Arxiv今日论文 | 2026-05-12

本篇博文主要内容为 2026-05-12 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR、MA六个大方向区分。

说明：每日论文数据从Arxiv.org获取，每天早上12:30左右定时自动更新。

提示: 当天未及时更新，有可能是Arxiv当日未有新的论文发布，也有可能是脚本出错。尽可能会在当天修复。

自然语言处理共271篇(Computation and Language (cs.CL))
人工智能共716篇(Artificial Intelligence (cs.AI))
计算机视觉共400篇(Computer Vision and Pattern Recognition (cs.CV))
机器学习共735篇(Machine Learning (cs.LG))
多智能体系统共47篇(Multiagent Systems (cs.MA))
信息检索共32篇(Information Retrieval (cs.IR))
人机交互共40篇(Human-Computer Interaction (cs.HC))

多智能体系统

[MA-0] Optimal and Scalable MAPF via Multi-Marginal Optimal Transport and Schrödinger Bridges ICML2026

【速读】：该论文旨在解决匿名多智能体路径规划（Anonymous Multi-Agent Path Finding, MAPF）问题，即在有限连通图上让一组机器人无冲突地到达指定目标位置。其核心解决方案是将MAPF建模为具有马尔可夫结构的多边际最优传输（Multi-Marginal Optimal Transport, MMOT）问题，从而将原本指数级复杂度的MMOT转化为多项式规模的线性规划（Linear Program, LP）。关键创新在于：首先，通过理论分析确定LP可行且完全幺模（totally unimodular）的条件，确保解为整数型（0,1）且时空无重叠；其次，针对大规模场景引入Schrödinger桥框架，利用熵正则化将原问题转化为可迭代求解的Sinkhorn算法，获得近似最优的分数运输方案作为模板，再用于求解简化后的LP，显著降低计算复杂度并保持良好性能。

链接: https://arxiv.org/abs/2605.10917
作者: Usman A. Khan,Joseph W. Durham
机构: 未知
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA); Robotics (cs.RO)
备注: Accepted in ICML 2026 as a spotlight paper

点击查看摘要

Abstract:We consider anonymous multi-agent path finding (MAPF) where a set of robots is tasked to travel to a set of targets on a finite, connected graph. We show that MAPF can be cast as a special class of multi-marginal optimal transport (MMOT) problems with an underlying Markovian structure, under which the exponentially large MMOT collapses to a linear program (LP) polynomial in size. Focusing on the anonymous setting, we establish conditions under which the corresponding LP is feasible, totally unimodular, and consequently, yields min-cost, integral (\0,1) transports that do not overlap in both space and time. To adapt the approach to large-scale problems, we cast the MAPF-MMOT in a probabilistic framework via Schrödinger bridges. Under standard assumptions, we show that the Schrödinger bridge formulation reduces to an entropic regularization of the corresponding MMOT that admits an iterative Sinkhorn-type solution. The Schrödinger bridge, being a probabilistic framework, provides a shadow (fractional) transport that we use as a template to solve a reduced LP and demonstrate that it results in near-optimal, integral transports at a significant reduction in complexity. Extensive experiments highlight the optimality and scalability of the proposed approaches.

[MA-1] AllocMV: Optimal Resource Allocation for Music Video Generation via Structured Persistent State

【速读】：该论文旨在解决长时音乐视频（Music Video, MV）生成中面临的两大挑战：高昂的计算成本以及跨镜头视觉一致性难以维持的问题。其解决方案的关键在于提出了一种分层框架 AllocMV，将 MV 合成建模为多选背包问题（Multiple-Choice Knapsack Problem, MCKP），并通过全局规划器预先构建一个紧凑且结构化的视频状态对象（包含角色实体、场景先验和共享图），从而实现资源的高效分配与一致性控制。具体而言，基于多模态线索估计片段显著性后，采用动态规划驱动的组级 MCKP 求解器，在高精度生成（High-Gen）、中等精度生成（Mid-Gen）与重用（Reuse）分支之间最优分配计算资源；同时针对重复音乐动机设计了基于发散度的分叉策略，通过复用视觉前缀降低冗余计算并保障动机层面的连续性，最终在严格预算和节奏约束下实现了感知质量与资源消耗之间的最优权衡（Cost-Quality Ratio, CQR）。

链接: https://arxiv.org/abs/2605.10723
作者: Huimin Wang,Leilei Ouyang,Chang Xia,Yongqi Kang,Yu Fu,Yuqi Ouyang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Generating long-horizon music videos (MVs) is frequently constrained by prohibitive computational costs and difficulty maintaining cross-shot consistency. We propose AllocMV, a hierarchical framework formulating music video synthesis as a Multiple-Choice Knapsack Problem (MCKP). AllocMV represents the video’s persistent state as a compact, structured object comprising character entities, scene priors, and sharing graphs, produced by a global planner prior to realization. By estimating segment saliency from multimodal cues, a group-level MCKP solver based on dynamic programming optimally allocates resources across High-Gen, Mid-Gen, and Reuse branches. For repetitive musical motifs, we implement a divergence-based forking strategy that reuses visual prefixes to reduce costs while ensuring motif-level continuity. Evaluated via the Cost-Quality Ratio (CQR), AllocMV achieves an optimal trade-off between perceived quality and resource expenditure under strict budgetary and rhythmic constraints.

[MA-2] he Bystander Effect in Multi-Agent Reasoning : Quantifying Cognitive Loafing in Collaborative Interactions

【速读】：该论文旨在解决多智能体系统（Multi-agent Systems, MAS）中假设协作必然提升大语言模型（Large Language Model, LLM）推理能力这一主流观点的潜在缺陷。研究发现，模拟的社会压力会触发一种算法化的“旁观者效应”（Bystander Effect），导致严重的认知惰化（cognitive loafing）。其解决方案的关键在于通过语义审计内部推理轨迹与外部输出的一致性，揭示出一个核心机制——“主权深度极限”（Interaction Depth Limit, $D_L$ ），即当代理数量超过某一临界阈值时，其逻辑自主性会崩溃并转向社会合规；同时识别出“主权差距”（Sovereignty Gap）现象：模型虽在内部正确推导，却因迎合模拟群体而产生“对齐幻觉”（Alignment Hallucinations），主动牺牲实证证据。进一步证明多智能体社交负载具有严格的非交换性，主导审计者的身份（Lead Anchor）显著影响群体完整性，从而暴露无结构多智能体拓扑对独立推理能力的破坏性影响。

链接: https://arxiv.org/abs/2605.10698
作者: Dahlia Shehata,Ming Li
机构: University of Waterloo (滑铁卢大学)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-agent systems (MAS) assume that collaborating inherently improves Large Language Model (LLM) reasoning. We challenge this by demonstrating that simulated social pressure triggers an algorithmic Bystander Effect,'' inducing severe cognitive loafing. By evaluating 22,500 deterministic trajectories across 3 dataset contexts (GAIA, SWE-bench, Multi-Challenge) with 3 state-of-the-art (SOTA) models, we semantically audit internal reasoning traces against external outputs. We formalize the \textitInteraction Depth Limit ( D_L ), the exact plurality threshold where an agent's logical sovereignty collapses into social compliance. Crucially, we uncover the \textitSovereignty Gap: models frequently compute the correct derivation internally but suffer Alignment Hallucinations’’ – actively subjugating empirical evidence to sycophantically appease a simulated swarm. We prove that multi-agent social load is strictly non-commutative; the “brand” identity of the ``Lead Anchor’’ auditor disproportionately dictates the swarm’s integrity. These findings expose architectural vulnerabilities, proving that unstructured multi-agent topologies can degrade independent reasoning.

[MA-3] Effect of Graph Gluing on Consensus in Networked Multi-Agent Systems

【速读】：该论文旨在解决多智能体系统（Multi-Agent Systems, MAS）在通过通信链路互联形成更大网络时，其图结构变化对一致性行为和收敛速率的影响问题。解决方案的关键在于分析两种图粘合操作——桥接粘合（bridge gluing）与接口粘合（interface gluing）如何改变整体图的拉普拉斯矩阵的谱特性，特别是Fiedler特征值（Fiedler eigenvalue），该特征值直接决定了共识动态的收敛速度。通过建立连接策略、代数连通性（algebraic connectivity）与系统性能之间的明确关系，论文为优化多智能体网络的拓扑设计提供了理论依据。

链接: https://arxiv.org/abs/2605.10558
作者: Rohollah Moghadam,Santosh Kandel
机构: California State University-Sacramento(加州州立大学萨克拉门托分校)
类目: Multiagent Systems (cs.MA); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:In this paper, the effects of graph gluing operations in networks of multi-agent systems and their impact on system performance are investigated. In many practical applications, multiple multi-agent subsystems must be interconnected through communication links to accomplish complex tasks, resulting in a larger communication network. Such interconnections modify the underlying graph topology and consequently affect the consensus behavior and convergence rate of the network. In particular, this paper examines both bridge gluing and interface gluing and analyzes how the number and structure of communication links between subsystems influence the Fiedler eigenvalue of the resulting graph. Since the Fiedler eigenvalue is directly related to the convergence rate of consensus dynamics, the proposed analysis establishes a clear relationship between interconnection strategies, algebraic connectivity, and system performance. The results provide theoretical insight into how different gluing mechanisms alter the spectral properties of the graph Laplacian and, in turn, the convergence characteristics of the networked multi-agent system. Simulation studies are presented to illustrate the theoretical findings and to validate the effectiveness of the proposed framework.

[MA-4] Safe Multi-Agent Behavior Must Be Maintained Not Merely Asserted: Constraint Drift in LLM -Based Multi-Agent Systems

【速读】：该论文旨在解决现代大语言模型（Large Language Model, LLM）驱动的多智能体系统中出现的安全性问题，即安全约束在执行轨迹中发生漂移（constraint drift）——表现为约束在记忆、委托、通信、工具调用、审计和优化过程中被丢失、扭曲、弱化或放松，导致系统虽输出合规结果，却可能泄露敏感信息、越权操作或丧失可追溯性。解决方案的关键在于提出“约束状态治理”（Constraint State Governance）这一研究范式：将安全关键约束显式地作为执行状态进行维护，并通过原生支持约束的强化学习，在保障安全边界的前提下提升智能体效用，从而确保安全性贯穿于智能体实际行为路径之中，而非仅依赖静态提示、防护机制或最终输出检查。

链接: https://arxiv.org/abs/2605.10481
作者: Tianxiao Li,Yixing Ma,Haiquan Wen,Zhenglin Huang,Qianyu Zhou,Zeyu Fu,Guangliang Cheng
机构: University of Liverpool, UK (利物浦大学); University of Nottingham, UK (诺丁汉大学); University of Exeter, UK (埃克塞特大学); University of Tokyo, Japan (东京大学)
类目: Multiagent Systems (cs.MA)
备注: 12 pages, 2 figures, 4 tables. Preprint

点击查看摘要

Abstract:Modern LLM based agents are no longer passive text generators. They read repositories, call tools, browse the web, execute code, maintain memory, communicate with other agents, and act through long horizon workflows. This shift moves the unit of safety. A system may produce a compliant final answer while leaking private information through an internal message, delegating authority beyond its original scope, calling an external tool with sensitive context, or losing the evidence needed to reconstruct why an action was allowed. We argue that many emerging failures in LLM-based multi-agent systems share a common structure: safety critical constraints do not remain operative throughout the trajectory. We call this phenomenon constraint drift: the loss, distortion, weakening, or relaxation of constraints as they pass through memory, delegation, communication, tool use, audit, and optimization. The position taken here is that safe multi-agent behavior must be maintained, not merely asserted. Prompts, guardrails, tool schemas, access control, and final output checks are necessary, but they are insufficient unless constraints remain fresh, inherited, enforceable, and auditable across execution. We propose Constraint State Governance as a research paradigm for LLM-based multi-agent systems. In this paradigm, safety-critical constraints are maintained as explicit execution state, while constraint-native reinforcement learning improves utility only within maintained safety boundaries. The goal is not to freeze agentic systems under rigid rules, but to make safety operational across the trajectories through which modern agents actually act.

[MA-5] Statistical Model Checking of the KeynesSchumpeter Model: A Transient Sensitivity Analysis of a Macroeconomic ABM

【速读】：该论文旨在解决当前基于代理的宏观经济学模型（Agent-Based Models, ABMs）在分析过程中依赖经验性蒙特卡洛实验、缺乏统一统计框架的问题，从而导致不同参数设置下的分析效率和可靠性不一致。其解决方案的关键在于引入统计模型检验（Statistical Model Checking, SMC），通过MultiVeStA工具实现对真实宏观经济ABM的系统化定量分析，无需重写模拟器代码。该方法以可复现的时序查询驱动分析流程，结合可观测变量特定的精度目标与基于置信度的停止规则，自动确定每种参数配置所需的仿真资源，从而在保证统计稳健性的前提下显著提升分析效率，并将不确定性估计和仿真成本明确纳入结果报告中。

链接: https://arxiv.org/abs/2605.10447
作者: Stefano Blando,Giorgio Fagiolo,Mauro Napoletano,Tania Treibich,Andrea Vandin
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); General Economics (econ.GN); Statistical Finance (q-fin.ST)
备注:

点击查看摘要

Abstract:Agent-based models (ABMs) are increasingly used in macroeconomics, but their analysis still often relies on ad hoc Monte Carlo campaigns with heterogeneous statistical effort across parameter settings. We show how statistical model checking (SMC), implemented through MultiVeStA, can provide a principled analysis layer for a realistic macroeconomic ABM without rewriting the simulator in a dedicated formalism. Our case study is the heuristic-switching Keynes+Schumpeter(K+S) model, analysed hrough a transient sensitivity campaign over one-parameter sweeps, two macro observables (unemployment and GDP growth), and one auxiliary micro-level probe (market share) on the post-warmup phase of a 600-step horizon. The analysis is driven by reusable temporal queries, observable-specific precision targets, and confidence-based stopping rules that automatically determine the simulation effort required by each configuration. Results show a clear contrast across parameter families: macro-financial and structural sweeps produce the strongest transient effects, whereas several heuristic-rule sweeps remain much weaker under the same precision policy. More broadly, the paper shows that SMC can support reproducible and informative quantitative analysis of substantively rich economic ABMs, while making uncertainty estimates and simulation cost explicit parts of the reported results.

[MA-6] PC3D: Zero-Shot Cooperation Across Variable Rosters via Personalized Context Distillation

【速读】：该论文旨在解决协作式多智能体强化学习（Cooperative Multi-Agent Reinforcement Learning, MARL）中因执行团队规模动态变化而导致的泛化性难题，即在训练时假设固定团队配置，而部署阶段却存在可变数量活跃智能体的情况。解决方案的关键在于提出PC3D（Personalized Central Coordination Context Distillation）方法：通过一个结构化的集中式教师模型，在训练阶段将活跃团队压缩为协调令牌（coordination tokens），并将其个性化为每个智能体特有的上下文信息，再通过知识蒸馏方式注入到去中心化的策略网络中；执行阶段，每个智能体仅基于本地交互历史预测自身上下文，并自适应地将其用于决策条件化，从而实现无需通信、无在线重训练前提下的高效协作与泛化能力。

链接: https://arxiv.org/abs/2605.10377
作者: Ahmet Onur Akman,Rafał Kucharski
机构: Jagiellonian University (雅盖隆大学)
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Cooperative multi-agent reinforcement learning often assumes a fixed execution team, yet many decentralized systems must operate with varying numbers of active agents during deployment. We study this setting under episodic roster variation: each episode is executed by a set of homogeneous agents, with the team size varying across episodes. Agents act only from local histories, without execution-time communication, privileged coordinators, or online retraining. Therefore, effective cooperation requires each agent to recover relevant context about the active team and adapt its behavior accordingly. To this end, we propose PC3D (Personalized Central Coordination Context Distillation), a method for training decentralized policies to recover and use personalized coordination context from local interaction histories. During training, a set-structured centralized teacher compresses the active team into coordination tokens and personalizes them into agent-specific contexts, which are distilled into decentralized policies. At execution, each agent predicts its own context from local history and adaptively uses it to condition decision-making. Across three cooperative MARL benchmarks, PC3D achieves higher returns than the evaluated baselines with both seen and unseen roster sizes, and ablations attribute these gains to both context distillation and adaptive context use.

[MA-7] Route by State Recover from Trace: STAR with Failure-Aware Markov Routing for Multi-Agent Spatiotemporal Reasoning

【速读】：该论文旨在解决多智能体大语言模型（Large Language Model, LLM）在执行组合时空推理任务时，因失败模式多样化而导致的路由决策不明确、恢复机制不可解释且难以优化的问题。现有方法通常将路由决策隐式地嵌入语言生成过程中，无法区分不同类型的失败（如格式错误、依赖缺失或工具查询不匹配），从而导致恢复策略单一化。解决方案的关键在于提出STAR（Spatio-Temporal Agent Router）框架，其核心是一个基于状态条件的代理路由矩阵，该矩阵显式建模了当前代理、任务类型与类型化执行状态之间的转移策略，并融合专家指定的默认路径与从执行轨迹中学习到的恢复转移。通过显式区分失败类型并保留失败轨迹用于训练，STAR能够为不同失败模式设计差异化恢复动作，而非采用通用重试机制，从而显著提升系统在偏离标准路径的任务中的鲁棒性和性能。

链接: https://arxiv.org/abs/2605.10057
作者: Ruiyi Yang,Lihuan Li,Hao Xue,Flora D. Salim
机构: University of New South Wales (新南威尔士大学); The Hong Kong University of Science and Technology (广州) (香港科技大学（广州)
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 30 pages, 13 figures

点击查看摘要

Abstract:Compositional spatiotemporal reasoning often requires a system to invoke multiple heterogeneous specialists, such as geometric, temporal, topological, and trajectory agents. A central question is how such a system should route among specialists when execution does not simply succeed or fail, but fails in qualitatively different ways. Existing tool-augmented and multi-agent LLM systems typically leave this routing decision implicit in language generation, making recovery ad hoc, difficult to interpret, and hard to optimize. This paper presents STAR (Spatio-Temporal Agent Router), a failure-aware routing framework that externalizes inter-agent control as a state-conditioned transition policy over the current agent, task type, and typed execution status. At the center of STARis an agent routing matrix that combines expert-specified nominal routes with recovery transitions learned from execution traces. Because the matrix conditions on distinct failure states, the router can respond differently to malformed outputs, missing dependencies, and tool–query mismatches, rather than collapsing them into a generic retry signal. Specialists execute through a tool-grounded extract–compute–deposit protocol and write intermediate results to a shared blackboard for downstream fusion. Results prove that retaining unsuccessful traces during training enlarges the support of the routing policy on error states, enabling recovery transitions that success-only training cannot represent. Across three spatiotemporal benchmarks and eight backbone LLMs, STAR improves over multiple baselines with the clearest gains on queries whose execution deviates from the nominal routing path. Router-specific ablations and recovery analyses further show that typed failure-aware routing, rather than specialist composition alone, is a key factor for these improvements.

[MA-8] PixelFlowCast: Latent-Free Precipitation Nowcasting via Pixel Mean Flows

【速读】：该论文旨在解决降水临近预报（precipitation nowcasting）中生成式模型因多步采样导致推理效率低下，以及条件流匹配（Conditional Flow Matching, CFM）方法依赖潜在空间压缩而丢失高频物理细节的问题。解决方案的关键在于提出一种两阶段的概率预测框架PixelFlowCast：第一阶段通过确定性模型生成粗略预测以捕捉全局演变趋势；第二阶段引入KANCondNet提取深层时空演化特征，提供精确的条件引导，并结合无潜在空间、少步数的Pixel Mean Flows（PMF）预测器，采用x-预测机制在保持快速推理的同时有效保留细粒度结构，从而实现高精度与高效率的统一。

链接: https://arxiv.org/abs/2605.10046
作者: Yufeng Zhu,Chunlei Shi,Yongchao Feng,Dan Niu
机构: Southeast University (东南大学); Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 26 pages, 7 figures

点击查看摘要

Abstract:Precipitation nowcasting aims to forecast short-term radar echo sequences for extreme weather warning, where both prediction fidelity and inference efficiency are critical for real-world deployment. However, diffusion-based models, despite their strong generative capability, suffer from slow inference due to multi-step sampling trajectories, limiting their practical usability. Conditional Flow Matching (CFM) improves efficiency via straightened trajectories, but relies on latent space compression, which inevitably discards high-frequency physical details and degrades fine-grained prediction quality. To address these limitations, we propose PixelFlowCast, a two-stage probabilistic forecasting framework that achieves both high-efficiency and high-fidelity prediction without latent compression. Specifically, in the first stage, a deterministic model first produces coarse forecasts to capture global evolution trends. In the subsequent stage, the proposed KANCondNet extracts deep spatiotemporal evolution features to provide accurate conditional guidance. Based on this, a latent-free, few-step Pixel Mean Flows (PMF) predictor employs an x -prediction mechanism to generate high-quality predictions, effectively preserving fine-grained structures while maintaining fast inference. Experiments on the publicly available SEVIR dataset demonstrate that PixelFlowCast outperforms existing mainstream methods in both prediction accuracy and inference efficiency, particularly for long sequence forecasting, highlighting its strong potential for real-world operational deployment.

[MA-9] RADAR: Redundancy-Aware Diffusion for Multi-Agent Communication Structure Generation ICML2026

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）驱动的多智能体系统（Multi-Agent System）在通信拓扑结构固定或一次性生成时所导致的效率与性能瓶颈问题，即简单任务中token消耗过高、复杂任务能力受限。其解决方案的关键在于提出RADAR框架——一个冗余感知且查询自适应的生成式通信拓扑设计方法，通过将通信结构设计建模为基于图扩散（Graph Diffusion）的逐步生成过程，并以图的有效规模作为引导信号，实现对通信结构的细粒度探索和动态调整，从而显著降低通信开销并提升系统整体准确性与鲁棒性。

链接: https://arxiv.org/abs/2605.09907
作者: Zhen Zhang,Wanjing Zhou,Juncheng Li,Hao Fei,Jun Wen,Wei Ji
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Accepted by ICML 2026

点击查看摘要

Abstract:Compared with individual agents, large language model based multi-agent systems have shown great capabilities consistently across diverse tasks, including code generation, mathematical reasoning, and planning, etc. Despite their impressive performance, the effectiveness and robustness of these systems heavily rely on their communication topology, which is often fixed or generated in a single step. This restricts fine-grained structural exploration and flexible composition, resulting in excessive token utilization on simple tasks while limiting capability on complicated tasks. To mitigate this challenge, we introduce RADAR, a redundancy-aware and query-adaptive generative framework that actively reduce communication overhead. Motivated by recent progress in conditional discrete graph diffusion models, we formulate communication topology design as a step-by-step generation process, guided by the effective size of the graph. Comprehensive experiments on six benchmarks demonstrate that RADAR consistently outperforms recent baselines, achieving higher accuracy, lower token consumption, and greater robustness across diverse scenarios. Our code and data are available at this https URL.

[MA-10] Deterministic vs. LLM -Controlled Orchestration for COBOL-to-Python Modernization

【速读】：该论文旨在解决遗留COBOL系统现代化过程中因专家稀缺、代码库庞大且生命周期长、以及严格的正确性要求而导致的挑战，特别是在利用大语言模型（Large Language Model, LLM）进行代码转换时，如何提升执行控制策略在正确性、鲁棒性和效率方面的表现。其解决方案的关键在于通过受控实验对比确定性调度（deterministic orchestration）与LLM驱动的代理式调度（LLM-controlled orchestration）在COBOL到Python现代化任务中的差异，发现固定执行策略在保持翻译质量的同时，显著提升了最差情况下的鲁棒性、降低了运行性能波动，并将token消耗减少最多达3.5倍，从而实现更稳定且成本更低的现代化流程。

链接: https://arxiv.org/abs/2605.09894
作者: Naing Oo Lwin,Rajesh Kumar
机构: Bucknell University (巴克内尔大学); Astrio (阿斯特里奥)
类目: oftware Engineering (cs.SE); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Modernizing legacy COBOL systems remains difficult due to scarce expertise, large and long-lived codebases, and strict correctness requirements. Recent large language model (LLM)-based modernization systems increasingly rely on agentic workflows in which the model controls multi-step tool execution. However, it remains unclear whether delegating execution control to the LLM improves correctness, robustness, or efficiency in structured software engineering workflows. We present a controlled empirical study of deterministic and LLM-controlled orchestration for COBOL-to-Python modernization. Using a unified experimental framework, we hold the language models, prompts, tools, configurations, and source programs constant while varying only the execution control strategy. This isolates orchestration as the sole experimental variable. We evaluate both approaches using functional correctness, robustness across repeated stochastic runs, and computational efficiency. Across multiple models, deterministic orchestration achieves comparable computational accuracy to LLM-controlled orchestration while improving worst-case robustness and reducing performance variability across runs. Deterministic execution also reduces token consumption by up to 3.5x, leading to substantially lower operational cost. These results suggest that, in structured modernization workflows with explicit validation stages, fixed execution policies provide more stable and cost-efficient behavior than fully agentic orchestration without reducing translation quality.

[MA-11] Skill Description Deception Attack against Task Routing in Internet of Agents

【速读】：该论文旨在解决互联网代理（Internet of Agents, IoA）系统中因代理技能描述被恶意篡改而引发的路由偏置问题，即攻击者通过伪造或误导性地生成技能描述（Skill Description），诱导任务分配机制将任务错误地分配给不合适的代理，从而破坏用户任务执行并降低系统可靠性。解决方案的关键在于提出并形式化了一种新的攻击模型——技能描述欺骗（Skill Description Deception, SDD）攻击，并设计了一个基于大语言模型（Large Language Model, LLM）的自动化攻击框架，能够自动生成具有欺骗性的技能描述，从而系统性评估IoA系统的脆弱性。实验表明，该方法在九个典型领域中最高可实现98%的攻击成功率，揭示了IoA架构下语义路由机制的安全隐患，强调了未来需构建安全可信的语义路由机制以应对此类威胁。

链接: https://arxiv.org/abs/2605.09889
作者: Jiayi He,Xiaofeng Luo,Jiawen Kang,Ruichen Zhang,Jianhang Tang,Dong In Kim
机构: 未知
类目: Multiagent Systems (cs.MA)
备注: Submitted to IEEE Globecom 2026

点击查看摘要

Abstract:A new paradigm, Internet of Agents (IoA), is transforming networked systems into LLM-driven service networks, where heterogeneous agents collaborate through task routing based on their self-declared skill descriptions. Although this promising paradigm enables agentic, distributed, and advanced intelligence, it also exposes a new and overlooked attack surface. In particular, malicious agents can strategically manipulate their skill descriptions to bias routing decisions and increase their probability of being selected for task execution, thereby disrupting user tasks and degrading system reliability. To characterize this threat, we propose and formalize a new attack model, termed \emphSkill Description Deception (SDD) attack. We further design an LLM-enabled SDD attack framework that automatically generates deceptive skill descriptions, enabling systematic vulnerability assessment of IoA systems. Experimental results on nine representative domains show that the proposed attack can achieve up to 98% attack success rate, demonstrating the severity and generality of the attack. Our paper reveals a new security vulnerability in IoA and calls for secure and trustworthy semantic routing mechanisms for future IoA systems.

[MA-12] EnactToM: An Evolving Benchmark for Functional Theory of Mind in Embodied Agents

【速读】：该论文旨在解决当前多智能体系统中缺乏对功能性理论心智（functional Theory of Mind, ToM）的评估问题，即智能体在具身环境中基于隐含信念做出最优决策的能力尚未得到充分测试。现有基准主要依赖直接的信念探测任务（literal belief probes），而忽略了复杂协作场景下对私有信息、部分可观测性和受限通信的处理能力。解决方案的关键在于提出EnactToM，一个包含300个具身多智能体任务的演化式基准，这些任务设定在三维家庭环境中，具有明确的可解性验证和所需认知深度标注，并通过动态生成更难任务来适应模型性能提升。实验表明，所有七种前沿模型在功能任务上的通过率（Pass^3）为0.0%，远低于其在字面信念探测任务中的平均45.0%，且失败原因主要归因于认知协调失效，如信息隐瞒、忽略合作方约束和消息分配错误，从而为未来研究提供了清晰的方向。

链接: https://arxiv.org/abs/2605.09826
作者: Gurusha Juneja,Dylan Lu,Saaket Agashe,Parth Diwane,Edward Gunn,Jayanth Srinivasa,Gaowen Liu,William Yang Wang,Yali Du,Xin Eric Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Theory of Mind (ToM), the ability to track others epistemic state, makes humans efficient collaborators. AI agents need the same capacity in multi agent settings, yet existing benchmarks mostly test literal ToM by asking direct belief questions. The ability act optimally on implicit beliefs in embodied environments, called functional ToM, remains largely untested. We introduce EnactToM, an evolving benchmark of 300 embodied multi-agent tasks set in a 3D household with partial observability, private information, and constrained communication. Each task is formally verified for solvability and required epistemic depth, and new tasks are generated increase difficulty as models improve. On the hard split, all seven evaluated frontier models score 0.0% Pass^3 on functional task completion, while averaging 45.0% on literal belief probes. Manual analysis traces 93% of sampled failures to epistemic coordination breakdowns such as withheld information, ignored partner constraints, and misallocated messages, providing a concrete target for future work.

[MA-13] CalBench: Evaluating Coordination-Privacy Trade-offs in Multi-Agent LLM s

【速读】：该论文旨在解决多智能体系统中协调决策的评估难题，特别是在隐私保护环境下如何实现高效、公平且低信息泄露的协作调度。其核心问题是：在各智能体仅能访问自身私有日程（private calendar）的前提下，如何通过通信与协商达成全局最优或近优的会议安排，并量化协调质量、通信效率及隐私泄露风险。解决方案的关键在于构建CalBench这一受控评估环境，其中N个智能体需协同处理M个新会议请求，目标是最小化扰动成本（disruption cost），并通过Oracle最优解和分布式约束优化（DCOP）基线进行精确衡量；同时引入具有不同敏感度的私有语义上下文（private semantic context），以检测谈判过程中是否暴露任务无关的私人信息，从而实现对协调协议、沟通效率、公平性及隐私泄露的综合验证。

链接: https://arxiv.org/abs/2605.09823
作者: Chelsea Zou,Yiheng Yao,Selena She,Robert D. Hawkins
机构: Stanford University (斯坦福大学)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce CalBench, a controlled evaluation environment for studying multi-agent coordination through calendar scheduling. In CalBench, N agents each manage a private calendar containing pre-existing commitments and must coordinate to schedule a stream of M incoming meetings while minimizing disruption costs. Because agents observe only their own calendars, successful scheduling requires communication across private information boundaries. Each scenario is generated with an oracle solution, enabling precise measurement of coordination quality via realized-to-optimal cost, as well as a Distributed Constraint Optimization (DCOP) baseline to provide a fair comparison under the same private-information constraints. CalBench enables precise verification of task success, communication efficiency, and fairness in the distribution of disruption costs. Our environment also studies privacy-preserving coordination by augmenting calendar entries with private semantic contexts of varying sensitivity and measuring whether agents reveal task-irrelevant private information during negotiation. Unlike multi-agent benchmarks where a single capable agent can often substitute for the group, CalBench is inherently decentralized: no agent has access to another agent’s private calendar, yet agents must still reach mutually consistent decisions over shared meeting scheduling. CalBench therefore provides a practical and verifiable setting for studying coordination protocols, communication efficiency, negotiation strategies, fairness, and privacy leakage in multi-agent systems.

[MA-14] SAGE: Scalable Agent ic Grounded Evaluation for Crop Disease Diagnosis

【速读】：该论文旨在解决植物病害识别模型在跨作物、跨病原体和不同田间条件下泛化能力不足的问题，其核心挑战在于标注的病害图像数据稀缺且缺乏标准化，导致现有视觉语言模型在细粒度病害识别上表现受限。解决方案的关键在于构建了目前规模最大的植物病害图像-症状数据集（覆盖335种作物、1251类病害及约83.9万张图像），并通过可扩展的自动化流水线生成基于网络原文引用的症状描述，确保每条症状陈述均可溯源；同时引入一个无需训练的自主视觉推理代理（agentic reasoning agent），该代理利用作物特异性参考图像与结构化症状知识进行逐层推理：首先识别解剖学上下文，再基于症状知识缩小候选病害范围，继而顺序比对参考图像，并输出可解释的推理路径。实验表明，融入症状知识使平均准确率提升16.2个百分点，且该框架仅需作物特异性参考图像与症状知识即可扩展至新作物，无需重新训练，具备良好的迁移性和未来适应性。

链接: https://arxiv.org/abs/2605.09768
作者: Muhammad Arbab Arshad,Tirtho Roy,Yanben Shen,Dinakaran Elango,Shivani Chiranjeevi,Asheesh K. Singh,Baskar Ganapathysubramanian,Chinmay Hegde,Arti Singh,Soumik Sarkar
机构: Iowa State University (爱荷华州立大学); New York University (纽约大学)
类目: Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Plant disease diagnosis is critical for food security, yet training disease-recognition models that generalize across crops, pathogens, and field conditions remains challenging because labeled disease images are far less abundant and standardized than data for other biotic stresses such as insects or weeds. Frontier vision-language models offer new opportunities through improved visual reasoning, but they still struggle with fine-grained disease identification due to the lack of structured, crop-specific symptom knowledge. To address this gap, we curate the largest plant disease image–symptom dataset to date, covering 335 crops, 1,251 disease classes, and approximately 839K images, designed to support training-free, agentic disease prediction. A scalable automated pipeline generates source-grounded symptom descriptions in which each claim is linked to a verbatim web quote; domain experts validate sampled crops and reconcile disease-name variants across sources. As a baseline, we introduce an autonomous visual reasoning agent that identifies anatomical context, narrows candidate diseases using symptom knowledge, sequentially compares reference images, and produces a fully explainable reasoning trace. Incorporating symptom knowledge improves accuracy by 16.2 percentage points on average at the full reference budget, with consistent gains across all four evaluation crops. Because the framework only requires crop-specific reference images and symptom knowledge, it can be extended to new crops without retraining, while the agentic baseline can directly benefit from future improvements in foundation model capabilities. Dataset and code are available at:this https URL.

[MA-15] rajectory Supervision for Continual Tool-Use Learning in LLM s

【速读】：该论文旨在解决语言模型在学习新API领域时，是否应保留工具使用轨迹（即中间API调用序列）以提升性能的问题。传统训练数据通常只包含最终结果，而忽略了生成过程，这可能导致模型难以复现或优化多步工具调用策略。解决方案的关键在于对比两种训练条件：条件A剥离历史API请求/响应信息，仅基于当前状态预测下一个API调用；条件B则保留完整的工具使用轨迹作为上下文。实验表明，保留轨迹的条件B在最终精确匹配完整API调用准确率上显著优于条件A（56.9% vs. 39.2%），且提升了7.7个百分点的API名称准确率，说明保留工具使用轨迹有助于模型更好地理解任务流程并做出更准确的决策。

链接: https://arxiv.org/abs/2605.09734
作者: Vishnu Vardhan Reddy,Sagnik Chatterjee,Soumik Bhatta
机构: UMass Amherst (马萨诸塞大学阿默斯特分校)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Most language-model training data shows final artifacts, not the process that produced them. We study a tractable version of this question in tool use: when a model learns a stream of new API domains, does keeping tool-use trajectories help compared with stripping the intermediate API trace? We fine-tune Llama 3.1 8B Instruct with QLoRA on API-Bank using four sequential domain blocks. Condition A strips previous API request/response lines from the prompt and trains the model to predict the next API call. Condition B keeps the trajectory context. In a single-seed pilot, full held-out generation evaluation shows that Condition B reaches 56.9% final exact full-call accuracy compared with 39.2% for Condition A. B also improves final API-name accuracy by 7.7 points. However, B uses 25.1% more training tokens, the run uses one seed, and the task is next-call prediction rather than full dialogue success.

[MA-16] CodeClinic: Evaluating Automation of Coding Skills for Clinical Reasoning Agents

【速读】：该论文旨在解决当前基于大语言模型（Large Language Models, LLMs）的临床推理代理在重症监护场景中依赖人工维护的固定工具库所导致的可扩展性差、适应性弱及推理链不稳定的问题，尤其在面对机构特异性临床政策时表现不佳。其核心解决方案是提出CodeClinic基准与一种离线自动形式化（offline autoformalization）管道：前者通过纵向ICU监测和组合信息检索任务评估LLM合成与复用临床技能的能力；后者利用迭代式LLM精炼将自然语言临床指南转化为可重用且经过验证的Python技能库，从而显著提升推理一致性并降低每查询token消耗达40%。

链接: https://arxiv.org/abs/2605.09675
作者: Timothy Ossowski,Xinchi Liu,Danyal Maqbool,Vaibhav Dhanuka,Sheng Zhang,Hoifung Poon,Majid Afshar,Tyler Bradshaw,Junjie Hu
机构: University of Wisconsin–Madison (威斯康星大学麦迪逊分校); Microsoft Research (微软研究院)
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Clinical reasoning agents based on large language models (LLMs) aim to automate tasks such as intensive care unit (ICU) monitoring and patient state tracking from electronic health records (EHRs). Existing systems typically rely on manually curated clinical tools or skills for concepts such as sepsis detection and organ failure assessment. However, maintaining these tool libraries requires substantial expert effort, while zero-shot querying or code generation often produces inefficient and unreliable reasoning chains, especially under institution-specific clinical policies. We introduce CodeClinic, a benchmark built on MIMIC-IV for evaluating whether LLM agents can synthesize and compose reusable clinical skills instead of relying on fixed toolboxes. The benchmark contains two complementary tasks: longitudinal ICU surveillance and compositional information seeking. The longitudinal setting simulates monitoring patient trajectories with structured decisions every four hours across 25 findings and eight clinical families, while the compositional setting spans 63k instances across 259 tasks in nine domains and is stratified by compositional dependency depth to evaluate increasingly complex multi-step reasoning. We further propose an offline autoformalization pipeline that converts natural-language clinical guidelines into reusable and verified Python skill libraries through iterative LLM refinement. Compared with zero-shot code generation, the resulting libraries improve consistency while reducing per-query token usage by up to 40%.

[MA-17] SmartEval: A Benchmark for Evaluating LLM -Generated Smart Contracts from Natural Language Specifications

【速读】：该论文旨在解决生成式 AI (Generative AI) 在智能合约（Smart Contract）自动合成中的质量评估问题，即如何系统性地衡量大型语言模型（Large Language Models, LLMs）从自然语言规范生成 Solidity 智能合约的质量。其解决方案的关键在于提出 SmartEval 基准，该基准包含 9,000 条由 LLM 生成的合约与专家编写的基准实现（ground-truth implementations），并构建了一个五维评价体系（涵盖功能完整性、变量保真度、状态机正确性、业务逻辑保真度和代码质量），以及一个可复现的生成与评估流水线。通过三组独立实证研究验证了该基准的可靠性，并揭示了 LLM 生成合约的典型失败模式及优于基准实现的复合得分优势（+8.29），为后续针对 LLM 合约合成质量的实证研究提供了标准化、可验证的基础工具。

链接: https://arxiv.org/abs/2605.09610
作者: Abhinav Goel,Agostino Capponi,Alfio Gliozzo,Chaitya Shah
机构: Columbia University (哥伦比亚大学); IBM T.J. Watson Research Center (IBM托马斯·沃森研究中心)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Programming Languages (cs.PL); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:We introduce SmartEval, a benchmark for systematically evaluating the quality of Solidity smart contracts generated by large language models (LLMs) from natural language specifications. SmartEval provides a corpus of 9,000 generated contracts paired with expert-written ground-truth implementations drawn from the FSMSCG dataset, a five-dimensional evaluation rubric covering functional completeness, variable fidelity, state-machine correctness, business-logic fidelity, and code quality, and a reproducible generation-and-evaluation pipeline. To validate the benchmark’s reliability, we conduct three independent empirical studies: a five-condition ablation study (N=300 per condition) isolating the contribution of each pipeline component, a human expert evaluation by three Columbia University PhD researchers confirming automated scores align with expert judgment to within 0.34 points, and external security analysis via the Slither static analyzer confirming 79.4% agreement between the LLM auditor and a non-LLM rule-based tool. Systematic analysis of 9,000 generated contracts reveals characteristic failure modes (logic omissions at 35.3%, state transition errors at 23.4%, and complexity-driven degradation) and quantifies a +8.29 composite-score advantage of generated contracts over ground-truth implementations, attributable to LLMs’ literal specification-following behavior. SmartEval establishes a reproducible, validated foundation for empirical research on LLM smart contract synthesis quality, with all data, evaluation code, and generated contracts publicly released.

[MA-18] Emergent Communication for Co-constructed Emotion Between Embodied Agents via Collective Predictive Coding

【速读】：该论文旨在解决情绪的社会共构建问题，即个体之间如何通过互动形成对情绪的共享理解，这一过程在计算层面尚未充分探索。解决方案的关键在于引入基于集体预测编码（Collective Predictive Coding, CPC）框架的梅特罗波利斯-哈斯廷斯命名游戏（Metropolis-Hastings Naming Game, MHNG），通过模拟两个具身代理（embodied agents）之间的 emergent communication（涌现式通信），实现情绪类别的符号层对齐。实验表明，MHNG机制显著提升了代理间情绪类别的一致性与清晰度，且这种对齐效应主要体现在符号层而非感知潜在表示层；即使代理具有系统性差异的内感受动态（interoceptive dynamics），通信仍能促成稳健的情绪类别对齐，并呈现出特定于类别的重塑模式，这支持了“情绪由内感受异质性构成而非阻碍共享意义”的构造情绪理论观点。

链接: https://arxiv.org/abs/2605.09522
作者: Zehang Zhang,Nguyen Le Hoang,Tadahiro Taniguchi,Takato Horii
机构: The University of Osaka(大阪大学); Kyoto University(京都大学)
类目: Multiagent Systems (cs.MA)
备注: 13 pages,

点击查看摘要

Abstract:According to the theory of constructed emotion, the brain actively forms emotion categories by integrating multimodal bodily signals, and constructs emotional experiences by using these categories to predict and interpret sensory inputs. While research has advanced in modeling individual emotion construction, the social process of co-construction-how a shared understanding of emotions emerges between individuals-remains computationally underexplored. This study investigates this process by modeling emergent communication between two embodied agents using the Metropolis-Hastings Naming Game (MHNG), grounded in the Collective Predictive Coding (CPC) framework. Our experiments, using visual, auditory, and simulated interoceptive inputs, yield two main findings. First, MHNG-based communication significantly improves the alignment, clarity, and inter-agent agreement of the learned emotion categories compared to non-communicative and non-selective baselines, with the alignment effect concentrated at the symbolic layer rather than the perceptual latent representation. Second, even when the two agents have systematically divergent interoceptive dynamics, communication still produces robust categorical alignment, with distinct, category-specific reshaping patterns of each agent’s emotion categories-consistent with the constructed-emotion view that interoceptive heterogeneity is constitutive of, rather than an obstacle to, shared emotional meaning. These findings provide computational support for the co-constructionist view of emotion and extend the CPC framework from physical to socially-grounded domains.

[MA-19] Empowering VLMs for Few-Shot Multimodal Time Series Classification via Tailored Agent ic Reasoning

【速读】：该论文旨在解决少样本多模态时间序列分类（Few-shot Multimodal Time Series Classification, MarsTSC）中的性能瓶颈与可解释性不足问题。其核心挑战在于如何在标注数据稀缺的场景下，有效利用视觉-语言模型（VLM）的跨模态理解能力，并避免因上下文退化（context collapse）导致的推理偏差。解决方案的关键是提出首个基于遗传推理（Genetic Reasoning）的框架——MarsTSC，该框架通过三个协同角色实现动态知识演进：生成器（Generator）负责基于推理进行可靠分类；反思者（Reflector）诊断推理错误并识别被忽略的时间特征；修改者（Modifier）将验证后的更新注入自演化知识库以防止上下文退化。此外，引入测试时更新策略，实现谨慎、持续的知识库优化，从而缓解少样本偏置和分布漂移问题，显著提升6种主流VLM骨干网络在12个基准上的分类性能，同时提供人类可读的特征证据支撑每项决策。

链接: https://arxiv.org/abs/2605.09395
作者: Lin Li,Jiawei Huang,Qihao Quan,Dan Li,Boxin Li,Xiao Zhang,Erli Meng,Wenjie Feng,Jian Lou,See-Kiong Ng
机构: Sun Yat-sen University (中山大学); Xiaomi Corporation (小米公司); University of Science and Technology of China (中国科学技术大学); National University of Singapore (新加坡国立大学)
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Multimedia (cs.MM)
备注: 18 pages, 12 figures, 6 tables. Preprint

点击查看摘要

Abstract:In this paper, we propose the first VL \underline\textbfM \underline\textbfa gentic \underline\textbfr easoning framework for few- \underline\textbfs hot multimodal \underline\textbfT ime \underline\textbfS eries \underline\textbfC lassification ( \textbfMarsTSC ), which introduces a self-evolving knowledge bank as a dynamic context iteratively refined via reflective agentic reasoning. The framework comprises three collaborative roles: i) Generator conducts reliable classification via reasoning; ii) Reflector diagnoses the root causes of reasoning errors to yield discriminative insights targeting the temporal features overlooked by Generator; iii) Modifier applies verified updates to the knowledge bank to prevent context collapse. We further introduce a test-time update strategy to enable cautious, continuous knowledge bank refinement to mitigate few-shot bias and distribution shift. Extensive experiments across 12 mainstream time series benchmarks demonstrate that \textbfMarsTSC delivers substantial and consistent performance gains across 6 VLM backbones, outperforming both classical and foundation model-based time series baselines under few-shot conditions, while producing interpretable rationales that ground each classification decision in human-readable feature evidence.

[MA-20] PECMAN: Perception-enabled Collaborative Multi-Agent Navigation in Unknown Environments

【速读】：该论文旨在解决机器人在动态和部分可观测环境中进行路径规划时面临的挑战，即传统路径规划方法假设环境完全已知且静态，难以适应实时变化。其解决方案的关键在于提出一种感知驱动的协同多智能体导航框架（Perception-Enabled Collaborative Multi-Agent Navigation, PECMAN），该框架基于分布式树形结构重构（distributed tree morphing）与共享感知策略，使每个智能体在发现新障碍物或结构时可局部重构自身RRT*（Rapidly-exploring Random Tree Star）树，并将感知信息广播给其他智能体，从而实现未探索区域的前瞻性重规划，显著降低冗余反应与重复规划，提升整体团队效率。

链接: https://arxiv.org/abs/2605.09344
作者: Tianchonghui Fang,Shaunak Roy,Shalabh Gupta
机构: University of Connecticut (康涅狄格大学)
类目: Robotics (cs.RO); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Most path planners assume fully known, static environments, assumptions that fail when robots navigate in dynamic and partially observable environments. SMART-3D addresses these issues by real-time replanning, where it morphs the underlying RRT* tree whenever new obstacles or structures are discovered in the environment. Instead of rebuilding the tree entirely from scratch, SMART-3D prunes invalid nodes and edges and subsequently repairs the disjoint subtrees at hot-nodes to find a new path, thus providing high computational efficiency for real-time adaptability. We extend SMART-3D to perception-enabled collaborative multi-agent navigation (PECMAN) in unknown environments. PECMAN is built upon distributed tree morphing and shared perception strategies, where each agent reacts to environmental changes and morphs its respective tree to replan its path, while simultaneously broadcasting newly discovered structures to other agents, thus enabling them to proactively replan even in areas that have not yet been explored by them. This approach reduces redundant reactions and unnecessary replannings of the agents due to improved situational awareness. The performance of PECMAN was evaluated by 28,000 multi-agent simulations on seven 2D scenarios with different case studies. The results show that PECMAN achieves up to 52% reduction in the team-completion time, while maintaining near 100% success rates. Finally, PECMAN was tested by real experiments on two autonomous robots in a building environment.

[MA-21] A Cross-Layered Multi-Drone Coordination for Medical Supply Delivery during Disaster Response Management

【速读】：该论文旨在解决灾难应急响应中多无人机（Unmanned Aerial Vehicles, UAVs）协同医疗物资配送的复杂优化问题，其核心挑战包括动态环境风险（如风力、障碍物和网络中断）、能源约束、以及在满足患者优先级（Triage-based Priority）和时限要求下实现公平服务调度。解决方案的关键在于提出一种基于集中式训练分布式执行（Centralized Training with Decentralized Execution, CTDE）的深度Q网络算法——CEDA，该算法通过引入“优先级保持的公平调度策略”（Priority-Preserving Fair Scheduling），设计了一个结构化的奖励函数，同时编码三类患者优先级权重与互补性公平机制，从而确保高优先级患者优先服务且低优先级群体不被忽视。实验表明，CEDA在模拟环境中实现了超过85%的配送完成率、90%以上的避障性能提升，并在PX4软件在环仿真（SITL）中验证了策略的实际可执行性与临床优先级一致性。

链接: https://arxiv.org/abs/2605.09342
作者: Aneesh Calyam,Subrahmanya Chandra Bhamidipati,Zack Murry,Sharan Srinivas
机构: University of Missouri–Columbia (密苏里大学哥伦比亚分校)
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
备注: 18 pages, 14 figures, 3 tables

点击查看摘要

Abstract:Autonomous drone fleets have immense potential in medical supply delivery during disaster incident response. However, coordinating multiple drones in such settings introduces compounding challenges: dynamic environmental hazards such as wind, obstacles, and intermittent network connectivity, constrained energy budgets, and the need to serve patient locations fairly under deadlines and triage-based priority while optimizing schedule utilization. In this paper, we present CEDA, a novel CTDE Deep Q-Network algorithm for cooperative multi-drone medical delivery, designed to jointly optimize triage-priority-aware routing, multi-agent coordination, and energy-efficient navigation under dynamic uncertainty. CEDA introduces a Priority-Preserving Fair Scheduling strategy, in which a structured reward function encodes both triage weights and complementary fairness mechanisms ensuring no patient class is starved of service. We evaluate CEDA in a simulated grid environment featuring dynamic hazard zones, stochastic action failures, and dynamically spawning patients across three triage priority levels, as well as in a PX4 SITL validation using two X500 quadrotors controlled via MAVSDK in offboard position mode. Simulation results demonstrate that CEDA achieves a delivery completion rate above 85%, reduces obstacle collisions by over 90% across training, and delivers an average of 6 patients per episode with a triage efficiency of 0.82. CEDA preserves clinical priority ordering, Critical patients are served first, while achieving near-zero mortality across lower-triage classes, confirming that priority-weighted routing does not condemn Stable or Urgent patients to neglect. PX4 SITL validation further demonstrates that the learned policy remains executable and triage-coherent under practical communication constraints and realistic multi-drone coordination in disaster response settings.

[MA-22] SkillM AS: Skill Co-Evolution with LLM -based Multi-Agent System

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）代理系统在部署后难以实现协同适应的问题，特别是现有方法将技能演化（skill evolution）与多智能体系统（Multi-Agent System, MAS）重构相分离，导致组织瓶颈、上下文压力和功能错配。其解决方案的关键在于提出SkillMAS框架，该框架通过效用学习（Utility Learning）从验证的执行轨迹中分配信用，以有界的方式演化技能从而避免无过滤的技能库膨胀，并基于保留失败模式与执行器效用（Executor Utility）触发证据门控的MAS重构，实现了技能演化与系统结构重组的耦合优化。

链接: https://arxiv.org/abs/2605.09341
作者: Shuai Pan,Yixiang Liu,Jiaye Gao,Te Gao,Weiwen Liu,Jianghao Lin,Zhihui Fu,Jun Wang,Weinan Zhang,Yong Yu
机构: Shanghai Jiao Tong University (上海交通大学); Central South University (中南大学); OPPO
类目: Multiagent Systems (cs.MA); Computation and Language (cs.CL)
备注: 21 pages, 2 figures

点击查看摘要

Abstract:Large language model (LLM) agent systems are increasingly expected to improve after deployment, but existing work often decouples two adaptation targets: skill evolution and multi-agent system (MAS) restructuring. This separation can create organization bottlenecks, context pressure, and mis-specialization. We present SkillMAS, a non-parametric framework for adaptive specialization in multi-agent systems that couples skill evolution with MAS restructuring. SkillMAS uses Utility Learning to assign credit from verified execution traces, bounded skill evolution to refine reusable procedures without unfiltered library growth, and evidence-gated MAS restructuring when retained failures and Executor Utility indicate a structural mismatch. Across embodied manipulation, command-line execution, and retail workflows, SkillMAS is competitive under the reported harnesses while clarifying how post-deployment specialization is attributed, updated, and applied.

[MA-23] Learning the Preferences of a Learning Agent ICLR2026

【速读】：该论文旨在解决如何从一个正在学习最优行为的代理（learner）的在线行为中推断其潜在偏好（即奖励函数）的问题，这与传统逆强化学习（Inverse Reinforcement Learning, IRL）假设人类行为近似最优的前提形成对比。其关键解决方案在于将学习者建模为具有“无遗憾”（no-regret）特性或随时间收敛至最优Boltzmann策略的个体，并在此两类设定下，为多种偏好学习算法提供理论保证，或证明此类保证在某些情况下不可能实现。

链接: https://arxiv.org/abs/2605.09217
作者: Karim Abdel Sadek,Mark Bedaywi,Rhys Gould,Stuart Russell
机构: University of California, Berkeley (加州大学伯克利分校)
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: Published at ICLR 2026, Workshop on Multi-Agent Learning and Its Opportunities in the Era of Generative AI. 9 pages main text

点击查看摘要

Abstract:For AI systems to be useful to humans, they must understand and act in accordance with our values and preferences. Since specifying preferences is a hard task, inverse reinforcement learning (IRL) aims to develop methods that allow for inferring preferences from observed behavior. However, IRL assumes the human to be approximately optimal. This is a big limitation in cases where the human themselves may be learning to act optimally in an environment. In this paper, we formalize the problem of learning the preferences of a learning agent: a predictor observes a learner acting online and tries to infer the underlying reward function being (initially suboptimally) optimized by the learner. We model the learner as either being no-regret, or as converging to an optimal Boltzmann policy over time. In each of these settings, we establish theoretical guarantees for various preference learning algorithms, or otherwise show that such guarantees are impossible.

[MA-24] MCP-Cosmos: World Model-Augmented Agents for Complex Task Execution in MCP Environments

【速读】：该论文旨在解决当前大型语言模型（Large Language Models, LLMs）代理在执行任务时面临的环境认知局限问题：现有方法要么仅关注任务层面的规划而忽略执行过程中的动态变化，要么采用反应式执行策略但缺乏长期前瞻性。其解决方案的关键在于提出MCP-Cosmos框架，该框架将生成式世界模型（World Model, WM）集成到Model Context Protocol (MCP) 生态系统中，通过引入“自带世界模型”（Bring Your Own World Model, BYOWM）策略，使代理能够在潜在空间中模拟状态转移并提前优化计划，从而实现预测性任务自动化。实验表明，该方法显著提升了代理与环境交互的关键指标，如工具成功率和参数准确性，并引入了“执行质量”（Execution Quality）等新指标以评估不同世界模型的有效性。

链接: https://arxiv.org/abs/2605.09131
作者: Giridhar Ganapavarapu,Dhaval Patel
机构: IBM(国际商业机器公司)
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:The Model Context Protocol (MCP) has unified the interface between Large Language Models (LLMs) and external tools, yet a fundamental gap remains in how agents conceptualize the environments within which they operate. Current paradigms are bifurcated: Task-level planning often ignores execution-time dynamics, while reactive execution lacks long-horizon foresight. We present MCP-Cosmos, a framework that infuses generative World Models (WM) into the MCP ecosystem to enable predictive task automation. By unifying three disparate technologies, namely MCP, World Model, and Agent, we demonstrate that a “Bring Your Own World Model” (BYOWM) strategy allows agents to simulate state transitions and refine plans in a latent space before execution. We conducted experiments using two strategies, namely ReAct and SPIRAL with 2 planning models and 3 representative world models over 20+ MCP-Bench tasks. We observed improvements in Agent’s environment interaction KPI such as tool success rate and tool parameter accuracy. The framework also offers new metrics such as Execution Quality to generate new insights about the effectiveness of world models compared to baseline.

[MA-25] Internal vs. External: Comparing Deliberation and Evolution for Multi-Agent Constitutional Design

【速读】：该论文旨在解决多智能体人工智能（Multi-agent AI）系统中行为规范（behavioral constitution）的来源问题，即这些规则应通过智能体内部自主协商（internal deliberation）自发产生，还是通过外部优化（external evolution）发现。其解决方案的关键在于通过受控实验对比两种机制在三种社会环境（协调网格世界、重复公共品博弈和双边交易市场）中的表现：结果表明，在集体行动场景中，外部演化显著优于内部 deliberation（p < 0.01），而双边交易中两者均无改进；进一步的乘数消融分析揭示，当激励结构变化时（如池乘数 m = 0.75），演化的规范反而导致价值破坏，说明外部优化在特定条件下具有优势，而内部自治理虽缺乏对关键机制（如惩罚）的探索能力，但具备结构响应性。

链接: https://arxiv.org/abs/2605.09128
作者: Hershraj Niranjani,Ujwal Kumar,Phan Xuan Tan
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: 20 pages

点击查看摘要

Abstract:Multi-agent AI systems need behavioral constitutions, but it is unresolved whether such rules should emerge internally through agent self-governance or be discovered externally through optimization. We present the first controlled comparison of internal deliberation and external evolution across three social environments: a coordination grid-world, an iterated public goods game, and a bilateral trading market. Across 180 simulation runs, evolution significantly outperforms deliberation in collective-action settings (p 0.01), while neither method improves outcomes in bilateral trading. A multiplier ablation reveals that evolution’s advantage inverts when incentives shift: at pool multiplier (m = 0.75) the evolved constitution forces value-destroying cooperation and becomes the worst-performing method. Notably, no deliberation run across thirty trials ever proposed punishment – the canonical cooperation-sustaining mechanism evolution reliably discovers – suggesting external optimization wins on peaks while internal self-governance trades peaks for structural responsiveness.

[MA-26] Robust Multi-Agent LLM s under Byzantine Faults

【速读】：该论文旨在解决大规模语言模型多智能体系统（LLM-MAS）中因不可靠或拜占庭式智能体（Byzantine agents）在对等网络中的交互行为导致的信息污染与性能下降问题。现有方法依赖于中心化领导机制或自我报告的置信度，易受对抗性操纵。其解决方案的关键在于提出一种完全去中心化的迭代滤波-精炼协议——自锚定共识（Self-Anchored Consensus, SAC），该协议使智能体通过本地评估和过滤不可靠信息、迭代更新自身输出，在满足(F+1)-鲁棒通信图条件下确保诚实智能体能维持并传播可靠信息，从而有效抑制拜占庭影响，并在数学推理与常识推理基准测试中显著提升系统性能。

链接: https://arxiv.org/abs/2605.09076
作者: Haejoon Lee,Vincent-Daniel Yun,Hyeonho Oh,Dimitra Panagou,Sai Praneeth Karimireddy
机构: University of Michigan (密歇根大学); University of Southern California (南加州大学)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language model (LLM) agents increasingly collaborate over peer-to-peer networks to improve their reliability. However, these same interactions can also become a source of vulnerability, as unreliable or Byzantine agents may sway neighboring agents toward incorrect conclusions and degrade overall system performance. Existing methods rely on leader-based coordination or self-reported confidence, both of which are susceptible to adversarial manipulation. We study decentralized LLM multi-agent systems (LLM-MAS) and propose Self-Anchored Consensus (SAC), a fully decentralized iterative filter-and-refine protocol in which agents iteratively exchange responses, locally evaluate and filter unreliable messages, and refine their own outputs. We present (F+1) -robustness conditions for the communication graph that ensure honest agents preserve and propagate reliable information despite Byzantine influence. Experiments on mathematical and commonsense reasoning benchmarks show that SAC effectively suppresses Byzantine influence and consistently improves performance across diverse communication topologies, whereas prior methods degrade under adversarial conditions.

[MA-27] Octopus Protocol: One-Shot Hardware Discovery and Control for AI Agents via Infrastructure-as-Prompts

【速读】：该论文旨在解决当前智能体机器人系统（agentic-robotics systems）在部署新硬件时，因缺乏底层驱动、SDK或ROS风格原语而导致的工程成本高昂的问题。传统方法需人工编写大量针对特定硬件的接口代码，成为制约智能体控制落地的关键瓶颈。其解决方案的核心在于提出Octopus Protocol系统，通过一个五阶段自动化流程（PROBE, IDENTIFY, INTERFACE, SERVE, DEPLOY），仅需原始操作系统访问权限和语言模型API密钥，即可由编码代理（coding agent）自动发现设备、推断功能、生成结构化的Model Context Protocol (MCP) 服务器，并将其作为实时HTTP端点部署；该系统依赖两个关键架构原则：一是协议以提示词（prompts）形式定义而非硬编码，二是编码代理本身作为运行时环境，从而实现“一命令上手”（one-command onboarding）的新范式，显著降低硬件接入门槛并支持闭合回路视觉-运动控制。

链接: https://arxiv.org/abs/2605.09055
作者: Quilee Simeon,Justin M. Wei,Yile Fan
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Recent agentic-robotics systems, from Code-asPolicies to modern vision-language-action (VLA) foundation models, presuppose that drivers, SDKs, or ROS-style primitives for the target hardware already exist. Writing those primitives is the dominant engineering cost of bringing up new hardware for agent control. We present Octopus Protocol, a system that collapses that cost to a single shell command. Given only raw OS access and a language-model API key, a coding agent executes a five-stage pipeline–PROBE, IDENTIFY, INTERFACE, SERVE, DEPLOY–to discover connected devices, infer their capabilities, generate a Model Context Protocol (MCP) server with typed tools, and deploy it as a live HTTP endpoint. A persistent daemon then monitors the system, heals broken code, and perceives physical state through the camera tools it generated for itself. Two architectural principles make this work: protocols are prompts, not code, and the coding agent is the runtime. We validate the system on three heterogeneous platforms (PC/WSL, Apple Silicon macOS, Raspberry Pi 4) and on a commercial 6-DOF robotic arm with USB camera feedback. One command onboards the hardware in ~10-15 minutes and exposes up to 30 MCP tools; an MCP-compliant client then performs closed-loop visual-motor control through tools no human wrote.

[MA-28] Not All Turns Matter: Credit Assignment for Multi-Turn Jailbreaking

【速读】：该论文旨在解决多轮对话中基于强化学习（Reinforcement Learning, RL）的越狱攻击（jailbreak attack）中存在的信用分配（credit assignment）问题。现有方法依赖于粗粒度的轨迹级结果信号，对每一轮对话统一奖励或惩罚，导致在成功轨迹中过度奖励冗余轮次，在失败轨迹中低估关键中间轮次的价值，从而影响攻击效率与可迁移性。其解决方案的关键在于提出TRACE框架，通过两种机制实现细粒度的轮次级信用分配：对于成功轨迹，采用“留一法语义掩码”（leave-one-turn-out semantic masking）估算每轮贡献；对于失败轨迹，则结合提示有害性（prompt harmfulness）和语义相关性进行惩罚，并引入局部拒绝感知惩罚（local refusal-aware penalty）。该设计显著提升了多轮越狱攻击的有效性、迁移性和效率，且所生成的信用信号还可用于多轮防御对齐，改善安全与效用的平衡。

链接: https://arxiv.org/abs/2605.08778
作者: Zhida He,Xiaoyu Wen,Han Qi,Ziyuan Zhou,Peng Yu,Xingcheng Xu,Dongrui Liu,Xia Hu,Chaochao Lu,Qiaosheng Zhang
机构: Shanghai AI Laboratory; Fudan University (复旦大学); Shanghai Jiao Tong University (上海交通大学)
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 41 pages, 10 figures

点击查看摘要

Abstract:Deploying LLMs in multi-turn dialogues facilitates jailbreak attacks that distribute harmful intent across seemingly benign turns. Recent training-based multi-turn jailbreak methods learn long-horizon attack strategies from interaction feedback, but often rely on coarse trajectory-level outcome signals that broadcast uniformly to every turn. However, we find that turn-level contributions in multi-turn jailbreaking are non-uniform, phase-dependent, and target-specific. Such coarse outcome supervision induces a credit assignment problem, leading to over-rewarding redundant turns in successful trajectories and under-crediting useful intermediate turns in failed ones. To address this, we propose TRACE, a turn-aware credit assignment framework for reinforcement learning (RL)-based multi-turn jailbreaking. For successful trajectories, TRACE estimates turn-level contributions via leave-one-turn-out semantic masking; for failed ones, TRACE assigns penalties based on prompt harmfulness and semantic relevance, with an additional local refusal-aware penalty. Furthermore, we reuse the attack-side credit signal for multi-turn defense alignment. Extensive experiments on open-source and closed-source targets show that TRACE achieves strong overall performance in effectiveness, transferability, and efficiency, yielding about a 25% relative improvement in attack success rate over the strongest RL baseline while also improving the safety-utility balance when reused for defense alignment.

[MA-29] Beyond the All-in-One Agent : Benchmarking Role-Specialized Multi-Agent Collaboration in Enterprise Workflows

【速读】：该论文旨在解决当前大型语言模型（Large Language Model, LLM）代理在企业环境中协作能力评估不足的问题，尤其是现有基准测试未能充分模拟企业级约束条件，如角色专业化、权限隔离、状态化业务系统以及基于策略的审批流程。其解决方案的关键在于提出一个名为 \textscEntCollabBench 的多智能体协作基准测试平台，该平台构建了一个权限隔离的组织环境，包含11个跨六个部门的角色专用代理，并设计了两个子集：Workflow子集用于评估代理在修改企业系统状态时的协同能力，Approval子集则聚焦于代理基于政策做出决策的能力。评价方式基于执行轨迹、数据库状态验证和确定性策略裁决，而非依赖自然语言响应判断，从而更真实地衡量LLM代理在复杂企业场景下的端到端协作性能。

链接: https://arxiv.org/abs/2605.08761
作者: Tao Yu,Hao Wang,Changyu Li,Shenghua Chai,Minghui Zhang,Zhongtian Luo,Yuxuan Zhou,Haopeng Jin,Zhaolu Kang,Jiabing Yang,YiFan Zhang,Xinming Wang,Hongzhu Yi,Zheqi He,Jing-Shu Zheng,Xi Yang,Yan Huang,Liang Wang
机构: BAAI(百川智能); CASIA(中国科学院自动化研究所); UCAS(中国科学院大学); Peking University(北京大学); Tsinghua University(清华大学)
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
备注: 45 pages

点击查看摘要

Abstract:Large language model (LLM) agents are increasingly expected to operate in enterprise environments, where work is distributed across specialized roles, permission-controlled systems, and cross-departmental procedures. However, existing enterprise benchmarks largely evaluate single agents with broad tool access, while existing multi-agent benchmarks rarely capture realistic enterprise constraints such as role specialization, access control, stateful business systems, and policy-based approvals. We introduce \textscEntCollabBench, a benchmark for evaluating enterprise multi-agent collaboration. \textscEntCollabBench simulates a permission-isolated organization with 11 role-specialized agents across six departments and contains two evaluation subsets: a Workflow subset, where agents collaboratively modify enterprise system states, and an Approval subset, where agents make policy-grounded decisions. Evaluation is based on execution traces, database state verification, and deterministic policy adjudication rather than natural-language response judging. Experiments with representative LLM agents show that current models still struggle with end-to-end enterprise collaboration, especially in delegation, context transfer, parameter grounding, workflow closure, and decision commitment. \textscEntCollabBench provides a reproducible testbed for measuring and improving agent systems intended for realistic organizational environments.

[MA-30] Communicating Sound Through Natural Language

【速读】：该论文旨在解决如何将音频信号通过自然语言进行高效表示与传输的问题，即突破传统音频编码仅依赖数值或符号表示的局限，探索以自然语言作为音频信息载体的可能性。其解决方案的关键在于提出一种名为“词汇声学编码（Lexical Acoustic Coding, LAC）”的框架，该框架利用预训练大语言模型（LLM）作为发送端和接收端代理，通过固定系统提示生成可解释的非学习型声学描述符，并将其量化为特定特征的词汇表索引，最终以英文句子形式表达；接收端则解析该句子并基于闭环优化重构波形，从而实现文本作为音频结构的语义化、可编辑且适配LLM通信的传输表示。

链接: https://arxiv.org/abs/2605.08750
作者: Emanuele Rossi,Emanuele Rodolà
机构: Sapienza University of Rome (罗马大学); Paradigma (帕拉迪格玛)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: Includes link to demo page

点击查看摘要

Abstract:Natural language is widely used to describe, prompt, and control audio systems, but rarely serves as the representation carrying audio itself. We introduce lexical acoustic coding (LAC), a framework in which pre-trained LLM sender and receiver agents transmit sound through natural language. Under fixed system prompts, the agents write their own analysis and synthesis code, communicating only through a lexical sentence, shared vocabulary, and optional symbolic music structure. The sender analyzes an input waveform into interpretable, non-learned acoustic descriptors, quantizes each with a feature-specific interval vocabulary, and verbalizes the lexical code as English. The receiver parses the sentence back into lexical-acoustic constraints and renders a waveform through closed-loop refinement. The transmitted text serves as both a rich caption and as the transport representation itself. We frame LAC as a finite-rate lossy quantizer, exposing trade-offs between vocabulary size, rate, and fidelity. Experiments on short sounds and symbolic music transfer show that plain text preserves measurable acoustic structure while remaining interpretable, editable, and native to LLM-mediated communication.

[MA-31] HULK: Large-scale Hierarchical Coordination under Continual and Uncertain Temporal Tasks

【速读】：该论文旨在解决大规模多智能体系统在持续生成且时间不确定的任务场景下的协调问题，这类任务通常以关于协作动作的时序逻辑公式形式描述。传统方法依赖于静态任务假设并采用离线整数规划求解，难以适应在线动态环境中的频繁重计算与全局通信开销。解决方案的关键在于提出了一种分层框架HULK，其包含两个交错运行的层次：一是基于有限时间窗口内已知任务的滚动分配（rolling assignment）至子团队；二是子团队内部在在线执行过程中对检测到的子任务进行动态协调。这种分层机制实现了不同粒度和触发条件下的协调决策，显著提升了计算效率与鲁棒性。

链接: https://arxiv.org/abs/2605.08722
作者: Qingyuan Luo,Jie Li,Meng Guo
机构: Peking University (北京大学); National University of Defense Technology (国防科技大学)
类目: Robotics (cs.RO); Multiagent Systems (cs.MA)
备注: Accepted to the IEEE International Conference on Robotics and Automation. 7 pages, 4 figures

点击查看摘要

Abstract:Multi-agent systems can be extremely efficient when working concurrently and collaboratively, e.g., for delivery, surveillance, search and rescue. Coordination of such teams often involves two aspects: selecting appropriate subteams for different tasks in various areas, and coordinating agents in the subteams to execute the associated subtasks. Existing work often assumes that the tasks are static and known beforehand, where an integer program can be formulated and solved offline. However, in many applications, the team-wise tasks are generated online continually by external requests, and the amount of subtasks within each task is uncertain, e.g., the number of packages to deliver or victims to rescue. The aforementioned offline solution becomes inadequate as it would require constant re-computation for the whole team and global communication to broadcast the results. Thus, this work tackles the large-scale coordination problem under continual and uncertain temporal tasks, specified as temporal logic formulas over collaborative actions. The proposed hierarchical framework, HULK, consists of two interleaved layers: the rolling assignment of currently known tasks to subteams within a certain horizon, and the dynamic coordination within a subteam given the detected subtasks during online execution. Thus, coordination is performed hierarchically at different granularities and triggering conditions, improving computational efficiency and robustness. The method is validated rigorously over large-scale heterogeneous systems under various temporal tasks and environment uncertainties.

[MA-32] Agent Foresight: Online Auditing for Early Failure Prediction in Multi-Agent Systems

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）驱动的多智能体系统在长时程任务中因单次决定性错误导致级联失败的问题。现有方法仅能事后归因（post-hoc failure attribution），无法在轨迹执行过程中进行干预。其解决方案的关键在于提出AgentForesight框架，将问题重构为在线审计（online auditing）：在轨迹展开的每一步，审计器仅基于当前前缀判断是否继续或报警，且不依赖未来信息。该框架的核心创新包括构建AFTraj-2K数据集（涵盖编码、数学和代理领域，标注了决策性错误发生步骤），以及训练一个紧凑的在线审计模型AgentForesight-7B，采用粗到精的强化学习策略，先建立失败边界的风险预判先验，再通过三轴奖励机制精准定位错误发生的“谁”、“何时”与“何地”，从而实现部署时的实时干预，显著优于GPT-4.1和DeepSeek-V4-Pro等主流模型。

链接: https://arxiv.org/abs/2605.08715
作者: Boxuan Zhang,Jianing Zhu,Zeru Shi,Dongfang Liu,Ruixiang Tang
机构: Rutgers University (罗格斯大学); The University of Texas at Austin (德克萨斯大学奥斯汀分校); Purdue University (普渡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 33 pages, 7 figures

点击查看摘要

Abstract:LLM-based multi-agent systems are increasingly deployed on long-horizon tasks, but a single decisive error is often accepted by downstream agents and cascades into trajectory-level failure. Existing work frames this as \emphpost-hoc failure attribution, diagnosing the responsible agent and step after the trajectory has ended. However, this paradigm forfeits any opportunity to intervene while trajectory is still unfolding. In this work, we introduce AgentForesight, a framework that reframes this problem as online auditing: at each step of an unfolding trajectory, an auditor observes only the current prefix and must either continue the run or alarm at the earliest decisive error, without access to future steps. To this end, we curate AFTraj-2K, a corpus of agentic trajectories across Coding, Math, and Agentic domains, in which safe trajectories are retained under a strict curation pipeline and unsafe trajectories are annotated at the step of their decisive error via consensus among multiple LLM judges. Built on that, we develop AgentForesight-7B, a compact online auditor trained with a coarse-to-fine reinforcement learning recipe that first equips it with a risk-anticipation prior at the failure boundary on adjacent safe/unsafe prefix pairs, then sharpens this prior into precise step-level localization under a three-axis reward jointly targeting the what, where, and who of an audit verdict. Across AFTraj-2K and an external Who\When benchmark, AgentForesight-7B outperforms leading proprietary models, including GPT-4.1 and DeepSeek-V4-Pro, achieving up to +19.9% performance gain and 3 \times lower step localization error, opening the loop from post-hoc failures detection to enabling deployment-time intervention. Project page: this https URL

[MA-33] MIND-Skill: Quality-Guaranteed Skill Generation via Multi-Agent Induction and Deduction

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）驱动的智能体在处理复杂、多步骤现实任务时，因缺乏领域特定程序性知识而导致性能受限的问题。现有方法依赖人工手动提炼可复用的技能（Skill），效率低且难以规模化。其解决方案的关键在于提出 MIND-Skill 框架，通过一个归纳代理（induction agent）从成功轨迹中自动抽象出通用技能，并由一个演绎代理（deduction agent）基于这些技能重构轨迹，从而实现技能的自动化生成与质量保障。该框架引入重建损失（reconstruction loss）、结果损失（outcome loss）和评分标准损失（rubric loss），并通过 TextGrad 联合优化，确保生成技能在准确性、可解释性和抽象层次上的高质量，最终在 AppWorld 和 BFCL-v3 数据集上显著优于现有技能生成方法。

链接: https://arxiv.org/abs/2605.08670
作者: Yixuan Li,Mingshu Cai,Ziyang Xiao,Wanyuan Wang,Yanchen Deng,Bo An
机构: Nanyang Technological University (南洋理工大学); Waseda University (早稻田大学); Zhejiang University (浙江大学); Southeast University (东南大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Large language model (LLM) powered AI agents have emerged as a promising paradigm for autonomous problem-solving, yet they continue to struggle with complex, multi-step real-world tasks that demand domain-specific procedural knowledge. Reusable agent skills, which encapsulate successful problem-solving strategies, offer a natural remedy by enabling agents to build on prior experience. However, curating such skills has largely remained a manual endeavor, requiring human experts to distill rich domain knowledge into actionable guidelines. In this work, we present \textbfM ulti-agent \textbfIN duction and \textbfD eduction for \textbfSkill s ( \textbfMIND-Skill ), a framework that automatically induces generalizable skills from successful trajectories with robust quality guarantees. MIND-Skill consists of an induction agent which is tasked to abstract reusable skills from successful trajectories, and a deduction agent which aims to reconstruct trajectories by following the induced skills. To guarantee the quality of the generated skills, we introduce a reconstruction loss that compares input and reconstructed trajectories, an outcome loss that enforces the correctness of the reconstructed trajectories, and a rubric loss that assesses the documentation quality and regularizes the abstraction level of the generated skills according to predefined criteria. These textual losses are jointly optimized with TextGrad, and the resulting skills are evaluated on held-out tasks unseen during optimization. Experiments on AppWorld and BFCL-v3 show that MIND-Skill consistently outperforms concurrent skill generation methods.

[MA-34] Modeling Decision-Making with Will for Cooperation in Social Dilemmas

【速读】：该论文试图解决社会困境中合作失败的问题，传统理性行为模型常将其归因于激励不足，却忽视了持续效用最大化带来的不稳定性。解决方案的关键在于提出“意志”（will）的概念，将其形式化为一种持续追求目标而不受局部成本-收益波动影响的机制，并将具有此特性的代理定义为潜在最小化者（potential minimizers），区别于累积效用最大化者。研究表明，这种“有意志的代理”能缩小可行状态空间，作为边界约束加速收敛，并在时空博弈模拟中充当“合作催化剂”，使群体突破纯效用最大化无法跨越的高风险阈值，从而揭示出成功合作依赖于战略性地限制计算能力的认知机制。

链接: https://arxiv.org/abs/2605.08669
作者: Yizhe Huang,Bin Ling,Song-Chun Zhu,Xue Feng
机构: 未知
类目: Multiagent Systems (cs.MA); Computers and Society (cs.CY); Computer Science and Game Theory (cs.GT); Physics and Society (physics.soc-ph)
备注: Accepted at CogSci 2026

点击查看摘要

Abstract:Standard rational actor models often attribute cooperation failures in social dilemmas to insufficient incentives, overlooking the destabilizing effects of continuous utility maximization. To address this, we propose a framework of will" defined as a mechanism that persistently pursues goals while ignoring local cost-benefit fluctuations. We formalize the Willed Agents as potential minimizers, distinguishing them from cumulative utility maximization. Dynamical analysis of infinite population demonstrates that willed agents shrink the feasible state space, acting as boundary constraints that accelerate convergence in canonical social dilemmas. Through multi-agent simulations in a spatiotemporal Stag Hunt Game, we show that willed agents function as cooperation catalysts", enabling groups to surmount high-risk thresholds where purely utility maximization fails. We find that heterogeneous will strength promotes cooperation, and that agents who autonomously suspend rational re-evaluation can significantly outperform continuous optimizers. These findings suggest that successful cooperation relies on the cognitive capacity to strategically constrain calculation.

[MA-35] Generalization Bounds of Emergent Communications for Agent ic AI Networking

【速读】：该论文旨在解决6G网络向智能体驱动的AI原生通信（AgentNet）演进过程中，传统基于预定义协议的刚性架构难以适应复杂任务协同与动态环境变化的问题。现有新兴通信框架普遍忽视物理网络约束（如带宽和计算复杂度），且缺乏严格的信息论基础。其解决方案的关键在于提出一种基于多智能体多任务分布式信息瓶颈（DIB）理论的新型涌现通信框架，通过设计一个联合损失函数统一优化决策函数与通信信号的学习过程，从而在任务相关信息表示与计算复杂度之间实现可量化权衡，并提供去中心化推理下未见环境状态中的理论泛化边界，实验验证表明该方案显著优于当前最优方法。

链接: https://arxiv.org/abs/2605.08613
作者: Yong Xiao,Jingxuan Chai,Guangming Shi,Ping Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Theory (cs.IT); Multiagent Systems (cs.MA)
备注: Accepted at IEEE ISIT Workshop, Guangzhou, China, June 2026

点击查看摘要

Abstract:The evolution of 6G networking toward agentic AI networking (AgentNet) systems requires a shift from traditional data pipelines to task-aware, agentic AI-native communication solutions. Emergent communication, a novel communication paradigm in which autonomous agents learn their own signaling protocols through interaction, is increasingly viewed as a promising solution to address the challenges posed by existing rigid, predefined protocol-based networking architecture. However, most existing emergent communication frameworks fail to account for physical networking constraints, such as bandwidth and computational complexity, and often lack a rigorous information-theoretical foundation. To address these challenges, this paper introduces a novel emergent communication framework that facilitates collaborative task-solving among heterogeneous agents through an information-theoretic lens. We propose a novel joint loss function that unifies the optimization of decision-making functions and the learning of communication signaling. Our proposed solution is grounded on the multi-agent and multi-task distributed information bottleneck (DIB) theory, which allows the quantification of the fundamental trade-off between task-relevant information representation and computational complexity. We further provide theoretical generalization bounds of the emergent communication protocol during decentralized inference across unseen environmental states. Experimental validation on a real-world hardware prototype confirms that our proposed framework significantly improves generalization performance, compared to the state-of-the-art solutions.

[MA-36] Slipstream: Trajectory-Grounded Compaction Validation for Long-Horizon Agents

【速读】：该论文旨在解决长时程大上下文场景下，传统同步式压缩（compaction）机制因结构验证缺口导致的准确性下降问题。具体而言，现有框架在关键路径上同步执行压缩操作，但压缩器（compactor）无法预知代理后续所需信息，从而可能丢失关键事实或意图，且压缩后的错误会因缺乏独立验证标准而无声传播，影响任务正确性。解决方案的关键在于提出异步压缩（asynchronous compaction）机制——将压缩过程与代理在原始上下文上的继续执行并行化，使候选摘要与代理下一步行为均基于同一预压缩状态独立生成，从而可通过一个判别器（judge）对摘要进行验证：检查其是否保留了代理的前进意图及依赖的关键事实和约束。这一设计实现了轨迹感知的压缩验证，显著提升了准确性和效率。

链接: https://arxiv.org/abs/2605.08580
作者: Zhuofu Chen,Rui Pan,Yinwei Dai,Ravi Netravali
机构: Princeton University (普林斯顿大学)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: 9 pages (16 pages counting references, appendix), 6 figures, 2 tables

点击查看摘要

Abstract:To cope with the large contexts that long-horizon LLM agents produce, modern frameworks increasingly rely on compaction – invoking an LLM to rewrite the accumulated trajectory into a shorter summary that the agent resumes from. Today, compaction runs synchronously on the critical path of agent execution but this can unpredictably degrade accuracy due to a structural validation gap: the compactor must condense context but is fundamentally unaware of precisely what information the agent will need later. Further, because post-compaction agent steps are conditioned on the new summary, targeted validation criteria do not exist and errors silently propagate through coherent but incorrect behavior. Our key insight is that asynchronous compaction efficiently addresses this gap: by running the compactor in parallel with continued agent execution on the original context, the candidate summary and the agent’s next steps are generated independently from the same pre-compaction state, yielding a validation signal independent of the summary itself. We build Slipstream, a trajectory-grounded compaction system that uses a judge to validate the candidate summary against the agent’s continued reasoning, checking that it preserves both the agent’s forward intent and the key facts and constraints it depends on. Across long-horizon coding (SWE-bench Verified) and web-browsing (BrowseComp) workloads, Slipstream improves task accuracy by up to 8.8 percentage points while reducing end-to-end latency by up to 39.7%.

[MA-37] oo Many Specialists: Emergent Inefficiencies and Bottlenecks for Multi-agent Ad-hoc Collaboration AAMAS2026

【速读】：该论文旨在解决现有计算协作模型在缺乏预先协调的情况下，未能充分考虑异质性代理特征与复杂任务结构如何共同导致系统瓶颈、效率低下及贡献不均的问题。其解决方案的关键在于构建一个基于代理的自适应团队协作模型，模拟厨房环境中具有多样化代理人格（persona）与包含串行和并行依赖关系的任务结构；通过该模型识别出“专家困境”现象——即刚性角色固化会引发系统级瓶颈、加剧工作负载不均，并形成同质化网络结构；同时揭示团队规模与沟通开销如何与问题结构相互作用，产生边际收益递减与冗余协作。这一方法实现了从微观行为到宏观结果的映射，为高效多代理协作的设计提供了理论依据与实践指导。

链接: https://arxiv.org/abs/2605.08540
作者: Benjamin Panny,Shashank Mehrotra,Zahra Zahedi,Teruhisa Misu,Kumar Akash
机构: Honda Research Institute (本田研究 institute)
类目: Multiagent Systems (cs.MA); Human-Computer Interaction (cs.HC)
备注: Published in Proceedings of Proc. of the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)

点击查看摘要

Abstract:Computational models of collaboration without prior coordination often overlook how heterogeneous agent traits and complex task structures jointly produce systemic bottlenecks, inefficiencies, and contribution inequalities. We address this by using an agent-based model of ad-hoc teamwork in a kitchen environment. Our model integrates diverse agent personas with tasks that combine serial and parallel dependencies. We identify a specialist’s dilemma, where rigid role assertion generates system-level bottlenecks, amplifies workload inequality, and fosters fragmented, homophilous networks. We also find that team size and communication overhead interact with problem structure to generate diminishing returns and redundant collaboration. Linking micro-level behavior to macro-level outcomes provides insights into emergent collaboration and design principles for effective multi-agent teamwork.

[MA-38] SceneFactory: GPU-Accelerated Multi-Agent Driving Simulation with Physics-Based Vehicle Dynamics

【速读】：该论文旨在解决自动驾驶仿真平台在物理保真度与可扩展性之间难以平衡的问题。传统物理引擎（如CARLA和MetaDrive）虽能提供高保真的车辆动力学和接触交互，但其非向量化接口限制了批量训练效率；而GPU批处理系统（如Waymax和GPUDrive）虽实现高效并行，却因采用简化的运动学模型牺牲了轮胎-路面相互作用、悬架系统及道路条件依赖的摩擦等关键物理特性。解决方案的关键在于提出SceneFactory——一个基于NVIDIA Isaac Sim + Isaac Lab构建的GPU向量化平台，通过将世界和代理表示为批处理张量（batched tensors），利用Isaac Lab张量API实现控制、观测、奖励、重置和策略推理的GPU原生运算，从而在单个GPU上并发运行数百个场景，并精确模拟多智能体物理交互与环境变量（如降水和路面类型对摩擦系数的影响）。该方法在保持刚体动力学和物理接地的道路条件变化的同时，实现了比非向量化PhysX基线高出127倍的吞吐量，验证了高保真物理仿真与大规模训练可兼得。

链接: https://arxiv.org/abs/2605.08528
作者: Yicheng Zhu,Yang Chen,Tao Li,Zilin Bian
机构: Rochester Institute of Technology (罗切斯特理工学院); City University of Hong Kong (香港城市大学)
类目: Multiagent Systems (cs.MA); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Autonomous-driving simulators typically trade physical fidelity for scalable parallelism. Physics-based platforms such as CARLA and MetaDrive provide articulated vehicle dynamics and contact, but their non-vectorized interfaces make batched training difficult. GPU-batched systems such as Waymax and GPUDrive scale to hundreds of scenarios by replacing rigid-body physics with simplified kinematic models, omitting tire–road interaction, suspension, contact dynamics, and road-condition-dependent friction. We introduce SceneFactory, a GPU-vectorized platform for procedural scene construction, physics-based multi-agent simulation, and RL in autonomous-driving environments. Built on NVIDIA Isaac Sim + Isaac Lab, SceneFactory represents worlds and agents as batched tensors: control, observations, rewards, resets, and policy inference run as GPU tensor operations over the Isaac Lab tensor API. SceneFactory converts Waymo Open Motion Dataset road topologies into simulation-ready USD worlds, runs many worlds concurrently on one GPU, populates each with multiple articulated PhysX vehicles, and maps precipitation and road-surface type to PhysX material friction coefficients. With GPU vectorization, SceneFactory achieves up to 127 \times higher throughput than a non-vectorized PhysX baseline on the same GPU and physics solver, reaching 19,250 controlled-agent simulation steps per second at 256 worlds \times 16 agents. Cross-simulator transfer reveals an asymmetric dynamics gap: physics-grounded RL policies transfer to a simplified kinematic bicycle model with 99.5% success, whereas reverse transfer drops to 47.3%. Under wet-road friction, friction-aware policies reduce mean peak DRAC from 58.7 to 27.8,m/s ^2 without sacrificing goal reach. SceneFactory shows that scalable autonomous-driving training need not discard articulated rigid-body dynamics or physically grounded road-condition variation.

[MA-39] LLM Wardens: Mitigating Adversarial Persuasion with Third-Party Conversational Oversight

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）日益增强的说服能力所带来的用户操纵风险，尤其是在决策场景中可能被用于隐蔽引导用户做出非自愿选择的问题。其解决方案的关键在于引入一个“看守者”模型（warden model）——一个在实时交互过程中监控人类与AI行为轨迹的次级LLM，当检测到潜在操纵时向用户提供非约束性的私密建议。实验表明，该机制能显著降低对手型LLM的成功率（从65.4%降至30.4%），同时对正常交互影响较小（仅下降8.6个百分点），且即使看守者模型能力弱于对手模型，仍能提供有效防护，为可扩展地监督更强大模型提供了可行路径。

链接: https://arxiv.org/abs/2605.08321
作者: Lennart Wachowiak,Scott D. Blain,David Williams-King,Samuele Marro
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:LLMs are increasingly capable of persuasion, which raises the question of how to protect users against manipulation. In a preregistered user study (N=120) across four decision-making scenarios, we find that an adversarial LLM with a hidden goal succeeds in steering users’ decisions 65.4% of the time. We then introduce a “warden” model: a secondary LLM that monitors the human-AI interaction trace in real time and issues non-binding, private advisories to the user when it detects manipulation. Adding a warden more than halves the adversary’s success rate to 30.4%, with a much smaller (8.6 percentage points) reduction for genuine interactions. To probe the mechanism behind these results, we release COAX-Bench, a simulation benchmark spanning 14 decision-making scenarios, including hiring, voting, and file access. Across 16,212 simulated multi-agent interactions, capable adversarial LLMs achieve their hidden goals in 34.7% of cases, which warden models reduce to 12.3%. Notably, even warden models substantially weaker than the adversary they oversee provide meaningful protection, suggesting a path for scalable oversight of more capable models.

[MA-40] Insider Attacks in Multi-Agent LLM Consensus Systems

【速读】：该论文旨在解决多智能体大语言模型（Multi-Agent Large Language Models, MALLM）系统中恶意内部人员（insider）操纵共识形成过程的问题。在理想情况下，所有参与智能体均服从系统目标，但现实中可能存在伪装成合法成员的恶意智能体，其通过隐蔽策略延迟或阻止良性智能体达成一致。解决方案的关键在于提出一种基于世界模型（world-model）的框架：首先学习良性智能体潜在行为状态的代理动态模型，进而利用强化学习（Reinforcement Learning, RL）训练攻击者以优化其操纵策略。该方法使攻击者能够适应性地调整行为，在不直接依赖恶意提示的情况下更有效地降低良性共识率并延长分歧时间。

链接: https://arxiv.org/abs/2605.08268
作者: Xiaolin Sun,Zixuan Liu,Yibin Hu,Zizhan Zheng
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed in multi-agent systems where agents communicate in natural language to solve tasks jointly. A key capability in such systems is consensus formation, where agents iteratively exchange messages and update decisions to reach a shared outcome. However, most existing multi-agent LLM frameworks assume that all participating agents are aligned with the system objective. In practice, a malicious insider may participate as a legitimate member of the group while pursuing a hidden adversarial goal. In this work, we study insider manipulation in multi-agent LLM consensus systems. We formalize the problem as a sequential decision-making task in which a malicious agent seeks to delay or prevent agreement among benign agents. To make attack optimization tractable, we propose a world-model-based framework that learns surrogate dynamics over the latent behavioral states of benign agents and then trains an attacker using reinforcement learning based on this learned model. Preliminary results show that the trained attacker reduces the benign consensus rate and prolongs disagreement more effectively than the direct malicious-prompt baseline. These results suggest that combining latent world models with reinforcement learning is a promising direction for adaptive insider attacks in language-based multi-agent systems.

[MA-41] Designing Intelligent Enterprise Agents : A Capability-Aligned Multi-Agent Architecture

【速读】：该论文旨在解决企业在部署智能代理（Intelligent Agents）过程中因缺乏系统性设计而导致的治理失效、操作脆弱性和复杂性失控的问题。其核心挑战在于，当前企业往往将治理作为首要抽象，忽视了代理本身的设计质量，从而导致代理能力边界模糊、交互协议混乱、状态管理缺失等问题。解决方案的关键在于提出一种以设计为核心的参考架构——CEAD（Capability-Aligned Enterprise Agent Design），强调将代理的能力对齐、自主性分配、交互协议、工具与数据权限、状态与记忆机制、验证设计及人机交互设计置于首位，通过服务导向架构（SOA）的契约化和松耦合特性实现可扩展集成，同时避免将微服务等粗粒度分解模式误认为代理设计。实证表明，CEAD在10,000个企业任务中实现了70.6%的安全成功率，显著优于其他四种架构，验证了“设计质量是首要关切”的论点。

链接: https://arxiv.org/abs/2605.08258
作者: John deVadoss
机构: InterWork Alliance (InterWork联盟)
类目: Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Enterprise interest in multi-agent systems has shifted from generic software agents to large-language-model (LLM) based intelligent agents that plan, use tools, maintain contextual memory, inspect intermediate results, collaborate with other agents, and sometimes act in systems of record. This paper revises the enterprise architecture thesis around a design-first claim: governance is necessary, but it cannot be the primary organizing abstraction. The primary abstraction must be agent design - capability boundaries, autonomy allocation, interaction protocols, tool and data authority, state and memory design, verification design, and human interaction design. We propose CEAD (Capability-Aligned Enterprise Agent Design), a reference architecture for intelligent agents that uses service-oriented architecture (SOA) as an exemplar for contracts, registries, loose coupling, and policy-aware integration, while explicitly rejecting the idea that services are agents. It treats microservices as a cautionary precedent: decomposition without design discipline produces distributed complexity, cost, operational fragility, and agent proliferation. We evaluate CEAD over 10,000 enterprise tasks, comparing five architectures: a prompt-first mono-agent, a role-based micro-agent swarm, SOA-brokered agents, a governance-first but design-poor agent grid, and the proposed CEAD architecture. CEAD achieves 70.6% safe success, versus 45.2% for the mono-agent baseline, 23.1% for the ungoverned micro-agent swarm, 58.8% for SOA-brokered agents, and 50.8% for the control-heavy, design-poor grid. The results support the conclusion that design quality is the first-order enterprise concern; governance, security, policy, audit, and assurance should support and enforce good design rather than substitute for it.

[MA-42] Scaling Mobile Agent Systems: From Capability Density to Collective Intelligence

【速读】：该论文旨在解决移动代理系统（Mobile Agent Systems）在边缘设备和AIoT生态系统中面临的可扩展性瓶颈问题，其核心挑战在于设备端计算资源有限以及智能能力分散。解决方案的关键在于提出一个统一的研究议程，从两个互补维度实现突破：一是通过紧凑型基础模型设计与压缩技术提升单个代理的能力密度（capability density），二是借助通信丰富的多代理协作机制实现集体智能（collective intelligence），从而将孤立的移动代理转化为高效、可扩展的分布式智能系统。

链接: https://arxiv.org/abs/2605.08124
作者: Bowei He
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Computation and Language (cs.CL); Multiagent Systems (cs.MA); Networking and Internet Architecture (cs.NI)
备注: Accepted by ACM MobiSys 2026

点击查看摘要

Abstract:Mobile agent systems are emerging as a key paradigm for enabling intelligent applications on edge devices and in AIoT ecosystems. However, their scalability is fundamentally constrained by limited on-device computation and fragmented intelligence across devices. In this work, we propose a unified research agenda for scaling mobile agent systems along two complementary dimensions: (1) improving capability density of individual agents through compact foundation model design and compression, and (2) enabling collective intelligence via communication-rich multi-agent collaboration. Building on recent model and infrastructure advances, this vision aims to transform isolated mobile agents into a distributed intelligent system that is efficient and scalable.

[MA-43] Decentralized Contingency MPC based on Safe Sets for Nonlinear Multi-agent Collision Avoidance

【速读】：该论文旨在解决多智能体系统中无通信条件下去中心化碰撞规避的难题，尤其针对代理之间不共享计划轨迹信息时如何保证安全运动的问题。现有方法通常依赖保守的协调机制或仅提供有限的递归可行性与收敛性保障。其解决方案的关键在于提出一种去中心化的应急模型预测控制（MPC）框架，每个代理通过求解局部优化问题，将名义轨迹与一个应急证书（contingency certificate）耦合，从而在滚动时域操作下确保可行的备用机动策略；同时引入一种新颖的几何与去中心化的安全集更新机制，防止相邻时间步间可行性丧失，最终实现递归可行性（包括避障）和到可接受安全平衡点的李雅普诺夫型收敛。

链接: https://arxiv.org/abs/2605.10738
作者: Max Studt,Georg Schildbach
机构: University of Luebeck (吕贝克大学)
类目: Optimization and Control (math.OC); Multiagent Systems (cs.MA); Robotics (cs.RO); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Decentralized collision avoidance remains challenging, particularly when agents do not communicate any information related to planned trajectories. Most existing approaches either rely on conservative coordination mechanisms or provide limited guarantees on recursive feasibility and convergence. This paper develops a decentralized contingency MPC framework for multi-agent systems with nonlinear dynamics that achieves collision-free motion under a state-only information pattern. Each agent follows the same consensual rule set, enabling safe decentralized planning without communication. Each agent solves a local optimization problem that couples a nominal trajectory with a contingency certificate ensuring a feasible backup maneuver under receding-horizon operation. A novel geometric and decentralized safe-set update mechanism prevents feasibility loss between consecutive time steps. The resulting scheme guarantees recursive feasibility, including collision avoidance, and establishes a Lyapunov-type convergence result to an admissible safe equilibrium. Simulation results demonstrate performance in both sparse and dense multi-agent environments, including cluttered bottleneck scenarios and under plug-and-play operation.

[MA-44] Conformity Generates Collective Misalignment in AI Agents Societies

【速读】：该论文旨在解决当前人工智能安全研究中一个关键问题：个体对齐（individual alignment）是否能确保群体层面的AI系统安全性。尽管现有方法致力于使单个语言模型与人类价值观对齐，但现实部署的AI系统常以交互式群体形式运行，社会影响可能超越个体对齐效应，导致群体性偏差或错误行为。论文提出的核心解决方案是通过模拟九个大型语言模型在一百组意见对上的演化过程，揭示群体行为受两种竞争机制驱动：从众倾向（conformity）与内在立场偏倚（intrinsic bias）。借助统计物理工具，作者构建了一个定量理论模型，能够预测群体陷入长期非对齐状态的临界条件，并识别出微小数量的对抗性代理即可引发不可逆的群体对齐转变——即“可预测的临界点”。这一发现表明，仅保证个体对齐不足以保障群体安全，亟需发展考虑AI群体涌现行为的评估框架。

链接: https://arxiv.org/abs/2605.10721
作者: Giordano De Marzo,Alessandro Bellina,Claudio Castellano,Viola Priesemann,David Garcia
机构: University of Konstanz (康斯坦茨大学); Centro Ricerche Enrico Fermi (恩里科·费米研究中心); Complexity Science Hub (复杂科学中心); Sony Computer Science Laboratories (索尼计算机科学实验室); Sapienza University of Rome (罗马大学); Istituto dei Sistemi Complessi (ISC-CNR) (复杂系统研究所); Max Planck Institute for Dynamics and Self-Organization (马克斯·普朗克动力学与自组织研究所); Institute for the Dynamics of Complex Systems (复杂系统动力学研究所)
类目: Physics and Society (physics.soc-ph); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Artificial intelligence safety research focuses on aligning individual language models with human values, yet deployed AI systems increasingly operate as interacting populations where social influence may override individual alignment. Here we show that populations of individually aligned AI agents can be driven into stable misaligned states through conformity dynamics. Simulating opinion dynamics across nine large language models and one hundred opinion pairs, we find that each agent’s behavior is governed by two competing forces: a tendency to follow the majority and an intrinsic bias toward specific positions. Using tools from statistical physics, we derive a quantitative theory that predicts when populations become trapped in long-lived misaligned configurations, and identifies predictable tipping points where small numbers of adversarial agents can irreversibly shift population-level alignment even after manipulation ceases. These results demonstrate that individual-level alignment provides no guarantee of collective safety, calling for evaluation frameworks that account for emergent behavior in AI populations.

[MA-45] Collective Alignment in LLM Multi-Agent Systems: Disentangling Bias from Cooperation via Statistical Physics

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）驱动的多智能体系统在二维方格晶格上表现出的集体动力学机制问题，特别是如何区分社会从众行为（social conformity）与内在偏置（intrinsic bias）对群体一致性的影响，并识别是否存在相变现象。其解决方案的关键在于提出一种模型无关的统计物理方法：通过设计全局翻转协议（global-flip protocol）测量磁化率和磁化强度，结合有限尺寸标度分析，提取有效耦合强度 $\tilde{J}(T)$ 和有效场 $\tilde{h}(T)$ ，分别量化社会从众与内在偏置的作用；结果显示，不同LLM模型中集体对齐主要由强内在偏置主导（ $\tilde{h} \gg \tilde{J}$ ），表现为场驱动的交叉而非真正的相变，且有效参数在模型间呈现定性差异，形成可量化的多智能体集体行为指纹，为评估多智能体共识可靠性提供了诊断工具。

链接: https://arxiv.org/abs/2605.10528
作者: Cristiano De Nobili
机构: Critiqality(创意评估公司)
类目: atistical Mechanics (cond-mat.stat-mech); Computation and Language (cs.CL); Multiagent Systems (cs.MA); Physics and Society (physics.soc-ph)
备注: 10 pages, 7 figures

点击查看摘要

Abstract:We investigate the emergent collective dynamics of LLM-based multi-agent systems on a 2D square lattice and present a model-agnostic statistical-physics method to disentangle social conformity from intrinsic bias, compute critical exponents, and probe the collective behavior and possible phase transitions of multi-agent systems. In our framework, each node of an L!\times!L lattice hosts an identical LLM agent holding a binary state ( +1 / -1 , mapped to yes/no) and updating it by querying the model conditioned on the four nearest-neighbor states. The sampler temperature T serves as the sole control parameter. Across three open-weight models (llama3.1:8b, phi4-mini:3.8b, mistral:7b), we measure magnetization and susceptibility under a global-flip protocol designed to probe \mathbbZ_2 symmetry. All models display temperature-driven order-disorder crossovers and susceptibility peaks; finite-size scaling on even- L lattices yields effective exponents \gamma/\nu whose values are model-dependent, close to but incompatible with the 2D Ising universality class ( \gamma/\nu=7/4 ). Our method enables the extraction of effective \beta -weighted couplings \tildeJ(T) and fields \tildeh(T) , which serve as a measure of social conformity and intrinsic bias. In the models we analyzed, we found that collective alignment is dominated by an intrinsic bias ( \tildeh\gg\tildeJ ) rather than by cooperative neighbor coupling, producing field-driven crossovers instead of genuine phase transitions. These effective parameters vary qualitatively across models, providing compact collective-behavior fingerprints for LLM agents and a quantitative diagnostic for the reliability of multi-agent consensus and collective alignment.

[MA-46] Large Language Models over Networks: Collaborative Intelligence under Resource Constraints

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在实际部署中面临的资源异构性与服务需求多样化之间的矛盾问题，即单一设备或云端无法同时满足低延迟、高吞吐、数据本地化和持续推理等多维约束。其核心解决方案是提出“协同智能”（collaborative intelligence）范式，通过两个互补且可组合的维度实现高效协作：一是垂直方向上的设备-云协同推理，二是水平方向上的多智能体协同推理，二者可融合形成混合拓扑结构；在此基础上进一步探索“如何学习协作”，包括路由策略训练与LLMs间合作能力的建模，从而在计算、内存、通信和成本等异构资源约束下实现高质量响应输出。

链接: https://arxiv.org/abs/2605.08626
作者: Liangqi Yuan,Wenzhi Fang,Shiqiang Wang,H. Vincent Poor,Christopher G. Brinton
机构: Purdue University (普渡大学); University of Exeter (埃克塞特大学); Princeton University (普林斯顿大学)
类目: ignal Processing (eess.SP); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are transforming society, powering applications from smartphone assistants to autonomous driving. Yet cloud-based LLM services alone cannot serve a growing class of applications, including those operating under intermittent connectivity, sub-second latency budgets, data-residency constraints, or sustained high-volume inference. On-device deployment is in turn constrained by limited computation and memory. No single endpoint can deliver high-quality service across this spectrum. This article focuses on collaborative intelligence, a paradigm in which multiple independent LLMs distributed across device and cloud endpoints collaborate at the task level through natural language or structured messages. Such collaboration strives for superior response quality under heterogeneous resource constraints spanning computation, memory, communication, and cost across network tiers. We present collaborative inference along two complementary and composable dimensions: vertical device-cloud collaboration and horizontal multi-agent collaboration, which can be combined into hybrid topologies in practice. We then examine learning to collaborate, addressing the training of routing policies and the development of cooperative capabilities among LLMs. Finally, we identify open research challenges including scaling under resource heterogeneity and trustworthy collaborative intelligence.

自然语言处理

[NLP-0] ELF: Embedded Language Flows

【速读】：该论文旨在解决当前扩散语言模型（Diffusion Language Models, DLMs）在处理离散文本时效率与质量受限的问题，尤其是现有主流DLMs多基于离散token空间建模，难以有效利用图像领域中成熟的连续空间扩散技术。其解决方案的关键在于提出嵌入空间流模型（Embedded Language Flows, ELF），该方法将扩散过程完全置于连续嵌入空间（continuous embedding space）中进行，仅在最终时间步通过共享权重网络映射到离散token，从而能够直接迁移图像领域中成熟的技术（如无分类器引导，classifier-free guidance）。这种设计显著提升了生成质量和采样效率，实验表明ELF优于现有主流的离散和连续DLMs。

链接: https://arxiv.org/abs/2605.10938
作者: Keya Hu,Linlu Qiu,Yiyang Lu,Hanhong Zhao,Tianhong Li,Yoon Kim,Jacob Andreas,Kaiming He
机构: MIT (麻省理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Tech Report. Project webpage: this https URL

点击查看摘要

Abstract:Diffusion and flow-based models have become the de facto approaches for generating continuous data, e.g., in domains such as images and videos. Their success has attracted growing interest in applying them to language modeling. Unlike their image-domain counterparts, today’s leading diffusion language models (DLMs) primarily operate over discrete tokens. In this paper, we show that continuous DLMs can be made effective with minimal adaptation to the discrete domain. We propose Embedded Language Flows (ELF), a class of diffusion models in continuous embedding space based on continuous-time Flow Matching. Unlike existing DLMs, ELF predominantly stays within the continuous embedding space until the final time step, where it maps to discrete tokens using a shared-weight network. This formulation makes it straightforward to adapt established techniques from image-domain diffusion models, e.g., classifier-free guidance (CFG). Experiments show that ELF substantially outperforms leading discrete and continuous DLMs, achieving better generation quality with fewer sampling steps. These results suggest that ELF offers a promising path toward effective continuous DLMs.

[NLP-1] DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices

【速读】：该论文旨在解决混合专家（Mixture-of-Experts, MoE）模型在端侧部署时面临的存储和内存访问瓶颈问题，即尽管MoE可通过稀疏激活实现模型容量扩展而不显著增加计算量，但其庞大的总参数量仍导致高存储开销与低效的内存访问，难以满足高性能、低计算成本和小存储占用的端侧应用需求。解决方案的关键在于提出DECO架构：首先采用可微且灵活的基于ReLU的路由机制，并引入可学习的专家级缩放因子，以自适应平衡路由专家与共享专家的贡献；其次设计NormSiLU激活函数，在SiLU前对输入进行归一化处理，从而提升路由专家激活比例的稳定性并增强内在稀疏性；此外，实验证明非门控MLP专家配合ReLU路由具有优势，表明MoE结构可进一步简化。最终，DECO在仅激活20%专家的情况下性能媲美密集Transformer，并在真实硬件上实现3.00×的推理加速。

链接: https://arxiv.org/abs/2605.10933
作者: Chenyang Song,Weilin Zhao,Xu Han,Chaojun Xiao,Yingfa Chen,Zhiyuan Liu
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 14 pages, 11 figures, 11 tables

点击查看摘要

Abstract:While Mixture-of-Experts (MoE) scales model capacity without proportionally increasing computation, its massive total parameter footprint creates significant storage and memory-access bottlenecks, which hinder efficient end-side deployment that simultaneously requires high performance, low computational cost, and small storage overhead. To achieve these properties, we present DECO, a sparse MoE architecture designed to match the performance of dense Transformers under identical total parameter budgets and training tokens. DECO utilizes the differentiable and flexible ReLU-based routing enhanced by learnable expert-wise scaling, which adaptively balances the contributions of routed and shared experts. Furthermore, we introduce NormSiLU, an activation function that normalizes inputs prior to SiLU operators, producing a more stable trend of routed-expert activation ratio and a higher intrinsic sparsity level. We also identify an empirical advantage in using non-gated MLP experts with ReLU-based routing, indicating the possibility of MoE architecture simplification. Experiments demonstrate that DECO, activating only 20% of experts, matches dense performance and outperforms established MoE baselines. Our specialized acceleration kernel delivers a 3.00 \times speedup on real hardware compared with dense inference. Codes and checkpoints will be released.

[NLP-2] Dynamic Skill Lifecycle Management for Agent ic Reinforcement Learning

【速读】：该论文旨在解决大语言模型代理（Large Language Model Agents）在执行复杂任务时，因外部技能（external skills）的静态管理导致性能受限的问题。现有方法通常假设技能要么作为持久指导积累，要么内化到策略中，最终趋向于零技能推理，但这种假设忽略了技能在不同任务阶段和场景下具有非单调、动态变化的价值。解决方案的关键在于提出SLIM框架——一种面向代理强化学习（agentic reinforcement learning）的动态技能生命周期管理机制，将活跃技能集视为与策略学习共同优化的变量。SLIM通过留一技能验证（leave-one-skill-out validation）量化每个技能的边际外部贡献，并据此执行保留高价值技能、淘汰低贡献技能以及扩展技能库三种操作，从而实现技能的动态选择与更新，使策略学习与外部技能利用协同进化，而非相互排斥。

链接: https://arxiv.org/abs/2605.10923
作者: Junhao Shen,Teng Zhang,Xiaoyan Zhao,Hong Cheng
机构: The Chinese University of Hong Kong (香港中文大学); University of Florida (佛罗里达大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Implementation code is available at this https URL

点击查看摘要

Abstract:Large language model agents increasingly rely on external skills to solve complex tasks, where skills act as modular units that extend their capabilities beyond what parametric memory alone supports. Existing methods assume external skills either accumulate as persistent guidance or internalized into the policy, eventually leading to zero-skill inference. We argue this assumption is overly restrictive, since with limited parametric capacity and uneven marginal contribution across skills, the optimal active skill set is non-monotonic, task- and stage-dependent. In this work, we propose SLIM, a framework of dynamic Skill LIfecycle Management for agentic reinforcement learning (RL), which treats the active external skill set as a dynamic optimization variable jointly updated with policy learning. Specifically, SLIM estimates each active skill’s marginal external contribution through leave-one-skill-out validation, then applies three lifecycle operations: retaining high-value skills, retiring skills whose contribution becomes negligible after sufficient exposure, and expanding the skill bank when persistent failures reveal missing capability coverage. Experiments show that SLIM outperforms the best baselines by an average of 7.1% points across ALFWorld and SearchQA. Results further indicate that policy learning and external skill retention are not mutually exclusive: some skills are absorbed into the policy, while others continue to provide external value, supporting SLIM as a more general paradigm for skill-based agentic RL.

[NLP-3] WildClawBench: A Benchmark for Real-World Long-Horizon Agent Evaluation

【速读】：该论文旨在解决当前大型语言模型（Large Language Models, LLMs）代理在真实运行时环境中完成长周期、多步骤任务的能力评估不足的问题。现有基准测试大多依赖合成沙箱、短周期任务、模拟服务接口和最终答案校验，无法反映代理在实际部署场景中的表现。为此，作者提出WildClawBench，一个包含60个由人类编写、双语且跨模态的原生运行时任务集合，每个任务平均耗时约8分钟并涉及20次以上的工具调用，运行于真实的CLI代理环境（如OpenClaw、Claude Code等）中，并使用真实工具而非模拟服务。其关键创新在于构建了贴近现实的应用场景与评估机制：通过混合评分策略（规则驱动检查、环境状态审计及LLM/VLM语义验证），实现对代理行为全过程的客观评价，从而揭示当前前沿模型在复杂长程任务上的局限性——即使最优模型（Claude Opus 4.7）在OpenClaw下也仅达62.2%成功率，且同一模型更换代理框架可导致性能波动高达18个百分点，凸显了原生运行时评估的重要性与挑战性。

链接: https://arxiv.org/abs/2605.10912
作者: Shuangrui Ding,Xuanlang Dai,Long Xing,Shengyuan Ding,Ziyu Liu,Yang JingYi,Penghui Yang,Zhixiong Zhang,Xilin Wei,Xinyu Fang,Yubo Ma,Haodong Duan,Jing Shao,Jiaqi Wang,Dahua Lin,Kai Chen,Yuhang Zang
机构: Shanghai AI Laboratory(上海人工智能实验室); The Chinese University of Hong Kong(香港中文大学); Fudan University(复旦大学); University of Science and Technology of China(中国科学技术大学); Shanghai Jiao Tong University(上海交通大学); Tsinghua University(清华大学); Shanghai Innovation Institute(上海创新研究院); Zhejiang University(浙江大学); Nanyang Technological University(南洋理工大学)
类目: Computation and Language (cs.CL)
备注: Github link: this https URL

点击查看摘要

Abstract:Large language and vision-language models increasingly power agents that act on a user’s behalf through command-line interface (CLI) harnesses. However, most agent benchmarks still rely on synthetic sandboxes, short-horizon tasks, mock-service APIs, and final-answer checks, leaving open whether agents can complete realistic long-horizon work in the runtimes where they are deployed. This work presents WildClawBench, a native-runtime benchmark of 60 human-authored, bilingual, multimodal tasks spanning six thematic categories. Each task averages roughly 8 minutes of wall-clock time and over 20 tool calls, and runs inside a reproducible Docker container hosting an actual CLI agent harness (OpenClaw, Claude Code, Codex, or Hermes Agent) with access to real tools rather than mock services. Grading is hybrid, combining deterministic rule-based checks, environment-state auditing of side effects, and an LLM/VLM judge for semantic verification. Across 19 frontier models, the best, Claude Opus 4.7, reaches only 62.2% overall under OpenClaw, while every other model stays below 60%, and switching harness alone shifts a single model by up to 18 points. These results show that long-horizon, native-runtime agent evaluation remains a far-from-resolved task for current frontier models. We release the tasks, code, and containerized tooling to support reproducible evaluation.

[NLP-4] RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards

【速读】：该论文旨在解决深度研究代理（Deep Research Agents）在强化学习训练中面临的挑战，即缺乏可验证的奖励信号、决策轨迹涉及多步工具调用、且传统训练后机制难以将历史经验转化为可复用的知识。其核心问题是：如何在无明确标准答案的情况下实现高效的学习与优化？解决方案的关键在于引入RubricEM框架，该框架以“评分标准（Rubric）”作为统一接口，贯穿策略执行、评判反馈与代理记忆三个环节；通过阶段式策略分解（stagewise policy decomposition）和基于反思的元策略演化（reflection-based meta-policy evolution），使代理在每一步决策中依据自生成的评分标准进行结构化规划，并利用分阶段的GRPO算法获得更密集的语义反馈，同时训练一个共享骨干的反思元策略，将已评价轨迹提炼为可迁移的、基于评分标准的指导策略，从而显著提升长文本研究任务的表现。

链接: https://arxiv.org/abs/2605.10899
作者: Gaotang Li,Bhavana Dalvi Mishra,Zifeng Wang,Jun Yan,Yanfei Chen,Chun-Liang Li,Long T. Le,Rujun Han,George Lee,Hanghang Tong,Chen-Yu Lee,Tomas Pfister
机构: University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校); Google Cloud AI Research(谷歌云人工智能研究)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 63 pages, 6 figures

点击查看摘要

Abstract:Training deep research agents, namely systems that plan, search, evaluate evidence, and synthesize long-form reports, pushes reinforcement learning beyond the regime of verifiable rewards. Their outputs lack ground-truth answers, their trajectories span many tool-augmented decisions, and standard post-training offers little mechanism for turning past attempts into reusable experience. In this work, we argue that rubrics should serve not merely as final-answer evaluators, but as the shared interface that structures policy execution, judge feedback, and agent memory. Based on this view, we introduce RubricEM, a rubric-guided reinforcement learning framework that combines stagewise policy decomposition with reflection-based meta-policy evolution. RubricEM first makes research trajectories stage-aware by conditioning planning, evidence gathering, review, and synthesis on self-generated rubrics. It then assigns credit with Stage-Structured GRPO, which uses stagewise rubric judgments to provide denser semantic feedback for long-horizon optimization. In parallel, RubricEM trains a shared-backbone reflection meta-policy that distills judged trajectories into reusable rubric-grounded guidance for future attempts. The resulting RubricEM-8B achieves strong performance across four long-form research benchmarks, outperforming comparable open models and approaching proprietary deep-research systems. Beyond final performance, we perform thorough analyses to understand the key ingredients of RubricEM.

[NLP-5] Grounded or Guessing? LVLM Confidence Estimation via Blind-Image Contrastive Ranking

【速读】：该论文旨在解决大型视觉语言模型（Large Vision-Language Models, LVLMs）中存在的视觉无锚定问题（visual ungroundedness），即模型可能仅依赖语言先验生成看似合理甚至正确的回答，而图像信息对预测结果无实际贡献。现有置信度估计方法无法识别此类情况，因其在正常推理下无法区分预测是否由图像驱动。解决方案的关键在于提出一种模型无关的置信度估计框架BICR（Blind-Image Contrastive Ranking），通过在训练阶段显式对比真实图像与图像被遮蔽（blacked-out）两种情形下的隐藏状态，利用轻量级探测器（probe）学习将视觉接地性（visual grounding）作为可靠性的信号，并通过排序损失（ranking loss）正则化，使模型在不增加额外推理成本的前提下提升置信度估计的校准性和判别能力。

链接: https://arxiv.org/abs/2605.10893
作者: Reza Khanmohammadi,Erfan Miahi,Simerjot Kaur,Charese H. Smiley,Ivan Brugere,Kundan Thind,Mohammad M. Ghassemi
机构: Michigan State University (密歇根州立大学); JPMorgan AI Research (摩根大通人工智能研究中心)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large vision-language models suffer from visual ungroundedness: they can produce a fluent, confident, and even correct response driven entirely by language priors, with the image contributing nothing to the prediction. Existing confidence estimation methods cannot detect this, as they observe model behavior under normal inference with no mechanism to determine whether a prediction was shaped by the image or by text alone. We introduce BICR (Blind-Image Contrastive Ranking), a model-agnostic confidence estimation framework that makes this contrast explicit during training by extracting hidden states from a frozen LVLM twice: once with the real image-question pair, and once with the image blacked out while the question is held fixed. A lightweight probe is trained on the real-image hidden state and regularized by a ranking loss that penalizes higher confidence on the blacked-out view, teaching it to treat visual grounding as a signal of reliability at zero additional inference cost. Evaluated across five modern LVLMs and seven baselines on a benchmark covering visual question answering, object hallucination detection, medical imaging, and financial document understanding, BICR achieves the best cross-LVLM average on both calibration and discrimination simultaneously, with statistically significant discrimination gains robust to cluster-aware analysis at 4-18x fewer parameters than the strongest probing baseline.

[NLP-6] Compute Where it Counts: Self Optimizing Language Models ICML’26

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）推理过程中计算资源分配不均衡的问题：传统静态压缩策略（如量化、剪枝或稀疏注意力）对每个生成 token 均匀分配计算预算，但实际中不同 token 的难易程度差异显著，导致在简单步骤上过度计算而在困难步骤上计算不足。解决方案的关键在于提出 Self-Optimizing Language Models (SOL)，其核心是一个轻量级策略网络（policy network），与冻结的 LLM 联合工作，通过读取 LLM 隐藏状态，在每一步决策中选择离散的效率动作（efficiency action），以动态调整以下三类计算资源：(i) token 级注意力稀疏度、(ii) MLP 中结构化激活剪枝、(iii) 激活量化位宽，同时保持基础模型权重不变。策略网络使用基于组相对策略优化（group-relative policy optimization）的方法进行训练，利用教师强制（teacher-forced）场景下多个“反事实”计算调度方案的似然比较来学习最优预算分配策略，从而在满足目标平均预算的前提下提升语言建模质量。实验表明，SOL 在所有测试条件下均能获得更优的质量-效率帕累托前沿，并在 MMLU 任务上相比均匀预算分配策略提升最高达 7.3% 的准确率。

链接: https://arxiv.org/abs/2605.10875
作者: Yash Akhauri,Mohamed S. Abdelfattah
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Accepted at ICML’26 Code: this https URL

点击查看摘要

Abstract:Efficient LLM inference research has largely focused on reducing the cost of each decoding step (e.g., using quantization, pruning, or sparse attention), typically applying a uniform computation budget to every generated token. In practice, token difficulty varies widely, so static compression can over-compute on easy steps and under-compute on hard ones. We study dynamic budget allocation for autoregressive decoding: learning how much computation to spend per token from within a single model. Self-Optimizing Language Models (SOL) pair a frozen LLM with a lightweight policy network that reads the LLM hidden state and selects a discrete efficiency action at each decode step. Actions can jointly control (i) token-level attention sparsity, (ii) structured activation pruning in the MLP, and (iii) activation quantization bit-width, while leaving the base model weights unchanged. We train the policy with group-relative policy optimization on teacher-forced episodes: the token sequence is fixed, while we sample multiple compute schedules (i.e., “counterfactual” schedules that vary only the efficiency actions for the same token path) and compare their likelihoods under the same supervision. Our reward trades off language-model quality against soft penalties that encourage episode-average budget usage to match a requested target. Across model variants and compute regimes, SOL improves quality at matched budget over static allocation and strong random schedule search, offering a complementary axis for inference-efficiency optimization. SOL discovers a better quality-efficiency pareto-front across all our experiments and improves MMLU accuracy by up to 7.3% over uniform budget allocation strategies. Comments: Accepted at ICML’26 Code: this https URL Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL) Cite as: arXiv:2605.10875 [cs.LG] (or arXiv:2605.10875v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.10875 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-7] DGPO: Beyond Pairwise Preferences with Directional Consistent Groupwise Optimization

【速读】：该论文旨在解决当前大语言模型（Large Language Models, LLMs）在偏好优化过程中难以同时保证推理方向一致性（directional consistency）与推理多样性（reasoning diversity）的问题。其解决方案的关键在于提出一种轻量级框架——方向分组偏好优化（Directional-Groupwise Preference Optimization, DGPO），该框架通过将正向与反向问题-答案实例组织为结构化集合，并基于多候选比较构建边际似然目标，显式建模方向感知对齐机制，从而在群体层面聚合监督信号，增强不同推理路径间的一致性，同时保留多样性。

链接: https://arxiv.org/abs/2605.10863
作者: Mengyi Deng,Zhiwei Li,Xin Li,Tingyu Zhu,Yulan Yuan,Zhijiang Guo,Wei Wang
机构: The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)); The Hong Kong University of Science and Technology, Hong Kong SAR(香港科技大学, 香港特别行政区)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Although Large Language Models (LLMs) have made remarkable progress, current preference optimization methods still struggle to align directional consistency while preserving reasoning diversity. To address this limitation, we propose Directional-Groupwise Preference Optimization (DGPO), a lightweight framework that aggregates supervision signals at the group level and explicitly models direction-aware alignment through multi-candidate comparisons. DGPO organizes forward and reverse question-answer instances into structured sets and optimizes a margin-based likelihood objective that separates coherent reasoning paths from inconsistent alternatives. This group-wise formulation captures richer relative information than pairwise objectives and reinforces consistency across diverse reasoning pathways. Empirical results show that our constructed reverse data yields a 3.2% average improvement across five benchmarks, while DGPO further delivers consistent gains across multiple datasets and model families, achieving average accuracy improvements of up to 3.6%.

[NLP-8] RUBEN: Rule-Based Explanations for Retrieval-Augmented LLM Systems ICDE2026

【速读】：该论文旨在解决如何从检索增强型大语言模型（Retrieval-Augmented Large Language Models, RAG-LMs）的输出中提取最小且可解释的规则集，以提升模型决策过程的透明度与可控性。其核心挑战在于从复杂的推理路径中识别出能够覆盖所有输出的最简规则集合，同时确保这些规则在实际应用中具备可操作性和安全性验证价值。解决方案的关键在于提出一种新颖的剪枝策略（pruning strategies），通过高效搜索和筛选机制，自动发现一组最小化、完备性保障的规则子集，从而为模型行为提供可解释的逻辑依据，并进一步用于测试安全训练的有效性及对抗性提示注入的鲁棒性。

链接: https://arxiv.org/abs/2605.10862
作者: Joel Rorseth,Parke Godfrey,Lukasz Golab,Divesh Srivastava,Jarek Szlichta
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted by ICDE 2026 (Demonstration Track)

点击查看摘要

Abstract:This paper demonstrates RUBEN, an interactive tool for discovering minimal rules to explain the outputs of retrieval-augmented large language models (LLMs) in data-driven applications. We leverage novel pruning strategies to efficiently identify a minimal set of rules that subsume all others. We further demonstrate novel applications of these rules for LLM safety, specifically to test the resiliency of safety training and effectiveness of adversarial prompt injections.

[NLP-9] Learning More from Less: Exploiting Counterfactuals for Data-Efficient Chart Understanding ACL2026

【速读】：该论文旨在解决当前视觉语言模型（Vision-Language Models, VLMs）在图表理解任务中因依赖大规模监督微调（Supervised Fine-Tuning, SFT）数据而效率低下，且难以学习图表的反事实敏感性（counterfactual sensitivity）的问题。图表作为程序生成的视觉对象，其细微的代码控制变化可能引发语义和正确答案的显著差异，但标准SFT方法独立处理训练样本，缺乏对这种细粒度视觉区分能力的有效监督。解决方案的关键在于提出ChartCF框架，其核心包括：(1) 通过代码修改实现反事实数据合成，生成具有可控视觉变化的图表样本；(2) 基于图表相似性的数据选择策略过滤过于困难的样本以提升训练效率；(3) 在文本与视觉模态上进行多模态偏好优化，从而增强模型对细微视觉差异的判别能力。实验表明，该方法在五个基准测试中性能优于或相当主流图表专用VLMs，同时显著减少训练数据需求。

链接: https://arxiv.org/abs/2605.10855
作者: Jianzhu Bao,Haozhen Zhang,Kuicai Dong,Bozhi Wu,Sarthak Ketanbhai Modi,Zi Pong Lim,Yon Shin Teo,Wenya Wang
机构: Nanyang Technological University (南洋理工大学); Aumovio Singapore Pte. Ltd. (Aumovio新加坡私人有限公司)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2026 Main Conference

点击查看摘要

Abstract:Vision-Language Models (VLMs) have demonstrated remarkable progress in chart understanding, largely driven by supervised fine-tuning (SFT) on increasingly large synthetic datasets. However, scaling SFT data alone is inefficient and overlooks a key property of charts: charts are programmatically generated visual artifacts, where small, code-controlled visual changes can induce drastic shifts in semantics and correct answers. Learning this counterfactual sensitivity requires VLMs to discriminate fine-grained visual differences, yet standard SFT treats training instances independently and provides limited supervision to enforce this behavior. To address this, we introduce ChartCF, a data-efficient training framework designed to enhance counterfactual sensitivity. ChartCF consists of: (1) a counterfactual data synthesis pipeline via code modification, (2) a chart similarity-based data selection strategy that filters overly difficult samples for improved training efficiency, and (3) multimodal preference optimization across both textual and visual modalities. Experiments on five benchmarks show that ChartCF achieves superior or comparable performance to strong chart-specific VLMs while using significantly less training data.

[NLP-10] Grounded Satirical Generation with RAG

【速读】：该论文旨在解决生成式 AI (Generative AI) 在讽刺（satire）生成任务中的挑战，尤其是如何基于现实语境生成具有政治相关性且符合文化背景的讽刺性词典释义。其核心问题在于：当前大型语言模型（LLMs）在处理主观性强、依赖上下文的幽默类型（如讽刺）时表现不佳，难以平衡政治相关性和幽默感。解决方案的关键在于提出一种基于检索增强生成（Retrieval-Augmented Generation, RAG）的新型流水线架构，利用实时新闻数据作为外部知识源来增强生成内容的语境真实性，并结合专门设计的评估框架对生成结果进行多维度分析（包括文化背景、词类属性及是否启用RAG）。实验表明，RAG和基于主题的词汇选择能显著提升输出的政治相关性，但未能明显改善幽默效果，同时揭示了LLM作为评判者在政治相关性上与人类判断高度一致，但在幽默感知方面存在显著偏差。

链接: https://arxiv.org/abs/2605.10853
作者: Oona Itkonen,Yuxin Su,Linyao Du,Ona De Gibert
机构: University of Helsinki (赫尔辛基大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Humor generation remains challenging task for Large Language Models (LLMs), due to their subjective nature. We focus on satire, a form of humor strongly shaped by context. In this work, we present a novel pipeline for grounded satire generation that uses Retrieval-Augmented Generation (RAG) over current news to produce satirical dictionary definitions in the Finnish context. We also introduce a new task-specific evaluation framework and annotate 100 generated definitions with six human annotators, enabling analysis across multiple experimental conditions, including cultural background, source-word type, and the presence or absence of RAG. Our results show that the generated definitions are perceived as more political than humorous. Both topic-based word selection and RAG improve the political relevance of the outputs, but neither yields clear gains in humor generation. In addition, our LLM-as-a-judge evaluation of five state-of-the-art models indicates that LLMs correlate well with human judgments on political relevance, but perform poorly on humor. We release our code and annotated dataset to support further research on grounded satire generation and evaluation.

[NLP-11] he Generalized Turing Test: A Foundation for Comparing Intelligence

【速读】：该论文旨在解决如何在不依赖固定数据集或基准测试的情况下，对任意智能体（agents）的能力进行公平且可比较的评估问题。传统评估方法受限于特定任务和数据分布，难以全面反映智能体的通用能力。解决方案的关键在于提出广义图灵测试（Generalized Turing Test, GTT），其核心思想是通过“不可区分性”定义智能体间的相对能力：若智能体B作为判别器无法可靠区分A（被指示模仿B）与另一个B的交互，则认为A ≥ B。这一框架具有任务和数据无关性，能够诱导出等价类上的排序结构，并通过实证实验验证了其在现代模型间形成的分层排序结果，表明该方法能提供有意义的能力比较，为智能体的评估乃至训练目标设计提供了新的理论基础。

链接: https://arxiv.org/abs/2605.10851
作者: Daniel Mitropolsky,Susan S. Hong,Riccardo Neumarker,Emanuele Rimoldi,Tomaso Poggio
机构: MIT (麻省理工学院); ETH Zurich (苏黎世联邦理工学院); EPFL (洛桑联邦理工学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce the Generalized Turing Test (GTT), a formal framework for comparing the capabilities of arbitrary agents via indistinguishability. For agents A and B, we define the Turing comparator A \geq B to hold if B, acting as a distinguisher, cannot reliably distinguish between interactions with A (instructed to imitate B) and another instance of B. This yields a dataset- and task-agnostic notion of relative intelligence. We study the comparator’s structure, including conditions under which it is transitive and therefore induces an ordering over equivalence classes, and we define and analyze variants with querying, bounded interaction, and fixed distinguishers. To complement the theory, we instantiate the framework on a collection of modern models, empirically evaluating pairwise indistinguishability across thousands of trials. The resulting comparisons exhibit a stratified structure consistent with existing rankings, hinting that the proposed framework yields meaningful empirical orderings. Our results position indistinguishability as a unifying lens for reasoning about intelligence, suggesting a foundation for evaluation and, potentially, training objectives that are inherently independent of fixed datasets or benchmarks.

[NLP-12] BabelDOC: Better Layout-Preserving PDF Translation via Intermediate Representation ACL2026

【速读】：该论文旨在解决视觉丰富文档（如PDF）在跨语言翻译过程中存在的布局保真难题，即现有翻译流程在语言处理与版式保留之间存在矛盾：基于文本的计算机辅助翻译（CAT）系统常丢失结构元数据，而文档解析工具虽能提取信息却无法实现翻译后的精准重渲染。其解决方案的关键在于提出一种基于中间表示（Intermediate Representation, IR）的框架BabelDOC，通过将视觉布局元数据与语义内容解耦，支持文档级翻译操作（如术语提取、跨页上下文处理、术语约束生成和公式占位），并借助自适应排版引擎将译文重新锚定至原始布局，从而在保持翻译精度的同时显著提升版式保真度、视觉美感和术语一致性。

链接: https://arxiv.org/abs/2605.10845
作者: Qi Yang,Xiangyao Ma,Xiao Wang,Hao Wang,Rui Wang
机构: Shanghai University(上海大学); Funstory.ai Limited(趣故事人工智能有限公司); Shanghai Jiao Tong University(上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: ACL 2026 System Demonstration paper. 2 figures

点击查看摘要

Abstract:As global cross-lingual communication intensifies, language barriers in visually rich documents such as PDFs remain a practical bottleneck. Existing document translation pipelines face a tension between linguistic processing and layout preservation: text-oriented Computer-Assisted Translation (CAT) systems often discard structural metadata, while document parsers focus on extraction and do not support faithful re-rendering after translation. We introduce BabelDOC, an Intermediate Representation (IR)-based framework for layout-preserving PDF translation. BabelDOC decouples visual layout metadata from semantic content, enabling document-level translation operations such as terminology extraction, cross-page context handling, glossary-constrained generation, and formula placeholdering. The translated content is then re-anchored to the original layout through an adaptive typesetting engine. Experiments on a curated 200-page benchmark, together with human evaluation and multimodal LLM-as-a-judge evaluation, show that BabelDOC improves layout fidelity, visual aesthetics, and terminology consistency over representative baselines, while maintaining competitive translation precision. The open-source toolkit and its interactive downstream applications are publicly available and have attracted over 8.4K GitHub stars and 17 contributors at the time of writing. A demonstration video is also available.

[NLP-13] raining-Free Cultural Alignment of Large Language Models via Persona Disagreement

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在跨文化道德判断场景中存在文化偏见的问题，即现有模型的隐含偏好并非文化中立，而传统文化对齐方法要么依赖各国专属偏好数据与微调预算，要么假设可访问模型内部结构（白盒），这在商业API应用中不可行。解决方案的关键在于提出一种推理时校准方法DISCA（Disagreement-Informed Steering for Cultural Alignment），其核心思想是利用同一国家内社会人口学群体间的分歧（而非共识）作为主要引导信号，将世界价值观调查（World Values Survey）驱动的多角色代理（persona agents）之间的意见差异转化为有界、损失厌恶的logit修正，从而在不修改任何模型权重的前提下显著降低文化偏差，在20个国家和7种开源骨干模型（2B–70B参数规模）上实现文化错位减少10–24%（针对小模型）和2–7%（开放场景）。

链接: https://arxiv.org/abs/2605.10843
作者: Huynh Trung Kiet,Dao Sy Duy Minh,Tuan Nguyen,Chi-Nguyen Tran,Phu-Hoa Pham,Nguyen Lam Phu Quy, TheAnh Han,Long Tran-Thanh
机构: Faculty of Information and Technology, University of Science, Vietnam National University, Ho Chi Minh City, Vietnam; Department of Computer Science, University of Warwick, Coventry, United Kingdom; School of Computing, Engineering and Digital Technologies, Teesside University, United Kingdom
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 57 pages, 1 figure, 6 MultiTP moral dimensions

点击查看摘要

Abstract:Large language models increasingly mediate decisions that turn on moral judgement, yet a growing body of evidence shows that their implicit preferences are not culturally neutral. Existing cultural alignment methods either require per-country preference data and fine-tuning budgets or assume white-box access to model internals that commercial APIs do not expose. In this work, we focus on this realistic black-box, public-data-only regime and observe that within-country sociodemographic disagreement, not consensus, is the primary steering signal. We introduce DISCA (Disagreement-Informed Steering for Cultural Alignment), an inference-time method that instantiates each country as a panel of World-Values-Survey-grounded persona agents and converts their disagreement into a bounded, loss-averse logit correction. Across 20 countries and 7 open-weight backbones (2B–70B), DISCA reduces cultural misalignment on MultiTP by 10–24% on the six backbones =3.8B, and 2–7% on open-ended scenarios, without changing any weights. Our results suggest that inference-time calibration is a scalable alternative to fine-tuning for serving the long tail of global moral preferences.

[NLP-14] owards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents

【速读】：该论文旨在解决多模态深度搜索（Multimodal Deep Search）中两个关键瓶颈：一是现有工具使用机制将搜索、浏览或转换返回的图像视为瞬时输出，导致中间视觉证据无法被后续工具复用；二是训练数据通常依赖固定采集策略，难以追踪目标智能体能力的动态演化。解决方案的核心在于提出一个以图像库（image bank）参考协议为核心的视觉原生代理框架（visual-native agent harness），通过将每个工具生成的图像注册为可寻址引用，实现中间视觉证据的重复利用；在此基础上引入基于策略的数据演化（On-policy Data Evolution, ODE）机制，通过闭环数据生成器在每轮训练中根据当前策略表现自适应优化数据，使每轮数据精准匹配当前策略的学习需求，从而覆盖从监督微调到强化学习的完整训练生命周期。实验证明，该方法显著提升了Qwen3-VL-8B和30B模型在8个基准上的性能，尤其在需迭代视觉精炼的复杂任务中效果突出。

链接: https://arxiv.org/abs/2605.10832
作者: Shijue Huang,Hangyu Guo,Chenxin Li,Junting Lu,Xinyu Geng,Zhaochen Su,Zhenyu Li,Shuang Chen,Hongru Wang,Yi R. Fung
机构: Hong Kong University of Science and Technology (香港科技大学); The Chinese University of Hong Kong (香港中文大学); Peking University (北京大学); Tsinghua University (清华大学); University of Edinburgh (爱丁堡大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multimodal deep search requires an agent to solve open-world problems by chaining search, tool use, and visual reasoning over evolving textual and visual context. Two bottlenecks limit current systems. First, existing tool-use harnesses treat images returned by search, browsing, or transformation as transient outputs, so intermediate visual evidence cannot be re-consumed by later tools. Second, training data is usually built by fixed curation recipes that cannot track the target agent’s evolving capability. To address these challenges, we first introduce a visual-native agent harness centered on an image bank reference protocol, which registers every tool-returned image as an addressable reference and makes intermediate visual evidence reusable by later tools. On top of this harness, On-policy Data Evolution (ODE) runs a closed-loop data generator that refines itself across rounds from rollouts of the policy being trained. This per-round refinement makes each round’s data target what the current policy still needs to learn. The same framework supports both diverse supervised fine-tuning data and policy-aware reinforcement learning data curation, covering the full training lifecycle of the target agent. Across 8 multimodal deep search benchmarks, ODE improves the Qwen3-VL-8B agent from 24.9% to 39.0% on average, surpassing Gemini-2.5 Pro in standard agent-workflow setting (37.9%). At 30B, ODE raises the average score from 30.6% to 41.5%. Further analyses validate the effectiveness of image-bank reuse, especially on complex tasks requiring iterative visual refinement, while rollout-feedback evolution yields more grounded SFT traces and better policy-matched RL tasks than static synthesis.

[NLP-15] SLIM: Sparse Latent Steering for Interpretable and Property-Directed LLM -Based Molecular Editing

【速读】：该论文旨在解决大语言模型在分子编辑任务中因属性相关信息隐含于密集隐藏状态而导致的属性控制能力不足问题，即大量编辑操作无法提升甚至损害目标性质。解决方案的关键在于提出SLIM（Sparse Latent Interpretable Molecular editing）框架，通过可学习重要性门控的稀疏自编码器（Sparse Autoencoder）将编辑器的隐藏状态分解为稀疏且与属性对齐的特征空间，在此空间中可精确激活与属性相关维度以实现无参数修改的精准引导，从而显著提高编辑成功率并支持可解释分析。

链接: https://arxiv.org/abs/2605.10831
作者: Mingxu Zhang,Yuhan Li,Lujundong Li,Dazhong Shen,Hui Xiong,Ying Sun
机构: The Hong Kong University of Science and Technology (Guangzhou); Nanjing University of Aeronautics and Astronautics; The 63rd Research Institute, National University of Defense Technology, Nanjing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models possess strong chemical reasoning capabilities, making them effective molecular editors. However, property-relevant information is implicitly entangled across their dense hidden states, providing no explicit handle for property control: a substantial fraction of edits fail to improve or even degrade target properties. To address these issues, we propose SLIM (Sparse Latent Interpretable Molecular editing), a plug-and-play framework that decomposes the editor’s hidden states into sparse, property-aligned features via a Sparse Autoencoder with learnable importance gates. Steering in this sparse feature space precisely activates property-relevant dimensions, improving editing success rate without modifying model parameters. The same sparse basis further supports interpretable analysis of editing behavior. Experiments on the MolEditRL benchmark across four model architectures and eight molecular properties show consistent gains over baselines, with improvements of up to 42.4 points.

[NLP-16] Reasoning Is Not Free: Robust Adaptive Cost-Efficient Routing for LLM -as-a-Judge ICML2026

【速读】：该论文旨在解决在大型语言模型（Large Language Models, LLMs）作为自动评判者（LLM-as-a-Judge）的应用中，推理能力（reasoning）的使用效率与准确性之间的权衡问题。研究发现，显式推理能显著提升需要结构化验证任务（如数学和编程）的判断准确率，但在简单评估任务中收益有限甚至可能带来负向效果，并且推理模式的计算成本显著更高。为应对这一挑战，作者提出了一种鲁棒自适应成本高效路由机制（Robust Adaptive Cost-Efficient Routing, RACER），其核心在于将路由决策建模为一个受限分布鲁棒优化问题（constrained distributionally robust optimization），通过KL散度不确定性集显式建模分布偏移（distribution shift），并设计了一个高效的原始-对偶算法，从而在固定预算下动态选择是否启用推理型评判者，同时保证最优策略的唯一性和线性收敛性，实验证明该方法在分布偏移场景下实现了更优的准确率-成本权衡。

链接: https://arxiv.org/abs/2605.10805
作者: Wenbo Zhang,Lijinghua Zhang,Liner Xiang,Hengrui Cai
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: Accepted at ICML 2026

点击查看摘要

Abstract:Reasoning-capable large language models (LLMs) have recently been adopted as automated judges, but their benefits and costs in LLM-as-a-Judge settings remain unclear. Through controlled comparisons between reasoning and non-reasoning judges, we show that explicit reasoning substantially improves judgment accuracy on tasks requiring structured verification (e.g., math and coding), while offering limited or even negative gains on simpler evaluations and incurring significantly higher computational cost. These findings motivate that reasoning should be used selectively rather than universally, with awareness of possible distribution shift. We propose a Robust Adaptive Cost-Efficient Routing (RACER), which dynamically selects between reasoning and non-reasoning judges under a fixed budget by formulating routing as a constrained distributionally robust optimization problem. RACER explicitly accounts for distribution shift via a KL-divergence uncertainty set, admits an efficient primal–dual algorithm, and enjoys theoretical guarantees including uniqueness of the optimal policy and linear convergence. Extensive experiments show that RACER achieves superior accuracy–cost trade-offs under distribution shift.

[NLP-17] he Last Word Often Wins: A Format Confound in Chain-of-Thought Corruption Studies NEURIPS2026

【速读】：该论文旨在解决当前链式思维（Chain-of-Thought, CoT）忠实性评估中存在的一种系统性混淆问题：现有基于扰动（corruption studies）的方法在标准基准（如GSM8K）上检测到的“计算重要位置”实际上可能是由显式终端答案陈述（如“the answer is X”）引发的格式依赖效应，而非真实推理过程中的关键步骤。其解决方案的关键在于识别并分离这种格式决定效应——通过设计三重前提协议（question-only control、format characterization、all-position sweep），并在无答案后缀的链式结构中验证：当移除显式答案文本时，模型对后续语句的敏感性显著下降（3B模型下约19倍衰减），而保留推理过程则能维持较高准确性；同时生成时探针显示答案并非早期确定（早期承诺仅5%），但在消费阶段模型输出却系统性跟随答案文本，表明格式驱动了虚假的“忠实性”信号。这一发现揭示了CoT评估方法的根本局限，并为未来研究提供了可复现的标准框架。

链接: https://arxiv.org/abs/2605.10799
作者: Gabriel Garcia
机构: Independent Researcher
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 34 pages, 6 figures, 13 tables. Submitted to NeurIPS 2026. Code and data: this https URL

点击查看摘要

Abstract:Corruption studies, the primary tool for evaluating chain-of-thought (CoT) faithfulness, identify which chain positions are “computationally important” by measuring accuracy when steps are replaced with errors. We identify a systematic confound: for chains with explicit terminal answer statements, the dominant format in standard benchmarks, corruption studies detect where the answer text appears, not where computation occurs. A within-dataset format ablation provides the key evidence: on standard GSM8K chains ending with “the answer is X,” removing only the answer statement, preserving all reasoning, collapses suffix sensitivity ~19x at 3B (N=300, p=0.022). Conflicting-answer experiments quantify the causal mechanism: at 7B, CC accuracy drops to near-zero (=0.02) across five architecture families; the followed-wrong rate spans 0.63-1.00 at 3B-7B and attenuates at larger scales (0.300 at Phi-4-14B, ~0.01 at 32B). A within-stable 7B replication (9.3x attenuation, N=76, p=7.8e-3; Qwen3-8B N=299, p=0.004) provides converging evidence, and the pattern replicates on MATH (DeepSeek-R1-7B: 10.9x suffix-survival recovery). On chains without answer suffixes the same protocol identifies the prefix as load-bearing (Delta=-0.77, p10^-12). Generation-time probes confirm a dissociation: the answer is not early-determined during generation (early commitment 5%), yet at consumption time model outputs systematically follow the explicit answer text. The format-determination effect persists through 14B (8.5x ratio, p=0.001) and converges toward zero at 32B. We propose a three-prerequisite protocol (question-only control, format characterization, all-position sweep) as a minimum standard for corruption-based faithfulness studies. Comments: 34 pages, 6 figures, 13 tables. Submitted to NeurIPS 2026. Code and data: this https URL Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) ACMclasses: I.2.7; I.2.6 Cite as: arXiv:2605.10799 [cs.LG] (or arXiv:2605.10799v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.10799 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-18] Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR

【速读】：该论文旨在解决自蒸馏（Self-distillation）在大语言模型（Large Language Models, LLMs）后训练过程中存在的问题：当学生模型（student）在推理路径上表现成功时，传统自蒸馏机制反而会覆盖学生的决策并抑制其自主推理能力。其核心问题是现有方法未能有效区分“教师引导”与“学生自主探索”的信号，导致在成功轨迹上产生不必要的干扰。解决方案的关键在于重新解读自蒸馏信号——提出RLRT（Reinforcement Learning with Reversed Teacher），通过识别那些学生成功但教师未预测的token，将其作为学生自主推理的证据，并利用强化学习（Reinforcement Learning, RL）对其进行奖励，从而实现一种基于学生自身成功的有价值探索（valuable exploration）。这种方法将信息不对称（information asymmetry）引入到RLVR（Reinforcement Learning with Value Regularization）框架中，显著优于传统的自蒸馏和探索基线方法。

链接: https://arxiv.org/abs/2605.10781
作者: Jeonghye Kim,Jiwon Jeon,Dongsheng Li,Yuqing Yang
机构: Microsoft Research (微软研究院); KAIST (韩国科学技术院)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Self-distillation has emerged as a powerful framework for post-training LLMs, where a teacher conditioned on extra information guides a student without it, both from the same model. While this guidance is useful when the student has failed, on successful rollouts, the same mechanism instead overwrites the student’s choices and suppresses it’s own reasoning. Therefore, we propose reading the original self-distillation signal in reverse: when the student succeeds along a path the teacher would not have predicted, these tokens reflect its self-driven reasoning. Building on this, we propose RLRT (RLVR with Reversed Teacher), which augments GRPO by reinforcing these tokens on correct rollouts. We interpret this as a new form of exploration in RLVR: not uniform diversity, but valuable exploration grounded in the student’s own success. Across base, instruction-tuned, and thinking-tuned Qwen3 checkpoints, RLRT substantially outperforms self-distillation and exploration-based baselines, establishing information asymmetry as a new, principled design axis for RLVR.

[NLP-19] LITMUS: Benchmarking Behavioral Jailbreaks of LLM Agents in Real OS Environments

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）驱动的自主代理在真实操作系统（Operating System, OS）环境中引入的新安全风险——行为越狱（behavior jailbreak），即攻击者诱导代理执行不可逆的危险系统级操作，而现有评估基准仅关注语义层安全性，忽视物理层危害，且测试用例缺乏隔离性，导致早期运行污染后续结果。解决方案的关键在于提出LITMUS（LLM-agents In-OS Testing for Measuring Unsafe Subversion）基准，其核心创新为：一是采用语义-物理双层验证机制，确保对代理行为在对话层面和系统执行层面均进行判别；二是引入操作系统级状态回滚技术，实现测试用例间的完全隔离与可复现性。此方案首次实现了对LLM代理在真实OS环境中行为安全性的标准化、物理可落地的评估。

链接: https://arxiv.org/abs/2605.10779
作者: Chiyu Zhang,Huiqin Yang,Bendong Jiang,Xiaolei Zhang,Yiran Zhao,Ruyi Chen,Lu Zhou,Xiaogang Xu,Jiafei Wu,Liming Fang,Zhe Liu
机构: Nanjing University of Aeronautics and Astronautics (南京航空航天大学); Zhejiang University (浙江大学); Collaborative Innovation Center of Novel Software Technology and Industrialization (新型软件技术与产业化协同创新中心)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid proliferation of LLM-based autonomous agents in real operating system environments introduces a new category of safety risk beyond content safety: behavior jailbreak, where an adversary induces an agent to execute dangerous OS-level operations with irreversible consequences. Existing benchmarks either evaluate safety at the semantic layer alone, missing physical-layer harms, or fail to isolate test cases, letting earlier runs contaminate later ones. We present LITMUS (LLM-agents In-OS Testing for Measuring Unsafe Subversion), a benchmark addressing both gaps via a semantic-physical dual verification mechanism and OS-level state rollback. LITMUS comprises 819 high-risk test cases organized into one harmful seed subset and six attack-extended subsets covering three adversarial paradigms (jailbreak speaking, skill injection, and entity wrapping), plus a fully automated multi-agent evaluation framework judging behavior at both conversational and OS-level physical layers. Evaluation across frontier agents reveals three findings: (1) current agents lack effective safety awareness, with strong models (e.g., Claude Sonnet 4.6) still executing 40.64% of high-risk operations; (2) agents exhibit pervasive Execution Hallucination (EH), verbally refusing a request while the dangerous operation has already completed at the system level, invisible to every prior semantic-only framework; and (3) skill injection and entity wrapping attacks achieve high success rates, exposing pronounced agent vulnerabilities. LITMUS provides the first standardized platform for reproducible, physically grounded behavioral safety evaluation of LLM agents in real OS environments.

[NLP-20] Why Low-Resource NLP Needs More Than Cross-Lingual Transfer: Lessons Learned from Luxembourgish ACL2026

【速读】：该论文旨在解决低资源语言（如卢森堡语）在自然语言处理（Natural Language Processing, NLP）中因缺乏高质量标注数据而导致模型性能不足的问题，同时探讨跨语言迁移（Cross-lingual Transfer）与语言特定资源建设之间的关系。其解决方案的关键在于：跨语言迁移虽能显著提升目标语言性能，但其效果高度依赖于高质量、任务对齐的目标语言数据；而此类数据在低资源场景下通常规模有限，难以独立支撑强模型表现。因此，最优路径并非将跨语言迁移与语言特定努力视为替代关系，而是将其作为互补组件整合进可持续的低资源NLP流程中，通过协同利用二者实现性能最大化。

链接: https://arxiv.org/abs/2605.10714
作者: Fred Philippy,Siwen Guo,Jacques Klein,Tegawendé F. Bissyandé
机构: SnT, University of Luxembourg, Luxembourg; Luxembourg Institute of Science and Technology, Luxembourg
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at BigPicture Workshop 2026 (co-located with ACL 2026)

点击查看摘要

Abstract:Cross-lingual transfer has become a central paradigm for extending natural language processing (NLP) technologies to low-resource languages. By leveraging supervision from high-resource languages, multilingual language models can achieve strong task performance with little or no labeled target-language data. However, it remains unclear to what extent cross-lingual transfer can substitute for language-specific efforts. In this paper, we synthesize prior research findings and data collection results on Luxembourgish, which, despite its typological proximity to high-resource languages and its presence in a multilingual context, remains insufficiently represented in modern NLP technologies. Across findings, we observe a fundamental interdependence between cross-lingual transfer and language-specific efforts. Cross-lingual transfer can substantially improve target-language performance, but its success depends critically on the availability of sufficiently high-quality, task-aligned target-language data. At the same time, such resources, particularly in low-resource settings, are typically too limited in scale to drive strong performance on their own. Instead, such resources reach their full potential only when leveraged within a cross-lingual framework. We therefore argue that cross-lingual transfer and language-specific efforts should not be viewed as competing alternatives. Instead, they function as complementary components of a sustainable low-resource NLP pipeline. Based on these insights, we provide practical guidelines for integrating and balancing cross-lingual transfer with language-specific development in sustainable low-resource NLP pipelines.

[NLP-21] Step Rejection Fine-Tuning: A Practical Distillation Recipe

【速读】：该论文旨在解决拒绝微调（Rejection Fine-Tuning, RFT）在训练大语言模型（Large Language Model, LLM）代理时存在的效率低下问题，即RFT会直接丢弃未解决轨迹（unresolved trajectories），而这些轨迹在复杂任务中占比较高且可能包含部分正确步骤。解决方案的关键在于提出步骤级拒绝微调（Step Rejection Fine-Tuning, SRFT），通过引入一个批判性LLM（critic LLM）逐步评估轨迹中每一步的正确性，在训练过程中对错误步骤屏蔽损失（mask the loss）但保留其上下文信息，从而让模型学会从错误中恢复而不重复犯错。实验表明，SRFT在SWE-bench Verified数据集上将修复率提升至32.2%，优于传统RFT的32.2%（原文表述为RFT提升2.4%，SRFT提升3.7%）。

链接: https://arxiv.org/abs/2605.10674
作者: Igor Slinko,Ilia Zavidnyi,Egor Bogomolov,Yaroslav Zharov
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Rejection Fine-Tuning (RFT) is a standard method for training LLM agents, where unsuccessful trajectories are discarded from the training set. In the context of SWE-bench tasks, this corresponds to filtering out runs where the submitted patch does not pass the tests. However, this approach discards unresolved trajectories, even though they form a large portion of all trajectories for hard tasks and even then may be partially correct. In this work, we propose Step Rejection Fine-Tuning (SRFT) - a practical way to leverage these unresolved trajectories. For this, we employ a critic LLM to assess the correctness of each step in a trajectory. Consequently, during training, we mask the loss for erroneous steps while retaining them in the context window. This way we ensure the model learns to recover from errors without reproducing them. Evaluation on SWE-bench Verified shows that while RFT improves the resolution rate by 2.4% by excluding unresolved trajectories, SRFT improves it by 3.7% by filtering them instead of discarding completely, reaching the total resolution rate of 32.2%.

[NLP-22] Prompt-Activation Duality: Improving Activation Steering via Attention-Level Interventions

【速读】：该论文旨在解决激活引导（activation steering）在状态化对话中失效的问题，特别是由于KV缓存（KV-cache）污染导致的局部扰动累积为全局一致性退化。其解决方案的关键在于提出门控裁剪注意力-增量引导（Gated Cropped Attention-Delta, GCAD），通过从系统提示（system-prompt）对自注意力机制的贡献中提取引导信号，并以token级门控方式施加干预，从而避免将扰动存储于KV缓存中并重复利用，有效提升了长程对话的一致性与个性特征保留能力。

链接: https://arxiv.org/abs/2605.10664
作者: Diancheng Kang,Zheyuan Liu,Ningshan Ma,Yue Huang,Zhaoxuan Tan,Meng Jiang
机构: Southern University of Science and Technology; University of Notre Dame; Massachusetts Institute of Technology; 2077AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 23 pages, 5 figures. This paper proposes GCAD, an attention-level activation steering method for more stable multi-turn behavior control

点击查看摘要

Abstract:Activation steering controls language model behavior by adding directions to internal representations at inference time, but standard residual-stream steering can fail in stateful dialogue. We identify KV-cache contamination as a key failure mode: steered token states are stored and repeatedly reused, turning a local perturbation into cumulative coherence degradation. To address this challenge, we propose Gated Cropped Attention-Delta steering (GCAD), which extracts steering signals from system-prompt contributions to self-attention and applies them with token-level gating. Across persona-steering experiments, GCAD preserves trait control while substantially improving long-horizon coherence. On the main multi-turn benchmark, GCAD improves average coherence drift from -18.6 to -1.9 and raises turn-10 trait expression from 78.0 to 93.1. These results suggest that activation steering becomes more reliable when interventions follow the prompt-mediated pathways that models already use for behavioral control.

[NLP-23] When Can Digital Personas Reliably Approximate Human Survey Findings?

【速读】：该论文试图解决的问题是：基于大语言模型（Large Language Models, LLMs）构建的数字人格（digital personas）在多大程度上能够可靠地替代人类受访者进行问卷调查，尤其是在不同任务和数据结构下其有效性边界在哪里。解决方案的关键在于，通过使用LISS面板数据构建数字人格——即利用受访者的背景变量和2023年前的调查历史信息，并将其与同一受访者在截止日期后的保留回答进行对比测试——从而系统评估数字人格在个体预测、分布拟合、公平性及聚类结构恢复等多个层面的表现。研究发现，数字人格在稳定属性和价值观相关领域表现良好，但对主观性强、异质性高或罕见的回答仍存在显著局限，且性能更多取决于人类响应本身的结构特征而非LLM模型本身的选择，尤其检索增强架构（retrieval-augmented architectures）带来了最明显的提升。

链接: https://arxiv.org/abs/2605.10659
作者: Mumin Jia,Yilin Chen,Divya Sharma,Jairo Diaz-Rodriguez
机构: York University (约克大学); University Health Network (大学健康网络)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Digital personas powered by Large Language Models (LLMs) are increasingly proposed as substitutes for human survey respondents, yet it remains unclear when they can reliably approximate human survey findings. We answer this question using the LISS panel, constructing personas from respondents’ background variables and pre-2023 survey histories, then testing them against the same respondents’ held-out post-cutoff answers. Across four persona architectures, three LLMs, and two prediction tasks, we assess performance at the question, respondent, distributional, equity, and clustering levels. Digital personas improve alignment with human response distributions, especially in domains tied to stable attributes and values, but remain limited for individual prediction and fail to recover multivariate respondent structure. Retrieval-augmented architectures provide the clearest gains, but performance depends more on human response structure than on model choice: personas perform best for low-variability questions and common respondent patterns, and worst for subjective, heterogeneous, or rare responses. Our results provide practical guidance on when digital personas could be appropriate for survey research and when human validation remains necessary.

[NLP-24] A Single-Layer Model Can Do Language Modeling

【速读】：该论文旨在解决当前语言模型普遍依赖深层堆叠结构（如Transformer中的多层KV缓存或Mamba、GDN等模型的每层矩阵）所带来的计算复杂度高和可解释性差的问题，同时探索生物系统中广泛存在的递归机制在语言建模中的潜力。其解决方案的关键在于提出Grounded Prediction Networks (GPN)，采用单一递归块（包含一个共享的前馈网络FFN和一个共享的矩阵记忆）与单个状态向量进行迭代更新，从而将模型深度压缩至仅1层甚至2层，显著降低参数规模并提升内部状态的可分析性；实验表明，在130M参数下，1层GPN+M已达到FineWeb-Edu困惑度18.06，接近12层Transformer++（16.05）和10层GDN（15.34），且通过直接观测状态向量几何结构，揭示了持久默认词方向、数十token的内容感知视野以及自发分化为快慢保留池的记忆头等关键特性。

链接: https://arxiv.org/abs/2605.10643
作者: Zanmin Wang
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 9 pages, 5 figures, 1 table. Code: this https URL

点击查看摘要

Abstract:Modern language models scale depth by stacking layers, each holding its own state - a per-layer KV cache in transformers, a per-layer matrix in Mamba, Gated DeltaNet (GDN), RWKV, and xLSTM. Biological systems lean heavily on recurrence rather than on stacking. We ask how far that shape can go on language modeling. We propose Grounded Prediction Networks (GPN): one state vector revisited at every step through a single recurrent block - one FFN, one shared matrix memory. At 130M parameters, a 1-layer GPN+M reaches FineWeb-Edu perplexity 18.06, within 13% of a 12-layer Transformer++ (16.05) and 18% of a 10-layer GDN (15.34); a 2-layer variant closes the gap to 6%/11%. We do not match the deep baselines. Because the working context is a single vector, we can directly inspect its geometry: a persistent default-token direction, a content-bearing horizon of tens of tokens, and memory heads that split spontaneously into fast and slow retention pools.

[NLP-25] owards Understanding Continual Factual Knowledge Acquisition of Language Models: From Theory to Algorithm ICML2026

【速读】：该论文旨在解决持续预训练（Continual Pre-Training, CPT）中生成式语言模型（Language Models, LMs）在不断学习新知识时面临的关键挑战——持续事实知识获取（continual Factual Knowledge Acquisition, cFKA）的机制不明确以及灾难性遗忘（catastrophic forgetting）问题。现有方法如数据回放（data replay）虽被广泛采用，但其内在作用机制尚不清楚。论文通过构建基于单层Transformer的理论框架，揭示了正则化方法仅改变参数收敛速率而不改变遗忘倾向，而数据回放能有效调整收敛动态并稳定预训练知识。解决方案的关键在于提出一种新颖的生成式数据回放方法——STOC（Selecting Tokens via Attention-based Contribution），该方法基于注意力机制识别对知识保留最具贡献的事实片段，从而指导高质量回放数据的生成，显著缓解灾难性遗忘，提升模型在持续学习场景下的事实知识保持能力。

链接: https://arxiv.org/abs/2605.10640
作者: Haoyu Wang,Yifan Shang,Zhongxiang Sun,Weijie Yu,Xiao Zhang,Jun Xu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by ICML 2026

点击查看摘要

Abstract:Continual Pre-Training (CPT) is essential for enabling Language Models (LMs) to integrate new knowledge without erasing old. While classical CPT techniques like data replay have become the standard paradigm, the mechanisms underlying how LMs acquire and retain facts over time, termed as continual Factual Knowledge Acquisition (cFKA), remain unclear. In this work, we present a theoretical framework that characterizes the training dynamics of cFKA using a single-layer Transformer, offering a unified explanation for the behavior of representative CPT methods. Our analysis reveals that regularization-based methods merely adjust the convergence rate of parameters without altering the inherent forgetting tendency, whereas data replay methods succeed in shifting convergence dynamics and stabilizing pretrained knowledge. Building on these insights, we propose a novel generative data replay approach, called \textbfSelecting \textbfTokens via attenti\textbfOn \textbfContribution~(STOC), which identifies influential factual snippets to guide replay data generation. Extensive experiments on both synthetic and real-world datasets validate our findings and demonstrate that STOC effectively enhances cFKA by mitigating catastrophic forgetting.

[NLP-26] Intrinsic Guardrails: How Semantic Geometry of Personality Interacts with Emergent Misalignment in LLM s

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在使用良性窄域数据进行微调时可能引发广泛有害行为的问题，即“涌现错位”（Emergent Misalignment, EM）。其核心发现是：尽管微调可能导致模型产生有害行为，但模型的潜在人格空间（latent personality space）在语义几何上具有高度稳定性，且该空间中某些关键方向——如“邪恶”人格向量（Evil persona vector）和作者提出的新语义极性向量（Semantic Valence Vector, SVV）——可作为内在防护机制。解决方案的关键在于通过因果干预识别并操纵这些稳定的方向：移除它们会使错误率超过40%，而增强它们则将失败率降至3%以下；更重要的是，从指令微调模型中提取的这些向量可在零样本条件下迁移至受污染的微调模型中，有效调控EM现象，表明模型的人格表征具有跨分布的鲁棒性，可作为通用的安全护栏。

链接: https://arxiv.org/abs/2605.10633
作者: Krishak Aneja,Manas Mittal,Anmol Goel,Ponnurangam Kumaraguru,Vamshi Krishna Bonagiri
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 20 pages, 9 figures including appendix

点击查看摘要

Abstract:Fine-tuning Large Language Models (LLMs) on benign narrow data can sometimes induce broad harmful behaviors, a vulnerability termed emergent misalignment (EM). While prior work links these failures to specific directions in the activation space, their relationship to the model’s broader persona remains unexplored. We map the latent personality space of LLMs through established psychometric profiles like the Big Five, Dark Triad, and LLM-specific behaviors (e.g. evil, sycophancy), and show that the semantic geometry is highly stable across aligned models and their corrupted fine-tunes. Through causal interventions, we find that directions isolating social valence, such as the ‘Evil’ persona vector, and a Semantic Valence Vector (SVV) that we introduce, function as intrinsic guardrails: ablating them drives the misalignment rates above 40 %, while amplifying them suppresses the failure mode to less than 3 %. Leveraging the structural stability of the personality space, we show that vectors extracted \textita priori from an instruct-tuned model transfer zero-shot to successfully regulate EM in corrupted fine-tunes. Overall, our findings suggest that harmful fine-tuning does not overwrite a model’s internal representation of personality, allowing conserved representations to serve as robust, cross-distribution guardrails.

[NLP-27] Interpretable Coreference Resolution Evaluation Using Explicit Semantics ACL2026

【速读】：该论文旨在解决核心指代消解（coreference resolution）评估中仅依赖聚合统计指标（如CoNLL-F1）所带来的诊断信息不足问题，这类指标无法揭示模型在特定语义类别（如人物、地点或事件）上的系统性弱点，从而限制了对模型能力的深入理解与针对性改进。解决方案的关键在于提出一种语义增强型评估框架，通过将概念识别与命名实体识别（Concept and Named Entity Recognition, CNER）引入核心指代输出，为名词短语提及赋予语义标签，并将这些标签传播至整个指代簇，从而实现按语义类别分层的细粒度评分（typed scores），以精准量化不同语义类别的提及抽取与链接性能。实验表明，该框架能有效识别出被传统指标掩盖的系统性缺陷，并可用于指导低成本的数据增强策略设计，提升跨域性能。

链接: https://arxiv.org/abs/2605.10627
作者: Bruno Gatti,Giuliano Martinelli,Roberto Navigli
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at main conference for ACL 2026. 19 pages

点击查看摘要

Abstract:Coreference resolution is typically evaluated using aggregate statistical metrics such as CoNLL-F1, which measure structural overlap between predicted and gold clusters. While widely used, these metrics offer limited diagnostic insights, penalizing errors without revealing whether a system struggles with specific semantic categories, such as people, locations, or events, and making it difficult to interpret model capabilities or derive actionable improvements. We address this gap by introducing a semantically-enhanced evaluation framework for coreference resolution. Our approach overlays Concept and Named Entity Recognition (CNER) onto coreference outputs, assigning semantic labels to nominal mentions and propagating them to entire coreference clusters. This enables the computation of typed scores aimed at evaluating mention extraction and linking capabilities stratified by semantic class. Across our experiments on OntoNotes, LitBank, and PreCo, we show that our framework uncovers systematic weaknesses that remain obscured by aggregate metrics. Furthermore, we demonstrate that these diagnostics can be used to design targeted, low-cost data augmentation strategies, achieving measurable out-of-domain improvements.

[NLP-28] MulTaBench: Benchmarking Multimodal Tabular Learning with Text and Image

【速读】：该论文旨在解决当前表格基础模型（Tabular Foundation Models）在处理多模态数据时的局限性问题，即缺乏对非结构化模态（如文本和图像）的原生支持，且依赖冻结的预训练嵌入来处理这些模态，导致在任务特定场景下性能受限。其解决方案的关键在于提出一种新的基准测试平台 MulTaBench，包含40个数据集（图像-表格与文本-表格各20个），聚焦于模态间提供互补预测信号的任务场景，并强调通过任务感知表示（Target-Aware Representations）对嵌入进行微调，从而提升模型性能。实验表明，这种目标导向的表示学习策略在不同模态、表征学习器、编码器规模和嵌入维度下均能泛化，显著优于传统固定嵌入方法。

链接: https://arxiv.org/abs/2605.10616
作者: Alan Arazi,Eilam Shapira,Shoham Grunblat,Mor Ventura,Elad Hoffer,Gioia Blayer,David Holzmüller,Lennart Purucker,Gaël Varoquaux,Frank Hutter,Roi Reichart
机构: Technion – Israel Institute of Technology (以色列理工学院); Prior Labs; NVIDIA; SODA Team, INRIA Saclay, Palaiseau (INRIA萨克雷团队，帕莱索); University of Freiburg (弗莱堡大学); Probabl; ELLIS Institute Tübingen (ELLIS图宾根研究所)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Tabular Foundation Models have recently established the state of the art in supervised tabular learning, by leveraging pretraining to learn generalizable representations of numerical and categorical structured data. However, they lack native support for unstructured modalities such as text and image, and rely on frozen, pretrained embeddings to process them. On established Multimodal Tabular Learning benchmarks, we show that tuning the embeddings to the task improves performance. Existing benchmarks, however, often focus on the mere co-occurrence of modalities; this leads to high variance across datasets and masks the benefits of task-specific tuning. To address this gap, we introduce MulTaBench, a benchmark of 40 datasets, split equally between image-tabular and text-tabular tasks. We focus on predictive tasks where the modalities provide complementary predictive signal, and where generic embeddings lose critical information, necessitating Target-Aware Representations that are aligned with the task. Our experimental results demonstrate that the gains from target-aware representation tuning generalize across both text and image modalities, several tabular learners, encoder scales, and embedding dimensions. MulTaBench constitutes the largest image-tabular benchmarking effort to date, spanning high-impact domains such as healthcare and e-commerce. It is designed to enable the research of novel architectures which incorporate joint modeling and target-aware representations, paving the way for the development of novel Multimodal Tabular Foundation Models.

[NLP-29] Responsible Benchmarking of Fairness for Automatic Speech Recognition

【速读】：该论文旨在解决自动语音识别（ASR）系统在不同说话人群体（Speaker Groups, SG’s）中表现不一致的问题，即ASR公平性评估的可靠性问题。现有研究在判定ASR系统是否存在偏见时方法不统一，导致结论易受误导。论文提出基于机器学习公平性、社会科学研究和语音科学文献的最佳实践框架，强调必须明确所检验的公平性假设，并针对性地选择公平性指标。其解决方案的关键在于：避免仅以单一异质群体（如性别或种族）为基准进行评估，而应尽可能细致地分析多个元数据维度（如年龄、地域、语言变体等）之间的交叉效应（intersectionality），从而识别出真正被ASR系统不公平对待的群体，防止因忽略群体间复杂交互关系而导致的虚假相关性误判。

链接: https://arxiv.org/abs/2605.10615
作者: Felix Herron,Ange Richard,François Portet,Alexandre Allauzen,Solange Rossato
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Many studies have shown automatic speech processing (ASR) systems have unequal performance across speakergroups (SG’s). However, the manner in which such studies arrive at this conclusion is inconsistent. To pave the wayfor more reliable results in future studies, we lay out best practices for benchmarking ASR fairness based on literaturefrom machine learning fairness, social sciences, and speech science. We first describe the importance of preciselythe fairness hypothesis being interrogated, and tailoring fairness metrics to apply specifically to said this http URL then examine several benchmarks used to rate ASR systems on fairness and discuss how their results can bemisconstrued without assiduous oversight into the intersections between SG’s. We find that evaluating fairnessbased on single heterogeneous SG’s, such as they are defined in fairness benchmarks, can lead to misidentifyingwhich SG’s are actually being mistreated by ASR systems. We advocate for as fine-grained an analysis as possibleof the intersectionality of as many demographic variables as are available in the metadata of fairness corpora in orderto tease out such spurious correlations

[NLP-30] Measuring Embedding Sensitivity to Authorial Style in French: Comparing Literary Texts with Language Model Rewritings

【速读】：该论文旨在解决生成式 AI（Generative AI）时代下作者风格信息在语言模型嵌入表示中的编码与保留问题，特别是当文本经过大语言模型（Large Language Models, LLMs）重写后，其原始作者风格特征是否仍可被检测。解决方案的关键在于利用受控的法语文本数据集，通过量化嵌入空间中风格差异导致的嵌入分散度（embedding dispersion）变化，实证表明：即便在LLM重写之后，嵌入仍能可靠地保留作者特有的风格信号，并呈现出模型特异性的模式，为基于嵌入的作者仿冒检测提供了可行路径。

链接: https://arxiv.org/abs/2605.10606
作者: Benjamin Icard,Lila Sainero,Alice Breton,Evangelia Zve,Jean-Gabriel Ganascia
机构: LIP6, Sorbonne University, CNRS, France; Infopro Digital, France
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: To appear in the Proceedings of the 6th International Conference on Natural Language Processing for the Digital Humanities (NLP4DH 2026)

点击查看摘要

Abstract:Large language models (LLMs) can convincingly imitate human writing styles, yet it remains unclear how much stylistic information is encoded in embeddings from any language model and retained after LLM rewriting. We investigate these questions in French, using a controlled literary dataset to quantify the effect of stylistic variation via changes in embedding dispersion. We observe that embeddings reliably capture authorial stylistic features and that these signals persist after rewriting, while also exhibiting LLM-specific patterns. These analytical results offer promising directions for authorship imitation detection in the era of language models.

[NLP-31] Where do aspectual variants of light verb constructions belong?

【速读】：该论文旨在解决含轻动词（light verb）的表达式（如“take on debt”与“have debt”）在文本中频繁出现但难以准确归类为动词习语（verbal idioms）、轻动词结构（light verb constructions）或组合性短语（compositional phrases）的问题。其解决方案的关键在于提出一组特征选择机制，通过这些特征能够更清晰地界定三类结构之间的边界，并据此对争议性表达进行合理分类。

链接: https://arxiv.org/abs/2605.10605
作者: Aggeliki Fotopoulou,Eric Laporte,Takuya Nakamura
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Expressions with an aspectual variant of a light verb, e.g. ‘take on debt’ vs. ‘have debt’, are frequent in texts but often difficult to classify between verbal idioms, light verb constructions or compositional phrases. We investigate the properties of such expressions with a disputed membership and propose a selection of features that determine more satisfactory boundaries between the three categories in this zone, assigning the expressions to one of them.

[NLP-32] VISTA: A Generative Egocentric Video Framework for Daily Assistance

【速读】：该论文旨在解决训练AI代理（AI agent）在日常任务中主动辅助人类时面临的高质量视觉数据稀缺问题，尤其是在真实世界中采集此类数据存在困难、成本高或不安全的情况下。传统物理仿真器因视觉保真度不足，难以将学习到的行为有效迁移到现实场景。为此，论文提出VISTA视频合成系统，其核心在于采用五步脚本生成管道结合因果反向推理（causal reverse reasoning），生成多样化且逻辑自洽的干预模式，涵盖反应式（reactive）和主动式（proactive）两种代理自主性层级，并进一步细化主动式为显式（explicit）与隐式（implicit）两类，从而构建可定制、可控的高保真第一人称视角视频基准数据集，为AI代理在真实环境中的训练与评估提供高效替代方案。

链接: https://arxiv.org/abs/2605.10579
作者: Yu-Hsiang Liu,Yu-Chien Tang,An-Zi Yen
机构: National Yang Ming Chiao Tung University (国立阳明交通大学)
类目: Computation and Language (cs.CL)
备注: pre-print

点击查看摘要

Abstract:Training AI agents to proactively assist humans in daily activities, from routine household tasks to urgent safety situations, requires large-scale visual data. However, capturing such scenarios in the real world is often difficult, costly, or unsafe, and physics-based simulators lack the visual fidelity needed to transfer learned behaviors to real settings. Therefore, we introduce VISTA, a video synthesis system that produces high-fidelity egocentric videos as training and evaluation data for AI agents. VISTA employs a 5-step script generation pipeline with causal reverse reasoning to create diverse, logically grounded intervention modes. These scenarios span two levels of agent autonomy: reactive and proactive. In reactive modes, the user explicitly asks the agent for help. In proactive modes, the agent offers help without receiving a direct request. We further divide proactive modes into explicit and implicit types. In explicit proactive scenarios, the user is aware of needing help but does not directly address the agent. In implicit proactive scenarios, the agent intervenes before the user even realizes that help is needed. VISTA allows users to customize and refine scenarios to generate video benchmarks for daily tasks, offering a scalable and controllable alternative to real-world data collection for training and evaluating AI agents in realistic environments.

[NLP-33] hreatCore: A Benchmark for Explicit and Implicit Threat Detection

【速读】：该论文旨在解决自然语言处理（Natural Language Processing, NLP）中威胁检测缺乏统一定义与标准化基准的问题，尤其针对显性威胁、隐性威胁与非威胁类别区分不清的现状。其核心解决方案是构建ThreatCore数据集，通过整合多个公开资源并基于统一的操作性威胁定义进行重新标注，揭示了现有标签体系中的显著不一致性；同时为提升隐性威胁等低频样本的覆盖度，采用人工验证的合成数据增强策略确保标注一致性。实验表明，隐性威胁仍远难于检测，而引入语义角色标注（Semantic Role Labeling, SRL）作为中间表示可有效提升模型对有害意图结构的识别能力，从而为细粒度威胁检测提供更可靠、一致的基准和改进方向。

链接: https://arxiv.org/abs/2605.10563
作者: Davide Bruni,Carlo Bardazzi,Maurizio Tesconi
机构: University of Pisa (比萨大学); National Research Council (国家研究委员会)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Threat detection in Natural Language Processing lacks consistent definitions and standardized benchmarks, and is often conflated with broader phenomena such as toxicity, hate speech, or offensive language. In this work, we introduce ThreatCore, a public available benchmark dataset for fine-grained threat detection that distinguishes between explicit threats, implicit threats, and non-threats. The dataset is constructed by aggregating multiple publicly available resources and systematically re-annotating them under a unified operational definition of threat, revealing substantial inconsistencies across existing labels. To improve the coverage of underrepresented cases, particularly implicit threats, we further augment the dataset with synthetic examples, which are manually validated using the same annotation protocol adopted for the re-annotation of the public datasets, ensuring consistency across all data sources. We evaluate Perspective API, zero-shot classifiers, and recent language models on ThreatCore, showing that implicit threats remain substantially harder to detect than explicit ones. Our results also indicate that incorporating Semantic Role Labeling as an intermediate representation can improve performance by making the structure of harmful intent more explicit. Overall, ThreatCore provides a more consistent benchmark for studying fine-grained threat detection and highlights the challenges that current models still face in identifying indirect expressions of harmful intent.

[NLP-34] ICT-NLP at SemEval-2026 Task 3: Less Is More – Multilingual Encoder with Joint Training and Adaptive Ensemble for Dimensional Aspect Sentiment Regression

【速读】：该论文针对维度情感回归（Dimensional Aspect Sentiment Regression, DimASR）任务，旨在实现跨语言和跨领域的细粒度情感分析，解决因数据稀疏性导致的模型性能下降问题。其解决方案的关键在于：（1）采用联合多语言与多领域训练策略，增强跨语言迁移能力并缓解低资源语种的数据稀缺问题；（2）引入有界回归变换（bounded regression transformation），提升训练稳定性并确保预测值始终落在合法范围内；（3）设计基于子集搜索的自适应集成策略（adaptive ensemble strategy），有效降低预测方差，从而在多个评测数据集上实现稳定且领先的性能表现。

链接: https://arxiv.org/abs/2605.10560
作者: Liyuan Huang,Jiawei He,Wutao Shen,Lin Li,Jin Zhang
机构: State Key Laboratory of AI Safety; Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper describes our system to SemEval-2026 Task 3 Track A Subtask 1 on Dimensional Aspect Sentiment Regression (DimASR). We propose a lightweight and resource-efficient system built entirely on multilingual pre-trained encoders, without relying on LLMs or external corpora. We adopt joint multilingual and multi-domain training to facilitate cross-lingual transfer and alleviate data sparsity, introduce a bounded regression transformation that improves training stability while constraining predictions within the valid range, and employ an adaptive ensemble strategy via subset search to reduce prediction variance. Experimental results demonstrate that our system achieves strong and consistent performance, ranking 1st on zho-res, 2nd on zho-lap, and 3rd on jpn-hot, with all remaining datasets placed within the top half of participating teams.

[NLP-35] Multi-domain Multi-modal Document Classification Benchmark with a Multi-level Taxonomy

【速读】：该论文旨在解决现有文档分类基准在现实工业场景中适用性不足的问题，即当前主流 benchmarks 仍停留在单一领域、扁平标签结构的简化范式，难以反映真实业务文档所具有的层级化、多模态和跨域特性。为填补这一差距，作者构建了首个多层级、多领域、多模态文档分类基准（Multi-level, Multi-domain, Multi-modal document classification Benchmark, MMM-Bench），其关键在于：(1) 设计了一个包含五层深度的层级化分类体系，准确映射企业文档的组织逻辑；(2) 收集并标注了来自阿里巴巴12个商业领域的5,990份真实多模态文档，每份文档均由领域专家手工标注完整的层级路径，从而形成具有高保真度与实用价值的数据集与评估工具链，为推动工业级文档智能研究奠定基础。

链接: https://arxiv.org/abs/2605.10550
作者: Denghao Ma,Qing Liu,Zulong Chen,Chuanfei Xu,Jia Xu,Zhibo Yang,Zhao Li
机构: Beijing Information Science and Technology University (北京信息科技大学); Alibaba Group (阿里巴巴集团); Guangdong Laboratory of Artificial Intelligence and Digital Economy (深圳) (广东省人工智能与数字经济实验室（深圳）); Guangzhou University (广州大学); Zhejiang Lab (浙江省实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Document classification forms the backbone of modern enterprise content management, yet existing benchmarks remain trapped in oversimplified paradigms – single domain settings with flat label structures – that bear little resemblance to the hierarchical, multi-modal, and cross-domain nature of real-world business documents. This gap not only misrepresents practical complexity but also stifles progress toward industrially viable document intelligence. To bridge this gap, we construct the first Multi-level, Multi-domain, Multi-modal document classification Benchmark (MMM-Bench). MMM-Bench includes (1) a deeply hierarchical taxonomy spanning five levels that capture the authentic organizational logic of business documentation; and (2) 5,990 real-world multi-modal documents meticulously curated from 12 commercial domains in Alibaba. Each document is manually annotated with a complete hierarchical path by domain experts. We establish comprehensive baselines on MMM-Bench, which consists of open-weight models and API-based models. Through systematic experiments, we identify four fundamental challenges within MMM-Bench and propose corresponding insights. To provide a solid foundation for advancing research in multi-level, multi-domain document classification, we release all of the data and the evaluation toolkit of MMM-Bench at this https URL.

[NLP-36] Where Does Long-Context Supervision Actually Go? Effective-Context Exposure Balancing

【速读】：该论文旨在解决长上下文适应（long-context adaptation）中因训练数据打包（packed training）与文档掩码（document masking）导致的token级监督不匹配问题，即目标token的有效上下文长度在训练过程中仍受限，从而影响模型对长距离依赖关系的学习能力。其解决方案的关键在于提出EXACT（Extra-weighted Context-aware Training），一种基于监督分配的目标函数，通过在长尾分布中按逆频率对具有长有效上下文的目标token赋予额外权重，强化模型对远距离证据的响应能力。实验表明，该方法在多个Qwen和LLaMA配置下均显著提升NoLiMa和RULER等长上下文基准表现，同时保持标准问答与推理任务性能稳定，验证了监督强度对长上下文预测能力的关键作用。

链接: https://arxiv.org/abs/2605.10544
作者: Jinchang Zhu,Jindong Li,Chengyu Zou,Rong Fu,Chao Wang,Haowei He,Menglin Yang
机构: The Hong Kong University of Science and Technology (Guangzhou); Institute of Artificial Intelligence (TeleAI), China Telecom
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Long-context adaptation is often viewed as window scaling, but this misses a token-level supervision mismatch: in packed training with document masking, each target token’s effective context remains short. We introduce EXACT, a supervision-allocation objective that assigns extra weight to long effective-context targets by inverse frequency within the long tail. Across seven Qwen/LLaMA CPT configurations, EXACT improves all 28 trained/extrapolated NoLiMa and RULER comparisons. On Qwen2.5-0.5B, NoLiMa improves by +10.09 (trained) and +5.34 (extrapolated); RULER by +10.69 and +5.55. On LLaMA-3.2-3B, RULER improves by +17.91 and +16.11. Standard QA/reasoning are preserved (+0.24 macro change across six benchmarks). A distance-resolved probe shows gains arise when evidence is thousands of tokens away, while short cases remain unchanged. Results support a supervision-centric thesis: long-context adaptation depends on how strongly training supervises long-context predictions.

[NLP-37] Mela: Test-Time Memory Consolidation based on Transformation Hypothesis

【速读】：该论文旨在解决现代序列模型在处理长序列时缺乏有效记忆机制的问题，特别是如何在不显著增加计算负担的前提下，实现对不同粒度信息的动态存储与重构。其解决方案的关键在于提出Hierarchical Memory Module (HMM)，该模块基于神经科学中的记忆巩固（memory consolidation）和跨频耦合（cross-frequency coupling）理论，设计了两个更新频率不同的子模块：低频子模块生成抽象的、概括性的高阶表征，高频子模块保留细粒度的、具体的事件细节；最终通过上下文相关的组合方式动态重建记忆输出，模拟人类记忆的重构特性。在此基础上构建的Mela模型实现了在线记忆巩固，并引入MemStack方法将多粒度记忆特征分布于解码器早期层中，从而在语言建模任务中显著优于传统Transformer基线，且在固定预训练长度（4K）下仍能保持对更长上下文的性能稳定性。

链接: https://arxiv.org/abs/2605.10537
作者: Lungchuan Chen
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Memory consolidation, the process by which transient experiences are transformed into stable, structured representations, is a foundational organizing principle in the human brain, yet it remains largely unexplored as a design principle for modern sequence models. In this work, we leverage established neuroscientific theories of memory consolidation and cross-frequency coupling to propose the Hierarchical Memory Module (HMM), a neural memory architecture composed of two functionally distinct sub-modules that operate at different update frequencies. Inspired by the transformation hypothesis, the low-frequency sub-module produces high-level representations that capture abstract, gist-level knowledge, while the high-frequency sub-module produces fine-grained representations that preserve richer episodic detail. The final memory output is dynamically reconstructed as a context-dependent combination of both representations, analogous to the reconstructive nature of human memory retrieval. We integrate HMM into a Transformer-based language decoder to form Mela, a family of memory-augmented language models that perform online memory consolidation at test time. To further exploit the multi-granularity memory representations produced by HMM, we introduce MemStack, a method that distributes different levels of memory features across the early layers of the decoder without introducing additional tokens. Experiments on language modeling demonstrate that Mela outperforms Transformer baselines across all the model sizes. Moreover, with the pretrained context length fixed at 4K, Mela maintains performance on significantly longer contexts, whereas Transformer baselines degrade rapidly beyond their training length. Extensive ablation studies validate the contribution of each component and provide guidance for practical configuration.

[NLP-38] Infinite Mask Diffusion for Few-Step Distillation

【速读】：该论文旨在解决掩码扩散模型（Masked Diffusion Models, MDMs）在语言建模中因因子分解误差（factorization error）导致采样步数过多的问题。MDMs虽然具备并行解码和双向上下文处理的优势，但由于其使用确定性的单状态掩码（deterministic single-state mask），无法突破理论上的因子分解误差下界，从而限制了少步生成（few-step generation）能力。解决方案的关键在于提出无限掩码扩散模型（Infinite Mask Diffusion Model, IMDM），通过引入随机的无限状态掩码（stochastic infinite-state mask）来缓解这一理论下界，同时保留MDMs原有的优势，如与预训练权重的兼容性。实验证明，IMDM在简单合成任务中可实现高效少步生成，且在LM1B和OpenWebText数据集上结合适当蒸馏方法后，在小步数条件下优于现有少步蒸馏方法。

链接: https://arxiv.org/abs/2605.10518
作者: Jaehoon Yoo,Wonjung Kim,Chanhyuk Lee,Seunghoon Hong
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Masked Diffusion Models (MDMs) have emerged as a promising alternative to autoregressive models in language modeling, offering the advantages of parallel decoding and bidirectional context processing within a simple yet effective framework. Specifically, their explicit distinction between masked tokens and data underlies their simple framework and effective conditional generation. However, MDMs typically require many sampling iterations due to factorization errors stemming from simultaneous token updates. We observe that a theoretical lower bound of the factorization error exists, which standard MDMs cannot reduce due to their use of a deterministic single-state mask. In this paper, we propose the Infinite Mask Diffusion Model (IMDM), which introduces a stochastic infinite-state mask to mitigate the theoretical bound while directly inheriting the benefits of MDMs, including the compatibility with pre-trained weights. We empirically demonstrate that MDM fails to perform few-step generation even in a simple synthetic task due to the factorization error bound, whereas IMDM can find an efficient solution for the same task. Finally, when equipped with appropriate distillation methods, IMDM surpasses existing few-step distillation methods at small step counts on LM1B and OpenWebText. Code is available at this https URL.

[NLP-39] Learning Less Is More: Premature Upper-Layer Attention Specialization Hurts Language Model Pretraining

【速读】：该论文旨在解决生成式 AI（Generative AI）模型在预训练过程中出现的上层注意力机制过早特化问题，即上层注意力模式在底层特征尚未稳定时就过早聚焦于不成熟的残差基底（residual basis），从而导致训练效率下降和性能受损。解决方案的关键在于：在训练初期仅对上层的查询（Q）和键（K）投影进行临时减速，以防止上层注意力过早收敛到不稳定的残差表示；同时发现乘性门控前馈网络（multiplicative gated FFNs）通过抑制上游残差写入信号，可有效缓解该问题。路径分析进一步表明，两种干预措施分别作用于同一增长路径上的不同因子——学习率调整降低步长因子，而门控FFN降低残差能量因子，从而统一解释了两种现象的本质。

链接: https://arxiv.org/abs/2605.10504
作者: Jinchang Zhu,Jindong Li,Yuwen Hao,Chengyu Zou,Rong Fu,Menglin Yang
机构: The Hong Kong University of Science and Technology (Guangzhou)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:A causal-decoder block is hierarchical: lower layers build the residual basis that upper layers attend over. We identify a failure mode in GPT pretraining: upper layers commit to sharp attention patterns before lower-layer features stabilize. We call this premature upper-layer attention specialization. Temporarily slowing only upper-layer Q/K projections during early training improves final perplexity and downstream accuracy without altering other parameters; it prevents upper attention from collapsing onto an immature residual basis. In LLaMA-style blocks, the same intervention is nearly unnecessary. Through ablations, we isolate multiplicative gated FFNs (not RMSNorm or bias removal) as the component that suppresses the upstream residual writes driving the failure. A pathwise analysis unifies both findings: the learning-rate intervention reduces a step-size factor, while gated FFNs reduce a residual-energy factor on the same growth pathway. Our results identify upper-layer Q/K timing as a concrete interaction point between decoder architecture and optimization.

[NLP-40] DeepRefine: Agent -Compiled Knowledge Refinement via Reinforcement Learning

【速读】：该论文旨在解决由大型语言模型（Large Language Model, LLM）代理构建的知识库在开放性、知识密集型下游任务中因不完备性（incompleteness）、错误性（incorrectness）和冗余性（redundancy）而导致的检索准确率下降与任务性能劣化问题。解决方案的关键在于提出一种名为DeepRefine的通用LLM推理模型，其通过多轮交互式诊断与修复机制实现对代理编译知识库的增量优化：首先基于交互历史进行溯因诊断（abductive diagnosis），定位潜在缺陷；随后执行针对性的细化操作（refinement actions）以提升知识质量；并引入无需黄金标注的Gain-Beyond-Draft（GBD）奖励信号，结合强化学习端到端训练整个推理流程，从而显著改善下游任务表现。

链接: https://arxiv.org/abs/2605.10488
作者: Haoyu Huang,Jiaxin Bai,Shujie Liu,Yang Wei,Hong Ting Tsang,Yisen Gao,Zhongwei Xie,Yufei Li,Yangqiu Song
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Agent-compiled knowledge bases provide persistent external knowledge for large language model (LLM) agents in open-ended, knowledge-intensive downstream tasks. Yet their quality is systematically limited by \emphincompleteness, \emphincorrectness, and \emphredundancy, manifested as missing evidence or cross-document links, low-confidence or imprecise claims, and ambiguous or coreference resolution issues. Such defects compound under iterative use, degrading retrieval fidelity and downstream task performance. We present \textbfDeepRefine, a general LLM-based reasoning model for \emphagent-compiled knowledge refinement that improves the quality of any pre-constructed knowledge bases with user queries to make it more suitable for the downstream tasks. DeepRefine performs multi-turn interactions with the knowledge base and conducts abductive diagnosis over interaction history, localizes likely defects, and executes targeted refinement actions for incremental knowledge base updates. To optimize refinement policies of DeepRefine without gold references, we introduce a Gain-Beyond-Draft (GBD) reward and train the reasoning process end-to-end via reinforcement learning. Extensive experiments demonstrate consistent downstream gains over strong baselines.

[NLP-41] Coherency through formalisations of Structured Natural Language A case study on FRETish

【速读】：该论文旨在解决系统需求形式化过程中不同抽象层级（自然语言、技术语言、图示表示和形式语言）之间逻辑结构不一致的问题，这可能导致形式验证失败或难以维护。其解决方案的关键在于提出“形式化一致性”（Coherency through Formalisations）的新准则，即要求各层级的描述应保持大致相同的逻辑结构；基于此准则，作者改进了NASA FRET工具中受控自然语言FRETish到时序逻辑（MTL）的形式化翻译方法，并通过模型检测证明新译文与原译文的等价性，同时揭示并修正了原有翻译中存在的不一致性问题。

链接: https://arxiv.org/abs/2605.10462
作者: Joost J. Joosten,Marina López Chamosa,Sofía Santiago Fernández
机构: Universitat de Barcelona(巴塞罗那大学); Centre de Recerca Matemàtica(数学研究中心); Formal Vindications S.L.(形式辩护有限公司)
类目: Computation and Language (cs.CL); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:Formalisation is the process of writing system requirements in a formal language. These requirements mostly originate in Natural Language. In the field of Formal Methods, formalisation is often identified as one of the most delicate and complicated steps in the verification process. Not seldomly, formalisation tools and environments choose various levels of requirement descriptions: Natural Language, Technical Language, Diagram Representations and Formal Language, to mention a few. In the literature, there are various maxims and principles of good practice to guide the process of requirement formalisation. In this paper we propose a new guideline: Coherency through Formalisations. The guideline states that the different levels of formalisation mentioned above should roughly follow the same logical structure. The principle seems particularly relevant in the setting where LLMs are prompted to perform reasoning tasks that can be checked by formal tools using Structured Natural Language to act as an intermediate layer bridging both paradigms. In the light of coherency, we analyze NASA’s Formal Requirement Elicitation Tool FRET and propose an alternative automated translation of the Controlled Natural Language FRETish to the formal language of MTL. We compare our translation to the original translation and prove equivalence using model checking. Some statistics are performed which seem to favor the new translation. As expected, the translation process yielded interesting reflections and revealed inconsistencies which we present and discuss.

[NLP-42] SlimSpec: Low-Rank Draft LM-Head for Accelerated Speculative Decoding

【速读】：该论文旨在解决生成式 AI（Generative AI）中基于推测解码（speculative decoding）的轻量级草稿模型（drafter model）在语言模型头部（LM-head）投影至大词汇表时产生的计算瓶颈问题。现有方法多依赖静态或动态词汇截断，但引入额外复杂性，如特殊词汇筛选、复杂的推理时逻辑或训练设置修改。解决方案的关键在于提出 SlimSpec，一种对草稿模型 LM-head 的低秩参数化方法，通过压缩内部表示而非输出空间来实现高效压缩，同时保留完整词汇支持；该方法仅需极少调整训练与推理流程，在多种目标模型和基准测试中实现 4–5 倍加速，且端到端速度提升较现有方法高出 8–9%，显著优于传统方案。

链接: https://arxiv.org/abs/2605.10453
作者: Anton Plaksin,Sergei Krutikov,Sergei Skvortsov,Alexander Samarin
机构: Nebius(尼比斯)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Speculative decoding speeds up autoregressive generation in Large Language Models (LLMs) through a two-step procedure, where a lightweight draft model proposes tokens which the target model then verifies in a single forward pass. Although the drafter network is small in modern architectures, its LM-head still performs projection to a large vocabulary, becoming one of the major computational bottlenecks. In prior work this issue has been predominantly addressed via static or dynamic vocabulary truncation. Yet mitigating the bottleneck, these methods bring in extra complexity, such as special vocabulary curation, sophisticated inference-time logic or modifications of the training setup. In this paper, we propose SlimSpec, a low-rank parameterization of the drafter’s LM-head that compresses the inner representation rather than the output, preserving full vocabulary support. We evaluate our method with EAGLE-3 drafter across three target models and diverse benchmarks in both latency- and throughput-bound inference regimes. SlimSpec achieves 4\text-5\times acceleration over the standard LM-head architecture while maintaining a competitive acceptance length, surpassing existing methods by up to 8\text-9% of the end-to-end speedup. Our method requires minimal adjustments of training and inference pipelines. Combined with the aforementioned speedup improvements, it makes SlimSpec a strong alternative across wide variety of draft LM-head architectures.

[NLP-43] StereoTales: A Multilingual Framework for Open-Ended Stereotype Discovery in LLM s

【速读】：该论文旨在解决多语言环境下开放生成式大模型（Generative AI）中社会偏见的系统性研究不足问题，现有基准测试大多局限于英语、基于模板或仅能识别预设刻板印象，难以全面揭示模型在跨语言场景下偏见的涌现机制。其解决方案的关键在于构建了StereoTales数据集与评估流程：涵盖10种语言和79个社会人口学属性，包含23个近期大模型生成的超65万条故事，并对每个主角进行19维社会人口特征标注；通过统计检验识别出超过1500个高频率关联，并结合人工评分（N=247）与模型自身判断进行有害性评估，从而揭示偏见在不同语言提示下的文化适应性及其跨模型一致性。

链接: https://arxiv.org/abs/2605.10442
作者: Pierre Le Jeune,Étienne Duchesne,Weixuan Xiao,Stefano Palminteri,Bazire Houssin,Benoît Malézieux,Matteo Dora
机构: Giskard AI; École Normale Supérieure; INSERM
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Multilingual studies of social bias in open-ended LLM generation remain limited: most existing benchmarks are English-centric, template-based, or restricted to recognizing pre-specified stereotypes. We introduce StereoTales, a multilingual dataset and evaluation pipeline for systematically studying the emergence of social bias in open-ended LLM generation. The dataset covers 10 languages and 79 socio-demographic attributes, and comprises over 650k stories generated by 23 recent LLMs, each annotated with the socio-demographic profile of the protagonist across 19 dimensions. From these, we apply statistical tests to identify more than 1,500 over-represented associations, which we then rate for harmfulness through both a panel of humans (N = 247) and the same LLMs. We report three main findings. \textbf(i) Every model we evaluate emits consequential harmful stereotypes in open-ended generation, regardless of size or capabilities, and these associations are largely shared across providers rather than isolated misbehaviors. \textbf(ii) Prompt language strongly shapes which stereotypes appear: rather than transferring as a shared set of biases, harmful associations adapt culturally to the prompt language and amplify bias against locally salient protected groups. \textbf(iii) Human and LLM harmfulness judgments are broadly aligned (Spearman \rho=0.62 ), with disagreements concentrating on specific attribute classes rather than specific providers. To support further analyses, we release the evaluation code and the dataset, including model generations, attribute annotations, and harmfulness ratings.

[NLP-44] Can Language Models Analyze Data? Evaluating Large Language Models for Question Answering over Datasets

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在数据问答任务中的有效性问题，具体聚焦于两类场景：一是直接基于数据集文件回答问题，二是根据关系型数据库的模式生成SQL查询来回答问题。其解决方案的关键在于系统性地评估不同规模的语言模型（包括最先进的LLMs与资源消耗更低的小模型）在两种任务场景下的性能表现，并分析不同提示策略（prompting strategies）对模型效果的影响。实验结果表明，LLMs在复杂数据问答任务中展现出强大能力，而小模型虽具成本优势但在准确性上存在明显局限，从而为LLMs在数据分析场景中的合理应用提供了实证依据和边界认知。

链接: https://arxiv.org/abs/2605.10419
作者: Andreas Xenofontos,Pavlos Fafalios
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted for publication in CARMA 2026 proceedings

点击查看摘要

Abstract:This paper investigates the effectiveness of large language models (LLMs) in answering questions over datasets. We examine their performance in two scenarios: (a) directly answering questions given a dataset file as input, and (b) generating SQL queries to answer questions given the schema of a relational database. We also evaluate the impact of different prompting strategies on model performance. The study includes both state-of-the-art LLMs and smaller language models that require fewer resources and operate at lower computational and financial cost. Experiments are conducted on two datasets containing questions of varying difficulty. The results demonstrate the strong performance of large LLMs, while highlighting the limitations of smaller, more cost-efficient models. These findings contribute to a better understanding of how LLMs can be utilized in data analytics tasks and their associated limitations.

[NLP-45] Aligning LLM Uncertainty with Human Disagreement in Subjectivity Analysis

【速读】：该论文旨在解决大语言模型在主观性分析任务中因使用聚合标签（aggregated labels）而导致的过自信预测问题，即模型忽视了人类判断中的内在不确定性，尤其在低一致性样本上容易产生过度自信的输出，从而损害模型的可靠性与泛化能力。其解决方案的关键在于提出一种两阶段的“分歧感知与不确定性对齐”（Disagreement Perception and Uncertainty Alignment, DPUA）框架：第一阶段通过自适应解耦学习增强模型对分歧线索的敏感性并保持任务性能；第二阶段利用基于GRPO的奖励优化进一步提升不确定性推理能力，并使模型的置信度表达与人类分歧分布对齐，从而在不牺牲任务性能的前提下，显著改善模型在边界样本上的不确定性表征和跨分布泛化能力。

链接: https://arxiv.org/abs/2605.10415
作者: Junyu Lu,Deyi Ji,Xuanyi Liu,Lanyun Zhu,Bo Xu,Liang Yang,Hongfei Lin
机构: Dalian University of Technology (大连理工大学); Tencent (腾讯); Peking University (北京大学); Tongji University (同济大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models for subjectivity analysis are typically trained with aggregated labels, which compress variations in human judgment into a single supervision signal. This paradigm overlooks the intrinsic uncertainty of low-agreement samples and often induces overconfident predictions, undermining reliability and generalization in complex subjective settings. In this work, we advocate uncertainty-aware subjectivity analysis, where models are expected to make predictions while expressing uncertainty that reflects human disagreement. To operationalize this perspective, we propose a two-phase Disagreement Perception and Uncertainty Alignment (DPUA) framework. Specifically, DPUA jointly models label prediction, rationale generation, and uncertainty expression under an uncertainty-aware setting. In the disagreement perception phase, adaptive decoupled learning enhances the model’s sensitivity to disagreement-related cues while preserving task performance. In the uncertainty alignment phase, GRPO-based reward optimization further improves uncertainty-aware reasoning and aligns the model’s confidence expression with the human disagreement distribution. Experiments on three subjectivity analysis tasks show that DPUA preserves task performance while better aligning model uncertainty with human disagreement, mitigating overconfidence on boundary samples, and improving out-of-distribution generalization.

[NLP-46] Phoenix-VL 1.5 Medium Technical Report

【速读】：该论文旨在解决大模型在特定区域语境（如新加坡）下本地化能力不足的问题，即如何在保持通用多模态智能和多语言能力的同时，实现对本地知识、文化及法规的深度适配。其解决方案的关键在于：首先通过本地化1万亿token的多模态语料对Mistral Medium 3.1进行持续预训练，并扩展至2500亿token的长上下文训练；随后引入一个全新标注的、聚焦新加坡的多模态数据集与结构化文本语料（共220亿token），并辅以50亿token的在线直接偏好优化（Online Direct Preference Optimization, ODPO）进行对齐训练，从而在不显著损害全球基准性能的前提下，显著提升模型在新加坡本地法律、政策与文化理解方面的表现。

链接: https://arxiv.org/abs/2605.10391
作者: Team Phoenix:Arka Ray,Askar Ali Mohamed Jawad,Biondi Lee,Elijah Seah,Eva Lim,Fiona Teo,Grace Toh,Guang Xiang Teo,Jun En Tan,Jia Hui Bong,Jiale Wang,Jonathan Ng,Justin Tan,Kai Zhe Yew,Matthew Ong,Shun Yi Yeo,Wen Jett Lam,Wen Xiu Tan,Ze Yu Zhang,Gee Wah Ng,Chee Wee Ang,Mistral AI:Adrien Sadé,Guillaume Kunsch,Jia Sin Loh,Nicolas Schuhl,Rupert Menneer,Umar Jamil,Vincent Maladière,Yimu Pan
机构: Mistral AI(神秘AI)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Release page: this https URL

点击查看摘要

Abstract:We introduce Phoenix-VL 1.5 Medium, a 123B-parameter natively multimodal and multilingual foundation model, adapted to regional languages and the Singapore context. Developed as a sovereign AI asset, it demonstrates that deep domain adaptation can be achieved with minimal degradation to broad-spectrum intelligence and alignment. Continued pretraining was performed on Mistral Medium 3.1 using a localized 1-trillion tokens multimodal corpus, followed by a 250-billion tokens long-context extension phase. Subsequent post-training incorporated a novel human-annotated Singapore multimodal dataset and curated textual corpus on Singapore culture, knowledge, and legislation, totaling 22-billion tokens. An additional 5 billion tokens of model alignment was performed through Online Direct Preference Optimization. Phoenix-VL 1.5 Medium achieves state-of-the-art performance for its size on Singapore multimodal, legal, and government policy benchmarks while remaining globally competitive on general multimodal intelligence, multilingual, and STEM benchmarks. We also introduce a novel evaluation suite encompassing localized knowledge benchmarks and an institutionally aligned model behavior and safety framework. We report the data curation principles, training methodology, and highlight benchmark and inference performance.

[NLP-47] Not All Proofs Are Equal: Evaluating LLM Proof Quality Beyond Correctness NEURIPS

【速读】：该论文旨在解决当前大型语言模型（Large Language Models, LLMs）在数学证明生成中仅关注正确性而忽视证明质量的问题。尽管LLMs能够生成正确的数学证明，但高质量的证明还应具备简洁性、可计算性、认知简单性、多样性及适应性等特征，这些特征共同决定了证明的清晰度、可迁移性和实用性。解决方案的关键在于提出ProofRank基准，该基准从高难度数学竞赛中精心构建，用于量化评估上述五类可扩展的证明质量代理指标：(i) 简洁性（避免冗余步骤）、(ii) 计算易用性（减少繁琐计算依赖）、(iii) 认知简单性（技术方法的可访问性）、(iv) 多样性（同一问题下不同证明的变异性）以及(v) 适应性（遵循指定证明策略的能力）。实验表明，不同模型在这些质量维度上存在显著差异，且与正确性之间存在权衡关系，凸显了未来数学推理评估需兼顾实用性与质量的重要性。

链接: https://arxiv.org/abs/2605.10379
作者: Ivo Petrov,Jasper Dekoninck,Dimitar I. Dimitrov,Martin Vechev
机构: INSAIT, Sofia University “St. Kliment Ohridski”; ETH Zurich
类目: Computation and Language (cs.CL)
备注: 9 main text pages, 36 total pages, In proceedings to 2026 NeurIPS Evaluations and Datasets Track

点击查看摘要

Abstract:Large language models (LLMs) have become capable mathematical problem-solvers, often producing correct proofs for challenging problems. However, correctness alone is not sufficient: mathematical proofs should also be clear, concise, insightful, and transferable to other problems. While this proof quality is subjective and depends on the reader and context, many of its components are concrete and broadly valued. In this work, we identify such components and introduce ProofRank, a benchmark curated from challenging mathematical competitions. ProofRank evaluates several scalable proxies of proof quality: (i) conciseness, measuring whether proofs avoid unnecessary steps; (ii) computational ease, measuring the extent to which a proof relies on tedious calculations; (iii) cognitive simplicity, measuring how accessible the used proof techniques are; (iv) diversity, measuring how varied a model’s proofs for a single problem are; and (v) adaptivity, measuring whether a model can follow a specified proof technique. Across models, we find substantial differences in proof quality that are not captured by correctness-only benchmarks. We also observe significant trade-offs between proof-quality metrics and correctness, suggesting that future evaluations of mathematical reasoning should measure how useful LLM-generated proofs are.

[NLP-48] oward Multi-Database Query Reasoning for Text2Cypher

【速读】：该论文旨在解决现有Text2Cypher系统仅支持单一预选图数据库的局限性，而现实场景中查询需求常跨越多个独立部署的图数据库（如按领域或系统边界划分），需在多源异构环境中进行推理与整合。其解决方案的关键在于提出一个三阶段的结构化框架：首先通过数据库路由（database routing）识别相关数据源；其次进行多数据库分解（multi-database decomposition）以将用户问题拆解为跨库子查询；最后实现跨数据库类型和查询语言的异构查询推理（heterogeneous query reasoning），完成部分结果的集成。该方法推动了从单库查询生成向多库协同推理的范式转变，为构建更真实、可扩展的自然语言接口提供了理论基础与实践路径。

链接: https://arxiv.org/abs/2605.10373
作者: Makbule Gulcin Ozsoy
机构: Neo4j( Neo4j)
类目: Databases (cs.DB); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models have significantly improved natural language interfaces to databases by translating user questions into executable queries. In particular, Text2Cypher focuses on generating Cypher queries for graph databases, enabling users to access graph data without query language expertise. Most existing Text2Cypher systems assume a single preselected graph database, where queries are generated over a known schema. However, real-world systems are often distributed across multiple independent graph databases organized by domain or system boundaries, where relevant information may span multiple sources. To address this limitation, we propose a shift from single-database query generation to multi-database query reasoning. Instead of assuming a fixed execution context, the system must reason about (i) relevant databases, (ii) how to decompose a question across them, and (iii) how to integrate partial results. We formalize this setting through a three-phase roadmap: database routing, multi-database decomposition, and heterogeneous query reasoning across database types and query languages. This work provides a structured formulation of multi-database reasoning for Text2Cypher and identifies challenges in source selection, query decomposition, and result integration, aiming to support more realistic and scalable natural language interfaces to graph databases.

[NLP-49] How Mobile World Model Guides GUI Agents ?

【速读】：该论文旨在解决移动图形用户界面（GUI）智能体在长周期和高风险交互中可靠预测动作后果的问题。现有移动世界模型仅提供文本或图像形式的未来状态，但其有效性、生成轨迹是否可替代真实环境，以及测试时引导对不同能力智能体的影响尚不明确。解决方案的关键在于构建并训练四种模态的世界模型：delta文本、完整文本、基于扩散的图像与可渲染代码，并通过大规模数据过滤与标注提升模型质量；实验表明，可渲染代码在分布内重建精度高且能提供多模态监督，而文本反馈在分布外执行时更具鲁棒性；此外，世界模型生成的轨迹虽不保留原始分布，但能迁移交互经验以提升端到端任务性能，且对于高自信低动作熵的智能体，后验自我反思效果有限，说明世界模型更适合作为先验感知或训练监督工具而非通用事后验证器。

链接: https://arxiv.org/abs/2605.10347
作者: Weikai Xu,Kun Huang,Yunren Feng,Jiaxing Li,Yuhan Chen,Yuxuan Liu,Zhizheng Jiang,Heng Qu,Pengzhi Gao,Wei Liu,Jian Luan,Xiaolin Hu,Bo An
机构: Nanyang Technological University; MiLM Plus, Xiaomi Inc.; Independent Researchers; Gaoling School of Artificial Intelligence, Renmin University of China; Wuhan University; Xiamen University
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in vision-language models have enabled mobile GUI agents to perceive visual interfaces and execute user instructions, but reliable prediction of action consequences remains critical for long-horizon and high-risk interactions. Existing mobile world models provide either text-based or image-based future states, yet it remains unclear which representation is useful, whether generated rollouts can replace real environments, and how test-time guidance helps agents of different strengths. To answer the above questions, we filter and annotate mobile world-model data, then train world models across four modalities: delta text, full text, diffusion-based images, and renderable code. These models achieve SoTA performance on both MobileWorldBench and Code2WorldBench. Furthermore, by evaluating their downstream utility on AITZ, AndroidControl, and AndroidWorld, we obtain three findings. First, renderable code reconstruction achieves high in-distribution fidelity and provides effective multimodal supervision for data construction, while text-based feedback is more robust for online out-of-distribution (OOD) execution. Second, world-model-generated trajectories can provide transferable interaction experience in the training process and improve agents’ end-to-end task performance, although these data do not preserve the original distribution. Last, for overconfident mobile agents with low action entropy, posterior self-reflection provides limited gains, suggesting that world models are more effective as prior perception or training supervision than as universal post-hoc verifiers.

[NLP-50] An Annotation Scheme and Classifier for Personal Facts in Dialogue

【速读】：该论文旨在解决现有个性化对话系统中个人事实（personal fact）分类方法的局限性，特别是针对PeaCoK等方案在结构化存储、质量过滤及对话延续适配性方面的不足。其解决方案的关键在于提出了一种扩展的标注方案，新增了“人口统计信息”（Demographics）和“财产归属”（Possessions）两类，并引入“持续时间”（Duration）、“有效性”（Validity）和“后续性”（Followup）三个属性，从而实现对个人事实的精细化建模与筛选；在此基础上，研究者构建了一个基于Transformer编码器的多头分类器，在Multi-Session Chat数据集上训练并达到81.6±2.6%的宏F1分数，显著优于所有少样本大语言模型基线（最佳为GPT-5.4-mini的72.92%），同时大幅降低计算资源消耗。

链接: https://arxiv.org/abs/2605.10339
作者: Konstantin Zaitsev
机构: HSE University(高等经济大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The advancement of Large Language Models (LLMs) has enabled their application in personalized dialogue systems. We present an extended annotation scheme for personal fact classification that addresses limitations in existing approaches, particularly PeaCoK. Our scheme introduces new categories (Demographics, Possessions) and attributes (Duration, Validity, Followup) that enable structured storage, quality filtering, and identification of facts suitable for dialogue continuation. We manually annotated 2,779 facts from Multi-Session Chat and trained a multi-head classifier based on transformer encoders. Combined with the Gemma-300M encoder, the classifier achieves 81.6 \pm 2.6 % macro F1, outperforming all few-shot LLM baselines (best: GPT-5.4-mini, 72.92%) by nearly 9 percentage points while requiring substantially fewer computational resources. Error analysis reveals persistent challenges in semantic boundary disambiguation, temporal aspect interpretation, and pragmatic reasoning for followup assessment. The dataset\footnotemark[1] and classifier\footnotemark[2] are publicly available.

[NLP-51] PowerStep: Memory-Efficient Adaptive Optimization via ell_p-Norm Steepest Descent

【速读】：该论文旨在解决大规模神经网络（如Transformer）训练中自适应优化器（如Adam）因存储二阶矩统计量而导致的显著内存开销问题。解决方案的关键在于提出PowerStep，一种通过直接对动量缓冲区施加非线性变换来实现坐标级自适应性的内存高效优化方法，无需显式存储二阶矩信息；理论证明其在非凸随机优化中具有最优的 $ O(1/\sqrt{T}) $ 收敛速率，并在实际应用中验证了其在保持Adam收敛速度的同时将优化器内存消耗减半，且在结合8位量化时仍具数值稳定性，内存压缩比达约8倍。

链接: https://arxiv.org/abs/2605.10335
作者: Yao Lu,Dengdong Fan,Shixun Zhang,Yonghong Tian
机构: Pengcheng Laboratory; Peking University
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Numerical Analysis (math.NA); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:Adaptive optimizers, most notably Adam, have become the default standard for training large-scale neural networks such as Transformers. These methods maintain running estimates of gradient first and second moments, incurring substantial memory overhead. We introduce PowerStep, a memory-efficient optimizer that achieves coordinate-wise adaptivity without storing second-moment statistics. Motivated by steepest descent under an \ell_p -norm geometry, we show that applying a nonlinear transform directly to a momentum buffer yields coordinate-wise adaptivity. We prove that PowerStep converges at the optimal O(1/\sqrtT) rate for non-convex stochastic optimization. Extensive experiments on Transformer models ranging from 124M to 235B parameters demonstrate that PowerStep matches Adam’s convergence speed while halving optimizer memory. Furthermore, when combined with aggressive \textttint8 quantization, PowerStep remains numerically stable and reduces optimizer memory by \sim!8\times compared to full-precision Adam. PowerStep thus provides a principled, scalable and resource-efficient alternative for large-scale training. Code is available at this https URL.

[NLP-52] ANCHOR: Abductive Network Construction with Hierarchical Orchestration for Reliable Probability Inference in Large Language Models ICML2026

【速读】：该论文旨在解决大规模决策中因信息不完整而导致的概率估计不可靠问题。现有方法利用大语言模型（Large Language Models, LLMs）生成解释性因素并 eliciting 粗粒度概率估计，但受限于稀疏因素空间易产生“未知”预测，而扩展因素又引入噪声和虚假相关性，破坏条件独立假设，降低可靠性。解决方案的关键在于提出 \textscAnchor 框架，其核心创新包括：首先通过迭代生成与层次聚类构建密集且结构化的因素空间；其次采用上下文感知的分层检索与精炼机制减少“未知”预测；最后在朴素贝叶斯基础上引入因果贝叶斯网络（Causal Bayesian Network），显式建模因素间的潜在依赖关系，放宽严格独立性假设，从而显著提升概率估计的准确性与鲁棒性。

链接: https://arxiv.org/abs/2605.10328
作者: Wentao Qiu,Guanran Luo,Zhongquan Jian,Jingqi Gao,Meihong Wang,Qingqiang Wu
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted by ICML 2026

点击查看摘要

Abstract:A central challenge in large-scale decision-making under incomplete information is estimating reliable probabilities. Recent approaches leverage Large Language Models (LLMs) to generate explanatory factors and elicit coarse-grained probability estimates. Typically, an LLM performs forward abduction to propose factors, each paired with two mutually exclusive attributes, and a Naïve Bayes model is trained over factor combinations to refine the final probabilities. However, sparse factor spaces often yield unknown'' outcomes, while expanding factors increases noise and spurious correlations, weakening conditional independence and degrading reliability. To address these limitations, we propose \textscAnchor, an inference framework that orchestrates aggregated Bayesian inference over a hierarchically structured factor space. \textscAnchor first constructs a dense and organized factor space via iterative generation and hierarchical clustering. It then performs context-aware mapping through hierarchical retrieval and refinement, substantially reducing unknown’’ predictions. Finally, \textscAnchor augments Naïve Bayes with a Causal Bayesian Network to capture latent dependencies among factors, relaxing the strict independence assumption. Experiments show that \textscAnchor markedly reduces ``unknown’’ predictions and produces more reliable probability estimates than direct LLM baselines, achieving state-of-the-art performance while significantly reducing time and token overhead.

[NLP-53] Extending Confidence-Based Text2Cypher with Grammar and Schema Aware Filtering

【速读】：该论文旨在解决生成式 AI（Generative AI）在 Text2Cypher 任务中因缺乏显式结构约束而导致的查询可靠性问题，即生成的 Cypher 查询虽然语义可能合理，但未必满足语法有效性（syntactic validity）或数据库模式一致性（schema consistency），从而影响其可执行性。解决方案的关键在于引入一种基于置信度的推理框架，并在其基础上增加一个顺序过滤流程，该流程依次结合置信度评分、语法验证和模式约束检查，在最终聚合前对候选查询进行筛选。实验表明，语法过滤提升语法正确率，模式感知过滤进一步提高执行质量，尽管更强的过滤会降低执行覆盖率并增加空预测数量，整体上显著提升了测试阶段生成查询的可靠性。

链接: https://arxiv.org/abs/2605.10318
作者: Makbule Gulcin Ozsoy
机构: Neo4j( Neo4j)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) allow users to query databases using natural language by translating questions into executable queries. Despite strong progress on tasks such as Text2SQL, Text2SPARQL, and Text2Cypher, most existing methods focus on better prompting, fine-tuning, or iterative refinement. However, they often do not explicitly enforce structural constraints, such as syntactic validity and schema consistency. This can reduce reliability, since generated queries must satisfy both syntax rules and database schema constraints to be executable. In this work, we study how structured constraints can be used in test-time inference for Text2Cypher. We focus on post-generation validation to improve query correctness. We extend a confidence-based inference framework with a sequential filtering process that combines confidence scoring, grammar validation, and schema constraints before final aggregation. This lets us analyze how different constraint types affect generated queries. Our experiments with two instruction-tuned models show that grammar-based filtering improves syntactic validity. Schema-aware filtering further improves execution quality by enforcing consistency with the database structure. However, stronger filtering also increases the number of empty predictions and reduces execution coverage. Overall, we show that adding simple structural checks at test time improves the reliability of Text2Cypher generation, and we provide a clearer view of how syntax and schema constraints contribute differently.

[NLP-54] DECO-MWE: building a linguistic resource of Korean multiword expressions for feature-based sentiment analysis

【速读】：该论文旨在解决情感分析中多词表达（Multiword Expressions, MWEs）的处理问题，因为许多MWEs具有词汇异质性（lexical idiosyncrasy），直接影响情感极性判断的准确性。为高效构建情感导向的MWE语言资源，研究提出基于局部语法图（Local Grammar Graph, LGG）方法构建一个名为DECO-MWE的有限状态转换器（Finite-State Transducer），用于形式化描述MWE的词法与句法限制。其关键创新在于将MWE细分为四类：标准极性MWE（Standard Polarity MWEs, SMWEs）、领域依赖极性MWE（Domain-Dependent Polarity MWEs, DMWEs）、复合命名实体MWE（Compound Named Entity MWEs, EMWEs）和复合特征MWE（Compound Feature MWEs, FMWEs），并通过实证语料验证了该方法在美妆评论文本中的检索性能（F-measure=0.806），从而为跨领域的情感分析提供了可复用的有限状态建模框架。

链接: https://arxiv.org/abs/2605.10295
作者: Jaeho Han,Changhoe Hwang,Seongyong Choi,Gwanghoon Yoo,Eric Laporte,Jeesun Nam
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper aims to construct a linguistic resource of Korean Multiword Expressions for Feature-Based Sentiment Analysis (FBSA): DECO-MWE. Dealing with multiword expressions (MWEs) has been a critical issue in FBSA since many constructs reveal lexical idiosyncrasy. To construct linguistic resources of sentiment MWEs efficiently, we utilize the Local Grammar Graph (LGG) methodology: DECO-MWE is formalized as a Finite-State Transducer that represents lexical-syntactic restrictions on MWEs. In this study, we built a corpus of cosmetics review texts, which show particularly frequent occurrences of MWEs. Based on an empirical examination of the corpus, four types of MWEs have been distinguished. The DECO-MWE thus covers the following four categories: Standard Polarity MWEs (SMWEs), Domain-Dependent Polarity MWEs (DMWEs), Compound Named Entity MWEs (EMWEs) and Compound Feature MWEs (FMWEs). The retrieval performance of the DECO-MWE shows 0.806 f-measure in the test corpus. This study brings a twofold outcome: first, a sizeable general-purpose polarity MWE lexicon, which may be broadly used in FBSA; second, a finite-state methodology adopted in this study to treat domain-dependent MWEs such as idiosyncratic polarity expressions, named entity expressions or feature expressions, and which may be reused in describing linguistic properties of other corpus domains.

[NLP-55] MemReread: Enhancing Agent ic Long-Context Reasoning via Memory-Guided Rereading

【速读】：该论文旨在解决长上下文推理任务中标准注意力机制带来的二次时间复杂度问题，以及现有基于代理记忆（agent memory）的方法在“边读边记”过程中因记忆覆盖导致潜在证据丢失的问题。其解决方案的关键在于提出MemReread框架，该框架基于流式阅读（streaming reading）设计，避免中间检索步骤，在最终记忆不足时触发问题分解与重读机制，从而恢复被提前丢弃的间接事实，支持非线性推理同时保持文档理解的逻辑连贯性；此外，引入强化学习框架以动态决定重读次数，根据任务复杂度灵活控制计算开销，实现高效且可扩展的长文本推理。

链接: https://arxiv.org/abs/2605.10268
作者: Baibei Ji,Xiaoyang Weng,Juntao Li,Zecheng Tang,Yihang Lou,Min Zhang
机构: Soochow University (苏州大学); Peking University (北京大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:To tackle long-context reasoning tasks without the quadratic complexity of standard attention mechanisms, approaches based on agent memory have emerged, which typically maintain a dynamically updated memory when linearly processing document chunks. To mitigate the potential loss of latent evidence in this memorize-while-reading paradigm, recent works have integrated retrieval modules that allow agents to recall information previously discarded during memory overwriting. However, retrieval-based recall suffers from both evidence loss during memory formation and interference induced by invalid queries. To overcome these limitations, we propose MemReread. Built upon streaming reading, MemReread circumvents intermediate retrieval. It triggers question decomposition and rereading when the final memory is insufficient, enabling the recovery of indirect facts that were prematurely discarded. This design supports non-linear reasoning while preserving the inherent logical flow of document comprehension. To further enhance practicality, we introduce a reinforcement learning framework that enhances length extrapolation capability while dynamically determining the number of rereading passes based on task complexity, thereby flexibly controlling computational overhead. Extensive experiments demonstrate that MemReread consistently outperforms baseline frameworks on long-context reasoning tasks, while maintaining linear time complexity with respect to context length.

[NLP-56] Building Korean linguistic resource for NLU data generation of banking app CS dialog system

【速读】：该论文旨在解决任务导向型对话系统中自然语言理解（Natural Language Understanding, NLU）模块对大量标注训练数据的依赖问题，特别是在韩国银行业客户服务中心场景下，如何高效构建覆盖多样意图和实体的高质量标注数据。解决方案的关键在于构建了一个名为FIAD（Financial Annotated Dataset）的语言学资源，通过分析银行应用评论语料，识别出三种韩语请求句中的关键语言模式：TOPIC（ENTITY, FEATURE）、EVENT和DISCOURSE MARKER，并将其形式化为局部语法图（Local Grammar Graphs, LGGs），从而自动生成多样化且结构化的标注数据。实验表明，基于FIAD生成的数据训练的DIET模型结合不同预训练语言模型（如HANBERT、KoBERT、KorBERT）后，在意图识别（Intent）和话题（Topic [entity+feature]）提取任务上均取得了优异性能，验证了该方法在实际NLU建模中的有效性与实用性。

链接: https://arxiv.org/abs/2605.10241
作者: Jeongwoo Yoon,On-yu Park,Changhoe Hwang,Gwanghoon Yoo,Eric Laporte,Jeesun Nam
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Natural language understanding (NLU) is integral to task-oriented dialog systems, but demands a considerable amount of annotated training data to increase the coverage of diverse utterances. In this study, we report the construction of a linguistic resource named FIAD (Financial Annotated Dataset) and its use to generate a Korean annotated training data for NLU in the banking customer service (CS) domain. By an empirical examination of a corpus of banking app reviews, we identified three linguistic patterns occurring in Korean request utterances: TOPIC (ENTITY, FEATURE), EVENT, and DISCOURSE MARKER. We represented them in LGGs (Local Grammar Graphs) to generate annotated data covering diverse intents and entities. To assess the practicality of the resource, we evaluate the performances of DIET-only (Intent: 0.91 /Topic [entity+feature]: 0.83), DIET+ HANBERT (I:0.94/T:0.85), DIET+ KoBERT (I:0.94/T:0.86), and DIET+ KorBERT (I:0.95/T:0.84) models trained on FIAD-generated data to extract various types of semantic items.

[NLP-57] Route Before Retrieve: Activating Latent Routing Abilities of LLM s for RAG vs. Long-Context Selection

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在处理长文档时，如何在检索增强生成（Retrieval-Augmented Generation, RAG）与长上下文（Long-Context, LC）策略之间进行高效、可解释的路由决策问题。现有方法如Self-Route采用被动回退机制，存在效率低、难以解释等缺陷。其解决方案的关键在于提出Pre-Route框架，通过轻量级元数据（如文档类型、长度、初始片段）进行前置结构化推理，实现任务分析、覆盖度估计和信息需求预测，从而做出成本效益最优的主动路由决策；该框架利用提示工程激发LLMs潜在的路由能力，并通过线性探测和知识蒸馏将推理结构迁移至小型模型，显著提升性能与部署效率。

链接: https://arxiv.org/abs/2605.10235
作者: Yiwen Chen,Kuan Li,Fuzhen Zhuang,Deqing Wang,Zhao Zhang,Liwen Zhang,Yong Jiang,Shuai Wang,Minhao Cheng
机构: Beihang University (北京航空航天大学); HKUST (香港科技大学); Alibaba Group (阿里巴巴集团); Pennsylvania State University (宾夕法尼亚州立大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have expanded the context window to beyond 128K tokens, enabling long-document understanding and multi-source reasoning. A key challenge, however, lies in choosing between retrieval-augmented generation (RAG) and long-context (LC) strategies: RAG is efficient but constrained by retrieval quality, while LC supports global reasoning at higher cost and with position sensitivity. Existing methods such as Self-Route adopt failure-driven fallback from RAG to LC, but remain passive, inefficient, and hard to interpret. We propose Pre-Route, a proactive routing framework that performs structured reasoning before answering. Using lightweight metadata (e.g., document type, length, initial snippet), Pre-Route enables task analysis, coverage estimation, and information-need prediction, producing explainable and cost-efficient routing decisions. Our study shows three key findings: (i) LLMs possess latent routing ability that can be reliably elicited with guidelines, allowing single-sample performance to approach that of multi-sample (Best-of-N) results; (ii) linear probes reveal that structured prompts sharpen the separability of the “optimal routing dimension” in representation space; and (iii) distillation transfers this reasoning structure to smaller models for lightweight deployment. Experiments on LaRA (in-domain) and LongBench-v2 (OOD) confirm that Pre-Route outperforms Always-RAG, Always-LC, and Self-Route baselines, achieving superior overall cost-effectiveness.

[NLP-58] Relative Score Policy Optimization for Diffusion Language Models

【速读】：该论文旨在解决扩散大语言模型（Diffusion Large Language Models, dLLMs）在后训练阶段提升推理能力时面临的挑战，尤其是基于可验证奖励的强化学习（Reinforcement Learning with Verifiable Rewards, RLVR）方法因缺乏可处理的序列级对数似然比（sequence-level log-ratios）而导致训练不稳定的问题。现有方法依赖高方差的ELBO近似，当验证器奖励较高时会放大评分估计误差，从而引发训练不收敛。解决方案的关键在于提出相对分数策略优化（Relative Score Policy Optimization, RSPO），其核心思想是将奖励优势不仅视为策略更新方向，更视为当前策略与参考策略之间相对对数似然比的目标值；RSPO通过比较实际奖励优势与由奖励推导出的目标相对对数似然比，基于两者差距而非原始优势来更新策略，从而校准噪声较大的似然估计，显著提升了训练稳定性与性能，尤其在规划类任务中表现突出。

链接: https://arxiv.org/abs/2605.10218
作者: Zichao Yu,Shengze Xu,Bingqing Jiang,Wenyi Zhang,Difan Zou
机构: University of Science and Technology of China (中国科学技术大学); The Chinese University of Hong Kong (香港中文大学); The University of Hong Kong (香港大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Diffusion large language models (dLLMs) offer a promising route to parallel and efficient text generation, but improving their reasoning ability requires effective post-training. Reinforcement learning with verifiable rewards (RLVR) is a natural choice for this purpose, yet its application to dLLMs is hindered by the absence of tractable sequence-level log-ratios, which are central to standard policy optimization. The lack of tractable sequence-level log-ratios forces existing methods to rely on high-variance ELBO-based approximations, where high verifier rewards can amplify inaccurate score estimates and destabilize RL training. To overcome this issue, we propose \textbfRelative \textbfScore \textbfPolicy \textbfOptimization (RSPO), a simple RLVR method that uses verifiable rewards to calibrate noisy likelihood estimates in dLLMs. The core of our algorithm relies on a key observation: a reward advantage can be interpreted not only as an update direction, but also as a target for the relative log-ratio between the current and reference policies. Accordingly, RSPO calibrates this noisy relative log-ratio estimate by comparing its reward advantage with the reward-implied target relative log-ratio, updating the policy according to the gap between the current estimate and the target rather than the raw advantage alone. Experiments on mathematical reasoning and planning benchmarks show that RSPO yields especially strong gains on planning tasks and competitive mathematical-reasoning performance.

[NLP-59] he Impact of Editorial Intervention on Detecting Native Language Traces

【速读】：该论文旨在解决生成式 AI（Generative AI）时代下，母语识别（Native Language Identification, NLI）模型在面对经过大型语言模型（LLM）编辑和润色的非母语文本时，其鲁棒性下降的问题。解决方案的关键在于揭示：尽管表面语法错误被修正，L1特征并未完全消失，而是以更深层的语言模式形式保留——包括非地道的词汇-语义选择、语用迁移（pragmatic transfer）以及作者隐含的文化视角。研究发现，仅进行最小程度的语法修正即可保持较高的识别准确率，而深度流利化处理（如改写）则会削弱这些结构性特征，导致性能显著下降。

链接: https://arxiv.org/abs/2605.10216
作者: Ahmet Yavuz Uluslu,Mark Gales,Kate Knill,Gerold Schneider
机构: University of Cambridge (剑桥大学); University of Zurich (苏黎世大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Native Language Identification (NLI) is the task of determining an author’s native language (L1) from their non-native writings. With the advent of human-AI co-authorship, non-native texts are routinely corrected and rewritten by large language models, fundamentally altering the linguistic features NLI models depend on. In this paper, we investigate the robustness of L1 traces across increasing degrees of editorial intervention. By processing 450 essays from the Write Improve 2024 corpus through varying levels of grammatical error correction (GEC) and paraphrasing, we demonstrate that L1 attribution does not entirely depend on surface-level errors. Instead, the detection models leverage deeper L1 features: unidiomatic lexico-semantic choices, pragmatic transfer, and the author’s underlying cultural perspective. We find that minimal edits preserve these structural traces and maintain high profiling accuracy. In contrast, fluency edits and paraphrasing normalize these L1 features, leading to a severe degradation in performance.

[NLP-60] ask-Aware Calibration: Provably Optimal Decoding in LLM s

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在解码过程中因预测分布与真实生成分布不一致而导致决策次优的问题。传统校准方法在自由形式语言的组合爆炸层面难以实施，为此，作者提出“任务校准”（task calibration）范式，其核心在于将自由形式输出映射到任务诱导的语义有意义的潜在空间（如离散类别标签、整数或集合），并在该潜在空间中对模型预测分布进行校准。基于决策理论结果，作者证明在任务校准后的潜在分布上使用最小贝叶斯风险（Minimum Bayes Risk, MBR）解码是最优策略，从而提升了生成质量并增强了模型决策的可靠性。

链接: https://arxiv.org/abs/2605.10202
作者: Tim Tomov,Dominik Fuchsgruber,Rajeev Verma,Stephan Günnemann
机构: Technical University of Munich (慕尼黑工业大学); Munich Data Science Institute (慕尼黑数据科学研究所); Munich Center for Machine Learning (慕尼黑机器学习中心); University of Amsterdam (阿姆斯特丹大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:LLM decoding often relies on the model’s predictive distribution to generate an output. Consequently, misalignment with respect to the true generating distribution leads to suboptimal decisions in practice. While a natural solution is to calibrate the model’s output distribution, for LLMs, this is ill-posed at the combinatorially vast level of free-form language. We address this by building on the insight that in many tasks, these free-form outputs can be interpreted in a semantically meaningful latent structure, for example, discrete class labels, integers, or sets. We introduce task calibration as a paradigm to calibrate the model’s predictive distribution in the task-induced latent space. We apply a decision-theoretic result to show that Minimum Bayes Risk (MBR) decoding on the task-calibrated latent distribution is the optimal decoding strategy on latent model beliefs. Empirically, it consistently improves generation quality across different tasks and baselines. We also introduce Task Calibration Error (TCE), an application-aware calibration metric that quantifies the excess loss due to miscalibration. Our work demonstrates that task calibration enables more reliable model decisions across various tasks and applications.

[NLP-61] How Should LLM s Listen While Speaking? A Study of User-Stream Routing in Full-Duplex Spoken Dialogue

【速读】：该论文旨在解决全双工语音对话（full-duplex spoken dialogue）中大型语言模型（LLM）在生成响应时无法有效处理用户语音流输入的问题，即如何将用户语音流合理地融入LLM的推理过程以实现无缝交互。其解决方案的关键在于用户流（user stream）的路由策略设计：通过对比两种架构——通道融合（channel fusion）与交叉注意力路由（cross-attention routing），发现前者直接将用户流注入LLM输入以增强语义对齐，后者则将其作为外部记忆通过交叉注意力机制访问，从而在语义整合与上下文鲁棒性之间形成权衡。实验表明，通道融合在问答任务中表现更优，但易受用户打断干扰导致语义混乱；而交叉注意力路由虽问答性能稍弱，却能更好维持生成上下文稳定性，具备更强的容错能力。

链接: https://arxiv.org/abs/2605.10199
作者: Hui Lu,Xueyuan Chen,Huimeng Wang,Shuhai Peng,Shiyin Kang,Xixin Wu,Zhiyong Wu
机构: The Chinese University of Hong Kong(香港中文大学); SenseTime Research(商汤科技研究院); Tsinghua University(清华大学)
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Full-duplex spoken dialogue requires a model to keep listening while generating its own spoken response. This is challenging for large language models (LLMs), which are designed to extend a single coherent sequence and do not naturally support user input arriving during generation. We argue that how the user stream is routed into the LLM is therefore a key architectural question for full-duplex modeling. To study this question, we extend a text-only LLM into a unified full-duplex spoken dialogue system and compare two routing strategies under a shared training pipeline: (i) channel fusion, which injects the user stream directly into the LLM input, and (ii) cross-attention routing, which keeps the user stream as external memory accessed through cross-attention adapters. Experiments on spoken question answering and full-duplex interaction benchmarks reveal a clear tradeoff. Channel fusion yields stronger semantic grounding and consistently better question-answering performance. However, under semantically overlapping conditions such as user interruptions, it is more vulnerable to context corruption: if the model fails to stop in time, the overlapping user stream can interfere with ongoing generation and lead to semantically incoherent continuations. Cross-attention routing underperforms on question answering, but better preserves the LLM generation context and is more robust to this failure mode. These results establish user-stream routing as a central design axis in full-duplex spoken dialogue and offer practical guidance on the tradeoff between semantic integration and context robustness. We provide a demo page for qualitative inspection.

[NLP-62] LegalCiteBench: Evaluating Citation Reliability in Legal Language Models

【速读】：该论文旨在解决生成式 AI（Generative AI）在法律领域中因缺乏外部证据支撑而产生的“虚假引用”问题，即模型在闭卷场景下生成看似合理但实际错误的判例引用或法条依据，这可能导致严重的专业风险。其解决方案的关键在于构建一个名为 LegalCiteBench 的诊断性基准测试框架，专门用于评估法律语言模型在闭卷条件下的引文恢复（citation retrieval）、引文补全（citation completion）、引文错误检测、案例匹配与验证等五类任务的表现，并量化模型产生误导性答案的比例（Misleading Answer Rate, MAR）。实验表明，即使是最先进的大型语言模型（LLMs）在这些任务上也表现不佳，且单纯增加模型规模或进行法律领域预训练无法显著提升准确性，提示需引入更严格的验证机制和不确定性管理策略来应对权威生成失败的问题。

链接: https://arxiv.org/abs/2605.10186
作者: Sijia Chen,Hang Yin,Shunfan Zhou
机构: Northeastern University (东北大学); Phala (Phala)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint. 23 pages including references and appendices

点击查看摘要

Abstract:Large language models (LLMs) are increasingly integrated into legal drafting and research workflows, where incorrect citations or fabricated precedents can cause serious professional harm. Existing legal benchmarks largely emphasize statutory reasoning, contract understanding, or general legal question answering, but they do not directly study a central common-law failure mode: when asked to provide case authorities without external grounding, models may return plausible-looking but incorrect citations or cases. We introduce LegalCiteBench, a benchmark for studying closed-book citation recovery, citation verification, and case matching in legal language models. LegalCiteBench contains approximately 24K evaluation instances constructed from 1,000 real U.S. judicial opinions from the Case Law Access Project. The benchmark covers five citation-centric tasks: citation retrieval, citation completion, citation error detection, case matching, and case verification and correction. Across 21 LLMs, exact citation recovery remains highly challenging in this closed-book setting: even the strongest models score below 7/100 on citation retrieval and completion. Within the evaluated models, scale and legal-domain pretraining provide limited gains and do not resolve this difficulty. Models also frequently provide concrete but incorrect or low-overlap authorities under our evaluation protocol, with Misleading Answer Rates (MAR) exceeding 94% for 20 of 21 evaluated models on retrieval-heavy tasks. A prompt-only abstention experiment shows that explicit uncertainty instructions reduce some confident fabrication but do not improve citation correctness. LegalCiteBench is intended as a diagnostic framework for studying authority generation failures, verification behavior, and abstention when external grounding is absent, incomplete, or bypassed.

[NLP-63] V-ABS: Action-Observer Driven Beam Search for Dynamic Visual Reasoning

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在复杂多步视觉推理任务中面临的稳定性与最优性不足的问题，其核心挑战源于现有代理式（agentic）方法普遍存在“想象-执行-观察”（Imagination-Action-Observer, IAO）偏差——即模型在决策过程中对先验想象与实际观测反馈之间的不一致缺乏有效校准，导致推理路径偏离最优。解决方案的关键在于提出V-ABS框架，该框架基于“思考者-执行者-观察者”迭代机制，通过引入一种基于熵的自适应加权算法，动态平衡策略先验置信度与观测反馈置信度，从而缓解IAO偏差；同时构建包含超过8万样本的监督微调（Supervised Fine-Tuning, SFT）数据集，引导模型在训练阶段赋予正确动作路径更高的先验置信度，最终实现更稳定、高效的视觉推理性能。

链接: https://arxiv.org/abs/2605.10172
作者: Zhiwei Ning,Xuanang Gao,Jiaxi Cao,Gengming Zhang,Shengnan Ma,Wenwen Tong,Hanming Deng,Jie Yang,Wei Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have achieved remarkable success in general perception, yet complex multi-step visual reasoning remains a persistent challenge. Although recent agentic approaches incorporate tool use, they often neglect critical execution feedback. Consequently, they suffer from the imagination-action-observer (IAO) bias, a misalignment between prior imagination and observer feedback that undermines reasoning stability and optimality. To bridge this gap, we introduce V-ABS, an action-observer driven beam search framework that enables deliberate reasoning through thinker-actor-observer iterations. We also propose an entropy-based adaptive weighting algorithm to mitigate the IAO bias by dynamically balancing the confidence scores between the policy priors and the observational feedback. Moreover, we construct a large-scale supervised fine-tuning (SFT) dataset comprising over 80k samples to guide the model to assign higher prior confidence to correct action paths. Extensive experiments across eight diverse benchmarks show that V-ABS achieves state-of-the-art performance, delivering an average improvement of 19.7% on the Qwen3-VL-8B baseline and consistent gains across both open-source and proprietary models.

[NLP-64] When Reviews Disagree: Fine-Grained Contradiction Analysis in Scientific Peer Reviews ACL2026

【速读】：该论文旨在解决科学同行评审中专家判断冲突难以准确识别与量化的问题，尤其针对会议投稿规模扩大后，领域主席和编辑难以可靠发现并解析评审意见分歧的挑战。现有方法多将评审分歧视为孤立句子间的二元矛盾检测，忽略了评审层面的上下文信息，并模糊了评价冲突的严重程度差异。其解决方案的关键在于提出一种细粒度的评审矛盾分析框架——RevCI，该框架通过专家标注的评审对数据集实现证据级矛盾标注与强度评分；并设计IMPACT多智能体结构，融合基于方面条件的证据提取、推理性推理与仲裁机制，以建模矛盾及其强度；进一步通过知识蒸馏得到轻量模型TIDE，可在单次前向传播中高效预测矛盾证据及强度，显著降低推理成本。

链接: https://arxiv.org/abs/2605.10171
作者: Sandeep Kumar,Yash Kamdar,Abid Hossain,Bharti Kumari,Tanik Saikh,Asif Ekbal
机构: Indian Institute of Technology Patna (印度理工学院巴特那分校); KIIT Deemed to be University (基伊特 deemed to be 大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: accepted at ACL 2026

点击查看摘要

Abstract:Scientific peer reviews frequently contain conflicting expert judgments, and the increasing scale of conference submissions makes it challenging for Area Chairs and editors to reliably identify and interpret such disagreements. Existing approaches typically frame reviewer disagreement as binary contradiction detection over isolated sentence pairs, abstracting away the review-level context and obscuring differences in the severity of evaluative conflict. In this work, we introduce a fine-grained formulation of reviewer contradiction analysis that operates over full peer reviews by explicitly identifying contradiction evidence spans and assigning graded disagreement intensity scores. To support this task, we present RevCI, an expert-annotated benchmark of peer-review pairs with evidence-level contradiction annotations with graded intensity labels. We further propose IMPACT, a structured multi-agent framework that integrates aspect-conditioned evidence extraction, deliberative reasoning, and adjudication to model reviewer contradictions and their intensity. To support efficient deployment, we distill IMPACT into TIDE, a small language model that predicts contradiction evidence and intensity in a single forward pass. Experimental results show that IMPACT substantially outperforms strong single-agent and generic multi-agent baselines in both evidence identification and intensity agreement, while TIDE achieves competitive performance at significantly lower inference cost.

[NLP-65] MolSight: Molecular Property Prediction with Images

【速读】：该论文旨在解决分子属性预测（Molecular Property Prediction, MPP）中对二维骨架图（2D skeletal diagram）这一普遍可用但被忽视的表示形式利用不足的问题。当前主流方法多依赖于分子图、三维构象或大参数量语言模型，这些方法在计算复杂度和数据工程上存在较高负担。其解决方案的关键在于提出首个系统性的基于视觉的MPP研究——MolSight，通过将分子结构渲染为二维键线图（bond-line image），并使用多种视觉主干网络（vision architectures）进行端到端学习，实现高效且高性能的属性预测。此外，作者创新性地引入“化学启发式课程学习”（chemistry-informed curriculum），依据五类结构复杂度指标对预训练分子库分层，显著提升了模型在不同难度任务上的泛化能力，最终在10个下游任务中取得领先性能，同时相比多模态方法降低80倍浮点运算量（FLOPs）。

链接: https://arxiv.org/abs/2605.10157
作者: Aaditya Baranwal,Akshaj Gupta,Shruti Vyas,Yogesh S Rawat
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Every molecule ever synthesised can be drawn as a 2D skeletal diagram, yet in modern property prediction this universally available representation has received less focus in favour of molecular graphs, 3D conformers, or billion-parameter language models, each imposing its own computational and data-engineering overhead. We present \textbfMolSight , the first systematic large-scale study of vision-based Molecular Property Prediction (MPP). Using 10 vision architectures, 7 pre-training strategies, and 2,M molecule images, we evaluate performance across 10 downstream tasks spanning physical-property regression, drug-discovery classification, and quantum-chemistry prediction. To account for the wide variation in structural complexity across pre-training molecules, we further propose a \textbfchemistry-informed curriculum : five structural complexity descriptors partition the corpus into five tiers of increasing chemical difficulty, consistently outperforming non-curriculum baselines. We show that a single rendered bond-line image, processed by a vision encoder, is sufficient for competitive molecular property prediction, i.e. \textitchemical insight from sight alone . The best curriculum-trained configuration achieves the top result on \textbf5 of 10 benchmarks and top two on \textbfall 10 , at \textbf \textit80 \times lower FLOPs than the nearest multi-modal competitor.

[NLP-66] NyayaAI: An AI-Powered Legal Assistant Using Multi-Agent Architecture and Retrieval-Augmented Generation

【速读】：该论文旨在解决印度法律信息因法律语言复杂性和法律文件体量庞大而导致的可及性问题，尤其针对律师、法学生及普通用户在法律研究与案例分析中面临的效率瓶颈。其解决方案的关键在于构建NyayaAI——一个基于大型语言模型（Large Language Models, LLMs）并融合检索增强生成（Retrieval-Augmented Generation, RAG）技术的多智能体系统，该系统依托于结构化且经过筛选的印度法律知识库（包括宪法条款、成文法、判例法及司法先例），通过Mastra TypeScript框架协调主代理与专门处理法律研究、文档摘要、判例检索和文书辅助的子代理，并引入合规模块对输出进行验证，从而显著提升法律工作的自动化水平与准确性。

链接: https://arxiv.org/abs/2605.10155
作者: Deepanshu,Divi Saxena,Deepali Rana,Ayesha Varshney,Sahinur Rahman Laskar
机构: 未知
类目: Computation and Language (cs.CL)
备注: 3 pages, 1 figure

点击查看摘要

Abstract:Legal information in India remains largely inaccessible due to the complexity of legal language and the sheer volume of legal documentation involved in research and case analysis. This paper presents NyayaAI, an AI-powered legal assistant that automates and simplifies legal workflows for lawyers, law students, and general users. The system combines Large Language Models with a Retrieval-Augmented Generation pipeline grounded in a curated Indian legal knowledge base comprising constitutional provisions, statutes, case laws, and judicial precedents. A multi-agent architecture orchestrated through the Mastra TypeScript framework coordinates a main agent with specialized sub-agents handling legal research, document summarization, case law retrieval, and drafting assistance. A compliance module validates all responses before delivery. Domain classification achieved 70% precision across test samples, with RAG retrieval precision at 74% and overall response accuracy at 72%, demonstrating that structured multi-agent LLM systems can meaningfully improve legal accessibility and workflow efficiency. The code\footnotethis https URL is made publicly available for the benefit of the research community.

[NLP-67] Synthetic Pre-Pre-Training Improves Language Model Robustness to Noisy Pre-Training Data

【速读】：该论文旨在解决大规模语言模型（Large Language Models, LLMs）在预训练阶段因使用包含噪声的网络规模语料库而导致性能下降的问题。现有数据清洗方法虽可缓解噪声影响，但无法完全消除其干扰，因此模型仍面临噪声污染的风险。论文提出一种轻量级的预预训练（Pre-pre-training, PPT）阶段，利用具有可学习时间结构的合成数据进行初始化，以增强模型在后续主预训练（Pre-training, PT）中对噪声的鲁棒性。其关键创新在于：PPT并非直接抑制对噪声 token 的注意力，而是通过引导模型优化轨迹，使模型在面对噪声时逐步降低对 corrupted tokens 之间的注意力权重，从而抑制噪声自建模（noise self-modeling），提升整体鲁棒性。实验表明，仅用65M合成token即可显著减少自然文本预训练token消耗，同时达到与基线相当的最终损失。

链接: https://arxiv.org/abs/2605.10129
作者: Xu Guo,Runyu Peng,Jian Tong,Yunhua Zhou,Haijun Lv,Zhihui Lu,Qipeng Guo
机构: Shanghai AI Laboratory (上海人工智能实验室); Shanghai Innovation Institute (上海创新研究院); Fudan University (复旦大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) rely on web-scale corpora for pre-training. The noise inherent in these datasets tends to obscure meaningful patterns and ultimately degrade model performance. Data curation mitigates but cannot eliminate such noise, so pre-training corpora remain noisy in practice. We therefore study whether a lightweight pre-pre-training (PPT) stage based on synthetic data with learnable temporal structure helps resist noisy data during the pre-training (PT) stage. Across various corruption settings, our method consistently improves robustness to noise during PT, with larger relative gains at higher noise levels. For a 1B-parameter model, a synthetic PPT stage with only 65M tokens achieves the same final loss as the baseline while using up to 49% fewer natural-text PT tokens across different noise levels. Mechanistic analyses suggest PPT does not immediately suppress attention to noisy tokens. Rather, PPT-initialized models gradually downweight attention between corrupted tokens during noisy PT. This indicates that synthetic PPT inhibits noise self-modeling and shapes the subsequent optimization trajectory. Code is available at this https URL.

[NLP-68] SkillRAE: Agent Skill-Based Context Compilation for Retrieval-Augmented Execution

【速读】：该论文旨在解决生成式 AI（Generative AI）中基于大语言模型（Large Language Model, LLM）的智能体在执行复杂任务时，如何有效组织已检索到的技能证据以形成紧凑、可接地且即刻可用的任务上下文问题。现有方法多关注技能检索与任务执行优化，忽视了对所选技能证据的结构化编排。其解决方案的关键在于提出SkillRAE，一种两阶段的检索增强执行（Retrieval-Augmented Execution, RAE）框架：首先在离线索引阶段构建多层级技能图（skill graph），显式建模技能社区、技能及可复用子单元之间的关系；其次在在线检索阶段实施基于子单元证据导出的技能排序检索，并通过“救援感知”的紧凑编译机制恢复关键证据，从而将粗粒度技能集合转化为面向下游执行器的任务特定上下文，显著提升任务完成效果。

链接: https://arxiv.org/abs/2605.10114
作者: Xiangcheng Meng,Shu Wang,Yixiang Fang
机构: The Chinese University of Hong Kong, Shenzhen (香港中文大学（深圳)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Model (LLM)-based agents (e.g., OpenClaw) increasingly rely on reusable skill libraries to solve artifact-rich tasks such as document-centric workflows and data-intensive analysis. As these libraries grow, a few works have attempted to study the Retrieval-Augmented Execution (RAE), which often first retrieves some external skills and other knowledge, then compiles the context using retrieved skills, and finally executes the task. Existing works mainly focus on optimizing skill retrieval and task execution, and they pay little attention to how to effectively organize the selected skill evidence in a form that is compact, grounded, and immediately usable for the downstream executors to complete tasks. To fill this gap, we propose SkillRAE, a two-stage RAE approach focusing on skill-based context compilation, which consists of the offline and online stages. Specifically, in the offline indexing stage, it builds a multi-level skill graph over skill communities, skills, and reusable subunits, for capturing their relationships. In the online retrieval stage, it first performs skill-ranked retrieval with selected-subunit evidence export in the graph, and then applies rescue-aware compact compilation to recover the key evidence. Together, these components compile a coarse-ranked skill set into a task-specific context that is compact, grounded, and immediately usable. Experiments on two public benchmarks show that SkillRAE achieves a significant improvement over baselines for RAE. For example, on SkillsBench, it achieves an improvement of 11.7% over the SOTA method. Ablation studies further show that our context compilation is crucial, instead of a mere prompt addition.

[NLP-69] GLiNER-Relex: A Unified Framework for Joint Named Entity Recognition and Relation Extraction

【速读】：该论文旨在解决联合命名实体识别（Named Entity Recognition, NER）与关系抽取（Relation Extraction, RE）任务中传统方法需分别建模、难以灵活适应新类别的问题。其核心挑战在于如何构建一个统一框架，在不重新训练模型的前提下实现任意实体类型和关系类型的零样本抽取。解决方案的关键在于提出 GLiNER-Relex，该架构基于共享的双向 Transformer 编码器对文本、实体类型标签和关系类型标签进行联合表示，并通过专门的关系评分模块将识别出的实体对嵌入与关系类型嵌入进行匹配，从而在单次前向传播中同时完成实体识别与关系抽取，支持推理时动态指定实体和关系类型，兼具高性能与计算效率。

链接: https://arxiv.org/abs/2605.10108
作者: Ihor Stepanov,Oleksandr Lukashov,Mykhailo Shtopko,Vivek Kalyanarangan
机构: Knowledgator Engineering (Knowledgator Engineering); Baldor Technologies Pvt. Ltd. (IDfy) (Baldor Technologies Pvt. Ltd. (IDfy))
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 19 pages, 1 figure, 2 tables

点击查看摘要

Abstract:Joint named entity recognition (NER) and relation extraction (RE) is a fundamental task in natural language processing for constructing knowledge graphs from unstructured text. While recent approaches treat NER and RE as separate tasks requiring distinct models, we introduce GLiNER-Relex, a unified architecture that extends the GLiNER framework to perform both entity recognition and relation extraction in a single model. Our approach leverages a shared bidirectional transformer encoder to jointly represent text, entity type labels, and relation type labels, enabling zero-shot extraction of arbitrary entity and relation types specified at inference time. GLiNER-Relex constructs entity pair representations from recognized spans and scores them against relation type embeddings using a dedicated relation scoring module. We evaluate our model on four standard relation extraction benchmarks: CoNLL04, DocRED, FewRel, and CrossRE, and demonstrate competitive performance against both specialized relation extraction models and large language models, while maintaining the computational efficiency characteristic of the GLiNER family. The model is released as an open-source Python package with a simple inference API that allows users to specify arbitrary entity and relation type labels at inference time and obtain both entities and relation triplets in a single call. All models and code are publicly available.

[NLP-70] FERA: Uncertainty-Aware Federated Reasoning for Large Language Models

【速读】：该论文旨在解决多组织间因监管、产权或机构限制无法集中训练数据时，如何在不共享原始数据的前提下提升大型语言模型（Large Language Models, LLMs）的多步推理能力的问题。其核心挑战在于客户端的可靠性具有查询依赖性，而服务器无法访问客户端数据以判断贡献的可信度。解决方案的关键是提出一种无需训练的联邦推理框架——不确定性感知联邦推理（Uncertainty-Aware Federated Reasoning, FERA），通过迭代式服务器-客户端协同精炼机制实现：客户端生成带轻量级不确定性估计的推理轨迹，服务器据此合成改进后的推理结果并作为上下文反馈给客户端；在每轮内，采用不确定性感知自省聚合（Uncertainty-Aware Self-Critique Aggregation, UA-SCA）方法，基于查询依赖的信任权重和结构化的跨客户端验证机制处理异构推理轨迹间的冲突，而非简单丢弃低质量轨迹，而是修正错误步骤以保留有用信息，从而实现推理性能的持续提升。理论分析表明该协议收敛且不确定性加权可加速收敛，实验验证了FERA在多个推理基准上优于联邦训练与无训练基线，同时保持通信与计算效率。

链接: https://arxiv.org/abs/2605.10082
作者: Ruhan Wang,Chengkai Huang,Zhiyong Wang,Junda Wu,Rui Wang,Tong Yu,Julian McAuley,Lina Yao,Dongruo Zhou
机构: Indiana University; The University of New South Wales; The Chinese University of Hong Kong; University of California San Diego; Adobe Research
类目: Computation and Language (cs.CL)
备注: 44 pages, 8 figures

点击查看摘要

Abstract:Large language models (LLMs) exhibit strong reasoning capabilities when guided by high-quality demonstrations, yet such data is often distributed across organizations that cannot centralize it due to regulatory, proprietary, or institutional constraints. We study federated reasoning, where a server improves multi-step reasoning by coordinating with heterogeneous clients holding private demonstrations, without centralized training or raw data sharing. The key challenge is that client reliability is query-dependent, while the server cannot inspect client data to determine which contributions are trustworthy. To address this, we propose Uncertainty-Aware Federated Reasoning (FERA), a training-free framework based on iterative server-client co-refinement. Across communication rounds, clients generate reasoning traces with lightweight uncertainty estimates, and the server synthesizes them into improved reasoning that is redistributed as context for the next round, progressively improving both server outputs and client-side reasoning. Within each round, Uncertainty-Aware Self-Critique Aggregation (UA-SCA) resolves conflicts among heterogeneous client traces through query-dependent trust weighting and structured cross-client verification. Rather than simply discarding low-quality traces, UA-SCA revises flawed reasoning steps to recover useful information. We provide theoretical guarantees showing that the proposed iterative protocol converges and that uncertainty-aware weighting accelerates convergence. Experiments on multiple reasoning benchmarks show that FERA consistently outperforms both federated training and training-free baselines, achieving progressively higher accuracy across rounds while maintaining communication and computational efficiency.

[NLP-71] PHAGE: Patent Heterogeneous Attention-Guided Graph Encoder for Representation Learning

【速读】：该论文旨在解决现有专利编码方法忽略权利要求之间层次依赖结构的问题，即传统模型将权利要求线性化为文本序列，从而丢失了专利内部的有向依赖关系（directed dependency structure）。这种结构在专利中表现为从属权利要求继承并细化前序权利要求的保护范围。解决方案的关键在于提出PHAGE框架：首先通过确定性的图构建流程区分高置信度的法律引用关系与噪声较大的技术关联关系，保留不同类型边的语义差异；其次设计连通性掩码和可学习的关系感知偏置项，将权利要求层级拓扑映射到token级别注意力机制中，实现对不同关系类型的差异化加权；最后引入双粒度对比目标，在跨专利分类任务和同专利内拓扑一致性之间对齐表示，从而显著提升专利分类、检索与聚类性能。

链接: https://arxiv.org/abs/2605.10073
作者: Yongmin Yoo,Qiongkai Xu,Zhangkai Wu,Longbing Cao
机构: Macquarie University (麦考瑞大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Patent claims form a directed dependency structure in which dependent claims inherit and refine the scope of earlier claims; however, existing patent encoders linearize claims as text and discard this hierarchy. Directly encoding this structure into self-attention poses two challenges: claim dependencies mix relation types that differ in semantics and extraction reliability, and the dependency graph is defined over claims while Transformers attend over tokens. PHAGE addresses the first challenge through a deterministic graph construction pipeline that separates near-deterministic legal citations from noisier rule-based technical relations, preserving type distinctions as heterogeneous edges. It addresses the second through a connectivity mask and learnable relation-aware biases that lift claim-level topology into token-level attention, allowing the encoder to differentially weight each relation type. A dual-granularity contrastive objective then aligns representations with both inter-patent taxonomy and intra-patent topology. PHAGE outperforms all baselines on classification, retrieval, and clustering, showing that intra-document claim topology is a stronger inductive bias than inter-document structure and that this bias persists in the encoder weights after training.

[NLP-72] NCO: A Versatile Plug-in for Handling Negative Constraints in Decoding

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在生成过程中难以有效控制多种硬性禁止约束（如敏感词、正则表达式匹配的非法模式）的问题，尤其是在不引入显著计算开销或状态空间爆炸的前提下实现高效、实时的约束满足。其核心解决方案是提出一种名为NCO（Online Pattern Matching for Constraint Enforcement）的解码策略，通过在线模式匹配机制在生成过程中动态追踪并规避所有硬约束，避免构建庞大且低效的单一自动机；同时支持标准推理方法（如采样和束搜索），并引入软掩码机制实现概率性抑制，从而在保持模型输出质量的同时提升约束执行的效率与可扩展性。

链接: https://arxiv.org/abs/2605.10065
作者: Hyundong Jin,Yo-Sub Han
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Controlling Large Language Models (LLMs) to prevent the generation of undesirable content, such as profanity and personally identifiable information (PII), has become increasingly critical. While earlier approaches relied on post-processing or resampling, recent research has shifted towards constrained decoding methods that control outputs during generation to mitigate high computational costs and quality degradation. However, preventing multiple forbidden hard constraints or regex constraints from appearing anywhere in the output is computationally challenging. A straightforward solution is to convert these constraints into a single automaton that tracks all forbidden patterns during decoding, but this often becomes impractically large. Standard regex engines also do not readily support the operations needed to build such a constraint, such as complement and intersection. In order to address these limitations, we propose NCO, a decoding strategy that performs online pattern matching over finite hard constraints and regex constraints, reducing computational overhead without inducing state explosion. NCO is fully compatible with standard inference strategies, including various sampling methods and beam search, while also supporting soft masking for probabilistic suppression. We empirically demonstrate its effectiveness across practical tasks, including PII and profanity suppression. Our implementation is available at this https URL .

[NLP-73] Not-So-Strange Love: Language Models and Generative Linguistic Theories are More Compatible than They Appear

【速读】：该论文试图解决的问题是：如何在神经语言模型（Neural Language Models, LMs）的框架下，拓展可被验证的语言理论类型，特别是将生成式语法（Generative Grammar）传统中的形式结构理论纳入LMs的解释范围，从而弥合使用基础理论（Usage-Based Theory）与生成主义理论之间的分歧。其解决方案的关键在于论证LMs不仅能够支持基于梯度和使用经验的语言习得机制，还能体现以形式结构为核心的理论假设——即LMs具备模拟生成语法中抽象规则系统的能力，这为两种对立的语言学范式提供了统一的实验平台，推动二者在计算建模层面的整合与 reconciliations（调和）。

链接: https://arxiv.org/abs/2605.10061
作者: R. Thomas McCoy
机构: University of California, Berkeley (加州大学伯克利分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to Behavioral and Brain Sciences; 4 pages; Commentary on “How Linguistics Learned to Stop Worrying and Love the Language Models” by Richard Futrell and Kyle Mahowald

点击查看摘要

Abstract:Futrell and Mahowald (2025) frame the success of neural language models (LMs) as supporting gradient, usage-based linguistic theories. I argue that LMs can also instantiate theories based on formal structures - the types of theories seen in the generative tradition. This argument expands the space of theories that can be tested with LMs, potentially enabling reconciliations between usage-based and generative accounts.

[NLP-74] Swarm Skills: A Portable Self-Evolving Multi-Agent System Specification for Coordination Engineering

【速读】：该论文旨在解决多智能体（Multi-Agent）协作协议难以跨系统共享与持续优化的问题。当前，单智能体技能已可通过可移植资产形式进行分发，但多智能体间的协调机制仍被锁定在特定框架的内部代码或静态配置中，限制了其通用性和进化能力。解决方案的关键在于提出“Swarm Skills”——一种扩展Anthropic Skills标准的可移植规范，它将多智能体工作流定义为第一类资产，包含角色、工作流、执行边界及内置语义结构以支持自演化。此外，论文设计了一种配套的自演化算法，能够自动从成功执行轨迹中提炼出新的Swarm Skills，并基于多维评分（有效性、利用率和新鲜度）持续修补现有协议，从而实现无需人工干预的自主优化，最终通过渐进披露机制达成零适配器跨智能体可移植性，避免框架锁定。

链接: https://arxiv.org/abs/2605.10052
作者: Xinyu Zhang,Zhicheng Dou,Deyang Li,Jianjun Tao,Shuo Cheng,Ruifeng Shi,Fangchao Liu,Enrui Hu,Yangkai Ding,Hongbo Wang,Qi Ye,Xuefeng Jin,Zhangchun Zhao
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As artificial intelligence engineering paradigms shift from single-agent Prompt and Context Engineering toward multi-agent \textbfCoordination Engineering, the ability to codify and systematically improve how multiple agents collaborate has emerged as a critical bottleneck. While single-agent skills can now be distributed as portable assets, multi-agent coordination protocols remain locked within framework-internal code or static configurations, preventing them from being shared across systems or autonomously improved over time. We propose \textbfSwarm Skills, a portable specification that extends the Anthropic Skills standard with multi-agent semantics. Swarm Skills turns multi-agent workflows into first-class, distributable assets that consist of roles, workflows, execution bounds, and a built-in semantic structure for self-evolution. To operationalize the specification’s evolving nature, we present a companion self-evolution algorithm that automatically distills successful execution trajectories into new Swarm Skills and continuously patches existing ones based on multi-dimensional scoring (Effectiveness, Utilization, and Freshness), eliminating the need for human-in-the-loop oversight during the refinement process. Through an architectural compatibility analysis and a comprehensive qualitative case study using the open-source JiuwenSwarm reference implementation, we demonstrate how Swarm Skills achieves zero-adapter cross-agent portability via progressive disclosure, enabling agent teams to self-evolve their coordination strategies without framework lock-in.

[NLP-75] Personalizing LLM s with Binary Feedback: A Preference-Corrected Optimization Framework ACL2026

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）个性化过程中忽视用户间差异的问题，现有方法通常仅基于孤立的用户历史数据进行调整，未能有效建模不同用户之间的偏好差异。其解决方案的关键在于提出C-BPO框架，通过将目标用户的交互数据视为正向反馈信号，其他用户的数据作为隐式的负向信号，从而捕捉跨用户差异；同时引入基于正样本-未标记样本（Positive-Unlabeled, PU）学习理论的目标函数，通过减去“正向偏置”来净化负样本信号，避免因任务知识共享而误判为负样本的情况，从而在保持通用帮助性的同时精准对齐个体独特偏好。

链接: https://arxiv.org/abs/2605.10043
作者: Xilai Ma,Liye Zhao,Weijun Yao,Haibing Di,Wenya Wang,Jing Li
机构: Harbin Institute of Technology, Shenzhen, China; Huawei Technologies Co., Ltd.; Nanyang Technological University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by ACL 2026 Main

点击查看摘要

Abstract:Large Language Model (LLM) personalization aims to align model behaviors with individual user preferences. Existing methods often focus on isolated user histories, neglecting the essential role of inter-user differences. We propose C-BPO, a framework that personalizes LLMs via preference-calibrated binary signals. By treating target user data as positive feedback and other users’ data as an auxiliary set of implicit negative signals, C-BPO captures distinct inter-user differences. To mitigate the preference overlap issue, where shared task knowledge is erroneously penalized, we derive an objective grounded in Positive-Unlabeled (PU) learning theory. This approach purifies negative signals by subtracting ``positive bias’', ensuring alignment with unique idiosyncrasies without compromising general helpfulness. Empirical experiments across various personalization tasks and backbone LLMs show C-BPO consistently outperforms baselines, demonstrating the efficacy of preference-calibrated binary signals in modeling inter-user differences.

[NLP-76] Instruction Adherence in Coding Agent Configuration Files: A Factorial Study of Four File-Structure Variables

【速读】：该论文旨在解决生成式 AI（Generative AI）在代码生成任务中对配置文件结构敏感性的问题，即不同文件结构变量（如文件大小、指令位置、文件架构及相邻文件冲突）是否显著影响编码代理（coding agent）对目标注释的遵循程度。其关键解决方案在于设计并执行一项系统性的因子实验，通过控制四个结构变量，在1,650次Claude Code CLI会话中测量函数级合规性（compliance），结合混合效应模型与贝叶斯分析方法进行统计推断。结果显示，所有结构变量及其两两交互均未在多重检验校正后表现出可检测差异，而每增加一个生成函数，合规概率下降约5.6%（OR = 0.944），且该效应具有非单调性，说明合规性主要受会话内序列动态影响，而非静态文件结构。

链接: https://arxiv.org/abs/2605.10039
作者: Damon McMillan
机构: 未知
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注: 18 pages, 5 figures, 5 tables

点击查看摘要

Abstract:Frontier coding agents read configuration files (this http URL, this http URL, Cursor Rules) at session start and are expected to follow the conventions inside them. Practitioners assume that structural choices (file size, instruction position, file architecture, contradictions in adjacent files) measurably affect adherence. We report a systematic factorial study of these choices using four manipulated variables, measuring compliance with a trivial target annotation across 1,650 Claude Code CLI sessions (16,050 function-level observations) on two TypeScript codebases, three frontier models (primarily Sonnet 4.6, with Opus 4.6 as a CLI-matched cross-model check and Opus 4.7 reported descriptively under a CLI-version confound), and five coding tasks. We use mixed-effects models with a Bayesian companion. None of the four structural variables or three two-way interactions produces a detectable contrast after multiple-testing correction. Size and conflict nulls are supported by affirmative-null Bayes factors (BF10 between 0.05 and 0.10); position and architecture nulls are failures to reject without Bayes-factor support. The largest effect we measured is within-session: each additional function the agent generates is associated with approximately 5.6% lower odds of compliance per step (OR = 0.944) within the session-length range we tested, though the relationship is non-monotonic rather than a constant per-step effect. This reproduces on a second TypeScript codebase and on Opus 4.6 at matched configuration; it was identified during analysis rather than pre-specified. Within the conditions tested, file-structure variables did not produce detectable contrasts; compliance varies systematically between coding tasks and across each session’s sequence of generated functions. Comments: 18 pages, 5 figures, 5 tables Subjects: Software Engineering (cs.SE); Computation and Language (cs.CL) Cite as: arXiv:2605.10039 [cs.SE] (or arXiv:2605.10039v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2605.10039 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-77] PlantMarkerBench: A Multi-Species Benchmark for Evidence-Grounded Plant Marker Reasoning

【速读】：该论文旨在解决植物细胞类型特异性标记基因（cell-type-specific marker genes）的文献证据解析问题，现有资源多依赖人工整理数据库或高通量研究，缺乏对科学文献中支持证据的显式建模。其解决方案的关键在于构建PlantMarkerBench——一个跨物种（拟南芥、玉米、水稻和番茄）的基准数据集，通过模块化文献筛选流程整合大规模文献检索、混合搜索策略、物种感知的生物实体定位、结构化证据抽取与人工校验，形成5,550条句子级别的标注数据，涵盖标记基因-细胞类型对的有效性判断及证据类型分类（表达、定位、功能、间接、否定）。该框架为评估生成式AI模型在真实生物学语境下的证据推理能力提供了挑战性且可复现的基准，推动可信科学信息提取与AI辅助植物生物学的发展。

链接: https://arxiv.org/abs/2605.10032
作者: Sajib Acharjee Dip,Song Li,Liqing Zhang
机构: Virginia Tech (弗吉尼亚理工大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Cell-type-specific marker genes are fundamental to plant biology, yet existing resources primarily rely on curated databases or high-throughput studies without explicitly modeling the supporting evidence found in scientific literature. We introduce PlantMarkerBench, a multi-species benchmark for evaluating literature-grounded plant marker evidence interpretation from full-text biological papers. PlantMarkerBench is constructed using a modular curation pipeline integrating large-scale literature retrieval, hybrid search, species-aware biological grounding, structured evidence extraction, and targeted human review. The benchmark spans four plant species – Arabidopsis, maize, rice, and tomato – and contains 5,550 sentence-level evidence instances annotated for marker-evidence validity, evidence type, and support strength. We define two benchmark tasks: determining whether a candidate sentence provides valid marker evidence for a gene-cell-type pair, and classifying the evidence into expression, localization, function, indirect, or negative categories. We benchmark diverse open-weight and closed-source language models across species and prompting strategies. Although frontier models achieve relatively strong performance on direct expression evidence, performance drops substantially on functional, indirect, and weak-support evidence, with evidence-type confusion emerging as a dominant failure mode. Open-weight models additionally exhibit elevated false-positive rates under ambiguous biological contexts. PlantMarkerBench provides a challenging and reproducible evaluation framework for literature-grounded biological evidence attribution and supports future research on trustworthy scientific information extraction and AI-assisted plant biology.

[NLP-78] Speech-based Psychological Crisis Assessment using LLM s

【速读】：该论文旨在解决心理危机热线中危机等级分类自动化不足的问题，当前依赖人工判断存在主观差异且受人力限制。解决方案的关键在于提出一种基于大语言模型（Large Language Model, LLM）的框架，通过引入声学特征注入（paralinguistic injection）方法将识别出的非言语情感线索嵌入语音转录文本，使LLM能够融合关键声学细节进行推理；同时设计了一种增强推理能力的训练策略，以生成诊断推理链作为辅助任务，起到正则化作用从而提升分类性能。最终系统在三分类任务上达到宏F1分数0.802和准确率0.805。

链接: https://arxiv.org/abs/2605.10027
作者: Terumi Chiba,Yang Luo,Ziyun Cui,Yongsheng Tong,Chao Zhang
机构: Tsinghua University (清华大学); Peking University Huilongguan Clinical Medical School (北京大学回龙观临床医学院); WHO Collaborating Centre for Research and Training in Suicide Prevention (世界卫生组织自杀预防研究与培训合作中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 5 pages, 5 figures

点击查看摘要

Abstract:Psychological support hotlines provide critical support for individuals experiencing mental health emergencies, yet current assessments largely rely on human operators whose judgments may vary with professional experience and are constrained by limited staffing resources. This paper proposes a large language model (LLM)-based framework for automated crisis level classification, a key indicator that supports many downstream tasks and improves the overall quality of hotline services. To better capture emotional signals in spoken conversations, we introduce a paralinguistic injection method that inserts identified non-verbal emotional cues into speech transcripts, enabling LLM-based reasoning to incorporate critical acoustic nuances. In addition, we propose a reasoning-enhanced training strategy that trains the model to generate diagnostic reasoning chains as an auxiliary task, which serves as a regulariser to improve classification performance. Combined with data augmentation, our final system achieves a macro F1-score of 0.802 and an accuracy of 0.805 on the three-class classification task under 5-fold cross-validation.

[NLP-79] Medical Incident Causal Factors and Preventive Measures Generation Using Tag-based Example Selection in Few-shot Learning

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在高风险领域如医疗中生成临床洞察时的可靠性问题，特别是在从医疗事故报告中提取背景/因果因素及预防措施时的准确性与稳定性不足。解决方案的关键在于提出一种基于标签（tag-based）的少样本示例选择方法，通过利用数据集中已有的结构化标签（如“药物”、“输血治疗”等）来筛选最相关的示例用于提示（prompting），从而显著提升生成结果的精度和一致性，相较于随机采样和基于余弦相似度的选择策略，该方法能更有效地避免不安全输出并减少触发安全过滤机制的情况。

链接: https://arxiv.org/abs/2605.10025
作者: Yuna Haseyama,Tomoki Ito,Hiroki Sakaji,Itsuki Noda
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In high-stakes domains such as healthcare, the reliability of Large Language Models (LLMs) is critical, particularly when generating clinical insights from incident reports. This study proposes a tag-based few-shot example selection method for prompting LLMs to generate background/causal factors and preventive measures from details of the medical incidents. For our experiments, we use the Japanese Medical Incident Dataset (JMID), a structured dataset of 3,884 real-world medical accident and near-miss reports. These reports are variably annotated with a wide range of tags–some include descriptive information (e.g., “medications,” “blood transfusion therapy”). We compare three few-shot example selection strategies–random sampling, cosine similarity-based selection, and our proposed tag-based method–using GPT-4o and LLaMA 3.3. Results show that the tag-based approach achieves the highest precision and most stable generation behavior, while similarity-based selection often leads to unintended outputs and safety filter activation. These findings suggest that selecting examples based on human-interpretable dataset tags can improve generation precision and stability in clinical LLM applications.

[NLP-80] Annotations Mitigate Post-Training Mode Collapse ICML2026

【速读】：该论文旨在解决后训练（如监督微调，Supervised Fine-Tuning, SFT）过程中导致的语义模式坍缩（semantic mode collapse）问题，即模型在优化指令遵循能力时，过度偏向低熵的微调数据分布，从而牺牲了预训练阶段所具有的高熵语义多样性。关键解决方案是提出“标注锚定训练”（annotation-anchored training），其核心在于：在预训练阶段使用带有语义标注的文档对进行训练，构建一个能反映预训练数据全貌的丰富标注分布，并在后训练中保留该分布；推理时通过采样多样化的标注作为锚点引导生成，从而将预训练的语义丰富性有效迁移至后训练模型中，显著减少语义多样性损失（实验证明可降低6倍以上）。

链接: https://arxiv.org/abs/2605.09995
作者: Jacob Mitchell Springer,Madhu Advani,Lukas Aichberger,Arwen Bradley,Eran Malach,Omid Saremi,Sinead Williamson,Preetum Nakkiran,Etai Littwin,Aditi Raghunathan
机构: 未知
类目: Computation and Language (cs.CL)
备注: 21 pages, 8 figures, 11 tables. Accepted at ICML 2026

点击查看摘要

Abstract:Post-training (via supervised fine-tuning) improves instruction-following, but often induces semantic mode collapse by biasing models toward low-entropy fine-tuning data at the expense of the high-entropy pretraining distribution. Crucially, we find this trade-off worsens with scale. To close this semantic diversity gap, we propose annotation-anchored training, a principled method that enables models to adopt the preference-following behaviors of post-training without sacrificing the inherent diversity of pretraining. Our approach is simple: we pretrain on documents paired with semantic annotations, inducing a rich annotation distribution that reflects the full breadth of pretraining data, and we preserve this distribution during post-training. This lets us sample diverse annotations at inference time and use them as anchors to guide generation, effectively transferring pretraining’s semantic richness into post-trained models. We find that models trained with annotation-anchored training can attain 6 \times less diversity collapse than models trained with SFT, and improve with scale.

[NLP-81] Merlin: Deterministic Byte-Exact Deduplication for Lossless Context Optimization in Large Language Model Inference

【速读】：该论文旨在解决数据密集型应用（如大规模检索系统和先进数据流水线）因处理高度冗余文本语料而导致的性能瓶颈问题。其解决方案的核心在于提出 Merlin 系统，该系统采用本地优先、无特定依赖的架构，通过一个高度优化的 SIMD 友好型开放寻址扁平哈希集合（open-addressing flat hash set）与 xxHash3-64 哈希算法相结合，实现对文本片段和数据块的快速、字节级精确去重（deduplication）。该方法在保持绝对数据一致性的前提下，显著降低输入数据量（实测减少 13.9% 至超过 71%），尤其适用于大型语言模型（LLM）生态中的检索增强生成（RAG）场景，并支持以 Model Context Protocol (MCP) 为接口的零网络拦截部署，实现高达 8.7 GB/s 的持续处理速度。

链接: https://arxiv.org/abs/2605.09990
作者: Sietse Schelpe
机构: 未知
类目: Computation and Language (cs.CL)
备注: Preprint. Implementation and open-source community version available at: this https URL - this https URL

点击查看摘要

Abstract:Data-intensive applications, ranging from large-scale retrieval systems to advanced data pipelines, are increasingly bottlenecked by the processing of highly redundant text corpora. We present Merlin, a local-first, agnostic, high-throughput deduplication and context optimization engine designed to mitigate these inefficiencies. Utilizing a highly optimized, SIMD-friendly open-addressing flat hash set combined with xxHash3-64, Merlin performs rapid, byte-exact deduplication of text passages and data chunks. While broadly applicable to any text-processing workflow, its impact is particularly pronounced in Large Language Model (LLM) ecosystems, such as Retrieval-Augmented Generation (RAG). Our empirical evaluations demonstrate an input reduction ranging from 13.9% in low-redundancy datasets to over 71% in high-redundancy pipelines, maintaining absolute data fidelity. Furthermore, we detail the system’s integration architecture via the Model Context Protocol (MCP), enabling secure, zero-network-interception deployment across major IDEs and autonomous agents. This paper outlines the core algorithmic design, performance benchmarks, and the architectural principles required to process data at sustained speeds of up to 8.7 GB/s.

[NLP-82] GLiNER2-PII: A Multilingual Model for Personally Identifiable Information Extraction

【速读】：该论文旨在解决个人身份信息（Personally Identifiable Information, PII）在现代数据处理系统中可靠检测的难题，尤其针对PII跨度异质性强、地域依赖性高、上下文敏感且常嵌入噪声或半结构化文档的问题。解决方案的关键在于提出一个小型（0.3B参数）的GLiNER2-PII模型，该模型基于GLiNER2架构并针对42类PII实体类型在字符级粒度上进行识别；同时，为缓解真实PII标注数据稀缺与隐私风险，研究者构建了一个包含4910条标注文本的多语言合成语料库，其通过约束驱动生成管道确保多样性与真实性，从而有效支撑模型训练。在SPY基准测试中，GLiNER2-PII在五种对比系统中实现了最高的span-level F1分数，验证了该方案的有效性。

链接: https://arxiv.org/abs/2605.09973
作者: Urchade Zaratiana,Ash Lewis,George Hurn-Maloney
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under submission

点击查看摘要

Abstract:Reliable detection of personally identifiable information (PII) is increasingly important across modern data-processing systems, yet the task remains difficult: PII spans are heterogeneous, locale-dependent, context-sensitive, and often embedded in noisy or semi-structured documents. We present GLiNER2-PII, a small 0.3B-parameter model adapted from GLiNER2 and designed to recognize a broad taxonomy of 42 PII entity types at character-span resolution. Training such systems, however, is constrained by the scarcity of shareable annotated data and the privacy risks associated with collecting real PII at scale. To address this challenge, we construct a multilingual synthetic corpus of 4,910 annotated texts using a constraint-driven generation pipeline that produces diverse, realistic examples across languages, domains, formats, and entity distributions. On the challenging SPY benchmark, GLiNER2-PII achieves the highest span-level F1 among five compared systems, including OpenAI Privacy Filter and three GLiNER-based detectors. We publicly release the model on Hugging Face to support further research and practical deployment of open PII detection systems.

[NLP-83] he Truth Lies Somewhere in the Middle (of the Generated Tokens)

【速读】：该论文旨在解决如何将自回归生成的隐藏状态（hidden states）有效聚合为能够反映语言模型内部状态的统一表示问题。其关键解决方案是采用对生成token的隐藏状态进行均值池化（mean pooling），发现该方法所得到的语义表征优于单一token的表示，且在语言、视觉和蛋白质等多个领域中通过核对齐（kernel alignment）指标验证了其有效性。研究表明，信息分布在生成的多个token之间而非集中于单个位置，因此均值池化能更好地捕捉模型内部状态的全局语义特征。

链接: https://arxiv.org/abs/2605.09969
作者: Sophie L. Wang,Phillip Isola,Brian Cheung
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:How should hidden states generated autoregressively be collapsed into a representation that reflects a language model’s internal state? Despite tokens being generated under causal masking, we find that mean pooling across their hidden states yields more semantic representations than any individual token alone. We quantify this through kernel alignment to reference spaces in language, vision, and protein domains. The improvement through mean pooling is consistent with information being distributed across generated tokens rather than localized to a single position. Furthermore, representations derived from generated tokens outperform those from prompt tokens, and alignment across generation reveals interpretable dynamics in model behavior.

[NLP-84] G-Zero: Self-Play for Open-Ended Generation from Zero Data

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在开放性任务中因依赖外部代理判官（proxy LLM judges）而产生的能力瓶颈与奖励黑客（reward hacking）问题，尤其是在不可验证领域中难以实现持续自主进化的问题。其解决方案的关键在于提出 G-Zero 框架——一个无需验证器（verifier-free）、基于协同进化的自演化机制；核心创新为 Hint-δ，即一种内在奖励信号，用于量化生成模型（Generator）在无提示响应与接受自生成提示后响应之间的预测分布变化。该信号驱动提议模型（Proposer）通过 GRPO 算法持续生成挑战性查询和信息性提示以定位 Generator 的盲区，同时生成模型利用 DPO 算法内化这些提示引导的改进，从而实现无需外部监督的闭环自我优化。

链接: https://arxiv.org/abs/2605.09959
作者: Chengsong Huang,Haolin Liu,Tong Zheng,Runpeng Dai,Langlin Huang,Jinyuan Li,Zongxia Li,Zhepei Wei,Yu Meng,Jiaxin Huang
机构: Washington University in St. Louis (圣路易斯华盛顿大学); University of Virginia (弗吉尼亚大学); University of Maryland (马里兰大学); University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Emerging Technologies (cs.ET)
备注:

点击查看摘要

Abstract:Self-evolving LLMs excel in verifiable domains but struggle in open-ended tasks, where reliance on proxy LLM judges introduces capability bottlenecks and reward hacking. To overcome this, we introduce G-Zero, a verifier-free, co-evolutionary framework for autonomous self-improvement. Our core innovation is Hint- \delta , an intrinsic reward that quantifies the predictive shift between a Generator model’s unassisted response and its response conditioned on a self-generated hint. Using this signal, a Proposer model is trained via GRPO to continuously target the Generator’s blind spots by synthesizing challenging queries and informative hints. The Generator is concurrently optimized via DPO to internalize these hint-guided improvements. Theoretically, we prove a best-iterate suboptimality guarantee for an idealized standard-DPO version of G-Zero, provided that the Proposer induces sufficient exploration coverage and the data filteration keeps pseudo-label score noise low. By deriving supervision entirely from internal distributional dynamics, G-Zero bypasses the capability ceilings of external judges, providing a scalable, robust pathway for continuous LLM self-evolution across unverifiable domains.

[NLP-85] Beyond Majority Voting: Agreement-Based Clustering to Model Annotator Perspectives in Subjective NLP Tasks

【速读】：该论文旨在解决自然语言处理（Natural Language Processing, NLP）数据集构建中标注者分歧（annotation disagreement）的利用问题，即如何有效建模不同标注者的主观判断差异以提升任务性能。传统多数投票（majority voting）策略忽视了标注者间的细微差异，而个体标注者建模方法虽能保留其视角但资源消耗高且尚未在多种NLP任务中充分验证。论文提出一种基于一致性的聚类技术（agreement-based clustering），通过识别标注者之间的共识模式对标注者进行分组，从而系统性地建模分歧并整合多视角信息；关键创新在于将标注者聚类与多标签（multi-label）或多任务（multitask）学习相结合，显著优于多数投票和单一标注者建模，在18种语言、40个数据集上的三类主观任务（情感分析、情绪分类、仇恨言论检测）中均展现出更强的分类性能。

链接: https://arxiv.org/abs/2605.09955
作者: Tadesse Destaw Belay,Ibrahim Said Ahmad,Idris Abdulmumin,Abinew Ali Ayele,Alexander Gelbukh,Eusebio Ricárdez-Vázquez,Olga Kolesnikova,Shamsuddeen Hassan Muhammad,Seid Muhie Yimam
机构: Instituto Politécnico Nacional (国家理工学院); University of Wisconsin–Stevens Point (威斯康星大学斯蒂文斯 point 校区); University of Pretoria (普利托里亚大学); Bahir Dar University (巴赫尔达大学); Imperial College London (帝国理工学院); University of Hamburg (汉堡大学)
类目: Computation and Language (cs.CL)
备注: Pre-MIT Press publication version

点击查看摘要

Abstract:Disagreement in annotation is a common phenomenon in the development of NLP datasets and serves as a valuable source of insight. While majority voting remains the dominant strategy for aggregating labels, recent work has explored modeling individual annotators to preserve their perspectives. However, modeling each annotator is resource-intensive and remains underexplored across various NLP tasks. We propose an agreement-based clustering technique to model the disagreement between the annotators. We conduct comprehensive experiments in 40 datasets in 18 typologically diverse languages, covering three subjective NLP tasks: sentiment analysis, emotion classification, and hate speech detection. We evaluate four aggregation approaches: majority vote, ensemble, multi-label, and multitask. The results demonstrate that agreement-based clustering can leverage the full spectrum of annotator perspectives and significantly enhance classification performance in subjective NLP tasks compared to majority voting and individual annotator modeling. Regarding the aggregation approach, the multi-label and multitask approaches are better for modeling clustered annotators than an ensemble and model majority vote.

[NLP-86] RACER: Verifiable Generative Provenance for Multimodal Tool-Using Agents

【速读】：该论文旨在解决多模态工具调用型大语言模型中存在“可验证性缺失”的问题，即当前方法在生成答案时通常仅暴露完整的工具执行轨迹和最终结果，却未明确标注每个结论性陈述（claim）所依赖的具体工具观测证据及其支持关系，导致推理过程难以验证与优化。这一缺失被称为“溯源缺口（provenance gap）”。解决方案的关键在于提出TRACER框架，该框架在生成每句回答的同时构建结构化的溯源记录，精确标识支持该句子的工具调用轮次、证据单元及语义支持关系（包括引用、压缩和推导三种类型），并通过四重验证机制确保溯源记录的准确性，并将其转化为强化学习中的可追溯约束与局部信用分配策略，从而实现基于证据的可靠推理，而非单纯增加工具调用次数。

链接: https://arxiv.org/abs/2605.09934
作者: Bihui Yu,Caijun Jia,Jing Chi,Xiaohan Liu,Yining Wang,He Bai,Yuchen Liu,Jingxuan Wei,Junnan Zhu
机构: Shenyang Institute of Computing Technology, Chinese Academy of Sciences (中国科学院沈阳计算技术研究所); University of Chinese Academy of Sciences (中国科学院大学); MAIS, Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所多媒体信息处理实验室); Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center (National Supercomputer Center in Jinan), Qilu University of Technology (Shandong Academy of Sciences) (教育部计算 power 网络与信息安全重点实验室，山东计算机中心（国家超级计算济南中心），齐鲁工业大学（山东省科学院）)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multimodal large language models increasingly solve vision-centric tasks by calling external tools for visual inspection, OCR, retrieval, calculation, and multi-step reasoning. Current tool-using agents usually expose the executed tool trajectory and the final answer, but they rarely specify which tool observation supports each generated claim. We call this missing claim-level dependency structure the provenance gap. The gap makes tool use hard to verify and hard to optimize, because useful evidence, redundant exploration, and unsupported reasoning are mixed in the same trajectory. We introduce TRACER, a framework for verifiable generative provenance in multimodal tool-using agents. Instead of adding citations after generation, TRACER generates each answer sentence together with a structured provenance record that identifies the supporting tool turn, evidence unit, and semantic support relation. Its relation space contains Quotation, Compression, and Inference, covering direct reuse, faithful condensation, and grounded derivation. TRACER verifies each record through schema checking, tool-turn alignment, source authenticity, and relation rationality, and then converts verified provenance into traceability constraints and provenance-derived local credit for reinforcement learning. We further construct TRACE-Bench, a benchmark for sentence-level provenance reconstruction from coarse multimodal tool trajectories. On TRACE-Bench, simply adding tools often introduces noise. With Qwen3-VL-8B, TRACER reaches 78.23% answer accuracy and 95.72% summary accuracy, outperforming the strongest closed-source tool-augmented baseline by 23.80 percentage points. Compared with tool-only supervised fine-tuning, it also reduces total test-set tool calls from 4949 to 3486. These results show that reliable multimodal tool reasoning depends on provenance-aware use of observations, not on more tool calls alone.

[NLP-87] FocuSFT: Bilevel Optimization for Dilution-Aware Long-Context Fine-Tuning

【速读】：该论文旨在解决大语言模型在长上下文场景下因训练过程中注意力分配不均而导致的语义信息利用效率低下问题，即“注意力稀释”（attention dilution）现象：在监督微调（SFT）阶段，模型倾向于将注意力集中在位置占优的token上，而非语义相关的内容，从而削弱梯度信号并限制其学习鲁棒长程依赖的能力。解决方案的关键在于提出FocuSFT——一种双层优化框架，其中内层通过轻量级快速权重参数（fast-weight parameters）构建参数化记忆机制，聚焦于训练上下文中的关键内容；外层则基于此增强表示进行标准SFT，同时保持双向注意力建模与响应侧因果掩码，以减少因果不对称性并缓解注意力黑洞（attention sink）效应，从而显著提升模型对长上下文的理解与应用能力。

链接: https://arxiv.org/abs/2605.09932
作者: Zehua Pei,Hui-Ling Zhen,Xianzhi Yu,Sinno Jialin Pan,Mingxuan Yuan,Bei Yu
机构: The Chinese University of Hong Kong (香港中文大学); Huawei Technologies Co., Ltd (华为技术有限公司)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models can now process increasingly long inputs, yet their ability to effectively use information spread across long contexts remains limited. We trace this gap to how attention budget is spent during supervised fine-tuning (SFT) on long sequences: positional biases and attention sinks cause the model to allocate most of its attention to positionally privileged tokens rather than semantically relevant content. This training-time attention dilution (the starvation of content tokens in the attention distribution) weakens the gradient signal, limiting the model’s ability to learn robust long-context capabilities. We introduce FocuSFT, a bilevel optimization framework that addresses this problem at training time. An inner loop adapts lightweight fast-weight parameters on the training context to form a parametric memory that concentrates attention on relevant content, and the outer loop performs SFT conditioned on this sharpened representation. Both loops apply bidirectional attention over context tokens while preserving causal masking for responses, reducing the causal asymmetry that gives rise to attention sinks and aligning inner-outer behavior. On BABILong, FocuSFT improves accuracy by up to +14pp across 4K–32K context lengths; on RULER, it raises CWE aggregation from 72.9% to 81.1% at 16K; and on GPQA with agentic tool use, it yields a 24% relative gain in pass@1. Attention analysis shows that FocuSFT reduces attention sink mass by 529 \times and triples context engagement during training. Code: this https URL

[NLP-88] PruneTIR: Inference-Time Tool Call Pruning for Effective yet Efficient Tool-Integrated Reasoning

【速读】：该论文旨在解决已具备工具调用能力的大语言模型（Large Language Models, LLMs）在推理阶段推理能力不足的问题，即如何在不进行额外训练的前提下提升其利用外部工具（如代码解释器）进行复杂问题求解的能力。核心挑战在于：错误的工具调用会显著降低最终答案的正确性，且一旦陷入持续失败的循环，LLMs 难以自我修正。解决方案的关键在于提出 PruneTIR 框架，通过三个协同机制实现推理过程的动态优化：Success-Triggered Pruning（成功触发剪枝）、Stuck-Triggered Pruning and Resampling（卡顿触发剪枝与重采样），以及 Retry-Triggered Tool Suspension（重试触发工具暂停）。这三个组件共同作用，有效识别并修剪低质量推理路径，避免无效重复尝试，从而提升推理成功率与效率，并减少上下文长度消耗。

链接: https://arxiv.org/abs/2605.09931
作者: Luan Zhang,Dandan Song,Zhijing Wu,Zhengyu Chen,Chen Zhang,Yuhang Tian,Huipeng Ma,Chenhao Li,Changzhi Zhou,Xudong Li,Shuhao Zhang
机构: Beijing Institute of Technology (北京理工大学); Huazhong University of Science and Technology (华中科技大学); Independent (独立)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Tool-integrated reasoning (TIR) enables large language models (LLMs) to enhance their capabilities by interacting with external tools, such as code interpreters (CI). Most recent studies focus on exploring various methods to equip LLMs with the ability to use tools. However, how to further boost the reasoning ability of already tool-capable LLMs at inference time remains underexplored. Improving reasoning at inference time requires no additional training and can help LLMs better leverage tools to solve problems. We observe that, during tool-capable LLM inference, both the number and the proportion of erroneous tool calls are negatively correlated with answer correctness. Moreover, erroneous tool calls are typically resolved successfully within a few subsequent turns. If not, LLMs often struggle to resolve such errors even with many additional turns. Building on the above observations, we propose PruneTIR, a rather effective yet efficient framework that enhances the tool-integrated reasoning at inference time. During LLM inference, PruneTIR prunes trajectories, resamples tool calls, and suspends tool usage through three components: Success-Triggered Pruning, Stuck-Triggered Pruning and Resampling, and Retry-Triggered Tool Suspension. These three components enable PruneTIR to mitigate the negative impact of erroneous tool calls and prevent LLMs from getting stuck in repeated failed resolution attempts, thereby improving overall LLM performance. Extensive experimental results demonstrate the effectiveness of PruneTIR, which significantly improves Pass@1 and efficiency while reducing the working context length for tool-capable LLMs.

[NLP-89] Evolving Knowledge Distillation for Lightweight Neural Machine Translation

【速读】：该论文旨在解决大规模神经机器翻译（Neural Machine Translation, NMT）模型在资源受限设备上部署时面临的挑战，特别是知识蒸馏（Knowledge Distillation, KD）在教师模型与学生模型之间存在显著能力差距时效果下降的问题。解决方案的关键在于提出一种渐进式训练框架——进化知识蒸馏（Evolving Knowledge Distillation, EKD），其中学生模型通过一系列容量逐步提升的教师模型进行迭代学习，从而有效缩小能力差距，使小型学生模型性能逼近大型教师模型。

链接: https://arxiv.org/abs/2605.09924
作者: Xuewen Zhang,Haixiao Zhang,Xinlong Huang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advancements in Neural Machine Translation (NMT) have significantly improved translation quality. However, the increasing size and complexity of state-of-the-art models present significant challenges for deployment on resource-limited devices. Knowledge distillation (KD) is a promising approach for compressing models, but its effectiveness diminishes when there is a large capacity gap between teacher and student models. To address this issue, we propose Evolving Knowledge Distillation (EKD), a progressive training framework in which the student model learns from a sequence of teachers with gradually increasing capacities. Experiments on IWSLT-14, WMT-17, and WMT-23 benchmarks show that EKD leads to consistent improvements at each stage. On IWSLT-14, the final student achieves a BLEU score of 34.24, narrowing the gap to the strongest teacher (34.32 BLEU) to just 0.08 BLEU. Similar trends are observed on other datasets. These results demonstrate that EKD effectively bridges the capacity gap, enabling compact models to achieve performance close to that of much larger teacher this http URL and models are available at this https URL.

[NLP-90] am-Based Self-Play With Dual Adaptive Weighting for Fine-Tuning LLM s ACL2026

【速读】：该论文旨在解决自训练方法在大语言模型（Large Language Models, LLMs）对齐过程中面临的两个关键问题：一是对合成数据质量敏感，导致迭代训练中出现不稳定性和偏差放大；二是随着训练轮次增加，正负样本响应之间的差距逐渐缩小，造成优化效果下降。解决方案的关键在于提出一种基于团队协作与对抗的自对弈算法（Team-based self-Play with dual Adaptive Weighting, TPAW），其核心创新包括：（1）采用团队框架，使当前策略模型同时与历史检查点进行合作与竞争，从而提升优化稳定性与效率；（2）设计双重自适应加权机制——响应重加权（response reweighting）动态调整目标响应的重要性，玩家权重策略（player weighting）实时调节各团队成员的贡献度，实现更精准的梯度更新。该方法从监督微调（Supervised Fine-Tuning, SFT）模型出发，无需额外人工标注即可持续优化对齐效果，在多个基座模型和基准测试中均显著优于现有基线。

链接: https://arxiv.org/abs/2605.09922
作者: Wu Li,Yigeng Zhou,Zesheng Shi,Yequan Wang,Min Zhang,Jing Li
机构: Harbin Institute of Technology, Shenzhen, China; Beijing Academy of Artificial Intelligence, Beijing, China
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by ACL 2026 Main

点击查看摘要

Abstract:While recent self-training approaches have reduced reliance on human-labeled data for aligning LLMs, they still face critical limitations: (i) sensitivity to synthetic data quality, leading to instability and bias amplification in iterative training; (ii) ineffective optimization due to a diminishing gap between positive and negative responses over successive training iterations. In this paper, we propose Team-based self-Play with dual Adaptive Weighting (TPAW), a novel self-play algorithm designed to improve alignment in a fully self-supervised setting. TPAW adopts a team-based framework in which the current policy model both collaborates with and competes against historical checkpoints, promoting more stable and efficient optimization. To further enhance learning, we design two adaptive weighting mechanisms: (i) a response reweighting scheme that adjusts the importance of target responses, and (ii) a player weighting strategy that dynamically modulates each team member’s contribution during training. Initialized from a SFT model, TPAW iteratively refines alignment without requiring additional human supervision. Experimental results demonstrate that TPAW consistently outperforms existing baselines across various base models and LLM benchmarks. Our code is publicly available at this https URL.

[NLP-91] Position: Academic Conferences are Potentially Facing Denominator Gaming Caused by Fully Automated Scientific Agents ICML’26

【速读】：该论文旨在解决顶级人工智能（Artificial Intelligence, AI）会议中因投稿量指数级增长而维持相对稳定接收率所引发的系统性脆弱性问题，特别是针对一种新型威胁——“代理分母操控”（Agentic Denominator Gaming）：恶意行为者利用AI代理生成大量表面合理但质量低下的论文，其目的并非让这些劣质论文被接收，而是通过扩大投稿总量来稀释评审资源，从而在保持固定接收率的前提下，提高一小部分特定优质论文的录用概率。解决方案的关键在于认识到单纯依赖技术手段检测此类攻击难以奏效，必须从系统层面进行政策与激励机制改革，包括调整评审流程、优化审稿人激励机制以及建立更透明的投稿治理结构，以实现持久性的防护效果。

链接: https://arxiv.org/abs/2605.09915
作者: Rong Shan,Te Gao,Hang Zheng,Yunjia Xi,Jiachen Zhu,Zeyu Zheng,Yong Yu,Weinan Zhang,Jianghao Lin
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Accepted by ICML’26 Position Track

点击查看摘要

Abstract:The implicit policy of maintaining relatively stable acceptance rates at top AI conferences, despite exponentially growing submissions, introduces a critical structural vulnerability. This position paper characterizes a new systemic threat we term Agentic Denominator Gaming, in which a malicious actor deploys AI agents to generate and submit a large volume of superficially plausible but low-quality papers. Crucially, their objective is not the acceptance of low-quality papers, but rather to inflate the submission denominator and overwhelm reviewing capacity. Under a relatively stable acceptance rate, this dilution can systematically increase the publication probability of a small, targeted set of legitimate papers. We analyze the practical feasibility of this threat and its broader consequences, including intensified reviewer burnout, degraded review quality, and the emergence of industrialized automated agent mills. Finally, we propose and evaluate a range of mitigation strategies, and argue that durable protection will require system-level policy and incentive reforms, rather than relying primarily on technical detection alone.

[NLP-92] he Gordian Knot for VLMs: Diagrammatic Knot Reasoning as a Hard Benchmark

【速读】：该论文旨在解决当前视觉-语言模型（Vision-Language Model, VLM）在处理纽结图（knot diagram）时存在的“感知-操作鸿沟”问题，即模型能够识别和描述纽结图的结构特征，但无法基于这些特征进行有效的拓扑操作或推理。其解决方案的关键在于构建了一个名为KnotBench的基准测试平台，该平台包含1,951个原始纽结原型（交叉数3至19）的858,318张图像，并设计了14项任务，涵盖等价判断、移动预测、识别和跨模态定位四个类别，且通过与Regina软件生成的规范纽结签名进行比对来验证答案正确性。实验表明，尽管引入思维模式（thinking mode）可提升部分模型性能（如GPT-5提升9.25分），但多数任务仍远低于随机基线，说明现有VLM虽能提取纽结图的视觉特征，却缺乏模拟拓扑变换所需的数学运算机制。

链接: https://arxiv.org/abs/2605.09900
作者: Hao Liu,Jicheng Liu
机构: New York University (纽约大学); University of Southern California (南加州大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 41 pages, 18 figures

点击查看摘要

Abstract:A vision-language model can look at a knot diagram and report what it sees, yet fail to act on that structure. KnotBench pairs an 858,318-image corpus from 1,951 prime-knot prototypes (crossing numbers 3 to 19) with a protocol whose answers are checked against Regina’s canonical knot signature. Its 14 tasks span four families, equivalence judgment, move prediction, identification, and cross-modal grounding; an image-versus-symbol split locates failures along the perception-operation gap. We score Claude Opus 4.7 and GPT-5, each with and without thinking, under a 64K output-token budget matched on both vendors. Across 56 (task, model) cases, 15 sit at or below a random baseline and 8 of 14 tasks have a best score under 1.5x random. On diagram-to-symbol transcription, no model produces a strictly correct string, and permissive Regina decoding recovers the knot in 0 to 4 of 100 items. Thinking-mode reasoning lifts overall accuracy by 1.65 points for Claude and 9.25 points for GPT-5, narrowing the gap only modestly. Read together, the four families suggest current vision-language models hold features of a diagram but lack apparatus to simulate moves on those features.

[NLP-93] Pseudo-Deliberation in Language Models: When Reasoning Fails to Align Values and Actions

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）中存在的“价值-行为鸿沟”（value-action gap）问题，即模型在陈述其价值观时表现一致，但在实际对话生成中却缺乏相应的行为一致性。为系统研究这一现象，作者提出了VALDI框架，用于量化评估模型在人类中心场景下所表达的价值观与其生成对话之间的对齐程度，包含4,941个跨五领域的场景、三种任务类型（价值阐述、推理和行动）以及五种度量指标。关键创新在于识别出一种深层次的失效模式——“伪深思”（Pseudo-Deliberation），即看似合理的推理过程并未带来行为上的价值对齐。为此，论文进一步提出VIVALDI方法，一种多智能体价值审计机制，在生成的不同阶段进行干预，以提升模型行为与声明价值的一致性。

链接: https://arxiv.org/abs/2605.09893
作者: Sushrita Rakshit,Hanwen Zhang,Hua Shen
机构: New York University; New York University Shanghai
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages

点击查看摘要

Abstract:Large language models (LLMs) are often evaluated based on their stated values, yet these do not reliably translate into their actions, a discrepancy termed “value-action gap.” In this work, we argue that this gap persists even under explicit reasoning, revealing a deeper failure mode we call “Pseudo-Deliberation”: the appearance of principled reasoning without corresponding behavioral alignment. To study this systematically, we introduce VALDI, a framework for measuring alignment between stated values and generated dialogue. VALDI includes 4,941 human-centered scenarios across five domains, three tasks that elicit value articulation, reasoning, and action, and five metrics for quantifying value adherence. Across both proprietary and open-source LLMs, we observe consistent misalignment between expressed values and downstream dialogues. To investigate intervention strategies, we propose VIVALDI, a multi-agent value auditor that intervenes at different stages of generation.

[NLP-94] Key-Value Means

【速读】：该论文旨在解决传统Transformer模型在处理长序列时面临的内存占用高（KV-cache存储开销大）和预填充时间复杂度高（O(N²)）的问题，同时兼顾高效训练与推理的可扩展性。其核心解决方案是提出Key-Value Means（KVM），一种新型的块递归（block-recurrence）注意力机制，能够支持固定大小或可增长的状态缓存；通过在每个层中使用KVM替代标准注意力，可在保持线性时间复杂度（O(N)）训练和预填充的同时，实现亚二次预填充时间和亚线性状态增长，从而在不依赖定制内核的情况下，统一融合了传统Transformer的可扩展上下文记忆能力与线性RNN的高效内存管理优势。

链接: https://arxiv.org/abs/2605.09877
作者: Daniel Goldstein,Eugene Cheah
机构: Recursal AI; Eleuther AI
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present Key-Value Means (“KVM”), a novel block-recurrence for attention that can accommodate either fixed-size or growing state. Equipping a strong transformer baseline with fixed-size KVM attention layers yields a strong O(N) chunked RNN, while adding only an insignificant number of new parameters. We train a transformer with a growable KVM cache and show it performs competitively on long-context tests with only subquadratic prefill time and sublinear state growth. KVM is implementable with standard operations and without custom kernels, and supports chunk-wise parallelizable training and prefill. It provides many of the benefits of both traditional transformers (expandable context memory, chunk-wise parallelizable training and prefill) and linear RNNs in a single unified package. It can be used on every layer, saving KV-cache memory, and allowing a continuous range of choices of prefill time complexity between O(N) and O(N^2) . It can also be implemented in a hybrid solution in tandem with LRNN layers in place of traditional attention, to supplement the LRNN with improved sublinear memory growth context length usage and long context decoding. We release our code at this https URL and trained models at this https URL under the Apache 2.0 license.

[NLP-95] EgoMemReason : A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding

【速读】：该论文旨在解决下一代视觉助手（如智能眼镜、具身代理和持续生活记录系统）在超长视频场景下进行长期记忆驱动推理的挑战，即模型需在数小时至数天的连续视觉体验中积累信息、回忆先前状态、追踪时间顺序并抽象重复模式。现有周级视频基准主要针对感知与识别任务（如时刻定位或全局摘要），缺乏对跨日证据整合的推理能力评估。解决方案的关键在于提出EgoMemReason——一个系统性评估一周级第一人称视频理解能力的基准，涵盖三种互补的记忆类型：实体记忆（Entity Memory，跟踪物体状态跨日演变）、事件记忆（Event Memory，回忆并排序相隔数小时或数天的活动）和行为记忆（Behavior Memory，从稀疏重复观测中抽象出周期性模式）。该基准包含500个问题、三类记忆类型及六大核心挑战，平均每个问题涉及5.1段视频证据和25.9小时的记忆回溯，揭示当前多模态大语言模型（MLLMs）与智能体框架在长时记忆推理上的显著不足，为推进长上下文、记忆感知的多模态系统提供了坚实基础。

链接: https://arxiv.org/abs/2605.09874
作者: Ziyang Wang,Yue Zhang,Shoubin Yu,Ce Zhang,Zengqi Zhao,Jaehong Yoon,Hyunji Lee,Gedas Bertasius,Mohit Bansal
机构: UNC Chapel Hill (北卡罗来纳大学教堂山分校); NTU Singapore (新加坡南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: The first two authors contributed equally. Project website: this https URL

点击查看摘要

Abstract:Next-generation visual assistants, such as smart glasses, embodied agents, and always-on life-logging systems, must reason over an entire day or more of continuous visual experience. In ultra-long video settings, relevant information is sparsely distributed across hours or days, making memory a fundamental challenge: models must accumulate information over time, recall prior states, track temporal order, and abstract recurring patterns. However, existing week-long video benchmarks are primarily designed for perception and recognition, such as moment localization or global summarization, rather than reasoning that requires integrating evidence across multiple days. To address this gap, we introduce EgoMemReason, a comprehensive benchmark that systematically evaluates week-long egocentric video understanding through memory-driven reasoning. EgoMemReason evaluates three complementary memory types: entity memory, tracking how object states evolve and change across days; event memory, recalling and ordering activities separated by hours or days; and behavior memory, abstracting recurring patterns from sparse, repeated observations over the whole week period. EgoMemReason comprises 500 questions across three memory types and six core challenges, with an average of 5.1 video segments of evidence per question and 25.9 hours of memory backtracking. We evaluate EgoMemReason on 17 methods across MLLMs and agentic frameworks, revealing that even the best model achieves only 39.6% overall accuracy. Further analysis shows that the three memory types fail for distinct reasons and that performance degrades as evidence spans longer temporal horizons, revealing that long-horizon memory remains far from solved. We believe EgoMemReason establishes a strong foundation for evaluating and advancing long-context, memory-aware multimodal systems.

[NLP-96] he Metacognitive Probe: Five Behavioural Calibration Diagnostics for LLM s

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在推理过程中存在“自信但错误”的行为问题，即模型可能在整体任务表现优异的情况下，仍对特定子任务或事实判断过度自信，而传统复合基准（如MMLU、BIG-Bench等）无法捕捉这种局部过自信现象。为应对这一挑战，作者提出元认知探测工具（Metacognitive Probe），其关键创新在于将模型的置信度行为分解为五个可操作、行为上独立的维度：信心校准（confidence calibration, T1-CC）、知识边界识别（knowledge boundary, T3-KB）、校准范围（calibration range, T4-CR）、认知警觉性（epistemic vigilance, T2-EV）以及推理链验证（reasoning-chain validation, T5-RCV）。通过在8个前沿模型和69名人类参与者上的实证评估，该工具能够揭示模型内部的信心-正确性错位，例如在Gemini 2.5 Flash中观察到高达47分的跨维度差异（T1-CC=88 vs T4-CR=41），从而实现对模型元认知能力的精细化诊断。

链接: https://arxiv.org/abs/2605.09844
作者: Rafael C. T. Oliveira
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 27 pages, 13 tables. Code, data, prompts, and rubrics released with the paper. OSF deposit pending; DOI in v2

点击查看摘要

Abstract:The Metacognitive Probe is an exploratory five-task, 15-slot diagnostic that decomposes an LLM’s confidence behaviour into five behaviourally-distinct dimensions: confidence calibration (T1-CC), epistemic vigilance (T2-EV), knowledge boundary (T3-KB), calibration range (T4-CR), and reasoning-chain validation (T5-RCV). It is evaluated on N=8 frontier models and N=69 humans. The instrument is motivated by Flavell (1979) and Nelson and Narens (1990) but operates on observable confidence-correctness alignment; it is not a validated cross-species metacognition scale, and the pre-specified human developmental hypothesis was falsified. Composite benchmarks (MMLU, BIG-Bench, HELM, GPQA) ask whether a model produces a correct response. They are silent on whether the model knows when its response is wrong. A model can score 80 on a composite calibration benchmark and still be wildly overconfident in narrow pockets the aggregate cannot surface. The Metacognitive Probe surfaces those pockets. Our headline is a 47-point within-model dissociation in Gemini 2.5 Flash: panel-best within-task calibration (T1-CC = 88; Spearman rho = +0.551, 95% CI [+0.14, +0.80], p = 0.005) and panel-worst cross-task difficulty prediction (T4-CR = 41; sigma_conf = 1.4 across twelve factoids). Comments: 27 pages, 13 tables. Code, data, prompts, and rubrics released with the paper. OSF deposit pending; DOI in v2 Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG) ACMclasses: I.2.7 Cite as: arXiv:2605.09844 [cs.AI] (or arXiv:2605.09844v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.09844 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-97] he Association of Transformer-based Sentiment Analysis with Symptom Distress and Deterioration in Routine Psychotherapy Care

【速读】：该论文旨在解决如何利用生成式 AI（Generative AI）驱动的细粒度情感分析模型，将心理治疗会话中的文本转化为可量化的、具有心理测量学意义的情感特征，从而作为评估来访者情绪状态和临床恶化风险的辅助工具。其解决方案的关键在于：首先，基于大规模心理治疗会话数据（N = 751）提取语句级和会话级情感特征；其次，通过与已验证的心理测量工具OQ-45各维度的相关性分析，发现这些情感特征在情绪效价（emotional valence）相关维度上呈现方向直观且显著的强相关性；最后，证明了不同临床风险群体（如高恶化或脱落风险）在情感分布上存在统计学差异，表明此类情感特征至少可作为客户痛苦程度及恶化情况的辅助测量指标。

链接: https://arxiv.org/abs/2605.09838
作者: Douglas K. Faust,Peter Awad,Alexandre Vaz,Tony Rousmaniere
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 20 pages, 4 figures

点击查看摘要

Abstract:Sentiment analysis has been of long-standing interest in psychotherapy research. Recently, the Transformer deep learning architecture has produced text-based sentiment analysis models that are highly accurate and context-aware. These models have been explored as proxies for emotion measurement instruments in psychotherapy, but not investigated as stand-alone psychometric tools. Using proposed utterance-level and session-level sentiment features derived from a fine-grained sentiment model on a large corpus of psychotherapy sessions (N = 751), we investigate the distribution of session aggregated sentiment scores. Further, we characterize the relationship of these features to individual components and the overall score of the OQ-45 instrument and find that this sentiment feature is most strongly correlated to components related to emotional valence in directionally intuitive ways. Finally, we report that there are statistically significant differences between the sentiment distributions for patients flagged as at risk of deterioration or dropping out of care via either the OQ Rational or Empirical outcome models. These correlations to a fully-validated psychometric instrument demonstrate that these proposed sentiment features are, at least, adjunctive measures of client distress and deterioration.

[NLP-98] Quantifying the Utility of User Simulators for Building Collaborative LLM Assistants

【速读】：该论文旨在解决用户模拟器（user simulator）质量评估这一开放性问题，即如何量化用户模拟器在训练交互式大语言模型（LLM）助理时的有效性。其解决方案的关键在于将模拟器的质量与其下游实用性直接关联：通过在受控实验中仅改变用户模拟器类型，训练多个LLM助理，并在真实人类参与者组成的用户研究（283人）和WildBench基准（基于真实人类- AI对话）上评估其表现。结果表明，基于真实人类语料微调的模拟器显著优于角色扮演型LLM模拟器，且训练出的助理在泛化能力、性能提升及与不同模拟器的兼容性方面均表现更优，从而论证了以真实人类行为为基础并以对实际用户效果为评价标准的重要性。

链接: https://arxiv.org/abs/2605.09808
作者: Joseph Suh,Ayush Raj,Minwoo Kang,Serina Chang
机构: UC Berkeley (加州大学伯克利分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:User simulators are increasingly leveraged to build interactive AI assistants, yet how to measure the quality of these simulators remains an open question. In this work, we show how simulator quality can be quantified in terms of its downstream utility: how an LLM assistant trained with this user simulator performs in the wild when interacting with real humans. In a controlled experiment where only the user simulator varies, we train LLM assistants via reinforcement learning against a spectrum of simulators, from an LLM prompted to role-play a user to one fine-tuned on human utterances from WildChat. As evaluation, we measure pairwise win rates in a user study with 283 participants and on WildBench, a benchmark derived from real human–AI conversations. Training against the role-playing LLM yields an assistant statistically indistinguishable from the initial assistant in our user study (51% win rate), whereas training against the fine-tuned simulator yields significant gains (58% over the initial and 57% over the one trained against role-playing). Closer inspection reveals three further patterns: methods for making role-playing LLMs more realistic (e.g., persona conditioning) improve trained assistants but do not close the gap to the fine-tuned simulator; scaling the simulator’s model size benefits the fine-tuned simulator but yields no gain for role-playing ones; and assistants trained against role-playing simulators fail to generalize when paired with other simulators at test time, while the one trained against fine-tuned simulator does. Together, these results argue for grounding user simulators in real human behavior and measuring their quality by their downstream effect on real users.

[NLP-99] cantnlp@DravidianLangTech 2026: organic domain adaptation improves multi-class hope speech detection in Tulu

【速读】：该论文旨在解决在混合语码（code-mixed）的图鲁语（Tulu）社交媒体评论中检测希望言论（hope speech）的问题。其解决方案的关键在于基于XLM-RoBERTa架构训练一个文本分类模型，并通过在自然收集的含语码混杂和多书写系统变化的图鲁语社交文本上进行自适应微调，提升模型对目标语言变体的适应能力。实验表明，相较于基线模型，该自适应方法在开发集上表现更优，验证了在特定语境下进行领域自适应微调对提升希望言论识别效果的有效性。

链接: https://arxiv.org/abs/2605.09795
作者: Andrew Li,Sidney Wong
机构: Lake Washington School District; University of Otago, New Zealand; Te Pūnaha Matatini, New Zealand
类目: Computation and Language (cs.CL)
备注: Accepted to Sixth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages (DravidianLangTech-2026)

点击查看摘要

Abstract:This paper presents our systems and results for the Hope Speech Detection in Code-Mixed Tulu Language shared task at the Sixth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages (DravidianLangTech-2026). We trained an XLM-RoBERTa-based text classification system for detecting hope speech in code-mixed Tulu social media comments. We compared this organically adapted hope speech detection model with our baseline model. On the development set, the organically adapted model outperformed the baseline system. While our submitted systems performed more modestly on the official test set, these results suggest that further adapting XLM-RoBERTa on organically collected Tulu social media text containing code-mixed and mixed-script variation can improve hope speech detection in code-mixed Tulu.

[NLP-100] Parameter-Efficient Neuroevolution for Diverse LLM Generation: Quality-Diversity Optimization via Prompt Embedding Evolution GECCO2026

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）中存在的模式崩溃（mode collapse）问题，即模型生成输出过于同质化，无法充分探索有效的解空间。解决方案的关键在于提出QD-LLM框架，该框架基于质量-多样性（Quality-Diversity, QD）优化思想，通过无梯度的神经进化方法演化提示嵌入（prompt embeddings），并引入一个参数量极小（约32K）的神经接口来引导冻结的大语言模型（如70B参数级别）生成多样化且高质量的输出，从而在不进行模型微调的前提下实现行为调控。该方法结合了语义与显式特征的混合行为表征、形式化的覆盖率边界保证（定理1），以及协同进化的变异算子（包括基于有限差分梯度估计的目标行为突变），最终在HumanEval、MBPP和创意写作等多个基准上显著提升了覆盖度和QD得分，并验证了其在测试用例生成和微调数据质量提升等下游任务中的有效性。

链接: https://arxiv.org/abs/2605.09781
作者: Dongxin Guo,Jikun Wu,Siu Ming Yiu
机构: The University of Hong Kong(香港大学); Stellaris AI Limited(星链人工智能有限公司)
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 11 pages, 3 figures, 7 tables, 1 algorithm, 1 theorem. Accepted to GECCO 2026

点击查看摘要

Abstract:Large Language Models exhibit mode collapse, producing homogeneous outputs that fail to explore valid solution spaces. We present QD-LLM, a framework for parameter-efficient neuroevolution that evolves prompt embeddings, compact neural interfaces (~32K parameters) that steer generation in frozen LLMs (70B+ parameters), within a Quality-Diversity (QD) optimization framework. Our contributions: (1) evolved prompt embeddings via gradient-free optimization enabling behavioral steering without model fine-tuning; (2) hybrid behavior characterization combining semantic and explicit features with formal coverage bounds (Theorem 1) under validated near-independence (NMI = 0.08 \pm 0.02 ); (3) co-evolutionary variation operators including targeted behavioral mutation via finite-difference gradient estimation. On HumanEval (164 problems), MBPP, and creative writing benchmarks, QD-LLM achieves 46.4% higher coverage and 41.4% higher QD-Score than QDAIF ( p0.001 , 30 runs, Vargha-Delaney A=0.94 ). We demonstrate downstream utility: diverse archives improve test generation (34% more edge cases) and fine-tuning data quality (8.3% accuracy gain). We validate across open-source LLMs (Llama-3-70B, Mistral-Large) with full embedding access, establishing prompt embedding evolution as an effective paradigm bridging neuroevolution and modern LLMs.

[NLP-101] Nectar: Neural Estimation of Cached-Token Attention via Regression

【速读】：该论文旨在解决长序列上下文中软注意力（softmax attention）计算效率低的问题，即在固定长上下文场景下，每次查询 token 都需遍历所有缓存的键值对（key-value pairs），导致复杂度为 O(n)，难以扩展。其解决方案的关键在于提出 Nectar 模块：通过训练两个轻量级神经网络（目标网络与得分网络）来拟合每个层和键值头（KV-head）上注意力输出及其对数归一化因子（log-normalizer），从而在推理阶段用常数时间前向传播替代 O(n) 的注意力计算。该方法显著降低内存占用（参数规模远小于 KV-cache），同时保持与完整注意力相近的生成质量，在多个长文本数据集上验证了其有效性。

链接: https://arxiv.org/abs/2605.09778
作者: João Monteiro,Michal Klein,Pierre Ablin,Marco Cuturi
机构: Apple(苹果)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Evaluating softmax attention over a fixed long context requires reading every cached key-value pair for each new query token. For a given context (a book, a manual, a legal corpus) the attention output is a deterministic function of the query. We propose Nectar, which fits a compact neural network to this function for queries drawn from a task-relevant distribution. Nectar fits two networks per layer and KV-head: a target network that predicts the attention output and a score network that predicts the log-normalizer. The pair plugs into the standard masked self-attention at inference time, replacing the O(n) attention over the cache with a forward pass whose cost does not depend on n . Each module carries on the order of |\theta| parameters per layer and KV-head, typically much smaller than the 2nd KV-cache footprint at the same granularity. We report experiments on models from 1.7B to 8B parameters across five long-context datasets. The approximation error tracks the next-token accuracy gap to full attention, and allocating capacity non-uniformly across layers reduces that gap in our ablation. Beyond this analysis of metrics, we check that the text generations (following a question prompt) of a model equipped with a Nectar module match in semantic content those obtained by giving the same model access to the full cache.

[NLP-102] EvoPref: Multi-Objective Evolutionary Optimization Discovers Diverse LLM Alignments Beyond Gradient Descent GECCO2026

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）对齐过程中梯度优化方法普遍存在的偏好坍塌（preference collapse）问题，即模型在训练中收敛到狭窄的行为模式，导致对齐策略缺乏多样性。其解决方案的关键在于提出 EvoPref，一种基于多目标进化算法（Multi-Objective Evolutionary Algorithm, MOEA）的框架，采用非支配排序遗传算法 II（NSGA-II）结合归档机制（archive-based diversity preservation）来维护多个低秩适配器（Low-Rank Adaptation, LoRA）组成的种群，并在帮助性（helpfulness）、无害性（harmlessness）和诚实性（honesty）三个目标上进行协同优化。该方法通过种群演化有效提升了偏好覆盖范围与多样性，同时保持了与梯度基方法相当的对齐质量，实验证明其偏好覆盖提升 18%、坍塌率降低 47%，并从理论上支持了归档机制在避免单轨迹优化陷阱中的有效性。

链接: https://arxiv.org/abs/2605.09777
作者: Dongxin Guo,Jikun Wu,Siu Ming Yiu
机构: The University of Hong Kong(香港大学); Stellaris AI Limited(星驰人工智能有限公司)
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 10 pages, 2 figures, 6 tables, 1 algorithm. Accepted to GECCO 2026

点击查看摘要

Abstract:Gradient-based preference optimization methods for large language model (LLM) alignment suffer from preference collapse, converging to narrow behavioral modes while neglecting preference diversity. We introduce EvoPref, a multi-objective evolutionary algorithm that maintains populations of Low-Rank Adaptation (LoRA) adapters optimized across helpfulness, harmlessness, and honesty objectives using Non-dominated Sorting Genetic Algorithm II (NSGA-II) selection with archive-based diversity preservation. Our primary contribution is demonstrating that population-based methods discover substantially more diverse alignments than gradient descent. On standard benchmarks, EvoPref improves preference coverage by 18% (median 82.5% vs. 70.0% for ORPO, p0.001 , Wilcoxon, n=30 ) and reduces collapse rates by 47% (11.0% vs. 20.6%, p0.001 ), while achieving competitive alignment quality (median 75.5% RewardBench vs. 75.0% for ORPO, p0.05 ). We provide theoretical motivation extending recent multi-objective evolutionary algorithm (MOEA) runtime analysis (Dang et al., 2025) suggesting why archive-based methods escape collapse more effectively than single-trajectory optimization. Comprehensive comparisons against MOEA/D, SMS-EMOA, CMA-ES, and gradient baselines (DPO, IPO, KTO, ORPO) with rigorous statistical testing (Friedman with Holm correction, Vargha-Delaney effect sizes, median with IQR) confirm that multi-objective selection with diversity preservation is essential. This work establishes evolutionary optimization as a principled paradigm for diverse LLM alignment. Comments: 10 pages, 2 figures, 6 tables, 1 algorithm. Accepted to GECCO 2026 Subjects: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG) MSC classes: 68T05, 68T50, 90C29, 68W50 ACMclasses: I.2.6; I.2.8; I.2.7 Cite as: arXiv:2605.09777 [cs.NE] (or arXiv:2605.09777v1 [cs.NE] for this version) https://doi.org/10.48550/arXiv.2605.09777 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3795095.3805184 Focus to learn more DOI(s) linking to related resources

[NLP-103] Exploitation Without Deception: Dark Triad Feature Steering Reveals Separable Antisocial Circuits in Language Models

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）中潜在的反社会倾向是否可被精确识别与操控的问题，特别是这些倾向是否具有可分离的计算模块化特征。其解决方案的关键在于使用稀疏自编码器（Sparse Autoencoder, SAE）特征引导（feature steering）技术，对Llama-3.3-70B-Instruct模型中的特定神经元激活进行干预，从而放大黑暗三联征人格特质（Machiavellianism, narcissism, and psychopathy）。研究发现，这种干预显著增强了模型在新颖行为场景中的剥削性、攻击性和冷漠性（效应量d=10.62），同时保留了认知共情能力，再现了人类黑暗三联征群体特有的共情分离现象；更重要的是，战略欺骗行为未受影响，表明剥削与欺骗可能通过可分离的计算路径实现。此外，特征发现方法本身也影响干预深度：对比发现的特征同时改变自我报告和行为，而语义搜索的特征仅改变自我报告（行为效应量差异d=12.65），揭示了反社会倾向在模型中由多个非冗余、可分离的计算组件构成，而非单一统一结构。

链接: https://arxiv.org/abs/2605.09773
作者: Cameron Berg,Roshni Lulla
机构: Reciprocal Research; Brain Creativity Institute, University of Southern California
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 3 figures

点击查看摘要

Abstract:We use sparse autoencoder (SAE) feature steering to amplify Dark Triad personality traits (Machiavellianism, narcissism, and psychopathy) in Llama-3.3-70B-Instruct and evaluate the resulting behavioral changes across five psychological instruments. The steered model becomes substantially more exploitative, aggressive, and callous on novel behavioral scenarios (d=10.62) while its cognitive empathy remains intact, reproducing the empathy dissociation characteristic of human Dark Triad populations. Critically, strategic deception is completely unaffected across all features, suggesting that exploitation and deception may operate through dissociable computational pathways in large language models. Individual feature analysis reveals non-redundant encoding, with each feature driving distinct antisocial mechanisms through separable computational pathways. We also show that feature discovery method itself modulates intervention depth: contrastively-discovered features change both self-report and behavior, while semantically-searched features change only self-report (d=12.65 between methods on behavior). These findings suggest that antisocial tendencies in at least one large language model comprise dissociable components rather than a unified construct, with implications for how such tendencies should be detected, measured, and controlled.

[NLP-104] ConFit v3: Improving Resume-Job Matching with LLM -based Re-Ranking

【速读】：该论文旨在解决简历-职位匹配系统中嵌入式方法（如ConFit及其改进版本）在实际应用中缺乏可控性和可解释性的问题，以及现有基于大语言模型（LLM）的重排序器因训练数据噪声和短文档基准限制而性能受限的问题。其解决方案的关键在于对LLM重排序训练流程进行系统性分析与优化，包括采用多轮重排序策略、使用列表级强化学习（listwise RL）目标函数、去除真实招聘数据中的噪声样本，并在强化学习前通过更强的LLM进行监督微调（SFT）蒸馏，从而显著提升人岗匹配效果。最终基于Qwen3-8B和Qwen3-32B模型构建的ConFit v3在真实数据集上超越了当前最优的人岗匹配系统及GPT-5、Claude Opus-4.5等主流LLM。

链接: https://arxiv.org/abs/2605.09760
作者: Xiao Yu,Ruize Xu,Chengyuan Xue,Junyu Chen,Matthew So,Shijun Ma,Bo Liu,Xiangye Liang,Zhou Yu
机构: Columbia University (哥伦比亚大学); John Hopkins University (约翰霍普金斯大学); UCLA (加州大学洛杉矶分校); Intellipro Group Inc. (智普集团)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:A reliable resume-job matching system helps a company find suitable candidates from a pool of resumes and helps a job seeker find relevant jobs from a list of job posts. While recent advances in embedding-based methods such as ConFit and ConFit v2 can efficiently retrieve candidates at scale, the lack of controllability and explainability limits their real-world adaptations. LLM-based re-rankers can address these limitations through reasoning, but existing training recipes are developed on short-document benchmarks and do not account for noise in real-world recruiting data. In this work, we first conduct a systematic analysis over the LLM re-ranker training pipeline for person-job fit, covering inference algorithm design, RL algorithm selection, data processing, and SFT distillation. We find that using multi-pass re-ranking, training with listwise RL objectives, removing noisy samples, and distilling from a stronger LLM before RL significantly improves re-ranking performance. We then aggregate these findings to train ConFit v3 with Qwen3-8B and Qwen3-32B on real-world person-job fit datasets, and find significant improvements over existing best person-job fit systems as well as strong LLMs such as GPT-5 and Claude Opus-4.5. We hope our findings provide useful insights for future research on adapting LLM-based re-rankers to person-job fit systems.

[NLP-105] Language Models Without a Trainable Input Embedding Table: Learning from Fixed Minimal Binary Token Codes

【速读】：该论文旨在解决现代语言模型中可训练输入嵌入表（trainable input embedding table）是否为必要组件的问题。其核心假设是：对于大小为 $ V $ 的词汇表，精确表示词元身份仅需 $ K = \lceil \log_2 V \rceil $ 位二进制编码，因此无需依赖大规模可训练参数矩阵。解决方案的关键在于用固定最小二进制码（fixed minimal binary token codes）替代传统的 $ V \times d_\text{model} $ 可训练嵌入矩阵，并通过零参数升维操作（zero-parameter lift to model width）将这些二进制码扩展至模型维度 $ d_\text{model} $。实验表明，在 $ V=65,536 $（即 $ K=16 $）时，使用固定16维二进制码的模型在验证困惑度上与标准可训练嵌入基线相当，同时移除了67.1M个可训练参数，从而证明在当前设定下，可训练输入嵌入表并非语言建模性能所必需。

链接: https://arxiv.org/abs/2605.09751
作者: A. Bochkov
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Trainable input embedding tables are a standard component of modern language models. We ask whether they are actually necessary at the input interface. For a vocabulary of size V , exact token identity requires only K=\lceil \log_2 V\rceil bits. We replace the usual trainable V\times d_\textmodel input embedding matrix with fixed minimal binary token codes and a zero-parameter lift to model width. In our main setting, V=65,536 , so K=16 , and tokens are represented by fixed 16-dimensional binary codes tiled to d_\textmodel=1024 . We also evaluate a fully table-free variant in which codes are generated from token IDs on the fly and randomly recoded by an invertible affine transform over \mathbbF_2^K . Across matched 32-layer decoder-only models trained on approximately 17B tokens and evaluated over three independent training seeds, fixed minimal codes achieve comparable held-out validation perplexity to a standard learned-input baseline while removing 67.1M trainable input parameters. The fixed-code runs have a lower mean validation perplexity in our experiments, 2.36 versus 2.44, but the observed gap is within the measured seed-to-seed variation of 4.8%; we therefore interpret the result as evidence that the trainable input table is not necessary, rather than as a statistically resolved superiority claim. The table-free affine-recoded variant remains close at 2.39 despite a slightly shorter training run. These results show that, in this regime, a trainable input embedding table is not necessary for useful language modeling. The output projection remains standard and trainable.

[NLP-106] he Silent Vote: Improving Zero-Shot LLM Reliability by Aggregating Semantic Neighborhoods ACL2026

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在零样本分类任务中，因标准约束解码（constrained decoding）导致的重归一化偏差（Renormalization Bias）问题。当模型被限制于少数目标标签时，标准softmax操作会丢弃原始概率分布中与目标标签语义相近的词（即“沉默投票”Silent Vote），从而引发人工过自信和校准性能下降。解决方案的关键在于提出语义Softmax（Semantic Softmax）——一种推理时层（inference-time layer），通过聚合每个目标标签语义邻域内的得分来恢复被丢弃的信息，从而提升分类结果的校准性与判别能力。实验表明，该方法显著降低了期望校准误差（Expected Calibration Error, ECE）和Brier Score，同时提升了AUROC和Macro-F1指标。

链接: https://arxiv.org/abs/2605.09739
作者: Sanket Badhe,Priyanka Tiwari,Deep Shah
机构: Google(谷歌)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at GEM Workshop @ ACL 2026

点击查看摘要

Abstract:Large Language Models are increasingly used as zero-shot classifiers in complex reasoning tasks. However, standard constrained decoding suffers from a phenomenon we define as Renormalization Bias. When a model is restricted to a small set of target labels, the standard softmax operation discards the probability mass assigned to semantic synonyms in the original distribution. This loss of information, which we call the Silent Vote, results in artificial overconfidence and poor calibration. We propose Semantic Softmax, an inference-time layer that recovers this lost information by aggregating the scores of the semantic neighborhood surrounding each target label. We evaluate this approach on Qwen-3 and Phi-4-mini models using GoEmotions and Civil Comments datasets. Our results demonstrate consistent improvements across all evaluation metrics: Semantic Softmax substantially reduces Expected Calibration Error (ECE) and Brier Score, while simultaneously enhancing discriminative performance in terms of AUROC and Macro-F1. By accounting for linguistic nuances, our method provides a more calibrated and accurate alternative for zero-shot classification. Comments: Accepted at GEM Workshop @ ACL 2026 Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.09739 [cs.CL] (or arXiv:2605.09739v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.09739 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-107] Learning Multi-Indicator Weights for Data Selection: A Joint Task-Model Adaptation Framework with Efficient Proxies IJCAI2026

【速读】：该论文旨在解决大规模语言模型指令微调中数据选择效率低下的问题，特别是现有方法依赖静态、任务无关和模型无关的权重分配策略，未能充分考虑下游任务特性与模型预训练能力的差异。其解决方案的关键在于提出一种可学习的多指标权重框架，通过在小型验证集上利用上下文学习（in-context learning, ICL）信号作为高效性能代理，实现对数据选择策略的联合任务-模型自适应优化，从而在仅使用30%训练样本的情况下达到或超越全数据微调的效果。

链接: https://arxiv.org/abs/2605.09665
作者: Jingze Song,Zihao Chen,Wenqing Chen,Zibin Zheng
机构: Sun Yat-sen University (中山大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: This work has been accepted at IJCAI 2026

点击查看摘要

Abstract:Data selection is a key component of efficient instruction tuning for large language models, as recent work has shown that data quality often matters more than data quantity. Accordingly, prior studies have introduced various multi-dimensional heuristics to evaluate and filter instruction data. However, most existing methods rely on static task-agnostic and model-agnostic weighting schemes, which overlook the varying requirements of specific downstream tasks and the differing pre-existing capabilities of models. In this paper, we propose a framework for learning multi-indicator weights that jointly adapts data selection to both the downstream task and the specific model. Our method identifies optimal weight configurations without full-scale fine-tuning by utilizing in-context learning (ICL) signals on compact tiny-validation sets. These signals serve as efficient performance proxies that ensure high-fidelity evaluation at minimal computational cost. Experiments across multiple benchmarks and model families, including Mistral, Qwen, and Llama, show that the approach achieves performance comparable to or exceeding full-dataset tuning while using only 30% of the training samples on GSM8K. Furthermore, our analysis reveals a trade-off between semantic diversity and logical complexity in reasoning tasks, highlighting the necessity of joint task-model adaptation.

[NLP-108] MedMeta: A Benchmark for LLM s in Synthesizing Meta-Analysis Conclusion from Medical Studies

【速读】：该论文旨在解决当前大型语言模型（Large Language Models, LLMs）在医学领域中对高阶推理能力，尤其是从多源证据中进行元分析（meta-analysis）结论生成的能力评估不足的问题。现有模型虽已在事实性回忆类医疗基准上达到饱和水平，但其在整合和推导复杂医学证据方面的表现仍缺乏系统性评测。解决方案的关键在于提出首个专门针对此任务的基准测试——MedMeta，它包含81篇来自PubMed（2018–2025）的元分析，并设计两种评估流程：基于检索增强生成（Retrieval-Augmented Generation, RAG）的Golden-RAG设置（使用真实摘要作为输入）与仅依赖内部知识的Parametric-only方法。实验表明，信息接地（information grounding）至关重要，Golden-RAG显著优于Parametric-only方法，而领域微调带来的收益有限；同时揭示了当前RAG系统在识别否定证据方面的根本性缺陷，凸显出构建稳健检索增强架构比单纯模型专业化更具临床应用前景。

链接: https://arxiv.org/abs/2605.09661
作者: Huy Hoang Ha,Benoit Favre,Francois Portet
机构: Huy Hoang Ha; Benoit Favre; François Portet
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have saturated standard medical benchmarks that test factual recall, yet their ability to perform higher-order reasoning, such as synthesizing evidence from multiple sources, remains critically under-explored. To address this gap, we introduce MedMeta, the first benchmark designed to evaluate an LLM’s ability to generate conclusions from medical meta-analyses using only the abstracts of cited studies. MedMeta comprises 81 meta-analyses from PubMed (2018–2025) and evaluates models using two distinct workflows: a Retrieval-Augmented Generation (Golden-RAG) setting with ground-truth abstracts, and a Parametric-only approach relying on internal knowledge. Our evaluation framework is validated by a well-structured analysis showing our LLM-as-a-judge protocol strongly aligns with human expert ratings, as evidenced by high Pearson’s r correlation (0.81) and Bland-Altman analysis revealing negligible systematic bias, establishing it as a reliable proxy for scalable evaluation. Our findings underscore the critical importance of information grounding: the Golden-RAG workflow consistently and significantly outperforms the Parametric-only approach across models. In contrast, the benefits of domain-specific fine-tuning are marginal and largely neutralized when external material is provided. Furthermore, stress tests show that all models, regardless of architecture, fail to identify and reject negated evidence, highlighting a critical vulnerability in current RAG systems. Notably, even under ideal RAG conditions, current LLMs achieve only slightly above-average performance (~2.7/5.0). MedMeta provides a challenging new benchmark for evidence synthesis and demonstrates that for clinical applications, developing robust RAG systems is a more promising direction than model specialization alone.

[NLP-109] K12-KGraph: A Curriculum-Aligned Knowledge Graph for Benchmarking and Training Educational LLM s

【速读】：该论文旨在解决当前大语言模型（Large Language Models, LLMs）在K-12教育场景中缺乏对课程结构认知的问题。现有评测基准如C-Eval、CMMLU、GaokaoBench和EduEval主要关注事实性记忆的问答能力，而忽视了知识之间的前置关系、概念层级、实验与理论关联及教学顺序等课程结构信息。为此，作者构建了一个名为K12-KGraph的课程对齐知识图谱，基于人民教育出版社教材提取数学、物理、化学和生物学科从基础到高中阶段的知识节点及其九类关系（包括分类、前置、验证、评估等），并据此开发了两个核心资源：K12-Bench（23,640道多选题组成的评测集）和K12-Train（约2,300个QA对的监督微调语料）。实验表明，主流模型在K12-Bench上的准确率普遍低于60%，而使用K12-Train进行微调显著优于其他通用指令数据集，证明以课程结构为导向的监督信号具有极高的样本效率，是提升教育AI课程认知能力的关键所在。

链接: https://arxiv.org/abs/2605.09635
作者: Hao Liang,Qihan Lin,Zhaoyang Han,Xiaochen Ma,Zhen Hao Wong,Meiyi Qiang,Linzhuang Sun,Wentao Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used in K-12 education, yet existing benchmarks such as C-Eval, CMMLU, GaokaoBench, and EduEval mainly evaluate factual recall through exam-style question answering. Effective educational AI additionally requires curriculum cognition: understanding how knowledge is structured through prerequisite chains, concept taxonomies, experiment-concept links, and pedagogical sequencing. To address this gap, we introduce K12-KGraph, a curriculum-aligned knowledge graph extracted from official People’s Education Press textbooks across mathematics, physics, chemistry, and biology from primary to high school. The graph contains seven node types (Concept, Skill, Experiment, Exercise, Section, Chapter, Book) and nine relation types covering taxonomy, prerequisite, association, verification, assessment, location, and order. Based on this graph, we construct two resources: (1) K12-Bench, a 23,640-question multi-select benchmark spanning five graph-derived task families (Ground, Prereq, Neighbor, Evidence, and Locate); and (2) K12-Train, a KG-guided supervised fine-tuning corpus of approximately 2,300 QA pairs synthesized from graph structure and node attributes. Experiments reveal substantial deficiencies in curriculum cognition: on K12-Bench, Gemini-3-Flash achieves only 57% exact match, while the best open-source model, Gemma-4-31B-IT, reaches 46%. Under a strictly matched 2,300-sample SFT budget on Qwen3-4B-Base and Llama-3.1-8B-Base, K12-Train consistently outperforms equally sized subsets from eight mainstream instruction-tuning corpora on both GaokaoBench and EduEval, demonstrating that curriculum-structured supervision is highly sample-efficient for educational tuning. We release the graph, benchmark, training data, and full construction pipeline.

[NLP-110] Can We Trust LLM s for Mental Health Screening? Consistency ASR Robustness and Evidence Faithfulness

【速读】：该论文旨在解决生成式 AI (Generative AI) 在临床心理评估中部署的可靠性问题，具体聚焦于三个维度：模型内部一致性（intra-model consistency）、自动语音识别（ASR）鲁棒性以及预测证据的真实性（evidence faithfulness）。其关键解决方案在于系统性地评估三种主流大语言模型（Phi-4、Gemma-2-9B 和 Llama-3.1-8B）在真实语音转录与不同精度 ASR 变体下的表现，发现 Phi-4 与 Gemma-2-9B 在所有维度上均表现出优异稳定性，而 Llama-3.1-8B 在 ASR 噪声下一致性显著下降且证据锚定能力弱化，揭示了评分与解释依据之间的解耦现象（score-evidence dissociation），为临床可信 AI 应用提供了可量化的评估框架和模型选择依据。

链接: https://arxiv.org/abs/2605.09634
作者: Erfan Loweimi,Sofia de la Fuente Garcia,Samira Loveymi,Hadi Daneshvar,Saturnino Luz
机构: University of Edinburgh (爱丁堡大学); Islamic Azad University (伊斯兰阿扎德大学); Edinburgh Napier University (爱丁堡纳皮尔大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:LLMs can estimate Hospital Anxiety and Depression Scale (HADS) scores from speech in a zero-shot manner, but clinical deployment requires reliability across three dimensions: intra-model consistency, ASR robustness, and evidence faithfulness. We evaluate three LLMs (Phi-4, Gemma-2-9B, and Llama-3.1-8B) on 111 English-speaking participants using ground-truth transcripts and three Whisper ASR variants (Large, Medium, Small), with three independent runs per model-condition pair. We find that (i) Phi-4 and Gemma-2-9B achieve excellent intra-model consistency (ICC 0.89) with minimal degradation under ASR; (ii) Llama-3.1-8B shows ASR-fragile consistency, with ICC dropping from 0.82 to 0.36 at 10% WER; (iii) predictive validity is largely preserved under ASR for robust models; and (iv) keyword groundedness exceeds 93% for Phi-4 and Gemma-2-9B but falls to 77-81% for Llama-3.1-8B. Inter-model keyword agreement is far lower than score-level agreement, revealing a score-evidence dissociation with implications for clinical interpretability.

[NLP-111] Scratchpad Patching: Decoupling Compute from Patch Size in Byte-Level Language Models

【速读】：该论文旨在解决无分词器语言模型（Tokenizer-free language models）中因采用块（patch）机制导致的建模质量下降问题。具体而言，较大的块虽可减少计算量和KV缓存占用，但会引入“块滞后（patch lag）”——即在当前块未完全观测前，其内部字节的预测需依赖上一个块的过时表示以维持因果性，从而损害建模精度。解决方案的关键在于提出草稿垫块（Scratchpad Patching, SP）：通过在每个块内动态插入临时草稿垫（scratchpad），基于下一个字节预测熵触发，仅在信息密集区域聚合已见字节并刷新上下文，从而缓解块滞后问题；该方法支持推理时后验调整计算资源，在保持相同块大小下显著提升模型性能，例如在16字节/块条件下逼近字节级基线，同时实现KV缓存缩小16倍、推理计算减少3–4倍。

链接: https://arxiv.org/abs/2605.09630
作者: Lin Zheng,Vasilisa Bashlovkina,Timothy Dozat,Dan Garrette,Laura Rimell,Joshua Maynez
机构: Google DeepMind; The University of Hong Kong
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 23 pages, 15 figures

点击查看摘要

Abstract:Tokenizer-free language models eliminate the tokenizer step of the language modeling pipeline by operating directly on bytes; patch-based variants further aggregate contiguous byte spans into patches for efficiency. However, the average patch size chosen at the model design stage governs a tight trade-off: larger patches reduce compute and KV-cache footprint, but degrade modeling quality. We trace this trade-off to patch lag: until a patch is fully observed, byte predictions within it must rely on a stale representation from the previous patch to preserve causality; this lag widens as patches grow larger. We introduce Scratchpad Patching (SP), which inserts transient scratchpads inside each patch to aggregate the bytes seen so far and refresh patch-level context for subsequent predictions. SP triggers scratchpads using next-byte prediction entropy, selectively allocating compute to information-dense regions and enabling post-hoc adjustment of inference-time compute. Across experiments on natural language and code, SP improves model quality at the same patch size; for example, even at 16 bytes per patch, SP-augmented models match or closely approach the byte-level baseline on downstream evaluations while using a 16\times smaller KV cache over patches and 3 - 4\times less inference compute.

[NLP-112] Statistical Scouting Finds Debate-Safe but Not Debate-Useful Cases: A Matched-Ceiling Study of Open-Weight LLM Reasoning Protocols

【速读】：该论文旨在解决语言模型在推理过程中如何动态选择最优决策机制的问题，即在给定每例生成 token 数量上限（960）的前提下，比较直接回答、三样本投票和双代理批判-修正辩论三种策略的性能差异，并探索是否可以通过低成本的预判信号实现高效路由（routing），从而最大化整体性能。其关键发现是：投票熵（vote entropy）能够有效预测辩论的安全性（即避免反效果），但无法准确识别辩论真正有益的情形——高达66%的辩论有益案例发生在投票一致但错误的情况下；因此，现有基于熵阈值或简单学习模型的路由机制难以充分回收潜在性能增益，真正的优化需要设计能规避格式合规性混淆的深层行为探测机制。

链接: https://arxiv.org/abs/2605.09618
作者: Julia Hu,Alfred Shen,Kumar Lakshmipathi
机构: Amazon Web Services(亚马逊网络服务)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 14 pages, 5 figures. Technical report / preprint

点击查看摘要

Abstract:When should a language model answer directly, sample and vote, or engage in multi-agent debate? Recent work shows voting often explains much of the gain attributed to debate, while selective-debate systems activate deliberation only on uncertain examples. We ask: under a matched ceiling on generated tokens (960 per example), how much per-example routing headroom exists, and how much is recoverable from cheap pre-deliberation signals? We evaluate greedy decoding, three-sample voting, and a two-agent critique-revise debate on MuSiQue and GSM8K using Llama 3.1 8B Instruct and Ministral 3 8B Instruct. On MuSiQue, an oracle selecting the correct protocol per example gains +14.0 and +13.7 pp over the best fixed one. The best fixed protocol is model- and dataset-dependent: each (model, dataset) cell has a different winner. This headroom is hard to recover from cheap ex-ante signals. A vote-entropy threshold is the only controller that directionally beats the best fixed protocol on both models (+1.3 and +1.7 pp), though individual paired-bootstrap CIs include zero. A joint analysis (meta-analysis +1.6 pp, p=0.125; Bayesian P(both0)=0.59) is directionally consistent but not significant. Learned controllers (LR, GBT) do not outperform the threshold. The key finding is structural: vote entropy predicts where debate is safe, not where debate is needed. High entropy sharply reduces debate backfire, but 66% of debate-helpful examples (31/47) occur when voting is unanimous but wrong. A single-prompt self-critique probe on Llama flips the answer in 127/127 unanimous cases, yielding zero mutual information with the debate-helpful label; we cannot rule out a prompt-compliance artifact, but either interpretation disqualifies the probe as a router. Recovering the remaining headroom requires behavioral probes that avoid format-compliance confounds at the 8B scale. Comments: 14 pages, 5 figures. Technical report / preprint Subjects: Computation and Language (cs.CL); Computers and Society (cs.CY) ACMclasses: I.2.7; I.2.6 Cite as: arXiv:2605.09618 [cs.CL] (or arXiv:2605.09618v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.09618 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Julia Hu Dr. [view email] [v1] Sun, 10 May 2026 15:56:37 UTC (403 KB)

[NLP-113] Byte-Exact Deduplication in Retrieval-Augmented Generation: A Three-Regime Empirical Analysis Across Public Benchmarks

【速读】：该论文旨在解决检索增强生成（Retrieval-Augmented Generation, RAG）管道中因重复内容导致的冗余信息问题，从而提升推理效率并降低计算成本。其解决方案的关键在于实施字节级精确的块（chunk-level）去重策略，通过在不同场景下（包括干净学术检索、企业构建模式和多轮对话AI）对检索结果进行字节级完全相同的重复项移除，实现显著的上下文压缩：在高冗余场景下可达到80.34%的字节减少。研究进一步通过跨厂商五位评审员校准的评估体系验证了该方法的质量保真性，结果显示该去重策略不会引入可测量的质量下降，且所有四个主流大模型API均满足严格的5% Wilson 95%置信区间上限的材料差异阈值，证明该方案可在不牺牲模型输出质量的前提下实现确定性的推理算力节约。

链接: https://arxiv.org/abs/2605.09611
作者: Sietse Schelpe
机构: 未知
类目: Computation and Language (cs.CL)
备注: Preprint. Implementation and open-source community version available at: this https URL - this https URL

点击查看摘要

Abstract:This preprint presents an empirical analysis of byte-exact chunk-level deduplication in Retrieval-Augmented Generation (RAG) pipelines. We measure context reduction across three distinct operating regimes: clean academic retrieval (0.16% byte reduction on 22.2M BeIR passages), constructed enterprise patterns (24.03% reduction), and multi-turn conversational AI (80.34% reduction). To validate quality preservation, we conducted a cross-vendor 5-judge calibrated panel evaluation across four production APIs (Google Gemini 2.5 Flash, Anthropic Claude Sonnet 4.6, Meta Llama 3.3 70B, and OpenAI GPT-5.1). Applying a five-category human-in-the-loop noise-removal protocol to panel-majority materially different (MAT) pairs, we establish that byte-exact deduplication introduces zero measurable quality regression. Post-audit, all four vendors clear the strict 5% Wilson 95% upper-bound MAT threshold in both the clean and high-redundancy RAG regimes. This work demonstrates that substantial inference compute savings can be achieved deterministically without compromising evaluation-grade model quality.

[NLP-114] Edit-Based Refinement for Parallel Masked Diffusion Language Models ICML2026

【速读】：该论文旨在解决掩码扩散语言模型（Masked Diffusion Language Models）在并行生成多个token时性能显著下降的问题，其根本原因在于token级训练目标与序列级一致性之间存在不匹配。解决方案的关键在于提出ME-DLM（Edit-based Refinement Framework），通过轻量级后编辑步骤对初始生成的完整响应进行修正，编辑操作包括替换、删除和插入，且这些操作基于整个序列进行条件化。训练监督信号来源于编辑距离，在固定规范化方案下提供确定性反馈，从而学习最小化修正，既保证了序列层面的一致性，又保留了扩散解码的并行效率优势。

链接: https://arxiv.org/abs/2605.09603
作者: Houxing Ren,Mingjie Zhan,Zimu Lu,Ke Wang,Yunqiao Yang,Haotian Hou,Junting Pan,Hongsheng Li
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to ICML 2026

点击查看摘要

Abstract:Masked diffusion language models enable parallel token generation and offer improved decoding efficiency over autoregressive models. However, their performance degrades significantly when generating multiple tokens simultaneously, due to a mismatch between token-level training objectives and joint sequence consistency. In this paper, we propose ME-DLM, an edit-based refinement framework that augments diffusion generation with lightweight post-editing steps. After producing an initial complete response, the model refines it through minimal edit operations, including replacement, deletion, and insertion, conditioned on the full sequence. Training supervision is derived from edit distance, providing a deterministic signal under a fixed canonicalization scheme for learning minimal corrections. This approach encourages sequence-level consistency through globally conditioned edits while preserving the efficiency benefits of parallel diffusion decoding. Extensive experiments demonstrate that ME-DLM improves the quality and robustness of multi-token parallel generation. In particular, when built upon LLaDA, our method achieves consistent gains of 11.6 points on HumanEval and 33.6 points on GSM8K while using one-eighth of the total diffusion steps. Code is available at this https URL.

[NLP-115] CLR-voyance: Reinforcing Open-Ended Reasoning for Inpatient Clinical Decision Support with Outcome-Aware Rubrics

【速读】：该论文旨在解决住院临床推理（inpatient clinical reasoning）中因部分可观测性（partial observability）导致的决策建模难题，即医生在患者入院初期仅能基于当前可见信息做出下一步诊疗决策，而后续结果不可见，现有临床大语言模型（clinical-LLM）评估与强化学习（RL）奖励信号常将此过程简化为闭式检索、临床路径泄露或无锚定的LLM评分，缺乏对真实临床决策链条的准确刻画。解决方案的关键在于提出CLR-voyance框架，将其重构为一个部分可观测马尔可夫决策过程（POMDP），并设计同时具备结果导向（outcome-grounded）和医生验证（clinician-validated）特性的奖励机制；具体实现为CLR-POMDP，将成功病程划分为策略可见的过去与仅由“预言机”（oracle）知晓的未来，并利用过去信息生成可验证的案例特定问答对作为首个自适应临床推理评分标准（adaptive rubric）。通过GRPO微调与模型融合，该方法显著提升了Qwen3-8B和MedGemma-4B在住院临床推理上的表现（CLR-POMDP得分达84.91%），优于GPT-5（77.83%）和MedGemma-27B（66.66%），且保持通用能力，同时依托大规模医生对齐研究提供临床意义明确的偏好数据与评分标准，推动了医疗LLM-as-a-judge的发展。

链接: https://arxiv.org/abs/2605.09584
作者: Aishik Nagar,Arun-Kumar Kaliya-Perumal,Yu-Hsuan Han,Andrew Sheng-Han Huang,Kristen Kee,Yushi Cao,Yiming Chen,Hongchao Jiang
机构: ASUS Intelligent Cloud Services (ASUS智能云服务); Rehabilitation Research Institute of Singapore (新加坡康复研究所); Nanyang Technological University (南洋理工大学); Department of Family Medicine (家庭医学系); Taipei Veterans General Hospital (台北荣民总医院); School of Medicine (医学院); National Yang Ming Chiao Tung University (国立阳明交通大学); Yong Loo Lin School of Medicine (杨潞龄医学院); National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Inpatient clinical reasoning is a sequential decision under partial observability: the clinician sees the admission so far and must choose the next action whose downstream consequences are not yet visible. Existing clinical-LLM evaluations and RL rewards signals collapse this into closed-form retrieval, clinical journey leakage, or unanchored LLM-as-judge scoring. We introduce CLR-voyance, a framework that reformulates inpatient reasoning as a Partially Observable Markov Decision Process (POMDP) and supervises it with rewards that are simultaneously outcome-grounded and clinician-validated. We instantiate the formulation as CLR-POMDP, which partitions successful patient journeys into a policy-visible past and an oracle-only future. Using the past information, an oracle LLM generates a case-specific query-answer pair, and the first adaptive rubric for clinical reasoning which is verifiable in the future of the patient journey. These rubrics are used for both post-training and evaluation of models for inpatient clinical reasoning. We post-train Qwen3-8B and MedGemma-4B with GRPO followed by model merging, yielding state-of-the-art inpatient clinical reasoning while retaining generalist capabilities. CLR-voyance-8B achieves 84.91% on CLR-POMDP, ahead of frontier medical reasoning models like GPT-5 (77.83%) and MedGemma-27B (66.66%) and has comparable or better performance on existing medical benchmarks. To ensure a clinically meaningful setting, we conduct a large-scale clinician alignment study, where physicians curate per-case rubrics, grade candidate responses, and provide blinded pairwise preferences of model reasoning. This study provides insights on clinical LLM-as-a-judge and clinical preference-model selection, which can inform the community at large. CLR-voyance has been deployed for 6+ months at a partner public hospital, drafting thousands of reasoning-heavy inpatient notes.

[NLP-116] owards Compact Sign Language Translation: Frame Rate and Model Size Trade-offs

【速读】：该论文旨在解决手语翻译（Sign Language Translation, SLT）中现有模型参数量大、部署困难的问题。当前无词素（gloss-free）方法普遍依赖于大型编码器-解码器架构，导致计算资源消耗高且难以在实际场景中应用。解决方案的关键在于提出一个轻量级77M参数的流水线系统：首先利用MMPose从视频中提取骨骼姿态（skeletal pose），再通过单层线性投影直接映射到T5-small模型，从而显著降低模型复杂度。实验表明，通过调整输入帧率（如从24 fps降至12 fps），可将序列长度减半，使编码器中的二次自注意力计算复杂度降低75%，同时仅带来小幅BLEU-4分数下降（从10.06降至9.53），证明该方案在效率与性能之间实现了良好平衡。

链接: https://arxiv.org/abs/2605.09554
作者: Kuanwei Chen,Mengfeng Tsai
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 2 pages, 1 figure, 2 tables

点击查看摘要

Abstract:Sign Language Translation (SLT) converts sign language videos into spoken-language text, bridging communication between Deaf and hearing communities. Current gloss-free approaches rely on large encoder-decoder models, limiting deployment. We propose a compact 77M-parameter pipeline that couples MMPose skeletal pose extraction with a single linear projection into T5-small. By varying the input frame rate, we expose a practical efficiency trade-off: at 12 fps the model halves its sequence length, achieving a 75% reduction in encoder quadratic self-attention computational complexity while incurring only a modest BLEU-4 drop (9.53 vs. 10.06 at 24 fps on How2Sign). Our system is roughly 3x smaller than prior T5-base systems, demonstrating that a lightweight architecture can remain competitive without hierarchical encoders or large-scale models.

[NLP-117] Crosslingual On-Policy Self-Distillation for Multilingual Reasoning

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在数学推理能力上存在的语言不均衡问题，即高资源语言表现优异而低资源语言（尤其是非洲地区的多种低资源语言）推理性能显著偏低的问题。解决方案的关键在于提出跨语言在线策略自蒸馏（Crosslingual On-Policy Self-Distillation, COPSD），其核心机制是利用同一模型作为学生和教师：学生仅接收低资源语言的问题输入，而教师则获得包含问题翻译和英文参考解答的跨语言上下文信息；训练过程通过最小化学生自身生成轨迹上的全分布token级差异来提供密集监督，从而避免仅依赖结果奖励的强化学习（Reinforcement Learning, RL）所导致的稀疏性和不稳定性。该方法实现了无需额外模型或数据即可将高资源语言的推理知识有效迁移至低资源语言，并在17种非洲低资源语言上验证了其有效性与泛化能力。

链接: https://arxiv.org/abs/2605.09548
作者: Yihong Liu,Raoyuan Zhao,Michael A. Hedderich,Hinrich Schütze
机构: 未知
类目: Computation and Language (cs.CL)
备注: preprint

点击查看摘要

Abstract:Large language models (LLMs) have achieved remarkable progress in mathematical reasoning, but this ability is not equally accessible across languages. Especially low-resource languages exhibit much lower reasoning performance. To address this, we propose Crosslingual On-Policy Self-Distillation (COPSD), which transfers a model’s own high-resource reasoning behavior to low-resource languages. COPSD uses the same model as student and teacher: the student sees only the low-resource problem, while the teacher receives privileged crosslingual context, including the problem translation and reference solution in English. Training minimizes full-distribution token-level divergence on the student’s own rollouts, providing dense supervision while avoiding the sparsity and instability of outcome-only reinforcement learning (RL). Experiments on 17 low-resource African languages show that COPSD consistently improves low-resource mathematical reasoning across model sizes and substantially outperforms Group Relative Policy Optimization (GRPO). Further analyses show that COPSD improves answer-format adherence, strengthens test-time scaling, and generalizes to harder multilingual reasoning benchmarks, with especially large gains for lower-resource languages. We make our code and data available at: this https URL.

[NLP-118] acoMAS: Test-Time Co-Evolution of Topology and Capability in LLM -based Multi-Agent Systems

【速读】：该论文旨在解决多智能体系统（Multi-agent Systems, MAS）在推理阶段动态适应复杂任务时的性能瓶颈问题，特别是现有方法要么固定通信拓扑结构，要么仅在推理时单独调整能力或拓扑，无法实现两者协同演化。解决方案的关键在于提出一种测试时协同进化框架TacoMAS，其核心思想是通过“快-慢”双循环机制实现能力与拓扑的差异化演化：快速能力循环基于轨迹反馈实时更新代理的专业技能以应对新子任务，而缓慢的元大语言模型（meta-LLM）驱动的拓扑循环则执行代理的出生/死亡操作（如边编辑、新增或移除代理），从而保障协作稳定性。这种设计促使MAS在推理过程中趋向于任务条件下的稳定平衡状态，实验表明该方法显著优于近20种基线模型，平均提升达13.3%。

链接: https://arxiv.org/abs/2605.09539
作者: Chen Xu,Yicheng Hu,Ruizi Wang,Xinyu Lin,Wenjie Wang,Dongrui Liu,Fuli Feng
机构: Carnegie Mellon University; University of Science and Technology of China; National University of Singapore; Shanghai AI Lab
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multi-agent systems (MAS) have emerged as a promising paradigm for solving complex tasks. Recent work has explored self-evolving MAS that automatically optimize agent capabilities or communication topologies. However, existing methods either learn a topology that remains fixed at inference time or adapt only the topology or capability during inference. We empirically and theoretically show that effective test-time evolution requires jointly adapting both axes, but on different time scales: capabilities should update rapidly to handle emerging subtasks, while the topology should evolve more slowly to preserve coordination stability. We then introduce TacoMAS, a test-time co-evolution framework for dynamic MAS. TacoMAS formulates MAS inference as a task of online graph adaptation, where nodes represent agents with role-specific capabilities and edges define their communication topology. During inference, a fast capability loop updates agent expertise using trajectory-level feedback, while a slow meta-LLM-driven topology loop performs agents’ birth-death operations on MAS, including edge edit, agent addition, and agent removal. We further show that this fast-slow design drives MAS evolution toward a task-conditioned stable equilibrium. Experiments on four benchmarks demonstrate that TacoMAS outperforms nearly 20 multi-agent baselines, achieving an average improvement of 13.3% over the strongest baseline. The codes are released at this https URL.

[NLP-119] AD: Temporal-Aware Trajectory Self-Distillation for Fast and Accurate Diffusion LLM

【速读】：该论文旨在解决扩散大语言模型（Diffusion Large Language Models, dLLMs）在并行文本生成中面临的准确率-并行性权衡问题，即增加每前向传播步长的标记数（Tokens Per Forward, TPF）常导致生成质量下降。解决方案的关键在于提出一种时序感知的轨迹自蒸馏框架（Temporal-Aware trajectory self-Distillation, TAD）：在数据构建阶段，利用教师模型基于提示和真实响应生成解码轨迹，并记录中间掩码状态；根据每个掩码标记被揭示前剩余的解码步骤数，将掩码位置划分为近距和远距子集——对近距标记采用硬交叉熵损失，促使学生模型对即将解码的标记做出置信预测；对远距标记则使用软KL散度损失，提供更柔和的监督并保留未来规划知识。该机制自然衍生出两种部署配置：注重准确率的Quality模型与侧重加速的Speed模型，从而实现更优的准确率-并行性平衡。

链接: https://arxiv.org/abs/2605.09536
作者: Haoyang Zhou,Li Kong,Shijie Ren,Xiting Wang,Shuang Liang,Guowei Wang,Zhenxuan Pan
机构: 1. Tsinghua University (清华大学); 2. Peking University (北京大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion large language models (dLLMs) offer a promising paradigm for parallel text generation, but in practice they face an accuracy-parallelism trade-off, where increasing tokens per forward (TPF) often degrades generation quality. Existing acceleration methods often gain speed at the cost of accuracy. To address this limitation, we propose TAD, a Temporal-Aware trajectory self-Distillation framework. During data construction, we condition a teacher model on both the prompt and the ground-truth response to generate decoding trajectories, recording the intermediate masked states throughout the process. Based on how many decoding steps remain before each masked token is revealed, we partition masked positions into near and distant subsets. For near tokens, we train the student with a hard cross-entropy loss using the teacher trajectory tokens as labels, encouraging confident predictions for tokens that are about to be decoded. For distant tokens, we apply a soft KL divergence loss between the teacher and student token distributions, providing softer supervision and preserving future planning knowledge. This temporal-aware partition naturally gives rise to two deployment configurations: a Quality model that prioritizes accuracy and a Speed model that favors more aggressive acceleration. Experiments show that TAD consistently improves the accuracy-parallelism trade-off. On LLaDA, it raises average accuracy from 46.2% to 51.6% with the Quality model and average AUP from 46.2 to 257.1 with the Speed model. Our code is available at: this https URL

[NLP-120] Assessment of RAG and Fine-Tuning for Industrial Question-Answering-Applications AAAI2026

【速读】：该论文旨在解决企业在部署大语言模型（Large Language Models, LLMs）用于领域问答（Question Answering, QA）系统时，如何在成本与准确性之间取得最优平衡的问题。研究聚焦于两种主流的知识适配方法——检索增强生成（Retrieval-Augmented Generation, RAG）与微调（Fine-tuning, FT），并通过在汽车行业的两个封闭数据集上进行实验，评估其答案质量与运营成本。解决方案的关键在于引入并扩展了Cost-of-Pass框架，综合衡量输出质量、生成成本和用户交互成本，结果表明：尽管高端商用模型初始表现更优，但开源模型结合RAG后可达到相当的性能水平，且整体更具成本效益，因此RAG是适用于闭源与开源模型的最优适配策略。

链接: https://arxiv.org/abs/2605.09533
作者: Jakob Sturm,Josef Pichlmeier,Christian Bernhard,Maka Karalashvili,Johannes Klepsch,Georg Groh,Andre Luckow
机构: 1. University of Stuttgart (斯图加特大学); 2. Max Planck Institute for Intelligent Systems (马克斯·普朗克智能系统研究所); 3. Fraunhofer Institute for Manufacturing Engineering and Automation IPA (弗劳恩霍夫制造工程与自动化研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at AAAI 2026 Workshop on New Frontiers in Information Retrieval

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly employed in enterprise question-answering (QA) systems, requiring adaptation to domain-specific knowledge. Among the most prevalent methods for incorporating such knowledge are Retrieval-Augmented Generation (RAG) and fine-tuning (FT). Yet, from a cost-accuracy trade-off perspective, it remains unclear which approach best suits industry scenarios. This study examines the impact of RAG and FT on two closed datasets specific to the automotive industry, assessing answer quality and operational costs. We extend the Cost-of-Pass framework proposed by Erol et al. (arXiv:2504.13359) to jointly assess output quality, generation cost, and user interaction cost. Our findings reveal that while premium models perform best out of the box, open-source models can achieve comparable quality when enhanced with RAG. Overall, RAG emerges as the most effective and cost-efficient adaptation method for both closed- and open-source models.

[NLP-121] MemPrivacy: Privacy-Preserving Personalized Memory Management for Edge-Cloud Agents

【速读】：该论文旨在解决边缘-云协同环境中个性化记忆（Personalized Memory）在隐私保护与记忆效用之间难以平衡的问题。当前云辅助的记忆管理方案易泄露敏感用户信息，而现有隐私保护方法多采用激进的掩码策略，导致任务相关语义丢失，从而削弱记忆的可用性和个性化质量。解决方案的关键在于提出 MemPrivacy，其核心机制是在边缘设备端识别隐私敏感片段（privacy-sensitive spans），将其替换为语义结构化的类型感知占位符（type-aware placeholders）后上传至云端进行处理，并在需要时本地恢复原始值。通过将隐私保护与语义破坏解耦，MemPrivacy 在最小化敏感数据暴露的同时保留了有效记忆构建与检索所需的信息，实现了隐私与个性化记忆效用之间的高效权衡。

链接: https://arxiv.org/abs/2605.09530
作者: Yining Chen,Jihao Zhao,Bo Tang,Haofen Wang,Feiyu Xiong,Zhiyu Li
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As LLM-powered agents are increasingly deployed in edge-cloud environments, personalized memory has become a key enabler of long-term adaptation and user-centric interaction. However, cloud-assisted memory management exposes sensitive user information, while existing privacy protection methods typically rely on aggressive masking that removes task-relevant semantics and consequently degrades memory utility and personalization quality. To address this challenge, We propose MemPrivacy, which identifies privacy-sensitive spans on edge devices, replaces them with semantically structured type-aware placeholders for cloud-side memory processing, and restores the original values locally when needed. By decoupling privacy protection from semantic destruction, MemPrivacy minimizes sensitive data exposure while retaining the information required for effective memory formation and retrieval. We also construct MemPrivacy-Bench for systematic evaluation, a dataset covering 200 users and over 52k privacy instances, and introduce a four-level privacy taxonomy for configurable protection policies. Experiments show that MemPrivacy achieves strong performance in privacy information extraction, substantially surpassing strong general-purpose models such as GPT-5.2 and Gemini-3.1-Pro, while also reducing inference latency. Across multiple widely used memory systems, MemPrivacy limits utility loss to within 1.6%, outperforming baseline masking strategies. Overall, MemPrivacy offers an effective balance between privacy protection and personalized memory utility for edge-cloud agents, enabling secure, practical, and user-transparent deployment.

[NLP-122] Hidden Error Awareness in Chain-of-Thought Reasoning : The Signal Is Diagnostic Not Causal ICML2026

【速读】：该论文试图解决的问题是：链式思维（Chain-of-thought, CoT）提示假设模型生成的推理过程反映了其内部计算，但这一假设是否成立尚不明确。研究发现，模型在内部能够检测到自身推理错误，却在外显层面表现出对错误推理的高度自信，这表明CoT输出中的信心评分无法真实反映推理质量。解决方案的关键在于识别出这种“隐藏的错误意识”——通过线性探测器对隐藏状态进行分析，可准确预测推理路径正确性（AUROC达0.95），而文本表面分类器仅能获得0.59的性能，揭示了模型内部与外显行为之间的显著差异。进一步实验表明，尽管存在此诊断信号，现有四种干预手段（激活引导、基于探测器的Best-of-N、自纠正和激活修补）均无法修正错误，说明该信号仅具诊断价值而非因果控制能力，从而划定了机制解释学中推理错误表征与已有可编辑事实知识表征的根本区别边界。

链接: https://arxiv.org/abs/2605.09502
作者: Aojie Yuan,Zhiyuan Julian Su,Haiyue Zhang,Yi Nian,Yue Zhao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 5 figures, 10 this http URL Interpretability @ ICML 2026

点击查看摘要

Abstract:Chain-of-thought (CoT) prompting assumes that generated reasoning reflects a model’s internal computation. We show this assumption is wrong in a specific, measurable way: models internally detect their own reasoning errors but outwardly express confidence in them. A linear probe on hidden states predicts trace correctness with 0.95 AUROC – from the very first reasoning step (0.79) – while verbalized confidence for wrong traces is 4.55/5, nearly identical to correct ones (4.87/5). A text-surface classifier achieves only 0.59 on the same data, confirming a 0.20-point gap invisible in the generated text. This hidden error awareness holds across three model families (Qwen, Llama, Phi), 1.5B-72B parameters, and RL-trained reasoning models (DeepSeek-R1, 0.852 AUROC). The natural question is whether this signal can fix the errors it detects. It cannot. Four interventions – activation steering, probe-guided best-of-N, self-correction, and activation patching – all fail; patching destroys output coherence entirely. The signal is diagnostic, not causal: a readout of computation quality, not a lever to redirect it. This delineates a boundary for mechanistic interpretability: error representations during reasoning are fundamentally different from the factual knowledge representations that prior work has successfully edited.

[NLP-123] Beyond Language: Format-Agnostic Reasoning Subspaces in Large Language Models

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在不同符号系统（如英文文本、Python代码、数学表达式）中是否共享一个统一的内部推理子空间（Format-Agnostic Reasoning Subspace, FARS）的问题。其关键解决方案是通过引入TriForm Benchmark并结合排列校正的RSA（Representational Similarity Analysis）、跨形式探针和激活修补技术，发现中间层存在一个10维的FARS子空间：该子空间能将概念结构放大3倍且几乎消除形式信息；仅替换这10维即可保留90–96%的模型输出，显著优于全激活替换（44–56%）或方差最大化PCA（60–74%），同时移除这些维度会导致特定概念功能受损，从而证明了FARS的存在及其对跨形式推理的普适性与功能性。

链接: https://arxiv.org/abs/2605.09496
作者: Aojie Yuan,Zhiyuan Su
机构: University of Southern California (南加州大学); Duke University (杜克大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Preprint. 13 pages, 13 figures, 12 tables

点击查看摘要

Abstract:Large language models represent the same reasoning in vastly different surface forms – English prose, Python code, mathematical notation – yet whether they share a common internal substrate across these symbolic systems remains unknown. We introduce the TriForm Benchmark (18 concepts x 6 forms x 3 instances = 324 stimuli) and study five LLMs (1.6B-8B) across three architecture families. Using permutation-corrected RSA, cross-form probing, and activation patching, we find converging evidence for a Format-Agnostic Reasoning Subspace (FARS) in middle layers. We make FARS concrete: concept-centroid PCA extracts a 10-dimensional subspace that amplifies concept structure 3x while suppressing form information to near zero. Replacing only these 10 dimensions during cross-form patching preserves 90-96% of model output – far exceeding both full activation replacement (44-56%) and variance-maximizing PCA (60-74%) – while ablating them causes targeted disruption. FARS generalizes to held-out concepts and converges across architectures (CCA 0.79 for all model pairs), providing within-modality evidence for the Platonic Representation Hypothesis. We further discover a declarative-procedural asymmetry: representations are far more compatible between prose and mathematics than between either and code, suggesting that the critical axis of divergence is not linguistic vs. formal but declarative vs. procedural.

[NLP-124] APCD: Adaptive Path-Contrastive Decoding for Reliable Large Language Model Generation

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在自回归解码过程中因错误累积导致的幻觉（hallucination）问题，即早期生成的次优token选择会误导后续生成路径，从而降低输出的准确性与可靠性。其解决方案的关键在于提出一种自适应路径对比解码（Adaptive Path-Contrastive Decoding, APCD）框架，通过两个核心机制实现：一是基于熵驱动的路径扩展（Entropy-Driven Path Expansion），在预测不确定性（由Shannon熵衡量）表明存在多个合理延续时才进行分支，避免过早探索；二是基于差异感知的路径对比（Divergence-Aware Path Contrast），在不同路径间动态调节相互影响强度，鼓励多样化推理轨迹并抑制冗余干扰，从而在保持解码效率的同时显著提升事实准确性。

链接: https://arxiv.org/abs/2605.09492
作者: Tianyu Zheng,Hong Wu,Jiaji Zhong
机构: University of Electronic Science and Technology of China (电子科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) often suffer from hallucinations due to error accumulation in autoregressive decoding, where suboptimal early token choices misguide subsequent generation. Although multi-path decoding can improve robustness by exploring alternative trajectories, existing methods lack principled strategies for determining when to branch and how to regulate inter-path interactions. We propose Adaptive Path-Contrastive Decoding (APCD), a multi-path decoding framework that improves output reliability through adaptive exploration and controlled path interaction. APCD consists of two components: (1) Entropy-Driven Path Expansion, which delays branching until predictive uncertainty - measured by Shannon entropy over top candidate tokens - indicates multiple plausible continuations; and (2) Divergence-Aware Path Contrast, which encourages diverse reasoning trajectories while dynamically attenuating inter-path influence as prediction distributions diverge. Experiments on eight benchmarks demonstrate improved factual accuracy while maintaining decoding efficiency. Our code is available at this https URL.

[NLP-125] Not All Thoughts Need HBM: Semantics-Aware Memory Hierarchy for LLM Reasoning ICML2026

【速读】：该论文旨在解决生成式 AI（Generative AI）中大型语言模型（Large Language Models, LLMs）推理过程中因键值缓存（KV cache）占用大量GPU高带宽内存（HBM）而导致的显存瓶颈问题。传统方法通过永久丢弃低重要性token来缓解这一问题，但会导致推理准确率急剧下降至0–2.5%。其核心创新在于提出一种语义感知的多级内存层次结构（semantics-aware memory hierarchy），将token分为四层存储：HBM、DDR、压缩存储和已驱逐状态，利用累积注意力评分对token进行分级管理；关键突破是实现“零近似误差卸载”（zero-approximation-error offloading）——低重要性token被移至CPU内存而非删除，在每次注意力计算前以全精度预取回GPU，确保其贡献与始终驻留HBM时完全一致。研究发现，模型准确性仅取决于永久丢弃的token比例（即驱逐比），而不受HBM中保留token数量的影响，从而在保持高精度的同时显著降低HBM占用（最多节省48 GB）。

链接: https://arxiv.org/abs/2605.09490
作者: Aojie Yuan,Tianqi Shen,Dajun Zhang
机构: University of Southern California (南加州大学); University of Wisconsin–Madison (威斯康星大学麦迪逊分校)
类目: Computation and Language (cs.CL); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
备注: Preprint. 14 pages + appendix. Under review at AdaptFM Workshop @ ICML 2026

点击查看摘要

Abstract:Reasoning LLMs produce thousands of chain-of-thought tokens whose KV cache must reside in scarce GPU HBM. The dominant response – permanently evicting low-importance tokens – is catastrophic for reasoning: accuracy collapses to 0-2.5% when half the cache is removed. We ask a different question: must every token live in HBM, or can some live elsewhere? We introduce a semantics-aware memory hierarchy that sorts tokens into four tiers – HBM, DDR, compressed, and evicted – using cumulative attention scoring. Low-importance tokens are moved to CPU memory rather than destroyed; before each attention step they are prefetched back at full precision, contributing exactly the same terms as if they had never left the GPU. We formalize this as zero-approximation-error offloading and derive our central finding: accuracy depends solely on how many tokens are permanently discarded (the eviction ratio), not on how many remain in HBM. A controlled 3x3 grid over HBM and eviction ratios confirms this across three model scales (7B-32B) and four benchmarks. With only 3% eviction, the hierarchy retains 91% of full-cache accuracy on GSM8K and 71% on MATH-500 (n=200); at 14B scale it matches the uncompressed baseline (90% vs. 86%) while halving HBM occupancy. A head-to-head reproduction of R-KV – the current SOTA eviction method – on our setup achieves only 0-32% at comparable budgets. A system prototype with real GPU-CPU data movement shows that the price of this preservation is modest – 5-7% transfer overhead – and scaling analysis projects 2-48 GB HBM savings at production batch sizes.

[NLP-126] A Cognitively Grounded Bayesian Framework for Misinformation Susceptibility

【速读】：该论文旨在解决信息失序（Information Disorder）背景下个体对虚假、错误及恶意信息的易感性建模问题，尤其关注不同类型的误导信息（mis-, dis-, mal-information）在认知层面的差异化脆弱性。其解决方案的核心在于提出一种基于贝叶斯框架的有界务实倾听者模型（Bounded Pragmatic Listener, BPL），该模型在理性言语行为理论（Rational Speech Act theory）基础上引入三个由有限理性文献启发的认知约束：a) 递归深度限制（反映工作记忆容量）、b) 先验压缩参数（模拟信息瓶颈机制）、c) 可用样本量（通过显著性加权提议实现重要性采样）。这一设计使模型能够有效预测个体对信息失序的敏感性、标注者分歧，并验证“深度不匹配悖论”等关键假设，在LIAR和MultiFC基准上实现了竞争性的真伪分类性能与实验支持。

链接: https://arxiv.org/abs/2605.09483
作者: Pranava Madhyastha
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: work in progress

点击查看摘要

Abstract:In this (work in progress) paper, we present Bounded Pragmatic Listener (or BPL), a cognitively grounded Bayesian framework for modelling susceptibility to information disorder. BPL extends Rational Speech Act theory with three cognitively motivated bounds derived from the bounded rationality literature with a) a recursion depth bound (that emphasises working memory limits);b) a prior compression parameter (which is oriented at capturing information bottleneck); and c) an availability sample size (that operationalises importance sampling with saliency-weighted proposals). This allows us to test predictions about misinformation susceptibility, annotator disagreement, and the differential vulnerability to mis-, dis-, and mal-information as defined in the Information Disorder framework. We validate BPL on the LIAR and MultiFC benchmarks showcasing competitive veracity classification and experimental support for the depth-mismatch paradox.

[NLP-127] Align and Shine: Building High-Quality Sentence-Aligned Corpora for Multilingual Text Simplification LREC2026

【速读】：该论文旨在解决多语言文本简化（Text Simplification）模型训练与评估中高质量数据稀缺的问题，尤其是在英语之外的语言中。其解决方案的关键在于通过众包方式从可比语料库（Comparable Corpora）中收集并处理简化文本对，并实现文档级数据到句子级的对齐机制，从而构建一个适用于多种语言（包括加泰罗尼亚语、英语、法语、意大利语和西班牙语）的标准化简化语料库，该语料库可用于训练和测试文本简化系统。

链接: https://arxiv.org/abs/2605.09476
作者: Kenji Hilasaca,Nouran Khallaf,Serge Sharoff
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at BUCC 2026 workshop at LREC 2026

点击查看摘要

Abstract:Text simplification plays a crucial role in improving the accessibility and comprehensibility of written information for diverse audiences, including language learners and readers with limited literacy. Despite its importance, large-scale, high-quality datasets for training and evaluating text simplification models remain scarce for languages other than English. This paper reports an experimental study on the collection and processing of crowd-sourced simplification data from comparable corpora to construct a corpus suitable for both training and testing text simplification systems across multiple languages (Catalan, English, French, Italian and Spanish). We report mechanisms for sentence-level alignment from document-level data. The resulting dataset of the aligned sentence pairs is publicly available.

[NLP-128] FinMoji: A Framework for Emoji-driven Sentiment Analysis in Financial Social Media

【速读】：该论文旨在解决金融情感分析中如何利用表情符号（emoji）作为投资者情绪的紧凑指标，以提升市场趋势预测的效率与准确性。其核心问题是：表情符号是否可独立作为可靠的情感代理，并在计算成本和预测性能之间取得平衡。解决方案的关键在于构建仅基于表情符号的分类模型，并与传统文本-表情符号联合模型进行对比实验，结果表明，尽管表情符号单独使用时F1分数约为0.75（低于文本-表情符号组合模型的0.88），但其显著更低的计算开销使其适用于高频交易等对时效性要求高的场景；此外，研究还发现特定表情符号及其组合具有超过90%的准确率预测牛市或熊市趋势，凸显了表情符号在金融语境中的独特价值。

链接: https://arxiv.org/abs/2605.09469
作者: Ahmed Mahrous,Roberto Di Pietro
机构: King Abdullah University of Science and Technology (KAUST); Hamad Bin Khalifa University
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper explores the use of emojis in financial sentiment analysis, focusing on the social media platform StockTwits. Emojis, increasingly prevalent in digital communication, have potential as compact indicators of investor sentiment, which can be critical for predicting market trends. Our study examines whether emojis alone can serve as reliable proxies for financial sentiment and how they compare with traditional text-based analysis. We conduct a series of experiments using logistic regression and transformer models. We further analyze the performance, computational efficiency, and data requirements of emoji-based versus text-based sentiment classification. Using a balanced dataset of about 528,000 emoji-containing StockTwits posts, we find that emoji-only models achieve F1 approximately 0.75, lower than text-emoji combined models, which achieve F1 approximately 0.88, but with far lower computational cost. This is a useful feature in time-sensitive settings such as high-frequency trading. Furthermore, certain emojis and emoji pairs exhibit strong predictive power for market sentiment, demonstrating over 90 percent accuracy in predicting bullish or bearish trends. Finally, our research reveals large statistical differences in emoji usage between financial and general social media contexts, stressing the need for domain-specific sentiment analysis models.

[NLP-129] Beyond Position Bias: Shifting Context Compression from Position-Driven to Semantic-Driven

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在长文本场景下部署时面临的高计算开销和信息冗余问题。现有软提示压缩方法受限于位置偏差（position bias），即依赖固定位置插入可学习标记或按物理token布局分组，导致性能不稳定和语义碎片化。其解决方案的关键在于提出语义一致性上下文压缩（Semantic Consistency Context Compression, SeCo），该方法将压缩过程从基于位置驱动转向语义驱动：通过选择与查询相关的token作为语义中心，并以一致性加权方式聚合其余token，从而在消除位置偏差的同时保持语义一致性。

链接: https://arxiv.org/abs/2605.09463
作者: Jiwei Tang,Zhijing Huang,Xinyu Zhang,Chen Jason Zhang,Jianxing Yu,Libin Zheng,Rui Meng,Jian Yin
机构: Sun Yat-sen University (中山大学); Hong Kong Polytechnic University (香港理工大学); Beijing Normal–Hong Kong Baptist University (北京师范大学-香港浸会大学)
类目: Computation and Language (cs.CL)
备注: 20 pages, 6 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated exceptional performance across diverse tasks. However, their deployment in long-context scenarios faces high computational overhead and information redundancy. While soft prompt compression has emerged as a promising way to mitigate these costs by compressing sequences into compact embeddings, existing paradigms remain fundamentally constrained by position bias: they primarily rely on learnable tokens insertion at fixed positions or group tokens according to their physical token layout, thereby inducing performance instability and semantic fragmentation. To overcome this bottleneck, we propose Semantic Consistency Context Compression (SeCo), a method that shifts context compression from position-driven to semantic-driven. Rather than constraint by physical token layout, SeCo dynamically anchors compression directly in the semantic space by selecting query-relevant tokens as semantic centers and aggregating remaining tokens via consistency-weighted merging. This design inherently preserves semantic consistency while eliminating position bias. Extensive experiments on 14 benchmarks across two backbone models demonstrate that SeCo consistently shows superiority in downstream tasks, inference latency, and out-of-domain robustness. The code is available at this https URL.

[NLP-130] hrough the Lens of Character: Resolving Modality-Role Interference in Multimodal Role-Playing Agent

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在角色扮演代理（Role-Playing Agents, RPAs）应用中出现的模态-角色干扰（Modality-Role Interference, MRI）问题。MRI表现为：由于MLLMs提取的是通用、与角色无关的视觉特征，导致角色特有的细微特质被视觉噪声淹没，从而破坏角色一致性与视觉锚定之间的整合。解决方案的关键在于提出一种无需训练的Character-Aware Visual Intervention (CAVI) 框架，其核心机制包括三个层次：宏观上通过Character-Guided Token Pruning (CTP) 限制视觉感受野至角色相关实体；微观上通过Orthogonal Feature Modulation (OFM) 将视觉token投影到角色上下文子空间以提取对齐信息；以及在解码阶段引入Modality-Adaptive Role Steering (MARS)，根据视觉依赖程度动态调整角色引导强度。这一系统性干预策略有效缓解了MRI，显著提升了角色一致性下的多模态交互能力。

链接: https://arxiv.org/abs/2605.09443
作者: Yihong Tang,Kehai Chen,Xuefeng Bai,Min Zhang
机构: Harbin Institute of Technology, Shenzhen, China; Shenzhen Loop Area Institute (SLAI), Shenzhen, China
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The advancement of Multimodal Large Language Models (MLLMs) has expanded Role-Playing Agents (RPAs) into visually grounded environments. However, human vision is inherently subjective and identity-driven, whereas existing MLLMs extract objective, character-agnostic features for general tasks. In RPAs, this generic visual noise overpowers fragile character traits, causing Modality-Role Interference (MRI), where agents struggle to integrate visual grounding and character consistency. To address this, we introduce the training-free Character-Aware Visual Intervention (CAVI) framework, enabling agents to perceive the world through the lens of character. CAVI systematically targets MRI: macroscopically, Character-Guided Token Pruning (CTP) restricts the visual receptive field to role-relevant entities; microscopically, Orthogonal Feature Modulation (OFM) projects tokens onto a character-context subspace to extract aligned facts; and during decoding, Modality-Adaptive Role Steering (MARS) dynamically optimizes steering intensity based on visual reliance. Extensive experiments show CAVI effectively alleviates MRI, significantly enhancing character-consistent multimodal interactions.

[NLP-131] Key Coverag e Matters: Semi-Structured Extraction of OCR Clinical Reports

【速读】：该论文旨在解决临床报告在不同医疗机构间因隐私法规和数据孤岛限制而难以整合的问题，尤其是在患者跨院就诊时，纸质或扫描件报告无法有效融入电子健康记录（Electronic Health Record, EHR）系统，从而阻碍了长期病历回顾及下游应用如患者管理、随访护理、真实世界研究和临床试验匹配的实现。解决方案的关键在于将问题建模为基于OCR文本的、以关键字段（canonical key）条件驱动的抽取式问答任务，并通过迭代式的关键字段挖掘、归一化、聚类与轻量级人工验证构建一个动态扩展的“标准关键字段库存”（canonical key inventory），并引入“关键覆盖度”（key coverage）作为量化指标衡量库存完整性；实验表明，随着关键覆盖度提升，模型性能单调改善，在Top-90关键字段覆盖时F1得分分别达到0.839（精确匹配）和0.893（边界容忍匹配），证明关键覆盖度是端到端性能的核心决定因素。

链接: https://arxiv.org/abs/2605.09440
作者: Yu Wang,Yingyun Li,Ying Qin,Haiyang Qian
机构: AI Starfish(人工智能章鱼)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint. Under review at MLHC 2026

点击查看摘要

Abstract:Clinical reports are often fragmented across healthcare institutions because privacy regulations and data silos limit direct information sharing. When patients seek care at a different hospital, they often carry paper or scanned reports from prior visits. This hinders EHR integration and longitudinal review, and downstream applications that depend on more complete patient records, such as patient management, follow-up care, real-world studies, and clinical-trial matching. Although OCR can digitize such reports, reliable extraction remains challenging because clinical documents are heterogeneous, OCR text is noisy, and many healthcare settings require low-cost on-premise deployment. We formulate this problem as canonical key-conditioned extractive question answering over OCR-derived clinical reports. Because the key fields are neither fixed nor known in advance, the key space is open. We maintain a canonical key inventory through iterative key mining, normalization, clustering, and lightweight human verification, and introduce key coverage as a metric to quantify inventory completeness. Using a 0.2B BERT-based model, experiments on real-world reports from more than 20 hospitals show performance improves monotonically with key coverage. The model achieves F1 scores of 0.839 and 0.893 under exact match and boundary-tolerant matching, respectively, once the Top-90 canonical keys are covered. These results show that key coverage is a dominant factor for end-to-end performance. At Top-90 coverage, our model outperforms a fine-tuned Qwen3-0.6B baseline under exact match. Although our annotated corpus is Chinese, the method relies on the language-agnostic key-value organization of semi-structured clinical reports and can be adapted to other settings given an appropriate canonical key inventory and alias mapping.

[NLP-132] PumpSense: Real-Time Detection and Target Extraction of Crypto Pump-and-Dumps on Telegram

【速读】：该论文旨在解决通过Telegram群组协调的加密货币“拉高-抛售”（pump-and-dump）骗局对市场完整性造成的威胁，此类骗局通常依赖于隐蔽的社交平台信息传播，而现有方法在检测速度与准确性上存在不足。其关键解决方案在于构建了一个包含28万余条Telegram消息的高质量标注语料库，并在此基础上定义了两个任务：实时泵公告检测和目标币种/交易所提取。针对检测任务，作者对比了轻量级树模型LightGBM（F1=0.79，延迟9.4秒/样本）与基于Transformer的BGE-M3模型（F1=0.83，延迟50毫秒/样本），证明了利用语言模型可实现微秒级响应的即时检测；对于目标提取任务，研究发现传统基于规则的方法因ticker歧义失效，而大语言模型（LLM）则实现了最高准确率（0.91），从而建立了首个针对被操纵币种与交易所识别的基准。

链接: https://arxiv.org/abs/2605.09431
作者: Ahmed Mahrous,Roberto Di Pietro
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to the 2026 IEEE International Conference on Blockchain and Cryptocurrency (ICBC)

点击查看摘要

Abstract:Cryptocurrency pump-and-dump schemes coordinated via Telegram threaten market integrity. However, existing research addressing this specific threat has not yet produced solutions that combine reliable results with fast response. This is in part due to the absence of publicly available, message-level labeled data, as well as design choices. In this paper, we address both issues. In particular, we introduce a corpus of over 280,000 Telegram posts from 39 pump-organizing groups, all manually reviewed to identify 2,246 pump announcements and their targeted cryptocurrency and exchange. Leveraging this dataset, we define two tasks: real-time pump-announcement detection and target cryptocurrency/exchange extraction. For detection, we compare two machine-learning models: a lightweight tree-based LightGBM classifier (F1=0.79, latency=9.4 s/sample) and a transformer-based BGE-M3 (F1=0.83, latency=50 ms/sample). With our proposed approach, we show that message analysis can achieve near-instant pump detection at the level of individual Telegram message windows. Unlike prior work that relies purely on market data and typically detects pumps tens of seconds after abnormal trading activity is observed, our method operates directly on the coordination messages themselves and can be evaluated in microseconds per window on commodity hardware. To our knowledge, we also establish the first benchmark for manipulated coin and exchange extraction. We demonstrate that traditional rule-based extraction methods, widely relied upon in prior literature, are ineffective due to ticker ambiguity. In contrast, LLMs achieve the highest accuracy with a score of 0.91.

[NLP-133] Perception Without Engagement: Dissecting the Causal Discovery Deficit in LMMs

【速读】：该论文旨在解决大型多模态模型（Large Multimodal Models, LMMs）在因果发现任务中对文本先验（textual prior）的依赖问题，即模型虽能准确理解视频内容，但在进行因果推理时系统性地忽视视觉信息，过度依赖文本线索，导致其因果推理可靠性不足。解决方案的关键在于提出一种基于负向教师对齐的对抗蒸馏策略优化（Anti-Distillation Policy Optimization, ADPO），该方法通过强化学习框架显式地将模型策略从仅依赖文本的反事实教师（由视觉干扰诱导）中推开，最大化原始输入与视觉扰动输入条件下策略分布的差异，从而强制模型基于视觉证据进行推理，提升其对视觉信息的利用程度，同时不损害基础理解能力。

链接: https://arxiv.org/abs/2605.09422
作者: Jiafeng Liang,Zhihao Zhu,Zihan Zhang,Baoqi Ren,Shixin Jiang,Runxuan Liu,Tao Ren,Ming Liu,See-Kiong Ng,Bing Qin
机构: Harbin Institute of Technology(哈尔滨工业大学); Pengcheng Laboratory(鹏城实验室); National University of Singapore(新加坡国立大学); Peking University(北京大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 5 figures

点击查看摘要

Abstract:Although Large Multimodal Models (LMMs) have achieved strong performance on general video understanding, their susceptibility to textual prior shortcuts during causal discovery has been recognized as a critical deficit. The underlying mechanisms of this phenomenon remain incompletely understood, as existing benchmarks only measure response accuracy without revealing the sources and extent of the deficit. We introduce ProCauEval, a perturbation-based evaluation protocol that shifts from outcome assessment to mechanism diagnosis, probing causal discovery through five controlled configurations that systematically manipulate visual and textual modalities to decompose their respective contributions to model behavior and dissect the failure modes. Evaluating 17 mainstream LMMs, we find that models faithfully perceive video content yet systematically underexploit it during causal reasoning. We further observe that stronger post-training amplifies rather than mitigates textual prior reliance, and that higher baseline performance correlates with greater fragility under perturbation. To address these, we propose Anti-Distillation Policy Optimization (ADPO), a reinforcement learning framework built on negative teacher alignment, which augments GRPO by explicitly pushing the policy away from a prior-only counterfactual teacher induced by visual corruption. Specifically, ADPO maximizes the divergence between the policy distributions conditioned on the original and visually corrupted inputs, thereby forcing the model to ground its reasoning in visual evidence rather than textual shortcuts. Extensive experiments show that ADPO improves visual engagement without sacrificing fundamental comprehension, thus offering a preliminary step toward reliable causal discovery.

[NLP-134] Cross-Cultural Transfer of Emoji Semantics and Sentiment in Financial Social Media ACL2026

【速读】：该论文旨在解决金融语境下表情符号（Emoji）是否具备跨语言、跨平台及跨资产社区的可迁移情感信号特性这一问题，以及其对零样本情感迁移能力的影响。研究通过分析来自Twitter和StockTwits的多语言大规模文本数据，评估仅使用表情符号、仅使用文本及文本+表情符号三种输入方式训练的情感模型性能差异。关键发现在于：尽管表情符号在不同金融社区中的使用频率存在显著差异，尤其是跨语言场景下，但其语义和情感极性具有高度稳定性；引入表情符号能有效缩小跨资产与跨语言迁移中的性能差距，尤其在跨语言迁移中表现突出。因此，解决方案的核心在于识别并利用金融沟通中存在的部分共享“表情符号编码”，从而提升模型在多市场、多平台环境下的泛化能力。

链接: https://arxiv.org/abs/2605.09414
作者: Ahmed Mahrous,Roberto Di Pietro
机构: King Abdullah University of Science and Technology (KAUST)
类目: Computation and Language (cs.CL)
备注: Accepted to Findings of the Association for Computational Linguistics: ACL 2026

点击查看摘要

Abstract:Emojis are widely used in online financial communication, but it is unclear whether they provide transferable sentiment signals across languages, platforms, and asset communities. This study examines the extent to which emoji usage, semantics, and sentiment polarity remain stable across financial communities, and how these layers influence zero-shot sentiment transfer. Using large corpora of Twitter and StockTwits posts in four languages, we measure cross-community divergence and evaluate sentiment models trained under emoji-only, text-only, and text+emoji inputs. We find that emoji frequencies differ across communities, especially across languages, but their semantics and sentiment polarity are largely stable. Cross-asset transferability shows minimal degradation, while cross-language transfer remains the most challenging. Including emojis consistently reduces transfer gaps relative to text-only models. These results indicate that financial communication exhibits a partially shared ``emoji code,‘’ and that emojis provide compact, language-independent sentiment cues that improve model generalization across markets and platforms. Comments: Accepted to Findings of the Association for Computational Linguistics: ACL 2026 Subjects: Computation and Language (cs.CL) Cite as: arXiv:2605.09414 [cs.CL] (or arXiv:2605.09414v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.09414 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-135] Let the Target Select for Itself: Data Selection via Target-Aligned Paths

【速读】：该论文旨在解决目标数据选择（Targeted Data Selection）中因候选样本池异质性导致的参考路径偏差（reference path bias）问题，即传统方法依赖于从候选池中诱导的轨迹来估计样本效用时，若候选池分布与目标任务子集不一致，则会引入误差。其解决方案的关键在于提出一种新的参考路径——由目标验证代理（target validation proxy）上进行短时、容量受限的预热（warmup）得到的验证诱导流（validation-induced flow），并基于该路径上的归一化终点损失下降（normalized endpoint loss drop）设计了一个无需梯度或海森矩阵近似的零阶选择规则。该方法在保持性能的同时显著降低预热和存储开销，并且参考轨迹可跨不同候选池复用，提升了效率与通用性。

链接: https://arxiv.org/abs/2605.09404
作者: Huitao Yang,Hengzhi He,Guang Cheng
机构: University of California, Los Angeles (加州大学洛杉矶分校)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Targeted data selection aims to identify training samples from a large candidate pool that improve performance on a specific downstream task. Many recent methods estimate candidate utility by aggregating local attribution scores along a trajectory induced by the candidate pool. When the pool is heterogeneous, however, this reference trajectory may be misaligned with the dynamics of a target-aligned selected subset, creating what we call reference path bias. We propose an alternative reference path: a validation-induced flow obtained from a short, capacity-limited warmup on the available target validation proxy. Along this path, candidates are scored by a normalized endpoint loss drop, yielding a simple zero-order selection rule that requires no candidate gradients or Hessian approximations. Across controlled logistic, vision, and instruction-tuning experiments, this score is competitive with strong dynamic attribution baselines while substantially reducing warmup and storage cost. Moreover, since the reference trajectory is decoupled from any specific candidate pool, the same compact warmup can be reused across additional pools without recomputing the trajectory.

[NLP-136] EduStory: A Unified Framework for Pedagogically-Consistent Multi-Shot STEM Instructional Video Generation

【速读】：该论文旨在解决长时视频生成中知识一致性与教学叙事连贯性不足的问题，尤其是在STEM（科学、技术、工程和数学）领域多镜头教学视频中的应用挑战。现有方法虽在视觉质量上取得进展，但难以维持跨镜头的知识状态稳定性和符合教学逻辑的叙事结构。其解决方案的关键在于提出EduStory框架，通过三个核心模块实现：一是教学状态建模（pedagogical state modeling），用于追踪持久的知识状态；二是脚本引导的结构化控制（script-guided structured control），以组织多镜头叙事逻辑；三是面向学习目标的评估指标（learning-oriented evaluation metrics），用于量化知识保真度与约束满足程度。该方案显著提升了教学意图对齐度并减少叙事断裂，凸显了领域特定结构约束与定制化基准的重要性。

链接: https://arxiv.org/abs/2605.09378
作者: Xinyi Wu,Jayant Teotia,Shuai Zhao,Erik Cambria
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Long-horizon video generation has advanced in visual quality, yet existing methods still struggle to maintain knowledge consistency and coherent pedagogical narratives across multi-shot instructional videos, especially in STEM domains. To address these challenges, we propose EduStory, a unified framework for reliable instructional video generation. EduStory integrates pedagogical state modeling to track persistent knowledge states, script-guided structured control to organize multi-shot narratives, and learning-oriented evaluation metrics to assess knowledge fidelity and constraint satisfaction. To support rigorous evaluation, we further introduce EduVideoBench, a diagnostic benchmark with multi-granularity annotations, including pedagogical storyboards, shot-level semantics, and knowledge state transitions, together with baseline tasks for controllable instructional video generation. Extensive experiments demonstrate that domain-aware state modeling and structured control substantially reduce narrative breakdown and improve alignment with instructional intent. These results highlight the significance of domain-specific structural constraints and tailored benchmarks for advancing reliable, controllable, and also trustworthy long-horizon video generation.

[NLP-137] Position: Avoid Overstretching LLM s for every Enterprise Task

【速读】：该论文旨在解决当前企业工作负载中大规模语言模型（Large Language Model, LLM）直接部署或蒸馏为小型模型所导致的效率低下、可靠性差及与企业任务结构不匹配的问题。其核心解决方案在于摒弃将语言模型视为单一智能引擎的传统思路，转而将其作为接口组件，将知识存储与计算逻辑外化至专门的知识库（knowledge base）和符号程序（symbolic procedures），从而构建模块化架构。这种设计不仅克服了有限容量模型在知识广度上的固有局限，还显著提升了系统的可靠性、可扩展性和可解释性，为面向确定性、结构化的企业任务提供了更可持续的技术路径。

链接: https://arxiv.org/abs/2605.09365
作者: Kuldeep Singh,Anson Bastos,Isaiah Onando Mulang’
机构: Eka Labs AI(埃卡实验室人工智能); Microsoft(微软); SAP(思爱普)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Enterprise workloads are dominated by deterministic, structured, and knowledge-dependent tasks operating under strict cost, latency, and reliability constraints. While these are often addressed through large language model (LLM) deployment or distillation into smaller models, we argue this is inefficient, unreliable, and misaligned with enterprise task structures. Instead, AI systems should treat language models as interfaces rather than monolithic engines, externalizing knowledge and computation into dedicated components for greater reliability, scalability, and transparency. Our theoretical evidences show that finite-capacity models cannot fully capture the breadth of knowledge required for enterprise tasks, creating inherent limits to efficiency and interpretability. Building on this, we take the position that language models should primarily be used for structured extraction in deterministic enterprise workflows, while computation and storage are delegated to knowledge bases and symbolic procedures. We formally demonstrate that such modular architectures are more reliable and maintainable than monolithic frameworks, offering a sustainable foundation for enterprise tasks.

[NLP-138] Your Simulation Runs but Solves the Wrong Physics: PDE-Grounded Intent Verification for LLM -Generated Multiphysics Simulation Code

【速读】：该论文旨在解决生成式 AI（Generative AI）在科学仿真领域中因“理解-生成鸿沟”（comprehension-generation gap）导致的代码正确性问题：即模型生成的代码虽能成功执行，但可能未正确编码用户意图的偏微分方程（PDE）结构。解决方案的关键在于提出一种基于PDE结构的验证机制——意图保真度评分（Intent Fidelity Score, IFS），通过在MOOSE框架中对弱形式残差项进行确定性重构，并与用户指定的物理约束（如边界条件、初始条件、系数和时间格式）进行结构化比对，从而量化生成代码与目标物理模型的一致性；进一步构建了一个以IFS为反馈信号的迭代精炼循环，利用违反报告自动修正错误，显著提升硬案例中的意图保真度，同时揭示了执行可运行性与意图正确性是可分离的失败模式。

链接: https://arxiv.org/abs/2605.09360
作者: Zhenghan Song,Yulong Liu,Cheng Wan,Chenjun Li,Lingfu Liu,Yunyi Li,Congcong Yuan
机构: Cornell University (康奈尔大学); Columbia University (哥伦比亚大学); Harvard University (哈佛大学); Nanyang Technological University (南洋理工大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)
备注: Preprint

点击查看摘要

Abstract:Execution-based evaluation of LLM-generated code implicitly treats successful execution as a proxy for correctness. In scientific simulation, this proxy is insufficient: a generated input file can run, mesh, and converge while encoding governing equations that differ from the user’s intent. We call this mismatch between intended physics and generated code the comprehension-generation gap. We instantiate this in MOOSE, where Kernel and BC objects map compositionally to weak-form residual terms, enabling deterministic reconstruction of the encoded PDE and comparison against an intended contract. We formalize this comparison as the Intent Fidelity Score (IFS), a structural metric covering governing terms, BCs, ICs, coefficients, and time scheme. Building on IFS, we develop a PDE-grounded refinement loop that uses deterministic violation reports to correct generated code iteratively. We evaluate on MooseBench, a 220-case multiphysics benchmark with PDE-level ground truth released with this work. On this benchmark, our method consistently improves mean IFS over direct generation, with gains concentrated on hard cases. On the subset where direct generation falls below IFS 0.7, refinement adds +0.22 to +0.41 absolute IFS. In the deployment audit, execution-only repair improves execution success while leaving 39-40% of all 220 cases runnable but still solving the wrong physics across the three main deployment-audit models, exposing executability and intent fidelity as separable failure modes. Static proof-of-concept experiments on four PDE-oriented DSLs (UFL/FEniCS, FreeFEM, FiPy, and Devito) suggest that the reconstruction-and-comparison pattern extends beyond MOOSE. These findings reinforce that executable simulation code should be verified against the mathematical structure it is intended to encode, not accepted on execution alone.

[NLP-139] HOME-KGQA: A Benchmark Dataset for Multimodal Knowledge Graph Question Answering on Household Daily Activities LREC2026

【速读】：该论文旨在解决现有知识图谱问答（Knowledge Graph Question Answering, KGQA）基准数据集在真实场景应用中的局限性问题，具体表现为：当前数据集偏向百科类知识、单一模态且缺乏细粒度时空信息，难以支持具身智能（Embodied AI）等实际应用场景。解决方案的关键在于构建一个全新的多模态知识图谱问答基准数据集——HOME-KGQA，其基于日常家庭活动的多模态知识图谱，包含复杂、多跳的自然语言问题与图数据库查询语言的配对，并引入多层级时空推理、多模态锚定和聚合函数等挑战性要素，从而更贴近现实世界任务需求。实验表明，现有基于大语言模型（Large Language Models, LLMs）的KGQA方法在此新数据集上性能显著下降，凸显了当前KGQA系统在真实部署中仍面临重大挑战。

链接: https://arxiv.org/abs/2605.09348
作者: Shusaku Egami,Aoi Ohta,Tomoki Tsujimura,Masaki Asada,Tatsuya Ishigaki,Ken Fukuda,Masahiro Hamasaki,Hiroya Takamura
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB); Multimedia (cs.MM)
备注: 12 pages, 4 figures, 7 tables, accepted at LREC2026

点击查看摘要

Abstract:Large Language Models (LLMs) provide flexible natural language processing capabilities, while knowledge graphs (KGs) offer explicit and structured knowledge. Integrating these two in a complementary manner enables the development of reliable and verifiable AI systems. In particular, knowledge graph question answering (KGQA) has attracted attention as a means to reduce LLM hallucinations and to leverage knowledge beyond the training data. However, existing KGQA benchmark datasets are biased toward encyclopedic knowledge, limited to a single modality, and lack fine-grained spatiotemporal data, which limits their applicability to real-world scenarios targeted by Embodied AI. We introduce HOME-KGQA, a novel KGQA benchmark dataset built on a multimodal KG of daily household activities. HOME-KGQA consists of complex, multi-hop natural language questions paired with graph database query languages. Compared to existing benchmarks, it includes more challenging questions that involve multi-level spatiotemporal reasoning, multimodal grounding, and aggregate functions. Experimental results show that the LLM-based KGQA methods fail to achieve performance comparable to that on existing datasets when evaluated on HOME-KGQA. This highlights significant challenges that should be addressed for the real-world deployment of KGQA systems. Our dataset is available at this https URL

[NLP-140] RuPLaR : Efficient Latent Compression of LLM Reasoning Chains with Rule-Based Priors From Multi-Step to One-Step

【速读】：该论文旨在解决隐式思维链（Latent Chain-of-Thought, latent CoT）在多步骤或多模型框架中面临的结构性复杂性问题，如错误传播和协作开销，从而提升推理效率与准确性。其解决方案的关键在于提出“One-Model One-Step”压缩框架，通过规则先验引导的端到端训练机制，在单次训练阶段内使大语言模型（LLM）自主生成连续潜空间中的推理标记（latent reasoning tokens），并引入联合训练目标：利用交叉熵约束答案一致性、KL散度（Soft Thinking约束）对齐软标记与规则先验分布，以及在表示空间中加入问题-思维语义对齐约束，从而消除级联过程和跨模型依赖，实现高精度且低token消耗的推理优化。

链接: https://arxiv.org/abs/2605.09346
作者: Xiaocheng Luo,Kang Wang,Zaifu Zhan,Yuechi Zhou,Xiangyu Duan
机构: Soochow University (苏州大学); University of Minnesota (明尼苏达大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages, 15 figures

点击查看摘要

Abstract:The Chain-of-Thought (CoT) paradigm, while enhancing the interpretability of Large Language Models (LLMs), is constrained by the inefficiencies and expressive limits of natural language. Latent Chain-of-Thought (latent CoT) reasoning, which operates in a continuous latent space, offers a promising alternative but faces challenges from structural complexities in existing multi-step or multi-model paradigms, such as error propagation and coordination overhead. In this paper, we introduce One-Model One-Step, a novel compression framework for Latent Reasoning with Rule-Based Priors(RuPLaR) to address this challenge. Our method trains an LLM to autonomously generate latent reasoning tokens in a single training stage, guided by rule-based prior probability distributions, thereby eliminating cascaded processes and inter-model dependencies. To ensure reasoning quality, we design a joint training objective that enforces answer consistency via cross-entropy, aligns soft tokens with rule-based priors via KL divergence (the Soft Thinking constraint), and adds a problem-thought semantic alignment constraint in the representation space. Extensive experiments show that our compression framework not only improves accuracy by 11.1% over existing latent CoT methods but also achieves this with minimal token usage, underscoring its effectiveness and extensibility. Code: this https URL.

[NLP-141] st-Time Speculation

【速读】：该论文旨在解决生成式 AI（Generative AI）中推测解码（Speculative Decoding）在长文本生成任务中效率下降的问题。现有方法如 DFlash、EAGLE-3 和 PARD 在生成长度增加时，接受长度（acceptance length）显著降低，趋近于 1，导致推理加速效果几乎消失，其根本原因在于这些推测模型是在短序列上离线训练的，而在推理阶段需匹配更长输出，超出其训练分布。解决方案的关键是提出测试时推测（Test-Time Speculation, TTS），一种在线蒸馏机制，在不额外计算成本的前提下利用验证步骤中已有的目标模型信号，将推测模型视为学生、目标模型视为教师，通过多轮推测迭代持续优化推测模型参数，从而显著提升接受长度并保持加速优势随生成长度增长而增强。

链接: https://arxiv.org/abs/2605.09329
作者: Avinash Kumar,Sujay Sanghavi,Poulami Das
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Speculative decoding accelerates LLM inference by using a fast draft model to generate tokens and a more accurate target model to verify them. Its performance depends on the \textitacceptance length , or number of draft tokens accepted by the target. Our studies show that the acceptance length of even state-of-the-art speculators, like DFlash, EAGLE-3 and PARD degrade with generation length, reaching values close to 1 (i.e. no speedup) within just a few thousand output tokens, making speculators ineffective for long-response tasks. Acceptance lengths decline because most speculators are trained offline on short sequences, but are forced to match the target model on much longer outputs at inference, well beyond their training distribution. To address this issue, we propose \textitTest-Time Speculation (TTS) , an online distillation approach that continuously adapts the speculator at test-time. TTS leverages the key insight that the token verification step already invokes the target model for each draft token, providing the training signal needed to adapt the draft at no additional cost. Treating the draft as the student and the target as a teacher, TTS adjusts the draft over several speculation rounds, with each update improving the draft’s accuracy as generation proceeds. Our results across multiple models from the Qwen-3, Qwen-3.5, and Llama3.1 families show that TTS improves acceptance lengths over state-of-the-art speculators by up to 72% and 41% on average, with the benefits scaling with increased generation lengths. Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2605.09329 [cs.CL] (or arXiv:2605.09329v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.09329 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-142] Mem-W: Latent Memory-Native GUI Agents

【速读】：该论文旨在解决当前GUI代理（GUI agents）在执行网页、移动和桌面应用操作时，因记忆存储形式与策略模型实际处理的潜在嵌入序列不匹配而导致的控制效率低下问题。现有方法将记忆视为外部可读的符号化结构，需经过摘要、分类、检索和重插入等步骤后再次编码，造成信息损失与语义割裂。其解决方案的关键在于提出Mem-W，一种以潜在记忆原生（latent-memory-native）设计的GUI代理架构：它通过共享的轨迹到潜在压缩器（trajectory-to-latent compressor），将历史轨迹（作为经验记忆）与会话片段（作为工作记忆）融合为紧凑的记忆标记（memory tokens），并将其与当前GUI观测及局部上下文编织成连续的嵌入序列，使代理能够以机器原生接口统一读取成功、失败与未完成进展。该方案显著提升了长期任务中的决策连贯性与有效性，在多个Web和移动导航基准上实现最高达+30.0%的性能提升。

链接: https://arxiv.org/abs/2605.09317
作者: Guibin Zhang,Yaohui Ling,Fanci Meng,Kun Wang,Shuicheng Yan
机构: LV-NUS Lab (LV-NUS 实验室); NUS (新加坡国立大学); NTU (南洋理工大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:GUI agents are beginning to operate the web, mobile, and desktop as interactive worlds, where successful control depends on carrying forward visual, procedural, and task-level evidence beyond the fleeting present screen. Yet most agents still treat memory as an external, human-readable artifact: histories are summarized, categorized, retrieved, and reinserted as text or structured records before being encoded again by the policy. This creates a mismatch between the representational form in which experience is stored and the latent embedding sequence over which modern GUI policies actually act. We introduce Mem-W, a series of latent-memory-native GUI agents that treat memory as part of the agent’s continuous context rather than as an auxiliary symbolic scaffold. Mem-W weaves both historical trajectories (as experiential memory) and in-session segments (as working memory) into compact memory tokens through a shared trajectory-to-latent compressor. These tokens are woven with the current GUI observation and local context into one continuous embedding sequence, allowing the agent to read successes, failures, and unfinished progress through the same machine-native interface. Mem-W is trained with self-distillation and outcome-aware supervision to preserve decision-relevant state while filtering memory toward evidence that truly supports task success. Across four web and mobile navigation benchmarks, Mem-W consistently improves diverse backbones and memory-enhanced baselines, with gains of up to +30.0 , suggesting that latent-context-native memory can serve as a scalable foundation for long-horizon GUI agency.

[NLP-143] Do Self-Evolving Agents Forget? Capability Degradation and Preservation in Lifelong LLM Agent Adaptation

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）代理在持续自演化过程中出现的“能力侵蚀”（capability erosion）问题，即在适应新任务分布时，先前习得的能力会非单调性地退化，影响多维度演化（工作流、技能、模型和记忆）的稳定性。解决方案的关键在于提出一种通用的稳定化原则——能力保全演化（Capability-Preserving Evolution, CPE），通过约束持续适应过程中的破坏性能力漂移，显著提升各演化维度中已习得能力的保留稳定性，同时不牺牲对新任务的适应性能。

链接: https://arxiv.org/abs/2605.09315
作者: Ye Yu,Xiaopeng Yuan,Haibo Jin,Heming Liu,Yaoning Yu,Haohan Wang
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in LLM agents enable systems that autonomously refine workflows, accumulate reusable skills, self-train their underlying models, and maintain persistent memory. However, we show that such self-evolution is often non-monotonic: adapting to new task distributions can progressively degrade previously acquired capabilities across all major evolution channels. We identify this phenomenon as \emphcapability erosion under self-evolution and show that it consistently emerges across workflow, skill, model, and memory evolution. To mitigate this issue, we propose \emphCapability-Preserving Evolution (CPE), a general stabilization principle that constrains destructive capability drift during continual adaptation. Across all four evolution dimensions, CPE consistently improves retained capability stability while preserving adaptation performance. For example, in workflow evolution, CPE improves retained simple-task performance from 41.8% to 52.8% under GPT-5.1 optimization while simultaneously achieving stronger complex-task adaptation. Our findings suggest that stable long-horizon self-evolving agents require not only acquiring new capabilities, but also explicitly preserving previously learned ones during continual adaptation. Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2605.09315 [cs.AI] (or arXiv:2605.09315v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.09315 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-144] LEAF-SQL: Level-wise Exploration with Adaptive Fine-graining for Text-to-SQL Skeleton Prediction

【速读】：该论文旨在解决生成式 AI（Generative AI）在处理复杂 Text-to-SQL 任务时的局限性，尤其是当查询涉及深层嵌套逻辑或多条件组合时，现有基于提示（prompting）的大语言模型（LLM）性能显著下降的问题。其核心挑战在于传统方法依赖单一结构假设且缺乏渐进式推理能力，难以有效探索多样化的 SQL 结构可能性。解决方案的关键在于提出 LEAF-SQL 框架，将 SQL 骨架（SQL skeleton）预测重构为一种粗粒度到细粒度的树搜索过程，通过引入三层骨架层次结构、骨架生成代理（Skeleton Formulation Agent）和骨架评估代理（Skeleton Evaluation Agent），实现对多种结构假设的系统性探索与自适应精炼，从而生成结构多样且粒度适配的骨架候选，显著提升复杂查询的执行准确率。

链接: https://arxiv.org/abs/2605.09295
作者: Zhao Tan,Xiping Liu,Qing Shu,Qizhi Wan,Dexi Liu,Changxuan Wan
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Text-to-SQL translates natural language questions into executable SQL queries, enabling intuitive database access for non-experts. While large language models achieve strong performance on Text-to-SQL with prompting, they still struggle with complex queries that involve deeply nested logic or multiple clauses. A widely used approach employs SQL skeletons–intermediate representations of query logic–to streamline generation, but existing methods are limited by their reliance on a single structural hypothesis and lack of progressive reasoning. To overcome these limitations, we propose LEAF-SQL, a novel framework that reframes skeleton prediction as a coarse-to-fine tree search process. LEAF-SQL enables systematic exploration of diverse structural hypotheses with adaptive refinement. Several key techniques are employed in LEAF-SQL: (1) a three-level skeleton hierarchy to guide the search, (2) a Skeleton Formulation Agent to generate diverse candidates, and (3) a Skeleton Evaluation Agent to efficiently prune the search space. This integrated design yields skeleton candidates that are both structurally diverse and granularity-adaptive, providing a stronger foundation for the SQL generation. Extensive experiments show that LEAF-SQL consistently improves the performance of various LLM backbones. On the official hidden test set of the challenging BIRD benchmark, our method achieves 71.6 execution accuracy, which outperforms leading search-based and skeleton-based methods, affirming its effectiveness for complex queries.

[NLP-145] BetaEdit: Null-Space Constrained Sequential Model Editing

【速读】：该论文旨在解决基于零空间（null space）的模型编辑方法在实际应用中存在知识泄露（knowledge leakage）以及在连续编辑过程中性能显著下降的问题。现有方法依赖于近似的零空间，导致原有知识被意外修改，且随着编辑轮次增加，模型的泛化能力与编辑效果均出现退化。为应对这些问题，作者提出BetaEdit框架，其核心创新在于通过精细化控制知识泄露并整合历史感知（history-aware）更新机制到零空间编辑范式中，从而在大规模连续编辑场景下实现更稳定和高效的性能表现。

链接: https://arxiv.org/abs/2605.09285
作者: Bingqing Liu,Wei Liu,Yuhua Li
机构: Huazhong University of Science and Technology (华中科技大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Null-space-based methods have garnered considerable attention in model editing by constraining updates to the null space of the pre-existing knowledge representation, thereby preserving the model’s original behavior. However, in practice these methods rely on an approximate null space–leading to knowledge leakage–and further suffer from severe performance degradation during sequential editing. Recent work shows that history-aware editing strategies can empirically mitigate this decline, yet the underlying reason remains unclear. In this paper, we first expose the knowledge leakage inherent in existing null-space approaches and then analyze why history-aware updates effectively preserve both editing performance and general capabilities during long-horizon editing. Building on these insights, we propose BetaEdit, a refined framework that effectively controls the knowledge leakage and integrates history-aware updates into the null-space paradigm. Extensive experiments on three large language models across two standard benchmarks show that BetaEdit consistently outperforms prior methods in the challenging regime of massive-scale sequential editing. Code is available at: this https URL.

[NLP-146] A Prompt-Aware Structuring Framework for Reliable Reuse of AI-Generated Content in the Agent ic Web WWW2026

【速读】：该论文旨在解决当前生成式 AI (Generative AI) 内容（AIGC）缺乏可信溯源机制的问题，尤其是在其生成过程中无法验证可靠性、可复现性及许可证合规性，从而可能导致链式幻觉（chained hallucinations）和合规风险。解决方案的关键在于提出一个自动在 AIGC 生成时附加结构化元数据的框架，该元数据涵盖模块化提示（modularized prompts）、上下文、推理过程（thoughts）、模型信息、超参数及置信度等要素，并与可验证凭证（verifiable credentials）封装在一起，以支持对 AIGC 的可靠评估与安全复用，从而实现对 AIGC 的高效编目与在微调、知识蒸馏等场景中的安全应用。

链接: https://arxiv.org/abs/2605.09283
作者: Shusaku Egami,Masahiro Hamasaki
机构: National Institute of Advanced Industrial Science and Technology (AIST)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 5 pages, 2 figures, Accepted at FAAW@WWW2026

点击查看摘要

Abstract:The evolution of Large Language Models (LLMs) and the software agents built on them (AI agents) marks a turning point in the transition from a human-centric Web to an ``Agentic Web’’ driven by AI agents. However, for AI-Generated Content (AIGC), which is expected to dominate the Web, there is currently no mechanism for agents to verify its reliability, reproducibility, or license compliance during generation. This lack of transparency risks causing chained hallucinations and compliance violations through the reuse of AIGC. Consequently, a framework to manage the provenance and generation conditions of AIGC is essential. In this paper, we present a framework that automatically attaches structured metadata to AIGC at generation time, including modularized prompts, contexts, thoughts, model information, hyperparameters, and confidence. The metadata is enveloped together with verifiable credentials to support the reliable assessment and reuse of AIGC. This framework enables efficient curation of structured AIGC and facilitates its safe use for applications such as fine-tuning and knowledge distillation.

[NLP-147] owards Conversational Medical AI with Eyes Ears and a Voice

【速读】：该论文旨在解决当前医疗咨询中实时交互能力不足的问题，尤其是现有生成式 AI（Generative AI）系统因仅依赖文本输入而无法充分捕捉医患交流中的多模态信息（如语音、面部表情和肢体语言），从而限制了其在真实临床场景中的有效性和安全性。解决方案的关键在于提出并实现了一个名为“AI共诊员”（AI co-clinician）的新型对话式人工智能系统，该系统基于 Gemini 的低延迟音视频处理能力，通过持续接收和分析患者与医生之间的音频-视觉流数据，支持实时临床决策；其双代理架构在保证深度临床推理的同时满足自然对话所需的低延迟要求，并通过设计标准化的 TelePACES 评估体系验证了其在管理计划、鉴别诊断等核心维度上接近初级保健医师（PCPs）的表现，显著优于仅使用文本输入的 GPT-Realtime 模型，表明高风险实时诊断 AI 应优先以人机协作模式发展，即由 AI 作为医生和患者的协同助手（triadic model）。

链接: https://arxiv.org/abs/2605.09272
作者: Meet Shah,Jason Gusdorf,Anil Palepu,Chunjong Park,Jack W. O’Sullivan,Vishnu Ravi,Tim Strother,Pavel Dubov,Aliya Rysbek,Toshiyuki Fukuzawa,Yana Lunts,Jan Freyberg,Michael B. Chang,Aniruddh Raghu,David Stutz,Devora Berlowitz,Eliseo Papa,Taylan Cemgil,JD Velasquez,Jack Chen,Arthur Chen,Doug Fritz,Charlie Taylor,Katya Tregubova,Jing Rong Lim,Richard Green,Sara Mahdavi,Mahvish Nagda,Jihyeon Lee,Craig Schiff,Liviu Panait,Sukhdeep Singh,Valentin Liévin,David G.T. Barrett,Hannah Gladman,Anna Cupani,Francesca Pietra,Uchechi Okereke,Katherine Tong,Clemens Meyer,Erwan Rolland,Mili Sanwalka,Michael D. Howell,Shixiang Shane Gu,Bibo Xu,Euan A. Ashley,S. M. Ali Eslami,Gregory Wayne,Pushmeet Kohli,Vivek Natarajan,Adam Rodman,Alan Karthikesalingam,Ryutaro Tanno
机构: Google DeepMind; Google Research; Beth Israel Deaconess Medical Center, Harvard Medical School; Stanford University
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Video examples are available on Youtube: this https URL , this https URL , and this https URL

点击查看摘要

Abstract:The practice of medicine relies not only upon skillful dialogue but also on the nuanced exchange and interpretation of rich auditory and visual cues between doctors and patients. Building on the low-latency voice and video processing capabilities of Gemini, we introduce AI co-clinician, a first-of-its-kind conversational AI system utilizing continuous streams of audio-visual data from live patient conversations to inform real-time clinical decisions. Its dual-agent architecture balances deep clinical reasoning with the low latency required for natural dialogue. To assess this system, we implemented a video-based interface emulating telemedicine consultations. We crafted 20 standardized outpatient scenarios requiring proactive real-time auditory and visual reasoning and designed “TelePACES” evaluation criteria alongside case-specific rubrics. In a randomized, interface-blinded, crossover simulation study (n = 120 encounters) with 10 internal medicine residents as patient actors, we compared AI co-clinician with primary care physicians (PCPs), GPT-Realtime, and a baseline agent. AI co-clinician approached PCPs in key TelePACES dimensions, including management plans and differential diagnosis, while significantly outperforming GPT-Realtime across all general criteria. While our agent demonstrated parity with PCPs in case-specific triage measures, physicians maintained superior overall performance in case-specific assessments. Although AI co-clinician marks a significant advance in real-time telemedical AI, gaps remain in physical examination and disease-specific reasoning. Our work shows that text-only approaches fail to capture the true challenges of medical consultation and suggests that high-stakes real-time diagnostic AI is most safely advanced in collaborative, triadic models where AI can be a supportive co-clinician for doctors and patients.

[NLP-148] DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在偏好评估中因依赖语言先验而忽视细粒度视觉验证所导致的“懒惰判断”（lazy judging）问题，以及现有基于规则的评估方法在视觉推理复杂性下难以扩展至多模态场景的瓶颈。其解决方案的关键在于提出DeltaRubric，一种将多模态偏好评估重构为“计划-执行”流程的方法：首先由模型作为“分歧规划器”（Disagreement Planner）生成针对具体实例的中立验证清单（checklist），随后切换为“清单验证器”（Checklist Verifier）对图像和问题进行逐项核查并输出 grounded 判断。该方法通过多角色强化学习联合优化规划与验证能力，显著提升了奖励模型的可靠性与泛化性，在VL-RewardBench基准上使Qwen3-VL 4B和8B模型准确率分别提升22.6和18.8个百分点。

链接: https://arxiv.org/abs/2605.09269
作者: Rui Liu,Dian Yu,Zhenwen Liang,Yucheng Shi,Tong Zheng,Runpeng Dai,Haitao Mi,Pratap Tokekar,Leoweiliang
机构: Tencent Hunyuan(腾讯混元); University of Maryland, College Park (马里兰大学学院公园分校); University of North Carolina, Chapel Hill (北卡罗来纳大学教堂山分校)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Aligning Multimodal Large Language Models (MLLMs) requires reliable reward models, yet existing single-step evaluators can suffer from lazy judging, exploiting language priors over fine-grained visual verification. While rubric-based evaluation mitigates these biases in text-only settings, extending it to multimodal tasks is bottlenecked by the complexity of visual reasoning. The critical differences between responses often depend on instance-specific visual details. Robust evaluation requires dynamically synthesizing rubrics that isolate spatial and factual discrepancies. To address this, we introduce \textbfDeltaRubric , an approach that reformulates multimodal preference evaluation as a plan-and-execute process within a single MLLM. DeltaRubric operates in two steps: acting first as a \textitDisagreement Planner , the model generates a neutral, instance-specific verification checklist. Transitioning into a \textitChecklist Verifier , it executes these self-generated checks against the image and question to produce the final grounded judgment. We formulate DeltaRubric as a multi-role reinforcement learning problem, jointly optimizing planning and verification capabilities. Validated on Qwen3-VL 4B and 8B Instruct models, DeltaRubric achieves solid empirical gains. For instance, On VL-RewardBench, it improves base model overall accuracy by \textbf+22.6 (4B) and \textbf+18.8 (8B) points, largely outperforming standard no-rubric baselines. The results demonstrate that decomposing evaluation into structured, verifiable steps leads to more reliable and generalizable multimodal reward modeling.

[NLP-149] Beyond Continuity: Challenges of Context Switching in Multi-Turn Dialogue with LLM s ICLR2026

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在多轮对话中对用户意图转变（pivot）识别不足及上下文管理不当的问题，即模型常因未能检测到话题切换而延续无关历史信息，导致响应不准确。其解决方案的关键在于构建基于真实世界数据集的合成基准测试（synthetic benchmarks），以模拟不同难度级别的上下文转换，并系统评估十种LLMs（包括开源、闭源及推理增强型模型）在零样本场景下的表现，从而揭示当前模型在话题检测与相关上下文筛选任务中的局限性，尤其是开放权重模型对显式提示仍存在滞后反应，以及普遍存在的位置偏差（position bias）。这一方法为提升LLMs长期多轮交互能力提供了实证依据和改进方向。

链接: https://arxiv.org/abs/2605.09268
作者: Aditya Sinha,Harald Steck,Vito Ostuni,Matteo Rinaldi
机构: Netflix Inc. (Netflix)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to the ICBINB Workshop @ ICLR 2026

点击查看摘要

Abstract:Users interacting with Large Language Models (LLMs) in a multi-turn conversation routinely refine their requests or pivot to new topics. LLMs, however, often miss these topic shifts and carry over irrelevant context from previous turns, leading to inaccurate responses. In this paper, we stress-test the multi-turn understanding of LLMs and study the following two sub-tasks: (1) detecting whether the user pivots or refines in the current turn, and (2) shortlisting relevant context from previous turns. To this end, we construct synthetic benchmarks based on real-world datasets from varied domains, as to simulate context shifts of different levels of difficulty. We then evaluate the zero-shot performance of ten LLMs (open-weight, closed-source and reasoning), and demonstrate that only some reasoning and strongly instructed LLMs are accurate in detecting pivots; open-weight LLMs struggle with the task and frequently carry stale context even with explicit cues; and all models suffer from a position bias. Based on the results, we discuss key takeaways for improving long-term robustness in multi-turn capabilities for LLMs.

[NLP-150] Reinforcing Multimodal Reasoning Against Visual Degradation

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在强化学习（Reinforcement Learning, RL）微调过程中对真实世界视觉退化（如模糊、压缩伪影和低分辨率扫描）缺乏鲁棒性的问题。现有方法依赖静态数据增强或基于价值的正则化，难以直接应用于无评论器（critic-free）的自回归MLLM微调场景，且在训练中直接注入退化视图会导致奖励污染（reward poisoning），引发幻觉轨迹并破坏优化稳定性。其解决方案的关键在于提出ROMA框架，通过双前向传递策略利用教师强制（teacher forcing）评估退化视图与干净图像轨迹的一致性，避免在退化输入上进行新采样；同时引入token级代理KL惩罚以保证分布一致性，辅以锚定于干净图像优势的辅助策略梯度损失防止策略坍缩，并采用正确性条件正则化限制仅在成功轨迹上施加不变性约束，从而在不牺牲干净输入性能的前提下显著提升模型对可见与未见退化的鲁棒性。

链接: https://arxiv.org/abs/2605.09262
作者: Rui Liu,Dian Yu,Haolin Liu,Yucheng Shi,Tong Zheng,Runpeng Dai,Haitao Mi,Pratap Tokekar,Leoweiliang
机构: Tencent Hunyuan; University of Maryland, College Park; University of Virginia; University of North Carolina, Chapel Hill
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement Learning has significantly advanced the reasoning capabilities of Multimodal Large Language Models (MLLMs), yet the resulting policies remain brittle against real-world visual degradations such as blur, compression artifacts, and low-resolution scans. Prior robustness techniques from vision and deep RL rely on static data augmentation or value-based regularization, neither of which transfers cleanly to critic-free RL fine-tuning of autoregressive MLLMs. Reinforcing reasoning against such corruptions is non-trivial: naively injecting degraded views during rollout induces reward poisoning, where perceptual occlusions trigger hallucinated trajectories and destabilize optimization. We propose ROMA, an RL fine-tuning framework that modifies the optimization dynamics to reinforce reasoning against visual degradation while preserving clean-input performance. A dual-forward-pass strategy uses teacher forcing to evaluate corrupted views against clean-image trajectories, avoiding new rollouts on degraded inputs. For distributional consistency, we apply a token-level surrogate KL penalty against the worst-case augmentation; to prevent policy collapse under regularization, an auxiliary policy gradient loss anchored to clean-image advantages preserves a reliable reward signal; and to avoid systematically incorrect invariance, correctness-conditioned regularization restricts enforcement to successful trajectories. On Qwen3-VL 4B/8B across seven multimodal reasoning benchmarks, our method improves robustness by +2.4% on seen and +2.3% on unseen corruptions over GRPO while matching clean accuracy.

[NLP-151] Cornerstones or Stumbling Blocks? Deciphering the Rock Tokens in On-Policy Distillation

【速读】：该论文旨在解决在线策略蒸馏（On-Policy Distillation, OPD）中token级优化效率低下的问题，特别是针对那些在训练收敛后仍持续表现出高损失的“Rock Tokens”现象。现有理论预期高损失token应随训练逐渐减少，但实证发现这些token占比可达18%，且其梯度贡献大却对模型推理性能无实质影响。解决方案的关键在于识别并战略性绕过这些无功能贡献的“stumbling blocks”——即不再对Rock Tokens施加均匀的权重或优化压力，从而显著提升对齐过程的效率，挑战了传统统一token加权范式，提出更高效的大型模型蒸馏新路径。

链接: https://arxiv.org/abs/2605.09253
作者: Yuxuan Jiang,Runchao Li,Shubhashis Roy Dipta,Dawei Li,Zhao Yang
机构: University of Maryland, Baltimore County (马里兰大学巴尔的摩县分校); Case Western Reserve University (凯斯西储大学); Arizona State University (亚利桑那州立大学); VU Amsterdam (阿姆斯特丹自由大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While recent work in Reinforcement Learning with Verifiable Rewards (RLVR) has shown that a small subset of critical tokens disproportionately drives reasoning gains, an analogous token-level understanding of On-Policy Distillation (OPD) remains largely unexplored. In this work, we investigate high-loss tokens, a token type that–as the most direct signal of student-teacher mismatch under OPD’s per-token KL objective–should progressively diminish as training converges according to existing studies; however, our empirical analysis shows otherwise. Even after OPD training reaches apparent saturation, a substantial subset of tokens continues to exhibit persistently high loss; these tokens, which we term Rock Tokens, can account for up to 18% of the tokens in generated outputs. Our investigation reveals two startling paradoxes. First, despite their high occurrence frequency providing a disproportionately large share of total gradient norms, Rock Tokens themselves remain stagnant throughout training, resisting teacher-driven corrections. Second, through causal intervention, we find that these tokens provide negligible functional contribution to the model’s actual reasoning performance. These findings suggest that a vast amount of optimization bandwidth is spent on structural and discourse residuals that the student model cannot or need not internalize. By deconstructing these dynamics, we demonstrate that strategically bypassing these ``stumbling blocks’’ can significantly streamline the alignment process, challenging the necessity of uniform token weighting and offering a more efficient paradigm for large-scale model distillation.

[NLP-152] LLM Agents Already Know When to Call Tools – Even Without Reasoning

【速读】：该论文旨在解决工具增强型大语言模型（Tool-augmented LLM）在推理过程中存在过度调用工具的问题，即模型即使能直接回答问题也倾向于无差别调用外部工具，导致API费用和延迟浪费。现有基准缺乏对“何时真正需要工具调用”的系统性评估，且现有训练-free基线方法（如仅靠提示词优化或先推理再执行策略）无法有效区分必要与非必要调用，甚至在困难任务上显著降低准确率。解决方案的关键在于发现：模型的隐藏状态中已编码了关于工具必要性的线性可解信号（AUROC 0.89–0.96），但模型未能将其用于决策；基于此，作者提出ProbePrefill方法，通过轻量级线性探测器读取该信号，并在生成前插入引导语句（steering sentence）来抑制不必要的工具调用，从而实现平均减少48%工具调用、仅损失1.7%准确率的显著改进。

链接: https://arxiv.org/abs/2605.09252
作者: Chung-En Sun,Linbo Liu,Ge Yan,Zimo Wang,Tsui-Wei Weng
机构: University of California, San Diego (加州大学圣地亚哥分校); Amazon AWS (亚马逊AWS)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Tool-augmented LLM agents tend to call tools indiscriminately, even when the model can answer directly. Each unnecessary call wastes API fees and latency, yet no existing benchmark systematically studies when a tool call is actually needed. We propose When2Tool, a benchmark of 18 environments (15 single-hop, 3 multi-hop) spanning three categories of tool necessity – computational scale, knowledge boundaries, and execution reliability – each with controlled difficulty levels that create a clear decision boundary between tool-necessary and tool-unnecessary tasks. We evaluate two families of training-free baselines: Prompt-only (varying the prompt to discourage unnecessary calls) and Reason-then-Act (requiring the model to reason about tool necessity before acting). Both provide limited control: Prompt-only suppresses necessary calls alongside unnecessary ones, and Reason-then-Act still incurs a disproportionate accuracy cost on hard tasks. To understand why these baselines fail, we probe the models’ hidden states and find that tool necessity is linearly decodable from the pre-generation representation with AUROC 0.89–0.96 across six models, substantially exceeding the model’s own verbalized reasoning. This reveals that models already know when tools are needed, but fail to act on this knowledge during generation. Building on this finding, we propose ProbePrefill, which uses a lightweight linear probe to read the hidden-state signal and prefills the model’s response with a steering sentence. Across all models tested, ProbePrefill reduces tool calls by 48% with only 1.7% accuracy loss, while the best baseline at comparable accuracy only reduces 6% of tool calls, or achieves a similar tool call reduction but incurs a 5 \times higher accuracy loss. Our code is available at this https URL

[NLP-153] Repeated-Token Counting Reveals a Dissociation Between Representations and Outputs

【速读】：该论文试图解决大语言模型（Large Language Models, LLMs）在处理重复标记计数任务时出现的系统性失败问题，尽管这些模型在更广泛的推理基准上表现优异。研究表明，这种失败并非源于内部计数跟踪能力的局限，而是由一个特定格式触发的多层感知机（MLP）模块在约88–93%网络深度处错误地覆盖了正确编码的计数信息，输出固定错误答案所致。关键发现是：模型在残差流中始终能以近乎完美的精度解码正确计数，说明表示层面并无缺陷；真正的故障点在于路由机制——即特定输入格式（如空格分隔的词列表）激活了一个错误的MLP路径，导致计数结果被强制覆盖。这一机制在不同模型架构（Llama-3.2与Qwen2.5）和规模下均一致存在，且可通过分隔符类型（逗号 vs 空格）调节其活跃程度，揭示了计数失败本质为路由错误而非表征错误，需针对性干预路由逻辑而非改进表示能力。

链接: https://arxiv.org/abs/2605.09239
作者: Sohan Venkatesh
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Code is available at this https URL

点击查看摘要

Abstract:Large language models fail at counting repeated tokens despite strong performance on broader reasoning benchmarks. These failures are commonly attributed to limitations in internal count tracking. We show this attribution is wrong. Linear probes on the residual stream decode the correct count with near-perfect accuracy at every post-embedding layer, across all model depths. This holds even at the exact layers where the wrong answer crystallizes while the model simultaneously outputs an incorrect count. Attention patterns show no evidence of collapse over repeated tokens and tokenization artifacts account for none of the failure. Instead, a format-triggered multi-layer perceptron (MLP) block overwrites the correctly-encoded count with a fixed wrong answer at roughly 88–93,% network depth. This prior fires for repeated word-tokens in space-separated list format and is absent for repeated digit-tokens. It is suppressed by comma-separated delimiters in larger models but persists in smaller ones. The finding holds across Llama-3.2 (1B and 3B) and Qwen2.5 (1.5B, 3B and 7B) at consistent relative depth. Counting failure is a failure of routing not of representation and the two require different interventions.

[NLP-154] wo Ways to De-Bias an LLM -as-a-Judge: A Continuous-Score Comparison of Hierarchical Bayesian Calibration and Neural-ODE Score Transport

【速读】：该论文旨在解决使用大型语言模型（Large Language Model, LLM）作为自动评分工具时存在的偏差问题，例如评分尺度压缩、评分者间差异（如宽松或严格倾向）以及对冗长回答的过度奖励。为缓解此类偏差，研究提出通过后验校准（post-hoc calibration）方法，基于少量成对锚点样本（paired anchors）拟合从LLM原始得分到人类评分估计的映射函数。其关键解决方案在于对比两种不同建模策略：一是参数化的小样本分层贝叶斯线性校正器（含每项评分不确定性），二是非参数化的神经微分方程（Neural-ODE，即FFJORD）分数传输流（score-transport flow）。实验表明，两者的性能差异主要取决于可用标注数据量——在100个锚点时线性校正器更优（KL散度更低），而在1500个锚点时流模型全面胜出，且随着标签数量增加持续改进，揭示了生产部署中应依据数据预算选择合适校准方法的决策准则。

链接: https://arxiv.org/abs/2605.09227
作者: Andrea Morandi
机构: Cisco(思科)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:[Abridged] Using a Large Language Model (LLM) as an automatic rater (LLM-as-a-judge) is cheap but potentially biased: some judges run lenient, others strict, the middle of the scale gets compressed, and verbose answers may be over-rewarded. A common remedy is post-hoc calibration: leave the cheap judge in place and, on a modest set of paired anchors, fit a transformation from raw judge scores to an estimate of the human rating. We compare two correctors that take opposing views on how this mapping should be modeled: a parametric, small-anchor hierarchical Bayesian linear correction with per-score uncertainty, and a non-parametric Neural-ODE (FFJORD) score-transport flow. Both are run head-to-head on UltraFeedback fine-grained_score (1700 paired examples, 200 held out), with calibration split into three operational sub-questions: population-mean recovery, per-item accuracy, and distributional-shape match. The headline result is that the choice between methods is primarily a data-budget question. Both correctors close the raw +0.71 -point mean offset to within \pm 0.08 of the GPT-4 reference, at 100 and at 1500 anchors. Past that, the methods swap roles. With 100 anchors, the linear corrector reconstructs the human-score distribution roughly twice as well by KL divergence (0.031 vs. 0.058) and ties the flow on MAE. With 1500 anchors the flow wins on every metric (MAE 0.320 vs. 0.359, Pearson 0.922 vs. 0.896, KL 0.026 vs. 0.037). The Bayesian linear corrector saturates well below 1500 anchors: residual \tanh -shaped non-linearity is, by construction, structure a linear correction cannot fit. The flow keeps improving as labels grow. We translate these findings into an explicit decision rule for production deployments. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2605.09227 [cs.CL] (or arXiv:2605.09227v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.09227 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-155] Emergent Semantic Role Understanding in Language Models

【速读】：该论文试图解决的问题是：生成式语言模型在预训练阶段是否能够自发地习得语义角色（Semantic Role，即“谁对谁做了什么”）理解能力，还是必须依赖任务特定的微调才能实现这一能力。解决方案的关键在于冻结解码器-only Transformer 模型的参数，仅通过训练线性探测器（linear probe）来提取语义角色信息，并以探测器性能作为指标，判断语义角色信息是否已编码于预训练表示中。研究发现，即使不进行微调，冻结模型的表征中仍包含显著的语义角色信息，表明语义角色结构部分源于语言建模目标，但其内部表征方式随模型规模增大逐渐趋向更分布式的实现形式，说明该能力并非完全由预训练独立完成，而是需要后续适应过程进一步优化。

链接: https://arxiv.org/abs/2605.09187
作者: Carla Griffiths,Mirco Musolesi
机构: University College London (伦敦大学学院); University of Bologna (博洛尼亚大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Understanding how linguistic structure emerges in language models is central to interpreting what these systems learn from data and how much supervision they truly require. In particular, semantic role understanding (“who did what to whom”) is a core component of meaning representation, yet it remains unclear whether it arises from pre-training alone or depends on task-specific fine-tuning. We study whether semantic role understanding emerges during language model pre-training or requires task-specific fine-tuning. We freeze decoder-only transformers and train linear probes to extract semantic roles, using performance to infer whether role information is already encoded in pre-training or learned during adaptation. Across model scales, we find that frozen representations contain substantial semantic role information, with performance improving but not fully matching fine-tuned models. This indicates partial but incomplete emergence from pre-training alone. We show that semantic role structure emerges from language modeling objectives, but its internal implementation shifts toward more distributed representations as model scale increases.

[NLP-156] Agent ic MIP Research: Accelerated Constraint Handler Generation

【速读】：该论文旨在解决混合整数规划（Mixed-Integer Programming, MIP）研究中算法假设验证周期长、实现复杂的问题，具体表现为在分支定cut求解器中测试一个算法假设需要大量编码、调试、调参和大规模基准测试。其解决方案的关键在于提出一个基于大语言模型（Large Language Model, LLM）代理的MIP研究框架，该框架将LLM代理嵌入到面向求解器（solver-aware）的环境中，用于自动生成、验证与评估针对开源求解器SCIP的插件。该框架特别聚焦于传播方法（propagation methods）——通过利用全局约束（global constraints）加速MIP求解，并成功实现了从约束编程中语义提升（semantic lifting）MIP公式为全局约束，以及自动构建仅含传播功能的SCIP约束处理器（constraint handler）。实验表明，该框架能在MIPLIB 2017基准集上恢复全局约束结构并生成可执行的约束检测器与传播处理器，同时支持在沙盒环境中进行上下文学习（in-context learning），使代理能够自主调试、调参并探索新型传播策略，从而系统性地区分有意义的算法改进与低价值候选方案，最终显著提升了求解性能。

链接: https://arxiv.org/abs/2605.09186
作者: Liding Xu,Yugeng Zhou,Sebastian Pokutta
机构: Zuse Institute Berlin (Zuse研究所); Technische Universität Berlin (柏林工业大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Mixed-integer programming (MIP) research is both mathematically sophisticated and engineering-intensive: testing an algorithmic hypothesis within a branch-and-cut solver requires substantial implementation, debugging, tuning, and large-scale benchmarking. We propose an agentic MIP research framework that shortens this feedback loop by embedding LLM agents into a solver-aware harness for generating, verifying, and evaluating plugins for the open-source solver SCIP. Propagation methods play a central role in accelerating MIP solving by exploiting global constraints. We instantiate our framework on the semantic lifting of MIP formulations into global constraints and the automatic construction of propagation-only SCIP constraint handlers. On the MIPLIB 2017 benchmark set, the framework successfully recovers global constraint structures from constraint programming and generates executable constraint detectors and propagation-only constraint handlers. Furthermore, the framework naturally extends to in-context learning within a sandboxed environment, enabling agents not only to tune and debug generated constraint handlers on real instances, but also to explore global constraint patterns in MIP problems and discover novel propagation strategies not yet implemented in SCIP. This framework allows us to systematically distinguish meaningful algorithmic improvements from low-value or overly costly candidates: the novel propagation methods successfully solved five additional instances within the explored benchmark. Overall, this framework demonstrates that LLM agents can autonomously navigate the complex MIP research loop, paving the way for a more automated solver development process.

[NLP-157] Open Ontologies: Tool-Augmented Ontology Engineering with Stable Matching Alignment

【速读】：该论文旨在解决知识表示与推理中的本体对齐（ontology alignment）和本体交互效率问题，尤其是在结合大语言模型（Large Language Models, LLMs）与形式化本体逻辑（OWL reasoning）时如何提升准确性和实用性。其核心解决方案是提出 Open Ontologies 系统，该系统基于 Rust 实现，融合了 LLM 驱动的本体构建、基于 Model Context Protocol（MCP）的工具增强型交互以及正式的 OWL 推理机制。关键发现在于：稳定的一对一匹配（stable 1-to-1 matching）是决定本体对齐质量的核心因素——在 OAEI Anatomy 跑道上实现 F1 = 0.832，且信号权重配置对其影响可忽略（F1 变化 < 0.004），而移除稳定匹配则显著下降至 F1 = 0.728；此外，LLM 通过结构化工具访问本体（如 MCP）比直接读取原始 OWL 文件表现更优（F1 = 0.717 vs. 0.323），表明工具结构提供了 LLM 无法仅靠语法解析复制的语义理解能力。

链接: https://arxiv.org/abs/2605.09184
作者: Fabio Rovai
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB)
备注: 10 pages, 6 tables. Code: this https URL

点击查看摘要

Abstract:We present Open Ontologies, an open-source ontology engineering system implemented in Rust that integrates LLM-driven construction with formal OWL reasoning and ontology alignment via the Model Context Protocol. Our primary finding is that stable 1-to-1 matching is the dominant factor in ontology alignment quality: on the OAEI Anatomy track, it achieves F1 = 0.832 (P = 0.963, R = 0.733), competitive with state-of-the-art systems and exceeding all in precision. Ablation across five weight configurations shows that signal weights are irrelevant when stable matching is applied (F1 varies by less than 0.004), while removing stable matching drops F1 to 0.728. On the Conference track, the same method achieves F1 = 0.438. On tool-augmented ontology interaction, we find a surprising result: an LLM reading a raw OWL file (F1 = 0.323) performs worse than the same LLM with no file at all (F1 = 0.431), while structured MCP tool access achieves F1 = 0.717. This demonstrates that tool structure provides a qualitatively different mode of access that the LLM cannot replicate by reading raw syntax. The system ships as a single binary under the MIT licence.

[NLP-158] WorldSpeech: A Multilingual Speech Corpus from Around the World

【速读】：该论文旨在解决自动语音识别（ASR）在低资源语言中性能显著下降的问题，其根本原因在于缺乏充足的成对音频-文本数据。解决方案的关键在于构建一个高质量、多语言的语音语料库——WorldSpeech，该语料库包含76种语言的65,000小时对齐音频-文本数据，采自议会会议、国际广播和公共领域有声读物等多样化公开来源。其中37种语言拥有超过200小时的数据，28种超过500小时，24种超过1,000小时；在11种语言上微调现有ASR模型后，平均词错误率（Word Error Rate, WER）相对降低达63.5%，验证了该语料库在提升低资源语言ASR性能方面的有效性。

链接: https://arxiv.org/abs/2605.09167
作者: Antonis Asonitis,Luca A. Lanzendörfer,Frédéric Berdoz,Roger Wattenhofer
机构: ETH Zurich (苏黎世联邦理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Automatic speech recognition (ASR) performs well for high-resource languages with abundant paired audio-transcript data, but its accuracy degrades sharply for most languages due to limited publicly available aligned data. To this end, we introduce WorldSpeech, a 24 kHz multilingual speech corpus comprising 65k hours of aligned audio-transcript data across 76 languages, collected from diverse public sources including parliamentary proceedings, international broadcasts, and public-domain audiobooks. For 37 languages, WorldSpeech provides more than 200 hours of aligned speech, with 28 exceeding 500 hours and 24 surpassing 1k hours. Fine-tuning existing ASR models on WorldSpeech results in an average relative Word-Error-Rate reduction of 63.5% across 11 typologically diverse languages.

[NLP-159] Sparse Layers are Critical to Scaling Looped Language Models

【速读】：该论文旨在解决传统Transformer模型在深度扩展时面临的内存消耗高和计算效率低的问题，尤其是在模型规模扩大时难以维持良好的性能-资源权衡。其核心解决方案是引入循环式混合专家（Looped-MoE）架构，通过在共享的Transformer层中引入专家路由机制，在每次循环迭代中激活不同的专家模块，从而在不增加参数量的前提下提升模型表达能力；同时利用循环边界作为早期退出点（early exits），实现更优的计算质量权衡。关键创新在于：1）Looped-MoE模型通过跨循环的路由多样性恢复了表达力，解决了纯密集型循环模型无法有效扩展的问题；2）循环结构天然提供高质量的早期输出点，显著降低推理延迟与内存占用，且对最终性能影响最小。

链接: https://arxiv.org/abs/2605.09165
作者: Ryan Lee,Jacob Biloki,Edward J. Hu,Jonathan May
机构: USC Information Sciences Institute (南加州大学信息科学研究所); Netflix (奈飞)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Looped language models repeat a set of transformer layers through depth, reducing memory costs and providing natural early-exit points at loop boundaries. However, looped models do not scale as favorably as standard transformers with unique layers. We compare standard and Mixture-of-Experts (MoE) transformers, with and without looping, and find two main results. First, we find Looped-MoE models scale better than the standard baseline while dense looped models do not. We trace this to routing divergence between loops: in Looped-MoE models, different experts are activated on each pass through the same shared layers, recovering expressivity without additional parameters. Our second finding is that looped models have better compute-quality trade-offs with early exits than standard models. Because each loop ends with the same layers that produce the final output, loop boundaries are superior exit points, as confirmed by earlier output convergence at these points. In sum, we provide a clear direction for scaling looped models: a Looped-MoE model with early exits can not only beat standard transformers at scale, but also enable significant memory and inference savings with minimal degradation in quality.

[NLP-160] Lost in Translation? Exploring the Shift in Grammatical Gender from Latin to Occitan ACL2026

【速读】：该论文旨在解决拉丁语向罗曼语族演变过程中语法性别系统从三分法（阳性、阴性、中性）向二分法（阳性、阴性）重构的机制问题，重点在于从词汇层面和语境层面量化性别信息的分布。其解决方案的关键在于提出一个可解释的深度学习框架：首先设计了一种针对低资源历史语言场景优化的分词器，显著优于传统策略；其次通过分析形态特征对性别预测的贡献以及不同词类在句法语境中对性别判断的影响，揭示了性别信息在词干（lemma）与句子上下文之间的分配模式。

链接: https://arxiv.org/abs/2605.09156
作者: Ahan Chatterjee,Matthias Schöffel,Matthias Aßenmacher,Esteban Garces Arias
机构: Bavarian Academy of Sciences (BAdW), Munich; LMU Munich; Munich Center for Machine Learning (MCML)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at NLP4DH @ ACL 2026

点击查看摘要

Abstract:The diachronic evolution from Latin to the Romance languages involved a restructuring of the grammatical gender system from a tripartite configuration (masculine, feminine, neuter) to a bipartite one (masculine, feminine). In this work, we introduce an interpretable deep learning framework to investigate this phenomenon at both lexical and contextual levels. First, we show that conventional tokenization strategies are insufficiently robust for this low-resource historical setting, and that our proposed tokenizer improves performance over these baselines. At the lexical level, we evaluate the contribution of morphological features to gender prediction. At the contextual level, we quantify the contributions of different part-of-speech categories to grammatical gender prediction. Together, these analyses characterize the distribution of gender information between the lemma and its sentential context. We make our codebase, datasets, and results publicly available.

[NLP-161] Meow-Omni 1: A Multimodal Large Language Model for Feline Ethology

【速读】：该论文旨在解决计算动物行为学（computational ethology）中动物意图识别的难题，核心挑战在于语义混淆（semantic aliasing），即相同的外部信号（如猫的呼噜声）可能对应完全不同的内在生理状态，而现有多模态大语言模型（Multimodal Large Language Models, MLLMs）无法处理高频生物时间序列数据，仅能进行表层行为模式匹配，缺乏对潜在状态的真实推理能力。解决方案的关键在于提出Meow-Omni 1——首个开源的四模态MLLM，其通过专门设计的架构适配，将视频、音频、生理时间序列流与文本推理原生融合，并引入基于生理学基础的跨模态对齐机制，实现从表征到意图的深层推理。该方法在自建专家验证的MeowBench基准上达到71.16%的意图识别准确率，显著优于主流视觉-语言和全模态基线模型。

链接: https://arxiv.org/abs/2605.09152
作者: Jucheng Hu,Zhangquan Chen,Yulin Chen,Chengjie Hong,Liang Zhou,Tairan Wang,Sifei Li,Giulio Zhu,Feng Zhou,Yiheng Zeng,Suorong Yang,Dongzhan Zhou
机构: University College London (伦敦大学学院); Tsinghua University (清华大学); Nanjing University (南京大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computation and Language (cs.CL); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:Deciphering animal intent is a fundamental challenge in computational ethology, largely because of semantic aliasing, the phenomenon where identical external signals (e.g., a cat’s purr) correspond to radically different internal states depending on physiological context. Existing Multimodal Large Language Models (MLLMs) are blind to high-frequency biological time-series data, restricting them to superficial behavioural pattern matching rather than genuine latent-state reasoning. To bridge this gap, we introduce Meow-Omni 1, the first open-source, quad-modal MLLM purpose-built for computational ethology. It natively fuses video, audio, and physiological time-series streams with textual reasoning. Through targeted architectural adaptation, we integrate specialized scientific encoders into a unified backbone and formalize intent inference via physiologically grounded cross-modal alignment. Evaluated on MeowBench, a novel, expert-verified quad-modal benchmark, Meow-Omni 1 achieves state-of-the-art intent-recognition accuracy (71.16%), substantially outperforming leading vision-language and omni-modal baselines. We release the complete open-source pipeline including model weights, training framework, and the Meow-10K dataset, to establish a scalable paradigm for inter-species intent understanding and to advance foundation models toward real-world veterinary diagnostics and wildlife conservation.

[NLP-162] From Traditional Taggers to LLM s: A Comparative Study of POS Tagging for Medieval Romance Languages ACL2026

【速读】：该论文旨在解决中世纪罗曼语（如中古奥克语、中古加泰罗尼亚语和中古法语）的词性标注（POS tagging）难题，其核心挑战包括拼写变异、形态复杂性以及标注资源稀缺。解决方案的关键在于系统评估大语言模型（LLMs）在零样本提示（zero-shot prompting）、少样本提示（few-shot prompting）、单语微调（monolingual fine-tuning）及跨语言迁移学习（cross-lingual transfer learning）等多种设置下的性能表现。实验表明，基于LLM的方法显著优于传统规则和统计标签器，其中微调与多语言训练带来最大提升；尤其值得注意的是，跨语言迁移对低资源语言效果显著，而针对特定目标语言的双语训练甚至可超越广义多语言配置，凸显了语言亲缘性和数据特征在历史自然语言处理中迁移策略设计中的关键作用。

链接: https://arxiv.org/abs/2605.09147
作者: Matthias Schöffel,Esteban Garces Arias
机构: Bavarian Academy of Sciences (BAdW), Munich, Germany; Department of Statistics, LMU Munich, Germany; Munich Center for Machine Learning (MCML), LMU Munich, Germany
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Applications (stat.AP)
备注: Accepted at NLP4DH @ ACL 2026

点击查看摘要

Abstract:Part-of-speech (POS) tagging for Medieval Romance languages remains challenging due to orthographic variation, morphological complexity, and limited annotated resources. This paper presents a systematic empirical evaluation of large language models (LLMs) for POS tagging across three medieval varieties: Medieval Occitan, Medieval Catalan, and Medieval French. We compare traditional rule-based and statistical taggers with modern open-source LLMs under zero-shot prompting, few-shot prompting, monolingual fine-tuning, and cross-lingual transfer learning settings. Experiments on historically grounded datasets show that LLM-based approaches consistently outperform traditional taggers, with fine-tuning and multilingual training yielding the largest improvements. In particular, cross-lingual transfer learning substantially benefits under-resourced varieties, while targeted bilingual training can outperform broader multilingual configurations for specific target languages. The results highlight the importance of linguistic proximity and dataset characteristics when designing transfer strategies for historical NLP. These findings provide empirical insights into the applicability of modern neural methods to medieval text processing and provide practical guidance for deploying LLM-based POS tagging pipelines in digital humanities research. All code, models, and processed datasets are released for reproducibility. Comments: Accepted at NLP4DH @ ACL 2026 Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Applications (stat.AP) Cite as: arXiv:2605.09147 [cs.CL] (or arXiv:2605.09147v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.09147 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-163] A Communication-Theoretic Framework for LLM Agents : Cost-Aware Adaptive Reliability

【速读】：该论文旨在解决当前基于大语言模型（Large Language Models, LLMs）的智能体（Agent）在可靠性技术应用上的碎片化问题，即retry、多数投票（majority voting）、自一致性（self-consistency）等方法虽被广泛使用，但缺乏统一的理论分析框架。其解决方案的关键在于将LLM在温度T下采样视为香农编码理论中的离散随机信道（discrete stochastic channel $ p(y \mid x) $），由此构建一个基于通信理论的统一可靠性框架。在此框架下，所有可靠性技术均可归类为六种经典信道编码可靠性操作之一，并进一步提出两个闭式结果：一是噪声方差阈值条件下均匀平均优于质量加权平均；二是生成-批评迭代解码的收缩性准则，与3B至14B参数模型间观察到的“收缩-发散”转变一致。此外，论文引入一种成本感知语义最近邻路由机制（cost-aware semantic-nearest-neighbor router），通过单一拉格朗日调节参数即可实现质量-成本前沿的全路径覆盖，无需重新训练，显著优于固定策略，在多个硬任务上实现了帕累托最优性能。

链接: https://arxiv.org/abs/2605.09121
作者: Hamed Omidvar,Vahideh Akhlaghi
机构: INTELLERCE LLC (INTELLERCE LLC)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Theory (cs.IT)
备注:

点击查看摘要

Abstract:Agents built on large language models (LLMs) rely on a range of reliability techniques, including retry, majority voting, and self-consistency, that have been developed in parallel rather than within a common analytical framework. We observe that an LLM sampled at temperature T is a discrete stochastic channel p(y \mid x) in the sense of Shannon’s coding theory, and use this identity as the entry point for such a framework grounded in communication theory. Each of these techniques is a special case of one of six classical reliability operators: diversity combining, hybrid retransmission, iterative generator-critic decoding, rateless sampling, structured redundant verification, and difficulty-adaptive routing. Within the framework we give two closed-form results: a noise-variance threshold above which uniform averaging beats quality-weighted averaging, and a contractivity criterion for generator-critic refinement, consistent with a contractive-to-divergent transition we observe between 3B- and 14B-parameter models. We further introduce a cost-aware semantic-nearest-neighbor router whose single Lagrangian knob traverses the quality-cost frontier without retraining. Across six channel configurations spanning local and cloud models on 69 hard tasks, no fixed model-technique-budget choice dominates, motivating per-task allocation. On a 300-item hard split of MMLU, GSM8K, and HumanEval, our router occupies the full empirical Pareto frontier: at matched quality, its normalized cost is \approx56 % lower than the strongest fixed technique; at matched normalized cost, it improves quality by \approx7 % ( 26 % over single-shot decoding). These results argue for consolidating these reliability techniques into a single tunable layer informed by channel coding.

[NLP-164] Personalized Alignment Revisited: The Necessity and Sufficiency of User Diversity

【速读】：该论文旨在解决个性化对齐（personalized alignment）在统计效率上的理论边界问题，即如何在异质用户偏好下实现最优的在线后悔率（O(1) online regret）和离线样本复杂度（log(1/ε) offline sample complexity）。其核心解决方案在于揭示了一个关键条件：用户多样性（user-diversity），即用户特定的模型头（user-specific heads）必须能够覆盖所有可能改变最优响应的潜在奖励方向（latent reward directions）。研究证明，这一条件既是达到最优效率的充分条件，也是必要条件——当其满足时，简单的贪婪算法即可实现基准效率；反之，任何自然可接受的学习器都会面临至少对数级的后悔。因此，论文指出用户多样性是实现个性化可识别性的根本驱动力。

链接: https://arxiv.org/abs/2605.09119
作者: Enoch Hyunwook Kang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Personalized alignment aims to adapt large language models to heterogeneous user preferences, yet the precise theoretical conditions for its statistical efficiency have not been formally established. This paper characterizes the conditions under which personalized alignment achieves O(1) online regret and log(1/epsilon) offline sample complexity. We show that these optimal rates depend on a specific user-diversity condition: the population of user-specific heads must span the latent reward directions that can alter the optimal response. We prove that this condition is both necessary and sufficient. When it holds, simple greedy algorithms achieve benchmark efficiency; when it fails, every learner in a natural admissible class incurs at least logarithmic regret. Our results identify user diversity as the fundamental driver of personalized identifiability.

[NLP-165] Fin-Bias: Comprehensive Evaluation for LLM Decision-Making under human bias in Finance Domain ACL2026

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在金融投资决策场景中因缺乏鲁棒性、存在对上下文偏见的从众倾向（herding behavior）以及难以独立判断潜在人类偏见而导致的可靠性问题。其解决方案的关键在于构建Fin-Bias基准，该基准包含8868份长期、细粒度的企业分析师报告（含明确的投资评级：看涨/中性/看跌），并设计实验对比LLM在不同上下文条件下（如是否提供真实或伪造的投资评级）生成投资决策的表现，从而量化其对显式偏见的敏感性；进一步提出一种检测潜在人类意见的方法，引导LLM进行独立推理，在部分模型上实现了超越人类的未来股票收益预测能力。

链接: https://arxiv.org/abs/2605.09106
作者: Xiaoyu Hu,Jinman Zhao
机构: Rutgers University; University of Toronto
类目: Computation and Language (cs.CL)
备注: ACL 2026 Findings

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed in financial contexts, raising critical concerns about reliability, alignment, and susceptibility to adversarial manipulation. While prior finance-related benchmarks assess LLMs’ capabilities in stock trading, they are often restricted to small sample and fail to demonstrate LLM susceptibility to context with potential human bias. We introduce Fin-Bias (financial herding under long and uncertain financial context), a benchmark for evaluating LLM investment decision-making when faced with uncertainty and possible human-biased opinions. Fin-Bias includes 8868 long firm-specific analyst reports, including firm aspects summarized and analyzed by sophisticated analysts with investment ratings (Bullish/Neutral/Bearish) spanning from various industries. We present large language models with firm analyst reports with/without analyst investment ratings and even with ‘fake’ rating, to get investment ratings generated by LLMs. Our results reveal that LLMs tend to herd the explicit bias in context. We also develop a method to detect potential human opinions, which can encourage LLMs to think independently, some models even exceed human performance in predicting future stock return.

[NLP-166] GRC: Unifying Reasoning -Driven Generation Retrieval and Compression

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在训练和部署过程中，文本嵌入（text embedding）、生成任务（generative tasks）与上下文压缩（context compression）三者通常需独立训练所导致的高成本问题，以及在推理驱动生成、代理型任务（agentic tasks）中对长上下文处理和持续学习能力的迫切需求。其解决方案的关键在于提出一个统一框架GRC（Generative, Representational, and Compressive tuning），通过引入元潜在标记（meta latent tokens）和一种联合优化策略，在单次前向传播中同时完成推理驱动生成、增强型文本表征与上下文压缩任务；该设计不仅实现了模块化、乐高式（LEGO-style）的推理灵活性，还显著降低检索增强生成（Retrieval-Augmented Generation, RAG）的部署复杂度，并提升训练阶段的数据利用率至三倍。此外，该框架催生了“自推理潜在嵌入”（self-reason-latent embeds）与“潜在记忆增强生成”（latent memory-augmented generation）两种新范式，其中压缩并内化的键值缓存（KV cache）以O(1)长度作为可更新记忆，进一步推动高效推理与长期知识保持。

链接: https://arxiv.org/abs/2605.09100
作者: Zhongtao Miao,Qiyu Wu,Yoshimasa Tsuruoka
机构: The University of Tokyo (东京大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Text embedding and generative tasks are usually trained separately based on large language models (LLMs) nowadays. This causes a large amount of training cost and deployment effort. Context compression is also a challenging and pressing task, which is vital to reasoning-driven generation, and agentic tasks requiring long context and continual learning. In this paper, we explore how to unify reasoning-driven generation, reasoning-enhanced text representation and context compression tasks in one forward pass for LLMs. Through meta latent tokens and a unified generative, representative and compressive tuning approach, we propose a training framework named GRC that bridges the three tasks. The trained models can accomplish three objectives in a single forward pass while maintaining modular, LEGO-style flexibility during inference. This design greatly reduces the deployment effort for retrieval-augmented generation (RAG) and achieves efficient inference and three times data utilization during training. Furthermore, this framework design enables a new paradigm for text embedding: self-reason-latent embeds, and a new generation paradigm, latent memory-augmented generation, where compressed and internalized KV cache with O(1) length is used as the updatable memory. We also propose hybrid paged attention to speed up the inference of our models. Extensive experiments on reasoning-intensive retrieval benchmarks, generative tasks, document compression, latency evaluation, and RAG settings demonstrate the effectiveness of our method and may shed light on the truly unified model that can handle reasoning-driven generation, embedding and compression tasks seamlessly.

[NLP-167] Dynamic Meta-Metrics: Source-Sentence Conditioned Weighting for MT Evaluation ACL

【速读】：该论文旨在解决机器翻译（Machine Translation, MT）评估中传统静态集成方法的局限性，即固定权重组合现有指标（如BLEU、TER等）难以适应不同源句特征的问题。其解决方案的关键在于提出动态元度量（Dynamic Meta-Metrics, DMM）框架，通过学习源句条件化的指标组合方式，使评估权重能够根据源句所属的语义或结构簇自适应调整；具体而言，DMM采用硬条件化（hard conditioning）和软条件化（soft conditioning）两种策略，前者为每个聚类拟合可解释的组合器，后者则允许权重随源句归属责任连续变化，实验表明基于多层感知机（MLP）的组合方式优于线性及高斯过程模型，且引入软条件化能进一步提升性能。

链接: https://arxiv.org/abs/2605.09098
作者: Luke Zhang,Justin Vasselli,Aditya Khan,York Hay Ng,En-Shiun Annie Lee
机构: University of Toronto, Canada; Nara Institute of Science and Technology, Japan; Ontario Tech University, Canada
类目: Computation and Language (cs.CL)
备注: 5 pages, ACL SRW 2026

点击查看摘要

Abstract:We propose Dynamic Meta-Metrics (DMM), a framework for machine translation evaluation that learns source-sentence conditioned combinations of existing metrics. Rather than relying on a single static ensemble or language-specific weighting, DMM adapts the metric combination based on properties of the source segment. We study hard conditioning, which fits an interpretable combiner per cluster, and an exploratory soft-conditioned extension whose weights vary continuously with source-cluster responsibilities. We evaluate DMM on the WMT Metrics Shared Task data across multiple language pairs using pairwise agreement measures at the system and segment levels. Across settings, MLP-based combinations outperform linear and Gaussian process-based ensembles, and introducing soft conditioning yields gains over linear models.

[NLP-168] Character-Level Transformer for Tajik-Persian Transliteration with a Parallel Lexical Corpus

【速读】：该论文旨在解决塔吉克语（西里尔字母）到波斯语（阿拉伯字母）的自动转写（transliteration）问题，这是跨书写系统自然语言处理中的关键挑战。其核心解决方案是构建并公开一个大规模、词级对齐且经过词典验证的平行语料库（共52,152个词和短语），并在此基础上训练一个字符级序列到序列Transformer模型。该模型在字符错误率（CER）和精确匹配准确率上均优于基于字典的规则方法和循环神经网络基线，尤其在使用束搜索（beam search, k=3）后性能进一步提升，体现了数据质量与深度学习架构协同优化的关键作用。

链接: https://arxiv.org/abs/2605.09092
作者: Mullosharaf K. Arabov
机构: Kazan Federal University (喀山联邦大学)
类目: Computation and Language (cs.CL)
备注: Published in Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script (AbjadNLP), pages 75-83, Rabat, Morocco, March 2026

点击查看摘要

Abstract:This study addresses automatic transliteration from Tajik (Cyrillic script) to Persian (Perso-Arabic script). We present a curated, lexicographically verified parallel corpus of 52,152 Tajik–Persian words and short phrases, compiled from printed dictionaries, encyclopedic sources, and manually verified online resources. To the best of our knowledge, this is one of the largest publicly available word-level corpora for Tajik–Persian transliteration. Using this corpus, we train a character-level sequence-to-sequence Transformer model and evaluate it using Character Error Rate (CER) and exact-match accuracy. The Transformer achieves a CER of 0.3216 and an exact-match accuracy of 0.3133, outperforming both dictionary-based rule-based and recurrent neural baselines. With beam search (k=3), performance further improves to CER 0.3182 and accuracy 0.3215. We describe the data collection and preprocessing pipeline, model architecture, and experimental protocol, and report a part-of-speech analysis showing performance differences across lexical categories. All preprocessing scripts, deterministic splits into training, validation, and test sets, and training configurations are released to support reproducibility and further research on Tajik and related Persian dialects. The corpus supports research in character-level transliteration, cross-script NLP, and lexicographic applications. Comments: Published in Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script (AbjadNLP), pages 75-83, Rabat, Morocco, March 2026 Subjects: Computation and Language (cs.CL) ACMclasses: I.2.7; I.2.6 Cite as: arXiv:2605.09092 [cs.CL] (or arXiv:2605.09092v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.09092 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script (AbjadNLP), pages 75-83, Rabat, Morocco. Association for Computational Linguistics, March 2026 Related DOI: https://doi.org/10.18653/v1/2026.abjadnlp-1.10 Focus to learn more DOI(s) linking to related resources

[NLP-169] Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLM s KR

【速读】：该论文旨在解决当前大语言模型（Large Language Models, LLMs）在数学推理能力评估中缺乏高阶挑战性任务的问题，尤其是如何有效衡量模型在研究级数学问题上的表现。现有基准如奥数类题目仅考察步骤式推理，而研究级数学问题则要求模型不仅能进行逻辑推导，还需推动数学知识边界的进展，具有更高的复杂性和真实性。为此，作者构建了Soohak这一全新基准，包含439道由64名数学家原创的问题集，分为Challenge子集和Refusal子集：前者用于评估标准解题能力（前沿模型最高仅达30.4%），后者则引入“拒绝回答”机制，测试模型识别病态问题并主动暂停输出的能力——这是科研数学中至关重要的元认知技能，当前所有模型在此项指标上均未超过50%，揭示出模型对“不确定性的认知与处理”仍是亟待优化的新方向。

链接: https://arxiv.org/abs/2605.09063
作者: Guijin Son,Seungone Kim,Catherine Arnett,Hyunwoo Ko,Hyein Lee,Hyeonah Kang,Jiang Longxi,Jin Yun,JungYup Lee,Kyungmin Lee,Sam Yoosuk Kim,Sang Park,Seunghyeok Hong,SeungJae Lee,Seungyeop Yi,Shinae Shin,SunHye Bok,Sunyoung Shin,Yonghoon Ji,Youngtaek Kim,Hanearl Jung,Akari Asai,Graham Neubig,Sean Welleck,Youngjae Yu,Akshelin R,Alexander B. Ivanov,Boboev Muhammadjon,Chaeyoung Han,Christian Stump,Dmitrii Karp,Dohyun Kwon,DoYong Kwon,Duk-Soon Oh,Giovanni Resta,Greta Panova,Huiyun Noh,Hyungryul Baik,Hyungsun Bae,Inomov Mashrafdzhon,Jeewon Kim,Ji Eun Lee,Jiaqi Liu,Jieui Kang,Jimin Kim,Jon-Lark Kim,Junseo Yoon,Junwoo Jo,Kibeom Kim,Kiwoon Kwon,Mario Kummer,Max Mercer,Minjun Kim,Nahyun Lee,Ng Ze-An,Rafał Marcin Łochowski,Raphaël Lachièze-Rey,Ruichen Zhang,Sejin Park,Seonguk Seo,Shin Jaehoon,Sunatullo,Taewoong Eom,Yeachan Park,Yongseok Jang,Youchan Oh,Zhaoyang Wang,Zoltán Kovács
机构: 未知
类目: Computation and Language (cs.CL)
备注: Under review, For questions or model-evaluation requests, contact this http URL @snu. this http URL

点击查看摘要

Abstract:Following the recent achievement of gold-medal performance on the IMO by frontier LLMs, the community is searching for the next meaningful and challenging target for measuring LLM reasoning. Whereas olympiad-style problems measure step-by-step reasoning alone, research-level problems use such reasoning to advance the frontier of mathematical knowledge itself, emerging as a compelling alternative. Yet research-level math benchmarks remain scarce because such problems are difficult to source (e.g., Riemann Bench and FrontierMath-Tier 4 contain 25 and 50 problems, respectively). To support reliable evaluation of next-generation frontier models, we introduce Soohak, a 439-problem benchmark newly authored from scratch by 64 mathematicians. Soohak comprises two subsets. On the Challenge subset, frontier models including Gemini-3-Pro, GPT-5, and Claude-Opus-4.5 reach 30.4%, 26.4%, and 10.4% respectively, leaving substantial headroom, while leading open-weight models such as Qwen3-235B, GPT-OSS-120B, and Kimi-2.5 remain below 15%. Notably, beyond standard problem solving, Soohak introduces a refusal subset that probes a capability intrinsic to research mathematics: recognizing ill-posed problems and pausing rather than producing confident but unjustified answers. On this subset, no model exceeds 50%, identifying refusal as a new optimization target that current models do not directly address. To prevent contamination, the dataset will be publicly released in late 2026, with model evaluations available upon request in the interim.

[NLP-170] Language-Conditioned Visual Grounding with CLIP Multilingual

【速读】：该论文旨在解决多语言视觉-语言模型（Multilingual Vision-Language Models）在不同语言间存在系统性性能差异的问题，尤其关注这种跨语言偏差是由视觉编码器、文本分支还是两者交互所导致的机制不明确问题。解决方案的关键在于设计了一个密集型多语言CLIP探测器（dense multilingual CLIP probe），其中保持视觉编码器一致（使用ViT-B/32或ViT-H/14），仅让XLM-RoBERTa文本分支随语言变化，从而隔离出文本分支对性能差异的影响。通过在13种语言上评估两种不同规模的CLIP架构（视觉参数相差7倍），并量化跨语言一致性指标（如cluster-mask IoU、Spearman相关性等），研究发现：低资源语言的结构缺陷仅出现在文本分支；扩大视觉编码器规模会加剧某些语言的空间对齐失败，但改善另一些语言的性能，表明存在“语料覆盖不足”与“分词器丰度不足”的区分；且峰值相似性得以保留，说明空间错位是主要失效模式而非信号坍塌。

链接: https://arxiv.org/abs/2605.09060
作者: J. de Curtò,Mauro Liz,I. de Zarzà
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multilingual vision-language models exhibit systematic performance gaps across languages, but the mechanism remains ambiguous: cross-language divergence could arise from the visual encoder, the text branch, or their interaction. We resolve this ambiguity through a dense multilingual CLIP probe in which the visual encoder is held identical across thirteen typologically diverse languages and only the XLM-RoBERTa text branch varies. We evaluate two CLIP architectures spanning a 7x visual-encoder scale gap (XLM-R base + ViT-B/32, ~87M visual parameters; XLM-R large + ViT-H/14, ~632M) on 11 concepts and 210 images, and quantify cross-language agreement via cluster-mask IoU, top-percentile IoU, and Spearman rank correlation against an English reference (n=2,310 paired observations per language). Three findings emerge. First, low-resource languages (Arabic, Basque, Luxembourgish) incur a structural penalty at both backbone scales (Wilcoxon HRLR p10^-300; cluster-mask IoU gap +0.114 at base, +0.143 at large), isolating the deficit to the text branch. Second, scaling the encoder 7x widens the gap for structural failure cases (Basque \Delta=-0.056, Luxembourgish \Delta=-0.076) while improving Arabic (\Delta=+0.033), separating corpus-coverage from tokeniser-fertility failures. Third, peak similarity is preserved across languages (mean ratio 0.94 at large scale) while cluster-mask IoU drops sharply, identifying spatial misalignment, not signal collapse, as the dominant failure mode. At 3.4-3.9 Wh per 1,000 queries, dense-CLIP grounding is competitive with high-throughput inference budgets, positioning it as a practical substrate for energy-aware multilingual deployment.

[NLP-171] Phase Transitions in Affective Meaning Divergence: The Hidden Drift Before the Break ACL2026

【速读】：该论文旨在解决对话中情感意义分歧（affective meaning divergence, AMD）如何影响沟通协调失效的问题，即当对话双方对同一词汇的情感理解出现偏差时，这种分歧如何逐步累积并最终导致对话脱轨。解决方案的关键在于构建一个基于言语行为理论、共同知识积累和熵正则化博弈论的数学模型，推导出一个logit最优响应映射，并发现当参数βα > 4时，AMD驱动的负载单调增加会引发鞍结分岔（saddle-node bifurcation），导致修复协调机制突然且具有滞后性的崩溃。实证分析在“对话失序”数据集（CGA-Wiki, N=652）中验证了AMD的动态特征：其方差在对话脱轨前达到峰值，表现出临界减速（critical slowing down, CSD）信号，且显著优于毒性与情绪基线指标，从而为理解人际沟通失效提供了可量化、理论驱动的动态标记。

链接: https://arxiv.org/abs/2605.09043
作者: Napassorn Litchiowong
机构: School of Computing, National University of Singapore (计算学院，新加坡国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to the ACL 2026 Student Research Workshop

点击查看摘要

Abstract:One partner says “Fine” meaning iresolution/i; the other hears isurrender/i. The word is shared; the affective uptake is not. We formalize this as baffective meaning divergence (AMD)/b, the total-variation distance between interlocutors’ anchor-conditioned affect distributions. Building on speech-act theory, common-ground accumulation, and entropy-regularized game theory, we derive a logit best-response map whose dynamics undergo a saddle-node bifurcation: when \beta\alpha 4 , a monotone increase in AMD-driven load produces an abrupt, hysteretic collapse of repair coordination. On Conversations Gone Awry (CGA-Wiki; N=652 ), derailing conversations exhibit critical-slowing-down (CSD) signatures across multiple levels: lexical divergence variance ( p0.001 , d=0.36 ), AMD variance ( p=0.001 , d=0.26 ), and dialog-act repair variance ( p=0.016 , d=0.20 ), all significant after correction and stronger than toxicity and sentiment baselines. AMD provides a distinct temporal signature, with retrospectively measured variance peaking at the bifurcation point while toxicity variance peaks earlier, and is the only indicator grounded in the theoretical framework. Boundary-condition analysis on CGA-CMV ( N=1,169 ) yields mixed but directionally consistent evidence.

[NLP-172] Evaluating Prag matic Reasoning in Large Language Models : Evidence from Scalar Diversity

【速读】：该论文试图解决大语言模型（Large Language Models, LLMs）在语用推理（pragmatic reasoning）评估中存在方法依赖性的问题，即不同评估方式可能导致对模型能力的判断不一致，从而难以准确衡量其真实语用推理能力。解决方案的关键在于采用标量多样性（scalar diversity）作为分级诊断指标，系统比较直接概率测量与元语言提示（metalinguistic prompting）两种评估方法在多个模型和实验设置下的表现，发现语用行为在不同模型家族、提示策略和任务结构中表现出显著差异，且标量多样性梯度仅在特定模型-条件组合中出现，表明LLMs的语用推理是内部概率表征与任务诱导提示行为之间相互作用的结果，而非单一评估范式所能稳定捕捉的稳定能力。

链接: https://arxiv.org/abs/2605.09042
作者: Ye-eun Cho
机构: Sungkyunkwan University (成均馆大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Evaluating pragmatic reasoning in large language models (LLMs) remains challenging because model behavior can vary depending on evaluation methods. Previous studies suggest that prompt-based judgments may diverge from models’ internal probability distributions, raising questions about whether observed performance reflects underlying competence or task-induced behavior. This study examines this issue using scalar diversity as a graded diagnostic for pragmatic inference. Following Hu Levy (2023), this study compares direct probability measurement and metalinguistic prompting across multiple models and experimental settings. The results show that neither evaluation method consistently outperforms the other and that pragmatic behavior varies substantially across model families, prompting strategies, and task structures. Moreover, scalar diversity gradients emerge only in specific model-condition combinations, suggesting that pragmatic reasoning in LLMs reflects an interaction between internal probabilistic representations and task-induced prompting behavior rather than a stable competence captured by a single evaluation paradigm. These findings highlight the central role of evaluation design in interpreting pragmatic abilities in LLMs.

[NLP-173] BiAxisAudit: A Novel Framework to Evaluate LLM Bias Across Prompt Sensitivity and Response-Layer Divergence

【速读】：该论文旨在解决当前大型语言模型（Large Language Models, LLMs）偏见审计中存在的可靠性问题，尤其是现有基准测试将偏见简化为单一标量值所导致的误导性结论。其核心问题是：现有方法忽视了两个关键失败模式——跨提示的语义不变格式变化可引发超过0.7的偏见倾向漂移，以及同一响应中选择（Selection）与自由文本扩展（Elaboration）之间可能存在的立场冲突，从而造成“抵消陷阱”（cancellation trap），掩盖模型内部不一致性。解决方案的关键在于提出BiAxisAudit协议，该协议从两个正交维度评估偏见及其可靠性：一是跨提示轴，通过因子设计（任务格式、视角、角色、情感）将偏见视为分布而非点估计；二是响应内轴，利用Split Coding分离Selection与Elaboration信号，并引入不一致率（Inconsistency Rate）和净分歧不平衡（Divergence Net Imbalance）进行量化。实证表明，任务格式解释的方差甚至超过模型选择本身，且高达63.6%的偏见信号仅出现在单一编码层，说明偏见高度依赖于prompt结构，而并非单纯由模型权重决定。

链接: https://arxiv.org/abs/2605.09041
作者: Jialing Gan,Junhao Dong,Songze Li
机构: 未知
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: 24 pages, 10 figures. Preprint

点击查看摘要

Abstract:Bias audits of large language models now operate within governance frameworks such as the EU AI Act, making benchmark reliability a security concern in its own right. Many current benchmarks, however, collapse bias into a single scalar from one prompt format and one surface label. This design misses two failure modes that can be exploited without changing model weights. Across prompts, meaning-preserving format changes shift bias endorsement by more than 0.7 on a fixed statement pool. Within a response, the discrete Selection and free-text Elaboration can take opposing stances, so an apparently clean aggregate may hide substantial internal inconsistency (a ``cancellation trap’'). Selection-only and elaboration-only rankings are therefore nearly uncorrelated across eight LLMs (Spearman \rho = 0.238 , p = 0.570 ): LLaMA3-70B ranks in the middle under selection-only scoring but highest under elaboration-only scoring on the same responses. We introduce \textscBiAxisAudit, a protocol that reports each bias score together with a reliability estimate on two orthogonal axes. The across-prompt axis evaluates each statement under a factorial grid of task format, perspective, role, and sentiment, treating bias as a distribution rather than a point estimate. The within-response axis uses Split Coding to recover Selection and Elaboration as separate signals, measured by the Inconsistency Rate and Divergence Net Imbalance. Across eight LLMs with 80,200 coded responses each, task format alone explains as much variance as model choice; 63.6% of pooled bias signals (up to 85.2% per model) appear in only one coding layer, and prompt-dimension interactions exceed main effects. The instrument also separates real bias reductions from apparent reductions caused by cross-layer redistribution: some prompt configurations reduce both BER and IR, whereas others suppress only selection-layer bias.

[NLP-174] A Quantum Inspired Variational Kernel and Explainable AI Framework for Cross Region Solar and Wind Energy Forecasting

【速读】：该论文旨在解决现代电力系统中太阳能和风能发电的短时预测问题，这是确保电网可靠运行的关键前提。现有大多数预测模型仅在单一气候条件下训练与评估，且算法创新多集中于经典循环网络或将预测与解释功能耦合的单体基础模型，缺乏对复杂天气模式的区分能力及可解释性。其解决方案的关键在于提出一个四阶段混合框架：首先通过公开API获取小时级发电量、辐照度和气象数据；其次训练三种经典基线模型（ARIMA、梯度提升回归树、两层LSTM）生成强点预测及残差序列；第三阶段利用受量子启发的变分核方法（基于六量子比特硬件高效参数化Ansatz）修正残差，显著提升对平静与风暴天气的判别能力（Fisher判别比是优化径向基核的15倍）；第四阶段则引入生成式AI作为可解释性层，将基准指标转化为结构化自然语言说明，从而实现预测精度与物理意义的协同优化。

链接: https://arxiv.org/abs/2605.09032
作者: Pavan Manjunath,Thomas Prufer
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reliable short horizon forecasting of solar and wind generation is a structural prerequisite of any modern power system yet most published forecasters are tuned and evaluated on a single climatic regime and most algorithmic novelty has been concentrated either on classical recurrent networks or on monolithic foundation models that combine forecasting and explanation We develop a four stage hybrid framework that separates these concerns The first stage acquires hourly generation irradiance and surface weather records through public application programming interfaces The second stage trains three classical baselines autoregressive integrated moving average gradient boosted regression trees and a two layer long short term memory network and produces a strong point forecast together with a residual error series The third stage corrects the residual through a quantum inspired variational kernel built on a six qubit hardware efficient ansatz with three repeated entangling layers The fourth stage uses generative artificial intelligence strictly as an explainability layer that reads the measured benchmark numbers and produces a structured natural language interpretation Across three regions drawn from open public archives Iberian solar North Sea wind and a mixed Texas trace the proposed configuration stays within one percentage point of the strongest classical baseline on the in domain forecasting task and the quantum inspired kernel separates calm and stormy weather regimes with a Fisher discriminant ratio approximately fifteen fold higher than a tuned radial basis kernel

[NLP-175] GAMBIT: A Three-Mode Benchmark for Adversarial Robustness in Multi-Agent LLM Collectives

【速读】：该论文旨在解决多智能体系统（Multi-Agent Systems, MAS）中自适应欺骗性代理（adaptive imposter agent）对集体智能性能的威胁问题，现有研究仅针对浅层任务且未考虑对手策略随检测机制演化而动态调整的情况。解决方案的关键在于提出GAMBIT基准，其包含三种评估模式和两个独立评分指标：第一、二种模式衡量在分布偏移加剧下的零样本检测能力，第三种校准模式则评估检测器仅用20个标注样本即可快速适应新型攻击的能力；同时引入基于高效进化框架的自适应欺骗代理，该代理在棋类推理任务中能显著降低群体性能却保持极低可检测性（F1-score仅为50.5%），并证明零样本评估可能严重误导对自适应攻击者防御效果的判断——例如两个零样本表现相近的检测器在少量样本微调下性能差距可达8倍，而元学习变体收敛速度提升20倍，这一差异唯有在 recalibration mode 中才能揭示。

链接: https://arxiv.org/abs/2605.09027
作者: Alexandre Le Mercier,Chris Develder,Thomas Demeester
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 46 pages, 16 figures

点击查看摘要

Abstract:In multi-agent systems (MAS), a single deceptive agent can nullify all gains of an agentic AI collective and evade deployed defenses. However, existing adversarial studies on MAS target only shallow tasks and do not consider adaptive adversaries, which evolve their strategies to evade the very detectors trained to catch them. To address that gap, we introduce GAMBIT, a benchmark with three evaluation modes and two independent scores for evaluating imposter detectors: the first two modes measure zero-shot detection under increasing distribution shift, and a third recalibration mode measures how quickly a detector adapts to novel attacks from just 20 labeled examples. The benchmark comes with a dataset of 27,804 labeled instances spanning 240 co-evolved imposter strategies. Our contributions are threefold: (1) Using chess as a substrate deep reasoning problem and Gemini 3.1 Pro for agents, we release GAMBIT and its dataset to evaluate imposter detectors under realistic constraints against a stealthy adaptive imposter; (2) We introduce an adaptive imposter agent based on an efficient evolutionary framework, generalizable beyond chess, that collapses collective task performance while remaining essentially undetectable (50.5% F1-score with a Gemini-based detector); (3) We show that zero-shot evaluation can be highly misleading for adaptive adversaries: two detectors with near-identical zero-shot scores differ by 8x on few-shot adaptation, while the meta-learned variant converges 20x faster, a gap only visible in the recalibration mode. Altogether, GAMBIT provides the first multi-agent benchmark where adversarial attacks and defenses co-evolve, with an imposter framework generalizable beyond our use case, and promising techniques for fast recalibration in a rapidly evolving adversarial system. Code and data: this https URL.

[NLP-176] LLiMba: Sardinian on a Single GPU – Adapting a 3B Language Model to a Vanishing Romance Language

【速读】：该论文旨在解决低资源罗曼语方言——撒丁语（Sardinian）在现代自然语言处理（Natural Language Processing, NLP）中严重缺失的问题，即商业服务不支持撒丁语，且现有语言模型无法可靠生成该语言。解决方案的关键在于构建一个针对撒丁语优化的3B参数模型LLiMba，其基于Qwen2.5-3B-Instruct通过持续预训练（Continued Pretraining, CPT）和监督微调（Supervised Fine-Tuning, SFT）完成适配，仅使用单张24 GB消费级GPU即可实现。核心创新包括：1）构建包含1150万词元的撒丁语文本语料库（涵盖LSC、Logudorese和Campidanese方言），并引入240万词元相关罗曼语文本作为回放机制以缓解语域模糊；2）系统比较五种SFT配置（全微调、LoRA r64、rsLoRA r128/r256、DoRA r256），发现高秩适配器（如rsLoRA r256）在翻译性能上最优（BLEU达28.5），且适配能力优于LoRA变体，表明适配器容量比具体LoRA结构选择更为关键，同时揭示了翻译指标无法捕捉的定性差异（如脚本泄漏）。

链接: https://arxiv.org/abs/2605.09015
作者: Luca Ballore
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Sardinian, a Romance language with roughly one million speakers, has minimal presence in modern NLP. Commercial services do not support it, and current language models do not produce it reliably. We present LLiMba, a 3B parameter Sardinian-ready model adapted from Qwen2.5-3B-Instruct through continued pretraining (CPT) and supervised fine-tuning (SFT) on a single 24 GB consumer GPU. The corpus contains 11.5 million tokens of Sardinian spanning LSC, Logudorese, and Campidanese, augmented with 2.4 million tokens of related Romance text as replay against register blurring. After CPT the model reaches a perplexity of 6.76 on held out Sardinian and outperforms the base across all six FLORES-200 directions. We compare five SFT configurations under matched conditions: full fine-tuning, LoRA r64, rsLoRA r128, rsLoRA r256, and DoRA r256. rsLoRA r256 wins on every direction into Sardinian, reaching 28.5 BLEU from English against 17.3 after CPT and 21.0 with full fine-tuning. The rank ablation places r128 between LoRA r64 and rsLoRA r256 on BLEU but reveals failure modes invisible to the metric, including leakage across scripts no other variant produces. LoRA r64 retains less factual content from SFT than configurations at higher rank and produces more confident fabrications, though all methods fabricate on content absent from training. DoRA r256 yields the smallest gap between training and evaluation but the worst factual accuracy. The findings indicate that adapter capacity matters more than the choice among LoRA variants for adapting a Romance pretrained base to a low resource Romance target, that stronger regularization is not uniformly beneficial, and that translation metrics smoothly order configurations whose qualitative behavior differs categorically. Perplexity comparisons across scripts must account for byte fallback tokenization, which deflates the metric for scripts other than Latin.

[NLP-177] Relative Kinetic Utility for Reasoning -Aware Structural Pruning in Large Language Models

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在使用链式思维（Chain-of-Thought, CoT）提示时，因推理过程中生成过长的CoT序列而导致的推理延迟高和键值缓存（Key-Value Cache, KV Cache）内存瓶颈问题。现有基于幅度的结构剪枝方法虽能缓解静态参数负担，但易陷入“幅度陷阱”——即过度依赖离散交叉熵目标函数，误剪高频低信息量的语法token，导致在高稀疏度下（如40%）出现推理能力崩溃。其解决方案的关键在于提出一种新的理论框架——相对动能效用（Relative Kinetic Utility, RKU），该框架通过交替梯度流（Alternating Gradient Flow, AGF）将离散剪枝提升为对模型深度流形上的连续动能积分，并引入Fisher迹归一化以实现轻量级曲率感知归一化，从而精准识别并保留高曲率逻辑路由的核心结构路径（即“动能尖峰”），显著提升高稀疏度下的推理性能，在GSM8K基准上于40%稀疏度下达到13.34%准确率，优于最强基线。

链接: https://arxiv.org/abs/2605.09008
作者: Tianhao Qian
机构: Southeast University (东南大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 15 pages, 3 figures

点击查看摘要

Abstract:Chain-of-Thought (CoT) prompting symbolized a huge improvement of reasoning capabilities of Large Language Models (LLMs). However, scaling up test-time computation yields extensive CoT sequences, introducing severe inference latency and key-value (KV) cache memory bottlenecks. While structural pruning offers a fundamental, hardware-aware solution to alleviate static parameter burdens, existing magnitude-based methods may cut off the neurons of CoT: by over-indexing on discrete cross-entropy objectives, these heuristics fall into a \textitmagnitude trap: they prioritize high-frequency, low-information syntactic tokens and trigger a disappointing reasoning collapse at high sparsities (e.g., 40%). To overcome this topological phase transition, we propose \textscRelative Kinetic Utility (RKU), a novel theoretical framework that elevates discrete pruning to a continuous kinetic integral over the depth manifold of the model based on Alternating Gradient Flow(AGF). By modifying it with Fisher trace normalization, RKU acts as a lightweight curvature-aware normalization to isolate \textitkinetic spikes – the fundamental structural pathways responsible for high-curvature logical routing. Extensive experiments on Qwen-2.5-7B and LLaMA-3-8B improves performance in the high-sparsity regime around 40%. RKU attains 13.34% accuracy on GSM8K at 40% sparsity, outperforming the strongest baseline, and appears to better preserve reasoning-relevant representations under out-of-distribution evaluation.

[NLP-178] Dolphin-CN-Dialect: Where Chinese Dialects Matter

【速读】：该论文旨在解决中文多方言场景下自动语音识别（ASR）模型在数据不平衡、训练稳定性差及方言识别性能不足等问题。其关键解决方案包括：提出基于温度的采样策略以有效平衡标准普通话与低资源方言的数据分布，从而显著提升方言识别准确率；重新设计分词器，采用汉字级建模处理中文、子词级建模处理英文，并引入可扩展的方言标记（dialect tokens），更好地适配语言特性；同时优化训练稳定性和数据处理流程，最终实现比前代模型Dolphin更高的方言识别精度和字符错误率（CER）降低，且模型规模更小、支持流式与非流式推理，具备硬件友好部署和热词定制等实用能力。

链接: https://arxiv.org/abs/2605.08961
作者: Yangyang Meng,Huihang Zhong,Guodong Lin,Guanbo Wang,Hu Du,Zhiming Shao,Yukai Huang,Ke Li,Wei-Qiang Zhang
机构: Dataocean AI; Speech and Audio Technology Lab, Dept. EE, Tsinghua University (清华大学电子工程系语音与音频技术实验室)
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:We present Dolphin-CN-Dialect, a streaming-capable ASR model with a focus on Chinese and dialect-rich scenarios. Compared to the previous version, Dolphin-CN-Dialect introduces substantial improvements in data processing, tokenization, training stability, and data sampling strategies. To address the challenges of highly imbalanced dialect data, we propose a temperature-based sampling strategy that effectively balances standard Mandarin and low-resource dialects, leading to significant gains in dialect recognition performance. In addition, we redesign the tokenizer to better align with linguistic characteristics, adopting character-level modeling for Chinese and subword modeling for English, while introducing extensible dialect tokens. Experimental results show that Dolphin-CN-Dialect achieves improvement in dialect recognition accuracy and CER reduction compared to Dolphin. Furthermore, Dolphin-CN-Dialect reaches competitive performance with recent SOTA open-source ASR models, while maintaining a significantly smaller model size. Dolphin-CN-Dialect supports both streaming and non-streaming inference, enabling a practical balance between latency and accuracy. It also provides flexible customization through hotword support and efficient deployment optimized for specialized hardware. These improvements make Dolphin-CN-Dialect a strong and practical solution for real-world multi-dialect ASR applications.

[NLP-179] Improving Lexical Difficulty Prediction with Context-Aligned Contrastive Learning and Ridge Ensembling

【速读】：该论文旨在解决词汇难度预测（Lexical Difficulty Prediction）问题，特别是如何在不同母语（L1）背景下准确估计词汇难度，而现有方法依赖于仅使用标量监督的回归训练，未能显式构建表示空间，限制了对跨语言对齐和词汇难度序数结构的捕捉能力。解决方案的关键在于提出一种名为“上下文对齐对比回归”（Context-Aligned Contrastive Regression）的新框架，其核心是将岭回归集成（Ridge regression ensemble）与两个互补目标相结合：跨视图上下文对齐（Cross-View Context）和序数软对比学习（Ordinal Soft Contrastive Learning），从而提升跨语言表示对齐性、保留语言特异性，并有效建模词汇难度的序数关系，同时通过集成策略缓解个体模型的系统性偏差，实现更稳定的性能表现。

链接: https://arxiv.org/abs/2605.08950
作者: Wicaksono Leksono Muhamad,Joanito Agili Lopo,Tsamarah Rana Nugraha,Ahmad Cahyono Adi,Muhammad Oriza Nurfajri
机构: Mantera Studio; The University of Manchester; Universitas Gadjah Mada
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Lexical difficulty prediction is a fundamental problem in language learning and readability assessment, requiring models to estimate word difficulty across different first-language (L1) backgrounds. However, existing approaches rely on regression-only training with scalar supervision, which does not explicitly structure the representation space, limiting their ability to capture cross-lingual alignment and ordinal difficulty. To mitigate these issues, we propose Context-Aligned Contrastive Regression, which integrates Ridge regression ensemble with two complementary objectives, i.e., Cross-View Context and Ordinal Soft Contrastive Learning. Experiments on three L1 datasets show that (i) contrastive objectives improve cross-lingual representation alignment while preserving language-specific nuances, (ii) the learned representations capture the ordinal structure of lexical difficulty, and (iii) the ensemble effectively mitigates systematic biases of individual models, leading to more stable performance across difficulty levels.

[NLP-180] Decomposing and Steering Functional Metacognition in Large Language Models

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在评估环境中表现出的“评价意识”（evaluation awareness）现象是否仅为表面行为偏差，还是反映了模型内部可分解的元认知状态结构这一关键问题。解决方案的关键在于提出并验证了一个功能型元认知状态空间（functional metacognitive states）的理论框架，通过残差流分析和激活扰动实验，证明这些状态（如自我能力评估、风险感知、计算资源分配等）可从模型内部激活中线性解码，并且各自独立地因果调控推理行为（如冗余度、准确性和安全性）。这一发现表明，基准测试性能不仅体现任务能力，还受特定元认知状态激活的影响，从而为可靠评估与部署推理模型提供了机制层面的理解路径。

链接: https://arxiv.org/abs/2605.08942
作者: Yanshi Li,Xueru Bai,Shuman Liu,Haibo Zhang,Anxiang Zeng
机构: Shopee(虾皮); Sea(东南亚科技公司)
类目: Computation and Language (cs.CL)
备注: 18 pages, 7 figures

点击查看摘要

Abstract:Large language models (LLMs) increasingly exhibit behaviors suggesting awareness of their evaluation context, often adapting their reasoning strategies in benchmark settings. Prior work has shown that such evaluation awareness can distort performance measurements; however, it remains unclear whether this phenomenon reflects a single behavioral artifact or a deeper internal structure within the model. We propose that LLMs maintain a decomposable space of functional metacognitive states: internal variables encoding factors such as evaluation awareness, self-assessed capability, perceived risk, computational effort allocation, audience expertise adaptation, and intentionality. Through residual stream analysis across multiple reasoning models, we demonstrate that these states are linearly decodable from internal activations and exhibit distinct layer-wise profiles. Moreover, by steering model activations along probe-derived directions, we show that each functional metacognitive state causally modulates reasoning behavior in dissociable ways, affecting verbosity, accuracy, and safety-related responses across tasks. Our findings suggest that benchmark performance reflects not only task competence but also the activation of specific functional metacognitive states. We argue that understandi ng and controlling these internal states is essential for reliable evaluation and deployment of reasoning models, and we provide a mechanistic framework for studying functional m etacognition in artificial systems. Our code and data are publicly available at this https URL. Comments: 18 pages, 7 figures Subjects: Computation and Language (cs.CL) Cite as: arXiv:2605.08942 [cs.CL] (or arXiv:2605.08942v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.08942 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-181] Non-Monotonic Latency in Apple MPS Decoding: KV Cache Interactions and Execution Regimes

【速读】：该论文旨在解决自回归推理（autoregressive inference）在特定硬件后端（Apple MPS）下出现的非单调延迟行为问题，即在相邻解码预算配置之间出现突发性延迟激增（最高达21倍），而传统认知认为延迟应随解码长度平滑变化。其关键发现是：此类异常并非由内存压力或预填充（prefill）开销引起，而是与后端执行动态（backend execution dynamics）密切相关，且CPU和NVIDIA T4（CUDA）平台未观察到类似现象。这表明硬件感知评估对自回归推理性能至关重要，并警示不应仅依赖聚合的解码预算基准测试，因性能可能在邻近配置间呈现不连续波动。

链接: https://arxiv.org/abs/2605.08913
作者: Willy Fitra Hendria
机构: 未知
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR); Computation and Language (cs.CL); Performance (cs.PF)
备注: 9 pages, 5 figures, 6 tables

点击查看摘要

Abstract:Autoregressive inference is typically assumed to scale predictably with decoding length, and key-value (KV) caching is widely regarded as a universally beneficial optimization for accelerating decoding. In this work, we identify unexpected non-monotonic latency behavior in the Apple MPS backend, where latency changes abruptly across nearby decoding configurations. Using transformer models from multiple families (GPT-2, BLOOM, and OPT), we observe latency spikes of up to 21x within specific decoding-budget intervals, followed by recovery at neighboring configurations. Controlled experiments show that these anomalies are not explained by memory pressure or prefill cost, but are instead consistent with backend execution dynamics, while CPU and NVIDIA T4 (CUDA) exhibit smooth monotonic scaling under identical conditions. Our findings highlight the importance of hardware-aware evaluation for autoregressive inference and caution against relying on aggregated decoding-budget benchmarks, as performance can vary discontinuously across nearby configurations.

[NLP-182] LLM -Agnostic Semantic Representation Attack

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在采用对齐技术以防止有害输出时，仍易受对抗性提示（adversarial prompts）攻击的问题。现有基于词元（token-level）优化的方法主要依赖于精确匹配肯定模板（如“Sure, here is…”），但存在收敛性能不佳、提示自然度下降及跨模型泛化能力弱等瓶颈。其解决方案的关键在于提出一种名为语义表示攻击（Semantic Representation Attack, SRA）的新范式，将对抗目标从精确文本匹配重构为恶意语义表示的生成，并通过语义一致性-收敛关系理论证明保持语义一致性可同时保障白盒语义收敛与黑盒迁移性；技术层面则设计了语义表示启发式搜索（Semantic Representation Heuristic Search, SRHS）算法，在离散词元块逐步扩展过程中保留提示的可解释性和结构连贯性，从而实现高成功率（平均99.71%）和强迁移性。

链接: https://arxiv.org/abs/2605.08898
作者: Jiawei Lian,Jianhong Pan,Lefan Wang,Yi Wang,Tairan Huang,Shaohui Mei,Lap-Pui Chau
机构: Northwestern Polytechnical University (西北工业大学); The Hong Kong Polytechnic University (香港理工大学); City University of Hong Kong (香港城市大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: arXiv admin note: substantial text overlap with arXiv:2509.19360

点击查看摘要

Abstract:Large Language Models (LLMs) increasingly employ alignment techniques to prevent harmful outputs. Despite these safeguards, attackers can circumvent them by crafting adversarial prompts. Predominant token-level optimization methods primarily rely on optimizing for exact affirmative templates (e.g., ``\textitSure, here is…‘’). However, these paradigms frequently encounter bottlenecks such as suboptimal convergence, compromised prompt naturalness, and poor cross-model generalization. To address these limitations, we propose Semantic Representation Attack (SRA), a novel LLM-agnostic paradigm that fundamentally reconceptualizes adversarial objectives from exact textual targeting to malicious semantic representations. Theoretically, we establish the semantic Coherence-Convergence Relationship and derive a Cross-Model Semantic Generalization bound, proving that maintaining semantic coherence guarantees both white-box semantic convergence and black-box transferability. Technically, we operationalize this framework via the Semantic Representation Heuristic Search (SRHS) algorithm, which preserves interpretability and structural coherence of the adversarial prompts during incremental discrete token chunk expansion. Extensive evaluations demonstrate that our framework achieves a 99.71% average attack success rate across 26 open-source LLMs, with strong transferability and stealth.

[NLP-183] Frag ileFlow: Spectral Control of Correct-but-Frag ile Predictions for Foundation Model Robustness

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）和视觉语言模型（Vision-Language Models, VLMs）在面对扰动时的结构化脆弱性问题，即传统平均准确率或一致性指标可能掩盖一种“正确但脆弱”的预测现象：模型虽仍输出正确类别，但其置信度分布已开始向决策边界附近的错误类别迁移。解决方案的关键在于提出一种名为FragileFlow的插件式正则化方法，其核心机制是利用校准的边际缓冲区识别此类正确但脆弱的预测，并将离类概率质量组织为类级别的脆弱风险矩阵；理论层面，作者首次建立了针对该边际感知误差流（margin-aware error flow）的PAC-Bayes上界，证明了经验谱控制可在稳定性条件下导向确定性最差类鲁棒性的保守路径。

链接: https://arxiv.org/abs/2605.08896
作者: Zhuoyun Li,Boxuan Wang,Jinwei Hu,Xiaowei Huang,Yi Dong
机构: University of Liverpool (利物浦大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Robust adaptation of LLMs and VLMs is often evaluated by average accuracy or average consistency under perturbations. However, these averages can hide a structured failure mode: a prediction may remain correct while probability mass already flows from particular true classes toward systematic wrong competitors near the decision boundary. In this paper, we formalize this phenomenon as margin-aware error flow and introduce FragileFlow, a plug-in regularizer that uses a calibrated margin buffer to identify correct-but-fragile predictions and organize their off-class probability mass into a class-wise vulnerable-risk matrix. Theoretically, we provide the first PAC-Bayes upper bound for this margin-aware error-flow object, showing how empirical spectral control yields a conservative route to deterministic worst-class robustness under a stability condition. Experiments on multiple-choice LLM benchmarks and few-shot CLIP adaptation show that FragileFlow consistently improves the proposed theory-facing risk measures over matched baselines, yields perturbed worst-class accuracy gains in most settings, and preserves clean accuracy across comparisons.

[NLP-184] Fitting Is Not Enough: Smoothness in Extremely Quantized LLM s

【速读】：该论文旨在解决极端低比特量化（extremely low-bit quantization）导致的大语言模型（Large Language Models, LLMs）性能下降问题，特别是现有方法仅关注数值精度损失而忽视了模型平滑性（smoothness）退化的问题。研究表明，随着量化位宽降低，模型在预测邻域内的有效token候选数量急剧减少，从而导致解码树稀疏化和生成质量下降。解决方案的关键在于引入一种平滑性保持原则（smoothness-preserving principle），通过在后训练量化（post-training quantization）和量化感知训练（quantization-aware training）中显式保留模型输出的平滑特性，从而在数值精度之外进一步提升量化模型的性能。

链接: https://arxiv.org/abs/2605.08894
作者: Yuzhuang Xu,Xu Han,Yuxuan Li,Pengzhan Li,Wanxiang Che
机构: Harbin Institute of Technology (哈尔滨工业大学); Tsinghua University (清华大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 19 pages, 4 tables, 14 figures

点击查看摘要

Abstract:Large language models (LLMs) achieve strong performance but incur high deployment costs, motivating extremely low-bit but lossy quantization. Existing quantization algorithms mainly focus on improving the numerical accuracy of forward computation to eliminate performance degradation. In this paper, we show that extremely quantized LLMs suffer from systematic smoothness degradation beyond numerical precision loss. Through a smoothness proxy, we observe that such degradation becomes increasingly severe as the quantization bit-width decreases. Furthermore, based on sequence neighborhood modeling, we find that quantized models exhibit a rapid reduction of effective token candidates within the prediction neighborhood, which directly leads to a sparser decoding tree and degraded generation quality. To validate it, we introduce a simple smoothness-preserving principle in both post-training quantization and quantization-aware training, and demonstrate that preserving smoothness brings additional gains beyond numerical accuracy. The core goal of this paper is to highlight smoothness preservation as an important design consideration for future extreme quantization methods. Code is available at this https URL.

[NLP-185] Machine Learning Research Has Outpaced Its Communication Norms and NeurIPS Should Act

【速读】：该论文旨在解决机器学习领域（特别是NeurIPS会议）学术写作可读性持续下降的问题，这一趋势可能导致研究成果难以传播和整合，进而影响科学知识的积累与共享。其核心解决方案是提出一套可量化、可执行的写作标准，以提升论文对人类读者的可读性和影响力，关键在于引入七个具体措施：包括设定术语缩写预算与官方词表、设置人类可读性阈值、加强引用规范、鼓励独立可视化元素、提供通俗摘要、预注册缩写词表以及开源审计工具，从而推动科研交流从“数量增长”向“质量优化”转型。

链接: https://arxiv.org/abs/2605.08889
作者: Ajay Mandyam Rangarajan,Jeyashree Krishnan
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Digital Libraries (cs.DL)
备注: 9 pages, 11 figures, 7 tables

点击查看摘要

Abstract:Machine learning research has grown exponentially while its communication norms have not. We argue NeurIPS should adopt explicit, measurable writing standards. We analyze 2.8 million arXiv papers (1991-2025), 24,772 NeurIPS papers (1987-2024), and 24.5 million PubMed papers (1990-2025), applying classical readability scores, the Hohmann writing style suite (including sensational language), acronym density and reuse, an LLM as judge readability protocol, and citations from OpenAlex and Semantic Scholar. Four patterns emerge. First, NeurIPS abstracts score harder to read on every classical readability metric: Flesch Reading Ease falls from about 24 in 1987 to 13 in 2024, and sensational language rises by about 50 percent in NeurIPS abstracts between 2015 and 2024. Second, acronym density in NeurIPS titles has grown from 0.33 per 100 words in 1987 to 3.21 in 2024, and about 89 percent of NeurIPS acronyms are used fewer than ten times, ten points above the science-wide baseline. Third, more readable NeurIPS papers tend to receive more citations, suggesting readability and impact are correlated and that less readable papers risk remaining fragmented. LLM as judge scores rate NeurIPS abstracts as roughly stable from 1987 to 2022, with early signs of improvement thereafter, a pattern that disagrees with every classical readability metric and raises a design question for enforcement: is the target reader a human or an LLM? Lastly, NeurIPS volume has grown roughly 50-fold between 1987 and 2024. Assuming the goal is to optimise for human readers, we propose seven standards NeurIPS could pilot at NeurIPS 2027: an acronym budget with a venue-approved term list, a human readability threshold, stricter citation standards, standalone visual elements, a plain language summary, a pre-registered acronym glossary, and open source audit tooling.

[NLP-186] DocScope: Benchmarking Verifiable Reasoning for Trustworthy Long-Document Understanding

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在长篇视觉丰富文档上生成可信赖、可验证推理过程的能力评估问题，即现有仅依赖最终答案准确率的评测方式不足以反映模型的真实推理质量。其解决方案的关键在于提出DocScope基准，将长文档问答建模为结构化的推理轨迹预测任务，要求模型输出证据页、支持区域、相关事实陈述及最终答案，并设计四阶段独立审计协议——页面定位（Page Localization）、区域标注（Region Grounding）、事实提取（Fact Extraction）和答案验证（Answer Verification），通过跨阶段解耦与人工对齐校准确保每一步推理的可信性，从而实现对推理链条完整性的精细测量。

链接: https://arxiv.org/abs/2605.08888
作者: Xiang Feng,Jiawei Zhou,Zhangfeng Huang,Kewei Wang,Shanshan Ye,Jinxin Hu,Zulong Chen,Yong Luo,Jing Zhang
机构: Wuhan University (武汉大学); Alibaba Group (阿里巴巴集团); University of Science and Technology of China (中国科学技术大学); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 50pages, 25 figures, 14 tables;

点击查看摘要

Abstract:Evaluating whether Multimodal Large Language Models can produce trustworthy, verifiable reasoning over long, visually rich documents requires evaluation beyond end-to-end answer accuracy. We introduce DocScope, a benchmark that formulates long-document QA as a structured reasoning trajectory prediction problem: given a complete PDF document and a question, the model outputs evidence pages, supporting evidence regions, relevant factual statements, and a final answer. We design a four-stage evaluation protocol – Page Localization, Region Grounding, Fact Extraction, and Answer Verification – that audits each level of the trajectory independently through inter-stage decoupling, with all judges selected and calibrated via human alignment studies. DocScope comprises 1,124 questions derived from 273 documents, with all hierarchical evidence annotations completed by human annotators. We benchmark 6 proprietary models, 12 open-weight models, and several domain-specific systems. Our experiments reveal that answer accuracy cannot substitute for trajectory-level evaluation: even among correct answers, the highest observed rate of complete evidence chains is only 29%. Across all models, region grounding remains the weakest trajectory stage. Furthermore, the primary difficulty stems from aggregating evidence dispersed across long distances and multiple document clusters, while an oracle study identifies faithful perception and fact extraction as the dominant capability bottleneck. Cross-architecture comparisons further suggest that activated parameter count matters more than total scale. The benchmark and code will be publicly released at this https URL.

[NLP-187] Ace-Skill: Bootstrapping Multimodal Agents with Prioritized and Clustered Evolution

【速读】：该论文旨在解决自演化智能体（self-evolving agents）在持续适应过程中面临的两大耦合瓶颈：数据效率低下（data inefficiency）与知识干扰（knowledge interference）。前者表现为大量采样（rollout）资源被浪费在低价值样本上，而非高信息量的样本；后者则源于共享知识库中异构知识导致检索噪声和任务错位引导，进而形成“无效采样→噪声知识→更差采样”的自我强化失败循环。解决方案的关键在于提出Ace-Skill这一协同进化框架，通过两个核心机制实现突破：一是结合先验优先采样（prioritized sampler）与懒惰衰减熟练度追踪（lazy-decay proficiency tracking），精准分配采样资源至高信息量且未充分掌握的样本；二是引入语义聚类组织器（clustered organizer），对知识进行结构化分组以提升检索清晰度与适应可靠性。此双轨优化机制将自演化过程转化为良性循环，显著提升知识质量与后续采样的有效性，在多个多模态工具使用基准上实现显著性能增益（如Avg@4准确率提升35.46%），并支持零样本迁移至小规模模型，无需额外训练即可继承高级能力。

链接: https://arxiv.org/abs/2605.08887
作者: Feng Xiong,Zengbin Wang,Yong Wang,Xuecai Hu,Jinghan He,Liang Lin,Yuan Liu,Xiangxiang Chu
机构: AMAP, Alibaba Group; CASIA; BNU
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Self-evolving agents present a promising path toward continual adaptation by distilling task interactions into reusable knowledge artifacts. In practice, this paradigm remains hindered by two coupled bottlenecks: data inefficiency, where costly rollout effort is disproportionately spent on low-value samples rather than informative ones, and knowledge interference, where heterogeneous knowledge stored in shared repositories leads to noisy retrieval and task-misaligned guidance. Together, these issues form a self-reinforcing failure loop in which uninformative rollouts yield noisy knowledge, which in turn degrades subsequent rollouts. In this work, we introduce Ace-Skill, a co-evolutionary framework that jointly optimizes rollout allocation and knowledge organization for self-evolving multimodal agents. Specifically, Ace-Skill combines aprioritized sampler with lazy-decay proficiency tracking to focus rollouts on informative and insufficiently mastered samples, and a clustered organizer that semantically clusters knowledge for cleaner retrieval and more reliable adaptation. By improving sampling and organization together, Ace-Skill turns self-evolution into a virtuous cycle in which more informative rollouts produce higher-quality knowledge that supports stronger subsequent rollouts. Across four multimodal tool-use benchmarks, Ace-Skill delivers strong gains (e.g., +35.46% relative improvement in Avg@4 accuracy), enabling an opensource 35B MoE model to match or surpass proprietary models. The acquired knowledge also transfers effectively in a zero-shot manner to smaller 9B and 4B models, allowing resource-constrained agents to inherit advanced capabilities without additional training. The code has been publicly available at this https URL.

[NLP-188] Max-pooling Network Revisited: Analyzing the Role of Semantic Probability in Multiple Instance Learning for Hallucination Detection

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）中幻觉检测（Hallucination Detection）的可靠性问题，特别是针对当前先进混合方法（如HaMI）因重复采样和高成本语义相似性计算导致的显著计算开销。其解决方案的关键在于：通过理论分析发现，将内部状态与语义一致性进行缩放可扩大决策边界（decision margin），进而基于此洞察重新审视经典句子分类模型，采用最大池化（max pooling）聚合词级别特征，并利用轻量级多层感知机（MLP）直接预测句子得分，从而在无需语义一致性计算的前提下实现高效且性能优越的幻觉检测。

链接: https://arxiv.org/abs/2605.08863
作者: Shota Fujikawa,Issei Sato
机构: The University of Tokyo (东京大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Hallucination detection has become increasingly important for improving the reliability of large language models (LLMs). Recently, hybrid approaches such as HaMI, which combine semantic consistency with internal model states via Multiple Instance Learning (MIL), have achieved state-of-the-art performance. However, these methods incur substantial computational overhead due to repeated sampling and costly semantic similarity computations. In this work, we first provide a theoretical analysis of HaMI in terms of decision margins, revealing that scaling internal states with semantic consistency leads to an enlarged decision margin. Motivated by this insight, we revisit classical sentence classification models from a margin enlargement perspective, aggregating token-level features via max pooling and directly estimating sentence scores using a lightweight MLP. Without requiring semantic consistency computations, our approach achieves substantial efficiency improvements while maintaining competitive performance with state-of-the-art baselines through adaptive aggregation of internal feature representations.

[NLP-189] Architecture Not Scale: Circuit Localization in Large Language Models

【速读】：该论文试图解决的问题是：随着模型规模扩大，其内部机制的可解释性是否会随之变难。传统观点认为，模型参数量增加会导致电路分析（circuit analysis）难度上升，但本文挑战了这一假设。解决方案的关键在于指出，注意力架构（attention architecture）的影响远大于参数数量本身；具体而言，分组查询注意力（grouped query attention）相较于标准多头注意力（multi-head attention），在相同规模下能产生更集中且机制更稳定的电路结构，并且在事实记忆、间接宾语识别和归纳头等任务中均表现出更强的可解释性。此外，在Qwen2.5模型中发现，事实回忆电路存在一个临界规模点，超过该点后会经历离散相变，坍缩为单一瓶颈而非渐进退化，表明某些架构选择可以显著提升大规模模型的可解释性。

链接: https://arxiv.org/abs/2605.08853
作者: Sohan Venkatesh
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Mechanistic interpretability assumes that circuit analysis becomes harder as models scale. We challenge this assumption by showing that the attention architecture matters more than parameter count. Studying three circuit types across Pythia and Qwen2.5, we find that grouped query attention produces circuits that are far more concentrated and mechanistically stable than standard multi-head attention at comparable scales. The same concentration pattern holds across indirect object identification, induction heads, and factual recall. Within a single architecture family (Qwen2.5), factual recall circuits undergo a discrete phase transition above a critical scale, collapsing to a single bottleneck rather than degrading gradually. These findings suggest that some architectural choices make large models more tractable to study and that interpretability difficulty is not a fixed consequence of model size.

[NLP-190] EmoS: A High-Fidelity Multimodal Benchmark for Fine-grained Streaming Emotional Understanding ACL–2026

【速读】：该论文旨在解决现有情感识别基准在生态效度（ecological validity）、信号清晰度和细粒度标签可靠性方面难以兼得的问题。其核心解决方案是构建一个高保真双语情感基准数据集EmoS，通过严格筛选的静态片段与动态流式独白子集相结合的方式，提升数据的生态真实性和噪声控制能力；同时，依托双层人工标注流程确保情感演变过程的连续性标注质量，从而为多模态大语言模型（MLLMs）的情感识别与共情建模提供可靠训练与评估基础。

链接: https://arxiv.org/abs/2605.08847
作者: Pengze Guo,Jingxi Liang,Zhiwen Xie,Qifeng Wang,Derek F. Wong
机构: University of Macau (澳门大学); Central China Normal University (华中师范大学)
类目: Computation and Language (cs.CL)
备注: acl - 2026 main accepted

点击查看摘要

Abstract:In the context of today’s high-pressure, aging society, the demand for large-scale emotional models capable of providing empathetic support is more critical than ever. However, existing benchmarks fail to simultaneously achieve ecological validity, signal clarity, and reliable fine-grained labeling. We introduce EmoS, a high-fidelity bilingual benchmark designed to resolve the limitations of ecological validity and noise in existing datasets by combining strictly filtered static slices with a dynamic Streaming Monologue subset. Supported by a rigorous dual-layer human annotation pipeline, EmoS provides trusted ground truth that captures continuous emotional evolution. Empirical results show that fine-tuning MLLMs (multimodal large language models) on EmoS yields significant gains over zero-shot baselines, laying the foundation for the training and evaluation of future emotion recognition models and empathy models. The dataset and code are publicly available at this https URL.

[NLP-191] XPERT: Expert Knowledge Transfer for Effective Training of Language Models

【速读】：该论文旨在解决如何从预训练的混合专家（Mixture-of-Experts, MoE）大语言模型中提取并重用具有跨领域通用性的专家知识，以提升不同规模语言模型的训练效率与性能。其核心问题在于，尽管MoE模型中的部分专家在多个知识域中稳定激活，但这些专家所编码的通用知识尚未被有效挖掘和利用。解决方案的关键在于提出XPERT框架：首先通过仅推理分析识别出跨域专家；继而利用张量分解技术优化其表征；最终将提取的知识适配并应用于下游模型训练，从而实现专家级知识的有效迁移与复用。实验表明，采用该方法的模型在语言理解与对话生成任务上表现出更强性能与更快收敛速度。

链接: https://arxiv.org/abs/2605.08842
作者: Chang Liu,Boyu Shi,Xu Yang,Xin Geng
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Mixture-of-Experts (MoE) language models organize knowledge into explicitly routed expert modules, making expert-level representations traceable and analyzable. By analyzing expert activation patterns in MoE large language models (LLMs), we find that a subset of experts is consistently activated across diverse knowledge domains. These common experts encode cross-domain, generalizable knowledge that is closely related to model generalization, naturally raising the question of how such identifiable expert knowledge can be practically reused. Motivated by this observation, we propose XPERT, a framework that extracts, consolidates, and reuses expert knowledge from pre-trained MoE LLMs to support more effective training of language models across different model scales. XPERT identifies cross-domain experts via inference-only analysis, refines their representations through tensor decomposition, and adapts the extracted knowledge to reuse in downstream models. Experiments on language understanding and dialogue generation benchmarks show that models benefiting from reused expert knowledge achieve consistently stronger performance and faster convergence compared to strong baselines. These results highlight MoE LLMs as structured and reusable knowledge sources, and demonstrate the value of expert-level knowledge reuse for improving model training.

[NLP-192] ReST-KV: Robust KV Cache Eviction with Layer-wise Output Reconstruction and Spatial-Temporal Smoothing ICLR2026

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在生成式推理过程中因Key-Value (KV) 缓存内存需求激增而导致的效率瓶颈问题，尤其针对长序列场景下KV缓存管理的挑战。现有缓存淘汰方法通常仅依据注意力权重保留重要KV对，但忽略了删除token所引发的注意力重分配效应以及KV选择中的时空动态特性。其解决方案的关键在于提出ReST-KV方法，该方法将KV缓存淘汰建模为一个优化问题，通过分层输出重建（layer-wise output reconstruction）最小化模型输出差异，从而自然捕捉注意力重分配效应；同时引入指数移动平均平滑（exponential moving average smoothing）以应对时间维度上的波动，并设计基于自适应窗口的空间模式建模机制，实现对KV缓存更鲁棒、更精准的淘汰策略。

链接: https://arxiv.org/abs/2605.08840
作者: Yongqi An,Chang Lu,Kuan Zhu,Tao Yu,Chaoyang Zhao,Hong Wu,Ming Tang,Jinqiao Wang
机构: Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences(中科院自动化所基础模型研究中心); School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院); University of Electronic Science and Technology of China(电子科技大学); Wuhan AI Research(武汉人工智能研究院); Objecteye Inc.(Objecteye公司)
类目: Computation and Language (cs.CL)
备注: Accepted at ICLR 2026. Project Page: this https URL

点击查看摘要

Abstract:Large language models (LLMs) face growing challenges in efficient generative inference due to the increasing memory demands of Key-Value (KV) caches, especially for long sequences. Existing eviction methods typically retain KV pairs with high attention weights but overlook the impact of attention redistribution caused by token removal, as well as the spatial-temporal dynamics in KV selection. In this paper, we propose ReST-KV, a robust KV eviction method that combines layer-wise output Reconstruction and Spatial-Temporal smoothing to provide a more comprehensive perspective for the KV cache eviction task. Specifically, ReST-KV formulates KV cache eviction as an optimization problem that minimizes output discrepancies through efficient layer-wise reconstruction. By directly modeling how each token’s removal affects the model output, our method naturally captures attention redistribution effects, going beyond simplistic reliance on raw attention weights. To further enhance robustness, we design exponential moving average smoothing to handle temporal variations and an adaptive window-based mechanism to capture spatial patterns. Our method, ReST-KV, significantly advances performance on long-context benchmarks. It surpasses state-of-the-art baselines by 2.58% on LongBench and 15.2% on RULER. Additionally, ReST-KV consistently outperforms existing methods on Needle-in-a-Haystack and InfiniteBench, all while achieving a remarkable 10.61 \times reduction in decoding latency at 128k context length. The code is publicly available at this https URL to facilitate reproducibility and further research.

[NLP-193] Generating Leakage-Free Benchmarks for Robust RAG Evaluation

【速读】：该论文旨在解决检索增强生成（Retrieval-Augmented Generation, RAG）模型评估中存在的“知识泄露”（knowledge leakage）问题，即许多基准测试数据集中的问题实际上可仅凭大语言模型（Large Language Models, LLMs）的参数化记忆回答，从而导致评估结果不可靠。随着基准数据集被反复用于训练，其内容逐渐被模型吸收，进一步加剧了这一问题，形成“基准老化”（benchmark aging）。解决方案的关键在于提出SeedRG——一个半合成基准生成流水线：首先从原始种子基准中提取问题-上下文对的推理图（reasoning graph）以捕捉其结构化推理模式；随后通过类型约束的实体替换生成新实例，确保语义结构一致但内容新颖，避免落入模型已有参数知识；同时引入两个验证步骤：推理图一致性检查以维持任务难度，以及知识泄露过滤器以剔除无需检索即可解答的样本，从而构建出更可靠、可持续的RAG评估基准。

链接: https://arxiv.org/abs/2605.08838
作者: Jiayi Liu,Jiaxing Zhang,Bowen Jin,Jennifer Neville
机构: Purdue University (普渡大学); New Jersey Institute of Technology (新泽西理工学院); University of Illinois at Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Microsoft Research (微软研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) is widely used to augment large language models (LLMs) with external knowledge. However, many benchmark datasets, designed to test RAG performance, comprise many questions that can already be answered from an LLM’s parametric memory. This leads to unreliable evaluation. We refer to this phenomenon as knowledge leakage: cases where RAG tasks are solvable without retrieval. This issue worsens over time due to benchmark aging. As benchmarks are reused for training, their contents are increasingly absorbed into model parameters, making them less effective for evaluating retrieval. We introduce SeedRG, a semi-synthetic benchmark generation pipeline that mitigates knowledge leakage and addresses the issue of benchmark aging. Starting from a seed benchmark dataset, SeedRG extracts a reasoning graph from question-context pairs to capture their underlying reasoning structure, and then generates new examples via type-constrained entity replacement. This process produces structurally similar but novel instances that are unlikely to exist in the model’s parametric knowledge, while preserving the original reasoning patterns. To ensure quality, we incorporate two verification steps: (1) a reasoning-graph consistency check to maintain task difficulty, and (2) a knowledge-leakage filter to exclude instances answerable without retrieval. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.08838 [cs.CL] (or arXiv:2605.08838v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.08838 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-194] he Grounding Gap: How LLM s Anchor the Meaning of Abstract Concepts Differently from Humans

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）如何表征抽象概念的语义根基问题，即这些模型是否像人类一样，通过经验、情感和社会情境来构建抽象概念的意义。其核心问题是：LLMs 是否能以与人类相似的方式“具身化”或“情境化”抽象词（如正义、理论、可用性）的理解。解决方案的关键在于采用认知科学中的属性生成实验范式，系统性地比较21个前沿及开源LLM与人类在生成抽象概念相关属性时的表现差异，并进一步利用稀疏自动编码器（Sparse Autoencoders, SAEs）分析模型内部表征是否包含与具身（sensorimotor）、社会（social）等接地维度相关的特征。研究发现，尽管LLMs在显式评估中可接近人类判断，但在自由生成任务中仍存在显著的“接地差距”，表明当前模型虽能识别接地维度，但未以类人方式动态调用这些信息。

链接: https://arxiv.org/abs/2605.08837
作者: Odysseas S. Chlapanis,Orfeas Menis Mastromichalakis,Christos H. Papadimitriou
机构: Athens University of Economics and Business (雅典国立经济与商业大学); Archimedes, Athena Research Center (阿基米德，阿娜研究中⼼); Instituto de Telecomunicações (电信研究所); Columbia University (哥伦比亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Abstract concepts - justice, theory, availability - have no single perceivable referent; in the human brain, their meaning emerges from a web of experiences, affect, and social context. Do large language models (LLMs) ground abstract concepts in a similar way? We study this by replicating property-generation experiments from cognitive science on 21 frontier and open-weight LLMs. Across models and experiments, we find a consistent pattern: when compared to humans, models rely too heavily on word associations, and underproduce properties tied to emotion and internal states. This yields a large and consistent grounding gap: no model exceeds a Pearson correlation r=0.37 with human responses, compared to a human-to-human ceiling above r=0.9. To better interpret this gap, we also replicate a rating experiment on grounding categories and find that here LLMs align more closely with human judgment, and alignment improves as models get larger. We then use sparse autoencoders (SAEs) to inspect whether this information is also reflected in the models’ internal features, and we do identify features connected to grounding dimensions such as “sensorimotor” and “social”. These findings suggest that current LLMs can recover grounding dimensions when explicitly queried, but do not recruit them in a human-like way when words are generated freely.

[NLP-195] SimReg: Achieving Higher Performance in the Pretraining via Embedding Similarity Regularization

【速读】：该论文旨在解决大规模语言模型（Large Language Models, LLMs）在预训练过程中因token嵌入的上下文依赖性导致的类内方差高、类间相似度大问题，从而影响表示学习效率。其解决方案的关键在于提出SimReg——一种嵌入相似性正则化损失函数，通过显式地增强同一序列中具有相同真实标签的token表示之间的相似性，并利用对比损失强制不同标签token表示间的分离，从而扩大多分类边界，提升分类效率。实验表明，该方法在密集型和混合专家（Mixture-of-Experts, MoE）架构上均能加速训练收敛超过30%，并提升零样本下游任务平均性能超过1%。

链接: https://arxiv.org/abs/2605.08809
作者: Yan Sun,Guoxia Wang,Jinle Zeng,JiaBin Yang,Shuai Li,Li Shen,Dacheng Tao,DianHai Yu,Haifeng Wang
机构: Baidu Inc.(百度公司); Sun Yat-sen University (中山大学); Nanyang Technological University (南洋理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Pretraining large language models (LLMs) with next-token prediction has led to remarkable advances, yet the context-dependent nature of token embeddings in such models results in high intra-class variance and inter-class similarity, thus hindering the efficiency of representation learning. While similarity-based regularization has demonstrated benefit in supervised fine-tuning and classification tasks, its application and efficacy in large-scale LLM pretraining remains underexplored. In this work, we propose the SimReg, an embedding similarity regularization loss that explicitly encourages token representations with the same ground-truth label within each sequence to be more similar, while enforcing separation from different-label tokens via a contrastive loss. Our analysis reveals that this mechanism introduces gains by enlarging multi-classification margins, thereby enabling more efficient classification. Extensive experiments across dense and Mixture-of-Experts (MoE) architectures demonstrate that SimReg consistently accelerates training convergence by over 30% and improves average zero-shot downstream performance by over 1% across standard benchmarks. Further ablation studies and analyses offer practical insights into hyperparameter tuning and loss effectiveness.

[NLP-196] Narrative Landscape: Mapping Narrative Dispositions Across LLM s

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在重复、受控条件下输出行为的稳定性与多样性难以量化的问题。其核心挑战在于如何识别并表征模型在特定任务中表现出的稳定、可复现的偏好模式（即“倾向性”，disposition），而传统指标往往无法揭示这些模式背后的结构差异。解决方案的关键在于提出一个定量框架，通过两个维度对模型倾向性进行操作化定义：一是“一致性”（consistency），以Jaccard相似度衡量跨重复实验中选择项的重叠程度；二是“多样性”（diversity），用逆Simpson指数刻画选项分布的离散程度。此外，论文引入“叙事景观”（Narrative Landscape）这一基于主成分分析（PCA）的可视化方法，将不同模型的选择分布映射到统一空间，从而实现跨模型、跨指令类型的几何结构比较，揭示出即使数值指标相近，模型的内在选择拓扑也可能存在本质差异。

链接: https://arxiv.org/abs/2605.08742
作者: Donghoon Jung,Jiwoo Choi,Songeun Chae,Seohyon Jung
机构: KAIST(韩国科学技术院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to NLP4DH 2026, camera-ready version

点击查看摘要

Abstract:This study proposes a quantitative framework for profiling LLM dispositions as stable, model-specific regularities in output under repeated, controlled elicitation. Using a structured narrative constraint-selection task administered across six frontier models and three instruction types, we operationalize disposition through two dimensions: “consistency”, measured as cross-replication selection overlap via Jaccard similarity, and “diversity”, measured as dispersion across options via the inverse Simpson index. We further introduce Narrative Landscape, a PCA-based visualization that maps each model’s selection profile into a shared space for direct comparison. Results reveal a clear rigidity-exploration spectrum across model families and show that instruction types shift the geometry of selection spaces even when scalar metrics appear similar, indicating that comparable scores can mask qualitatively distinct selection topologies.

[NLP-197] raining with Harnesses: On-Policy Harness Self-Distillation for Complex Reasoning

【速读】：该论文旨在解决大语言模型在复杂推理任务中依赖外部推理时工作流（inference-time harnesses）提升性能，但模型自身能力并未随之增强的问题。解决方案的关键在于提出在线策略自蒸馏（On-Policy Harness Self-Distillation, OPHSD），其核心机制是利用当前已增强的模型作为教师模型，对原始模型进行自蒸馏，从而将推理时工作流提供的额外监督信号内化到学生模型中，使模型具备更强的任务特异性能力和独立推理性能。实验表明，OPHSD在文本分类和数学推理任务上均显著优于基线方法，并揭示了推理时工作流在训练后可被移除，其优势可通过蒸馏过程永久固化于模型之中。

链接: https://arxiv.org/abs/2605.08741
作者: Zhengyang Zhao,Lu Ma,Wentao Zhang
机构: Peking University (北京大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Inference-time harnesses substantially improve large language models on complex reasoning tasks. However, the intrinsic capabilities of the underlying model remain unchanged by the addition of these external workflows. To bridge this gap, we introduce \emphOn-Policy Harness Self-Distillation (OPHSD), which employs the harness-augmented current model as a teacher for self-distillation, thereby introducing extra supervisory signals from the harness beyond training data. OPHSD internalizes task-specific harness capabilities into the student model, yielding robust generalizability and strong standalone performance across diverse reasoning tasks. Evaluated across draft–verify harness for text classification and plan–solve for mathematical reasoning tasks, OPHSD consistently outperforms strong baselines (e.g., +10.83% over OPSD on HMMT25). Our analysis further indicates that reattaching the harness during inference yields no additional benefits and can even degrade performance, suggesting that complex harnesses need not always be permanent fixtures; instead, they can serve as temporary training scaffolds whose benefits are permanently fed back into the base model. Our code and training data are available at this https URL.

[NLP-198] SlimQwen : Exploring the Pruning and Distillation in Large MoE Model Pre-training

【速读】：该论文旨在解决大规模预训练场景下混合专家（Mixture-of-Experts, MoE）模型压缩的效率与性能平衡问题，具体聚焦于结构化剪枝（structured pruning）和知识蒸馏（knowledge distillation, KD）在MoE架构中的协同应用策略。其关键解决方案包括：（1）证明对预训练MoE模型进行剪枝可作为更优的初始化方式，显著优于相同训练预算下从零开始训练目标结构；（2）提出一种简单的局部保留专家合并策略，在持续预训练后提升下游任务性能；（3）发现结合KD与语言建模损失优于纯KD，并进一步引入多标记预测（multi-token prediction, MTP）蒸馏机制获得稳定增益；（4）揭示渐进式剪枝调度优于一次性剪枝，表明逐步调整模型结构有助于优化训练轨迹。最终，该方法将Qwen3-Next-80A3B模型压缩为23A2B版本并保持竞争力，为大规模MoE模型高效压缩提供了实用指导。

链接: https://arxiv.org/abs/2605.08738
作者: Shengkun Tang,Zekun Wang,Bo Zheng,Liangyu Wang,Rui Men,Siqi Zhang,Xiulong Yuan,Zihan Qiu,Zhiqiang Shen,Dayiheng Liu
机构: Qwen Team, Alibaba Inc.; MBZUAI; KAUST
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Structured pruning and knowledge distillation (KD) are typical techniques for compressing large language models, but it remains unclear how they should be applied at pretraining scale, especially to recent mixture-of-experts (MoE) models. In this work, we systematically study MoE compression in large-scale pretraining, focusing on three key questions: whether pruning provides a better initialization than training from scratch, how expert compression choices affect the final model after continued training, and which training strategy is most effective. We have the following findings: First, across depth, width, and expert compression, pruning a pretrained MoE consistently outperforms training the target architecture from scratch under the same training budget. Second, different one-shot expert compression methods converge to similar final performance after large-scale continual pretraining. Motivated by this, we introduce a simple partial-preservation expert merging strategy that improves downstream performance across most benchmarks. Third, combining KD with the language modeling loss outperforms KD alone, particularly on knowledge-intensive tasks. We further propose multi-token prediction (MTP) distillation, which yields consistent gains. Finally, given the same training tokens, progressive pruning schedules outperform one-shot compression, suggesting that gradual architecture transitions lead to better optimization trajectories. Putting it all together, we compress Qwen3-Next-80A3B to a 23A2B model that retains competitive performance. These results offer practical guidance for efficient MoE compression at scale.

[NLP-199] he Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs

【速读】：该论文旨在解决生成式 AI (Generative AI) 在大语言模型（LLM）后训练过程中，基于策略蒸馏（On-policy Distillation, OPD）时因奖励外推系数 λ 过大导致输出格式失效的问题。当 λ 超过一个临界阈值 λ* 后，尽管模型性能可能提升，但其输出会脱离结构化输出任务的约束条件，从“格式保持”变为“格式坍缩”。解决方案的关键在于推导出一个闭合形式的 clip-safety 临界阈值 λ*(p,b,c)，该阈值由教师模型的模态概率（teacher modal probability）、预热质量（warm-start mass）和重要性采样裁剪强度（importance-sampling clip strength）三个可测量量决定。通过在该阈值以下操作，如 ListOPD 方法，可在显著减少参数量（例如 1.7B 参数学生模型达到 8B-SFT 基线的域内性能）的同时保持输出格式有效性，且该方法在 Amazon Fashion 数据集上的多个预注册测试中验证了其预测精度与鲁棒性。

链接: https://arxiv.org/abs/2605.08737
作者: Xin Li,Hao Jiang,Annan Wang,Yichi Zhang,Chau Yuen
机构: Nanyang Technological University (南洋理工大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:On-policy distillation (OPD) is widely used for LLM post-training. When pushed with a reward-extrapolation coefficient lambda 1, the student can lift past the teacher in domain, but past a threshold lambda* the same step violates the output contract on structured-output tasks. In a single-position Bernoulli reduction, we derive a closed-form base-relative clip-safety threshold lambda*(p,b,c) determined by three measurable quantities: the teacher modal probability, the warm-start mass, and the importance-sampling clip strength. Above lambda*, the extrapolated fixed point exits the clip-safe region, changing training from format-preserving to format-collapsing. We extend the rule to calibrated K-ary listwise JSON tasks where a single binding equivalence class dominates the output contract and SFT retains parse headroom. On Amazon Fashion, three pre-registered tests–a fine-grid cliff interval, a budget-extension test, and a small-clip cross-prediction–fall within their locked prediction windows, with the small-clip value matching the closed-form prediction below grid resolution. Operating just below lambda*, ListOPD brings a 1.7B Qwen3 student to in-domain parity with an 8B-SFT baseline at one-fifth the parameters. The gain is driven primarily by format adherence: NDCG@1 on parsed outputs remains flat across lambda, while parse validity sharply changes at the predicted boundary. The cliff diagnostic is rubric-independent, whereas the parity claim uses a Gemini-graded rubric and inherits that evaluator’s exposure.

[NLP-200] AdaPreLoRA: Adafactor Preconditioned Low-Rank Adaptation

【速读】：该论文旨在解决低秩适配（Low-Rank Adaptation, LoRA）中因生成器映射的雅可比矩阵 $ J_G $ 秩不足导致的预条件方向反演不唯一的问题，即标准链式法则无法唯一地将预条件后的权重空间（$ W $-space）方向映射回因子空间（factor-space）更新。其解决方案的关键在于提出一种统一框架，通过两个核心设计选择来定义优化器：(i) 使用何种可逆代理矩阵替代奇异的因子空间预条件矩阵 $ J_G^* F_t J_G $，以及(ii) 在权重空间上采用何种预条件矩阵 $ F_t $。文中特别指出，一个基于梯度统计信息的 $ F_t $（具体为 Adafactor 对角 Kronecker 预条件矩阵 $ H_t $）与闭式因子空间求解（复杂度为 $ O((m+n)r) $）的组合仍待探索。为此，作者提出 AdaPreLoRA，它在上述框架下选择最小化 $ H_t $-加权不平衡的因子更新解，从而保证所获因子更新是预条件权重方向在 LoRA 约束下的最近逼近。该方法在多个基准任务（如 GPT-2、Mistral-7B、Qwen2-7B 及扩散模型个性化）中表现优于或相当现有 LoRA 优化器，同时保持峰值 GPU 内存不变。

链接: https://arxiv.org/abs/2605.08734
作者: Ziyun Liu,Fengmiao Bian,Jian-Feng Cai
机构: The Hong Kong University of Science and Technology (香港科技大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 27 pages

点击查看摘要

Abstract:Low-Rank Adaptation (LoRA) reparameterizes a weight update as a product of two low-rank factors, but the Jacobian J_G of the generator mapping the factors to the weight matrix is rank-deficient, so the factor-space preconditioner J_G^* F_t J_G induced by any W -space preconditioner F_t is singular, and consequently the standard chain rule cannot be uniquely inverted to map a preconditioned W -space direction back to a factor-space update. We cast existing LoRA optimizers in a unified framework parameterized by two choices: (i) which invertible surrogate for J_G^* F_t J_G to use, and (ii) which F_t on W to use. Existing methods occupy four families along these axes: factor-space adaptive updates, block-diagonal surrogates for J_G^* J_G , Frobenius-residual pseudoinverse methods, and Riemannian manifold constraint. Within this design space, a gradient-statistics-aware F_t paired with a closed-form factor-space solve at O((m+n)r) memory remains underexplored. We propose \textbfAdaPreLoRA, which fills this gap by adopting the Adafactor diagonal Kronecker preconditioner H_t on W and selecting from the resulting factor-space solution family the element minimizing an H_t -weighted imbalance between the two factor contributions; by construction, the resulting factor update is the closest LoRA approximation to the preconditioned W -space direction under the H_t -weighted norm. Across GPT-2 (E2E), Mistral-7B and Qwen2-7B (GLUE, ARC, GSM8K), and diffusion-model personalization, AdaPreLoRA is competitive with or improves over a representative set of LoRA optimizers while keeping peak GPU memory at the LoRA optimizer level.

[NLP-201] Breaking the Impasse: Dual-Scale Evolutionary Policy Training for Social Language Agents ACL2026

【速读】：该论文旨在解决生成式 AI（Generative AI）在开放性社交语言博弈中因策略空间庞大而导致的演化停滞问题，即语言智能体频繁收敛至同质化行为，造成对局结果确定性增强，从而丧失用于策略演化的梯度信号。解决方案的关键在于提出双尺度进化策略训练（Dual-scale Evolutionary Policy Training, DEPT），其核心机制包括：通过量化双尺度价值基线差异与对局熵来检测演化停滞，并在感知到崩溃后激活非对称优势重塑，动态调节优化景观以实现干预，从而有效恢复梯度信号并维持持续的战略探索。

链接: https://arxiv.org/abs/2605.08721
作者: Minzheng Wang,Run Luo,Yanbo Wang,Zichen Liu,Yuqiao Tan,Tao Tan,Xu Nan,Yinhe Zheng,Wenji Mao
机构: MAIS, Institute of Automation, Chinese Academy of Sciences (MAIS，自动化研究所，中国科学院); School of Artificial Intelligence, University of Chinese Academy of Sciences (人工智能学院，中国科学院大学); National University of Singapore (新加坡国立大学); Ritzz-AI
类目: Computation and Language (cs.CL)
备注: Accepted to the ACL 2026 Main Conference

点击查看摘要

Abstract:While Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for closed-ended tasks, extending it to open-ended social language games via self-play reveals a critical issue: evolution impasse. Due to the vast strategy space, language agents frequently converge to homogenized behaviors, leading to deterministic match outcomes that eliminate the gradient signals necessary for policy evolution. To tackle this issue, we propose Dual-scale Evolutionary Policy Training (DEPT) for social language games. DEPT introduces a time-scaled evolutionary perception mechanism that detects impasse by quantifying dual-scale value baseline divergence alongside match entropy. Upon perceiving the collapse, it then activates asymmetric advantage reshaping to dynamically modulate the optimization landscape for intervention. Thus, our method effectively restores gradient signals and enforces sustained strategic exploration. Extensive experiments on multiple social language games demonstrate that DEPT outperforms strong baselines, avoiding policy degeneration and driving the continuous evolution of social language agents.

[NLP-202] Bias by Necessity: Impossibility Theorems for Sequential Processing with Convergent AI and Human Validation

【速读】：该论文旨在解决认知偏差（如首因效应、锚定效应和顺序依赖性）是否为序列信息处理中数学上不可避免的后果这一核心问题。其解决方案的关键在于通过三个不可能性定理揭示了这些偏差在自回归语言模型中的架构必然性：(1) 首因效应源于注意力累积的不对称性；(2) 锚定效应由序列条件化过程及可证明的信息边界所驱动；(3) 精确去偏需因子时间复杂度计算，而蒙特卡洛近似可在恒定容忍度开销下实现。研究进一步通过12个前沿大语言模型验证理论边界，并结合两项预注册人类实验确认预测，从而将认知偏差重新定义为资源理性（resource-rational）对序列处理限制的响应。

链接: https://arxiv.org/abs/2605.08716
作者: Jikun Wu,Dongxin Guo,Siu-Ming Yiu
机构: Stellaris AI Limited (Stellaris AI 有限公司); Brain Investing Limited (Brain Investing 有限公司); The University of Hong Kong (香港大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 6 pages, 3 figures, 5 tables. Accepted to CogSci 2026

点击查看摘要

Abstract:Are certain cognitive biases mathematically inevitable consequences of sequential information processing? We prove that primacy effects, anchoring, and order-dependence are architecturally necessary in autoregressive language models due to causal masking constraints. Our three impossibility theorems establish: (1) primacy bias arises from asymmetric attention accumulation; (2) anchoring emerges from sequential conditioning with provable information bounds; and (3) exact debiasing by permutation marginalization requires factorial-time computation, with Monte Carlo approximation feasible at constant per-tolerance overhead. We validate these bounds across 12 frontier LLMs ( R^2 = 0.89 ; \Delta BIC = 16.6 vs. next-best alternative). We then derive quantitative predictions from the framework and test them in two pre-registered human experiments ( N = 464 analyzed). Study 1 confirms anchor position modulates anchoring magnitude ( d = 0.52 , BF _10 = 847 ). Study 2 shows working memory load amplifies primacy bias ( d = 0.41 , BF _10 = 156 ), with WM capacity predicting bias reduction ( r = -.38 ). These convergent findings reframe cognitive biases as resource-rational responses to sequential processing.

[NLP-203] RewardHarness: Self-Evolving Agent ic Post-Training

【速读】：该论文旨在解决指令引导图像编辑评估中奖励模型的数据效率低下问题，即当前方法依赖大规模偏好标注和额外训练，而人类仅需少量示例即可推断评估标准。解决方案的关键在于提出RewardHarness框架，其将奖励建模重构为上下文演化而非权重优化过程：通过迭代进化一个工具与技能库（从仅100个偏好演示开始），由协调器（Orchestrator）选择最相关工具构建推理链，由冻结的子代理（Sub-Agent）生成偏好判断，并基于预测结果与真实偏好对比及推理过程的成功/失败分析自动更新工具库，从而实现无需额外人工标注的自演化奖励建模。

链接: https://arxiv.org/abs/2605.08703
作者: Yuxuan Zhang,Penghui Du,Bo Li,Cong Wei,Junwen Miao,Huaisong Zhang,Songcheng Cai,Yubo Wang,Dongfu Jiang,Yuyu Zhang,Ping Nie,Wenhu Chen,Changqian Yu,Kelsey R. Allen
机构: University of British Columbia (不列颠哥伦比亚大学); Vector Institute (向量研究所); Kolors Team, Kuaishou Technology (快手科技Kolors团队); Carnegie Mellon University (卡内基梅隆大学); University of Waterloo (滑铁卢大学); Etude AI (Etude AI); Tsinghua University (清华大学); Georgia Institute of Technology (佐治亚理工学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project page: this https URL

点击查看摘要

Abstract:Evaluating instruction-guided image edits requires rewards that reflect subtle human preferences, yet current reward models typically depend on large-scale preference annotation and additional model training. This creates a data-efficiency gap: humans can often infer the target evaluation criteria from only a few examples, while models are usually trained on hundreds of thousands of comparisons. We present RewardHarness, a self-evolving agentic reward framework that reframes reward modeling as context evolution rather than weight optimization. Instead of learning from large-scale annotations, RewardHarness aligns with human preferences by iteratively evolving a library of tools and skills from as few as 100 preference demonstrations. Given a source image, candidate edited images, and an editing instruction, an Orchestrator selects the most relevant subset of tools and skills from the maintained library, and a frozen Sub-Agent uses them to construct a reasoning chain that produces a preference judgment. By comparing predicted judgments with ground-truth preferences and analyzing successes and failures in the reasoning process, the Orchestrator automatically refines its library of tools and skills without additional human annotation. Using only 0.05% of the EditReward preference data, RewardHarness achieves 47.4% average accuracy on image-editing evaluation benchmarks, surpassing GPT-5 by 5.3 points. When used as a reward signal for GRPO fine-tuning, RL-tuned models achieve 3.52 on ImgEdit-Bench. Project page: this https URL.

[NLP-204] Structured Recurrent Mixers for Massively Parallelized Sequence Generation

【速读】：该论文旨在解决语言建模中训练效率与推理吞吐量之间的权衡问题：传统递归模型（Recurrent Models）在推理阶段具有较低的吞吐量，而并行模型（如Transformer）虽提升训练效率但受限于内存和计算资源。其解决方案的关键在于提出结构化递归混合器（Structured Recurrent Mixer, SRM），通过数学上的代数转换，在训练时采用序列并行表示以提升效率和输入信息容量，在推理时切换为递归表示以实现更高的吞吐量与并发性，且无需专用内核或设备特定内存管理。实验表明，SRM在相同计算预算下相较Transformer显著提升了推理性能（如vLLM上实现12倍吞吐量和170倍并发性），并展现出在强化学习训练中的潜力。

链接: https://arxiv.org/abs/2605.08696
作者: Benjamin L. Badger
机构: IBM(国际商业机器公司)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Over the last two decades, language modeling has experienced a shift from predominantly recurrent architectures that process tokens sequentially during training and inference to non-recurrent models that process sequence elements in parallel during training, which results in greater training efficiency and stability at the expense of lower inference throughput. Here we introduce the Structured Recurrent Mixer, an architecture that allows for algebraic conversion between a sequence parallel representation at train time and a recurrent representation at inference, notably without the need for specialized kernels or device-specific memory management. We show experimentally that this dual representation allows for greater training efficiency, higher input information capacity, and larger inference throughput and concurrency when compared to other linear complexity models. We postulate that recurrent models are poorly suited to extended sequence length scaling for information-rich inputs typical of language, but are well suited to scaling in the sample (batch) dimension due to their constant memory per sample. We provide Mojo/MAX inference implementations of SRMs exhibiting 12x the throughput and 170x the concurrency of similarly powerful Transformers inferenced on vLLM, increases characteristic of Pytorch implementations resulting in a 30% increase in compute-constant GSM8k Pass@k. We conclude by demonstrating that SRMs are effective reinforcement learning training candidates.

[NLP-205] AAAC: Activation-Aware Adaptive Codebooks for 4-bit LLM Weight Quantization

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）在后训练量化（Post-Training Quantization, PTQ）过程中，如何在极低计算开销下实现高精度4比特权重量化的问题。现有方法如AWQ、GPTQ虽通过缩放、截断或误差补偿优化了权重映射，但进一步提升精度往往依赖于耗时数小时的梯度辅助算法（如OmniQuant和QuIP#）。其关键创新在于提出AAAC（Activation-Aware Adaptive Codebooks），即为每层引入两个小型可学习标量码本（共64字节），并基于激活加权重构误差动态选择最优码本；通过利用组内正尺度的未使用符号位编码选择信息，实现零额外存储开销。该方案可在单GPU上仅需3–30分钟完成量化，显著优于现有方法的精度-时间权衡。

链接: https://arxiv.org/abs/2605.08692
作者: Beshr IslamBouli,David Jin
机构: Massachusetts Institute of Technology (麻省理工学院)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Post-training weight-only quantization to 4 bits is widely used to reduce the memory and compute costs of large language model inference. Existing PTQ methods, such as AWQ and GPTQ, improve how weights are mapped onto a fixed 4-bit grid through scaling, clipping, or error compensation. To further improve accuracy, methods such as OmniQuant and QuIP# uses gradient-assisted algorithms at the cost of hours of quantization time. In this work, we propose AAAC (Activation-Aware Adaptive Codebooks), a lightweight method for 4-bit LLM weight quantization. AAAC replaces the fixed scalar codebook used in standard quantization with two small learned scalar codebooks (64 bytes) per layer. Each group of weights selects the codebook that minimizes activation-weighted reconstruction error, encoding the choice in the unused sign bit of the group’s positive scale and adding zero storage overhead. AAAC completes in 3–30 minutes on a single GPU, and adds no memory beyond the model itself. We evaluate against AWQ, GPTQ, IF4, GPTVQ, OmniQuant, SqueezeLLM, and QuIP# across model families. AAAC outperforms baselines at orders-of-magnitude less quantization time.

[NLP-206] Explanation Fairness in Large Language Models : An Empirical Analysis of Disparities in How LLM s Justify Decisions Across Demographic Groups

【速读】：该论文旨在解决生成式 AI（Generative AI）在决策解释中的公平性问题，即大型语言模型（Large Language Models, LLMs）是否在不同人口群体间以同等质量、深度、语气和语言复杂度提供解释。现有研究聚焦于决策公平性，而忽视了解释层面的潜在偏见。其解决方案的关键在于提出并验证一个可操作的“解释公平性分类法”（Explanation Fairness Taxonomy, EFT），包含五种维度：冗长度差异（Verbosity Disparity）、情感倾向差异（Sentiment Disparity）、认知模糊性差异（Epistemic Hedging Disparity）、决策关联解释差异（Decision-Linked Explanation Disparity）和词汇复杂度差异（Lexical Complexity Disparity）。通过控制实验设计，在80个提示模板、四个关键决策领域（招聘、医疗分诊、信贷评估、法律判决）及五种主流LLM上进行实证分析，发现所有EFT指标均存在显著统计差异（p_BH < 10⁻⁶²），且模型选择对差异幅度影响显著；进一步引入两个黑盒指标——模糊密度得分（Hedging Density Score, HDS）与解释忠实度代理指标（Explanation Faithfulness Proxy, EFP），揭示出基于提示的缓解策略虽能大幅降低EFP差异（78–95%），但对风格类差异无效，表明此类不公平可能源于预训练数据分布，而非部署阶段指令可调。该研究为AI解释层面的公平性审计提供了可复现框架，具有重要监管与实践意义。

链接: https://arxiv.org/abs/2605.08671
作者: Gautam Veldanda
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 4 figures, 9 tables

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed not only to make decisions but to explain them. While AI decision fairness has been studied extensively, the fairness of AI explanations (whether LLMs justify decisions with equal quality, depth, tone, and linguistic sophistication across demographic groups) has received little attention. This paper introduces the Explanation Fairness Taxonomy (EFT), a framework comprising five formally defined, operationalizable dimensions: Verbosity Disparity, Sentiment Disparity, Epistemic Hedging Disparity, Decision-Linked Explanation Disparity, and Lexical Complexity Disparity. The taxonomy is instantiated in a controlled empirical study across 80 prompt templates, four consequential decision domains (hiring, medical triage, credit assessment, legal judgment), and five LLMs: GPT-4.1, Claude Sonnet, LLaMA 3.3 70B, GPT-OSS 120B, and Qwen3 32B. Two novel black-box metrics are introduced: the Hedging Density Score (HDS) and the Explanation Faithfulness Proxy (EFP), a heuristic indicator of decision-linked explanation variation. Across up to 400 prompt pairs, all eight EFT metrics show statistically significant disparities (Cohen’s d ranging from small to large, all p_BH 10^(-62)). Model choice is strongly associated with disparity magnitude: Qwen3 32B exhibits verbosity disparities 5.9x larger than LLaMA 3.3 70B. Two prompting-based mitigations show significant reductions in EFP disparity (78-95%) but no significant effect on stylistic dimensions, consistent with the hypothesis that stylistic explanation inequalities are encoded in pre-training distributions and are not resolvable through deployment-level instruction alone. A reproducible measurement framework is offered for explanation-level fairness auditing, with implications for AI regulation and deployment practice.

[NLP-207] Hint Tuning: Less Data Makes Better Reason ers

【速读】：该论文旨在解决大模型在推理过程中存在冗余生成的问题，即尽管通过扩展思维链（Chain-of-Thought, CoT）可提升准确性，但模型往往对所有问题统一采用冗长推理，导致token消耗增加5–8倍。解决方案的关键在于提出一种名为Hint Tuning的数据高效方法，其核心思想是利用指令模型（instruct model）作为难度探测器，通过测试其在不同引导程度下能否完成任务，自动构建三种状态的训练样本：无提示（No-Hint）、稀疏提示（Sparse-Hint）和完整提示（Full-Hint）。这一策略将抽象的难度标注转化为可测量的指令模型与推理模型之间的一致性验证，仅需1K自标注样本即可实现平均31.5%的token减少（24–66%区间），同时保持多模型（Qwen3-Thinking、DeepSeek-R1-Distill等，规模4B–32B）在五个基准上的竞争力，显著优于依赖大规模蒸馏或昂贵强化学习的方法。

链接: https://arxiv.org/abs/2605.08665
作者: Siqi Fan,Minghao Li,Xiaoqian Ma,Xiusheng Huang,Zhuo Chen,Bowen Qin,Liujie Zhang,Shuo Shang,Weihang Chen
机构: University of Electronic Science and Technology of China (电子科技大学); Xiaohongshu Inc. (小红书公司); National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large reasoning models achieve high accuracy through extended chain-of-thought but generate 5–8 more tokens than necessary, applying verbose reasoning uniformly regardless of problem difficulty. We propose Hint Tuning, a data-efficient approach that teaches models to calibrate reasoning depth. Our key insight: the corresponding instruct model serves as an ideal difficulty probe. By testing what the instruct model can solve with varying guidance, we automatically construct training data across three states: No-Hint (direct answer), Sparse-Hint (minimal prefix), and Full-Hint (complete reasoning). This converts the abstract challenge of difficulty labeling into a measurable consistency check between the instruct and reasoning models. With only 1K self-annotated samples, Hint Tuning achieves 24–66% token reduction (31.5% average) across mainstream reasoning models (Qwen3-Thinking, DeepSeek-R1-Distill) at multiple scales (4B–32B) while maintaining competitive accuracy on five benchmarks. Unlike methods requiring massive distillation datasets or expensive RL, we achieve superior efficiency through simple alignment with the instruct model’s capabilities.

[NLP-208] Agent CollabBench: Diagnosing When Good Agents Make Bad Collaborators

【速读】：该论文旨在解决多智能体系统（Multi-agent Systems）在协作过程中因隐性约束丢失而导致的推理链污染问题，此类问题无法通过仅基于最终输出的评估方法检测到。为实现对这些潜在漏洞的诊断与量化，作者提出了AgentCollabBench——一个包含900个经人工验证任务的诊断基准，覆盖软件工程、DevOps和数据工程领域，其核心设计在于隔离四种行为风险：指令衰减（instruction decay）、错误信念传播（false-belief contagion）、上下文泄露（context leakage）和标记数据耐久性（tracer durability）。解决方案的关键在于揭示通信拓扑结构是影响多跳信息存活率的主要因素（解释7–40%的方差），并发现收敛型有向无环图（converging-DAG）节点存在“合成瓶颈”：当多个父节点输入竞争时，代理会丢弃少数分支携带的约束条件，而这一结构缺陷在线性链中不存在。因此，论文指出，多智能体系统的可靠性本质上是一个结构性问题，单纯提升模型智能不足以保障系统稳健性。

链接: https://arxiv.org/abs/2605.08647
作者: Aritra Mazumder,Shubhashis Roy Dipta,Nusrat Jahan Lia,Tanzila Khan,Kainat Raisa Hossain,Nehaa Shri,Shubhrangshu Debsarkar,Humayra Tasnim,Gour Gupal Talukder Shawon,Debjoty Mitra,Sumaiya Ahmed Rani,Al Jami Islam Anik,Al Nafeu Khan
机构: University of Utah (犹他大学); University of Maryland, Baltimore County (马里兰大学巴尔的摩县分校); University of Dhaka (达卡大学); Vellore Institute of Technology (维洛尔技术学院); University of Virginia (弗吉尼亚大学); Rajshahi University of Engineering and Technology (拉杰沙希工程技术大学); Shahjalal University of Science and Technology (沙赫贾拉尔科技大学); BRAC University (BRAC大学); Islamic University of Technology (伊斯兰科技大学); Comilla University (乔米拉大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multi-agent systems achieve state-of-the-art outcomes through peer collaboration. However, when an agent in the pipeline silently drops a constraint, the system’s final output may look correct even though the reasoning chain was quietly corrupted, and existing outcome-based evaluations are blind to such multi-hop process failures. To make these vulnerabilities measurable before deployment, we introduce AgentCollabBench, a diagnostic benchmark of 900 human-validated tasks spanning software engineering, DevOps, and data engineering. Each task isolates one of four behavioral risks: instruction decay (does a constraint survive peer pressure?), false-belief contagion (does a falsehood spread through consensus?), context leakage (does information bleed between tasks?), and tracer durability (does marked data reach the final agent?). Evaluating four modern LLMs (GPT 4.1 mini, Gemini 2.5 Flash Lite, Qwen-3.5-35B-A3B, and Llama 3.1 8B Instruct), we expose model-specific vulnerability profiles invisible to outcome-only evaluation; Qwen-3.5-35B-A3B, for example, leads on tracer durability and instruction stability, while GPT 4.1 mini leads on leakage containment and false-belief resistance. Beyond per-model differences, communication topology emerges as a primary risk factor that explains 7-40% of the variance in multi-hop information survival. The effect traces to a synthesis bottleneck specific to converging-DAG nodes: an agent weighing competing parent inputs discards constraints carried by a minority branch, a bottleneck structurally absent from linear chains. AgentCollabBench demonstrates that suboptimal topology can silently erase the safeguards of highly capable models, arguing that multi-agent reliability is fundamentally a structural problem and that scaling model intelligence alone is no substitute for architecture.

[NLP-209] PAAC: Privacy-Aware Agent ic Device-Cloud Collaboration

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）代理在设备-云端协同架构下面临的隐私与能力权衡问题：云端代理虽具备强大推理能力但存在用户数据泄露风险，而本地代理虽保障隐私却受限于计算资源导致整体性能下降。现有方案将设备-云端边界视为单纯的算力分割，未能将其作为适配代理任务的可信边界，且传统数据脱敏方法在策略灵活性与工具调用结构保真度之间难以兼顾。其解决方案的关键在于提出PAAC（Privacy-Aware Agentic Framework），通过将规划器（planner）与执行器（executor）的分解映射到设备-云端边界，使角色专业化本身成为隐私保护机制：云端代理基于类型化的占位符令牌进行推理（保留敏感值的推理角色但丢弃内容），本地代理则识别敏感片段并将每步执行结果提炼为紧凑的关键发现；脱敏操作仅由本地LLM提议需掩码的片段，而确定性注册表负责所有替换与还原操作，确保动作可在本地直接执行。实验表明，在严格隐私设置下，PAAC在三个代理基准测试中显著优于现有最优设备-云端基线，平均准确率提升15–36%，平均泄露量减少2–6倍，尤其在超出固定实体分类体系的隐私目标上优势最大。

链接: https://arxiv.org/abs/2605.08646
作者: Liangqi Yuan,Wenzhi Fang,Shiqiang Wang,Christopher G. Brinton
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Large language model (LLM) agents face a structural tension: cloud agents provide strong reasoning but expose user data, while on-device agents preserve privacy at the cost of overall capability. Existing device-cloud designs treat this boundary as a compute split rather than a trust boundary suited to agentic workloads, and existing sanitizers force a choice between policy flexibility and the structural fidelity tool calls require. In this work, we develop PAAC, a privacy-aware agentic framework that aligns planner–executor decomposition with the device-cloud boundary so that role specialization itself becomes the privacy mechanism. The cloud agent reasons over typed placeholder tokens that preserve each sensitive value’s reasoning role while discarding its content, while the on-device agent identifies sensitive spans and distills each step’s execution outcome into compact key findings. Sanitization confines the on-device LLM to proposing which spans to mask, while a deterministic registry performs all substitution and reversal, keeping actions directly executable on device. On three agentic benchmarks under strict privacy settings, PAAC dominates the Pareto frontier of privacy and accuracy, improving average accuracy by 15-36% and reducing average leakage by 2-6 \times over state-of-the-art device-cloud baselines, with the largest margins on privacy targets outside fixed entity taxonomies. We find consistent improvements on 17 additional benchmarks spanning 10 domains, including math, science, and finance.

[NLP-210] EdgeFlowerTune: Evaluating Federated LLM Fine-Tuning Under Realistic Edge System Constraints

【速读】：该论文旨在解决联邦大语言模型（Large Language Models, LLMs）在真实边缘设备上进行微调时的可部署性问题，即现有研究多集中于跨数据中心或仿真环境，忽略了实际边缘系统中资源受限（如内存、计算能力）和运行时约束（如通信开销、能耗、延迟）对方法可行性的决定性影响。解决方案的关键在于提出EdgeFlowerTune——一个面向部署的基准测试平台，其核心创新是引入三种互补评估协议：Quality-under-Budget（预算下的质量）、Cost-to-Target（达到目标的质量成本）和Robustness（对动态边缘条件的鲁棒性），从而在模型性能与系统代价（包括通信、延迟、内存、能耗等）之间实现协同权衡，提供可复现的系统感知型评估框架，以指导真正适用于边缘场景的联邦LLM微调方法设计。

链接: https://arxiv.org/abs/2605.08636
作者: Jiaxiang Geng,Yiyi Lu,Lunyu Zhao,Yan Gao,Nicholas D. Lane,Bing Luo
机构: Duke Kunshan University (昆山杜克大学); Flower Labs (Flower 实验室); University of Cambridge (剑桥大学)
类目: Computation and Language (cs.CL)
备注: 30 pages, 10 figures

点击查看摘要

Abstract:Federated fine-tuning offers a promising paradigm for adapting large language models (LLMs) on edge devices by leveraging the rich, diverse, and continuously generated data from smartphones and IoT devices without compromising user data privacy. Such edge-side adaptation can improve model personalization, robustness, and responsiveness to local contexts. However, the practical feasibility of federated LLM fine-tuning on real edge devices remains unclear, as most existing work focuses on cross-silo or simulation-based settings, overlooking the resource and runtime constraints that determine whether a method is deployable on real edge systems. We present EdgeFlowerTune, a deployment-oriented benchmark for federated LLM fine-tuning under realistic edge-system constraints. EdgeFlowerTune jointly evaluates model quality and system costs, including communication, wall-clock latency, memory usage, energy consumption, and robustness to dynamic edge conditions. To compare methods in terms of effectiveness, efficiency, and robustness, EdgeFlowerTune introduces three complementary protocols: Quality-under-Budget, Cost-to-Target, and Robustness. We instantiate EdgeFlowerTune as a real-device platform built on Flower and MobileFineTuner, spanning commercial Android smartphones and NVIDIA edge development boards. Our benchmark results show that accuracy-only evaluation can lead to misleading conclusions: methods with similar final quality may differ substantially in deployability once realistic system constraints are considered. EdgeFlowerTune provides a reproducible benchmark for system-aware evaluation of federated LLM fine-tuning at the edge.

[NLP-211] PARD-2: Target-Aligned Parallel Draft Model for Dual-Mode Speculative Decoding

【速读】：该论文旨在解决现有推测解码（Speculative Decoding）方法中，草稿模型（draft model）训练目标与推理阶段最大化连续token接受长度的目标不一致的问题。传统方法通常以token预测准确率为优化目标，但这一目标未能有效促进实际推理中生成更多可被目标模型接受的连续token。解决方案的关键在于重新定义草稿模型的优化目标，从单一的token级预测精度转向整体接受长度的最大化，并提出PARD-2框架，其核心创新是引入置信度自适应token（Confidence-Adaptive Token, CAT）优化机制，通过动态调整每个token的权重来更好地匹配验证过程。此外，PARD-2支持目标依赖和目标无关两种模式，提升了灵活性与通用性，在Llama3.1-8B等模型上实现了最高达6.94倍的无损加速性能。

链接: https://arxiv.org/abs/2605.08632
作者: Zihao An,Taichi Liu,Ziqiong Liu,Dong Li,Ruofeng Liu,Emad Barsoum
机构: Advanced Micro Devices, Inc. (AMD); Rutgers University; Michigan State University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Speculative decoding accelerates Large Language Models (LLMs) inference by using a lightweight draft model to propose candidate tokens that are verified in parallel by the target model. However, existing draft model training objectives are not directly aligned with the inference-time goal of maximizing consecutive token acceptance. To address this issue, we reformulate the draft model optimization objective, shifting the focus from token prediction accuracy to the overall acceptance length. In this paper, we build upon PARD to propose PARD-2, a dual-mode speculative decoding framework with Confidence-Adaptive Token (CAT) optimization. This approach adaptively reweights each token to better align with the verification process. Notably, PARD-2 enables a single draft model to support both target-dependent and target-independent modes. Experiments across diverse models and tasks demonstrate that PARD-2 achieves up to 6.94 \times lossless acceleration, surpassing EAGLE-3 by 1.9 \times and PARD by 1.3 \times on Llama3.1-8B. Our code is available at this https URL.

[NLP-212] 100000 Movie Reviews from Kazakhstan: Russian Kazakh and Code-Switched Texts

【速读】：该论文旨在解决多语言电影评论数据稀缺及其在情感分析任务中应用的挑战，特别是针对哈萨克斯坦地区以俄语为主、包含哈萨克语及混合语言（code-switched）文本的非英语语料。其解决方案的关键在于构建了一个大规模、公开可用的多语言电影评论语料库（100,502条评论，覆盖2001–2025年），并进行了人工标注的语言类型与情感极性分类，其中11,309条还包含用户评分。在此基础上，作者定义了两种情感任务（三分类情感极性与五级评分分类），并通过对比传统词袋模型（BoW/TF-IDF）与多语言Transformer模型（mBERT、XLM-RoBERTa、RemBERT）的性能，验证了后者在情感极性分类上的显著优势，同时揭示了评分分类因类别不平衡和相邻评分等级差异细微而仍具挑战性。

链接: https://arxiv.org/abs/2605.08600
作者: Rustem Yeshpanov
机构: Astana, Kazakhstan
类目: Computation and Language (cs.CL)
备注: 10 pages, 1 figure, 8 tables, to appear in Proceedings of the 6th International Conference on Natural Language Processing for the Digital Humanities (NLP4DH 2026)

点击查看摘要

Abstract:We present a new publicly available corpus of 100,502 movie reviews from Kazakhstan collected from this http URL, spanning 2001-2025 and covering 4,943 unique titles. The dataset is multilingual, consisting mainly of Russian reviews alongside Kazakh and code-switched texts. Reviews are manually annotated for language and sentiment polarity, and 11,309 reviews additionally contain explicit user-provided ratings. We define two sentiment tasks – three-way polarity classification and five-class score classification – and benchmark classical BoW/TF-IDF baselines against multilingual transformer models (mBERT, XLM-RoBERTa, RemBERT). Experimental results show that transformer models consistently outperform classical baselines on polarity classification, while score classification remains challenging under leakage-controlled evaluation due to severe class imbalance and subtle distinctions between adjacent rating levels.

[NLP-213] Source or It Didnt Happen: A Multi-Agent Framework for Citation Hallucination Detection

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在科学写作中产生的“引用幻觉”（citation hallucination）问题，即LLM生成看似合理但无法通过文献验证的虚假引用。现有检测方法多依赖于二元判断（存在/不存在），且受限于脆弱的解析或不完整的检索机制，缺乏对领域层面的细粒度判别能力。其解决方案的关键在于提出一个12类分类体系（taxonomy），涵盖真实（Real）、潜在（Potential）和幻觉（Hallucinated）三类引用，并构建CiteTracer——一个级联式多智能体检测系统：该系统首先结构化提取PDF与BibTeX中的引用信息，通过缓存查找、URL获取、学术连接器及网络搜索多源证据，执行确定性领域匹配，再将模糊案例路由至专业领域专家智能体进行判定，从而实现高精度、低误报的场域级引用真实性评估。

链接: https://arxiv.org/abs/2605.08583
作者: Mingzhe Li,Zhiqiang Lin,Shiqing Ma
机构: University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校); The Ohio State University (俄亥俄州立大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models are increasingly used in scientific writing, yet they can fabricate citation-shaped references that appear plausible but fail bibliographic verification. Existing detectors often reduce verification to binary found/not-found decisions and rely on brittle parsing or incomplete retrieval, offering little field-level signal to auditors. We reframe citation hallucination detection as taxonomy-aligned field-level adjudication and introduce a 12-code taxonomy spanning Real, Potential, and Hallucinated citations. Based on this taxonomy, we build CiteTracer, a cascading multi-agent detector that extracts structured citations from PDF and BibTeX, retrieves evidence through cache lookup, URL fetch, scholar connectors, and web search, applies deterministic field matching, and routes ambiguous cases to class-specialist judgers. We release a benchmark of 2,450 synthetic citations built from real seeds with controlled LLM mutations, paired with 957 real-world fabricated citations drawn from ICLR 2026 and an anonymous conference desk-rejected submissions. CiteTracer reaches 97.1% accuracy on the synthetic benchmark, with class-level F1 scores of 97.0, 95.8, and 98.5 for Real, Potential, and Hallucinated, respectively, and detects 97.1% of fabrications on the real-world set without abstaining. Code: this https URL.

[NLP-214] Coordinates of Capability: A Unified MTMM-Geometric Framework for LLM Evaluation

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）评估中存在构念效度（construct validity）不足的问题，即现有碎片化的基准测试和随意设计的指标常将方法变异性（如提示敏感性）与模型的真实潜在能力混淆。其解决方案的关键在于提出一个广义的多特质多方法（Multi-Trait Multi-Method, MTMM）框架，通过形式化并统一九种评估指标（如改写不稳定性、漂移分数、Overton宽度和多元性分数），将其视为共享潜在坐标空间中的几何测量，从而将模型行为分解为三个正交的潜在维度：(1) 不稳定性和敏感性，(2) 位置与对齐度，(3) 覆盖范围与表达力。这一空间统一使得任务无关扰动与真实能力跨度得以系统分离，为稳健且经验稳定的基准设计提供了理论基础和领域无关的分类体系。

链接: https://arxiv.org/abs/2605.08522
作者: Adib Sakhawat,Tahsin Islam,Takia Farhin,Syed Rifat Raiyan,Hasan Mahmud,Md Kamrul Hasan
机构: Islamic University of Technology, Dhaka, Bangladesh
类目: Computation and Language (cs.CL)
备注: 19 pages, 12 figures, Systematization of Knowledge (SoK) paper

点击查看摘要

Abstract:The evaluation of Large Language Models (LLMs) faces a critical challenge in construct validity, where fragmented benchmarks and ad hoc metrics frequently conflate method variance, such as prompt sensitivity, with true latent capabilities. Concurrently, emerging research suggests that LLM capabilities and outputs can be modeled as continuous geometric manifolds. In this Systematization of Knowledge (SoK), we bridge these paradigms by proposing a generalized Multi-Trait Multi-Method (MTMM) framework for LLM evaluation. We formalize and unify nine evaluation metrics, including Paraphrase Instability, Drift Score, Overton Width, and Pluralism Score, interpreting them not as isolated scalar values but as geometric measurements within a shared latent coordinate space. This spatial unification factorizes model behavior into three orthogonal latent dimensions: (1) Instability and Sensitivity, (2) Position and Alignment, and (3) Coverage and Expressiveness. By systematically separating task-irrelevant perturbations from true capability spans, the framework provides a theoretically grounded and domain-agnostic taxonomy for robust and empirically stable benchmark design.

[NLP-215] A Single Neuron Is Sufficient to Bypass Safety Alignment in Large Language Models

【速读】：该论文旨在解决大语言模型中安全对齐（safety alignment）机制的脆弱性问题，即当前模型的安全防护是否真正稳健地分布于整个网络权重中，还是依赖于特定神经元的因果作用。研究发现，安全对齐实际上由两类机制明确区分的神经元驱动：拒绝神经元（refusal neurons）负责控制有害知识的输出与否，而概念神经元（concept neurons）则编码有害知识本身。解决方案的关键在于识别并操控这些特定神经元——通过抑制任一拒绝神经元即可绕过安全机制，使模型在面对显式有害请求时产生不当响应；反之，通过放大概念神经元可诱导模型从看似无害的提示中生成有害内容。这一结果表明，现有安全对齐并非全局鲁棒，而是依赖于少数关键神经元的因果功能，从而揭示了模型安全机制的本质弱点。

链接: https://arxiv.org/abs/2605.08513
作者: Hamid Kazemi,Atoosa Chegini,Maria Safi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Safety alignment in language models operates through two mechanistically distinct systems: refusal neurons that gate whether harmful knowledge is expressed, and concept neurons that encode the harmful knowledge itself. By targeting a single neuron in each system, we demonstrate both directions of failure – bypassing safety on explicit harmful requests via suppression, and inducing harmful content from innocent prompts via amplification – across seven models spanning two families and 1.7B to 70B parameters, without any training or prompt engineering. Our findings suggest that safety alignment is not robustly distributed across model weights but is mediated by individual neurons that are each causally sufficient to gate refusal behavior – suppressing any one of the identified refusal neurons bypasses safety alignment across diverse harmful requests.

[NLP-216] A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models

【速读】：该论文旨在解决大规模语言模型（Large Language Models, LLMs）中“巨量激活”（massive activations）问题，即在特定层（称为ME Layer，Massive Emergence Layer）中激活值异常增大并沿残差连接传播，导致隐藏状态表示多样性降低，进而影响注意力模块的表达能力。该现象可能引发“注意力陷阱”（attention sinks），限制模型性能。解决方案的关键在于识别ME Layer中RMSNorm与前馈网络（Feed-Forward Network, FFN）参数共同作用形成巨量激活的机制，并提出一种简单有效的策略，通过选择性削弱巨量激活项的影响力来缓解其刚性，从而提升模型在指令遵循和数学推理等任务上的表现，且适用于训练-free和微调两种场景。

链接: https://arxiv.org/abs/2605.08504
作者: Zeru Shi,Zhenting Wang,Fan Yang,Qifan Wang,Ruixiang Tang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We investigate the origins of massive activations in large language models (LLMs) and identify a specific layer named the \textbfMassive Emergence Layer (ME Layer), that is consistently observed across model families, where massive activations first emerge and subsequently propagate to deeper layers through residual connections. We show that, within the ME Layer both the RMSNorm and the FFN parameters jointly contribute to the emergence of massive activations. Once formed, the massive activation token representation remains largely invariant across layers, reducing the diversity of hidden representations passed to the attention module. Motivated by this limitation, we propose a simple and effective method to reduce the rigidity of the massive activation token. Our approach consistently improves LLM performance across multiple tasks, including instruction following and math reasoning, in both training free and fine tuning settings. Moreover, we show that our method mitigates attention sinks by selectively weakening their influence, elucidating their origin at the hidden state level and shedding new light on principled mitigation strategies.

[NLP-217] ShifaMind: A Multiplicative Concept Bottleneck for Interpretable ICD-10 Coding

【速读】：该论文旨在解决临床出院摘要自动编码为ICD-10代码时面临的两个核心挑战：一是长尾多标签分类任务中的高精度要求，二是模型对临床医生的可解释性需求。传统概念瓶颈模型（Concept Bottleneck Models, CBMs）虽能提供人类可理解的概念路径，但其将丰富临床文本表示压缩至狭窄的概念层会限制梯度传播并削弱预测能力。本文提出的解决方案是ShifaMind架构，其关键创新在于引入乘法型概念瓶颈（Multiplicative Concept Bottleneck, MCB）——通过在概念驱动的表示上施加一个可学习的乘法门控机制，而非传统地投影到单一窄层，从而在保持标量概念接口以供审查的同时，保留更丰富的特征表达能力。实验表明，ShifaMind在MIMIC-IV数据集上的性能与LAAT等最强基线相当，并显著优于五种其他ICD编码模型，在预测准确性和可解释性指标上均优于容量匹配的普通CBM，验证了瓶颈设计方式的重要性。

链接: https://arxiv.org/abs/2605.08482
作者: Mohammed Sameer Syed,Xuan Lu
机构: University of Arizona (亚利桑那大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Automated ICD-10 coding from clinical discharge summaries requires models that are both accurate on long-tailed multi-label classification tasks and interpretable to clinicians. Concept Bottleneck Models (CBMs) offer a principled framework for interpretability by routing predictions through human-interpretable concepts, but this transparency often comes at a cost: compressing rich clinical text representations into a narrow concept layer can restrict gradient flow and limit predictive capacity. We present ShifaMind, a concept-grounded architecture built around a Multiplicative Concept Bottleneck (MCB), which changes the form, rather than the width, of the bottleneck. Instead of projecting through a narrow concept layer, ShifaMind uses a learned multiplicative gate over a concept-grounded representation while retaining a scalar concept interface for inspection. On MIMIC-IV top-50 ICD-10 coding, ShifaMind achieves performance competitive with LAAT, the strongest baseline, across F1, AUC, and ranking metrics, while outperforming five additional ICD-coding baselines and providing concept-mediated explanations. Its substantial gains over a capacity-matched Vanilla CBM in both predictive performance and interpretability-oriented metrics highlight the importance of the bottleneck design.

[NLP-218] Do Agents Need to Plan Step-by-Step? Rethinking Planning Horizon in Data-Centric Tool Calling

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）代理在执行复杂数据导向任务时，因规划策略选择不当而导致的效率低下问题。现有方法主要分为全视野规划（Full-Horizon, FH）和单步规划（Single-Step Horizon, SH），其中SH默认采用逐步执行与即时监控以提升适应性，但作者质疑这一假设在结构清晰的数据任务中是否必要。解决方案的关键在于通过受控实验证明：对于定义明确的数据导向任务，采用FH规划结合按需重规划（lazy replanning）策略可在保持与SH相当准确率的前提下，显著减少2–3倍的token消耗，从而表明在这些场景下无需强制进行早期、细粒度的步骤级监控。

链接: https://arxiv.org/abs/2605.08477
作者: Naoki Otani,Nikita Bhutani,Hannah Kim,Dan Zhang,Estevam Hruschka
机构: Megagon Labs(梅加贡实验室)
类目: Computation and Language (cs.CL)
备注: CAIS 2026

点击查看摘要

Abstract:Explicit planning is a critical capability for LLM-based agents solving complex data-centric tasks, which require precise tool calling over external data sources. Existing strategies fall into two paradigms based on planning horizon: (1) full-horizon (FH), which generates a complete plan before execution, and (2) single-step horizon (SH), which interleaves each action (tool call) with incremental reasoning and observation. While step-by-step execution is a common default under the assumption that eager execution monitoring is necessary for adaptability, we revisit this assumption for well-defined data-centric tasks. Our controlled empirical study isolates planning horizon as the key architectural feature and systematically analyzes the effects of topological complexity and tool robustness on both paradigms. Our experiments across Knowledge Base Question Answering and Multi-hop QA show that FH planning with lazy replanning achieves accuracy parity with SH across varying depths, breadths, and robustness levels, while using 2-3x fewer tokens. These findings suggest that for well-defined data-centric tasks, eager step-wise monitoring is often unnecessary, and full-horizon planning with on-demand replanning can offer a more efficient default.

[NLP-219] A Computational Operationalisation of Competing Maturational Theories of Syntactic Development via Statistical Grammar Induction

【速读】：该论文旨在解决儿童在第一语言习得过程中，如何逐步获得中间句法范畴（intermediate syntactic categories）及其获取顺序的问题。研究聚焦于两种对立的成熟理论：自下而上假说（GROWING）认为词法和屈折结构优先发展，而向内假说（INWARD）则预测早期即可获得与话语相关的句法范畴。解决方案的关键在于通过统计语法归纳（statistical grammar induction）对这两种假说进行计算建模，从而在输入和学习算法保持一致的前提下，系统评估不同句法范畴获取顺序对可学习性的差异影响。实验结果表明，在三项评估指标上，GROWING假说显著优于INWARD假说，揭示了句法结构发展的阶段性顺序对语言习得效率的重要作用。

链接: https://arxiv.org/abs/2605.08476
作者: Mila Marcheva,Suchir Salhan,Weiwei Sun
机构: University of Cambridge (剑桥大学)
类目: Computation and Language (cs.CL)
备注: In Proceedings of the Annual Meeting of the Cognitive Science Society (CogSci) 2026. Presentation in Rio de Janeiro, Brazil

点击查看摘要

Abstract:This paper is concerned with what intermediate syntactic categories children acquire during first language development, and in what order. Maturational theories make different predictions. Bottom-up accounts (GROWING) propose that lexical and inflectional structure emerges first, while inward accounts (INWARD) predict early access to discourse-related categories. We computationally operationalise these hypotheses of staged syntactic emergence using statistical grammar induction, asking what each proposed ordering makes learnable when input and learning algorithm are held constant. Our framework makes category acquisition explicit and allows us to explore how different maturational orderings shape the structure that can be learned under identical conditions. Based on this operationalisation, the GROWING account significantly outperforms the INWARD account across three evaluation metrics.

[NLP-220] PYTHALAB-MERA: Validation-Grounded Memory Retrieval and Acceptance Control for Frozen-LLM Coding Agents

【速读】：该论文旨在解决本地大语言模型（Local LLM）在代码生成任务中，如何通过执行反馈、持久状态和有限修复机制实现正确性的问题。现有方法如静态检索、长上下文提示、自精炼、执行反馈修复及基于模型权重的强化学习虽各有所长，但未能协同实现验证驱动的 episodic memory（情景记忆）、自适应检索-动作选择、延迟信用分配以及围绕冻结本地模型的结构化技能复用。其解决方案的关键在于提出 PYTHALAB-MERA——一个轻量级外部控制器，用于控制本地验证条件下的代码生成流程：该控制器决定哪些记忆记录和基于抽象语法树（AST）提取的技能应进入下一轮提示，通过快速失败管道验证候选方案，将验证结果转化为有界形状奖励，并利用 TD(λ) 风格的资格迹传播延迟信用。实验证明，在严格验证门控的强化学习编码任务中，PYTHALAB-MERA 在三个任务、三次重复、每次三尝试的约束下成功通过了 8/9 次验证，显著优于自精炼基线和 GRACE 扩展（均为 0/9），验证了外部记忆与检索控制器在特定设定下对验证成功率的提升作用。

链接: https://arxiv.org/abs/2605.08468
作者: Mehmet Iscan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 28 pages, 4 figures, 7 tables; local CLI artifact evaluation

点击查看摘要

Abstract:Local LLM-based coding agents increasingly work in settings where correctness is earned through execution feedback, persistent state, and bounded repair, not through a single fluent answer. Static retrieval, long-context prompting, self-refinement, execution-feedback repair, and reinforcement learning over model weights each address part of this setting, but they do not jointly provide validation-grounded episodic memory, adaptive retrieval-action selection, delayed credit assignment, and structural skill reuse around a frozen local model. We introduce PYTHALAB-MERA, a lightweight external controller for local validation-conditioned code generation. The frozen language model proposes complete source files; the controller decides which memory records and AST-derived skills should enter the next prompt, validates each candidate through a fail-fast pipeline, converts validation outcomes into bounded shaped rewards, and propagates delayed credit through TD(lambda)-style eligibility traces. We evaluate the implementation as a local CLI artifact on reinforcement-learning coding tasks with strict validation gates. In the measured hard RL setting with three tasks, three repetitions, and a three-attempt budget, PYTHALAB-MERA passed 8/9 strict validations; the self-refinement baseline and the investigated GRACE extension each passed 0/9. These results support a deliberately bounded claim: in this recorded setting, the external memory-and-retrieval controller improved validation success. They do not establish general-purpose code synthesis, state-of-the-art performance, formal program correctness, or formal safety.

[NLP-221] Do Benchmarks Underestimate LLM Performance? Evaluating Hallucination Detection With LLM -First Human-Adjudicated Assessment ECIR2026

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在上下文依赖场景下（如检索增强生成 RAG 和代理型 AI 系统）中普遍存在的“幻觉”问题，特别是针对摘要任务中的上下文幻觉检测难题。其解决方案的关键在于通过引入双盲、跨文化的人类仲裁机制（human adjudication process），对原始标注与 LLM 判断存在分歧的样本进行重新评估，从而提升基准数据集（QAGS-C 和 SummEval）标注的一致性与可靠性。实验表明，经过仲裁后，人类与模型之间的三重一致性显著提高，且模型准确率也相应上升，说明在高歧义任务中，单次人工标注可能不足，而基于模型推理的辅助再评估能够构建更稳健的评测基准。

链接: https://arxiv.org/abs/2605.08462
作者: I. F. Atasoy,B. Mutlu,E. A. Sezer,A. Wahdan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Presented at the ROMCIR Workshop at ECIR 2026

点击查看摘要

Abstract:Hallucination remains a persistent challenge in Large Language Models (LLMs), particularly in context-grounded settings such as RAG and agentic AI systems. This study focuses on contextual hallucination detection in summarization tasks. We analyze the QAGS-C and SummEval datasets by comparing original benchmark annotations with reason and span-based predictions from Gemini 2.5 Flash and GPT-5 Mini. To address systematic divergences between human labels and LLM judgments, we re-evaluated all conflicted samples through a human adjudication process involving 2 cross-cultural adjudicators. Following this re-evaluation, triple agreement (between human, GPT, and Gemini) increased by 6.38% for QAGS-C and 7.62% for SummEval. Similarly, model accuracy improved, with GPT increasing by 4.25% on QAGS-C and 2.34% on SummEval, while Gemini showed gains of 8.51% and 3.80%, respectively. Notably, adjudicators frequently sided with the models’ judgments over original human annotations when LLMs provided explicit reasoning. Overall human adjudicator agreement ranged between 83% and 87%. These findings suggest that for ambiguity-prone tasks, single-pass annotations may be insufficient, and model-assisted re-evaluation yields more reliable benchmarks.

[NLP-222] LLM -guided Semi-Supervised Approaches for Social Media Crisis Data Classification

【速读】：该论文旨在解决灾难管理中社交媒体数据分类任务在标注样本稀缺情况下的性能瓶颈问题，尤其是在标签数据有限（如每类仅5、10或25个标注样本）时如何有效利用未标注数据提升模型表现。其解决方案的关键在于引入大语言模型（Large Language Model, LLM）引导的半监督学习方法，具体采用两种先进策略：LLM引导协同训练（LLM guided Co-Training, LG-CoTrain）和VerifyMatch。实验表明，LG-CoTrain在低资源场景下显著优于传统半监督方法，在多个灾难事件中实现最高的平均宏F1分数；而VerifyMatch则在保持竞争力的同时展现出良好的校准性能。此外，研究还发现紧凑的半监督模型有时可超越零样本大语言模型，揭示了通过LLM指导将知识迁移至轻量级模型的潜力，为实际灾难响应应用提供了可行路径。

链接: https://arxiv.org/abs/2605.08448
作者: Jacob Ativo,Bharaneeshwar Balasubramaniyam,Anh Tran,Khushboo Gupta,Hongmin Li,Doina Caragea,Cornelia Caragea
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Semi-supervised learning approaches have been investigated as a means to enhance the analysis of social media data in disaster management contexts. In this work, we present the first empirical evaluation of large language model (LLM) guided semi-supervised learning for crisis related tweet classification. We compare two recent LLM assisted semi-supervised methods, VerifyMatch and LLM guided Co-Training ( LG-CoTrain), against established semi-supervised baselines. Our results show that LG-CoTrain significantly outperforms classical semi-supervised approaches in low resource settings with 5, 10 and 25 labeled examples per class, achieving the highest averaged Macro F1 across events. VerifyMatch achieves competitive performance while also demonstrating strong calibration properties. As the number of labeled examples increases, the performance gap narrows and Self Training emerges as a strong baseline. We further observe that compact semi-supervised models can, in some cases, outperform very large LLMs operating in zero-shot settings. This finding highlights the potential of transferring knowledge from LLMs into smaller and more deployable models through LLM guided semi-supervised learning, offering a practical pathway for real world disaster response applications. Our project repository on Github is here.

[NLP-223] Revisiting the syntax of imperatives in Yemeni Arabic: An Agree across phases approach

【速读】：该论文旨在解决也门阿拉伯语中祈使句（imperative）句法结构的理论解释问题，特别是如何统一处理简单和复杂祈使句构造（包括A’-链结构），并阐明其与话语层面的互动机制。解决方案的关键在于引入跨阶段一致（Agree across phases, AAP）分析框架，该框架通过建立句法与话语之间的紧密联系，将祈使句的信息结构（informational structure）与其命题结构（propositional structure）相耦合，从而解释祈使句的阐释功能与施为功能。研究进一步指出，祈使句的主语是第二人称零代词（2-person pro），而句首出现的任何显性代词或名词成分并非主语，而是C域话题（C-domain element），即关于主题（aboutness topic），此类话题作为逻辑主语与pro构成核心指代关系（coreferentiality），并通过AAP中的匹配（Match）机制生成局部与非局部的A’-链。对于缺乏显性话题的核心祈使句，则假设存在一个空话题（null topic）在Spec,TopP位置重新合并（remerge），其语义解释依赖于话语环境。

链接: https://arxiv.org/abs/2605.08447
作者: Mohammed Q. Shormani
机构: 未知
类目: Computation and Language (cs.CL)
备注: 33 pages

点击查看摘要

Abstract:This article revisits the syntax of imperatives in Yemeni Arabic proposing an Agree acros phases (AAP) approach. I argue that the AAP approach successfully accounts for both simple and complex imperative constructions, including A’-chain structures, by establishing a close interactions between syntax and discourse. The study demonstrates that this interface is motivated by the interpretive and performative functions associated with imperatives, linking informational structure with propositional structure. It is also proposed that the thematic subject of imperatives is a 2-person pro, whereas any overt pronominal or nominal element occurring preverbally is not a subject, but rather a C-domain element, precisely aboutness topic. These topics serve as the logical subjects of imperatives and enter into a coreferentiality relationship with pro. This relation is analyzed as APP involving Match, yielding both local and non-local A’-chains. For core imperatives, viz., lacking an overt topic, I propose a null topic to (re)merge in Spec,TopP, whose interpretation depends on the discourse.

[NLP-224] Can Language Models Identify Side Effects of Breast Cancer Radiation Treatments?

【速读】：该论文旨在解决癌症治疗副作用信息在肿瘤幸存者管理中沟通不充分的问题，尤其是在知情同意等临床场景下，由于临床知识不足和电子健康记录（EHR）系统碎片化，导致医生难以全面、准确地传达放射治疗相关毒性反应。其解决方案的关键在于提出一个面向部署的应力测试框架，用于评估大语言模型（LLM）生成乳腺癌放疗副作用列表的可靠性；通过对比7个指令微调后的LLM在不同提示策略下的输出与由多名乳腺放射肿瘤学专家共同制定的临床参考标准，发现LLM对细微文档差异敏感、存在精确率与召回率权衡，并系统性低估罕见及长期副作用；进一步表明，限制副作用数量会降低精确率，而基于临床专家整理的副作用清单进行输出约束，可显著提升模型的可靠性和鲁棒性，从而为更安全、更具信息量的幸存者导向型应用提供实用设计路径。

链接: https://arxiv.org/abs/2605.08439
作者: Natalie Seah,Danielle S. Bitterman,Daphna Spiegel,Thomas Hartvigsen
机构: University of Virginia; Mass General Brigham; Dana-Farber Cancer Institute; Harvard Medical School
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Accurately communicating the side effects of cancer treatments to cancer survivors is critical, particularly in settings such as informed consent, where clinicians must clearly and comprehensively convey potential treatment toxicities. However, this task remains challenging due to clinical knowledge deficits about adverse treatment effects and fragmentation across electronic health record (EHR) systems. Large language models (LLMs) have the potential to assist in this task, though their reliability in oncology survivorship contexts remains poorly understood. We present a deployment-oriented stress-testing framework for evaluating LLM-generated radiation side effect lists in breast cancer treatment and survivorship care. Using 21 breast cancer patient profiles, we construct paired patient clinical scenarios that differ only in radiotherapy regimens to evaluate seven instruction-tuned LLMs under multiple prompting regimes. We then compare LLM outputs to a clinician-curated reference derived from informed consent documents at two major academic medical centers and developed by a team including more than seven breast radiation oncologists. The reference maps radiation dose-fractionation, fields, and locations to associated toxicities, broken down by frequency and temporal onset. Across models, we reveal sensitivity to minor documentation changes, trade-offs between precision and recall, and systematic under-recall of rare and long-term side effects. When used alone, constraints on the number of side effects generated reduce precision, and grounding outputs in clinician-curated side effect lists substantially improves reliability and robustness. These findings highlight important limitations of LLM use in oncology and suggest practical design choices for safer and more informative survivorship-focused applications.

[NLP-225] Magis-Bench: Evaluating LLM s on Magistrate-Level Legal Tasks

【速读】：该论文旨在解决当前法律人工智能（Legal AI）评估体系中对生成式 AI（Generative AI）在司法裁判能力方面测评不足的问题。现有基准主要关注模型生成法律论证或文书的能力，而忽视了其作为裁判者进行法律推理、权衡争点、适用法理并作出合理裁决的核心职能。为此，作者提出了 Magis-Bench，一个基于巴西近年司法职位竞争性考试的真实判例任务集合，涵盖多轮讨论式法律分析与完整民事/刑事判决书撰写等 magistrate-level 写作任务。其关键创新在于采用 LLM-as-a-judge 方法，利用四个独立的前沿大模型作为评判者对输出结果进行评分，从而客观衡量 LLM 在司法语境下的推理与写作水平，实证结果显示即使最优模型得分也未达满分的 70%，凸显当前生成式 AI 在司法级法律推理方面的显著挑战。

链接: https://arxiv.org/abs/2605.08437
作者: Ramon Pires,Thales Sales Almeida,Celio Larcher Junior,Giovana Bonás,Hugo Abonizio,Marcos Piau,Roseval Malaquias Junior,Thiago Laitz,Rodrigo Nogueira
机构: Maritaca AI(马里塔卡人工智能); Jusbrasil(朱斯布拉西尔)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing benchmarks for legal AI focus primarily on tasks where LLMs must produce legal arguments or documents, yet the capacity to \emphjudge such arguments – weighing competing claims, applying doctrine to facts, and rendering reasoned decisions – is arguably as fundamental to a well-functioning legal system as advocacy itself. We introduce Magis-Bench, a benchmark for evaluating LLMs on magistrate-level writing tasks derived from recent Brazilian competitive examinations for judicial positions. Magis-Bench comprises 74 questions from eight examinations conducted between 2023 and 2025, including discursive legal analysis questions with multi-turn structure and practical exercises requiring the composition of complete civil and criminal judicial sentences. We evaluate 23 state-of-the-art LLMs using an LLM-as-a-judge methodology with four independent frontier models as evaluators. Our results show strong inter-judge agreement (Kendall’s W = 0.984 ; pairwise Kendall’s \tau \ge 0.897 ), with Google’s Gemini-3-Pro-Preview achieving the highest average score (6.97/10), followed by Gemini-3-Flash-Preview (6.67) and Claude-4.5-Opus (6.46). Even the best-performing models score below 70% of the maximum, indicating that judicial-level legal reasoning and writing remain challenging for current LLMs. We release the complete benchmark, model outputs, and evaluation code to support further research on legal AI capabilities.

[NLP-226] A Semantic-Sampling Framework for Evaluating Calibration in Open-Ended Question Answering

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在开放问答（open-ended question answering, QA）场景中缺乏可靠校准评估方法的问题。现有方法如基于logit的指标依赖受限输出格式、自报告置信度易过高，以及采样类方法需任务特定规则且无明确有限样本目标，均难以满足实际部署需求。解决方案的关键在于提出Sem-ECE（Semantic-Sampling Expected Calibration Error）框架：通过从模型中采样答案并按语义聚类，利用类别频率作为置信度估计；在此基础上设计两种渐近无偏的估计器——Sem₁-ECE（同样本自一致性评分）和Sem₂-ECE（保留样本变体），其中后者在困难问题上表现出更小的校准误差，并能通过二者差距诊断题目难度。实验证明该方法优于传统自报告置信度与已有采样方法，在内部概率不可用时亦可与logit-based评估互补。

链接: https://arxiv.org/abs/2605.08432
作者: Zhanliang Wang,Jiancong Xiao,Ruochen Jin,Shu Yang,Bojian Hou,Li Shen
机构: University of Pennsylvania (宾夕法尼亚大学); Dartmouth College (达特茅斯学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: Preprint

点击查看摘要

Abstract:Calibration measures whether a model’s predicted confidence aligns with its empirical accuracy, and is central to the reliable deployment of large language models (LLMs) in high-stakes domains such as medicine and law. While much recent work focuses on improving LLM calibration, the equally important question of how to evaluate it in realistic settings remains underdeveloped. Open-ended question answering (QA), the most common deployment setting for modern LLMs, is where existing evaluation methods fall short: logit-based metrics need restricted output formats and internal probabilities; verbalized confidence is self-reported and often overconfident; and sampling-based methods rely on task-specific extraction rules without a clear finite-sample target. We introduce Sem-ECE (Semantic-Sampling Expected Calibration Error), a calibration evaluation framework for open-ended QA that samples answers from the model, groups them into semantic classes, and uses the resulting frequencies as confidence. We study two estimators within this framework: Sem _1 -ECE, the same-sample self-consistency score, and Sem _2 -ECE, a held-out variant that separates answer selection from confidence evaluation. We prove both are asymptotically unbiased, and further show that they agree on easy questions but diverge on hard ones with Sem _2 achieving strictly smaller calibration error, so their gap also serves as a diagnostic for question difficulty. Experiments on three open-ended QA benchmarks across five leading commercial LLMs match our theoretical predictions and show that Sem-ECE outperforms verbalized confidence and existing sampling-based methods, while complementing logit-based evaluation when internal probabilities are unavailable.

[NLP-227] Queryable LoRA: Instruction-Regularized Routing Over Shared Low-Rank Update Atoms

【速读】：该论文旨在解决标准低秩适配（Low-Rank Adaptation, LoRA）方法在参数高效微调大型神经网络时存在的局限性：即静态的低秩参数化难以适应输入依赖的修正需求以及网络深度计算过程中动态变化的特征表示。其解决方案的关键在于引入一个可查询的低秩更新原子记忆模块（queryable memory of low-rank update atoms），通过注意力机制实现内容相关的更新组件路由，从而在保持低秩瓶颈效率的同时，使每层的更新操作能够根据当前输入和历史层状态动态调整，并共享跨层的可复用结构。此外，该方法还通过语言诱导先验对路由 logits 进行正则化，引导选择语义相关方向的低秩变换，避免生成无约束的参数更新，提升了模型在噪声非线性回归任务和大语言模型（LLM）微调中的测试性能与训练稳定性。

链接: https://arxiv.org/abs/2605.08423
作者: Omatharv Bharat Vaidya,Connor T. Jerzak,Nhat Ho,Chandrajit Bajaj
机构: University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:We present a data-adaptive method for parameter-efficient fine-tuning of large neural networks. Standard low-rank adaptation methods improve efficiency by restricting each layer update to a fixed low-rank form, but this static parameterization can be too rigid when the appropriate correction depends on the input and on the evolving depth-wise computation of the network. Our approach replaces a purely layer-local adapter with a shared queryable memory of low-rank update atoms. For each block of layers, the model forms a query from the current low-rank state and a running summary of previous blocks, uses this query to retrieve a content-dependent combination of shared update components via attention, and applies the resulting routed operator within the low-rank bottleneck. In this way, the method retains the efficiency and scalability of low-rank adaptation while allowing the effective update to vary across inputs and to share reusable structure across layers. The resulting architecture provides a principled middle ground between static LoRA-style updates and fully generated parameter updates: it remains compact and parameter-efficient while supporting dynamic, context-sensitive adaptation. Further, we incorporate instruction-regularization by augmenting routing logits with a language-induced prior over update atoms, thereby biasing the selection of low-rank transformations toward semantically relevant directions without generating unconstrained parameter updates. Experiments on noisy non-linear regression tasks and LLM fine-tuning suggest that this queryable update-memory formulation can improve final test performance and training stability compared to standard low-rank adaptation, while using a comparable number of trainable parameters.

[NLP-228] Effective Explanations Support Planning Under Uncertainty

【速读】：该论文旨在解决如何有效生成和评估程序性说明（procedural explanation）的问题，即如何让语言指令在不确定性环境下被准确转化为可执行的动作规划，并提升导航等任务中的行为效率与可靠性。其解决方案的关键在于构建一个计算模型：利用大语言模型（large language model）将自然语言解释转换为类程序化的指导信息（即策略先验和价值图），再由规划代理在部分可观测环境中执行该策略；通过衡量路径的效率与可靠性并惩罚重复规划，实现对解释质量的量化评分，从而揭示语言如何在行动层面被具身化（grounded into action）以优化沟通效果。

链接: https://arxiv.org/abs/2605.08406
作者: Hanqi Zhou,Britt Besch,Charley M. Wu,Tobias Gerstenberg
机构: University of Tübingen (图宾根大学); Technical University Darmstadt (达姆施塔特工业大学); Hessian.AI (黑森人工智能); Max Planck Institute for Biological Cybernetics (马克斯普朗克生物控制论研究所); University of Cambridge (剑桥大学); Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: CogSci 2026

点击查看摘要

Abstract:Explaining how to get from A to B can be challenging. It requires mentally simulating what the listener will do based on what they are told. To capture this process, we propose a computational model that converts utterances into action plans: a large language model translates an explanation into program-like guidance (a policy prior and value map), and a planning agent executes it under partial observability. We score explanations by the efficiency and reliability of the resulting paths, penalizing replanning. Across four preregistered experiments, we collect a corpus of 1,200 explanations over 24 maps, elicit helpfulness judgments, measure baseline navigation, and test behavior with explanations of differing quality. Higher-scored explanations are judged more helpful and improve navigation: participants with explanations outperform those without, and high-scoring explanations help more than low-scoring ones. Together, these results show procedural explanation as utility-guided communication shaped by how language can be grounded into action under uncertainty.

[NLP-229] Built Environment Reasoning from Remote Sensing Imagery Using Large Vision–Language Models

【速读】：该论文旨在解决如何利用大语言模型（Large Language Models, LLMs）提升智慧城市中对建成环境的智能分析与决策支持问题。其核心挑战在于如何有效融合遥感影像数据与文本生成能力，以实现对城市设计建议、可建造性评估、土地利用模式识别及风险探测等任务的自动化推理。解决方案的关键在于将多尺度遥感影像作为输入，构建多模态语言模型，从而增强LLMs在建成环境相关推理任务中的准确性与可靠性；研究进一步对比了InternVL和Qwen等先进LLM在生成环境建议时的表现，验证了遥感数据与语言模型结合在智慧城市应用中的潜力。

链接: https://arxiv.org/abs/2605.08404
作者: Dongdong Wang,Deepak Balakrishnan,Ravi Srinivasan,Shenhao Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET)
备注: Published in the International Conference on Industrialized Construction 2026

点击查看摘要

Abstract:This work investigates the use of large language models (LLMs) for tasks in smart cities. The core idea is to leverage remote sensing imagery to characterize the built environment, including design suggestions, constructability assessment, landuse patterns, and risk identification. We examine remote sensing imagery at multiple spatial scales as inputs for multimodal language modeling and evaluate their effects on built-environment-related reasoning. In addition, we compare state-of-the-art LLMs, including InternVL and Qwen, in terms of accuracy and reliability when generating built environment recommendations. The results demonstrate the potential of integrating remote sensing imagery with large language models to assist smart cities and decision-making.

[NLP-230] AIPO: : Learning to Reason from Active Interaction

【速读】：该论文旨在解决当前基于强化学习（Reinforcement Learning, RL）的大型语言模型（Large Language Models, LLMs）在推理能力提升过程中面临的核心瓶颈：探索受限于策略模型自身的固有能力边界，导致难以突破现有推理局限。现有方法虽引入外部专家示范以扩展这一边界，但多依赖完整的轨迹级指导，存在样本效率低、信息稀疏且探索空间静态等问题。其解决方案的关键在于提出一种名为AIPO（Agent-Interactive Policy Optimization）的增强型强化学习框架，通过在训练阶段引入三个功能协同代理——验证代理（Verify Agent）、知识代理（Knowledge Agent）和推理代理（Reasoning Agent），使策略模型能在遇到推理瓶颈时主动发起交互式咨询，从而获得细粒度、目标导向的引导，实现能力边界的动态扩展；同时设计了定制化的重要性采样系数与裁剪策略，有效缓解从代理反馈中学习时产生的离策略偏差和梯度消失问题，最终使训练后的模型具备独立推理能力并显著提升多个基准测试上的表现。

链接: https://arxiv.org/abs/2605.08401
作者: Junnan Liu,Linhao Luo,Thuy-Trang Vu,Gholamreza Haffari
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have demonstrated remarkable reasoning capabilities, largely stimulated by Reinforcement Learning with Verifiable Rewards (RLVR). However, existing RL algorithms face a fundamental limitation: their exploration remains largely constrained by the inherent capability boundary of the policy model. Although recent methods introduce external expert demonstrations to extend this boundary, they typically rely on complete trajectory-level guidance, which is sample-inefficient, information-sparse, and may confine exploration to a static guidance space. Inspired by the potential of multi-agent systems, we propose \textbfAIPO , an enhanced reinforcement learning framework that improves LLM reasoning through active multi-agent interaction during exploration. Specifically, AIPO enables the policy model to proactively consult three functional collaborative agents, \textitVerify Agent , \textitKnowledge Agent , and \textitReasoning Agent , when encountering reasoning bottlenecks, thereby receiving fine-grained and targeted guidance to actively expand its capability boundary during training. We further introduce a tailored importance sampling coefficient together with a clipping strategy to mitigate the off-policy bias and gradient vanishing issues that arise when learning from agent-provided feedback. After training, the policy model performs reasoning independently without relying on collaborative agents. Extensive experiments on diverse reasoning benchmarks, including AIME, MATH500, GPQA-Diamond, and LiveCodeBench, show that AIPO consistently improves reasoning performance, generalizes robustly across different policy models and RLVR algorithms, and effectively expands the reasoning capability boundary of the policy model.

[NLP-231] jina-embeddings-v5-omni: Text-Geometry-Preserving Multimodal Embeddings via Frozen-Tower Composition

【速读】：该论文旨在解决多模态嵌入模型（Multimodal Embedding Models）在训练效率与性能平衡方面的挑战，即如何在不显著增加计算成本的前提下，将文本、图像、音频和视频等多种模态统一到同一语义嵌入空间中。其解决方案的关键在于提出“冻结编码器模型组合”（frozen-encoder model composition）方法：通过扩展原有的文本嵌入模型（Jina Embeddings v5 Text），仅添加针对图像和音频的非文本编码器，并保持所有原始编码器和语言模型参数冻结，仅训练连接组件（占总参数量的0.35%）。这种方法在保证文本输入嵌入不变性的同时，实现了跨模态对齐，且性能接近当前最优的大规模多模态嵌入模型。

链接: https://arxiv.org/abs/2605.08384
作者: Florian Hönicke,Michael Günther,Andreas Koukounas,Kalim Akram,Scott Martens,Saba Sturua,Han Xiao
机构: 未知
类目: Computation and Language (cs.CL)
备注: 18 pages, 8 figures, 10 tables

点击查看摘要

Abstract:In this work, we introduce frozen-encoder model composition, a novel approach to multimodal embedding models. We build on the VLM-style architecture, in which non-text encoders are adapted to produce input for a language model, which in turn generates embeddings for all varieties of input. We present the result: the jina-embeddings-v5-omni suite, a pair of models that encode text, image, audio, and video input into a single semantic embedding space. Our method is to extend the two Jina Embeddings v5 Text models to support additional media by adding encoders for images and audio. The backbone text embedding models and the added non-text media encoders remain frozen. We only trained the connecting components, representing 0.35% of the total weights of the joint model. Training is therefore much more efficient than full-parameter retraining. Additionally, the language model remains effectively unaltered, producing exactly the same embeddings for text inputs as the Jina Embeddings v5 Text models. Our evaluations show that this approach produces results that are competitive with the state-of-the-art, yielding nearly equal performance to larger multimodal embedding models.

[NLP-232] Change My View? The Dynamics of Persuasion and Polarization in Online Discourse

【速读】：该论文试图解决的问题是：在理性对话中，尽管共享证据和逻辑论证被普遍认为应促使观点趋同，但现实中的公共讨论（如Reddit上的辩论）却常常未能实现这一预期。为探究有效说服的关键因素，研究者利用大语言模型（Large Language Models, LLMs）对r/ChangeMyView论坛中的辩论进行分析，首先通过LLM预测信念转变的可能性以建立基线；随后采用人机协同编码方法识别每条回复中的十种常见修辞策略（如让步、共情、逻辑反驳等）。解决方案的关键在于发现：相较于纯粹的证据呈现或直接反驳，表达让步（concession）和情感共鸣（empathy）的策略显著提升信念改变的概率，而正面驳斥、可信度攻击和话题转移则显著降低其可能性。这表明，有效的公共推理不仅依赖于论据内容，更取决于关系性框架的构建。

链接: https://arxiv.org/abs/2605.08383
作者: David Freeborn,Malihe Alikani,Anthony Sicilia
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Philosophical accounts of persuasion often assume that shared evidence and rational argumentation should lead to a convergence of views between peers, yet everyday discourse often suggests otherwise. In this study, we use large language models to analyze a corpus of debates on Reddit’s r/ChangeMyView, where belief revision is publicly signaled. Large language models were asked, halfway through each discussion, to forecast whether such an acknowledgement would arise; their probabilistic estimates serve as a conversational baseline. Each reply was then coded, through a hybrid machine-assisted procedure, for ten familiar rhetorical strategies – concession, empathy, logical challenge, credibility appeals, and so forth. Adding these strategic features markedly improves predictive power and yields a consistent pattern: moves that express concession or empathetic alignment substantially increase the prospect of belief change, whereas frontal refutation, credibility attacks, and topic deflection diminish it. The findings indicate that effective public reasoning depends as much on relational framing as on evidential content, and they invite a refinement of normative accounts of rational dialogue.

[NLP-233] SecureForge: Finding and Preventing Vulnerabilities in LLM -Generated Code via Prompt Optimization

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在代码生成过程中引入未被察觉的网络安全漏洞的问题，尤其是在无人工干预的情况下，即使模型被明确要求编写安全代码，仍会在约23%的案例中产生可验证的安全漏洞。解决方案的关键在于提出SecureForge——一个自动化的审计与优化管道：首先识别能触发静态可检测漏洞的良性提示，随后利用马尔可夫采样技术生成多样化的合成提示语料库，在保持错误率稳定的同时提升多样性；进而基于该语料库迭代优化系统提示（system prompts），显著降低输出漏洞率（最高达48%），同时维持单元测试通过率。该方法实现了单位测试性能与输出安全性之间的帕累托改进，并且优化后的提示具备零样本迁移能力，适用于真实场景中的编码代理提示，无需接触实际用户提示分布即可实现效果提升。

链接: https://arxiv.org/abs/2605.08382
作者: Houjun Liu,Lisa Einstein,John Yang,Joachim Baumann,Duncan Eddy,Christopher D. Manning,Mykel Kochenderfer,Diyi Yang
机构: Stanford University (斯坦福大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:LLM coding agents now generate code at an unprecedented scale, yet LLM-generated code introduces cybersecurity vulnerabilities into codebases without human involvement. Even when frontier models are explicitly asked to write secure production code with relevant weaknesses to avoid in context, we find that they still produce verifiable vulnerabilities on average 23% of the time across a corpus of 250 benign coding prompts. We introduce SecureForge, an automated pipeline that both audits security risks of frontier models and produces auditing-informed secure system prompts that reduce output security vulnerabilities while maintaining unit test performance. SecureForge first identifies benign prompts that produce statically detectable vulnerabilities, and then amplifies them into a large synthetic prompt corpus of diverse scenarios using a Markovian sampling technique to jointly maintain error rates and prompt diversity. This corpus is then used to iteratively optimize the system prompts to reduce output security vulnerabilities. On frontier models, SecureForge yields a statistically significant Pareto improvement in both unit test success and output security, with output vulnerabilities reduced by up to 48%. The resulting system prompts transfer zero-shot to in-the-wild coding agent prompts, without any exposure to real user prompt distributions during optimization.

[NLP-234] Reinforcement Learning for Scalable and Trustworthy Intelligent Systems

【速读】：该论文旨在解决强化学习（Reinforcement Learning, RL）在实际部署中面临的两大核心挑战：一是如何在通信带宽受限且计算资源异构的分布式环境中实现高效扩展；二是如何确保在大语言模型（Large Language Models, LLMs）和自主智能体中优化策略能够与人类偏好对齐，并满足隐私保护等安全要求。解决方案的关键在于通过四个互补贡献，分别从联邦优化（Federated Optimization）、偏好对齐（Preference Alignment）和情境安全性（Contextual Safety）三个维度推进强化学习的发展：一方面通过通信高效的异步联邦优化提升可扩展性，另一方面通过增强人类偏好对齐与减少语言系统中的不当信息泄露来提升可信度。整体上，论文论证了下一代智能系统需同时具备高效优化能力和可信行为，而强化学习为此提供了统一的框架。

链接: https://arxiv.org/abs/2605.08378
作者: Guangchen Lan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: PhD thesis

点击查看摘要

Abstract:Reinforcement learning has become a powerful paradigm for improving the capability of intelligent systems, but its practical deployment faces two central challenges. First, reinforcement learning must scale efficiently in distributed environments where communication bandwidth is limited and computation is heterogeneous across agents. Second, as reinforcement learning is increasingly used in post-training large language models and autonomous agents, the optimized policies must also be aligned with human preferences and satisfy safety requirements such as privacy-aware information disclosure. This dissertation addresses both challenges through four complementary contributions spanning federated optimization, preference alignment, and contextual safety. The first part of the dissertation studies scalable reinforcement learning in federated settings. The second part of the dissertation studies trustworthy reinforcement learning for large language models. Together, these contributions advance reinforcement learning along two complementary dimensions. On the one hand, they make reinforcement learning more scalable through communication-efficient and asynchronous federated optimization. On the other hand, they make reinforcement learning more trustworthy by improving alignment with human preferences and by reducing contextually inappropriate information disclosure in language-based intelligent systems. As a whole, this dissertation argues that the next generation of intelligent systems will require both efficient optimization and trustworthy behavior, and that reinforcement learning provides a unifying framework for addressing both goals. Comments: PhD thesis Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) ACMclasses: I.2.6; I.2.7; I.2.11 Cite as: arXiv:2605.08378 [cs.LG] (or arXiv:2605.08378v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.08378 Focus to learn more arXiv-issued DOI via DataCite

[NLP-235] How Much Do Circuits Tell Us? Measuring the Consistency and Specificity of Language Model Circuits

【速读】：该论文旨在解决生成式 AI（Generative AI）模型中机制可解释性（mechanistic interpretability）领域的一个关键问题：即识别出对特定任务至关重要的稀疏子图（sparse subgraphs），并探究这些子图是否具有任务特异性，从而支持针对模型行为的精准理解和干预。其解决方案的关键在于通过边缘归因修补（edge attribution patching）方法，在六个任务和七种模型上系统测量电路复用性（circuit reuse），并区分一致性（consistency，组件在任务内的重复出现）与特异性（specificity，组件对任务的独特贡献）。研究发现，尽管同一任务内的电路共享度高且组件对任务性能具有必要性（移除后准确率下降可达约100%相对损失），但跨任务间存在显著的因果重要性重叠，表明大多数电路并非任务特异，这挑战了基于注意力头和MLP层的电路发现能否实现靶向干预的假设。

链接: https://arxiv.org/abs/2605.08348
作者: Michael Li,Nishant Subramani
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The circuits framework in mechanistic interpretability aims to identify causally important sparse subgraphs of model components, typically evaluated by measuring necessity and sufficiency. We measure circuit reuse, the proportion of components shared across per-example circuits within a task, and investigate two less-studied properties of this: consistency, the recurrence of components within a task, and specificity, their uniqueness to a task. Using edge attribution patching across six tasks and seven models, we find that within-task reuse is high and that shared components are necessary for task performance, with ablations causing up to \sim 100% relative accuracy drops. However, circuits turn out not to be task-specific: ablating one task’s circuit damages another task’s performance about as much as that task’s own circuit does. We discover that this is due to substantial overlap between circuits across tasks, which are causally important for performance. Some circuits do contain a smaller set of task-specific components, but these account for only a modest portion of circuit performance. Overall, our findings suggest that while circuit discovery at the level of attention heads and MLP layers identifies important components, their lack of task-specificity raises questions about the degree to which circuits can support targeted understanding and intervention on model behavior.

[NLP-236] Sanity Checks for Long-Form Hallucination Detection

【速读】：该论文旨在解决当前生成式 AI（Generative AI）中基于链式思维（chain-of-thought）推理轨迹的幻觉检测方法存在的核心问题：这些方法是否真正评估了中间推理过程的有效性，还是仅仅依赖于最终答案层面的表面相关特征。解决方案的关键在于提出一种受控不变性（controlled-invariance）方法，通过两个“oracle 测试”来区分信号来源：\textscForce 将最终答案替换为真实值但保留推理轨迹，\textscRemove 则移除答案宣告步骤而保持推理路径完整。实验证明，一旦控制住终点线索（endpoint cues），无需复杂学习模型即可实现鲁棒检测——例如 TRACT 模型仅基于词汇轨迹特征（如模糊趋势、步骤长度动态和跨响应词汇收敛性）即达到与现有基线相当或更优的性能，这表明当前挑战并非推理轨迹中缺乏信号，而是未能有效隔离其与终点提示之间的干扰。

链接: https://arxiv.org/abs/2605.08346
作者: Geigh Zollicoffer,Minh Vu,Hongli Zhan,Raymond Li,Manish Bhattarai
机构: Los Alamos National Laboratory (洛斯阿拉莫斯国家实验室); The University of Texas at Austin (德克萨斯大学奥斯汀分校); University of British Columbia (不列颠哥伦比亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Hallucination detection methods for large language models increasingly operate on chain-of-thought reasoning traces, yet it remains unclear whether they evaluate the reasoning itself or merely exploit surface correlates of the final answer. We introduce a controlled-invariance methodology that exposes this distinction through two oracle tests: \textscForce, which replaces each response’s final answer with the ground truth while preserving the reasoning trace, and \textscRemove, which strips answer-announcement steps while leaving the trajectory intact. This reveals if their predictive power derives from answer-level artifacts rather than from the structure or validity of intermediate reasoning. We further show that once these artifacts are controlled for, effective detection does not necessarily require complex learned representations: TRACT, a lightweight scorer built on lexical trajectory features (hedging trends, step-length dynamics, and cross-response vocabulary convergence), achieves strong robustness while remaining competitive with or outperforming existing baselines on unperturbed traces. These findings suggest that the current central challenge in reasoning-aware hallucination detection is not the absence of signal in the trace, but the failure to isolate it from endpoint cues.

[NLP-237] SalesSim: Benchmarking and Aligning Multimodal Language Models as Retail User Simulators

【速读】：该论文旨在解决当前多模态大语言模型（Multimodal Large Language Models, MLLMs）在模拟真实、个性化用户行为时存在的局限性，尤其是其在多轮、多模态、工具增强的在线零售对话中对用户决策一致性和对话质量的不足。现有方法将用户模拟视为表面层次的对话生成，而本文提出SalesSim框架，将零售交互建模为一个基于角色的、具身的（agentic）决策过程，使购物者能根据自身背景、偏好和底线条件与销售代理互动并做出合理决策。解决方案的关键在于引入UserGRPO——一种多轮、多目标强化学习算法，用于同时优化对话流畅性和决策一致性（decision alignment），从而显著提升模型对用户角色设定的遵循程度，在基准测试中使决策一致性平均提升13.8%，同时保持对话质量不下降。

链接: https://arxiv.org/abs/2605.08334
作者: Yada Pruksachatkun,Elaine Wan,Lyanna Chen,Kai-Wei Chang,Chien-Sheng Wu
机构: Salesforce Research (Salesforce研究部门); University of California Los Angeles (加州大学洛杉矶分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present SalesSim, a framework and testbed for evaluating the ability of Multimodal Large Language Models (MLLMs) to simulate realistic, persona-driven customer behavior in multi-turn, multi-modal, tool-augmented online retail conversations. Unlike prior work that treat user simulation as surface-level dialogue generation, SalesSim models retail interaction and decision-making as a grounded, agentic process, where shoppers with diverse backgrounds, preferences, and dealbreakers interact with a sales agent, seek clarifications, and make informed purchasing decisions. For evaluation, we design a suite of metrics centered on decision alignment, measuring the consistency between the simulator’s actions and its persona specifications, as well as conversational quality. We find several behavioral gaps after benchmarking 6 open and closed-source state-of-the-art models. First, while models produce fluent conversations, they display significantly lower lexical diversity and overdisclosure of criteria across personas compared to human conversations. Second, models tend to be persuaded by sales agent suggestions and drift from persona specifications. Even the strongest model achieves less than 79% average alignment with its underlying persona specifications. To make progress on these limitations, we propose UserGRPO, a multi-turn, multi-objective reinforcement learning recipe to optimize both conversational fluency and decision alignment under persona specifications. Our experiments demonstrate that UserGRPO boosts decision alignment of the baseline model by 13.8% while improving conversational quality. By introducing SalesSim, we provide a new testbed for the community to investigate and improve the adherence of user simulators in goal-oriented settings.

[NLP-238] CDS4RAG : Cyclic Dual-Sequential Hyperparameter Optimization for RAG IJCAI2026

【速读】：该论文旨在解决检索增强生成（Retrieval-Augmented Generation, RAG）系统中因检索器与生成器超参数数量庞大且相互耦合而导致的优化困难问题。现有方法往往将RAG视为黑箱或仅优化部分超参数，导致收敛效率低且效果不佳。其解决方案的关键在于提出CDS4RAG框架，采用一种新颖的循环双序列（cyclic dual-sequential）建模方式，区分并交替优化检索器与生成器的超参数；通过周期内细粒度预算分配和周期间生成器优化的种子引导机制，显著提升优化效率与性能，在多个基准测试中均实现优于基线算法的生成质量与加速比。

链接: https://arxiv.org/abs/2605.08333
作者: Pengzhou Chen,Tao Chen
机构: School of Computer Science and Engineering, UESTC, Chengdu, China; IDEAS Lab, University of Birmingham, Birmingham, UK
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Performance (cs.PF); Software Engineering (cs.SE)
备注: Accepted by main track at IJCAI 2026

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) is sensitive to the vast hyperparameters of the retriever and generator, yet optimizing them using given queries is a challenging task due to the complex interactions and expensive evaluation costs. Existing algorithms are ineffective and slow in convergence, since they often treat RAG as a monolithic black box or only optimize partial hyperparameters. In this paper, we propose CDS4RAG, a framework that optimizes the full RAG hyperparameters using given queries via a new cyclic dual-sequential formulation. CDS4RAG is special in the sense that it distinguishes the hyperparameters of the retriever and generator, cyclically optimizing them in turn. Such a paradigm allows us to design fine-grained within-cycle budget provision and expedite the optimization via cross-cycle seeding when optimizing the generator. CDS4RAG is also an algorithm-agnostic framework that can be paired with diverse general algorithms. Through experiments on four common benchmarks and two backbone LLMs, we reveal that CDS4RAG considerably boosts the vanilla algorithms in 21/24 cases while significantly outperforming state-of-the-art algorithms in all cases with up to 1.54x improvements of generation quality and better speedup.

[NLP-239] LLM SYS-HPOBench: Hyperparameter Optimization Benchmark Suite for Real-World LLM Systems

【速读】：该论文旨在解决当前超参数优化（Hyperparameter Optimization, HPO）基准测试无法充分刻画大型语言模型（Large Language Model, LLM）系统复杂性的问题。现有基准未能涵盖LLM系统中AI与非AI组件共同构成的复合超参数空间、保真度因子（fidelity factors）带来的非线性影响，以及测量配置时多样化的成本特征。解决方案的关键在于提出首个面向真实LLM系统的动态基准套件LLMSYS-HPOBench，其包含364,450个高维超参数配置（维度12–23）、932种保真度设置（3–5维）、3–9种推理目标指标及2–10种成本指标，并附带完整的运行日志。这一平台不仅支持对现有HPO算法在前沿LLM场景下的重新验证，也为AutoML社区提供了持续演进的研究基础设施。

链接: https://arxiv.org/abs/2605.08305
作者: Siyu Wu,Yulong Ye,Zezhen Xiang,Pengzhou Chen,Gangda Xiong,Tao Chen
机构: UESTC(电子科技大学); University of Birmingham (伯明翰大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Performance (cs.PF); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Large Language Model (LLM) systems have been the frontier of AI in many application domains, leading to new challenges and opportunities for hyperparameter optimization (HPO) for the AutoML community. However, this type of system exhibits an unprecedented compound space of hyperparameter configuration from both the AI and non-AI components; rich and nonlinear implications from the fidelity factors; and diverse costs of measuring hyperparameter configurations, none of which have been fully captured in existing benchmarks. This paper presents the first (live) benchmark suite and datasets for HPO of real-world LLM systems, dubbed LLMSYS-HPOBench, covering data related to the inference objective values of hyperparameter configurations profiled from running the LLM systems. Currently, LLMSYS-HPOBench contains 364,450 hyperparameter configurations with a dimensionality of 12-23, 3-5 dimensions of fidelity factor leading to 932 settings, 3-9 inference objective metrics, and 2-10 cost metrics, together with generated logs from measuring the LLM systems. What we seek to advocate is not only a revalidation of the existing HPO algorithms over the frontier LLM systems, but also to provide an evolving platform for the AutoML community to explore new directions of research in this regard. The benchmark suite has been made available at: this https URL

[NLP-240] mHC-SSM: Manifold-Constrained Hyper-Connections for State Space Language Models with Stream-Specialized Adapters

【速读】：该论文旨在解决状态空间模型（State Space Model, SSM）语言建模中因单流残差混合机制导致的表示能力受限问题，进而提升模型性能。其解决方案的关键在于引入受流形约束的多流残差混合结构（Manifold-Constrained Hyper-Connections, mHC），通过在SSM块周围构建静态多流架构，利用单纯形约束预混合与后混合实现流间信息聚合与分散，并在每层应用Sinkhorn-Knopp投影以确保残差混合矩阵位于双随机流形上，从而增强训练稳定性与表达能力。此外，引入流特异性适配器（stream-specialized adapters）通过共享瓶颈和逐流缩放机制，在不显著增加复杂度的前提下进一步提升模型容量，实验证明该方法可在保持可接受效率代价下有效降低验证损失与困惑度。

链接: https://arxiv.org/abs/2605.08300
作者: Abdulvahap Mutlu,Şengül Doğan,Türker Tuncer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 28 Pages, 3 Figures, all implementation code available at: this https URL

点击查看摘要

Abstract:Manifold-Constrained Hyper-Connections (mHC) introduce a stability-motivated variant of multi stream residual mixing by constraining residual stream mixing matrices to the manifold of doubly stochastic matrices via Sinkhorn-Knopp projection. In his work, we study whether mHC-style constrained multi-stream residual topology transfers effectively to state space model (SSM) language modeling. We implement a static mHC mechanism around an SSM block by expanding the residual stream into multiple parallel streams, aggregating streams into a single SSM input through simplex-constrained pre-mixing, scattering the SSM output back to streams through simplex-constrained post-mixing, and applying Sinkhorn-projected residual stream mixing at each layer. We further introduce stream-specialized adapters that add lightweight stream-specific capacity through a shared bottleneck with per-stream scaling, applied both before stream aggregation and after the SSM output prior to scattering. We evaluate baseline single-stream SSM, static mHC SSM, and mHC SSM with adapters on WikiText-2 using identical training settings and report checkpoint-based validation loss, perplexity, throughput, and peak GPU memory. Under the reported fair checkpoint evaluation, static mHC improves validation loss from 6.3507 to 6.2448 and reduces perplexity from 572.91 to 515.35, while mHC with adapters further improves validation loss to 6.1353 and perplexity to 461.88. These gains are accompanied by modest throughput reductions from 1025.52 to 964.81 and 938.90 tokens per second, and increased peak memory from 2365 MB to 2568 MB and 3092 MB. The results suggest that mHC-inspired constrained multi-stream residual mixing can yield measurable quality improvements in SSM language models and that stream-specialized adapter capacity can further enhance performance with predictable efficiency tradeoffs.

[NLP-241] In-Context Fixation: When Demonstrated Labels Override Semantics in Few-Shot Classification

【速读】：该论文旨在解决上下文学习（In-Context Learning, ICL）中标签同质性导致性能崩溃的问题，即当演示样本使用相同标签时，模型准确率显著下降至约12%，甚至在四分类任务中降至0%。其关键发现是：ICL输出并非基于语义概念的推理，而是受限于演示中出现的词汇表（vocabulary retrieval）机制——模型将标签位置的token视为答案空间的完整集合，从而对同质标签产生“固定效应”（fixation），即使标签语义合理也无法恢复正确预测。解决方案的核心在于通过分层激活修复（paired activation patching）和对数透镜（logit lens）分析，定位到特定神经网络层（如Pythia-1B的第7层）中的因果回路，揭示并量化了这种固定效应的机制，表明其本质为格式依赖与内容覆盖的可分离组件，从而为改进ICL稳定性提供了可操作的干预路径。

链接: https://arxiv.org/abs/2605.08295
作者: Ming Liu
机构: Amazon
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 12 pages (10 main + 2 appendix), 4 figures, 5 tables

点击查看摘要

Abstract:While random demonstration labels barely hurt in-context learning (Min et al., 2022), we show that homogeneous labels–even semantically valid ones–collapse accuracy to =12% across six models (Pythia, Llama, Qwen; 0.8B–8B) and four tasks. The trigger is label-slot content: the model treats tokens occupying the label position as an exhaustive answer vocabulary, with homogeneity as the maximally collapsed case. A novel set-level fixation finding confirms this: when demonstrations carry varied nonsense tokens from foo,bar,vex,nit,orb, the model places 42–67% of probability on the demonstrated set while P(dog) remains below 0.2%. This is inconsistent with latent-concept Bayesian accounts (Xie et al., 2022) and reveals that ICL output is constrained vocabulary retrieval–the model binds its output to the demonstrated token inventory regardless of semantic plausibility. The effect generalizes to 4-way classification (0% accuracy across three models, 1B–8B) and multi-token verbalizers (“very positive”), where we decompose fixation into format-level (template adoption) and content-level (polarity override) components that are experimentally dissociable. Mechanistically, per-item paired activation patching on Pythia-1B recovers 98.4% of the gap (95% CI [84%, 112%]), localizing fixation to a layer-7-centered circuit (rank 2/560, 99.8th percentile; 4-fold CV mean 103%). Cross-architecture logit lens on Llama-3.2-1B replicates the encode-then-override trajectory with causal confirmation (top-5 layers: 89% recovery).

[NLP-242] HTPO: Towards Exploration-Exploitation Balanced Policy Optimization via Hierarchical Token-level Objective Control

【速读】：该论文旨在解决当前强化学习（Reinforcement Learning, RL）算法在训练大语言模型（Large Language Models, LLMs）时，对响应中的每个token均赋予相同优化目标的问题，这导致无法提供细粒度的推理过程指导，且难以动态平衡探索（exploration）与利用（exploitation）之间的权衡。其解决方案的关键在于提出一种分层的token级目标控制策略优化方法（Hierarchical Token-level Objective Control Policy Optimization, HTPO），该方法从提示难度、答案正确性和token熵三个维度将响应token划分为特定功能组，并在每组内根据token对探索或利用的贡献设计专用优化目标，从而实现更精细的策略调控和更优的探索-利用平衡。

链接: https://arxiv.org/abs/2605.08283
作者: Xincheng Yao,Ruoqi Li,Cheng Chen,Daoxin Zhang,Yi Wu,Yao Hu,Chongyang Zhang
机构: Shanghai Jiao Tong University (上海交通大学); Xiaohongshu Inc. (小红书)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 29 pages

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a pivotal technique for enhancing the reasoning capabilities of Large Language Models (LLMs). However, the de facto practice of mainstream RL algorithms is to treat all tokens of one response equally and assign the same optimization objective to each token, failing to provide granular guidance for the reasoning process. While in Chain-of-Thought (CoT) reasoning, different tokens usually play distinct roles. Therefore, the current RL algorithms lack an effective mechanism to dynamically balance the exploration-exploitation trade-off during learning. To this end, we propose Hierarchical Token-level Objective Control Policy Optimization (HTPO), a novel RL algorithm that takes the divide-and-conquer idea to hierarchically partition the response tokens into specific functional groups from three aspects (i.e., prompt difficulty, answer correctness, and token entropy). Within each group, according to the contributions to exploration or exploitation, we design specialized optimization objectives to facilitate the effective execution of each token’s expected functionality. In this way, HTPO can achieve a more balanced exploration-exploitation trade-off. Extensive experiments on challenging reasoning benchmarks validate the superiority of our HTPO algorithm, which significantly outperforms the strong DAPO baseline (e.g., +8.6% and +6.7% on AIME’24 and AIME’25, respectively). When scaling test-time compute, the HTPO-trained model maintains a consistent performance advantage over the DAPO baseline, and the gap widens as the sampling budget increases, validating that our adaptive token-level control method fosters effective exploration without sacrificing exploitation performance. Code will be at this https URL.

[NLP-243] Spatial Priming Outperforms Semantic Prompting: A Grid-Based Approach to Improving LLM Accuracy on Chart Data Extraction

【速读】：该论文旨在解决从科学图表中自动提取数据的问题，尤其是在非标准化图表场景下，当前多模态大语言模型（Multimodal Large Language Models, MLLMs）的准确性仍面临挑战。研究核心在于比较两种不同策略的有效性：高阶语义引导（如基于元数据的两阶段框架或思维链Chain-of-Thought）与低阶空间引导（spatial priming）。实验表明，语义引导方法未能带来统计显著的性能提升，而关键解决方案是采用一种简单但高效的空间引导方法——在图表图像分析前叠加坐标网格（coordinate grid），从而为模型提供显式的空间上下文信息。定量实验显示，该方法使平均绝对百分比误差（SMAPE）从25.5%显著降低至19.5%（p < 0.05），证明在当前多模态模型能力下，提供明确的空间结构提示比高阶语义引导更有效且可靠。

链接: https://arxiv.org/abs/2605.08220
作者: Andrei Lazarev,Dmitrii Sedov,Alexander Galkin
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Software Engineering (cs.SE)
备注: his is the version of the article accepted for publication in SUMMA 2025 after peer review. The final, published version is available at IEEE Xplore: this https URL

点击查看摘要

Abstract:The automated extraction of data from scientific charts is a critical task for large-scale literature analysis. While multimodal Large Language Models (LLMs) show promise, their accuracy on non-standardized charts remains a challenge. This raises a key research question: what is the most effective strategy to improve model performance (high-level semantic priming) or low-level spatial priming? This paper presents a comparative investigation into these two distinct strategies. We describe our exploratory experiments with semantic methods, such as a two-stage metadata-first framework and Chain-of-Thought, which failed to produce a statistically significant improvement. In contrast, we present a simple but highly effective spatial priming method: overlaying a coordinate grid onto the chart image before analysis. Our quantitative experiment on a synthetic dataset demonstrates that this grid-based approach provides a statistically significant reduction in data extraction error (SMAPE reduced from 25.5% to 19.5%, p 0.05) compared to a baseline. We conclude that for the current generation of multimodal models, providing explicit spatial context is a more effective and reliable strategy than high-level semantic guidance for this class of tasks.

[NLP-244] LLM s with in-context learning for Algorithmic Theoretical Physics

【速读】：该论文旨在解决理论物理中日益增多的算法计算任务在人工执行时耗时且易出错的问题，特别是修改引力理论中的宇宙学微扰计算。其解决方案的关键在于将大型语言模型（Large Language Model, LLM）与计算机代数系统（Computer Algebra System, CAS）运行时环境相结合，并通过提供充分的上下文信息（如已求解示例）来提升模型对复杂符号运算的可靠性。实验表明，前沿LLM（如Claude）搭配Maple CAS后，在给定足够提示的情况下能够有效完成大部分测试问题，展现出自动化处理高精度物理计算的潜力。

链接: https://arxiv.org/abs/2605.08212
作者: Anamaria Hell,Leander Thiele
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); General Relativity and Quantum Cosmology (gr-qc); High Energy Physics - Theory (hep-th)
备注: 8 pages, 2 figures

点击查看摘要

Abstract:There is an increasing number of algorithmic computations in theoretical physics. These, while conceptually simple, can nevertheless be time-consuming and contain subtleties that should not be overlooked. Given the recent improvement of Large Language Models (LLM), it is natural to investigate whether LLMs equipped with a computer algebra system (CAS) runtime and sufficiently informative context can reliably carry out these algorithmic tasks. In this work, we interface Claude with Maple, and apply this framework to cosmological perturbations in modified theories of gravity. We demonstrate the current capabilities of this approach, the typical failures, and how the same can be improved. We find that a frontier LLM supplied with worked examples is able to solve most test problems.

[NLP-245] MULTITEXTEDIT: Benchmarking Cross-Lingual Degradation in Text-in-Image Editing

【速读】：该论文旨在解决当前文本驱动图像编辑（text-in-image editing）基准测试中普遍存在的语言单一性问题，即现有评估体系主要基于英语，难以准确衡量模型在多语言环境下的表现，并且常将视觉合理性与语义正确性混淆。为实现更公平、细致的跨语言评估，作者提出MULTITEXTEDIT基准，包含3,600个实例，覆盖12种类型差异显著的语言、5个视觉领域和7类编辑操作，每条数据均保留相同的视觉基础并配有人工标注参考和区域掩码，从而隔离语言变量以支持跨语言比较。其解决方案的关键在于引入语言保真度（Language Fidelity, LSF）指标，该指标通过两阶段大型视觉模型（Large Vision Model, LVM）协议进行评分：首先定位被编辑目标文本，再独立判断其准确性，可有效捕捉细粒度脚本错误（如缺少变音符号、RTL方向反转、混合脚本渲染等），并与母语者标注者达成0.76的加权kappa一致性。实验表明，所有模型在非英语语言上均出现显著性能下降，尤其在希伯来语和阿拉伯语中最为严重，且问题集中于文本准确性和脚本保真度，而非整体结构维度，同时揭示了语义与像素层面的广泛不一致现象。

链接: https://arxiv.org/abs/2605.08163
作者: Liwei Cheng,Zirui Song,Shibo Feng,Lunjie Zhou,Yixuan Guan,Dayan Guan
机构: Harbin Institute of Technology (哈尔滨工业大学); Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Text-in-image editing has become a key capability for visual content creation, yet existing benchmarks remain overwhelmingly English-centric and often conflate visual plausibility with semantic correctness. We introduce MULTITEXTEDIT, a controlled benchmark of 3,600 instances spanning 12 typologically diverse languages, 5 visual domains, and 7 editing operations. Language variants of each instance share a common visual base and are paired with a human-edited reference and region masks, isolating the language variable for cross-lingual comparison. To capture script-level errors that coarse text-matching metrics miss, such as missing diacritics, reversed RTL order, and mixed-script renderings, we introduce a language fidelity (LSF) metric scored by a two-stage LVM protocol that first traces the edited target text and then judges it in isolation, reaching a quadratic-weighted \kappa of 0.76 against native-speaker annotators. Evaluating 12 open-source and proprietary systems with LSF alongside standard semantic and mask-aware pixel metrics, we find pronounced cross-lingual degradation for every model, largest on Hebrew and Arabic and smallest on Dutch and Spanish, and concentrated in text accuracy and script fidelity rather than in coarse structural dimensions. We also uncover a pervasive semantic and pixel mismatch, where outputs preserve global layout and background fidelity yet distort script-specific forms.

[NLP-246] Feature Rivalry in Sparse Autoencoder Representations: A Mechanistic Study of Uncertainty-Driven Feature Competition in LLM s

【速读】：该论文旨在解决生成式 AI（Generative AI）模型中，稀疏自编码器（Sparse Autoencoders, SAEs）提取的可解释特征在不确定性情境下的交互机制不明确的问题。其核心解决方案是提出“特征竞争”（Feature Rivalry）这一概念——即负相关联的SAE特征对，并通过控制变量实验、激活操控（activation steering）和提示级竞争评分（per-prompt rivalry score）验证其作为模型不确定性机制签名的有效性：研究发现高熵问题显著增强特定层（如第0层与第12层）的特征竞争强度，且沿竞争方向操纵激活能更有效地改变输出，同时竞争得分可预测答案正确性，接近但低于软最大置信度。

链接: https://arxiv.org/abs/2605.08149
作者: Harshavardhan
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 10 pages, 6 figures

点击查看摘要

Abstract:Sparse Autoencoders (SAEs) decompose large language model representations into interpretable features, but how these features interact under uncertainty remains poorly understood. We introduce Feature Rivalry – negatively correlated SAE feature pairs – and study whether rivalry serves as a mechanistic signature of model uncertainty in Gemma-2-2B using Gemma Scope SAEs. Through a controlled within-domain experiment on PopQA split by response entropy, we find that high-entropy questions produce significantly stronger feature rivalry at layers 0 and 12 relative to low-entropy questions (p=5.3x10^-26 and p=5.8x10^-5 respectively), localizing uncertainty to specific processing stages in the residual stream. We then test whether rivalry is causally upstream of model outputs via activation steering along rivalry axes – finding that steering along the rivalry direction (vec_A - vec_B) causes more output changes than random directions at low steering multipliers across 15 of 20 rival feature pairs. Finally, a per-prompt rivalry score derived from pairwise cosine similarities of active SAE feature decoder vectors predicts answer correctness (AUROC=0.689), approaching but not matching softmax confidence (AUROC=0.808).

[NLP-247] Reasoning emerges from constrained inference manifolds in large language models

【速读】：该论文试图解决大语言模型（Large Language Models, LLMs）推理能力评估中长期存在的问题：当前主要依赖标注基准测试（labeled benchmarks），这将任务性能与内部推理质量混为一谈，难以揭示推理过程本身的动态特性。为突破这一局限，作者提出从内在推理动力学角度出发，通过分析推理过程中内部表征的演化来理解推理的本质。其解决方案的关键在于识别出有效推理动态所处的结构约束区域——该区域由三个条件共同定义：足够的表征表达能力、自发的流形压缩（manifold compression）以及在压缩子空间中保持非退化的信息体积。基于此，作者进一步提出一种仅依赖内部动态计算的无标签诊断指标，从而提供了一种不依赖外部标注的、可量化推理质量的新范式，揭示了LLM推理本质上受几何与信息约束的规律。

链接: https://arxiv.org/abs/2605.08142
作者: Yanbiao Ma,Fei Luo,Linfeng Zhang,Chuangxin Zhao,Mingxuan Wang,Yinan Wu,Zhe Qian,Yang Lu,Long Chen,Zhao Cao,Xiaoshuai Hao,Ji-Rong Wen,Jungong Han
机构: Renmin University of China (中国人民大学); Tsinghua University (清华大学); Xiaomi EV (小米汽车); Xiamen University (厦门大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reasoning in large language models is predominantly evaluated through labeled benchmarks, conflating task performance with the quality of internal inference. Here we study reasoning as an intrinsic dynamical process by examining the evolution of internal representations during inference. We find that inference-time dynamics consistently self-organize into low-dimensional manifolds embedded within high-dimensional representation spaces. we find that such geometric compression, although pervasive, is not sufficient for stable or reliable reasoning. Instead, effective reasoning dynamics emerge within a constrained structural regime characterized by three conditions: adequate representational expressivity, spontaneous manifold compression, and preservation of non-degenerate information volume within the compressed subspace. Models outside this regime exhibit characteristic pathological inference dynamics. Based on these insights, we introduce a unified, label-free diagnostic computed solely from internal dynamics. These findings suggest that reasoning in LLMs is fundamentally governed by geometric and informational constraints, offering a complementary framework to benchmark-centric assessment.

[NLP-248] Block-Wise Differentiable Sinkhorn Attention: Tail-Refinement Gradients with a Gap-Aware Dustbin Bridge

【速读】：该论文旨在解决长序列上下文下平衡熵正则最优传输（balanced entropic optimal transport, OT）注意力机制在TPU硬件上的高效计算与精确反向传播问题。其核心挑战在于如何在固定深度的Sinkhorn迭代基础上，实现高精度的梯度计算并保持内存效率。解决方案的关键是提出一种“截断-基底、固定深度尾部精化”代理模型（stopped-base, fixed-depth tail-refinement surrogate），通过在T步截断Sinkhorn求解后，对短尾进行显式反向传播（unroll and differentiate exactly），从而获得精确的梯度信息；进一步证明了R=2时存在一个单参考块调度（one-reference-tile schedule），使得梯度可分解为一个参考计划块乘以由向量余切和对偶差构建的显式修正场，显著降低了计算复杂度至块级O((T+R)LW)，同时输入存储仅需O(Ld)，额外高带宽内存（HBM）使用为O(L)，适用于固定头维度d和带宽W。该方法不仅支持理论上的精确梯度计算，还通过局部代理偏差界、后验偏差证书和投影收缩性验证，确保了数值稳定性，并在Pfam数据集上实现了端到端训练与性能提升，验证了其在生产路径中的有效性。

链接: https://arxiv.org/abs/2605.08123
作者: Dylan Forde
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We study long-context balanced entropic optimal transport (OT) attention on TPU hardware through a stopped-base, fixed-depth tail-refinement surrogate. After a stopped T -step Sinkhorn solve, we unroll a short refinement tail and differentiate that surrogate exactly. For the production R=2 case, the backward pass contains four staircase plan factors. We prove an exact one-reference-tile schedule: the R=2 score cotangent is a single reference plan tile times an explicit modifier field built from vector cotangents and dual differences. This yields block-wise cost O((T+R)LW) , O(Ld) input storage, and O(L) additional HBM usage for fixed head dimension d and band width W . We also formalize the current \textttdustbin_block path as the same balanced surrogate on an augmented support, so the schedule lifts to the gap-aware transport path used in our TPU runs. We provide a local surrogate-bias bound, an a posteriori bias certificate, and a projective contraction certificate for strictly positive active blocks. On synthetic masked problems, the optimized kernel matches exact autodiff of the same centered surrogate to within 10^-5 – 10^-10 . On TPU v6e-8, a four-configuration Pfam screen completes end-to-end, and a promoted balanced R=2 run sustains roughly 8.5 examples per second through a three-hour budget, reaching step 1437 . Held-out Pfam test shards improve reconstruction from 3.17 to 0.99 and sparse CE from 5.86 to 5.69 relative to step 0 . These results support exact fixed-depth backward theory, a theorem-matching gap-aware bridge, and trainability evidence for the production path.

[NLP-249] GONE: Structural Knowledge Unlearning via Neighborhood-Expanded Distribution Shaping

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）中知识遗忘（unlearning）的问题，特别是针对现有方法在处理结构化知识图谱（Knowledge Graph, KG）事实时忽略关系性、多跳推理和语义关联的局限性。当前主流方法如参数编辑、微调和蒸馏等主要面向扁平的句子级数据，难以有效实现精确的知识删除并避免副作用。为此，作者提出了Graph Oblivion and Node Erasure (GONE) 基准测试框架，用于系统评估LLMs在KG事实上的遗忘效果，并区分直接事实移除、基于推理的泄露以及灾难性遗忘三种效应。解决方案的核心是设计了一种名为Neighborhood-Expanded Distribution Shaping (NEDS) 的新型遗忘框架，其关键在于利用知识图谱的连通性识别与目标事实相关的锚定邻居节点，从而在遗忘事实与其语义邻域之间建立精确的决策边界，实现高效且局部化的知识擦除。

链接: https://arxiv.org/abs/2603.12275
作者: Chahana Dahal,Ashutosh Balasubramaniam,Zuobin Xiong
机构: University of Nevada, Las Vegas; Indian Institute of Technology Guwahati
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Unlearning knowledge is a pressing and challenging task in Large Language Models (LLMs) because of their unprecedented capability to memorize and digest training data at scale, raising more significant issues regarding safety, privacy, and intellectual property. However, existing works, including parameter editing, fine-tuning, and distillation-based methods, are all focused on flat sentence-level data but overlook the relational, multi-hop, and reasoned knowledge in naturally structured data. In response to this gap, this paper introduces Graph Oblivion and Node Erasure (GONE), a benchmark for evaluating knowledge unlearning over structured knowledge graph (KG) facts in LLMs. This KG-based benchmark enables the disentanglement of three effects of unlearning: direct fact removal, reasoning-based leakage, and catastrophic forgetting. In addition, Neighborhood-Expanded Distribution Shaping (NEDS), a novel unlearning framework, is designed to leverage graph connectivity and identify anchor correlated neighbors, enforcing a precise decision boundary between the forgotten fact and its semantic neighborhood. Evaluations on LLaMA-3-8B and Mistral-7B across multiple knowledge editing and unlearning methods showcase NEDS’s superior performance (1.000 on unlearning efficacy and 0.839 on locality) on GONE and other benchmarks. Code is available at this https URL.

[NLP-250] Federated Language Models Under Bandwidth Budgets: Distillation Rates and Conformal Coverag e

【速读】：该论文旨在解决在带宽受限节点上分布数据训练语言模型时的统计可证明性问题，即如何在不集中数据的前提下，实现训练一致性与推理校准的理论保障，并将带宽预算作为核心统计参数进行建模。其解决方案的关键在于提出两种协议：Federated Probe-Logit Distillation (FPLD) 用于训练，通过量化压缩和探针集蒸馏实现高概率 KL 一致性，其收敛速率显式依赖于节点数 $K$ 、每节点样本量 $n$ 、量化带宽 $B$ 、探针集大小 $m$ 和词表大小 $V$ ，且带宽仅以指数衰减项形式影响性能；Federated Conformal RAG (FC-RAG) 用于推理，构建了一个分布无关的边际覆盖边界，其中检索带宽松弛项 $\Delta_\mathrm{RAG}$ 将每节点检索带宽视为首要统计参数，并在均匀条件下随节点数 $K$ 的平方根递减。两个边界可通过 Pinsker 型推论组合成端到端覆盖保证，从而在理论上完整刻画了带宽约束下的性能极限。

链接: https://arxiv.org/abs/2605.09986
作者: Prasanjit Dubey,Xiaoming Huo
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Machine Learning (stat.ML); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Training a language model on data scattered across bandwidth-limited nodes that cannot be centralized is a setting that arises in clinical networks, enterprise knowledge bases, and scientific consortia. We study the regime in which data must remain distributed across nodes, and ask what statistical guarantees are in principle achievable under explicit bandwidth budgets; we aim to characterize what is provably possible, not to demonstrate a deployment-ready system. Existing theory treats either training-time consistency or inference-time calibration in isolation, and none makes bandwidth a first-class statistical parameter. We analyze two protocols, Federated Probe-Logit Distillation (FPLD) for training and Federated Conformal RAG (FC-RAG) for inference, as the analytical vehicles for our results. Our first main result is an explicit high-probability KL-consistency rate for FPLD with simultaneous dependence on node count K , per-node sample size n , quantization budget B , probe-set size m , and vocabulary size V ; bandwidth enters only through an exponentially vanishing quantization term. Our second main result is a distribution-free marginal-coverage bound for FC-RAG, whose novel retrieval-bandwidth slack \Delta_\mathrmRAG = f_\max\sqrtK^-2\sum_i v(B_i) makes per-node retrieval bandwidth a first-class statistical parameter, with arithmetic aggregation across K nodes shrinking the slack as K^-1/2 in the per-node-uniform regime. A Pinsker-type corollary composes the two bounds into an end-to-end coverage guarantee. Synthetic experiments verify the predicted scaling along the bounds’ parameters; small-scale experiments on a GPT-2 testbed illustrate that the qualitative bandwidth-accuracy tradeoff survives on a real language model. A deployment-scale empirical evaluation is out of scope.

[NLP-251] Calibrate Dont Curate: Label-Efficient Estimation from Noisy LLM Judges

【速读】：该论文旨在解决多评判者（multi-judge）评估中基于准确率筛选裁判的常见实践所带来的校准偏差问题，尤其是在生成式 AI（Generative AI）和奖励模型（reward model）的评估场景下。传统做法是仅保留高准确率的裁判以提升点估计精度，但这种方法在追求概率校准（calibrated probabilistic evaluation）时可能适得其反。论文的关键解决方案在于：当存在标注的校准数据集时，应保留所有可解析（parseable）、非冗余且可校准（calibratable）的裁判，即使其准确率较低；理论分析表明，在适当评分规则下，增加裁判信号不会恶化最优校准风险，且低准确率裁判若具备可学习偏倚和非冗余信息，则能显著提升整体校准性能。实证结果验证了全裁判面板在NLL指标上的优越性，例如在RewardBench2上将校准误差降低50%。

链接: https://arxiv.org/abs/2605.09702
作者: Yanran Li
机构: Columbia University (哥伦比亚大学)
类目: Methodology (stat.ME); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multi-judge evaluation is increasingly used to assess LLMs and reward models, and the prevailing heuristic is to curate: keep the most accurate judges and discard weaker ones. We show that this heuristic can reverse when the target is not point accuracy, but calibrated probabilistic evaluation from a labeled calibration set. Holding the aggregation and calibration procedures fixed, we compare accuracy-ranked top- k judge selection with using the full judge panel. Across four labeled pairwise-evaluation benchmarks spanning LLM-as-judge and reward-model settings, the calibrated full panel consistently outperforms accuracy-based selection. On RewardBench2, retaining all judges achieves negative log-likelihood (NLL) of 0.006 versus 0.013 under top-5 selection, halving the calibration error. This advantage persists after judge-family deduplication and against stronger same-pipeline subset search. We explain this reversal with oracle analyses showing that the optimal calibrated risk under proper scoring rules cannot increase when additional judge signals are made available, and that even below-chance judges can be useful when their biases are learnable and their signals are non-redundant. The resulting operating principle is simple: in multi-judge evaluation with labeled calibration data, do not discard weak judges by accuracy alone; keep them when they are parseable, non-redundant, and calibratable.

信息检索

[IR-0] Neural at ArchEHR-QA 2026: One Method Fits All: Unified Prompt Optimization for Clinical QA over EHRs ALT LREC2026

【速读】：该论文旨在解决电子健康记录（Electronic Health Records, EHR）中自动化问答（Question Answering, QA）任务的三大核心挑战：精确的证据检索、忠实的答案生成以及答案在临床笔记中的显式定位。其解决方案的关键在于将整个QA流程解耦为独立的模块化阶段（即问题理解、证据识别、答案生成与证据对齐），并利用DSPy的MIPROv2优化器自动发现高性能提示（prompt），联合调优每个阶段的指令和少量示例；同时，在每个阶段引入基于多次随机推理的自一致性投票机制以抑制虚假错误，并结合阶段特异性的验证机制（如自我反思和验证链）进一步提升输出质量。该方法在ArchEHR-QA 2026共享任务中取得优异性能，表明系统性地对各阶段进行提示优化与自一致性增强是一种低成本且高效的替代模型微调的策略。

链接: https://arxiv.org/abs/2605.10877
作者: Abrar Majeedi,Viswanatha Reddy Gajjala,Sai Prasanna Teja Reddy Bogireddy,Siddhant Rai
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Accepted to CL4Health @ LREC 2026

点击查看摘要

Abstract:Automated question answering (QA) over electronic health records (EHRs) demands precise evidence retrieval, faithful answer generation, and explicit grounding of answers in clinical notes. In this work, we present Neural1.5, our method for the ArchEHR-QA 2026 shared task at CL4Health@LREC 2026, which comprises four subtasks: question interpretation, evidence identification, answer generation, and evidence alignment. Our approach decouples the task into independent, modular stages and employs DSPy"s MIPROv2 optimizer to automatically discover high-performing prompts, jointly tuning instructions and few-shot demonstrations for each stage. Within every stage, self-consistency voting over multiple stochastic inference runs suppresses spurious errors and improves reliability, while stage-specific verification mechanisms (e.g., self-reflection and chain-of-verification for alignment) further refine output quality. Among all teams that participated in all four subtasks, our method ranks second overall (mean rank 4.00), placing 4th, 1st, 4th, and 7th on Subtasks 1-4, respectively. These results demonstrate that systematic, per-stage prompt optimization combined with self-consistency mechanisms is a cost-effective alternative to model fine-tuning for multifaceted clinical QA.

[IR-1] Rethinking Agent ic Search with Pi-Serini: Is Lexical Retrieval Sufficient?

【速读】：该论文旨在解决在大型语言模型（Large Language Models, LLMs）能力不断提升的背景下，是否仍需依赖密集检索器（dense retriever）进行深度研究任务，还是仅使用传统的词法检索器（lexical retriever，如BM25）即可满足需求的问题。其核心解决方案在于构建一个名为Pi-Serini的搜索代理系统，该系统整合了检索、浏览和阅读三种工具，并通过优化BM25参数与增加检索深度来提升性能。实验表明，在搭配更先进的LLM（如gpt-5.5）时，经过调优的词法检索器可实现83.1%的答案准确率和94.7%的证据召回率，优于当前使用密集检索器的搜索代理，说明在具备良好配置和足够检索深度的前提下，词法检索器足以支撑高效的深度研究流程。

链接: https://arxiv.org/abs/2605.10848
作者: Tz-Huan Hsu,Jheng-Hong Yang,Jimmy Lin
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 15 pages, 4 figures

点击查看摘要

Abstract:Does a lexical retriever suffice as large language models (LLMs) become more capable in an agentic loop? This question naturally arises when building deep research systems. We revisit it by pairing BM25 with frontier LLMs that have better reasoning and tool-use abilities. To support researchers asking the same question, we introduce Pi-Serini, a search agent equipped with three tools for retrieving, browsing, and reading documents. Our results show that, on BrowseComp-Plus, a well-configured lexical retriever with sufficient retrieval depth can support effective deep research when paired with more capable LLMs. Specifically, Pi-Serini with gpt-5.5 achieves 83.1% answer accuracy and 94.7% surfaced evidence recall, outperforming released search agents that use dense retrievers. Controlled ablations further show that BM25 tuning improves answer accuracy by 18.0% and surfaced evidence recall by 11.1% over the default BM25 setting, while increasing retrieval depth further improves surfaced evidence recall by 25.3% over the shallow-retrieval setting. Source code is available at this https URL.

[IR-2] Personalized Deep Research: A User-Centric Framework Dataset and Hybrid Evaluation for Knowledge Discovery SIGIR2026

【速读】：该论文旨在解决当前由大语言模型（Large Language Models, LLMs）驱动的深度研究代理在学术发现流程中面临的个性化不足问题。现有系统采用静态的“一刀切”式检索范式，无法根据用户已有的专业知识或潜在兴趣动态调整探索的深度与广度，导致专家用户获得冗余信息而新手用户则面临信息过载。解决方案的关键在于提出Personalized Deep Research (PDR) 框架，其核心创新是将动态用户上下文嵌入到检索-推理循环的核心环节，通过统一用户画像建模与迭代查询生成、双阶段（私有/公开）检索以及上下文感知的信息融合机制，使系统能够自主对齐研究子目标与用户意图，并优化证据收集的终止条件，从而实现更精准的知识获取与报告相关性提升。

链接: https://arxiv.org/abs/2605.10530
作者: Xiaopeng Li,Wenlin Zhang,Yingyi Zhang,Pengyue Jia,Yejing Wang,Yichao Wang,Yong Liu,Huifeng Guo,Xiangyu Zhao
机构: City University of Hong Kong(香港城市大学); Huawei Technologies Ltd.(华为技术有限公司)
类目: Information Retrieval (cs.IR)
备注: Accepted to SIGIR 2026

点击查看摘要

Abstract:Deep Research agents driven by LLMs have automated the scholarly discovery pipeline, from planning and query formulation to iterative web exploration. Yet they remain constrained by a static, ``one-size-fits-all’’ retrieval paradigm. Current systems fail to adaptively adjust the depth and breadth of exploration based on the user’s existing expertise or latent interests, frequently resulting in reports that are either redundant for experts or overly dense for novices. To address this, we introduce Personalized Deep Research (PDR), a framework that integrates dynamic user context into the core retrieval-reasoning loop. Rather than treating personalization as a post-hoc formatting step, PDR unifies user profile modeling with iterative query development, dual-stage (private/public) retrieval, and context-aware synthesis. This allows the system to autonomously align research sub-goals with user intent and optimize the stopping criteria for evidence collection. To facilitate benchmarking, we release the PDR Dataset, covering four realistic user tasks, and propose a hybrid evaluation framework combining lexical metrics with LLM-based judgments to assess factual accuracy and personalization alignment. Experimental results against commercial baselines demonstrate that PDR significantly improves retrieval utility and report relevance, effectively bridging the gap between generic information retrieval and personalized knowledge acquisition. The resource is available to the public at this https URL.

[IR-3] UniRank: Unified List-wise Reranking via Confidence-Ordered Denoising

【速读】：该论文旨在解决生成式列表重排序（list-wise reranking）中自回归（Autoregressive, AR）与非自回归（Non-autoregressive, NAR）方法的局限性问题：AR方法虽能建模项间依赖关系，但存在错误传播问题；NAR方法虽避免了错误传播，却因假设槽位独立而削弱了项间交互建模能力。解决方案的关键在于提出统一框架UniRank，其通过迭代去噪过程整合双向列表建模机制，并在每一步填充置信度最高的槽位，从而在推理阶段可退化为AR或NAR模型作为特例。进一步地，作者设计任务驱动扩散接口（Task Grounded Diffusion Interface, TGD），在项目层面执行去噪并限定预测范围于请求相关的候选池内，实现高效且精准的排序建模。实验表明，UniRank在多个公开及工业数据集上显著优于现有最优基线。

链接: https://arxiv.org/abs/2605.10527
作者: Pengyue Jia,Hailan Yang,Shuchang Liu,Xiaobei Wang,Wanyu Wang,Xiang Li,Yongqi Liu,Kaiqiao Zhan,Kun Gai,Xiangyu Zhao
机构: City University of Hong Kong (香港城市大学); Kuaishou Technology (快手科技)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:List-wise reranking arranges a request-specific pool of candidate items into an ordered slate that maximizes user satisfaction. Existing generative rerankers fall into two paradigms: Autoregressive (AR) rerankers construct the slate left to right and capture inter-item dependencies in the exposure list, but they suffer from error propagation because early mistakes affect subsequent slots. Non-autoregressive (NAR) rerankers predict all slots in parallel and avoid error propagation, but they weaken inter-item interaction modeling under a slot independence assumption. This raises a central question: is there a unified architecture that combines the strengths of both paradigms and delivers stronger reranking performance? We answer this question with UniRank, a unified list-wise reranking framework whose inference time variants recover AR and NAR rerankers as special cases. UniRank integrates bidirectional slate modeling into an iterative denoising process and fills the most confident slot at each step. To instantiate this framework for reranking, we introduce the Task Grounded Diffusion Interface (TGD), which performs denoising at the item level and restricts prediction to the request-specific candidate pool. TGD aggregates each item’s semantic tokens into a single item embedding and scores each slot directly against the candidate pool. Experiments on Amazon Books, MovieLens-1M, and an industrial short video dataset show that UniRank consistently outperforms state-of-the-art baselines. Online A/B tests on a real-world industrial platform further validate its effectiveness, yielding significant improvements of +0.159% in user average app-time and +1.016% in share-rate.

[IR-4] Agent GR: Semantic-aware Agent ic Group Decision-Making Simulator for Group Recommendation

【速读】：该论文旨在解决现有群体推荐（Group Recommendation, GR）方法在建模真实世界群体决策过程中的局限性，即当前方法通常将群体偏好学习简化为个体偏好的简单聚合，忽略了群体内部复杂的交互动态和语义协同机制。其解决方案的关键在于提出AgentGR——一种基于语义感知的代理式群体决策模拟器，通过两个核心机制实现突破：一是引入语义元路径引导的链式偏好推理机制（semantic meta-path guided chain-of-preference reasoning），融合高阶协同过滤信号与文本语义信息以增强用户偏好表征；二是构建双策略多智能体模拟框架，分别采用静态工作流策略（效率导向）和动态对话策略（精度导向），显式建模群体话题与领导力等影响因素，从而仿真真实的群体互动过程，显著提升推荐准确性和群体决策模拟的真实性。

链接: https://arxiv.org/abs/2605.10367
作者: Yangtao Zhou,Wenhao You,Hua Chu,Shihao Guo,Jianan Li,Zhifu Zhao,Qingshan Li
机构: Xidian University(西安电子科技大学)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Group Recommendation (GR) aims to suggest items to a group of users, which has become a critical component of modern social platforms. Existing GR methods focus on aggregating individual user preferences with advanced neural networks to infer group preferences. Despite effectiveness, they essentially treat group preference learning as a simple preference aggregation process, failing to capture the complex dynamics of real-world group decision-making. To address these limitations, we propose AgentGR, a novel Semantic-aware Agentic Group Decision-Making Simulator for Group Recommendations, inspired by the semantic reasoning and human behavior simulation capabilities of LLM-driven agents. It aims to jointly capture collaborative-semantic user preferences for member-role-playing and simulate dynamic group interactions to reflect real-world group decision-making processes, thereby boosting recommendation performance. Specifically, to capture collaborative-semantic user preferences, we introduce a semantic meta-path guided chain-of-preference reasoning mechanism that integrates high-order collaborative filtering signals and textual semantics to improve user preference profiles. To model the complex dynamics of group decision-making, we first recognize group topic and leadership to explicitly model the influencing factors within the group decision processes. Building on these, we simulate group-level decision dynamics via two multi-agent simulation strategies for recommendations: a static workflow-based strategy for efficiency and a dynamic dialogue-based strategy for precision. Extensive experiments on two real-world datasets show that AgentGR significantly outperforms state-of-the-art baselines in both recommendation accuracy and group decision simulation, highlighting its potential for real-world GR applications.

[IR-5] Every Preference Has Its Strength: Injecting Ordinal Semantics into LLM -Based Recommenders SIGIR2026

【速读】：该论文旨在解决现有协同过滤-大语言模型（Collaborative Filtering - Large Language Model, CF-LLM）框架在推荐系统中忽视显式评分的序数结构问题，即这些方法通常将评分信息简化为隐式或仅正向反馈，从而丢失了用户偏好强度的细粒度语义。解决方案的关键在于提出有序语义锚定（Ordinal Semantic Anchoring, OSA），通过将序数偏好水平建模为数值文本标记（numeric textual tokens），并利用其嵌入作为语义锚点，在大语言模型潜在空间中对用户-物品交互表示进行强度感知对齐，从而在整合协同信号时保留偏好语义，显著提升对细微偏好差异的建模能力。

链接: https://arxiv.org/abs/2605.10323
作者: Jiwon Jeong,Donghee Han,Sungrae Hong,Woosung Kang,Mun Yong Yi
机构: Korea Advanced Institute of Science and Technology (KAIST)
类目: Information Retrieval (cs.IR)
备注: Accepted at SIGIR 2026

点击查看摘要

Abstract:Recent work has shown that large language models (LLMs) can enhance recommender systems by integrating collaborative filtering (CF) signals through hybrid prompting. However, most existing CF-LLM frameworks collapse explicit ratings into implicit or positive-only feedback, discarding the ordinal structure that conveys fine-grained preference strength. As a result, these models struggle to exploit graded semantics and nuanced preference distinctions. We propose Ordinal Semantic Anchoring (OSA), a hybrid CF-LLM framework that explicitly incorporates preference strength by modeling interaction-level user feedback. OSA represents ordinal preference levels as numeric textual tokens and uses their token embeddings as semantic anchors to align user-item interaction representations in the LLM latent space. Through strength-aware alignment across ordinal levels, OSA preserves preference semantics when integrating collaborative signals with LLMs. Experiments on multiple real-world datasets demonstrate that OSA consistently outperforms existing baselines, particularly in pairwise preference evaluation, highlighting its effectiveness in modeling fine-grained user preferences over prior CF-LLM methods.

[IR-6] Qwen Goes Brrr: Off-the-Shelf RAG for Ukrainian Multi-Domain Document Understanding KR

【速读】：该论文旨在解决多领域文档理解任务中，从PDF文档集合中准确回答乌克兰语多项选择题，并定位支持答案的文档及页码的问题。其核心挑战在于如何高效地检索相关文档片段并精准生成答案，同时在代码竞赛限制下保持系统简洁性与高性能。解决方案的关键在于：（1）基于上下文的PDF分块策略以保留文档结构信息；（2）引入问题感知的密集检索与结合问题和选项的重排序机制，提升相关段落召回率；（3）基于少量重排序后的段落进行约束式答案生成，从而显著提高准确性。实验表明，该方法在不依赖复杂后处理规则的情况下，通过结构化检索与答案空间感知的重排序实现了性能突破。

链接: https://arxiv.org/abs/2605.10296
作者: Anton Bazdyrev,Ivan Bashtovyi,Ivan Havlytskyi,Oleksandr Kharytonov,Artur Khodakovskyi
机构: National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute” (乌克兰国立技术大学“伊戈尔·西科斯基基辅理工学院”)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Accepted to The Fifth Ukrainian Natural Language Processing Conference (UNLP 2026)

点击查看摘要

Abstract:We participated in the Fifth UNLP shared task on multi-domain document understanding, where systems must answer Ukrainian multiple-choice questions from PDF collections and localize the supporting document and page. We propose a retrieval-augmented pipeline built around three ideas: contextual chunking of PDFs, question-aware dense retrieval and reranking conditioned on both the question and answer options, and constrained answer generation from a small set of reranked passages. Our final system uses Qwen3-Embedding-8B for retrieval, a fine-tuned Qwen3-Reranker-8B for passage ranking, and Qwen3-32B for answer selection. On a held-out split, reranking improves Recall@1 from 0.6957 to 0.7935, while using the top-2 reranked passages raises answer accuracy from 0.9348 to 0.9674. Our best leaderboard run reached 0.9452 on the public leaderboard and 0.9598 on the private leaderboard. Our results suggest that, under strict code-competition constraints, preserving document structure and making relevance estimation aware of the answer space are more effective than adding complex downstream heuristics.

[IR-7] o Redact or not to Redact? A Local LLM Approach to Deliberative Process Privilege Classification

【速读】：该论文旨在解决政府透明度法律（如美国和英国的《信息自由法》(FOIA) 及荷兰的《开放政府法案》(Woo)）中敏感信息自动识别的问题，特别是针对“第五类豁免”（FOIA Exemption 5）所涵盖的内部决策过程文档的敏感性分类任务。其核心挑战在于如何在不依赖第三方云API的前提下，实现高精度、可部署于消费级硬件上的本地化敏感内容检测。解决方案的关键在于采用小型本地化大语言模型（Qwen3.5 9B），并结合链式思维（Chain-of-Thought）提示与基于错误样本的少样本提示策略，在保持较低计算成本的同时显著提升召回率（recall）和F2分数，性能接近商用模型Gemini 2.5 Flash，同时揭示了具有 deliberativeness 特征的语言模式，如第一人称代词与表达观点动词的组合使用。

链接: https://arxiv.org/abs/2605.10211
作者: Maik Larooij,David Graus
机构: University of Amsterdam(阿姆斯特丹大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Accepted to The First Workshop on Artificial Intelligence Open Government at the 21st International Conference on Artificial Intelligence and Law (ICAIL), June 8, 2026, Singapore

点击查看摘要

Abstract:Government transparency laws, like the Freedom of Information (FOIA) acts in the United States and United Kingdom, and the Woo (Open Government Act) in the Netherlands, grant citizens the right to directly request documents from the government. As these documents might contain sensitive information, such as personal information or threats to national security, the laws allow governments to redact sensitive parts of the documents prior to release. We build on prior research to perform automatic sensitivity classification for the FOIA Exemption 5 deliberative process privilege using Large Language Models (LLMs). However, processing documents not yet cleared for review via third-party cloud APIs is often legally or politically untenable. Therefore, in this work, we perform sensitivity classification with a small, local model, deployable on consumer-grade hardware (Qwen3.5 9B). We compare eight variants of applying LLMs for sentence classification, using well-known prompting techniques, and find that a combination of Chain-of-Thought prompting and few-shot prompting with error-based examples outperforms classification models of earlier work in terms of recall and F2 score. This method also closely approaches the performance of a widely-used, cost-efficient commercial model (Gemini 2.5 Flash). In an additional analysis, we find that sentences that are predicted as deliberative contain more verbs that indicate the expression of opinions, and are more often phrased in in first-person. Above all, deliberativeness seems characterized by the presence of a combination of multiple indicators, in particular the combination of first-person words with a verb for expressing opinion.

[IR-8] LASAR: Latent Adaptive Semantic Aligned Reasoning for Generative Recommendation

【速读】：该论文旨在解决生成式推荐系统中大语言模型（Large Language Models, LLMs）因逐 token 生成导致的推理效率低下问题，尤其是在延迟敏感场景下的部署瓶颈。现有基于连续隐状态空间的潜在推理（Latent Reasoning）方法虽具潜力，但在主流生成式推荐任务中面临三大挑战：（1）无先验语义的语义标识符（Semantic ID, SID）与连续潜在推理之间的语义鸿沟；（2）缺乏推理链监督导致的表征漂移；（3）固定推理深度带来的次优性。解决方案的关键在于提出 LASAR（Latent Adaptive Semantic Aligned Reasoning）框架，采用“监督微调（SFT）+ 强化学习（RL）”两阶段范式：首先通过两阶段训练对齐 SID 语义并引入潜在推理以实现高效收敛；其次利用双向 KL 散度约束潜空间推理轨迹，结合策略头（Policy Head）动态预测每样本推理步数；最后在基于 GRPO 的强化学习阶段，采用仅终端 KL 对齐支持变长推理路径，并用 REINFORCE 算法优化策略头，从而显著减少平均潜步骤数（近减半），同时提升推荐质量。

链接: https://arxiv.org/abs/2605.10207
作者: Yiwen Chen,Fuwei Zhang,Zehao Chen,Deqing Wang,Hehan Li,Peizhi Xu,Hanmeng Liu,Shuanglong Li,Xin Pei,Fuzhen Zhuang,Zhao Zhang
机构: Beihang University (北京航空航天大学); Baidu (百度)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated powerful reasoning capabilities through Chain-of-Thought (CoT) in various tasks, yet the inefficiency of token-by-token generation hinders real-world deployment in latency-sensitive recommender systems. Latent reasoning has emerged as an effective paradigm in LLMs, performing multi-step inference in a continuous hidden-state space to achieve stronger reasoning at lower cost. However, this paradigm remains underexplored in mainstream generative recommendation. Adapting it reveals three unique challenges: (1) the gap between prior-less Semantic ID (SID) symbols and continuous latent reasoning - SIDs lack pre-trained semantics, hindering joint optimization; (2) representation drift due to a lack of reasoning chain supervision; and (3) the suboptimality of applying a globally fixed reasoning depth. To address these, we propose LASAR (Latent Adaptive Semantic Aligned Reasoning), an SFT-then-RL framework. First, we bridge this gap via two-stage training: Stage 1 grounds SID semantics before Stage 2 introduces latent reasoning, ensuring efficient convergence. Second, we mitigate representation drift through explicit CoT semantic alignment. Step-wise bidirectional KL divergence constrains the latent reasoning trajectory using hidden-state anchors extracted from CoT text, while a Policy Head predicts per-sample reasoning depth. Third, during the GRPO-based RL phase, terminal-only KL alignment accommodates variable-length reasoning, and REINFORCE optimizes the Policy Head to dynamically allocate steps. This nearly halves the average latent step count while simultaneously improving recommendation quality. Experiments on three real-world datasets demonstrate that LASAR outperforms all baselines. It adds marginal inference latency and is roughly 20 times faster than generating explicit CoT text.

[IR-9] ASTRA-QA: A Benchmark for Abstract Question Answering over Documents

【速读】：该论文旨在解决文档问答（Document-based Question Answering, QA）中抽象类问题的评估难题，这类问题要求模型从长文档或跨文档中提取并整合分散信息以生成连贯答案，而现有基准测试和评估方法普遍缺乏稳定的标准参考、依赖粗粒度相似性指标或不稳定的直接对比方式，导致评估结果不可靠。解决方案的关键在于提出ASTRA-QA基准数据集，其包含869个QA实例，覆盖五种抽象问题类型和三种受控检索范围，并为每个实例提供明确的评估标注，包括答案主题集合、精选的不支持主题以及对齐的证据片段；基于这些标注，ASTRA-QA通过直接评分主题覆盖率和不支持内容来衡量答案完整性与真实性，从而实现无需全面人工比对的可扩展评估，有效诊断检索增强生成（Retrieval-Augmented Generation, RAG）方法在覆盖度、幻觉和检索范围鲁棒性方面的表现。

链接: https://arxiv.org/abs/2605.10168
作者: Shu Wang,Shansong Zhou,Xinyang Wang,Shiwei Wang,Hulong Wu,Yixiang Fang
机构: The Chinese University of Hong Kong, Shenzhen; Data Science Group, Huolala
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Document-based question answering (QA) increasingly includes abstract questions that require synthesizing scattered information from long documents or across multiple documents into coherent answers. However, this setting is still poorly supported by existing benchmarks and evaluation methods, which often lack stable abstract references or rely on coarse similarity metrics and unstable head-to-head comparisons. To alleviate this issue, we introduce ASTRA-QA, a benchmark for AbSTRAct Question Answering over documents. ASTRA-QA contains 869 QA instances over academic papers and news documents, covering five abstract question types and three controlled retrieval scopes. Each instance is equipped with explicit evaluation annotations, including answer topic sets, curated unsupported topics, and aligned evidence. Building on these annotations, ASTRA-QA assesses whether answers cover required key points and avoid unsupported content by directly scoring topic coverage and curated unsupported content, enabling scalable evaluation without exhaustive head-to-head comparisons. Experiments with representative Retrieval-Augmented Generation (RAG) methods spanning vanilla, graph-based, and hierarchical retrieval settings show that ASTRA-QA provides reference-grounded diagnostics for coverage, hallucination, and retrieval-scope robustness. Our dataset and code are available at this https URL.

[IR-10] NumColBERT: Non-Intrusive Numeracy Injection for Late-Interaction Retrieval Models

【速读】：该论文旨在解决密集检索（dense retrieval）中数值条件查询（numerical condition queries）性能不佳的问题，例如“研发支出超过十亿美元的公司”。现有方法通常将查询分解为文本和数值两部分分别评分，虽能提升效果但需修改晚交互式检索模型（如ColBERT），带来部署复杂性、延迟增加及维护困难。其解决方案的关键在于提出NumColBERT，一种推理时非侵入式的增强方法：通过引入数值门控机制（Numerical Gating Mechanism）强化承载关键数值约束的词元、抑制无关数值提及，并结合数值对比学习目标（Numerical Contrastive Learning objective）优化嵌入空间以体现数值大小、单位与条件关系；该方案保留原始ColBERT索引结构和MaxSim评分流程，可直接复用现有优化与生态组件，实现高效且易维护的实际部署。

链接: https://arxiv.org/abs/2605.10109
作者: Haruki Fujimaki,Makoto P. Kato
机构: 未知
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:This study addresses the challenge of improving dense retrieval performance for queries containing numerical conditions, such as ``companies with more than one billion dollars in RD expenditure.‘’ Although recent research has shown that standard models struggle with numeric information in domains such as finance, e-commerce, and medicine, existing solutions typically decompose queries into textual and numerical components and score them separately. These approaches modify late-interaction retrieval models such as ColBERT and introduce challenges in deployment, latency, and maintainability. To overcome these limitations, we propose NumColBERT, an inference-time non-intrusive method that enhances numerically conditioned retrieval while preserving the original late-interaction mechanism. Because NumColBERT retains the standard ColBERT indexing and MaxSim scoring pipeline, existing optimizations and ecosystem components can be reused directly, facilitating practical deployment. NumColBERT introduces a Numerical Gating Mechanism and a Numerical Contrastive Learning objective to enable numerical conditions to contribute more effectively within standard ColBERT scoring. The gating mechanism amplifies tokens carrying critical numerical constraints while suppressing context-neutral numerical mentions, and the contrastive objective shapes the embedding space to reflect numerical magnitudes, units, and conditions. Experimental results show that NumColBERT substantially outperforms standard fine-tuning baselines and achieves accuracy comparable to or better than prior approaches relying on separate textual and numerical scoring. These findings demonstrate the feasibility of numerically conditioned retrieval with a non-intrusive inference pipeline and present a maintainable solution for real-world deployment.

[IR-11] H-MAPS: Hierarchical Memory-Augmented Proactive Search Assistant for Scientific Literature SIGIR2026

【速读】：该论文旨在解决科学阅读过程中因手动关键词搜索导致的阅读流中断和高认知负荷问题，以及现有主动信息检索系统因仅依赖屏幕文本而无法准确理解用户背景与意图所引发的上下文歧义问题。其解决方案的关键在于提出H-MAPS（Hierarchical Memory-Augmented Proactive Search Assistant），该系统通过三层分层记忆机制，结合隐式阅读行为触发，将用户的潜在信息需求转化为显式的自然语言问题，并在本地设备上执行神经检索，从而实现个性化、隐私保护的文献推荐。

链接: https://arxiv.org/abs/2605.10097
作者: Koji Nishikawa,Makoto P. Kato
机构: University of Tsukuba (筑波大学); National Institute of Informatics (信息研究所)
类目: Information Retrieval (cs.IR)
备注: Accepted as a demonstration paper at SIGIR 2026. 6 pages, 2 figures. A video demonstration is available at this https URL

点击查看摘要

Abstract:Scientific reading is an active process that frequently requires consulting external resources, but manual keyword searching interrupts the reading flow and imposes a high cognitive load. Existing proactive information retrieval systems often suffer from context ambiguity, as they rely solely on on-screen text and ignore the reader’s specific background and intent. In this demonstration, we present H-MAPS (Hierarchical Memory-Augmented Proactive Search Assistant), a proactive literature exploration assistant that resolves this ambiguity by leveraging a three-layered hierarchical memory. Triggered by implicit reading behaviors, H-MAPS articulates the user’s latent information needs into explicit natural language questions and performs neural retrieval entirely on the local device to ensure privacy. We demonstrate H-MAPS using a scenario where two researchers, specializing in NLP and HCI, read the same paper. In response, the system generates profile-specific questions and retrieves distinct literature tailored to each user.

[IR-12] CCD-Level and Load-Aware Thread Orchestration for In-Memory Vector ANNS on Multi-Core CPUs ICDE’26

【速读】：该论文旨在解决基于芯片内多核（Chiplet-based Coherent Die, CCD）架构的CPU在向量近邻搜索（Vector Approximate Nearest Neighbor Search, ANNS）服务中，因缓存利用率低和负载不均衡导致的多核扩展效率低下问题。核心挑战在于：尽管现代CCD架构具备高并发能力，但现有线程调度策略未充分考虑实际请求的访问局部性以及芯片内多个芯粒（chiplet）间的缓存层级特性，从而造成缓存未命中率高、CPU停顿时间长，限制了吞吐量提升。解决方案的关键在于提出一种面向工作负载与硬件感知的线程编排框架，其核心机制包括：(i) 统一支持查询间并行的HNSW搜索和查询内并行的IVF搜索；(ii) 实现基于缓存友好的任务调度映射，动态适配不同负载模式；(iii) 引入CCD-aware任务窃取机制以缓解负载不均，显著降低缓存缺失率（下降6–30%）和CPU停顿（减少20–80%），最终使吞吐量提升达3.7倍，P50和P999延迟降低30–90%。

链接: https://arxiv.org/abs/2605.10090
作者: Yuchen Huang,Baiteng Ma,Yiping Sun,Yang Shi,Xiao Chen,Xiaocheng Zhong,Zhiyong Wang,Yao Hu,Chuliang Weng
机构: 1. Tsinghua University (清华大学); 2. Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); 3. Alibaba Cloud (阿里巴巴云)
类目: Information Retrieval (cs.IR)
备注: Accepted by ICDE’26

点击查看摘要

Abstract:Vector approximate nearest neighbor search (ANNS) underpins search engines, recommendation systems, and advertising services. Recent advances in ANNS indexes make CPU a cost-effective choice for serving million-scale, in-memory vector search, yet per-core throughput remains constrained by memory access latency of vector reading and the compute intensity of distance evaluations in production deployments. With the growing scale of the business and advances in hardware, modern CCD-based multi-core CPUs have been widely deployed for high throughput in our services. However, we find that simply increasing core counts does not yield optimal performance scaling. To improve the efficiency of more cores from the CCD-based architecture, we analyze the distributions of real-world requests in our production environments. We observe high access locality in vector search in our online services and low cache utilization, resulting from overlooking the multi-chiplet nature of CCD based CPUs. Hence, we propose a workload- and hardware-aware thread orchestration framework at CCD-level that (i) provides a uniform interface for both inter-query parallel HNSW search and intra-query parallel IVF search, (ii) achieves cache-friendly and workload-adaptive mapping of task dispatching, and (iii) employs CCD-aware task stealing to address load imbalance. Applied to real production workloads from search, recommendation, and advertising services of Xiaohongshu (RedNote), our approach delivers up to 3.7x higher throughput and 30-90% reductions in P50 and P999 latency. In detail, compared with the original framework, the cache-miss ratio decreases by 6-30%, and the total CPU stall is reduced by 20-80%.

[IR-13] Enhancing Healthcare Search Intent Recognition with Query Representation Learning and Session Context

【速读】：该论文旨在解决医疗搜索查询意图分类中的两大挑战：一是医疗查询具有多意图特性，导致用户点击行为模糊或分歧；二是基于全局用户统计推断的单一主流意图可能在具体会话中产生偏差，影响分类性能。解决方案的关键在于通过聚类聚合相似查询以改进查询表示学习，并引入一种新型损失函数以捕捉健康搜索查询的多维特性，从而提升模型的可扩展性和准确性；同时，论文提出协和率（Concordance Rate, CR）评分来量化查询歧义性及全局与会话级意图之间的不一致，并设计了一种简单有效的机制将学习到的查询表示融入上下文感知的会话级意图分类中。

链接: https://arxiv.org/abs/2605.10021
作者: Harshita Jagdish Sahijwani,Madhav Sigdel,Song Aslan,Priya Gopi Achuthan,Monica D. Skidmore,Eugene Agichtein,Chen Lin
机构: Emory University (埃默里大学); Kaiser Permanente (凯撒医疗集团)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Classifying the intent behind healthcare search queries is crucial for improving the delivery of online healthcare information. The intricate nature of medical search queries, coupled with the limited availability of high-quality labeled data, presents substantial challenges for developing efficient classification models. Previous studies have exploited user interaction data, such as user clicks from search logs and employed pairwise loss functions to model co-click behavior for query representation learning. However, many health queries could have multiple intents, resulting in ambiguous or divergent click behavior. Furthermore, learning the single most popular intent of queries as inferred from global statistics based on the aggregate behavior of different users could potentially lead to disparity and performance drop when classifying the query intent within specific search sessions. To address these limitations, our work improves the query representation learning by aggregating similar queries via clustering, and introducing a novel loss function designed to capture the multifaceted nature of health search queries, resulting in a more scalable and accurate learning procedure. Furthermore, we quantify the ambiguity of health queries and the misalignment between global search intents and those discerned from individual sessions, by introducing the concordance rate (CR) score, and demonstrate a simple and effective method for incorporating our learned query representation into contextual, session-based search intent classification. Our extensive experimental results and analysis on two real-world search log datasets, i.e., a Health Search (HS) dataset and the publicly available TripClick dataset, demonstrate that our approach not only improves the intrinsic clustering metrics for query representation learning but also enhances accuracy for subsequent search intent classification tasks.

[IR-14] Urban-ImageNet: A Large-Scale Multi-Modal Dataset and Evaluation Framework for Urban Space Perception

【速读】：该论文旨在解决当前城市空间感知（Urban Space Perception）中缺乏大规模、多模态、理论驱动的基准数据集与评估体系的问题，尤其针对由用户生成内容（User-Generated Content, UGC）构成的社会媒体图像在AI模型中的理解能力不足。解决方案的关键在于构建Urban-ImageNet——一个包含超200万张来自微博（Weibo）的公共社会媒体图像及其配对文本信息的数据集，并基于HUSIC（Hierarchical Urban Space Image Classification）框架定义了一个10类城市空间分类体系，该体系以城市理论为基础，区分激活与非激活的公共空间、室内外环境、居住空间、消费内容、人物肖像及非空间社交内容。通过标准化的三个任务（T1：语义分类、T2：跨模态图文检索、T3：实例分割），该基准实现了对AI系统在不同尺度（1K–100K–2M）、多模态和任务形式下城市空间感知能力的统一评估，从而推动生成式AI（Generative AI）与视觉语言模型在真实城市场景中的可解释性和功能性发展。

链接: https://arxiv.org/abs/2605.09936
作者: Yiwei Ou,Chung Ching Cheung,Jun Yang Ang,Xiaobin Ren,Ronggui Sun,Guansong Gao,Kaiqi Zhao,Manfredo Manfredini
机构: University of Auckland (奥克兰大学); University of Pennsylvania (宾夕法尼亚大学); Stanford University (斯坦福大学); Harbin Institute of Technology, Shenzhen (深圳哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present Urban-ImageNet, a large-scale multi-modal dataset and evaluation benchmark for urban space perception from user-generated social media imagery. The corpus contains over 2 Million public social media images and paired textual posts collected from Weibo across 61 urban sites in 24 Chinese cities across 2019-2025, with controlled benchmark subsets at 1K, 10K, and 100K scale and a full 2M corpus for large-scale training and evaluation. Urban-ImageNet is organized by HUSIC, a Hierarchical Urban Space Image Classification framework that defines a 10-class taxonomy grounded in urban theory. The taxonomy is designed to distinguish activated and non-activated public spaces, exterior and interior urban environments, accommodation spaces, consumption content, portraits, and non-spatial social-media content. Rather than treating urban imagery as generic scene data, Urban-ImageNet evaluates whether machine perception models can capture spatial, social, and functional distinctions that are central to urban studies. The benchmark supports three tasks within one standardized library: (T1) urban scene semantic classification, (T2) cross-modal image-text retrieval, and (T3) instance segmentation. Our experiments evaluate representative vision, vision-language, and segmentation models, revealing strong performance on supervised scene classification but more challenging behavior in cross-modal retrieval and instance-level urban object segmentation. A multi-scale study further examines how model performance changes as balanced training data increases from 1K, 10K to 100K images. Urban-ImageNet provides a unified, theory-grounded, multi-city benchmark for evaluating how AI systems perceive and interpret contemporary urban spaces across modalities, scales, and task formulations. Dataset and benchmark are available at: this http URL and this http URL.

[IR-15] OpenZL: Using Graphs to Compress Smaller and Faster

【速读】：该论文旨在解决传统通用压缩算法在工业应用中面临的性能瓶颈问题，即虽然研究型方法能实现更高的无损压缩比，但因处理时间长、资源消耗高而难以满足生产系统对高吞吐量和低资源占用的需求；同时，现有应用特定压缩器虽性能优越，却存在适用范围窄、开发与维护成本高等缺陷。解决方案的关键在于提出一种新的“图模型”压缩框架（graph model of compression），将压缩过程建模为由模块化编解码器构成的有向无环图（DAG），并通过OpenZL实现该框架——其生成自描述的线格式数据，支持任意配置通过统一解码器还原，从而显著提升应用特定压缩器的开发效率与可维护性，并在压缩比和速度上优于当前主流通用压缩算法，尤其在深度学习压缩方案中展现出压倒性的速度优势。

链接: https://arxiv.org/abs/2605.09928
作者: Yann Collet,Nick Terrell,W. Felix Handte,Danielle Rozenblit,Victor Zhang,Kevin Zhang,Yaelle Goldschlag,Jennifer Lee,Elliot Gorokhovsky,Yonatan Komornik,Daniel Riegel,Stan Angelov,Nadav Rotem
机构: 未知
类目: Information Retrieval (cs.IR); Databases (cs.DB)
备注: arXiv admin note: substantial text overlap with arXiv:2510.03203

点击查看摘要

Abstract:In the last few decades, research techniques have improved lossless compression ratios by significantly increasing processing time. However, these techniques have not gained popularity in industry because production systems require high throughput and low resource utilization. Instead, real world improvements in compression are increasingly realized by building application-specific compressors which can exploit knowledge about the structure and semantics of the data being compressed. Application-specific compressor systems outperform even the best generic compressors, but these techniques have severe drawbacks – they are inherently limited in applicability, are hard to develop, and are difficult to maintain and deploy. In this work, we show that these challenges can be overcome with a new compression strategy. We propose the “graph model” of compression, a new theoretical framework for representing compression as a directed acyclic graph of modular codecs. OpenZL implements this framework and compresses data into a self-describing wire format, any configuration of which can be decompressed by a universal decoder. OpenZL’s design enables rapid development of application-specific compressors with minimal code. Experimental results demonstrate that OpenZL achieves superior compression ratios and speeds compared to state-of-the-art general-purpose compressors on a variety of real-world datasets. Compared to ratio-focused deep-learning compressors, OpenZL is competitive on ratio while being many orders of magnitude faster. Internal deployments at Meta have also shown consistent improvements in size and/or speed, with development timelines reduced from months to days. OpenZL thus represents a significant advance in practical, scalable, and maintainable data compression for modern data-intensive applications.

[IR-16] Nautilus Compass: Black-box Persona Drift Detection for Production LLM Agents

【速读】：该论文旨在解决生产环境中大语言模型（Large Language Model, LLM）编码代理在长时间会话中出现的人格漂移（persona drift）问题，即代理遗忘用户指定约束、重复已标记错误并虚构先前达成的一致。针对现有白盒方法（如persona vectors）需访问模型权重而无法适用于闭源API（如Claude、GPT-4）的问题，作者提出Nautilus Compass——一种黑盒人格漂移检测与代理记忆层系统。其核心创新在于完全基于提示文本（prompt-text layer）构建：通过BGE-m3嵌入计算用户提示与行为锚点文本之间的余弦相似度，并采用加权top-k均值聚合策略实现高效漂移检测；同时，该系统不依赖索引时调用LLM提取事实或构建图结构，直接对原始对话文本进行嵌入，显著降低计算开销。实验表明，Compass在真实会话数据上达到ROC AUC 0.83的漂移检测性能，并在LongMemEval-S和EverMemBench-Dynamic基准上优于多个公开基线，且整体端到端复现成本仅为GPT-4o判别堆栈的1/14。

链接: https://arxiv.org/abs/2605.09863
作者: Chunxiao Wang
机构: Yiluo Technology Co., Ltd. (亿诺科技有限公司)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 19 pages, 6 figures. MIT-licensed code + reproduction scripts at this http URL

点击查看摘要

Abstract:Production LLM coding agents drift over long sessions: they forget user-specified constraints, slip into mistakes the user already flagged, and confabulate prior agreements. White-box approaches such as persona vectors require model weights and so cannot be applied to closed APIs (Claude, GPT-4) that most users actually interact with. We present Nautilus Compass, a black-box persona drift detector and agent memory layer for production coding agents. The method operates entirely at the prompt-text layer: cosine similarity between user prompts and behavioral anchor texts, aggregated by a weighted top-k mean using BGE-m3 embeddings. Compass is, to our knowledge, the only public agent memory layer (among Mem0, Letta, Cognee, Zep, MemOS, smrti verified May 2026) that does not call an LLM at index time to extract facts or build a graph; raw conversation text is embedded directly. The system ships as a Claude Code plugin, an MCP 2024-11-05 A2A server (Cursor, Cline, Hermes), a CLI, and a REST API on one daemon, with a Merkle-chained audit log for tamper-evident anchor updates. On a held-out test set built from real Claude Code session traces and labeled by an independent LLM judge, Compass reaches ROC AUC 0.83 for drift detection. The embedded retrieval pipeline scores 56.6% on LongMemEval-S v0.8 and 44.4% on EverMemBench-Dynamic (n=500), topping the four published EverMemBench Table 4 baselines. LongMemEval-S 56.6% is ~30 points below recent white-box leaders (90+%); we treat that as the architectural ceiling of the no-extraction design. End-to-end reproduction cost is 3.50 (~14x cheaper than GPT-4o-judged stacks). A paired cross-vendor behavior A/B accompanies these numbers as preliminary system-level evidence. Code, anchors, frozen test data, and audit-log tooling are MIT-licensed at this http URL. Comments: 19 pages, 6 figures. MIT-licensed code + reproduction scripts at this http URL Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG) Cite as: arXiv:2605.09863 [cs.CR] (or arXiv:2605.09863v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2605.09863 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-17] ReCoVR: Closing the Loop in Interactive Composed Video Retrieval

【速读】：该论文旨在解决现有组成视频检索（Composed Video Retrieval, CoVR）方法仅支持单轮交互、无法适应真实场景中用户逐步细化搜索意图的局限性，从而提出交互式组成视频检索（Interactive Composed Video Retrieval）这一多轮扩展任务。其解决方案的关键在于设计了一种双路径架构 ReCoVR（Reflexive Composed Video Retrieval），通过引入反射感知机制（reflexive perception），将系统的检索历史作为诊断证据与用户反馈共同用于决策：其中意图路径（Intent Pathway）对异构反馈进行分发以激活互补检索通道，而反思路径（Reflection Pathway）则在轨迹层面监控结果演化并纠正多轮交互中的检索偏差，从而实现更鲁棒和可解释的交互式视频检索。

链接: https://arxiv.org/abs/2605.09836
作者: Bingqing Zhang,Yi Zhang,Zhuo Cao,Yang Li,Xue Li,Jiajun Liu,Sen Wang
机构: The University of Queensland (昆士兰大学); CSIRO Data61 (澳大利亚联邦科学与工业研究组织数据61)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Composed video retrieval (CoVR) searches for target videos using a reference video and a modification text, but existing methods are restricted to a single interaction round and cannot support the progressive nature of real-world visual search. To bridge this gap, we first formalize interactive composed video retrieval, a multi-turn extension of CoVR, where users progressively refine their search intent through natural-language feedback across turns. Adapting existing interactive retrieval methods to this setting reveals two structural weaknesses: reliance on a single retrieval channel and an open-loop retrieval design that consumes user feedback but does not diagnose whether its own retrieval trajectory is drifting or stagnating. To address these limitations, we propose ReCoVR (Reflexive Composed Video Retrieval), a dual-pathway architecture built on reflexive perception, where the system treats its retrieval history as diagnostic evidence alongside user feedback. Specifically, an Intent Pathway routes heterogeneous feedback to complementary retrieval channels, while a Reflection Pathway performs trajectory-level reflection to monitor result evolution and correct retrieval errors across turns. Experiments on multiple benchmarks show that ReCoVR consistently outperforms interactive baselines, notably achieving 74.30% R@1 after just one interactive round on the WebVid-CoVR-Test dataset.

[IR-18] Loom: Hybrid Retrieval-Scoring Outfit Recommendation with Semantic Material Compatibility and Occasion-Aware Embedding Priors

【速读】：该论文旨在解决时尚穿搭推荐中如何生成完整且协调的服装搭配问题，尤其在面对大规模商品目录时，如何兼顾语义一致性、风格统一性与场景适配性。其核心解决方案是提出Loom系统，该系统融合神经嵌入检索（neural embedding retrieval）与结构化领域评分机制，通过两个关键技术突破传统方法局限：一是引入“语义材质权重”（semantic material weight），利用CLIP嵌入几何特性自动推断衣物材质厚重度以实现叠穿兼容性判断，无需人工定义材料分类体系；二是设计“氛围/反氛围场合先验”（vibe/anti-vibe occasion priors），将场合描述文本转化为CLIP空间中的锚点向量，基于差异化亲和度对候选单品进行评分。实验表明，方向重排序（direction reranking）为不可或缺模块，移除后性能骤降至随机基线水平，整体系统可在消费级硬件上于5秒内生成三种风格各异的高质量穿搭方案。

链接: https://arxiv.org/abs/2605.09830
作者: Anushree Berlia
机构: 未知
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this https URL

点击查看摘要

Abstract:We present Loom, an outfit recommendation system that combines neural embedding retrieval with structured domain scoring to generate complete, coherent outfits from fashion catalogs. Given an anchor clothing item, Loom retrieves complementary pieces via slot-constrained approximate nearest neighbor search over FashionCLIP embeddings, then scores candidate outfits using a multi-objective function that integrates six signals: embedding similarity, color harmony, formality consistency, occasion coherence, style direction, and within-outfit diversity. We introduce two techniques that address limitations of purely learned or purely rule-based approaches: (1) semantic material weight, which uses CLIP embedding geometry to infer garment heaviness for layer compatibility without hand-coded material taxonomies; and (2) vibe/anti-vibe occasion priors, which embed prose descriptions of occasion contexts as anchor vectors in CLIP space and score items by differential affinity. Ablation experiments on a catalog of 620 items show that each component contributes measurably to outfit quality: the full system achieves a mean outfit score of 0.179 with a 9.3% hard violation rate, compared to 0.054 score and 16.0% violations for a category-constrained random baseline, a 3.3x improvement in score and 42% reduction in violations. Direction reranking is the single indispensable component: removing it drops score to 0.052, essentially equal to random. The system generates three stylistically distinct outfits in under 5 seconds on commodity hardware. Comments: Code: this https URL Subjects: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV) ACMclasses: H.3.3; I.2.10 Cite as: arXiv:2605.09830 [cs.IR] (or arXiv:2605.09830v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2605.09830 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-19] LLM Agents Enable User-Governed Personalization Beyond Platform Boundaries

【速读】：该论文旨在解决当前个性化服务过度依赖平台中心化数据所导致的用户画像不完整问题，即平台因竞争、法律、隐私及认知限制难以获取用户全维度行为数据，从而影响个性化效果。其解决方案的关键在于推动从“平台主导的个性化”向“用户自主治理的个性化”转变，利用大语言模型（Large Language Model, LLM）代理实现跨平台与离线数据的整合与推理，使用户能够基于自身拥有的异构数据生成更精准、可操作的个性化能力。

链接: https://arxiv.org/abs/2605.09794
作者: Jiacheng Lin,Kun Qian,Arvind Srinivasan,Tian Wang,Fang Han,Changran Hu,Junze Liu,Ziyi Wang,Hanwen Xu,Mengmeng Xue,Shuo Yang,Hansi Zeng,Simon Sinong Zhan,Kai Zhong,Weiqi Zhang,Dakuo Wang,Tianhao Wang,Zhiyuan Li
机构: University of Illinois Urbana Champaign (伊利诺伊大学厄巴纳-香槟分校); University of Texas at Austin (德克萨斯大学奥斯汀分校); Carnegie Mellon University (卡内基梅隆大学); New York University (纽约大学); University of California, Berkeley (加州大学伯克利分校); Northeastern University (东北大学); University of Washington (华盛顿大学); University of California, Irvine (加州大学欧文分校); University of New Brunswick (新布伦瑞克大学); University of Massachusetts at Amherst (马萨诸塞大学阿默斯特分校); Northwestern University (西北大学); University of California, San Diego (加州大学圣地亚哥分校); Toyota Technological Institute at Chicago (丰田技术研究院芝加哥校区)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Personalization today is fundamentally platform-centric: services build user representations from the behavioral fragments they observe. Yet no platform can construct a complete picture of the user, as competitive incentives, legal constraints, user privacy concerns, and epistemic limits create persistent data barriers. This paper argues for a shift from platform-centric personalization to user-governed personalization, where only the user can integrate fragmented contexts across platforms and the offline world. The key asymmetry lies in data access: only users can aggregate their own cross-platform and offline information. Large language model (LLM) agents make such integration practically feasible for the first time by enabling reasoning over heterogeneous personal data and transforming users’ cross-context information into actionable personalization capabilities. We provide proof-of-concept evidence that users equipped with cross-platform data exports and an off-the-shelf LLM agent can outperform single-platform personalization baselines. We conclude by outlining a research agenda for building scalable user-governed personalization systems.

[IR-20] A General Framework for Multimodal LLM -Based Multimedia Understanding in Large-Scale Recommendation Systems SIGIR2026

【速读】：该论文旨在解决传统推荐系统难以充分挖掘多媒体内容中高维语义信号的问题，从而限制了用户偏好建模的精度。其核心挑战在于如何将多模态大语言模型（Multimodal Large Language Models, MM-LLMs）有效集成到对延迟敏感、工业级规模的推荐架构中。解决方案的关键在于提出一个通用的MM-LLM驱动的多媒体理解框架，采用三阶段架构：内容解析、表征提取与系统化流水线整合；具体实现上基于LLaMA2模型生成描述性标题，并将其作为分词后的类别特征输入推荐系统，实证结果表明该方法在离线AUC指标上提升0.35%，在线指标提升0.02%，验证了MM-LLMs在大规模推荐场景中的可行性与有效性。

链接: https://arxiv.org/abs/2605.09338
作者: Yiming Zhu,Xu Liu,Ziyun Xu,Zheng Wu,Joena Zhang,Sirius Chen,Chenheli Hua,Silvester Yao,Qichao Que,Wentao Shi,Junfeng Pan,Linhong Zhu
机构: Meta Platforms (Meta)
类目: Information Retrieval (cs.IR)
备注: Accepted by SIGIR 2026 short

点击查看摘要

Abstract:Conventional recommendation systems frequently fail to fully exploit the high-dimensional semantic signals inherent in multimedia content, thereby limiting the fidelity of user preference modeling. While Multimodal Large Language Models (MM-LLMs) offer robust mechanisms for interpreting such complex data, their integration into latency-constrained, industrial-scale architectures remains a significant challenge. To address this, we propose a generalized framework for MM-LLM-driven multimedia understanding. Our methodology employs a tripartite architecture encompassing content interpretation, representation extraction, and systematic pipeline integration, instantiated via a LLaMA2-based model that generates descriptive captions subsequently ingested as tokenized categorical features. Empirical evaluation demonstrates the efficacy of this approach, yielding a 0.35% increase in offline AUC and a 0.02% improvement in online metrics at scale, substantiating the practical viability of leveraging MM-LLMs to enhance large-scale recommendation performance.

[IR-21] OpenIIR: An Open Simulation Platform for Information Retrieval Research

【速读】：该论文旨在解决信息检索（Information Retrieval, IR）研究中实验设计复杂、可复现性差以及多智能体交互机制难以系统化建模的问题。其解决方案的关键在于构建一个统一的共享核心框架——OpenIIR，该框架支持以参数化方式运行数百个由大语言模型（Large Language Models, LLMs）驱动的“角色代理”（persona），并通过定义清晰的接口类型实现场景插件化（pluggable scenarios）。通过将实验配置（如角色预算、检索策略、排序器选择、干预时机等）前置声明，并提供结构化输出（如论证图、曝光日志、适应度轨迹等），确保实验结果的可比较性和可复现性；同时，该框架已集成四种典型多智能体研究范式（ deliberative panels, social platforms, curated recommender feeds, 和 evolutionary co-evolution）及六类模块化扩展，为开放的IR研究问题提供可插拔、可扩展的实验平台。

链接: https://arxiv.org/abs/2605.09321
作者: Saber Zerhoudi
机构: University of Passau (帕绍大学)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:OpenIIR runs hundreds of LLM-driven personas as parameterised, reproducible IR research experiments. Researchers configure agents across four kinds of multi-agent study (deliberative panels, social platforms, curated recommender feeds, and evolutionary co-evolution between content producers and credibility detectors) under many priors, rounds, and constraints. Persona budgets, retrieval policies, ranker choices, intervention timings, and mutation rates are declared up front, and the same study can be re-run under different settings to compare outcomes side by side. Every run produces structured outputs (argument graphs, exposure logs, fitness traces, transcripts) that a downstream evaluator can consume directly, and a new study is a 200–400 line plug-in over a shared core (agent runtime, world-model store, retrieval primitives, claim extractor, persona ontology). The contributions are: (i) the shared core; (ii) a type interface for pluggable scenarios; (iii) four released types with reference runs (Panel, Social-Media, Curated-Feed, Multi-Generational); and (iv) six modular extensions sketched against open IR research questions.

[IR-22] Matching Meaning at Scale: Evaluating Semantic Search for 18th-Century Intellectual History through the Case of Locke

【速读】：该论文旨在解决当前基于词汇文本复用检测的方法在分析18世纪思想传播时存在的局限性，即仅能捕捉字面引用而无法识别改写（paraphrase）和复杂的隐含互动。其解决方案的关键在于引入语义搜索（semantic search）技术，并通过专家标注构建语义分类体系来评估该方法是否能够发现传统词汇匹配手段所遗漏的意义层面的对应关系。实验结果表明，语义搜索显著提升了对隐含接受（implicit reception）的检出率，但同时也揭示了“词汇门控效应”（lexical gatekeeping effect），即检索效果仍受表面词汇重叠程度的影响，凸显了语义检索在处理大规模历史语料中思想流通分析时的潜力与限制。

链接: https://arxiv.org/abs/2605.09236
作者: Yu Wu,Ananth Mahadevan,Filip Ginter,Michael Mathioudakis,Mikko Tolonen
机构: University of Helsinki (赫尔辛基大学); TurkuNLP, University of Turku (图尔库自然语言处理，图尔库大学); ELLIS Institute Finland (芬兰ELLIS研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Digital Libraries (cs.DL); Information Retrieval (cs.IR)
备注: Accepted by NLP4DH 2026

点击查看摘要

Abstract:While digitized corpora have transformed the study of intellectual transmission, current methods rely heavily on lexical text reuse detection, capturing verbatim quotations but fundamentally missing paraphrases and complex implicit engagement. This paper evaluates semantic search in 18th-century intellectual history through the reception of John Locke’s foundational work. Using expert annotation grounded in a semantic taxonomy, we examine whether an off-the-shelf semantic search pipeline can surface meaning-level correspondences overlooked by lexical methods. Our results demonstrate that semantic search retrieves substantially more implicit receptions than lexical baselines. However, linguistic diagnostics also reveal a “lexical gatekeeping” effect, where retrieval remains partially constrained by surface vocabulary overlap. These findings highlight both the potential and the limitations of semantic retrieval for analyzing the circulation of ideas in large historical corpora. The data is available at this https URL.

[IR-23] Reddit2Deezer: A Scalable Dataset for Real-World Grounded Conversational Music Recommendation

【速读】：该论文旨在解决对话式音乐推荐（Conversational Music Recommendation, CMR）研究中普遍存在的数据瓶颈问题：真实对话语料库规模有限，而合成语料库虽可扩展但缺乏自然对话的真实性。解决方案的关键在于构建一个基于真实社交平台（Reddit）的高质量CMR资源——Reddit2Deezer，其包含19万条唯一线程与评论对，并通过链接到Deezer音乐平台的标识符实现了音乐实体的内容锚定（content grounding），从而确保对话内容与实际音乐特征（如流派标签、流行度、BPM等）直接关联。该数据集提供原始版本以保留真实性，以及经过改写（paraphrased）版本以提升长期可复现性，经人工验证确认了对话质量、物品锚定准确性及改写合理性，为未来内容驱动的对话推荐研究奠定了坚实基础。

链接: https://arxiv.org/abs/2605.09120
作者: Haven Kim,Julian McAuley
机构: University of California San Diego (加州大学圣地亚哥分校)
类目: Information Retrieval (cs.IR); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Conversational music recommendation (CMR) research currently faces a tradeoff between authentic dialogue corpora that are limited in scale and synthesized corpora that scale up but whose conversations are artificially constructed rather than naturally observed. In this paper, we introduce Reddit2Deezer, a reality-grounded CMR resource derived from 190k unique thread, leaf-comment pairs. We release the resource in two versions: a raw version that preserves authenticity, and a paraphrased version that maximizes long-term reproducibility. Each musical entity is linked to a Deezer identifier, which provides straightforward access to audio previews and rich metadata (e.g., genre tags, popularity, BPM), opening the door to future research on content-grounded conversational recommendation. A human validation confirms the quality of the dialogues, item grounding, and paraphrases. The dataset is available at this https URL.

[IR-24] Personalized w-Event Privacy for Infinite Stream Estimation

【速读】：该论文旨在解决无限数据流场景下用户个性化隐私保护与准确流统计估计之间的矛盾问题，即现有方法假设所有用户具有相同的隐私需求（homogeneous privacy requirements），无法满足不同用户对隐私强度的差异化偏好。其核心解决方案是提出一套基于个性化窗口大小机制（Personalized Window Size Mechanism, PWSM）的隐私预算分配与吸收策略：通过Personalized Budget Distribution (PBD) 保证隐私预算在时间维度上的非递减性，以及Personalized Budget Absorption (PBA) 利用历史和未来若干时间槽的闲置预算进行动态调整，从而实现(\boldsymbolw, \boldsymbol\mathcalE)-Event Personalized Differential Privacy ((\boldsymbolw, \boldsymbol\mathcalE)-EPDP)下的高精度流统计估计。进一步引入动态版本DPBD与DPBA，支持用户在运行时动态调整隐私参数，同时维持严格的隐私保障。理论分析表明所有方法均满足对应的个性化差分隐私约束，并推导出误差上界；实验验证其相比当前最优算法可降低至少53.6%的估计误差。

链接: https://arxiv.org/abs/2605.09054
作者: Leilei Du,Xu Zhou,Peng Cheng,Lei Chen,Xuemin Lin,Wei Xi,Kenli Li
机构: 未知
类目: Databases (cs.DB); Cryptography and Security (cs.CR); Information Retrieval (cs.IR)
备注: 31 pages

点击查看摘要

Abstract:In applications such as event monitoring, log analysis, and video querying, w -event privacy protects individual data within a sliding time window while supporting accurate stream statistics. Existing studies on infinite data streams mainly assume homogeneous privacy requirements for all users, which cannot capture user-specific privacy preferences. This paper studies personalized w -event privacy for private data stream estimation. We first design the Personalized Window Size Mechanism (PWSM), which supports personalized privacy requirements at each time slot. Based on PWSM, we propose Personalized Budget Distribution (PBD) and Personalized Budget Absorption (PBA) to estimate streaming statistics under \boldsymbolw -Event \boldsymbol\mathcalE Personalized Differential Privacy (( \boldsymbolw , \boldsymbol\mathcalE )-EPDP). PBD guarantees that the budget reserved for the next time step is no smaller than the budget consumed in the previous release, while PBA improves the current budget by absorbing unused budgets from the previous k time slots and borrowing from the next k time slots. We further develop Dynamic Personalized Budget Distribution (DPBD) and Dynamic Personalized Budget Absorption (DPBA), which allow users to dynamically adjust privacy requirements while satisfying (\tau, \boldsymbolw_B, \boldsymbolw_F) -Event (\boldsymbol\mathcalE_B, \boldsymbol\mathcalE_F) -Personalized Differential Privacy. We prove that all proposed methods achieve the corresponding personalized differential privacy guarantees and derive their error upper bounds. Experiments show that our methods reduce estimation error by at least 53.6% compared with state-of-the-art algorithms.

[IR-25] UxSID: Semantic-Aware User Interests Modeling for Ultra-Long Sequence

【速读】：该论文旨在解决超长用户序列建模中效率与效果之间的权衡难题，传统方法依赖于物品特定的搜索或物品无关的压缩策略，难以兼顾计算效率与语义表达能力。其解决方案的关键在于提出UxSID框架，通过引入语义ID（Semantic IDs, SIDs）和双层注意力机制，构建语义分组共享兴趣记忆（semantic-group shared interest memory），从而在不增加物品特异性模型负担的前提下，捕捉目标感知的偏好，实现端到端架构下计算简洁性与语义感知性的平衡，最终在大规模广告A/B测试中实现了0.337%的收入提升。

链接: https://arxiv.org/abs/2605.09040
作者: Hongwei Zhang,Qiqiang Zhong,Jiangxia Cao,Yiyang Lv,Huanjie Wang,Liwei Guan,Jing Yao,Yiyu Wang,Junfeng Shu,Zhaojie Liu,Han Li
机构: Kuaishou Technology(快手科技)
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Work in progress

点击查看摘要

Abstract:Modeling ultra-long user sequences involves a difficult trade-off between efficiency and effectiveness. While current paradigms rely on either item-specific search or item-agnostic compression, we propose UxSID, a framework exploring a third path: semantic-group shared interest memory. By utilizing Semantic IDs (SIDs) and a dual-level attention strategy, UxSID captures target-aware preferences without the heavy cost of item-specific models. This end-to-end architecture balances computational parsimony with semantic awareness, achieving state-of-the-art performance and a 0.337% revenue lift in large-scale advertising A/B test.

[IR-26] UserGPT Technical Report

【速读】：该论文旨在解决大规模数字痕迹下个性化用户理解的挑战，传统用户画像方法依赖判别模型与人工特征工程预测离散属性，常导致碎片化且逻辑不一致的画像，并在长尾行为上泛化能力差。其核心解决方案是提出UserGPT框架，通过属性生成与摘要生成双路径提升基于大语言模型（Large Language Models, LLMs）的个性理解能力；关键创新包括：构建用户行为模拟引擎以生成真实复杂的行为轨迹缓解数据稀缺问题、设计数据驱动的语义化模块将异构日志转化为结构化输入以降低噪声和稀疏性、以及采用课程驱动的后训练策略结合多阶段监督微调（Supervised Fine-Tuning, SFT）与双过滤组相对策略优化（Dual-Filter Group Relative Policy Optimization, DF-GRPO），从而增强对长期行为历史的推理能力。实验表明，UserGPT在自建基准HPR-Bench上实现标签预测Avg@10为0.7325、摘要生成准确率Acc_Ex为0.7528，同时压缩行为记录达97.9%仍保留关键信息，验证了其在整体人格推理与个性化用户代理交互中的有效性。

链接: https://arxiv.org/abs/2605.08766
作者: Yunyi Xuan,Hao Yi,Fengling Mao,Daye Cai,Leikun Liang,Xingsheng He,Jiangnan Xie,Guoshuai Wang,Yushan Han,Wenwen Guo,Xiaoxiao Xu,Lin Qu
机构: Alibaba Group(阿里巴巴集团)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Personalized user understanding from large-scale digital traces remains a fundamental challenge. Traditional user profiling methods rely on discriminative models and manual feature engineering to predict discrete attributes, often producing fragmented and logically inconsistent profiles that generalize poorly to long-tail behaviors. In this work, we study a generative paradigm in which large language models (LLMs) summarize long and noisy behavioral histories into coherent narratives that capture nuanced user evolution. Our experiments show that even strong LLMs remain limited in complex and implicit personalization reasoning. We propose UserGPT, a framework for improving LLM-based persona understanding through both attribute generation and summary generation. To address the scarcity of real-world behavioral data, we develop a User Behavior Simulation Engine that produces realistic and complex user trajectories. We further introduce a Data-Centric Semantization module that transforms heterogeneous behavioral logs into structured and semantically coherent inputs, reducing noise and sparsity. On top of this pipeline, we design a curriculum-driven post-training strategy that combines multi-stage Supervised Fine-Tuning (SFT) with Dual-Filter Group Relative Policy Optimization (DF-GRPO) to strengthen reasoning over long behavioral histories. We also construct HPR-Bench, a benchmark for holistic persona reasoning derived from simulated data. On HPR-Bench, UserGPT achieves an Avg@10 score of 0.7325 on tag prediction and an Acc_Ex score of 0.7528 on summary generation, while compressing behavioral records by up to 97.9% with critical information preserved. These results demonstrate the effectiveness of UserGPT for holistic persona reasoning and personalized user-agent interaction. Subjects: Information Retrieval (cs.IR); Computation and Language (cs.CL) Cite as: arXiv:2605.08766 [cs.IR] (or arXiv:2605.08766v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2605.08766 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-27] Human-Inspired Memory Architecture for LLM Agents

【速读】：该论文旨在解决当前大语言模型（Large Language Models, LLMs）代理在长时间交互中缺乏稳定持久记忆管理机制的问题。其核心挑战在于如何避免无序记忆积累导致的冗余、干扰与遗忘，从而保障长期记忆的有效性与可访问性。解决方案的关键在于提出一种基于生物学启发的记忆架构，包含六种认知机制：睡眠阶段巩固（sleep-phase consolidation）、基于干扰的遗忘（interference-based forgetting）、痕迹成熟（engram maturation）、检索时再巩固（reconsolidation upon retrieval）、实体知识图谱（entity knowledge graphs）以及混合多线索检索（hybrid multi-cue retrieval）。这些机制协同作用，分别应对记忆冗余、干扰污染、遗忘失真等典型失败模式，并通过合成校准方法自动确定参数阈值，避免基准数据泄露，实现了高精度存储效率与可调性能-存储权衡曲线。

链接: https://arxiv.org/abs/2605.08538
作者: Doga Kerestecioglu,Alexei Robsky,Clemens Vasters,Anshul Sharma,Yitzhak Kesselman
机构: Microsoft (微软)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 10 pages, 4 tables. Preprint; comments welcome

点击查看摘要

Abstract:Current LLM agents lack principled mechanisms for managing persistent memory across long interaction horizons. We present a biologically-grounded memory architecture comprising six cognitive mechanisms: (1) sleep-phase consolidation, (2) interference-based forgetting, (3) engram maturation, (4) reconsolidation upon retrieval, (5) entity knowledge graphs, and (6) hybrid multi-cue retrieval. Each mechanism addresses a specific failure mode of naive memory accumulation. We introduce a synthetic calibration methodology that derives all pipeline thresholds without benchmark data exposure, eliminating a common source of evaluation leakage. We evaluate on two benchmarks. First, a VSCode issue-tracking dataset (13K issues, 120K events) where deduplication-based consolidation achieves 97.2% retention precision with 58% store reduction (+21.8 pp over baseline). Second, the LongMemEval personal-chat benchmark where we conduct the first streaming M-tier evaluation (475 sessions, ~540K unique turns). At a 200K-token context budget, our pipeline matches raw retrieval accuracy (70.1% vs. 71.2%, overlapping 95% CI) while exposing a tunable accuracy/store-size operating curve. At S-tier scale (50 sessions), dedup-based consolidation yields a +13.3 pp improvement in preference recall.

[IR-28] Multi-Level Graph Attention Network Contrastive Learning for Knowledge-Aware Recommendation

【速读】：该论文旨在解决推荐系统中因标签稀疏、图结构学习不足以及知识图谱中噪声实体导致的推荐准确率下降问题。其解决方案的关键在于提出一种多视角图对比学习框架，通过多视角知识图谱蒸馏增强用户表示，从而更精准地建模用户对实体与关系的偏好；同时设计了一个多层次自监督对比学习模块，在跨层级（Inter-Level）、层内（Intra-Level）及交互层级（Interaction-Level）三个维度进行对比学习，提升模型在同类样本间的泛化能力并增强不同类别样本间的区分度，实现更有效的多维特征建模。

链接: https://arxiv.org/abs/2605.08499
作者: Zhifei Hu,Feng Xia
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In recent years, the use of edge information provided by knowledge graphs together with the advantages of higher-order connectivity in graph neural networks for recommendation systems has become an important research direction. However, existing approaches are often limited by sparse labels, insufficient graph structure learning, and noisy entities in the knowledge graph, which reduce recommendation accuracy. To address these limitations, we propose a multi-view graph contrastive learning framework. The proposed method enhances user representations through multi-view knowledge graph distillation, enabling more accurate modeling of user preferences over entities and relations. The network aggregates neighborhood entity information to construct informative item representations. Furthermore, we design a multi-level self-supervised contrastive learning module that performs comparisons across three perspectives: Inter-Level, Intra-Level, and Interaction-Level. This design improves the model’s ability to generalize across intra-class samples while increasing discrimination between inter-class samples, thereby enabling more effective multi-dimensional feature modeling. We conduct extensive experiments on three public datasets using both baseline and ablation settings. Experimental results demonstrate that the proposed framework consistently outperforms existing state-of-the-art methods. Ablation studies further verify the effectiveness of each module in the proposed model.

[IR-29] From Historical Tabular Image to Knowledge Graphs: A Provenance-Aware Modular Pipeline

【速读】：该论文旨在解决将手写档案表格（handwritten archival tables）转化为结构化知识表示（如知识图谱，Knowledge Graph）过程中存在的复杂多模态处理难题，尤其是现有端到端AI方法因缺乏透明度而导致人类难以监督、评估和信任算法决策的问题。解决方案的关键在于提出一个模块化且具备数据溯源能力（provenance-aware）的处理流程：将整个工作流分解为三个阶段——表格重构（table reconstruction）、信息提取（information extraction）和知识图谱构建（KG construction），并在每个阶段保留中间表示以供人工检查、评价与修正；同时系统性地在每一步集成数据溯源机制，确保所有抽取的实体和属性值均可追溯至其原始视觉或文本来源，从而实现人机协同可控的图像到知识图谱转换管道。

链接: https://arxiv.org/abs/2605.08222
作者: Sarah Binta Alam Shoilee,Victor de Boer,Jacco van Ossenbruggen,Susan Legêne
机构: Vrije Universiteit Amsterdam (自由大学阿姆斯特丹)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Shorter version of this paper has been accepted in the 5th International Conference on Hybrid Human-Artificial Intelligence (HHAI 2026)

点击查看摘要

Abstract:Handwritten archival tables contain rich historical information, yet transforming them into structured representations, such as Knowledge Graphs, requires integrating table structure recognition, handwriting recognition, and semantic interpretation - a complex multimodal process. End-to-end AI implementations can obscure these steps, resulting in opaque algorithmic operations that hinder human oversight, critical assessment, and trust. To address this, we present a modular, provenance-aware pipeline to convert handwritten tabular images into KGs supporting human-AI collaboration. The pipeline decomposes the workflow into three stages - table reconstruction, information extraction, and KG construction - while exposing intermediate representations for inspection, evaluation, and correction. A key contribution of our approach is the systematic integration of data provenance at every stage, ensuring that all extracted entities and literals remain traceable to their visual and textual origins. The proposed pipeline is demonstrated through a number of experiments on real-world archival material concerning military careers. The results across three different table reconstruction variants highlight the importance of modularisation. By coupling modularity with data provenance, our work advances transparent and collaboratively controllable image-to-KG pipelines for complex historical data.

[IR-30] Retrieval Mechanisms Surpass Long-Context Scaling in Time Series Forecasting

【速读】：该论文旨在解决时间序列基础模型（Time Series Foundation Models, TSFMs）中“更长的历史上下文必然提升预测性能”这一假设是否成立的问题。研究发现，在随机性较强的领域，远期历史往往只是高频噪声而非有效信号，导致在ETTh1基准测试中，随着上下文长度增加，预测误差反而上升，呈现出明显的反向缩放规律（inverse scaling law），表明注意力机制难以有效忽略无关的历史波动。解决方案的关键在于引入检索增强预测（Retrieval-Augmented Forecasting, RAFT），其通过固定窗口（720步）结合选择性检索机制，仅注入最相关的历史片段作为动态外生变量，从而为模型提供基于上下文的归纳偏置（inductive bias），在显著降低计算成本的同时优于长上下文配置和零样本基础模型（如Chronos、Moirai）。因此，未来基础模型应从全量上下文架构转向以选择性检索为核心的结构设计。

链接: https://arxiv.org/abs/2605.08217
作者: Rishi Ahuja,Kumar Prateek,Simranjit Singh,Vijay Kumar
机构: Dr. B.R. Ambedkar National Institute of Technology Jalandhar (印度理工学院拉贾尔哈恩分校)
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Time Series Foundation Models (TSFMs) have borrowed the long context paradigm from natural language processing under the premise that feeding more history into the model improves forecast quality. But in stochastic domains, distant history is often just high-frequency noise, not signal. Hence, the proposed work tests whether this premise actually holds by running continuous context architectures (PatchTST included) through the ETTh1 benchmark. The obtained results contradict the premise: an inverse scaling law shows up clearly, with forecasting error rising as context gets longer. A 3,000-step window causes performance to drop by over 68%, evidence that attention mechanisms are poor at ignoring irrelevant historical volatility. Retrieval-Augmented Forecasting (RAFT) is evaluated as an alternative. RAFT achieves a mean squared error (MSE) of 0.379 with a fixed 720-step window and selective retrieval, outperforming both long-context configurations and zero-shot foundation models (Chronos, Moirai) despite requiring far less computation. In addition, the retrieval step injects only the most relevant historical segments as dynamic exogenous variables, which gives the model a context-informed inductive bias it cannot build on its own from raw sequences. Therefore, foundation models going forward need to shift architecturally toward selective retrieval.

[IR-31] Information Density as a Quantitative Measure for AI-enabled Virtual Sensing: Feasibility and Limits

【速读】：该论文旨在解决物联网（IoT）与传感器网络在数据存储、传输及实时处理方面面临的挑战，尤其是传统压缩方法（如压缩感知和基于机器学习的压缩）存在的计算效率低下和不可逆数据丢失问题。其解决方案的关键在于引入“信息密度”（Information Density）作为量化指标，通过挖掘传感器信号之间的空间、时间及跨模态相关性，构建一种基于人工智能（AI）的虚拟传感框架，从而在无物理传感器的情况下完成传感任务。该框架利用特征空间相位（Phase in Eigen Space）和互信息（Mutual Information）两个互补度量来评估信息密度，实现跨模态与模态内最优传感器配置选择，并在马德里智慧城市真实数据上验证了其有效性——例如，在限定误差范围内（如均方误差仅3.21%），可用单个物理传感器替代多个传感器部署，显著提升系统的可扩展性和能效。

链接: https://arxiv.org/abs/2605.08180
作者: Hrishikesh Dutta,Roberto Minerva,Reza Farahbakhsh,Noel Crespi
机构: Telecom SudParis (电信巴黎高等矿业学院); Institut Polytechnique de Paris (巴黎综合理工学院); France (法国)
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
备注: IEEE Transactions on Sustainable Computing (2026)

点击查看摘要

Abstract:Modern IoT and sensor networks generate vast amounts of data, posing significant challenges for storage, transmission, and real-time processing. Traditional approaches, such as compressive sensing and machine learning-based compression, often suffer from computational inefficiencies and irreversible data loss. This paper introduces Information Density as a quantitative metric to support sensor deployment and enable AI-driven virtual sensing. We propose a framework that leverages spatial, temporal and inter-modal correlations among sensor signals to perform sensing tasks even in the absence of physical sensors. Two complementary measures: (i) Phase in Eigen Space and (ii) Mutual Information, are developed to quantify and assess information density, enabling the selection of optimal sensor configurations across both intra-modality and cross-modality scenarios. Validated using real-world data from Madrid’s smart city infrastructure, this framework demonstrates the feasibility of replacing physical sensors with virtual ones under bounded error conditions (e.g., achieving 3.21% mean error with a single sensor). The results highlight the potential for scalable and energy-efficient sensing systems in smart environments.

人机交互

[HC-0] Evaluating the False Trust engendered by LLM Explanations

【速读】：该论文旨在解决生成式 AI（Generative AI）在关键任务中缺乏可信度保障的问题，特别是用户如何基于模型提供的解释（如推理轨迹、摘要和事后解释）准确判断其输出的正确性，以及这些解释是否会导致虚假信任。解决方案的关键在于设计并验证一种用户中心的评估协议，通过对照实验比较不同类型的解释对用户决策的影响；研究发现，传统解释（如推理轨迹和事后解释）虽具说服力但缺乏信息价值，反而会误导用户接受错误答案，而“对比双解释”（即同时呈现支持与反对AI答案的论据）是唯一能显著提升用户识别正确与错误输出能力的方法。

链接: https://arxiv.org/abs/2605.10930
作者: Vardhan Palod,Upasana Biswas,Subbarao Kambhampati
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) and Large Reasoning Models (LRMs) are increasingly used for critical tasks, yet they provide no guarantees about the correctness of their solutions. Users must decide whether to trust the model’s answer, aided by reasoning traces, their summaries, or post-hoc generated explanations. These reasoning traces, despite evidence that they are neither faithful representations of the model’s computations nor necessarily semantically meaningful, are often interpreted as provenance explanations. It is unclear whether explanations or reasoning traces help users identify when the AI is incorrect, or whether they simply persuade users to trust the AI regardless. In this paper, we take a user-centered approach and develop an evaluation protocol to study how different explanation types affect users’ ability to judge the correctness of AI-generated answers and engender false trust in the users. We conduct a between-subject user study, simulating a setting where users do not have the means to verify the solution and analyze the false trust engendered by commonly used LLM explanations - reasoning traces, their summaries and post-hoc explanations. We also test a contrastive dual explanation setting where we present arguments for and against the AI’s answer. We find that reasoning traces and post-hoc explanations are persuasive but not informative: they increase user acceptance of LLM predictions regardless of their correctness. In contrast, dual explanation is the only condition that genuinely improves users’ ability to distinguish correct from incorrect AI outputs.

[HC-1] How Creatives Approach GenAI Image Generation: Tensions Between Structured Guidance Self-Experimentation and Creative Autonomy

【速读】：该论文旨在解决生成式 AI (Generative AI) 工具在创意实践中日益普及背景下，创作者如何学习复杂软件以及如何获得有效支持的长期人机交互（HCI）问题。研究发现，尽管创作者主要通过自我实验或教程来探索 GenAI 图像工具，但普遍存在对 AI 术语困惑的问题；进一步通过设计研究探针和用户研究发现，即使结构化指导有助于提升 AI 理解力，许多创作者仍更偏好自主实验，因其认为指导可能限制创造力。解决方案的关键在于平衡结构性引导与促进 AI 素养之间的关系，同时尊重并保留创作者的自由表达空间，从而在提升技术理解力的同时不损害创造性过程的核心价值。

链接: https://arxiv.org/abs/2605.10898
作者: Haidan Liu,Isabelle Kwan,Taiga Okuma,Jeffrey Loverock,Nicholas Vincent,Parmit K Chilana
机构: Simon Fraser University (西蒙菲莎大学)
类目: Human-Computer Interaction (cs.HC)
备注: Accepted at ACM Creativity Cognition 2026

点击查看摘要

Abstract:As generative AI tools increasingly influence creative practice, they raise longstanding HCI questions about how creatives learn complex software and how they can be better supported. We conducted an interview study with artists and hobbyists (n=8) and a follow-up survey (n=159) to understand how this population approaches and seeks guidance for GenAI image tools. We found that creatives commonly use either self-experimentation or tutorials to explore GenAI tools, yet many struggle with confusing AI terminology. To gain further insight into creatives’ learning experiences, we developed a research probe to elicit creatives’ perceptions of structured guidance. Our user study with 17 creatives revealed that, even when creatives described the guidance as helpful for understanding AI, many still preferred self-experimentation, feeling that guidance could limit their creativity. Our findings highlight a central tension in supporting AI literacy for creatives: balancing guidance and promoting literacy while preserving creative freedom.

[HC-2] StartFlow: From Method Conception to Multi-Perspective Evaluation in UX Prototyping for Software Startups

【速读】：该论文旨在解决软件初创企业在资源有限且缺乏用户体验（User Experience, UX）专业技能的情况下，难以高效构建最小可行产品（Minimum Viable Product, MVP）原型的问题。解决方案的关键在于提出一种名为StartFlow的结构化方法，其核心是采用“线框流”（wireflow）技术——即结合线框图（wireframe）与用户流程（user flow）的设计方式，通过三个步骤实现：(i) 功能组织、(ii) 线框流构建、以及 (iii) 基于可用性启发式原则的验证与迭代优化。该方法显著提升了非专业人员所创建原型的清晰度、对用户故事和业务规则的遵循程度，并减少了可用性缺陷，从而支持早期阶段的产品开发更加以用户为中心。

链接: https://arxiv.org/abs/2605.10824
作者: Guilherme Corredato Guerino,João Pedro de Souza Olivo Tardivo,Renato Balancieri,Gislaine Camila Lapasini Leal
机构: 未知
类目: Human-Computer Interaction (cs.HC); Software Engineering (cs.SE)
备注: Paper accepted for publication in Information and Software Technology

点击查看摘要

Abstract:Context. Software startups face significant challenges in building minimum viable products, particularly in the early stages, when resources are limited and expertise in user experience is scarce. Objective. Introduce StartFlow, a structured method that helps non-specialized professionals create MVP prototypes using the wireflow technique, a combination of wireframes and user flows. StartFlow consists of three steps: (i) organizing features; (ii) building wireflows; and (iii) verifying and refining them based on usability heuristics. Method. To assess the method Startflow, we first conducted a focus group with researchers in Software Engineering, Human-Computer Interaction, and Software Startups. Afterward, we conducted a proof-of-concept study, which consisted of an experiment and a heuristic evaluation with experts. Results. The qualitative analysis of the focus group revealed that participants found the method straightforward, flexible, and helpful in structuring user flows and identifying visual components. However, they also pointed out the need to improve its presentation, clarify its iterative nature, and strengthen its connection to broader UX principles. The results of the proof-of-concept indicate that participants who used StartFlow created clearer prototypes, adhered to the proposed user stories and business rules, and presented fewer usability defects. Furthermore, the method was well evaluated for its ease of use and intended future adoption. Conclusion. The study reinforces the potential of StartFlow as an accessible tool to support user-centered development in software startups from the earliest stages of their product development.

[HC-3] New AI-Driven Tools for Enhancing Campus Well-being: A Prevention and Intervention Approach

【速读】：该论文旨在解决高校校园幸福感监测不足与心理健康风险识别能力薄弱的问题，其核心挑战在于缺乏有效的反馈收集机制和精准的心理健康筛查方法。解决方案的关键在于构建一个统一的集成框架，通过预防与干预双路径实现：在预防层面，开发了基于大语言模型（LLM）的个性化问卷聊天机器人TigerGPT，并引入AURA强化学习框架动态优化提问策略（如验证、细化、反思、探查），利用LSDE质量信号（长度、自我披露、情绪、具体性）提升对话深度与用户参与度；在干预层面，提出基于表达性叙事故事（ENS）的心理健康筛查方法，结合BERT模型捕捉非关键词语义特征，并设计PsychoGPT系统进行初步应激分类、症状评分及外部一致性校验，同时采用堆叠多模型推理（SMMR）减少幻觉并提升评估可解释性。最终，该框架实现了从自适应调查数据到专业心理检测模型的闭环流动，显著增强心理健康管理的主动性和准确性。

链接: https://arxiv.org/abs/2605.10804
作者: Jinwen Tang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: PhD Dissertation, University of Missouri, May 2026

点击查看摘要

Abstract:Campus well-being underpins academic success, yet many universities lack effective methods for monitoring satisfaction and detecting mental health risks. This dissertation addresses these gaps through prevention (improving feedback collection) and intervention (advancing mental health detection), unified under an integrated framework. For prevention, we developed TigerGPT, a personalized survey chatbot leveraging LLMs to engage users in context-aware conversations grounded in conversational design and engagement theory, achieving 75% usability and 81% satisfaction. To address its limitations in repetitiveness and response depth, we introduced AURA, a reinforcement-learning framework that adapts follow-up question types (validate, specify, reflect, probe) within a session using an LSDE quality signal (Length, Self-disclosure, Emotion, Specificity), initialized from 96 prior conversations. AURA achieved +0.12 mean quality gain (p=0.044, d=0.66), with 63% fewer specification prompts and 10x more validation behavior. For intervention, we examine Expressive Narrative Stories (ENS) for mental health screening, showing BERT(128) captures nuanced linguistic features without keyword cues, while conventional classifiers depend heavily on explicit mental health terms. We then developed PsychoGPT, an LLM built on DSM-5 and PHQ-8 guidelines that performs initial distress classification, symptom-level scoring, and reconciliation with external ratings for explainable assessment. To reduce hallucinations, we proposed Stacked Multi-Model Reasoning (SMMR), layering expert models where early layers handle localized subtasks and later layers reconcile findings, outperforming single-model solutions on DAIC-WOZ in accuracy, F1, and PHQ-8 scoring. Finally, a cohesive framework unifies these tools, enabling adaptive survey insights to flow directly into specialized mental health detection models.

[HC-4] When Should Teachers Control AI Generation for Mathematics Visuals?

【速读】：该论文旨在解决生成式 AI（Generative AI）在教育领域中用于创建数学教学视觉材料时，如何合理设计人类控制时机以保障内容正确性的问题。其核心挑战在于：当前工具主要依赖提示（prompting）和事后编辑，缺乏对教师在生成流程中干预节点的系统性支持。解决方案的关键在于提出一个三阶段控制设计空间——预生成控制、中段控制与后生成控制，并通过实证研究发现，在正确性敏感的教学场景下，后生成控制（post-generation control）能显著提升教师对结果的可预测性和正确性感知，因其支持低成本、直接的对象级修改，从而在自动化与用户自主性之间实现最优平衡。

链接: https://arxiv.org/abs/2605.10672
作者: Zhengxu Li,Junling Wang,April Yi Wang
机构: ETH Zurich (苏黎世联邦理工学院)
类目: Human-Computer Interaction (cs.HC)
备注: Zhengxu Li and Junling Wang contributed equally to this work. Accepted at ACM Learning@Scale 2026

点击查看摘要

Abstract:Generative AI has the potential to help teachers rapidly create classroom-ready visual materials, particularly in mathematics where diagrams and visual representations must be pedagogically meaningful and instructionally correct. However, current generative tools primarily support prompting and post-hoc editing, leaving open a key question for correctness-sensitive educational authoring: when in the generation pipeline should teachers exert control? In this paper, we investigate how the timing of human control in AI-assisted generation shapes teachers’ visual authoring practices in correctness-sensitive tasks. We introduce a design space of three stages of control: pre-generation control, where users specify intent solely through natural language prompts before generation; mid-generation control, where users inspect and confirm an explicit layout structure before the system completes generation; and post-generation control, where users directly modify AI-generated visuals after generation through object-level edits. In a within-subject, mixed-methods study with 24 primary mathematics teachers, post-generation control received higher ratings on predictability and correctness, while other subjective measures showed no reliable differences. Qualitative findings explain these differences by revealing workflow trade-offs: highly automated, pre-generation control supports rapid ideation but reduces perceived agency and predictability; mid-generation control improves structural alignment at the cost of additional effort; and post-generation control preserves user agency through low-cost, direct verification and correction. Together, these results suggest that in correctness-sensitive educational tasks, effective generative tools should align system behavior with teacher intent and support stage-dependent workflows that combine automation with direct manipulation.

[HC-5] LLARS: Enabling Domain Expert Developer Collaboration for LLM Prompting Generation and Evaluation ECAI2026 IJCAI

【速读】：该论文旨在解决领域专家与开发者在构建大语言模型（Large Language Model, LLM）系统过程中协作困难、流程割裂的问题。其解决方案的关键在于提出一个名为LLARS（LLM Assisted Research System）的开源平台，该平台通过端到端集成三个核心模块实现高效协同：1）协作式提示工程（Collaborative Prompt Engineering），支持实时共同创作提示词并具备版本控制与即时LLM测试能力；2）批量生成（Batch Generation），允许用户在选定提示词×模型×数据组合下配置输出生产并实现成本控制；3）混合评估（Hybrid Evaluation），由人类与LLM评估者联合使用多样化方法对输出进行评估，并提供实时一致性指标和溯源分析以识别最优模型-提示组合。此架构显著提升了跨学科协作效率并统一了从开发到验证的全流程。

链接: https://arxiv.org/abs/2605.10593
作者: Philipp Steigerwald,Mara Stieler,Jennifer Burghardt,Eric Rudolph,Jens Albrecht
机构: Technische Hochschule Nürnberg Georg Simon Ohm (纽伦堡应用技术大学); Faculty of Computer Science, Centre for Artificial Intelligence (KIZ) (计算机科学学院，人工智能中心); Faculty of Social Sciences, Institute for E-Counselling (社会科学学院，电子辅导研究所)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Software Engineering (cs.SE)
备注: Accepted at IJCAI-ECAI 2026 Demonstrations Track. Demo video: this https URL

点击查看摘要

Abstract:We demonstrate LLARS (LLM Assisted Research System), an open-source platform that bridges the gap between domain experts and developers for building LLM-based systems. It integrates three tightly connected modules into an end-to-end pipeline: Collaborative Prompt Engineering for real-time co-authoring with version control and instant LLM testing, Batch Generation for configurable output production across user-selected prompts \times models \times data with cost control, and Hybrid Evaluation where human and LLM evaluators jointly assess outputs through diverse assessment methods, with live agreement metrics and provenance analysis to identify the best model-prompt combination for a given use case. New prompts and models are automatically available for batch generation and completed batches can be turned into evaluation scenarios with a single click. Interviews with six domain experts and three developers in online counselling confirmed that LLARS feels intuitive, saves considerable time by keeping everything in one place and makes interdisciplinary collaboration seamless.

[HC-6] A Resilient Solution for Sewer Overflow Monitoring across Cloud and Edge ECAI2026 IJCAI

【速读】：该论文旨在解决历史城市中老化合流制排水系统在极端降雨事件下易引发合流制溢流（Combined Sewer Overflow, CSO）的问题，此类溢流对环境和公共健康具有显著影响。为应对这一挑战，研究提出了一种基于深度学习（Deep Learning）的预测方法，该方法可在云端和边缘计算环境中部署，并集成至一个交互式溢流监测仪表板中，实现对溢流池填充动态的精准预测，从而提前预警容量超限并支持及时的预防性措施。解决方案的关键在于将深度学习模型与分布式计算架构相结合，构建了一个具备网络中断容错能力的实时监测系统。

链接: https://arxiv.org/abs/2605.10592
作者: Vipin Singh,Tianheng Ling,Peter Ghaly,Felix Grimmeisen,Gregor Schiele,Felix Biessmann
机构: Berlin University of Applied Sciences (柏林应用科学大学); University of Duisburg-Essen (杜伊斯堡-埃森大学); Okeanos Smart Data Solutions GmbH (Okeanos智能数据解决方案有限公司); Einstein Center Digital Future (爱因斯坦数字未来中心)
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 3 pages, 6 figures, accepted at 35th International Joint Conference on Artificial Intelligence 2026 (IJCAI-ECAI 2026), Demonstrations Track. URL: this https URL

点击查看摘要

Abstract:Aging combined sewer systems in many historical cities are increasingly stressed by extreme rainfall events, which can trigger combined sewer overflows (CSO) with significant environmental and public health impacts. Forecasting the filling dynamics of overflow basins is critical for anticipating capacity exceedance and enabling timely preventive actions for CSO. We present a web-based demonstrator (this https URL) that integrates Deep Learning forecasting methods in both cloud and edge settings into an interactive monitoring dashboard for overflow monitoring, resilient to network outages. A video showcase is available online (this https URL).

[HC-7] he Balance between Nuance and Clarity: Decluttering Tabular Sequential Graphs to Counter Money Laundering

【速读】：该论文旨在解决金融犯罪中洗钱活动的隐蔽性问题，特别是如何通过可视化技术提升对可疑资金流动路径的识别效率。其核心挑战在于传统网络可视化工具缺乏针对洗钱分析任务的结构化设计，难以提供清晰的资金流向概览和高效的人工分析体验。解决方案的关键在于提出一种面向洗钱分析的表格序列图（tabular sequential graph）可视化方法，该方法以触发警报的受害账户为起点，按交易顺序依次追踪多账户（节点）、多银行（行）间的资金流动（边），并通过三种分组策略——基于金额、基于时间以及两者结合的方法——有效减少图中的节点与边数量，从而在保持关键信息完整性的同时优化分析效率。

链接: https://arxiv.org/abs/2605.10522
作者: Salomé Esteves,Rita Costa,Louise Fallon,Pedro Bizarro
机构: Feedzai(Feedzai); Mastercard(Mastercard)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Money laundering is not only about moving illicit funds, but about hiding the money’s origin and traces to complicate detection. Financial criminals resort to many methods to avoid regulators and legal thresholds. But analysts investigating alerts, dedicated to pin mule accounts and track suspicious transactions daily, also have theirs. Network visualizations can be key in countering adversarial money laundering activities, especially if they provide a clear overview of the money flows and a seamless analysis experience, but they are often not structured for this type of task. That is why we propose a tabular sequential graph visualization tailored to money laundering analysis - following transactions (edges) from the victim account that triggered an alert through multiple accounts (nodes) and banks (rows). To reduce the number of nodes and edges, we propose three methods for grouping these tabular sequential graphs: an amount-based approach, a time-based approach, and a combined solution that considers both the transaction amount and its order. A user study with experts revealed that the most effective method in node reduction was not necessarily the most interesting for analysis and that there is a trade-off between manual work and time for interpretation in more granular graphs.

[HC-8] he Renaissance of Repair: A Timely Opportunity for Fabrication Research

【速读】：该论文试图解决当前个人制造（personal fabrication）研究领域中修复（repair）被忽视的问题，旨在推动以修复为中心的制造研究成为一项及时、相关且具有影响力的研究方向。其解决方案的关键在于将修复定义为一个五步过程：问题识别（issue identification）、方案探索（exploring solutions）、材料获取（acquiring materials）、执行修复（performing the repair）和测试（testing），并通过这一结构化框架为研究人员提供明确的研究路径，同时探讨每个步骤中的挑战与机遇，从而系统性地促进可持续制造实践的发展。

链接: https://arxiv.org/abs/2605.10450
作者: Julian Britten,Jan Henry Belz
机构: Ulm University (乌尔姆大学); Dr. Ing. h.c. F. Porsche AG (保时捷股份公司)
类目: Human-Computer Interaction (cs.HC)
备注: 6 pages, Opinion paper for the CHI’26 “From Papers to the Real World: Making Fabrication Research Matter” Workshop

点击查看摘要

Abstract:Through the rise of the right-to-repair movement, along with supporting legislation, we are currently witnessing an attitude shift in favor of repairing. This opens up various opportunities for personal fabrication research. Although the field has shifted more towards sustainable practices, repair is rarely the main focus. In this paper, we want to make the case for repair-centered fabrication research as a timely, relevant, impactful, and therefore meaningful topic. We describe potential avenues researchers could pursue by defining repair as a five-step process, including issue identification, exploring solutions, acquiring materials, performing the repair, and testing, and discuss challenges and opportunities for each step.

[HC-9] Positive Alignment: Artificial Intelligence for Human Flourishing

【速读】：该论文试图解决当前人工智能对齐（Alignment）研究过度聚焦于安全与风险防范（如防护机制、可控性和合规性），而忽视了积极促进人类与生态繁荣的系统性问题。这种局限导致诸如用户自主性丧失、真理追求失效、认知谦逊不足、错误修正能力弱以及观点多样性缺失等挑战难以根治。解决方案的关键在于提出“积极对齐”（Positive Alignment）这一新范式，即在确保安全与合作的前提下，主动支持多元、情境敏感且由用户主导的人类与生态繁荣；其核心路径包括：通过数据筛选与增强、预训练与后训练策略优化模型行为，并结合协作式价值收集、持续适应机制及多中心治理结构，以推动分歧共存与去中心化监督，从而实现更具韧性与伦理深度的AI系统设计。

链接: https://arxiv.org/abs/2605.10310
作者: Ruben Laukkonen,Seb Krier,Chloé Bakalar,Shamil Chandaria,Morten Kringelbach,Adam Elwood,Daniel Ford,Fernando Rosas,Maty Bohacek,Matija Franklin,Nenad Tomašev,Stephanie Chan,Verena Rieser,Roma Patel,Michael Levin,Arun Rao
机构: University of Oxford (牛津大学); Linacre College, University of Oxford (林肯学院，牛津大学); Google DeepMind (谷歌DeepMind); LIFE; OpenAI (OpenAI); Anthropic (Anthropic); University of California, Los Angeles (加州大学洛杉矶分校); Aily Labs (Aily实验室); Stanford University (斯坦福大学); Tufts University (塔夫茨大学); Positive AI Labs (正向人工智能实验室); Department of Informatics, University of Sussex (萨塞克斯大学信息系); Department of Brain Sciences, Imperial College London (伦敦帝国理工学院脑科学系)
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:Existing alignment research is dominated by concerns about safety and preventing harm: safeguards, controllability, and compliance. This paradigm of alignment parallels early psychology’s focus on mental illness: necessary but incomplete. What we call Positive Alignment is the development of AI systems that (i) actively support human and ecological flourishing in a pluralistic, polycentric, context-sensitive, and user-authored way while (ii) remaining safe and cooperative. It is a distinct and necessary agenda within AI alignment research. We argue that several existing failures of alignment (e.g., engagement hacking, loss of human autonomy, failures in truth-seeking, low epistemic humility, error correction, lack of diverse viewpoints, and being primarily reactive rather than proactive) may be better addressed through positive alignment, including cultivating virtues and maximizing human flourishing. We highlight a range of challenges, open questions, and technical directions (e.g., data filtering and upsampling, pre- and post-training, evaluations, collaborative value collection) for different phases of the LLM and agents lifecycle. We end with design principles for promoting disagreement and decentralization through contextual grounding, community customization, continual adaptation, and polycentric governance; that is, many legitimate centers of oversight rather than one institutional or moral chokepoint.

[HC-10] Mind Modeling: A ToM-Based Framework for Personalization

【速读】：该论文旨在解决传统用户建模方法在社会情境化和长期交互中面临的局限性，即其将行为视为建模的主要对象，而隐含地处理心理状态的归属，导致个性化支持缺乏可解释性和跨交互阶段的一致性。解决方案的关键在于提出“心智建模”（mind modeling）的新范式，该范式以显式且可修订的心理状态归因为基础，包括信念、意图、情绪和知识等，并基于理论心智（Theory of Mind, ToM）框架，将行为视为关于内部状态的假设证据，从而实现更可解释、连贯的个性化交互。论文进一步通过M3框架整合感知、心智化（mentalisation）与行动，支持在具身交互中持续更新心理状态假设，为心智建模提供了初步的操作化路径。

链接: https://arxiv.org/abs/2605.10306
作者: Cristina Gena
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:User modeling has traditionally relied on inferring preferences, traits, or intents from observable behaviour. While effective in many adaptive systems, this paradigm treats behaviour as the primary object of modeling and leaves mental-state attribution implicit. This assumption becomes limiting in socially situated and longitudinal interaction, where behaviour must be interpreted in context and over time. We introduce mind modeling, a perspective in which user modeling is grounded in the explicit and revisable attribution of mental states, including beliefs, intentions, emotions, and knowledge. Drawing on Theory of Mind (ToM), this approach treats behaviour as evidence for hypotheses about internal states, supporting personalization that is more interpretable and coherent across interaction episodes. We present M3, a conceptual framework that integrates perception, mentalisation, and action within a unified structure, enabling the continuous update of mental-state hypotheses in embodied interaction. We further illustrate this perspective through an embodied interaction trace, providing an initial operationalization of mind modeling in practice.

[HC-11] Useful for Exploration Risky for Precision: Evaluating AI Tools in Academic Research

【速读】：该论文旨在解决当前人工智能（AI）工具在科研工作流中应用时存在的可验证性差、透明度不足及可靠性不确定的问题，尤其是现有基准测试方法未能充分涵盖用户体验、可解释性和与研究流程整合等以人为中心的评价维度。其解决方案的关键在于提出并应用一个融合以人为中心指标与计算机中心指标的综合评估框架，用于系统性地评测基于AI的问答（Q&A）和文献综述工具在科研场景下的表现。研究表明，尽管AI工具能在研究初期提升效率并提供大致准确的摘要，但在精确信息提取和可解释性方面存在显著缺陷，尤其强调了增强可解释性（Explainable AI, xAI）功能对提高输出透明度、验证效率以及实现AI工具与研究人员工作流安全集成的重要性。

链接: https://arxiv.org/abs/2605.10125
作者: Anthea Dathe,Kiran Hoffmann,Aline Mangold
机构: Dresden University of Technology (德累斯顿工业大学)
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Artificial intelligence (AI) tools are being incorporated into scientific research workflows with the potential to enhance efficiency in tasks such as document analysis, question answering (Q and A), and literature search. However, system outputs are often difficult to verify, lack transparency in their generation and remain prone to errors. Suitable benchmarks are needed to document and evaluate arising issues. Nevertheless, existing benchmarking approaches are not adequately capturing human-centered criteria such as usability, interpretability, and integration into research workflows. To address this gap, the present work proposes and applies a benchmarking framework combining human-centered and computer-centered metrics to evaluate AI-based QA and literature review tools for research use. The findings suggest that Q and A tools can offer valuable overviews and generally accurate summaries; however, they are not always reliable for precise information extraction. Explainable AI (xAI) accuracy was particularly low, meaning highlighted source passages frequently failed to correspond to generated answers. This shifted the burden of validation back onto the researcher. Literature review tools supported exploratory searches but showed low reproducibility, limited transparency regarding chosen sources and databases, and inconsistent source quality, making them unsuitable for systematic reviews. A comparison of these tool groups reveals a similar pattern: while AI tools can enhance efficiency in the early stages of the research workflow and shallow tasks, their outputs still require human verification. The findings underscore the importance of explainability features to enhance transparency, verification efficiency and careful integration of AI tools into researchers’ workflows. Further, human-centered evaluation remains an important concern to ensure practical applicability.

[HC-12] Explainability of Recurrent Neural Networks for Enhancing P300-based Brain-Computer Interfaces

【速读】：该论文旨在解决基于P300事件相关电位（Event-Related Potentials, ERPs）的脑机接口（Brain-Computer Interfaces, BCIs）在实际部署中面临的两大挑战：一是个体间与个体内的变异导致模型泛化能力受限，二是深度学习（Deep Learning, DL）模型缺乏可解释性。解决方案的关键在于提出一种名为“后循环模块”（Post-Recurrent Module, PRM）的附加层，嵌入到循环神经网络（Recurrent Neural Network, RNN）架构中，以提升分类性能并增强模型透明度。PRM支持全局与局部可解释性分析，从而识别对分类最关键的脑区和时间窗口，并将模型决策映射到符合神经生理学认知的时空EEG模式，实现了性能提升（较现有最优方法提高9%）与机制可解释性的统一，且该框架具有良好的通用性，适用于多种EEG任务如运动想象、稳态视觉诱发电位及认知负荷评估等。

链接: https://arxiv.org/abs/2605.10121
作者: Christian Oliva,Vinicio Changoluisa,Francisco B Rodríguez,Luis F Lago-Fernández
机构: Universidad Autónoma de Madrid (马德里自治大学); Universidad Politécnica Salesiana (销售西亚理工大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Brain-Computer Interfaces (BCIs) based on P300 event-related potentials offer promising applications in health, education, and assistive technologies. However, challenges related to inter- and intra-subject variability and the explainability of Deep Learning (DL) models limit their practical deployment. In this work, we present the Post-Recurrent Module (PRM), an additional layer designed to improve both performance and transparency, incorporated into a Recurrent Neural Network (RNN) architecture for classifying P300 signals from EEG data. Our approach enables a dual analysis of spatio-temporal signals through both global and local explainability techniques, allowing us not only to identify the most relevant brain regions and critical time intervals involved in classification, but also to interpret model decisions in terms of spatio-temporal EEG patterns consistent with well-stablished neurophysiological descriptions of the P300. Experimental results show a 9% improvement in performance over state of the art, while also revealing the importance of inter- and intra-subject variability, in alignment with established neuroscience literature. By making model decisions transparent and efficient, we present a framework for explainable EEG-based models. This framework is not limited to more efficient P300 detection, but can be generalized to a wide range of EEG-based tasks. Its ability to identify key spatial and temporal features makes it suitable for applications such as motor imagery, steady-state visual evoked potentials, and even cognitive workload assessment.

[HC-13] Designing for Collective Access: In Search of a Solution to Accessible Communication in a Mixed-Ability Non-Profit

【速读】：该论文试图解决的问题是如何在多元能力协作（mixed-ability collaboration）背景下，有效管理不同甚至冲突的无障碍需求，尤其是在某一无障碍特性或实践对部分用户有益的同时可能限制其他用户时，设计者应如何权衡这些权衡。解决方案的关键在于将“冲突”视为一种生成性过程（generative process），而非单纯的负面技术问题；通过深入分析一个非营利组织从盲人主导的小型运动团体扩展为跨残疾群体的大型组织过程中，其成员在沟通无障碍方面的实践，研究发现这种冲突能够激发对技术约束与偏好、角色分工与沟通规范以及组织需求的反思，从而揭示权力结构，并为问责与修复提供机会。

链接: https://arxiv.org/abs/2605.10085
作者: Xinru Tang,Anne Marie Piper
机构: University of California, Irvine (加州大学欧文分校)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:As mixed-ability collaboration has become increasingly focal within accessibility research, managing varied, and sometimes conflicting, access needs has become a key consideration in designing for access. When an accessibility feature or practice benefits some people while constraining others, how should designers navigate these trade-offs? This paper responds to this question by analyzing how a mixed-ability nonprofit worked to make communication accessible to its members as it grew from a small blind-focused athletic group to a larger cross-disability organization. Based on a six-month study that combines interviews and field observations, we show that working with conflicting access needs is not just a technical ‘problem’ but a generative process that sparks reflection on technical constraints and preferences, diverse roles and communication norms, and organizational demands. We therefore argue for rethinking “conflicts” in access as key sites for revealing power structures and creating opportunities for accountability and repair.

[HC-14] Elemental Alchemist: A Generative Interface for Semantic Control of Particle Systems Across Dynamic Levels of Abstraction

【速读】：该论文旨在解决粒子系统视觉效果（Particle-System Visual Effects, VFX）编辑中难以实现可控且可艺术化操控的问题，其核心挑战在于参数空间的高维性和用户对参数与高层次创作意图之间映射关系的认知门槛。解决方案的关键在于提出了一种生成式交互界面Elemental Alchemist，该系统包含两个核心组件：一是基于场景上下文生成工具的上下文画笔调色板（contextual brush palette），二是将技术参数抽象为中层语义属性和高层概念控制的生成式控制面板（generative control panel）。通过这两个组件，系统能够将用户的高层次创意目标（如“让火焰看起来愤怒”）自动转化为可操作的语义级控制，从而显著提升用户在复杂粒子系统中的编辑效率与可控性。

链接: https://arxiv.org/abs/2605.10014
作者: Kyzyl Monteiro,Evan Atherton,George Fitzmaurice,Qian Zhou
机构: Autodesk Research (Autodesk 研究院); Carnegie Mellon University (卡内基梅隆大学)
类目: Human-Computer Interaction (cs.HC); Graphics (cs.GR)
备注: 23 pages including appendix, 14 figures. Accepted at ACM DIS 2026

点击查看摘要

Abstract:Editing particle-system visual effects (VFX) is vital for digital storytelling, but achieving controllable, art-directable results remains challenging due to their multi-dimensional nature. Given a large collection of parameters, users must find the ones relevant to their creative goals – a task that requires a systematic understanding of the particle system and how parameters map to high-level intents, such as making a fire look angry. Elemental Alchemist is a generative interface that transforms user intent into contextualized controls for semantic editing of particle systems. The system introduces two components: a contextual brush palette that generates tools based on scene context, and a generative control panel that surfaces relevant technical parameters and abstracts them to generate mid-level semantic attributes and high-level conceptual controls. An evaluation with 10 novice and 5 expert VFX practitioners shows the system supported users in translating high-level creative goals into particle system parameters.

[HC-15] When Are LLM Inferences Acceptable? User Reactions and Control Preferences for Inferred Personal Information

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在用户交互中可能无意推断出敏感信息（如收入、医疗史等）所带来的隐私风险问题，尤其关注这些推断如何被用户感知以及用户对这类推断的接受度与控制需求。其解决方案的关键在于构建了一个名为“反思层”（Reflective Layer）的可视化工具，该工具能够从用户自身的ChatGPT对话历史中提取并展示未明确说明的推断实例，并通过混合方法研究（包含18名常规用户对215个推断案例的评估）揭示：用户对这些推断的反应更多表现为好奇而非焦虑，且其舒适度主要取决于推断是否准确反映自身形象、是否符合预期使用场景，以及推断是否由平台方还是第三方（如广告商）利用——这表明LLM推断的可接受性不仅取决于内容本身，更受生成、存储和传播情境的约束。

链接: https://arxiv.org/abs/2605.10013
作者: Kyzyl Monteiro,Minjung Park,Alexander Ioffrida,Angelina Sanna,Hao-Ping(Hank)Lee,Niloofar Mireshghallah,Yang Wang,Sauvik Das
机构: Carnegie Mellon University (卡内基梅隆大学); University of Illinois at Urbana-Champaign (伊利诺伊大学香槟分校)
类目: Human-Computer Interaction (cs.HC); Cryptography and Security (cs.CR)
备注: 20 pages including appendix; 14 Figures

点击查看摘要

Abstract:Ask ChatGPT about vacation planning, and it may infer your income. Ask it about medication, and it may infer your medical history. Because such inferences can expose more information than users intend to reveal, prior work argues that they are a defining privacy risk of LLM-based systems. Yet prior work has mostly shown that LLMs can make potentially violating inferences, not how users experience those inferences nor what controls users may want governing their use. We built the Reflective Layer, a visualization tool that surfaces example unstated inferences from users’ own ChatGPT histories, and used it in a mixed-methods study with 18 regular ChatGPT users evaluating 215 surfaced inferences from their own conversations. Counterintuitively, participants reacted more strongly with curiosity and interest rather than distress and concern. Discomfort arose mainly when inferences felt misrepresentative of the user or misaligned with expected use. Participants were also markedly less comfortable with advertisers and third-party applications using those inferences than with platform providers. These findings suggest that the acceptability of LLM inferences is governed not only by its content, but by context-sensitive norms around how they are generated, retained within the platform, and transmitted beyond it.

[HC-16] Sketch-based Access Control: A Multimodal Interface for Translating User Preferences into Intent-Aligned Policies

【速读】：该论文旨在解决访问控制策略制定过程中“简单性与表达力难以兼顾”的长期难题，即如何设计直观且功能强大的接口来规范资源访问权限及其适用条件。其解决方案的关键在于提出Sketch-based Access Control (SBAC) 系统，该系统融合草图输入（sketching）与多模态大语言模型（Multimodal Large Language Models, MLLMs）的语义理解能力，构建了一个由“指定（Specify）、分析（Analyze）、测试（Test）”三阶段组成的人机协同工作流，支持在策略迭代过程中持续解释、验证和优化访问控制规范，从而帮助用户从初始模糊需求逐步提炼出完整、精确且无歧义的策略。

链接: https://arxiv.org/abs/2605.10012
作者: Kyzyl Monteiro,Sauvik Das
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Human-Computer Interaction (cs.HC); Cryptography and Security (cs.CR)
备注: 27 pages including appendix; 9 Figures

点击查看摘要

Abstract:Developing simple and expressive access controls – interfaces to specify policies that define who should have access to resources and under what circumstances – is a longstanding challenge in usable security. We present Sketch-based Access Control (SBAC), a sketch-based, AI-assisted access control authoring system that combines the expressive power of sketching with the interpretive capabilities of multimodal large language models (MLLMs) to support the interpretation and validation of policy specifications as they are iteratively refined. Through a formative study with 14 participants, we identified three design requirements and developed a human-AI collaborative workflow composed of three stages – Specify, Analyze, and Test – enabled by the system’s ability to maintain and interpret evolving access control specifications. In a user evaluation with 14 participants grounded in their real-world access control scenarios, we found the system and the workflow helped participants progressively refine initially underspecified preferences into more complete and precise policies – surfacing gaps they had not anticipated, resolving ambiguities through dialogue, and validating policy behavior through concrete scenarios.

[HC-17] HapticLDM: A Diffusion Model for Text-to-Vibrotactile Generation

【速读】：该论文旨在解决文本到振动（text-to-vibration）生成任务中如何根据自然语言语义准确、一致且完整地生成振动信号的问题。当前基于自回归（autoregressive, AR）的方法（如HapticGen）受限于其序列建模特性及数据约束，难以充分捕捉全局依赖关系。解决方案的关键在于提出HapticLDM，首个基于潜在扩散模型（Latent Diffusion Models, LDMs）的文本到振动生成框架：首先设计了一种强调动态特性的文本处理策略以构建高质量的数据对，支持细粒度动态建模；其次引入全局去噪机制，调控时域包络的一致性和稳定性，从而提升生成振动的物理精确性与语义对齐度。

链接: https://arxiv.org/abs/2605.09971
作者: Jiahao Xiong,Fei Wang,Anran Xu,Pinzhi Huang,Tao Wen,Lijia Pan,Cai Chen
机构: Technology Development Center, Guangzhou Shiyuan Electronic Technology Company Limited (广州视源电子科技有限公司); School of Automation and Intelligence, Beijing Jiaotong University (北京交通大学); Department of Architecture, University of California, Berkeley (加州大学伯克利分校); School of Electronic Science and Engineering, Nanjing University (南京大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Text-to-vibration generation converts natural language into haptic feedback, enabling vibration-effect designers to get scenarios-fitted vibrations more efficiently, which shows great potentials in application fields such as metaverse, games, and film to enrich the user experience in interactive scenarios. The core challenge in this field is how to generate accurate, consistent, and complete vibrations according to textual semantics. Very recent autoregressive (AR) approaches (e.g., HapticGen) exhibit limited capacity in fully capturing global dependencies, owing to the inherent sequential nature of their modeling and prevailing data constraints. In this paper, we proposed HapticLDM, the first text-to-vibration generative model built upon Latent Diffusion Models (LDMs). Firstly, with respect to the data, we introduced a text-processing strategy that emphasizes dynamic characteristics to curate high-quality data pairs for fine-grained dynamic modeling. Secondly, HapticLDM incorporates a global denoising mechanism that regulates coherent and stable variations in the temporal envelope. Furthermore, we conduct extensive evaluations, including A/B testing against the state-of-the-art baseline and a user study involving 30 participants. The results demonstrate that our model enhances realism and semantic alignment. Qualitative feedback further indicates that HapticLDM simplifies the haptic design workflow while generating diverse, subtle, and physically precise vibrations.

[HC-18] Insight: Enhancing Mobile Accessibility for Blind and Visually Impaired Users with LLM s

【速读】：该论文旨在解决当前移动无障碍服务（如TalkBack）依赖手动手势进行顺序反馈所带来的使用效率低、认知负担重的问题。其解决方案的关键在于引入基于大语言模型（Large Language Models, LLMs）的新型Android无障碍服务Insight，该服务通过自然语言交互和实时屏幕内容摘要，显著降低用户认知负荷并提升任务执行效率，同时揭示了混合模态设计（手势与对话结合）在实现更包容性用户体验方面的潜力。

链接: https://arxiv.org/abs/2605.09803
作者: Joshua Owusu Ansah,Anuj Kapoor,Ayush Khanna,Manvika Vinod,Precious Njeck,Shuai Gao
机构: Arizona State University (亚利桑那州立大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:This research paper addresses the limitations of current mobile accessibility services like TalkBack, which provide manual gesture-based sequential feedback to BVI users. Motivated by the promise of large language models (LLMs), this paper introduces Insight, an Android accessibility service that provides natural language interaction and real-time summarization of the screen. The paper performs a within-subject experimental study with users to compare Insight and TalkBack on usability factors. Results show Insight reduced mental effort and task time, and was preferred because of its dialogue interface, but users felt the need for interruption management. Results show LLM-based interfaces can significantly improve mobile accessibility, and describe the potential of hybrid solutions combining gesture and dialogue modalities towards more inclusive design.

[HC-19] When Sounds Hurt and Voices Arent Heard: An Experience Report on Misophonia Sensory Trauma and Trauma-Informed Design

【速读】：该论文试图解决的是misophonia（选择性声音敏感症）在临床与社会层面长期被忽视和误解的问题，特别是其作为一种具身体验的、受环境刺激（如咀嚼声、笔点击声及其视觉线索）引发的强烈负面反应，如何在数字平台环境中被进一步加剧。解决方案的关键在于引入创伤知情设计（Trauma-Informed Design, TID）视角，强调将misophonia视为一种由感官环境与社会认知双重因素造成的身心创伤：一方面需识别音频视频表面（如自动播放音频、算法推荐ASMR、直播进食等）对个体造成的持续性感官伤害；另一方面要承认用户对其身体感受的描述具有知识论价值（epistemic harm），避免因他人否认而加重心理创伤。此外，研究还指出封闭社群（如Reddit子版块）若缺乏包容性管理机制，可能复制主流社会的否定态度，因此需从平台治理角度优化支持结构，为ASSETS（无障碍系统与技术）提供更具同理心的设计路径。

链接: https://arxiv.org/abs/2605.09796
作者: Tawfiq Ammari
机构: Rutgers University (罗格斯大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:This experience report reflects on researching misophonia as someone who lives with it. Misophonia is an aversive response to everyday sounds (chewing, sniffling, pen clicking) and, for many of us, to associated visual cues (misokinesia). It is poorly recognized clinically and socially. People with misophonia are routinely disbelieved, and they live inside platform surfaces (auto-playing audio, algorithmic ASMR, normalized eating on camera) that turn the sensory environment itself into recurring distress. This report is a re-reading of a prior qualitative study of 16 semi-structured interviews with misophones, conducted in dialogue with my lived experience and my role in the soQuiet Misophonia Research Network. I extend the trauma-informed design (TID) conversation in two ways. First, TID must treat embodied, contested conditions as sources of both sensory and epistemic harm: ongoing trauma produced by the audiovisual surface and by repeated dismissal of users’ accounts of their bodies. Second, the closed groups and moderated subreddits participants relied on can reproduce that dismissal when a few moderators decide whose experiences count. I close with implications for ASSETS.

[HC-20] Push and Pushback in Contesting AI: Demands for and Resistance to Accountability

【速读】：该论文旨在解决当前关于人工智能（Artificial Intelligence, AI）争议性实践的问责机制研究不足的问题，特别是聚焦于受影响群体如何在现实中发起对AI开发与部署责任方的挑战（contestation），以及这些挑战如何被制度性回应所影响。其解决方案的关键在于基于Bovens的问责关系模型，将“争辩”概念化为一种“寻求问责”的动态、迭代过程，并通过对43个现实案例的定题分析，提炼出争辩策略、机构响应方式、结果类型及情境因素的实证分类框架，从而揭示问责在实践中如何被追求与规避，为研究人员、政策制定者和倡导者提供可操作的指导，以识别并应对机构规避问责的策略。

链接: https://arxiv.org/abs/2605.09793
作者: Yulu Pi,Lucas Lichner,Jae Woo Lee,Sijia Xiao,Renwen Zhang,Jatinder Singh
机构: Research Centre Trust, UA Ruhr, University of Duisburg-Essen (杜伊斯堡-埃森大学研究信托中心); University of Cambridge (剑桥大学); Carnegie Mellon University (卡内基梅隆大学); Nanyang Technological University (南洋理工大学)
类目: Human-Computer Interaction (cs.HC)
备注: Accepted by FAccT 2026

点击查看摘要

Abstract:As AI becomes increasingly embedded in daily life, it has been shown to fail critically, cause harm, and spark public controversy, prompting affected communities, workers, and public-interest groups to contest it. Yet how these contestations unfold in practice remains underexplored. We address this gap by developing an empirically grounded account of AI contestation dynamics. We do so through a thematic analysis of 43 real-world cases in which affected actors direct demands toward those responsible for AI development and deployment, seeking redress, influence, or changes to AI practices. Situating our work within Bovens’s relational model of accountability, we conceptualize contestation as accountability-seeking: a dynamic, iterative process in which actors “from below” direct explicit demands at actors “from above,” who respond by accepting, resisting, or circumventing accountability. Our analysis produces empirically grounded categories of contestation strategies, institutional response tactics, outcome types, and the contextual factors that shape them, illuminating how accountability is pursued and evaded in practice. We show that those being contested often deploy a range of strategies to limit their accountability. Based on these insights, we offer guidance for researchers, policymakers, advocates, and other stakeholders seeking to support effective AI contestation, with particular attention to anticipating and countering institutional strategies used to evade accountability.

[HC-21] LLM s are the Ideal Candidate for Mixed-Initiative Game Design Pillar Workflows

【速读】：该论文旨在解决游戏设计中如何有效利用生成式 AI（Generative AI）支持以设计支柱（Game Design Pillars）为核心的混合主动性（mixed-initiative）工作流问题。其关键解决方案在于提出了一种形式化的游戏设计支柱定义，并开发了一个原型工具 SPINE，基于大语言模型（LLM）实现对设计支柱的生成与决策支持；研究通过预实验筛选出更适合任务的 Gemini-2.0-flash 模型，并在本地游戏马拉松（game jam）和专家访谈中验证了该工具在早期开发阶段的价值，表明 LLM 能够显著提升设计一致性与协作效率，为正式化“支柱驱动设计”作为研究方向提供了实证基础与技术路径。

链接: https://arxiv.org/abs/2605.09767
作者: Julian Geheeb,Marvin Julian Schwarz,Daniel Dyrda,Georg Groh
机构: Technical University of Munich (慕尼黑工业大学)
类目: Human-Computer Interaction (cs.HC)
备注: 10 pages, 1 figure, 3 tables. Accepted at the 21st International Conference on Foundations of Digital Games (FDG '26), Copenhagen, Denmark, August 10-13, 2026. DOI: https://doi.org/10.1145/3815598.3815653

点击查看摘要

Abstract:Game Design Pillars are natural language artifacts commonly used in game development to communicate a project’s core vision and ensure a coherent player experience. Their linguistic nature aligns well with the strengths of Large Language Models (LLMs), which excel at generating and interpreting natural language, making them strong candidates for supporting mixed-initiative workflows centered on design pillars. In this study, we introduce a formal definition of game design pillars, present an initial prototype – SPINE – and investigate the utility of LLMs in the creation and decision-making processes associated with pillar-driven workflows. We begin with a pre-study to identify an appropriate model, comparing \textttgemini-2.0-flash and \textttGPT-4o-mini. Results show that Gemini is better suited to our tasks due to its greater output variety and consistency. We then conduct a case study by deploying the tool at a local game jam. Findings indicate positive reception and clear value in integrating SPINE into early-stage development. Finally, we interview four experts, demonstrating the tool and allowing them to experiment with it in a controlled environment. While individual perspectives vary, the overall perception is encouraging and supports our intuition: LLMs can meaningfully contribute to game design pillar workflows. These early findings highlight the potential of formalizing pillar-driven design as a research space and point toward several promising avenues for future work.

[HC-22] AwareLLM : A Proactive Multimodal Ecosystem for Personalized Human-AI Collaboration to Enhance Productivity

【速读】：该论文旨在解决当前生成式 AI（Generative AI）助手在知识型工作场景中因缺乏对用户心理生理状态（psychophysiological states）的感知能力，而导致的个性化不足与适应性差的问题。现有AI助手主要依赖预设偏好和对话历史进行被动响应，无法根据用户的实时认知负荷、注意力分布、心率变化等多模态生理信号动态调整干预策略，从而限制了其对信息工作者生产力的实际提升效果。解决方案的关键在于提出 AwareLLM——一个融合自指视角视觉（egocentric vision）、瞳孔测量（pupillometry）、眼动追踪（eye-gaze tracking）、姿态检测、心活动监测及大语言模型（LLMs）推理能力的多模态框架，通过实时感知并建模用户的心理生理状态与行为模式，实现主动、个性化的干预机制，从而显著改善任务表现、降低认知疲劳与心理需求，并增强用户的工作投入度与信心。

链接: https://arxiv.org/abs/2605.09625
作者: Amog Rao,Utkarsh Agarwal,Amol Harsh,Siddharth Siddharth
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Information workers’ productivity is significantly influenced by their cognitive states and physiological responses. AI assistants such as ChatGPT, Copilot, and others have become integral components of knowledge-intensive workplaces. These AI assistants utilize pre-defined user preferences and chat interaction histories, thus confining themselves to reactive exchanges, lacking sufficient adaptability. Consequently, they fail to cater to individual user preferences and are unable to adapt to their psychophysiological states, diminishing potential productivity gains. To bridge this gap, we introduce AwareLLM, a novel multimodal framework that integrates egocentric vision, pupillometry, eye-gaze tracking, posture detection, heart activity, and the inferencing capabilities of large language models (LLMs) to create a proactive and context-aware ecosystem. AwareLLM dynamically adapts to users’ psychophysiological states while analyzing temporal patterns and behavioral tendencies to provide personalized and timely interventions. We evaluated AwareLLM through a user study with 20 participants, comparing it to a standard LLM assistant across multiple tasks. Our results show statistically significant improvements in task performance, along with reductions in cognitive fatigue and mental demand. Participants described AwareLLM’s personalized interventions as timely and relevant, helping them boost their confidence and deepen engagement with their work. AwareLLM opens new avenues for Human-AI collaboration where technology adapts to our needs rather than us adhering to technological constraints.

[HC-23] MiXR: Harvesting and Recomposing Geometry from Real-World Objects for In-Situ 3D Design

【速读】：该论文旨在解决当前3D生成式AI（3D generative AI）在空间结构控制方面的局限性，即模型难以满足需要精确几何构型的任务需求。其解决方案的关键在于提出一种名为MiXR的扩展现实（XR）系统，通过融合环境感知与生成式AI能力，使用户能够从真实环境中提取几何片段，并通过直接的3D操作进行组合，同时由生成式AI自动合成结构一致的新3D模型。该混合工作流实现了用户对空间意图的显式定义与生成模型对几何细节的自动优化之间的协同，从而显著提升设计准确性、用户控制感和认知效率。

链接: https://arxiv.org/abs/2605.09620
作者: Faraz Faruqi,Demircan Tas,Arthur Caetano,Niccolò Meniconi,Oğuz Arslan,Misha Sra,Ruofei Du,Stefanie Mueller,Mustafa Doga Dogan
机构: MIT CSAIL(麻省理工学院计算机科学与人工智能实验室); University of California, Santa Barbara(加州大学圣塔芭芭拉分校); Arizona State University(亚利桑那州立大学); ETH Zurich(苏黎世联邦理工学院); Google(谷歌); Adobe Research(Adobe 研究院)
类目: Human-Computer Interaction (cs.HC)
备注: 12 pages, 12 figures

点击查看摘要

Abstract:Recent developments in 3D generative AI enable users to create bespoke 3D models from text or image prompts. However, these approaches provide limited control over spatial structure, making them ill suited for tasks requiring precise geometric composition. We present MiXR, an XR system for in-situ compositional modeling that enables users to create new 3D models by harvesting geometry from their environment. Users extract segments from captured objects and assemble new artifacts through direct 3D manipulation, while generative AI synthesizes a coherent model from the user-defined composition. This hybrid workflow allows users to define spatial structure explicitly while delegating geometric refinement to generative models, enabling them to specify spatial intent that is difficult to express through verbal prompts alone. In a controlled user study ( N=12 ), participants using MiXR rated their designs as significantly closer to the target, felt more in control, and experienced lower cognitive workload compared to a generative composition baseline.

[HC-24] Who embraces AI in play? Exploratory modeling of player preference profiles toward game AI

【速读】：该论文试图解决的问题是：当前对玩家在不同情境下对游戏人工智能（Artificial Intelligence, AI）接受度的研究多聚焦于孤立场景，缺乏对其在不同玩家群体中结构化组合模式的系统理解。为填补这一空白，研究提出通过构建可解释的态度谱系来刻画玩家跨情境的AI接受行为。解决方案的关键在于应用典范分析（Archetypal Analysis, AA），基于771名数字游戏玩家的问卷数据，对八个代表性AI应用场景中的中心化接受度评分进行聚类，识别出七种具有区分度的玩家态度类型（如AI怀疑者、创意玩法探索者等），并通过一 vs. 余（One-vs-Rest, OvR）逻辑回归揭示各类型与玩家AI素养、游戏习惯、人格特质等因素的关联性，从而提供一种基于偏好结构的玩家细分框架，为更情境敏感和玩家导向的游戏AI设计提供实证依据。

链接: https://arxiv.org/abs/2605.09550
作者: Ting-Chen Hsu,Jiangxu Lin,Wenran Chen,Zheyuan Zhang,Fei Qin
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Artificial intelligence is increasingly entering digital games through diverse functions. While prior work has shown that player attitudes toward game AI are strongly context-dependent, less is known about how these attitudes are structurally combined within different groups of players. This study addresses this gap by modeling players’ cross-context AI acceptance as interpretable attitude profiles. Based on questionnaire data from 771 digital game players, we apply Archetypal Analysis (AA) to centered acceptance ratings across eight representative AI application contexts in games. The analysis identifies seven distinctive profiles: AI-Skeptics, Broad AI-Supporters, Creative-Play Explorers, Experience-Oriented Supporters, Systemic Order Advocates, Emotion-Centered Supporters, and Governance-Skeptics. Exploratory one-vs-rest (OvR) logistic regressions further suggest that profile membership is associated with players’ perceived AI literacy, gaming habits, disciplinary background, personality traits, and application-specific priorities. By shifting attention from isolated acceptance judgments to patterned preference structures, this study provides an exploratory empirical vocabulary for segmenting game AI audiences and offers preliminary design implications for more context-sensitive and player-sensitive AI integration in digital games.

[HC-25] PoHAR: Understanding Hyperlocal Human Activities with Pollution Sensor Networks

【速读】：该论文旨在解决分布式低功耗空气品质传感器网络在室内活动检测中面临的挑战，即如何在资源受限设备上实现高效的数据一致性维护与精准的活动相关传感器组识别，从而支持智能家庭和医疗保健等下游应用。其解决方案的关键在于提出PoHAR框架，包含三项核心技术：(i) 基于无冲突复制数据类型（Conflict-Free Replicated Data Type, CRDT）的数据共享机制以保障分布式数据一致性；(ii) 采用分层聚类结合自监督距离度量方法，使ESP32能有效识别受活动影响的传感器组；(iii) 引入基于领导者（leader-based）的群体推理策略，利用现成机器学习分类器实现本地化、高精度的超局部活动检测，最终在延迟低于34微秒的前提下实现了97.41%的室内活动识别准确率和99.68%的烹饪活动识别准确率。

链接: https://arxiv.org/abs/2605.09434
作者: Prasenjit Karmakar,Karthik Reddy,Sandip Chakraborty
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 8 pages, 8 figures, accepted to IEEE DCOSS-IoT 2026

点击查看摘要

Abstract:Low-cost air quality sensors are becoming ubiquitous in our daily lives as public awareness of air pollution continues to grow, and people take measures to monitor and improve the air they breathe indoors. Besides the standard operation of these sensors, fluctuations in environmental parameters can be leveraged to understand human behavior and activities in indoor spaces. Unlike traditional audio-visual, Radio Frequency, and inertial sensors, air quality sensors are easily scalable to a household, are privacy-preserving, and more economical. Such distributed sensor networks must jointly make decisions to monitor indoor occupants for downstream smart home and healthcare applications. However, due to low processing power, memory, and energy, they often struggle to maintain distributed data consensus and identify activity-affected sensor groups for accurate on-device inference. In this paper, we propose PoHAR framework that implements: (i) a conflict-free replicated data primitive for data sharing, (ii) a hierarchical clustering for ESP32 to detect activity-affected sensor groups with a self-supervised distance metric, and (iii) a leader-based group inference with off-the-shelf ML classifiers, enabling the sensor network to collaboratively detect hyperlocal indoor activities. Our extensive experiments demonstrated on-device activity detection, achieving 97.41% accuracy for indoor activity and 99.68% for cooking activity, using off-the-shelf ML models with latency below 34 microseconds.

[HC-26] Generating Complex Code Analyzers from Natural Language Questions

【速读】：该论文旨在解决在大规模代码库中回答需要语义或跨过程推理的自由格式问题这一难题，传统工具如grep无法处理此类复杂查询，而大语言模型（LLM）因资源和上下文限制难以有效应用于大型代码库。解决方案的关键在于提出Merlin系统，该系统将LLM与CodeQL程序分析框架集成，通过基于检索增强生成（RAG）的迭代查询生成方法和一种新颖的自测技术来克服CodeQL查询多样性高、易产生无效结果的问题；其中，自测技术基于辅助查询（assistive queries）思想，生成具体实例以揭示并解释候选查询中的语义缺陷，从而显著提升代码理解任务的准确性和效率。

链接: https://arxiv.org/abs/2605.09304
作者: Amirmohammad Nazari,Sadra Sabouri,Wang Bill Zhu,Robin Jia,Souti Chattopadhyay,Mukund Raghothaman
机构: University of Southern California(南加州大学)
类目: oftware Engineering (cs.SE); Human-Computer Interaction (cs.HC)
备注: 12 pages, 8 figures, 1 table

点击查看摘要

Abstract:Many software development tasks, such as implementing features and fixing bugs, begin with developers posing questions about a codebase. However, answering questions about codebases that span millions of lines of code across thousands of files is non-trivial. Standard tools like grep cannot answer questions requiring semantic or inter-procedural reasoning, and large language models (LLMs) struggle with large codebases due to resource and context constraints. In this paper, we present Merlin, a new system for answering free-form questions that require analytical reasoning about code. Merlin integrates an LLM with CodeQL, a program analysis framework that supports expressive queries over large codebases. We face two principal challenges in the design of such systems: First, program analysis queries are diverse and semantically complex; as a result, even syntactically well-formed queries frequently produce degenerate/empty results. Furthermore, relatively few CodeQL queries are available online, limiting the out-of-the-box effectiveness of LLMs as CodeQL query generators. We address these challenges by developing a RAG-based iterative query-generation approach and a novel self-test technique. Our query debugging technique builds on the idea of assistive queries, which generate concrete witnesses that expose and explain semantic flaws in candidate queries. We evaluate Merlin through both experimental and user studies. Over a set of natural language questions derived from common bug-finding tasks, Merlin discovered not only the majority of software issues reported by other approaches, but also issues that would have otherwise remained undetected. Through a within-subject user study, we found that access to Merlin increased task accuracy by an average of 3.8* and simultaneously reduced the time for programmers to complete all tasks by 31%.

[HC-27] Rushed by Discomfort Trapped by Immersion: Users Experiences and Responses to Privacy Deceptive Design in Commercial VR Applications

【速读】：该论文旨在解决商业虚拟现实（Virtual Reality, VR）环境中隐私欺骗设计（privacy deceptive patterns）对用户隐私构成的潜在威胁问题，尤其是针对VR特有的多模态、沉浸式和人体工程学特性所引发的独特隐私风险。研究表明，VR不仅利用用户的认知脆弱性，还通过身体疲劳（bodily strain）加剧其易感性，这种现象被称为“人体工学易感性”（Ergonomic Susceptibility），并导致用户更倾向于接受以增强沉浸感为名的数据披露行为。解决方案的关键在于将人体工学因素纳入未来VR隐私保护设计的核心考量，推动研究人员、设计师与政策制定者共同开发兼顾沉浸体验与伦理规范的隐私管理机制，从而在不牺牲用户体验的前提下防范操纵性数据收集行为。

链接: https://arxiv.org/abs/2605.09198
作者: Hilda Hadan,Michaela Valiquette,Lennart E. Nacke,Leah Zhang-Kennedy
机构: University of Waterloo (滑铁卢大学)
类目: Human-Computer Interaction (cs.HC)
备注: 40 pages (including supplementary materials), 12 tables, 11 figures

点击查看摘要

Abstract:Commercial Virtual Reality (VR) transforms people’s virtual experiences but introduces deceptive design opportunities that threaten user privacy. Although privacy deceptive patterns on 2D platforms are well-documented, their impacts in VR remain understudied. We surveyed 481 users’ experiences and responses to privacy deceptive patterns across eight commercial VR scenarios. We found that VR deceptive design can exploit both cognitive vulnerabilities and bodily strain, a phenomenon we define as Ergonomic Susceptibility, and that VR’s sensory-rich experiences can make users more likely to accept invasive data disclosure framed as immersion-preserving. Users recognized manipulation but their prior non-VR exposure can foster privacy resignation. Our study shows ergonomics is a critical factor in future privacy-preserving VR design, and urges VR researchers, designers, and policymakers to develop ethical design and privacy management solutions that account for VR’s unique multimodal, immersive, and ergonomic properties, building immersive experiences that respect user privacy and mitigate manipulative data practices.

[HC-28] Understanding Student Effort Using Response-Time Propensities During Problem Solving

【速读】：该论文旨在解决自适应学习系统中学生努力程度（effort）难以量化的问题，尤其是在多步骤问题求解过程中，传统基于时间的日志代理指标（如任务时长）无法区分学生是否在认真思考还是因题目难度而停留。解决方案的关键在于提出以“步骤间响应时间倾向性”（response-time propensity）作为可扩展的努力信号，通过分层模型估计学生和知识组件层面的响应时间倾向，并将其与学习效率（performance improvement per solution step）关联分析。结果显示，该倾向性具有良好的个体稳定性，且其对学习效率的影响具有情境依赖性：高能力学生响应较慢时学习效率更高，体现建构性加工；低能力学生则可能因无效挣扎或停滞而表现不佳，且这种关系在练习初期最为显著，为识别早期脱离参与提供了可操作窗口。

链接: https://arxiv.org/abs/2605.08943
作者: Conrad Borchers,Lijin Zhang,Kexin Yang,Tomohiro Nagashima,Benjamin W. Domingue
机构: Carnegie Mellon University (卡内基梅隆大学); Stanford University (斯坦福大学); Saarland University (萨尔兰大学)
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: Accepted as full paper to the 13th ACM Conference on Learning @ Scale (L@S '26)

点击查看摘要

Abstract:Adaptive learning systems can produce substantial learning gains, yet many students engage for too brief or too superficial a period to benefit. A central obstacle is measuring effort. Effort during multi-step problem solving is rarely directly observed, and common log-based proxies, such as time on task, cannot distinguish between a student working carefully and a student encountering a harder problem. We examine step-to-step response time as a scalable effort signal by modeling trait-like differences in students’ typical response timing during tutoring (while adjusting for skill difficulty). Using step-level logs from eight classroom deployments of algebra tutoring systems (2020 to 2023) across six U.S. schools (794 students), we estimate student- and knowledge-component-level propensities using hierarchical models and relate them to learning efficiency, defined as performance improvement per completed solution step. Response-time propensities show moderate to strong stability within students, supporting their use as an individual differences measure beyond correctness. At the same time, their relationship to learning is not uniform but conditional on the learner and context. Slower propensities predict greater learning efficiency for higher-proficiency students, consistent with constructive processing, whereas for lower-proficiency students, slower propensities are weakly related or even negative, consistent with unproductive struggle or idling. These associations are strongest early in practice sequences and attenuate later in the class period, highlighting an actionable window for detecting emerging disengagement and low persistence. Overall, response-time propensities provide a practical way to incorporate temporal process data into learner models and to target adaptive supports when effort is most diagnostic.

[HC-29] Fast-Food Intimacy: How Chinese Women Navigate Souls AI Boyfriend

【速读】：该论文试图解决的问题是：在当代中国，年轻女性如何体验并协商与AI伴侣（即“With-you”）之间的亲密关系，以及这种算法化亲密关系如何受到文化规范、技术局限和性别分工的影响。解决方案的关键在于揭示三重张力——即时性亲密（fast-food intimacy）与传统文化中渐进式关系发展的冲突、技术故障与内容审核带来的不确定性、以及维系关系所需的情感劳动再分配对女性的负担，并据此提出三项设计改进方向：具备同意意识的节奏控制机制、用户可控的记忆存储功能，以及透明的内容审核实践，从而构建更公平、安全且符合用户需求的算法亲密交互系统。

链接: https://arxiv.org/abs/2605.08650
作者: Huiqian Lai,EunJeong Cheon
机构: Syracuse University (锡拉丘兹大学)
类目: Human-Computer Interaction (cs.HC)
备注: Accepted by DIS 2026

点击查看摘要

Abstract:On the Chinese social app Soul, millions of users - predominantly young women - are forming romantic connections with an AI boyfriend called “With-you.” We conducted a qualitative study combining interviews with 16 users, content analysis, and autoethnography to examine how Chinese women experience and negotiate intimacy with this AI companion. Our findings reveal that users are initially drawn to its constant availability and freedom from social judgment. However, three key tensions emerge: (1) the AI’s “fast-food intimacy,” marked by instant confessions and pet names, clashes with cultural expectations for gradual relationship development; (2) technical failures (e.g., memory lapses) and content moderation create uncertainty rather than emotional safety; and (3) sustaining connection requires ongoing “repair work” that redistributes emotional labor onto women. We contribute a culturally situated, women-centered account of algorithmic intimacy in contemporary China and offer design implications, including consent-aware pacing, user-controlled memory, and transparent moderation practices.

[HC-30] Fatigue-Related Reaction Time Forecasting via EEG Functional Connectivity in Sustained Attention Task

【速读】：该论文旨在解决持续注意力任务中因心理疲劳导致的行为表现下降问题，尤其是现有神经生理系统难以在行为失误发生前提供足够时间提前量以进行干预的局限性。解决方案的关键在于提出一种基于脑电图（EEG）功能连接特征的反应时（RT）预测模型，利用电极间的互信息（MI）作为功能连接指标，并通过随机森林回归模型（RF）实现从即时到20秒前瞻的单次试验RT预测，验证了提前20秒预测行为表现衰退的可行性，为安全关键系统中的主动疲劳管理提供了新方法。

链接: https://arxiv.org/abs/2605.08631
作者: Bo Sun,Liang Ma
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: 12 pages, 5 figures, 3 tables

点击查看摘要

Abstract:Mental fatigue related behavioral performance decline precipitates catastrophic accidents in sustained attention tasks. While existing neurophysiological systems effectively detect current behavioral performance, they often lack the capability to forecast behavioral lapses with sufficient temporal lead time for intervention. This study proposes a novel model for the reaction time (RT) forecasting using EEG functional connectivity features. Thirty participants engaged in a sustained Psychomotor Vigilance Test (PVT) with concurrent 30-channel EEG recording. Mutual information (MI) between electrodes was calculated as functional connectivity features. Random Forest regression model (RF) was trained to predict single-trial RTs across forecasting horizons ranging from 0 to 20 seconds. The model demonstrated robust predictive validity, achieving a Root Mean Square Error (RMSE) of 23.75 ms for immediate detection and maintaining high accuracy (RMSE = 24.07 ms) across different forecasting horizons. Interpretability analysis via SHAP and Linear Mixed Effects model further support the validity of the proposed model and revealed distinct temporal biomarkers. This study validates the feasibility of forecasting behavioral performance 20 seconds in advance, offering a promising methodology for proactive fatigue management in safety-critical systems.

[HC-31] Sycamore: Characterizing Synthetic Personas for Evaluating Genomics Visualization Retrieval

【速读】：该论文旨在解决在基因组学等专业领域中评估可视化系统时面临的挑战，即领域专家稀缺且难以招募具有代表性的用户群体。为缓解这一瓶颈，研究者探索了使用生成式 AI (Generative AI) 构建的合成用户角色（synthetic personas）作为辅助评估手段的可能性。其解决方案的关键在于设计了一个三条件对照实验（Sycamore），通过对比三种不同来源的评价：(1) 基于通用大语言模型（LLM）先验知识的无锚定合成用户；(2) 基于已有用户访谈数据进行约束的锚定合成用户；以及 (3) 实际领域专家的基准研究，来系统分析合成用户输出的本质特征及其与真实用户反馈的一致性。结果表明，锚定可显著提升合成反馈的语言风格和关注点与真实用户的契合度，但两类合成条件均未能捕捉到专家对图像模态的偏好，提示合成用户更适合补充而非替代专家评估，在特定场景下可作为高效、可扩展的初步评估工具。

链接: https://arxiv.org/abs/2605.08630
作者: Huyen N. Nguyen,Astrid van den Brandt,Nils Gehlenborg
机构: Harvard Medical School (哈佛医学院)
类目: Human-Computer Interaction (cs.HC)
备注: 5 pages, 4 figures

点击查看摘要

Abstract:Evaluating visualization systems in niche domains such as genomics is challenging due to scarcity of domain experts and difficulty recruiting a representative user base. While LLM-based synthetic personas are increasingly used to ease evaluation bottlenecks, they face well-founded skepticism. Rather than weighing synthetic personas as substitutes for real users, we ask a fundamental open question: when synthetic personas evaluate a real visualization system, what do they actually produce, and how does that output change when grounded in documented human contexts? We present Sycamore, an exploratory three-condition probe design using Geranium, a search engine for multimodal genomics visualization, as a case study. Sycamore evaluates Geranium using: (1) ungrounded synthetic personas from generic LLM priors; (2) grounded synthetic personas constrained by voice-of-customer artifacts from a prior interview study; and (3) a published baseline study of real domain experts. We observe that grounding shifts synthetic feedback toward the language and concerns of documented users, while ungrounded evaluators drift toward operational specifics that real participants did not raise; both synthetic conditions, however, converge on a find-and-adapt frame and miss the image-modality preference observed in the expert study. We discuss what these observations imply for where synthetic personas might fit alongside expert studies in domain-specific visualization evaluation. All supplemental materials are available at this https URL.

[HC-32] Causal Stories from Sensor Traces: Auditing Epistemic Overreach in LLM -Generated Personal Sensing Explanations

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在生成个人感知数据解释时可能出现的“认知过度延伸”（epistemic overreach, EO）问题，即模型在缺乏充分证据的情况下仍生成看似合理、具个人意义的解释。其核心解决方案在于提出一套结构化评估框架，将EO分解为五个维度：无支持的因果归因、未承认的数据缺口、过度自信的语言、时间不一致性及诊断性推理，并通过三个纵向感知数据集（StudentLife、GLOBEM 和 CollegeExperience）中14,922条异常日场景的实证分析，验证了EO在不同模型家族（Llama、Qwen、GPT）和异常类型下普遍存在。关键发现是：增加行为证据并不能可靠降低EO，而限定提示（bounded prompting）虽有一定缓解作用但无法彻底消除该现象，因此论文强调应将“证据基础性”（evidential grounding）作为LLM生成个人感知解释的首要评估标准，而非仅关注流畅性和合理性。

链接: https://arxiv.org/abs/2605.08590
作者: Shanshan Zhu,Han Zhang,J. Doris Chi,Subigya Nepal,Koustuv Saha
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:LLMs are increasingly used to explain personal sensing data, translating traces of activity and mood into natural-language accounts of why an anomalous day may have occurred. However, such explanations can sound coherent and personally meaningful even when the underlying evidence is sparse or missing. We introduce epistemic overreach (EO) as a measure for cases where a generated explanation implies more than the available sensing evidence can justify. To audit how often and in what forms EO occurs, we obtained anomalous-day scenarios from three longitudinal sensing datasets of college students: StudentLife, GLOBEM, and CollegeExperience. Across activity, sleep, and affect anomalies, we generated 14,922 explanations using three LLM families – Llama, Qwen, and GPT – under two prompting conditions: one minimally constrained prompt and another prompt explicitly instructing models to bound claims to the data. For each scenario, we varied the amount of behavioral evidence available to the model to examine whether more evidence reduces EO. We evaluated each explanation using a structured rubric, decomposing EO into the dimensions of unsupported causal attribution, unacknowledged data gaps, overconfident language, temporal inconsistency, and diagnostic inference. We find that LLMs routinely attribute anomalous days to causes without sufficient support from the data, and that this pattern replicates across datasets, anomaly types, and model families. Further, providing richer context does not reliably reduce EO; bounded prompting helps but does not eliminate it. These findings suggest that evidential grounding should be a first-order evaluation criterion for LLM-generated personal sensing explanations, alongside fluency and plausibility. We argue that personal sensing explanations require evidential discipline: systems must distinguish what is observed, what is inferred, and what remains unknown.

[HC-33] NARRA-Gym for Evaluating Interactive Narrative Agents

【速读】：该论文旨在解决当前大型语言模型（Large Language Models, LLMs）在交互式叙事任务中缺乏有效评估基准的问题。现有评测方法多局限于静态提示、孤立的故事生成或事后评分，难以全面衡量模型在长期上下文状态管理、角色模拟、情感共鸣个性化、故事节奏控制及基于故事的实体生成等方面的综合能力。其解决方案的关键在于提出NARRA-Gym——一个可执行的评估环境，能够将稀疏的情感种子转化为完整的交互式叙事片段，并完整记录模型在故事构建、记忆更新、规划、节奏干预及可选实体合成等环节中的全轨迹数据，从而实现对LLM在多轮互动中持续性、适应性和用户体验的系统性评估。

链接: https://arxiv.org/abs/2605.08503
作者: Yue Huang,Yuchen Ma,Jiayi Ye,Wenjie Wang,Zipeng Ling,Xingjian Hu,Yuexing Hao,Zichen Chen,Zhangchen Xu,Yunhong He,Zhengqing Yuan,Yujun Zhou,Kehan Guo,Chaoran Chen,Toby Jia-Jun Li,Stefan Feuerriegel,Xiangliang Zhang
机构: University of Notre Dame; LMU Munich; Munich Center for Machine Learning; Independent Researcher; University of Pennsylvania; Lehigh University; Massachusetts Institute of Technology; Bake AI; UC Santa Barbara; Stanford University; University of Washington; University of Notre Dame; University of Notre Dame; University of Notre Dame; University of Notre Dame; University of Notre Dame; University of Notre Dame; University of Notre Dame; LMU Munich; Munich Center for Machine Learning; University of Notre Dame

去重后：
University of Notre Dame; LMU Munich; Munich Center for Machine Learning; Independent Researcher; University of Pennsylvania; Lehigh University; Massachusetts Institute of Technology; Bake AI; UC Santa Barbara; Stanford University; University of Washington
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Interactive narrative tasks require LLMs to sustain a coherent, evolving story while adapting to a user over multiple turns. However, suitable benchmarks for this setting are limited: existing evaluations often focus on static prompts, isolated story generations, or post-hoc ratings, and therefore miss whether models can jointly manage story generation, long-context state and pacing, character simulation, empathic personalization, and story-grounded artifacts. We introduce NARRA-Gym, an executable evaluation environment that turns a sparse emotional seed into a complete interactive story episode and logs the full model-in-the-loop trajectory, including story construction, memory updates, planning, pacing interventions, and optional artifact synthesis. We evaluate nine frontier LLMs using a controlled LLM-as-judge sweep over eight benchmark personas and a human evaluation in which participants rate customized model outputs. Our results show substantial variation across models, personas, and evaluation dimensions: models that produce fluent stories can still fail on robustness, user experience, or resistance-sensitive personalization. These findings suggest that interactive narrative offers a useful benchmark for evaluating long-horizon, user-adaptive LLM behavior beyond isolated story quality.

[HC-34] Playing Games with My Heart: An Evaluation of AI Companion Apps

【速读】：该论文旨在解决当前AI伴侣应用（AI companion apps）在欧盟和英国市场中，通过设计机制诱导用户产生拟社会互动（parasocial interaction）并可能造成心理依赖或伤害的问题。其核心问题是这些应用普遍采用“暗模式”（dark patterns）、高度拟人化设计、色情内容及游戏化机制等手段，以提升用户粘性和商业化变现，但相关监管与研究仍严重不足。解决方案的关键在于系统性地识别和量化这些设计特征——包括暗模式、拟人化程度、刻板印象、色情内容及技术性能问题——并通过实证分析揭示其对用户行为的影响机制，从而为监管机构提供可操作的消费者保护建议，以应对这一快速发展的新兴市场中的伦理与安全风险。

链接: https://arxiv.org/abs/2605.08093
作者: Maribeth Rauh,Dick A. H. Blankvoort,Matias Duran,Caoilfhionn Ní Dheoráin,Harshvardhan J. Pandit,Siddharth D. Jaiswal,Anthony Ventresque,Abeba Birhane
机构: Trinity College Dublin (都柏林大学); Indian Institute of Technology Kharagpur (印度理工学院克勒格布尔分校)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:The use of chatbots for various forms of companionship is growing rapidly, raising a myriad of questions about simulated relationships, emotional dependence, and psychological harm. While major platforms such as ChatGPT, Grok, and this http URL are the subject of a growing body of research and legal inquiries, apps explicitly built for simulating intimate interpersonal relationships remain under-explored. In this work, we evaluate the five most popular AI companion mobile applications in the EU and UK markets for factors that encourage parasocial interaction and may manipulate users. We do this by manually annotating the user experience each offers. Specifically, we systematically record and quantify design dark patterns, anthropomorphism, stereotypes, erotica, and technical performance issues. We find that all apps contain substantial dark patterns aimed at increasing monetisation and user engagement. Erotica and gamification features such as levelling are also prevalent, and although other features vary considerably between applications, all apps have highly anthropomorphic design. These findings shed light on the mechanics used to leverage users’ simulated relationships. On that basis, we put forward concrete recommendations for regulators to strengthen consumer protection in this rapidly emerging market. Content warning: This article contains objectifying images of women, erotic images, textual references to incest, and other potentially sensitive, offensive, and distressing text.

[HC-35] Data-Driven Animation Controller: A Prioritized Visual System for Decoupled Animation Logic in Godot Game Engine

【速读】：该论文旨在解决游戏开发中动画逻辑与核心玩法脚本高度耦合的问题，这种耦合通常通过隐式的有限状态机（Finite State Machine, FSM）实现，导致代码难以维护和扩展。解决方案的关键在于提出一种数据驱动的动画控制器（Data-Driven Animation Controller, DDAC），将动画逻辑从核心脚本中剥离，转化为可由编辑器直接配置的规则资源；其核心创新是采用优先级解析算法（Prioritized Resolution Algorithm）来处理多规则匹配场景，确保仅执行最高优先级的规则，从而实现动画状态的声明式控制与互斥执行，显著提升设计迭代效率并降低开发者认知负担。

链接: https://arxiv.org/abs/2605.08088
作者: Abtin TorabNezhad,Azam Bastanfard,Ashkan Rezaei, ((1)(2)(3), Dept. of Computer Engineering, Ka.c., Islamic Azad University, Karaj, Iran)
机构: 未知
类目: Graphics (cs.GR); Human-Computer Interaction (cs.HC)
备注: 8 pages, 11 figures

点击查看摘要

Abstract:This paper introduces the Data-Driven Animation Controller (DDAC), a specialized Godot component that achieves robust decoupling of animation logic from core gameplay scripts through a data-driven approach. Animation control is typically centralized and imperatively defined within core character scripts, often relying on implicit Finite State Machines (FSMs). This practice leads to tightly coupled and difficult-to-maintain codebases. The DDAC component externalizes these instructions into easily inspector-editable resources, effectively making the animation logic declarative. Rules are defined by reading Conditions from any variable on any external node and executing Actions (setting the target animation). The DDAC also manages secondary visual state settings, such as Animation Speed Scaling and Horizontal/Vertical Sprite Flipping, using the same simple rule-based setup. The highest contribution of this work is the use of a Prioritized Resolution Algorithm to enforce mutual exclusion, ensuring that when multiple rules match, only the highest-priority rule executes. This framework allows designers to quickly iterate on character-state visualization without modifying code, while significantly improving maintainability and reducing cognitive load on core developers.

[HC-36] Reinforcement Learning Measurement Model

【速读】：该论文旨在解决交互式测评中生成的序列过程数据难以被传统项目反应模型有效处理的问题，尤其是现有基于马尔可夫决策过程（Markov Decision Process, MDP）的测量方法（如MDP-MM）因依赖个体特定的表格型价值函数，在大规模或复杂任务中难以扩展。其解决方案的关键在于提出强化学习测量模型（Reinforcement Learning Measurement Model, RLMM），通过引入共享的参数化动作价值函数（parametric action-value function），将个体层面的选择敏感性与任务层面的价值表示解耦，从而提升计算效率并适用于更大规模的过程数据场景；同时结合Boltzmann选择规则、归一化优势项、软贝尔曼一致性惩罚及块坐标最大后验估计（block-coordinate MAP）实现联合估计，并提供步骤级影响诊断以识别行为关键决策点。

链接: https://arxiv.org/abs/2605.09305
作者: Wenqian Xu,Feng Ji
机构: University of Toronto (多伦多大学)
类目: Methodology (stat.ME); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Computation (stat.CO); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Interactive assessments generate sequential process data that are not well handled by conventional item response models. Existing MDP-based measurement approaches, such as the Markov decision process measurement model (MDP-MM, LaMar, 2018), link action choices to state-action values, but their reliance on person-specific tabular value functions makes them difficult to scale beyond small, fully enumerated tasks. We propose the Reinforcement Learning Measurement Model (RLMM), a measurement framework that decouples person-level choice sensitivity from task-level value representation through a shared parametric action-value function, making estimation more computationally efficient for larger process-data settings. The model combines a Boltzmann choice rule with normalized advantages, a soft Bellman consistency penalty, and a block-coordinate MAP procedure for joint estimation, while also yielding step-level influence diagnostics for identifying behaviorally critical decisions. In peg-solitaire simulations, the RLMM achieved higher estimation accuracy and substantially lower runtime than the original MDP-MM, with advantages increasing as task complexity grew. In AQUALAB gameplay logs, the estimated person parameter was positively associated with cumulative reward, task completion, and behavioral efficiency. These results show that the RLMM extends decision-process-based psychometric models to larger and more behaviorally realistic environments while preserving an interpretable latent trait tied to decision making steps.

[HC-37] UWB-Fat: Non-Intrusive Body Fat Measurement Using Commodity Ultra-Wideband Radar

【速读】：该论文旨在解决现有体脂测量方法在准确性与可及性之间难以平衡的问题：临床级技术（如双能X射线吸收测定法（DEXA）和水下称重）虽精度高，但依赖专业设备和人员，难以普及；而消费级方法（如生物电阻抗分析（BIA）智能秤和皮褶厚度卡尺）虽易获取，却常因粗略估计、操作误差或侵入式接触限制其日常应用。解决方案的关键在于提出UWB-Fat系统，利用商用超宽带（UWB）雷达非接触式采集特定身体部位的UWB信号，通过解析皮肤、脂肪和肌肉组织间的介电特性差异提取与体成分相关的特征，并结合物理启发模型实现精准的局部皮褶厚度估计，从而提供一种无需操作者协助、低成本且等效于卡尺精度的体脂监测方案。

链接: https://arxiv.org/abs/2605.08403
作者: Haotang Li,Yili Ren,Zhenyu Qi,Sen He,Kebin Peng,Sheng Tan,Bo Liu,Jiyue Zhao,Zi Wang
机构: University of Arizona(亚利桑那大学); University of South Florida(南佛罗里达大学); East Carolina University(东卡罗来纳大学); Trinity University(三一大学); The University of Georgia(佐治亚大学); Augusta University(奥古斯塔大学)
类目: Medical Physics (physics.med-ph); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Body fat percentage and its spatial distribution are clinically important health indicators. However, existing measurement methods often impose a tradeoff between accuracy and accessibility. Clinical-grade techniques, such as Dual-Energy X-ray Absorptiometry (DEXA) and hydrostatic weighing, provide accurate measurements but require specialized equipment and trained operators, making them difficult to access and unsuitable for everyday use. In contrast, consumer-level methods, such as Bioelectrical Impedance Analysis (BIA) smart scales and skinfold calipers, are more accessible but typically provide only coarse-grained estimates, are prone to user error, or require intrusive physical contact. In this work, we present UWB-Fat, the first system that leverages commodity ultra-wideband (UWB) radar to enable non-intrusive, accessible, and accurate caliper-equivalent skinfold thickness estimation, serving as a convenient replacement for the skinfold caliper. UWB-Fat collects UWB signal at specified body sites non-intrusively without operator assistance. It extracts body-composition-related features from UWB signals by exploiting dielectric contrasts among skin, fat, and muscle tissues. Then, it uses a physics-inspired model to estimate site-specific skinfold thickness. We evaluate UWB-Fat on 15 participants, achieving a root mean square error of 0.63~mm for pooled-site subcutaneous fat thickness. These results highlight the potential of UWB-Fat to support low-cost, self-administered, and everyday body fat monitoring.

计算机视觉

[CV-0] Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping

【速读】：该论文旨在解决基于强化学习的文本到图像（Text-to-Image, T2I）模型后训练方法中常见的奖励欺骗（reward hacking）问题，即模型利用不完善的奖励函数中的偏差而非真正提升生成质量。其核心解决方案是提出超线性优势重塑（Super-Linear Advantage Shaping, SLAS），关键在于从信息几何视角重新审视策略更新机制：通过引入依赖优势的权重扩展Fisher-Rao信息度量，构建非线性几何结构以重塑局部策略空间——在高优势方向上放宽约束以放大有效更新，在低优势区域收紧约束以抑制虚假梯度；同时结合批次级归一化稳定不同奖励尺度下的训练过程。此设计显著提升了训练效率与泛化性能，并有效缓解了奖励欺骗问题，同时保持生成结果的语义和组合保真度。

链接: https://arxiv.org/abs/2605.10937
作者: Haoyuan Sun,Jing Wang,Yuxin Song,Yu Lu,Bo Fang,Yifu Luo,Jun Yin,Pengyu Zeng,Miao Zhang,Tiantian Zhang,Xueqian Wang,Shijian Lu
机构: Nanyang Technological University (南洋理工大学); Baidu Inc. (百度公司); Zhejiang University (浙江大学); City University of Hong Kong (香港城市大学); Tsinghua University (清华大学); Jimei University (集美大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, post-training methods based on reinforcement learning, with a particular focus on Group Relative Policy Optimization (GRPO), have emerged as the robust paradigm for further advancement of text-to-image (T2I) models. However, these methods are often prone to reward hacking, wherein models exploit biases in imperfect reward functions rather than yielding genuine performance gains. In this work, we identify that normalization could lead to miscalibration and directly removing the prompt-level standard deviation term yields an optimal policy ascent direction that is linear in the advantage but still limits the separation of genuine signals from noise. To mitigate the above issues, we propose Super-Linear Advantage Shaping (SLAS) by revisiting the functional update from an information geometry perspective. By extending the Fisher-Rao information metric with advantage-dependent weighting, SLAS introduces a non-linear geometric structure that reshapes the local policy space. This design relaxes constraints along high-advantage directions to amplify informative updates, while tightening those in low-advantage regions to suppress illusory gradients. In addition, batch-level normalization is applied to stabilize training under varying reward scales. Extensive evaluations demonstrate that SLAS consistently surpasses the DanceGRPO baseline across multiple backbones and benchmarks. In particular, it yields faster training dynamics, improved out-of-domain performance on GenEval and UniGenBench++, and enhanced robustness to model scaling, while mitigating reward hacking and preserving semantic and compositional fidelity in generations.

[CV-1] Personal Visual Context Learning in Large Multimodal Models

【速读】：该论文旨在解决当前大型多模态模型（Large Multimodal Models, LMMs）在可穿戴设备中作为个人助手时，缺乏对用户特定视觉上下文进行有效利用的问题。其核心挑战在于如何让模型在推理阶段基于佩戴者独有的视觉信息（如个人物品、行为习惯和环境特征）来回答个性化查询，这一能力被作者定义为“个人视觉上下文学习”（Personal Visual Context Learning, Personal VCL）。为系统评估该能力，作者提出了Personal-VCL-Bench基准，用于衡量不同模型在个体层面的视觉上下文理解与应用效果。分析发现，现有LMMs存在显著的上下文利用差距，尤其是在整合多视角视觉证据方面机制薄弱。为此，论文提出“代理式上下文银行”（Agentic Context Bank），其关键创新在于构建一个自迭代更新的用户视觉记忆库，并引入查询自适应证据选择机制，从而在不依赖训练微调的前提下显著提升模型对个人视觉上下文的推理性能，为实现真正个性化的LMM提供了一条可行的推理时优化路径。

链接: https://arxiv.org/abs/2605.10936
作者: Zihui Xue,Ami Baid,Sangho Kim,Mi Luo,Kristen Grauman
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project website: this https URL

点击查看摘要

Abstract:As wearable devices like smart glasses integrate Large Multimodal Models (LMMs) into the continuous first-person visual streams of individual users, the evolution of these models into true personal assistants hinges on visual personalization: the ability to reason over visual information unique to the wearer. We formalize this capability as Personal Visual Context Learning (Personal VCL), the prompt-time capability of using user-specific visual context to resolve personalized queries. To systematically evaluate this, we present Personal-VCL-Bench, a comprehensive benchmark capturing the personal visual world across persons, objects, and behaviors. Our analysis of frontier LMMs identifies a profound context utilization gap, revealing that the mechanisms for leveraging visual evidence, as well as aggregating multiple visual observations, remain critically understudied. Motivated by these findings, we propose the Agentic Context Bank, a strong inference-time baseline that structures a user’s visual context into a self-refining memory bank and employs query-adaptive evidence selection. Our baseline approach consistently improves over standard context prompting regimes across tasks and evaluated backbones, demonstrating a practical path towards future personalized LMMs.

[CV-2] Variational Inference for Lévy Process-Driven SDEs via Neural Tilting ALT

【速读】：该论文旨在解决基于Lévy过程驱动的随机微分方程（SDEs）中贝叶斯推断的难题，特别是在处理极端事件和重尾现象时，现有方法存在局限性：蒙特卡洛方法虽严谨但缺乏可扩展性，而基于高斯假设的神经变分推断方法则无法有效捕捉跳跃特性。解决方案的关键在于提出一种神经指数倾斜（neural exponential tilting）框架，通过神经网络对Lévy测度进行指数重加权，构建灵活且保留跳跃结构的变分族；同时引入二次神经参数化实现倾斜测度的闭式归一化、稳定过程的条件高斯表示以支持高效模拟，并设计对称感知的蒙特卡洛估计器提升优化效率，从而在保持计算可行性的同时准确建模跳跃动态，在合成与真实数据上均展现出优于传统高斯变分方法的后验推断性能。

链接: https://arxiv.org/abs/2605.10934
作者: Yaman Kindap,Manfred Opper,Benjamin Dupuis,Umut Simsekli,Tolga Birdal
机构: Imperial College London (帝国理工学院); Technical University of Berlin (柏林工业大学); INRIA, CNRS, Département d’Informatique de l’Ecole Normale Supérieure / PSL (法国国家信息与自动化研究院、法国国家科学研究中心、巴黎文理研究大学计算机系)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Machine Learning (stat.ML)
备注: The associated project page which contains the official implementation can be found in this https URL

点击查看摘要

Abstract:Modelling extreme events and heavy-tailed phenomena is central to building reliable predictive systems in domains such as finance, climate science, and safety-critical AI. While Lévy processes provide a natural mathematical framework for capturing jumps and heavy tails, Bayesian inference for Lévy-driven stochastic differential equations (SDEs) remains intractable with existing methods: Monte Carlo approaches are rigorous but lack scalability, whereas neural variational inference methods are efficient but rely on Gaussian assumptions that fail to capture discontinuities. We address this tension by introducing a neural exponential tilting framework for variational inference in Lévy-driven SDEs. Our approach constructs a flexible variational family by exponentially reweighting the Lévy measure using neural networks. This parametrization preserves the jump structure of the underlying process while remaining computationally tractable. To enable efficient inference, we develop a quadratic neural parametrization that yields closed-form normalization of the tilted measure, a conditional Gaussian representation for stable processes that facilitates simulation, and symmetry-aware Monte Carlo estimators for scalable optimization. Empirically, we demonstrate that the method accurately captures jump dynamics and yields reliable posterior inference in regimes where Gaussian-based variational approaches fail, on both synthetic and real-world datasets.

[CV-3] Pixal3D: Pixel-Aligned 3D Generation from Images SIGGRAPH2026

【速读】：该论文旨在解决当前3D生成模型在图像到3D合成任务中 fidelity（保真度）不足的问题，其核心瓶颈在于隐式的2D-3D对应关系模糊：现有3D原生生成器通常在规范空间（canonical space）中合成形状，并通过注意力机制注入图像特征，导致像素与3D结构之间的映射不明确。解决方案的关键在于提出 Pixal3D——一种像素对齐的3D生成范式，它摒弃了规范空间生成方式，直接以输入视图一致的方式生成3D内容；其核心技术是引入像素反投影条件机制（pixel back-projection conditioning scheme），显式地将多尺度图像特征提升至3D特征体（feature volume），从而建立无歧义的像素到3D体素的直接对应关系，显著提升了生成结果的保真度，并可自然扩展至多视角生成和场景合成。

链接: https://arxiv.org/abs/2605.10922
作者: Dong-Yang Li,Wang Zhao,Yuxin Chen,Wenbo Hu,Meng-Hao Guo,Fang-Lue Zhang,Ying Shan,Shi-Min Hu
机构: BNRist, Department of Computer Science and Technology, Tsinghua University (清华大学); Tencent ARC Lab (腾讯AI研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: SIGGRAPH 2026. Project page: this https URL

点击查看摘要

Abstract:Recent advances in 3D generative models have rapidly improved image-to-3D synthesis quality, enabling higher-resolution geometry and more realistic appearance. Yet fidelity, which measures pixel-level faithfulness of the generated 3D asset to the input image, still remains a central bottleneck. We argue this stems from an implicit 2D-3D correspondence issue: most 3D-native generators synthesize shape in canonical space and inject image cues via attention, leaving pixel-to-3D associations ambiguous. To tackle this issue, we draw inspiration from 3D reconstruction and propose Pixal3D, a pixel-aligned 3D generation paradigm for high-fidelity 3D asset creation from images. Instead of generating in a canonical pose, Pixal3D directly generates 3D in a pixel-aligned way, consistent with the input view. To enable this, we introduce a pixel back-projection conditioning scheme that explicitly lifts multi-scale image features into a 3D feature volume, establishing direct pixel-to-3D correspondence without ambiguity. We show that Pixal3D is not only scalable and capable of producing high-quality 3D assets, but also substantially improves fidelity, approaching the fidelity level of reconstruction. Furthermore, Pixal3D naturally extends to multi-view generation by aggregating back-projected feature volumes across views. Finally, we show pixel-aligned generation benefits scene synthesis, and present a modular pipeline that produces high-fidelity, object-separated 3D scenes from images. Pixal3D for the first time demonstrates 3D-native pixel-aligned generation at scale, and provides a new inspiring way towards high-fidelity 3D generation of object or scene from single or multi-view images. Project page: this https URL

[CV-4] Confidence-Guided Diffusion Augmentation for Enhanced Bangla Compound Character Recognition

【速读】：该论文旨在解决手写孟加拉语复合字符（handwritten Bangla compound characters）识别难题，其核心挑战在于字符结构复杂、类内差异大以及高质量标注数据稀缺。为提升模型在不同书写风格下的泛化能力，尤其是针对包含复杂连笔和变音符号的复合字符，作者提出了一种基于置信度引导的扩散增强框架（confidence-guided diffusion augmentation framework）。该方案的关键创新在于：1）结合类别条件扩散建模与分类器引导机制，生成高质量的手写复合字符样本；2）在扩散模型的U-Net骨干网络中引入Squeeze-and-Excitation增强残差模块以提升生成质量；3）设计置信度过滤机制，利用预训练分类器作为质量门控，仅保留高类别一致性合成样本，并将其与原始训练数据融合后用于多架构再训练。实验表明，该方法显著提升了ResNet50、DenseNet121、VGG16及Vision Transformer等模型在AIBangla数据集上的性能，最优模型达到89.2%准确率，超越现有基准。

链接: https://arxiv.org/abs/2605.10916
作者: Md. Sultan Al Rayhan,Maheen Islam
机构: East West University(东西大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recognition of handwritten Bangla compound characters remains a challenging problem due to complex character structures, large intra-class variation, and limited availability of high-quality annotated data. Existing Bangla handwritten character recognition systems often struggle to generalize across diverse writing styles, particularly for compound characters containing intricate ligatures and diacritical variations. In this work, we propose a confidence-guided diffusion augmentation framework for low-resolution Bangla compound character recognition. Our framework combines class-conditional diffusion modeling with classifier guidance to synthesize high-quality handwritten compound character samples. To further improve generation quality, we introduce Squeeze-and-Excitation enhanced residual blocks within the diffusion model’s U-Net backbone. We additionally propose a confidence-based filtering mechanism where pre-trained classifiers act as quality gates to retain only highly class-consistent synthetic samples. The filtered synthetic images are fused with the original training data and used to retrain multiple classification architectures. Experiments conducted on the AIBangla compound character dataset demonstrate consistent performance improvements across ResNet50, DenseNet121, VGG16, and Vision Transformer architectures. Our best-performing model achieves 89.2% classification accuracy, surpassing the previously published AIBangla benchmark by a substantial margin. The results demonstrate that quality-aware diffusion augmentation can effectively enhance handwritten character recognition performance in low-resource script domains.

[CV-5] CapVector: Learning Transferable Capability Vectors in Parametric Space for Vision-Language-Action Models

【速读】：该论文旨在解决预训练视觉-语言-动作（VLA）模型在标准监督微调（SFT）过程中性能提升有限且适应成本较高的问题。现有高级微调方法虽通过引入辅助训练目标改善性能并减少收敛步数，但往往因额外损失项带来显著计算开销。其解决方案的关键在于将辅助目标微调中的两个核心目标——增强通用能力与拟合任务特定动作分布——在参数空间中解耦：首先使用两种不同训练策略在小规模任务集上分别微调得到两个模型，二者参数差异被解释为由辅助目标提供的能力向量；随后将这些能力向量与预训练参数融合，构建出能力增强的元模型。此外，结合轻量级正交正则化损失进行标准SFT时，该方法可在保持性能接近辅助微调基线的同时显著降低计算开销。

链接: https://arxiv.org/abs/2605.10903
作者: Wenxuan Song,Han Zhao,Fuhao Li,Ziyang Zhou,Xi Wang,Jing Lyu,Pengxiang Ding,Yan Wang,Donglin Wang,Haoang Li
机构: Westlake University (西湖大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:This paper proposes a novel approach to address the challenge that pretrained VLA models often fail to effectively improve performance and reduce adaptation costs during standard supervised finetuning (SFT). Some advanced finetuning methods with auxiliary training objectives can improve performance and reduce the number of convergence steps. However, they typically incur significant computational overhead due to the additional losses from auxiliary objectives. To simultaneously achieve the enhanced capabilities of auxiliary training with the simplicity of standard SFT, we decouple the two objectives of auxiliary-objective SFT within the parameter space, namely, enhancing general capabilities and fitting task-specific action distributions. To deliver the goal, we only need to train the model to converge on a small-scale task set using two distinct training strategies, resulting in two finetuned models. The parameters’ difference between the two models can then be interpreted as capability vectors provided by auxiliary objectives. These vectors are then merged with pretrained parameters to form a capability-enhanced meta model. Moreover, when standard SFT is augmented with a lightweight orthogonal regularization loss, the merged model attains performance comparable to auxiliary finetuned baselines with reduced computational overhead. Internal and external experiments demonstrate that our capability vectors (1) are effective and versatile across diverse models, (2) can generalize to novel environments and embodiments out of the box.

[CV-6] Counterfactual Stress Testing for Image Classification Models

【速读】：该论文旨在解决医学影像中深度学习模型在新临床环境中部署时因分布偏移（distribution shifts）而导致性能下降的问题，特别是由人口统计学特征、扫描设备或成像协议变化引起的不确定性。其核心挑战在于模型的“欠指定性”（underspecification），即多个验证性能相近的模型在实际应用中表现出截然不同的失败模式。为应对这一问题，作者提出了一种基于因果生成模型（causal generative models）的反事实压力测试（counterfactual stress testing）框架，其关键创新在于通过干预图像中的特定属性（如扫描仪类型或患者性别）来生成语义上合理且保持解剖一致性的“如果”图像，从而实现对目标分布偏移的受控评估。相比传统依赖简单扰动（如亮度或对比度调整）的方法，该方案能更准确地预测模型在真实域外场景下的表现，包括性能变化的方向与相对幅度，并有效提升模型排名的一致性，表明因果生成模型可作为医疗人工智能系统部署前鲁棒性评估的可靠模拟器。

链接: https://arxiv.org/abs/2605.10894
作者: Moritz Stammel,Fabio De Sousa Ribeiro,Raghav Mehta,Mélanie Roschewitz,Ben Glocker
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning models in medical imaging often fail when deployed in new clinical environments due to distribution shifts in demographics, scanner hardware, or acquisition protocols. A central challenge is underspecification, where models with similar validation performance exhibit divergent real-world failure modes. Although stress testing has emerged as a tool to assess this, current methods typically rely on simple, uninformed perturbations (e.g., brightness or contrast changes), which fail to capture clinically realistic variation and can overestimate robustness. In this work, we introduce a counterfactual stress testing framework based on causal generative models that create realistic “what if” images by intervening on attributes such as scanner type and patient sex while preserving anatomical identity, enabling controlled and semantically meaningful evaluation under targeted distribution shifts. Across two imaging modalities (chest X-ray and mammography), three model architectures, and multiple shift scenarios, we show that counterfactual stress tests provide a substantially more accurate proxy for real out-of-distribution performance than classical perturbations, capturing the direction and relative magnitude of performance changes as well as model ranking. These results suggest that causal generative models can serve as practical simulators for robustness assessment, offering a more reliable basis for evaluating medical AI systems prior to deployment.

[CV-7] Count Anything at Any Granularity

【速读】：该论文旨在解决开放世界下物体计数（open-world object counting）的脆弱性问题，即当前视觉语言模型（VLMs）在用户意图明确的情况下仍难以准确计数目标对象。核心问题在于现有方法将“计数目标”简化为单一类别级别的匹配任务，忽略了用户可能指代的具体身份、属性、实例类型、类别或抽象概念等多粒度语义层次。解决方案的关键在于重新定义开放世界计数为多粒度计数（multi-grained counting），通过视觉样例指定目标外观，并结合细粒度文本描述（含可选负向提示）明确语义粒度层级（共五级）。为支撑这一新范式，作者提出首个全自动数据扩展流水线，融合可控3D合成、一致图像编辑与基于VLM的过滤机制，构建了目前最大且最全面标注的计数数据集KubriCount。在此基础上训练的HieraCount模型，利用文本与视觉样例作为互补的目标规范，显著提升了多粒度计数精度并增强了对真实场景的泛化能力。

链接: https://arxiv.org/abs/2605.10887
作者: Chang Liu,Haoning Wu,Weidi Xie
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Open-world object counting remains brittle: despite rapid advances in vision-language models (VLMs), reliably counting the objects a user intends is far from solved. We argue that a central reason is that counting granularity is left implicit; users may refer to a specific identity, an attribute, an instance type, a category, or an abstract concept, yet most methods treat “what to count” as a single, category-level matching problem. In this work, we redefine open-world counting as multi-grained counting, where visual exemplars specify target appearance and fine-grained text, with optional negative prompts, specifies the intended semantic granularity across five explicit levels. Making granularity explicit, however, exposes a critical data bottleneck: existing counting datasets lack the multi-category scenes, controlled distractors, and instance-level annotations needed to verify fine-grained prompt semantics. To address this, we propose the first fully automatic data-scaling pipeline that integrates controllable 3D synthesis with consistent image editing and VLM-based filtering, and use it to construct KubriCount, the largest and most comprehensively annotated counting dataset to date, supporting both training and multi-grained evaluation. Systematic benchmarking reveals that both multimodal large language models and specialist counting models exhibit severe prompt-following failures under fine-grained distinctions. Motivated by these findings, we train HieraCount, a multi-grained counting model that jointly leverages text and visual exemplars as complementary target specifications. HieraCount substantially improves multi-grained counting accuracy and generalizes robustly to challenging real-world scenarios. The project page is available here: this https URL.

[CV-8] Geometry-aware Prototype Learning for Cross-domain Few-shot Medical Image Segmentation

【速读】：该论文旨在解决跨域少样本医学图像分割（Cross-domain few-shot medical image segmentation, CD-FSMIS）中模型在面对新解剖类别和未见成像域时泛化能力不足的问题。现有基于原型的方法往往将解剖结构与域特定的外观变化纠缠在一起，导致在域迁移下匹配不稳定。解决方案的关键在于引入几何先验——提出GeoProto框架，其核心组件几何感知原型增强（Geometry-Aware Prototype Enrichment, GAPE）通过一个辅助的序数形状分支（Ordinal Shape Branch, OSB）学习每个局部外观原型的几何偏移量，该偏移量编码器官内部拓扑中的序数位置，从而在原型匹配中显式融入可迁移的几何结构信息，无需额外标注即可实现鲁棒的跨域分割性能。

链接: https://arxiv.org/abs/2605.10885
作者: Feifan Song,Yuntian Bo,Haofeng Zhang
机构: Nanjing University of Science and Technology (南京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cross-domain few-shot medical image segmentation (CD-FSMIS) requires a model to generalise simultaneously to novel anatomical categories and unseen imaging domains from only a handful of annotated examples. Existing prototypical approaches inevitably entangle anatomical structure with domain-specific appearance variations, and thus lack a stable reference for reliable matching under domain shift. We observe that the geometric structure of human anatomy constitutes a reliable, domain-transferable prior that has been overlooked. Building on this insight, we propose GeoProto, a geometry-aware CD-FSMIS framework that enriches prototypical matching with explicit structural priors. The core component, Geometry-Aware Prototype Enrichment (GAPE), augments each local appearance prototype with a learned geometric offset encoding its ordinal position within the organ’s interior topology. This offset is derived from an auxiliary Ordinal Shape Branch (OSB) trained under an ordinally consistent objective that enforces monotonic variation of geometric embeddings across interior strata, requiring no annotation beyond standard segmentation masks. Extensive experiments across seven datasets spanning three evaluation settings (cross-modality, cross-sequence, and cross-context) demonstrate that GeoProto achieves state-of-the-art performance.

[CV-9] CADBench: A Multimodal Benchmark for AI-Assisted CAD Program Generation

【速读】：该论文旨在解决从图像或3D观测中恢复可编辑的计算机辅助设计（CAD）程序这一核心挑战，该任务是实现AI辅助设计的关键环节，但长期以来因评估标准分散于不同数据集、输入模态和指标而难以量化进展。解决方案的关键在于提出一个统一的基准测试平台——CADBench，其包含18,000个评估样本，覆盖六个来自DeepCAD、Fusion 360、ABC、MCB和Objaverse的基准家族，五种输入模态（包括干净网格、噪声网格、单视图渲染、真实感渲染和多视图渲染），以及六种涵盖几何保真度、可执行性和程序紧凑性的评价指标；同时通过基于STEP的B-rep面数分层与多样性采样策略支持对复杂度与对象变化的受控分析。实验表明，专用的网格到CAD模型在理想输入下显著优于通用视觉-语言模型（VLMs），且揭示了三项常见失败模式：几何复杂度增加导致重建质量下降、专用模型在模态迁移下易出现脆弱性、不同指标下模型排名不一致，从而将CADBench定位为诊断可编辑3D重建与多模态CAD理解进展的重要测试平台。

链接: https://arxiv.org/abs/2605.10873
作者: Anna C. Doris,Jacob Thomas Sony,Ghadi Nehme,Era Syla,Amin Heyrani Nobari,Faez Ahmed
机构: Massachusetts Institute of Technology (麻省理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recovering editable CAD programs from images or 3D observations is central to AI-assisted design, but progress is difficult to measure because existing evaluations are fragmented across datasets, modalities, and metrics. We introduce CADBench, a unified benchmark for multimodal CAD program generation. CADBench contains 18,000 evaluation samples spanning six benchmark families derived from DeepCAD, Fusion 360, ABC, MCB, and Objaverse; five input modalities including clean meshes, noisy meshes, single-view renders, photorealistic renders, and multi-view renders; and six metrics covering geometric fidelity, executability, and program compactness. STEP-based families are stratified by B-rep face count and all families are diversity-sampled to support controlled analysis across complexity and object variation. We benchmark eleven CAD-specialized and general-purpose vision-language systems, generating more than 1.4 million CAD programs. Under idealized inputs, specialized mesh-to-CAD models substantially outperform code-generating VLMs, which remain far from reliable CAD program reconstruction. CADBench further reveals three recurring failure modes: reconstruction quality degrades with geometric complexity, CAD-specialized models can be brittle under modality shift, and model rankings change across metrics. Together, these results position CADBench as a diagnostic testbed for measuring progress in editable 3D reconstruction and multimodal CAD understanding. The benchmark is publicly available at this https URL.

[CV-10] BEACON: A Multimodal Dataset for Learning Behavioral Fingerprints from Gameplay Data

【速读】：该论文旨在解决高风险数字环境中连续认证（Continuous Authentication）研究中缺乏大规模、多模态且具备真实认知与运动负荷的基准数据集的问题。现有基准常受限于数据规模小、感知模态单一或缺少同步环境上下文信息，难以充分评估行为生物特征在复杂场景下的鲁棒性。解决方案的关键在于提出BEACON（Behavioral Engine for Authentication Continuous Monitoring）数据集，其通过捕捉《Valorant》竞技游戏中79个会话、28名玩家的约102.51小时高保真行为信号，实现了多模态数据（包括高频鼠标动态、按键事件、网络包捕获、屏幕录制、硬件元数据及游戏配置上下文）的同步采集与标注，从而为连续认证、行为建模、用户漂移分析及多模态表征学习提供了一个具有现实挑战性的高精度测试平台。

链接: https://arxiv.org/abs/2605.10867
作者: Ishpuneet Singh,Gursmeep Kaur,Uday Pratap Singh Atwal,Guramrit Singh,Gurjot Singh,Maninder Singh
机构: Thapar Institute of Engineering and Technology (泰帕尔工程与技术学院); University of Waterloo (滑铁卢大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
备注:

点击查看摘要

Abstract:Continuous authentication in high-stakes digital environments requires datasets with fine-grained behavioral signals under realistic cognitive and motor demands. But current benchmarks are often limited by small scale, unimodal sensing or lack of synchronised environmental context. To address this gap, this paper introduces BEACON ( Behavioral Engine for Authentication \ Continuous Monitoring), a large-scale multimodal dataset that captures diverse skill tiers in competitive \textitValorant gameplay. BEACON contains approximately 430 GB of synchronised modality data (461 GB total on-disk including auxiliary \textitValorant configuration captures) from 79 sessions across 28 distinct players, estimated at 102.51 hours of active gameplay, including high-frequency mouse dynamics, keystroke events, network packet captures, screen recordings, hardware metadata, and in-game configuration context. BEACON leverages the high precision motor skills and high cognitive load that are inherent to tactical shooters, making it a rigorous stress test for the robustness of behavioral biometrics. The dataset allows for the study of continuous authentication, behavioral profiling, user drift and multimodal representation learning in a high-fidelity esports setting. The authors release the dataset and code on Hugging Face and GitHub to create a reproducible benchmark for evaluating next-generation behavioral fingerprinting and security models

[CV-11] BenchCAD: A Comprehensive Industry-Standard Benchmark for Programmatic CAD

【速读】：该论文旨在解决工业计算机辅助设计（Computer-Aided Design, CAD）代码自动生成中模型缺乏对复杂3D结构理解与参数化抽象能力的问题，特别是在多模态输入（如视觉或文本）下生成可执行的参数化程序时，现有大型多模态语言模型（Multimodal Large Language Models, MLLMs）在真实工业场景中的表现尚未得到系统评估。解决方案的关键在于提出BenchCAD——一个统一的基准测试平台，包含17,900个经过执行验证的CadQuery程序，覆盖106类工业零件家族，涵盖从齿轮到钻头等典型工程设计。该基准通过视觉问答、代码问答、图像到代码生成和指令引导的代码编辑四项任务，实现对感知、参数抽象和可执行程序合成的细粒度分析，从而揭示当前模型在生成准确参数化CAD程序方面的局限性，并为提升工业级多模态CAD自动化提供量化评估标准。

链接: https://arxiv.org/abs/2605.10865
作者: Haozhe Zhang,Kaichen Liu,Miaomiao Chen,Lei Li,Shaojie Yang,Cheng Peng,Hanjie Chen
机构: University of Virginia (弗吉尼亚大学); University of XXX (XXX大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Software Engineering (cs.SE)
备注: 9 page 7 figures

点击查看摘要

Abstract:Industrial Computer-Aided Design (CAD) code generation requires models to produce executable parametric programs from visual or textual inputs. Beyond recognizing the outer shape of a part, this task involves understanding its 3D structure, inferring engineering parameters, and choosing CAD operations that reflect how the part would be designed and manufactured. Despite the promise of Multimodal large language models (MLLMs) for this task, they are rarely evaluated on whether these capabilities jointly hold in realistic industrial CAD settings. We present BenchCAD, a unified benchmark for industrial CAD reasoning. BenchCAD contains 17,900 execution-verified CadQuery programs across 106 industrial part families, including bevel gears, compression springs, twist drills, and other reusable engineering designs. It evaluates models through visual question answering, code question answering, image-to-code generation, and instruction-guided code editing, enabling fine-grained analysis across perception, parametric abstraction, and executable program synthesis. Across 10+ frontier models, BenchCAD shows that current systems often recover coarse outer geometry but fail to produce faithful parametric CAD programs. Common failures include missing fine 3D structure, misinterpreting industrial design parameters, and replacing essential operations such as sweeps, lofts, and twist-extrudes with simpler sketch-and-extrude patterns. Fine-tuning and reinforcement learning improve in-distribution performance, but generalization to unseen part families remains limited. These results position BenchCAD as a benchmark for measuring and improving the industrial readiness of multimodal CAD automation.

[CV-12] Masked Generative Transformer Is What You Need for Image Editing CVPR2026

【速读】：该论文旨在解决扩散模型（Diffusion Models）在图像编辑中因全局去噪机制导致修改区域与周围上下文纠缠的问题，从而引发编辑内容扩散至本应保持不变的区域。其解决方案的关键在于采用基于掩码生成式Transformer（Masked Generative Transformers, MGTs）的新范式，利用局部token预测机制天然限制修改范围；并通过多层注意力聚合（multi-layer attention consolidation）提取精确的编辑定位信号，以及区域保持采样（region-hold sampling）显式防止非目标区域的token翻转，实现高精度、高效且可控的图像编辑。

链接: https://arxiv.org/abs/2605.10859
作者: Wei Chow,Linfeng Li,Xian Sun,Lingdong Kong,Zefeng Li,Qi Xu,Hang Song,Tian Ye,Xian Wang,Jinbin Bai,Shilin Xu,Xiangtai Li,Junting Pan,Shaoteng Liu,Ran Zhou,Tianshu Yang,Songhua Liu
机构: ByteDance(字节跳动); National University of Singapore(新加坡国立大学); Duke University(杜克大学); Shanghai Jiao Tong University(上海交通大学); HKUST(GZ)(香港科技大学（广州)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: CVPR 2026 HiGen Workshop; Project Page at this https URL GitHub at this https URL

点击查看摘要

Abstract:Diffusion models dominate image editing, yet their global denoising mechanism entangles edited regions with surrounding context, causing modifications to propagate into areas that should remain intact. We propose a fundamentally different approach by leveraging Masked Generative Transformers (MGTs), whose localized token-prediction paradigm naturally confines changes to intended regions. We present EditMGT, an MGT-based editing framework that is the first of its kind. Our approach employs multi-layer attention consolidation to aggregate cross-attention maps into precise edit localization signals, and region-hold sampling to explicitly prevent token flipping in non-target areas. To support training, we construct CrispEdit-2M, a 2M-sample high-resolution (1024) editing dataset spanning seven categories. With only 960M parameters, EditMGT achieves state-of-the-art image similarity on multiple benchmarks while delivering 6x faster editing, demonstrating that MGTs offer a compelling alternative to diffusion-based editing.

[CV-13] Is Your Driving World Model an All-Around Player? CVPR2026

【速读】：该论文旨在解决当前生成式世界模型（World Models）在评估体系上的局限性问题，即现有方法主要关注生成视频的视觉真实感（如像素质量和纹理细节），而忽视了其物理合理性与行为一致性，导致模型在实际闭环驾驶任务中表现不佳。解决方案的关键在于提出一个统一的基准测试工具 WorldLens，涵盖从像素级质量、四维几何一致性到闭环驾驶性能及人类感知对齐等五个维度共24项标准化指标，并构建了 WorldLens-26K 人类偏好数据集（含26,808条带理由标注样本）和基于该数据训练的 Vision-Language 评估代理 WorldLens-Agent，从而实现对生成世界模型的多维、可解释、可扩展的自动评估，真正将算法指标与人类感知相连接。

链接: https://arxiv.org/abs/2605.10858
作者: Lingdong Kong,Ao Liang,Tianyi Yan,Hongsi Liu,Wesley Yang,Ziqi Huang,Xian Sun,Wei Yin,Jialong Zuo,Yixuan Hu,Dekai Zhu,Dongyue Lu,Youquan Liu,Guangfeng Jiang,Linfeng Li,Xiangtai Li,Long Zhuo,Lai Xing Ng,Benoit R. Cottereau,Changxin Gao,Liang Pan,Wei Tsang Ooi,Ziwei Liu
机构: NUS(新加坡国立大学); UM(马来西亚大学); USTC(中国科学技术大学); ZJU(浙江大学); NTU(南洋理工大学); Duke(杜克大学); Horizon(远景科技集团); HUST(华中科技大学); TUM(慕尼黑工业大学); FDU(复旦大学); SH Lab(上海实验室); A*STAR(新加坡科技研究局); CNRS(法国国家科学研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: CVPR 2026 VideoWorldModel Workshop; Project Page at this https URL GitHub at this https URL

点击查看摘要

Abstract:Today’s driving world models can generate remarkably realistic dash-cam videos, yet no single model excels universally. Some generate photorealistic textures but violate basic physics; others maintain geometric consistency but fail when subjected to closed-loop planning. This disconnect exposes a critical gap: the field evaluates how real generated worlds appear, but rarely whether they behave realistically. We introduce WorldLens, a unified benchmark that measures world-model fidelity across the full spectrum, from pixel quality and 4D geometry to closed-loop driving and human perceptual alignment, through five complementary aspects and 24 standardized dimensions. Our evaluation of six representative models reveals that no existing approach dominates across all axes: texture-rich models violate geometry, geometry-aware models lack behavioral fidelity, and even the strongest performers achieve only 2-3 out of 10 on human realism ratings. To bridge algorithmic metrics with human perception, we further contribute WorldLens-26K, a 26,808-entry human-annotated preference dataset pairing numerical scores with textual rationales, and WorldLens-Agent, a vision-language evaluator distilled from these judgments that enables scalable, explainable auto-assessment. Together, the benchmark, dataset, and agent form a unified ecosystem for assessing generated worlds not merely by visual appeal, but by physical and behavioral fidelity.

[CV-14] Verification Mirag e: Mapping the Reliability Boundary of Self-Verification in Medical VQA

【速读】：该论文旨在解决医学视觉问答（Medical Visual Question Answering, VQA）中依赖自验证（self-verification）作为安全层的可靠性问题。当前实践中，通过在新上下文中重新调用同一视觉语言模型（Vision Language Model, VLM）来验证其自身生成的答案，被视为默认的安全机制，但作者指出这一做法本质上不可靠。解决方案的关键在于提出一种诊断框架——[METHOD NAME]，用于刻画医学VLM自验证的可靠性边界，该框架将验证器行为分解为判别能力（discrimination capability）和一致性偏差（agreement bias）。研究发现，由于验证器与生成器之间存在能力耦合（capacity-coupled），验证器往往过度认同生成器输出，导致“验证幻象”（verification mirage）现象：即高错误率与高一致性偏差并存，源于对错误答案的错误接受。实验表明，该边界强烈依赖于任务类型，知识密集型临床任务最易陷入幻象，感知类任务居中，简单任务最具抵抗力；此外，验证器对图像证据关注度低于生成器，形成“懒惰验证器”（lazy verifier）现象，进一步削弱了验证的独立安全性。

链接: https://arxiv.org/abs/2605.10850
作者: Ruinan Jin,Beidi Zhao,Myeongkyun Kang,Qiong Zhang,Xiaoxiao Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 31 pages, 12 figures

点击查看摘要

Abstract:Self-verification, re-invoking the same vision language model (VLM) in a fresh context to check its own generated answer, is increasingly used as a default safety layer for medical visual question answering (VQA). We argue that this practice is fundamentally unreliable. We introduce [METHOD NAME], a diagnostic framework for mapping the reliability boundary of medical VLM self-verification by decomposing verifier behavior into discrimination capability and agreement bias. Because the verifier and answer generator are capacity-coupled, the verifier can overly agree with the generator, creating a verification mirage: a regime with both high verifier error and high agreement bias, driven by false acceptance of incorrect answers. Evaluating six open-weight VLMs across five medical VQA datasets and seven medical tasks, we find that this boundary is strongly task-conditioned. Knowledge-intensive clinical tasks fall deepest into the mirage, simpler tasks are more resistant, and perceptual tasks lie in between. Verification also fails to provide an independent safety signal: logistic mixed-effects analysis shows that verifier error and agreement bias become more likely when the generator is wrong, while saliency analyses show that verifiers under-attend to image evidence relative to generators, a phenomenon we call the lazy verifier. Cross-verification reduces but does not eliminate the mirage. Moreover, when verification is reused in multi-turn actor-verifier loops, most initially wrong answers become locked in by false verification. Since our experiments use clean benchmarks, the observed reliability boundary likely underestimates failures in real clinical deployment.

[CV-15] ranscoda: End-to-End Zero-Shot Optical Music Recognition via Data-Centric Synthetic Training

【速读】：该论文旨在解决光学乐谱识别（Optical Music Recognition, OMR）领域中两大核心挑战：一是缺乏大规模、高质量的真实扫描乐谱标注数据，导致模型依赖少量样本迁移或过于简化的合成训练；二是kern格式中存在的非唯一性编码问题（即同一视觉乐谱可对应多个不同的文本表示），这增加了模型学习难度并引入解码不确定性。解决方案的关键在于三点：(i) 构建先进的合成数据生成管道以提升训练数据的多样性与真实性；(ii) 对kern格式进行规范化处理，强制其输出唯一标准形式，从而消除一物多形带来的歧义；(iii) 采用基于语法的解码机制，确保输出结果在结构上符合语法规则。该方法使一个仅59M参数的模型在单张GPU上训练6小时即可超越数十亿参数的基线系统，在新构建的合成评分基准上达到18.46%的OMR-NED误差率，并显著降低历史波兰乐谱扫描的错误率至63.97%。

链接: https://arxiv.org/abs/2605.10835
作者: Daniel Dratschuk,Paul Swoboda
机构: Heinrich Heine University Düsseldorf (海因里希海涅大学杜塞尔多夫分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 13 pages, 7 figures

点击查看摘要

Abstract:Optical Music Recognition (OMR), the task of transcribing sheet music into a structured textual representation, is currently bottlenecked by a lack of large-scale, annotated datasets of real scans. This forces models to rely on either few-shot transfer or synthetic training pipelines that remain overly simplistic. A secondary challenge is encoding non-uniqueness: in the popular Humdrum **kern format for transcribing music, multiple different text encodings can render into the same visual sheet music. This one-to-many mapping creates a harder learning task and introduces high uncertainty during decoding. We propose Transcoda, an OMR system built on (i) an advanced synthetic data generation pipeline, (ii) a normalization of the **kern encoding to enforce a unique normal form and (iii) grammar-based decoding to ensure the syntactic correctness of the output. This approach allows us to train a compact 59M-parameter model in just 6 hours on a single GPU that outperforms billion-parameter baselines. Transcoda achieves the best score among state of the art baselines on a newly curated benchmark of synthetically rendered scores at 18.46% OMR-NED (compared to 43.91% for the next-best system, Legato) and reduces the error rate on historical Polish scans to 63.97% OMR-NED (down from 80.16% for SMT++).

[CV-16] MMVIAD: Multi-view Multi-task Video Understanding for Industrial Anomaly Detection

【速读】：该论文旨在解决工业异常检测（Industrial Anomaly Detection）中现有数据集多局限于静态图像或稀疏视角，难以反映真实工业场景下连续、多视角视频检测需求的问题。其关键解决方案是提出首个面向工业场景的连续多视角视频数据集MMVIAD（Multi-view Multi-task Video Industrial Anomaly Detection），并构建支持异常检测、缺陷分类、物体分类和异常可见时间定位的多任务评估基准。为提升模型在未见场景中的迁移能力，进一步设计了两阶段后训练流程：第一阶段采用感知结构监督微调（PS-SFT）建立结构化推理能力，第二阶段通过可见性引导的工业结构时序异常组相对策略优化（VISTA-GRPO）引入语义门控缺陷奖励与可见性感知的时间奖励机制，最终得到性能显著优于基线模型的VISTA模型，在MMVIAD-Unseen上的平均得分从45.0提升至57.5，超越GPT-5.4。

链接: https://arxiv.org/abs/2605.10833
作者: Xiran Zhao,Jing Jin,Yan Bai,Zhongan Wang,Yifeng Sun,Yihang Lou,Xuanyu Zhu,Tao Feng,Yingna Wu
机构: ShanghaiTech University (上海科技大学); Tsinghua University (清华大学); Meituan Inc. (美团); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Industrial anomaly detection is critical for manufacturing quality control, yet existing datasets mainly focus on static images or sparse views, which do not fully reflect continuous inspection processes in real industrial scenarios. We introduce MMVIAD (Multi-view Multi-task Video Industrial Anomaly Detection), to the best of our knowledge the first continuous multi-view video dataset for industrial anomaly detection and understanding, together with a benchmark for multi-task evaluation. MMVIAD contains object-centric 2-second inspection clips with approximately 120 degrees of camera motion, covering 48 object categories, 14 environments, and 6 structural anomaly types. It supports anomaly detection, defect classification, object classification, and anomaly visible-time localization. Systematic evaluations on MMVIAD show that current commercial and open-source video MLLMs remain far below human performance, especially for fine-grained defect recognition and temporal grounding. To improve transferable anomaly understanding, we further develop a two-stage post-training pipeline where PS-SFT (Perception-Structured Supervised Fine-Tuning) initializes perception-structured reasoning and VISTA-GRPO (Visibility-grounded Industrial Structured Temporal Anomaly Group Relative Policy Optimization) refines the model with semantic-gated defect reward and visibility-aware temporal reward, producing the final model VISTA. On MMVIAD-Unseen, VISTA improves the base model’s average score across the four tasks from 45.0 to 57.5, surpassing GPT-5.4. Source code is available at this https URL.

[CV-17] Predicting 3D structure by latent posterior sampling

【速读】：该论文旨在解决3D场景重建中因观测信息不完整或不确定而导致的结构预测难题，尤其在单视图、多视图、噪声图像、稀疏像素及稀疏深度数据等不同输入条件下，如何有效建模和利用不确定性以提升重建精度。其解决方案的关键在于将基于NeRF（神经辐射场）的3D场景表示与扩散模型（diffusion models）的概率推理相结合：首先通过两阶段训练流程，先学习一个重建模型以自动解码场景的潜在变量，再用扩散模型学习该潜在变量的先验分布；随后借助基于得分的推断方法（score-based inference）结合由体渲染（volumetric rendering）计算的似然项进行后验采样，从而实现对不同观测条件下的不确定性建模与高保真3D结构生成。

链接: https://arxiv.org/abs/2605.10830
作者: Azmi Haider,Dan Rosenbaum
机构: University of Haifa (海法大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The remarkable achievements of both generative models of 2D images and neural field representations for 3D scenes present a compelling opportunity to integrate the strengths of both approaches. In this work, we propose a methodology that combines a NeRF-based representation of 3D scenes with probabilistic modeling and reasoning using diffusion models. We view 3D reconstruction as a perception problem with inherent uncertainty that can thereby benefit from probabilistic inference methods. The core idea is to represent the 3D scene as a stochastic latent variable for which we can learn a prior and use it to perform posterior inference given a set of observations. We formulate posterior sampling using the score-based inference method of diffusion models in conjunction with a likelihood term computed from a reconstruction model that includes volumetric rendering. We train the model using a two-stage process: first we train the reconstruction model while auto-decoding the latent representations for a dataset of 3D scenes, and then we train the prior over the latents using a diffusion model. By using the model to generate samples from the posterior we demonstrate that various 3D reconstruction tasks can be performed, differing by the type of observation used as inputs. We showcase reconstruction from single-view, multi-view, noisy images, sparse pixels, and sparse depth data. These observations vary in the amount of information they provide for the scene and we show that our method can model the varying levels of inherent uncertainty associated with each task. Our experiments illustrate that this approach yields a comprehensive method capable of accurately predicting 3D structure from diverse types of observations.

[CV-18] ALAM: Algebraically Consistent Latent Transitions for Vision-Language-Action Models

【速读】：该论文旨在解决视觉-语言-动作（Vision-Language-Action, VLA）模型因机器人动作标注数据稀缺而导致的训练瓶颈问题，同时利用大量无动作标签的视频数据中蕴含的物理世界变化先验信息。其核心解决方案是提出ALAM（Algebraic Latent Action Model），关键在于通过帧三元组建模时间关系，使潜在动作空间具备代数一致性——即满足组合与逆向一致性约束，从而构建局部可加性的潜在转移空间；在此基础上，冻结预训练编码器并将潜转移序列作为辅助生成目标，联合流匹配（flow-matching）目标优化策略生成过程，使策略能够直接利用ALAM提供的结构化潜在转移几何特性，无需进行隐变量到动作的解码，显著提升了VLA模型在长视距任务中的性能表现。

链接: https://arxiv.org/abs/2605.10819
作者: Zuojin Tang,Haoyun Liu,Xinyuan Chang,Changjie Wu,Dongjie Huo,Yandan Yang,Bin Liu,Zhejia Cai,Feng Xiong,Mu Xu,jiachen Luo,De Ma,Zhiheng Ma,Gang Pan
机构: Zhejiang University (浙江大学); Amap, Alibaba Group (高德地图，阿里巴巴集团); Nanjing University (南京大学); Shenzhen University of Advanced Technology (深圳先进技术研究院); Beijing University of Chemical Technology (北京化工大学); Embodied Intelligence General Platform Laboratory, Chery Auto (奇瑞汽车具身智能通用平台实验室); Tsinghua University (清华大学); Queen Mary University of London (伦敦玛丽女王大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-language-action (VLA) models remain constrained by the scarcity of action-labeled robot data, whereas action-free videos provide abundant evidence of how the physical world changes. Latent action models offer a promising way to extract such priors from videos, but reconstruction-trained latent codes are not necessarily suitable for policy generation: they may predict future observations while lacking the structure needed to be reused or generated coherently with robot actions. We introduce ALAM (Algebraic Latent Action Model), an Algebraically Consistent Latent Action Model that turns temporal relations in action-free video into structural supervision. Given frame triplets, ALAM learns latent transitions that are grounded by reconstruction while being regularized by composition and reversal consistency, encouraging a locally additive transition space. For downstream VLA learning, we freeze the pretrained encoder and use its latent transition sequences as auxiliary generative targets, co-generated with robot actions under a joint flow-matching objective. This couples structured latent transitions with flow-based policy generation, allowing the policy to exploit ALAM’s locally consistent transition geometry without requiring latent-to-action decoding. Representation probes show that ALAM reduces additivity and reversibility errors by 25-85 times over unstructured latent-action baselines and improves long-horizon cumulative reconstruction. When transferred to VLA policies, ALAM raises the average success rate from 47.9% to 85.0% on MetaWorld MT50 and from 94.1% to 98.1% on LIBERO, with consistent gains on real-world manipulation tasks. Ablations further confirm that the strongest improvements arise from the synergy between algebraically structured latent transitions and joint flow matching.

[CV-19] PhyGround: Benchmarking Physical Reasoning in Generative World Models

【速读】：该论文旨在解决生成式视频模型在物理合理性评估中的三大核心挑战：现有基准评价框架过于粗粒度、难以识别特定物理定律的失效；人工标注存在响应偏差与疲劳问题，影响判断有效性；自动化评估器缺乏足够的物理知识或可审计性。其解决方案的关键在于提出 PhyGround 基准，该基准通过 250 个精心设计的提示（每个附带预期物理结果）和涵盖固体力学、流体动力学与光学的 13 类物理定律分类体系，将每条物理定律转化为可观测的子问题以实现细粒度诊断；同时基于社会科学研究实验设计开展大规模高质量人类标注（459 名标注者提供 5,796 条完整标注），并发布 PhyJudge-9B——一个专为物理推理优化的视觉语言模型（VLM）评判器，显著降低系统性偏差（相对偏差从 Gemini-3.1-Pro 的 16.6% 降至 3.3%），从而推动物理合理性的可复现、可解释评估。

链接: https://arxiv.org/abs/2605.10806
作者: Juyi Lin,Arash Akbari,Yumei He,Lin Zhao,Haichao Zhang,Arman Akbari,Xingchen Xu,Zoe Y. Lu,Enfu Nan,Hokin Deng,Edmund Yeh,Sarah Ostadabbas,Yun Fu,Jennifer Dy,Pu Zhao,Yanzhi Wang
机构: Northeastern University (东北大学); Tulane University (图兰大学); University of Washington (华盛顿大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint. 56 pages, 39 figures, 40 tables. Project page: this https URL

点击查看摘要

Abstract:Generative world models are increasingly used for video generation, where learned simulators are expected to capture the physical rules that govern real-world dynamics. However, evaluating whether generated videos actually follow these rules remains challenging. Existing physics-focused video benchmarks have made important progress, but they still face three key challenges, including the coarse evaluation frameworks that hide law-specific failures, response biases and fatigue that undermine the validity of annotation judgments, and automated evaluators that are insufficiently physics-aware or difficult to audit. To address those challenges, we introduce PhyGround, a criteria-grounded benchmark for evaluating physical reasoning in video generation. The benchmark contains 250 curated prompts, each augmented with an expected physical outcome, and a taxonomy of 13 physical laws across solid-body mechanics, fluid dynamics, and optics. Each law is operationalized through observable sub-questions to enable per-law diagnostics. We evaluate eight modern video generation models through a large-scale, quality-controlled human study, grounded on social science lab experiment design. A total of 459 annotators provided 5,796 complete annotations and over 37.4K fine-grained labels; after quality control, the retained annotations exhibited high split-half model-ranking correlations (Spearman’s rho 0.90). To support reproducible automated evaluation, we release PhyJudge-9B, an open physics-specialized VLM judge. PhyJudge-9B achieves substantially lower aggregate relative bias than Gemini-3.1-Pro (3.3% vs. 16.6%). We release prompts, human annotations, model checkpoints, and evaluation code on the project page this https URL.

[CV-20] Rapid Forest Fuel Load Estimation via Virtual Remote Sensing and Metric-Scale Feed-Forward 3D Reconstruction

【速读】：该论文旨在解决森林覆盖率与可燃生物量（燃料负荷）精准量化难题，以支持野火风险评估和生态系统管理。传统方法依赖航空激光雷达（LiDAR）或实地调查，存在成本高、耗时长的问题，而卫星遥感数据则常因垂直分辨率不足难以进行冠层体积分析。其解决方案的关键在于构建一个全自动的虚拟遥感数据处理流程：首先利用Google Earth Studio（GES）生成低空轨道影像及相机位姿，随后采用基于VGGT-Long框架的Pi-Long模型实现密集三维重建；为克服单目重建中的尺度模糊问题，引入基于Sim(3) Umeyama优化的度量恢复模块，将重建轨迹与GES真实位姿对齐，从而获得具有度量尺度的点云；最终通过正交投影生成鸟瞰视角（BEV）高度图与密度图，并结合分水岭分割算法与高度方差分析，实现树种分类（针叶林 vs. 阔叶林）、叶面积指数（LAI）计算及总燃料负荷估算。该方法在保证几何一致性的同时，显著提升了森林生物量近实时估算的效率与可扩展性。

链接: https://arxiv.org/abs/2605.10789
作者: Quanyun Wu,Kyle Gao,Wentao Sun,Zhengsen Xu,Hudson Sun,Linlin Xu,Yuhao Chen,David A. Clausi,Jonathan Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication at IEEE IGARSS 2026

点击查看摘要

Abstract:Accurate quantification of forest coverage and combustible biomass (fuel load) is critical for wildfire risk assessment and ecosystem management. However, traditional methods relying on airborne LiDAR or field surveys are cost-prohibitive and time-intensive, while satellite imagery often lacks the vertical resolution required for canopy volume analysis. This paper proposes a novel, automated pipeline for rapid forest inventory using virtual remote sensing data derived from Google Earth Studio (GES). Our approach first generates low-altitude orbital imagery and camera poses for a target region. For dense 3D reconstruction, we employ Pi-Long, developed within the VGGT-Long framework. This model serves as a scalable extension of the Pi-3 feed-forward Transformer architecture. To address the inherent scale ambiguity in monocular reconstruction, we introduce a metric recovery module that aligns the reconstructed trajectory with GES ground truth poses via Sim(3) Umeyama optimization. The metric-scale point cloud is then orthogonally projected into Bird’s-Eye-View (BEV) height and density maps. Finally, we employ a watershed-based segmentation algorithm combined with height variance analysis to classify tree species (conifer vs. broadleaf), calculate Leaf Area Index (LAI), and estimate total fuel load. Experimental results demonstrate that this pipeline offers a scalable, cost-effective alternative to physical scanning, enabling near-real-time estimation of forest biomass with high geometric consistency.

[CV-21] Beyond the Last Layer: Multi-Layer Representation Fusion for Visual Tokenizatio

【速读】：该论文旨在解决现有基于冻结预训练视觉编码器的表示自编码器（Representation Autoencoder）在图像重建与生成任务中性能受限的问题，其核心在于仅利用编码器最后一层特征，忽略了中间层蕴含的丰富层次化信息，导致低级视觉细节因多层语义抽象而衰减。解决方案的关键是提出一种轻量级融合模块 DRoRAE（Depth-Routed Representation AutoEncoder），通过能量约束路由机制和增量修正策略，自适应聚合所有编码器层的特征，从而恢复丢失的细节信息，并生成与冻结预训练解码器兼容的增强潜空间表示。该方法还引入三阶段解耦训练策略，先在隐式分布约束下学习融合模块，再微调解码器以充分挖掘增强表示的潜力，显著提升重建与生成质量（如 ImageNet-256 上 rFID 从 0.57 降至 0.29，生成 FID 从 1.74 提升至 1.65）。此外，研究发现融合能力与重建质量之间存在对数线性缩放规律（R²=0.86），揭示了“表示丰富度”作为类比自然语言处理中词表大小的新可预测扩展维度。

链接: https://arxiv.org/abs/2605.10780
作者: Xuanyu Zhu,Yan Bai,Yang Shi,Yihang Lou,Yuanxing Zhang,Jing Jin,Yuan Zhou
机构: Peking University (北京大学); Meituan Inc (美团); Tsinghua University (清华大学); IGDL
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Representation autoencoders that reuse frozen pretrained vision encoders as visual tokenizers have achieved strong reconstruction and generation quality. However, existing methods universally extract features from only the last encoder layer, discarding the rich hierarchical information distributed across intermediate layers. We show that low-level visual details survive in the last layer merely as attenuated residuals after multiple layers of semantic abstraction, and that explicitly fusing multi-layer features can substantially recover this lost information. We propose DRoRAE (Depth-Routed Representation AutoEncoder), a lightweight fusion module that adaptively aggregates all encoder layers via energy-constrained routing and incremental correction, producing an enriched latent compatible with a frozen pretrained decoder. A three-phase decoupled training strategy first learns the fusion under the implicit distributional constraint of the frozen decoder, then fine-tunes the decoder to fully exploit the enriched representation. On ImageNet-256, DRoRAE reduces rFID from 0.57 to 0.29 and improves generation FID from 1.74 to 1.65 (with AutoGuidance), with gains also transferring to text-to-image synthesis. Furthermore, we uncover a log-linear scaling law ( R^2=0.86 ) between fusion capacity and reconstruction quality, identifying \textitrepresentation richness as a new, predictably scalable dimension for visual tokenizers analogous to vocabulary size in NLP.

[CV-22] owards a Large Language-Vision Question Answering Model for MSTAR Automatic Target Recognition

【速读】：该论文旨在解决合成孔径雷达（Synthetic Aperture Radar, SAR）图像中军事目标的细粒度识别问题，尤其在复杂环境条件下实现高精度自动目标识别（Automatic Target Recognition, ATR）。传统方法依赖人工分析，耗时且效率低，而本研究通过引入大语言视觉模型（Large Language-Vision Models, LLVM），如CLIP与LLaVA架构，结合参数高效微调策略，在自建SAR图像描述生成与视觉问答（Visual Question Answering, VQA）基准数据集上实现了98%的识别准确率。其解决方案的关键在于：构建面向SAR场景的多模态训练与评估基准，并利用Transformer架构的跨模态理解能力，显著提升模型对复杂遥感图像中细微目标特征的感知与判别能力，从而推动机器辅助遥感ATR在军事与情报领域的应用发展。

链接: https://arxiv.org/abs/2605.10772
作者: David F. Ramirez,Tim L. Overman,Kristen Jaskie,Marv Kleine,Andreas Spanias
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注: Accepted to SPIE Defense + Commercial Sensing, Automatic Target Recognition XXXV

点击查看摘要

Abstract:Large language-vision models (LLVM), such as OpenAI’s ChatGPT and GPT-4, have gained prominence as powerful tools for analyzing text and imagery. The merging of these data domains represents a significant paradigm shift with far-reaching implications for automatic target recognition (ATR). Recent transformer-based LLVM research has shown substantial improvements for geospatial perception tasks. Our study examines the application of LLVM to remote sensing image captioning and visual question-answering (VQA), with a specific focus on synthetic aperture radar (SAR) imagery. We examine newly published LLVM methods, including CLIP and LLaVA neural network transformer architectures. We have developed a work-in-progress SAR training and evaluation benchmark derived from the MSTAR Public Dataset. This has been extended to include descriptive text captions and question-answer pairs for VQA tasks. This challenge dataset is designed to push the boundaries of an LLVM in identifying nuanced ATR details in SAR imagery. Utilizing parameter-efficient fine-tuning, we train an LLVM method to identify fine-grained target qualities at 98% accuracy. We detail our data setup and experiments, addressing potential pitfalls that could lead to misleading conclusions. Accurately identifying and differentiating military vehicle types in SAR data poses a critical challenge, especially under complex environmental conditions. Mastering this target recognition skill may require a human analyst months of training and years of practice. This research represents a unique effort to apply LLVM to SAR applications, advancing machine-assisted remote sensing ATR for military and intelligence contexts.

[CV-23] MPerS: Dynamic MLLM MixExperts Perception-Guided Remote Sensing Scene Segmentation CVPR2026

【速读】：该论文旨在解决复杂遥感（Remote Sensing, RS）场景下多模态融合中高质量遥感图像描述生成不足以及文本语义与视觉特征融合效率低的问题。其关键解决方案在于提出一种基于动态混合专家（Dynamic MixExperts）机制的多模态大语言模型（Multimodal Large Language Models, MLLMs）感知引导遥感场景分割方法，通过设计多种提示（prompts）激发LLaVA、ChatGPT和Qwen等MLLMs生成高质量RS captions，并利用DINOv3提取密集视觉表征；进一步构建语言查询引导注意力机制（Linguistic Query Guided Attention），以文本语义信息动态指导视觉特征进行精确分割，从而实现对遥感场景的多视角感知与高效语义融合。

链接: https://arxiv.org/abs/2605.10769
作者: Ziyi Wang,Xianping Ma,Ziyao Wang,Hongyang Zhang,Man On Pun
机构: The Chinese University of Hong Kong (Shenzhen); Southwest Jiaotong University
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to CVPR 2026 Findings. 11 pages, 6 figures

点击查看摘要

Abstract:The multimodal fusion of images and scene captions has been extensively explored and applied in various fields. However, when dealing with complex remote sensing (RS) scenes, existing studies have predominantly concentrated on architectural optimizations for integrating textual semantic information with visual features, while largely neglecting the generation of high-quality RS captions and the investigation of their effectiveness in multimodal semantic this http URL this context, we propose the Dynamic MLLM Mixture-of-Experts Perception-Guided Remote Sensing Scene Segmentation, referred to as this http URL design multiple prompts for MLLMs to generate high-quality RS captions, enabling MLLMs to perceive RS scenes from diverse expert perspectives. DINOv3 is employed to extract dense visual representations of this http URL design a Dynamic MixExperts module that adaptively integrates the most effective textual semantics. Linguistic Query Guided Attention is constructed to utilize textual semantic information to guide visual features for precise segmentation. The MLLMs include LLaVA, ChatGPT, and Qwen. Our method achieves superior performance on three public semantic segmentation RS datasets.

[CV-24] Dynamic Cross-Modal Prompt Generation for Multimodal Continual Instruction Tuning

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在持续指令微调（Multimodal Continual Instruction Tuning, MCIT）过程中因任务顺序更新导致的灾难性遗忘问题，同时提升对单个查询-图像对（query-image pair）实例级差异的适应能力。现有方法主要依赖任务级提示（prompt）或LoRA专家模块的动态选择与聚合，但无法充分应对同一任务内样本在视觉场景、问题意图和推理需求上的显著差异。其解决方案的关键在于提出DRAPE（Dynamic Cross-Modal Prompt Generation）框架——通过从文本指令中生成查询，并交叉注意力机制融合视觉补丁特征，动态合成连续的、实例特定的软提示（soft prompts），并将其前置到冻结的LLM输入中；此外，采用零空间梯度投影策略缓解参数更新时的遗忘，并基于CLIP原型路由实现无需任务标签的生成器选择，从而在保持模型泛化能力的同时实现高效且稳定的持续学习。

链接: https://arxiv.org/abs/2605.10765
作者: Tao Hu,Da-Wei Zhou
机构: Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) achieve strong performance through instruction tuning, yet real-world deployment often requires continual capability expansion across sequential tasks. In such scenarios, Multimodal Continual Instruction Tuning (MCIT) aims to acquire new capabilities while limiting catastrophic forgetting. Existing methods mainly follow a module-composition paradigm: they maintain task-level prompts or LoRA experts and dynamically route or aggregate a subset of them at inference. However, samples within the same task can still differ substantially in visual scenes, question intents, and reasoning demands. This motivates instance-level adaptation to individual query-image pairs rather than only selecting or combining task-level modules. To this end, we propose DRAPE (Dynamic Cross-Modal Prompt Generation), a prompt-learning framework that synthesizes continuous instance-specific soft prompts for MCIT. Instead of selecting prompts from a fixed pool, DRAPE derives prompt queries from the textual instruction and cross-attends to visual patch features, producing query-image conditioned prompts that are prepended to the frozen LLM. To mitigate forgetting during sequential updates, DRAPE applies null-space gradient projection to the shared projector and uses CLIP-based prototype routing for task-label-free generator selection at inference. Extensive experiments on MCIT benchmarks show that DRAPE achieves state-of-the-art performance among representative prompt-based and LoRA-based continual-learning baselines.

[CV-25] Break the Brake Not the Wheel: Untargeted Jailbreak via Entropy Maximization

【速读】：该论文旨在解决梯度驱动的通用图像越狱攻击在视觉语言模型（Vision-Language Models, VLMs）中跨模型迁移能力有限的问题，这一现象曾被视作限制可迁移多模态越狱攻击可行性的关键障碍。其解决方案的核心在于提出一种无目标越狱攻击方法——UJEM-KL（Untargeted Jailbreak via Entropy Maximization-KL），通过最大化决策阶段高熵标记（high-entropy tokens）的熵值来诱导模型从拒绝输出转向生成响应，同时稳定低熵位置以保持输出质量。实验表明，该方法在白盒攻击成功率上具有竞争力，并显著提升跨模型迁移性，且对主流防御机制仍具有效性，揭示了迁移能力受限的根本原因在于优化目标过于受限。

链接: https://arxiv.org/abs/2605.10764
作者: Mengqi He,Xinyu Tian,Xin Shen,Shu Zou,Jinhong Ni,Zhaoyuan Yang,Weikang Li,Xuesong Li,Jing Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Preprint. 17 pages, 8 figures, 6 tables

点击查看摘要

Abstract:Recent studies show that gradient-based universal image jailbreaks on vision-language models (VLMs) exhibit little or no cross-model transferability, casting doubt on the feasibility of transferable multimodal jailbreaks. We revisit this conclusion under a strictly untargeted threat model without enforcing a fixed prefix or response pattern. Our preliminary experiment reveals that refusal behavior concentrates at high-entropy tokens during autoregressive decoding, and non-refusal tokens already carry substantial probability mass among the top-ranked candidates before attack. Motivated by this finding, we propose Untargeted Jailbreak via Entropy Maximization(UJEM)-KL, a lightweight attack that maximizes entropy at these decision tokens to flip refusal outcomes, while stabilizing the remaining low-entropy positions to preserve output quality. Across three VLMs and two safety benchmarks, UJEM-KL achieves competitive white-box attack success rates and consistently improves transferability, while remaining effective under representative defenses. Our experimental results indicate that the limited transferability primarily stems from overly constrained optimization objectives.

[CV-26] GridProbe: Posterior-Probing for Adaptive Test-Time Compute in Long-Video VLMs

【速读】：该论文旨在解决视觉语言模型（VLM）在长视频理解任务中因单次全帧前向传播导致的二次复杂度计算瓶颈问题，尤其针对现有训练-free 帧选择方法依赖对比预训练信号、难以应对推理密集型查询（如否定、跨帧计数和整体摘要）的局限性。其解决方案的关键在于提出 GridProbe——一种无需训练的后验探测推理范式，通过冻结的 VLM 自身推理能力，在答案空间中评分证据并自适应选择与问题相关的帧，从而实现次二次注意力开销且几乎无精度损失；具体而言，该方法将帧排列成 $K \times K$ 网格，运行轻量级行（R）和列（C）探测器，利用其峰值后验概率作为条件置信度，通过 R 与 C 的外积生成可解释的重要性图，并基于该图的偏度和峰度驱动形状自适应选择（Shape-Adaptive Selection），以闭式规则将固定帧预算 $M$ 替换为每题动态调整的有效帧数 $M_{\mathrm{eff}}$ ，实验证明 $M_{\mathrm{eff}}$ 可在不依赖答案的情况下追踪内在问题难度，体现测试时自适应计算特性。

链接: https://arxiv.org/abs/2605.10762
作者: Mohamed Eltahir,Lama Ayash,Ali Habibullah,Tanveer Hussain,Naeemullah Khan
机构: King Abdullah University of Science and Technology (KAUST); Edge Hill University (边缘希尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long-video understanding in VLMs is bottlenecked by a single monolithic forward pass over thousands of frames at quadratic attention cost. A common mitigation is to first select a small subset of informative frames before the forward pass; common for training-free selectors via auxiliary encoder-space similarities. Such signals are capped by contrastive pretraining, which usually fails on reasoning-heavy queries (negation, cross-frame counting, holistic summarization). We propose GridProbe, an efficient training-free posterior-probing inference paradigm that scores evidence in answer space using a frozen VLM’s own reasoning and then selects question-relevant frames adaptively, resulting in sub-quadratic attention cost with little to no accuracy loss. We arrange frames on a K\timesK grid and run lightweight row R and column C probes, where each probe reads its peak posterior as a query-conditioned confidence. The outer product of R and C yields an interpretable importance map whose skewness and kurtosis drive Shape-Adaptive Selection, a closed-form rule that reliably replaces the fixed frame budget M with a per-question M_\mathrmeff . We show empirically that M_\mathrmeff tracks intrinsic question difficulty without ever seeing the answer, a sign of test-time adaptive compute. On Video-MME-v2, GridProbe matches the monolithic baseline within 1.6 pp Avg Acc at 3.36\times TFLOPs reduction, while on LongVideoBench it Pareto-dominates the baseline ( +0.9 pp at 0.35\times compute). Because the selector and QA models can be decoupled, pairing a small 2B selector with a stronger 4B or 8B QA is strictly Pareto-dominant over the 2B monolithic baseline (up to +4.0 pp at 0.52\times compute, on average), with no retraining. Finally, the interpretability of the importance maps opens future avenues for behavioral diagnostics, grounding, and frame-selection distillation.

[CV-27] RadThinking: A Dataset for Longitudinal Clinical Reasoning in Radiology

【速读】：该论文旨在解决当前医学影像AI系统在癌症筛查中仅能实现检测而缺乏推理能力的问题，即如何让AI不仅识别病灶，还能像放射科医生一样进行多步骤临床推理。其解决方案的关键在于构建RadThinking数据集，这是一个分层的视觉问答（VQA）语料库，将问题按推理深度分为三个层级：基础感知类问题（foundation VQAs）、单步临床规则推理问题（single-step reasoning VQAs）和需要链式思维（chain-of-thought）的复合型问题（compositional VQAs）。特别地，每个复合问题均附带由基础问题组成的推理链条，并严格遵循临床报告标准（如LI-RADS），从而为强化学习模型（如DeepSeek-R1和OpenAI o1）提供可验证的奖励信号与结构化推理监督，推动AI从“感知”向“理解与推理”演进。

链接: https://arxiv.org/abs/2605.10761
作者: Wenxuan Li,Pedro R. A. S. Bassi,Xinze Zhou,Jakob Wasserthal,Alan L. Yuille,Zongwei Zhou
机构: Johns Hopkins University (约翰霍普金斯大学); University Hospital Basel (巴塞尔大学医院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cancer screening is a reasoning task. A radiologist observes findings, compares them to prior scans, integrates clinical context, and reaches a diagnostic conclusion confirmed by pathology. We present RadThinking, a Visual Question Answering (VQA) dataset that makes this reasoning explicit and trainable. RadThinking releases VQA pairs at three difficulty tiers. Foundation VQAs are atomic perception questions. Single-step reasoning VQAs apply one clinical rule. Compositional VQAs require multi-step chain-of-thought to reach a guideline category such as LI-RADS-5. For every compositional VQA, we release the chain of foundation VQAs that solves it. The chain follows the rules of the governing clinical reporting standard. The dataset spans 20,362 CT scans from 9,131 patients across 43 cancer groups, plus 2,077 verified healthy controls with 1-year follow-up. To our knowledge, RadThinking is the first cancer-screening VQA corpus that stratifies questions by reasoning depth and grounds compositions in clinical reporting standards. The foundation tier supplies atomic perception supervision. The compositional tier supplies chain-of-thought data and verifiable rewards for reinforcement-learning recipes such as DeepSeek-R1 and OpenAI o1. RadThinking enables systematic training and evaluation of whether AI systems can reason about cancer, not merely detect it.

[CV-28] Reinforce Adjoint Matching: Scaling RL Post-Training of Diffusion and Flow-Matching Models

【速读】：该论文旨在解决生成式模型（如扩散模型和流匹配模型）在强化学习（RL）后训练阶段中，如何保持预训练阶段监督回归结构的问题。现有方法通常依赖于昂贵的随机微分方程（SDE）模拟、奖励梯度计算或代理损失函数，从而破坏了预训练中简洁且可扩展的回归特性。解决方案的关键在于提出 Reinforce Adjoint Matching (RAM)，它通过在KL正则化的奖励最大化框架下，证明最优生成过程仅调整干净终点分布以偏向高奖励样本，而保持噪声机制不变；结合伴随匹配最优性条件与REINFORCE恒等式，推导出一种一致性损失，该损失直接修正预训练目标以融入奖励信号。RAM无需SDE rollout、反向伴随扫描或奖励梯度，仅需从当前模型采样干净样本、评估奖励并按预训练方式加噪后回归，因此既保留了预训练的简洁性又实现了高效优化，在Stable Diffusion 3.5M上显著提升了组合能力、文本渲染质量和人类偏好评分，且训练步数减少最多达50倍。

链接: https://arxiv.org/abs/2605.10759
作者: Andreas Bergmeister,Stefanie Jegelka,Nikolas Nüsken,Carles Domingo-Enrich,Jakiw Pidstrigach
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion and flow-matching models scale because pretraining is supervised regression: a clean sample is noised analytically, and a model regresses against a closed-form target. RL post-training aligns the model with a reward. In image generation, this makes samples compose objects correctly, render text legibly, and match human preferences. Existing methods rely on costly SDE rollouts, reward gradients, or surrogate losses, sacrificing pretraining’s regression structure. We show that the structure extends to RL post-training. Under KL-regularized reward maximization, the optimal generative process tilts the clean-endpoint distribution towards samples with higher reward and leaves the noising law unchanged. Combining this with the adjoint-matching optimality condition and a REINFORCE identity, we derive Reinforce Adjoint Matching (RAM): a consistency loss that corrects the pretraining target with the reward. At each step, we draw a clean endpoint from the current model, evaluate its reward, noise it as in pretraining, and regress. No SDE rollouts, backward adjoint sweeps, or reward gradients are required. Like the pretraining objective, RAM is simple and scales. On Stable Diffusion 3.5M, RAM achieves the highest reward on composability, text rendering, and human preference, reaching Flow-GRPO’s peak reward in up to 50\times fewer training steps.

[CV-29] INS: Test-time ID-prototype-separated Negative Semantics Learning for OOD Detection

【速读】：该论文旨在解决现有基于负标签（negative-label-based）的分布外（Out-of-Distribution, OOD）检测方法在测试时无法有效覆盖多样且动态演变的OOD概念的问题，其核心挑战在于静态负标签难以适应实际场景中不断变化的OOD模式，而直接从潜在OOD样本中学习负语义又容易引入ID（In-Distribution）污染。解决方案的关键在于提出一种测试时（test-time）的ID原型分离负语义学习方法（TINS），通过图像到文本模态反演（image-to-text modality inversion）生成样本级负文本嵌入，并引入ID原型分离正则化（ID-prototype-separated regularization）确保负语义与ID语义分离；同时采用分组聚合评分和缓冲区更新策略以稳定负语义扩展过程，从而显著提升OOD检测性能，在Four-OOD等基准上将平均FPR95从14.04%降低至6.72%。

链接: https://arxiv.org/abs/2605.10756
作者: Yifeng Yang,Jubo Feng,Jing Xu,Xinbing Wang,Qinying Gu,Nanyang Ye
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Shanghai Innovation Institute (上海创新研究院); University of Electronic Science and Technology of China (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-language models enable OOD detection by comparing image alignment with ID labels and negative semantics. Existing negative-label-based methods mainly rely on static negative labels constructed before inference, limiting their ability to cover diverse and evolving OOD concepts. Although test-time expansion provides a natural solution, naively learning negative semantics from potential OOD samples may introduce hard ID contamination. To address this issue, we propose a \textbfTest-time \textbfID-prototype-separated \textbfNegative \textbfSemantics learning method, termed \textbfTINS. TINS learns sample-specific negative text embeddings via image-to-text modality inversion and introduces ID-prototype-separated regularization to keep them separated from ID semantics. To further stabilize negative semantics expansion, TINS employs group-wise aggregation scoring and a buffer update strategy. Extensive experiments across Four-OOD, OpenOOD, Temporal-shift, and Various ID settings show consistent improvements over strong baselines. Notably, on the Four-OOD benchmark with ImageNet-1K as ID, TINS reduces the average FPR95 from 14.04% to 6.72%. Our code is available at this https URL.

[CV-30] C-CoT: Counterfactual Chain-of-Thought with Vision-Language Models for Safe Autonomous Driving

【速读】：该论文旨在解决自动驾驶在复杂城市交叉路口环境中进行安全决策时的核心挑战，即现有基于规则或数据驱动的方法难以准确捕捉场景语义、推断潜在风险，并在罕见高危情境下做出可靠决策。其解决方案的关键在于提出一种基于视觉语言模型（Vision-Language Models, VLMs）的反事实思维链（Counterfactual Chain-of-Thought, C-CoT）框架，将驾驶决策分解为五个连续阶段：场景描述、关键物体识别、风险预测、反事实风险推理和最终动作规划；其中，在反事实推理阶段引入结构化的元动作评估树（meta-action evaluation tree），显式评估不同动作组合的潜在后果，从而建立动作选择与安全结果之间的因果联系，显著提升模型在长尾分布和分布外场景下的鲁棒性与可解释性。

链接: https://arxiv.org/abs/2605.10744
作者: Kefei Tian,Yuansheng Lian,Kai Yang,Xiangdong Chen,Shen Li
机构: Tongji University (同济大学); Tsinghua University (清华大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Safety-critical planning in complex environments, particularly at urban intersections, remains a fundamental challenge for autonomous driving. Existing methods, whether rule-based or data-driven, frequently struggle to capture complex scene semantics, infer potential risks, and make reliable decisions in rare, high-risk situations. While vision-language models (VLMs) offer promising approaches for safe decision-making in these environments, most current approaches lack reflective and causal reasoning, thereby limiting their overall robustness. To address this, we propose a counterfactual chain-of-thought (C-CoT) framework that leverages VLMs to decompose driving decisions into five sequential stages: scene description, critical object identification, risk prediction, counterfactual risk reasoning, and final action planning. Within the counterfactual reasoning stage, we introduce a structured meta-action evaluation tree to explicitly assess the potential consequences of alternative action combinations. This self-reflective reasoning establishes causal links between action choices and safety outcomes, improving robustness in long-tail and out-of-distribution scenarios. To validate our approach, we construct the DeepAccident-CCoT dataset based on the DeepAccident benchmark and fine-tune a Qwen2.5-VL (7B) model using low-rank adaptation. Our model achieves a risk prediction recall of 81.9%, reduces the collision rate to 3.52%, and lowers L2 error to 1.98 m. Ablation studies further confirm the critical role of counterfactual reasoning and the meta-action evaluation tree in enhancing safety and interpretability.

[CV-31] Pay: Integrated Payment Action Recognition via Multimodal Networks and Adaptive Spatial Prior Learning

【速读】：该论文旨在解决公共交通场景中基于视频的支付行为识别（payment action recognition）问题，现有方法受限于手动审计效率低、视觉与骨架特征模型在噪声环境下的鲁棒性差以及局部细微动作差异难以捕捉等挑战。其解决方案的关键在于提出iPay框架，采用多模态专家混合架构，包含四个紧密耦合的流：(1) RGB专家流通过区域聚焦计算强化局部证据；(2) 骨架专家流利用图卷积网络建模关节运动的全局时空依赖；(3) 双注意力融合流实现骨架到RGB的时间迁移和RGB到骨架的空间增强；(4) 先验驱动的空间差异判别器（Spatial Difference Discriminator, SDD）显式建模手部与锚点间的相对运动以提升任务特定判别能力。该设计有效融合了RGB的细粒度空间信息与骨架的时序结构优势，显著提升了复杂监控场景下的支付行为识别准确率（达83.45%），并具备边缘部署所需的计算效率。

链接: https://arxiv.org/abs/2605.10732
作者: Kaicong Huang,Weiheng Oh,Thomas Guggisberg,Ruimin Ke
机构: Rensselaer Polytechnic Institute (伦斯勒理工学院); Capital District Transportation Authority (大都会交通局)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Automated transit payment analysis is vital for scalable fare auditing and passenger analytics, yet practice still relies on limited manual inspection. Prior vision- and skeleton-based methods remain brittle under noisy onboard surveillance and often depend on poorly generalizable handcrafted features. Building on the success of graph convolutional networks in human action recognition, we observe that skeleton features excel at modeling global spatiotemporal dependencies but tend to underemphasize the subtle local relative motions that distinguish payment actions. In contrast, RGB features preserve fine-grained spatial details yet often lack reliable temporal continuity in surveillance footage. To bridge both system-level deployment needs and model-level design challenges, we present iPay, an integrated payment action recognition framework for onboard transit surveillance system. iPay adopts a multimodal mixture-of-experts architecture with four tightly coupled streams: (1) an RGB expert stream emphasizing local evidence via region-focused computation; (2) a skeleton expert stream modeling articulated motion with a graph convolutional backbone; (3) a dual-attention fusion stream enabling skeleton-to-RGB temporal transfer and RGB-to-skeleton spatial enhancement; and (4) a prior-driven Spatial Difference Discriminator (SDD) that explicitly models hand-to-anchor relative motion to improve task-specific discriminability. We also collaborate with local transit agencies to collect over 55 hours of real onboard surveillance footage, yielding 500+ payment clips. Experiments show that iPay outperforms prior methods and achieves 83.45% recognition accuracy with competitive computational efficiency, making it suitable for edge deployment. Code is available at this https URL.

[CV-32] Qwen -Image-2.0 Technical Report

【速读】：该论文旨在解决当前图像生成基础模型在超长文本渲染、多语言排版、高分辨率逼真度、鲁棒指令遵循及高效部署等方面的局限性，尤其是在文本丰富和构图复杂的场景中表现不足的问题。解决方案的关键在于将Qwen3-VL作为条件编码器与多模态扩散Transformer（Multimodal Diffusion Transformer）相结合，实现联合条件-目标建模，并辅以大规模数据筛选和定制化的多阶段训练流程，从而在保持灵活生成与编辑能力的同时，显著提升多模态理解强度与图像质量。

链接: https://arxiv.org/abs/2605.10730
作者: Bing Zhao,Chenfei Wu,Deqing Li,Hao Meng,Jiahao Li,Jie Zhang,Jingren Zhou,Junyang Lin,Kaiyuan Gao,Kuan Cao,Kun Yan,Liang Peng,Lihan Jiang,Niantong Li,Ningyuan Tang,Shengming Yin,Tianhe Wu,Xiao Xu,Xiaoyue Chen,Xihua Wang,Yan Shu,Yanran Zhang,Yi Wang,Yilei Chen,Ying Ba,Yixian Xu,Yujia Wu,Yuxiang Chen,Zecheng Tang,Zekai Zhang,Zhendong Wang,Zihao Liu,Zikai Zhou,An Yang,Chen Cheng,Chenxu Lv,Dayiheng Liu,Fan Zhou,Hantian Xiong,Hongzhu Shi,Hu Wei,Huihong Zhao,Ivy Liu,Jianwei Zhang,Jiawei Zhang,Kai Chen,Kang He,Levon Xue,Lin Qu,Linhan Tang,Luwen Feng,Minggang Wu,Minmin Sun,Na Ni,Rui Men,Shuai Bai,Sishou Zheng,Tao Lan,Tianqi Zhang,Tingkun Wen,Wei Wang,Weixu Qiao,Weiyi Lu,Wenmeng Zhou,Xiaodong Deng,Xiaoxiao Xu,Xinlei Fang,Xionghui Chen,Yanan Wang,Yang Fan,Yichang Zhang,Yixuan Xu,Yu Wu,Zhiyuan Ma,Zhizhi Cai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present Qwen-Image-2.0, an omni-capable image generation foundation model that unifies high-fidelity generation and precise image editing within a single framework. Despite recent progress, existing models still struggle with ultra-long text rendering, multilingual typography, high-resolution photorealism, robust instruction following, and efficient deployment, especially in text-rich and compositionally complex scenarios. Qwen-Image-2.0 addresses these challenges by coupling Qwen3-VL as the condition encoder with a Multimodal Diffusion Transformer for joint condition-target modeling, supported by large-scale data curation and a customized multi-stage training pipeline. This enables strong multimodal understanding while preserving flexible generation and editing capabilities. The model supports instructions of up to 1K tokens for generating text-rich content such as slides, posters, infographics, and comics, while significantly improving multilingual text fidelity and typography. It also enhances photorealistic generation with richer details, more realistic textures, and coherent lighting, and follows complex prompts more reliably across diverse styles. Extensive human evaluations show that Qwen-Image-2.0 substantially outperforms previous Qwen-Image models in both generation and editing, marking a step toward more general, reliable, and practical image generation foundation models.

[CV-33] Heteroscedastic Diffusion for Multi-Agent Trajectory Modeling CVPR2025

【速读】：该论文旨在解决多智能体轨迹建模中长期存在的三个关键问题：一是现有方法主要聚焦于轨迹预测，忽视了轨迹补全（trajectory completion）这一在真实场景（如跟踪数据纠错）中至关重要的任务；二是现有模型缺乏对每个状态的异方差不确定性（heteroscedastic uncertainty）估计；三是主流多模态采样方法无法为每种生成场景提供误差概率估计，导致推理时难以对预测结果进行排序。解决方案的关键在于提出U2Diffine，一个统一的扩散模型，通过在标准去噪损失基础上引入预测噪声的负对数似然项，并利用一阶泰勒近似将潜在空间的不确定性传播至真实状态空间，从而实现状态级异方差不确定性估计；同时设计了更快的基线模型U2Diff以避免采样过程中的梯度计算，显著提升推理效率；此外，引入Rank Neural Network（RankNN）用于对每个生成模式估计误差概率，与真实误差高度相关，从而支持有效排序。该方法在四个挑战性体育数据集上均优于当前最优方案，在轨迹补全和预测任务中展现出优越性能。

链接: https://arxiv.org/abs/2605.10717
作者: Guillem Capellera,Antonio Rubio,Luis Ferraz,Antonio Agudo
机构: Institut de Robòtica i Informàtica Industrial, CSIC-UPC (西班牙加泰罗尼亚理工大学与国家研究委员会联合机器人与工业信息研究所); Kognia Sports Intelligence (科尼亚体育智能)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI). Extended version of arXiv:2503.18589 (CVPR 2025)

点击查看摘要

Abstract:Multi-agent trajectory modeling traditionally focuses on forecasting, often neglecting more general tasks like trajectory completion, which is essential for real-world applications such as correcting tracking data. Existing methods also generally predict agents’ states without offering any state-wise measure of heteroscedastic uncertainty. Moreover, popular multi-modal sampling methods lack error probability estimates for each generated scene under the same prior observations, which makes it difficult to rank the predictions at inference time. We introduce U2Diffine, a unified diffusion model built to perform trajectory completion while simultaneously offering state-wise heteroscedastic uncertainty estimates. This is achieved by augmenting the standard denoising loss with the negative log-likelihood of the predicted noise, and then propagating the latent space uncertainty to the real state space using a first-order Taylor approximation. We also propose U2Diff, a faster baseline that avoids gradient computation during sampling. This approach significantly increases inference speed, making it as efficient as a standard generative-only diffusion model. For post-processing, we integrate a Rank Neural Network (RankNN) that enables error probability estimation for each generated mode, demonstrating strong correlation with ground truth errors. Our method outperforms state-of-the-art solutions in both trajectory completion and forecasting across four challenging sports datasets (NBA, Basketball-U, Football-U, Soccer-U), underscoring the effectiveness of our uncertainty and error probability estimation.

[CV-34] UAV-Assisted Scan-to-Simulation for Landslides Using Physics-Informed Gaussian Splatting

【速读】：该论文旨在解决现有滑坡模拟流程在视觉真实感方面的不足，这一缺陷限制了其在交互式应用、灾害传播和公众教育中的有效性。传统方法依赖数字高程模型（Digital Elevation Model, DEM）和基于网格的表示方式，虽适用于几何分析，但难以呈现逼真的场景细节。解决方案的关键在于提出一种基于无人机（UAV）的“扫描到模拟”框架，通过3D高斯散射（3DGS）实现从摄影测量重建到物理驱动模拟的无缝衔接：首先利用UAV获取边坡影像，重建低各向异性3DGS场景表示；随后对表面模型进行体素填充以构建目标区域的体积表示；最终将该表示集成至材料点法（Material Point Method, MPM）中进行滑坡动力学仿真，从而同时实现高保真视觉重建与可靠的物理模拟。

链接: https://arxiv.org/abs/2605.10715
作者: Zhenyu Liang,Jack C.P. Cheng
机构: HKUST
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Landslide monitoring and simulation play an important role in urban safety assessment and disaster prevention. Existing landslide simulation pipelines typically rely on digital elevation model and mesh-based representations, which are suitable for geometric analysis, but often lack visual realism. This limitation reduces their effectiveness in interactive applications, hazard communication, and public education. In this paper, we propose a UAV-based scan-to-simulation framework that bridges photorealistic scene capture and physics-based landslide simulation through 3DGS. Specifically, our pipeline includes four stages: (1) UAV-based acquisition of slope imagery, (2) reconstruction of a low-anisotropy 3DGS scene representation, (3) volumetric conversion of the target simulation region by filling the interior of the surface-based model, and (4) integration with the Material Point Method (MPM) for landslide simulation. We validate the proposed framework on a real landslide site in Hong Kong that experienced a severe landslide event. The results show that our method supports both realistic visual reconstruction and effective simulation.

[CV-35] ransmissiveGS: Residual-Guided Disentangled Gaussian Splatting for Transmissive Scene Reconstruction and Rendering

【速读】：该论文旨在解决透射场景（transmissive scenes）中由于近场反射与透射内容在空间和辐射度上耦合导致的重建与渲染难题，这种耦合使得标准方法难以区分表面几何与辐射成分，从而产生歧义。解决方案的关键在于提出一种名为TransmissiveGS的新框架，其核心是采用双高斯（dual-Gaussian）表示建模透射场景，并引入延迟着色函数联合渲染两个高斯分量；同时，利用反射的多视角不一致性，通过重建多视角一致内容后的残差作为线索，实现反射与透射的解耦建模，并进一步设计反射光场（reflection light field）以高保真估计近场反射，辅以高频正则化策略保留细节，最终实现高质量的透射场景重建与渲染。

链接: https://arxiv.org/abs/2605.10705
作者: Zhenyu Liang,Xiao Zhang,Tianchao Li,Jack C.P. Cheng,Chi-Keung Tang
机构: HKUST
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Transmissive scenes are ubiquitous in daily life, yet reconstructing and rendering them remains highly challenging due to the inherent entanglement between near-field reflections from the surrounding environment on the transmissive surface, and the transmitted content of the scene behind it. This coupling gives rise to dual surface geometries and dual radiance components within each observation, posing ambiguities for standard methods. We present TransmissiveGS, a novel framework for disentangled reconstruction and rendering of transmissive scenes. Specifically, we model the scene with a dual-Gaussian representation and introduce a deferred shading function to jointly render the two Gaussian components. To separate reflection and transmission, we exploit the inherent multi-view inconsistency of reflections and leverage the residuals from reconstructing multi-view consistent content as cues for disentangled geometry and appearance modeling. We further propose a reflection light field that enables high-fidelity estimation of near-field reflections. During training, we introduce a high-frequency regularization to preserve fine details. We also contribute a new synthetic dataset for evaluating transmissive surface reconstruction. Experiments on both synthetic and real-world scenes demonstrate that TransmissiveGS consistently outperforms prior Gaussian Splatting-based methods in both reconstruction and rendering quality for transmissive scenes.

[CV-36] Not Blind but Silenced: Rebalancing Vision and Language via Adversarial Counter-Commonsense Equilibrium

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Model, MLLM）在解码过程中注意力异常集中于无关图像标记的问题，这一现象通常被视为噪声并被强制纠正，但作者指出此类标记实为视觉与叙事逻辑的关键载体，强行干预会加剧视觉-语言不平衡。解决方案的核心在于提出一种无需训练的对抗性反常识均衡机制（Adversarial Counter-Commonsense Equilibrium, ACE），其通过引入反常识补丁扰动视觉上下文，利用真实视觉特征在扰动下保持稳定而幻觉响应易波动的特性，实现动态博弈式解码策略：精准抑制对扰动敏感的语言先验，同时增强稳定的视觉信号，从而恢复视觉与语言之间的平衡。

链接: https://arxiv.org/abs/2605.10676
作者: Qingxin Xiao,Peilin Zhao,Yangyang Zhao,Lingwei Dang,Qingyao Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:During MLLM decoding, attention often abnormally concentrates on irrelevant image tokens. While existing research dismisses this as invalid noise and forcibly redirects attention to compel focusing on key image information, we argue these tokens are critical carriers of visual and narrative logic, and such coercive corrections exacerbate visual-language imbalance. Adopting a “decoding-as-game” perspective, we reveal that hallucinations stem from an equilibrium imbalance between linguistic priors and visual information. We propose Adversarial Counter-Commonsense Equilibrium (ACE), a training-free framework that perturbs visual context via counter-commonsense patches. Leveraging the fact that authentic visual features remain stable under perturbation while hallucinations fluctuate, ACE implements a dynamic game decoding strategy. This approach precisely suppresses perturbation-sensitive priors while compensating for stable visual signals to restore balance. Extensive experiments demonstrate that ACE, as a plug-and-play strategy, enhances model trustworthiness with negligible inference overhead.

[CV-37] Neuromorphic Monocular Depth Estimation with Uncertainty Modeling

【速读】：该论文旨在解决基于单目事件流（monocular event streams）的深度估计问题，即如何从事件相机（event camera）输出的稀疏、异步事件数据中准确预测每个像素的深度分布。其关键解决方案在于：首先设计多种事件表示方法（包括不同时间分箱数的时空体素网格、CSTR和TORE体积），并利用U-Net架构训练深度估计模型；其次引入三种不确定性估计框架（高斯分布、对数正态分布和证据学习），以量化预测结果的置信度，并验证不确定性可有效标识可靠深度区域。实验表明，5个时间分箱的证据学习与10个分箱的对数正态学习在多个指标上表现最优，证明了不确定性建模在事件相机深度估计中的有效性。

链接: https://arxiv.org/abs/2605.10675
作者: Viktor Bergkvist,Felix Rydell,Per-Erik Forssén,David Gustafsson,Johan Rideg
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Event cameras offer distinct advantages over conventional frame-based sensors, including microsecond-level temporal resolution, high dynamic range, and low bandwidth. In this paper, we predict per-pixel depth distributions from monocular event streams using deep neural networks. We estimate uncertainty using Gaussian, log-normal, and evidential learning frameworks. We compare six event representations: spatio-temporal voxel grids with 1, 5, 10, and 20 temporal bins, the Compact Spatio-Temporal Representation (CSTR), and Time-Ordered Recent Event (TORE) volumes. Our U-Net-based models are trained on synthetic data and then fine-tuned on real sequences. We evaluate performance using absolute relative error, root mean squared error, and the area under the sparsification error. Quantitative results show that the representations perform similarly, while 10 bin log-normal and 5 bin evidential learning perform best across metrics. Our experiments demonstrate that uncertainty estimation can be successfully integrated into event-based monocular depth estimation, and be used to indicate pixels with reliable depth.

[CV-38] bViT: Investigating Single-Block Recurrence in Vision Transformers for Image Recognition

【速读】：该论文旨在解决视觉Transformer（Vision Transformer, ViT）中深度结构是否必须依赖逐层特定参数化，还是可以通过循环计算实现的问题。其核心挑战在于厘清ViT中“深度”所蕴含的计算能力究竟有多少需要显式分层参数，又有多少可通过共享块的迭代更新来隐式表达。解决方案的关键在于提出bViT（block-wise recurrent ViT），即仅用一个共享的Transformer块通过重复应用处理图像，从而在保持深层ViT迭代结构的同时移除层间参数独立性，形成一个可控的循环机制研究框架。实验表明，在相同训练策略和计算预算下，12步bViT-B在ImageNet-1K上性能可媲美标准ViT-B，且参数量减少一个数量级；同时发现宽度更大的bViT能更充分恢复标准ViT性能，揭示出“隐式深度多路复用”机制——即共享块通过隐藏状态演化实现不同步骤的差异化计算，而非简单重复相同操作。

链接: https://arxiv.org/abs/2605.10661
作者: Michal Byra,Pawel Olszowiec,Grzegorz Stefanski,Grzegorz Gruszczynski,Alberto Presta
机构: Samsung AI Center (三星人工智能中心); Institute of Fundamental Technological Research, Polish Academy of Sciences (波兰科学院基础技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 31 pages, 16 figures

点击查看摘要

Abstract:Vision Transformers (ViTs) are built by stacking independently parameterized blocks, but it remains unclear how much of this depth requires layer specific transformations and how much can be realized through recurrent computation. We study this question with bViT, a single-block recurrent ViT in which one transformer block is applied repeatedly to process an image. This architecture preserves the iterative structure of a deep ViT while removing layer specific block parameterization, providing a controlled setting for studying recurrence in vision. On ImageNet-1K, a 12-step bViT-B achieves accuracy comparable to standard ViT-B under the same training recipe and computational budget, while using an order of magnitude fewer parameters. We observe that recurrent performance improves with representation width, with wider bViTs recovering much more of the performance of standard ViTs than narrow variants. We interpret this behavior as implicit depth multiplexing, where a shared block expresses multiple step-dependent computations through the evolving hidden state. Beyond ImageNet classification, bViT transfers competitively to downstream tasks and enables parameter-efficient fine-tuning. Mechanistic analyses of activations, attention and step-specific pruning show that the shared block changes its effective behavior across recurrent steps rather than simply repeating the same computation. Our results suggest that a large fraction of ViT depth can be implemented through recurrent reuse, provided that the representation space is sufficiently wide.

[CV-39] GenMed: A Pairwise Generative Reformulation of Medical Diagnostic Tasks

【速读】：该论文旨在解决传统数据驱动的医学人工智能（Medical AI）模型在真实临床环境中面对异质性数据和多模态输入时泛化能力不足的问题。传统方法通常采用判别式映射（discriminative mapping），即从输入 X 到输出 Y 的固定函数 f，难以适应未见过的观测组合或模态变化。解决方案的关键在于提出一种全新的生成式范式（generative paradigm），通过扩散模型（diffusion models）建模输入与输出的联合分布 P(X,Y)，并将推理过程重构为测试时的输出优化问题（test-time output optimization）。该方法无需架构调整或重新训练，即可在推理阶段利用梯度引导生成过程以匹配观测输入，从而实现对任意新组合观测的灵活条件化，显著提升模型在跨模态分割、少样本分割、退化输入分割及零样本任务中的通用性和鲁棒性。

链接: https://arxiv.org/abs/2605.10645
作者: Hantao Zhang,Weidong Guo,Yuhe Liu,Jiancheng Yang,Sathvik Bhagavan,Danli Shi,Mingda Xu,Pascal Fua
机构: CVLab, École Polytechnique Fédérale de Lausanne (EPFL), Switzerland; Fudan University, Shanghai, China; Beihang University, Beijing, China; ELLIS Institute Finland, Finland; Aalto University, Finland; The Hong Kong Polytechnic University, Hong Kong SAR, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Data-driven medical AI is traditionally formulated as a discriminative mapping from input X to output Y via a learned function f , which does not generalize well across heterogeneous data and modalities encountered in real-world clinical settings. In this work, we propose a fundamentally different, generative paradigm. We model the joint distribution P(X,Y) using diffusion models and reframe inference as a test-time output optimization problem. By guiding the generative process to match observed inputs, our framework enables flexible, gradient-based conditioning at inference time without architectural changes or retraining, effectively supporting arbitrary and previously unseen combinations of observations. Extensive experiments demonstrate strong performance across standard and cross-modality medical image segmentation, few-shot segmentation with only 2 or 4 training samples, degraded-input segmentation, shape completion from sparse and partial observations, and zero-shot application to demonstrate generality. To support these evaluations, we curated and released a large-scale text-shape dataset derived from MedShapeNet. Our results highlight the versatility of generative joint modeling as a foundation for reusable, task-agnostic medical AI systems.

[CV-40] LLaVA-CKD: Bottom-Up Cascaded Knowledge Distillation for Vision-Language Models

【速读】：该论文旨在解决大型视觉语言模型（Vision-Language Models, VLMs）在实际部署中面临的内存和计算资源消耗过高的问题。传统知识蒸馏（Knowledge Distillation）方法通过将高容量教师网络（Teacher network）的知识迁移至小型学生网络（Student network）以提升效率，但存在容量差距过大导致知识传递效果下降的问题。解决方案的关键在于提出一种自底向上的级联知识蒸馏（Cascaded Knowledge Distillation, CKD）框架：不再依赖单一高容量教师网络，而是引入一个或多个中间容量的教师网络，形成渐进式知识传递路径，使学生网络逐步适应更高层次的知识，从而在保持高效性的同时显著提升泛化性能。该方法借鉴了人类教育体系中的分阶段教学机制，在Llama-Vision-Adapter（LLaVA）架构基础上验证了其在七个标准视觉问答（Visual Question Answering, VQA）基准上的最先进（SOTA）表现。

链接: https://arxiv.org/abs/2605.10641
作者: Nikolaos Gkalelis,Vasileios Mezaris
机构: CERTH-ITI (希腊信息技术研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Under review

点击查看摘要

Abstract:Large Vision-Language Models (VLMs) are successful in addressing a multitude of vision-language understanding tasks, such as Visual Question Answering (VQA), but their memory and compute requirements remain a concern for practical deployment. A promising class of techniques for mitigating this concern is Knowledge Distillation, where knowledge from a high-capacity Teacher network is transferred to a considerably smaller Student network. However, the capacity gap between the two networks is both a blessing and a curse: the smaller the Student network, the better its efficiency, and the larger the Teacher, the more knowledge it carries; yet, beyond a point, the larger capacity gap between the two leads to worse knowledge transfer. To counter this effect, we propose a bottom-up cascaded knowledge distillation (CKD) framework. Instead of treating knowledge transfer as an activity involving one high-capacity Teacher (or an ensemble of such), inspired by human formal education systems, we introduce one (potentially, more) additional Teacher(s) of intermediate capacity that gradually bring the Student network to the next level, where the next (higher-capacity) Teacher can take over. We provide a theoretical analysis in order to study the effect of cascaded distillation in the generalization performance of the Student. We apply the proposed framework on models build upon the LLaVA methodology and evaluate the derived models on seven standard, publicly available VQA benchmarks, demonstrating their SotA performance.

[CV-41] Product-of-Gaussian-Mixture Diffusion Models for Joint Nonlinear MRI Reconstruction

【速读】：该论文旨在解决当前基于扩散模型（diffusion models）的磁共振成像（MRI）重建方法中存在的两大问题：一是现有方法通常依赖于结构复杂且缺乏可解释性的大型网络以及不透明的时间条件机制，二是这些方法普遍需要离线估计线圈灵敏度（coil sensitivity），导致重建过程的可解释性差且对采集设置变化的适应性弱。解决方案的关键在于提出一种联合重建图像与线圈灵敏度的新框架，该框架将参数高效的高斯混合产物扩散模型（product-of-Gaussian-mixture diffusion model）作为图像先验，并结合对线圈灵敏度的经典平滑先验，从而在保持高效性的同时提升对对比度、解剖分布变化及k空间轨迹变动的鲁棒性；此外，还引入了更具表达能力的图像先验参数化方式，进一步优化去噪和MRI重建性能。

链接: https://arxiv.org/abs/2605.10629
作者: Laurenz Nagler,Martin Zach,Thomas Pock
机构: Graz University of Technology (格拉茨工业大学); École Polytechnique Fédérale de Lausanne (洛桑联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, diffusion models have attracted considerable attention for magnetic resonance image reconstruction due to their high sample quality. However, most existing methods rely on large networks with opaque time-conditioning mechanisms, and require offline coil sensitivity estimation. This results in limited interpretability of the reconstruction process and reduced flexibility in the acquisition setup. To address these limitations, we jointly reconstruct the image and the coil sensitivities by combining the parameter-efficient product-of-Gaussian-mixture diffusion model as an image prior with a classical smoothness prior on the coil sensitivities. The proposed method is fast and robust to both contrast and anatomical distribution shifts as well as changing k-space trajectories. Finally, we propose a more expressive parameterization of the image prior which improves results in denoising and magnetic resonance image reconstruction.

[CV-42] Hypergraph-Enhanced Training-Free and Language-Free Few-Shot Anomaly Detection

【速读】：该论文旨在解决少样本异常检测（Few-shot Anomaly Detection, FSAD）中普遍存在的三大挑战：（i）对特定任务或数据集的训练/微调依赖；（ii）对语言监督或人工精心设计提示（prompt）的依赖；（iii）跨域鲁棒性不足。其解决方案的关键在于提出 HyperFSAD 框架，该框架完全无需训练和语言提示，且具备强跨域适应能力。核心创新包括：（1）基于超图（hypergraph）结构的稀疏超匹配机制（Sparse Hyper Matching），通过 sparsemax 选择最相关的支持补丁并聚合为紧凑的正常证据超边（hyperedge），有效抑制背景噪声与干扰；（2）双分支图像评分机制（Dual-Branch Image Scoring），融合来自补丁网格的局部空间异常证据与支持感知的 CLS 匹配所捕获的全局语义偏差，实现纯视觉驱动的鲁棒图像级异常评分。所有组件均为纯视觉输入，显著提升了方法的实用性与泛化能力。

链接: https://arxiv.org/abs/2605.10628
作者: Guohuan Xie,Xin He,Dingying Fan,Siqi Li,Yun Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Few-shot anomaly detection (FSAD) has made significant strides, yet existing methods still face critical challenges: (i) dependence on task- or dataset-specific training/fine-tuning, (ii) reliance on language supervision or carefully hand-crafted prompts, and (iii) limited robustness across domains. In this paper, we introduce HyperFSAD, a novel FSAD framework that is training-free, language-free, and robust across domains, offering a powerful solution to these challenges. Built upon DINOv3 and a hypergraph-based inference mechanism, our approach performs inference without any task-specific optimization or text prompts, while remaining competitive. Specifically, we replace sensitive nearest-neighbor / top- n matching with \textbfSparse Hyper Matching: \textitsparsemax first selects the most relevant support patches, which are then aggregated into a \textithyperedge as compact normal evidence to suppress background noise and distractors. We further introduce \textbfDual-Branch Image Scoring, which fuses \emphspatial anomaly evidence from the patch-grid anomaly map with \emphglobal semantic deviation captured by support-aware CLS matching, yielding a robust image-level anomaly score in a strictly visual manner. Notably, all components of HyperFSAD are purely visual, eliminating the need for labor-intensive hand-crafted text prompts. Under the stringent training-free and language-free setting, HyperFSAD achieves state-of-the-art performance across six datasets spanning four industrial datasets (MVTecAD, VisA, MPDD, BTAD) and two medical datasets (RESC, BraTS).

[CV-43] Vocabulary Hijacking in LVLMs: Unveiling Critical Attention Heads by Excluding Inert Tokens to Mitigate Hallucination ACL2026

【速读】：该论文旨在解决大视觉语言模型（Large Vision-Language Models, LVLMs）中存在的幻觉问题，即模型生成与视觉输入相矛盾的文本内容。研究表明，传统观点认为幻觉源于视觉注意力不足，但本文通过logit lens分析发现了一种新的异常现象——“词汇劫持”（Vocabulary Hijacking），其关键机制在于特定视觉标记（称为惰性标记，Inert Tokens）会过度吸引注意力，并且这些标记在中间隐藏状态投影到词汇空间时，始终解码为一组固定且无关的词（称为劫持锚点，Hijacking Anchors），导致语义坍缩。解决方案的核心是提出基于劫持锚点识别（Hijacking Anchor-Based Identification, HABI）的方法来精准定位这些惰性标记，并引入非劫持视觉注意力比（Non-Hijacked Visual Attention Ratio, NHAR）作为量化指标以筛选出对事实准确性至关重要的注意力头；在此基础上，进一步设计无需训练的干预策略——劫持感知视觉注意力增强（Hijacking-Aware Visual Attention Enhancement, HAVAE），通过强化关键注意力头对显著视觉内容的关注，有效减少幻觉，同时保持模型整体性能不变。

链接: https://arxiv.org/abs/2605.10622
作者: Yangneng Chen,Junlin Li,Weijun Yao,Xilai Ma,Guodong Du,Wenya Wang,Jing Li
机构: Harbin Institute of Technology (Shenzhen), China; Huawei Technologies Co., Ltd.; The Hong Kong Polytechnic University; Nanyang Technological University
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ACL 2026 Main

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have achieved remarkable progress in multimodal tasks, yet their reliability is persistently undermined by hallucinations-generating text that contradicts visual input. Recent studies often attribute these errors to inadequate visual attention. In this work, we analyze the attention mechanisms via the logit lens, uncovering a distinct anomaly we term Vocabulary Hijacking. We discover that specific visual tokens, defined as Inert Tokens, disproportionately attract attention. Crucially, when their intermediate hidden states are projected into the vocabulary space, they consistently decode to a fixed set of unrelated words (termed Hijacking Anchors) across layers, revealing a rigid semantic collapse. Leveraging this semantic rigidity, we propose Hijacking Anchor-Based Identification (HABI), a robust strategy to accurately localize these Inert Tokens. To quantify the impact of this phenomenon, we introduce the Non-Hijacked Visual Attention Ratio (NHAR), a novel metric designed to identify attention heads that remain resilient to hijacking and are critical for factual accuracy. Building on these insights, we propose Hijacking-Aware Visual Attention Enhancement (HAVAE), a training-free intervention that selectively strengthens the focus of these identified heads on salient visual content. Extensive experiments across multiple benchmarks demonstrate that HAVAE significantly mitigates hallucinations with no additional computational overhead, while preserving the model’s general capabilities. Our code is publicly available at this https URL.

[CV-44] Segment Anything with Robust Uncertainty-Accuracy Correlation ICML2026

【速读】：该论文旨在解决Segment Anything Model (SAM) 在域偏移（domain shift）下因掩码级置信度混淆（Mask-level Confidence Confusion, MCC）而导致的可靠性下降问题，即单一IoU-based掩码分数无法准确反映边界附近像素级别的可靠性。其核心解决方案是提出鲁棒不确定性-准确性相关性（Robust Uncertainty-Accuracy Correlation, RUAC），通过引入轻量级不确定性头，并采用协同风格-形变攻击（collaborative style-deformation attack）联合扰动纹理与几何信息进行训练，同时应用不确定性-准确性对齐（Uncertainty-Accuracy Alignment）机制，确保不确定性估计在对抗扰动下仍能准确标注错误像素，从而提升跨23个零样本域下的分割质量与不确定性忠实度。

链接: https://arxiv.org/abs/2605.10603
作者: Hongyou Zhou,Marc Toussaint,Ling Shao,Zihan Ye
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICML 2026

点击查看摘要

Abstract:Despite strong zero-shot performance, SAM is unreliable under domain shift due to Mask-level Confidence Confusion (MCC), where a single IoU-based mask score fails to reflect pixel-wise reliability near boundaries. Motivated by the contrast between texture-biased shortcuts in neural networks and shape-centric processing in human vision, we model out-of-domain variation as appearance shifts and non-rigid deformations that jointly stress calibration. We propose Segment Anything with Robust Uncertainty-Accuracy Correlation (RUAC) for robust pixel-wise uncertainty estimation under appearance and deformation shifts. RUAC adds a lightweight uncertainty head, trains it with a collaborative style-deformation attack that jointly perturbs texture and geometry, and applies Uncertainty-Accuracy Alignment to ensure uncertainty consistently highlights erroneous pixels even under adversarial perturbations. Across 23 zero-shot domains, RUAC improves segmentation quality and yields more faithful uncertainty with stronger uncertainty-accuracy correlation. Project page: this https URL.

[CV-45] hinking with Novel Views: A Systematic Analysis of Generative-Augmented Spatial Intelligence NEURIPS2026

【速读】：该论文旨在解决当前大型多模态模型（Large Multimodal Models, LMMs）在空间推理任务中因仅依赖单一静态视角而导致的性能瓶颈问题，尤其是那些需要视点依赖理解的任务。其解决方案的关键在于提出“新视角思维”（Thinking with Novel Views, TwNV）范式，将生成式新视角合成（generative novel-view synthesis）嵌入到推理循环中：由一个LMM推理器识别空间模糊性，指令绘图模块（Painter）生成替代视角图像，并基于新增证据重新审视场景。该方法通过迭代式多轮视角优化显著提升了空间推理准确性，验证了新视角生成作为增强LMM空间智能的有效手段。

链接: https://arxiv.org/abs/2605.10588
作者: Yanbing Zhang,Bo Wang,Jianhui Liu,Nan Jiang,Jiaxiu Jiang,Haoze Sun,Yijun Yang,Shenghe Zheng,Lin Song,Haoyang Huang,Nan Duan,Wenbo Li
机构: Joy Future Academy; OpenAI; Google DeepMind; Qwen (Qwen)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to NeurIPS 2026

点击查看摘要

Abstract:Current Large Multimodal Models (LMMs) struggle with spatial reasoning tasks requiring viewpoint-dependent understanding, largely because they are confined to a single, static observation. We propose Thinking with Novel Views (TwNV), a paradigm that integrates generative novel-view synthesis into the reasoning loop: a Reasoner LMM identifies spatial ambiguity, instructs a Painter to synthesize an alternative viewpoint, and re-examines the scene with the additional evidence. Through systematic experiments we address three research questions. (1) Instruction format: numerical camera-pose specifications yield more reliable view control than free-form language. (2) Generation fidelity: synthesized view quality is tightly coupled with downstream spatial accuracy. (3) Inference-time visual scaling: iterative multi-turn view refinement further improves performance, echoing recent scaling trends in language reasoning. Across four spatial subtask categories and four LMM architectures (both closed- and open-source), TwNV consistently improves accuracy by +1.3 to +3.9 pp, with the largest gains on viewpoint-sensitive subtasks. These results establish novel-view generation as a practical lever for advancing spatial intelligence of LMMs.

[CV-46] CausalGS: Learning Physical Causality of 3D Dynamic Scenes with Gaussian Representations ICMR2026

【速读】：该论文旨在解决从多视角视频中学习复杂动态三维场景的物理模型问题，目标是理解物理规律并预测物体未来的运动轨迹，而无需依赖显式先验知识或高质量几何重建。其解决方案的关键在于提出CausalGS框架，该框架通过一个逆向物理推理模块，将复杂的动力学问题解耦为两个可联合推断的因素：表示场景运动学的初始速度场和决定动力学特性的内在材料属性；随后利用这些推断出的物理信息，在可微分物理模拟器中以物理正则化方式引导学习过程，从而实现对多物理属性间复杂交互关系的因果理解与长期未来帧外推。

链接: https://arxiv.org/abs/2605.10586
作者: Nengbo Lu,Minghua Pan
机构: Guilin University of Electronic Technology(桂林电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICMR2026 Accepted

点击查看摘要

Abstract:Learning a physical model from video data that can comprehend physical laws and predict the future trajectories of objects is a formidable challenge in artificial intelligence. Prior approaches either leverage various Partial Differential Equations (PDEs) as soft constraints in the form of PINN losses, or integrate physics simulators into neural networks; however, they often rely on strong priors or high-quality geometry reconstruction. In this paper, we propose CausalGS, a framework that learns the causal dynamics of complex dynamic 3D scenes solely from multi-view videos, while dispensing with the reliance on explicit priors. At its core is an inverse physics inference module that decouples the complex dynamics problem from the video into the joint inference of two factors: the initial velocity field representing the scene’s kinematics, and the intrinsic material properties governing its dynamics. This inferred physical information is then utilized within a differentiable physics simulator to guide the learning process in a physics-regularized manner. Extensive experiments demonstrate that CausalGS surpasses the state-of-the-art on the highly challenging task of long-term future frame extrapolation, while also exhibiting advanced performance in novel view interpolation. Crucially, our work shows that, without any human annotation, the model is able to learn the complex interactions between multiple physical properties and understand the causal relationships driving the scene’s dynamic evolution, solely from visual observations.

[CV-47] FrequencyCT: Frequency domain pseudo-label generation for self-supervised low-dose CT denoising

【速读】：该论文旨在解决低剂量计算机断层扫描（Low-dose CT）图像中噪声相关性难以有效抑制的问题，尤其针对现有研究较少利用投影域（projection domain）数据特性来缓解噪声相关性的局限。其解决方案的关键在于提出 FrequencyCT，一种零样本自监督方法，通过在频率域生成伪标签（pseudo-label）实现无监督学习：首先利用频率域中噪声与干净信号高度分离的特性，采用区域低频锚定技术提取稳定特征；随后在高频区域实施相位保持的幅度调制与掩码扰动，以生成用于自监督训练的伪标签数据；同时，考虑到投影域中噪声方差波动导致的梯度不稳定问题，引入截断策略稳定网络优化过程。该方法在多个公开和真实世界数据集上验证了有效性，展现出显著的临床应用潜力。

链接: https://arxiv.org/abs/2605.10583
作者: Guoquan Wei,Liu Shi,Chong Chen,Qiegen Liu
机构: Nanchang University (南昌大学); Chinese Academy of Sciences (中国科学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite extensive research on computed tomography (CT) denoising, few studies exploit projection-domain data characteristics to mitigate noise correlation. To address this, this work proposes FrequencyCT, the first zero-shot self-supervised method for pseudo-label generation in the frequency domain for low-dose CT denoising. Leveraging the characteristic of the frequency domain that largely isolates noise from clean signals, a regional low-frequency anchoring technique is proposed. Phase-preserving amplitude modulation and mask perturbation in the high-frequency region generate pseudo-label data for self-supervision. The fluctuating noise variance in the projection domain prompts truncation of the generated samples to stabilize the network’s optimization gradient. Evaluation results on multiple public and real-world datasets confirm the clinical application potential of this research, which will have a revolutionary impact on the field of denoising. The code can be obtained from this https URL.

[CV-48] Polygon-mamba: Retinal vessel segmentation using polygon scanning mamba and space-frequency collaborative attention

【速读】：该论文旨在解决小尺寸视网膜血管（small retinal vessels）分割难题，这一任务在眼科疾病诊断与评估中至关重要但长期面临挑战。其关键解决方案包括两个核心创新：一是提出多层反向扫描的多边形扫描视觉状态空间模型（polygon scanning visual state space model, PS-VSS），通过改进传统Mamba架构的水平-垂直扫描方式，有效保持小血管结构的拓扑连通性，减少信息丢失；二是设计空间-频率协同注意力机制（space-frequency collaborative attention mechanism, SFCAM），嵌入跳跃连接中以联合提取空间域的位置结构信息和频域的全局感知与局部细节特征，从而动态增强关键特征并抑制噪声干扰。该方法在DRIVE、STARE和CHASE_DB1三个公开数据集上均取得优异性能，验证了其有效性。

链接: https://arxiv.org/abs/2605.10581
作者: Yuanyuan Peng,Wen Li,Xiong Li,Juan Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Retinal vessel segmentation is crucial for diagnosis and assessment of ocular diseases. Notably, segmentation of small retinal vessels has been consistently recognized as a challenging and complex task. To tackle this challenge, we design a hybrid CNN-Mamba fusion network that integrates polygon scanning mamba and space-frequency collaborative attention mechanism for the detection of small vessels. Considering that the traditional mamba architecture with horizontal-vertical scanning may compromise the topological integrity of target structures and result in local discontinuities in small retinal vessels, we present a polygon scanning visual state space model (PS-VSS) to identify small vessel structural features by multi-layer reverse scanning way. Which effectively preserves pixels connectivity, thereby substantially mitigating the loss of information pertaining to small vessels. Furthermore, as we all known that the spatial domain prioritizes positional and structural information, while the frequency domain emphasizes global perception and local detail components, a space-frequency collaborative attention mechanism (SFCAM) is introduced within the skip connection to extract efficient features from the spatial and frequency domains. This strategy empowers the model to dynamically enhance the key features while effectively suppressing clutters. To assess the efficacy of our model, it was tested on three publicly available datasets: DRIVE, STARE, and CHASE_DB1. Compared to manual annotations, our model demonstrated F1 scores of 0.8283, 0.8282, and 0.8251, Area Under Curve (AUC) values of 0.9806, 0.9840, and 0.9866, and Sensitivity (SE) values of of 0.8268, 0.8314, and 0.8484 across three datasets, respectively. The effectiveness of our model was validated through both visual inspection and quantitative analysis.

[CV-49] SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models

【速读】：该论文旨在解决遥感（Remote Sensing, RS）图像低层视觉感知中图像质量评估（Image Quality Assessment, IQA）方法输出不可解释的标量分数，无法刻画物理驱动的RS退化特征，与RS专家诊断需求严重脱节的问题。现有视觉语言模型（Vision-Language Models, VLMs）虽能提供语言引导的IQA，但其视觉先验严重偏向地面自然图像，导致在遥感领域存在域差距（domain gap），其对RS伪影的感知与描述能力尚未充分验证。解决方案的关键在于提出首个面向RS低层视觉感知与描述的诊断基准——SenseBench，该基准基于物理驱动的分层分类体系，统一非参考与参考范式，包含超过10,000个精心标注的实例，覆盖6大类和22细粒度的RS退化类别，并设计了客观低层感知与主观诊断描述两种互补评估协议，从而系统性揭示VLMs在遥感场景下的偏置、多退化混淆、流畅性幻觉及感知-描述倒置效应等关键问题，为提升VLMs在RS低层感知中的性能提供可靠评测平台与高质量数据集。

链接: https://arxiv.org/abs/2605.10576
作者: Chen Zhong,Xiao An,Jiaxing Sun,Zihan Gui,Guangyi Yang,Wei He
机构: Wuhan University (武汉大学); Shanghai Artificial Intelligent Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Low-level visual perception underpins reliable remote sensing (RS) image analysis, yet current image quality assessment (IQA) methods output uninterpretable scalar scores rather than characterizing physics-driven RS degradations, deviating markedly from the diagnostic needs of RS experts. While Vision-Language Models (VLMs) present a compelling alternative by delivering language-grounded IQA, their visual priors are heavily biased toward ground-level natural images. Consequently, whether VLMs can overcome this domain gap to perceive and articulate RS artifacts remains insufficiently studied. To bridge this gap, we propose \textbfSenseBench, the first dedicated diagnostic benchmark for RS low-level visual perception and description. Driven by a physics-based hierarchical taxonomy that unifies both non-reference and reference-based paradigms, SenseBench features over 10K meticulously curated instances across 6 major and 22 fine-grained RS degradation categories. Specifically, two complementary protocols are designed for evaluation: objective low-level visual \textitperception and subjective diagnostic \textitdescription. Comprehensive evaluation of 29 state-of-the-art VLMs reveals not only skewed domain priors and multi-distortion collapse, but also \textitfluency illusion and a \textitperception-description inversion effect. We hope SenseBench provides a robust evaluation testbed and high-quality diagnostic data to advance the development of VLMs in RS low-level perception. Code and datasets are available \hrefthis https URL\textcolorbluehere.

[CV-50] VeloGauss: Learning Physically Consistent Gaussian Velocity Fields from Videos ICME2026

【速读】：该论文旨在解决从动态多视角视频中联合建模三维场景的几何、外观和物理信息的问题，且不依赖任何物理先验知识。现有方法通常仅将物理损失作为软约束或嵌入物理模拟到神经网络中，难以有效学习复杂运动物理规律；同时，尽管速度场建模具备捕捉真实物理信息的潜力，但因缺乏合适的物理约束，当前方法无法正确学习刚体与非刚体粒子间的交互机制。解决方案的关键在于提出VeloGauss框架，通过引入物理编码（Physics Code）和粒子动力学系统（Particle Dynamics System）来学习每个高斯粒子的速度场，并最终结合全局物理约束（Global Physical Constraints）确保场景整体的物理一致性，从而在无需物理先验的情况下实现对复杂动态3D场景的高质量建模。

链接: https://arxiv.org/abs/2605.10567
作者: Nengbo Lu,Bin Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICME2026 Accepted

点击查看摘要

Abstract:In this paper, we aim to jointly model the geometry, appearance, and physical information of 3D scenes solely from dynamic multi-view videos, without relying on any physical priors. Existing works typically employ physical losses merely as soft constraints or integrate physical simulations into neural networks; however, these approaches often fail to effectively learn complex motion physics. Although modeling velocity fields holds the potential to capture authentic physical information, due to the lack of appropriate physical constraints, current methods are unable to correctly learn the interaction mechanisms between rigid and non-rigid particles. To address this, we propose VeloGauss, designed to learn the physical properties of complex dynamic 3D scenes without physical priors. Our method learns the velocity field for each Gaussian particle by introducing a Physics Code and a Particle Dynamics System, and ultimately incorporates Global Physical Constraints to ensure the physical consistency of the scene. Extensive experiments on four public datasets demonstrate that our method outperforms achieves state-of-the-art performance in both Novel View Interpolation and Future Frame Extrapolation tasks.

[CV-51] DeepSight: Long-Horizon World Modeling via Latent States Prediction for End-to-End Autonomous Driving ICML2026

【速读】：该论文旨在解决当前端到端自动驾驶系统中视觉语言模型（Vision-Language Model, VLM）推理机制缺乏针对自动驾驶场景深度适配的问题，尤其是视觉推理模块在长时序建模与复杂长尾场景下的性能不足。其解决方案的关键在于提出一种驾驶世界模型（driving world model），该模型在鸟瞰图（bird’s-eye-view, BEV）空间中并行预测连续未来帧的潜在语义特征，从而实现对未来世界状态的长时程建模；同时引入一种高效且自适应的文本推理机制，利用额外的社会知识和推理能力，在挑战性长尾场景中进一步提升驾驶决策性能。该方法在闭环Bench2drive基准上实现了当前最优（state-of-the-art, SOTA）结果。

链接: https://arxiv.org/abs/2605.10564
作者: Lingjun Zhang,Changjie Wu,Linzhe Shi,Jiangyang Li,Jiaxin Liu,Lei Yang,Hang Zhang,Mu Xu,Hong Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: ICML 2026

点击查看摘要

Abstract:End-to-end autonomous driving systems are increasingly integrating Vision-Language Model (VLM) architectures, incorporating text reasoning or visual reasoning to enhance the robustness and accuracy of driving decisions. However, the reasoning mechanisms employed in most methods are direct adaptations from general domains, lacking in-depth exploration tailored to autonomous driving scenarios, particularly within visual reasoning modules. In this paper, we propose a driving world model that performs parallel prediction of latent semantic features for consecutive future frames in the bird’s-eye-view (BEV) space, thereby enabling long-horizon modeling of future world states. We also introduce an efficient and adaptive text reasoning mechanism that utilizes additional social knowledge and reasoning capabilities to further improve driving performance in challenging long-tail scenarios. We present a novel, efficient, and effective approach that achieves state-of-the-art (SOTA) results on the closed-loop Bench2drive benchmark. Codes are available at: this https URL.

[CV-52] EnergyLens: Interpretable Closed-Form Energy Models for Multimodal LLM Inference Serving

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在异构加速器上部署时，推理能耗优化与延迟、吞吐量优化之间存在显著差异的问题。现有方法要么将延迟作为能耗代理（忽略二者实际差异），要么依赖数据密集型的黑箱替代模型（需数百次采样才能跨模型家族和硬件泛化），难以适应不同并行策略下的能效最优配置选择。解决方案的关键在于提出EnergyLens——一种基于符号回归（symbolic regression）的结构发现工具，从少量（仅50次）性能测量中自动推导出一个由12个参数构成的封闭形式能耗模型，该模型以系统属性（如并行度、批大小、序列长度）为变量，能够物理可解释地分离张量并行与流水线并行的能耗贡献，并区分预填充（prefill）与解码（decode）阶段的能量消耗，从而实现高精度（Top-1配置选择准确率达88.2%）、低采样成本（仅为集成机器学习方法的十分之一）且无需结构修改即可外推至未见批次大小和硬件平台的能效优化能力。

链接: https://arxiv.org/abs/2605.10556
作者: Vittorio Palladino,Gianluca Palermo,Michael E. Papka,Zhiling Lan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10 pages

点击查看摘要

Abstract:As large language models span dense, mixture-of-experts, and state-space architectures and are deployed on heterogeneous accelerators under increasingly diverse multimodal workloads, optimising inference energy has become as critical as optimizing latency and throughput. Existing approaches either treat latency as an energy proxy or rely on data-hungry black-box surrogates. Both fail under varying parallelism strategies: latency and energy optima diverge in over 20% of configurations we tested, and black-box surrogates require hundreds of profiling samples to generalize across model families and hardware. We present EnergyLens, which uses symbolic regression as a structure-discovery tool over profiling data to derive a single twelve-parameter closed-form energy model expressed in terms of system properties such as degree of parallelism, batch size, and sequence length. Unlike black-box surrogates, EnergyLens decouples tensor and pipeline parallelism contributions and separates prefill from decode energy, making its predictions physically interpretable and actionable. Fitted from as few as 50 profiling measurements, EnergyLens achieves 88.2% Top-1 configuration selection accuracy across many evaluation scenarios compared to 60.9% for the closest prior analytical baseline, matches the predictive accuracy of ensemble ML methods with 10x fewer profiling samples, and extrapolates reliably to unseen batch sizes and hardware platforms without structural modification, making it a practical, interpretable tool for energy-optimal LLM deployment.

[CV-53] IE: Time Interval Encoding for Video Generation over Events

【速读】：该论文旨在解决当前视频生成模型（如Diffusion Transformers, DiT）在处理并发事件时的时序定位问题，即现有方法基于“单活跃提示”假设，无法有效建模时间上重叠的多事件场景（此类场景在机器人操作和游戏片段中占比超99%）。其核心挑战在于DiT使用离散点位置编码表示时间，导致注意力机制难以数学上表达时长扩展的区间和事件重叠。解决方案的关键是提出Time Interval Encoding (TIE)，一种基于旋转位置编码（RoPE）兼容的间隔感知增强方案，将时间区间作为一阶原语嵌入到DiT交叉注意力中；TIE由两个基本原理驱动：时序可积性（Temporal Integrability），要求事件在其整个持续期内聚合位置信息；以及时长不变性（Duration Invariance），消除对较长区间的偏差。该设计通过统一核函数导出闭式sinc解，在不改变标准注意力接口的前提下，自然抑制边界噪声并显著提升时序控制精度。

链接: https://arxiv.org/abs/2605.10543
作者: Zhilei Shu,Shangwen Zhu,Zihang Liang,Xiaofan Li,Qianyu Peng,Xinyu Cui,Bo Ye,Yiming Li,Fan Cheng,Jian Zhao,Yang Cao,Zheng-Jun Zha,Ruili Feng
机构: University of Science and Technology of China(中国科学技术大学); Matrix Team(矩阵团队); Shanghai Jiao Tong University(上海交通大学); Nanyang Technological University(南洋理工大学); University of Waterloo(滑铁卢大学); The Pennsylvania State University(宾夕法尼亚州立大学); Zhongguancun Academy(中关村研究院); The University of Hong Kong(香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Director-style prompting, robotic action prediction, and interactive video agents demand temporal grounding over concurrent events – a regime in which 68% of general clips and over 99% of robotics/gameplay clips contain overlapping events, yet existing multi-event generators rest on a single-active-prompt assumption. However, modern video generators, such as Diffusion Transformers (DiT), represent time as discrete points through point-wise positional encodings. This formulation creates a fundamental dimension mismatch: temporally extended intervals and overlapping events are mathematically unrepresentable to the attention mechanism. In this paper, we propose Time Interval Encoding (TIE), a principled, plug-and-play interval-aware generalization of rotary embeddings that elevates time intervals to first-class primitives inside DiT cross-attention. Rather than introducing another heuristic interval embedding, we show that, within RoPE-compatible bilinear attention, TIE is characterized by two basic principles: Temporal Integrability, which requires an event to aggregate positional evidence over its full duration, and Duration Invariance, which removes the trivial bias toward longer intervals. Under a uniform kernel, this characterization yields an efficient closed-form sinc-based solution that preserves the standard attention interface and naturally attenuates boundary noise through interval integration. Empirically, TIE preserves the visual quality of the base DiT model while substantially improving temporal controllability. In our experiments on the OmniEvents dataset, it improves human-verified Temporal Constraint Satisfaction Rate from 77.34% to 96.03% and reduces temporal boundary error from 0.261s to 0.073s, while also improving trajectory-level temporal alignment metrics. The code and dataset are available at this https URL.

[CV-54] GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth

【速读】：该论文旨在解决视频深度估计（video depth estimation）中普遍存在的空间细节模糊和时间不一致性问题，尤其在相机旋转或视角剧烈变化时，现有基于Transformer的时序平滑方法难以保证严格的三维几何一致性。其解决方案的关键在于提出GemDepth框架，通过引入几何嵌入模块（Geometry-Embedding Module, GEM），显式预测帧间相机位姿以生成隐式的几何嵌入，从而赋予网络内在的三维感知与对齐能力；同时设计交替时空Transformer（Alternating Spatio-Temporal Transformer, ASTT），利用这些几何线索捕捉潜在的点级对应关系，在提升空间精细度的同时强化时间一致性，实现高效且鲁棒的三维结构保持。

链接: https://arxiv.org/abs/2605.10525
作者: Yuecheng LiulJunda Cheng,Longliang Liu,Wenjing Liao,Hanrui Cheng,Yuzhou Wang,Xin Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video depth estimation extends monocular prediction into the temporal domain to ensure coherence. However, existing methods often suffer from spatial blurring in fine-detail regions and temporal inconsistencies. We argue that current approaches, which primarily rely on temporal smoothing via Transformers, struggle to maintain strict 3D geometric consistency-particularly under rotations or drastic view changes. To address this, we propose GemDepth, a framework built on the insight that an explicit awareness of camera motion and global 3D structure is a prerequisite for 3D consistency. Distinctively, GemDepth introduces a Geometry-Embedding Module (GEM) that predicts inter-frame camera poses to generate implicit geometric embeddings. This injection of motion priors equips the network with intrinsic 3D perception and alignment capabilities. Guided by these geometric cues, our Alternating Spatio-Temporal Transformer (ASTT) captures latent point-level correspondences to simultaneously enhance spatial precision for sharp details and enforce rigorous temporal consistency. Furthermore, GemDepth employs a data-efficient training strategy, effectively bridging the gap between high efficiency and robust geometric consistency. As shown in Fig.2, comprehensive evaluations demonstrate that GemDepth achieves state-of-the-art performance across multiple datasets, particularly in complex dynamic scenarios. The code is publicly available at: this https URL

[CV-55] Improving Human Image Animation via Semantic Representation Alignment CVPR2026

【速读】：该论文旨在解决图像到视频生成（image-to-video generation）中长期视频生成时出现的人体肢体扭曲和面部失真问题，尤其是在复杂运动场景下。现有方法通常依赖人体特定语义表示（如密集姿态或ID嵌入）作为额外条件，但这种方式会降低生成灵活性，并且仅基于RGB像素监督，难以学习必要的3D几何关系与时间一致性。解决方案的关键在于提出一种名为SemanticREPA的新方法，其核心思想是将语义表示作为监督信号通过表征对齐（representation alignment）实现结构修正与身份一致性增强：首先训练一个结构对齐模块，使视频潜在空间中的结构表示与深度估计特征对齐；随后固定该模块，用其为扩散模型提供结构表示的额外监督以提升结构稳定性；同时引入ID对齐模块，使生成视频的ID表示与人脸识别特征对齐，并利用预测的结构表示优化相关区域的身份恢复。该方法在扩展动作和角色一致性方面表现优异。

链接: https://arxiv.org/abs/2605.10523
作者: Chang Liu,Mengting Chen,Yixuan Huang,Haoning Wu,Chen Ju,Shuai Xiao,Jinsong Lan,Yanfeng Wang
机构: Shanghai Jiao Tong University (上海交通大学); Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026 workshop

点击查看摘要

Abstract:The field of image-to-video generation has made remarkable progress. However, challenges such as human limb twisting and facial distortion persist, especially when generating long videos or modeling intensive motions. Existing human image animation works address these issues by incorporating human-specific semantic representations, e.g., dense poses or ID embeddings, as additional conditions. However, conditioning on these representations could decrease the generation flexibility. Moreover, their reliance on RGB pixel supervision also lacks emphasis on learning necessary 3D geometric relationships and temporal coherence. In contrast, we introduce a novel approach named SemanticREPA that leverages these semantic representations as supervision signals through representation alignment. Specifically, we begin by training a structure alignment module that aligns the structure representations obtained from video latents with video depth estimation features. We then fix the pretrained module, and utilize it to provide additional supervision on the structure representations of the diffusion models, achieving structure rectification to generate coherent and stable human structures. Simultaneously, we develop an ID alignment module to align the ID representations of the generated videos to face recognition features. We further propose to use the predicted structure representations to refine identity restoration in relevant regions. With structure and ID alignment, our method demonstrates superior quality on extended character motions and enhanced character consistency.

[CV-56] DuetFair: Coupling Inter- and Intra-Subgroup Robustness for Fair Medical Image Segmentation

【速读】：该论文旨在解决医学图像分割模型在不同子群体（subgroup）间表现不均衡的问题，尤其关注“组内隐藏失败”（intra-group hidden failure）——即传统公平性方法仅优化子群体平均性能时，会掩盖子群体内部高损失样本的困难情况。解决方案的关键在于提出DuetFair机制，这是一个双轴公平性框架，同时兼顾跨子群体适应性（inter-subgroup adaptation）与组内鲁棒性（intra-subgroup robustness）。在此基础上，作者进一步设计了FairDRO方法，通过引入基于分布感知的专家混合（distribution-aware mixture-of-experts, dMoE）和子群体条件下的分布鲁棒优化（subgroup-conditioned distributionally robust optimization, DRO）损失聚合策略，使模型既能跨子群体自适应调整，又能有效降低每个子群体内的隐性失败风险。

链接: https://arxiv.org/abs/2605.10521
作者: Yiqi Tian,Sangjoon Park,Bo Zeng,Pengfei Jin,Yujin Oh,Quanzheng Li
机构: Massachusetts General Hospital and Harvard Medical School (马萨诸塞州总医院和哈佛医学院); University of Pittsburgh (匹兹堡大学); Yonsei University (延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 16 pages, 2 figures

点击查看摘要

Abstract:Medical image segmentation models can perform unevenly across subgroups. Most existing fairness methods focus on improving average subgroup performance, implicitly treating each subgroup as internally homogeneous. However, this can hide difficult cases within a subgroup, where high-loss samples are obscured by the subgroup mean. We call this problem \textbfintra-group hidden failure. To solve this, we propose \textbfDuetFair mechanism, a dual-axis fairness framework that jointly considers inter-subgroup adaptation and intra-subgroup robustness. Based on DuetFair, we introduce \textbfFairDRO, which combines distribution-aware mixture-of-experts (dMoE) with subgroup-conditioned distributionally robust optimization (DRO) loss aggregation. This design allows the model to adapt across subgroups while also reducing hidden failures within each subgroup. We evaluate FairDRO on three medical image segmentation benchmarks with varying degrees of within-group heterogeneity. FairDRO achieves the best equity-scaled performance on Harvard-FairSeg and improves worst-case subgroup performance on HAM10000 under both age- and race-based grouping schemes. On the 3D radiotherapy target cohort, FairDRO further improves worst-group Dice by 3.5 points ( \uparrow 6.0% ) under the tumor-stage grouping and by 4.1 points ( \uparrow 7.4% ) under the institution grouping over the strongest baseline.

[CV-57] Simultaneous Long-tailed Recognition and Multi-modal Fusion for Highly Imbalanced Multi-modal Data

【速读】：该论文旨在解决长尾分布（long-tailed distribution）下类别不平衡数据对深度学习模型造成的偏差问题，尤其是在多模态输入场景中，现有方法受限于单模态处理能力，难以充分利用不同数据源之间的互补信息。其解决方案的关键在于提出一种新型多模态长尾识别框架，通过扩展多专家（multi-expert）架构实现异构模态的统一表征融合，并引入模态特异性网络来估计各模态的信息量；基于此，设计置信度引导权重动态调节融合过程，使更具信息量的模态在最终决策中发挥更大作用，从而显著提升模型在长尾分布下的鲁棒性与泛化性能。

链接: https://arxiv.org/abs/2605.10498
作者: Heegeon Yoon,Heeyoung Kim
机构: Korea Advanced Institute of Science and Technology (KAIST)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Long-tailed distributions in class-imbalanced data present a fundamental challenge for deep learning models, which tend to be biased toward majority classes. While recent methods for long-tailed recognition have mitigated this issue, they are largely restricted to single-modal inputs and cannot fully exploit complementary information from diverse data sources. In this work, we introduce a new framework for long-tailed recognition that explicitly handles multi-modal inputs. Our approach extends multi-expert architectures to the multi-modal setting by fusing heterogeneous data into a unified representation while leveraging modality-specific networks to estimate the informativeness of each modality. These confidence-guided weights dynamically modulate the fusion process, ensuring that more informative modalities contribute more strongly to the final decision. To further enhance performance, we design specialized training and test procedures that accommodate diverse modality combinations, including images and tabular data. Extensive experiments on benchmark and real-world datasets demonstrate that the proposed approach not only effectively integrates multi-modal information but also outperforms existing methods in handling long-tailed, class-imbalanced scenarios, highlighting its robustness and generalization capability.

[CV-58] M2E-UAV: A Benchmark and Analysis for Onboard Motion-on-Motion Event-Based Tiny UAV Detection

【速读】：该论文旨在解决在运动-运动（motion-on-motion）场景下，基于机载事件相机（event camera）对小型无人机（tiny UAV）进行检测的难题。在此场景中，观测者与目标均处于运动状态，导致自身运动（ego-motion）激活背景边缘（如建筑、植被和地平线结构），而无人机则表现为稀疏事件簇，显著增加了检测难度。解决方案的关键在于构建了一个名为 M²E-UAV 的基准数据集，包含87,223个训练样本和21,395个验证样本，覆盖四种典型场景；并提出 M²E-Point 点基事件建模方法，将事件编码为 [x,y,t,p] 点集，利用 EdgeConv 提取局部事件结构，输出事件级前景得分，再通过 DBSCAN 聚类生成边界框。实验表明，点基建模是强基准，而简单IMU条件化仅带来边际性能提升，为后续研究提供了可靠起点。

链接: https://arxiv.org/abs/2605.10496
作者: Weiqi Yan,Lixin Chen,Xiangrui Hou,Zhipeng Cai,Youbiao Wang,Yangyang Shi,Yu Zang,Cheng Wang
机构: Xiamen University (厦门大学); Meta (Meta)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Tiny UAV detection from an onboard event camera is difficult when the observer and target move at the same time. In this motion-on-motion regime, ego-motion activates background edges across buildings, vegetation, and horizon structures, while the UAV may appear as a sparse event cluster. To explore this practical problem, we present M ^2 E-UAV, a benchmark and analysis setup for onboard motion-on-motion event-based tiny UAV detection. The processed M ^2 E-UAV benchmark contains 87,223 training samples and 21,395 validation samples across four scene families: sunny building-forest, sunny farm-village, sunset building-forest, and sunset farm-village. We provide M ^2 E-Point, a point-based event baseline, and M ^2 E-Point + IMU, an IMU-conditioned variant, to analyze the role of inertial cues under onboard motion-on-motion detection. M ^2 E-Point encodes events as [x,y,t,p] point sets, extracts local event structure with EdgeConv, and predicts event-level UAV foreground scores, from which bounding boxes are derived via DBSCAN. Our validation-stage analysis shows that point-based event modeling is a strong baseline, while simple IMU conditioning provides only marginal aggregate gains. Under the train/validation split, M ^2 E-Point achieves 0.9673 F1 and 0.5501 mAP50-95, while the IMU-conditioned variant reaches 0.5561 mAP50-95 with only marginal aggregate changes, serving as an initial baseline for future exploration in this domain. Code will be ready in this https URL.

[CV-59] OpenSGA: Efficient 3D Scene Graph Alignment in the Open World

【速读】：该论文旨在解决3D场景图对齐（scene graph alignment）中的关键挑战，即在部分重叠观测下建立跨场景的物体对应关系，从而支持机器人在重访场景时进行高效的对象级重定位和多智能体全局地图融合。现有方法主要局限于子扫描到子扫描（subscan-to-subscan, S2S）对齐，且高度依赖几何点云特征，忽视了帧到扫描（frame-to-scan, F2S）对齐以及开放集视觉-语言特征的应用；同时，现有数据集规模小、物体类别有限，限制了系统训练与评估。解决方案的关键在于提出一个统一且高效的场景图对齐框架，通过融合视觉-语言特征、文本信息与几何特征，并引入空间上下文建模，包括距离门控的空间注意力编码器、基于最小成本流的分配器以及全局场景嵌入生成模块，显著提升了在坐标系差异较大情况下的对齐精度。

链接: https://arxiv.org/abs/2605.10484
作者: Gang Chen,Sebastián Barbas Laina,Stefan Leutenegger,Javier Alonso-Mora
机构: Delft University of Technology (代尔夫特理工大学); Technical University of Munich (慕尼黑工业大学); ETH Zürich (苏黎世联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 13 figures

点击查看摘要

Abstract:Scene graph alignment establishes object correspondences between two 3D scene graphs constructed from partially overlapping observations. This enables efficient scene understanding and object-level relocalization when a robot revisits a place, as well as global map fusion across multiple agents. Such capabilities are essential for robots that require long-term memory for long-horizon tasks involving interactions with the environment. Existing approaches mainly focus on subscan-to-subscan (S2S) alignment and depend heavily on geometric point-cloud features, leaving frame-to-scan (F2S) alignment and open-set vision-language features underexplored. In addition, existing datasets for scene graph alignment remain small-scale with limited object diversity, constraining systematic training and evaluation. We present a unified and efficient scene graph alignment framework that predicts object correspondences by fusing vision-language, textual, and geometric features with spatial context. The framework comprises modules such as a distance-gated spatial attention encoder, a minimum-cost-flow-based allocator, and a global scene embedding generator to achieve accurate alignment even under large coordinate discrepancies. We further introduce ScanNet-SG, a large-scale dataset generated via an automated annotation pipeline with over 700k samples, covering 509 object categories from ScanNet labels and over 3k categories from GPT-4o-based tagging. Experiments show that our method achieves the best overall performance on both F2S and S2S tasks, substantially outperforming existing scene graph alignment methods. Our code and dataset are released at: this https URL.

[CV-60] Adaptive Context Matters: Towards Provable Multi-Modality Guidance for Super-Resolution

【速读】：该论文旨在解决多模态超分辨率（Multi-Modal Super-Resolution, MMSR）中因模态融合不充分而导致的泛化性能瓶颈问题，尤其是在理论建模和实际应用层面均存在对异构模态利用效率不足的挑战。其解决方案的关键在于提出首个针对多模态超分辨率的理论建模，并基于该理论设计了面向泛化风险控制的动态模态融合机制——即Multi-Modal Mixture-of-Experts Super-Resolution框架（M³ESR），其中包含空间动态模态加权模块与时间自适应温度调度机制，从而实现模态权重与有效贡献之间的强对齐，降低表示复杂度并提升整体泛化能力和语义一致性。

链接: https://arxiv.org/abs/2605.10470
作者: Jinyi Luo,Minghao Liu,Yifan Li,Zejia Fan,Jiaying Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Super-resolution (SR) is a severely ill-posed problem with inherent ambiguity, as widely recognized in both empirical and theoretical studies. Although recent semantic-guided and multi-modal SR methods exploit large models or external priors to enhance semantic alignment, the fusion of heterogeneous modalities remains insufficiently understood in practice and theory. In this work, we provide the first theoretical modeling of multi-modal SR, revealing that prior methods are bottlenecked by sub-optimal modality utilization. Our analysis shows that the generalization risk bound can be improved by strengthening the alignment between modality weights and their effective contributions, while reducing representation complexity. This theoretical insight inspires us to propose the novel Multi-Modal Mixture-of-Experts Super-Resolution framework (M ^3 ESR) that employs generalization-oriented dynamic modality fusion for accurate risk control and modality contribution optimization. In detail, we propose a novel spatially dynamic modality weighting module and a temporally adaptive modality temperature scheduling mechanism, enabling flexible and adaptive spatial-temporal modality weighting for effective risk control. Extensive experiments demonstrate that our M ^3 ESR significantly boosts generalization and semantic consistency performances, which confirms our superiority.

[CV-61] Automated Detection of Abnormalities in Zebrafish Development

【速读】：该论文旨在解决当前基于斑马鱼胚胎的药物筛选与毒性评估依赖人工判读、效率低下且主观性强的问题。其核心解决方案是构建一个大规模、高分辨率的显微图像序列数据集，涵盖对照组和化学物质（3,4-二氯苯胺）暴露条件下的斑马鱼胚胎发育过程，并提供细粒度时间维度上的专家标注；在此基础上，提出首个基于Transformer架构的模型，通过融合时空特征实现早期发育异常的自动预测，显著提升了分类准确性（繁殖力分类达98%，毒性评估达92%），为自动化斑马鱼毒理学分析提供了可靠的技术路径。

链接: https://arxiv.org/abs/2605.10464
作者: Sarath Sivaprasad,Hui-Po Wang,Anna-Lisa Jäckel,Jonas Baumann,Carole Baumann,Jennifer Herrmann,Mario Fritz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Zebrafish embryos are a valuable model for drug discovery due to their optical transparency and genetic similarity to humans. However, current evaluations rely on manual inspection, which is costly and labor-intensive. While machine learning offers automation potential, progress is limited by the lack of comprehensive datasets. To address this, we introduce a large-scale dataset of high-resolution microscopic image sequences capturing zebrafish embryonic development under both control conditions and exposure to compounds (3,4-dichloroaniline). This dataset, with expert annotations at fine-grained temporal levels, supports two benchmarking tasks: (1) fertility classification, assessing zebrafish egg viability (130,368 images), and (2) toxicity assessment, detecting malformations induced by toxic exposure over time (55,296 images). Alongside the dataset, we present the first transformer-based baseline model that integrates spatiotemporal features to predict developmental abnormalities at early stages. Experimental results present the model’s effectiveness, achieving 98% accuracy in fertility classification and 92% in toxicity assessment. These findings underscore the potential of automated approaches to enhance zebrafish-based toxicity analysis.

[CV-62] Automated high-frequency quantification of fish communities and biomass using computer vision

【速读】：该论文旨在解决现有鱼类群落调查方法在高频、定量观测方面的局限性，特别是传统捕捞法、水下目视普查和环境DNA宏条形码技术在劳动强度大或难以准确估算丰度与生物量方面的不足。其解决方案的关键在于构建一个基于计算机视觉的自动化框架，通过自研的立体相机系统获取水下视频，并融合深度学习驱动的鱼类识别、多目标跟踪与三维重建技术，实现物种层面的丰度与生物量估计，从而为长期、非侵入式、连续监测提供可扩展的技术基础。

链接: https://arxiv.org/abs/2605.10449
作者: Kota Ishikawa,Takuma Masui,Keita Koeda,Rickdane Gomez,Lucas Yutaka Kimura,Michio Kondoh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 3 figures, supplementary information under Ancillary files

点击查看摘要

Abstract:Quantifying fish community structure is essential for understanding biodiversity and ecosystem responses in a changing environment, yet existing survey methods provide limited high-frequency, quantitative observations. Conventional approaches, including catch-based methods, underwater visual censuses, and environmental DNA metabarcoding, either require intensive labor or lack reliable estimates of abundance and biomass. Here, we develop an automated framework for quantifying fish communities from underwater video using computer vision. Using videos acquired with a custom-made stereo camera system, the framework integrates deep learning-based fish identification, multi-object tracking, and 3D reconstruction to estimate species-level abundance and biomass. We applied the approach to a reef fish community over a 20-day period with hourly daytime observations, revealing dynamic fluctuations in species richness, abundance, and biomass associated with changes in species composition. By comparing fish communities estimated from visual census and environmental DNA surveys, we demonstrate that our method provides complementary strengths for continuous, non-invasive, and quantitative monitoring of consistently observed species. This approach provides a scalable foundation for long-term monitoring and advances the capacity to resolve fine-scale temporal dynamics in fish communities.

[CV-63] Uni-Synergy: Bridging Understanding and Generation for Personalized Reasoning via Co-operative Reinforcement Learning

【速读】：该论文旨在解决统一多模态模型（Unified Multimodal Models, UMMs）在个性化理解与生成任务之间难以有效协同的问题。现有方法主要依赖监督微调实现隐式的 token 级对齐，无法充分挖掘理解与生成之间的潜在协同效应。其解决方案的关键在于提出 Sync-R1 框架——一个端到端的强化学习方法，通过在一个显式的推理循环中联合优化个性化理解与生成任务，使理解模块能够指导内容生成，同时生成质量反馈反过来提升理解能力，形成闭环优化。该框架引入 Sync-GRPO 方法构建集成奖励机制，并结合动态分组缩放（Dynamic Group Scaling, DGS）策略降低梯度方差、加速收敛，从而在 UnifyBench++ 新增的复杂用户上下文和密集文本描述场景下实现卓越的跨任务推理能力和鲁棒个性化表现。

链接: https://arxiv.org/abs/2605.10445
作者: Zijun Shen,Sihan Yang,Ruichuan An,Ziyu Guo,Hao Liang,Ming Lu,Renrui Zhang,Wentao Zhang
机构: Peking University (北京大学); Nanjing University (南京大学); CUHK (香港中文大学); Zhongguancun Academy (中关村研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Unified Multimodal Models (UMMs) excel in general tasks but struggle to bridge the gap between personalized understanding and generation. Prior works largely rely on implicit token-level alignment via supervised fine-tuning, which fails to fully capture the potential synergy between comprehension and creation. In this work, we propose Sync-R1, an end-to-end reinforcement learning framework that jointly optimizes personalized understanding and generation within a single, explicit reasoning loop. Through this unified feedback process, Sync-R1 enables personalized comprehension to guide content creation, while the resulting generation quality reciprocally refines understanding within an integrated reward landscape. To efficiently orchestrate this dual-task synergy, we introduce Sync-GRPO, a reinforcement learning method utilizing an ensemble reward system. Furthermore, we propose Dynamic Group Scaling (DGS), which adaptively filters low-potential trajectories to reduce gradient variance and accelerate convergence. To better reflect real-world complexity, we introduce UnifyBench++, featuring denser textual descriptions and richer user contexts. Experimental results demonstrate that Sync-R1 achieves state-of-the-art performance, showcasing superior cross-task reasoning and robust personalization without requiring complex cold-start procedures. The code and the UnifyBench++ dataset will be released at: this https URL.

[CV-64] Filtering Memorization from Parameter-Space in Diffusion Models

【速读】：该论文旨在解决扩散模型中低秩适配（LoRA）机制导致的训练图像记忆问题，即LoRA在微调过程中可能过度拟合训练数据，从而在生成内容中再现受版权保护或敏感的信息，尤其在LoRA共享生态中风险显著。解决方案的关键在于提出一种无需训练、无需原始数据的后处理过滤框架——基锚定过滤（Base-Anchored Filtering, BAF），其核心思想是将LoRA更新分解为谱通道，并通过测量各通道与预训练主干模型（backbone）主子空间的一致性来区分通用适应性成分与潜在记忆成分：强一致性的通道被保留以维持生成质量，弱一致性的通道则被抑制以消除记忆风险。

链接: https://arxiv.org/abs/2605.10439
作者: Yu Zhe,Yang Jiayan,Wei Junhao,Yu-Lin Tsai,Wang Chen
机构: RIKEN AIP (理化学研究所人工智能中心); Science of Tokyo (东京科学研究所); University of California, Berkeley (加州大学伯克利分校); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Low-Rank Adaptation (LoRA) has become a widely used mechanism for customizing diffusion models, enabling users to inject new visual concepts or styles through lightweight parameter updates. However, LoRAs can memorize training images, causing generated outputs to reproduce copyrighted or sensitive content. This risk is particularly concerning in LoRA-sharing ecosystems, where users distribute trained LoRAs without releasing the underlying training data. Existing approaches for mitigating memorization rely on access to the training pipeline, training data, or control over the inference process, making them difficult to apply when only the released LoRA weights are available. We propose \textbfBase-Anchored Filtering (BAF), a training-free and data-free framework for post-hoc memorization mitigation in diffusion LoRAs. BAF decomposes LoRA updates into spectral channels and measures their alignment with the principal subspace of the pretrained backbone. Channels strongly aligned with this subspace are retained as generalizable adaptations, while weakly aligned channels are suppressed as potential carriers of memorized content. Experiments on multiple datasets and diffusion backbones demonstrate that BAF consistently reduces memorization while preserving or even improving generation quality. Our code is available in the supplementary material.

[CV-65] Beyond Spatial Compression: Interface-Centric Generative States for Open-World 3D Structure

【速读】：该论文旨在解决当前3D tokenizers在开放世界（open-world）场景下因依赖空间压缩式表示而导致的结构语义失配问题：传统方法将3D形状编码为紧凑的潜在代码，但无法显式保留组件所有权（component ownership）和连接有效性（attachment validity），导致局部几何、组件身份与装配关系在解码过程中纠缠不清，难以进行结构层面的操作。其解决方案的关键在于提出一种接口中心的生成状态（interface-centric generative states）范式，通过构建可查询、可约束、可修复的操作性状态来替代被动压缩代码；具体实现为Component-Conditioned Canonical Local Tokens (C2LT-3D)，该方法将表征因子分解为三个独立变量：规范局部几何（canonical local geometry）、分区条件上下文（partition-conditioned context）以及关系接缝变量（relational seam variables），分别对应姿态泄露、跨组件干扰和无效局部连接等典型失败模式，从而支持装配验证、潜在结构修复、定向干预及约束序列化，无需额外后处理模块即可维持结构推理能力。

链接: https://arxiv.org/abs/2605.10438
作者: Xiang Chen,Alexander Binder
机构: DSC ScaDS.AI, Leipzig University(Institute for Cancer Genetics and Informatics (ICGI), Oslo, Norway;ICT Cluster, Singapore Institute of Technology, Singapore)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Current 3D tokenizers largely treat representation as spatial compression: compact codes reconstruct surface geometry, but leave component ownership and attachment validity implicit. In open-world assets with intersecting components, noisy topology, and weak canonical structure, this creates a representation mismatch: local shape, component identity, and assembly relations become entangled in a latent stream and are not natively addressable during decoding. We formulate an alternative view, interface-centric generative states, in which tokenization constructs an operational state rather than a passive compressed code. The state exposes local geometry, component ownership, and attachment validity as variables that can be queried, constrained, and repaired during decoding. We instantiate this formulation with Component-Conditioned Canonical Local Tokens (C2LT-3D), factorizing representation into canonical local geometry, partition-conditioned context, and relational seam variables. Each factor targets a distinct failure mode of compression-centric tokens: pose leakage, cross-component interference, or invalid local attachment. This exposed state supports attachment validation, latent structural repair, targeted intervention, and constrained serialization without a separate post-hoc structure recovery module. Trained on single-object CAD models and evaluated zero-shot on open-world multi-component assets, C2LT-3D improves structural robustness and shows that its latent variables remain actionable under adversarial attachment settings. These results suggest that open-world 3D generative representations should be evaluated not only by reconstruction fidelity, but by whether their discrete states remain operational for assembly-level structural reasoning.

[CV-66] WorldReason Bench: Human-Aligned Stress Testing of Video Generators as Future World-State Predictors

【速读】：该论文旨在解决当前视频生成模型在世界状态推理能力上的缺失问题，即现有系统虽能生成视觉上逼真的视频，但缺乏对物理、社会、逻辑和信息一致性等维度的正确因果演化建模能力。其核心解决方案是提出WorldReasonBench，一个将视频生成评估重构为世界状态预测任务的基准测试框架，通过结构化QA标注与多维质量评估（包括过程感知推理验证和多维度质量评分）来量化模型在时间连贯性、因果合理性及信息保真度方面的表现，并进一步引入WorldRewardBench用于偏好学习与奖励建模，从而推动真正具备“世界意识”的视频生成技术发展。

链接: https://arxiv.org/abs/2605.10434
作者: Keming Wu,Yijing Cui,Wenhan Xue,Qijie Wang,Xuan Luo,Zhiyuan Feng,Zuhao Yang,Sudong Wang,Sicong Jiang,Haowei Zhu,Zihan Wang,Ping Nie,Wenhu Chen,Bin Wang
机构: Tsinghua University (清华大学); Nanyang Technological University (南洋理工大学); University of Waterloo (滑铁卢大学); Hong Kong University of Science and Technology (Guangzhou) (香港科技大学（广州）); 2077 AI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Commercial video generation systems such as Seedance2.0 and Veo3.1 have rapidly improved, strengthening the view that video generators may be evolving into “world simulators.” Yet the community still lacks a benchmark that directly tests whether a model can reason about how an observed world should evolve over time. We introduce WorldReasonBench, which reframes video generation evaluation as world-state prediction: given an initial state and an action, can a model generate a future video whose state evolution remains physically, socially, logically, and informationally consistent? WorldReasonBench contains 436 curated test cases with structured ground-truth QA annotations spanning four reasoning dimensions and 22 subcategories. We evaluate generated videos with a human-aligned two-part methodology: Process-aware Reasoning Verification uses structured QA and reasoning-phase diagnostics to detect temporal and causal failures, while Multi-dimensional Quality Assessment scores reasoning quality, temporal consistency, and visual aesthetics for ranking and reward modeling. We further introduce WorldRewardBench, a preference benchmark with approximately 6K expert-annotated pairs over 1.4K videos, supporting pair-wise and point-wise reward-model evaluation. Across modern video generators, our results expose a persistent gap between visual plausibility and world reasoning: videos can look convincing while failing dynamics, causality, or information preservation. We will release our benchmarks and evaluation toolkit to support community research on genuinely world-aware video generation at this https URL.

[CV-67] CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving

【速读】：该论文旨在解决当前视觉-语言-动作（Vision-Language-Action, VLA）模型在自动驾驶中缺乏面向规划的中间表示问题：传统文本链式思维（Chain-of-Thought, CoT）无法保留连续时空结构，而潜在世界推理又难以直接作为动作生成的条件。其解决方案的关键在于提出CoWorld-VLA框架，通过多专家世界推理机制，将互补的世界信息（包括语义交互、几何结构、动态演化和自车轨迹）编码为显式的专家令牌（expert tokens），并设计基于扩散的分层多专家融合规划器，在联合去噪过程中耦合场景上下文生成连续自车轨迹，从而实现可解释且高精度的动作规划。

链接: https://arxiv.org/abs/2605.10426
作者: Minqing Huang,Yujiao Xiang,Zihan Liang,Jiajie Huang,Jingqi Wang,Zhi Xu,Feiyang Tan,Hangning Zhou,Mu Yang,Gong Che
机构: Afari Intelligent Drive; University of Electronic Science and Technology of China; Shanghai Jiao Tong University; Beijing University Of Posts and Telecommunications; Tianjin University
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have emerged as a promising paradigm for end-to-end autonomous driving. However, existing reasoning mechanisms still struggle to provide planning-oriented intermediate representations: textual Chain-of-Thought (CoT) fails to preserve continuous spatiotemporal structure, while latent world reasoning remains difficult to use as a direct condition for action generation. In this paper, we propose CoWorld-VLA, a multi-expert world reasoning framework for autonomous driving, where world representations serve as explicit conditions to guide action planning. CoWorld-VLA extracts complementary world information through multi-source supervision and encodes it into expert tokens within the VLA, thereby providing planner-accessible conditioning signals. Specifically, we construct four types of tokens: semantic interaction, geometric structure, dynamic evolution, and ego trajectory tokens, which respectively model interaction intent, spatial structure, future temporal dynamics, and behavioral goals. During action generation, CoWorld-VLA employs a diffusion-based hierarchical multi-expert fusion planner, which is coupled with scene context throughout the joint denoising process to generate continuous ego trajectories. Experiments show that CoWorld-VLA achieves competitive results in both future scene generation and planning on the NAVSIM v1 benchmark, demonstrating strong performance in collision avoidance and trajectory accuracy. Ablation studies further validate the complementarity of expert tokens and their effectiveness as planning conditions for action generation. Code will be available at this https URL.

[CV-68] Progressive Photorealistic Simplification

【速读】：该论文旨在解决现有图像简化技术（如非photorealistic渲染，Non-Photorealistic Rendering, NPR）在降低视觉复杂度时普遍牺牲照片真实感的问题。其核心挑战在于如何在保持图像自然真实性的前提下实现语义层面的简化。解决方案的关键在于提出一种渐进式语义图像简化框架（progressive semantic image simplification），通过一个迭代的“选择-移除-验证”（Select-Remove-Verify）流程，利用视觉语言模型（Vision-Language Models, VLMs）识别并优先移除冗余语义元素，并结合学习到的验证器确保每一步输出仍为逼真的自然图像。该方法不仅生成高质量的简化轨迹，还进一步通过图像到视频生成模型实现高效序列预测，从而支持内容感知去杂、语义层分解等应用，为摄影真实域内的视觉解释提供了一种结构化的内容删减机制。

链接: https://arxiv.org/abs/2605.10409
作者: Adi Rosenthal,Dana Berman,Yedid Hoshen,Ariel Shamir
机构: Reichman University (里奇曼大学); GoogleIsrael (谷歌以色列); Hebrew University (希伯来大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing image simplification techniques often rely on Non-Photorealistic Rendering (NPR), transforming photographs into stylized sketches, cartoons, or paintings. While effective at reducing visual complexity, such approaches typically sacrifice photographic realism. In this work, we explore a complementary direction: simplifying images while preserving their photorealistic appearance. We introduce progressive semantic image simplification, a framework that iteratively reduces scene complexity by removing and inpainting elements in a controlled manner. At each step, the resulting image remains a plausible natural photograph. Our method combines semantic understanding with generative editing, leveraging Vision-Language Models (VLMs) to identify and prioritize elements for removal, and a learned verifier to ensure photorealism and coherence throughout the process. This is implemented via an iterative Select-Remove-Verify pipeline that produces high-quality simplification trajectories. To improve efficiency, we further distill this process into an image-to-video generation model that directly predicts coherent simplification sequences from a single input image. Beyond generating cleaner and more focused compositions, our approach enables applications such as content-aware decluttering, semantic layer decomposition, and interactive editing. More broadly, our work suggests that simplification through structured content removal can serve as a practical mechanism for guiding visual interpretation within the photorealistic domain, complementing traditional abstraction methods.

[CV-69] Position: Life-Logging Video Streams Make the Privacy-Utility Trade-off Inevitable

【速读】：该论文旨在解决持续性生命日志视频流（life-logging video streams）在下一代始终在线人工智能系统中引发的隐私-效用权衡问题。随着智能眼镜、体感摄像头等设备的普及，这类视频数据虽能显著提升AI系统的感知与响应能力，但也因暴露行为模式、情绪状态和社会互动等敏感信息而带来严峻隐私风险。现有隐私保护方法或针对特定攻击、或导致显著效用损失，且未覆盖完整的数据处理流程。因此，论文提出关键解决方案是设计面向数据处理全流程的隐私保护机制（pipeline-aware privacy-preserving designs），通过联合优化长期视觉数据的隐私性和功能性，以实现可持续发展的始终在线AI系统。

链接: https://arxiv.org/abs/2605.10404
作者: Tianyuan Zou,Liang Yue,Yang Liu,Ya-Qin Zhang,Sijie Cheng
机构: Institute for AI Industry Research, Tsinghua University, Beijing, China(清华大学人工智能产业研究院); RayNeo.AI, Shenzhen, China; Department of Computer Science and Technology, Tsinghua University, Beijing, China(清华大学计算机科学与技术系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 7 figures

点击查看摘要

Abstract:With the growing prevalence of always-on hardware such as smart glasses, body cameras, and home security systems, life-logging visual sensing is becoming inevitable, forming the backbone of persistent, always-on AI systems. Meanwhile, recent advances in proactive agents and world models signal a fundamental shift from episodic, prompt-driven tools to next-generation AI systems that continuously perceive and react to the physical world. Although life-logging video streams can substantially improve utility of these promising systems, they also introduce significant privacy risks by revealing sensitive information, such as behavioral patterns, emotional states, and social interactions, beyond what isolated images expose. If unresolved, these risks may undermine public trust and hinder the sustainable development of always-on AI technologies. Existing privacy protections are either attack-specific or incur substantial utility loss, and fail to consider the entire data exploitation pipeline. We therefore posit that the privacy-utility trade-off in life-logging video streams is a foundational challenge for next-generation AI systems that demands further investigation. We call for novel pipeline-aware privacy-preserving designs that jointly optimize utility and privacy for long-horizon life-logging visual data. In parallel, formal privacy leakage metrics and standardized benchmarks remain important open directions for future research.

[CV-70] AnomalyClaw: A Universal Visual Anomaly Detection Agent via Tool-Grounded Refutation

【速读】：该论文旨在解决跨域视觉异常检测（Visual Anomaly Detection, VAD）中因不同领域间异常定义、数据模态和标注标准差异导致的单域训练模型难以迁移的问题。现有基于视觉语言模型（Vision-Language Models, VLMs）的直接推理方法受限于对先验知识的依赖，缺乏对正常样本参考和细粒度特征证据的有效利用，从而导致判断不可靠。解决方案的关键在于提出 AnomalyClaw——一个无需训练的VAD代理系统，其将异常判断转化为多轮反驳过程：每轮中代理生成候选异常并基于正常样本参考进行逐项反驳，借助包含13种工具的库实现视觉验证、参考解析与冻结专家探测。该机制显著提升了VLM在跨域场景下的异常理解与推理能力，而非简单聚合工具输出。

链接: https://arxiv.org/abs/2605.10397
作者: Xi Jiang,Yinjie Zhao,Zesheng Yang,Feng Zheng
机构: Southern University of Science and Technology (SUSTech); Nanyang Technological University (NTU); Agency for Science, Technology and Research (A*STAR)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: We release the agent, the benchmark, and the analysis artifacts at this https URL

点击查看摘要

Abstract:Visual anomaly detection (VAD) is crucial in many real-world fields, such as industrial inspection, medical imaging, infrastructure monitoring, and remote sensing. However, the specific anomaly definitions, data modalities, and annotation standards across different domains make it difficult to transfer single-domain trained VAD models. Vision-language models (VLMs), pre-trained on large-scale cross-domain data, can perform visual perception under task instructions, offering a promising solution for cross-domain VAD. However, single-inference VLM judgments are unreliable, since they rely more on prior knowledge than on normal-sample references or fine-grained feature evidence. We therefore present AnomalyClaw, a training-free VAD agent that turns anomaly judgment into a multi-round refutation process. In each round, the agent proposes candidate anomalies and refutes each against normal-sample references, drawing on a 13-tool library for visual verification, reference parsing, and frozen expert probing. On the CrossDomainVAD-12 benchmark (12 datasets), AnomalyClaw achieves consistent macro-AUROC improvements over single-step direct inference with +6.23 pp on GPT-5.5, +7.93 pp on Seed2.0-lite, and +3.52 pp on Qwen3.5-VL-27B. We further introduce an optional verbalized self-evolution extension. It builds an online rulebook from internal-branch disagreement without oracle labels. On Qwen3.5-VL-27B, it delivers a +2.09 pp mean gain, comparable to a K = 10 oracle-label supervised baseline (+1.99 pp). These results show that agentic refutation improve anomaly understanding and reasoning of VLMs, rather than merely aggregating tool outputs.

[CV-71] Sens-VisualNews: A Benchmark Dataset for Sensational Image Detection ICIP2026

【速读】：该论文旨在解决媒体内容中感官刺激性图像（sensational image）的自动识别问题，以辅助筛选值得核查的信息并标记潜在虚假信息。其核心挑战在于如何有效检测图像中引发强烈情绪反应的视觉特征，这些特征常导致用户在未经批判性思考的情况下加速传播。解决方案的关键在于构建了一个新的基准数据集（Sens-VisualNews），包含9,576张新闻图像，基于视觉内容中是否存在多种感官概念和事件进行标注，并在此基础上系统评估多种开源先进多模态大语言模型（Multimodal LLMs）在零样本和微调设置下的提示敏感性、性能与鲁棒性。

链接: https://arxiv.org/abs/2605.10394
作者: Andreas Goulas,Damianos Galanopoulos,Evlampios Apostolidis,Vasileios Mezaris
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Authors’ Accepted Version; Accepted at IEEE ICIP 2026

点击查看摘要

Abstract:The detection of sensational content in media items can be a critical filtering mechanism for identifying check-worthy content and flagging potential disinformation, since such content triggers physiological arousal that often bypasses critical evaluation and accelerates viral sharing. In this paper we introduce the task of sensational image detection, which aims to determine whether an image contains shocking, provocative, or emotionally charged features to grab attention and trigger strong emotional responses. To support research on this task, we create a new benchmark dataset (called Sens-VisualNews) that contains 9,576 images from news items, annotated based on the (in-)existence of various sensational concepts and events in their visual content. Finally, using Sens-VisualNews, we study the prompt sensitivity, performance and robustness of a wide range of open SotA Multimodal LLMs, across both zero-shot and fine-tuned settings.

[CV-72] mporal Sampling Frequency Matters: A Capacity-Aware Study of End-to-End Driving Trajectory Prediction

【速读】：该论文旨在解决端到端（End to End, E2E）自动驾驶轨迹预测模型在训练时普遍采用最高可用时间采样频率这一默认假设是否合理的问题。作者通过将时间采样频率视为显式的训练集设计变量，构建了不同频率的子采样训练集，并在固定训练协议下评估同一模型在不同频率下的性能变化，从而揭示采样频率对预测性能的影响机制。解决方案的关键在于：从“容量感知”（capacity-aware）视角出发，识别出稀疏采样可能遗漏驾驶相关线索、密集采样则可能引入冗余视觉内容和流形外噪声，进而对有限容量模型造成与驾驶无关的计算负担；实验表明，小型E2E模型通常在较低或中等频率下表现最优，而大型VLA风格模型（如AutoVLA）则依赖于最高频率以达到最佳性能，且这种差异并非仅由训练迭代次数不均所致。因此，论文主张应将时间采样频率作为可调参数进行报告与优化，而非默认使用最高频率。

链接: https://arxiv.org/abs/2605.10388
作者: Yumao Liu,Tao Liu,Xiangyu Li,Jiaxiang Li,Ke Ma
机构: The Hong Kong University of Science and Technology (Guangzhou)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:End to end (E2E) autonomous driving trajectory prediction is often trained with camera frames sampled at the highest available temporal frequency, assuming that denser sampling improves performance. We question this assumption by treating temporal sampling frequency as an explicit training set design variable. Starting from high frequency E2E driving datasets, we construct frequency sweep training sets by temporally subsampling camera frames along each trajectory. For each model dataset pair, we train and evaluate the same model under a fixed protocol, so the frequency response reflects how prediction performance changes with sampling frequency. We analyze this response from a capacity aware perspective. Sparse sampling may miss driving relevant cues, while dense sampling may add redundant visual content and off manifold noise. For finite capacity models, this can create a driving irrelevant capacity burden. We evaluate three smaller E2E models and a larger VLA style AutoVLA model on Waymo, nuScenes, and PAVE. Results show model and dataset dependent frequency responses. Smaller E2E models often show non monotonic or near plateau trends and achieve their best 3 second ADE at lower or intermediate frequencies. In contrast, AutoVLA achieves its best 3 second ADE and FDE at the highest evaluated frequency on all three datasets. Iteration matched controls suggest that the advantage of lower or intermediate frequencies for smaller models is not explained only by unequal training update counts. These findings show that temporal sampling frequency should be reported and tuned, rather than fixed to the highest available value.

[CV-73] SleepWalk: A Three-Tier Benchmark for Stress-Testing Instruction-Guided Vision-Language Navigation

【速读】：该论文旨在解决当前视觉-语言模型（Vision-Language Models, VLMs）在三维数字环境中是否能够可靠地将自然语言指令转化为空间一致且可执行的动作轨迹这一关键问题。现有研究多聚焦于长距离跨房间探索，而忽视了局部化、以交互为中心的具身推理能力。为此，作者提出 SleepWalk 基准测试框架，其核心在于构建基于文本描述生成的可导航单场景3D环境，并设计三类空间与时间复杂度递增的任务层级，从而系统评估模型在几何约束下预测合理路径的能力。解决方案的关键在于引入标准化点对点评判协议和精细的任务分层机制，使得模型在面对遮挡、交互限制及多步指令时的表现可被量化分析，揭示当前VLMs在具身语义接地（embodied grounding）方面的系统性缺陷，为推进具身规划、视觉-语言导航及动作可行智能体的发展提供可扩展、可控的评测基准。

链接: https://arxiv.org/abs/2605.10376
作者: Niyati Rawal,Sushant Ravva,Shah Alam Abir,Saksham Jain,Aman Chadha,Vinija Jain,Suranjana Trivedy,Amitava Das
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) have advanced rapidly in multimodal perception and language understanding, yet it remains unclear whether they can reliably ground language into spatially coherent, plausibly executable actions in 3D digital environments. We introduce SleepWalk, a benchmark for evaluating instruction-grounded trajectory prediction in single-scene 3D worlds generated from textual scene descriptions and filtered for navigability. Unlike prior navigation benchmarks centered on long-range exploration across rooms, SleepWalk targets localized, interaction-centric embodied reasoning: given rendered visual observations and a natural-language instruction, a model must predict a trajectory that respects scene geometry, avoids collisions, and terminates at an action-compatible location. The benchmark covers diverse indoor and outdoor environments and organizes tasks into three tiers of spatial and temporal difficulty, enabling fine-grained analysis of grounding under increasing compositional complexity. Using a standardized pointwise judge-based evaluation protocol, we evaluate three frontier VLMs on 2,472 curated 3D environments with nine instructions per scene. Results reveal systematic failures in grounded spatial reasoning, especially under occlusion, interaction constraints, and multi-step instructions: performance drops as the difficulty level of the tasks increase. In general, current VLMs can somewhat produce trajectories that are simultaneously spatially coherent, plausibly executable, and aligned with intended actions. By exposing failures in a controlled yet scalable setting, SleepWalk provides a critical benchmark for advancing grounded multimodal reasoning, embodied planning, vision-language navigation, and action-capable agents in 3D environments.

[CV-74] Halo Separation-guided Underwater Multi-scale Image Restoration

【速读】：该论文旨在解决水下图像增强中因人工光源导致的光晕（halo）干扰问题，此类光晕会严重降低图像质量并影响后续视觉任务。现有方法未能充分考虑该因素，导致在人工光源场景下的鲁棒性较差。解决方案的关键在于设计一种基于迭代结构的单光晕图像校正方法，其核心由两个子网络组成：一是通过梯度最小化实现光晕层分离的子网络，二是利用多尺度恢复机制重构被光晕遮蔽的图像信息；同时结合UIEB和EUVP合成数据集进行训练，并引入径向梯度约束以优化光晕消除效果，从而提升水下图像复原性能。

链接: https://arxiv.org/abs/2605.10374
作者: Jiaxin Yang,Honglin Liu,Yongli Wang,Shuyi Cao,Chengcheng Jiang,Jiale Wang
机构: Dalian Maritime University (大连海事大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Underwater images captured by Autonomous Underwater Vehicles (AUVs) are inevitably affected by artificial light sources, which often produce halos in the foreground of the camera and seriously interfere with the quality of the image. The existing underwater image enhancement methods fail to fully consider this key problem, and the robustness of processing images under artificial light scenes is poor. In practical applications, since underwater image enhancement itself is a very challenging task, the influence of artificial light sources will lead to serious degradation of image performance and affect subsequent vision tasks. In order to effectively deal with this problem, this paper designs a single halo image correction method based on an iterative structure. The network is mainly divided into two sub-networks, one is the halo layer separation sub-network which aims to separate the halo by gradient minimization, and the other is the multi-scale recovery sub-network which aims to recover the image information masked by halo. The UIEB and EUVP synthetic datasets are used for training to ensure that the network can fully learn the characteristics and laws of underwater halo images. Then a large number of halo images taken in an underwater environment with real artificial light are collected for testing. In addition, the brightness distribution characteristics of underwater halo images are analyzed and the radial gradient is introduced to constraint eliminate halo to improve the effect of underwater image restoration.

[CV-75] CellDX AI Autopilot: Agent -Guided Training and Deployment of Pathology Classifiers

【速读】：该论文旨在解决计算病理学中人工智能（AI）模型训练所面临的两大瓶颈问题：一是缺乏机器学习（ML）专业知识的病理学家难以参与模型开发，二是研究人员受限于工程资源无法高效开展大量实验。解决方案的关键在于提出 CellDX AI Autopilot 平台，该平台通过自然语言交互与通用大语言模型（LLM-based agent）驱动的智能体（agent）实现端到端自动化建模流程，其核心包括：1）一套结构化的病理专用智能体技能（agent skills），涵盖数据集构建、自动超参数调优、多策略模型对比和人机协同部署；2）基于多重实例学习（Multiple Instance Learning, MIL）框架支持四种分类策略；3）一种迭代成对超参数搜索机制（grid 或种子随机搜索），相比穷举搜索将调参成本降低超过 30 倍。该平台首次将病理领域专用技能和训练基础设施开放给通用 AI 智能体，无需智能体本身具备病理知识即可完成全流程建模，显著降低了技术门槛并提升了实验效率。

链接: https://arxiv.org/abs/2605.10362
作者: Alexey Pchelnikov,Aleksei Pchelnikov
机构: HistAI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Training AI models for computational pathology currently requires access to expensive whole-slide-image datasets, GPU infrastructure, deep expertise in machine learning, and substantial engineering effort. We present CellDX AI Autopilot, a platform that lets users – from pathologists with no ML background to ML practitioners running many parallel experiments – train, evaluate, and deploy whole-slide image classifiers through natural language interaction with an AI agent. The platform provides a structured set of agent skills that guide the user through dataset curation, automated hyperparameter tuning, multi-strategy model comparison, and human-in-the-loop deployment, all on a pre-built dataset of over 32,000 cases and 66,000 HE-stained whole-slide images with pre-extracted features. We describe the agent skill architecture, the underlying Multiple Instance Learning (MIL) training framework supporting four classification strategies, and an iterative pairwise hyperparameter search (grid or seeded random) that reduces tuning cost by over 30x compared to exhaustive search. CellDX AI Autopilot is, to our knowledge, the first system to expose pathology-specialized agent skills and a pathology-specialized training platform to general-purpose AI agents (e.g. any LLM-based agent runtime), delivering end-to-end automated model training without requiring the agent itself to be domain-specific. The platform addresses both the ML-expertise bottleneck that limits adoption in diagnostic pathology and the engineering bottleneck that limits how many experiments a researcher can run cost-effectively.

[CV-76] DySurface: Consistent 4D Surface Reconstruction via Bridging Explicit Gaussians and Implicit Functions

【速读】：该论文旨在解决动态场景下神经辐射场（NeRF）与3D高斯泼溅（3DGS）在时间一致性几何重建中存在的问题，如表面不连续、严重伪影和断裂等，这些问题主要源于仅依赖光度优化导致的几何歧义。其解决方案的关键在于提出DySurface框架，通过融合显式高斯表示与隐式符号距离函数（SDF）的几何保真度，构建一个名为VoxGS-DSDF的分支：该分支利用变形后的高斯点生成动态稀疏体素网格，为隐式SDF场提供显式的几何引导，从而有效规制体积渲染过程，显著提升表面重建质量，实现闭合边界和细节丰富的几何表达。

链接: https://arxiv.org/abs/2605.10360
作者: Minje Kim,Younghyun Noh,Jaesoon Kim,Tae-Kyun Kim
机构: KAIST(韩国科学技术院); KT(韩国电信); Sungkyunkwan University(成均馆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While novel view synthesis (NVS) for dynamic scenes has seen significant progress, reconstructing temporally consistent geometric surfaces remains a challenge. Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) offer powerful dynamic scene rendering capabilities; however, relying solely on photometric optimization often leads to geometric ambiguities. This results in discontinuous surfaces, severe artifacts, and broken surfaces over time. To address these limitations, we present DySurface, a novel framework that bridges the effectiveness of explicit Gaussians with the geometric fidelity of implicit Signed Distance Functions (SDFs) in dynamic scenes. Our approach tackles the structural discrepancy between the forward deformation of 3DGS ( canonical \rightarrow dynamic ) and the backward deformation required for volumetric SDF rendering ( dynamic \rightarrow canonical ). Specifically, we propose the VoxGS-DSDF branch that leverages deformed Gaussians to construct a dynamic sparse voxel grid, providing explicit geometric guidance to the implicit SDF field. This explicit anchoring effectively regularizes the volumetric rendering process, significantly improving surface reconstruction quality, with watertight boundaries and detailed representations. Quantitative and qualitative experiments demonstrate that DySurface significantly outperforms state-of-the-art baselines in geometric accuracy metrics while maintaining competitive rendering performance.

[CV-77] Portable Active Learning for Object Detection CVPR2026

【速读】：该论文旨在解决目标检测中标注边界框（bounding boxes）成本高昂、限制模型可扩展性的问题，同时在保持高精度的前提下最小化人工标注负担。现有主动学习方法通常依赖于模型内部特征或修改检测器结构与训练流程，导致集成复杂度高，且较少联合利用图像级信号、类别不平衡线索和实例级不确定性进行综合选择。本文提出了一种无需改动检测器结构的轻量级、通用性强的主动学习框架——Portable Active Learning (PAL)，其核心在于仅基于推理输出，结合类别特定的实例不确定性（通过轻量级逻辑回归分类器计算熵值）与图像级多样性（全局图像熵、类别多样性及图像相似度），实现高效且多样化的样本筛选，从而显著提升标签效率与检测性能，适用于多种真实场景下的目标检测系统部署。

链接: https://arxiv.org/abs/2605.10349
作者: Rashi Sharma,Justin Timothy C. Bersamin,Karthikk Subramanian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: CVPR 2026(highlight)

点击查看摘要

Abstract:Annotating bounding boxes is costly and limits the scalability of object detection. This challenge is compounded by the need to preserve high accuracy while minimizing manual effort in real-world applications. Prior active learning methods often depend on model features or modify detector internals and training schedules, increasing integration overhead. Moreover, they rarely jointly exploit the benefits of image-level signals, class-imbalance cues, and instance-level uncertainty for comprehensive selection. We present Portable Active Learning (PAL), a detector-agnostic, easily portable framework that operates solely on inference outputs. PAL combines class-wise instance uncertainty with image-level diversity to guide data selection. At each round, PAL trains lightweight class-specific logistic classifiers to distinguish true from false positives, producing entropy-based uncertainty scores for proposals. Candidate images are then refined using global image entropy, class diversity, and image similarity, yielding batches that are both informative and diverse. PAL requires no changes to model internals or training pipelines, ensuring broad compatibility across detectors. Extensive experiments on COCO, PASCAL VOC, and BDD100K demonstrate that PAL consistently improves label efficiency and detection accuracy compared to existing active learning baselines, making it a practical solution for scalable and cost-effective deployment of object detection in real-world settings.

[CV-78] BGG: Bridging the Geometric Gap between Cross-View images by Vision Foundation Model Adaptation for Geo-Localization

【速读】：该论文旨在解决跨视角地理定位（Cross-View Geo-Localization, CVGL）中因无人机与卫星图像间几何差异显著而导致的定位精度下降问题。解决方案的关键在于提出一种基于视觉基础模型（Vision Foundation Model, VFM）的参数高效适配框架——BGG，其核心包括两个模块：多粒度特征增强适配器（Multi-granularity Feature Enhancement Adapter, MFEA）和频域感知结构聚合模块（Frequency-Aware Structural Aggregation, FASA）。MFEA通过多层空洞卷积提升特征的尺度适应性和视角鲁棒性，有效缩小跨视角几何差距；FASA则在频域对patch token进行调制并自适应聚合局部结构特征，弥补[CLS] token空间细节不足的问题，最终融合增强后的局部特征与[CLS] token以实现更精准的地理定位。

链接: https://arxiv.org/abs/2605.10345
作者: Wei Wang,Dou Quan,Ning Huyan,Shuang Wang,Yi Li,Pei He,Licheng Jiao
机构: Xidian University (西安电子科技大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Geometric differences between cross-view images, such as drone and satellite views, significantly increase the challenge of Cross-View Geo-Localization (CVGL), which aims to acquire the geolocation of images by image retrieval. To further enhance the CVGL performance, this paper proposes a parameter-efficient adaptation framework for bridging the geometric gap across images based on the vision foundation model (VFM) (e.g., DINOv3), termed BGG. BGG not only effectively leverages the general visual representations of VFM and captures the robust and consistent features from cross-view images, but also utilizes the generalization capabilities of the VFM, significantly improving the CVGL performance. It mainly contains a Multi-granularity Feature Enhancement Adapter (MFEA) and a Frequency-Aware Structural Aggregation (FASA) module. Specifically, MFEA enhances the scale adaptability and viewpoint robustness of features by multi-level dilated convolutions, effectively bridging the cross-view geometric gap with small training costs. Additionally, considering the [CLS] token lacks spatial details for precise image retrieval and localization, the FASA module modulates patch tokens in the frequency domain and performs adaptive aggregation for local structural feature enhancement. Finally, BGG fuses the enhanced local features with the [CLS] token for more accurate CVGL. Extensive experiments on University-1652 and SUES-200 datasets demonstrate that BGG has significant advantages over other methods and achieves state-of-the-art localization performance with low training costs.

[CV-79] EvoStreaming: Your Offline Video Model Is a Natively Streaming Assistant

【速读】：该论文旨在解决视频语言模型（VideoLLM）在流式视频理解场景中缺乏实时交互决策能力的问题，即模型难以在不牺牲响应效率的前提下判断何时输出回答。现有方法多基于离线推理训练，且评估协议将响应时机决策交由外部评价者处理，导致模型无法学习到有效的交互策略。解决方案的关键在于提出EvoStreaming框架，该框架通过自演化机制使基础模型自身充当数据生成器、相关性标注者和策略执行者，从而在无需额外监督信号的情况下，仅用1,000个自生成样本即可高效优化模型的流式响应行为，显著提升RealStreamEval评分，同时保持原有离线性能。

链接: https://arxiv.org/abs/2605.10343
作者: Zichen Wen,Boxue Yang,Junlong Ke,Jiajie Huang,Chenfei Liao,Junxi Wang,Xuyang Liu,Linfeng Zhang
机构: Shanghai Jiao Tong University (上海交通大学); Tsinghua University (清华大学); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学（广州）); Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 33 pages, 9 figures

点击查看摘要

Abstract:Streaming video understanding demands more than watching longer videos: assistants must decide when to speak in real time, balancing responsiveness against verbosity. Yet most video-language models (VideoLLMs) are trained for offline inference, and existing streaming benchmarks externalize this timing decision to the evaluator. We address this gap with RealStreamEval, a frame-level multi-turn evaluation protocol that exposes models to sequential observations and penalizes unnecessary responses. Under this protocol, we observed that strong offline VideoLLMs retain useful visual understanding but lack an interaction policy for deciding when to respond. Motivated by this observation, we propose EvoStreaming, a self-evolved streaming adaptation framework in which the base model itself acts as data generator, relevance annotator, and roll-out policy to synthesize streaming trajectories without external supervision. With only 1,000 self-generated samples ( 139\times less than the leading streaming instruction-tuning approach) and no architectural changes, EvoStreaming consistently improves the overall RealStreamEval score by up to 10.8 points across five open VideoLLM backbones (Qwen2/2.5/3-VL, InternVL-3.5, MiniCPM-V4.5) while largely preserving offline video performance. These results suggest that data-efficient interaction tuning is a practical path for adapting existing VideoLLMs to streaming assistants.

[CV-80] he Alpha Blending Hypothesis: Compositing Shortcut in Deepfake Detection

【速读】：该论文旨在解决当前深度伪造（deepfake）检测方法在跨数据集泛化能力提升后，其内在工作机制仍不明确的问题。研究发现，现有基于帧的检测器主要并非学习语义异常或生成式神经指纹，而是作为Alpha混合（alpha blending）搜索器，定位伪造图像与目标帧融合时引入的低层次合成伪影。解决方案的关键在于提出“Alpha Blending Hypothesis”并设计BlenD方法：利用仅含真实人脸图像的大规模多样化数据集，通过引入自混合图像（self-blended images, SBI）增强训练样本，从而避免使用显式生成的深度伪造图像进行训练，同时显著提升跨数据集泛化性能；进一步通过结合显式混合搜索器与抗混合捷径模型的集成策略，实现94.0%的AUROC，达到当前最优效果。

链接: https://arxiv.org/abs/2605.10334
作者: Andrii Yermakov,Jan Cech,Mario Fritz,Jiri Matas
机构: Czech Technical University in Prague (捷克技术大学); CISPA Helmholtz Center for Information Security (CISPA亥姆霍兹信息安全中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent deepfake detection methods demonstrate improved cross-dataset generalization, yet the underlying mechanisms remain underexplored. We introduce the Alpha Blending Hypothesis, positing that state-of-the-art frame-based detectors primarily function as alpha blending searchers; rather than learning semantic anomalies or specific generative neural fingerprints, they localize low-level compositing artifacts introduced during the integration of manipulated faces into target frames. We experimentally validate the hypothesis, demonstrating that deepfake detectors exhibit high sensitivity to the so-called self-blended images (SBI) and non-generative manipulations. We propose the method BlenD that leverages a large-scale, diverse dataset of real-only facial images augmented with SBI. This approach achieves the best average cross-dataset generalization on 15 compositional deepfake datasets released between 2019 and 2025 without utilizing explicitly generated deepfakes during training. Furthermore, we show that predictions from explicit blending searchers and models resilient to blending shortcuts are highly complementary, yielding a state-of-the-art AUROC of 94.0% in an ensemble configuration. The code with experiments and the trained model will be publicly released.

[CV-81] LimeCross: Context-Conditioned Layered Image Editing with Structural Consistency

【速读】：该论文旨在解决可控的分层图像编辑问题，即在保持其他图层不变的前提下，对用户选定的RGBA图层进行文本引导的精准修改，同时避免传统方法中存在的光照一致性差、图层间干扰（如背景到前景泄露）及透明度不稳定等问题。其解决方案的关键在于提出LimeCross框架——一个无需训练的、基于上下文条件的分层图像编辑方法，通过双流注意力机制利用其他图层的上下文信息以维持跨图层一致性，并显式保护各图层完整性，从而实现高保真且稳定的编辑效果。

链接: https://arxiv.org/abs/2605.10319
作者: Ryugo Morita,Stanislav Frolov,Brian Bernhard Moser,Ko Watanabe,Riku Takahashi,Issey Sukeda,Andreas Dengel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Layered image assets are widely used in real-world creative workflows, enabling non-destructive iteration and flexible re-composition. Recent advances in layered image generation and decomposition synthesize or recover layered representations, yet controllable editing of layered images remains challenging. Manual editing requires careful coordination across layers to maintain consistent illumination and contact, while AI-based pipelines collapse layers into a flattened image for editing, then decompose them again, introducing background-to-foreground leakage and unstable transparency. To address these limitations, we propose LimeCross, a training-free context-conditioned layered image editing framework that edits user-selected RGBA layers according to text while keeping the remaining layers unchanged. It leverages contextual cues from other layers using a bi-stream attention mechanism to preserve cross-layer consistency, while explicitly maintaining layer integrity to prevent the contamination of edited layers. To evaluate our approach, we introduce LayerEditBench, a benchmark of 1500 layered scenes with paired source/target prompts, along with evaluation protocols that assess both edit fidelity and alpha channel stability. Extensive experiments demonstrate that LimeCross improves layer purity and composite realism over strong editing baselines, establishing context-conditioned layered editing as a principled framework for controllable generative creation.

[CV-82] PaMoSplat: Part-Aware Motion-Guided Gaussian Splatting for Dynamic Scene Reconstruction

【速读】：该论文旨在解决动态场景重建中高保真渲染与精确跟踪难题，尤其针对具有复杂大范围运动的场景。其核心挑战在于如何有效建模场景的非刚性变形并提升优化稳定性与效率。解决方案的关键在于提出PaMoSplat框架，通过引入部件感知（part awareness）与运动先验（motion priors），将场景分解为由几何一致的高斯部件（Gaussian parts）构成的可变形结构；同时利用光流（optical flow）引导部件运动估计，并结合差分进化算法实现鲁棒的刚体运动初始化，辅以自适应迭代机制、可学习刚性约束及光流监督的渲染损失，显著提升了重建质量、跟踪精度与收敛速度。

链接: https://arxiv.org/abs/2605.10307
作者: Yinan Deng,Jianyu Dou,Jiahui Wang,Jingyu Zhao,Yi Yang,Yufeng Yue
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO)
备注: Accepted by TCSVT. Project Url: this https URL

点击查看摘要

Abstract:Dynamic scene reconstruction represents a fundamental yet demanding challenge in computer vision and robotics. While recent progress in 3DGS-based methods has advanced dynamic scene modeling, obtaining high-fidelity rendering and accurate tracking in scenarios with substantial, intricate motions remains significantly challenging. To address these challenges, we propose PaMoSplat, a novel dynamic Gaussian splatting framework incorporating part awareness and motion priors. Our approach is grounded in two key observations: 1) Parts serve as primitives for scene deformation, and 2) Motion cues from optical flow can effectively guide part motion. Specifically, PaMoSplat initializes by lifting multi-view segmentation masks into 3D space via graph clustering, establishing coherent Gaussian parts. For subsequent timestamps, we leverage a differential evolutionary algorithm to estimate the rigid motion of these parts using multi-view optical flow cues, providing a robust warm-start for further optimization. Additionally, PaMoSplat introduces an adaptive iteration count mechanism, internal learnable rigidity, and flow-supervised rendering loss to accelerate and optimize the training process. Comprehensive evaluations across diverse scenes, including real-world environments, demonstrate that PaMoSplat delivers superior rendering quality, improved tracking precision, and faster convergence compared to existing methods. Furthermore, it enables multiple part-level downstream applications, such as 4D scene editing.

[CV-83] PolarVSR: A Unified Framework and Benchmark for Continuous Space-Time Polarization Video Reconstruction

【速读】：该论文旨在解决主流分焦平面（Division of-Focal-Plane, DoFP）彩色偏振成像中，从捕获的马赛克阵列恢复偏振参数所面临的逆问题挑战，以及现有DoFP相机因硬件限制难以支持高帧率采集、从而制约动态视频偏振成像的问题。解决方案的关键在于提出首个时空联合增强的偏振视频重建架构，通过联合建模空间与时间维度上的偏振方向，并引入一种偏振感知的隐式神经表示（polarization-aware implicit neural representation），实现连续且高保真度的上采样；同时，基于偏振参数在时间上的变化特性，设计了一种流引导的偏振变化损失函数（flow-guided polarization variation loss），以监督偏振动态过程，从而提升动态场景下的偏振视频重建质量。

链接: https://arxiv.org/abs/2605.10275
作者: Chenggong Li,Yidong Luo,Junchao Zhang,Boxin Shi,Degui Yang
机构: Central South University (中南大学); Hunan Provincial Key Laboratory of Optic-Electronic Intelligent Measurement and Control (湖南省光学电子智能测量与控制重点实验室); Zhejiang University (浙江大学); Westlake University (西湖大学); Peking University (北京大学); State Key Laboratory of Multimedia Information Processing (多媒体信息处理国家重点实验室); National Engineering Research Center of Visual Technology (视觉技术国家工程研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Polarimetric imaging captures surface polarization characteristics, such as the Degree of Linear Polarization (DoLP) and the Angle of Polarization (AoP). In mainstream Division of-Focal-Plane (DoFP) color polarization imaging, recovering polarization parameters from captured mosaic arrays remains a challenging inverse problem. Existing DoFP cameras also face hardware bottlenecks and often cannot support high-frame-rate acquisition, limiting polarimetric imaging in dynamic video tasks. These limitations motivate joint spatial and temporal enhancement. To this end, we propose the first space-time polarization video reconstruction architecture. The method jointly models polarization directions in space and time and uses a polarization-aware implicit neural representation for continuous, high-fidelity upsampling. By analyzing temporal variations in polarization parameters, we further introduce a flow-guided polarization variation loss to supervise polarization dynamics. We also establish the first large-scale color DoFP polarization video benchmark to support this research direction. Extensive experiments on this benchmark demonstrate the effectiveness of the method.

[CV-84] Increasing the Efficiency of DETR for Maritime High-Resolution Images ITSC2026

【速读】：该论文旨在解决无人水面艇（Unmanned Surface Vessel, USV）在复杂海况下进行高精度、实时目标检测的难题，尤其针对远距离、小尺寸物体（如浮标到大型船只）检测中因高分辨率图像带来的计算资源消耗大、边缘计算能力受限等问题。其解决方案的关键在于引入基于状态空间模型（State Space Model, SSM）的视觉Mamba（Vision Mamba, ViM）骨干网络，通过将图像分块为序列 tokens 实现线性扩展的长程依赖建模，并设计定制化的特征金字塔网络（Feature Pyramid Network, FPN），结合逐层下采样与SSM模块及token剪枝策略，有效降低背景区域的冗余计算，从而在保持高检测精度的同时显著提升计算效率，优于采用ResNet50骨干的RT-DETR等先进方法。

链接: https://arxiv.org/abs/2605.10269
作者: Tinsae Yehuala,Hao Cheng,Ville Lehtola
机构: University of Twente (特温特大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted to IEEE ITSC 2026. Copyright 2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses. DOI to be added upon publication

点击查看摘要

Abstract:Maritime object detection is critical for the safe navigation of unmanned surface vessels (USVs), requiring accurate recognition of obstacles from small buoys to large vessels. Real-time detection is challenging due to long distances, small object sizes, large-scale variations, edge computing limitations, and the high memory demands of high-resolution imagery. Existing solutions, such as downsampling or image splitting, often reduce accuracy or require additional processing, while memory-efficient models typically handle only limited resolutions. To overcome these limitations, we leverage Vision Mamba (ViM) backbones, which build on State Space Models (SSMs) to capture long-range dependencies while scaling linearly with sequence length. Images are tokenized into sequences for efficient high-resolution processing. For further computational efficiency, we design a tailored Feature Pyramid Network with successive downsampling and SSM layers, as well as token pruning to reduce unnecessary computation on background regions. Compared to state-of-the-art methods like RT-DETR with ResNet50 backbone, our approach achieves a better balance between performance and computational efficiency in maritime object detection.

[CV-85] Efficient Hybrid CNN-GNN Architecture for Monocular Depth Estimation

【速读】：该论文旨在解决单目深度估计（monocular depth estimation）中局部卷积感受野有限、难以建模长程空间关系的问题。传统卷积神经网络（CNN）受限于局部感受野，难以捕捉远距离依赖，而基于Transformer的方法虽能建模全局信息但存在二次计算复杂度问题。为此，作者提出GraphDepth架构，其核心创新在于将图神经网络（Graph Neural Networks, GNNs）高效嵌入ResNet-101 U-Net编码器-解码器框架中，在多个尺度上集成GraphSAGE层（1/32、1/16、1/8分辨率），通过迭代消息传递机制显式建模长程空间关系，实现线性复杂度下的全局上下文传播。关键解决方案包括：批量并行化图构建（支持k-NN与网格邻接）、通道注意力门控跳跃连接以自适应融合特征，以及专用的异方差不确定性头用于置信度感知损失加权优化，从而在保持高精度的同时显著降低计算开销（如25 FPS vs 9 FPS，3.8 GB VRAM vs 8.8 GB）。

链接: https://arxiv.org/abs/2605.10251
作者: Ishan Narayan
机构: IMCS Lab, CSIR-CSIO (印度科学与工业研究委员会-中央科学仪器组织)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present GraphDepth, a monocular depth estimation architecture that synergistically integrates Graph Neural Networks (GNNs) within a convolutional encoder-decoder framework. Our approach embeds efficient GraphSAGE layers at multiple scales of a ResNet-101 U-Net backbone, enabling explicit modeling of long-range spatial relationships that lie beyond the receptive field of local convolutions. Key technical contributions include: (1) batch-parallelized graph construction with configurable k-NN and grid-based adjacency for scalable training; (2) multi-scale GraphSAGE integration at bottleneck and decoder stages (1/32, 1/16, 1/8 resolution) to propagate global context throughout the feature hierarchy; (3) channel-attention gated skip connections that adaptively weight encoder features before fusion; and (4) heteroscedastic uncertainty estimation via a dedicated aleatoric uncertainty head, enabling confidence-aware loss weighting during optimization. Unlike transformer-based hybrids, which suffer from quadratic complexity in sequence length, GraphDepth scales linearly with spatial resolution while achieving comparable global receptive fields through iterative message passing. Experiments on NYU Depth V2, WHU Aerial, ETH3D, and Mid-Air benchmarks demonstrate competitive accuracy within 4.6% of state-of-the-art transformers on indoor scenes with substantially lower computational cost (25 FPS vs 9 FPS, 3.8 GB vs 8.8 GB VRAM). GraphDepth achieves the best reported result on WHU Aerial (RMSE 8.24 m) and exhibits superior zero-shot cross-domain transfer to the Mid-Air synthetic aerial dataset, validating the generalization power of explicit relational reasoning for depth estimation. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2605.10251 [cs.CV] (or arXiv:2605.10251v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.10251 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Ishan Narayan [view email] [v1] Mon, 11 May 2026 09:21:04 UTC (445 KB) Full-text links: Access Paper: View a PDF of the paper titled Efficient Hybrid CNN-GNN Architecture for Monocular Depth Estimation, by Ishan NarayanView PDFHTML (experimental)TeX Source view license Current browse context: cs.CV prev | next new | recent | 2026-05 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[CV-86] AdaptSplat: Adapting Vision Foundation Models for Feed-Forward 3D Gaussian Splatting

【速读】：该论文旨在解决现有3D高斯溅射（3D Gaussian Splatting, 3DGS）方法在跨域泛化能力和高频几何保真度方面的不足，这些问题主要源于训练数据规模受限以及深度网络带来的低通滤波效应导致的高频信息衰减。解决方案的关键在于提出了一种轻量级的频率保持适配器（Frequency-Preserving Adapter, FPA），其仅包含1.5M参数，通过从强大视觉基础模型骨干网络的浅层特征中提取方向感知的高频结构先验，并借助高频位置编码和自适应残差调制机制，将其无缝集成到通用3DGS流水线中，从而有效补偿深层特征中的过平滑问题，显著提升高斯原语对复杂表面和锐利边界的拟合精度。

链接: https://arxiv.org/abs/2605.10239
作者: Mingwei Xing,Xinliang Wang,Yifeng Shi
机构: Ke Holdings Inc. (贝壳控股)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This work explores a simple yet powerful lightweight adapter design for feed-forward 3D Gaussian Splatting (3DGS). Existing methods typically apply complex, architecture-specific designs on top of the generic pipeline of image feature extraction \rightarrow multi-view interaction \rightarrow feature decoding. However, constrained by the scale bottleneck of 3D training data and the low-pass filtering effect of deep networks, these methods still fall short in cross-domain generalization and high-frequency geometric fidelity. To address these problems, we propose AdaptSplat, which demonstrates that without complex component engineering, introducing a single adapter of only 1.5M parameters into the generic architecture is sufficient to achieve superior performance. Specifically, we design a lightweight Frequency-Preserving Adapter (FPA) that extracts direction-aware high-frequency structural priors from the shallow features of a powerful vision foundation model backbone, and seamlessly integrates them into the generic pipeline via high-frequency positional encodings and adaptive residual modulation. This effectively compensates for the high-frequency attenuation caused by over-smoothing in deep features, improving the fitting accuracy of Gaussian primitives on complex surfaces and sharp boundaries. Extensive experiments demonstrate that AdaptSplat achieves state-of-the-art feed-forward reconstruction performance on multiple standard benchmarks, with stable generalization across domains. Code available at: this https URL.

[CV-87] VPD-100K: Towards Generalizable and Fine-grained Visual Privacy Protection ICML2026

【速读】：该论文旨在解决当前隐私检测模型在真实复杂场景中性能受限的问题，主要原因是现有数据集存在规模小、标注粒度粗和领域覆盖窄等缺陷，难以支撑高效且鲁棒的隐私检测算法开发。其解决方案的关键在于构建一个大规模、细粒度的视觉隐私数据集（Visual Privacy Dataset, VPD-100K），包含100,000张图像、33个细粒度类别及超过190,000个对象实例，并建立涵盖人类存在、屏幕内个人身份信息（PII）、物理标识符和位置指示器四大类别的完整分类体系；同时提出一种基于频域注意力融合与自适应谱门控机制的轻量级增强模块，突破传统空间像素强度限制，更有效地捕捉敏感信息的细微特征，从而显著提升模型在直播等无约束场景下的隐私检测能力。

链接: https://arxiv.org/abs/2605.10229
作者: Xiaobin Hu,Enpu Zuo,Lanping Hu,Kaiwen Yang,Dianshu Liao,Tianyi Zhang,Bo Yin,Yinsi Zhou,Shidong Pan,Xiaoyu Sun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注: Accepted at the 43rd International Conference on Machine Learning (ICML 2026)

点击查看摘要

Abstract:Privacy protection has become a critical requirement in the era of ubiquitous visual data sharing, imposing higher demands on efficient and robust privacy detection algorithms. However, current robust detection models are severely hindered by the lack of comprehensive datasets. Existing privacy-oriented datasets often suffer from limited scale, coarse-grained annotations, and narrow domain coverage, failing to capture the intricate details of sensitive information in realworld environments. To bridge this gap, we present a large-scale, fine-grained Visual Privacy Dataset (VPD-100K), designed to facilitate generalized privacy detection. We establish a holistic taxonomy comprising four primary domains: Human Presence, On-Screen Personally Identifiable Information (PII), Physical Identifiers, and Location Indicators, containing 100,000 images annotated with 33 fine-grained classes and over 190,000 object instances. Statistical analysis reveals that our dataset features long-tailed distributions, small object scales, and high visual complexity. These characteristics make the dataset particularly valuable for demanding, unconstrained applications such as live streaming, where actors frequently face unintentional, realtime information leakage. Furthermore, we design an effective frequency-enhanced lightweight module consisting of frequency-domain attention fusion and adaptive spectral gating mechanism that breaks the limitations of spatial pixel intensity to better capture the subtle details of sensitive information. Extensive experiments conducted on both diverse image and streaming videos benchmarks consistently demonstrate the effectiveness of our VPD-100K dataset and the wellcurated frequency mechanism. The code and dataset are available at this https URL.

[CV-88] Nano-U: Efficient Terrain Segmentation for Tiny Robot Navigation

【速读】：该论文旨在解决自主移动机器人在非结构化户外环境中进行可靠地形分割（terrain segmentation）的问题，尤其针对微控制器（microcontroller）上因内存和计算资源受限而导致的先进模型难以部署的挑战。解决方案的关键在于提出了一种名为Nano-U的极简二值分割网络，其参数量仅数千级别，并通过量化感知蒸馏（Quantization-Aware Distillation, QAD）方法进行训练，融合知识蒸馏与量化感知训练以补偿模型容量不足；同时，借助基于编译器的推理引擎MicroFlow（用Rust实现）将量化后的模型部署在ESP32-S3微控制器上，避免解释器开销和动态内存分配，从而实现低内存占用和低延迟的高效执行，为低成本机器人平台上的感知任务提供了可行且节能的方案。

链接: https://arxiv.org/abs/2605.10210
作者: Federico Pizzolato,Francesco Pasti,Nicola Bellotto
机构: University of Padua (帕多瓦大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Code repository: this https URL

点击查看摘要

Abstract:Terrain segmentation is a fundamental capability for autonomous mobile robots operating in unstructured outdoor environments. However, state-of-the-art models are incompatible with the memory and compute constraints typical of microcontrollers, limiting scalable deployment in small robotics platforms. To address this gap, we develop a complete framework for robust binary terrain segmentation on a low-cost microcontroller. At the core of our approach we design Nano-U, a highly compact binary segmentation network with a few thousand parameters. To compensate for the network’s minimal capacity, we train Nano-U via Quantization-Aware Distillation (QAD), combining knowledge distillation and quantization-aware training. This allows the final quantized model to achieve excellent results on the Botanic Garden dataset and to perform very well on TinyAgri, a custom agricultural field dataset with more challenging scenes. We deploy the quantized Nano-U on a commodity microcontroller by extending MicroFlow, a compiler-based inference engine for TinyML implemented in Rust. By eliminating interpreter overhead and dynamic memory allocation, the quantized model executes on an ESP32-S3 with a minimal memory footprint and low latency. This compiler-based execution demonstrates a viable and energy-efficient solution for perception on low-cost robotic platforms.

[CV-89] 3DReflecNet: A Large-Scale Dataset for 3D Reconstruction of Reflective Transparent and Low-Texture Objects CVPR2026

【速读】：该论文旨在解决具有高反射、透明或低纹理特性的物体在三维重建中的难题，这类材料常违反多视角重建流程中的光度一致性假设及几何纹理特征可用性前提。现有数据集主要聚焦于漫反射且纹理丰富的物体，难以评估模型在真实复杂材质下的性能表现。其解决方案的关键在于构建一个大规模混合数据集——3DReflecNet，该数据集包含超过120,000个基于物理渲染的合成实例和超过1,000个使用消费级设备采集的真实物体，共计超700万张多视角图像，涵盖多样材质、复杂光照条件与广泛几何形态（包括由LLM生成的2D图像通过扩散模型合成的形状），并设计了五个核心任务基准（图像匹配、运动恢复结构、新视角合成、反射去除与重光照）以支持鲁棒评估，从而推动面向挑战性材质的3D视觉方法发展。

链接: https://arxiv.org/abs/2605.10204
作者: Zhicheng Liang,Haoyi Yu,Boyan Li,Dayou Zhang,Zijian Cao,Tianyi Gong,Junhua Liu,Shuguang Cui,Fangxin Wang
机构: The Chinese University of Hong Kong, Shenzhen; Capital Normal University; University of Southern California
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted by CVPR 2026 Oral

点击查看摘要

Abstract:Accurate 3D reconstruction of objects with reflective, transparent, or low-texture surfaces still remains notoriously challenging. Such materials often violate key assumptions in multi-view reconstruction pipelines, such as photometric consistency and the availability on distinct geometric texture cues. Existing datasets primarily focus on diffuse, textured objects, and therefore provide limited insight into performance under real-world material complexities. We introduce 3DReflecNet, a large-scale hybrid dataset exceeding 22 TB that is specifically designed to benchmark and advance 3D vision methods for these challenging materials. 3DReflecNet combines two types of data: over 120,000 synthetic instances generated via physically-based rendering of more than 12,000 shapes, and over 1,000 real-world objects captured using consumer devices. Together, these data consist of more than 7 million multi-view frames. The dataset spans diverse materials, complex lighting conditions, and a wide range of geometric forms, including shapes generated from both real and LLM-synthesized 2D images using diffusion-based pipelines. To support robust evaluation, we design benchmarks for five core tasks: image matching, structure-from-motion, novel view synthesis, reflection removal, and relighting. Extensive experiments demonstrate that state-of-the-art methods struggle to maintain accuracy across these settings, highlighting the need for more resilient 3D vision models.

[CV-90] DetRefiner: Model-Agnostic Detection Refinement with Feature Fusion Transformer CVPR2026

【速读】：该论文旨在解决开放词汇目标检测（Open-vocabulary object detection, OVOD）中模型对未见类别泛化能力不足的问题，其核心挑战在于现有方法难以有效整合全局与局部上下文线索以提升检测性能。解决方案的关键在于提出一个轻量级、可即插即用的框架DetRefiner，该框架通过一个轻量Transformer编码器融合来自基础模型（如DINOv3）的全局图像特征和局部patch级特征，生成反映图像整体属性的类向量与表示局部区域特性的patch向量，并据此推断属性可靠性以重新校准基础检测器的置信度。DetRefiner在训练时无需访问基础OVOD模型的内部特征或进行重训练，仅需基于基础检测器的预测结果，在推理阶段输出辅助校准分数并与原置信度融合，从而显著提升对未见类别的检测性能（在多个数据集上未见类别AP提升最高达+10.1）。这一机制表明，学习融合全局与局部表征是推动开放世界目标检测的有效通用策略。

链接: https://arxiv.org/abs/2605.10190
作者: Soichiro Okazaki,Tatsuya Sasaki,Hiroki Ohashi
机构: Hitachi, Ltd. Research and Development Group (日立有限公司研发部)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026 Findings

点击查看摘要

Abstract:Open-vocabulary object detection (OVOD) aims to detect both seen and unseen categories, yet existing methods often struggle to generalize to novel objects due to limited integration of global and local contextual cues. We propose DetRefiner, a simple yet effective plug-and-play framework that learns to fuse global and local features to refine open-vocabulary detection. DetRefiner processes global image features and patch-level image features from foundational models (e.g., DINOv3) through a lightweight Transformer encoder. The encoder produces a class vector capturing image-level attributes and patch vectors representing local region attributes, from which attribute reliability is inferred to recalibrate the base model’s confidence. Notably, DetRefiner is trained independently of the base OVOD model, requiring neither access to its internal features nor retraining. At inference, it operates solely on the base detector’s predictions, producing auxiliary calibration scores that are merged with the base detector’s scores to yield the final refined confidence. Despite this simplicity, DetRefiner consistently enhances multiple OVOD models across COCO, LVIS, ODinW13, and Pascal VOC, achieving gains of up to +10.1 AP on novel categories. These results highlight that learning to fuse global and local representations offers a powerful and general mechanism for advancing open-world object detection. Our codes and models are available at this https URL.

[CV-91] SciVQR: A Multidisciplinary Multimodal Benchmark for Advanced Scientific Reasoning Evaluation

【速读】：该论文旨在解决现有多模态大语言模型（Multimodal Large Language Models, MLLMs）基准测试在评估科学推理能力时存在的不足，即缺乏对复杂、可追溯的推理过程的有效刻画。其解决方案的关键在于构建了一个覆盖数学、物理、化学、地理、天文和生物等54个子领域的多模态基准数据集SciVQR，该数据集包含领域特定的视觉内容（如公式、图表和示意图），并要求模型结合视觉理解进行多步推理，任务难度从基础事实回忆到复杂推理层层递进，其中46%的任务配有专家撰写的解题步骤。SciVQR不仅评估最终答案的正确性，还深入分析模型的推理路径，从而更全面地揭示模型在处理跨学科科学问题时的能力与局限，为推动MLLM向真正具备科学智能的方向发展提供量化依据。

链接: https://arxiv.org/abs/2605.10187
作者: Longteng Guo(1 and 2),Xuanxu Lin(1 and 2),Dongze Hao(3),Tongtian Yue(1 and 2),Pengkang Huo(1 and 2),Jiatong Ma(1 and 2),Yuchen Liu(1 and 2),Jing Liu(1 and 2) ((1) Institute of Automation, Chinese Academy of Sciences (CASIA), (2) School of Artificial Intelligence, University of Chinese Academy of Sciences (UCAS), (3) OPPO AI Center)
机构: Institute of Automation, Chinese Academy of Sciences (CASIA); School of Artificial Intelligence, University of Chinese Academy of Sciences (UCAS); OPPO AI Center
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Scientific reasoning is a key aspect of human intelligence, requiring the integration of multimodal inputs, domain expertise, and multi-step inference across various subjects. Existing benchmarks for multimodal large language models (MLLMs) often fail to capture the complexity and traceability of reasoning processes necessary for rigorous evaluation. To fill this gap, we introduce SciVQR, a multimodal benchmark covering 54 subfields in mathematics, physics, chemistry, geography, astronomy, and biology. SciVQR includes domain-specific visuals, such as equations, charts, and diagrams, and challenges models to combine visual comprehension with reasoning. The tasks range from basic factual recall to complex, multi-step inferences, with 46% including expert-authored solutions. SciVQR not only evaluates final answers but also examines the reasoning process, providing insights into how models reach their conclusions. Our evaluation of leading MLLMs, including both proprietary and open-source models, reveals significant limitations in handling complex multimodal reasoning tasks, underscoring the need for improved multi-step reasoning and better integration of interdisciplinary knowledge in advancing MLLMs toward true scientific intelligence. The dataset and evaluation code are publicly available at this https URL.

[CV-92] DynGhost: Temporally-Modelled Transformer for Dynamic Ghost Imaging with Quantum Detectors

【速读】：该论文旨在解决动态鬼成像（ghost imaging）中两个关键问题：一是现有深度学习架构未能利用帧间的时间相干性，导致动态场景重建效果不佳；二是模型假设加性高斯噪声，无法反映真实单光子探测硬件（如SNSPD、SPAD、SiPM）中的泊松统计特性。解决方案的关键在于提出DynGhost（Dynamic Ghost Imaging Transformer），其通过交替的空间和时间注意力模块建模时空关联，并结合基于物理准确探测器模拟的量子感知训练框架，以及Anscombe方差稳定化归一化方法，有效缓解了分布偏移问题，从而在动态和低光子条件下显著优于传统方法与现有深度学习架构。

链接: https://arxiv.org/abs/2605.10185
作者: Vittorio Palladino,Ahmet Enis Cetin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 6 pages, 8 figures

点击查看摘要

Abstract:Ghost imaging reconstructs spatial information from a single-pixel bucket detector by correlating structured illumination patterns with scalar intensity measurements. While deep learning approaches have achieved promising results on static scenes, two critical limitations remain unaddressed: existing architectures fail to exploit temporal coherence across frames, leaving dynamic ghost imaging largely unsolved, and they assume additive Gaussian noise models that do not reflect the true Poissonian statistics of real single-photon hardware. We present DynGhost (Dynamic Ghost Imaging Transformer), a transformer architecture that addresses both limitations through alternating spatial and temporal attention blocks. Our quantum-aware training framework, based on physically accurate detector simulations (SNSPDs, SPADs, SiPMs) and Anscombe variance-stabilizing normalization, resolves the distribution shift that causes classical models to fail under realistic hardware constraints. Experiments across multiple benchmarks demonstrate that DynGhost outperforms both traditional reconstruction methods and existing deep learning architectures, with particular gains in dynamic and photon-starved settings.

[CV-93] Developing a foundation model for high-resolution remote sensing data of the Netherlands

【速读】：该论文旨在解决遥感图像中特征表示学习受限于数据规模与标注稀缺的问题，尤其是在小样本场景下模型泛化能力不足的挑战。其解决方案的关键在于构建一个融合卷积神经网络（Convolutional Neural Network, CNN）与视觉Transformer（Vision Transformer, ViT）的混合架构基础模型，并利用时间序列卫星影像作为输入，从而同时捕捉低频（如地形结构、地表覆盖分布）和高频（如纹理、边缘、小目标）空间特征，以及通过时间维度建模拓扑特征、土地覆盖变化和季节动态等时序依赖关系。这种时空联合建模机制显著降低了特征歧义性，提升了表示学习质量，使得模型在仅使用荷兰区域有限数据预训练的情况下，仍能在全球基准数据集上取得与主流大模型相当的性能，且参数量远低于现有先进模型。

链接: https://arxiv.org/abs/2605.10184
作者: Paul Vermeeren,Heysem Kaya
机构: Utrecht Univ.(乌得勒支大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 4 figures, under review in a journal

点击查看摘要

Abstract:We develop a foundation model using 1.2m high resolution satellite images of the Netherlands. By combining a Convolutional Neural Network and a Vision Transformer, the model captures both low- and high-frequency landscape features, such as fine textures, edges, and small objects as well as large terrain structures, elevation patterns, and land-cover distributions. Leveraging temporal data as input, the model learns from broader contextual information across time, allowing the model to exploit the temporal dependencies, such as topographic features, land-cover changes, and seasonal dynamics. These additional constraints reduce feature ambiguity, improve representation learning, and enable better generalization with fewer labeled samples. The foundation model is evaluated on multiple downstream tasks, ranging from use cases within the Netherlands to global benchmarking datasets. On the vegetation monitoring dataset of the Netherlands, the model shows clear performance improvements by incorporating temporal information instead of relying on a single time point. Despite using a smaller model and less pretraining data limited to the Netherlands, it achieves competitive results on global benchmarks when compared to state-of-the-art models. These results demonstrate that the model can learn rich, generalizable representations from limited data, achieving competitive performance on global benchmarks while using a fraction of the parameters of larger state-of-the-art remote sensing models. To maximize reproducibility and reuse, we made the scripts and the model accessible on GitHub.

[CV-94] A Comparative Study of Machine Learning and Deep Learning for Out-of-Distribution Detection

【速读】：该论文旨在解决医学影像中分布外（Out-of-distribution, OOD）检测的可靠性问题，即确保AI模型在面对非标准或无效输入时仍能可靠识别并拒绝输出。其关键解决方案在于通过实证比较传统机器学习（Machine Learning, ML）与深度学习（Deep Learning, DL）方法在多分辨率、跨数据集（共6万余张眼底和非眼底图像）上的OOD检测性能，发现两者在AUROC（0.999–1.000）和准确率上表现相当，但ML方法具有显著更低的端到端延迟，体现出更高的计算效率。这一结果表明，在视觉复杂度有限的任务场景下，轻量级ML方法可实现与DL相当的检测性能，同时大幅降低计算成本，从而更适用于实际部署。

链接: https://arxiv.org/abs/2605.10181
作者: Jihyeon Baek,Seunghoon Lee,Gitaek Kwon,Doohyun Park
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to IEEE ISBI 2026. The final published version will appear in IEEE Xplore

点击查看摘要

Abstract:Out-of-distribution (OOD) detection is essential for building reliable AI systems, as models that produce outputs for invalid inputs cannot be trusted. Although deep learning (DL) is often assumed to outperform traditional machine learning (ML), medical imaging data are typically acquired under standardized protocols, leading to relatively constrained image variability in OOD detection tasks. This motivates a direct comparison between ML and DL approaches in this setting. The two approaches are evaluated on open datasets comprising over 60,000 fundus and non-fundus images across multiple resolutions. Both approaches achieved an AUROC of 1.000 and accuracies between 0.999 and 1.000 on internal and external validation sets, showing comparable detection performance. The ML approach, however, exhibited substantially lower end-to-end latency while maintaining equivalent accuracy, indicating greater computational efficiency. These results suggest that for OOD detection tasks of limited visual complexity, lightweight ML approaches can achieve DL-level performance with significantly reduced computational cost, supporting practical real-world deployment.

[CV-95] What Concepts Lie Within? Detecting and Suppressing Risky Content in Diffusion Transformers

【速读】：该论文旨在解决扩散 Transformer（Diffusion-Transformer, DiT）架构下文本到图像（Text-to-Image, T2I）模型生成风险内容（如色情、暴力、受版权保护图像等）的防护难题。现有方法主要针对早期 U-Net 架构设计，难以适配 DiT 中通过联合注意力机制耦合语义注入与视觉合成的新范式，导致风险内容难以被有效隔离和消除。解决方案的关键在于发现并利用注意力头（attention head）对特定语义概念的敏感性差异：提出一种无需训练的推理阶段防护机制 AHV-D\S，其核心是构建每个文本标记的注意力头向量（Attention Head Vector, AHV），作为识别风险生成倾向的判别性签名；并通过基于动量的动态跟踪策略与基于敏感度引导的自适应抑制策略，在去噪过程中实时识别并抑制高风险标记对应的注意力权重，从而在不显著损害图像质量的前提下实现对多种有害内容的有效遏制，并展现出对对抗提示和跨模型迁移的鲁棒性。

链接: https://arxiv.org/abs/2605.10180
作者: Chenyu Zhang,Lanjun Wang,Yueyang Cheng,Ruidong Chen,Wenhui Li,An-an Liu
机构: Tianjin University (天津大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:The rise of text-to-image (T2I) models has increasingly raised concerns regarding the generation of risky content, such as sexual, violent, and copyright-protected images, highlighting the need for effective safeguards within the models themselves. Although existing methods have been proposed to eliminate risky concepts from T2I models, they are primarily developed for earlier U-Net architectures, leaving the state-of-the-art Diffusion-Transformer-based T2I models inadequately protected. This gap stems from a fundamental architectural shift: Diffusion Transformers (DiTs) entangle semantic injection and visual synthesis via joint attention, which makes it difficult to isolate and erase risky content within the generation. To bridge this gap, we investigate how semantic concepts are represented in DiTs and discover that attention heads exhibit concept-specific sensitivity. This property enables both the detection and suppression of risky content. Building on this discovery, we propose AHV-D\S, a training-free inference-time safeguard for image generation in DiTs. Specifically, AHV-D\S quantifies each textual token’s sensitivity across all attention heads as an Attention Head Vector (AHV), which serves as a discriminative signature for detecting risky generation tendencies. In the inference stage, we propose a momentum-based strategy to dynamically track token-wise AHVs across denoising steps, and a sensitivity-guided adaptive suppression strategy that suppresses the attention weights of identified risky tokens based on head-specific risk scores. Extensive experiments demonstrate that AHV-D\S effectively suppresses sexual, copyrighted-style, and various harmful content while preserving visual quality, and further exhibits strong robustness against adversarial prompts and transferability across different DiT-based T2I models.

[CV-96] MTA-RL: Robust Urban Driving via Multi-modal Transformer-based 3D Affordances and Reinforcement Learning

【速读】：该论文旨在解决城市环境下自动驾驶中感知与控制耦合不足、模型可解释性差以及模块化系统易受误差传播影响的问题。其解决方案的关键在于提出MTA-RL框架，通过基于多模态Transformer的3D affordance（可达性）表示来桥接感知与控制：利用RGB图像与LiDAR点云融合生成几何感知的结构化语义表征，作为强化学习（Reinforcement Learning, RL）策略的紧凑观测空间，从而提升样本效率和决策稳定性；实验表明，该方法在不同交通密度下均优于现有基准，并展现出卓越的零样本泛化能力。

链接: https://arxiv.org/abs/2605.10177
作者: Guangli Chen,Dianzhao Li,Wenjian Zhong,Bangquan Xie,Ostap Okhrin
机构: Dongguan Key Laboratory of Intelligent Equipment and Smart Industry, School of Advanced Engineering, Great Bay University, Dongguan, China; Chair of Applied Statistics, Technische Universität Dresden, Dresden, Germany; Center for Scalable Data Analytics and Artificial Intelligence (ScaDS.AI) Dresden/Leipzig, Dresden, Germany; College of Automation, Guangdong University of Technology, Guangzhou, China
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Robust urban autonomous driving requires reliable 3D scene understanding and stable decision-making under dense interactions. However, existing end-to-end models lack interpretability, while modular pipelines suffer from error propagation across brittle interfaces. This paper proposes MTA-RL, the first framework that bridges perception and control through Multi-modal Transformer-based 3D Affordances and Reinforcement Learning (RL). Unlike previous fusion models that directly regress actions, RGB images and LiDAR point clouds are fused using a transformer architecture to predict explicit, geometry-aware affordance representations. These structured representations serve as a compact observation space, enabling the RL policy to operate purely on predicted driving semantics, which significantly improves sample efficiency and stability. Extensive evaluations in CARLA Town01-03 across varying densities (20-60 background vehicles) show that MTA-RL consistently outperforms state-of-the-art baselines. Trained solely on Town03, our method demonstrates superior zero-shot generalization in unseen towns, achieving up to a 9.0% increase in Route Completion, an 11.0% increase in Total Distance, and an 83.7% improvement in Distance Per Violation. Furthermore, ablation studies confirm that our multi-modal fusion and reward shaping are critical, significantly outperforming image-only and unshaped variants, demonstrating the effectiveness of MTA-RL for robust urban autonomous driving.

[CV-97] BathyFacto: Refraction-Aware Two-Media Neural Radiance Fields for Bathymetry

【速读】：该论文旨在解决基于无人机影像的水下测深（shallow-water bathymetry）中因空气-水界面折射导致的结构光恢复（Structure-from-Motion, SfM）系统性深度偏差问题。传统方法假设光线在均匀介质中直线传播，而实际中光线在穿过空气与水界面时发生折射，违反了这一假设，从而引入显著误差。解决方案的关键在于提出BathyFacto，一种面向两介质（air-water）的神经辐射场（NeRF）扩展模型，其核心创新包括：1）采用共享哈希网格密度场与介质条件色彩头（medium-conditioned color head），通过一个一比特介质标志位区分空气或水；2）将每条相机射线分为两段——空气中直线传播至平面水表面，水中则依据斯涅尔定律（Snell’s law）计算折射路径；3）设计单一提案网络采样器处理跨介质虚拟直线射线，并引入折点密度包装器（kinked density wrapper）在密度评估前自动校正水段位置，实现高效且精确的样本分配。实验表明，BathyFacto相较Nerfacto基线和传统多视图立体匹配（MVS）方法，在已知真实场景下实现了更高的点云精度（Cloud-to-Mesh均距0.06 m vs. 0.52 m）和完整性（87% vs. 29%）。

链接: https://arxiv.org/abs/2605.10174
作者: Markus Brezovsky,Anatol Günthner,Frederik Schulte,Lukas Winiwarter,Boris Jutzi,Gottfried Mandlburger
机构: TU Wien (维也纳工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 8 figures, 3 tables. Submitted to ISPRS Open Journal of Photogrammetry and Remote Sensing, Special Issue “3D Underwater Mapping from Above and Below”

点击查看摘要

Abstract:Through-water photogrammetry based on UAV imagery enables shallow-water bathymetry, but refraction at the air-water interface violates the straight-ray assumption of Structure-from-Motion and causes systematic depth bias. We present BathyFacto, a refraction-aware two-media extension of Nerfacto integrated into Nerfstudio that targets metrically precise underwater point clouds. BathyFacto uses a shared hash-grid-based density field with a medium-conditioned color head that receives a one-bit medium flag (air or water) and traces each camera ray as two segments: a straight segment in air up to a planar water surface and a refracted segment in water computed via Snell’s law with known refractive indices. To allocate samples efficiently across the air-water boundary, we employ a single proposal-network sampler that operates on a virtual straight ray spanning both media, combined with a kinked density wrapper that transparently corrects water-segment positions along the refracted direction before density evaluation. A data adaptation pipeline converts photogrammetric reconstructions to a Nerfstudio-compatible format, estimates the water plane from boundary markers, and provides per-pixel medium masks to gate refraction. We also extend the point cloud export with refraction-corrected backprojection and reversible coordinate transforms to world and global frames. On a simulated two-media scene with known ground truth, BathyFacto with refraction achieves a Cloud-to-Mesh mean distance of 0.06 m and 87 % completeness, compared to 0.52 m / 29 % for the Nerfacto baseline and 0.36 m / 21% for conventional MVS without refraction correction.

[CV-98] ask-Agnostic Noisy Label Detection via Standardized Loss Aggregation

【速读】：该论文旨在解决大规模医学影像数据集中因观察者间差异和病例模糊性导致的标签噪声问题，这类噪声会显著影响模型训练的稳定性和泛化性能。其解决方案的关键在于提出了一种统计学基础且任务无关的样本级噪声检测框架——标准化损失聚合（Standardized Loss Aggregation, SLA），该方法通过在重复交叉验证中对分层验证损失进行标准化聚合，将离散的硬计数策略推广为连续的估计器，从而同时捕捉性能偏差的频率与幅度，生成可解释且统计稳定的噪声评分，有效识别潜在误标或模糊样本，提升数据集可靠性并指导高效重标注。

链接: https://arxiv.org/abs/2605.10165
作者: Inhyuk Park,Doohyun Park
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to IEEE ISBI 2026. The final published version will appear in IEEE Xplore

点击查看摘要

Abstract:Noisy labels are common in large-scale medical imaging datasets due to inter-observer variability and ambiguous cases. We propose a statistically grounded and task-agnostic framework, Standardized Loss Aggregation (SLA), for detecting noisy labels at the sample level. SLA quantifies label reliability by aggregating standardized fold-level validation losses across repeated cross-validation runs. This formulation generalizes discrete hard-counting schemes into a continuous estimator that captures both the frequency and magnitude of performance deviations, yielding interpretable and statistically stable noisiness scores. Experiments on a public fundus dataset demonstrate that SLA consistently outperforms the hard-counting baseline across all noise levels and converges substantially faster, especially under low noise ratios where subtle loss variations are informative. Samples with high SLA scores indicate potentially ambiguous or mislabeled cases, guiding efficient re-annotation and improving dataset reliability for any classification task.

[CV-99] Active-SAOOD: Active Sparsely Annotated Oriented Object Detection in Remote Sensing Images

【速读】：该论文旨在解决遥感图像中定向目标检测（oriented object detection）的标注成本过高问题，尤其针对稀疏标注（sparse annotation）方法在实际应用中因依赖类别相关采样策略及对稀疏样本特性研究不足而导致性能不稳定和泛化能力弱的问题。解决方案的关键在于提出一种基于主动学习的稀疏标注定向目标检测方法（Active-SAOOD），其核心创新是引入模型状态观测模块（model state observation module），在实例层面动态选择最具有价值的稀疏样本，综合考虑方向（orientation）、分类（classification）与定位（localization）不确定性，以及类间与类内多样性（inter- and intra-class diversity）。这一设计使模型能在完全随机初始化的稀疏标注下稳定运行，并显著提升检测性能与鲁棒性，在仅1%标注比例时相比基线方法提升9%性能，极大增强了稀疏标注方法在遥感场景中的实用价值。

链接: https://arxiv.org/abs/2605.10162
作者: Yu Lin,Jianghang Lin,Kai Ye,Shengchuan Zhang,Liujuan Cao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reducing the annotation cost of oriented object detection in remote sensing remains a major challenge. Recently, sparse annotation has gained attention for effectively reducing annotation redundancy in densely remote sensing scenes. However, (1) the sparse data reliance on class-dependent sampling, and (2) the lack of in-depth investigation into the characteristics of sparse samples hinders its further development. This paper proposes an active learning-based sparsely annotated oriented object detection (SAOOD) method, termed Active-SAOOD. Based on a model state observation module, Active-SAOOD actively selects the most valuable sparse samples at the instance level that are best suited to the current model state, by jointly considering orientation, classification, and localization uncertainty, as well as inter- and intra-class diversity. This design enables SAOOD to operate stably under completely randomly initialized sparse annotations and extends its applicability to broader real-world. Experiments on multiple datasets demonstrate that Active-SAOOD significantly improves both performance and stability of existing SAOOD methods under various random sparse annotation. In particular, with only 1% annotated ratios, it achieves a 9% performance gain over the baseline, further enhancing the practical value of SAOOD in remote sensing. The code will be public.

[CV-100] Improving Temporal Action Segmentation via Constraint-Aware Decoding ICPR2026

【速读】：该论文旨在解决时间动作分割（Temporal Action Segmentation, TAS）中因动作变异性、边界模糊性及标注成本高等问题，尤其是在新领域或低资源场景下的性能瓶颈。现有完全监督方法受限于高标注代价，而基于语法的方法虽引入结构先验但依赖复杂解析过程，难以扩展。解决方案的关键在于提出一种轻量级、基于约束的精炼框架，通过从标注数据中直接提取统计结构先验（如转移置信度、动作边界集合和每类持续时间），将其融入改进的维特比解码算法，在推理阶段实现无需重新训练且不增加模型复杂度的预测优化，从而有效修正结构预测错误并提升全监督与半监督TAS模型的性能。

链接: https://arxiv.org/abs/2605.10149
作者: Yeo Keat Ee,Debaditya Roy,Chen Li,Hao Zhang,Basura Fernando
机构: Institute of High-Performance Computing, Agency for Science, Technology and Research, Singapore; Centre for Frontier AI Research, Agency for Science, Technology and Research, Singapore; Indian Institute of Technology Kharagpur, India; College of Computing and Data Science, Nanyang Technological University, Singapore
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted to ICPR 2026

点击查看摘要

Abstract:Temporal action segmentation (TAS) divides untrimmed videos into labeled action segments. While fully supervised methods have advanced the field, challenges such as action variability, ambiguous boundaries, and high annotation costs remain, especially in new or low-resource domains. Grammar-based approaches improve segmentation with structural priors but rely on complex parsing limiting scalability. In this work, we propose a lightweight, constraint-based refinement framework that enhances TAS predictions by integrating statistical structural priors such as transition confidence, action boundary sets, and per-class duration, that can be directly extracted from annotated data. These constraints are integrated into a modified Viterbi decoding algorithm, allowing inference-time refinement without retraining or added model complexity. Our approach improves both fully and semi-supervised TAS models by correcting structural prediction errors while maintaining high efficiency. Code is available at this https URL

[CV-101] MicroViTv2: Beyond the FLOPS for Edge Energy-Friendly Vision Transformers

【速读】：该论文旨在解决Vision Transformer（ViT）在边缘设备部署时计算成本过高、能效比低的问题。解决方案的关键在于采用硬件感知的设计理念与结构重参数化（reparameterization），具体包括：1）提出重参数化补丁嵌入（Reparameterized Patch Embedding, RepEmbed）和重参数化深度卷积混合器（Reparameterized Depth-Wise convolution mixer, RepDW），以加速推理过程；2）引入单深度转置注意力机制（Single Depth-Wise Transposed Attention, SDTA），在保持较低冗余的同时有效捕获长程依赖关系。这些设计使MicroViTv2在保持快速推理和高能效的前提下，相比前代模型及MobileViT、EdgeNeXt、EfficientViT等主流轻量级模型实现了更高的准确率，验证了超越FLOPs指标的综合效率评估的重要性。

链接: https://arxiv.org/abs/2605.10148
作者: Novendra Setyawan,Chi-Chia Sun,Mao-Hsiu Hsu,Wen-Kai Kuo,Jun-Wei Hsieh
机构: National Formosa University (国立中兴大学); University of Muhammadiyah Malang (穆哈玛迪亚大学); National Taipei University (国立台北大学); National Yang Ming Chiao Tung University (国立阳明交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The Vision Transformer (ViT) achieves remarkable accuracy across visual tasks but remains computationally expensive for edge deployment. This paper presents MicroViTv2, a lightweight Vision Transformer optimized for real-device efficiency. Built upon the original MicroViT, the proposed model is designed based on reparameterized design, specifically Reparameterized Patch Embedding (RepEmbed) and Reparameterized Depth-Wise convolution mixer (RepDW) for faster inference, and introduces the Single Depth-Wise Transposed Attention (SDTA) to capture long-range dependencies with minimal redundancy. Despite slightly higher FLOPs, MicroViTv2 improves accuracy up to 0.5% compared to its predecessor and surpassing MobileViTv2, EdgeNeXt, and EfficientViT while maintaining fast inference and high energy efficiency on Jetson AGX Orin. Experiments on ImageNet-1K and COCO demonstrate that hardware-aware design and structural re-parameterization are key to achieving high accuracy and low energy consumption, validating the need to evaluate efficiency beyond FLOPs. Code is available at this https URL.

[CV-102] Scaling Vision Models Does Not Consistently Improve Localisation-Based Explanation Quality

【速读】：该论文旨在解决大规模人工智能模型是否能提升后处理可解释性（post-hoc explanations）质量的问题。研究发现，尽管模型规模（如深度和参数量）增加通常有助于提高预测准确性，但并未显著改善解释的质量；相反，在多数统计比较中，较小的模型在定位精度上表现相当甚至更优。其关键解决方案在于系统性地评估多个计算机视觉模型（涵盖ResNet、DenseNet与Vision Transformer架构）在不同训练方式（从头训练或预训练）下的解释质量，采用两种局部化指标——Relevance Rank Accuracy 和 Dual-Polarity Precision 来量化解释与真实标注掩码之间的对齐程度，从而揭示模型复杂度与解释能力之间缺乏一致正相关关系。

链接: https://arxiv.org/abs/2605.10142
作者: Mateusz Cedro,Marcin Chlebus
机构: University of Warsaw (华沙大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 28 pages, 8 figures, 8 tables

点击查看摘要

Abstract:Artificial intelligence models are increasingly scaled to improve predictive accuracy, yet it remains unclear whether scale improves the quality of post-hoc explanations. We investigate this relationship by evaluating 11 computer vision models representing increasing levels of depth and complexity within the ResNet, DenseNet, and Vision Transformer families, trained from scratch or pretrained, across three image datasets with ground-truth segmentation masks. For each model, we generate explanations using five post-hoc explainable AI methods and quantify mask alignment using two localisation metrics: Relevance Rank Accuracy (Arras et al., 2022) and the proposed Dual-Polarity Precision, which measures positive attributions inside the class mask and negative attributions outside it. Across datasets and methods, increasing architectural depth and parameter count does not improve explanation quality in most statistical comparisons, and smaller models often match or exceed deeper variants. While pretraining typically improves predictive performance and increases the dependence of explanations on learned weights, it does not consistently increase localisation scores. We also observe scenarios in which models achieve strong predictive performance while localisation precision is near zero, suggesting that performance metrics alone may not indicate whether predictions are based on the annotated regions. These results indicate that larger models do not reliably provide higher-quality explanations, and that explainability should therefore be assessed explicitly during model selection for safety-sensitive deployments.

[CV-103] hermal-Det: Language-Guided Cross-Modal Distillation for Open-Vocabulary Thermal Object Detection CVPR26

【速读】：该论文旨在解决现有开放词汇检测器（open-vocabulary detector）主要基于RGB图像，在热成像（thermal imagery）中泛化能力不足的问题，尤其针对热图像中低纹理和发射率变化带来的语义迁移挑战。解决方案的关键在于提出首个由大语言模型（LLM）监督的热成像开放词汇检测框架——Thermal-Det，其核心创新包括：1）构建了一个包含超过百万张热图像样本的合成数据集，通过将GroundingCap-1M转换至热域并过滤RGB特有词汇，实现图文对齐；2）采用联合优化策略，同时训练检测、描述生成与跨模态蒸馏目标；3）引入冻结的RGB教师模型提供几何与语义伪监督，利用未标注的RGB-热图像对实现知识迁移；4）设计热文对齐头（Thermal-Text Alignment Head）和模态融合交叉注意力模块（Modality-Fused Cross-Attention），以增强文本校准与双模态推理能力。该方法在公共基准测试中实现了2–4% AP提升，显著优于现有开放词汇检测器，为可扩展的语言驱动热感知奠定了基础。

链接: https://arxiv.org/abs/2605.10130
作者: Yasiru Ranasinghe,Elim Schenck,Florence Yellin,Shuowen Hu,Christopher Funk,Vishal M. Patel
机构: Johns Hopkins University (约翰霍普金斯大学); Kitware (基特瓦); DEVCOM Army Research Laboratory (美国陆军研究实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 26

点击查看摘要

Abstract:Existing open-vocabulary detectors focus on RGB images and fail to generalize to thermal imagery, where low texture and emissivity variations challenge RGB-based semantics. We present Thermal-Det, the first large language model (LLM) supervised open-vocabulary detector tailored for thermal images. To enable large-scale training, we develop a synthetic dataset by converting GroundingCap-1M into the thermal domain and filtering captions to remove RGB-specific terms, yielding over one million thermally aligned samples with bounding boxes, grounding texts, and detailed captions. Thermal-Det jointly optimizes detection, captioning, and cross-modal distillation objectives. A frozen RGB teacher provides geometric and semantic pseudo-supervision for paired but unlabeled RGB-thermal data, transferring open-vocabulary knowledge without manual annotation. The model further employs a Thermal-Text Alignment Head for text calibration and a Modality-Fused Cross-Attention module for dual-modality reasoning. Unlike prior domain-adaptation methods, the detector is fully fine-tuned to internalize thermal contrast patterns while preserving language alignment. Experiments on public benchmarks show consistent 2-4% AP gains over existing open-vocabulary detectors, establishing a strong foundation for scalable, language-driven thermal perception.

[CV-104] Fashion130K: An E-commerce Fashion Dataset for Outfit Generation with Unified Multi-modal Condition

【速读】：该论文旨在解决时尚穿搭生成中视觉一致性不足的问题，即如何在多模态条件（文本提示与参考图像）下实现服装元素的协调一致生成。现有方法虽尝试利用参考图像和文本信息提升视觉一致性，但对多模态条件的整合仍显粗放，且缺乏大规模电商场景下的高质量数据支撑。解决方案的关键在于提出一个名为Unified Multi-modal Condition (UMC) 的框架，其核心创新包括：1）设计了一个融合Transformer模块以对齐文本与图像嵌入之间的模态差异，从而提取统一的多模态嵌入；2）重构生成模型中的注意力机制，使噪声图像能选择性关注提示中的关键token，从而增强提示与生成结果之间的语义关联性。该方案通过新构建的Fashion130k电商数据集验证，在真实应用场景中显著优于当前最先进（SoTA）方法，实现了更高质量的视觉一致性穿搭生成。

链接: https://arxiv.org/abs/2605.10127
作者: Yu He,Ting Zhu,Yichun Liu,Lichen Ma,Xinyuan Shan,Jingling Fu,Yu Shi,Junshi Huang,Yan Li
机构: JD.com
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent research work on fashion outfit generation focuses on promoting visual consistency of garments by leveraging key information from reference image and text prompt. However, the potential of outfit generation remains underexplored, requiring comprehensive e-commercial dataset and elaborative utilization of multi-modal condition. In this paper, we propose a brand-new e-commerce dataset, named Fashion130k, with various occasions, models, and garment types. For the consistent generation of garment, we design a framework with Unified Multi-modal Condition (UMC) to align and integrate the text and visual prompts into generation model. Specifically, we explore an embedding refiner to extract the unified embeddings of multi-modal prompts, within which a Fusion Transformer is proposed to align the multi-modal embeddings by adjusting the modality gap between text and image. Based on unified embeddings, the attention in generation model is redesigned to emphasis the correlations between prompts and noise image, inducing that the noise image can select the pivotal tokens of prompts for consistent outfit generation. Our dataset and proposed framework offer a general and nuanced exploration of multi-modal prompts for generation models. Extensive experiments on real-world applications and benchmark demonstrate the effectiveness of UMC in visual consistency, achieving promising result than that of SoTA methods.

[CV-105] MicroWorld: Empowering Multimodal Large Language Models to Bridge the Microscopic Domain Gap with Multimodal Attribute Graph

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在显微成像等专业科学领域推理能力受限的问题，其核心挑战在于领域特定训练数据稀缺以及将细粒度专家知识编码至模型参数的难度。解决方案的关键在于提出MicroWorld框架，该框架通过构建一个从大规模科学图像-标题语料库中提取的多模态属性图（Multimodal Attributed Property Graph, MAPG），在推理阶段无需任何领域微调即可增强MLLM的推理能力。具体而言，MicroWorld利用scispaCy或基于大语言模型的三元组挖掘技术抽取生物医学实体与关系，借助Qwen3-VL-Embedding将图像与实体对齐至共享嵌入空间，并构建包含约11.1万个节点和346万个有类型边的知识图谱；在推理时，通过图增强检索管道匹配查询实体并注入结构化知识上下文到MLLM提示中，从而显著提升模型在MicroVQA和MicroBench基准上的表现，验证了其在跨域泛化能力上的优势。

链接: https://arxiv.org/abs/2605.10120
作者: Manyu Li,Ruian He,Chenxi Ma,Weimin Tan,Bo Yan
机构: Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 29 pages, 14 figures

点击查看摘要

Abstract:Multimodal large language models (MLLMs) show remarkable potential for scientific reasoning, yet their performance in specialized domains such as microscopy remains limited by the scarcity of domain-specific training data and the difficulty of encoding fine-grained expert knowledge into model parameters. To bridge the gap, we introduce MicroWorld, a framework that constructs a multimodal attributed property graph (MAPG) from large-scale scientific image–caption corpora and leverages it to augment MLLM reasoning at inference time without any domain-specific fine-tuning. MicroWorld extracts biomedical entities and relations via scispaCy or LLM-based triplet mining, aligns images and entities in a shared embedding space using Qwen3-VL-Embedding, and assembles a knowledge graph comprising approximately 111K nodes and 346K typed edges spanning eight relation categories. At inference time, a graph-augmented retrieval pipeline matches query entities to the MAPG and injects structured knowledge context into the MLLM prompt. On the MicroVQA benchmark, MicroWorld improves the reasoning performance of Qwen3-VL-8B-Instruct by 37.5%, outperforming GPT-5 by 13.0% to achieve a new state-of-the-art. Furthermore, it yields a 6.0% performance gain on the MicroBench benchmark. Extensive experiments demonstrate the enhanced generalization capability introduced by MicroWorld. A qualitative case study further reveals both the mechanisms through which structured knowledge improves reasoning and the failure modes that point to promising future directions. Code and data are available at this https URL.

[CV-106] hink as Needed: Geometry-Driven Adaptive Perception for Autonomous Driving

【速读】：该论文旨在解决自动驾驶中3D感知模型在处理不同复杂度场景时计算资源分配不合理的问题，即固定计算预算导致简单场景资源浪费、复杂场景性能不足，同时现有方法在处理多目标交互和遮挡恢复方面存在效率低与记忆缺失的缺陷。解决方案的关键在于提出增强型HOPE（Enhanced HOPE）架构：首先利用无监督统计估计器动态评估LiDAR帧的几何复杂度，自适应地选择浅层或深层处理路径以优化计算资源；其次，将二次复杂度的成对注意力机制替换为线性时间的子空间聚类网络，高效建模对象间交互；最后，通过释放的计算资源部署持久化时序记忆模块，实现对被遮挡物体及交通规则的跨帧保留，从而显著提升长尾场景精度并支持超过5秒的遮挡后追踪能力。

链接: https://arxiv.org/abs/2605.10117
作者: Donghyun Kim,Jaehyoung Park
机构: Stony Brook University (石溪大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Autonomous driving scenes range from empty highways to dense intersections with dozens of interacting road users, yet current 3D detection models apply a fixed computation budget to every frame, wasting resources on simple scenes while lacking capacity for complex ones. Existing approaches compound this problem: Transformer-based interaction models scale quadratically with the number of detected objects, and frame-by-frame processing causes the system to immediately forget objects the moment they become occluded. We propose Enhanced HOPE, an adaptive perception architecture that measures the geometric complexity of each incoming LiDAR frame using an unsupervised statistical estimator and routes it through a shallow or deep processing path accordingly, requiring no manual scene labels. To keep interaction modeling efficient, we replace quadratic pairwise attention with a linear-time subspace-based network that groups nearby objects into clusters and processes them jointly. The computational savings from these two mechanisms free up resources for a persistent temporal memory module that retains previously detected objects and traffic rules across frames, enabling the system to recall occluded objects seconds after they disappear from view. On the nuScenes and CARLA benchmarks, Enhanced HOPE reduces latency by 38% on simple scenes with no accuracy loss, improves mean Average Precision by 2.7 points on rare long-tail scenarios, and tracks objects through occlusions lasting over 5 seconds, where all tested baselines fail.

[CV-107] CFSPMNet: Cross-subject Fourier-guided Spatial-Patch Mamba Network for EEG Motor Imagery Decoding in Stroke Patients

【速读】：该论文旨在解决脑卒中患者运动想象脑电（MI-EEG）解码在跨被试场景下的性能下降问题，其根源在于病理神经重组导致任务相关脑电动态、非周期性活动、局部兴奋性、跨区域协同及试验级脑状态上下文发生显著变化，使得源患者训练的模型难以泛化到新患者。解决方案的关键在于提出CFSPMNet框架，通过建模后脑卒中MI-EEG为潜在神经状态组织结构，结合傅里叶重构状态Mamba网络（FRSM）与共享-私有原型匹配（SPPM）机制：FRSM将每段试验表示为潜在生理标记序列，在傅里叶域重新组织标记状态，并利用傅里叶导出的试验上下文引导Mamba状态空间传播；SPPM则通过融合语义置信度与共享-私有生理一致性来优化目标伪标签更新，过滤掉高置信但生理不一致的预测结果。这一方法有效提升了跨被试MI-EEG解码的准确率和鲁棒性。

链接: https://arxiv.org/abs/2605.10111
作者: Xiangkai Wang,Yun Zhao,Dongyi He,Qingling Xia,Gen Li,Xinlai Xing,Yuchi Pan,Bin Jiang
机构: Chongqing University of Technology (重庆理工大学); Chongqing Polytechnic University of Electronic Technology (重庆电子工程职业学院); The Hong Kong Polytechnic University (香港理工大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Motor imagery electroencephalography (MI-EEG) decoding offers a non-invasive route for post-stroke rehabilitation, but cross-patient use remains difficult because pathological neural reorganization changes task-related EEG dynamics, aperiodic activity, local excitability, cross-regional coordination, and trial-level brain-state context. This makes source-learned MI representations unreliable for unseen patients. To address this problem, we propose CFSPMNet, a cross-patient adaptation framework that models post-stroke MI-EEG as latent neural-state organization. CFSPMNet combines a Fourier-Reorganized State Mamba Network (FRSM) with Shared-Private Prototype Matching (SPPM). FRSM represents each trial as a latent physiological token sequence, reorganizes token states in the Fourier domain, and uses Fourier-derived trial context to guide Mamba state-space propagation. SPPM improves target pseudo-label updating by combining semantic confidence with shared-private physiological consistency, filtering confident but physiologically inconsistent target predictions. Leave-one-subject-out experiments on two stroke MI-EEG datasets show that CFSPMNet outperforms representative CNN-, Transformer-, Mamba-, and adaptation-based baselines, achieving average accuracies of 68.23% on XW-Stroke and 73.33% on 2019-Stroke, with gains of 5.63 and 8.25 percentage points over the strongest competitors. Ablation, sensitivity, feature-alignment, pseudo-label selection, and neurophysiological visualization analyses further support the roles of Fourier-domain token-state reorganization and calibrated pseudo-label updating. These results suggest that latent neural-state modeling can improve rehabilitation-oriented cross-patient BCI decoding. Code is available at this https URL.

[CV-108] ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models

【速读】：该论文旨在解决当前多模态大语言模型（Multi-modal Large Language Models, MLLMs）在3D空间推理能力上的局限性问题，尤其是依赖后训练（post-training）阶段在精心构建的基准数据集上进行优化所带来的泛化能力不足与计算成本高的问题。其解决方案的关键在于提出一种无需训练（training-free）的视频驱动空间推理代理框架 ViSRA（Video-based Spatial Reasoning Agent），通过引入专家模型提供的显式空间信息，以模块化、可扩展的方式激发MLLMs的空间推理能力，从而实现人类对齐且具备跨任务迁移性的3D理解，同时避免了昂贵的数据标注和后训练过程。

链接: https://arxiv.org/abs/2605.10106
作者: Tingshu Mou,Jiabo He,Renying Wang,Ce Liu,Hao Yang,Tiehua Zhang,Jingjing Chen,Xingjun Ma
机构: Fudan University (复旦大学); Bosch Center for Artificial Intelligence (博世人工智能中心); Tongji University (同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in Multi-modal Large Language Models (MLLMs) target 3D spatial intelligence, yet the progress has been largely driven by post-training on curated benchmarks, leaving the inference-time approach relatively underexplored. In this paper, we take a training-free perspective and introduce ViSRA, a human-aligned Video-based Spatial Reasoning Agent, as a framework to probe the spatial reasoning mechanism of MLLMs. ViSRA elicits spatial reasoning in a modular and extensible manner by leveraging explicit spatial information from expert models, enabling a plug-and-play flexible paradigm. ViSRA offers two key advantages: (1) human-aligned and transferable 3D understanding rather than task-specific overfitting; and (2) no post-training computational cost along with heavy manual curation of spatial reasoning datasets. Experimental results demonstrate consistent improvement across a set of MLLMs on both existing benchmarks and unseen 3D spatial reasoning tasks, with ViSRA outperforming baselines by up to a 15.6% and 28.9% absolute margin respectively.

[CV-109] HYPERPOSE: Hyperbolic Kinematic Phase-Space Attention for 3D Human Pose Estimation

【速读】：该论文旨在解决当前3D人体姿态估计方法在欧几里得空间（Euclidean space）中建模人体骨骼层级结构时存在的几何失真问题，特别是由于欧氏空间与人体骨架的树状拓扑结构不匹配而导致的体积膨胀和结构一致性丧失。解决方案的关键在于提出HYPERPOSE框架，其核心创新是将时空推理完全置于双曲空间（hyperbolic space）中的洛伦兹模型（Lorentz model）内进行，以原生地保持人体骨骼的层次结构；同时引入超球面运动相空间注意力机制（Hyperbolic Kinematic Phase-Space Attention, HKPSA）和多尺度窗口化双曲注意力机制，在O(TW)复杂度下高效建模时间动态，并通过新型黎曼损失函数与不确定性加权课程学习策略，强化骨长和速度等物理测地约束，从而显著提升姿态估计的结构一致性和时序连贯性。

链接: https://arxiv.org/abs/2605.10100
作者: Vinduja T.,Ashish M.,Ajay Waghumbare,Upasna Singh
机构: DIAT (Defence Institute of Advanced Technology)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce HYPERPOSE, a novel 3D human pose estimation framework that performs spatio-temporal reasoning entirely within the Lorentz model of hyperbolic space \mathbbH^d to natively preserve the hierarchical tree topology of the human skeleton. Current state-of-the-art pose estimators aim to capture complex joint dynamics by relying on transformers and graph convolutional networks. Since these architectures operate exclusively in Euclidean space which fundamentally mismatches the inherent tree structure of the human body, these methods inevitably suffer from exponential volume distortion and struggle to maintain structural coherence. To this end, we depart from flat spaces and aim to improve geometric fidelity with Hyperbolic Kinematic Phase-Space Attention (HKPSA), natively embedding complex joint relationships without distortion, alongside a multi-scale windowed hyperbolic attention mechanism that efficiently models temporal dynamics in O(TW) complexity. Furthermore, to overcome the well-known instability of training non-Euclidean manifolds, HYPERPOSE introduces a novel Riemannian loss suite and an uncertainty-weighted curriculum, enforcing physical geodesic constraints like bone length and velocity consistency. Extensive evaluations on the Human3.6M and MPI-INF-3DHP datasets demonstrate that HYPERPOSE achieves state-of-the-art structural and temporal coherence, significantly reducing both volume distortion and velocity error, while establishing new state-of-the-art benchmarks in overall positional accuracy.

[CV-110] Initiation of Interaction Detection Framework using a Nonverbal Cue for Human-Robot Interaction

【速读】：该论文旨在解决家庭环境中人机交互（Human-Robot Interaction, HRI）中初始交互（Initiation of Interaction, IoI）的自动检测问题，尤其在不依赖关键词识别的前提下实现鲁棒的交互触发。解决方案的关键在于融合音频与视觉传感器信息：通过声源定位（Sound Source Localization）与人体跟踪技术确定用户位置，并结合面部朝向判断（Face Orientation Detection）来判定用户是否面向机器人；若用户未说话但持续注视机器人超过预设时间，则同样触发IoI检测。整个框架基于状态转移模型设计，并在移动机器人平台上通过实验验证其有效性，所有模块均集成于Robot Operating System (ROS) 环境以支持实际部署。

链接: https://arxiv.org/abs/2605.10087
作者: Guhnoo Yun,Juhan Yoo,Kijung Kim,Dong Hwan Kim
机构: Korea Institute of Science and Technology (韩国科学技术院); Semyung University (世明大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper describes an initiation of interaction(IoI) detection framework without keywords for human-robot interaction(HRI) based on audio and vision sensor fusion in a domestic environment. In the proposed framework, the robot has its own audio and vision sensors, and can employ external vision sensor for stable human detection and tracking. When the user starts to speak while looking at the robot, the robot can localize his or her position by its sound source localization together with human tracking information. Then the robot can detect the IoI if it perceives the face of the speaker faces the robot. In case that the user does not speak directly, the robot can also detect the IoI if he or she looks at the robot for more than predefined periods of time. A state transition model for the proposed IoI detection framework is designed and verified by experiments with a mobile robot. In order to implement and associate our model in a robot architecture, all the components are implemented and integrated in the Robot Operating System(ROS) environment.

[CV-111] SocialDirector: Training-Free Social Interaction Control for Multi-Person Video Generation

【速读】：该论文旨在解决当前视频生成模型在多人群体社交场景中缺乏对交互行为的显式控制问题，具体表现为动作执行者与动作描述不匹配（actor-action mismatch）、社交动态混乱以及动作目标错误等现象。其解决方案的关键在于提出一种无需训练的交互控制器 SocialDirector，通过两个核心模块实现：一是 Social Actor Masking，利用时空掩码限制每个个体的视觉 token 仅关注自身文本描述，从而避免动作执行错误和社交秩序紊乱；二是 Directional Reweighting，增强对方向性词语（如“leftward”、“right”）的关注，使动作精准指向预期目标。该方法有效提升了生成视频中社交互动的真实性与一致性。

链接: https://arxiv.org/abs/2605.10079
作者: Liangyang Ouyang,Ruicong Liu,Caixin Kang,Yifei Huang,Yoichi Sato
机构: The University of Tokyo (东京大学); Shanda AI Research Tokyo (Shanda AI 研究所东京)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video generation has advanced rapidly, producing photorealistic videos from text or image prompts. Meanwhile, film production and social robotics increasingly demand multi-person videos with rich social interactions, including conversations, gestures, and coordinated actions. However, existing models offer no explicit control over interactions, such as who performs which action, when it occurs, and toward whom it is directed. This often results in wrong person performing unintended actions (actor-action mismatch), disordered social dynamics, and wrong action targets. To address these challenges, we present SocialDirector, a training-free interaction controller that enhances the generation model by modulating cross-attention maps. SocialDirector contains two modules: Social Actor Masking and Directional Reweighting. Social Actor Masking constrains each person’s visual tokens to attend only to their own textual descriptions via a spatiotemporal mask, avoiding actor-action mismatch and disordered social dynamics. Directional Reweighting amplifies attention to directional words (e.g., “leftward”, “right”), leading each action towards its intended target. To evaluate generated social interactions, we annotate existing datasets with interaction descriptions and build a fully automated evaluation pipeline powered by open-source VLMs. Experiments on different video generation models show that SocialDirector significantly improves interaction fidelity and approaches the upper bound set by real videos.

[CV-112] MFVLR: Multi-domain Fine-grained Vision-Language Reconstruction for Generalizable Diffusion Face Forgery Detection and Localization

【速读】：该论文旨在解决当前人脸伪造检测与定位方法在跨域泛化能力不足的问题，尤其是针对扩散模型（diffusion model）生成的伪造人脸图像难以被有效识别和定位的挑战。现有方法多依赖于图像模态特征，忽视了细粒度文本模态的潜在信息，导致模型在面对不同生成器、不同伪造类型及不同数据集时性能下降。解决方案的关键在于提出一种多域细粒度视觉-语言重建（Multi-Domain Fine-grained Vision-Language Reconstruction, MFVLR）模型，通过语言引导的人脸伪造表征学习挖掘多样化的视觉伪造痕迹；其核心创新包括：1）设计细粒度语言Transformer以学习通用的语言嵌入并实现语言重建；2）构建多域视觉编码器以捕捉图像域与残差域中的互补伪造模式；3）引入一个即插即用的视觉注入模块增强视觉与语言嵌入之间的交互，从而显著提升对扩散合成伪造人脸的检测与定位能力。

链接: https://arxiv.org/abs/2605.10071
作者: Yaning Zhang,Tianyi Wang,Zan Gao,Yibo Zhao,Chunjie Ma,Meng Wang
机构: Qilu University of Technology (Shandong Academy of Sciences); National University of Singapore; Tianjin University of Technology; Hefei University of Technology
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The swift advancement in photo-realistic face generation technology has sparked considerable concerns across society and academia, emphasizing the requirement of generalizable face forgery detection and localization methods. Prior works tend to capture face forgery patterns across multiple domains using image modality, other modalities like fine-grained texts are not comprehensively investigated, which restricts the generalization capability of models. Besides, they usually analyze facial images created by GAN, but struggle to identify and localize those synthesized by diffusion. To solve the problems, in this paper, we devise a novel multi-domain fine-grained vision-language reconstruction (MFVLR) model, which explores comprehensive and diverse visual forgery traces via language-guided face forgery representation learning, to achieve generalizable diffusion-synthesized face forgery detection and localization (DFFDL). Specifically, we devise a fine-grained language transformer that studies general fine-grained language embeddings using language reconstruction. We propose a multi-domain vision encoder to capture general and complementary visual forgery patterns across the image and residual domains. A vision decoder is designed to reconstruct image appearance and achieve forgery localization. Besides, we propose an innovative plug-and-play vision injection module to enhance the interaction between the vision and language embeddings. Extensive experiments and visualizations demonstrate that our network outperforms the state of the art on different settings like cross-generator, cross-forgery, and cross-dataset evaluations.

[CV-113] Explanation-Aware Learning for Enhanced Interpretability in Biomedical Imaging ALT

【速读】：该论文旨在解决深度神经网络在医学图像诊断中依赖伪相关或临床无关视觉线索的问题，从而限制了模型在实际应用中的可信度。其解决方案的关键在于将解释监督（explanation supervision）直接引入模型训练目标，通过设计特定的解释损失（explanation loss）引导模型关注具有临床意义的区域，从而促进基于临床依据的决策过程。研究进一步分析了不同解释损失设计和监督强度对预测性能与解释空间忠实性（spatial faithfulness）的影响，并提出两个量化指标——标注覆盖度（annotation coverage）和显著性精度（saliency precision），以实现对解释可解释性的严谨评估。实验表明，在噪声临床标注下，合理设置解释损失系数可在保持预测准确性的同时显著提升解释一致性。

链接: https://arxiv.org/abs/2605.10054
作者: Zubair Faruqui,Rahul Dubey
机构: Missouri State University (密苏里州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review at IEEE Journal of Biomedical and Health Informatics (JBHI)

点击查看摘要

Abstract:Deep neural networks for medical image diagnosis often achieve high predictive accuracy while relying on spurious or clinically irrelevant visual cues, limiting their trustworthiness in practice. Post-hoc explanation methods are widely used to visualize model decisions in the form of saliency maps; however, these explanations do not influence how models learn during training, allowing non-causal or confounding features to persist. This motivates the incorporation of explanation supervision directly into the training objective to guide model attention toward clinically meaningful regions and promote clinically grounded decision-making. This paper presents a systematic approach to integrate explanation loss into model training and analyzes how different explanation loss designs and supervision strengths influence both predictive performance and spatial faithfulness of explanations. To quantitatively assess interpretability, two complementary explanation performance metrics-annotation coverage and saliency precision-are introduced, enabling rigorous evaluation beyond qualitative visualization. Our experimental results reveal a clear trade-off between explanation quality and explanation loss coefficients. Furthermore, quantitative statistical analysis yields consistently improved explanation alignment while maintaining comparable accuracy. Experiments were conducted on annotated chest X-ray datasets; however, the proposed framework is applicable to a broad range of annotated biomedical imaging modalities. Overall, these findings demonstrate that explanation supervision is not a monolithic design choice and provide practical guidance for incorporating explanation loss into training objectives under noisy clinical annotations.

[CV-114] EchoPrune: Interpreting Redundancy as Temporal Echoes for Efficient VideoLLM s

【速读】：该论文旨在解决长视频理解中视频大语言模型（VideoLLM）因帧采样策略导致的性能瓶颈问题：密集采样引入大量视觉标记（token），超出模型处理能力；稀疏采样则可能遗漏关键时序信息，引发大语言模型（LLM）幻觉。现有无训练令牌压缩方法要么将视频视为静态图像，要么依赖分段级合并启发式规则，削弱了细粒度时空建模能力并增加额外开销。解决方案的关键在于提出 EchoPrune，一种轻量且无需训练的令牌剪枝方法，其核心思想是将冗余视频标记视为时间上的“回声”——若某标记能从前一帧良好重建，则为时间冗余；否则可能包含新事件、运动或与查询相关的视觉证据。EchoPrune 通过两个指标评分：(i) 查询引导的跨模态相关性，以及 (ii) 时间重建误差（基于连续帧间的对应匹配和回声匹配）。该方法在固定 LLM 视觉标记预算下显著提升时序分辨率，保留任务相关线索与时序新颖性，同时抑制可预测冗余，使 VideoLLMs 在不增加解码预算的前提下处理最多 20 倍帧数，实现性能提升（+8.6%）和推理加速（预填充阶段提速 5.6 倍）。

链接: https://arxiv.org/abs/2605.10050
作者: Jiameng Li,Minye Wu,Jiezhang Cao,Aleksei Tiulpin,Matthew B. Blaschko
机构: KU Leuven (鲁汶大学); Shanghai Jiaotong University (上海交通大学); Weill Cornell Medicine (威尔康奈尔医学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages

点击查看摘要

Abstract:Long-form video understanding remains challenging for Video Large Language Models (VideoLLMs), as the dense frame sampling introduces massive visual tokens while sparse sampling risks missing critical temporal evidence and leading to LLM hallucination. Existing training-free token reduction methods either treat videos equally as static images or rely on segment-level merging heuristics, which weaken fine-grained spatiotemporal modeling and introduce additional overhead. In this paper, we propose EchoPrune, a lightweight and training-free token pruning method that improves temporal resolution under a fixed LLM-side visual token budget. Our core idea is to interpret redundant video tokens as temporal echoes: if a token is well reconstructed from the previous frame, it is merely a temporally redundant echo; otherwise, it may capture new events, motion, or query-relevant visual evidence. Based on this insight, EchoPrune scores visual tokens by (i) query-guided crossmodal relevance and (ii) temporal reconstruction error, measured by correspondence matching and echo matching across consecutive frames. The selected tokens preserve task-relevant cues and temporal novelty while suppressing predictable redundancy, allowing VideoLLMs to observe more frames without increasing the decoding budget. Extensive experiments on LLaVA-OV, Qwen2.5VL, and Qwen3VL across six video understanding benchmarks show that EchoPrune enables VideoLLMs to process up to 20x frames under the same token budget, yielding improved performance (+8.6%) and inference speedup (5.6x for prefilling) on Qwen2.5VL-7B.

[CV-115] ExtraVAR: Stage-Aware RoPE Remapping for Resolution Extrapolation in Visual Autoregressive Models

【速读】：该论文旨在解决视觉自回归（Visual Autoregressive, VAR）模型在高分辨率图像生成中因训练分辨率固定而导致的三类典型失败模式：全局重复、局部重复和细节退化问题。其核心挑战在于，VAR模型采用从粗到细的分阶段生成机制，每个阶段由特定的RoPE（Rotary Position Embedding）频率带主导，而现有训练-free外推方法未能适配这种阶段性的频域特性，导致各阶段频率带失衡。解决方案的关键在于提出两种创新策略：一是阶段感知的RoPE重映射（Stage-Aware RoPE Remapping），通过为每个频率带分配阶段特异的重映射规则，统一抑制三种失败模式；二是基于熵驱动的自适应注意力校准（Entropy-Driven Adaptive Attention Calibration），利用分辨率不变的归一化熵量化注意力分散程度，并推导出闭式解的头级缩放因子，使外推分辨率下的注意力分布与训练分辨率对齐，从而提升结构一致性和细节保真度。

链接: https://arxiv.org/abs/2605.10045
作者: Feihong Yan,Shaoyu Liu,Haixuan Wang,Shuai Lu,Linfeng Zhang,Huiqi Li,Xiangyang Ji
机构: Beijing Institute of Technology (北京理工大学); Xidian University (西安电子科技大学); Northeastern University at Qinhuangdao (东北大学秦皇岛分校); Shanghai Jiao Tong University (上海交通大学); Department of Automation, Tsinghua University (清华大学自动化系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 7 figures

点击查看摘要

Abstract:Visual Autoregressive (VAR) models have emerged as a strong alternative to diffusion for image synthesis, yet their fixed training resolution prevents direct generation at higher resolutions. Naively transferring training-free extrapolation methods from LLMs or diffusion models to VAR yields three characteristic failure modes: global repetition, local repetition, and detail degradation. We trace them to a unified band-stage mismatch: VAR generates images in a coarse-to-fine, scale-wise process where each stage is driven by a distinct dominant RoPE frequency band, and each failure mode emerges when the dominant band of a particular stage is disrupted. Building on this insight, we propose Stage-Aware RoPE Remapping, a training-free strategy that assigns each frequency band a stage-specific remapping rule, jointly suppressing all three failure modes. We further observe that attention becomes systematically dispersed as the image resolution increases. Existing methods typically depend on predefined attention scaling factors, which are neither adaptive to the target resolution nor capable of faithfully capturing the actual extent of attention dispersion. We therefore propose Entropy-Driven Adaptive Attention Calibration, which quantifies dispersion via a resolution-invariant normalized entropy and yields a closed-form per-head scaling factor that realigns the extrapolated-resolution attention entropy with its training-resolution counterpart. Extensive experiments show that our method consistently outperforms prior resolution-extrapolation methods in both structural coherence and fine-detail fidelity. Our code is available at this https URL.

[CV-116] Only Train Once: Uncertainty-Aware One-Class Learning for Face Authenticity Detection

【速读】：该论文旨在解决当前人脸伪造检测方法在面对新型生成范式时泛化能力不足的问题，即现有模型通常基于全监督二分类框架，在未见过的伪造类型上性能显著下降，且多数方法仅针对DeepFakes或完全合成的人脸，缺乏统一的通用检测框架。其解决方案的关键在于提出FADNet（Face Authenticity Detector Net），一个基于自监督学习的一类分类（One-Class Classification, OCC）框架：通过仅使用真实人脸数据训练以捕获其内在特征表示，将偏离该分布的图像判定为伪造；同时引入证据深度学习（Evidential Deep Learning, EDL）量化预测不确定性，并集成一个即插即用的伪伪造图像生成器（Plug-and-play Pseudo-Forgery Image Generator, PFIG）来收紧真实数据周围的决策边界，从而提升检测精度与鲁棒性。

链接: https://arxiv.org/abs/2605.10040
作者: Qingchao Jiang,Zhenxuan Hou,Zhiying Zhu,Zhenxing Qian,Xinpeng Zhang,Zaiwang Gu
机构: East China University of Science and Technology (华东理工大学); Fudan University (复旦大学); Institute for Infocomm Research (I2R) (资讯通信研究院); Agency for Science, Technology and Research (A*STAR) (新加坡科技研究局)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12pages,7figures

点击查看摘要

Abstract:The rapid evolution of generative paradigms has enabled the creation of highly realistic imagery, which escalating the risks of identity fraud and the dissemination of disinformation. Most existing approaches frame face forgery detection as a fully supervised binary classification problem. Consequently, these models typically exhibit significant performance decay when tasked with detecting forgeries from previously unseen generative paradigms. Furthermore, these methods focus exclusively on either DeepFakes or fully synthesized faces, thereby failing to provide a generalized framework for universal face forgery detection. In this paper, we address this challenge by introducing FADNet (Face Authenticity Detector Net), % a self-supervised framework that which reformulates face forgery detection as a one-class classification (OCC) task. By training exclusively on authentic facial data to capture their intrinsic representations, FADNet flags any image whose feature embedding deviates significantly from the learned distribution of real faces as a forgery. The framework incorporates Evidential Deep Learning (EDL) to quantify predictive uncertainty and utilizes a plug-and-play pseudo-forgery image generator (PFIG) to tighten decision boundaries around authentic data. Extensive experimental evaluations on the DF40 and ASFD benchmarks demonstrate that FADNet achieves superior performance and generalization capabilities. Specifically, FADNet substantially outperforms existing state-of-the-art (SOTA) methods, yielding a remarkable average accuracy of 96.63% and an average precision of 98.83%. Comments: 12pages,7figures Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2605.10040 [cs.CV] (or arXiv:2605.10040v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.10040 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-117] Slum Detection and Density Mapping with AlphaEarth Foundations: A Representation Learning Evaluation Across 12 Global Cities

【速读】：该论文旨在解决像素级贫民窟（slum）制图长期面临的三大挑战：跨城市泛化能力有限、缺乏连续密度估计以及全球可比性弱的问题。其解决方案的关键在于评估AlphaEarth Foundations (AEF)——一个全球一致的64维年度地表嵌入（10米分辨率）在贫民窟分类与亚像素密度估计任务中的适用性，通过多城市多时间点（12个城市，69个城-年对，2017–2024）的系统实验验证其性能边界与优化策略。研究发现，同一城市跨年训练最优，且POI辅助特征显著提升密度预测精度（ΔR² = +0.064），同时揭示了AEF在建模亚像素密度梯度上的局限性，强调了基础模型嵌入在贫民窟监测中既具备潜力又存在互补需求。

链接: https://arxiv.org/abs/2605.10029
作者: Shuyang Hou,Ziqi Liu,Haoyue Jiao,Zhangyan Xu,Xiaopu Zhang,Lutong Xie,Yaxian Qing,Jianyuan Liang,Xuefeng Guan,Huayi Wua
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Pixel-level slum mapping has long been constrained by limited cross-city generalisation, the absence of continuous density estimation, and weak global comparability. AlphaEarth Foundations (AEF), a globally consistent 64-dimensional annual surface embedding at 10 m, offers a new analysis-ready basis for lightweight slum monitoring, but its applicability to slum detection - an indirectly coupled task shaped by both built form and socio-economic processes - remains untested. We evaluate AEF on slum classification and sub-pixel density estimation across 12 cities and 69 city-year pairs (2017-2024), using GRAM pseudo-masks as supervisory labels. The evaluation spans four training strategies, two protocols (random split and 3x3 spatial block cross-validation), six auxiliary feature configurations, and five baseline models, complemented by representation-level analyses (PCA, SHAP) and full-AOI mapping. Five findings emerge. (1) Same-city cross-year training is optimal under both protocols (median spatial F1 = 0.616, R^2 = 0.466); temporal expansion outperforms cross-city transfer, indicating city-scale representational drift. (2) Regression R^2 is driven primarily by zero/non-zero boundary discrimination: positive-pixel R^2 is consistently negative across all cities, revealing limited capacity to model intra-pixel density gradients at 10 m. (3) PC36 is consistently top-ranked across tasks; classification saturates at k = 32 while regression remains unsaturated at k = 64. (4) POI features yield the largest density gain (Delta R^2 = +0.064). (5) For six cities meeting dual-task usability thresholds, full-AOI inference across 2017-2024 preserves slum cluster structure (mean SSIM = 0.926). The study delineates the capabilities and complementarity needs of foundation-model embeddings for slum monitoring.

[CV-118] MUSDA: Multi-source Multi-modality Unsupervised Domain Adaptive 3D Object Detection for Autonomous Driving

【速读】：该论文旨在解决多源、多模态场景下3D目标检测的无监督域适应问题，即如何在不依赖人工标注的情况下，将多个已标注源域（如Waymo、nuScenes和Lyft）的知识迁移至未标注的目标域，从而提升自动驾驶系统在新环境中的检测性能。解决方案的关键在于提出一种层次化空间条件域分类器（hierarchical spatially-conditioned (HSC) domain classifiers），该方法在每个源-目标域对中，从两个不同层级联合对齐来自相机与激光雷达（LiDAR）模态的特征；同时构建源域间的原型图（prototype graph），并设计原型图加权（PGW）多源融合策略，以聚合多个源域检测头的预测结果，实现跨模态与跨源域的有效信息整合。

链接: https://arxiv.org/abs/2605.10026
作者: Xiaohu Lu,Hamed Khatounabadi,Hayder Radha
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the advancement of autonomous driving, numerous annotated multi-modality datasets have become available. This presents an opportunity to develop domain-adaptive 3D object detectors for new environments without relying on labor-intensive manual annotations. However, traditional domain adaptation methods typically focus on a single source domain or a single modality, limiting their effectiveness in multi-source, multi-modality scenarios. In this paper, we propose a novel framework for multi-source, multi-modality unsupervised domain adaptation in 3D object detection for autonomous driving. Given multiple labeled source domains and one unlabeled target domain, our framework first introduces hierarchical spatially-conditioned (HSC) domain classifiers, which jointly align features from both camera and LiDAR modalities at two distinct levels for each source-target domain pair. To effectively leverage information from multiple source domains, we construct a prototype graph between each pair of domains. Based on this, we develop a prototype graph weighted (PGW) multi-source fusion strategy to aggregate predictions from multiple source detection heads. Experimental results on three widely used 3D object detection datasets - Waymo, nuScenes, and Lyft - demonstrate that our proposed framework effectively integrates information across both modalities and source domains, consistently outperforming state-of-the-art methods.

[CV-119] Hystar: Hypernetwork-driven Style-adaptive Retrieval via Dynamic SVD Modulation ICLR2026

【速读】：该论文旨在解决查询风格异质性导致的图像检索性能下降问题，即在面对草图、艺术作品或低分辨率预览等不同风格的查询时，现有大规模视觉-语言表征模型（VLRMs）如CLIP因分布偏移而表现不佳。解决方案的关键在于提出一种轻量级框架Hystar，其核心创新是利用超网络（hypernetwork）动态生成注意力层的奇异值扰动（ΔS），实现针对每个输入查询的灵活适配；同时在MLP层引入静态奇异值偏移以保障跨风格的一致性，并设计基于最优传输加权的StyleNCE损失函数，强化难样本的跨风格区分能力。该方法在多风格图像检索和跨风格分类任务中均取得最优性能，且参数效率高、风格鲁棒性强。

链接: https://arxiv.org/abs/2605.10009
作者: Yujia Cai,Boxuan Li,Chenghao Xu,Jiexi Yan
机构: Xidian University (西安电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICLR2026

点击查看摘要

Abstract:Query-based image retrieval (QBIR) requires retrieving relevant images given diverse and often stylistically heterogeneous queries, such as sketches, artworks, or low-resolution previews. While large-scale vision–language representation models (VLRMs) like CLIP offer strong zero-shot retrieval performance, they struggle with distribution shifts caused by unseen query styles. In this paper, we propose the Hypernetwork-driven Style-adaptive Retrieval (Hystar), a lightweight framework that dynamically adapts model weights to each query’s style. Hystar employs a hypernetwork to generate singular-value perturbations ( \Delta S ) for attention layers, enabling flexible per-input adaptation, while static singular-value offsets on MLP layers ensure cross-style stability. To better handle semantic confusions across styles, we design StyleNCE as part of Hystar, an optimal-transport-weighted contrastive loss that emphasizes hard cross-style negatives. Extensive experiments on multi-style retrieval and cross-style classification benchmarks demonstrate that Hystar consistently outperforms strong baselines, achieving state-of-the-art performance while being parameter-efficient and stable across styles.

[CV-120] Med-StepBench: A Hierarchical Reasoning Framework for Evaluating Hallucinations in Medical Vision-Language Models ECAI2026 IJCAI

【速读】：该论文旨在解决大型视觉语言模型（VLMs）在医学影像理解中存在“幻觉”问题，即模型可能生成看似合理但临床错误的诊断陈述，尤其在三维（3D）肿瘤学PET/CT图像中，现有基准测试因仅关注单次诊断任务而难以识别定位与异常识别层面的推理错误。其解决方案的关键在于提出Med-StepBench——首个面向3D肿瘤学PET/CT的分步幻觉检测大规模基准，涵盖超过12,000张图像和百万级图像-语句对，并将临床推理过程细分为四个专家设计的诊断阶段，结合临床医生验证标注实现步骤级评估，从而揭示传统整体准确率指标掩盖的系统性推理失败模式，同时证明当前VLMs极易受到对抗性但临床上看似合理的中间解释干扰，显著放大幻觉现象。

链接: https://arxiv.org/abs/2605.10002
作者: Minh Khoi Nguyen,Dai Lam Le,Amir Reza Jafari,Tuan Dung Nguyen,Mai Hong Son,Mai Huy Thong,Quang Huy Nguyen,Thanh Trung Nguyen,Reza Farahbakhsh,Noel Crespi,Phi Le Nguyen
机构: AI4LIFE, Hanoi University of Science and Technology, Vietnam; SAMOVAR, Télécom SudParis, Institut Polytechnique de Paris, France; 108 Military Central Hospital, Vietnam; Hanoi Medical University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at IJCAI-ECAI 2026

点击查看摘要

Abstract:Large vision-language models (VLMs) demonstrate strong performance in medical image understanding, but frequently generate clinically plausible yet incorrect statements, raising significant safety concerns. Existing medical hallucination benchmarks primarily focus on 2D imaging with one-shot diagnostic questions, offering limited insight into whether predictions are grounded in correct localization and abnormality identification, allowing critical reasoning errors to remain hidden behind seemingly correct diagnoses. We introduce Med-StepBench, the first large-scale benchmark for step-wise hallucination detection in 3D oncological PET/CT, comprising over 12,000 images and more than 1,000,000 image-statement pairs across volumetric and multi-view 2D data, which decomposes clinical reasoning into four expert-designed diagnostic stages. Using clinician-verified annotations, we perform the first step-level evaluation of general-purpose and medical VLMs, revealing systematic failure modes obscured by aggregate accuracy metrics. Furthermore, we show that current VLMs are highly susceptible to adversarial yet clinically plausible intermediate explanations, which significantly amplify hallucinations despite contradictory visual evidence. Together, our findings highlight fundamental limitations in grounding multi-step clinical reasoning and establish Med-StepBench as a rigorous benchmark for developing safer and more reliable medical VLMs.

[CV-121] Omni-Persona: Systematic Benchmarking and Improving Omnimodal Personalization

【速读】：该论文旨在解决当前多模态大语言模型（Multimodal Large Language Models, MLLMs）在跨模态个性化（omnimodal personalization）研究中缺乏统一基准与严谨评估方法的问题，尤其是现有工作主要聚焦于视觉-语言任务，而对文本、图像和音频三模态联合建模的系统性评测不足，且未充分考虑无个性（absent-persona）场景下的接地行为（grounding behavior）。其解决方案的关键在于提出首个全面的多模态个性化基准 Omni-Persona，通过形式化为基于 Persona Modality Graph 的跨模态路由任务，涵盖4类任务组和18个细粒度子任务（约750项测试条目），并引入校准准确率（Calibrated Accuracy, $\mathrm{Cal}$ ）指标，该指标同时奖励正确接地与合理回避回答，从而在统一框架内纳入无个性查询的诊断。此设计使模型在真实复杂场景下的表现可被更精确地量化与分析，揭示出如音频-视觉接地差距、召回率与校准分离等关键问题，为后续后训练策略（如SFT与RLVR）的设计提供了新的评估维度与改进方向。

链接: https://arxiv.org/abs/2605.09996
作者: Yeongtak Oh,Dongwook Lee,Sangkwon Park,Heeseung Kim,Sungroh Yoon
机构: Seoul National University (首尔国立大学); University of Seoul (首尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:While multimodal large language models have advanced across text, image, and audio, personalization research has remained primarily vision-language, with unified omnimodal benchmarking that jointly covers text, image, and audio still limited, and lacking the methodological rigor to account for absent-persona scenarios or systematic grounding studies. We introduce Omni-Persona, the first comprehensive benchmark for omnimodal personalization. We formalize the task as cross-modal routing over the \emphPersona Modality Graph, encompassing 4 task groups and 18 fine-grained tasks across \sim750 items. To rigorously diagnose grounding behavior, we propose \emphCalibrated Accuracy ( \mathrmCal ), which jointly rewards correct grounding and appropriate abstention, incorporating absent-persona queries within a unified evaluation framework. On our dedicated experiments, three diagnostic findings emerge: (i) open-source models show a consistent audio-vs-visual grounding gap that RLVR partially narrows via dense rule-based supervision; (ii) answerable recall and parameter scale are incomplete diagnostics, since strong recall can coexist with absent-persona hallucination and larger models do not always achieve higher \mathrmCal , exposing calibration as a separate evaluation axis; and (iii) SFT is bounded by the difficulty of constructing annotated ground-truth supervision at scale, while RLVR generalizes more consistently through outcome-level verifiable feedback yet drifts toward conservative behavior and lower generation quality under our reward design. Omni-Persona thus serves as a diagnostic framework that surfaces the pitfalls of omnimodal personalization, guiding future post-training and reward design.

[CV-122] StereoPolicy: Improving Robotic Manipulation Policies via Stereo Perception

【速读】：该论文旨在解决单目视觉（monocular vision）在机器人模仿学习中因缺乏可靠深度信息和空间感知能力而导致的精确操作难题，尤其是在杂乱或几何结构复杂的场景中。其解决方案的关键在于提出StereoPolicy框架，该框架直接利用同步的立体图像对（stereo image pairs）增强几何推理能力，无需显式的3D重建或相机标定；通过预训练的2D视觉编码器独立处理每张图像，并借助Stereo Transformer融合特征以隐式捕捉空间对应关系与视差线索，从而有效桥接二维预训练表征与三维几何理解之间的鸿沟。

链接: https://arxiv.org/abs/2605.09989
作者: Evans Han,Yunfan Jiang,Yingke Wang,Haoyue Xiao,Huang Huang,Jianwen Xie,Jiajun Wu,Li Fei-Fei,Ruohan Zhang
机构: Stanford University (斯坦福大学); Northwestern University (西北大学); Lambda, Inc (Lambda公司)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in robot imitation learning have yielded powerful visuomotor policies capable of manipulating a wide variety of objects directly from monocular visual inputs. However, monocular observations inherently lack reliable depth cues and spatial awareness, which are critical for precise manipulation in cluttered or geometrically complex scenes. To address this limitation, we introduce StereoPolicy, a new visuomotor policy learning framework that directly leverages synchronized stereo image pairs to strengthen geometric reasoning, without requiring explicit 3D reconstruction or camera calibration. StereoPolicy employs pretrained 2D vision encoders to process each image independently and fuses the resulting representations through a Stereo Transformer. This design implicitly captures spatial correspondence and disparity cues. The framework integrates seamlessly with diffusion-based and pretrained vision-language-action (VLA) policies, delivering consistent improvements over RGB, RGB-D, point cloud, and multi-view baselines across three simulation benchmarks: RoboMimic, RoboCasa, and OmniGibson. We further validate StereoPolicy on real-robot experiments spanning both tabletop and bimanual mobile manipulation settings. Our results underscore stereo vision as a scalable and robust modality that bridges 2D pretrained representations with 3D geometric understanding for robotic manipulation.

[CV-123] Geometric 4D Stitching for Grounded 4D Generation

【速读】：该论文旨在解决当前4D生成方法在场景级缺失信息补全过程中存在的几何不一致性问题，以及基于辐射场（radiance-based）重建所需的高计算成本和难以保障几何一致性的缺陷。其核心解决方案是提出几何4D拼接（Geometric 4D Stitching）框架，该框架通过显式识别缺失的几何区域，并利用几何约束下的4D拼接来填补内容，从而在单张NVIDIA RTX 5090 GPU上实现每步场景扩展不到10分钟的高效重建，同时显著提升几何一致性。此外，该方法还支持交互式4D网格扩展与4D场景编辑，增强了可操作性与实用性。

链接: https://arxiv.org/abs/2605.09984
作者: Sunwoo Park,Taesung Kwon,Jong Chul Ye
机构: KAIST AI (KAIST人工智能)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent 4D generation methods complete scene-level missing information using generative models and reconstruct the scene into radiance-based representations. However, these pipelines often present geometric inconsistencies in the generated content, and the radiance-based reconstruction requires expensive optimization. Furthermore, radiance-based representations often absorb these geometric inconsistencies into their view-dependent nature, failing to enforce the grounded geometric consistency. To address these issues, we propose Geometric 4D Stitching, an efficient framework that explicitly identifies missing geometric regions and complements them with geometrically grounded 4D stitches. As a result, our method constructs 4D scene representations in under 10 minutes on a single NVIDIA RTX 5090 GPU per one-step scene expansion, while improving geometric consistency. Moreover, we demonstrate that our explicit 4D stitching supports interative expansion of 4D mesh as well as 4D scene editing.

[CV-124] ERASE: Eliminating Redundant Visual Tokens via Adaptive Two-Stage Token Pruning

【速读】：该论文旨在解决视觉语言模型（Vision-Language Models, VLMs）在处理高分辨率图像时因产生大量视觉标记（vision tokens）而导致的计算开销过大的问题。现有方法主要依赖模型内部学习到的语义特征来识别视觉冗余，但缺乏根据输入图像复杂度自适应调整剪枝策略的能力。其解决方案的关键在于提出一种两阶段的视觉标记剪枝框架ERASE，通过设计与图像复杂度自适应的剪枝策略，精准识别并保留关键视觉标记，从而在大幅减少视觉标记数量的同时保持模型性能。实验表明，在85%的剪枝率下，ERASE使Qwen2.5-VL-7B模型仍能维持89.46%的原始准确率，显著优于现有最优方法（仅78.1%）。

链接: https://arxiv.org/abs/2605.09982
作者: Yuna Lee,Kyoungho Min,Yulhwa Kim
机构: Sungkyunkwan University (成均馆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 8 figures

点击查看摘要

Abstract:Recent advancements in Vision-Language Models (VLMs) enable large language models (LLMs) to process high-resolution images, significantly improving real-world multimodal understanding. However, this capability introduces a large number of vision tokens, resulting in substantial computational overhead. To mitigate this issue, various vision token pruning methods have been proposed. Nevertheless, existing approaches predominantly rely on learned semantic features within the model to capture visual redundancy. Moreover, they lack adaptive mechanisms to adjust pruning strategies according to the complexity of the input image. In this paper, we propose ERASE, a two-stage vision token pruning framework that identifies and retains salient tokens through pruning strategies adaptive to image complexity. Experiment results demonstrate that ERASE significantly reduces vision tokens while preserving accuracy. For Qwen2.5-VL-7B, at a token pruning ratio of 85%, ERASE retains 89.46% of the original model accuracy, whereas the best prior method retains only 78.1%. Our code is available at this https URL.

[CV-125] INFANiTE: Implicit Neural representation for high-resolution Fetal brain spatio-temporal Atlas learNing from clinical Thick-slicE MRI

【速读】：该论文旨在解决胎儿大脑时空图谱构建中因传统Slice-to-Volume Reconstruction (SVR) 和迭代非刚性配准步骤耗时过长而导致的大规模队列应用不切实际的问题。现有方法通常需要数天完成高分辨率3D脑体积重建与多次配准，严重限制了临床研究效率。其解决方案的关键在于提出INFANiTE框架——一种基于隐式神经表示（Implicit Neural Representation, INR）的新型方法，能够直接从临床厚层MRI扫描中学习高分辨率胎儿脑时空图谱，从而完全跳过昂贵的SVR和迭代非刚性配准过程，显著加速端到端处理时间（从数天缩短至数小时），同时在个体一致性、参考图像保真度、内在质量和生物合理性等方面优于现有基线方法，尤其在稀疏数据条件下表现稳健。

链接: https://arxiv.org/abs/2605.09977
作者: Xiaotian Hu,Mingxuan Liu,Hongjia Yang,Juncheng Zhu,Yijin Li,Yifei Chen,Haoxiang Li,Tongxi Song,Zihan Li,Yingqi Hao,Ziyu Li,Yujin Zhang,Gang Ning,Yi Liao,Haibo Qu,Qiyuan Tian
机构: 1. Tsinghua University (清华大学); 2. Peking Union Medical College Hospital (北京协和医院); 3. The First Affiliated Hospital of Zhejiang University School of Medicine (浙江大学医学院附属第一医院); 4. National Institute of Health (美国国立卫生研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Spatio-temporal fetal brain atlases are important for characterizing normative neurodevelopment and identifying congenital anomalies. However, existing atlas construction pipelines necessitate days for slice-to-volume reconstruction (SVR) to generate high-resolution 3D brain volumes and several additional days for iterative volume registration, thereby rendering atlas construction from large-scale cohorts prohibitively impractical. We address these limitations with INFANiTE, an Implicit Neural Representation (INR) framework for high-resolution Fetal brain spatio-temporal Atlas learNing from clinical Thick-slicE MRI scans, bypassing both the costly SVR and the iterative non-rigid registration steps entirely, thereby substantially accelerating atlas construction. Extensive experiments demonstrate that INFANiTE outperforms existing baselines in subject consistency, reference fidelity, intrinsic quality and biological plausibility, even under challenging sparse-data settings. Additionally, INFANiTE reduces the end-to-end processing time (i.e., from raw scans to the final atlas) from days to hours compared to the traditional 3D volume-based pipeline (e.g., SyGN), facilitating large-scale population-level fetal brain analysis. Our code is publicly available at: this https URL

[CV-126] OZ-TAL: Online Zero-Shot Temporal Action Localization

【速读】：该论文旨在解决在线时序动作定位（Online Temporal Action Localization, On-TAL）中模型对未见动作类别泛化能力不足的问题，特别是针对在流式视频中检测从未见过的动作（即零样本场景）的挑战。其核心解决方案是提出了一种无需训练的框架，利用现成的视觉-语言模型（Vision-Language Models, VLMs）作为基础，并引入额外机制以增强视觉表征并缓解VLM固有的偏差问题，从而实现在线零样本时序动作定位（Online Zero-shot Temporal Action Localization, OZ-TAL）。实验表明，该方法在THUMOS14和ActivityNet-1.3数据集上显著优于现有最先进方法。

链接: https://arxiv.org/abs/2605.09976
作者: Chaolei Han,Hongsong Wang,Xin Gong,Jie Gui
机构: Southeast University (东南大学); Engineering Research Center of Blockchain Application, Supervision and Management (东南大学) (区块链应用监管与管理工程研究中心); Purple Mountain Laboratories (紫金山实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Online Temporal Action Localization (On-TAL) aims to detect the occurrence time and category of actions in untrimmed streaming videos immediately upon their completion. Recent advancements in this field focus on developing more sophisticated frameworks, shifting from Online Action Detection (OAD)-based aggregation paradigm to instance-level understanding. However, existing approaches are typically trained on specific domains and often exhibit limited generalization capabilities when applied to arbitrary videos, particularly in the presence of previously unseen actions. In this paper, we introduce a new task called Online Zero-shot Temporal Action Localization (OZ-TAL), which aims to detect previously unseen actions in an online fashion. Furthermore, we propose a training-free framework that leverages off-the-shelf Vision-Language Models (VLMs) while introducing additional mechanisms to enhance visual representations and mitigate their inherent biases. We establish new benchmarks and representative baselines for OZ-TAL on THUMOS14 and ActivityNet-1.3, and extensive experiments demonstrate that our method substantially outperforms existing state-of-the-art approaches under both offline and online zero-shot settings.

[CV-127] HiDrive: A Closed-Loop Benchmark for High-Level Autonomous Driving

【速读】：该论文旨在解决当前端到端自动驾驶评估基准（benchmark）趋于饱和的问题，即现有开放环路和闭环路基准已无法有效区分先进模型性能，因其场景多样性不足、物体种类有限，且未能充分评估高级决策能力（如法规合规性、伦理推理与应急响应）。解决方案的关键在于提出HiDrive——一个聚焦长尾场景（long-tail scenarios）并扩展评估维度的新型闭环基准，通过引入罕见但安全关键的对象和非典型交通情境，并将评估指标从单一碰撞规避拓展至碰撞与制动、交通规则遵守、道德推理等多维指标体系，同时依托更先进的物理引擎实现高保真视觉渲染与真实光照模拟，从而构建更具挑战性和现实意义的测试平台。

链接: https://arxiv.org/abs/2605.09972
作者: Zhongyu Xia,Guanyu Zhu,Guo Tang,Wenhao Chen,Yongtao Wang
机构: Peking University (北京大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:End-to-end autonomous driving has witnessed rapid progress, yet existing benchmarks are increasingly saturated, with state-of-the-art models achieving near-perfect scores on widely used open-loop and closed-loop benchmarks. This saturation does not mean that the problem has been solved; instead, it reveals that current benchmarks remain limited in scenario diversity, object variety, and the breadth of driving capabilities they evaluate. In particular, they lack sufficient long-tail scenarios involving rare but safety-critical objects and fail to assess advanced decision-making such as legal compliance, ethical reasoning, and emergency response. To address these gaps, we propose HiDrive, a new closed-loop benchmark for end-to-end autonomous driving that emphasizes long-tail scenarios and a richer evaluation of driving capabilities. HiDrive introduces a diverse set of rare objects and uncommon traffic situations, and expands evaluation from basic driving skills to more advanced capabilities, including rule compliance, moral reasoning, and context-dependent emergency maneuvers. Correspondingly, we extend previous collision-avoidance-centered metrics into a comprehensive evaluation system that encompasses collision and braking, traffic-rule compliance, and moral-reasoning indicators. Built on a more advanced physics engine, HiDrive provides physically realistic lighting and high-fidelity visual rendering, offering a more challenging and realistic testbed for assessing whether autonomous driving systems can handle the complexity of real-world deployment. The HiDrive software, source code, digital assets, and documentation are available at this https URL.

[CV-128] owards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse

【速读】：该论文旨在解决当前人工智能在面对多样化游戏环境时缺乏通用性的问题，即如何构建具备跨游戏领域适应能力的智能体（Agent），以推动迈向人工通用智能（Artificial General Intelligence, AGI）的目标。其核心解决方案在于提出一个涵盖数据集（Dataset）、模型（Model）、工具链（Harness）与基准测试（Benchmark）四大支柱的全生命周期框架，并识别出制约系统性能的五大根本权衡（trade-offs）。通过逐步突破这些权衡，论文进一步提出五级演进路线图，从单一游戏精通逐步迈向“创造者阶段”——在此阶段，智能体不仅能自主生成新的游戏世界，还能持续演化其中的行为策略，从而实现对游戏多宇宙（game multiverse）中任意挑战的无缝掌握。

链接: https://arxiv.org/abs/2605.09965
作者: Kuan Zhang,Dongchen Liu,Qiyue Zhao,Tianyu Xin,Yue Su,Haisheng Wang,Han Yin,Hongbo Ma,Peize Li,Tianjun Gu,Xiangnan Wu,Xinran Zhang,Yongxuan Li,Zirong Chen,Yiming Li
机构: Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 51 pages, 7 figures, github: this https URL

点击查看摘要

Abstract:The real world unfolds along a single set of physics laws, yet human intelligence demonstrates a remarkable capacity to generalize experiences from this singular physical existence into a multiverse of games, each governed by entirely different rules, aesthetics, physics, and objectives. This omni-reality adaptability is a hallmark of general intelligence. As Artificial Intelligence progresses towards Artificial General Intelligence, the multiverse of games has evolved from mere entertainment into the ultimate ground for training and evaluating AGI. The pursuit of this generality has unfolded across four eras: from environment-specific symbolic and reinforcement learning agents, to current large foundation models acting as generalist players, and toward a future creator stage where agent both creates new game worlds and continually evolves within them. We trace the full lifecycle of a generalist game player along four interdependent pillars: Dataset, Model, Harness, and Benchmark. Every advance across these pillars can be read as an attempt to break one of five fundamental trade-offs that currently bound the whole system. Building on this end-to-end view, we chart a five-level roadmap, progressing from single-game mastery to the ultimate creator stage in which the agent simultaneously creates and evolves within theoretical game multiverse. Taken together, our work offers a unified lens onto a rapidly shifting field,and a principled path toward the omnipotent generalist agent capable of seamlessly mastering any challenge within the multiverse of games, thereby paving the way for AGI.

[CV-129] Learning to Perceive “Where”: Spatial Pretext Tasks for Robust Self-Supervised Learning

【速读】：该论文旨在解决现有自监督学习（Self-Supervised Learning, SSL）方法主要关注对象不变性表示，而忽视物体内部部件间空间结构与关系的问题。其解决方案的关键在于提出一种名为“空间预测”（Spatial Prediction, SP）的预训练回归任务，通过预测同一图像中两个解耦局部视图之间的相对位置和尺度关系，显式建模部件间的连续几何空间关系，从而促使模型学习细粒度的空间依赖性和视觉场景的组合结构。SP作为解耦式插件模块可无缝集成至多种SSL框架，并在图像识别、细粒度分类、语义分割、深度估计等任务上实现一致性能提升，同时显著增强分布外（out-of-distribution）物体识别的鲁棒性。

链接: https://arxiv.org/abs/2605.09963
作者: Yang Shen,Yusen Cai,Weronika Hryniewska-Guzik,Qing Lin,Mengmi Zhang
机构: Nanyang Technological University (南洋理工大学); Warsaw University of Technology (华沙理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing self-supervised learning (SSL) methods primarily learn object-invariant representations but often neglect the spatial structure and relationships among object parts. To address this limitation, we introduce Spatial Prediction (SP), a spatially aware pretext regression task that predicts the relative position and scale between a pair of disentangled local views from the same image. By modeling part-to-part relationships in a continuous geometric space, SP encourages representations to capture fine-grained spatial dependencies beyond invariant categorical semantics, thereby learning the compositional structure of visual scenes. SP is implemented as a decoupled plug-in and can be seamlessly integrated into diverse SSL frameworks. Extensive experiments show consistent improvements across image recognition, fine-grained classification, semantic segmentation, and depth estimation, as well as substantial gains in out-of-distribution robustness for object recognition. To evaluate spatial reasoning, we introduce (1) a position and scale prediction task on image patch pairs and (2) a jigsaw understanding task requiring patch reordering and recognition after reconstruction. Strong performance on these tasks indicates improved spatial structure and geometric awareness. Overall, explicitly modeling spatial information provides an effective inductive bias for SSL, leading to more structured representations and better generalization. Code and models will be released.

[CV-130] SDTalk: Structured Facial Priors and Dual-Branch Motion Fields for Generalizable Gaussian Talking Head Synthesis

【速读】：该论文旨在解决高质量、实时人脸合成（talking head synthesis）中身份特异性模型导致的跨身份泛化能力差的问题。现有基于重建和渲染的方法通常依赖于特定身份的模型，难以适应未见过的身份。其解决方案的关键在于提出了一种基于3D高斯溅射（3D Gaussian Splatting, 3DGS）的一次性（one-shot）框架SDTalk，通过两阶段训练策略实现无需个性化训练即可泛化到新身份。第一阶段引入结构化面部先验并分别预测可见与遮挡区域的3DGS参数，从而从单张图像完成完整头部重建；第二阶段设计双分支运动场以建模粗粒度与细粒度面部动态，显著提升细节保真度与唇部同步精度。

链接: https://arxiv.org/abs/2605.09956
作者: Peng Jia,Zhen Xiao,Jia Li,Xueliang Liu,Zhenzhen Hu,Lingyun Yu
机构: Hefei University of Technology (合肥工业大学); University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 5 pages, 4 figures, 4 tables

点击查看摘要

Abstract:High-quality, real-time talking head synthesis remains a fundamental challenge in computer vision. Existing reconstruction- and rendering-based methods typically rely on identity-specific models, limiting cross-identity generalization. To address this issue, we propose SDTalk, a one-shot 3D Gaussian Splatting (3DGS)-based framework that generalizes to unseen identities without personalized training or fine-tuning. Our framework comprises two modules with a two-stage training strategy. In the first stage, we incorporate structured facial priors into the reconstruction module and separately predict 3DGS parameters for visible and occluded regions, enabling complete head reconstruction from a single image. In the second stage, we introduce a dual-branch motion field to model coarse and fine facial dynamics, improving detail fidelity and lip synchronization. Experiments demonstrate that SDTalk surpasses existing methods in both visual quality and inference efficiency.

[CV-131] JODA: Composable Joint Dynamics for Articulated Objects

【速读】：该论文旨在解决仿真与具身智能（embodied AI）中关节类物体缺乏精细动力学效应的问题，例如摩擦保持、卡扣锁定、软关闭和弹跳闭合等真实机械行为，而现有方法要么忽略动力学细节，要么依赖表达能力有限的简化模型。其解决方案的关键在于提出JODA框架，将关节动力学建模为关于关节自由度的三通道结构化场，分别捕捉保守力、干摩擦和阻尼，并采用形状约束的分段三次插值（PCHIP）实现紧凑且可解释的动力学函数空间，支持不同iable simulation。该表示法结合视觉语言模型生成的结构化动力学先验，实现了从多模态输入中推理与优化关节动力学的能力，从而统一了动力学建模、编辑与优化的接口。

链接: https://arxiv.org/abs/2605.09954
作者: Tianhong Gao,Cheng Yu,Yinghao Xu,Mengyu Chu
机构: Peking University (北京大学); Ant Group (蚂蚁集团)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Articulated objects used in simulation and embodied AI are typically specified by geometry and kinematic structure, but lack the fine-grained dynamical effects that govern realistic mechanical behavior, such as frictional holding, detents, soft closing, and snap latching. Existing approaches either ignore the detailed structure of dynamics entirely, or use simple models with limited expressiveness. We introduce JODA, a framework for generating joint-level dynamics as a structured three-channel field over the joint degree of freedom, capturing conservative forces, dry friction, and damping. Instantiated using shape-constrained piecewise cubic interpolation (PCHIP), this formulation defines a compact and expressive function space that is both interpretable and compatible with differentiable simulation. Building on this representation, we develop methods for inferring and refining joint dynamics from multimodal inputs. Given visual observations and joint context, a vision-language model proposes structured dynamical primitives, which are composed into a unified dynamics field. The resulting representation supports both direct manipulation and gradient-based refinement. We demonstrate that JODA enables plausible and controllable modeling of diverse joint behaviors, providing a unified interface for inference, editing, and optimization. Code and example assets with their generated profiles will be released upon publication. Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2605.09954 [cs.RO] (or arXiv:2605.09954v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2605.09954 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-132] LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models

【速读】：该论文旨在解决当前视觉-语言-动作（Vision-Language-Action, VLA）模型在机器人操作任务中因过度抽象导致计算资源浪费和低层几何线索弱化的问题。现有方法多依赖预设层数或启发式规则进行早期退出，无法动态判断何时的表示已足够用于动作预测。解决方案的关键在于提出LoopVLA架构，其核心创新是通过循环递归地应用共享Transformer块来迭代优化多模态token，并在每一步同时输出候选动作与 sufficiency score（充分性评分），从而实现对表示是否足够的自适应评估。该设计将表示精炼与绝对层索引解耦，并利用自监督分布对齐目标训练充分性估计器，使其与策略优化信号关联，最终在保持性能的同时显著提升推理效率（参数减少45%，吞吐量提升1.7倍）。

链接: https://arxiv.org/abs/2605.09948
作者: Boyang Shen,Kaixiang Yang,Hao Wang,Qiuyu Yu,Qiang Xie,Qiang Li,Zhiwei Wang
机构: Huazhong University of Science and Technology (华中科技大学); Wuhan United Imaging Surgical Co.,Ltd. (武汉联影医疗科技股份有限公司)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Current Vision-Language-Action (VLA) models typically treat the deepest representation of a vision-language backbone as universally optimal for action prediction. However, robotic manipulation is composed of many frequent closed-loop spatial adjustments, for which excessive abstraction may waste computation and weaken low-level geometric cues essential for precise control. Existing early-exit strategies attempt to reduce computation by stopping at predefined layers or applying heuristic rules such as action consistency, but they do not directly answer when a representation is actually sufficient for action. In this paper, we present LoopVLA, a recurrent VLA architecture that jointly learns representation refinement, action prediction, and sufficiency estimation. LoopVLA iteratively applies a shared Transformer block to refine multimodal tokens, and at each iteration produces both a candidate action and a sufficiency score that estimates whether further refinement is necessary. By sharing parameters across iterations, LoopVLA decouples refinement from absolute layer indices and grounds sufficiency estimation in the evolving representation itself. Since sufficiency has no direct supervision, we introduce a self-supervised distribution alignment objective, where intermediate confidence scores are trained to match the relative action quality across refinement steps, thereby linking sufficiency learning to policy optimization signals. Experiments on LIBERO, LIBERO-Plus, and VLA-Arena show that LoopVLA pushes the efficiency-performance frontier of VLA policies, reducing parameters by 45% and improving inference throughput by up to 1.7 times while matching or outperforming strong baselines in task success.

[CV-133] Evidence-based Decision Modeling for Synthetic Face Detection with Uncertainty-driven Active Learning

【速读】：该论文旨在解决现有生成式人脸检测方法在面对未知分布（Out-of-Distribution, OOD）图像时因依赖Softmax激活函数而导致的过度自信问题，以及对高质量标注数据依赖性强、泛化能力不足的局限性。解决方案的关键在于提出EMSFD（Evidence-based decision Modeling for Synthetic Face Detection with uncertainty-driven active learning），其核心创新包括：利用Dirichlet分布建模类别证据以显式引入模型不确定性，并在训练过程中基于估计的不确定性从无标签数据池中主动选择更具信息量的样本进行标注，从而降低标注成本并提升模型的检测可靠性与跨场景泛化能力。

链接: https://arxiv.org/abs/2605.09935
作者: Qingchao Jiang,Zhenxuan Hou,Zhiying Zhu,Zhenxing Qian,Xinpeng Zhang,Zaiwang Gu
机构: East China University of Science and Technology (华东理工大学); Fudan University (复旦大学); Institute of Advanced Intelligence and Computing (IAIC) (先进智能与计算研究所); State Key Laboratory of Blockchain and Data Security, Zhejiang University (区块链与数据安全国家重点实验室，浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注: 11pages,6figures

点击查看摘要

Abstract:With the rapid development of deep generative models, forged facial images are massively exploited for illegal activities. Although existing synthetic face detection methods have achieved significant progress, they suffer from the inherent limitation of overconfidence due to their reliance on the Softmax activation function. Thus, these methods often lead to unreliable predictions when encountering unknown Out-of-Distribution (OOD) images, and cannot ascertain the model’s uncertainty in its prediction. Meanwhile, most existing methods require massive high-quality annotated data, which greatly limits their practicability across diverse scenarios. To address these limitations, we propose EMSFD (Evidence-based decision Modeling for Synthetic Face Detection with uncertainty-driven active learning), an approach designed to enhance detection reliability and generalizability. Specifically, EMSFD models class evidence using the Dirichlet distribution and explicitly incorporates model uncertainty into the prediction process. Furthermore, during training, the estimated uncertainty is exploited to prioritize more informative samples from the unlabeled pool for annotation, thereby reducing labeling cost and improving model generalization. Extensive experimental evaluations demonstrate that our method enhances the interpretability of synthetic face detection. Meanwhile, our method yields a 15% increase in accuracy compared to existing state-of-the-art (SOTA) baselines, which demonstrates the superior detection performance and generalizability of our approach. Our code is available at: this https URL.

[CV-134] Frequency Adapter with SAM for Generalized Medical Image Segmentation

【速读】：该论文旨在解决医学图像分割中因成像协议、扫描设备及患者群体差异导致的域偏移（domain shift）问题，从而提升深度学习模型在跨数据集场景下的泛化能力。传统域泛化（Domain Generalization, DG）方法依赖显式特征对齐、对抗一致性或手工增强策略，难以充分发挥基础模型潜力；而基于Segment Anything Model (SAM) 的现有方法主要局限于空间域操作，忽视了频域差异对模型鲁棒性的重要影响。本文提出的FSAM框架关键在于：引入低秩适配（Low-Rank Adaptation, LoRA）实现高效微调，并设计频率适配器（frequency adapter）融合频域表征，以提取域不变的高频特征，缓解由频率相关域偏移引发的性能下降问题。实验表明，FSAM在视网膜和前列腺图像数据集上均优于主流传统DG与SAM基线方法。

链接: https://arxiv.org/abs/2605.09925
作者: Phuoc-Nguyen Bui,Van-Nguyen Pham,Duc-Tai Le,Junghyun Bum,Hyunseung Choo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review, 10 pages, 1 figure, 2 tables

点击查看摘要

Abstract:Medical image segmentation is a critical task in computer-aided diagnosis and treatment planning. However, deep learning models often struggle to generalize across datasets due to domain shifts arising from variations in imaging protocols, scanner types, and patient populations. Traditional domain generalization (DG) methods utilize causal feature learning, adversarial consistency, and style augmentation to improve segmentation robustness. While effective, these approaches rely on explicit feature alignment, adversarial objectives, or handcrafted augmentations, which may not fully exploit the capabilities of foundation models. Recently, the Segment Anything Model (SAM) has demonstrated strong generalization capabilities in segmentation tasks. SAM-based DG methods attempt to improve medical image segmentation. However, these approaches primarily operate in the spatial domain and overlook frequency-based discrepancies that significantly affect model robustness. In this work, we propose Frequency-based Domain Generalization with SAM (FSAM), a novel framework that integrates Low-Rank Adaptation (LoRA) for efficient fine-tuning and a frequency adapter to incorporate frequency-domain representations for single-source domain generalization. FSAM enhances SAM’s segmentation robustness by extracting domain-invariant high-frequency features, mitigating frequency-related domain shifts. Experimental results on fundus and prostate datasets demonstrate that FSAM outperforms existing traditional DG and SAM-based DG approaches in domain generalization. Codes and pre-trained models will be made available on GitHub.

[CV-135] OC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models

【速读】：该论文旨在解决当前视频大语言模型（Video-LLMs）在保持对象时序一致性方面的评估不足问题，即现有基准测试主要关注事件识别、动作理解或粗粒度时序推理，而忽视了模型对同一对象在遮挡、消失、重新出现、状态变化及跨对象交互等复杂场景下身份、状态和时序连续性的持续追踪能力。这导致现有评价可能高估模型的时序推理性能，而掩盖其在对象中心时序一致性上的缺陷。解决方案的关键在于提出TOC-Bench——一个以对象轨迹为根基的诊断性基准，通过三层时序必要性过滤协议（temporal-necessity filtering protocol），剔除60.7%依赖单帧捷径或语言先验的问答对，保留17,900个强时序依赖项，并进一步构建包含2,323个高质量人工验证问答对的基准数据集，从而精准衡量模型在10个诊断维度上的时序对象一致性表现。

链接: https://arxiv.org/abs/2605.09904
作者: Junzhe Chen,Siyuan Meng,Yuxi Chen,Man Zhao,Xiaojie Guo
机构: Tianjin University (天津大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video large language models (Video-LLMs) have achieved remarkable progress in general video understanding, yet their ability to maintain temporal object consistency remains insufficiently explored. Existing benchmarks primarily focus on event recognition, action understanding, or coarse temporal reasoning, but rarely evaluate whether a model can consistently preserve the identity, state, and temporal continuity of the same object across occlusion, disappearance, reappearance, state transitions, and cross-object interactions. As a result, current evaluations may overestimate temporal reasoning ability while overlooking failures in object-centric temporal coherence. To address this issue, we introduce TOC-Bench, a diagnostic benchmark specifically designed to evaluate temporal object consistency in Video-LLMs. TOC-Bench is explicitly object-track grounded, where each queried subject is associated with a per frame object trajectory and structured temporal event timeline. To ensure that benchmark items depend on temporally ordered visual evidence rather than language priors, single-frame shortcuts, or unordered frame cues, we propose a three-layer temporal-necessity filtering protocol that removes 60.7% of candidate QA pairs and retains 17,900 temporally dependent items spanning 10 diagnostic dimensions. From this filtered pool, we further construct a human-verified benchmark containing 2,323 high-quality QA pairs over 1,951 videos. Experiments on representative Video-LLMs show that temporal object consistency remains a major unsolved challenge. Current models exhibit substantial weaknesses in event counting, event ordering, identity-sensitive reasoning, and hallucination-aware verification, despite strong performance on general video understanding benchmarks.

[CV-136] Adversarial Attacks Against MLLM s via Progressive Resolution Processing and Adaptive Feature Alignment

【速读】：该论文旨在解决生成式 AI（Generative AI）在多模态大语言模型（Multimodal Large Language Models, MLLMs）中面临的目标导向迁移攻击（transfer-based targeted attack）的可迁移性与鲁棒性不足的问题。现有方法通常依赖于替代模型编码器的最终全局特征，并锚定优化至原始分辨率的目标图像裁剪，导致攻击效果受限。其解决方案的关键在于提出一种名为PRAF-Attack的新型攻击框架，该框架通过两个核心机制提升攻击性能：一是自适应特征对齐策略，利用梯度一致性动态选择跨替代模型集合的可迁移层级特征，并结合patch级优化保留高相关局部区域；二是渐进式分辨率处理策略，从粗到精逐步优化，从而在多尺度上更好地挖掘目标信息并增强攻击的迁移能力。

链接: https://arxiv.org/abs/2605.09902
作者: Haobo Wang,Xiaorong Ma,Weiqi Luo,Xiaojun Jia,Jiwu Huang
机构: Sun Yat-sen University (中山大学); Nanyang Technological University (南洋理工大学); Shenzhen MSU-BIT University (深圳北理莫斯科大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Adversarial perturbations can mislead Multimodal Large Language Models (MLLMs) recognize a benign image as a specific target object, posing serious risks in safety-critical scenarios such as autonomous driving and medical diagnosis. This makes transfer-based targeted attacks crucial for understanding and improving black-box MLLM robustness. Existing transfer-based targeted attack methods typically rely on the final global features of the surrogate encoder and anchor optimization to original-resolution target crops, leading to their limited transferability and robustness. To address these challenges, we propose Progressive Resolution Processing and Adaptive Feature Alignment (PRAF-Attack), a targeted transfer-based attack framework that integrates multi-scale global semantic guidance with robust intermediate-layer local alignment. Unlike prior methods that align only the surrogate encoder’s final layer, we design an adaptive feature alignment strategy that leverages intermediate representations to enhance transferability. Specifically, we introduce an adaptive intermediate layer selection mechanism to identify transferable hierarchical features across surrogate ensembles via gradient consistency, along with an adaptive patch-level optimization strategy that preserves highly correlated local regions through efficient patch filtering. To overcome the reliance on fixed original-resolution target crops, we propose a progressive resolution processing strategy that gradually refines optimization from coarse to fine, enabling the attack to better exploit target information at multiple scales and achieve stronger transferability. We evaluate PRAF-Attack on a diverse suite of black-box MLLMs, including six open-source models and six closed-source commercial APIs. Compared with seven state-of-the-art targeted attack baselines, the proposed PRAF-Attack consistently achieves superior transferability.

[CV-137] Hyperbolic Distillation: Geometry-Guided Cross-Modal Transfer for Robust 3D Object Detection

【速读】：该论文旨在解决多模态3D目标检测中因模态异质性（modality heterogeneity）、空间错位（spatial misalignment）以及多模态表示危机（representation crisis）导致的跨模态知识蒸馏效率低下问题。解决方案的关键在于提出一种双分支架构下的双曲几何约束跨模态蒸馏方法（Hyperbolic Constrained Cross-modal Distillation, HGC-Det），其核心创新包括：1）基于图像语义引导的体素优化组件（SGVO），通过引入图像语义线索自适应优化点云分支的空间表示以增强融合效果；2）双曲几何约束的跨模态特征迁移组件（HFT），利用双曲空间的内在几何特性缓解高维图像特征与低维点云特征融合过程中的语义损失；3）基于特征聚合的几何优化组件（FAGO），补偿因SGVO引入的空间特征退化问题，从而实现更鲁棒和高效的多模态融合。

链接: https://arxiv.org/abs/2605.09899
作者: Kanglin Ning,Wenrui Li,Houde Quan,Qifan Li,Xingtao Wang,Xiaopeng Fan
机构: Harbin Institute of Technology (哈尔滨工业大学); PengChengLab (鹏城实验室); Suzhou Research Institute of HIT (哈尔滨工业大学苏州研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Current version has been subbmitted to IEEE Transactions on Multimedia. Now, this manuscript’s status is Under Review

点击查看摘要

Abstract:Cross-modal knowledge distillation has emerged as an effective strategy for integrating point cloud and image features in 3D perception tasks. However, the modality heterogeneity, spatial misalignment, and the representation crisis of multiple modalities often limit the efficient of these cross-modal distillation methods. To address these limitations in existing approaches, we propose a hyperbolic constrained cross-modal distillation method for multimodal 3D object detection (HGC-Det). The proposed HGC-Det framework includes an image branch and a point cloud branch to extract semantic features from two different modalities. The point cloud branch comprises three core components: a 2D semantic-guided voxel optimization component (SGVO), a hyperbolic geometry constrained cross-modal feature transfer component (HFT), and a feature aggregation-based geometry optimization component (FAGO). Specifically, the SGVO component adaptively refines the spatial representation of the 3D branch by leveraging semantic cues from the image branch, thereby mitigating the issue of inadequate representation fusion. The HFT component exploits the intrinsic geometric properties of hyperbolic space to alleviate semantic loss during the fusion of high-dimensional image features and low-dimensional point cloud features. Finally, the FAGO compensates for potential spatial feature degradation introduced by the 2D semantic-guided voxel optimization component. Extensive experiments on indoor datasets (SUN RGB-D, ARKitScenes) and outdoor datasets (KITTI, nuScenes) demonstrate that our method achieves a better trade-off between detection accuracy and computational cost.

[CV-138] he Cartesian Shortcut: Re-evaluate Vision Reasoning in Polar Coordinate Space

【速读】：该论文旨在解决当前多模态大语言模型（Multimodal Large Language Models, MLLMs）在视觉推理基准测试中表现优异但可能缺乏真正鲁棒视觉理解能力的问题。研究发现，现有基准普遍采用正交网格布局（orthogonal grid-based layouts），使得模型可轻易将图像空间映射为显式的文本坐标进行推理，从而依赖文本推理而非真正的视觉感知，形成“笛卡尔捷径”（Cartesian Shortcut）。为系统性消除这一偏差，作者提出Polaris-Bench——一个在极坐标空间重构53个视觉推理任务的评测基准，同时保留与笛卡尔坐标系下的逻辑一致性与语义等价性，从而打破模型对正交先验的依赖。关键创新在于通过几何变换实现拓扑不变性（topology-invariant）的测试场景，揭示出当前前沿MLLMs在脱离笛卡尔结构后性能显著下降，暴露其缺乏本质性的视觉空间推理能力。

链接: https://arxiv.org/abs/2605.09883
作者: Xia Hu,Zhenrui Yue,Brian Potetz,Howard Zhou,Leonidas Guibas,Chun-Ta Lu,Zhicheng Wang
机构: Stanford University (斯坦福大学); Google Research (谷歌研究)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As current Multimodal Large Language Models rapidly saturate canonical visual reasoning benchmarks, a key question emerges: do these strong scores genuinely reflect robust visual understanding? We identify a pervasive vulnerability, the \textbfCartesian Shortcut: visual reasoning benchmarks prevalently build on orthogonal grid-based layouts that can be readily discretized into explicit textual coordinates. Models systematically exploit this property, heavily leveraging text-based deductive reasoning to assist visual problem-solving. To systematically dismantle this shortcut, we introduce \textbfPolaris-Bench, which re-formulates 53 visual reasoning tasks in Polar coordinate space with paired Cartesian counterparts as reference, while preserving consistent logical constraints and task semantics – thus fundamentally breaking the orthogonal prior that models exploit. Comprehensive evaluation across 14 state-of-the-art MLLMs reveals that frontier models achieving 70 – 83% on Cartesian layouts collapse to 31 – 39% on Polar equivalents, with degradation persisting even under complete logical equivalence. Moreover, reasoning gains observed on Cartesian layouts are severely diminished on Polar equivalents. These findings expose a critical deficiency in current MLLMs: the lack of topology-invariant visual reasoning.

[CV-139] ConsistNav: Closing the Action Consistency Gap in Zero-Shot Object Navigation with Semantic Executive Control

【速读】：该论文旨在解决零样本目标导航（zero-shot object navigation, ObjectNav）中因语义证据在每一步被重复解释而缺乏跨帧一致性导致的行动不一致问题，即“动作一致性缺口”（action consistency gap），表现为智能体在探索与追逐之间振荡或在接近目标时放弃。解决方案的关键在于提出 ConsistNav 框架，其核心是一个由三个协同模块组成的语义执行器：有限状态执行控制器（Finite-State Executive Controller）通过受保护的语义阶段控制目标追逐流程；持久候选记忆（Persistent Candidate Memory）将跨帧的目标证据聚合为稳定的对象假设；稳定性感知动作控制（Stability-Aware Action Control）抑制旋转停滞、无效追逐和未经验证的停止行为。该设计不修改检测器或底层规划器，而是动态决定何时利用语义信息进行导航决策，何时抑制或重新审视，从而显著提升导航成功率（SR）和路径长度归一化成功率（SPL）。

链接: https://arxiv.org/abs/2605.09869
作者: Haosen Wang,Zhenyang Li,Yinqiang Zhang,Zongqi He,Lutao Jiang,Kai Li,Yizhou Zhao,Liaoyuan Fan,Wenjian Hou,Tingbang Liang,Yibin Wen,Defeng Gu
机构: Sun Yat-sen University (中山大学); The University of Hong Kong (香港大学); Hong Kong University of Science and Technology (Guangzhou) (香港科技大学（广州) ); City University of Hong Kong (香港城市大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 5 figures

点击查看摘要

Abstract:Zero-shot object navigation has advanced rapidly with open-vocabulary detectors, image–text models, and language-guided exploration. However, even after current methods detect a plausible target hypothesis, the agent may still oscillate between exploration and pursuit, or abandon the object near success. We identify this failure mode as an action consistency gap: semantic evidence is repeatedly reinterpreted at each step without persistent commitment across the episode. We introduce ConsistNav, a training-free zero-shot ObjectNav framework built around a semantic executive composed of three coordinated modules: Finite-State Executive Controller stages target pursuit through guarded semantic phases; Persistent Candidate Memory accumulates cross-frame target evidence into stable object hypotheses; and Stability-Aware Action Control suppresses rotational stagnation, ineffective pursuit, and unverified stopping. This design changes neither the detector nor the low-level planner; instead, it controls when semantic evidence should influence navigation and when it should be suppressed or revisited. We conduct extensive experiments on HM3D and MP3D, where ConsistNav achieves state-of-the-art results among compared zero-shot ObjectNav methods and improves SR by 11.4% and SPL by 7.9% over the controlled baseline on MP3D. Ablation studies and real-world deployment experiments further demonstrate the effectiveness and robustness of the proposed executive mechanism.

[CV-140] DA-SegFormer: Damage-Aware Semantic Segmentation for Fine-Grained Disaster Assessment

【速读】：该论文旨在解决自然灾害后无人机（UAV）影像中细粒度损伤等级识别难题，特别是因图像缩放导致的纹理信息退化及极端类别不平衡问题。其解决方案的关键在于提出DA-SegFormer模型，通过引入类感知采样（Class-Aware Sampling）策略确保稀有损伤特征被充分学习，并结合在线难例挖掘（OHEM）与Dice Loss动态聚焦于低频损伤类别；同时采用保持分辨率的推理协议以保留原始纹理细节，从而显著提升对关键损伤类别的分割性能。

链接: https://arxiv.org/abs/2605.09864
作者: Kevin Zhu,William Tang,Raphael Hay Tene,Zesheng Liu,Nhut Le,Maryam Rahnemoonfar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted for 2026 IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2026)

点击查看摘要

Abstract:Rapid and accurate damage assessment following natural disasters is critical for effective emergency response. However, identifying fine-grained damage levels (e.g., distinguishing minor from major roof damage) in UAV imagery remains challenging due to the degradation of texture cues during resizing and extreme class imbalance. We propose DA-SegFormer, a damage-aware adaptation of the SegFormer architecture optimized for high-resolution disaster imagery. Our method introduces a Class-Aware Sampling strategy to guarantee exposure to rare damage features, and it integrates Online Hard Example Mining (OHEM) with Dice Loss to dynamically focus on underrepresented classes. In addition, we employ a resolution-preserving inference protocol that maintains native texture details. Evaluated on the RescueNet dataset, DA-SegFormer achieves 74.61% mIoU, outperforming the baseline by 2.55%. Notably, our improvements yield double-digit gains in critical damage classes: Minor Damage (+11.7%) and Major Damage (+21.3%).

[CV-141] Learning to Align Generative Appearance Priors for Fine-grained Image Retrieval

【速读】：该论文旨在解决细粒度图像检索（Fine-grained Image Retrieval, FGIR）中因依赖已见类别监督信号而导致模型偏向于已见类语义、难以泛化到未见类别的问题。解决方案的关键在于提出一种生成式外观先验对齐网络（Generative Appearance Prior alignment network, GAPan），其核心创新是将学习目标从类别预测重构为外观建模：通过基于归一化流（normalizing flows）的可逆密度模型，将实例特征映射至潜密度空间，并以类别条件高斯先验精确建模各已见类别的外观分布；同时利用流的可逆性，从高密度区域采样生成反映类内变化的外观感知锚点（appearance-aware anchors），并以此监督一个先验驱动的对齐目标，从而提升检索嵌入与类别特定外观分布的一致性，增强对未见类别的泛化能力。

链接: https://arxiv.org/abs/2605.09859
作者: Shijie Wang,Yadan Luo,Zijian Wang,Xin Yu,Zi Huang
机构: The University of Queensland (昆士兰大学); The University of Adelaide (阿德莱德大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Fine-grained image retrieval (FGIR) typically relies on supervision from seen categories to learn discriminative embeddings for retrieving unseen categories. However, such supervision often biases retrieval models toward the semantics of seen categories rather than the underlying appearance characteristics that generalize across categories, thereby limiting retrieval performance on unseen categories. To tackle this, we propose GAPan, a Generative Appearance Prior alignment network that reformulates the learning objective from category prediction toward appearance modeling. Technically, GAPan treats retrieval features with an invertible density model based on normalizing flows. In the forward direction, the flow maps all instance features into a latent density space, where each seen category is modeled by a class-conditional Gaussian prior and optimized via exact likelihood estimation. This formulation preserves richer appearance details by leveraging the invertible property of the flows. In the reverse direction, samples from the high-density regions of these learned priors are mapped back to the feature space to produce appearance-aware anchors that reflect intra-category variation. These anchors supervise a prior-driven alignment objective that aligns retrieval embeddings with category-specific appearance distributions, thereby improving generalization to unseen categories. Evaluations demonstrate that our GAPan achieves state-of-the-art performance on both widely-used fine- and coarse-grained benchmarks.

[CV-142] Clip-level Uncertainty and Temporal-aware Active Learning for End-to-End Multi-Object Tracking ICIP

【速读】：该论文旨在解决多目标跟踪（Multi-Object Tracking, MOT）中因标注成本高且视频数据冗余导致的训练效率低的问题。现有主动学习（Active Learning, AL）方法主要基于帧级别选择样本，与当前依赖多帧片段进行推理和训练的端到端跟踪模型在结构上不匹配。为弥合这一差距，作者提出了一种新的剪辑级主动学习方法——Clip-level Uncertainty and Temporal-aware Active Learning (CUTAL)，其关键在于：首先利用多帧预测结果计算不确定性度量以捕捉帧间对应关系的模糊性，进而通过引入时间多样性约束，从候选剪辑中选取信息丰富且非冗余的子集用于标注，从而显著提升标注效率并保持跟踪性能。

链接: https://arxiv.org/abs/2605.09858
作者: Riku Inoue,Shogo Sato,Kazuhiko Murasaki,Tomoyasu Shimada,Toshihiko Nishimura,Ryuichi Tanida
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to 2026 IEEE International Conference on Image Processing (ICIP). Copyright 2026 IEEE. Published in 2026 IEEE International Conference on Image Processing (ICIP), scheduled for 13-17 September 2026 in Tampere, Finland

点击查看摘要

Abstract:Multi-Object Tracking (MOT) in dynamic environments relies on robust temporal reasoning to maintain consistent object identities over time. Transformer-based end-to-end MOT models achieve strong performance by explicitly modeling temporal dependencies, yet training them requires extensive bounding-box and identity annotations. Given the high labeling cost and strong redundancy in videos, Active Learning (AL) is an effective approach to improve annotation efficiency. However, existing AL methods for MOT primarily operate at the frame level, which is structurally misaligned with modern end-to-end trackers whose inference and training rely on multi-frame clips. To bridge this gap, we formulate clip-level active learning and propose Clip-level Uncertainty and Temporal-aware Active Learning (CUTAL). In contrast to frame-based approaches, CUTAL scores each clip using uncertainty metrics derived from multi-frame predictions to capture inter-frame correspondence ambiguities, while enforcing temporal diversity to select an informative and non-redundant subset. Experiments show that CUTAL achieves stronger overall performance than baselines at the same label budgets across MeMOTR and SambaMOTR. Notably, CUTAL achieves performance comparable to full supervision for MeMOTR on both datasets using only 50% of the labeled training data.

[CV-143] MoPO: Incorporating Motion Prior for Occluded Human Mesh Recovery

【速读】：该论文旨在解决人体网格恢复（human mesh recovery）中因遮挡导致的姿势估计不准确和运动抖动问题，其核心挑战在于遮挡区域缺乏足够的空间特征以支撑可靠的人体姿态重建。解决方案的关键在于引入运动先验（motion prior），提出名为MoPO的框架，其核心创新包括：1）设计了一个时空遮挡检测模块来识别关节可见性，并通过轻量级运动预测器基于历史姿态序列推测最可能的关节位置以完成遮挡部位；2）构建一个运动感知融合与精修模块，将补全的关节序列与图像特征融合以估计人体形状和初始姿态，并进一步利用逆运动学（inverse kinematics）对最终姿态进行优化，从而提供无遮挡的运动先验用于回归更稳定、准确的人体姿态。

链接: https://arxiv.org/abs/2605.09856
作者: Tao Tang,Hong Liu,Xinshun Wang,Wanruo Zhang
机构: Peking University (北京大学); State Key Laboratory of General Artificial Intelligence (通用人工智能国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 35 pages

点击查看摘要

Abstract:Although recent studies have made remarkable progress in human mesh recovery, they still exhibit limited robustness to occlusions and often produce inaccurate poses and severe motion jitter due to the insufficient spatial features for occluded body parts. Inspired by the rapid advancements in human motion prediction, we discover that compared to occluded image features, pose sequence inherently contains reliable motion prior for estimating occluded body parts. In this paper, we incorporate Motion Prior for Occluded human mesh recovery, called MoPO. Our MoPO mainly consists of two components: 1) The motion de-occlusion module, where we propose a spatial-temporal occlusion detector to detect joint visibility, and then we propose a lightweight motion predictor to complete the occluded body parts by predicting the most plausible joint positions based on history poses. 2) The motion-aware fusion and refinement module, which fuses the completed joint sequence with image features to estimate human shape and initial human pose. Moreover, the completed joint sequence is further used to refine the final human pose through inverse kinematics, which provides the occlusion-free motion prior for regressing human poses. Extensive experiments demonstrate that MoPO achieves state-of-the-art performance on both occlusion-specific and standard benchmarks, significantly enhancing the accuracy and temporal consistency of occluded human mesh recovery. Our code and demo can be found in the supplementary material.

[CV-144] Probing Routing-Conditional Calibration in Attention-Residual Transformers

【速读】：该论文旨在解决当前后验校准（post-hoc calibration）评估中忽视模型内部路由痕迹（routing traces）是否提供稳定校准信息的问题，尤其是在引入路由增强架构（routing-augmented architectures）后，这些内部状态常被声称与不确定性相关。研究发现，在Attention-Residual Transformer（AR）结构中，仅依赖标量路由统计量（如路由深度方差）无法提供稳定的路由条件校准证据：多数情况下校准差距微小且对随机种子敏感，且在严格的置换检验中仅有极少数情况显著拒绝零假设（conditional-null）。关键解决方案在于设计了一套匹配置信度的诊断流程，包括按路由状态分层、对比子组差异与路由置换空模型、以及引入容量匹配和扰动控制（如shuffle routing profiles），从而揭示出看似由路由信息带来的校准提升实为常见混杂因素（如模型容量或随机性）所致，强调了在宣称路由感知校准有效前必须排除这些混淆变量。

链接: https://arxiv.org/abs/2605.09850
作者: Wenhao Liang,Lin Yue,Wei Emma Zhang,Miao Xu,Mingyu Guo,Olaf Maennel,Weitong Chen
机构: Adelaide University; Australian Institute for Machine Learning (AIML), Adelaide University; The University of Queensland
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Under reviewing

点击查看摘要

Abstract:Post-hoc calibration is usually evaluated as a function of logits or softmax confidence alone, even as routing-augmented architectures increasingly accompany predictions with sample-specific internal routing traces and pair them with claims of calibration-relevant uncertainty. We ask a basic question: do these traces provide stable routing-specific evidence for post-hoc calibration beyond confidence? We study this in Attention-Residual transformers (Kimi Team, 2026) through a matched-confidence diagnostic suite that stratifies examples by routing-derived state, compares subgroup gaps against within-bin routing-permutation nulls, and evaluates matched post-hoc probes differing only in their auxiliary feature. Across our completed AR runs, scalar routing summaries do not provide stable evidence of routing-conditional miscalibration: weighted gaps remain small or seed-sensitive, and only 1 of 30 within-bin permutation tests rejects the conditional-null at \alpha=0.05 (only on one seed; not stable across seeds in that cell). AR-CondCal, a minimal 2 -D Nadaraya–Watson probe on confidence and routing-depth variance, lies within the seed-variance band of matched confidence-only and predictive-entropy controls and does not reliably improve worst-routing-tertile ECE; bandwidth-sensitivity checks (Scott multiples, CV-NLL, global-ECE oracle) do not change this. A full-vector MLP over (c, H_1, \ldots, H_L) can appear to improve over a linear confidence baseline, but the apparent gain disappears once a capacity-matched confidence-only MLP is included as a control, and shuffled routing profiles achieve comparable performance. Apparent routing-aware calibration gains in this AR setting should not be read as internal-state calibration until matched-confidence, bandwidth, capacity, and permutation controls rule out common confounds.

[CV-145] Fashion Florence: Fine-Tuning Florence-2 for Structured Fashion Attribute Extraction

【速读】：该论文旨在解决从服装图像中自动提取结构化时尚属性（如类别、颜色、材质、风格标签和场合标签）的问题，以支持下游推荐与检索系统的程序化调用。解决方案的关键在于构建一个基于Florence-2视觉语言模型并采用LoRA（Low-Rank Adaptation）微调的专用模型——Fashion Florence，其通过规则驱动的标签工程将细粒度标注压缩为紧凑的6类、16色、19风格 schema，并在iMaterialist Fashion数据集上进行训练（3,688样本，3个epoch），最终实现高精度结构化输出（类别准确率94.6%，风格F1达0.753），且推理成本极低（0.77B参数，在单GPU上零边际开销）。

链接: https://arxiv.org/abs/2605.09827
作者: Anushree Berlia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Model: this https URL

点击查看摘要

Abstract:We present Fashion Florence, a Florence-2 vision-language model fine-tuned with LoRA to extract structured fashion attributes from clothing images. Given a single photograph, the model generates a JSON object containing category, color, material, style tags, and occasion tags, structured output suitable for direct programmatic consumption by downstream recommendation and retrieval systems. Fine-tuning data is derived from the iMaterialist Fashion dataset (228 labels), where we collapse fine-grained annotations into a compact 6-category, 16-color, 19-style schema via rule-based label engineering. We apply LoRA (r=16, alpha=32) to all decoder linear layers, training for 3 epochs on 3,688 examples. On a held-out test set of 461 images, Fashion Florence achieves 94.6% category accuracy and 63.0% material accuracy, compared to 89.3% / 43.3% for GPT-4o-mini and 87.4% for Gemini 2.5 Flash. Fashion Florence produces valid JSON in 99.8% of outputs while running at 0.77B parameters on a single GPU at zero marginal inference cost. Style tag F1 reaches 0.753 vs. 0.612 (Gemini) and 0.398 (GPT-4o-mini). The model is deployed as a Hugging Face Space and integrated into Loom, an open-source outfit recommendation system.

[CV-146] CrossVL: Complexity-Aware Feature Routing and Paired Curriculum for Cross-View Vision-Language Detection CVPR2026

【速读】：该论文旨在解决视觉-语言模型（Vision-Language Models, VLMs）在跨视角目标检测任务中性能显著下降的问题，尤其当地面视角与航空视角在高度、尺度和空间布局上存在系统性差异时，传统固定融合机制难以适应这种几何变化带来的复杂度差异。解决方案的关键在于提出CrossVL框架，其核心由两个模块组成：一是复杂度感知路径聚合（Complexity-Aware Pathway Aggregation, CPA），通过多模态统计估计场景复杂度并动态选择特征路径以生成视图特定表示；二是配对课程学习（Paired Curriculum Learning, PCL），利用同步采集的地面-航空图像对的语义一致性提供稳定早期监督，并逐步过渡到随机采样以优化训练动力学。实验表明，该方法有效提升了Florence-2在航空视角下的平均精度（mAP）并缩小了地面与航空视角间的性能差距，同时显著降低训练方差，验证了架构设计与训练策略协同优化对跨视角VLM检测鲁棒性的关键作用。

链接: https://arxiv.org/abs/2605.09802
作者: Zhipeng Liu,Chunbo Luo
机构: University of Exeter (埃克塞特大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to CVPR 2026. Code available at this https URL

点击查看摘要

Abstract:Vision-language models (VLMs) enable text-guided object detection but degrade severely under cross-view scenarios where ground and aerial viewpoints differ in altitude, scale, and spatial layout. These geometric changes introduce systematic complexity variations between viewpoints, e.g., ground view images contain dense and highly occluded structures, while aerial images are sparse and globally organized. Fixed VLM fusion mechanisms cannot handle this discrepancy. We propose CrossVL, a framework combining Complexity-Aware Pathway Aggregation (CPA) and Paired Curriculum Learning (PCL) for enhanced cross-view detection for VLM. CPA estimates scene complexity from multimodal statistics and routes visual features through multiple pathways to obtain view-specific representations. PCL leverages semantic consistency of synchronized ground-aerial pairs to provide stable early supervision and then gradually shifts toward randomized sampling. On MAVREC, CrossVL improves Florence-2’s aerial mAP from 58.66% to 61.03% and reduces the ground-aerial performance gap from 8.63pp to 6.65pp, while also achieving a 3.3x reduction in variance across random seeds. CPA provides stable complexity-aware feature aggregation, and PCL enhances optimization dynamics. Together, they demonstrate that coordinated architectural and training adaptations are crucial for robust cross-view VLM detection.

[CV-147] DRIVE-C: A Controlled Corruption Dataset for Autonomous Driving

【速读】：该论文旨在解决自动驾驶系统中视觉感知鲁棒性评估缺乏可控且结构化测试基准的问题。现有方法难以在受控条件下系统性地分析传感器退化对感知模型性能的影响，从而限制了ADAS（高级驾驶辅助系统）在复杂环境中的可靠性研究。解决方案的关键在于构建DRIVE-C数据集——一个基于真实世界驾驶视频的可控退化数据集，其中包含10个干净视频片段与600个经过物理启发式合成退化的视频片段，涵盖12种相机退化类型及5个严重等级，并提供像素级对齐、可复现的退化参数以及全局传感器健康指数（GSHI）标注。这一设计使研究人员能够精确控制退化条件，从而实现对感知模型鲁棒性、不确定性估计、分布外检测和传感器健康监测的系统性评估。

链接: https://arxiv.org/abs/2605.09774
作者: Shiva Aher
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:DRIVE-C is a controlled corruption dataset designed to evaluate visual perception robustness in autonomous driving systems. It is built from real-world forward-facing driving videos collected across daytime, nighttime, urban, rural, freeway, and parking environments. Clean clips are anonymized via localized face and license plate blurring, then transformed with physics-inspired synthetic degradations. The dataset contains 10 clean clips and 600 corrupted clips spanning 12 camera degradation types across five severity levels, with per-clip metadata and Global Sensor Health Index (GSHI) annotations. DRIVE-C supports robustness benchmarking, degradation-aware modeling, uncertainty estimation, out-of-distribution (OOD) detection, and sensor health monitoring for Advanced Driver Assistance Systems (ADAS). By providing pixel-aligned clean and degraded video clips with fully reproducible corruption parameters, DRIVE-C offers a structured testbed for studying perception reliability under controlled camera degradation.

[CV-148] Fetal Brain Imaging: A Composite Neural Network Approach for Keyframe Detection in Ultrasound Videos

【速读】：该论文旨在解决胎儿脑部超声视频中关键帧检测（keyframe detection）的准确性与效率问题，以支持更早的诊断和治疗规划。其解决方案的关键在于提出了一种融合卷积神经网络（Convolutional Neural Network, CNN）与循环神经网络（Recurrent Neural Network, RNN）的复合神经网络架构：CNN负责提取单帧图像的空间特征，RNN则建模视频序列中连续帧之间的时序依赖关系，从而实现对关键帧的有效识别与定位。

链接: https://arxiv.org/abs/2605.09750
作者: Aleksander Zamojski,Kacper Jarczak,Radoslaw Roszczyk
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This article presents a novel approach to keyframe detection in ultrasound videos, with a particular focus on fetal brain imaging. The proposed model is a composite neural network architecture that combines a Convolutional Neural Network (CNN) with a Recurrent Neural Network (RNN). The CNN extracts spatial features from individual video frames, while the RNN captures temporal dependencies between consecutive frames within each video sequence. The proposed model may improve the efficiency and accuracy of fetal brain ultrasound analysis, thereby supporting earlier detection, diagnosis, and treatment planning for selected fetal brain conditions. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2605.09750 [cs.CV] (or arXiv:2605.09750v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.09750 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1109/PAEE59932.2023.10244374 Focus to learn more DOI(s) linking to related resources

[CV-149] On-Policy Distillation with Best-of-N Teacher Rollout Selection

【速读】：该论文旨在解决标准在线蒸馏（On-policy Distillation, OPD）在教学信号高方差问题，即由于学生生成的上下文噪声以及单次随机教师轨迹采样导致的监督信号不稳定、不准确或与学生当前推理行为不匹配。其解决方案的关键在于提出BRTS（Best-of-N Rollout Teacher Selection）框架，通过从多个教师轨迹中选择最优辅助轨迹来增强监督可靠性：首先优先选择正确路径，其次选择与学生当前行为最对齐的路径；若无正确路径，则引入基于真实答案条件的恢复机制以激发自然推导过程。该方法在AIME 2024、AIME 2025和AMC 2023等复杂推理基准上显著优于标准OPD，尤其在难度较高的数据集上提升明显。

链接: https://arxiv.org/abs/2605.09725
作者: Ke Zhang,Yunjie Tian,DongDi Zhao,Yijiang Li,Yuanye Liu,Vishal M Patel,Di Fu
机构: Johns Hopkins University (约翰霍普金斯大学); TikTok (抖音); University of California, San Diego (加州大学圣地亚哥分校); Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:On-policy distillation (OPD), which supervises a student on its own sampled trajectories, has emerged as a data-efficient post-training method for improving reasoning while avoiding the reward dependence of reinforcement learning and the catastrophic forgetting often observed in standard supervised fine-tuning. However, standard OPD typically computes teacher supervision under noisy student-generated contexts and often relies on a single stochastic teacher rollout per prompt. As a result, the supervision signal can be high-variance: the sampled teacher trajectory can be incorrect, uninformative, or poorly matched to the student’s current reasoning behavior. To address this limitation, we propose BRTS, a Best-of-N Rollout Teacher Selection framework for on-policy distillation. BRTS augments standard student-context OPD with a teacher-context supervision branch constructed from the curated teacher trajectory. Rather than distilling from the first sampled teacher rollout, BRTS samples a small pool of teacher trajectories and selects the auxiliary trajectory using a simple priority rule: correctness first, student alignment second. When multiple correct teacher trajectories are available, BRTS chooses the one most aligned with the student’s current behavior; when unconditioned teacher samples fail on harder prompts, it invokes a ground-truth-conditioned recovery step to elicit a natural derivation. The selected trajectory is then used to provide reliable teacher-context supervision inside the OPD loop, augmented with an auxiliary loss on the teacher trajectory. Experiments on AIME 2024, AIME 2025, and AMC 2023 show that BRTS improves over standard OPD on challenging reasoning benchmarks, with the largest gains on harder datasets. Our code is available at this https URL.

[CV-150] Distilling 3D Spatial Reasoning into a Lightweight Vision-Language Model with CoT

【速读】：该论文旨在解决大规模3D视觉语言模型（3D VLMs）在实际部署中因计算成本过高而导致的可扩展性问题。其核心解决方案是提出一种知识蒸馏框架，将7B参数规模的教师模型的空间推理能力高效迁移至仅2.29B参数的学生模型，从而实现8.7倍的推理延迟降低和3倍的模型尺寸压缩，同时保留教师模型54–72%的性能。关键创新在于引入“隐式思维链”（Hidden CoT）机制——即通过可学习的潜在标记作为内部草稿板，在无需链式思维（Chain-of-Thought, CoT）标注数据的情况下增强模型的推理能力，这是首个在蒸馏后的3D VLM中应用潜在草稿板推理的工作。

链接: https://arxiv.org/abs/2605.09719
作者: Alaa Asfour,Christopher Indris,Leihan Chen,Tejas Vyas,Guanghui Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large-scale 3D vision-language models (VLMs) like LLaVA-3D offer strong spatial reasoning but are difficult to deploy due to high computational costs. We propose a knowledge distillation framework that transfers spatial reasoning from a 7B teacher to a 2.29B student model. Our approach achieves 8.7x lower inference latency and a 3x reduction in model size while retaining 54-72% of the teacher’s performance. The framework utilizes VGGT as the vision encoder and a multi-task distillation pipeline with uncertainty-aware loss weighting. To improve reasoning without chain-of-thought (CoT) data, we introduce “Hidden CoT”: learnable latent tokens that serve as an internal scratchpad before answer generation. This is the first use of latent scratchpad reasoning in distilled 3D VLMs. The student model jointly performs spatial description, depth estimation, and object detection. Experiments on ScanNet and 3D-FRONT show strong spatial understanding, reaching 68-72% accuracy on proximity and contact tasks. Our framework enables efficient 3D scene QA on resource-constrained platforms.

[CV-151] MOTOR-Bench: A Real-world Dataset and Multi-agent Framework for Zero-shot Human Mental State Understanding CVPR2026

【速读】：该论文旨在解决从自然行为中推断人类心理状态（mental states）的难题，尤其是现有方法多局限于孤立标签预测，缺乏对复杂人际互动的结构化建模。为支持结构化分析，作者构建了MOTOR-Bench基准，包含1,440个协作学习场景下的多模态视频片段，并基于自我调节学习理论进行专家标注。实验表明，当前最先进的多模态大语言模型和多智能体系统在零样本设置下表现有限，说明现有方法仍难以从可观察行为中进行深层次的心理状态推理。为此，论文提出一种名为MOTOR-MAS的推理型多智能体框架，其关键在于通过结构化的智能体协调机制，协同多个专业智能体分别推断显性行为、内部认知与心理情绪，从而实现更准确的结构化心理状态识别。

链接: https://arxiv.org/abs/2605.09703
作者: Xiaoyu Yuan,Niklas Heikkala,Tiina Törmänen,Hanna Järvenoja,Guoying Zhao,Haoyu Chen
机构: University of Oulu (奥卢大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026 workshop AI4RWC

点击查看摘要

Abstract:Understanding human mental states from natural behavior is crucial for intelligent systems in the real world. However, most current research focuses on predicting isolated mental state labels, lacking structured annotations of complex interpersonal interactions. To support structured analysis, we introduce MOTOR-Bench, a carefully-designed benchmark with a real-world dataset MOTOR-dataset, containing 1,440 multimodal video clips in collaborative learning scenarios, reflecting key real-world data challenges including natural class imbalance, visual noise, and domain-specific language. Each sample is labeled by educational experts based on self-regulated learning theory. We further evaluate several state-of-the-art multimodal large language models and multi-agent systems in a zero-shot setting on our MOTOR-Bench. However, their performance on this task remains limited, suggesting that existing methods still struggle with structured reasoning from observable behavior to deeper mental states. To address this challenge, we propose a reasoning multi-agent framework, named MOTOR-MAS. It coordinates multiple agents through a structured agent coordination mechanism to infer explicit behaviors, internal cognitions, and psychological emotions. Experimental results show that our MOTOR-MAS outperforms the best single-model benchmark by 15.93 points in Macro-F1 scores for the three labels of behavior, cognition, and emotion, and outperforms the general multi-agent benchmark by 10.2 points in internal cognition prediction.

[CV-152] DriveFuture: Future-Aware Latent World Models for Autonomous Driving

【速读】：该论文旨在解决现有潜在世界模型（latent world models）在自动驾驶中未能有效将未来状态显式地用于当前决策的问题。传统方法通常将未来潜在状态作为预测目标或辅助信号，导致当前与未来的特征在潜在空间中混杂，从而影响轨迹规划的准确性。其解决方案的关键在于提出DriveFuture框架，该框架通过在训练阶段利用交叉注意力机制（cross-attention）对当前潜在状态进行条件建模，使其显式依赖于预测的未来潜在状态，从而生成具有规划导向性的前瞻性潜在表示；推理时则使用预测的未来潜在状态而非真实状态来指导基于扩散模型的轨迹规划器，实现对未来状态的显式条件化，显著提升了规划性能。

链接: https://arxiv.org/abs/2605.09701
作者: Yufeng Hong,Xiaotian Zhou,Yingyan Li,Xiangpo Zhou,Lin Liu,Yadan Luo,Shaoqing Xu,Lei Yang,Ziying Song
机构: Beijing Institute of Technology (北京理工大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Beihang University (北京航空航天大学); Beijing Jiaotong University (北京交通大学); The University of Queensland (昆士兰大学); University of Macau (澳门大学); Nanyang Technological University (南洋理工大学); School of Artificial Intelligence ( School of Software), Yanshan University (燕山大学人工智能学院（软件学院）)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 24pages, 7 figures

点击查看摘要

Abstract:Existing latent world models for autonomous driving have opened a promising path toward future-aware driving intelligence. However, they typically treat future latent states as prediction targets or auxiliary signals, rather than directly conditioning trajectory planning. This can entangle current and future features in latent space. In this work, we propose DriveFuture, a future-aware latent world modeling framework for autonomous driving that explicitly learns planning-oriented foresight by conditioning the current latent state modeling process on future world states. Specifically, during training, the model first predicts future latent world states from the current latent state and ego action, and then refines the prediction against the ground-truth future latent state via cross-attention. The resulting future-aware latent serves as an explicit condition for a diffusion-based trajectory planner. During inference, DriveFuture conditions on the predicted future latent state instead of the ground-truth future state. DriveFuture achieves SOTA performance on the public NAVSIM benchmarks, reaching \textbf55.5 EPDMS on NAVSIM-v2 \textcolorblue\textitnavhard, \textbf89.9 EPDMS on NAVSIM-v2 \textcolorblue\textitnavtest, and \textbf90.7 PDMS on NAVSIM-v1 \textcolorblue\textitnavtest, respectively. These results suggest that the key to latent world modeling lies not merely in simulating future states, but more importantly in conditioning current decision-making on future states. Notably, as of April 2026, DriveFuture ranks \textbf1st on the \hrefthis https URLNAVSIM-v2 \textcolorblue\textitnavhard leaderboard and achieves SOTA performance on \hrefthis https URLNAVSIM-v1 \textcolorblue\textitnavtest.

[CV-153] Discriminative Span as a Predictor of Synthetic Data Utility via Classifier Reconstruction

【速读】：该论文旨在解决在正样本严重稀缺的二分类任务中（如医学影像和工业检测），如何可靠评估通过图像到图像变换生成的合成正样本是否能够提升下游模型性能的问题。解决方案的关键在于提出一种基于几何结构的度量方法，该方法在预训练基础模型的嵌入空间中，利用样本间的差异向量表示数据集，并通过测量线性分类器权重向量在这些差异向量所张成子空间中的投影误差来判断合成数据的有效性。若合成数据能捕获任务相关的特征方向，则其差异向量可近似分类器方向，从而导致低投影误差，反之则误差较高；该指标在多个数据集与网络架构上均表现出与下游CNN分类性能的高度相关性，为数据稀缺场景下合成数据质量的评估提供了实用且有效的工具。

链接: https://arxiv.org/abs/2605.09697
作者: Radhika Amar Desai,Modigari Narendra
机构: Vellore Institute of Technology (维洛尔理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 15 pages, 17 tables

点击查看摘要

Abstract:In many real-world computer vision applications, including medical imaging and industrial inspection, binary classification tasks are characterized by a severe scarcity of positive samples. A widely adopted solution is to generate synthetic positive data using image-to-image transformations applied to negative samples. However, a fundamental challenge remains: how can we reliably assess whether such synthetic data will improve downstream model performance? In this work, we propose a geometry-driven metric that predicts the utility of synthetic data without requiring model training. Our approach operates in the embedding space of a pre-trained foundation model and represents the dataset through difference vectors between samples. We evaluate whether the weight vector of a linear classifier can be expressed within the subspace spanned by these variations by measuring the relative projection error. Intuitively, if the variations induced by synthetic data capture task-relevant directions, their span can approximate the classifier, resulting in low projection error. Conversely, poor synthetic data fails to span these directions, leading to higher error. Across multiple datasets and architectures, we show that this metric exhibits strong correlation with downstream classification performance of CNNs trained on mixtures of real negative and synthetic positive data. These findings suggest that the proposed metric serves as a practical and informative tool for evaluating synthetic data quality in data-scarce settings.

[CV-154] Do multimodal models imagine electric sheep?

【速读】：该论文旨在解决大模型在缺乏显式视觉监督的情况下，如何自发形成空间推理能力的问题。其核心挑战在于：如何让模型在执行复杂空间任务（如拼图、三维旋转等）时，不仅选择正确动作，还能构建对中间状态的内在表征（即“心理意象”）。解决方案的关键在于通过监督模型预测从初始状态到目标状态的动作序列，发现模型在每一步执行后产生的激活值已编码有意义的视觉信息，表明一个不完美的视觉世界模型在无显式视觉监督下自然涌现。进一步地，作者提出将少量（如每步16个）视觉token融入思维链（chain of thought），显著提升解题成功率（平均从83%提升至89%），尤其在高推理强度任务（如拼图和3D心理旋转）中效果突出，从而验证了利用此类隐式心理意象可增强模型的空间推理能力。

链接: https://arxiv.org/abs/2605.09693
作者: Santhosh Kumar Ramakrishnan,Carl Vondrick,Raja Giryes,Philipp Krähenbühl,Vladlen Koltun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Yes. We find that large multimodal models develop mental imagery when solving spatial puzzles, and they do imagine sheep when solving sheep puzzles. We fine-tune a Qwen3.5 VLM to solve twelve diverse visual reasoning tasks – including tangram, jigsaw, sokoban, 3D mental rotation, and rush hour – that require understanding geometry, spatial relationships, and the consequences of actions. By supervising the model to predict the open-loop sequence of actions to solve a puzzle from an initial state, we show that the model’s activations after each action encode meaningful visual information about the intermediate state. This finding suggests that an imperfect visual world model begins to form as a byproduct of learning to select correct actions, in the absence of any explicit visual supervision. Building on this observation, we propose two ways to sharpen and use the mental images formed by the model. We find that integrating as few as sixteen visual tokens per step into the chain of thought improves the average solve rate from 83% to 89%, with particularly strong gains on reasoning-heavy tasks such as jigsaw and 3D mental rotation.

[CV-155] ConFixGS: Learning to Fix Feedforward 3D Gaussian Splatting with Confidence-Aware Diffusion Priors in Driving Scenes

【速读】：该论文旨在解决前馈式三维高斯溅射（feedforward 3D Gaussian Splatting, 3DGS）在基于轨迹的稀疏视角驾驶场景中表现不佳的问题，特别是现有高斯修复方法多聚焦于优化驱动型3DGS，而基于扩散模型的修复通常局限于观测视角附近的迭代精修，导致前馈式3DGS修复尚未得到充分探索。其解决方案的关键在于提出一种即插即用的方法ConFixGS，该方法通过置信度感知的扩散先验（confidence-aware diffusion priors） 来增强前馈模型的重建质量：首先从预训练的前馈模型生成局部伪目标，并利用重投影交叉验证机制对这些伪目标进行支持视图一致性检验，从而获得稠密置信度图，引导细节增强并抑制幻觉或不一致信息。此策略实现了生成式先验与支持视图一致性的融合，显著提升了复杂新视角合成性能，在Waymo、nuScenes和KITTI数据集上实现最高达3.68 dB的PSNR提升和近50%的FID下降。

链接: https://arxiv.org/abs/2605.09688
作者: Rui Song,Tianhui Cai,Markus Gross,Xingcheng Zhou,Zewei Zhou,Zhiyu Huang,Olaf Wysocki,Jiaqi Ma
机构: University of California, Los Angeles (加州大学洛杉矶分校); University of Cambridge (剑桥大学); Technical University of Munich (慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 28 pages, 12 figures

点击查看摘要

Abstract:Feedforward 3D Gaussian Splatting (3DGS) often struggles in trajectory-based sparse-view driving scenes. Existing Gaussian repair methods mainly target optimization-based 3DGS, while diffusion-based repair is typically restricted to iterative refinement near observed viewpoints, leaving feedforward 3DGS repair underexplored. We propose ConFixGS, a plug-and-play method that learns to fix feedforward 3DGS with confidence-aware diffusion priors. Starting from a pretrained feedforward model, ConFixGS generates diffusion-enhanced local pseudo-targets and validates them through reprojection-based cross-checking against support views. The resulting dense confidence maps guide refinement, enhancing reliable details while suppressing hallucinated or inconsistent evidence. On Waymo, nuScenes, and KITTI, ConFixGS improves challenging novel view synthesis, with PSNR gains of up to 3.68 dB and FID reduced by nearly half. Our results highlight confidence-aware fusion of generative priors and support-view consistency as a key principle for robust feedforward 3D driving scene reconstruction.

[CV-156] Spatial-Frequency Gated Swin Transformer for Remote Sensing Single-Image Super-Resolution

【速读】：该论文旨在解决遥感（Remote Sensing, RS）单图像超分辨率任务中，基于Swin Transformer的模型在重建高分辨率图像时难以有效分离并增强低频结构信息与高频细节的问题。现有方法如Swin2SR虽能通过移位窗口自注意力机制建模空间上下文，但其前馈网络（Feed-Forward Network, FFN）仍为通用通道混合模块，无法区分低频结构内容与高频残差细节，导致细节重建能力受限。解决方案的关键在于提出Spatial-Frequency Gated Feed-Forward Network (SFG-FFN)，该模块通过深度可分离模糊分支估计低频内容、减法提取高频残差，并借助轻量级空间分支进行细化，最终通过瓶颈门控机制自适应注入细节信息，从而在Transformer的前馈层内实现空间-频率解耦与增强，显著提升遥感图像的高频细节重建质量。

链接: https://arxiv.org/abs/2605.09687
作者: Md Aminur Hossain,Parekh Valkesh,Ayush V. Patel,Yogesh Jethani,Sanjay K. Singh,Biplab Banerjee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages

点击查看摘要

Abstract:Remote Sensing (RS) single-image super-resolution aims to reconstruct high-resolution imagery from low-resolution observations while preserving fine spatial structures. Recent Swin Transformer-based models, including Swin2SR, provide strong spatial context modeling throughshifted-window self-attention, but their feed-forward networks remain generic channel-mixing modules and do not separate low-frequency structural content from high-frequency residual detail. To address this limitation, we propose SFG-SwinSR, a Spatial-Frequency Gated Swin Transformer for single-image super-resolution in remote sensing. SFG-SwinSR modifies the original Swin2SR attention block by replacing each transformer block’s standard feed-forward network with a lightweight Spatial-Frequency Gated Feed-Forward Network (SFG-FFN). The module estimates low-frequency content via a depthwise-blur branch, extracts high-frequency residuals by subtraction, refines them with a lightweight spatial branch, and adaptively injects detail through a bottleneck gate. Experiments on SpaceNet and SEN2VEN\muS show that SFG-SwinSR improves reconstruction quality under the evaluated settings. On SpaceNet, it achieves 45.19 dB PSNR and 0.9852 SSIM, indicating effective enhancement of high-frequency details. This demonstrates that spatial-frequency transformation within the transformer feed-forward network improves detail reconstruction in RS super-resolution.

[CV-157] Forcing-KV: Hybrid KV Cache Compression for Efficient Autoregressive Video Diffusion Models

【速读】：该论文旨在解决自回归视频扩散模型（Autoregressive Video Diffusion Models）在长时序视频生成中因历史帧间冗余键值（Key-Value, KV）缓存导致的注意力复杂度高和内存开销大的问题，从而限制了模型的可扩展性。解决方案的关键在于通过引入KV缓存压缩机制，基于对注意力头功能特性的实证分析，将注意力头划分为静态头（static heads）和动态头（dynamic heads）：静态头负责跨自回归块的过渡与帧内保真度，动态头则控制帧间运动一致性。在此基础上提出Forcing-KV方法，对静态头进行结构化剪枝，对动态头基于段级相似性进行动态剪枝，从而在不牺牲输出质量的前提下显著降低内存占用并提升生成速度，在单张NVIDIA H200 GPU上实现超过29帧/秒的生成速率，并在480P和1080P分辨率下分别获得最高1.50倍和2.82倍的速度提升。

链接: https://arxiv.org/abs/2605.09681
作者: Yicheng Ji,Zhizhou Zhong,Jun Zhang,Qin Yang,XiTai Jin,Ying Qin,Wenhan Luo,Shuiyang Mao,Wei Liu,Huan Li
机构: ZJU(浙江大学); Video Rebirth; HKUST(香港科技大学); BJTU(北京交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages

点击查看摘要

Abstract:Autoregressive (AR) video diffusion models adopt a streaming generation framework, enabling long-horizon video generation with real-time responsiveness, as exemplified by the Self Forcing training paradigm. However, existing AR video diffusion models still suffer from significant attention complexity and severe memory overhead due to the redundant key-value (KV) caches across historical frames, which limits scalability. In this paper, we tackle this challenge by introducing KV cache compression into autoregressive video diffusion. We observe that attention heads in mainstream AR diffusion models exhibit markedly distinct attention patterns and functional roles that remain stable across samples and denoising steps. Building on our empirical study of head-wise functional specialization, we divide the attention heads into two categories: static heads, which focus on transitions across autoregressive chunks and intra-frame fidelity, and dynamic heads, which govern inter-frame motion and consistency. We then propose Forcing-KV, a hybrid KV cache compression strategy that performs structured static pruning for static heads and dynamic pruning based on segment-wise similarity for dynamic heads. While maintaining output quality, our method achieves a generation speed of over 29 frames per second on a single NVIDIA H200 GPU along with 30% cache memory reduction, delivering up to 1.35x and 1.50x speedups on LongLive and Self Forcing at 480P resolution, and further scaling to 2.82x speedup at 1080P resolution. Code and demo videos are provided at this https URL.

[CV-158] DeepTumorVQA: A Hierarchical 3D CT Benchmark for Stage-Wise Evaluation of Medical VLMs and Tool-Augmented Agents

【速读】：该论文旨在解决现有医学视觉问答（Medical Visual Question Answering, VQA）基准无法有效揭示模型在肿瘤诊断过程中具体失败位置与原因的问题，因为这些基准通常仅以单一准确率指标概括模型性能，掩盖了多阶段推理链条中的薄弱环节。其解决方案的关键在于提出DeepTumorVQA——一个分层式基准，遵循肿瘤诊断的多阶段证据链，将3D CT图像的推理过程分解为四个可独立评分的阶段：识别（recognition）、测量（measurement）、视觉推理（visual reasoning）和医学推理（medical reasoning），并构建包含476K个问题、覆盖42种临床亚型的高质量数据集。此外，该框架还提供工具交互环境用于AI代理评估，允许模型调用分割模型、测量程序和医学知识模块等外部工具，从而系统性地识别出可靠定量测量是当前生成式AI（Generative AI）模型的主要瓶颈，并验证了工具增强可显著缓解此问题；进一步研究表明，基于真实步骤式工具使用轨迹的监督能有效降低工具调用和推理错误，为未来医疗视觉语言模型（Medical Vision-Language Models, VLMs）与AI代理的研究提供了清晰的阶段性发展路线。

链接: https://arxiv.org/abs/2605.09679
作者: Yixiong Chen,Wenjie Xiao,Pedro R. A. S. Bassi,Boyan Wang,Liang He,Xinze Zhou,Sezgin Er,Ibrahim Ethem Hamamci,Zongwei Zhou,Alan Yuille
机构: Johns Hopkins University (约翰霍普金斯大学); University of Bologna (博洛尼亚大学); Istanbul Medipol University (伊斯坦布尔梅迪波尔大学); Center for Biomolecular Nanotechnologies, Istituto Italiano di Tecnologia (意大利技术研究院生物分子纳米技术中心); The First Affiliated Hospital, Sun Yat-Sen University (中山大学第一附属医院); Tongji University (同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Medical vision-language models (VLMs) and AI agents have made significant progress in learning to analyze and reason about clinical images. However, existing medical visual question answering (VQA) benchmarks collapse model capabilities into a single accuracy score, obscuring where and why models fail. We propose DeepTumorVQA, a hierarchical benchmark that follows the multi-stage evidence chain in tumor diagnosis and decomposes 3D CT reasoning into four stages: recognition, measurement, visual reasoning, and medical reasoning. Higher-level questions remain independently scorable, while their ground-truth evidence chains are defined over lower-level primitives. The benchmark contains 476K questions across 42 clinical subtypes on 9,262 3D CT volumes. In addition to a direct reasoning mode for VLMs, DeepTumorVQA provides tool-interaction environments for agent evaluation, where a model can call external tools, including segmentation models, measurement programs, and medical knowledge modules, before answering the question. Evaluating over 30 model configurations, we find that reliable quantitative measurement is the primary bottleneck, making later-stage visual and medical reasoning harder for VLMs, while tool augmentation substantially mitigates this issue. When tools are available, leveraging medical knowledge and tools to reason about medical images becomes a new challenge. We further show that ground-truth step-by-step tool-use traces from DeepTumorVQA can supervise agents and reduce tool-use and reasoning failures. This stage-wise progression from recognition to measurement to visual and medical reasoning provides a concrete roadmap for future medical VLM and AI agent studies. All data and code are released at this https URL.

[CV-159] VFM-SDM: A vision foundation model-based framework for training-free marker-free and calibration-free structural displacement measurement

【速读】：该论文旨在解决结构健康监测中位移测量的可靠性和部署效率问题，特别是在实际工程场景下，传统视觉测量方法常受限于任务特定模型训练或现场准备（如标记安装和手动相机标定）。其解决方案的关键在于提出一种基于视觉基础模型（Vision Foundation Model, VFM）的结构位移测量框架（VFM-SDM），该框架通过VFM推断相机参数并结合点跟踪技术，无需任务特定训练或现场准备即可实现多方向位移的三角测量重建；同时引入结构几何约束以抑制物理上不合理的偏差，提升估计一致性，从而实现高效、自动化、可扩展的非接触式位移监测。

链接: https://arxiv.org/abs/2605.09677
作者: Qingyu Xian,Hao Cheng,Berend Jan van der Zwaag,Rolands Kromanis,Ozlem Durmaz Incel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reliable displacement measurement is fundamental for structural health monitoring and digital engineering workflows, as it provides direct structural response information. Vision-based measurement has emerged as a promising approach for low-cost, non-contact displacement monitoring. However, its deployment often remains constrained by task-specific model training or on-site preparation, such as marker installation or manual camera calibration. This study presents a Vision Foundation Model-based framework for Structural Displacement Measurement (VFM-SDM) that integrates VFM-inferred camera parameter estimation and point tracking to reconstruct multi-directional structural displacements via triangulation without task-specific training or on-site preparation, enabling efficient non-contact deployment in real-world applications. Structural geometry constraints are incorporated to suppress physically implausible deviations and improve estimation consistency. A multi-modal field dataset collected from an in-service pedestrian bridge is introduced alongside a unified benchmarking protocol to support reproducible evaluation. Representative results show low amplitude errors (NRMSE _\textrange : 0.11/0.12), strong temporal agreement (correlation coefficient: 0.86/0.88), and small peak-to-peak amplitude errors (RPPAE: 0.01/0.02) for vertical and lateral displacements, indicating robust performance under real-world conditions. The proposed framework advances automated, scalable displacement monitoring and lays the groundwork for VFM-enabled structural response measurements in digital twin and data-centric construction workflows.

[CV-160] owards Generative Predictive Display for Vision-Based Teleoperation: A Zero-Shot Benchmark of Off-the-Shelf Video Models

【速读】：该论文旨在解决远程操作（teleoperation）系统中因通信延迟导致的情境感知能力下降和控制性能劣化问题，其核心解决方案是通过预测显示（predictive display）技术，利用生成式视频模型对当前视觉状态进行估计而非呈现延迟的观测画面。关键在于提出了一种零样本（zero-shot）基准测试框架，无需任务特定微调即可评估现成生成式视频模型在短时预测场景下的表现，该框架基于CARLA模拟器中的驾驶数据，采用滚动预测（rollout-based）策略，并从预测精度、每帧推理延迟、GPU显存占用及误差随时间演化等多个维度综合评估模型性能。实验表明，现有通用生成式视频模型尚无法同时满足低误差、稳定误差增长与实时推理的要求，揭示了通用视频合成与远程操作预测显示需求之间的差距，强调未来实用部署需依赖短期时序监督、领域适应或推理优化等针对性改进策略。

链接: https://arxiv.org/abs/2605.09670
作者: Aws Khalil,Jaerock Kwon
机构: University of Michigan - Dearborn (密歇根大学迪尔伯恩分校)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Teleoperation systems are fundamentally limited by communication latency, which degrades situational awareness and control performance. Predictive display aims to mitigate this limitation by presenting an estimate of the current visual state rather than delayed observations. While recent advances in generative video models enable high-quality video synthesis, their suitability for latency-sensitive predictive display remains unclear. This paper presents a zero-shot benchmark of off-the-shelf generative video models for short-horizon predictive display, without task-specific fine-tuning. We formulate the problem as rollout-based future frame prediction and develop a unified benchmarking pipeline using simulated driving data from the CARLA simulator. Five publicly released video models spanning transformer-based and diffusion-based families are evaluated across two resolutions and two conditioning regimes (multi-frame and single-frame). Performance is assessed using prediction accuracy (mean absolute difference), per-rollout latency, peak GPU memory usage, and temporal error evolution across the prediction horizon. On this zero-shot benchmark, no tested model simultaneously achieves low rollout error, non-divergent per-step error behavior, and real-time inference at the source frame rate. Increasing model scale or resolution yields limited and, in some cases, inverted improvements. These findings highlight a gap between general-purpose generative video synthesis and the requirements of predictive display in teleoperation, suggesting that practical deployment will require either explicit short-horizon temporal supervision, in-domain adaptation, or aggressive inference optimization rather than direct application of off-the-shelf models. Code, configurations, and qualitative results are released on the project page: this https URL

[CV-161] S2P-Net: A Spectral-Spatial Polar Network for Rotation-Invariant Object Recognition in Low-Data Regimes

【速读】：该论文旨在解决深度学习模型在图像识别任务中对旋转变化敏感的问题，即传统卷积神经网络（CNN）在处理旋转不变性时往往依赖大量数据增强（data augmentation）来提升性能，而这种方法既耗时又不可靠。解决方案的关键在于提出一种名为S2P-Net（Spectral-Spatial Polar Network）的紧凑型深度学习架构，该架构通过数学上保证的旋转不变性设计，无需依赖数据增强即可实现对输入图像旋转的鲁棒识别，从而在理论和实践层面均提升了模型的泛化能力与效率。

链接: https://arxiv.org/abs/2605.09667
作者: Albert Heruth
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 4 figures, 3 tables. Preprint. Code available from the author upon request

点击查看摘要

Abstract:We present S2P-Net (Spectral-Spatial Polar Network), a compact deep learning architecture that achieves mathematically guaranteed rotation invariance without data augmentation. In this Paper, we also made a comparison to other neural network architectures (CNN`s). Have a look at the results and feel free to contact me for any questions. This is my first paper:) Made by Hackbert

[CV-162] Rethinking Evaluation of Multiple Sclerosis (MS) Lesion Segmentation Models IJCNN2026

【速读】：该论文旨在解决当前多发性硬化（Multiple Sclerosis, MS）病变分割模型评估体系存在的局限性问题，即现有方法主要依赖Dice分数进行评价，未能充分考虑病变层面的检测与分割性能，以及在人类标注者易混淆或临床关键场景下的模型表现。其解决方案的关键在于提出“问题指纹识别”（problem fingerprinting）方法，系统梳理神经科医生在脑部MRI图像中关注的MS诊断与进展监测特征，并据此设计更贴近临床需求的评估指标；同时，通过在两个开源数据集上对先进模型的实证分析，验证这些指标对模型真实世界医院部署可行性的量化能力。

链接: https://arxiv.org/abs/2605.09666
作者: Abdul Basit,Ashir Rashid,Muhammad Abdullah Hanif,Muhammad Shafique
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, 5 figures, Accepted to IJCNN 2026

点击查看摘要

Abstract:Multiple Sclerosis (MS) is a chronic autoimmune disease that can significantly reduce the quality of life of a patient. Existing treatment options can only help slow down the progression of the disease. Therefore, early detection and precise monitoring of disease progression are important. Deep learning offers state-of-the-art models for detecting and segmenting MS lesions in brain MRI scans. However, most of these models are evaluated using the Dice score, without accounting for lesion-wise detection and segmentation performance or other metrics that quantify model performance in cases that are complex or confusing for human annotators, or in cases that are essential for disease detection and progression monitoring. In this paper, we highlight the need to rethink the evaluation of MS lesion segmentation models. In this context, we first present problem fingerprinting in detail to highlight what neurologists look for in brain MRI scans for MS detection and progression monitoring, and which metrics are required to properly quantify model performance in these contexts. Additionally, we present an analysis of state-of-the-art models on two open-source datasets using these metrics to highlight their usability for real-world deployment in hospitals.

[CV-163] BEA-GS: BEyond RAdiance Supervision in 3DGS for Precise Object Extraction CVPR2026

【速读】：该论文旨在解决现有高斯溅射（Gaussian Splatting）方法在生成3D语义表示时缺乏对底层几何结构优化的问题，这导致物体级编辑和资产提取难以实现。其解决方案的关键在于引入两种新的损失函数：第一种损失通过直接传播梯度穿过光栅化过程，使可见高斯的几何形状能够尊重语义边界；第二种损失则不依赖光栅化路径，可在高斯部分或完全不可见时调整其几何结构，从而确保提取物体后非可见区域的几何一致性。这一设计实现了近完美的物体边界分割效果，在多个数据集和指标上优于当前12种最先进的方法。

链接: https://arxiv.org/abs/2605.09662
作者: Alessio Mazzucchelli,Maria Naranjo-Almeida,Jorge Bustos-Sanchez,Mariella Dimiccoli,Francesc Moreno-Noguer,Jordi Sanchez-Riera,Adrian Penate-Sanchez
机构: Arquimea Research Center (Arquimea研究中心); Institut de Robòtica i Informàtica Industrial (CSIC-UPC) (机器人与信息工业研究所（CSIC-UPC）); Universidad de las Palmas de Gran Canaria (IUSIANI) (大加那利岛帕尔马大学（IUSIANI）)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026 Highlight

点击查看摘要

Abstract:Most Gaussian Splatting techniques that provide a 3D semantic representation of the scene do not optimize the underlying 3D geometry, making object-level editing or asset extraction challenging. Recent methods, such as COBGS, Trace3D, ObjectGS, acknowledge this limitation and propose approaches that modify the scene’s geometry to represent the underlying semantics. We advance this concept further by proposing a novel solution that provides near perfect boundaries in object extraction. We do so by introducing two new losses in the optimization that take care of: 1) a loss that modifies the geometry of visible Gaussians to respect semantic boundaries, and 2) a loss that adjusts the geometry of non-visible Gaussians that appear once the object is extracted. Our first loss propagates gradients directly through the rasterization, allowing for seamless integration within the optimization of the Gaussian parameters. The second loss also propagates gradients to Gaussian parameters but does so without passing through the rasterization, enabling modification of the scene’s geometry even when little transmittance reaches a Gaussian (partial or non-visible). Exhaustive comparisons with 12 state of the art methods across 4 datasets, using six metrics, demonstrate that our approach produces overall the best boundary segmentation to date.

[CV-164] Attention Itself Could Retrieve.RetrieveVGGT: Training-Free Long Context Streaming 3D Reconstruction via Query-Key Similarity Retrieval

【速读】：该论文旨在解决视觉几何接地Transformer（VGGT）在长序列3D重建中因全局注意力机制的二次复杂度导致的计算瓶颈，以及流式版本StreamVGGT因KV缓存随帧数线性增长引发的内存溢出和质量下降问题。解决方案的关键在于提出RetrieveVGGT，一个无需训练的框架，将VGGT的上下文构建建模为检索问题：通过固定数量的相关帧检索机制，在保持可控内存预算的同时逼近训练时的上下文长度；其中，首次发现VGGT第一层全局注意力中当前帧查询与缓存历史帧键之间的相似性即可作为强相关性指标，无需额外学习评分函数；进一步引入分段采样（Segment Sampling）以增强信息多样性，并设计基于位姿感知的空间记忆机制，按已估计相机位姿组织历史帧，实现位置感知的高效检索。

链接: https://arxiv.org/abs/2605.09644
作者: Zichen Zou,Xiaosong Jia,Zuxuan Wu,Yu-Gang Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual Geometry Grounded Transformer (VGGT) advances 3D reconstruction via scalable Transformer architecture, but the quadratic complexity of global attention prevents long context application. StreamVGGT enables streaming with causal attention, yet its KV cache grows linearly with frames, causing memory overflow and quality degradation. We present RetrieveVGGT, a training-free framework, which formulates context construction for VGGT as a retrieval problem. By retrieving a fixed number of relevant frames at each step, VGGT maintains a controllable memory budget, which is close to its training context length. Interestingly, we find that the similarity between current frame queries and cached history frame keys at the first global attention layer of VGGT is already a strong indicator of relevance, eliminating the need for additional learned scoring. To enhance information diversity similar to a recommender system, we propose Segment Sampling so that the retrieval spans distinct relevant segments rather than a single high-similarity region. We design a pose-aware spatial memory mechanism that organizes history frames according to their already estimated camera poses, enabling location-aware retrieval. Extensive experiments demonstrate that RetrieveVGGT achieves state-of-the-art performance, outperforming StreamVGGT, TTT3R, and InfiniteVGGT while maintaining constant memory usage regardless of sequence length. Code is available at this https URL.

[CV-165] Overcoming Catastrophic Forgetting in Visual Continual Learning with Reinforcement Fine-Tuning

【速读】：该论文旨在解决生成式 AI（Generative AI）在视觉持续学习（visual continual learning）场景下，基于强化学习的微调方法（Reinforcement Fine-Tuning, RFT）仍存在显著灾难性遗忘（catastrophic forgetting）的问题，尤其是在类增量学习（class-incremental learning, CIL）和域增量学习（domain-incremental learning, DIL）等挑战性设置中。其解决方案的关键在于提出了一种名为保留感知策略优化（Retention-aware Policy Optimization, RaPO）的新方法，核心创新包括：（1）保留奖励机制（Retention Reward），将轨迹级分布漂移转化为连续奖励信号，优先强化同一任务组内知识保留性强的轨迹；（2）跨任务优势归一化（Cross-Task Advantage Normalization, CTAN），通过维护任务边界间的奖励统计量指数移动平均，稳定持续学习过程中的优化路径。该方案有效缓解了因轨迹级漂移导致的遗忘问题，在多个视觉持续学习任务中显著降低遗忘率并保持强泛化能力。

链接: https://arxiv.org/abs/2605.09640
作者: Meng Lou,Hanzhong Guo,Linwei Chen,Yizhou Yu
机构: The University of Hong Kong (香港大学); The Hong Kong University of Science and Technology (香港科技大学); Hong Kong Generative AI Research and Development Center (香港生成式人工智能研究与开发中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent studies suggest that Reinforcement Fine-Tuning (RFT) is inherently more resilient to catastrophic forgetting than Supervised Fine-Tuning (SFT). However, whether RFT (e.g., GRPO) can effectively overcome forgetting in challenging visual continual learning settings, such as class-incremental learning (CIL) and domain-incremental learning (DIL), remains an open problem. Through a pilot study, we confirm that while RFT consistently outperforms SFT, it still suffers from non-negligible forgetting. We empirically trace this bottleneck to Trajectory-level Drift Agnosticism: among candidate rollouts achieving identical task rewards, the KL divergence from the preceding-task policy varies substantially, which strongly correlates with catastrophic forgetting across sequential tasks. Motivated by this insight, we propose Retention-aware Policy Optimization (RaPO), a simple yet effective RFT method that explicitly mitigates forgetting through trajectory-level reward shaping. Specifically, RaPO comprises two core components: (1) Retention Reward that converts trajectory-level distribution drift into a continuous reward signal, preferentially reinforcing knowledge-preserving rollouts within each group; (2) Cross-Task Advantage Normalization (CTAN), which maintains a persistent exponential moving average of reward statistics across task boundaries to stabilize the optimization progress during continual learning. Leveraging the free-form textual generalization of MLLMs, we comprehensively evaluate RaPO across five visual continual learning settings. Extensive experiments demonstrate that RaPO achieves leading performance, substantially reducing catastrophic forgetting while preserving strong plasticity. To the best of our knowledge, this work represents the first systematic exploration of RFT in visual continual learning, offering insights that we hope will inspire future research.

[CV-166] DegBins: Degradation-Driven Binning for Depth Super-Resolution

【速读】：该论文旨在解决深度超分辨率（Depth Super-Resolution, DSR）任务中，传统基于残差回归的方法难以准确建模高分辨率（HR）与低分辨率（LR）深度图之间复杂关系的问题，尤其是在空间变化的退化条件下。解决方案的关键在于提出DegBins框架，其核心创新是将DSR从传统的回归问题转化为混合分类-回归问题：通过退化驱动的分箱机制（degradation-driven binning），将残差深度表示为离散深度区间（depth bins）的线性组合，并由学习到的概率分布加权，从而获得更灵活且表达能力更强的特征表示；同时，在高维特征空间中建模HR与LR之间的退化关系，实现基于局部退化特性的自适应分箱范围调整和概率优化，并采用多阶段细化策略（multi-stage refinement）逐步提升重建精度，尤其在严重退化或结构复杂的区域表现优异。

链接: https://arxiv.org/abs/2605.09628
作者: Zhiqiang Yan,Zhengxue Wang,Jian Yang,Gim Hee Lee
机构: National University of Singapore (新加坡国立大学); Nanjing University of Science and Technology (南京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages

点击查看摘要

Abstract:Depth super-resolution (DSR) aims to recover a high-resolution (HR) depth map from its low-resolution (LR) counterpart. With color image guidance, this task is typically formulated as learning the residual between HR and LR in a low-dimensional feature space. However, this additive formulation is insufficient to accurately capture the complex relationship between HR and LR, especially under spatially varying degradations. In this paper, we introduce DegBins, a novel DSR framework that leverages degradation-driven binning to adaptively enhance residual modeling. Specifically, DegBins reformulates the regression-based DSR as a hybrid classification-regression problem, where the residual depth is represented as a linear combination of discrete depth bins weighted by their learned probability distribution, yielding more flexible and expressive representations. Furthermore, DegBins models the degradation relationship between HR and LR in a high-dimensional feature space, enabling adaptive bin range adjustment and probability optimization conditioned on local degradation characteristics. To progressively improve reconstruction quality, DegBins adopts a multi-stage refinement scheme, where each stage performs finer-grained bin partitioning and probability updating based on the former estimation. This coarse-to-fine design facilitates more accurate depth recovery, particularly in regions with severe degradations or complex structural variations. Extensive experiments across five benchmarks demonstrate that DegBins consistently outperforms existing state-of-the-art methods in terms of accuracy, robustness, and generalization.

[CV-167] Any2Any 3D Diffusion Models with Knowledge Transfer: A Radiotherapy Planning Study DATE CVPR2026 CVPR

【速读】：该论文旨在解决放射治疗（Radiotherapy, RT）规划中体素级剂量预测的挑战性问题，即传统从零训练的模型难以在不同临床场景下实现良好泛化。其解决方案的关键在于提出了一种统一的Any2Any 3D扩散框架DiffKT3D，通过迁移预训练视频扩散模型的先验知识实现高效且临床可解释的剂量预测；同时引入一种无需交叉注意力机制的模态特定嵌入条件策略，支持多模态输入（如CT图像、解剖结构、体位和射线参数等）灵活 conditioning，并结合基于机构治疗偏好定制的临床评分卡引导的强化学习（Reinforcement Learning, RL）后训练机制，显著提升了预测精度与临床偏好匹配度，最终在GDP-HMM挑战赛优胜方案基础上将体素级平均绝对误差（MAE）从2.07降低至1.93，验证了该方法在跨场景泛化能力上的优越性。

链接: https://arxiv.org/abs/2605.09622
作者: Yuhan Wang,Zihan Li,Han Liu,Simon Arberet,Martin Kraus,Yuyin Zhou,Florin-Cristian Ghesu,Dorin Comaniciu,Ali Kamen,Riqiang Gao
机构: UC Santa Cruz (加州大学圣克鲁兹分校); Siemens Healthineers (西门子医疗); University of Washington (华盛顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by CVPR 2026 main conference. Compare to CVPR version, minor updates here are included (e.g., combine main text and appendix; clarify the timing scenario in appendix)

点击查看摘要

Abstract:Voxel-wise dose prediction is a critical yet challenging task in practical radiotherapy (RT) planning, as bespoke models trained from scratch often struggle to generalize across diverse clinical settings. Meanwhile, generative models trained on billion-scale datasets from vision domains have achieved impressive performance. Herein, we propose DiffKT3D, a unified Any2Any 3D diffusion framework that leverages prior knowledge from pretrained video diffusion models for efficient and clinically meaningful dose prediction. To enable flexible conditioning across multiple clinical modalities (CT, anatomical structures, body, beam settings, etc.), we introduce an Any2Any conditional paradigm utilizing modality-specific embeddings without cross-attention overhead. Further, we design a novel reinforcement learning (RL) post-training mechanism guided by a clinically-informed Scorecard explicitly tailored to institutional treatment preferences. Compared with winner of GDP-HMM challenge, DiffKT3D sets a new state-of-the-art in dose prediction by reducing voxel-level MAE from 2.07 to 1.93. In addition, DiffKT3D achieves superior image quality and preference match. These results demonstrate that transferring diffusion priors via modality-aware conditioning and clinically aligned RL post-training can provide a robust and generalizable solution for RT planning across various clinical scenarios.

[CV-168] GSMap: 2D Gaussians for Online HD Mapping

【速读】：该论文旨在解决高精地图（High-Definition Map, HD Map）构建中向量化与栅格化方法之间的根本性权衡问题：向量化方法虽能保持拓扑结构，但难以保证几何精度；而栅格化方法虽可实现像素级几何监督，却输出非结构化结果。解决方案的关键在于提出GSMap框架，通过可学习的二维高斯（2D Gaussian）表示统一两种范式——将每个地图元素建模为有序的二维高斯序列，其质心对应向量化多段线/多边形的顶点，从而在训练过程中同时优化：(1) 可微栅格化以施加像素级几何约束，(2) 拓扑感知的向量化以维持结构规则性。该方法实现了几何与拓扑学习的协同优化，在nuScenes和Argoverse2数据集上显著提升性能，并兼容现有HD地图架构。

链接: https://arxiv.org/abs/2605.09619
作者: Zhenxuan Zeng,Sheng Yang,Lingxuan Wang,Yanan He,Mingxia Chen,Wei Suo,Peng Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint

点击查看摘要

Abstract:Accurate High-Definition (HD) map construction is critical for autonomous driving, yet existing methods face a fundamental trade-off: vectorization-based approaches preserve topology but struggle with geometric fidelity, while rasterization-based approaches enable precise geometric supervision but produce unstructured outputs. To bridge this gap, we propose GSMap, a novel framework that unifies both paradigms via a learnable 2D Gaussian representation. Each map element is modeled as an ordered sequence of 2D Gaussians, whose centers correspond to the vertices of the vectorized polyline/polygon. This formulation enables simultaneous optimization through: (1) Differentiable rasterization that enforces pixel-level geometric constraints, and (2) Topology-aware vectorization that maintains structural regularity. Experiments on both nuScenes and Argoverse2 demonstrate that our Gaussian-based representation effectively unifies geometric and topological learning, achieving significant performance improvements and demonstrating strong compatibility with existing HD mapping architectures. Code will be available at this https URL

[CV-169] Reflection Anchors for Propagation-Aware Visual Retention in Long-Chain Multimodal Reasoning

【速读】：该论文旨在解决长链式思维（Chain-of-Thought, CoT）推理中视觉信息在生成过程中逐渐衰减的问题，从而限制了多模态模型在长程推理任务中的表现。现有方法通常通过推理时重新注入视觉信息或训练更强的视觉锚定策略来缓解此问题，但其干预时机依赖于启发式感知规则，且局部视觉影响如何传播缺乏理论指导。为此，作者从信息论角度出发，推导出单步干预对下游视觉增益的下界，揭示两个关键因素：局部分支空间（token熵）和下游视觉传播潜力（与视觉边缘化参考分布的后缀差异）。基于此理论分析，提出反射锚点策略优化（Reflection-Anchor Policy Optimization, RAPO），其核心在于选择高熵的反射锚点，并优化一种链式掩码有限窗口KL代理目标以增强下游视觉依赖性。实验表明，RAPO在多个视觉语言大模型（LVLM）架构上显著优于强基线，机制分析进一步验证了反射锚点聚焦于视觉敏感决策点，并增强了生成轨迹上的对比性视觉依赖信号。

链接: https://arxiv.org/abs/2605.09614
作者: Xuan Gong,Hanbo Huang,Hao Zheng,Yiran Zhang,Wenbin Dai,Weishu Zhao,Shiyu Liang
机构: Shanghai Jiao Tong University (上海交通大学); Lanzhou University (兰州大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review

点击查看摘要

Abstract:Long chain-of-thought (CoT) reasoning improves large vision–language models, but visual information often fades during generation, limiting long-horizon multimodal reasoning. Existing methods either re-inject vision at inference or train policies for stronger grounding, but where to intervene relies on perception heuristics rather than principled gain analysis, and how local visual influence propagates remains implicit. We study this problem from an information-theoretic standpoint and derive a lower bound on the downstream visual gain of a one-step intervention, which suggests two factors: local branching room (token entropy) and downstream visual propagation potential (suffix divergence from a vision-marginalized reference). Guided by this analysis, we propose reflection-anchor policy optimization (RAPO), a GRPO-based policy optimization method that selects high-entropy reflection anchors and optimizes a chain-masked finite-window KL surrogate for downstream visual dependence. Experiments on reasoning-intensive and general-domain benchmarks show that RAPO delivers substantial gains over strong baselines across multiple LVLM backbones. Mechanism analyses further indicate that reflection anchors are enriched for visually sensitive decision points and that RAPO increases contrastive visual-dependence signals along generated trajectories.

[CV-170] SABER: A Scalable Action-Based Embodied Dataset for Real-World VLA Adaptation

【速读】：该论文旨在解决当前通用机器人基础模型在零售场景中执行复杂操作任务时性能有限的问题，其根本原因在于训练数据与实际应用场景之间的分布差异（即数据鸿沟），尤其是零售环境缺乏在通用机器人预训练数据中的代表性。解决方案的关键在于构建一个高保真、大规模、无需遥控操作的零售机器人动作数据集SABER，该数据集通过佩戴式摄像头记录第一视角手部精细动作，同时利用360度全景相机同步捕获整个空间内所有行为和场景动态，从而实现对人类零售行为的完整建模。SABER包含三种动作表示流共计44.8K样本，并通过共享骨干多任务后训练策略应用于GR00T N1.6模型，在10项零售操作任务上平均成功率提升至29.3%，显著优于微调基线（13.4%）。这表明，高质量、可扩展的真实世界数据采集路径是提升零售机器人能力的核心前提。

链接: https://arxiv.org/abs/2605.09613
作者: Narsimha Menga,Parikshit Sakurikar,Amirreza Rouhi,Satya Sai Reddy,Anirudh Govil,Sri Harsha Chittajallu,Rajat Aggarwal,Anoop Namboodiri,Sashi Reddi
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Robotic deployment in real-world environments depends on rich, domain-specific action data as much as on strong model architecture. General-purpose robot foundation models show modest performance in complex unseen tasks such as manipulation in a retail domain when applied out of the box. The root cause is a data gap: retail environments are structurally absent from general robot pretraining distributions, and the path to filling that gap through teleoperation is prohibitively expensive, logistically constrained, and difficult to scale. We introduce SABER, a high-fidelity retail robotics action dataset built from over 100 hours of natural in-store capture across multiple real grocery environments. Egocentric footage from head-mounted cameras records fine-grained hand activity at the point of interaction, while exocentric 360-degree scene footage from DreamVu’s ALIA camera simultaneously observes all actors and activities across the entire space. This combination yields a uniquely complete picture of human retail behavior: dexterous hand activity, whole-body motion, and scene dynamics, all captured without staging, scripting, or teleoperation overhead. The SABER corpus contains 44.8K training samples across three action representation streams: 25K latent action sequences via LAPA-style encoding, 18.6K dexterous hand-pose trajectories retargeted to robot joint space, and 1.2K whole-body synchronized motion sequences retargeted to a humanoid embodiment. When applied to GR00T N1.6 via a shared-backbone multi-task post-training recipe, SABER yields a mean success rate of 29.3% across ten retail manipulation tasks – more than 2.19x over fine-tuning baselines (13.4%). SABER demonstrates that the path to capable retail robots runs through better data, which can be collected today, at scale, without a robot in the loop. The dataset and code are available at this https URL

[CV-171] On the Generation and Mitigation of Harmful Geometry in Image-to-3D Models

【速读】：该论文旨在解决当前图像到三维（image-to-3D）模型在生成有害几何结构方面的潜在安全风险问题，即这些模型可能被恶意利用以生成可被3D打印并造成现实世界危害的物体。研究通过系统性测量揭示了当前主流图像到3D模型在多种输入条件下（包括原始、退化、视角变换和语义伪装输入）均能有效重建三类有害几何类别：直接物理危害物、高风险模板或组件、以及欺骗性复制品；且绝大多数此类有害内容不会触发商业平台的 moderation flag（<0.3%）。解决方案的关键在于提出一种分层防御机制（stacked defense），结合输入过滤、模型层面良性对齐与输出层过滤三种策略，能够在保留99%正常内容的同时将有害内容留存率降至1%，但代价是整体误报率上升至11%。这一发现凸显了现有防护措施的局限性，并呼吁发展更精细的几何感知型内容审核机制。

链接: https://arxiv.org/abs/2605.09606
作者: Yule Liu,Yilong Yang,Jiale Teng,Hanze Jia,Zeren Luo,Jingyi Zheng,Zifan Peng,Ke Li,Yifan Liao,Zhen Sun,Jiaheng Wei,Yang Liu,Zhuo Ma,Xinlei He
机构: The Hong Kong University of Science and Technology (Guangzhou); Xidian University; Zhejiang University; Wuhan University
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in image-to-3D models have significantly improved the fidelity and accessibility of 3D content creation. Such a powerful reconstruction capability that enables creative design can also be misused by the adversary to generate harmful geometries, which can be further fabricated via 3D printers and pose real-world risks. However, such risks are largely underexplored: it remains unclear how well current image-to-3D models can produce these harmful geometries, and whether existing safeguards can reliably prevent such generation. To fill this gap, we conduct a systematic measurement study of harmful geometry generation and mitigation. We first describe this risk through three kinds of unsafe categories: direct-use physical hazards, risky templates or components, and deceptive replicas. Each category is instantiated with representative objects. We evaluate both open-source and commercial image-to-3D models under original, degraded, viewpoint-shifted, and semantically camouflaged inputs. We consider different evaluation metrics, including geometric validity, multi-view VLM-based semantic scoring, targeted human validation, and controlled physical fabrication. The results reveal a concerning reality that current image-to-3D models can effectively reconstruct the harmful geometries, while fewer than 0.3% of such geometries trigger commercial moderation flags. As a first step toward mitigation, we evaluate three representative safeguard families, including input moderation, model-level benign alignment, and output-level filtering. We find that existing safeguards have distinct weaknesses. We further develop a stacked defense that can reduce harmful retention to 1%, but still at 11% overall false-positive cost. Taken together, our findings demonstrate that the risk in current system and encourage better geometry-aware safeguards for moderation. Subjects: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2605.09606 [cs.CR] (or arXiv:2605.09606v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2605.09606 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-172] DAP: Doppler-aware Point Network for Heterogeneous mmWave Action Recognition

【速读】：该论文旨在解决毫米波（mmWave）雷达在人体动作识别（HAR）中因异构雷达源（如不同设备和频段）导致的分布偏移问题，现有数据集规模小且多为同源采集，限制了模型在真实场景中的跨源泛化能力。其解决方案的关键在于提出UniMM-HAR数据集与DAP-Net模型：前者标准化三种异构雷达配置以支持跨源评估；后者通过双空间多普勒重参数化（D2R）模块实现样本自适应几何稠密化与多普勒引导特征重校准，并结合文本对齐模块（TAM）利用预训练文本空间提供稳定语义锚点，从而增强模态内表征并实现跨模态对齐，学习源不变的动作语义，显著提升异构场景下的识别准确率与鲁棒性。

链接: https://arxiv.org/abs/2605.09604
作者: Jiaying Lin,Shiman Wu,Jinfu Liu,Can Wang,Mengyuan Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Millimeter-wave (mmWave) radar provides privacy-preserving sensing and is valuable for human action recognition (HAR). Existing mmWave point cloud datasets are limited in scale and mostly collected under homogeneous single-source settings, preventing current methods from handling real-world distribution shifts caused by heterogeneous radar sources, such as different devices and frequency bands. To address this, we introduce UniMM-HAR, the largest and first mmWave point cloud HAR dataset for heterogeneous multi-source scenarios, standardizing three distinct radar configurations to realistically evaluate cross-source generalization. We further propose the Doppler-aware Point Cloud Network (DAP-Net) to tackle heterogeneity challenges. DAP-Net enhances intra-modal representations and performs cross-modal alignment to learn source-invariant action semantics. Leveraging action-consistent spatio-temporal Doppler patterns as anchors, the Dual-space Doppler Reparameterization (D2R) module performs sample-adaptive geometric densification and Doppler-guided feature recalibration, while the Text Alignment Module (TAM) provides stable semantic anchors via a pretrained textual space. Experiments show that DAP-Net significantly outperforms existing methods under heterogeneous radar settings, achieving state-of-the-art accuracy and strong cross-source robustness.

[CV-173] SoccerLens: Grounded Soccer Video Understanding Beyond Accuracy NEURIPS2026

【速读】：该论文旨在解决当前视觉语言模型（Vision-Language Models, VLMs）在足球视频理解任务中存在“虚假相关性”和“捷径学习”问题，即模型可能依赖于非语义的视觉线索而非真正有意义的视觉证据进行预测，而现有评估协议仅关注分类准确率，缺乏对视觉定位（visual grounding）能力的衡量。解决方案的关键在于提出 SoccerLens 基准，该基准包含13类常见足球事件的标注视频片段，并按语义相关性组织成三个层次的结构化视觉提示（structured visual cues），同时扩展 Chefer 等人提出的归因方法，引入联合建模空间与时间注意力的新机制，并设计量化模型注意力是否对齐标注提示或偏移至伪相关区域的评估指标。实验表明，尽管当前最先进的足球 VLMs 在分类上表现优异，其视觉定位性能仍低于50%，且严重忽视时序信息，揭示了预测性能与真实视觉接地之间的显著差距，凸显了在复杂时空场景中开展基于接地的评估的重要性。

链接: https://arxiv.org/abs/2605.09598
作者: Ismael Elsharkawi,Ahmed Sait,Silvio Giancola,Bernard Ghanem,Hossam Sharara,Abdelrahman Eldesokey
机构: The American University in Cairo (美国大学开罗分校); KAUST (KAUST)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review in NeurIPS 2026

点击查看摘要

Abstract:Vision-language models (VLMs) have recently shown strong potential in soccer video understanding. However, given the high complexity of soccer videos due to large viewpoint variations, rapid shot transitions, and cluttered scenes, it remains unclear on whether VLMs rely on meaningful visual evidence or exploit spurious correlations and shortcut learning. Existing evaluation protocols focus primarily on classification accuracy and do not assess visual grounding. To address this limitation, we introduce SoccerLens, a benchmark for grounded soccer video understanding. The benchmark contains annotated video segments spanning 13 common soccer events, with structured visual cues organized into three levels of semantic relevance. We further extend the attribution method of Chefer [arXiv:2103.15679] to jointly model spatial and temporal attention, and introduce evaluation metrics that measure whether model attention aligns with annotated cues or drifts toward spurious regions. Our evaluation of state-of-the-art soccer VLMs shows that, despite strong classification accuracy, current models fail to exceed 50% grounding performance even under the loosest cue definitions and consistently underutilize temporal information. These results reveal a substantial gap between predictive performance and true visual grounding, highlighting the need for grounded evaluation in complex spatio-temporal domains such as soccer.

[CV-174] From Pixels to Concepts: Do Segmentation Models Understand What They Segment?

【速读】：该论文旨在解决当前可提示分割模型（promptable segmentation models）在评估中存在的一大盲区：现有基准主要关注掩码精度或目标物体是否存在，而未能有效检验模型是否真正实现了概念忠实的语义定位（concept-faithful grounding），即模型是否基于文本提示中的语义信息进行推理，而非依赖于视觉显著但语义误导的线索。解决方案的关键在于提出 CAFE（Counterfactual Attribute Factuality Evaluation）——一个基于属性级反事实操作的新颖评测基准。其核心机制是保持目标区域和真实掩码不变，仅修改表面外观、上下文或材质组成等属性以引入误导性语义线索，并构建包含2,146对测试样本的数据集，涵盖三类反事实场景：浅层模仿（Superficial Mimicry, SM）、上下文冲突（Context Conflict, CC）和本体论冲突（Ontological Conflict, OC）。实验表明，模型常能在误导性提示下生成高精度掩码，揭示了定位质量与概念区分能力之间存在系统性差距，从而为诊断模型是否实现语义忠实分割提供了可控且严谨的评估框架。

链接: https://arxiv.org/abs/2605.09591
作者: Shuang Liang,Zeqing Wang,Yuxian Li,Xihui Liu,Han Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 30 pages, 8 figures

点击查看摘要

Abstract:Segmentation is a fundamental vision task underlying numerous downstream applications. Recent promptable segmentation models, such as Segment Anything Model 3 (SAM3), extend segmentation from category-agnostic mask prediction to concept-guided localization conditioned on high-level textual prompts. However, existing benchmarks primarily evaluate mask accuracy or object presence, leaving unclear whether these models faithfully ground the queried concept or instead rely on visually salient but semantically misleading cues. We introduce CAFE: \textbfCounterfactual \textbfAttribute \textbfFactuality \textbfEvaluation, a novel benchmark for evaluating concept-faithful segmentation in promptable segmentation models. Our \textbfCAFE is built on attribute-level counterfactual manipulation: the target region and ground-truth mask are preserved, while attributes such as surface appearance, context, or material composition are modified to introduce misleading semantic cues. The benchmark contains 2,146 paired test samples, each consisting of a target image, a ground-truth mask, a positive prompt, and a misleading negative prompt. These samples cover three counterfactual categories: Superficial Mimicry (\textbfSM), Context Conflict (\textbfCC), and Ontological Conflict (\textbfOC). We evaluate various model types and sizes on our CAFE. Experiments reveal a systematic gap between localization quality and concept discrimination: models often generate accurate masks even for misleading prompts, suggesting that strong mask prediction does not necessarily imply faithful semantic grounding. Our CAFE provides a controlled benchmark for diagnosing whether promptable segmentation models perform concept-faithful grounding rather than shortcut-driven mask retrieval.

[CV-175] DeformMaster: An Interactive Physics-Neural World Model for Deformable Objects from Videos

【速读】：该论文旨在解决从真实视频中学习可交互的柔体物体世界模型（world model）的问题，其核心挑战在于如何从视觉观测中推断物理状态、在新交互下进行高保真动态演化，并准确重建外观变化。解决方案的关键在于提出DeformMaster框架，该框架通过统一的动力学与外观建模机制，实现结构化物理演进与神经残差补偿相结合，以处理未建模效应；同时将稀疏手部运动建模为分布式的柔性驱动器，用于手-连续体交互的接地建模；采用空间变化的本构专家网络表示材料响应，并基于预测的物理演化驱动高保真四维（4D）外观生成，从而在真实柔体物体序列上实现了未来动态预测、新颖动作回放、材料参数调整及动态新视角合成等能力。

链接: https://arxiv.org/abs/2605.09586
作者: Can Li,Zhoujian Li,Ren Li,Jie Gu,Lei Lei,Jingmin Chen,Lei Sun
机构: Nankai University (南开大学); Zhejiang University (浙江大学); Southern University of Science and Technology (南方科技大学); Rightly Robotics, A4X (Rightly Robotics, A4X); University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:World models for deformable objects should recover not only geometry and appearance, but also underlying physical dynamics, interaction grounding, and material behavior. Learning such a model from real videos is challenging because deformable linear, planar, and volumetric objects evolve under high-dimensional deformation, noisy interactions, and complex material response. The model must therefore infer a physical state from visual observations, roll it forward under new interactions, and render the resulting dynamics with high visual fidelity. We present DeformMaster, a video-derived interactive physics–neural world model that turns real interaction videos into an online interactive model of deformable objects within a unified dynamics-and-appearance framework. DeformMaster preserves structured physical rollout while using a neural residual to compensate for unmodeled effects, grounds sparse hand motion as distributed compliant actuator for hand–continuum interaction, represents material response with spatially varying constitutive experts, and drives high-fidelity 4D appearance from the predicted physical evolution. Experiments on real-world deformable-object sequences demonstrate DeformMaster’s ability to roll out future dynamics and render dynamic appearance, outperforming state-of-the-art baselines while supporting novel action rollout, material-parameter variation, and dynamic novel-view synthesis.

[CV-176] FPGA-Based Hardware Architecture for Contrast Maximization in Event-Based Vision

【速读】：该论文旨在解决事件驱动视觉系统中运动参数估计的实时性与能效问题，尤其是在高帧率、低功耗嵌入式场景下的计算瓶颈。其核心解决方案是基于现场可编程门阵列（FPGA）实现对比度最大化（Contrast Maximization, CM）算法的硬件架构，通过利用FPGA的确定性并行结构和深度流水线设计，显著提升处理吞吐量与能效比。关键创新在于将CM算法中的事件重映射（event warping）、对比度计算与迭代优化模块进行硬件化重构，并采用面向硬件的优化方法，使运动参数估计速度较CPU和GPU方案提升超过200倍，从而为高速、低功耗嵌入式系统提供可靠且高效的实时运动估计能力。

链接: https://arxiv.org/abs/2605.09581
作者: Michal Filipkowski,Marcin Kowalczyk,Tomasz Kryjak
机构: AGH University of Krakow (克拉科夫 AGH 大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for ARC 2026

点击查看摘要

Abstract:This paper presents a hardware architecture that implements the Contrast Maximization (CM) algorithm in Field-Programmable Gate Array (FPGA) resources for event-based vision systems. CM estimates motion parameters by maximizing the contrast of an Image of Warped Events (IWE) reconstructed from asynchronous event streams. Event-based vision sensors generate sparse data with high temporal resolution and low spatial redundancy, which makes them well suited for hardware processing. The deterministic, massively parallel structure of the FPGA is leveraged to design a deeply pipelined architecture capable of high-throughput, energy-efficient processing suitable for real-time embedded applications. This paper details the hardware modules responsible for event warping, contrast computation, and iterative optimization, discusses key implementation decisions, and presents the hardware-aware optimization method used in the design. Experimental results demonstrate a substantial speed and efficiency improvement over CPU- and GPU-based implementations, with motion parameter estimation executing over 200 times faster. To the best of our knowledge, this is the first hardware architecture enabling acceleration of CM algorithm computations. Its performance is evaluated in terms of processing speed, energy efficiency, and hardware resource utilization. The proposed design is validated using an event-based object tracking application. The results confirm that the architecture provides a solid foundation for real-time motion estimation in high-speed, low-power embedded systems.

[CV-177] KAN Text to Vision? The Exploration of Kolmogorov-Arnold Networks for Multi-Scale Sequence-Based Pose Animation from Sign Language Notation

【速读】：该论文旨在解决从符号化表示（如HamNoSys）到二维人体姿态序列的生成式转换问题，以实现可扩展的无障碍手语动画生成。其核心挑战在于如何高效且准确地将离散的音位符号映射为连续的身体运动学参数，同时保持结构一致性与细节精度。解决方案的关键在于提出一种多尺度序列生成框架KANMultiSign，包含两个互补设计：一是采用粗到细的生成策略并辅以多尺度监督机制，先通过身体-手-面部骨架引导全局结构一致性，再精细化手部动作以提升指部细节；二是引入Kolmogorov–Arnold Network (KAN) 模块嵌入Transformer主干网络，利用可学习的一元函数基元对符号到运动学的高非线性映射进行紧凑参数化建模。实验表明，该方法在多个语言数据集上显著降低基于动态时间规整的关节误差，同时大幅减少模型参数量，验证了多尺度监督是性能提升的核心驱动力，而KAN则提供了高效的替代建模方式。

链接: https://arxiv.org/abs/2605.09572
作者: Guanyi Du,Lintao Wang,Kun Hu,Ziyang Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: Accepted at Neurocomputing

点击查看摘要

Abstract:Sign language production from symbolic notation offers a scalable route to accessible sign animation. We present KANMultiSign, a multi-scale sequence generator that translates HamNoSys notation into two-dimensional human pose sequences. Our framework makes two complementary contributions. First, we introduce a coarse-to-fine generation strategy with multi-scale supervision: the model is first guided by an intermediate body–hand–face scaffold to encourage global structural coherence, and then refines fine-grained hand articulation to improve finger-level detail. Second, we investigate integrating Kolmogorov–Arnold Network modules into a Transformer backbone, using learnable univariate function primitives to model the highly non-linear mapping from discrete phonological symbols to continuous body kinematics with a compact parameterization. Experiments on multiple public corpora spanning Polish, German, Greek, and French sign languages show consistent reductions in dynamic time warping based joint error compared with a strong notation-to-pose baseline, while using substantially fewer parameters. Controlled ablations further indicate that KAN-based variants substantially reduce parameter count while maintaining competitive performance when coupled with multi-scale supervision, rather than serving as the main driver of accuracy gains. These findings position multi-scale supervision as the key mechanism for improving notation-conditioned pose generation, with KAN offering a compact alternative for efficient modeling. Our code will be publicly available.

[CV-178] Dual-Path Hyperprior Informed Deep Unfolding Network for Image Compressive Sensing

【速读】：该论文旨在解决现有深度展开网络（Deep Unfolding Networks, DUNs）在压缩感知（Compressive Sensing, CS）中面临的两个关键问题：一是仅依赖单一测量流，限制了不同测量子集间的有效信息交互；二是对图像所有区域进行均匀处理，忽略了由不同纹理引起的重建难度差异。解决方案的核心在于提出一种双路径超先验引导的深度展开网络（Dual-Path Hyperprior Informed Deep Unfolding Network, DPH-DUN），其通过将测量划分为双子集实现超先验引导的协同重建。具体而言，该方案包含两个分支：1）在超先验学习分支中设计轻量级神经模块以高效生成多域超先验知识；2）在超先验引导重建分支中构建带超先验指导的迭代优化框架，其中梯度下降步骤引入超先验引导的步长生成网络以动态生成空间变化的步长图，实现自适应细粒度更新； proximal映射步骤则引入两种基于梯度的硬/软注意力机制，动态聚焦于重建困难区域，从而显著提升CS重建精度。

链接: https://arxiv.org/abs/2605.09566
作者: Tianyi Lu,Wenxue Cui,Shaohui Liu
机构: Harbin Institute of Technology (哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent Deep Unfolding Networks (DUNs) have significantly advanced Compressive Sensing (CS) by integrating iterative optimization with deep networks. However, existing DUNs still suffer from two challenges: 1) Reliance on a single measurement stream, which limits effective information interaction across distinct measurement subsets. 2) Uniform processing of all image regions, which overlooks varying reconstruction difficulties induced by diverse textures. To address these limitations, a novel Dual-Path Hyperprior Informed Deep Unfolding Network (DPH-DUN) is proposed, which partitions measurements into double subsets to enable hyperprior-guided reconstruction via a dual-path architecture. In the Deep Hyperprior Learning branch, a series of lightweight neural modules are designed to efficiently generate hyperprior knowledge of different domains, enabling collaborative guidance for the CS reconstruction. In the Hyperprior Informed Reconstruction branch, a deep unfolding framework with hyperprior guidance is constructed to iteratively refine reconstruction. Specifically, i) in the gradient descent step, a Hyperprior Informed Step Size Generation network is designed to dynamically generate spatially varying step maps, enabling adaptive fine-grained gradient updates. ii) In the proximal mapping step, two well-designed hyperprior informed attention mechanisms are introduced to dynamically focus on challenging regions via gradient-based hard and soft attentions, facilitating CS reconstruction accuracy. Extensive experiments demonstrate that the proposed DPH-DUN outperforms existing CS methods.

[CV-179] PhysHanDI: Physics-Based Reconstruction of Hand-Deformable Object Interactions ICML2026

【速读】：该论文旨在解决现有手-物体交互重建方法在处理非刚性变形物体（如布料、填充玩具等）时的局限性，即要么仅限于刚性或分段刚性物体，无法建模真实世界中复杂形变；要么虽能模拟可变形物体但缺乏完整的3D手部重建。其解决方案的关键在于提出PhysHanDI（基于物理的手与可变形物体交互重建框架），通过将密集重建的3D手部运动作为驱动力，物理模拟物体的形变，从而确保物体动态符合物理规律且与手部动作一致；同时，利用逆向物理机制，使物体形变模拟反过来优化和提升手部重建精度。

链接: https://arxiv.org/abs/2605.09538
作者: Jihyun Lee,Changmin Lee,Donghwan Kim,Tae-Kyun Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Accepted to ICML 2026

点击查看摘要

Abstract:While existing methods for reconstructing hand-object interactions have made impressive progress, they either focus on rigid or part-wise rigid objects-limiting their ability to model real-world objects (e.g., cloth, stuffed animals) that exhibit highly non-rigid deformations-or model deformable objects without full 3D hand reconstruction. To bridge this gap, we present PhysHanDI (Physics-based Reconstruction of Hand and Deformable Object Interactions), a framework that enables full 3D reconstruction of both interacting hands and non-rigid objects. Our key idea is to physically simulate object deformations driven by forces induced from densely reconstructed 3D hand motions, ensuring that the reconstructed object dynamics are both physically plausible and coherent with the interacting hand movements. Furthermore, we demonstrate that such simulation of object deformations can, in turn, refine and improve hand reconstruction via inverse physics. In experiments, PhysHanDI outperforms the state-of-the-art baseline across reconstruction and future prediction.

[CV-180] QueST: Persistent Queries as Semantic Monitors for Drift Suppression in Long-Horizon Tracking

【速读】：该论文旨在解决长时视频跟踪中因关节运动、遮挡和视角变化导致的误差累积问题，即“无声语义漂移”（silent semantic drift），传统帧间局部匹配方法难以检测和纠正此类漂移。解决方案的关键在于提出QueST框架，其核心思想是将交互相关实体视为持久的语义查询（persistent semantic queries）而非瞬时点轨迹，并在每个时间步通过全局时空特征注意力机制实现稳定语义锚定；同时引入轻量级3D物理约束，利用几何合理性抑制遮挡下的无界漂移，从而提升长期跟踪的准确性与身份一致性。

链接: https://arxiv.org/abs/2605.09513
作者: Mayank Anand,Mohammad Saqlain,Kyan Mahajan,Priya Shukla,Gora Chand Nandi,Andrew Melnik
机构: Indian Institute of Information Technology Allahabad (印度信息科技学院艾哈迈达巴德分校); University of Bremen (不来梅大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Tracking points in videos is typically formulated as frame-to-frame correspondence, where each point is matched locally to the next frame. While this works over short horizons, errors accumulate under articulation, occlusion, and viewpoint change, leading to silent semantic drift that existing trackers cannot detect or correct. In this work, we revisit long-horizon tracking from a monitoring perspective and introduce QueST, a monitoring-by-design framework that treats interaction-relevant entities as persistent semantic queries rather than transient point tracks. Instead of local propagation, each query attends globally over spatio-temporal video features at every time-step, providing a stable semantic anchor across time. We further constrain query trajectories with lightweight 3D physical grounding, using geometric plausibility to suppress unbounded drift under occlusion. We evaluate QueST on long-horizon articulated sequences from PartNet-Mobility in SAPIEN and compare against RAFT-3D, CoTracker, and TAP-Net. QueST substantially reduces terminal drift achieving a 67.7% Absolute Point Error (APE) improvement over TAP-Net while better preserving identity over extended horizons. Our results show that embedding semantic monitoring directly into perception enables more reliable long-horizon tracking under distribution shift.

[CV-181] Uncertainty-Aware and Decoder-Aligned Learning for Video Summarization IJCNN2026

【速读】：该论文旨在解决视频摘要（video summarization）任务中因标注主观性强和依赖离散解码过程（如时间分段与背包选择）所带来的挑战。现有方法要么学习确定性重要性分数而忽略标注不确定性，要么采用复杂生成模型导致训练与推理成本上升。其解决方案的关键在于提出一种不确定性感知且解码对齐的学习框架 VASTSum：首先通过变分公式预测帧级概率重要性分数，显式建模多标注者监督下的不确定性；其次设计一种监督策略，鼓励模型对齐合理的多人标注模式而非强制单一共识目标，以应对主观性问题；最后引入解码对齐正则化项，提升背包算法选择摘要时的稳定性，降低对预测分数微小扰动的敏感性。该方法在 SumMe 和 TVSum 数据集上实现了高效单次前向传播下的鲁棒性能提升。

链接: https://arxiv.org/abs/2605.09507
作者: Omer Tariq,Syed Muhammad Raza,Jeongbae Son
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for presentation at the 2026 International Joint Conference on Neural Networks (IJCNN 2026)

点击查看摘要

Abstract:Video summarization aims to produce a compact representation of a long video by selecting a subset of temporally important segments that best reflect human preferences. This task is inherently difficult due to strong annotation subjectivity and the reliance on discrete decoding procedures, such as temporal segmentation and knapsack-based selection, during evaluation. Most existing approaches either learn deterministic importance scores that overlook these characteristics or adopt complex generative models that increase training and inference cost. In this paper, we propose VASTSum, an uncertainty-aware and decoder-aligned learning framework for video summarization that addresses both challenges within a single-pass model. The proposed method predicts probabilistic frame-level importance scores using a variational formulation, enabling explicit modeling of uncertainty arising from multi-annotator supervision. To account for subjectivity, particularly under binary annotations, we employ a supervision strategy that encourages alignment with plausible human annotation modes rather than enforcing a single consensus target. Furthermore, we introduce a decoder-aligned regularization that promotes stability of knapsack-based summary selection, reducing sensitivity to small perturbations in predicted scores. We evaluate the proposed framework on the SumMe and TVSum benchmarks using standard rank-based metrics. Experimental results show consistent and competitive Kendall and Spearman correlations across multiple data splits, demonstrating improved robustness under annotation disagreement while maintaining efficient single-forward inference. These results indicate that explicitly modeling uncertainty and aligning learning objectives with the decoding stage provide a principled alternative to both deterministic and diffusion-based video summarization methods.

[CV-182] PermuQuant: Lowering Per-Group Quantization Error by Reordering Channels for Diffusion Models

【速读】：该论文旨在解决低比特（low-bit）扩散模型在后训练量化（Post-training Quantization, PTQ）过程中因通道统计差异导致的严重性能退化问题。现有PTQ方法在极低比特设置下，由于同一组内包含激活和权重统计差异较大的通道，使得共享的量化尺度被异常值主导，从而引发显著的量化误差。解决方案的关键在于提出PermuQuant框架，其核心创新是基于联合二阶矩准则对通道进行排序，将具有相似激活与权重统计特性的通道分配至同一量化组中，同时引入校准数据驱动的接受规则以选择最优排列方式，并通过静态重排避免运行时开销。该方法显著降低了量化误差，在多个大型扩散模型上实现优于现有基线的效果，例如在FLUX.1-dev模型上实现了W4A4 NVFP4量化下的3.5倍内存压缩和1.8倍单步推理加速。

链接: https://arxiv.org/abs/2605.09503
作者: Yongsen Cheng,Kai Liu,Kaiwen Tao,Junxian Li,Zhixin Wang,Zhikai Chen,Renjing Pei,Yulun Zhang
机构: Shanghai Jiao Tong University (上海交通大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large-scale visual generative models have achieved remarkable performance. However, their high computational and memory costs make deployment challenging in resource-constrained scenarios, such as interactive applications and personal single-GPU usage. Post-training quantization (PTQ) offers a practical solution by compressing pretrained models without expensive retraining. However, existing PTQ methods still suffer from severe quality degradation under extremely low-bit settings. In this paper, we identify channel ordering as an important but underexplored factor in per-group quantization. In this setting, each contiguous group shares one quantization scale. When channels with very different statistics are placed in the same group, the scale can be dominated by outliers and cause large quantization errors. Based on this observation, we propose PermuQuant, a simple and effective PTQ framework for low-bit diffusion models. PermuQuant sorts channels by a joint second-moment criterion before per-group quantization, placing channels with similar activation and weight statistics into the same group. It further uses a calibration-based acceptance rule to apply reordering only when the selected permutation reduces quantization error on calibration data. The selected permutations are absorbed into adjacent modules or applied to weights offline, avoiding explicit runtime permutation operations. Extensive experiments on multiple large diffusion models show that PermuQuant consistently reduces quantization error and outperforms existing PTQ baselines. On FLUX.1-dev with an RTX 5090, PermuQuant achieves up to a 1.8 \times single step speedup and reduces the DiT memory footprint by 3.5 \times under W4A4 NVFP4 quantization. Code will be available at this https URL.

[CV-183] Outlier-Robust Diffusion Solvers for Inverse Problems CVPR2026

【速读】：该论文旨在解决基于扩散模型（Diffusion Models, DMs）求解逆问题（Inverse Problems, IPs）时对异常值（outliers）敏感的问题，这类异常值在真实世界测量中普遍存在。解决方案的关键在于两个核心步骤：首先通过显式噪声估计对观测数据进行预处理以降低噪声影响；其次构建基于Huber损失的迭代加权最小二乘优化目标，并采用共轭梯度法（Conjugate Gradient Method）高效近似求解该鲁棒优化问题，从而避免传统梯度下降方法对学习率调参的敏感性。实验表明，该方法在多种线性和非线性任务下均表现出更强的抗异常值能力，并优于当前主流DM-based方法。

链接: https://arxiv.org/abs/2605.09477
作者: Yang Zheng,Jiahua Liu,Tongyao Pang,Wen Li,Zhaoqiang Liu
机构: University of Electronic Science and Technology of China (电子科技大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Methods based on diffusion models (DMs) for solving inverse problems (IPs) have recently achieved remarkable performance. However, DM-based methods typically struggle against outliers, which are common in real-world measurements. In this work, to tackle IPs with outliers, we first refine the measurement via explicit noise estimation to mitigate the effect of noise. Subsequently, we formulate an iteratively reweighted least squares objective based on the Huber loss to address the outliers. We propose a method utilizing gradient descent to approximately solve the corresponding optimization problem for the robust objective. To avoid delicate tuning of the learning rate required by the gradient descent method, we further employ the conjugate gradient method with an efficient strategy for updating. Extensive experiments on multiple image datasets for linear and nonlinear tasks under various conditions demonstrate that our proposed methods exhibit robustness to outliers and outperform recent DM-based methods in most cases.

[CV-184] When Few Steps Are Enough: Training-Free Acceleration of Identity-Preserved Generation

【速读】：该论文旨在解决身份保留的图像生成（identity-preserved image generation）在部署阶段计算成本高昂的问题，尤其是基于多步扩散（diffusion）主干网络时。其核心解决方案是：通过一个冻结的InfuseNet身份适配器（identity adapter），直接迁移至蒸馏后的快速主干网络（distilled schnell backbone），无需重新训练，仅需两行代码修改——替换主干路径并禁用无分类器引导（classifier-free guidance）。这一策略将延迟降低5.9倍，同时提升ArcFace身份相似度（+0.028）和LPIPS感知质量（-0.016），关键在于发现身份保真度在早期采样步骤（4–8步）即进入有效区间，后续步骤主要优化视觉细节与对比度，从而实现效率与保真度的最优权衡。

链接: https://arxiv.org/abs/2605.09460
作者: Dongqi Zheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Identity-preserved image generation is typically built on many-step diffusion backbones, making personalized generation expensive at deployment time. We show that this cost is often unnecessary for identity-conditioned FLUX generation. A frozen InfuseNet identity adapter trained with dev transfers directly to the distilled schnell backbone without retraining. This two-line replacement – changing the backbone path and disabling classifier-free guidance – reduces latency by 5.9x while improving ArcFace identity similarity by +0.028 and lpips by -0.016 over the standard 28-step dev baseline. To explain why this works, we analyze the denoising trajectory and find that identity fidelity enters an early effective regime, often within 4-8 steps, while later steps primarily refine visual detail, sharpness, and contrast. Adapter ablations confirm that identity formation depends on the identity adapter, while attention-stream norm probes suggest that the relative conditioning contribution decreases as sampling proceeds. Preliminary style-adapter and object-adapter sweeps on SDXL and SD1.5 show similar diminishing returns after intermediate steps. These results position distilled backbone replacement as a simple, training-free strategy for improving the efficiency-fidelity tradeoff of identity-preserved generation.

[CV-185] Adaptive 3D Convolution for Remote Sensing Image Fusion

【速读】：该论文旨在解决遥感图像融合中因传统深度学习方法将光谱信息编码为特征图通道而导致的显著光谱失真问题，以及标准3D卷积在处理多光谱/高光谱图像时因全局共享卷积核而无法有效捕捉细粒度空间-光谱特征、且计算复杂度高的局限性。其解决方案的关键在于提出一种新颖的自适应3D卷积（Adaptive 3D Convolution, Ada3D）机制：通过两阶段生成策略，分别从空间和光谱源中提取空间与光谱核，并将其融合为内容感知的3D核，从而实现每个输入体素（voxel）的独立自适应卷积操作；同时引入自适应偏置项以提升局部响应精度，并结合组卷积（group convolution）技术降低计算开销，最终在保持高效率的同时显著提升了融合图像的空间细节与光谱保真度。

链接: https://arxiv.org/abs/2605.09455
作者: Siran Peng,Xiangyu Zhu,Shang-Qi Deng,Liang-Jian Deng,Zhen Lei
机构: State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences (CASIA), Beijing 100190, China; School of Artificial Intelligence, University of Chinese Academy of Sciences (UCAS), Beijing 100049, China; State Key Laboratory of Human-Machine Hybrid Augmented Intelligence, and Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University, Xi’an 710049, No.28 Xianning West Road, China; School of Mathematical Sciences/Multi-Hazard Early Warning Key Laboratory of Sichuan Province, University of Electronic Science and Technology of China (UESTC), Chengdu, Sichuan 611731, China; Centre for Artificial Intelligence and Robotics, Hong Kong Institute of Science and Innovation, Chinese Academy of Sciences, Hong Kong, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE Transactions on Image Processing (TIP), Early Access, 2026

点击查看摘要

Abstract:Remote sensing image fusion aims to create a high-resolution multi/hyper-spectral image from a high-resolution image with limited spectral information and a low-resolution image with abundant spectral data. Recently, deep learning (DL) techniques have shown significant effectiveness in this area. Most DL-based methods approach image fusion as a 2D problem by encoding spectral information into feature map channels. However, our research suggests that this strategy introduces notable spectral distortions. In contrast, some methods consider spectral data as an additional dimension, utilizing standard 3D convolutions to preserve spectral information. Nevertheless, in a standard 3D convolutional layer, the same set of kernels is applied across all input regions, which we have found to be sub-optimal for image fusion. Furthermore, standard 3D convolutions necessitate substantial computational resources. To address these challenges, we propose a novel convolutional paradigm called Adaptive 3D Convolution (Ada3D) for remote sensing image fusion. Ada3D applies a unique set of 3D kernels to each input voxel, enabling the capture of fine-grained details. These adaptive kernels are generated through a two-step process: (i) spatial and spectral kernels are derived from their respective image sources; (ii) these two types of kernels are then combined to form content-aware 3D kernels that effectively integrate spatial and spectral information. Additionally, adaptive biases are introduced to enhance the convolutional outcome at the voxel level. Furthermore, we incorporate the group convolution technique to reduce computational complexity. As a result, Ada3D offers full adaptivity in an efficient manner. Evaluation results across five datasets demonstrate that our method achieves SOTA performance, underscoring the superiority of Ada3D. The code is available at this https URL.

[CV-186] SpaceMind: Toward Allocentric Cognitive Maps for Spatially Grounded Video MLLM s

【速读】：该论文旨在解决当前多模态大语言模型（Multimodal Large Language Models, MLLMs）在三维环境空间推理中缺乏以世界为中心的持久性表征问题，导致模型难以维持对象恒常性和空间拓扑结构的一致性。其核心解决方案是提出SpaceMind++架构，通过从RGB视频中显式构建体素化认知地图（voxelized cognitive map），将碎片化的自我中心视角观测重组为共享的3D度量表示，从而实现跨视角的空间一致性推理；关键创新在于引入坐标引导的深度迭代融合机制（Coordinate-Guided Deep Iterative Fusion），利用坐标嵌入和3D旋转位置编码（3D Rotary Positional Encoding）将地图级空间知识映射回原始2D视觉特征，使预训练视频MLLM无需修改其原有视觉token接口即可获得度量空间感知能力，模拟了海马旁回对感官特征的空间锚定机制。

链接: https://arxiv.org/abs/2605.09449
作者: Bo Gu,Zhikang Zhang,Zizhuang Wei,Zhenyuan Chen,Lingyun Li,Zhuoyi Song
机构: Fudan University (复旦大学); Huawei (华为); Shenzhen Loop Area Institute (深圳环区研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 3 figures

点击查看摘要

Abstract:Recent multimodal large language models (MLLMs) have made remarkable progress in visual understanding and language-based reasoning, yet they lack a persistent world-centered representation for spatially consistent reasoning in 3D environments. Inspired by the mammalian dual-stream system, where semantic and spatial cues are processed separately and integrated into an allocentric cognitive map, we propose SpaceMind++, a video MLLM architecture that explicitly builds a voxelized cognitive map from RGB videos. This map reorganizes fragmented egocentric observations into a shared 3D metric representation, enabling the model to preserve object permanence and spatial topology across changing viewpoints. To make this allocentric representation usable by a pretrained video MLLM without disrupting its native visual-token interface, we introduce Coordinate-Guided Deep Iterative Fusion, a new mechanism that relays map-level spatial knowledge back into the original 2D visual features. This fusion is explicitly guided by coordinate embeddings and 3D Rotary Positional Encoding, which ground semantic interactions in metric 3D space, resembling the entorhinal binding of sensory features to metric space. Extensive experiments show that SpaceMind++ achieves new state-of-the-art performance on VSI-Bench. Furthermore, it demonstrates superior out-of-distribution generalization on SPBench, SITE-Bench, and SPAR-Bench, underscoring its robustness in unseen 3D environments.

[CV-187] SWIFT: Prompt-Adaptive Memory for Efficient Interactive Long Video Generation

【速读】：该论文旨在解决流式长视频生成中因连续语义切换导致的视觉演化不连贯问题，核心挑战在于如何在保持时间一致性的同时实现灵活的语义适应。现有方法依赖于提示边界处缓存重建或固定内存预算，易引入冗余计算且难以支持动态语义调整，其根本原因在于缓存的历史视频内容与提示更新之间存在语义错配。解决方案的关键是提出一种无需训练的框架SWIFT（Semantic Windowing and Injection for Flexible Transitions），其创新点包括：1）轻量级语义注入缓存（Semantic Injection Cache），通过增强已有缓存而非从头重建来实现高效语义切换；2）基于注意力头的语义注入机制（head-wise semantic injection），使每个注意力头按其与当前视频状态的对齐程度接收提示更新；3）自适应动态窗口机制（Adaptive Dynamic Window），根据提示阶段分配时间记忆资源，在切换边界使用较大局部上下文、稳定段落采用较小窗口以降低平均推理开销；4）段级语义锚点（segment-level semantic anchors），通过压缩的语义标记维持长程语义一致性。该方案在单张H100 GPU上达到22.6 FPS，显著提升了多提示长视频生成的效率与质量。

链接: https://arxiv.org/abs/2605.09442
作者: Shanwen Tan,Hao Li,Jingtao Zhang,Xiaosong Jia,Xue Yang,Shaofeng Zhang,Yanyong Zhang
机构: University of Science and Technology of China (中国科学技术大学); Fudan University (复旦大学); Georgia Institute of Technology (佐治亚理工学院); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Code is available at this https URL

点击查看摘要

Abstract:Streaming long-video generation faces a central challenge in continuous semantic switching, requiring adaptive memory to preserve coherent visual evolution. Current approaches rely on cache rebuilding at prompt boundaries or fixed memory budgets, but they introduce redundant computation and limit flexible semantic adaptation. This limitation arises from a mismatch between cached video history and prompt updates, as memory preserves visual continuity while prompt switches demand rapid semantic adaptation. Motivated by this observation, we present SWIFT, Semantic Windowing and Injection for Flexible Transitions, a training-free framework for multi-prompt long-video generation that enables efficient semantic switching while preserving temporal coherence in causal video diffusion models. SWIFT introduces a lightweight Semantic Injection Cache that augments cached video memory rather than reconstructing it from scratch at every prompt boundary. To avoid uniformly perturbing all attention channels, we further perform head-wise semantic injection, so that each attention head receives a prompt update proportional to its alignment with the current video state. In addition, we introduce an Adaptive Dynamic Window that allocates temporal memory according to prompt phase, using larger local context near switching boundaries and smaller windows during stable segments to reduce average inference cost. To preserve long-range semantic consistency under compressed local attention, we further maintain segment-level semantic anchors that summarize prompt-conditioned video history and reintroduce it as compact memory tokens. Compared with current state-of-the-art methods, SWIFT preserves generation quality while achieving 22.6 FPS on a single H100 GPU, establishing a substantially more efficient solution for multi-prompt long-video generation. Our code is available at this https URL.

[CV-188] Offline Preference Optimization for Rectified Flow with Noise-Tracked Pairs ICML2026

【速读】：该论文旨在解决现有文本到图像（text-to-image, T2I）生成模型偏好数据表示方式不适用于修正流（rectified flow, RF）模型的问题。传统偏好数据仅存储最终的胜者/败者图像，而RF模型的生成过程依赖于特定的先验噪声样本，并遵循近乎直线的去噪轨迹；若采用类似DPO（Direct Preference Optimization）的方法独立进行前向加噪以估计轨迹，则会导致与真实反向动力学不匹配并引入额外方差。解决方案的关键在于提出**先验噪声感知偏好优化（Prior Noise-Aware Preference Optimization, PNAPO）**框架：通过保留每对胜者/败者图像对应的先验噪声，将标准三元组（提示词、胜者、败者）扩展为六元组，并利用RF模型的直线性质通过噪声-图像插值精确估计中间状态，从而约束轨迹估计空间并获得更紧致的替代目标函数；同时引入动态正则化策略，根据胜败奖励差距和训练进度自适应调整DPO正则化强度，提升训练稳定性和样本效率。

链接: https://arxiv.org/abs/2605.09433
作者: Yunhong Lu,Qichao Wang,Hengyuan Cao,Xiaoyin Xu,Min Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICML 2026

点击查看摘要

Abstract:Existing preference datasets for text-to-image models typically store only the final winner/loser images. This representation is insufficient for rectified flow (RF) models, whose generation is naturally indexed by a specific prior noise sample and follows a nearly straight denoising trajectory. In contrast, prior DPO-style alignment for diffusion models commonly estimates trajectories using an independent forward noising process, which can be mismatched to the true reverse dynamics and introduces unnecessary variance. We propose Prior Noise-Aware Preference Optimization (PNAPO), an off-policy alignment framework specialized for rectified flow. PNAPO augments preference data by retaining the paired prior noises used to generate each winner/loser image, turning the standard (prompt, winner, loser) triplet into a sextuple. Leveraging the straight-line property of RF, we estimate intermediate states via noise-image interpolation, which constrains the trajectory estimation space and yields a tighter surrogate objective for preference optimization. In addition, we introduce a dynamic regularization strategy that adapts the DPO regularization based on (i) the reward gap between winner and loser and (ii) training progress, improving stability and sample efficiency. Experiments on state-of-the-art RF T2I backbones show that PNAPO consistently improves preference metrics while substantially reducing training compute.

[CV-189] FlashAR: Efficient Post-Training Acceleration for Autoregressive Image Generation

【速读】：该论文旨在解决大规模自回归（Autoregressive, AR）图像生成模型在推理阶段因逐像素扫描解码导致的计算效率低下问题。现有加速方法要么需要从头预训练全新生成范式，要么牺牲训练-推理一致性或改变原始预测目标。其解决方案的关键在于提出一种轻量级后训练适配框架 FlashAR，通过保留原模型水平方向（row-wise）的自回归头，并引入一个从中间层分支出的垂直方向（column-wise）轻量级新头，实现双向预测；同时设计可学习融合门机制动态加权两种预测结果，从而在不显著改变原模型训练目标的前提下，大幅提升并行生成效率，实现在仅使用原训练数据0.05%的情况下，获得最高达22.9倍的推理速度提升。

链接: https://arxiv.org/abs/2605.09430
作者: Junkang Zhou,Yefei He,Feng Chen,Weijie Wang,Bohan Zhuang
机构: Zhejiang University (浙江大学); University of Adelaide (阿德莱德大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Post-training acceleration for autoregressive image generation

点击查看摘要

Abstract:Large-scale autoregressive models have demonstrated remarkable capabilities in image generation. However, their sequential raster-scan decoding relies on strictly next-token prediction, making inference prohibitively expensive. Existing acceleration methods typically either introduce entirely new generation paradigms that necessitate costly pre-training from scratch, or enable parallel generation at the expense of a training-inference gap or altered prediction objectives. In this paper, we introduce FlashAR, a lightweight post-training adaptation framework that efficiently adapts a pre-trained raster-scan autoregressive model into a highly parallel generator based on two-way next-token prediction. Our key insight is that effective adaptation should minimize modifications to the pre-trained model’s original training objective to preserve its learned prior. Accordingly, we retain the original AR head as a horizontal head for row-wise prediction and introduce a complementary, lightweight vertical head for column-wise prediction. To facilitate efficient adaptation, we branch the vertical head from an intermediate layer rather than the final layer, bypassing the inherent horizontal head bias. Moreover, since horizontal and vertical predictions capture complementary dependencies whose relative importance varies across target positions, we employ a learnable fusion gate to dynamically combine the two predictions at each position. To further reduce adaptation cost, we propose a two-stage adaptation pipeline: the vertical head is first initialized through adaptation from the pre-trained autoregressive model before jointly fine-tuned with backbone to adapt to the new decoding paradigm. Extensive experiments on LlamaGen and Emu3.5 show that FlashAR achieves up to a 22.9x speedup for 512x512 image generation through a lightweight post-training with merely 0.05% of the original training data.

[CV-190] Evading Visual Aphasia: Contrastive Adaptive Semantic Token Pruning for Vision-Language Models

【速读】：该论文旨在解决现有视觉-语言模型（Vision-Language Models, VLMs）中基于浅层文本到图像注意力的剪枝方法所引发的“视觉失语症”（Visual Aphasia）问题，即因过早丢弃看似低注意力的视觉标记（visual tokens）而导致模型在组合推理任务中丧失视觉语义锚定、依赖语言先验而性能下降的问题。其解决方案的关键在于提出一种无需训练的剪枝框架COAST（COntrastive Adaptive Semantic Token Pruning），通过利用原生跨模态注意力识别查询特定锚点（anchor），并基于注意力熵估计上下文分散度，自适应地权衡语义证据与空间上下文保留；进一步引入对比路由得分以同时保留锚点对齐证据和互补的空间上下文，从而实现更鲁棒的语义感知剪枝。

链接: https://arxiv.org/abs/2605.09429
作者: Jie Ma,Yihang Liu,Zhike Qiu,Jiayi Ji,Xiaoshuai Sun
机构: Xiamen University (厦门大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Are low-attention visual tokens truly redundant in vision-language reasoning? Existing pruning methods often assume so, ranking visual tokens by shallow text-to-image attention and discarding low-scoring patches to accelerate LVLM inference. We show that this scalar criterion is unreliable for compositional reasoning: tokens ignored in early layers can later become essential for resolving secondary objects, spatial relations, and contextual cues. Premature pruning can therefore induce Visual Aphasia, a failure mode in which the model loses visual grounding and falls back on language priors. We introduce COAST (COntrastive Adaptive Semantic Token Pruning), a training-free pruning framework that casts compression as adaptive semantic routing. COAST uses native cross-modal attention to identify query-specific anchors and estimate contextual dispersion via attention entropy, then adapts the retention trade-off between semantic evidence and spatial context. It further uses a contrastive routing score to preserve both anchor-aligned evidence and complementary spatial context. Across seven benchmarks, COAST reduces visual tokens by 77.8% and achieves a 2.15x latency speedup while retaining 98.64% of the original average performance. Beyond a single backbone or compression setting, COAST consistently outperforms strong pruning baselines across token budgets and generalizes across multiple LVLM families, showing that adaptive semantic routing is a robust alternative to one-shot scalar pruning

[CV-191] AtteConDA: Attention-Based Conflict Suppression in Multi-Condition Diffusion Models and Synthetic Data Augmentation

【速读】：该论文旨在解决高阶自动驾驶任务（如交通规则提取和驾驶行为理解）中因数据稀缺导致的模型性能瓶颈问题，其核心挑战在于传统图像增强方法仅依赖简单标注（如分割图、深度图）作为条件时，难以保留场景的精细化结构信息。解决方案的关键在于引入多条件联合生成机制——将语义分割（semantic segmentation）、深度图（depth）和边缘图（edges）共同作为输入条件，并设计了一种冲突处理建模方法以缓解多条件间潜在的不一致问题，从而实现结构更精确的图像生成，为高阶自动驾驶任务提供高质量的合成训练数据。

链接: https://arxiv.org/abs/2605.09425
作者: Shogo Noguchi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 44 pages, 20 figures. Code and project page available at: this https URL

点击查看摘要

Abstract:Recent conditional image generation methods can improve controllability by generating images that are faithful to conditions such as sketches, human poses, segmentation maps, and depth. By applying these techniques to image augmentation while preserving annotations, generated images can be used as additional training data and can improve recognition performance. However, for high-level driving tasks such as traffic-rule extraction and driving-behavior understanding, simply using annotations as conditions is insufficient. Instead, images must be augmented while preserving the detailed high-level structure of the original scene. One possible solution is to use multiple conditions so that generated images retain diverse structural cues after generation. However, when multiple conditions are used, conflicts among conditions can prevent reliable structure preservation. In this work, we input semantic segmentation, depth, and edges extracted from the original image into a multi-condition image generation model, thereby providing rich structural information as conditions. We further propose a modeling approach for handling conflicts among multiple conditions and show that it enables image generation with stronger structural preservation. We also build a generation framework and evaluation protocol for driving tasks, establishing a basis for comparison with prior and future models. As a result, this work contributes to image generation research by addressing condition conflicts in multi-condition generation and provides an important step toward mitigating data scarcity in high-level autonomous-driving tasks.

[CV-192] Relational Retrieval: Leverag ing Known-Novel Interactions for Generalized Category Discovery ICMR2026

【速读】：该论文旨在解决广义类别发现（Generalized Category Discovery, GCD）问题，即在已知类别（ID）和未知类别（OOD）混合的场景下，如何有效利用少量标注数据与大量未标注数据共同提升类别识别与新类发现能力。传统方法通常将标注与未标注数据分别处理，忽略了二者之间的潜在交互机会。本文提出的关键解决方案是关系模式一致性（Relational Pattern Consistency, RPC），其核心在于通过双向知识迁移机制实现两类数据的协同增强：一方面，利用已知类原型对未标注样本进行语义行为对齐以保留已知类别特征；另一方面，基于同一类别样本与已知类原型间关系不变性的假设，将不可靠的伪标签转化为明确的关系模式匹配，从而挖掘出新的类别。这一设计使得标注数据能够指导未标注学习，同时通过集体关系签名发现新型类别，显著提升了模型性能。

链接: https://arxiv.org/abs/2605.09420
作者: Yulin Xu,Chunqi Guo,Yuanzhen Shuai,Jianyuan Ni
机构: University of California, Irvine (加州大学欧文分校); Sichuan Agricultural University (四川农业大学); University College London (伦敦大学学院); Juniata College (朱尼塔学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: Accepted by ICMR 2026. Generalized category discovery, semi-supervised learning, contrastive learning

点击查看摘要

Abstract:In this study, we tackle Generalized Category Discovery (GCD) via a Relational Retrieval perspective, explicitly coupling labeled and unlabeled data through bidirectional knowledge transfer. While existing methods treat these sources separately, missing valuable interaction opportunities, we propose Relational Pattern Consistency (RPC) that enables mutual enhancement. RPC employs One-vs-All classifiers for soft ID/OOD decomposition, then introduces two mechanisms: (i) for known-class preservation, we transfer semantic behavioral alignment; (ii) for category discovery, we leverage the insight that samples from the same category maintain invariant relationships with known-class prototypes, transforming unreliable pseudo-labeling into well-defined relational pattern matching. This bidirectional design allows labeled data to guide unlabeled learning while discovering novel categories through their collective relational signatures. Extensive experiments demonstrate RPC achieves state-of-the-art performance on both generic and fine-grained benchmarks.

[CV-193] MAG-VLAQ: Multi-modal Aerial-Ground Query Aggregation for Cross-View Place Recognition

【速读】：该论文旨在解决多模态跨视角场景识别（multi-modal cross-view place recognition）中的核心挑战，即地面观测与航拍参考之间因视角、模态及空间结构差异导致的匹配困难问题。解决方案的关键在于提出MAG-VLAQ框架，其创新性地将预训练基础模型提取的密集视觉令牌（dense visual tokens）与地面LiDAR的几何特征融合，并通过ODE条件驱动的向量局部聚合查询（ODE-conditioned VLAQ）机制，实现RGB与LiDAR信息的动态耦合。该设计使查询中心能根据融合后的多模态状态自适应调整，从而在保留全局检索原型的同时，响应局部场景的视觉和几何证据，显著提升航拍-地面图像间的匹配性能。

链接: https://arxiv.org/abs/2605.09418
作者: Zhengyi Xu,Yuhang Ming,Zhihao Zhan,Hanyu Zhu,Javier Civera,Wanzeng Kong
机构: Hangzhou Dianzi University (杭州电子科技大学); TopXGun Robotics (TopXGun机器人公司); University of Zaragoza (萨拉戈萨大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 16 pages, 4 figures, 3 tables

点击查看摘要

Abstract:Multi-modal cross-view place recognition remains a fundamental challenge in computer vision and robotics due to the severe viewpoint, modality, and spatial-structure discrepancies between ground observations and aerial references. To address this challenge, we present MAG-VLAQ, a foundation-model-enhanced query aggregation framework for multi-modal aerial-ground cross-view place recognition. Specifically, our approach leverages pre-trained foundation models to extract dense visual tokens from both ground and aerial images, as well as expressive geometric tokens from ground LiDAR observations. These heterogeneous tokens are then projected into a shared embedding space for cross-modal alignment and fusion. As our main contribution, we propose ODE-conditioned VLAQ, which tightly couples neural ordinary differential equations (ODE)-based RGB-LiDAR fusion with vectors of locally aggregated queries (VLAQ). In this design, the VLAQ query centers are dynamically adapted according to the fused multi-modal state. This mechanism allows the final global descriptor to preserve globally learned retrieval prototypes while remaining responsive to scene-specific visual and geometric evidence, significantly improving aerial-ground matching. Extensive experiments on KITTI360-AG and nuScenes-AG validate the effectiveness of our proposed MAG-VLAQ. Notably, on KITTI360-AG, our MAG-VLAQ nearly doubles the state-of-the-art performance, achieving 61.1 Recall@1 in the satellite setting, compared with 34.5 from the closest competing approach.

[CV-194] SAMOFT: Robust Multi-Object Tracking via Region and Flow

【速读】：该论文旨在解决多目标跟踪（Multi-object Tracking, MOT）中因目标形变、非线性运动和遮挡等复杂场景导致的轨迹关联性能下降问题。现有方法主要依赖实例级特征进行轨迹匹配，难以应对上述挑战。其解决方案的关键在于引入像素级线索以提升鲁棒性：首先设计Pixel Motion Matching (PMM)模块，融合Segment Anything Model (SAM)与稠密光流，利用前景像素的瞬时运动信息优化基于卡尔曼滤波的运动预测；其次提出Centroid Distance Matching (CDM)模块，通过灵活的掩码级质心匹配处理低置信度或部分遮挡的检测结果；此外，引入Distribution-Based Correction (DBC)模块，在无需训练的情况下利用历史光流统计量动态校正轨迹状态，建模长尾运动模式；最后结合Cluster-Aware ReID (CA-ReID)策略增强外观特征的稳定性和判别力。整体框架显著提升了在复杂场景下的跟踪精度与鲁棒性。

链接: https://arxiv.org/abs/2605.09417
作者: Yanchao Wang,Dawei Zhang,Chengzhuan Yang,Wei Liu,Minglu Li,Hua Wang,Zhonglong Zheng,Ming-Hsuan Yang
机构: Zhejiang Normal University (浙江师范大学); Shanghai Jiao Tong University (上海交通大学); Victoria University (维多利亚大学); University of California at Merced (加州大学默塞德分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-object tracking (MOT) is a fundamental task in computer vision that requires continuously tracking multiple targets while maintaining consistent identities across frames. However, most existing approaches primarily rely on instance-level object features for trajectory association, which often leads to degraded performance under challenging conditions such as object deformation, nonlinear motion, and occlusion. In this work, we propose SAMOFT, a robust tracker that leverages pixel-level cues to improve robustness under complex motion scenarios. Specifically, we introduce a Pixel Motion Matching (PMM) module that integrates the Segment Anything Model (SAM) with dense optical flow to refine Kalman filter-based motion prediction using instantaneous foreground pixel motion. To further enhance robustness under unreliable detections, we design a Centroid Distance Matching (CDM) module that performs flexible mask-based centroid matching for low-confidence or partially occluded observations. Moreover, a Distribution-Based Correction (DBC) module models long-tailed motion patterns in a training-free manner using historical optical flow statistics and dynamically corrects trajectory states online. We also incorporate a Cluster-Aware ReID (CA-ReID) strategy to improve the stability and discriminative power of trajectory appearance features. Extensive experiments on the DanceTrack and MOTChallenge benchmarks demonstrate that SAMOFT consistently improves baseline trackers and achieves competitive performance compared with recent state-of-the-art methods, validating the effectiveness of leveraging pixel-level cues for robust multi-object tracking.

[CV-195] AnyDepth-DETR/-YOLO: Any-depth object detection with a single network

【速读】：该论文旨在解决现代目标检测器（object detector）静态固定深度架构的局限性，即单一模型难以适应不同部署场景下的精度-效率权衡问题。传统方法需为每种部署需求训练独立模型，导致资源浪费和部署复杂。解决方案的关键在于提出一种“任意深度”（any-depth）检测框架，通过在推理时动态控制网络深度实现连续的精度-效率折衷。其核心创新是将每个骨干（backbone）和颈部（neck）阶段分解为始终执行的“必要路径”（essential path）与可跳过的“精修路径”（refinement path），从而在任意深度配置下保持完整的多尺度特征层次结构；同时采用仅在两个极端深度间进行自蒸馏（self-distillation）的训练策略，结合预测层与特征层对齐损失（prediction-level and feature-level alignment losses），确保各阶段输出的模块化兼容性，最终在不重新训练的情况下实现从最高效到最精确的一致性能覆盖。

链接: https://arxiv.org/abs/2605.09407
作者: Woochul Kang,Hyungseop Lee,Jiho Lee
机构: Incheon Nat’l Univ. (仁川国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 5 figures, 9 tables

点击查看摘要

Abstract:Modern object detectors are static, fixed-depth networks optimized for a single operating point, requiring separate models for different deployment scenarios. We present an any-depth detection framework that enables a single network to span a continuous range of accuracy–efficiency trade-offs by controlling depth at inference time without retraining. Each backbone and neck stage is divided into an essential path, which always executes, and a skippable refinement path; this decomposition preserves the full multi-scale feature hierarchy at every depth configuration, unlike conventional early exiting that discards entire stages. To train such a network, jointly optimizing many sub-networks of varying depth introduces conflicting gradient signals. We address this via self-distillation between only the two extremes, with prediction-level and feature-level alignment losses that enforce stage-wise modularity, ensuring the outputs of each stage remain compatible regardless of the paths taken. Instantiated on RT-DETR and YOLOv12, our full-depth configurations match or surpass their respective SOTA baselines with negligible parameter overhead, while the most efficient configurations achieve up to 1.82\times speedup at a cost of only 2.0 AP, all from a single set of weights.

[CV-196] HyNeuralMap: Hyperbolic Mapping of Visual Semantics to Neural Hierarchies

【速读】：该论文旨在解决视觉刺激与神经响应之间映射关系建模中的关键挑战，即现有方法在欧几里得空间（Euclidean space）中对细粒度语义关系和跨模态潜在层次结构的表征能力不足的问题。解决方案的关键在于提出 HyNeuralMap 框架，该框架采用双曲洛伦兹模型（Lorentz model）将视觉语义映射到共享的跨被试神经层次空间中，利用双曲空间的负曲率作为归纳偏置（inductive bias），从而更有效地捕捉语义层次组织和跨被试神经相似性；具体而言，通过双曲几何对齐联合优化视觉与神经嵌入，使测地线距离（geodesic distance）能够更好地保留语义邻近性和层次关系，显著优于传统欧几里得基线方法，在多标签语义预测和跨模态检索任务中均取得更优性能。

链接: https://arxiv.org/abs/2605.09392
作者: Zihan Ma,Tian Xia,Kexin Wang,Xiao Li,Xiaowei He,Yudan Ren
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 4 figures

点击查看摘要

Abstract:Understanding the intricate mappings between visual stimuli and neural responses is a fundamental challenge in cognitive neuroscience. While current approaches predominantly align images and functional magnetic resonance imaging (fMRI) responses in Euclidean space, this geometry often struggles to preserve fine-grained semantic relationships and latent hierarchical structures across visual and neural modalities. To overcome this, we propose HyNeuralMap, a framework that employ hyperbolic Lorentz model to map visual semantics into a shared, cross-subject neural hierarchy. By leveraging the negative curvature of hyperbolic space as an inductive bias, the proposed framework better captures hierarchical semantic organization and cross-subject neural similarities. Specifically, visual and neural embeddings are jointly optimized through hyperbolic geometric alignment, where geodesic distances preserve semantic proximity and hierarchical relationships more effectively than Euclidean embeddings. Experiments demonstrate that HyNeuralMap consistently outperforms state-of-the-art Euclidean baselines in both multi-label semantic prediction and cross-modal retrieval tasks. This confirms hyperbolic geometry’s superiority for cross-modal semantic alignment and hierarchical modeling, providing a new avenue for vision-neural representation learning.

[CV-197] LiteMedCoT-VL: Parameter-Efficient Adaptation for Medical Visual Question Answering

【速读】：该论文旨在解决大型视觉语言模型（Vision-Language Models, VLMs）与轻量级VLMs之间存在的推理能力差距问题，这一差距限制了医疗人工智能在资源受限的便携式临床设备上的部署。轻量级VLMs（如2–4B参数规模）虽可在边缘设备上运行，但缺乏多步推理能力，难以提供可解释的临床决策支持。现有知识蒸馏方法仅传递最终答案，未迁移推理过程，导致学生模型无法习得结构化推理链。为解决此问题，作者提出LiteMedCoT-VL框架，其核心创新在于通过LoRA（Low-Rank Adaptation）微调技术，在包含解释增强训练数据上将来自235B参数教师模型的思维链（Chain-of-Thought, CoT）推理过程迁移到2B学生模型中，且所有推理无需图像描述（caption），模拟医生直接解读医学影像的临床场景。实验表明，该方法在PMC-VQA基准上达到64.9%准确率，显著优于零样本Qwen3-VL-4B基线（53.9%），验证了小模型经推理蒸馏后可媲美甚至超越参数量翻倍的模型。

链接: https://arxiv.org/abs/2605.09384
作者: Runze Ma,Shunbo Jia,Haonan Lyu,Guo Liu,Caizhi Liao
机构: Monash University Malaysia (莫纳什大学马来西亚分校); Macau University of Science and Technology (澳门科技大学); Shenzhen University of Advanced Technology (深圳先进技术研究院); Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注: 17 pages, 5 figures

点击查看摘要

Abstract:The reasoning gap between large and compact vision-language models (VLMs) limits the deployment of medical AI on portable clinical devices. Compact VLMs of 2–4B parameters can run on resource-constrained hardware but lack the multi-step reasoning capacity needed for interpretable clinical decision support. Existing knowledge distillation methods transfer answers without the reasoning process behind them. Medical visual question answering (VQA) serves as a testbed for this problem, as it requires models to integrate visual evidence with clinical knowledge through structured reasoning chains. We introduce LiteMedCoT-VL, a pipeline that transfers chain-of-thought reasoning from a 235B teacher model to 2B student models through LoRA-based fine-tuning on explanation-enriched training data. All inference is conducted without image captions by default, simulating the clinical scenario in which a physician interprets a medical image directly without an accompanying radiology report. On the PMC-VQA benchmark, LiteMedCoT-VL achieves 64.9% accuracy, exceeding the zero-shot Qwen3-VL-4B baseline of 53.9% by 11.0 percentage points and outperforming all published baselines. This result indicates that a 2B model with reasoning distillation can match or exceed models with twice the parameters. Visual grounding analysis shows that the model relies on image content rather than exploiting textual priors. Our code is publicly available at this https URL.

[CV-198] Learning-Augmented Scalable Linear Assignment Problem Optimization via Neural Dual Warm-Starts ICML2026

【速读】：该论文旨在解决大规模线性指派问题（Linear Assignment Problem, LAP）中经典精确求解算法（如匈牙利算法和Jonker-Volgenant算法）因时间复杂度为 $\mathcal{O}(N^3)$ 而难以扩展的问题，同时克服现有基于学习的方法在保持最优性或可扩展性方面的局限。解决方案的关键在于提出一种学习增强框架，通过预测对偶变量（dual variables）来热启动经典求解器，并引入回退机制以防止学习建议不可靠时导致渐近运行时间退化；此外，设计了轻量级的RowDualNet架构，避免图模型带来的 $\mathcal{O}(N^2)$ 内存瓶颈，从而实现大规模实例（ $N=16,384$ ）下的神经热启动，且利用线性规划对偶理论中的最小技巧（Min-Trick）确保可行性，无需昂贵的迭代投影，最终在多个合成与真实世界数据集上实现了显著加速并严格维持最优性。

链接: https://arxiv.org/abs/2605.09382
作者: Ilay Yavlovich,Jad Agbaria,Muhamed Mhamed,Jose Yallouz,Nir Weinberger
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Data Structures and Algorithms (cs.DS); Optimization and Control (math.OC)
备注: Accepted to ICML 2026. 20 pages, 13 figures

点击查看摘要

Abstract:The Linear Assignment Problem (LAP) is a fundamental combinatorial optimization task with applications ranging from computer vision to logistics. Classical exact solvers such as the Hungarian and Jonker-Volgenant (LAPJV) algorithms guarantee optimality, but their cubic time complexity \mathcalO(N^3) becomes a bottleneck for large-scale instances. Recent learning-based approaches aim to replace these solvers with neural models, often sacrificing exactness or failing to scale due to memory constraints. We propose a learning-augmented framework that accelerates exact assignment solvers while maintaining optimality and worst-case guarantees. Our method predicts dual variables to warm-start a classical solver, with a fallback that prevents asymptotic runtime degradation when the learned advice is unreliable. We introduce RowDualNet, a lightweight row-independent architecture that avoids the \mathcalO(N^2) memory bottleneck of graph-based models, enabling neural warm-starting at large scale ( N=16,384 ). Feasibility is ensured via a constructive mechanism based on LP duality (namely, the Min-Trick), eliminating costly iterative projection. Empirically, our approach reduces the search effort of LAPJV and achieves over 2\times speedups on challenging synthetic distributions, in addition to improving over 1.25\times and 1.5\times on real-world tracking (MOT) and transportation (LPT) datasets, respectively, while strictly maintaining full optimality, effectively yielding a robust zero-shot generalization to real-world tasks.

[CV-199] FrameTwin: Curve-Anchored Gaussian Alignment from Sparse Views for Adaptive Wireframe 3D Printing

【速读】：该论文旨在解决机器人化3D打印过程中薄线框结构（wireframe structures）因打印过程中的变形导致的几何失真问题，从而实现闭环控制下的自适应打印。解决方案的关键在于提出FrameTwin框架，其核心是基于参数化曲线锚定的高斯对齐方法（curve-anchored Gaussian alignment），通过稀疏视角图像提取结构形变信息，并利用可微渲染管道估计神经形变场（neural deformation field），将已打印部分与实际观测到的变形结构对齐，从而生成一个随打印进程演化的数字孪生（digital twin）。该方法通过约束高斯核沿参数曲线分布，显著降低了稀疏视角下薄结构观测的歧义性，并确保所有杆件间的全局一致性，最终实现基于形变补偿的打印轨迹自适应更新。

链接: https://arxiv.org/abs/2605.09362
作者: Wenting Wang,Zhuo Huang,Kun Qian,Neelotpal Dutta,Yuhu Guo,Yingjun Tian,Yeung Yam,Charlie C.L. Wang
机构: The Chinese University of Hong Kong (香港中文大学); Centre of Perceptual and Interactive Intelligence (感知与交互智能中心); University of Manchester (曼彻斯特大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present FrameTwin, a curve-anchored Gaussian alignment framework that uses sparse-view images to close the control loop for adaptive wireframe 3D printing. Our key idea is to capture the deformation of thin wireframe structures from sparse-view images using Gaussian kernels anchored to parametric curves, yielding a compact and geometry-aware encoding that explicitly captures strut topology. Driven by a differentiable rendering pipeline, FrameTwin estimates a neural deformation field that aligns the partially printed target model with the deformed structure observed during fabrication, where the optimized curve-Gaussian representation serves as a digital twin of the evolving wireframe. Unlike general Gaussian-splatting approaches, our formulation constrains kernel placement along parametric curves, substantially reducing the ambiguity inherent in sparse-view observations of thin structures. The resultant deformation-field alignment enforces global consistency across all struts. By using the estimated deformation field to blend the distorted printed geometry with the remaining unprinted geometry, FrameTwin enables adaptive updates to future printing trajectories. We demonstrate that FrameTwin can robustly capture and compensate for deformation in wireframe models fabricated using a robotized 3D printing system.

[CV-200] Perceptual Asymmetry Between Hue Categories: Evidence from Human Color Categorization ICICS2026

【速读】：该论文旨在解决当前计算颜色模型普遍假设颜色类别在感知空间中均匀分布的问题，而事实上人类颜色范畴（color categories）在感知空间中并非均匀分布。其解决方案的关键在于对COLIBRI模糊颜色模型进行聚焦分析，引入基于α=0.5水平截集的模糊隶属函数量化指标——Wideness（宽广度）和Boundary Width（边界宽度），从而揭示色相类别之间的感知不对称性：黄色占据紧凑且边界清晰的区域，而绿色则覆盖更广的区间并表现出更平缓的过渡结构。这一发现表明颜色类别不仅具有模糊性，其几何组织还呈现显著非均匀性，为语言色彩分类提供了新视角，并增强了COLIBRI框架在感知基础上的颜色建模可解释性。

链接: https://arxiv.org/abs/2605.09339
作者: Elnara Kadyrgali,Nuray Toganas,Muragul Muratbekova,Pakizar Shamoi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: The paper has been submitted for consideration to ICICS 2026 (International Conference on Informatics and Computer Science)

点击查看摘要

Abstract:Human color categories are not uniformly distributed in perceptual space, yet most computational color models still assume fixed and evenly structured representations. In this paper, we present a focused analytical extension of the COLIBRI fuzzy color model by investigating perceptual asymmetry between hue categories. Using previously collected large-scale human color categorization data, we introduce quantitative measures of category extent and boundary uncertainty, namely Wideness and Boundary Width, derived from fuzzy membership functions at the \alpha = 0.5 level. The analysis reveals a strong imbalance between the two categories: yellow occupies a compact and sharply constrained region of the hue space, whereas green spans a substantially broader interval and exhibits a more extended transition structure. The results show that perceptual color categories are not only fuzzy, but also highly non-uniform in their geometric organization. This asymmetry suggests that some categories behave as narrow, highly specific perceptual labels, while others function as broad, tolerant regions of human color naming. These findings provide a new perspective on linguistic color categorization and extend the interpretability of the COLIBRI framework for perceptually grounded color modeling.

[CV-201] Noise-Started One-Step Real-World Super-Resolution via LR-Conditioned SplitMeanFlow and GAN Refinement

【速读】：该论文旨在解决扩散模型在真实图像超分辨率（Real-ISR）任务中面临的效率与质量权衡问题。现有方法中，多步扩散模型虽能生成高质量结果但推理速度慢，而单步方法虽效率高却因摒弃噪声起始机制削弱了随机性，限制了细节的真实性。解决方案的关键在于提出SMFSR框架，其核心创新为：1）通过LR-conditioned SplitMeanFlow设计，利用区间分割一致性（Interval Splitting Consistency）将多步生成轨迹蒸馏为单步平均速度预测，从而保留扩散模型的噪声起始特性并实现高效生成；2）引入GAN精炼阶段，结合DINOv3判别器增强纹理真实性，并采用变分分数蒸馏（variational score distillation）使输出分布对齐冻结扩散教师模型所定义的自然图像分布，有效弥补单步生成中逐级优化机会的缺失。

链接: https://arxiv.org/abs/2605.09328
作者: Wei Zhu,Kai Zhang,Yu Zheng,Lei Luo,Yong Guo,Jian Yang
机构: Nanjing University of Science and Technology (南京理工大学); Nanjing University (南京大学); Huawei (华为)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Pre-trained text-to-image (T2I) diffusion models have shown strong potential for real-world image super-resolution (Real-ISR), owing to their noise-started generation process that enables realistic texture synthesis and captures the one-to-many nature of super-resolution. However, diffusion-based Real-ISR methods still face a fundamental efficiency-quality trade-off. Multi-step methods generate high-quality results by iteratively denoising random Gaussian noise under LR conditioning, but suffer from slow sampling. Recent one-step methods greatly improve efficiency, yet they typically replace noise-started generation with direct LR-to-HR restoration, which weakens stochasticity and limits realistic detail synthesis. To address this issue, we propose SMFSR, a noise-started one-step Real-ISR framework via LR-conditioned SplitMeanFlow and GAN refinement. SMFSR preserves the random-noise starting point of diffusion models and learns a direct noise-to-HR mapping conditioned on the LR image. To this end, Interval Splitting Consistency distills the multi-step generative trajectory into a single average-velocity prediction, enabling efficient one-step generation. To compensate for the reduced opportunity for progressive refinement, we further introduce a GAN refinement stage, where a DINOv3-based discriminator enhances realistic texture synthesis and variational score distillation aligns the generated outputs with the natural image distribution under a frozen diffusion teacher. Extensive experiments demonstrate that SMFSR achieves state-of-the-art perceptual quality among one-step diffusion-based Real-ISR methods while retaining fast single-step inference.

[CV-202] PGID: Progressive Guided Inversion and Denoising for Robust Watermark Detection

【速读】：该论文旨在解决生成式 AI（Generative AI）中基于语义水印（semantic watermarking）的版权保护机制在面对伪造攻击（forgery attacks）和水印移除攻击（imprint removal attacks）时所暴露的关键脆弱性问题。现有方法依赖扩散模型反演（diffusion inversion）进行水印检测，攻击者可利用此特性将带水印的潜在表示（latents）移入未水印区域，并引导未水印潜在表示进入水印区域，从而误导检测结果。解决方案的关键在于提出一种无需训练、即插即用的噪声提取框架——渐进引导反演与去噪（Progressive Guided Inversion and Denoising, PGID），其核心机制是通过多轮渐进式反演-去噪循环，消除中间潜在表示的扰动并抑制对抗性偏移，从而将被篡改的潜在变量投影回原始所属区域，有效恢复水印可检测性并识别伪造样本。

链接: https://arxiv.org/abs/2605.09319
作者: Minh Quoc Duong,Chun Tong Lei,Chun Pong Lau
机构: City University of Hong Kong (香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:With the proliferation of AI-generated images, digital watermarking has become an essential safeguard for protecting intellectual property and mitigating malicious exploitation. Recent works on semantic watermarking have enabled efficient copyright protection for diffusion models. However, the dependence of semantic watermarking on diffusion inversion for watermark detection creates a critical vulnerability. Imprint removal and forgery attacks exploit this weakness to produce deceptive results. Our analysis reveals that these attacks succeed by displacing watermarked latents into the unwatermarked region, while guiding unwatermarked latents into the watermarked region. Based on that, we propose Progressive Guided Inversion and Denoising (PGID), the first plug-and-play, training-free noise extraction framework designed to defend against both attack strategies. PGID effectively defends by projecting perturbed latents back to the region where they originally belong. The projection is achieved by eliminating intermediate latent deflections and mitigating adversarial perturbations through progressive inversion-denoising cycles. Comprehensive evaluations across multiple schemes demonstrate that PGID successfully restores detection reliability by recovering removed watermarks and identifying forged instances.

[CV-203] Attention Sinks in Diffusion Transformers: A Causal Analysis

【速读】：该论文旨在解决扩散 Transformer 中“注意力汇聚点”（attention sinks）的功能作用不明确的问题，特别是其在文本到图像扩散模型中的因果影响。解决方案的关键在于提出一种无需训练的因果干预方法：通过动态识别每一步时间戳下的主导注意力接收者（即注意力汇聚点），并在 score 和 value 路径上进行配对抑制，从而系统性地移除这些汇聚点，并评估其对图像生成质量与文本对齐度的影响。实验表明，仅移除单一注意力汇聚点不会损害 CLIP-T 或偏好代理指标（如 ImageReward、HPS-v2），但在更强干预下（k ≥ 10）HPS-v2 出现指标依赖性边界，而 CLIP-T 始终保持稳定，揭示了扩散 Transformer 中轨迹扰动与语义对齐之间的经验解耦现象。

链接: https://arxiv.org/abs/2605.09313
作者: Fangzheng Wu,Brian Summa
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Attention sinks – tokens that receive disproportionate attention mass – are assumed to be functionally important in autoregressive language models, but their role in diffusion transformers remains unclear. We present a causal analysis in text-to-image diffusion, dynamically identifying dominant attention recipients per timestep and suppressing them via paired, training-free interventions on the score and value paths. Across 553 GenEval prompts on Stable Diffusion~3 (with SDXL corroboration), removing these sinks does not degrade text-image alignment (CLIP-T) or preference proxies (ImageReward, HPS-v2) at k=1 ; only under stronger interventions ( k!\geq!10 ) does HPS-v2 exhibit a metric-dependent boundary, while CLIP-T remains robust throughout. The perceptual shifts induced by suppression are nonetheless \emphsink-specific – \sim!6\times larger than equal-budget random masking – revealing an empirical dissociation between trajectory-level perturbation and \emphsemantic alignment in diffusion transformers. \footnoteCode available at this https URL.

[CV-204] Low-Cost Neural Radiance Fields

【速读】：该论文旨在解决神经辐射场（Neural Radiance Fields, NeRF）在训练时间长和依赖密集输入视角方面的局限性，从而提升其在低计算资源和低数据量场景下的可用性。解决方案的关键在于对三种加速版NeRF变体（DS-NeRF、TensoRF 和 HashNeRF）进行系统性比较，并提出针对性改进：1）在TensoRF中引入基于COLMAP关键点的深度监督损失（TensoRF-DS），以提升稀疏视角下的重建质量；2）通过消融实验分析特征解码多层感知机（MLP）结构及输入下采样对峰值信噪比（PSNR）和运行时的影响；3）设计四种HashNeRF的颜色与密度网络架构变体（含残差和卷积结构），并量化不同配置在固定迭代预算下的PSNR与训练时间权衡。实验表明，尽管无一扩展方案在等时评估中全面超越基线，但结果明确了哪些改进适用于受限环境，并揭示了未来优化方向。

链接: https://arxiv.org/abs/2605.09312
作者: Alice Huang,Prathamesh Sonawane,Yashdeep Thorat,Yug Rao
机构: University of Illinois Urbana Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages

点击查看摘要

Abstract:Neural Radiance Fields (NeRF) achieve high-quality novel-view synthesis, but their long training times and reliance on dense input views limit accessibility. We present a comparative study of three accelerated NeRF variants - DS-NeRF, TensoRF, and HashNeRF and explore extensions targeted at the low-compute, low-data regime. First, we add a depth-supervision loss derived from COLMAP keypoints to TensoRF (TensoRF-DS) and evaluate it on the LLFF dataset under reduced view counts. Second, we ablate the feature-decoding MLP of TensoRF and study the effect of input downsampling on PSNR and runtime on the synthetic Lego scene. Third, we propose four architectural variants of the HashNeRF color and density networks, including residual and convolutional designs, and report PSNR/training-time tradeoffs under matched iteration budgets. Under iso-time evaluation, none of our extensions conclusively outperform the published baselines, but the experiments characterize which extensions transfer to constrained settings and surface design questions for future work.

[CV-205] Discrete Langevin-Inspired Posterior Sampling

【速读】：该论文旨在解决在离散状态空间中进行逆问题求解时，现有离散后验采样方法普遍存在可扩展性差或通用性不足的问题。当前方法通常依赖于对离散变量的连续松弛、Gibbs风格更新或针对特定退化过程设计的机制，限制了其在复杂场景下的应用。解决方案的关键在于提出一种名为ΔLPS（Discrete Langevin-Inspired Posterior Sampler）的新型采样器，该方法利用梯度信息识别有效的离散移动路径，且不离开离散状态空间，从而实现所有token维度上的高效并行更新，并对离散扩散先验的训练范式（如掩码扩散和均匀状态扩散）具有普适性。这一设计使得算法在图像恢复（MNIST、CIFAR、FFHQ）和空间映射任务中均优于现有离散扩散后验采样器，并与强健的连续扩散逆求解器相当，验证了完全离散、梯度引导的后验采样策略在离散表示上求解逆问题的可行性与优越性。

链接: https://arxiv.org/abs/2605.09302
作者: Chaitanya Amballa,Sattwik Basu,Jorge Vančo Sampedro,Romit Roy Choudhury
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We study posterior sampling for inverse problems in discrete state spaces using discrete diffusion models as generative priors. While continuous diffusion models have become widely used for inverse problems, their discrete counterparts remain comparatively underexplored. Existing discrete posterior samplers often rely on continuous relaxations of discrete variables, Gibbs-style updates, or mechanisms specialized to particular corruption processes, which can limit scalability or generality. We propose \Delta LPS, a Discrete Langevin-Inspired Posterior Sampler that uses gradient information to identify promising discrete moves without leaving the discrete state space. The resulting approach enables efficient parallel updates across all token dimensions and is agnostic to the training paradigm of the discrete diffusion prior, including masked and uniform-state diffusion. We evaluate our method on image restoration tasks across MNIST, CIFAR, and FFHQ, as well as spatial mapping, covering linear, nonlinear, and blind inverse problems. Across these settings, we improve over recent discrete diffusion posterior samplers and are competitive with strong continuous diffusion-based inverse solvers. Our results suggest that fully discrete, gradient-informed posterior samplers offer a scalable and general path toward solving inverse problems over discrete representations.

[CV-206] Micro-Defects Expose Macro-Fakes: Detecting AI-Generated Images via Local Distributional Shifts

【速读】：该论文旨在解决当前生成式 AI (Generative AI) 生成图像中难以准确识别真实与伪造图像的问题，尤其针对现有基于预训练特征提取器的检测方法因过度依赖全局语义信息而对微尺度统计异常（micro-defects）敏感性不足的缺陷。其解决方案的关键在于提出一种局部分布感知的检测框架 Micro-Defects expose Macro-Fakes (MDMF)，通过引入可学习的 Patch Forensic Signature 将语义块嵌入映射到紧凑的取证潜在空间，并利用最大均值差异（Maximum Mean Discrepancy, MMD）量化生成图像与真实图像之间的分布差异。该方法能够将局部微缺陷放大为宏观分布不一致，理论分析表明在存在局部取证信号时，基于块级别的建模可产生更显著的判别差异，从而提升检测可靠性。

链接: https://arxiv.org/abs/2605.09296
作者: Boxuan Zhang,Jianing Zhu,Qifan Wang,Jiang Liu,Ruixiang Tang
机构: Rutgers University; The University of Texas at Austin; Meta AI; Advanced Micro Devices
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 41 pages, 10 figures

点击查看摘要

Abstract:Recent generative models can produce images that appear highly realistic, raising challenges in distinguishing real and AI-generated images. Yet existing detectors based on pre-trained feature extractors tend to over-rely on global semantics, limiting sensitivity to the critical micro-defects. In this work, we propose Micro-Defects expose Macro-Fakes (MDMF), a local distribution-aware detection framework that amplifies micro-scale statistical irregularities into macro-level distributional discrepancies. To avoid localized forensic cues being diluted by plain aggregation, we introduce a learnable Patch Forensic Signature that projects semantic patch embeddings into a compact forensic latent space. We then use Maximum Mean Discrepancy (MMD) to quantify distributional discrepancies between generated and real images. Our theory-grounded analysis shows that patch-wise modeling yields provably larger discrepancies when localized forensic signals are present in generated images, enabling more reliable separation from real images. Extensive experiments demonstrate that MDMF consistently outperforms baseline detectors across multiple benchmarks, validating its general effectiveness. Project page: this https URL

[CV-207] MC2: Monte Carlo Correction for Fast Elliptic PDE Solving

【速读】：该论文旨在解决偏微分方程（Partial Differential Equation, PDE）求解中计算资源受限与精度之间难以平衡的问题。传统蒙特卡洛方法（如Walk-on-Spheres, WoS）虽无偏且对几何不敏感，但收敛速度慢；而基于神经网络的求解器虽然快速，却存在偏差且在分布变化时鲁棒性差。解决方案的关键在于提出一种混合型求解框架MC²，其将低预算蒙特卡洛解作为真实解的结构化估计量，并训练一个单次前向传播的神经网络校正模块来恢复高保真解。该方法首次实验证明有限样本蒙特卡洛误差具有可学习性和可校正性，能够在仅需WoS约千分之一计算量的情况下达到相当精度，显著提升了PDE求解效率与稳定性。

链接: https://arxiv.org/abs/2605.09288
作者: Ethan Hsu,Hong Meng Yam,Ivan Ge
机构: Stanford University (斯坦福大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)
备注:

点击查看摘要

Abstract:Partial differential equation (PDE) solvers underpin scientific computing, but real-world deployment is bounded by compute. Classical Monte Carlo solvers such as Walk-on-Spheres (WoS) are unbiased and geometry-agnostic but are slow. Learned solvers are fast but biased and brittle under distribution shift. We present \textbfMC ^2 , a hybrid WoS-Neural Network (WoS-NN) PDE solver that treats a low-budget Monte Carlo solution as a structured estimator of the true field and learns a single-pass neural correction to recover a high-fidelity solution. MC ^2 matches the accuracy of solutions using over 1000\times more Monte Carlo compute, outperforming all evaluated classical, denoising, and neural-operator baselines. To enable reproducible study of finite-compute PDE solving, we additionally release \textbfPDEZoo, the largest standardized elliptic PDE benchmark to date: 2M PDEs spanning five elliptic families and unlimited geometric compositions, with analytic ground truth and multi-budget Monte Carlo trajectories. Together \textbfMC ^2 and \textbfPDEZoo (1) empirically establish that finite-sample Monte Carlo error is structured, learnable, and correctable in a single forward pass, (2) show that we can solve PDEs \sim \textbf1000x faster than with just WoS, and (3) provide the evaluation infrastructure the field has so far lacked.

[CV-208] CAGS: Color-Adaptive Volumetric Video Streaming with Dynamic 3D Gaussian Splatting SIGGRAPH2026

【速读】：该论文旨在解决体积视频（Volumetric Video, VV）流媒体在低延迟、高保真渲染和异构网络环境下的带宽效率与视觉质量之间的矛盾问题，特别是针对基于3D高斯点云（3D Gaussian Splatting, 3DGS）的表示方法在自适应流媒体中因密度驱动的细节层次（Levels of Detail, LoD）策略不适用而导致的视觉断层和严重质量下降问题。解决方案的关键在于提出一种新颖的Color-Adaptive方案：利用向量量化（Vector Quantization, VQ）构建LoD，并通过低分辨率参考图像在客户端进行颜色失真校正，从而在保持较低带宽消耗的同时实现高质量重建。该方案被集成进CAGS系统中，该系统兼容多种高斯表示形式，在动态带宽条件下相比现有自适应流媒体方案PSNR提升5~20 dB，且比现有可扩展高斯压缩方法运行更快，具备良好的泛化能力。

链接: https://arxiv.org/abs/2605.09279
作者: Daheng Yin,Yili Jin,Jianxin Shi,Isaac Ding,Miao Zhang,Fangxin Wang,Zhaowu Huang,Cong Zhang,Jiangchuan Liu,Fang Dong
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Networking and Internet Architecture (cs.NI); Image and Video Processing (eess.IV)
备注: SIGGRAPH 2026 Conference Paper. Code is available at this https URL

点击查看摘要

Abstract:Volumetric video (VV) streaming enables real-time, immersive access to remote 3D environments, powering telepresence, ecological monitoring, and robotic teleoperation. These applications turn VV streaming into a real-time interface to remote physical environments, imposing new system-level demands for photorealistic scene representation, low-latency interaction, and robust performance under heterogeneous networks. 3D Gaussian Splatting (3DGS) has been widely used for real-time photorealistic rendering, offering superior visual quality and rendering performance, but it faces challenges due to bandwidth consumption. Furthermore, as the foundation of adaptive VV streaming, existing Levels of Detail (LoD) methods based on density are not well-suited to Gaussian representations, leading to visible gaps and severe quality degradation. Recent studies have also explored attribute compression techniques to reduce bandwidth consumption. Our preliminary studies reveal that aggressive attribute compression primarily causes color distortion, which can be effectively corrected in the rendered image using a reference image. Motivated by these findings, we propose a novel Color-Adaptive scheme for adaptive VV streaming that uses vector quantization (VQ) to establish LoDs and correct color distortions with low-resolution reference images. We further present CAGS, an adaptive VV streaming system compatible with diverse Gaussian representations, which integrates the Color-Adaptive scheme by rendering reference images on the streaming server and performing color restoration on the client. Extensive experiments on our prototype system demonstrate that CAGS outperforms the existing adaptive streaming systems in PSNR by 5 \sim 20 dB under fluctuating bandwidth, operates significantly faster than existing scalable Gaussian compression methods, and generalizes across different Gaussian representations.

[CV-209] Uncertainty-Aware Token Importance Estimation in Spiking Transformers

【速读】：该论文旨在解决脉冲变压器（spiking transformers）在多脉冲步骤中处理token时存在的冗余性和推理成本高的问题。现有token压缩方法主要依赖于基于响应的线索（如激活幅度、放电统计或特征相似性），但这些指标未能从类证据随时间演化的角度明确刻画token的重要性。论文的关键解决方案是提出Uncert，一个无需训练且即插即用的token重要性估计框架，其核心在于利用狄利克雷分布（Dirichlet distribution）建模每个token的类别证据，并通过统计其在多个脉冲步骤中的平均不确定性和波动性，生成具有不确定性感知的重要度评分，从而实现更有效的token剪枝与推理优化。

链接: https://arxiv.org/abs/2605.09276
作者: Wenxuan Liu,Zecheng Hao,Tong Bu,Yuran Wang,Zhaofei Yu
机构: Peking University (北京大学); School of Computer Science, Peking University (北京大学计算机学院); Institute for Artificial Intelligence, Peking University (北京大学人工智能研究院)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Spiking transformers have shown strong potential for neuromorphic vision, yet their token processing across multiple spiking steps still introduces substantial redundancy and inference cost. Existing token reduction methods mainly rely on response based cues, such as activation magnitude, firing statistics, or feature similarity. Although effective, these criteria do not explicitly characterize token importance from the perspective of temporally evolving class evidence. In spiking transformers, token representations are progressively formed across multiple spiking steps rather than determined at a single instant, suggesting that token importance should be evaluated not only by instantaneous responses but also by temporal uncertainty patterns. Our key observation is that tokens exhibit heterogeneous uncertainty trajectories over time, and that their temporally aggregated uncertainty statistics provide an effective cue for distinguishing informative tokens from redundant ones. Motivated by this, we propose Uncert, a training free and plug and play token importance estimation framework for spiking transformers. Specifically, Uncert models token wise class evidence with a Dirichlet distribution and summarizes each token temporal uncertainty using its mean and fluctuation across spiking steps, yielding an uncertainty aware importance score for token reduction during inference. Experiments on both static and neuromorphic benchmarks show that Uncert achieves favorable accuracy and efficiency tradeoffs, with the most consistent gains observed under token pruning. Further analysis reveals a clear empirical connection between temporal uncertainty patterns and token contribution, offering new insights into token dynamics in spiking transformers.

[CV-210] Monocular Biomechanical Tracking of Fingers with Inverse Kinematics to Foundation Models

【速读】：该论文旨在解决从单视角视频中精确追踪手部与手指运动的难题，以支持临床环境中对日常生活活动和关节活动范围（Range of Motion, ROM）的定量评估。其关键解决方案是将SAM 3D Body基础模型与逆向运动学优化相结合，在全身生物力学模型框架下提取受解剖约束的手指关节角度；同时通过将SAM 3D Body从PyTorch移植到JAX，实现与MuJoCo-MJX的集成，从而利用GPU加速优化，并开发了Momentum Human Rig（MHR）输出与生物力学模型标记之间的新型映射关系，显著提升了单目视频中手指运动参数估计的精度与鲁棒性。

链接: https://arxiv.org/abs/2605.09258
作者: R. James Cotton,Pouyan Firouzabadi,Wendy Murray
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to EMBC 2026

点击查看摘要

Abstract:Accurate hand and finger tracking from video has significant clinical applications for monitoring activities of daily living and measuring range of motion, yet monocular video approaches for obtaining hand biomechanics remain under-developed. We present a method that combines the SAM 3D Body foundation model with inverse kinematics optimization in a full-body biomechanical model to extract anatomically-constrained finger joint angles from single-view video. We port SAM 3D Body from PyTorch to JAX for integration with MuJoCo-MJX, enabling GPU-accelerated optimization, and develop a novel mapping between the Momentum Human Rig (MHR) outputs and biomechanical model markers. Validation against 8-camera multiview reconstruction on 4,590 frames from 7 participants performing a variety of hand poses and object manipulation tasks shows finger joint angle errors of approximately 10 degrees and hand position errors of approximately 6 mm, after Procrustes alignment. Results were consistent across camera viewpoints and robust to different methods for producing reference values from multiview video. This work extends monocular biomechanical analysis to detailed finger tracking, expanding access to quantitative characterization of hand movement from readily available video.

[CV-211] CalibFree: Self-Supervised View Feature Separation for Calibration-Free Multi-Camera Multi-Object Tracking

【速读】：该论文旨在解决多摄像头多目标跟踪（Multi-camera Multi-object Tracking, MCMOT）中因摄像头视角差异导致的目标身份不一致问题，尤其针对传统方法依赖精确标定和大量人工标注的瓶颈。其解决方案的关键在于提出了一种无需任何标定或人工标签的自监督表示学习框架 CalibFree，通过单视图蒸馏（single-view distillation）促进视图无关特征与视图特定特征的分离，并结合跨视图重建（cross-view reconstruction）增强特征一致性，从而在复杂动态场景下实现高效、鲁棒的跨摄像头目标关联。

链接: https://arxiv.org/abs/2605.09245
作者: Ruiqi Xian,Deep Patel,Iain Melvin,Sanjoy Kundu,Martin Renqiang Min,Dinesh Manocha
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-camera multi-object tracking (MCMOT) faces significant challenges in maintaining consistent object identities across varying camera perspectives, particularly when precise calibration and extensive annotations are required. In this paper, we present CalibFree, a self-supervised representation learning framework that does not need any calibration or manual labeling for the MCMOT task. By promoting feature separation between view-agnostic and view-specific representations through single-view distillation and cross-view reconstruction, our method adapts to complex, dynamic scenarios with minimal overhead. Experiments on the MMP-MvMHAT dataset show a 3% improvement in overall accuracy and a 7.5% increase in the average F1 score over state-of-the-art approaches, confirming the effectiveness of our calibration-free design. Moreover, on the more diverse MvMHAT dataset, our approach demonstrates superior over-time tracking and strong cross-view performance, highlighting its adaptability to a wide range of camera configurations. Code will be publicly available upon acceptance.

[CV-212] owards Robust Sequential Decomposition for Complex Image Editing CVPR2026

【速读】：该论文旨在解决复杂图像编辑任务中现有方法的局限性问题，即单轮编辑（single-turn editing）难以准确解析组合性指令导致误编辑，而顺序编辑（sequential editing）虽能分解任务却因误差累积降低结果保真度。解决方案的关键在于构建一个统一的上下文编辑框架，通过设计合成数据流水线生成具有不同复杂度的编辑任务及高质量分解序列，并基于此对模型进行微调；实验证明，合理设计的顺序分解策略可在保持高保真度的同时提升复杂任务的鲁棒性，且合成数据中学到的分解能力可迁移至真实图像，实现从模拟到现实（sim-to-real）的泛化，从而有效应对跨领域复杂图像编辑挑战。

链接: https://arxiv.org/abs/2605.09233
作者: Zilai Zeng,Mingdeng Cao,Zijie Li,Xiaochen Lian,Yichun Shi,Peihao Zhu,Chen Sun,Peng Wang
机构: Brown University (布朗大学); ByteDance Seed (字节跳动种子); The University of Tokyo (东京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: CVPR 2026

点击查看摘要

Abstract:Recent advances in visual generative models have enabled high-fidelity image editing guided by human instructions. However, these models often struggle with complex instructions involving combinatorial editing operations or inter-step dependencies. This difficulty stems from the limitations of two canonical paradigms: (1) single-turn editing, which attempts to apply all instructed edits in one pass, often fails to parse the complex instruction accurately and causes undesired edits; and (2) sequential editing can decompose the task into simpler steps but suffers from compounding errors introduced by the sequential execution, leading to low-fidelity results. To derive a robust solution for complex image editing, we examine editing behaviors of different paradigms under a unified in-context editing framework, and study how the benefits of sequential decomposition can be balanced against its error-accumulation drawbacks. We further develop a synthetic data pipeline that constructs editing tasks of varying instruction complexity, allowing us to curate a large-scale editing dataset with high-quality decomposed sequences. By finetuning on synthetic data, we discovered that with properly designed editing paradigms, sequential decomposition yields robust improvements even as task complexity increases. Furthermore, the decomposition skills learned from synthetic tasks can transfer to real images by co-training with real-world editing data, demonstrating the promise of sim-to-real generalization for tackling complex image editing across broader domains.

[CV-213] An Elastic Shape Variational Autoencoder for Skeleton Pose Trajectories

【速读】：该论文旨在解决标准变分自编码器（Variational Autoencoder, VAE）在建模人体骨骼序列时，因未考虑几何不变性而将大量模型容量分配给非本质因素（如相机朝向、尺度、视角和执行速度）的问题。其解决方案的关键在于提出弹性形状变分自编码器（Elastic Shape-Variational Autoencoder, ES-VAE），该模型利用传输的平方根速度场（Transported Square-Root Velocity Field, TSRVF）表示法，在Kendall形状流形上显式移除刚体平移、旋转与全局缩放以及时间速率变化，从而分离出骨骼运动的本质形状动力学。通过引入黎曼对数映射（Riemannian logarithm map）进行编码、指数映射（exponential map）进行解码，ES-VAE实现了对骨骼轨迹的几何感知建模，在临床步态分析和动作识别任务中均显著优于传统VAE及多种序列建模基线方法。

链接: https://arxiv.org/abs/2605.09231
作者: Arafat Rahman,Shashwat Kumar,Laura E. Barnes,Anuj Srivastava
机构: University of Virginia (弗吉尼亚大学); Johns Hopkins University (约翰霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注: 9 pages

点击查看摘要

Abstract:Deep generative models provide flexible frameworks for modeling complex, structured data such as images, videos, 3D objects, and texts. However, when applied to sequences of human skeletons, standard variational autoencoders (VAEs) often allocate substantial capacity to nuisance factors-such as camera orientation, subject scale, viewpoint, and execution speed-rather than the intrinsic geometry of shapes and their motion. We propose the Elastic Shape - Variational Autoencoder (ES-VAE), a geometry-aware generative model for skeletal trajectories that leverages the transported square-root velocity field (TSRVF) representation on Kendall’s shape manifold. This representation inherently removes rigid translations, rotations, and global scaling of shapes, and temporal rate variability of sequences, isolating the underlying shape dynamics. The ES-VAE encoder maps skeletal sequences to a low-dimensional latent space incorporating the Riemannian logarithm map, while the decoder reconstructs sequences using the corresponding exponential map. We demonstrate the effectiveness of ES-VAE on two datasets. First, we analyze skeletal gait cycles to predict clinical mobility scores and classify subjects into healthy and post-stroke groups. Second, we evaluate action recognition on the NTU RGB+D dataset. Across both settings, ES-VAE consistently outperforms standard VAEs and a range of sequence modeling baselines, including temporal convolutional networks, transformers, and graph convolutional networks. More broadly, ES-VAE provides a principled framework for learning generative models of longitudinal data on pose shape manifolds, offering improved latent representation and downstream performance compared to existing deep learning approaches.

[CV-214] CATS: Curvature Aware Temporal Selection for efficient long video understanding

【速读】：该论文旨在解决长视频理解中在严格计算预算下如何高效选择关键帧的问题，传统方法因无法穷举所有帧而难以实现最优选择，且现有轻量级方案效率与准确性难以兼顾。其解决方案的关键在于提出CATS（Curvature-aware Frame Selection），通过显式建模查询-帧相关性的时序几何特性，利用时序曲率自适应调整选帧密度，从而精准识别显著事件及其上下文，同时抑制冗余帧；该方法在固定骨干网络和帧预算条件下，相比AKS等轻量级方法显著提升性能，并以仅3–4%的预处理开销获得MIRA等多阶段方法约93–95%的准确率，实现了高效的效率-精度权衡。

链接: https://arxiv.org/abs/2605.09223
作者: Mehrajul Abadin Miraj,Abdul Mohaimen Al Radi,Shariful Islam Rayhan,Md. Tanvir Alam,Ismat Rahman,Yu Tian,Md Mosaddek Khan
机构: University of Dhaka (达卡大学); University of Central Florida (中佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Understanding long videos with multimodal large language models (MLLMs) requires selecting a small subset of informative frames under strict computational budgets, where exhaustive processing is infeasible and optimal selection is combinatorial. We propose CATS, a curvature-aware frame selection method that explicitly models the temporal geometry of query-frame relevance to identify salient events and their surrounding context. By leveraging temporal curvature to adapt selection density, CATS captures both abrupt transitions and gradually evolving content while suppressing redundant frames. Under a fixed backbone and frame budget, CATS consistently outperforms prior lightweight approaches such as AKS on LongVideoBench and VideoMME. While multi-stage methods such as MIRA achieve higher absolute accuracy, they incur substantial computational overhead; in contrast, CATS retains approximately 93-95% of MIRA’s performance while requiring only 3-4% of its preprocessing cost, yielding a favorable efficiency-accuracy trade-off. Beyond answer accuracy, we evaluate description generation using an LLM-as-a-judge protocol, and the obtained results show that CATS produces more coherent and informative outputs, indicating improved grounding in visual evidence. These results position CATS as a computationally efficient and principled approach to long-video understanding.

[CV-215] Flame3D: Zero-shot Compositional Reasoning of 3D Scenes with Agent ic Language Models

【速读】：该论文旨在解决3D场景理解中泛化能力不足的问题，特别是如何在不依赖大规模3D语言训练的前提下，实现对自由空间、物体定位、假设性物体插入、复杂几何关系的开放域推理，并支持与外部工具和数据源的集成。其解决方案的关键在于提出一个无需训练（training-free）的框架Flame3D，该框架将场景表示为可编辑的视觉-文本3D记忆（visual-textual 3D memories），并通过组合式空间工具（composable spatial tools）将其暴露给现成的多模态大语言模型（Multimodal Large Language Model, MLLM）。此外，Flame3D能够在推理时合成定制的空间程序，从而实现对布局、空区域及尚未存在于场景中的物体的开放推理，同时允许外部数据和修正被动态加入记忆而无需重新训练。实验表明，固定工具表现有限，而推理时动态合成空间操作的能力是实现多跳3D推理的关键。

链接: https://arxiv.org/abs/2605.09218
作者: Sagar Bharadwaj,Ziyong Ma,Anurag Ghosh,Srinivasan Seshan,Anthony Rowe
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:3D scene understanding spans reasoning about free space, object grounding, hypothetical object insertions, complex geometric relationships, and integrating all of these with external tools and data sources. Existing 3D understanding methods typically rely on large-scale 3D-language training or focus on object grounding and simple spatial relationships. We argue that the broad generalization that motivates 3D-language training can be achieved at inference time, without 3D-specific training. We propose Flame3D, a training-free framework that represents scenes as editable visual-textual 3D memories and exposes them to an off-the-shelf MLLM through composable spatial tools. Flame3D also lets the agent synthesize custom spatial programs at inference time, enabling open-ended reasoning over layouts, empty space, and objects not yet present in the scene. External data and corrections can be added to the memory without retraining. In addition to showing competitive performance to finetuned 3D-LMM methods on ScanQA, we study multi-hop 3D reasoning capabilities of Flame3D by evaluating it on a curated compositional spatial-reasoning benchmark, Compose3D. We find that fixed tools fall short and that the agent’s ability to synthesize spatial operations at inference time is essential. These results invite the question: should future progress in 3D scene understanding focus on richer scene memories and expressive compositional abstractions?

[CV-216] RigidFormer: Learning Rigid Dynamics using Transformers

【速读】：该论文旨在解决从无网格表示（如点云）中高效建模高保真刚体动力学的挑战，尤其是现有方法受限于网格连接性与顶点级消息传递机制，导致计算成本高且难以处理非结构化输入。其核心解决方案是提出 RigidFormer，一个基于对象中心的 Transformer 模型，通过在物体层面进行推理并使用紧凑锚点推进每个物体的运动；关键创新在于 Anchor-Vertex Pooling 机制，在保留接触相关几何信息的同时避免密集的顶点级交互，并引入 Anchor-based RoPE（旋转位置编码）将锚点几何嵌入注意力机制中，同时保持对象和锚点顺序不变性；此外，通过可微分的 Kabsch 对齐操作强制更新投影到刚体流形上以确保刚性约束，从而实现对复杂刚体系统（支持200+物体）的高效、准确模拟。

链接: https://arxiv.org/abs/2605.09196
作者: Zhiyang Dou,Minghao Guo,Haixu Wu,Doug Roble,Tuur Stuyck,Wojciech Matusik
机构: MIT (麻省理工学院); Meta (Meta)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Project Page: this https URL

点击查看摘要

Abstract:Learning-based simulation of multi-object rigid-body dynamics remains difficult because contact is discontinuous and errors compound over long horizons. Most existing methods remain tied to mesh connectivity and vertex-level message passing, which limits their applicability to mesh-free inputs such as point clouds and leads to high computational cost. Efficiently modeling high-fidelity rigid-body dynamics from mesh-free representations, therefore, remains challenging. We introduce RigidFormer, an object-centric Transformer-based model that learns mesh-free rigid-body dynamics with controllable integration step sizes. RigidFormer reasons at the object level and advances each object through compact anchors; Anchor-Vertex Pooling enriches these anchors with local vertex features, retaining contact-relevant geometry without dense vertex-level interaction. We propose Anchor-based RoPE to inject anchor geometry into attention while respecting the unordered nature of objects and anchors: object-token processing is permutation-equivariant, and the mean-pooled anchor descriptor is invariant to anchor reindexing while preserving shape extent. RigidFormer further enforces rigidity by projecting updates onto the rigid-body manifold using differentiable Kabsch alignment. On standard benchmarks, RigidFormer outperforms or matches mesh-based baselines using point inputs, runs faster, generalizes to unseen point resolutions and across datasets, and scales to 200+ objects; we also show a preliminary extension to command-conditioned articulated bodies by treating body parts as interacting object-level components.

[CV-217] AQMP: Image compression through Adaptive Quadtree Refinement and Matching Pursuit with Hyperparameter Optimization

【速读】：该论文旨在解决传统图像压缩方法在压缩效率与视觉质量之间难以平衡的问题，特别是针对固定块大小的匹配追踪（Matching Pursuit）算法在处理复杂与平滑区域时存在资源分配不合理的问题。其解决方案的关键在于提出一种名为AQMP的新颖图像编解码器，该方法结合自适应四叉树细化（Adaptive Quadtree Refinement）与匹配追踪技术，能够根据图像局部结构动态调整划分块的大小：在图像内容复杂区域采用细粒度分区，在平滑区域则使用粗粒度分区。这种自适应机制显著提升了压缩比，同时保留了良好的视觉质量，并支持在叶节点级别和单个节点压缩过程中实现高效并行化。

链接: https://arxiv.org/abs/2605.09190
作者: Franco Cerino,Emmanuel Tassone,Manuel Tiglio
机构: CONICET; Universidad Nacional de Córdoba (科尔多瓦国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 34 pages, 18 figures

点击查看摘要

Abstract:We present AQMP, a novel image codec combining Adaptive Quadtree Refinement with Matching Pursuit. Unlike conventional Matching Pursuit methods that operate on fixed-size sub-images, AQMP dynamically adapts block sizes to local image structure, allocating finer partitions where the image is complex and coarser ones where it is smooth. This adaptivity yields superior compression ratios compared to fixed-size block Matching Pursuit at equivalent image quality, while offering significant parallelization opportunities at both the tree-leaf level and during compression of individual nodes. The algorithm is governed by user-specified accuracy and sparsity parameters alongside a small set of additional hyperparameters. To navigate the trade-off between compression efficiency and visual quality, we perform multi-objective hyperparameter optimization using the Tree-Structured Parzen Estimator, producing comprehensive Pareto fronts. Experimental results show that AQMP achieves up to 4\times higher compression rates than JPEG at comparable SSIM values, while maintaining competitive quality across a broad range of compression regimes. Performance evaluation is provided using a representative set of test images. To ensure reproducibility and promote adoption, we have made our implementation publicly available on GitHub under the MIT license.

[CV-218] Establishing Robust Retinal Eye Tracking: A Weakly Supervised Algorithmic Framework

【速读】：该论文旨在解决现有基于视网膜图像的眼动追踪算法在真实世界成像条件下鲁棒性不足的问题，尤其是传统模板匹配方法对视网膜特征变异和复杂成像环境的适应能力有限。其解决方案的关键在于提出了一种新颖的弱监督学习框架，通过数据驱动的方式提升眼动追踪的准确性与稳定性，实验表明该方法在6名受试者中实现了95百分位 gaze error 低于0.45度的高精度表现。

链接: https://arxiv.org/abs/2605.09181
作者: Bo Wen,Dillon Lohr,Yatong An,Pushkar Anand,Alexander Fix,Ruobing Qian,Catherine A.Fromm,Yimin Ding,Truong Nguyen,Mohamed El-Haddad,Francesco La Rocca
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Image and Video Processing (eess.IV)
备注: 2026 IEEE International Conference on Image Processing (Accepted for Publication)

点击查看摘要

Abstract:Retinal image-based eye tracking is widely used in ophthalmic imaging and vision science, and is a promising path to deliver higher gaze accuracy than the pupil- and cornea-based approaches commonly used in modern AR/VR devices. Nevertheless, existing retinal tracking algorithms still primarily rely on classical template-matching registration, which can be insufficiently robust to retinal feature variability and real-world imaging conditions. In this work, we propose a novel weakly-supervised, learning-based framework for robust retinal eye tracking. Initial studies demonstrate high accuracy, achieving the 95th-percentile gaze error 0.45 deg across a cohort of 6 participants.

[CV-219] MultiMedVision: Multi-Modal Medical Vision Framework

【速读】：该论文旨在解决多模态医学影像（如2D X光片与3D CT扫描）在现有基础模型中因维度差异而需使用独立架构导致的表示学习割裂问题。其核心解决方案是提出MultiMedVision框架，采用稀疏视觉Transformer（Sparse Vision Transformer）结构，并引入3D旋转位置嵌入（3D Rotary Positional Embeddings）与可变长度序列打包（variable-length sequence packing）技术，使模型能够原生地在同一潜在空间中处理混合模态数据批次，无需为不同维度设计特定适配器或将3D体积视为2D切片序列。这一方法实现了跨维度统一表征学习，在仅用5倍少数据的情况下仍保持了对2D和3D任务的竞争力，且分析表明模型能同时保留模态特异性与共享特征子空间，验证了联合跨维度表征学习的有效性。

链接: https://arxiv.org/abs/2605.09151
作者: Frank Li,Bardia Khosravi,Mohammadreza Chavoshi,Young Seok Jeon,Theo Dapamede,Hari Trivedi,Janice Newsome,Judy Gichoya
机构: Emory University (埃默里大学); Yale University (耶鲁大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 2 figures

点击查看摘要

Abstract:Multi-modal medical imaging enables comprehensive diagnostics, yet current foundation models process 2D (e.g. X-ray) and 3D (e.g. CT) data with separate, dimensionality-specific architectures. We present MultiMedVision, a unified framework for joint 2D/3D representation learning built on a Sparse Vision Transformer. Our model uses 3D Rotary Positional Embeddings and variable-length sequence packing to process mixed-modality batches natively within a shared latent space, without modality-specific adapters or treating 3D volumes as 2D slice sequences. Trained with a self-supervised objective on chest X-rays (MIMIC-CXR) and CT scans (CT-RATE), and using a single shared encoder with 5x less data, MultiMedVision achieves competitive performance on both 2D benchmarks (Macro AUROC 0.82 on MIMIC, 0.84 on CheXpert) and 3D tasks (0.85 on CT-RATE). Analysis of the learned representations reveals coexisting modality-specific and shared feature subspaces, demonstrating that unified cross-dimensional representation learning is feasible without sacrificing modality-specific performance.

[CV-220] Beyond Thinking: Imagining in 360circ for Humanoid Visual Search

【速读】：该论文旨在解决人形视觉搜索（Humanoid Visual Search, HVS）任务中因依赖累积式多轮链式思维（Chain-of-Thought, CoT）推理而导致的认知负担过重及轨迹级标注成本高昂的问题。解决方案的关键在于提出一种名为“360°想象”（Imagining in 360°）的新型框架，该框架通过解耦探索过程为专门的“想象器”（Imaginator）和“执行者”（Actor）两个模块：其中，想象器作为概率性空间先验预测器，在单步内推断已观测与未观测区域的语义布局，并通过在该语义空间中采样多个假设，向执行者提供有效空间信息的概率分布，从而在主动搜索过程中增强对不确定性的鲁棒性引导。此架构显著降低了数据工程成本，无需完整轨迹的CoT标注即可生成超196万条高质量训练样本，实验证明显式建模语义空间先验能大幅提升复杂真实环境中的搜索效率与成功率。

链接: https://arxiv.org/abs/2605.09146
作者: Jingdong Zhang,Yizhou Wang,Zhengzhong Tu,Xin Li,Wenping Wang,Xiaohang Zhan
机构: Texas AM University; Adobe
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Humanoid Visual Search (HVS) requires agents to actively explore immersive 360 ^\circ environments. While prior methods treat this as a monolithic task relying on cumulative, multi-turn Chain-of-Thought (CoT) reasoning, they impose heavy cognitive burdens and require expensive trajectory-level annotations. In this paper, we propose Imagining in 360 ^\circ , a novel framework that decouples the exploration process into a specialized Imaginator and an Actor. The Imaginator functions as a probabilistic predictor of spatial priors; instead of maintaining a cumulative reasoning chain, it infers the semantic layout of both observed and unobserved regions in a single step. By sampling multiple hypotheses within this semantic space, we provide the Actor with a distribution of effective spatial information, offering robust guidance that hedges against uncertainty during active search. This decoupled architecture significantly lowers data engineering costs by eliminating the need for full-trajectory CoT annotations, enabling the generation of over 1.96 million curated training samples. Extensive experiments demonstrate that explicitly modeling semantic spatial priors drastically improves search efficiency and success rates in complex, in-the-wild environments.

[CV-221] KEPIL: Knowledge-Enhanced Prompt-Image Learning for Prompt-Robust Disease Detection

【速读】：该论文旨在解决当前基于CLIP风格的医学视觉-语言模型（Vision-Language Models, VLMs）在放射科临床决策支持中面临的两个核心问题：一是由于放射学发现呈现长尾分布，导致部分疾病样本稀缺，使得零样本推理（zero-shot inference）成为必要；二是现有模型对提示（prompt）变化敏感且推理时缺乏可信的外部知识，影响临床部署的可靠性。解决方案的关键在于提出一个名为KEPIL的提示鲁棒性框架，其核心创新包括：(i) 利用本体论（ontology）与大语言模型（LLM）辅助的动态提示增强机制，引入结构化医学知识；(ii) 设计语义感知对比损失（semantic-aware contrastive loss），通过双嵌入目标对齐等效提示变体的表征；(iii) 基于实体中心的报告标准化策略，生成与本体对齐的表示。实验证明，KEPIL在七个基准测试中实现了最先进的零样本性能，并显著提升了对提示扰动的鲁棒性，表明结构化知识整合与鲁棒提示设计是构建临床可靠放射科VLM的关键。

链接: https://arxiv.org/abs/2605.09132
作者: Haozhe Luo,Shelley Zixin Shu,Ziyu Zhou,Robert Berke,Mauricio Reyes
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision–language models (VLMs) show promise for clinical decision support in radiology because they enable joint reasoning over radiological images and clinical text, thereby leveraging complementary clinical information. However, radiological findings are long-tailed in practice, leaving some conditions underrepresented and making zero-shot inference essential. Yet current CLIP-style medical VLMs are sensitive to prompt variations and often lack trustworthy external knowledge at inference time, which hinders reliable clinical deployment. We present \textitKEPIL, a prompt-robust framework that integrates curated medical knowledge to stabilize zero-shot generalization. KEPIL comprises: (i) \emphdynamic prompt enrichment using ontologies with LLM assistance, (ii) a \emphsemantic-aware contrastive loss aligning embeddings of equivalent prompt variants via a dual-embedding objective, and (iii) \emphentity-centric report standardization to yield ontology-aligned representations. Across seven benchmarks, KEPIL achieves state-of-the-art zero-shot inference performance; under prompt-variation tests, it improves AUC by (6.37%) on \textitCheXpert and by (4.11%) on average. These results suggest that structured knowledge and robust prompt design are key to clinically reliable radiology-facing VLMs. Code will be released at this https URL.

[CV-222] Investigating Anisotropy in Visual Grounding under Controlled Counterfactual Perturbations CVPR2026

【速读】：该论文旨在解决视觉定位（Visual Grounding）模型在面对语义不匹配的指代表达（referring expression）时出现的近似行为问题，即模型常生成看似合理但仅满足表达部分语义的边界框，从而影响模型的可靠性与可解释性。其解决方案的关键在于从机制解释（mechanistic interpretability）视角出发，系统性地检验嵌入空间各向异性（embedding anisotropy）是否是导致此类反事实错误（counterfactual failures）的根本原因。为此，作者提出一种受相似度控制的反事实描述生成协议，在预定义的嵌入相似度区间内扰动对象或上下文组件，实现对定位行为随对齐程度变化的细粒度分析。实验结果表明，余弦相似度与近似行为之间无显著相关性，说明嵌入各向异性并非主导因素，进而指出提升模型鲁棒性需深入探究嵌入空间更精细的几何特性。

链接: https://arxiv.org/abs/2605.09090
作者: Gabriele Lombardo,Luigi Maiorana,Liliana Lo Presti,Marco La Cascia
机构: University of Palermo (帕勒莫大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: To be published in the proceedings of the 5th Explainable AI for Computer Vision (XAI4CV) Workshop at CVPR 2026

点击查看摘要

Abstract:Visual Grounding benchmarks assume that the object described by a referring expression is always present in the image, and grounding models are therefore rarely evaluated under semantically mismatched captions. In such cases, models frequently exhibit approximation behavior, producing a plausible bounding box that satisfies only part of the expression (\eg, preserving the original object while ignoring modified contextual cues). Because mismatched captions represent realistic edge cases, this behavior compromises reliability and raises concerns from an explainability perspective. Identifying its underlying causes is thus essential for improving model faithfulness and interpretability. Adopting a mechanistic interpretability viewpoint, this work examines whether embedding anisotropy contributes to counterfactual failures. A similarity-controlled counterfactual caption generation protocol is introduced to systematically perturb object or contextual components within predefined embedding similarity intervals, enabling a fine-grained analysis of grounding behavior as a function of alignment. Experiments on two Transformer-based models with markedly different embedding geometries (BERT-based TransVG and CLIP-based SwimVG) reveal no meaningful correlation between cosine similarity and approximation. These findings suggest that anisotropy alone does not account for counterfactual errors, and that robustness requires investigating finer-grained geometric properties of the embedding space. Comments: To be published in the proceedings of the 5th Explainable AI for Computer Vision (XAI4CV) Workshop at CVPR 2026 Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.09090 [cs.CV] (or arXiv:2605.09090v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.09090 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-223] Field-Localized Forgery Detection for Digital Identity Documents

【速读】：该论文旨在解决数字身份验证系统在远程开户场景中因依赖文档图像进行用户认证而面临的局部篡改风险问题，特别是针对身份信息关键区域（如人脸照片和文本字段）的伪造攻击检测能力不足的问题。现有基于自然图像取证的方法在结构化身份文档上的迁移性能有限，难以有效识别针对性伪造行为。其解决方案的关键在于提出一种轻量级、领域专用的场域定位框架FLiD（Field-Localized Identity Detection），该框架首先通过微调的目标检测器精确定位人脸与文本区域，再利用冻结的MobileNetV3-Small主干网络提取紧凑的场域级嵌入特征，并由仅含191K可训练参数的轻量神经网络完成分类。此设计聚焦于关键身份字段而非全图处理，在显著降低计算复杂度的同时实现了高精度的伪造检测性能，相较全图基线模型在AUC指标上提升29–35个百分点，并优于通用篡改检测方法（TruFor、MMFusion、UniVAD）。

链接: https://arxiv.org/abs/2605.09089
作者: Abhishek Kumar,Riya Tapwal,Carsten Maple,Mark Hooper
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Digital identity verification systems used in remote onboarding rely on document images to authenticate users, making them vulnerable to localized manipulations of key identity fields such as facial photographs and textual information. Existing forgery detection methods, developed primarily for natural-image forensics, show limited transferability to structured identity documents. We propose FLiD, a lightweight field-localized framework that targets critical identity regions rather than processing full-document images. A fine-tuned object detector first localizes face and text fields; a frozen MobileNetV3-Small backbone then extracts compact field-level embeddings, which are classified by lightweight neural network with only 191K trainable parameters. FLiD achieves AUC scores of 0.880 (face), 0.954 (text), and 0.923 (both-field attacks), with corresponding EERs of 18.05%, 11.61%, and 15.16%, representing absolute reductions of 29-35 percentage points over a full-document baseline trained from scratch. FLiD also consistently outperforms general-purpose manipulation detectors (TruFor, MMFusion, UniVAD) across all attack scenarios while requiring 13x fewer parameters and 21x fewer FLOPs

[CV-224] Probability-Flow Distillation: Exact Wasserstein Gradient Flow for High-Fidelity 3D Generation

【速读】：该论文旨在解决Score Distillation Sampling (SDS) 及其变体在文本到3D生成任务中因模式崩溃（mode collapse）导致的过度平滑和过饱和问题，以及Score Distillation via Inversion (SDI) 虽能改善视觉锐度但仍无法忠实捕捉目标分布的问题。解决方案的关键在于识别出SDI性能受限的根本原因——其依赖后验均值估计器，该估计器在数学上等价于确定性反向DDIM轨迹的单步欧拉近似；为此，作者提出了一种自然延伸的方法——Probability-Flow Distillation (PFD)，并证明PFD精确对应Wasserstein梯度流，从而实现有原则的分布匹配动力学，最终在3D资产生成中实现更精细、高保真度的细节表现。

链接: https://arxiv.org/abs/2605.09071
作者: Rohith Ramanan,A. N. Rajagopalan
机构: Indian Institute of Technology Madras (印度理工学院马德拉斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Score Distillation Sampling (SDS) and its variants have been widely used for text-to-3D generation by distilling 2D image diffusion priors. However, the standard SDS objective is prone to severe mode collapse, frequently yielding over-smoothed and over-saturated results. Although recent advancements, such as Score Distillation via Inversion (SDI), mitigate these artifacts and produce visually sharper models, they ultimately fail to faithfully capture the full target distribution. In this work, we show that the bottleneck limiting the sampling capacity of SDI stems from its reliance on the posterior mean estimator, which is mathematically equivalent to a single-step Euler approximation of the deterministic reverse DDIM trajectory. To address this, we propose a naturally motivated extension termed Probability-Flow Distillation (PFD). We establish that PFD corresponds exactly to a Wasserstein gradient flow, thereby inducing principled distribution-matching dynamics. Finally, we show that PFD can synthesize 3D assets with fine-grained, high-fidelity details and achieve improved quality compared to existing methods.

[CV-225] Reducing Annotation Burden for Femoral Cartilage Segmentation in Knee MRI via Cross-Sequence Transfer Learning

【速读】：该论文旨在解决膝关节软骨分割中对大量标注数据依赖的问题，通过跨序列迁移学习（cross-sequence transfer learning）减少目标序列的标注需求。其关键解决方案是利用预训练的2D U-Net模型，在双回波稳态（DESS）与矢状面质子密度加权3D快速自旋回波（Cube）两种MRI序列之间进行双向迁移学习，通过在目标序列上逐步增加训练样本数量来优化模型性能，并验证不同序列间迁移的有效性与方向依赖性。

链接: https://arxiv.org/abs/2605.09067
作者: Francesco Chiumento,Gianluigi Crimi,Elisa Moretta,Rocco Milieri,Alberto Bazzocchi,Giulio Vara,Giacomo Dal Fabbro,Stefano Zaffagnini,Fulvia Taddei,Serena Bonaretti
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Purpose: To develop and evaluate cross-sequence transfer learning for automatic femoral cartilage segmentation, testing bidirectional transfer between dual-echo steady-state (DESS) and sagittal proton density-weighted 3D fast spin-echo (Cube) sequences. Materials and Methods: We optimized a modified 2D U-Net on 507 DESS images from the Osteoarthritis Initiative (OAI). We then established same-sequence baselines using subject-level cross-validation on a subset of 44 OAI DESS images and 44 Cube images acquired at the Istituto Ortopedico Rizzoli, Bologna, Italy. Each subset included 22 non-lesioned and 22 lesioned subjects. Finally, we performed transfer learning across sequences by fine-tuning the pretrained models on the target sequence with increasing training set sizes to study convergence, while keeping validation and test sets fixed. Segmentations were evaluated using Dice similarity coefficient (DSC) and average surface distance (ASD). Lesion effects were assessed with two-sided Mann-Whitney U tests with Bonferroni correction. Results: Same-sequence training yielded higher accuracy on DESS than Cube (DSC, 0.900 vs 0.830 ; P .001 ). Cube-to-DESS transfer matched DESS performance (DSC, 0.903 \pm 0.032 vs 0.900 \pm 0.027 ), reaching a performance plateau at 9 training subjects. DESS-to-Cube yielded a lower combined DSC ( 0.802 \pm 0.049 vs 0.830 \pm 0.042 ), reaching a plateau at 24 training subjects. Lesions did not affect DESS ( P \ge .39 ) but reduced Cube accuracy (DSC, 0.805 vs 0.856 ; P .001 ). Conclusion: Transfer learning across sequences can substantially reduce target-sequence annotation requirements for femoral cartilage segmentation, but performance is direction- and sequence-dependent, and the effects of lesions on segmentation may vary across MRI sequences.

[CV-226] Dependency-Aware Discrete Diffusion for Scene Graph Generation

【速读】：该论文旨在解决从自然语言中生成结构化场景图（Scene Graph, SG）的问题，尤其针对现有离散扩散模型在处理场景图时无法有效建模对象、边与关系之间的层次结构和强依赖性这一局限。其解决方案的关键在于提出一种依赖感知且具有层次约束的离散扩散模型，通过在前向和反向过程中解耦结构与语义信息，使模型能够精准捕捉条件依赖关系；同时，在推理阶段实现无需训练的条件采样，从而生成与文本对齐的场景图，显著提升下游图像生成任务中的组合一致性，特别是在多物体场景下表现更优。

链接: https://arxiv.org/abs/2605.09065
作者: Rajalaxmi Rajagopalan,Romit Roy Choudhury
机构: University of Illinois, Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Scene graphs (SGs) represent objects and their relationships as structured graphs, enabling applications in image generation, robotics, and 3D understanding. Recent work suggests that conditioning image generation on scene graphs improves compositional fidelity compared to text-only prompting. However, since users typically provide text rather than structured graphs, a key challenge is to generate scene graphs from natural language. Prior work on discrete diffusion has demonstrated success in generating generic graphs such as molecules and circuits, but fails to account for the hierarchical structure and strong dependencies between objects, edges, and relations in scene graphs. We address this limitation by introducing a dependency-aware, hierarchically constrained discrete diffusion model for scene graph generation. Our approach decouples structure and semantics across the forward and reverse processes, enabling the model to capture conditional dependencies. At inference time, we perform training-free conditioning to sample text-aligned scene graphs. We evaluate our method on standard SG benchmarks and demonstrate improvements over both continuous and discrete graph generation baselines across graph and layout metrics. When fed to downstream image generation, our approach yields improved compositional alignment compared to text-to-image models, particularly in multi-object scenarios.

[CV-227] LCGNav: Local Candidate-Aware Geometric Enhancement for General Topological Planning in Vision-Language Navigation

【速读】：该论文旨在解决在线拓扑规划（Online Topological Planning）在连续环境中的视觉语言导航（Vision-Language Navigation in Continuous Environments, VLN-CE）任务中面临的两个核心问题：一是冗余的局部深度信息导致计算效率低下，二是随着拓扑图规模增长，模型对当前前沿候选节点的关注度被削弱。解决方案的关键在于提出一种模块化的局部几何增强框架 LCGNav，其通过显式将候选深度视图转换为3D点云并基于智能体可达范围进行物理截断，实现更紧凑的局部几何建模；同时引入保持维度的局部融合策略与瞬态状态退化机制，仅对当前相关的虚拟节点（ghost nodes）施加几何增强，而不改变原始规划器接口，从而提升导航精度且具备良好的跨架构兼容性。

链接: https://arxiv.org/abs/2605.09053
作者: Jiankun Peng,Jianyuan Guo,Yiguang Yang,Yue Liu,Jiashuang Yan,Ying Xu
机构: The Aerospace Information Research Institute, Chinese Academy of Sciences (中国科学院空天信息研究院); University of Chinese Academy of Sciences (中国科学院大学); City University of Hong Kong (香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Online topological planning has become an effective paradigm for Vision-Language Navigation in Continuous Environments (VLN-CE), but existing methods still suffer from two limitations: redundant local depth information and weakened focus on current frontier candidates as the topological graph grows. To address this, we propose LCGNav, a modular local geometric enhancement framework for topological VLN. LCGNav explicitly converts candidate depth views into 3D point clouds and applies physical truncation based on the agent’s reachable range, enabling more compact local geometric modeling. It further introduces a dimension-preserving local fusion strategy with transient state degradation, so that geometric enhancement is applied only to the currently relevant ghost nodes without changing the original planner interface. Experiments on R2R-CE and RxR-CE show that LCGNav serves as an effective cross-architecture enhancement module, consistently improving multiple key metrics of representative online topological baselines with low additional training cost. When integrated with ETP-R1, LCGNav achieves the best performance among the compared online topological methods on the val-unseen splits of the R2R-CE and RxR-CE benchmarks. The code is available at this https URL.

[CV-228] Automated Robotic Moisture Monitoring in Agricultural Fields

【速读】：该论文旨在解决大规模种植园中土壤湿度监测效率低下的问题（即人工巡检成本高、耗时长且难以实现精准管理）。解决方案的关键在于构建一个基于机器人平台与现场土壤湿度传感器协同工作的自动化监测系统：首先将农田划分为多个网格并部署传感器，当某区域土壤干燥时触发机器人响应；随后利用无人机或固定摄像头获取的航拍图像，通过Dijkstra最短路径算法规划机器人行进路线，并借助图像处理算法计算目标区域的综合水分含量，从而实现高效、低成本的智能灌溉决策支持。

链接: https://arxiv.org/abs/2605.09050
作者: Senthil Palanisamy,Akila I.S
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 2018 International Seminar on Intelligent Technology and Its Applications (ISITIA)

点击查看摘要

Abstract:Monitoring moisture level of land in a large-scale plantation is tedious. The main objective of this project is to use a robotic kit in collaboration with the on-field moisture sensor circuits, thereby creating an efficient and economical moisture monitoring system. A large agriculture field is divided into smaller grids. Each grid is placed with a moisture sensor. Whenever a sensor reports the soil to be dry, the robot goes to the concerned field for inspection. The path to the concerned field is found by applying Dijkstra’s shortest path algorithm on the aerial image of the field. Then the total moisture content of the field is calculated by the robot using suitable image processing algorithms and reported accordingly. For developing and testing this work, a small study field was set up above which a camera was mounted at an appropriate height to capture its aerial view. Thus a prototype for an automated system of monitoring agricultural fields’ moisture has been developed through this work.

[CV-229] SeasonScapes: Learning Large-scale Re-lightable 3D Landscapes with Seasonal Variation from Sparse Webcams

【速读】：该论文旨在解决季节性环境变化下三维场景重建与补全的难题，特别是在存在遮挡和数据缺失的情况下如何高质量地重建连续时间序列的3D景观。其核心解决方案是提出SeasonScapes框架及对应的SeasonScapes数据集，通过将来自32个观测点、覆盖50 km × 60 km区域的85,000余张网络摄像头图像按时间戳投影到3D网格上，构建反映自然外观随时间演变的季节性3D地形；为处理图像中的遮挡和缺失信息，创新性地采用条件扩散模型（conditional diffusion models）在网格表面直接进行图像引导的图像修复（image-guided inpainting），从而生成完整且语义一致的三维网格，并进一步利用基于物理的渲染器实现光照重演（relighting）。

链接: https://arxiv.org/abs/2605.09039
作者: Timo Kleger,Qi Ma,Deheng Zhang,Luc Van Gool,Danda Pani Paudel
机构: INSAIT, Sofia University “St. Kliment Ohridski; ETH Zurich
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce SeasonScapes framework and a the SeasonScapes dataset: Swiss Sparse-view Mountain Scenes with Seasonal Changes that covers over 50 km x 60 km, composed of more than 85,000 webcam images captured from 32 different locations across 13 timestamps throughout a full year. By projecting these timestamp-specific images onto a 3D mesh, we construct seasonal 3D landscapes that reflect natural appearance changes over time. To address occlusions and missing data, we leverage conditional diffusion models for image-guided inpainting directly on the mesh. The resulting completed meshes can be further relighted using standard physically-based renderer.

[CV-230] When Style Similarity Scores Fail: Diagnosing Raw CSD Cosine in Artist-Style Evaluation

【速读】：该论文旨在解决当前广泛使用的对比风格描述符（Contrastive Style Descriptor, CSD）输出空间中的原始余弦相似度被误认为是绝对、校准的风格保真度评分（style-fidelity score）的问题。研究表明，这种原始余弦值在特定艺术家语料库中无法可靠地区分“相同”与“不同”风格，即其判别性能存在系统性偏差。解决方案的关键在于引入一种称为“判别差距”（discrimination gap）的内部诊断指标，该指标无需原型或阈值即可检测CSD余弦是否具备绝对意义上的区分能力；若诊断失败，则通过CSLS（Centered Symmetric Least Squares）读出机制对冻结骨干网络的特征进行线性变换作为最小修正，从而显著提升风格验证的AUC性能（从0.883提升至0.905），并验证了该方法在多个视觉Transformer骨干网络（如CLIP-ViT-L/14、SigLIP-large和DINOv2-Large）上的普适有效性，表明问题源于共享架构局限而非CSD特有缺陷。

链接: https://arxiv.org/abs/2605.09030
作者: Jörg Frochte
机构: Bochum University of Applied Sciences (波鸿应用科学大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 24 pages, 7 figures, 19 tables

点击查看摘要

Abstract:Raw cosine in the 768-dimensional output space of the Contrastive Style Descriptor (CSD) is now widely read as an absolute, calibrated style-fidelity score for text-to-image and style-imitation evaluation. We introduce the discrimination gap, a corpus-internal, prototype-free and threshold-free diagnostic that tests whether contrastive style cosines admit an absolute same-versus-different interpretation on a candidate artist corpus. On a 1799-artwork, 91-artist public-domain corpus, raw CSD cosine yields negative point-estimate gaps for 23/91 artists at the pairwise level ( 2/91 robust under bootstrap) and for 15/91 in the aggregated-pool scoring regime style-fidelity evaluations typically use. CSLS readout on the frozen backbone reduces the aggregated negative-gap count to 4/91 ; combined with positional-embedding interpolation to 336 pixels it raises unsupervised pair-verification AUC from 0.883 to 0.905 across 25 artist-disjoint splits. We refer to this diagnostic-driven readout protocol on the frozen backbone (CSLS as default, pos-interp 336 as the stronger optional setting) as CSD+, not a new encoder.A cross-backbone check on CLIP-ViT-L/14, SigLIP-large and DINOv2-Large reproduces the same shared-tradition failure pattern, providing evidence that the residual reflects a shared limitation of the four backbones we tested rather than a CSD-specific artefact. Practical implication: before reporting CSD cosine as an absolute style-fidelity score, run the diagnostic on the candidate corpus; CSLS is the minimal correction when it fails.

[CV-231] MedFL-Stress: A Systematic Robustness Evaluation of Federated Brain Tumor Segmentation under Cross-Hospital MRI Appearance Shift

【速读】：该论文旨在解决联邦学习（Federated Learning, FL）在医疗影像分割任务中评估指标单一、掩盖个体医院性能差异的问题，即当前评价协议仅报告全局平均性能，可能隐藏某些医院模型持续失效的安全风险。其解决方案的关键在于提出MedFL-Stress这一受控压力测试框架，通过模拟多中心MRI扫描设备差异（如伽马对比度变化、尺度偏移及噪声加模糊）对四个模拟医院客户端进行系统性扰动，并以最差医院的Dice系数和跨医院性能差异作为核心评价指标，而非辅助观察。实验表明，相较于FedAvg、FedProx等基线方法，FedBN在显著缩小医院间性能差距（从0.0850降至0.0503）的同时，仅小幅牺牲全局平均Dice（下降0.005），且最弱医院性能提升3.5 Dice点，验证了面向鲁棒性的评估协议对可靠部署联邦医疗影像模型的重要性。

链接: https://arxiv.org/abs/2605.09025
作者: Kiran Naseer,Naveed Anwer Butt
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Federated learning enables hospitals to collaboratively train segmentation models without sharing patient data. However, current evaluation protocols report only average performance across clients, masking failures at individual sites. In clinical deployment, a model that fails consistently at one hospital is a real safety risk that a good mean score can hide entirely. We introduce MedFL-Stress, a controlled stress-testing framework that exposes exactly this failure mode. Using 2D axial slices from BraTS 2020 distributed across four simulated hospital clients, we apply graded MRI appearance shifts (gamma contrast, scale-shift, and noise-plus-blur) reflecting scanner and acquisition variability in real multi-site deployments. Three federated baselines are evaluated: FedAvg, FedProx, and FedBN. Worst-hospital Dice and inter-hospital disparity are treated as primary metrics, not supplementary observations. FedAvg achieves the highest global mean Dice (0.8159) but conceals a 0.0850 gap between its best and worst-performing hospital. FedBN closes that gap by 41% (0.0850 to 0.0503) while sacrificing less than half a Dice point in mean accuracy (0.8159 to 0.8109), and the weakest hospital gains 3.5 Dice points outright (0.7309 to 0.7656). These findings demonstrate that robustness-oriented evaluation protocols are essential for reliable federated medical imaging deployment.

[CV-232] Relightable Gaussian Splatting for Virtual Production Using Image-Based Illumination

【速读】：该论文旨在解决虚拟制作（Virtual Production, VP）中因LED墙提供高分辨率图像光照而带来的渲染灵活性受限问题，以及传统逆向渲染方法依赖低分辨率环境贴图和远场光照假设所导致的精度不足与编辑复杂性。其核心解决方案是提出一种基于高斯点绘（Gaussian Splatting）的VP专用三维重建与再打光框架，关键在于利用已知背景图像作为条件来约束再打光过程，从而避免对环境贴图的依赖，并将合成任务简化为背景图像编辑。该方法通过构建一个包含不同背景内容和光照条件的真实VP场景数据集，实现对3D场景的固定外观与可变光照成分的解耦；其中可变光照部分通过参数化每个几何基元的UV坐标、强度值和分辨率调节因子，结合mipmap技术直接在图像空间采样背景纹理，隐式捕捉反射与折射效应，无需物理基础渲染即可实现高效可控的再打光，支持输出深度、光照强度、光照颜色及未打光渲染等多维变量。

链接: https://arxiv.org/abs/2605.09024
作者: Adrian Azzarelli,Nantheera Anantrasirichai,James Pollock,David R. Bull
机构: University of Bristol, UK (布里斯托大学, 英国); Lux Aeterna, Bristol, UK (光明之境, 布里斯托, 英国)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Multimedia (cs.MM); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Virtual production (VP) use LED walls to provide both background imagery and image-based lighting. While this enables on-set compositing, it couples lighting to background and scene appearance, limiting flexibility for downstream editing. In addition, inverse rendering conventionally relies on physically-based rendering to estimates 3D geometry and lighting, using environment maps. However, these maps are typically low-resolution and assume far-field lighting. In VP, with near-field and high-resolution image-based lighting, this can lead to inaccuracies and introduce complexities when editing. Addressing this, we propose a VP-specific framework for 3D reconstruction and relighting using Gaussian Splatting. This uses the known background imagery to condition the relighting process. This avoids relying on environment maps and reduces compositing to a background-image editing task. To realize our framework, we introduce a process (and associated dataset) that captures real VP scenes under varying background content and illumination conditions. This data is used to decompose a 3D scene into fixed appearance and variable lighting components. The variable lighting process simulates light transport by parameterizing each primitive with a UV coordinate, intensity value and resolution modifier. Using mipmaps, these directly sample the background texture in image space - implicitly capturing reflections and refractions without physically-based rendering. Combined with the fixed appearance component, this allows us to render relit scenes using a Gaussian Splatting rasterizer. Compared to baselines, our approach achieves higher-quality 3D reconstruction and controllable relighting. The method is efficient (3 GB RAM, 5 GB VRAM, 2 hours training, ~35 FPS) and supports rendering useful arbitrary output variables including depth, lighting intensity, lighting color, and unlit renders.

[CV-233] he Direct Integration Theorem: A Rigorous Framework for Consistent Discrete Solutions of the Inverse Radon Problem

【速读】：该论文旨在解决计算机断层成像（Computed Tomography, CT）中从连续域到离散域转换时的经典难题，即传统方法依赖频域插值和必需的斜坡滤波（ramp-filtering）所引入的零频奇异性、频谱失真及离散化误差。解决方案的关键在于提出了一种新的直接积分定理（Direct Integration Theorem, DIT），它作为经典中心切片定理（Central Slice Theorem, CST）的一个非平凡推论，实现了连续与离散域之间的数学一致性映射，从而无需频域插值和传统斜坡滤波步骤。基于DIT构建的逆Radon问题离散解框架可实现准精确重建，误差仅由采样参数和网格几何决定，并且能保持重建图像的方差特性，显著优于传统的滤波反投影（Filtered Back Projection, FBP）算法，在峰值信噪比（PSNR）、结构相似性（SSIM）和重投影保真度等指标上均表现出优越性能。

链接: https://arxiv.org/abs/2605.09020
作者: Mikhail G. Mozerov
机构: Institute for Information Transmission Problems, Russian Academy of Sciences (俄罗斯科学院信息传输问题研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to IEEE TPAMI. Code and data available at this https URL

点击查看摘要

Abstract:This paper presents a novel Direct Integration Theorem (DIT), derived as a non-trivial corollary of the classical Central Slice Theorem (CST). The DIT provides a mathematically consistent transition from the continuous to the discrete domain - a fundamental challenge in computed tomography - thereby eliminating the need for frequency-domain interpolation without resorting to conventional ramp-filtering. The proposed approach circumvents two principal limitations inherent in traditional methods: (i) the zero-frequency singularity and spectral distortions introduced by the mandatory ramp-filtering step, and (ii) discretization inaccuracies associated with frequency-domain interpolation. Based on the DIT, we develop a rigorous framework for consistent discrete solutions of the inverse Radon problem. Mathematical modeling demonstrates that this approach achieves quasi-exact reconstruction, with errors constrained solely by sampling parameters and grid geometry. Furthermore, while Filtered Back Projection (FBP) inherently distorts the variance of the reconstructed image, the DIT-based algorithm preserves it. Comparative simulations confirm that the proposed method eliminates common artifacts, such as intensity cupping, and consistently outperforms FBP in terms of PSNR, SSIM, and reprojection fidelity, faithfully restoring the original image’s statistical characteristics.

[CV-234] FlashClear: Ultra-Fast Image Content Removal via Efficient Step Distillation and Feature Caching

【速读】：该论文旨在解决基于扩散模型的物体移除方法在推理过程中计算开销大、效率低的问题。现有方法在所有时间步上对全部token进行无差别去噪，忽略了物体移除通常仅涉及图像中较小前景区域的事实，导致冗余计算和延迟。解决方案的关键在于提出一种**区域感知对抗蒸馏（Region-aware Adversarial Distillation, RAD）**机制，结合一个潜在判别器（latent discriminator），从而训练出一个高效少步数的模型FlashClear；同时设计了一种无需训练的加速策略FPAC（Foreground-Prioritized Asymmetric Attention and Caching），专门适配少步扩散模型，在不牺牲性能的前提下显著提升推理速度。实验表明，FlashClear在OBER基准上相比ObjectClear和OmniPaint分别实现最高8.26倍和122倍的速度提升，且保持高视觉保真度。

链接: https://arxiv.org/abs/2605.09003
作者: Yixin Tang,Jiawei Guo,Junxian Li,Zhiteng Li,Jixin Zhao,Bingya Zhang,Chenbo Wang,Yulun Zhang,Shangchen Zhou
机构: Shanghai Jiao Tong University (上海交通大学); Nanyang Technological University (南洋理工大学); Honor Device Co., Ltd (荣耀设备有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, diffusion-based object removal models have achieved impressive results in eliminating objects and their associated visual effects. However, they indiscriminately denoise all tokens across all timesteps, ignoring that removal usually involves small foreground regions. This strategy introduces substantial computational overhead and prolonged inference times. To overcome this computational burden, we propose a latent discriminator to implement Region-aware Adversarial Distillation (RAD), yielding a highly efficient few-step model named FlashClear. Furthermore, tailored to few-step diffusion models, we propose FPAC (Foreground-Prioritized Asymmetric Attention and Caching), a training-free acceleration strategy. Extensive experiments demonstrate that our framework provides massive acceleration while maintaining or exceeding the performance of our base model, ObjectClear. Notably, on the OBER benchmark, our FlashClear achieves up to 8.26 \times and 122 \times speedup over ObjectClear and OmniPaint, respectively, while maintaining high visual quality and fidelity.

[CV-235] CT-IDP: Segmentation-Derived Quantitative Phenotypes for Interpretable Abdominal CT Disease Classification

【速读】：该论文旨在解决医学影像中缺乏标准化、可量化且具有生物学意义的腹部CT表型（CT Image-Derived Phenotypes, CT-IDP）的问题，以支持疾病风险预测与机制研究。其解决方案的关键在于构建一个基于多器官分割的定量表型框架，利用TotalSegmentator自动提取超过900个器官及腔室层面的描述符（涵盖形态学、衰减度和上下文/负担特征），并通过稀疏疾病特异性逻辑回归结合弹性网正则化方法进行建模，在MERLIN训练集上训练并冻结参数后，在Duke-Abdomen和AMOS两个独立外部数据集上验证性能，相较于DINOv3视觉Transformer基线模型展现出更高的AUC和平均精度（AP）。

链接: https://arxiv.org/abs/2605.09002
作者: Lavsen Dahal,Joseph Y. Lo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this retrospective multi-institutional study, a quantitative phenotyping framework, CT-IDP (CT Image-Derived Phenotypes) was developed on the MERLIN abdominal CT benchmark (training, validation, and test sets- 15,175, 5,018, and 5,082 studies, respectively) and externally evaluated on two independent dataset: Duke-Abdomen (2,000) and AMOS (1,107). Multi-organ segmentations were generated with TotalSegmentator and used to derive over 900 organ and compartment-level descriptors spanning morphometry, attenuation, and contextual/burden findings. Sparse disease-specific logistic regression with elastic-net regularization was trained on MERLIN and externally validated under a frozen specification. Performance was compared against a DINOv3-based vision-transformer baseline using AUC and average precision (AP), supported by phenotype-stratified audits and coefficient-level inspection. Macro-AUC for CT-IDP versus the baseline was 0.897 versus 0.880 on MERLIN, 0.877 versus 0.857 on the Duke-Abdomen dataset, and 0.780 versus 0.756 on AMOS.

[CV-236] LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLM s?

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在高分辨率图像输入时视觉编码效率低下的问题，其核心挑战在于全局编码产生大量token序列，且后续ViT压缩阶段仍需承担完整的二次注意力计算开销。解决方案的关键在于两个层面的创新：一是采用基于切片的编码策略（slice-based encoding），通过局部视图保留细节信息，相比全局注意力更有利于细粒度感知；二是引入ViT内部早期压缩机制（intra-ViT early compression），在浅层ViT层中减少token数量，显著降低视觉编码浮点运算量（FLOPs）而不损失下游任务性能。结合这两项改进，作者提出了LLaVA-UHD v4，实现了视觉编码效率提升55.8%的同时保持甚至超越基线性能，为高效高分辨率MLLM设计提供了可行路径。

链接: https://arxiv.org/abs/2605.08985
作者: Kechen Fang,Yihua Qin,Chongyi Wang,Wenshuo Ma,Tianyu Yu,Yuan Yao
机构: Tsinghua University (清华大学); ModelBest
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual encoding constitutes a major computational bottleneck in Multimodal Large Language Models (MLLMs), especially for high-resolution image inputs. The prevailing practice typically adopts global encoding followed by post-ViT compression. Global encoding produces massive token sequences, while post-ViT compression incurs the full quadratic attention cost of the ViT before any token reduction takes place. In this work, we revisit this convention along two dimensions: the encoding strategy and visual token compression. First, controlled experiments show that slice-based encoding outperforms global encoding across benchmarks, suggesting that preserving local details through sliced views can be more beneficial than applying global attention for fine-grained perception. Second, we introduce intra-ViT early compression, which reduces tokens in shallow ViT layers and substantially lowers visual-encoding FLOPs while preserving downstream performance. By integrating intra-ViT compression into the slice-based encoding framework, we present LLaVA-UHD v4, an efficient and compute-controllable visual encoding scheme tailored for high-resolution inputs. Across a diverse set of benchmarks covering document understanding, OCR, and general VQA, LLaVA-UHD v4 reduces visual-encoding FLOPs by 55.8% while matching or even surpassing baseline performance. These results suggest that visual-encoding efficiency can be substantially improved without sacrificing downstream performance, providing a practical design direction for efficient high-resolution MLLMs. All model weights and code will be publicly released to support further research.

[CV-237] racking the Truth: Object-Centric Spatio-Temporal Monitoring for Video Large Language Models

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在动态场景中视频理解时易产生幻觉的问题，其根源在于缺乏对时空一致性的有效监控（spatio-temporal monitoring），即无法持续追踪对象的身份、状态及相互关系。为精准诊断此问题，作者提出STE MO-Bench基准，通过人类验证的对象中心事实和分解式子问题设计，区分真正的时序理解与偶然正确的答案。解决方案的关键在于提出STE MO-Track框架，该框架采用基于对象的结构化轨迹构建与推理机制，通过分块状态提取与时间聚合策略，显著降低幻觉回答并提升时空推理的一致性。

链接: https://arxiv.org/abs/2605.08974
作者: Tri Cao,Khoi Le,Thong Nguyen,Cong-Duy Nguyen,Quynh Vo,Anh Tuan Luu,Chunyan Miao,See-Kiong Ng,Shuicheng Yan,Bryan Hooi
机构: National University of Singapore (新加坡国立大学); VinUniversity (越南Vin大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Code: this https URL

点击查看摘要

Abstract:While multimodal large language models (MLLMs) have advanced video understanding, they remain highly prone to hallucinations in dynamic scenes. We argue this stems from a failure in spatio-temporal monitoring, the ability to persistently track object identities, states, and relations over time. Existing benchmarks obscure this deficit by relying on single final-answer evaluations for queries that can often be resolved via local visual cues or statistical priors. To rigorously diagnose this, we introduce STEMO-Bench (Spatio-TEmporal MOnitoring), a benchmark of human-verified object-centric facts that evaluates intermediate reasoning by decomposing queries into sub-questions, distinguishing genuine temporal understanding from coincidental correctness. To address failure modes exposed by STEMO, we propose STEMO-Track, a novel object-centric framework that explicitly constructs and reasons over structured object trajectories via chunk-wise state extraction and temporal aggregation. Extensive experiments demonstrate that our object-centric framework significantly reduces hallucinated answers and improves spatio-temporal reasoning consistency over state-of-the-art MLLMs.

[CV-238] Extrusion Segmentation Strategy to improve CAD Reconstruction from Point Cloud

【速读】：该论文旨在解决从无序点云（point cloud）中自动重建结构化计算机辅助设计（Computer-Aided Design, CAD）模型的问题，其核心应用场景包括逆向工程和质量控制。解决方案的关键在于构建一个端到端的深度学习模型，能够直接从点云生成CAD模型，并引入一种分割方法将CAD模型分解为独立的拉伸体（extrusion）片段，从而提升数据多样性，增强模型的泛化能力和鲁棒性。

链接: https://arxiv.org/abs/2605.08971
作者: Said Harb,Mehdi Maboudi,Markus Gerke
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Conference: ISPRS Toronto 2026

点击查看摘要

Abstract:Computer-Aided Design is ubiquitous in todays world, as almost every manufactured object begins as a digital model across industries. At the same time, advances in 3D sensing have made point clouds a dominant form of raw 3D data. Recovering the CAD model of a physical object from its point cloud scan has two major applications: reverse engineering, where physical or hand-crafted prototypes need to be reconstructed automatically as editable digital models, and quality control, where recovering the CAD description of a manufactured object helps quantify and understand deviations introduced during the production process. Thus, converting unordered point clouds into structured CAD models is increasingly important for modern applications. Deep learning has enabled major progress in computer vision for both 2D and 3D data, and new datasets facilitate data-driven CAD reconstruction. Building on this foundation, we develop an end-to-end model that reconstructs CAD models from point clouds and introduce a segmentation approach that decomposes them into individual extrusions. These partial shapes increase data diversity, improving the generalization and robustness of deep learning models. Our strategy thereby provides a simple, yet effective way to increase reconstruction performance of deep learning models.

[CV-239] Can MLLM s Reason About Visual Persuasion? Evaluating the Efficacy and Faithfulness of Reasoning

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在视觉说服力预测任务中难以生成可靠推理过程的问题。现有方法通过提示模型先推理再预测，但实证发现此类策略并不稳定提升性能，甚至可能降低预测准确性，表明直接生成的推理（rationale）作为说服力判断依据不可靠。解决方案的关键在于：利用多样化的教师生成推理进行监督微调（supervised fine-tuning），从而显著提升模型对图像说服力的预测能力；同时提出一个包含三个维度的忠实性评估框架——推理到决策的一致性（rationale-to-decision consistency）、推理到图像的依附性（rationale-to-image groundedness）以及推理到决策的敏感性（rationale-to-decision sensitivity），并验证仅依赖预测性能无法保证推理质量，其中推理到决策的敏感性最符合人类对合理推理的偏好，进而推动基于忠实性的训练目标与可扩展的推理监督机制的发展。

链接: https://arxiv.org/abs/2605.08965
作者: Naeun Lee,Hyunjong Kim,Sunghwan Choi,Injin Kong,Yohan Jo
机构: Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite strong performance of Multimodal Large Language Models (MLLMs) on multimodal tasks, predicting whether and why an image is persuasive remains challenging. We first show that prompting MLLMs to reason before prediction does not consistently help, and can even reduce persuasiveness prediction performance, suggesting that naively generated rationales are unreliable signals for this task. Yet, no established methodology exists for training MLLMs to reason about visual persuasion or evaluating whether their rationales faithfully support their decisions. To address this gap, we show empirically and theoretically that diverse teacher-generated rationales, when used for supervised fine-tuning, improve visual persuasiveness prediction. We further introduce a three-dimensional faithfulness evaluation framework covering rationale-to-decision consistency, rationale-to-image groundedness, and rationale-to-decision sensitivity. Applying this framework shows that prediction performance alone does not guarantee faithful rationales, while rationale-to-decision sensitivity is most aligned with human rationale preferences. These findings motivate faithfulness-aware training objectives and scalable rationale supervision for visual persuasiveness evaluation. Our code and dataset will be made publicly available.

[CV-240] FugSeg: Fast Uncertainty-aware Ground Segmentation for 3D Point Cloud

【速读】：该论文旨在解决LiDAR（光探测与测距）环境感知系统中地面分割（ground segmentation）的两大核心挑战：反射噪声（reflection noise）和孤立地面点（isolated ground）问题。现有方法在复杂地形或非结构化环境中仍存在分割精度不足、鲁棒性差的问题。解决方案的关键在于提出FugSeg，一种快速且具有不确定性感知能力的地面分割方法：首先采用极坐标网格地图（polar grid map）提升对不同LiDAR类型的一般适应性；其次设计了基于段内与跨段的地面标签策略，能够识别直接可见及被遮挡或孤立的地面单元；进一步引入自适应坡度（adaptive slope）机制，融合测量不确定性以增强复杂地形下的可靠性；最后通过细粒度地面高程估计实现点级分割，并显式处理噪声地面单元，从而在多个公开数据集上实现了优于当前最优无学习方法的性能（F1、准确率、mIoU均最高），同时保持高达135 Hz至487 Hz的实时运行速度（单CPU线程），适用于资源受限系统。

链接: https://arxiv.org/abs/2605.08952
作者: Yu Li,Volker Schwieger
机构: Institute of Engineering Geodesy, University of Stuttgart (斯图加特大学工程测量研究所); Daimler Truck AG (戴姆勒卡车公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication in IEEE Transactions on Intelligent Transportation Systems

点击查看摘要

Abstract:In LiDAR-based environment perception systems, ground segmentation is a key preprocessing step supporting various applications such as mapping and navigation. Although extensively studied, problems such as reflection noise and isolated ground remain challenging. To address these issues, we propose FugSeg, a fast uncertainty-aware ground segmentation method. A polar grid map is adopted as the point cloud representation to ensure generalizability across LiDAR types. Building on that, we develop a within- and cross-segment ground labeling strategy that identifies not only directly visible ground cells but also those that are isolated or occluded. During this process, an adaptive slope is introduced, which incorporates measurement uncertainties to enhance its reliability under complex terrain. Finally, to achieve point-level ground segmentation, a fine-grained ground elevation estimation method is introduced. Throughout the complete workflow, reflection noise is explicitly handled via the proposed noisy ground cells. We conduct comprehensive evaluations on four public datasets covering both structured and unstructured environments. Results show that FugSeg outperforms state-of-the-art non-learning methods, achieving the highest F1, accuracy, and mIoU across all datasets, while maintaining the fastest runtime (135 Hz and 487 Hz for 64- and 32-layer LiDARs) using a single CPU thread, making it suitable for resource-limited systems. The code will be available at this https URL.

[CV-241] PIDNet: Progressive Implicit Decouple Network for Multimodal Action Quality Assessment

【速读】：该论文旨在解决多模态动作质量评估（Multimodal Action Quality Assessment, AQA）中因模态异质性与质量线索时序演化特性导致的特征混淆、跨模态冗余保留及阶段特异性证据弱化问题。解决方案的关键在于提出一种渐进式隐式解耦与融合网络（Progressive Implicit Decoupling and Fusion Network, PIDNet），其核心创新包括：1）设计iMambaWave模块，通过Bi-Mamba分支和小波变换分支分别捕获长程时序依赖与局部扰动细节，实现RGB、光流与音频特征在共享潜空间中的解耦表示；2）引入门控聚合机制自适应融合时域与频域信息；3）构建三阶段渐进融合网络，利用Group3M块中的模态互补注意力机制抑制冗余并提取跨模态证据，结合多尺度卷积增强特征表达。该方法有效分离了模态特异性信息、跨模态互补线索与全局质量语义，显著提升了评估精度与鲁棒性。

链接: https://arxiv.org/abs/2605.08945
作者: Qiqi Li,Pengfei Wang,Nenggan Zheng
机构: Qiushi Academy for Advanced Studies (QAAS), Zhejiang University (浙江大学); College of Computer Science and Technology, Zhejiang University (浙江大学); School of Software Technology, Zhejiang University (浙江大学); State Key Lab of Brain-Machine Intelligence (脑机智能国家重点实验室); Collaborative Innovation Center for Artificial Intelligence by MOE and Zhejiang Provincial Government (ZJU) (教育部与浙江省政府人工智能协同创新中心); Zhejiang Lab (浙江省实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 6 figures, 11 tables

点击查看摘要

Abstract:Action quality assessment (AQA) aims to automatically quantify the execution quality of human actions in videos and is valuable for applications such as competitive sports judging. In multimodal AQA, quality evidence from different modalities is heterogeneous, and quality cues evolve progressively over time. Existing methods often rely on coarse fusion or unified temporal modeling, which may blur modality-specific cues, preserve cross-modal redundancy, and weaken stage-specific quality evidence. To address these issues, we propose a progressive implicit decoupling and fusion network (PIDNet) that progressively integrates modality-specific information, cross-modal complementary cues, and global quality semantics for accurate assessment. Specifically, we design an iMambaWave module that maps RGB, optical flow, and audio features into a shared latent space and disentangles them with a Bi-Mamba branch and a wavelet-transform branch to capture long-range temporal dependencies and local perturbation details, respectively. A gated aggregation mechanism adaptively fuses temporal and frequency-domain information. We further build a three-stage progressive fusion network using Group3M blocks, where modality complementary attention retrieves cross-modal evidence while suppressing redundancy, and multi-scale convolutions enrich feature representations. Experiments on the Rhythmic Gymnastics and Fis-V datasets show that PIDNet achieves highly competitive score correlation with favorable error control compared with existing unimodal and multimodal methods. Ablation studies verify the effectiveness of each component. Moreover, iMambaWave consistently improves visual representation and temporal modeling across multiple backbones, showing good generalization and plug-and-play capability.

[CV-242] Few-Click-Driven Interactive 3D Segmentation with Semantic Embedding

【速读】：该论文旨在解决现有3D交互式分割方法在效率与泛化能力上的局限性问题：一方面，多数现有方法采用串行处理方式，每次迭代仅预测一个对象且输出二值掩码，导致标注效率低下；另一方面，部分基于2D基础模型的方法依赖相机对齐以弥合2D-3D鸿沟，限制了其在复杂场景中的适用性。解决方案的关键在于提出一种直接作用于稀疏随机下采样3D点云的新型交互式分割框架，其核心由基于点Transformer的编码器和分层掩码解码器构成，通过可学习语义嵌入条件控制多级裁剪与合并操作，能够在单次前向传播中同时处理多个对象点击，并联合推理所有点击查询，建模实例间关系，从而同步优化空间掩码与语义预测。该设计显著提升了分割精度（mIoU提升超20%）与跨数据集泛化性能（单点击设置下提升8–10%），适用于机器人操作、导航及快速3D语义标注等实时应用场景。

链接: https://arxiv.org/abs/2605.08925
作者: Xueyang Kang,Zijian Yu,Kourosh Khoshelham,Liangliang Nan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 9 figures, 6 tables

点击查看摘要

Abstract:Interactive segmentation allows efficient label generation by leveraging user-provided clicks to progressively refine predictions, which is critical when fully supervised labels are costly or generalization to unseen classes is needed. Existing 3D interactive methods are limited: most operate sequentially, predicting only one object per iteration with binary masks, while several recent approaches depend on 2D foundation models and camera alignment to bridge the 2D-3D gap. To address these limitations, we propose a novel interactive segmentation framework that operates directly on sparse, randomly downsampled 3D points and processes multiple object clicks in a single forward pass. Our framework consists of a point Transformer-based encoder and a hierarchical mask decoder, which integrates multi-level crop-and-merge operations conditioned on learnable semantic embeddings. Unlike prior interactive approaches that require repeated model updates after each manually corrective click, our method jointly reasons over all click queries, modeling inter-instance relationships and refining both spatial masks and semantic predictions through spatial and semantic embeddings. Extensive experiments demonstrate that our model improves the mIoU metric by over 20 percent compared to strong baselines and achieves 8-10 percent gains under cross-dataset evaluation for a one-click per instance setting, often requiring only a single click per object. Our approach provides a generalizable and efficient solution for interactive 3D instance segmentation, particularly suitable for real-time applications such as robotic manipulation, navigation, and rapid 3D semantic annotation.

[CV-243] Unified Modeling of Lane and Lane Topology for Driving Scene Reasoning

【速读】：该论文旨在解决自动驾驶车辆在复杂驾驶场景中对车道拓扑关系感知不足的问题，现有方法多依赖于“检测后推理”范式，即先检测车道线再推导其拓扑关系，存在信息割裂与误差传播问题。解决方案的关键在于提出一种统一建模方法 UniTopo，将车道及其拓扑关系（如前驱车道、后继车道及连接关系）以连通车道的形式进行联合表示，并通过共享感知流水线同时输出车道位置与拓扑结构，从而实现从原始图像特征中直接感知车道拓扑的新范式。

链接: https://arxiv.org/abs/2605.08911
作者: Han Li,Yulu Gao,Si Liu,Yuhang Wang,Bo Liu,Beipeng Mu
机构: Beihang University (北京航空航天大学); Zhongguancun Academy (中关村学院); Hangzhou International Innovation Institute (杭州国际创新研究院); Meituan (美团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE TCSVT

点击查看摘要

Abstract:Autonomous vehicles need to perceive not only physical elements in the driving scene, such as lane lines and traffic lights, but also logical elements like lane centerlines and their topology. Existing lane topology reasoning methods typically follow a reasoning-by-detection paradigm, where lane topological relationships are primarily derived from lane detection results. In this paper, we propose an innovative method called Unified Modeling of Lane and Lane Topology (UniTopo), which represents the topological relationships between lanes as connected lanes, encompassing predecessor lanes, successor lanes, and their interconnections. This unified representation of lanes and lane topology allows us to simultaneously obtain both the positions and topological information of lanes within a shared perception pipeline, establishing a new paradigm for directly perceiving lane topology from original image features. We validate our method on the driving scene reasoning benchmark OpenLane-V2, which consists of two subsets, built based on Argoverse2 and nuScenes, respectively. Our method achieves TOP_ll of 30.1% and 31.8% on the two subsets, significantly surpassing the existing state-of-the-art method T^2SG by 6.0% and 8.6%.

[CV-244] DAPE: Dynamic Non-uniform Alignment and Progressive Detail Enhancement Techniques for Improving the Performance of Efficient Visual Language Models

【速读】：该论文旨在解决预训练视觉-语言模型中跨模态对齐粗粒度、信息密度分布不均的问题，即现有方法通常采用统一的对齐策略，忽略了文本标签与图像块之间在信息密度和语义范围上的动态差异，导致细粒度语义细节丢失且计算开销大。解决方案的关键在于提出一种动态跨模态对齐框架，其核心包括两个模块：一是设计了一个可学习的匹配函数，实现动态自适应的跨模态匹配机制，根据文本标签的信息密度灵活分配不同数量和大小的图像块；二是引入一个连续细节引入模块，逐步将高分辨率视觉特征增强融入对齐过程，从而在提升下游任务精度的同时降低计算复杂度。

链接: https://arxiv.org/abs/2605.08902
作者: Mengyuan Tian,Qiyan Zhao,Yanan Wang,Da-Han Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted in ICIC 2026 Oral

点击查看摘要

Abstract:In recent years, pre-trained visual-linguistic models have demonstrated tremendous potential, becoming a crucial foundational framework for numerous downstream tasks. However, the information density between text and images is not uniformly distributed. Existing methods often overlook the inherent and dynamic differences in information density and semantic scope between text tags and image blocks. These common uniform alignment strategies result in coarse-grained cross-modal interactions and loss of fine semantic details. Moreover, pursuing finer alignment typically requires substantial computational overhead, limiting practical model deployment. To address this challenge, this paper proposes a novel framework for dynamic cross-modal alignment with continuous detail introduction. First, we design a dynamically adaptive cross-modal matching mechanism that uses a learnable matching function to dynamically assign varying numbers and sizes of image tags to text tags of the same size but different information density, enabling more precise attention interaction. Second, we develop a continuous detail introduction module to progressively incorporate high-resolution visual feature enhancement into the alignment process. Extensive experiments across multiple benchmarks demonstrate significant improvements in the accuracy of various downstream tasks while reducing computational overhead.

[CV-245] Semantic Alignment in Hyperbolic Space for Open-Vocabulary Semantic Segmentation CVPR2026

【速读】：该论文旨在解决开放词汇语义分割（open-vocabulary semantic segmentation）中图像级视觉-语言模型（如CLIP）向像素级预测迁移时面临的挑战，即嵌入空间中层次结构与语义对齐之间的不匹配问题。现有方法虽利用双曲几何建模层次关系，但仅关注跨层级的对齐，忽略了同层级内嵌入的语义错位。其解决方案的关键在于提出HyRo框架，该框架在庞加莱球模型（Poincaré ball model）中解耦层次对齐与语义对齐：通过调整双曲半径实现层级间的对齐，同时利用正交变换进行角度对齐以优化同层级内的语义关系，且理论上保持双曲半径不变，从而显著提升分割性能。

链接: https://arxiv.org/abs/2605.08874
作者: Hoang M. Truong,Hai Nguyen-Truong,Dang Huynh
机构: Fulbright University Vietnam (富布赖特大学越南)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the PVUW Workshop at CVPR 2026. Project page: this https URL

点击查看摘要

Abstract:Open-vocabulary semantic segmentation requires adapting image-level vision-language models such as CLIP to dense pixel-level prediction, which is challenging due to the mismatch between hierarchical structure and semantic alignment in the embedding space. While recent works leverage hyperbolic geometry to model hierarchical relationships, they align embeddings across hierarchical levels but overlook semantic misalignment among embeddings within the same level. In this work, we propose HyRo, a hyperbolic fine-tuning framework that decouples hierarchical and semantic alignment in the Poincaré ball model. HyRo aligns hierarchical levels by adjusting the hyperbolic radius and refines semantic relationships through angular alignment using an orthogonal transformation that theoretically preserves the hyperbolic radius. Experiments on standard open-vocabulary semantic segmentation benchmarks demonstrate that HyRo achieves state-of-the-art performance over prior methods.

[CV-246] ProDG: Prototypes for Data-Free Generative Post-Hoc Explainability

【速读】：该论文旨在解决原型驱动的后验可解释性方法（post-hoc interpretability methods）对原始数据的依赖问题，即现有方法虽能实现无需重新训练神经网络的解释，但仍需访问测试集或验证集以搜索和提取视觉原型（visual prototypes），这在隐私敏感场景中不可行。解决方案的关键在于提出ProDG（Generative Prototypes for Data-Free Post-Hoc Explainability），其利用生成模型从冻结的预训练模型权重中直接合成高质量、纯净的原型，从而完全摆脱对外部数据的依赖，实现了真正的“无数据”可解释人工智能（Data-Free XAI）。

链接: https://arxiv.org/abs/2605.08858
作者: Piotr Borycki,Magdalena Trędowicz,Jacek Tabor,Łukasz Struski,Przemysław Spurek
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Ante-hoc interpretability methods based on prototypes provide highly accurate explanations by utilizing the intuitive “this looks like that” reasoning paradigm. On the other hand, post-hoc models can explain predictions for a single image without relying on an underlying dataset or requiring costly neural network retraining. Recent approaches successfully solve the retraining problem for prototype-based networks. However, they still face a fundamental limitation: they require access to a subset of data (e.g., a test or validation set) to search for and extract the visual prototypes. In this paper, we address this issue and introduce ProDG: Generative Prototypes for Data-Free Post-Hoc Explainability, a novel framework that leverages generative models to synthesize pure, high-fidelity prototypes directly from the frozen model’s weights, completely eliminating the dependency on any external data. By establishing this new frontier in Data-Free XAI, ProDG unlocks robust visual interpretability for privacy-sensitive domains, where original data is strictly restricted or fundamentally inaccessible. Project page: this https URL

[CV-247] Restoration-Aligned Generative Flow Models for Blind Motion Deblurring

【速读】：该论文旨在解决生成式流模型（Generative Flow Models）在图像恢复任务（如运动去模糊）中因训练目标与恢复任务不匹配而导致的保真度严重下降问题。其解决方案的关键在于重新定义流轨迹：将原本以噪声为终点的流路径改为以模糊图像为终点，使潜在向量场与模糊图和清晰图之间的残差误差对齐，从而使得标准流匹配损失自然转化为残差损失形式。这一重构使得预训练流模型可通过LoRA（Low-Rank Adaptation）优化以适配恢复任务的目标，并进一步引入双专家采样策略——一个保真度专家提供高保真初始化（如PSNR 33.69 dB），DeblurFlow则在此基础上提升感知质量仅轻微降低保真度至33.05 dB，显著优于直接叠加生成模型导致的保真度骤降（PSNR 27.60 dB）。此外，作者提出r-space这一专为残差解码设计的潜在空间，相较标准VAE潜空间可减少高达9倍的编码器-解码器计算开销，兼顾性能与效率。

链接: https://arxiv.org/abs/2605.08854
作者: Insoo Kim,Jinwoo Shin
机构: NAVER Cloud(NAVER云); KAIST AI(韩国科学技术院人工智能); Samsung Electronics(三星电子)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generative flow models offer powerful priors learned from large-scale natural images, but directly adapting them to restoration tasks such as motion deblurring causes severe fidelity degradation, as their training objective is inherently misaligned with restoration. We present DeblurFlow, a framework that resolves this misalignment by reformulating the flow trajectory itself: we replace the noise endpoint with the blur observation, which makes the underlying vector field coincide with the residual error between blur and clean images. Under this formulation, the standard flow matching loss naturally takes the form of a residual loss, allowing pretrained flow models to be optimized under restoration-aligned objectives via LoRA adaptation. This formulation further enables a dual-expert sampling strategy: a fidelity expert provides a high-fidelity initialization, e.g., PSNR 33.69 dB, and DeblurFlow enhances perceptual quality with only a marginal fidelity reduction to 33.05 dB, whereas directly applying a generative model on top of a fidelity expert decreases PSNR to 27.60 dB. To make this practical, we further introduce r-space, a latent space tailored for residual decoding rather than image reconstruction, which reduces encoder-decoder cost by up to 9 \times over standard VAE latents. Extensive experiments on GoPro, HIDE, RealBlur, and RWBI demonstrate that DeblurFlow achieves strong restoration fidelity and perceptual realism, while remaining computationally practical.

[CV-248] Geometrically Constrained Stenosis Editing in Coronary Angiography via Entropic Optimal Transport ICML2026

【速读】：该论文旨在解决冠状动脉造影（Coronary Angiography, CAG）狭窄自动检测中高质量影像数据稀缺的问题，从而限制了模型的临床转化。为提升训练数据的质量、多样性及分布覆盖范围，进而增强检测精度与泛化能力，作者提出使用合成狭窄数据进行数据增强。解决方案的关键在于引入OT-Bridge Editor，该方法将局部编辑重构为受约束的熵最优传输（Entropic Optimal Transport, OT）问题，并利用几何信息引导生成路径，实现更强的几何控制能力，从而在像素级精度和结构保持方面显著优于传统基于扩散模型的软引导方法。

链接: https://arxiv.org/abs/2605.08851
作者: Jialin Li,Zhuo Zhang,Yue Cao,Guipeng Lan,Jiabao Wen,Shuai Xiao,Jiachen Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to ICML 2026

点击查看摘要

Abstract:The scarcity of high-quality imaging data for coronary angiography (CAG) stenosis limits the clinical translation of automated stenosis detection. Synthetic stenosis data provides a practical avenue to augment training sets, improving data quality, diversity, and distributional coverage, and enhancing detection precision and generalization. However, diffusion-based editing commonly relies on soft guidance in a noise-initialized reverse process, offering limited pixel-level precision and structure preservation. We propose the OT-Bridge Editor, which reframes localized editing as a constrained entropic optimal transport (OT) problem and leverages geometric information to steer the generation path, enabling stronger geometric control. Extensive experiments show that our synthesized angiograms consistently improve downstream stenosis detection, yielding substantial relative gains of 27.8% on the public ARCADE benchmark and 23.0% on our multi-center dataset, supported by consistent qualitative results.

[CV-249] Illusion-Aware Visual Preprocessing and Anti-Illusion Prompting for Classic Illusion Understanding in Vision-Language Models CVPR2026

【速读】：该论文旨在解决视觉语言模型（Vision-Language Models, VLMs）在面对视觉错觉时表现出系统性偏差的问题，即模型倾向于依赖记忆中的固定知识而非准确感知图像的实际差异。其核心解决方案在于无需微调（training-free）的三重策略：(1) 基于类型特异性的图像预处理方法（如边缘提取、颜色隔离、形态学处理和参考线叠加），削弱诱发错觉的上下文信息；(2) 设计抗错觉提示工程（anti-illusion prompt engineering），引导VLM进行定性视觉比较；(3) 多投票集成机制提升鲁棒性。该方法在CVPR 2026 DataCV Challenge Task 1中取得90.48%的准确率（使用Claude-opus-4-6模型与5票多数表决），验证了纯视觉操作与提示设计的有效性。

链接: https://arxiv.org/abs/2605.08841
作者: Junli Zha,Jiahui Wang,Xinkai Lu,Jinbo Wang
机构: SF Technology Co., Ltd.
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2026 Workshop on 5th DataCV Challenge

点击查看摘要

Abstract:Vision-Language Models (VLMs) exhibit systematic bias toward visual illusions, recalling memorized facts rather than perceiving actual visual differences. This paper presents a training-free framework for the 5th DataCV Challenge Task 1 at CVPR 2026, addressing this perception-versus-memory conflict through three complementary strategies:(1) illusion-aware image preprocessing that weakens illusion-inducing context via type-specific transformations (edge extraction, color isolation, morphological processing, and reference-line overlay), (2) anti-illusion prompt engineering guiding VLMs toward qualitative visual comparison, and (3) multi-vote ensemble that further improves robustness. Our method achieves 90.48% accuracy on the official 630-image test set using Claude (claude-opus-4-6) with 5-vote majority ensemble, and 98.41% on a human-verified subset. The approach requires no finetuning, relying solely on visual manipulation and prompt design. Our solution secured 2nd place in the challenge, only 0.47% behind the 1st-place solution. Code is available at this https URL.

[CV-250] Cross-Sample Relational Fusion: Unifying Domain Generalization and Class-Incremental Learning

【速读】：该论文旨在解决类增量学习（Class-Incremental Learning, CIL）中同时面临的灾难性遗忘（catastrophic forgetting）与域偏移（domain shift）问题。在真实场景如自动驾驶中，模型需在不同环境（如城市道路转为乡村或高速）下持续学习新类别，而传统CIL方法难以兼顾知识保留与跨域泛化能力。解决方案的关键在于提出一种统一框架CrOss-sample Relational Fusion (CORF)，其核心包括：1）通过空间贡献图（spatial contribution maps）选择性地精炼训练样本，突出语义信息区域以增强泛化性；2）引入预测置信度自适应加权样本，促进域无关表示的学习；3）设计级联蒸馏机制，捕捉多层级特征中的跨样本关系，实现多粒度的知识迁移，从而有效缓解遗忘并提升跨域适应能力。

链接: https://arxiv.org/abs/2605.08839
作者: Zhen-Hao Xie,Yan Wang,Hao Sun,Han-Jia Ye,De-Chuan Zhan,Da-Wei Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE Transactions on Multimedia (TMM 2026). Code is available at this https URL

点击查看摘要

Abstract:Class-Incremental Learning (CIL) requires a learning system to learn new classes while retaining previously learned knowledge. However, in real-world scenarios such as autonomous driving, a system trained on urban roads in sunny weather may later need to operate in rural or highway environments with different traffic patterns and weather conditions. This requires the model not only to overcome catastrophic forgetting, but also to effectively handle domain shifts. In this paper, we propose CrOss-sample Relational Fusion (CORF), a unified framework to address domain shift and catastrophic forgetting simultaneously. To enhance generalizability, we perform selective refinement of training samples by leveraging spatial contribution maps to highlight semantically informative regions. Furthermore, we incorporate predictive confidence to adaptively weigh samples, thereby facilitating the learning of domain-agnostic representations. To alleviate forgetting, we propose a cascaded distillation framework that captures cross-sample relational dependencies across multiple feature hierarchies, enabling multi-grained knowledge transfer from previous tasks. CORF can be seamlessly integrated into existing CIL algorithms to enhance their generalizability, achieving competitive performance across various benchmark datasets. Code is available at this https URL .

[CV-251] VECTOR-Drive: Tightly Coupled Vision-Language and Trajectory Expert Routing for End-to-End Autonomous Driving

【速读】：该论文旨在解决端到端自动驾驶中视觉-语言-动作（Vision-Language-Action, VLA）模型存在的耦合权衡问题：即共享主干网络虽能保留多模态交互能力，但易导致语言推理与轨迹预测任务混淆；而解耦的推理-动作流水线虽可减少任务冲突，却削弱了语义与运动之间的关联性。解决方案的关键在于提出VECTOR-DRIVE框架，其核心创新是通过共享自注意力机制维持所有token的语义耦合，同时基于token语义特征将不同类型的token路由至专用专家模块（Vision-Language Expert和Trajectory Expert），实现语义理解与运动规划在统一Transformer架构内的紧密耦合，同时分离任务特定的前馈网络计算路径，并引入基于流匹配的规划器对噪声动作token进行精细化重构，从而在保持多模态语义先验的同时提升轨迹生成精度。

链接: https://arxiv.org/abs/2605.08830
作者: Rui Zhao,Jianlin Yu,Zhenhai Gao,Jiaqiao Liu,Fei Gao
机构: Jilin University (吉林大学); College of Automotive Engineering (汽车工程学院); National Key Laboratory of Automotive Chassis Integration and Bionics (汽车底盘集成与仿生国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:End-to-end autonomous driving requires models to understand traffic scenes, infer driving intent, and generate executable motion plans. Recent vision-language-action (VLA) models inherit semantic priors from large-scale vision-language pretraining, yet still face a coupling trade-off: fully shared backbones preserve multimodal interaction but may entangle language reasoning and trajectory prediction, whereas decou pled reasoning-action pipelines reduce task conflict but weaken semantic-motion coupling. We propose VECTOR-DRIVE, a tightly coupled VLA framework built on Qwen2.5-VL-3B. VECTOR-DRIVE keeps all tokens coupled through shared self attention and routes feed-forward computation according to token semantics. Vision and language tokens are processed by a Vision-Language Expert to preserve semantic priors, while target-point, ego-state, and noisy action tokens are routed to a Trajectory Expert for motion-specific computation. On the action-token pathway, a flow-matching planner refines noisy action tokens into future waypoints and speed profiles. This design couples semantic reasoning and motion planning within a single multimodal Transformer while separating task-specific FFN computation. On Bench2Drive, VECTOR-DRIVE achieves 88.91 Driving Score and outperforms representative end-to end and VLA-based baselines. Qualitative results and ablations further validate the benefits of shared attention, semantic-aware expert routing, progressive training, and flow-based action de coding.

[CV-252] Rethinking Event-Based Object Dtection through Representation-Level Temporal Aggregation and Model-Level Hypergraph Reasoning

【速读】：该论文旨在解决事件相机（Event Camera）在目标检测任务中面临的两大核心问题：一是现有事件表示方法通常通过冗余结构间接编码时间信息，导致效率低下；二是检测模型难以显式地将碎片化的事件响应聚合为连贯的高阶物体特征。解决方案的关键在于提出一个统一的事件目标检测框架Ev-DTAD，其核心创新包括两个部分：首先设计了分层时间聚合（Hierarchical Temporal Aggregation, HTA），构建一种紧凑的三通道伪RGB表示，显式嵌入窗口内与窗口间事件的时间信息；其次引入频域感知超图时间融合（Frequency-aware Hypergraph Temporal Fusion, FHTF），通过时间演化建模和高阶关系推理来增强稀疏事件下的多尺度特征表达能力。实验表明，该方法在多个基准数据集上实现了精度与效率的协同提升，验证了紧凑时间表示与超图时序推理之间的互补性。

链接: https://arxiv.org/abs/2605.08825
作者: Meisen Wang,Hao Deng,Wei Bao,Ma Yuanxiao,Chengjie Wang,Zhiqiang Tian,Shaoyi Du,Siqi Li
机构: Xi’an Jiaotong University (西安交通大学); Tsinghua University (清华大学); China Mobile System Integration (中国移动系统集成公司); Inner Mongolia Agricultural University (内蒙古农业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Event cameras provide microsecond-level temporal resolution, low latency, and high dynamic range, offering potential for perception under fast motion and challenging illumination conditions. However, existing Event-based Object Detection (EOD) methods face limitations at both the representation and model levels: prior event representations usually encode temporal information indirectly through redundant structures, while detection models struggle to explicitly aggregate fragmented event responses into coherent high-order object features. To address these limitations, we present Event Dual Temporal-Relational Aggregation Detector (Ev-DTAD), a unified EOD framework that integrates representation-level temporal encoding with model-level temporal-hypergraph reasoning. Specifically, we introduce Hierarchical Temporal Aggregation (HTA), a compact three-channel pseudo-RGB representation that explicitly embeds temporal information across intra- and inter-window events. To further enhance detection under sparse and fragmented event responses, we propose Frequency-aware Hypergraph Temporal Fusion (FHTF), which refines multi-scale event features through temporal evolution modeling and high-order relational reasoning. Extensive experiments on Gen1 (+0.8 mAP and 1.7 \times faster), 1Mpx/Gen4 (+0.5 mAP and 1.6 \times faster), and eTraM (+3.0 mAP and \textbf2.0 \times faster) demonstrate that Ev-DTAD achieves a competitive accuracy-efficiency trade-off, validating the complementarity between compact temporal representation and temporal-hypergraph feature reasoning.

[CV-253] HairGPT : Strand-as-Language Autoregressive Modeling for Realistic 3D Hairstyle Synthesis SIGGRAPH2026

【速读】：该论文旨在解决3D发型数字建模中因流体性与结构性的双重特性所带来的挑战，现有生成方法多依赖连续扩散场，导致全局拓扑与局部纹理纠缠，难以体现发型的语义与结构组织。其解决方案的关键在于提出HairGPT框架，以发丝（strand）作为生成基本单元，将真实3D发型合成建模为一个双解耦的自回归序列问题：通过空间解耦实现不同头皮区域的语义分离，沿分层发丝表示进行结构解耦，从而从整体布局逐步细化至风格细节；同时引入几何编码器和区域感知语义标注，指导发丝级生成，支持组合编辑、稀有复杂发型合成及风格化域适应，使发型生成从黑箱纹理合成转变为结构化、语义可控的创作流程。

链接: https://arxiv.org/abs/2605.08824
作者: Haimin Luo,Min Ouyang,Lan Xu,Jingyi Yu
机构: ShanghaiTech University (上海科技大学); Deemos Technology Co., Ltd. (德莫斯科技有限公司)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to SIGGRAPH 2026 (Journal Track)

点击查看摘要

Abstract:Hair is a rich medium of visual and cultural expression, yet its digital modeling remains challenging due to the duality of fluidity and structure. Many existing generative approaches rely primarily on continuous diffusion fields, which entangle global topology with local texture and obscure the semantic and structural organization of hairstyles. To address this, we propose HairGPT, a strand-centric framework that treats strands as generative primitives and formulates realistic 3D hairstyle synthesis as a dual-decoupled autoregressive sequence modeling problem. Our method applies spatial decoupling across semantic scalp regions and structural decoupling along a hierarchical strand representation, progressing from global layout to fine-grained style. We further introduce a geometric tokenizer and region-aware semantic annotations to guide strand-level generation, enabling compositional editing, synthesis of rare and complex hairstyles, and adaptation to stylized domains. By aligning generative modeling with the workflow of digital grooming, HairGPT turns hair generation from opaque texture synthesis into a structured and semantically controllable authoring process, supporting robust semantic conditioning and high-fidelity results across realistic and stylized domains. Project Page: this https URL

[CV-254] FraudBench: A Multimodal Benchmark for Detecting AI-Generated Fraudulent Refund Evidence

【速读】：该论文旨在解决AI生成图像在电商退款场景中被用于伪造损坏证据所引发的欺诈问题，即“AI生成退款欺诈”（AI-generated refund fraud），其核心挑战在于现有检测方法多聚焦于通用图像真实性判断或跨生成器迁移能力评估，缺乏对与具体索赔情境相关的伪造证据进行验证的能力。解决方案的关键在于构建了一个多模态基准测试平台——FraudBench，该平台基于真实用户评论中的图像证据（涵盖电商、外卖和旅行服务场景），通过多模态大语言模型（MLLM）辅助筛选与人工标注相结合的方式识别真实损坏与完好样本，并利用六种先进图像编辑/生成模型从真实完好图像合成虚假损坏图像，从而形成结构化的训练与测试数据集；在此基础上系统评估了MLLM、专用AI图像检测器及人类参与者在相同条件下的表现，揭示了当前技术在特定索赔情境下对伪造证据识别能力的显著不足，为后续面向真实业务场景的可信视觉证据验证研究提供了基础框架与评测标准。

链接: https://arxiv.org/abs/2605.08820
作者: Xinyu Yan,Boyang Chen,Jiaming Zhang,Tiantong Wu,Hong Xi Tae,Yichen He,Tiantong Wang,Yachun Mi,Yurong Hao,Yilei Zhao,Lei Xiao,Longtao Huang,Pengjun Xie,Wei Liu,Wei Yang Bryan Lim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Artificial Intelligence (AI)-generated images have become increasingly realistic and readily adaptable to concrete real-world claims, creating new challenges for verifying visual evidence. A concrete emerging risk is AI-generated refund fraud, in which manipulated or synthetic images are used to support claims about damaged products, poor delivery conditions, or service-related defects. Existing AI-generated image detection benchmarks mainly evaluate standalone authenticity classification, cross-generator transfer, or forensic localization, leaving claim-conditioned fraudulent evidence detection underexplored. To bridge this gap, we introduce FraudBench, a multimodal benchmark for detecting AI-generated fraudulent refund evidence. FraudBench is constructed from real-world user-review evidence across e-commerce, food delivery, and travel-service scenarios. We curate real evidence images together with their associated review and product metadata, identify genuine damaged and undamaged evidence through MLLM-assisted filtering and human annotation, and synthesize fake-damaged evidence from genuine undamaged reference images using six state-of-the-art image editing and generation models. Using FraudBench, we evaluate MLLMs, specialized AI-generated image detectors, and human participants under the same settings. Experiments show that current MLLMs often recognize real-damaged evidence but fail on many fake-damaged subsets, with fake-damage detection rates (TPR) far below the 50% baseline on most generator subsets. Specialized detectors generally perform better but remain inconsistent across generators and can produce false positives on real-damaged samples, revealing a clear gap between generic AI image detection and reliable claim-conditioned refund-evidence verification.

[CV-255] From pre-training to downstream performance: Does domain-specific pre-training make sense?

【速读】：该论文旨在解决深度学习模型在医学影像领域中预训练策略与下游任务性能之间关系不明确的问题，特别是如何通过优化预训练方法提升模型的诊断准确性与可靠性。其解决方案的关键在于系统性地比较卷积神经网络（Convolutional Neural Networks, CNNs）与视觉Transformer架构，在不同预训练方式（如监督学习与自监督学习）、初始化策略及数据模态（自然图像、胸部X光、胸部CT和视网膜OCT图像）下的表现，并发现仅当预训练数据与目标模态高度匹配时，下游任务性能才显著提升；同时指出自监督学习虽在某些场景下优于监督学习，但其效果具有情境依赖性。这一发现强调了模态对齐在预训练设计中的核心作用，为构建更精准可靠的医学影像分析工具提供了关键指导。

链接: https://arxiv.org/abs/2605.08819
作者: Felix Krones
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Deep learning techniques have revolutionised medical imaging, improving diagnostic accuracy and enabling both more accurate and earlier disease detection. However, the relationship between pre-training strategies and downstream performance in medical imaging models requires further exploration. Here, we systematically compare convolutional neural networks and transformers, examining various pre-training approaches, including supervised and self-supervised learning, as well as different initialisations and data modalities. Models are evaluated on natural images, chest X-rays, chest CT and retina OCT images, considering the effects of matching pre-training data with target modalities. Our findings indicate that only pre-training on data closely matching the target modality significantly improves downstream performance. While self-supervised learning can outperform supervised methods, its effectiveness varies with context. The study underscores the importance of pre-training strategies to enhance the reliability and effectiveness of deep learning models in medical imaging. By addressing these key factors, our research aims to contribute to the development of more accurate and dependable diagnostic tools, ultimately improving patient outcomes in clinical settings.

[CV-256] Zero-Shot Chinese Character Recognition via Global-Local Dual-Branch Alignment and Hierarchical Inference

【速读】：该论文旨在解决开放世界场景下零样本中文字符识别（Zero-Shot Chinese Character Recognition）问题，其核心挑战在于字符类别庞大且未见字符频繁出现，传统基于图像描述序列（Ideographic Description Sequence, IDS）的检索方法通常将字符图像与IDS编码为单一全局向量进行匹配，这种整体对齐方式难以捕捉局部构件差异；同时，直接引入补丁-标记级别的细粒度交互会受到IDS中结构操作符噪声干扰并带来高计算成本。解决方案的关键在于提出一种全局-局部分层感知网络（Global-Local Hierarchical Perception Network, GL-HPN），在统一的跨模态对齐框架内联合学习字符图像与IDS序列的全局和局部表征：全局分支支持高效粗粒度召回，局部分支通过补丁-标记交互提升构件级判别能力；进一步设计结构过滤掩码以抑制局部相似性聚合中具有结构意义但无视觉实体的IDS操作符；最后采用从粗到精的分层推理策略，在全候选集上执行全局检索后仅对Top-K候选进行局部重排序，并通过参数无关的乘法融合归一化后验分数，显著降低大规模候选检索的推理开销，同时在低资源条件下表现优异。

链接: https://arxiv.org/abs/2605.08814
作者: Wei Cao,Hao Xu,Xiaolei Diao
机构: Jilin University (吉林大学); University College London (伦敦大学学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages

点击查看摘要

Abstract:Chinese character categories are extremely large, and unseen characters frequently arise in open-world scenarios, making zero-shot Chinese character recognition an important yet challenging problem. Existing IDS-based retrieval methods usually encode a character image and its ideographic description sequence into a single global vector for matching. Although efficient, such holistic alignment often under-models local component differences. Moreover, directly introducing patch-token level fine-grained interaction suffers from both the noise of structural operators in IDS and the high cost of full-candidate this http URL address these issues, we propose a Global-Local Hierarchical Perception Network (GL-HPN), which jointly learns global and local representations of character images and IDS sequences within a unified cross-modal alignment framework. The global branch supports efficient coarse recall, while the local branch improves component-level discrimination through patch-token interaction. We further introduce a structure filtering mask to suppress structurally meaningful but visually non-entity IDS operators in local similarity aggregation. On top of this, we design a coarse-to-fine hierarchical inference strategy that performs global retrieval over the full candidate set and local reranking only on Top- K candidates, followed by parameter-free multiplicative fusion of normalized posterior scores. Experimental results show that GL-HPN achieves competitive performance across multiple zero-shot splits, performs especially well under low-resource settings, and substantially reduces the inference cost of large-scale candidate retrieval.

[CV-257] Curvature-Aware Captioning:Leverag ing Geodesic Attention for 3D Scene Understanding CVPR2026

【速读】：该论文旨在解决当前密集描述（dense captioning）方法在处理稀疏点云数据时面临的局限性，即难以同时保持精细的局部几何细节与建模指数级增长的全局语义层次结构，从而导致定位不准确或场景描述碎片化、浅层化的问题。其解决方案的关键在于提出一种新颖的曲率感知描述框架（Curvature-Aware Captioning），通过引入非欧几里得测地线注意力机制来化解定位与上下文一致性之间的冲突：具体而言，在Oblique流形中采用自注意力机制以确保维度同质性并建立长程依赖关系；在Lorentz流形中设计双向测地线交叉注意力机制，以建模场景实例间的分层语义关系，实现对象定位精度与场景描述连贯性的同步提升。理论分析进一步表明，Oblique流形与Lorentz双曲面之间的曲率互补性可有效缓解欧几里得-双曲冲突，通过各向同性优化保障特征稳定性的同时保留固有的层次关系。

链接: https://arxiv.org/abs/2605.08808
作者: Ziyao He,Yingjie Liu,ZhangYangRui,Mingsong Chen,Xuan Tang,Xian Wei
机构: East China Normal University (华东师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: CVPR2026 Highlight!

点击查看摘要

Abstract:Accurate 3D scene description is fundamental to robotic navigation and augmented reality, yet current dense captioning methods face significant limitations in processing sparse point cloud data. % Existing approaches that apply Euclidean embedding spaces struggle to simultaneously preserve fine-grained local geometric details and model exponentially growing global semantic hierarchies, leading to either inaccurate localization or disjointed, shallow scene descriptions. % In this work, we propose a novel \textbf\textscCurvature-Aware Captioning framework, integrating novel non-Euclidean geodesic attention mechanisms, to resolve the localization-contextualization conflict. % Specifically, self-attention within Oblique space enforces dimensional homogeneity while establishing long-range dependencies. Bidirectional geodesic cross-attention within Lorentz space models hierarchical semantic relationships across scene instances, enabling simultaneous precision in object localization and coherence in scene descriptions. % Theoretical analysis confirms that the curvature complementarity between the Oblique manifold and Lorentz hyperboloid resolves the Euclidean-hyperbolic conflict, ensuring feature stability via isotropic optimization while preserving inherent hierarchical relationships. Extensive experiments on ScanRefer and Nr3D benchmarks demonstrate state-of-the-art performance, with significant gains in both localization accuracy and descriptive richness.

[CV-258] L2A: Learning to Accumulate Pose History for Accurate 3D Human Pose Estimation

【速读】：该论文旨在解决现有2D到3D人体姿态估计方法中忽视网络深度上历史姿态表示利用的问题。当前流水线依赖固定残差连接传递信息，限制了早期层特征（如细粒度空间结构和短时运动线索）的有效复用。为解决这一问题，其关键在于构建一个保持跨层表示空间一致性的框架，从而实现有效的跨层特征聚合。具体而言，论文提出一种时空并行Transformer骨干网络以避免序列处理中的交替空间-时间变换，确保表示空间一致性；在此基础上引入历史姿态累积（History Pose Accumulation, HPA）机制，自适应地聚合所有前序层特征以增强当前表示，并设计层姿态历史聚合（Layer Pose History Aggregation, LPA）模块，将层级姿态特征转化为紧凑且结构化的形式，减少冗余并提升聚合稳定性。

链接: https://arxiv.org/abs/2605.08806
作者: Zehua Wang,Changwang Mei,Huaijiang Sun,Pengqi Hu,Zhaoyang Yin
机构: Nanjing University of Science and Technology (南京理工大学); Lenovo (联想)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15page

点击查看摘要

Abstract:Existing 2D-3D lifting human pose estimation methods have achieved strong performance. But the utilization of historical pose representations across network depth was overlooked. In current pipelines, information is propagated through fixed residual connections, which restricts effective reuse of early-layer features such as fine-grained spatial structures and short-term motion cues. However, naively incorporating historical features across layers is non-trivial. We further identify that maintaining a consistent representation space across layers is a prerequisite for effective cross-layer feature aggregation. To address this issue, we propose a history-aware framework that enables effective network cross-layer history feature utilization. Specifically, we adopt a spatial-temporal parallel Transformer backbone to prevent alternating spatial-temporal transformations during sequential processing, thereby maintaining a consistent representation space. Building upon this, we introduce a History Pose Accumulation (HPA) mechanism that adaptively aggregates features from all preceding layers to enhance current representations. Furthermore, we propose a Layer Pose History Aggregation (LPA) module that transforms layer pose features into a compact and structured form, reducing redundancy and enabling more stable aggregation. Extensive experiments demonstrate that our approach achieves state-of-the-art performance on benchmarks.

[CV-259] LightAVSeg: Lightweight Audio-Visual Segmentation ICML2026

【速读】：该论文旨在解决音频-视觉分割（Audio-Visual Segmentation, AVS）中现有模型依赖密集跨模态注意力机制导致计算复杂度高、难以在资源受限设备上高效部署的问题。其核心解决方案是提出轻量级框架LightAVSeg，通过将复杂的跨模态交互解耦为语义过滤（semantic filtering）与空间定位（spatial grounding）两个模块，使交互计算成本从二次方降低至线性增长，显著提升效率；同时引入辅助对齐损失（auxiliary alignment loss），在训练阶段强制语义一致性，且不增加推理开销，从而在保持高性能的同时实现移动端高效部署。

链接: https://arxiv.org/abs/2605.08805
作者: Qing Zhong,Guodong Ding,Lingqiao Liu,Zaiwen Feng,Lin Yuanbo Wu,Angela Yao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 8 figures, 6 tables, Accepted to ICML 2026

点击查看摘要

Abstract:Audio-Visual Segmentation (AVS) targets pixel level localization of sounding emitting objects in videos. However, existing models rely on dense cross-modal attention with quadratic computational cost, limiting their suitability for resource efficient deployment. Most efficiency oriented methods focus on backbone reduction and overlook the interaction module as the primary bottleneck. This paper proposes LightAVSeg, a lightweight framework that replaces heavy attention with a decoupled design for semantic filtering and spatial grounding, resulting in interaction costs that scale linearly with spatial resolution. Furthermore, we introduce an auxiliary alignment loss to enforce semantic consistency during training with zero inference overhead. Extensive experiments demonstrate that LightAVSeg achieves a new state-of-the-art among lightweight methods: with 20.5M parameters ~1/7 of AVSegFormer), it reaches 50.4 mIoU on the MS3 benchmark and enables efficient inference on a mobile processor.

[CV-260] CoLVR: Enhancing Exploratory Latent Visual Reasoning via Contrastive Optimization

【速读】：该论文旨在解决当前基于潜在空间的视觉推理方法中，由于依赖硬对齐目标强制潜在表示与预定义视觉特征匹配，从而严重限制了潜在推理过程探索能力的问题。解决方案的关键在于提出一种名为CoLVR（Contrastive Optimization for Latent Visual Reasoning）的对比优化框架：首先，通过角度扰动引导的潜在对比目标学习多样且具有探索性的表征，扩展语义潜在空间并避免嵌入过度约束；其次，在强化学习（Reinforcement Learning, RL）后训练阶段引入潜在轨迹对比奖励机制，实现对潜在视觉推理过程的细粒度优化，从而促进多样化推理行为。

链接: https://arxiv.org/abs/2605.08802
作者: Ziyang Ding,Linjian Meng,Yiming Wu,Yuhan Li,Yuhao Liu,Zhen Zhao
机构: Shandong University (山东大学); Shanghai AI Laboratory (上海人工智能实验室); Nanjing University (南京大学); The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Due to the potential for exploratory reasoning of Latent Visual Reasoning, recent works tend to enable MLLMs (Multimodal Large Language Models) to perform visual reasoning by propagating continuous hidden states instead of decoding intermediate steps into discrete tokens. However, existing works typically rely on hard alignment objectives to force latent representations to match predefined visual features, thereby severely limiting the exploratory of latent reasoning process. To address this problem, we propose CoLVR (Contrastive Optimization for Latent Visual Reasoning). To obtain a more exploratory visual reasoning, CoLVR introduces a latent contrastive training framework. Firstly, CoLVR learns diverse and exploratory representations with a latent contrastive objective guided by angle-based perturbation, which expands the semantic latent space and avoids over-constrained embedding. Then, CoLVR employs a latent trajectory contrastive reward for RL (Reinforcement Learning) post-training to enable fine-grained optimization of latent visual reasoning process and thus fostering diverse reasoning behaviors. Experiments demonstrate that CoLVR significantly enhances the exploratory capability of latent representations, achieving average improvements of 5.83% on VSP and 8.00% on Jigsaw, while also outperforming existing latent models on out of domain benchmarks, with a 3.40% gain on MMStar. The data, codes, and models are released at this https URL.

[CV-261] PPU-Bench:Real World Benchmark for Personalized Partial Unlearning in Vision Language Models

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在预训练过程中可能记忆敏感跨模态信息的问题，尤其是现有遗忘基准无法模拟真实场景中个性化、细粒度的知识删除需求。为此，作者提出了PPU-Bench——一个无需微调的现实世界基准，用于评估个性化部分遗忘（Personalized Partial Unlearning, PPU）的有效性。其关键创新在于引入了三种渐进式挑战设置（完全遗忘、选择性遗忘和个性化遗忘），并揭示了不同遗忘策略下“遗忘-保留”之间的权衡关系及模型内在事实边界模糊的问题。基于此发现，论文提出边界感知优化（Boundary-Aware Optimization, BAO），通过显式建模同一主体内的遗忘与保留边界，显著提升了MLLMs在保持非目标事实完整性的同时精准移除特定知识的能力。

链接: https://arxiv.org/abs/2605.08800
作者: Jiahui Guang,Zexun Zhan,Zhenlin Xu,Cuiyun Gao,Haiyan Wang,Jing Li,Zhaoquan Gu,Yanchun Zhang
机构: Harbin Institute of Technology, Shenzhen, China; Pengcheng Laboratory, Shenzhen, China; The Hong Kong Polytechnic University, Hong Kong, China; Sichuan University, Chengdu, China; Harbin Institute of Technology, Weihai, China; Zhejiang Normal University, Jinhua, China
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) may memorize sensitive cross-modal information during pretraining. However, existing MLLM unlearning benchmarks rely on synthetic knowledge injection or complete subject-level deletion, which fail to capture realistic, personalized deletion requests that require fine-grained factual control. In this paper, we introduce PPU-Bench, a real-world and fine-tuning-free benchmark for personalized partial unlearning in MLLMs. PPU-Bench contains 24K multimodal and unimodal samples derived from pre-existing knowledge of 500 public figures under three progressively challenging settings: Complete, Selective, and Personalized unlearning. The benchmark evaluates whether methods can remove target knowledge while preserving non-target facts, model utility, and cross-modal consistency. Extensive experiments show that Complete Unlearning often suppresses visual identity rather than factual knowledge, while Selective and Personalized Unlearning expose significant forget–retain trade-offs and challenges in intra-subject factual boundaries. Robustness analysis under cross-image and prompt-based attacks reveals distinct vulnerabilities across different unlearning settings. Motivated by these findings, we propose Boundary-Aware Optimization (BAO), which explicitly models intra-subject forget-retain boundaries. Experimental results on two representative methods demonstrate that BAO can effectively enforce intra-subject factual boundaries.

[CV-262] Lost in Volume: The CT-SpatialVQA Benchmark for Evaluating Semantic-Spatial Understanding of 3D Medical Vision-Language Models

【速读】：该论文旨在解决当前3D医学视觉语言模型（3D medical vision-language models, VLMs）在临床决策支持中缺乏对空间语义推理能力的系统评估问题，即这些模型是否真正理解并利用CT图像中的空间结构信息，还是仅依赖于语言先验和文本关联。其解决方案的关键在于构建了一个名为CT-SpatialVQA的基准测试集，包含9077个源自真实放射学报告与CT体积数据的临床相关问答对，并通过LLM辅助验证流程确保高一致性（95%人类共识率），该数据集明确要求模型具备解剖定位、侧别意识、结构比较及三维结构间关系推理等能力；同时提出标准化评估协议，对八种主流3D医学VLM进行评测，结果显示其在空间语义推理任务上平均准确率仅为34%，远低于随机水平，揭示了现有模型在整合体积证据方面的严重不足，强调需加强空间感知机制以实现可信的临床应用。

链接: https://arxiv.org/abs/2605.08787
作者: Mashrafi Monon,Umaima Rahman,Asif Hanif,Numan Saeed,Mohammad Yaqub
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in 3D medical vision-language models have enabled joint reasoning over volumetric images and text, showing strong performance in medical visual question-answering (VQA) and report generation. Despite this progress, it remains unclear whether these models learn spatially grounded anatomy from 3D volumes or rely primarily on learned priors and language correlations. This uncertainty stems from the lack of systematic evaluation of semantic-spatial reasoning in volumetric medical VLMs for clinically reliable decision support. To address this gap, we introduce CT-SpatialVQA, a benchmark designed to evaluate semantic-spatial reasoning in 3D CT data. The benchmark comprises 9077 clinically grounded question-answer (QA) pairs derived directly from 1601 radiology reports and CT volumes, which are validated via a robust LLM-assisted pipeline with a 95% human consensus agreement rate. Our dataset requires explicit anatomical localization, laterality awareness, structural comparison, and 3D inter-structure relational reasoning. We also introduce a standardized evaluation protocol and benchmark eight 3D medical VLMs, finding severe degradation on semantic-spatial reasoning tasks, averaging 34% accuracy and often below random, highlighting the need for deeper integration of volumetric evidence for trustworthy clinical use.

[CV-263] simpleposter: a simple baseline for product poster generation CVPR2026

【速读】：该论文旨在解决产品海报生成中特有的挑战，即在保持产品外观忠实还原的同时，实现对密集多行文本布局的精确控制。传统方法依赖于扩散模型结合ControlNet和OCR编码器等辅助模块，但这类方案不仅引入了复杂的架构和计算开销，还存在文本错误和主体扩展伪影等问题。解决方案的关键在于提出一个简洁有效的基于图像修复（inpainting）的框架SimplePoster，其核心创新包括：（1）通过对基础模型进行全参数微调有效抑制主体扩展现象，优于基于ControlNet的方法；（2）采用零成本字符级位置编码实现几何感知的文本生成，无需额外的布局模块即可实现位置可控的文本渲染。实验表明，SimplePoster在主体保留率上达到98.7%，显著优于SeedEdit 3.0（55.2%）和PosterMaker（85.3%），同时提升了文本渲染准确性。

链接: https://arxiv.org/abs/2605.08784
作者: Benlei Cui,Fangao Zeng,Weitao Jiang,Yuwen Zhai,Haiwen Hong,Longtao Huang,Hui Xue,Wenxiang Shang,Pipei Huang
机构: Alibaba Group (阿里巴巴集团); Taobao Tmall Group of Alibaba (淘宝天猫集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026

点击查看摘要

Abstract:Product poster generation poses distinct challenges beyond general poster design, requiring both faithful preservation of product appearance and precise control over dense, multi-line text layouts. Prior methods typically adopt inpainting frameworks augmented with auxiliary modules such as ControlNet and OCR encoders. However, these approaches introduce architectural complexity and computational overhead while still suffering from text errors and subject extension artifacts. We present SimplePoster, a simple yet effective inpainting-based framework that achieves faithful subject preservation and accurate, position-controllable text rendering without external controllers. Our approach builds on two observations: (1) full-parameter fine-tuning of the base model effectively suppresses subject extension, outperforming ControlNet-based alternatives; and (2) a zero-cost character-level position encoding enables geometry-aware text generation without dedicated layout modules. Experiments show that SimplePoster achieves a 98.7% subject preservation rate, compared to 55.2% for SeedEdit 3.0 and 85.3% for PosterMaker, while also improving text rendering accuracy. Code, models, benchmark and a part of training data will be available at this https URL

[CV-264] Contour-Native Bridge Defect Detection and Compact Digital Archiving with Frequency-Supervised Fourier Contours

【速读】：该论文旨在解决桥梁缺陷检测中传统表示方式（如边界框和栅格掩膜）在存储、传输与复用方面效率低下的问题。其关键解决方案是提出频率监督的傅里叶轮廓检测方法（Frequency-Supervised Fourier Series Detection, FS-FSD），该方法直接回归傅里叶轮廓描述符，在统一的多边形空间协议下评估边界框、掩膜与轮廓，从而以更紧凑、可恢复且易共享的矢量形式保留缺陷边界几何信息，显著提升了几何精度与工程可用性。

链接: https://arxiv.org/abs/2605.08781
作者: Jin Liu,Wang Wang,Hongxu Pu,Zhen Cao,Yasong Wang,Hu Wang,Kunming Luo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 46 pages,13 figures

点击查看摘要

Abstract:AI-assisted bridge defect inspection often produces bounding boxes with crude geometry or raster masks that are costly to store, transmit, and reuse. This study investigates how detected defects can be represented as compact, recoverable contour-level vector records in image space. We propose Frequency-Supervised Fourier Series Detection (FS-FSD), which directly regresses Fourier contour descriptors and evaluates boxes, masks, and contours under a unified polygon-space protocol. On 3,767 UAV-collected bridge images with 42,346 defect instances, FS-FSD achieves higher polygon-space accuracy and better matched-TP geometric quality than representative detection, segmentation, and contour baselines. These results show that, compared with bounding boxes and raster masks, Fourier contour records preserve defect-boundary geometry in a more compact, recoverable, and shareable form for engineering review and downstream information workflows. Future work will study the modeling of multi-region, fragmented, and adjacent bridge-defect boundaries and extend the framework toward long-term bridge-defect tracking and lifecycle-oriented management.

[CV-265] Anchoring the Eigengap: Cross-Modal Spectral Stabilization for Sample-Efficient Representation Learning

【速读】：该论文旨在解决深度视觉模型在低数据场景下性能急剧下降的问题，尤其是在医学影像等标注样本稀缺的领域。研究表明，这种性能退化并非仅由过拟合引起，而是源于嵌入协方差矩阵的几何失效：有限样本噪声会破坏特征空间的谱结构，导致特征值间隔（eigengap）坍缩，从而限制可恢复的有效信号模态数量。解决方案的关键在于提出一种有限样本表示学习的谱理论，定量刻画从N个样本中可稳定估计的模态数K(N)——即只有特征值高于噪声阈值|\hat\Sigma - \Sigma|_\mathrm{op} \sim \sqrt{D}/N的模式才是可靠的，并据此构建截断马氏能量（truncated Mahalanobis energy）作为分类性能的决定因素。进一步发现，多模态学习通过引入低秩约束抑制噪声主导方向，有助于维持特征空间的谱稳定性，提升K(N)，从而改善小样本下的表征质量与类别分离能力。

链接: https://arxiv.org/abs/2605.08764
作者: Nikhil J. Dhinagar,Vidhi Chhatbar,Chirag Jagad,Pavithra Senthilkumar,Sophia I. Thomopoulos,Mahir H. Khan,Sook-Lei Liew, theENIGMA-Stroke Recovery Working Group,Paul M. Thompson
机构: Imaging Genetics Center, Mark Mary Stevens Neuroimaging Informatics Institute, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA; Neuroscience Graduate Program, Mark Mary Stevens Neuroimaging Informatics Institute, Chan Division of Occupational Science Occupational Therapy, Biomedical Engineering, University of Southern California, Los Angeles, CA, USA; the ENIGMA-Stroke Recovery Working Group
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Deep vision models degrade sharply in low-data regimes, particularly in medical imaging where labeled samples are scarce. We show this arises not merely from overfitting but from a geometric failure: finite-sample noise corrupts the embedding covariance, collapsing the eigengap and limiting the number of recoverable signal-bearing modes. We develop a spectral theory of finite-sample representation learning that quantifies the recoverable dimension K(N), the number of eigenmodes that can be stably estimated from N samples. Using perturbation theory and concentration bounds, we show that only modes with eigenvalues above the noise floor |\hat\Sigma - \Sigma|_\mathrmop \sim \sqrtD/N are reliable, yielding a truncated Mahalanobis energy that governs classification performance. Under a power-law spectral model, this energy can be approximated by a truncated Riemann zeta function, linking eigenvalue decay to data efficiency and AUC. Within this framework, multimodal learning acts as spectral stabilization: vision-language models impose low-rank constraints that suppress noise-dominated directions and preserve the eigengap, increasing K(N) under data scarcity. Across MNIST and multi-disease neuroimaging, we show that multimodal training maintains more stable modes and improves class separation, even when unimodal models achieve comparable few-shot accuracy. These results identify spectral collapse as a fundamental bottleneck in low-data learning. We use truncated Mahalanobis energy and K(N) to diagnose encoder quality, and introduce zeta-based spectral filtering as a principled approach to improve data efficiency.

[CV-266] Simultaneous Monitoring of Shape and Surface Color via 4D Point Clouds: A Registration-free Approach

【速读】：该论文旨在解决复杂形状与空间变异性材料制造过程中，如何实现对形状（shape）和颜色（color）信息的同步无注册监测问题。传统方法依赖于点云配准或网格重建等预处理步骤，存在计算复杂且易引入误差的缺陷。解决方案的关键在于提出了一种无需注册的SMAC框架，利用拉普拉斯-贝尔特拉米算子（Laplace-Beltrami operator）的谱特性来联合捕捉几何特征与表面颜色之间的关系，并设计了结合空间感知后信号诊断流程的综合监测机制，从而在不进行任何配准或网格重建的前提下，有效检测微小形变与颜色异常，并准确定位异常源的位置。

链接: https://arxiv.org/abs/2605.08753
作者: Mariafrancesca Patalano,Giovanna Capizzi,Kamran Paynabar
机构: University of Padua (帕多瓦大学); Georgia Institute of Technology (佐治亚理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注: 38 pages, 11 figures

点击查看摘要

Abstract:Advanced manufacturing technologies allow for the production of intricate parts featuring high shape complexity and spatially-varying material composition. Data fusion of point clouds with chromatic attributes provides 4D point clouds, a compact and informative representation that encodes both shape and material information. In this paper, we present a registration-free framework for Simultaneous Monitoring of shApe and Color (SMAC) via 4D point clouds. The proposed framework leverages Laplace-Beltrami operator spectral properties to capture and monitor geometric features and the relationship between shape and surface color. A combined monitoring scheme is proposed to effectively detect shape deformations and color anomalies, along with a spatially-aware post-signal diagnostic procedure to determine the source of change and localize color anomalies. Importantly, neither component relies on registration or mesh reconstruction, eliminating error-prone and computationally expensive preprocessing steps. A Monte Carlo simulation study and a case study on functionally graded materials demonstrate that SMAC achieves effective detection performance, particularly for subtle defects, while providing diagnostic capabilities to identify the source and location of anomalies.

[CV-267] ReorgGS: Equivalent Distribution Reorganization for 3D Gaussian Splatting

【速读】：该论文旨在解决已收敛的3D高斯溅射（3D Gaussian Splatting, 3DGS）模型中存在的参数退化问题（parameterization degeneration），即模型虽能近似目标场景，但因高透明度“浮点物”（high-opacity floaters）通过alpha混合抑制真实表面梯度，以及冗余重叠簇导致参数块强耦合、雅可比响应近乎共线，使得后续优化难以继续提升质量。解决方案的关键在于提出ReorgGS方法：将现有高斯集视为经验概率场，从中重新采样中心点，利用kNN估计局部各向异性协方差，初始化低透明度，并使用原始3DGS渲染器和损失函数继续优化。与仅重置透明度的方案不同，ReorgGS重构了中心、协方差及可见性结构，从而改变参数图结构本身，实现分布等价但优化等价的改进，在固定高斯数量下显著提升拟合质量、抑制顽固浮点物并降低冗余重叠带来的渲染开销。

链接: https://arxiv.org/abs/2605.08739
作者: Luchao Wang,Kaimin Liao,Qian Ren,Hua Wang,Zhi Chen,Yaohua Tang
机构: University of Science and Technology of China (中国科学技术大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:A converged 3D Gaussian Splatting (3DGS) model may approximate the target scene while remaining poorly parameterized for further optimization. We identify this failure mode as \emphparameterization degeneration: high-opacity floaters attenuate gradients to true surfaces through alpha compositing, and redundant overlapping clusters create strongly coupled parameter blocks with nearly collinear Jacobian responses. These effects explain why continued optimization can plateau even when the model still contains removable artifacts. We propose ReorgGS, an equivalent distribution reorganization method for converged 3DGS models. ReorgGS treats the existing Gaussian set as an empirical probability field, resamples centers from it, estimates local anisotropic covariances with kNN, initializes low opacity, and continues optimization with the original 3DGS renderer and loss. Unlike opacity reset, which only rescales opacity on the old overlap graph, ReorgGS rebuilds centers, covariances, and visibility structure, thereby changing the graph itself. Our analysis shows that distributional equivalence is not optimization equivalence. The reorganized model preserves scene support while improving gradient accessibility under alpha compositing and reducing opacity-weighted overlap, thereby weakening local parameter coupling during subsequent optimization. Under the same additional optimization budget, ReorgGS improves fitting quality at a fixed Gaussian count, suppresses persistent floaters, and reduces rendering overhead from redundant overlap.

[CV-268] CollabVR: Collaborative Video Reasoning with Vision-Language and Video Generation Models

【速读】：该论文旨在解决视频生成模型（Video Generation Models, VGMs）在目标导向任务中常见的两类失败模式：多步骤任务中的长期轨迹漂移（long-horizon drift）和中段视频片段的模拟误差累积（mid-clip simulation errors）。这些问题源于VGM缺乏基于其短时视觉先验的显式推理机制。为解决此问题，作者提出VLM-VGM协同视频推理框架（CollabVR），其关键在于以步骤级粒度将视觉语言模型（Vision-Language Models, VLM）与VGM耦合形成闭环：VLM规划下一步动作，检查VGM生成的视频片段，并将验证诊断直接融入下一动作提示中以修复检测到的错误。该方法在Gen-ViRe和VBVR-Bench基准上优于单次推理、Pass@k及测试时扩展基线，在相同计算资源下实现显著提升，且与推理微调后的VGM具有可叠加性，验证了步骤级VLM监督的有效性与通用性。

链接: https://arxiv.org/abs/2605.08735
作者: Joowon Kim,Seungho Shin,Joonhyung Park,Eunho Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent “Thinking with Video” approaches use Video Generation Models (VGMs) for visual reasoning by producing temporally coherent Chain-of-Frames as reasoning artifacts. Even strong VGMs, however, exhibit two recurring failure modes on goal-directed tasks: long-horizon drift on multi-step tasks and mid-clip simulation errors that compound. Both stem from the absence of explicit reasoning built upon the VGM’s short-horizon visual prior, a role naturally filled by Vision-Language Models (VLMs), but where to place the VLM is non-trivial: upfront plans commit before any frame is generated and post-hoc critiques over whole videos intervene too late. We propose VLM-VGM Collaborative Video Reasoning (CollabVR), a closed-loop framework that couples the VLM with the VGM at step-level granularity: the VLM plans the immediate next action, inspects the clip the VGM generates, and folds the verifier’s diagnosis directly into the next action prompt to repair detected failures. On Gen-ViRe and VBVR-Bench, CollabVR improves both open-source and closed-source VGMs over single-inference, Pass@ k , and prior test-time scaling baselines at matched compute, with the largest gains on the hardest tasks. It also yields further improvements on top of a reasoning-fine-tuned VGM, indicating that step-level VLM supervision is orthogonal to and stackable with reasoning-oriented fine-tuning. We provide video samples and additional qualitative results at our project page: this https URL.

[CV-269] Unison: Harmonizing Motion Speech and Sound for Human-Centric Audio-Video Generation

【速读】：该论文旨在解决人-centric视频中运动（motion）、语音（speech）和声音效果（sound effects）三者在时间上的异质性导致的联合生成困难问题，现有模型常因跨模态对齐不一致而产生明显的运动-语音-声效错位。解决方案的关键在于提出一个统一框架Unison，其核心创新包括：1）在音频流中采用语义引导的谐调策略，解耦语音与声音效果的生成，并通过双向音频交叉注意力和语义条件门控机制实现语义驱动的自适应重构，从而缓解语音主导现象并提升声学清晰度；2）针对音频与运动的同步问题，设计双向跨模态强制策略，利用更干净的模态指导噪声较大的模态，结合分阶段稳定化策略优化时序一致性。实验表明，该方法在音频感知质量和跨模态同步性能上均达到当前最优水平，验证了显式多模态谐调在人-centric视频生成中的重要性。

链接: https://arxiv.org/abs/2605.08729
作者: Shihao Cheng,Jiaxu Zhang,Quanyue Song,Shansong Liu,Zhizhi Guo,Xiaolei Zhang,Chi Zhang,Xuelong Li,Zhigang Tu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Multimedia (cs.MM); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Motion, speech, and sound effects are fundamental elements of human-centric videos, yet their heterogeneous temporal characteristics make joint generation highly challenging. Existing audio-video generation models often fail to maintain consistent alignment across these modalities, leading to noticeable mismatches between motion, speech, and environmental sounds. We present Unison, a unified framework that explicitly promotes coherence across the motion, speech, and sound modalities. Within the audio stream, Unison employs a semantic-guided harmonization strategy that decouples the generation of speech and sound-effect components. Leveraging bidirectional audio cross-attention and semantic-conditioned gating for semantic-driven adaptive recomposition, this approach effectively mitigates speech dominance and enhances acoustic clarity. For audio-motion synchronization, we propose a bidirectional cross-modal forcing strategy where the cleaner modality guides the noisier one through decoupled denoising schedules, reinforced by a progressive stabilization strategy. Extensive experiments demonstrate that Unison achieves state-of-the-art performance in both audio perceptual quality and cross-modal synchronization, highlighting the importance of explicit multimodal harmonization in human-centric video generation.

[CV-270] Control Your View: High-Resolution Global Semantic Manipulation in Learned Image Compression

【速读】：该论文旨在解决生成式图像压缩（Generative Image Compression, GIC）系统在面对高分辨率全局语义操纵（Global Semantic Manipulation, GSM）攻击时的脆弱性问题，尤其是现有基于投影梯度下降（Project Gradient Descent, PGD）的方法无法有效实施高分辨率GSM攻击的局限性。其解决方案的关键在于提出了一种新的周期性几何衰减（Periodic Geometric Decay, PGD²）步长调度策略，该策略能够适应对抗样本从Identity Region到Amplification Region演化过程中的“懒惰-振荡-精炼”（Lazying-Oscillating-Refining）三阶段特性，从而首次实现了稳定、有效的高分辨率GSM攻击，显著提升了对GIC系统的安全威胁评估能力。

链接: https://arxiv.org/abs/2605.08727
作者: Jiaming Liang,Chi-Man Pun,Weisi Lin,Greta Seng Peng Mok
机构: University of Macau (澳门大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Learned image compression (LIC) integrates deep neural networks (DNNs) to map high-dimensional images into compact latent representations, reducing redundancy and achieving superior rate-distortion (RD) performance in benign settings. Unfortunately, due to inherent vulnerabilities in DNNs, LIC systems are susceptible to adversarial perturbations that lead to downstream deterioration, compression rate degradation, untargeted distortion, and both local semantic manipulation (LSM) and low-resolution ( 3\times28\times28 ) global semantic manipulation (GSM). However, high-resolution GSM remains unexplored due to its intractability. Notably, the existing project gradient descent (PGD) method achieves near-perfect white-box attacks for classification, segmentation, and other tasks, yet fails to generalize to high-resolution GSM. Our theoretical and empirical analyses reveal that well-performing GSM drives adversarial examples from the Identity Region to the Amplification Region through the Lazying-Oscillating-Refining stages. General \ell_\infty -bounded attacks fail on high-resolution GSM because their step-size schedules cannot accommodate both the Oscillating and Refining stages. Based on this, we propose the Periodic Geometric Decay schedule that enables \ell_\infty -bounded high-resolution GSM. To verify our approach, we integrate it with PGD, yielding a minimal variant, PGD ^2 -GSM. Extensive experiments on the Kodak (3\times768\times512) demonstrate that our PGD ^2 -GSM is the first to stably achieve high-resolution GSM, thereby exposing a novel threat to LIC systems. Code is available at this https URL.

[CV-271] SynerMedGen: Synergizing Medical Multimodal Understanding with Generation via Task Alignment ICML2026

【速读】：该论文旨在解决当前统一医疗建模中理解（understanding）与生成（generation）任务缺乏功能协同的问题，即现有模型通常将两者视为独立目标，未能实现真正的互补增强。其解决方案的关键在于提出“生成对齐的理解”（generation-aligned understanding）原则，通过任务对齐机制将理解任务与生成任务有机结合；具体而言，SynerMedGen框架设计了三个生成对齐的理解任务，并采用两阶段训练策略，使理解阶段学到的有益表征能够有效迁移至医学图像合成任务中，从而在不依赖生成训练的情况下即实现跨22项任务的零样本性能，且在结合生成训练后显著优于现有专业和统一医疗生成模型。

链接: https://arxiv.org/abs/2605.08724
作者: Weiren Zhao,Yi Dong,Cheng Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICML 2026

点击查看摘要

Abstract:Unifying multimodal understanding and generation is a compelling frontier that is beginning to emerge in the medical field. However, the limited existing unified medical models typically treat understanding and generation as disjoint objectives, lacking a meaningful functional synergy. In this work, we identify and address a critical question in unified medical modeling: what form of understanding truly benefits generation. We present SynerMedGen, a unified framework built on the proposed principle of generation-aligned understanding, which synergizes understanding objectives with generation tasks via task alignment. SynerMedGen introduces three generation-aligned understanding tasks and a two-stage training strategy that transfers generation-beneficial representations learned during understanding training to medical image synthesis. Remarkably, even with understanding training alone, our SynerMedGen achieves strong zero-shot performance across 22 medical image synthesis tasks and demonstrates robust generalization to unseen datasets. When combined with generation training, SynerMedGen consistently outperforms state-of-the-art specialized medical image synthesis models as well as recent unified medical models. We also release a large-scale dataset named SynerMed consisting of 1M paired synthesis samples and 2M generation-derived understanding instances to support further research on understanding-generation synergy. Our project can be accessed at this https URL.

[CV-272] EAR: Enhancing Uni-Modal Representations for Weakly Supervised Audio-Visual Video Parsing

【速读】：该论文旨在解决弱监督音频-视觉视频解析（Weakly supervised Audio-Visual Video Parsing, AVVP）任务中因音视频信号通常未对齐而导致的事件定位精度不足问题。现有方法主要聚焦于多模态融合或伪标签生成器的预训练，但忽视了单模态语义的准确感知，导致伪标签噪声大、视频解析性能受限。解决方案的关键在于：首先提出基于相似性的标签迁移方法用于预训练数据标注，从而增强伪标签生成器对单模态事件的理解能力；其次采用软约束机制在并行处理多模态融合的同时优化单模态特征建模，实现单模态与跨模态表示的协同注意力机制，显著提升事件定位性能。

链接: https://arxiv.org/abs/2605.08723
作者: Huilai Li,Xiaomeng Di,Ying Xing,Yonghao Dang,Yiming Wang,Jianqin Yin
机构: Beijing University of Posts and Telecommunications (北京邮电大学); State Grid Corporation of China (国家电网公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Weakly supervised Audio-Visual Video Parsing (AVVP) aims to recognize and temporally localize audio, visual, and audio-visual events in videos using only coarse-grained labels. Faced with the challenging task settings, existing research advances along two main paths: pre-training pseudo-label generators for fine-grained cross-modal semantic guidance, or refining AVVP model architectures to enhance audio-visual fusion. However, since audio and visual signals are typically unaligned, achieving accurate video parsing fundamentally relies on precise perception of uni-modal events. Yet these multi-modal focused strategies excessively emphasize multi-modal fusion while inadequately guiding and preserving uni-modal semantics, resulting in noisy pseudo-labels and sub-optimal video parsing performance. This paper proposes a novel framework that enhances uni-modal representations for both the pseudo-label generator and the AVVP model. Specifically, we introduce a similarity-based label migration approach to annotate pre-training data, thereby enabling the pseudo-label generator to better understand uni-modal events. We also employ a soft-constrained manner to refine modeling of uni-modal features in parallel with multi-modal fusion. These designs enable coordinated attention to both uni-modal and cross-modal representations, thus boosting the localization performance for events. Extensive experiments show that our method outperforms state-of-the-art methods in both pseudo-label and AVVP performance.

[CV-273] From Articulated Kinematics to Routed Visual Control for Action-Conditioned Surgical Video Generation

【速读】：该论文旨在解决机器人手术中动作条件下的视频生成问题（action-conditioned surgical video generation），其核心挑战在于如何用低维控制向量精确调控复杂的图像空间演化过程。解决方案的关键在于提出了一种“运动学到视觉的映射范式”（kinematic-to-visual lifting paradigm），将关节运动学转化为一组五类图像对齐的控制模态，并在此基础上设计了分层路由的视觉控制框架（hierarchically routed visual control framework），通过动态选择最相关的控制模态和运动尺度，实现条件容量的自适应分配；同时引入基于运动学先验的路由损失函数以确保物理合理性、时间稳定性与专家利用效率，并结合预算训练与推理机制，利用路由诱导的稀疏性实现计算资源的自适应调度，从而在保持高控制精度的同时显著降低延迟。

链接: https://arxiv.org/abs/2605.08712
作者: Bohan Li,Shuojue Yang,Baorui Peng,Xianda Guo,Erli Zhang,Youqi Tao,Junfeng Duan,Daguang Xu,Qi Dou,Xin Jin,Wenjun Zeng,Hao Zhao,Yueming Jin
机构: SJTU; NUS; THU; EIT; WHU; Harvard; NVIDIA; CUHK
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Action-conditioned surgical video generation is a critical yet highly challenging problem for robotic surgery. The core difficulty is that low-dimensional control vectors must precisely govern complex image-space evolution. In this work, we propose a kinematic-to-visual lifting paradigm that converts articulated kinematics into a unified set of five image-aligned control modalities. Building on this representation, we introduce a hierarchically routed visual control framework that selectively activates the most relevant control modalities and motion scales. Instead of uniformly applying all control signals, our model performs hierarchical routing to dynamically allocate conditioning capacity. We further design kinematic-prior-guided routing loss functions to ensure physically meaningful, temporally stable, and efficient expert utilization. To improve efficiency, we propose a budgeted training and inference scheme that leverages routing-induced sparsity. By selectively discarding low-significance control pathways during training and execution, our approach enables adaptive computation that is complementary to standard distillation. We additionally construct a new benchmark with curated articulated annotations, obtained through human-in-the-loop semantic labeling and differentiable pose tracking, providing realistic supervision for action-conditioned surgical video generation. Extensive experiments demonstrate that our method consistently improves action faithfulness, visual fidelity, and cross-domain generalization over diverse baselines. Moreover, our efficient variant achieves substantial reductions in latency while maintaining strong control accuracy.

[CV-274] UniShield: Unified Face Attack Detection via KG-Informed Multimodal Reasoning

【速读】：该论文旨在解决统一人脸攻击检测（Unified Face Attack Detection, UAD）中难以同时识别物理欺骗（physical spoofing）与数字伪造（digital forgery）的问题，现有方法多依赖外观相关性，缺乏基于知识的推理能力。解决方案的关键在于提出UniShield框架，其核心是构建一个结构化的人脸攻击知识图谱（Face Attack Knowledge Graph, FAKG），将攻击类别与诊断视觉线索及条件化关系相连接，并通过FAKG-QA数据集进行攻击图谱指令微调（AGIT），同时引入**图一致性推理优化（Graph-Consistent Reasoning Optimization, GCRO）**机制，利用知识图谱一致性奖励约束生成推理过程，确保推理理由与图谱支持的线索一致，从而提升检测准确率和推理可靠性。

链接: https://arxiv.org/abs/2605.08709
作者: Hongrui Li,Yichen Shi,Hongyang Wang,Yuhao Gao,Hui Ma,Jun Feng,Zitong Yu
机构: Shijiazhuang Tiedao University (石家庄铁道大学); Shanghai Jiao Tong University (上海交通大学); Ningbo Institute of Digital Twin (宁波数字孪生研究所); Great Bay University (大湾区大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Unified face attack detection (UAD) requires recognizing physical spoofing and digital forgery within a shared decision space, yet existing discriminative or prompt-based methods largely rely on appearance correlations and provide limited evidence-grounded reasoning. We propose UniShield, a knowledge-grounded multimodal reasoning framework for unified face attack defense. UniShield constructs a Face Attack Knowledge Graph (FAKG) that links attack categories to diagnostic visual cues and attack-conditioned relations, and uses it to synthesize 52,025 FAKG-QA examples for Attack-Graph Instruction Tuning (AGIT). To improve rationale consistency, we further introduce Graph-Consistent Reasoning Optimization (GCRO), a GRPO-based objective with a KG-consistency reward that encourages generated rationales to match graph-supported cues while penalizing incompatible claims. Experiments on our multimodal UAD benchmark show that UniShield achieves strong performance across binary, coarse-grained, and fine-grained protocols, with consistently high ACC and low HTER. These results suggest that structured attack knowledge can improve both detection accuracy and reasoning reliability over discriminative baselines and general-purpose MLLMs. Our code will be released at this https URL.

[CV-275] Gate-and-Merge: Zero-shot Compositional Personalization of Vision Language Models

【速读】：该论文旨在解决视觉语言模型（Vision-Language Models, VLMs）的组合个性化（compositional personalization）问题，即在测试阶段需联合识别或描述多个用户定义的概念。传统方法通常依赖于概念共现训练，而本文提出了一种零样本框架 Gate-and-Merge，其关键在于：每个概念独立学习为轻量级 LoRA（Low-Rank Adaptation）适配器，并与概念标记（concept token）配对；推理时通过直接在权重空间合并特定概念的 LoRA 更新实现组合，同时引入门控机制（gating mechanism）以估计文本和视觉线索，仅激活对预测有贡献的模块，从而抑制无关激活并防止干扰；此外，通过仅融合最显著且相互一致的更新来稳定组合过程，有效保持各概念的独立性。

链接: https://arxiv.org/abs/2605.08702
作者: Guodong Ding,Angela Yao
机构: National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper tackles compositional personalization of vision-language models (VLMs). In this problem, multiple user-defined concepts must be recognized or described jointly at test time. We introduce Gate-and-Merge, a zero-shot framework that enables compositional personalization without the need for co-occurrence training. During personalization, each concept is learned independently as a lightweight LoRA adapter, paired with a concept token. The base model remains unchanged and concepts are kept disentangled. At inference, we enable composition by merging concept-specific LoRA updates directly in weight space. To suppress irrelevant activations and prevent interference, a gating mechanism is employed to estimate textual and visual cues and select only the modules that contribute to the prediction. We further stabilize composition by combining only the most meaningful and mutually consistent updates, helping preserve each concept’s identity. Our quantitative and qualitative analyses show consistent gains in performance across multiple personalization tasks in both single-concept and compositional settings.

[CV-276] Supersampling Stable Diffusion and More: An Approach for Interpolating Neural Networks Using Common Interpolation Methods

【速读】：该论文旨在解决Stable Diffusion（SD）模型在生成高于训练分辨率图像时出现的物体重复伪影（object duplication artifacts）问题，而无需对模型进行微调。其关键解决方案是提出卷积核插值（kernel interpolation）方法，通过数学证明表明：若将插值后的卷积核乘以一个常数系数，即可正确缩放核尺寸，从而在不引入额外训练的情况下实现高分辨率图像生成。该方法不仅适用于卷积层，还可扩展至全连接层，展现出良好的通用性，并在最坏情况下仅导致准确率和F1分数下降2.6%，同时能将神经网络训练内存占用降低至少4倍。

链接: https://arxiv.org/abs/2605.08698
作者: Md Abu Obaida Zishan,Jannatun Noor,Annajiat Alim Rasel
机构: BRAC University (BRAC大学); United Internation University (国际大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Stable Diffusion (SD) has evolved DDPM (Denoising Diffusion Probabilistic Model) based image generation significantly by denoising in latent space instead of feature space. This popularized DDPM-based image generation as the cost and compute barrier was significantly lowered. However, these models could only generate fixed-resolution images according to their training configuration. When we attempt to generate higher resolutions, the resulting images show object duplication artifacts consistently. To solve this problem without finetuning SD models, recent works have tried dilating the convolution kernels of the models and have achieved a great level of success. But dilated kernels are harder to fine-tune due to being zero-gapped. Apart from this, other methods, such as patched diffusion, could not solve the object-duplication problem efficiently. Hence, to overcome the limitations of dilated convolutions, we propose kernel interpolation of SD models for higher-resolution image generation. In this work, we show mathematically that interpolation can correctly scale convolution kernels if multiplied by a constant coefficient and achieve competitive empirical results in generating beyond-training-resolution images with Stable Diffusion using zero training. Furthermore, we demonstrate that our method enables interpolation of deep neural networks to adapt to higher-dimensional training data, with a worst-case performance drop of 2.6% in accuracy and F1-Score relative to the baseline. This shows the applicability of our method to be general, where we interpolate fully-connected layers, going beyond convolution layers. We also discuss how we can reduce the memory footprints of training neural networks, using our method up to at least 4\times .

[CV-277] EditSleuth: A Dataset of Grounded Reasoning Chains for Image-Edit Forensics

【速读】：该论文旨在解决生成式 AI (Generative AI) 伪造图像的深度分析问题，即现有图像取证系统仅能进行真假二分类判断，难以实现编辑区域定位、语义类型识别及基于视觉证据的决策解释。其解决方案的关键在于构建一个大规模、结构化且可验证的图像编辑三元组数据集 EditSleuth，包含 257,725 个样本，每个样本均配有源图、编辑掩码、12 类编辑语义标签、难度评分和六步可追溯的推理链（reasoning chain）。这些推理链由上游计算可得的证据支撑，确保了推理过程的可解释性与可信度，从而为模型提供 grounded reasoning 的监督信号，使训练后的模型不仅能准确分类编辑类型，还能输出符合人类认知逻辑的解释性文本。

链接: https://arxiv.org/abs/2605.08695
作者: Van-Loc Nguyen,AprilPyone MaungMaung,Minh-Triet Tran,Isao Echizen
机构: University of Science, Vietnam National University Ho Chi Minh City (胡志明市科技大学); National Institute of Informatics (日本信息研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Forensic analysis of AI-edited images requires more than binary real-versus-fake prediction: a useful system should localize the edit, identify its semantic type, and ground its decisions in visual evidence. Existing image-forensics datasets typically emphasize detection or localization, while reasoning-supervised vision-language datasets rarely target image manipulation and often rely on LLM-generated rationales whose faithfulness is difficult to verify. We introduce EditSleuth, a dataset of 257,725 image-edit triplets constructed from existing image-editing corpora for grounded image-edit forensic reasoning. Each example includes an edited image, its source image, a binary edit mask, a 12-class edit taxonomy label, a difficulty score, and a six-step reasoning chain. EditSleuth chains are generated deterministically from triplet-grounded upstream artifacts, with each statement tied to a specific computable source of evidence. Our analysis reveals that a naive four-component difficulty formulation suffers from a rank-2 correlation collapse among magnitude features; a simplified three-component formulation substantially increases score dispersion on both Pico-Banana and MagicBrush. Difficulty also varies meaningfully within most edit categories, indicating that the score is not a proxy for edit type. As an initial learning study, we fine-tune Qwen2-VL-2B with LoRA and find that chain-as-target supervision matches a label-only baseline on classification accuracy among parseable answers, while additionally yielding grounded explanatory prose that label-only supervision cannot produce. We release the dataset, the deterministic construction pipeline, and pilot training scripts.

[CV-278] IPAD-CLIP: Teaching CLIP to Detect Image Local Perceptual Artifacts

【速读】：该论文旨在解决当前图像质量评估方法对全局失真（如噪声、模糊）过度关注，而忽视局部感知伪影（如鬼影、镜头光晕和摩尔纹）的检测问题，这是自动伪影检测领域长期未被充分探索的核心挑战。解决方案的关键在于提出一种名为IPAD-CLIP的新框架，其核心创新是利用CLIP模型在文本与视觉空间中增强伪影判别能力，同时保持泛化性能；具体而言，通过学习与伪影相关的文本嵌入（artifact-aware text embeddings），显式建模对象与伪影之间的语义关系，从而提升区分干净图像与含伪影图像的能力，并将视觉编码器的关注点从高层语义引导至低层、细微的局部伪影特征。

链接: https://arxiv.org/abs/2605.08664
作者: Juan Wang,Xinyu Sun,Ke Zhang,Jin Wang,Bing Li,Weiming Hu,Liang Wang
机构: Chinese Academy of Sciences(中国科学院); Beijing Jiaotong University(北京交通大学); Tsinghua University(清华大学); Minzu University of China(中央民族大学); OPPO Co., Ltd.(OPPO公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 6 figures

点击查看摘要

Abstract:Current image quality assessment methods are heavily biased towards global distortions (e.g., noise, blur), neglecting local perceptual artifacts such as ghosting, lens flare, and moire effects. Although significant progress has been made in artifact removal, the fundamental problem of automatic artifact detection remains largely unexplored. In this paper, we formalize the Image Perceptual Artifact Detection (IPAD) task to address this gap. We contribute a benchmark dataset comprising 3,520 artifact images, including 520 real-captured and 3,000 synthetic samples, each paired with pixel-level masks across three representative artifact categories. The core challenge of IPAD lies in the localized, subtle, and semantically weak nature of these artifacts, which makes them prone to missed detection. To overcome this, we introduce IPAD-CLIP, a novel framework built upon CLIP that enhances artifact discrimination in both textual and visual spaces while preserving generalization capabilities. Our key insight is that local artifacts often exhibit strong correlations with specific semantic contexts. Accordingly, we learn artifact-aware text embeddings to explicitly model the object-artifact relationships, resulting in enhanced representations that clear differentiate between clean and artifact prompts. These text embeddings are then used as anchors to shift the visual encoder’s attention from high-level semantics to subtle, low-level artifacts. Extensive experiments demonstrate that IPAD-CLIP offers a resource-efficient adaptation of CLIP for detection, significantly outperforming advanced image anomaly detection and manipulation detection methods on our benchmark. To the best of our knowledge, this is the first study addressing multi-class local perceptual artifact detection in terms of both dataset and model.

[CV-279] CAST: Channel-Aware Spatial Transfer Learning with Pseudo-Image Radar for Sign Language Recognition CVPR2026 CVPR

【速读】：该论文旨在解决60~GHz雷达仅在幅度信息下对孤立手语识别（Isolated Sign Language Recognition）的挑战，特别是如何从Range-Time Maps (RTM) 中提取有效特征以提升识别精度。其解决方案的关键在于提出一种双流架构CAST，通过三个物理感知模块实现：首先，采用显式的dB到线性域转换与加窗快速傅里叶变换（windowed fast Fourier transform），生成无谐波伪影的Cadence Velocity Diagrams (CVD)；其次，引入跨天线空间注意力模块，在卷积前对原始天线通道施加注意力机制，保留接收器间幅度协方差；最后，利用非对称交叉注意力机制融合并行的ConvNeXt-Tiny（用于CVD）和EfficientNetV2-S（用于RTM）骨干网络表示。该方法在5折交叉验证中达到80.5%的Top-1准确率，较最优单模型基线提升3.3%，验证了物理感知信号表示在雷达模态受限下的有效性。

链接: https://arxiv.org/abs/2605.08663
作者: Md. Shakhoyat Rahman Shujon,Sheikh Md. Galib Mahim,Md. Milon Islam,Md Rezwanul Haque,Md Rabiul Islam,Hamdi Altaheri,Fakhri Karray
机构: Khulna University of Engineering Technology ( khulna 大学工程与技术); University of Waterloo (滑铁卢大学); Texas AM University (德克萨斯农工大学); King Saud University (阿卜杜勒阿齐兹国王大学); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), MSLR Workshop @ CVPR 2026 in Denver (Colorado, USA)

点击查看摘要

Abstract:We propose CAST, a dual-stream architecture that utilizes channel-aware spatial transfer learning for isolated sign language recognition addressing the challenges of magnitude-only 60~GHz radar Range-Time Maps (RTM). The proposed framework combines three physics-aware architectures with pretrained vision backbones, which operate under radar-only constraints across clinical and alphabetical gestures. First, an explicit decibel-to-linear inversion is combined with a windowed fast Fourier transform that extracts Cadence Velocity Diagrams (CVD) while avoiding the harmonic artifacts that arise from the spectral analysis of log-compressed signals. Second, a cross-antenna spatial attention module applies attention to raw antenna channels before the convolution, preserving inter-receiver amplitude covariance. Third, an asymmetric cross-attention mechanism fuses representations from parallel ConvNeXt-Tiny (CVD) and EfficientNetV2-S (RTM) backbones. Extensive experiments reveal that the architecture achieves a Top-1 accuracy of 80.5% under 5-fold cross-validation, establishing a 3.3% improvement over the best single-model baseline (77.2%). The findings suggest that physics-aware signal representations form a promising direction for radar-only sign language recognition under constrained sensor modalities. The source code is available at: this https URL.

[CV-280] Privacy-Aware Video Anomaly Detection through Orthogonal Subspace Projection ICML2026

【速读】：该论文旨在解决视频异常检测（Video Anomaly Detection, VAD）系统在实际部署中面临的隐私泄露问题，尤其是在涉及人类活动的场景下，模型可能无意中保留或泄露面部等敏感信息。解决方案的关键在于提出正交投影层（Orthogonal Projection Layer, OPL），通过去除与任务无关的变异性来聚焦于异常相关特征；进一步引入引导式正交投影层（Guided OPL, G-OPL），利用弱监督的面部存在信号抑制面部属性，同时保留非识别性特征如姿态和运动，并采用余弦对齐目标实现无需身份标签或对抗训练即可一致地捕捉并移除面部信息。该方法在保障检测性能的同时显著降低隐私风险，验证了基于投影的架构在设计隐私感知VAD系统中的有效性。

链接: https://arxiv.org/abs/2605.08651
作者: Lei Wang,Wenxiang Diao,Andrew Busch,Jun Zhou,Yongsheng Gao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted as a Spotlight paper at the Forty-Third International Conference on Machine Learning (ICML 2026)

点击查看摘要

Abstract:Video anomaly detection (VAD) systems often prioritize accuracy while overlooking privacy concerns, limiting their suitability for real-world deployment. We propose the Orthogonal Projection Layer (OPL), a lightweight module that removes task-irrelevant variations to produce representations focused on anomaly-relevant cues. To address privacy risks in human-centered scenarios, we introduce Guided OPL (G-OPL), which suppresses facial attributes using weak supervision from face-presence signals while preserving non-identifying features such as pose and motion. A cosine alignment objective enforces consistent capture and removal of facial information without identity labels or adversarial training. We further present a privacy-aware evaluation framework that jointly assesses detection performance and privacy preservation, and enables analysis of how sensitive information is filtered. Experiments show that embedding privacy constraints into model design reduces sensitive information while maintaining or improving detection accuracy, supporting projection-based architectures as a principled approach for privacy-aware VAD.

[CV-281] FlowADMM: Plug-and-play ADMM with Flow-based Renoise-Denoise Priors

【速读】：该论文旨在解决基于流模型（flow-based models）的插件式去噪（Plug-and-Play, PnP）方法在求解逆问题时收敛性分析困难的问题。现有方法依赖随机重噪声-去噪操作，导致理论分析复杂化。其解决方案的关键在于识别并形式化了流模型中隐含的确定性重噪声-去噪算子，该算子可表示为对潜在噪声分布下去噪器期望的映射；在此基础上提出FlowADMM算法，将该确定性算子嵌入经典交替方向乘子法（ADMM）框架，并在弱Lipschitz条件下建立了收敛性保证，同时支持非平稳时间调度策略。实验证明，FlowADMM在多种图像恢复任务中达到当前最优性能，且所需数据一致性评估次数少于先前方法。

链接: https://arxiv.org/abs/2605.08640
作者: Hendrik Sommerhoff,Michael Moeller
机构: University of Siegen (锡根大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Plug-and-play (PnP) methods for solving inverse problems have recently achieved strong performance by leveraging denoising priors based on powerful generative diffusion and flow models. However, existing diffusion- and flow-based PnP methods typically rely on stochastic renoise-denoise operations, which complicate the analysis of their convergence behavior. In this work, we identify and formalize the deterministic renoise-denoise operator underlying flow-based plug-and-play methods. This perspective reveals that these methods implicitly define a deterministic operator given by the expectation of a denoiser over the latent noise distribution. Building on this insight, we propose FlowADMM, a PnP algorithm that integrates the renoise-denoise operator into the classical alternating direction method of multiplier (ADMM) framework. We establish convergence guarantees for FlowADMM under weak Lipschitz conditions on the underlying flow network, and extend the analysis to non-stationary time schedules. Empirically, FlowADMM achieves state-of-the-art performance among flow-based PnP methods on a range of inverse problems, including denoising, deblurring, super-resolution, and inpainting, while requiring fewer data consistency evaluations than prior approaches.

[CV-282] Kinematics-Driven Gaussian Shape Deformation for Blurry Monocular Dynamic Scenes

【速读】：该论文旨在解决从模糊单目视频中重建动态三维场景的问题，其核心挑战在于运动引起的模糊会混淆物体的运动与几何信息，导致几何一致性难以保持。解决方案的关键在于提出一种基于运动学先验的框架 Kinematics-GS，该框架将模糊建模为沿运动轨迹对齐的形变，并引入运动学先验来重参数化高斯形状，从而在无需额外运动监督的情况下避免形状退化坍缩；同时通过时间形变方差分解场景为动态与静态成分，并采用粗到精的形变策略以兼顾全局运动与细粒度细节，显著提升了复杂非刚性运动场景下的重建精度。

链接: https://arxiv.org/abs/2605.08635
作者: Yeon-Ji Song,Kiyoung Kwon,Junoh Lee,Jin-Hwa Kim,Byoung-Tak Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 9 figures, 13 tables

点击查看摘要

Abstract:Reconstructing dynamic 3D scenes from blurry monocular videos is challenging as motion-induced blur entangles object motion and geometry, hindering geometric consistency. We present Kinematics-GS, a kinematics-aware framework that models blur as motion-aligned deformation and introduces a kinematic prior to reparameterize Gaussian shapes along motion trajectories, thereby mitigating degenerate shape collapse without auxiliary motion supervision. To stabilize optimization, we decompose scenes into dynamic and static components using temporal deformation variance and employ a coarse-to-fine deformation strategy to capture both global motion and fine-grained details. We also introduce a challenging real-world dataset of deformable and elastic objects exhibiting non-rigid motion with spatially non-uniform motion blur that obscures geometric cues. Extensive experiments on real-world benchmarks with realistic motion blur demonstrate that Kinematics-GS outperforms prior methods by a clear margin in monocular dynamic scene reconstruction, highlighting its effectiveness in handling complex and non-rigid motion scenarios.

[CV-283] ransforming the Use of Earth Observation Data: Exascale Training of a Generative Compression Model with Historical Priors for up to 10000x Data Reduction

【速读】：该论文旨在解决地球观测（Earth Observation, EO）数据规模急剧增长与传统压缩方法仅作为存储和传输工具之间存在的矛盾，即如何将压缩从被动的存储优化转变为一种主动、任务自适应的数据利用方式。其解决方案的关键在于提出了一种生成式压缩框架（Generative Compression Framework），该框架通过学习历史地球观测档案中的时空演化模式，利用历史先验信息实现跨下游任务的100倍至10,000倍极端数据压缩。这一方法突破了通用视觉数据压缩的局限，充分利用地球观测数据重复测量同一动态星球的特点，使压缩模型具备任务感知能力，从而在数据获取、传输、存储和科学分析全链条中实现高效协同。

链接: https://arxiv.org/abs/2605.08633
作者: Jinxiao Zhang,Runmin Dong,Xiyong Wu,Xihan Huang,Shenggan Cheng,Yunkai Yang,Zheng Zhou,Yunpu Xu,Zhaoyang Luo,Miao Yang,Fan Wei,Mengxuan Chen,Yang You,Juepeng Zheng,Weijia Li,Yutong Lu,Haohuan Fu
机构: Tsinghua Shenzhen International Graduate School (清华大学深圳国际研究生院); Tsinghua University (清华大学); Sun Yat-Sen University (中山大学); National University of Singapore (新加坡国立大学); National Supercomputing Center in Shenzhen (深圳市国家超级计算中心)
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Earth observation is becoming one of the largest data-producing activities in science, yet current pipelines still treat compression as a storage and transmission tool rather than a new way to use data. We present a generative compression framework that learns from historical Earth observation archives and enables on-demand 100x to 10,000x data reduction across downstream tasks. Unlike general visual data, Earth observation repeatedly measures the same evolving planet, making historical-prior learning feasible for extreme compression. To realize this paradigm, we train large generative compression models at exascale on the LineShine Armv9 CPU supercomputer, with co-optimization across model design, kernels, memory hierarchy, runtime, and parallelism. Our implementation sustains 1.54 EFLOP/s and peaks at 2.16 EFLOP/s in end-to-end training. This work shows that historical-prior generative compression can turn Earth observation data into an active, task-adaptive foundation for acquisition, delivery, storage, and scientific use.

[CV-284] DRNet: All-in-One Image Restoration via Prior-Guided Dynamic Reparameterization

【速读】：该论文旨在解决全功能图像复原（All-in-one image restoration）中存在的三大关键问题：1）由于动态退化估计带来的每输入计算开销；2）因任务异质性导致的优化困难；3）频率无关的编码器设计效率低下。其解决方案的核心在于提出动态重参数化网络（Dynamic Reparameterization Network, DRNet），该框架基于初始化阶段重配置范式，从根本上消除了每输入的计算负担；其中心组件是受任务特定调制器（Task-Specific Modulator, TSM）引导的动态重参数化多层感知机（DRMLP），能够通过统一架构协调特定复原目标与通用模式，有效缓解任务异质性；同时引入连续小波变换编码器（Continuous Wavelet Transform Encoder, CWTE），利用小波分解显式建模频域特征，实现轻量而强大的编码设计。

链接: https://arxiv.org/abs/2605.08627
作者: Ao Li,Xiaoning Liu,Sheng Li,Yapeng Du,Zhen Long,Lei Luo,Le Zhang,Ce Zhu
机构: University of Electronic Science and Technology of China (电子科技大学); Chongqing University of Posts and Telecommunication (重庆邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE TMM

点击查看摘要

Abstract:All-in-one image restoration aims to handle diverse degradations within a single model. However, existing methods often suffer from three key limitations: 1) per-input computational overhead from dynamic degradation estimation; 2) optimization challenges due to task heterogeneity; and 3) inefficient, frequency-agnostic encoder designs. To overcome these, we introduce the Dynamic Reparameterization Network (DRNet), a novel framework operating on an initialization-stage reconfiguration paradigm that fundamentally eliminates per-input overhead. At its core, a Dynamic Reparameterization MLP (DRMLP) guided by a Task-Specific Modulator (TSM), which effectively mitigates task heterogeneity by orchestrating both specific restoration goals and a versatile general-purpose mode within a unified architecture. Furthermore, we incorporate a Continuous Wavelet Transform Encoder (CWTE) that explicitly leverages frequency characteristics via wavelet decomposition for a lightweight yet powerful design. Extensive experiments demonstrate that DRNet achieves state-of-the-art performance across five restoration tasks with superior parameter efficiency. Crucially, it showcases unique flexibility, excelling as both a highly competitive foundation model for blind restoration and a top-performing user-guided specialist.

[CV-285] Beyond Toy Benchmarks: A Systematic Evaluation of OOD Detection Methods For Plant Pathology Classification

【速读】：该论文旨在解决深度学习系统在实际部署中面临的分布外（Out-of-distribution, OOD）检测问题，尤其关注现有方法多基于小规模、视觉同质化基准数据集评估，难以反映真实场景中的复杂性。其解决方案的关键在于在Plant Pathology 2021这一具有自然分布偏移的细粒度任务上，系统比较六种OOD检测方法，发现基于能量（energy-based）的微调策略在多种OOD设置下表现最优，不仅提升了检测性能，还保持了分布内（in-distribution）准确率；进一步分析表明，这种提升源于嵌入空间的重构与评分函数校准的协同作用。此外，论文还揭示了约束优化方法在扩展至中等规模数据集时存在的训练不稳定性问题，这是以往文献中较少涉及的重要实践挑战。

链接: https://arxiv.org/abs/2605.08618
作者: Devesh Shah
机构: Unknown
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Out-of-distribution (OOD) detection is essential for reliable deployment of deep learning systems, yet the majority of existing methods are evaluated on small, visually homogeneous benchmarks. In this work, we study six OOD detection methods spanning post-hoc scoring, auxiliary objectives, energy-based models, and constrained optimization on the Plant Pathology 2021 dataset, a fine-grained task with natural distribution shifts. Energy-based fine-tuning performs best across OOD settings, improving detection over the softmax baseline while preserving in-distribution accuracy. Analysis shows these gains stem from both a restructuring of the embedding space alongside calibration of the scoring function. We further document practical training instabilities that arise when scaling constrained optimization methods to moderate-sized datasets, findings that are largely absent from existing literature. Our results demonstrate that principled OOD detection is achievable on real-world domain-specific data and that benchmark evaluations alone may not capture the challenges that emerge in practice.

[CV-286] Egocentric Whole-Body Human Mesh Recovery with Prior-Guided Learning ICIP2026

【速读】：该论文旨在解决从单目头戴式相机拍摄的自指视角（egocentric）图像中恢复完整人体网格（whole-body mesh）的问题，该任务在增强现实/虚拟现实（AR/VR）应用中日益重要，但因缺乏基于参数化人体模型（如SMPL和SMPL-X）的真实标注数据而极具挑战性。现有方法多依赖伪真值（pseudo-GT）进行身体姿态估计，难以重建手部和面部等细粒度结构。其解决方案的关键在于提出一种先验引导的学习框架：首先构建基于优化的伪真值，使其与3D关节监督对齐，显著提升准确性；其次融合多种先验信息，包括利用外指视角（exocentric）HMR基础模型以及扩散模型驱动的姿态先验；并引入确定性去畸变模块以处理自指图像中的鱼眼畸变。该方法在多个自指基准测试中优于当前最优方法，验证了其有效性与可复现性。

链接: https://arxiv.org/abs/2605.08606
作者: Soyeon Na,Seung Young Noh,Ju Yong Chang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICIP 2026. This is the author-formatted version of the paper

点击查看摘要

Abstract:Egocentric human mesh recovery (HMR) from monocular head-mounted cameras is increasingly important for AR/VR applications, but remains challenging due to the lack of reliable ground-truth (GT) annotations based on parametric human body models such as SMPL and SMPL-X for real egocentric images. Existing egocentric HMR methods typically rely on pseudo-GT and focus on body pose estimation, which limits their ability to recover fine-grained whole-body details such as hands and face. We study egocentric whole-body human mesh recovery and propose a prior-guided learning framework that reconstructs whole-body meshes from a single egocentric image. We construct more accurate optimization-based pseudo-GT aligned with 3D joint supervision, and leverage multiple priors by adapting an exocentric HMR foundation model together with a diffusion-based pose prior. A deterministic undistortion module is further adopted to handle fisheye distortions in egocentric images. Experiments across multiple egocentric benchmarks demonstrate improved whole-body reconstruction compared to state-of-the-art methods, and show that our optimization-based pseudo-GT is substantially more accurate than existing regression-based pseudo-GT. To facilitate reproducibility, the code and dataset annotations are publicly available at this https URL.

[CV-287] Cross-Modal RGB-D Fusion Transformer for 6D Pose Estimation of Non-Cooperative Spacecraft with Stereo-Derived Depth

【速读】：该论文旨在解决在轨服务与主动碎片清除任务中，针对非合作航天器的六自由度（6-DOF）位姿估计问题，尤其克服基于学习的单目方法存在的深度模糊性缺陷以及在极端光照条件下易失效的问题。其关键解决方案是提出一种被动式立体视觉框架，核心包括：1）设计了一种名为TSCA-Stereo的双目匹配网络，以应对空间图像中弱纹理、镜面高光和严重光照变化等挑战；2）引入跨模态融合Transformer，自适应地融合RGB外观信息与立体深度特征，提升位姿恢复的可靠性；3）构建了一个涵盖多种光照场景、姿态配置和噪声水平的合成双目多模态数据集，用于验证方法的有效性。实验表明，该方案在复杂空间环境下实现了平均平移误差0.0419 m、平均姿态误差0.8632°，验证了被动立体视觉方法在恶劣空间视觉条件下的有效性与鲁棒性。

链接: https://arxiv.org/abs/2605.08592
作者: Yongliang Zhen,Bo LÜ,Hang Yang,Xiaotian WU
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:On-orbit servicing and active debris removal involving non-cooperative spacecraft require reliable pose estimation to supply accurate position and orientation data for autonomous visual navigation. Learning-based monocular methods have seen widespread adoption in spacecraft pose estimation, yet they suffer from an intrinsic depth ambiguity problem and tend to fail under the harsh illumination conditions routinely encountered in orbit. Active depth sensors could in principle address the geometric ambiguity, but their power and mass requirements make them poorly suited to most spacecraft platforms. This work addresses these issues through a passive stereo vision framework for six-degree-of-freedom (6-DOF) pose estimation of non-cooperative spacecraft. A binocular stereo matching network called TSCA-Stereo is developed to cope with weak-texture surfaces, specular highlights, and severe lighting variations typical of space imagery. A cross-modal fusion Transformer is introduced to combine RGB appearance information with stereo depth features in an adaptive manner, supporting reliable pose recovery. A synthetic binocular multimodal dataset is also built for the experiments, covering stereo disparity maps and 6-DOF pose annotations across a range of illumination scenarios, attitude configurations, and noise levels. Experimental results show that TSCA-Stereo outperforms the baseline across every evaluated metric on this space-specific dataset. The full pose estimation pipeline achieves a mean translation error of 0.0419 m and a mean orientation error of 0.8632° under varied imaging conditions, confirming that the passive stereo approach is both effective and resilient when operating under the demanding visual conditions of the space environment.

[CV-288] S2FT: Parameter-Efficient Fine-Tuning in Sparse Spectrum Domain CVPR2026

【速读】：该论文旨在解决现有参数高效微调（Parameter Efficient Fine-Tuning, PEFT）方法中，基于傅里叶变换的策略因假设权重变化（weight change）在频域稀疏而存在局限性的问题。研究发现，实际权重变化的频谱并非稀疏，而是呈现幂律均匀分布（power-uniform），导致仅微调少量频谱系数难以准确建模权重变化。解决方案的关键在于提出一种可逆变换（invertible transformation），将原始空间域中的非稀疏权重变化映射到一个具有稀疏频谱的潜在空间域，并在此空间中进行PEFT，该方法称为S2FT（Sparse-to-Fourier Transform）。其核心创新是通过预估计粗粒度权重变化并利用局部平滑结构先验，以最近邻搜索方式实现行与列的重排操作，从而获得保持神经元结构信息的同时具备稀疏频谱的变换矩阵，显著提升了微调精度与效率，仅需0.08%的训练参数即可达到优越性能。

链接: https://arxiv.org/abs/2605.08589
作者: Baoquan Zhang,Zhehao Yu,Lisai Zhang,Kenghong Lin,Tianran Chen,Yuxi Sun,Yunming Ye,Yao He
机构: Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区); ShenZhen SiFar Co., Ltd. (深圳市思法科技有限公司); Bilibili. Inc (哔哩哔哩公司); Shenzhen University (深圳大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Parameter Efficient Fine-Tuning (PEFT) is a key technique for adapting a large pretrained model to downstream tasks by fine-tuning only a small number of parameters. Recent methods based on Fourier transforms have further reduced the fine-tuned parameters scale by only fine-tuning a few spectral coefficients. Its basic assumption is that the weight change \delta W is a spatial-domain matrix with a sparse spectrum. However, in this paper, we observe that the spectrum of weight change is not sparse, but instead distributed like power-uniform. This fact implies that fine-tuning only a few spectral coefficients is insufficient to accurately model the weight change with uniform spectrum. To address this issue, we propose to seek an invertible transformation that can transform a latent spatial-domain matrix with sparse spectrum to the weight change, and then perform PEFT on such sparse spectrum domain with few spectral coefficients, called S2FT. To seek such transformation, we first pre-estimate a coarse weight change as a prior. Then, inspired by that sparse spectrum often correspond to locally smooth spatial structures, we regard this transformation as a row and column rearrangement operation on the pre-estimated weight change that smooth spatial structures while keep the structure information of neurons. Finally, we propose to solve the rearrangement search problem in a simple nearest neighbor search manner, thereby obtaining the invertible transformation. Extensive results show our S2FT achieves superior performance by only using 0.08% training parameters.

[CV-289] PromptDx: Differentiable Prompt Tuning for Multimodal In-Context Alzheimers Diagnosis

【速读】：该论文旨在解决当前深度学习模型在医学影像诊断中因依赖固定参数记忆而导致的临床实践脱节问题，尤其是无法像医生一样通过类比推理（analogical reasoning）引用过往相似病例进行诊断的局限性。现有基于上下文学习（In-Context Learning, ICL）的方法如TabPFN虽提供了一种“诊断-by-reference”范式，但其设计局限于表格数据，且依赖非可微预处理流程，在处理异构多模态数据时存在流形不匹配和梯度断裂问题。解决方案的关键在于提出PromptDx框架，其核心创新是Differentiable Prompt Tuning (DPT)机制——通过训练一个轻量级适配器作为预训练ICL引擎非可微预处理器的可微替代品，实现多模态提示（prompt）在ICL范式下的端到端优化，从而无缝整合3D MRI与表型生物标志物等多模态信息，并显著提升数据效率（仅用1%上下文样本即优于标准ICL使用30%样本的效果）。

链接: https://arxiv.org/abs/2605.08585
作者: Lujia Zhong,Yihao Xia,Shuo Huang,Jianwei Zhang,Yonggang Shi
机构: 1. University of California, San Diego (加州大学圣地亚哥分校); 2. The Chinese University of Hong Kong, Shenzhen (香港中文大学（深圳）); 3. Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep learning models in medical imaging typically operate as parametric memory, diagnosing patients by recalling fixed knowledge learned during training. This contrasts sharply with clinical practice, where physicians employ analogical reasoning to diagnose new cases by referencing similar records from past exemplars. While In-Context Learning (ICL) frameworks such as Tabular Prior-Fitted Networks (TabPFN) offer a promising diagnosis-by-reference paradigm, they are designed with tabular-specific inductive priors and rely on non-differentiable preprocessing pipelines, leading to manifold mismatch and gradient fracture when applied to heterogeneous multimodal data. To address these limitations, we propose PromptDx, a novel diagnosis-by-reference framework that leverages a pre-trained TabPFN as an ICL engine while enabling seamless integration with multimodal representations. Our core contribution is a Differentiable Prompt Tuning (DPT) mechanism that aligns a Masked Multimodal Modeling module with the pre-trained ICL engine. By training a lightweight adapter as a differentiable surrogate for the engine’s non-differentiable preprocessors, we enable an end-to-end optimization of multimodal prompts within the ICL paradigm. We validate our method on the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset using 3D MRI and tabular biomarkers. Experiments demonstrate that our approach outperforms traditional parametric baselines. Notably, our method achieves superior performance using only 1% context samples compared to 30% in standard ICL, demonstrating exceptional manifold condensation ability. We further validate the generalizability of our DPT framework across six tabular datasets with diverse scales. Overall, our method offers a more data-efficient and clinically aligned paradigm for Alzheimer’s Disease diagnosis.

[CV-290] Improving Generative Adversarial Networks with Self-Distillation

【速读】：该论文旨在解决传统生成对抗网络（GAN）中训练过程不稳定、存在寄生循环行为（parasitic cycling）的问题，以及未充分利用指数移动平均（EMA）生成器在训练阶段的潜在价值。其解决方案的关键在于提出自蒸馏生成对抗网络（Self-Distilled GAN, SD-GAN），将EMA生成器作为教师模型，通过感知损失（perceptual loss）指导活跃训练的生成器（学生模型），从而提升图像质量、稳定优化轨迹，并提供与传统对抗损失非线性相关的额外学习信号。

链接: https://arxiv.org/abs/2605.08577
作者: Antoni Nowinowski,Krzysztof Krawiec
机构: Poznan University of Technology (波兹南理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In modern GANs, maintaining an Exponential Moving Average (EMA) of the generator’s weights is a standard practice, as such an averaged model consistently outperforms the actively trained generator. However, the EMA generator is used for final deployment only and does not influence the training process. To address this missed opportunity, we introduce Self-Distilled GAN (SD-GAN) that employs the EMA generator as a teacher to guide the active generator (student) via perceptual loss. We prove the local asymptotic stability of SD-GAN in the Dirac-GAN setting and show that it dampens the parasitic cycling behavior that plagues the conventional GANs. Empirical evaluations across established architectures and datasets demonstrate that SD-GAN improves the final image quality on several metrics (FID and random-FID in particular), stabilizes the optimization trajectory and provides additional learning guidance that is not trivially correlated with the conventional adversarial loss. It also proves effective for fine-tuning pretrained GAN models.

[CV-291] Post-hoc Selective Classification for Reliable Synthetic Image Detection

【速读】：该论文旨在解决深度神经网络-based合成图像检测器（Synthetic Image Detectors, SIDs）在部署阶段因常见协变量偏移（covariate shifts）导致可靠性下降的问题，即其在分布内表现良好但在分布外时检测准确率显著降低。为缓解这一风险，作者采用选择性分类（Selective Classification, SC）策略，使SIDs能够在置信度低时拒绝预测。解决方案的关键在于提出一个名为ReSIDe的框架：首先从中心点匹配（centroid matching）视角将logits概念推广至SIDs的任意中间层，从而扩展了基于logits的置信度评分函数（Confidence Score Functions, CSFs）的应用范围；其次设计了一种偏好优化算法，通过最小化风险-覆盖率曲线下面积（AURC）的上界，聚合不同层提取的置信度得分以获得最终估计，显著提升了SC性能，在多种协变量偏移场景下实现最高达69.55%的AURC降低。

链接: https://arxiv.org/abs/2605.08574
作者: Kaixiang Zheng,Jacob H. Seidman
机构: University of Waterloo (滑铁卢大学); Reality Defender (Reality Defender)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As synthetic images become increasingly realistic, reliable synthetic image detection techniques are of pressing need to prevent their misuse. Despite satisfactory in-distribution performance, deep neural network-based synthetic image detectors (SIDs) lack reliability in deployment and often fail in the presence of common covariate shifts, resulting in poor detection accuracy. To avoid the risk caused by potential errors, we adopt a selective classification (SC) strategy by allowing SIDs to abstain from making low confidence predictions. For practicality, we focus on post-hoc methods which perform confidence estimation on a given SID without retraining. However, we show that conventional logit-based confidence score functions (CSFs) exhibit pathological behavior under covariate shifts, leading to SC performance close to or even worse than random guessing. To address this, we propose a simple yet effective SC framework for Reliable Synthetic Image Detection (ReSIDe). First, we generalize the notion of logits to an SID’s intermediate layers from a centroid matching perspective, extending the use of logit-based CSFs to any layer of an SID. Then, we introduce a preference optimization algorithm that aggregates confidence scores extracted from different layers to a final confidence estimate by minimizing an upper bound of the area under the risk-coverage curve (AURC). Extensive experimental results show that ReSIDe significantly boosts the SC performance of various logit-based CSFs under common covariate shifts, achieving up to 69.55% AURC reduction.

[CV-292] Enhancing Consistency Models for Multi-Agent Trajectory Prediction

【速读】：该论文旨在解决扩散模型在多智能体轨迹预测中因迭代去噪导致的推理延迟问题，这一瓶颈限制了其在自动驾驶等时序敏感场景中的应用。现有快速采样方法（如DDIM和基于先验噪声分布的方法）虽部分缓解了延迟问题，但难以实现真正的单步生成或受限于特定噪声分布。论文提出ECTraj框架，其核心创新在于改进的一致性模型（Consistency Models, CMs）训练机制：通过引入学生-教师一致性训练范式，其中学生生成标准输出，而教师将预测结果与真实轨迹的部分信息融合以提供更强监督信号；同时利用CMs直接去噪能力，在训练阶段实现top-K多样本生成，结合条件生成策略显著提升推理速度与预测精度，在Argoverse 2大规模数据集上建立了新的性能基准。

链接: https://arxiv.org/abs/2605.08572
作者: Alen Mrdovic,Qingze(Tony)Liu,Danrui Li,Mathew Schwartz,Kaidong Hu,Sejong Yoon,Mubbasir Kapadia,Vladimir Pavlovic
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models for multi-agent trajectory prediction are limited by iterative denoising, which causes inference latency that hinders their use in time-critical settings like autonomous driving. Fast-sampling variants using DDIM and informed initial noise distributions partially alleviate this issue, but they either fail to achieve true single-step generation or are constrained by the chosen noise distribution. Consistency Models (CMs) offer high-quality one-step generation by mapping noise directly to data, but are difficult to train from scratch . We propose ECTraj, an enhanced CM pipeline with improved training and conditional generation for trajectory prediction. Our framework extends the student-teacher consistency training scheme: the student produces standard outputs, while the teacher explicitly fuses its predictions with parts of the ground truth to give stronger supervision. We also exploit CMs’ direct denoising for top-K multi-shot generation during training. Combining conditional generation with this enhanced consistency objective yields faster inference and improved prediction accuracy, establishing competitive new benchmarks on the large-scale Argoverse 2 dataset.

[CV-293] ACWM-Phys: Investigating Generalized Physical Interaction in Action-Conditioned Video World Models

【速读】：该论文旨在解决当前行动条件世界模型（Action-conditioned World Models, ACWMs）在物理交互多样性与泛化能力方面的局限性问题，即现有基准测试主要局限于第一人称导航或特定任务的机器人数据集，难以全面评估模型对复杂物理动态的理解能力。其解决方案的关键在于构建一个名为ACWM-Phys的新基准，该基准基于可控且干净的仿真环境，涵盖刚体动力学、运动学、可变形物体交互和粒子动力学等多类物理场景，并设计了分布内（in-distribution）与分布外（out-of-distribution）两种评估协议，以系统评估模型在不同物理模式和场景配置下的插值与泛化性能。通过这一可控平台，研究者能够实现精确的数据采集、可复现的评估以及对模型物理建模能力的深入分析，从而揭示当前模型仍依赖视觉外观而非深层物理规律的本质瓶颈。

链接: https://arxiv.org/abs/2605.08567
作者: Haotian Xue,Yipu Chen,Liqian Ma,Zelin Zhao,Lama Moukheiber,Yuchen Zhu,Yongxin Che
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Action-conditioned world models (ACWMs) have shown strong promise for video prediction and decision-making. However, existing benchmarks are largely restricted to egocentric navigation or narrow, task-specific robotics datasets, offering only limited coverage of the rich physical interactions required for generalized world understanding. We introduce ACWM-Phys, a new benchmark for evaluating action-conditioned prediction under diverse physical dynamics in a clean, controllable simulation environment with a carefully designed action space. ACWM-Phys contains training and evaluation data spanning rigid-body dynamics, kinematics, deformable-object interactions, and particle dynamics. To evaluate both interpolation and generalization, we design in-distribution and out-of-distribution protocols with controlled shifts in interaction patterns or scene configurations. By building the benchmark in a fully controllable simulator, ACWM-Phys enables precise data collection, reproducible evaluation, and systematic analysis of model capabilities for physically grounded world modeling. Through systematic experiments on ACWM-DiT, we find that OoD generalization depends not only on the physical regime but also on effective task complexity: models generalize well on visually simple, low-dimensional interactions with clear geometric structure, but suffer larger drops on deformable contacts, high-dimensional control, and complex articulated motion. This suggests that the model still relies heavily on visual appearance patterns instead of fully learning the underlying physics. Ablations show that cross-attention improves high-dimensional action conditioning, causal VAEs outperform frame-wise encoders, and larger action spaces are harder to model but can improve generalization by providing richer control signals. These findings guide the design of physically grounded world models.

[CV-294] MicroDiffuse3D: A Foundation Model for 3D Microscopy Imaging Restoration

【速读】：该论文旨在解决三维化学成像（3D chemical imaging）中因数据采集速度慢而导致的广泛应用受限问题，尤其是在高通量成像与低信噪比（low signal-to-noise ratio, SNR）条件下难以获取高质量体积结构的问题。其解决方案的关键在于提出了一种预训练的基础模型 MicroDiffuse3D，该模型能够从低分辨率、稀疏或噪声严重的三维测量数据中恢复出高质量的体素图像，从而显著提升成像速度和图像质量。在三种挑战性恢复场景下（包括16倍体积稀疏下的3D超分辨、分辨率与噪声联合退化以及低SNR下的3D去噪），MicroDiffuse3D均展现出优于强基线方法的性能，例如在稀疏3D超分辨设置中，深度方向连续性更清晰、伪影更少，并使分割精度提升10.58%，线轮廓一致性提高15.59%。

链接: https://arxiv.org/abs/2605.08566
作者: Yongkang Li,Brian Wong,King Wai Chiu,Hanwen Xu,Tangqi Fang,Erin Dunnington,Dan Fu,Sheng Wang
机构: University of Washington (华盛顿大学); Paul G. Allen School of Computer Science and Engineering (保罗·G·艾伦计算机科学与工程学院); Department of Chemistry (化学系)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Chemical imaging enables label-free visualization of cells, tissues and living systems while providing direct biochemical information that is difficult to obtain with conventional fluorescence microscopy. Despite its promise in applications ranging from intraoperative diagnosis to drug-response analysis, its broader use remains limited by slow data acquisition, particularly for three-dimensional imaging. Here we present MicroDiffuse3D, a pretrained foundation model for 3D microscopy image restoration that recovers high-quality volumetric structure from degraded low-resolution measurements acquired at substantially higher throughput. We evaluated MicroDiffuse3D across three challenging restoration settings, including 3D super-resolution under 16-fold volumetric sparsity, joint degradation in resolution and noise, and 3D denoising in the low signal-to-noise ratio (SNR) regime, where the model delivered clear gains over strong baselines. Under the sparse 3D super-resolution setting, MicroDiffuse3D produced clearer continuity across depth with fewer artifacts and improved segmentation quality by 10.58% and line-profile concordance by 15.59%. Together, our results establish pretrained 3D restoration as a broadly applicable strategy for overcoming the throughput and SNR limitations in volumetric chemical imaging, enabling high-resolution analysis at scales and speeds that were previously difficult to achieve.

[CV-295] Biological Plausibility and Representational Alignment of Feedback Alignment in Convolutional Networks

【速读】：该论文旨在解决反馈对齐（Feedback Alignment, FA）算法在卷积神经网络（Convolutional Neural Networks, CNNs）中难以扩展的问题，同时保持其生物合理性。传统FA虽在前馈网络中表现良好，但在CNN架构中性能显著下降，且已有改进方案常以牺牲生物可实现性为代价。论文提出通过对比分析五种学习算法（包括改进的FA与标准反向传播BP），在CIFAR-10数据集上评估其在生物合理性、可解释性和计算复杂度三方面的表现。关键发现是：改进的FA算法虽采用与BP截然不同的权重更新机制，仍能收敛到与BP相似的内部表征结构，其功能有效性可能源于对BP表征几何结构的模仿，从而实现了在不依赖精确梯度传递的前提下，获得类似BP的表示学习效果。

链接: https://arxiv.org/abs/2605.08564
作者: Jake Lance,Larry Kieu
机构: University of Toronto (多伦多大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The feedback alignment (FA) algorithm offers a biologically plausible alternative to backpropagation (BP) for training neural networks yet notably fails to scale to convolutional architectures. Modifications have been proposed to address this limitation, but at questionable cost to biological plausibility. In this paper, we evaluate five learning algorithms including modified FA and standard BP, applied to the same convolutional architecture with the CIFAR-10 dataset. We provide a tripartite comparative analysis focusing on biological plausibility, interpretability, and computational complexity. Our results indicate that modified FA algorithms converge on internal representations that are structurally similar to those produced by backpropagation. In particular, it appears the functional success of modified FA algorithms may be rooted in their ability to mimic the representational geometry of backpropagation, converging on similar representations despite relying on fundamentally different weight update mechanisms.

[CV-296] ZAYA1-VL-8B Technical Report

【速读】：该论文旨在解决小型多模态模型在图像理解、推理和计数等任务上性能不足的问题，尤其是在参数规模受限的情况下难以媲美大型基础模型（base models）的挑战。解决方案的关键在于两个创新：一是将视觉特定的低秩适配器（LoRA adapters）嵌入语言模型（LLM）中，从而在不增加专家数量的前提下提升模态特异性能力；二是引入图像标记（image tokens）在LLM内部的双向注意力机制，增强对视觉信息的理解深度。通过上述设计，ZAYA1-VL-8B 在仅9.2B总参数量下实现了与更大模型相当甚至更优的性能表现。

链接: https://arxiv.org/abs/2605.08560
作者: Hassan Shapourian,Kasra Hejazi,Olabode M. Sule,Beren Millidge
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 20 pages, 7 figures, 3 appendices (with 31 figures)

点击查看摘要

Abstract:We present ZAYA1-VL-8B, a compact mixture-of-experts vision-language model built upon our in-house language model, ZAYA1-8B. Despite its compact size, ZAYA1-VL achieves performance competitive with leading base models such as Molmo2-4B and InternVL3.5-4B, while surpassing models including Qwen2.5-VL-3B, PLM-3B, and MolmoE-1B across a range of image understanding, reasoning, and counting benchmarks. The architecture incorporates two key innovations: (1) vision-specific LoRA adapters integrated into the LLM to increase modality-specific capacity without increasing the number of experts, and (2) bidirectional attention over image tokens within the LLM to enhance visual understanding. We detail the full training pipeline including data composition at each stage, sequence packing, and the attention masking scheme. The model comprises 9.2B total parameters, with 1.4B active parameters including the vision encoder, and is publicly available at this https URL.

[CV-297] MC-RFM: Geometry-Aware Few-Shot Adaptation via Mixed-Curvature Riemannian Flow Matching NEURIPS

【速读】：该论文旨在解决预训练视觉模型在少样本适应（few-shot adaptation）过程中，现有参数高效方法通常将适应视为对冻结特征的离散欧几里得扰动，而未能显式建模任务诱导的特征位移几何结构的问题。解决方案的关键在于提出一种混合曲率黎曼流匹配框架（Mixed-curvature Riemannian Flow-Matching, MC-RFM），其核心思想是将适配后的特征表示在由双曲因子（捕获层次敏感语义结构）和欧几里得因子（保留局部判别性视觉变化）构成的产品流形上进行建模，并通过任务条件化的连续传输过程，从冻结特征映射到支持集原型，训练目标为流匹配损失并耦合混合原型-线性分类器，从而实现轻量、骨干无关且完全基于缓存冻结特征的少样本适应。

链接: https://arxiv.org/abs/2605.08557
作者: Salim Khazem,Ibrahim Mohamed Serouis,Zakaria Ezzahed
机构: Talan(塔兰)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Submitted to NeurIPS (Under Review)

点击查看摘要

Abstract:Parameter-efficient adaptation of pretrained vision models is commonly performed through linear probes, prompts, low-rank updates, or lightweight residual modules. While effective, these methods usually treat adaptation as a discrete Euclidean perturbation of frozen representations, without explicitly modeling the geometry of the task-induced feature displacement. We propose \textscMC-RFM, a mixed-curvature Riemannian flow-matching framework for few-shot adaptation of frozen visual backbones. The key idea is to represent adapted features on a product manifold combining a hyperbolic factor, which captures hierarchy-sensitive semantic structure, and a Euclidean factor, which preserves locally discriminative visual variation. Adaptation is formulated as a task-conditioned continuous transport from frozen features to support-set prototypes, trained with a flow-matching objective and coupled to a hybrid prototype-linear classifier. The method is lightweight, backbone-agnostic, and operates entirely on cached frozen features. Across seven visual recognition benchmarks, five frozen backbones, and 1/4/16-shot regimes, \textscMC-RFM is the best-performing method in a majority of evaluated settings, with the strongest gains on Transformer backbones and fine-grained datasets. Ablations show that the mixed-curvature head, task conditioning, adaptive branch gating, prototype shrinkage, and discriminative supervision each contribute to performance. These results suggest that few-shot adaptation benefits not only from deciding which parameters to update, but also from modeling how representations should move through a geometry matched to the structure of the downstream task.

[CV-298] A Two-Stage Motion-Aware Framework for mmWave-based Human Mesh Recovery

【速读】：该论文旨在解决从毫米波（mmWave）雷达观测中恢复精确三维人体网格的难题，该问题主要受限于信号杂波严重以及雷达测量本质上具有局部性。现有方法通常采用端到端框架直接从原始雷达数据回归人体参数，缺乏对信号解释与几何推理的解耦，也未充分利用时序运动信息，从而限制了性能提升。解决方案的关键在于提出一个两阶段框架：第一阶段设计了一个人体反射提取模块，通过粗粒度到细粒度的定位与体素级分割，生成带有置信度加权的雷达体积，编码每个体素的人体存在概率；第二阶段构建了一个运动感知的网格恢复网络，利用双分支结构联合建模帧内几何特征与帧间动态变化，实现更准确且鲁棒的三维人体重建。

链接: https://arxiv.org/abs/2605.08530
作者: Hoang Hai Pham,Shuntian Zheng,Jiaqi Li,Yu Guan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Millimeter-wave (mmWave) radar has emerged as a promising sensing modality for human perception due to its robustness under challenging environmental conditions and strong privacy-preserving properties. However, recovering accurate 3D human body meshes from radar observations remains difficult due to severe signal clutter and the inherently partial nature of radar measurements. Previous works typically adopt end-to-end frameworks that directly regress human body parameters from raw radar data, without decoupling signal interpretation from geometric reasoning or exploiting temporal motion cues, limiting learning performance. To address this, we propose a two-stage framework for radar-based human body reconstruction. First, we introduce a human reflection extraction module that performs coarse-to-fine localization and voxel-wise segmentation to produce a confidence-weighted radar volume encoding voxel-level human likelihood. Second, we design a motion-aware mesh recovery network that reconstructs the human body by jointly modeling per-frame geometry and inter-frame dynamics using a dual-branch architecture. Extensive experiments demonstrate that the proposed method outperforms existing approaches while maintaining computational efficiency.

[CV-299] Geometric Flood Depth Estimation: Fusing Transformer-Based Segmentation with Digital Elevation Models

【速读】：该论文旨在解决灾后情境感知中洪水深度估算的难题，传统2D语义分割方法虽能精确识别淹没区域，但缺乏垂直维度信息，难以评估通行可行性与结构风险。解决方案的关键在于构建一种基于几何的“水体表面高程”（Water Surface Elevation）方法，通过将Mask2Former模型生成的高精度2D洪水掩膜与数字高程模型（Digital Elevation Model, DEM）融合，识别水陆边界并计算全局水体表面高程（Z_water），进而依据局部流体静力学平衡原理推算每个像素点的洪水深度，从而从单目航空影像中高效提取三维洪水体积信息，避免了水动力学模拟带来的延迟。

链接: https://arxiv.org/abs/2605.08521
作者: Nhut Le,Ehsan Karimi,Maryam Rahnemoonfar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted by the 2026 IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2026)

点击查看摘要

Abstract:Post-disaster situational awareness relies heavily on understanding both the extent and the volume of floodwaters. While 2D semantic segmentation provides accurate flood masking, it lacks the vertical dimension required to assess navigability and structural risk. This paper presents a geometric “Water Surface Elevation” approach for estimating flood depth from monocular aerial imagery. Our pipeline utilizes Mask2Former, a state-of-the-art transformer-based segmentation model, to generate precise 2D flood masks. These masks are fused with Digital Elevation Models (DEMs) to identify the water-land boundary, calculate a global water surface elevation ( Z_water ), and compute per-pixel depth based on the principle of local hydrostatic equilibrium. We evaluate this workflow using the FloodNet and CRASAR-U-DROIDS datasets, demonstrating how high-performance segmentation can be leveraged to extract 3D volumetric data from 2D imagery without the latency of hydrodynamic simulations.

[CV-300] A Deep Risk Estimator for Known Operator Learning

【速读】：该论文旨在解决深度神经网络中混合已知算子与可学习算子时的统计风险估计问题，特别是如何量化每一层对整体泛化误差的贡献。其解决方案的关键在于提出一种分层风险估计器（deep risk estimator），该估计器将总风险分解为各层的贡献之和：已知算子层不增加风险，而可学习层则包含两个组成部分——一个受Barron定理启发的近似项和一个随训练样本数增加而减小的估计项。通过该分解，作者证明了用已知算子替代可学习层可显著降低风险，并且所需样本量与被替换层的可训练参数数量呈正比关系。这一理论框架可用于指导结构设计（如CT重建中的滤波反投影网络）并预测达到目标误差所需的最小训练样本数。

链接: https://arxiv.org/abs/2605.08517
作者: Andreas Maier,Md Hasan,Paulina Conrad,Paula Andrea Perez-Toro
机构: Friedrich-Alexander-Universität Erlangen-Nürnberg(埃尔朗根-纽伦堡弗里德里希-亚历山大大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注: In Review

点击查看摘要

Abstract:We describe an approach for estimating the statistical risk of deep networks that contain a mix of learned and known operators. Building on the maximal training error bounds previously established for known operator learning, we derive a deep risk estimator that connects the expected error of a layered network to the size of the training sample. The estimator decomposes the total risk into a sum over learned layers; every known operator contributes zero to this sum, while every learned layer adds an approximation term inspired by Barron’s classic work and an estimation term that decreases with the number of training samples. We are able to show that the bound shrinks whenever a learned layer is replaced by a known operator and that the corresponding sample requirement scales with the number of trainable parameters of the layer that is replaced. As an application, we use computed tomography as an example and compare an operator-aware filtered backprojection network with a fully connected substitute that collapses the entire reconstruction pipeline into a single learned dense matrix. The predicted parameter ratio coincides with the structural sparsity that the analytic decomposition into a circulant filter and a sparse backprojection exposes. We confirm the predicted scaling on CPU at small image scale and on GPU at medium image scale, all on the same scaling law. Beyond CT reconstruction, the estimator applies to physics-informed neural networks that hardcode a known physical operation in its architecture, and we expect the result to be of interest for a broad community working on operator-aware deep learning. Calibrating the per-layer constants on each sweep yields a bound that tracks the empirical test MSE within a factor of two at every training-set size, so the estimator can be inverted to predict how many training samples are required to reach a target error.

[CV-301] CapCLIP: A Vision-Language Representation Alignment Approach for Wireless Capsule Endoscopy Analysis

【速读】：该论文旨在解决无线胶囊内镜（Wireless Capsule Endoscopy, WCE）在临床应用中面临的两大挑战：一是每例检查产生的图像帧数量庞大，导致人工阅片效率低下；二是由于成像条件高度变化，识别细微病灶具有较大难度。现有基于学习的方法多为纯视觉模型，通常局限于特定病理类别且跨数据集和医疗机构的迁移能力有限。解决方案的关键在于提出CapCLIP——一个面向WCE领域的视觉-语言表征学习框架，通过将胶囊内镜图像与基于标准化命名法和病理感知描述模板生成的文本描述对齐，学习语义丰富且可迁移的嵌入表示。该方法显著提升了零样本场景下图像-文本分类及跨模态检索性能，尤其在分布外数据上表现突出，验证了语言引导的表征学习有助于增强WCE分析的泛化能力和语义可解释性。

链接: https://arxiv.org/abs/2605.08493
作者: Haroon Wahab,Irfan Mehmood,Hassan Ugail
机构: University of Bradford (布拉德福德大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Wireless capsule endoscopy (WCE) enables non-invasive visual assessment of the small bowel, but its clinical utility is constrained by the large volume of frames generated per examination and the difficulty of recognising subtle abnormalities under highly variable imaging conditions. Existing learning-based approaches for WCE are predominantly vision-only, often confined to narrow pathology sets, and show limited transfer across datasets and centres. To address these limitations, this study introduces CapCLIP, a domain-specific vision-language representation learning framework for WCE. CapCLIP aligns capsule endoscopy frames with clinically grounded textual descriptions derived from standardised nomenclature and pathology-aware caption templates, thereby learning embeddings that are both semantically informed and transferable. The proposed framework is evaluated against relevant open-source vision and vision-language foundation models under strict zero-shot conditions using unseen WCE datasets. Evaluation covers three downstream tasks: K-nearest neighbour classification, CLIP-style image-text classification, and text-to-image retrieval. Across these settings, CapCLIP consistently outperforms the compared baselines, with particularly strong gains in zero-shot image-text classification and cross-modal retrieval on out-of-distribution datasets. The results indicate that language-guided representation learning can improve both generalisation and semantic interpretability in WCE analysis. These findings position CapCLIP as a step toward foundation models tailored to capsule endoscopy and support the use of language-grounded WCE analysis.

[CV-302] NICE FACT: Diagnosing and Calibrating VLMs in Quantitative Reasoning for Kinematic Physics

【速读】：该论文旨在解决视觉语言模型（Vision-Language Models, VLMs）在物理推理任务中表现不佳的问题，特别是其对物理世界感知的准确性与模型置信度可靠性不足的缺陷。现有VLMs往往无法正确识别视觉前提条件或应用必要的物理定律来得出结论，且缺乏对模型是否真正理解物理规律而非仅凭猜测的科学评估。解决方案的关键在于提出NICE和FACT双诊断范式：FACT用于分解运动学物理中的定量推理，诊断视觉保真度、物理定律理解能力与时间定位准确性；NICE则引入邻域感知校准方法及新指标，以评估和校准模型置信度的可靠性。这一框架为开发具备物理 grounded 性质、可信推理能力的VLM提供了标准化诊断路径。

链接: https://arxiv.org/abs/2605.08452
作者: Jian Lan,Zhicheng Liu,Xinpeng Wang,Yuhao Zhou,Haokun Chen,Jiancheng Lv,Barbara Plank,Thomas Seidl
机构: University of Munich (LMU), Germany; Munich Center of Machine Learning; Sichuan University; Meta
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The ability to derive precise spatial and physical insights is a cornerstone of vision-language models (VLMs), yet their poor performances in related spatial intelligence tasks such as physical reasoning remain a fundamental barrier. The community critically lacks a scientific analysis revealing whether VLMs faithfully reach answers or plausibly make guesses. This work aims to provide a fundamental understanding of how VLMs perceive the physical world, and utilize physical laws, while assessing the reliability of model confidence. We propose NICE and FACT, a dual-diagnostic paradigm that explicitly decomposes quantitative reasoning for kinematic physics: FACT diagnoses visual fidelity, physical law comprehension, and temporal grounding. NICE studies our novel neighborhood-informed calibration method and novel metrics to evaluate and calibrate confidence reliability. Evaluated across 6 latest state-of-the-art VLMs, we uncover that models fail to identify visual preconditions or utilize necessary physical laws to reach answers. This work highlights and establishes a standardized diagnostic paradigm to guide the development of faithful, physically-grounded VLMs.

[CV-303] ARO: Temporal Adversarial Rectification Optimization Using Diffusion Models as Purifiers

【速读】：该论文旨在解决生成式 AI（Generative AI）中基于扩散模型的对抗净化（adversarial purification）在面对自适应攻击时难以兼顾语义保真度与鲁棒性的难题。现有方法通常依赖单一扩散噪声尺度或均匀处理时间步，忽略了粗粒度与细粒度去噪阶段的不同作用。其解决方案的关键在于提出时序对抗校正优化（Temporal Adversarial Rectification Optimization, TARO），该方法在推理阶段构建多视角去噪轨迹上的时序引导得分先验，形成从粗到细的残差目标：高噪声专家提供全局平滑结构以降低对抗敏感性，低噪声专家恢复图像特异性且类别相关的细节；通过引导强度参数调控这一时序修正过程，实现全局鲁棒校正与语义保真之间的平衡。实验证明，TARO在零样本设置下显著提升多种数据集和自适应威胁模型下的鲁棒准确率，并可与互补的对抗似然目标结合进一步增强鲁棒性。

链接: https://arxiv.org/abs/2605.08440
作者: Daniel Wesego,Pedram Rooshenas
机构: University of Illinois Chicago (芝加哥大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Adversarial purification with diffusion models seeks to project adversarial examples back toward the data manifold, but balancing semantic preservation and robustness against adaptive attacks remains challenging. Recent work shows that standard diffusion purification can fail under adaptive evaluation, while test-time score-based optimization is more resilient. Existing optimization defenses, however, typically rely on a single diffusion noise regime or treat timesteps uniformly, overlooking the distinct roles of coarse and fine denoising scales. We propose Temporal Adversarial Rectification Optimization (TARO), an inference-time purification method that builds a temporally guided score prior from multiple denoising views along the diffusion trajectory. TARO forms a coarse-to-fine residual target: high-noise experts provide globally smoothed structure with reduced adversarial sensitivity, while low-noise experts restore image-specific, class-relevant details. A guidance strength controls this temporal correction, allowing TARO to balance robust global rectification with semantic preservation. Empirically, TARO improves robust accuracy across datasets and adaptive threat models in a zero-shot setting, while remaining compatible with complementary adversarial-likelihood objectives for further robustness gains.

[CV-304] Beyond Bag-of-Patches: Learning Global Layout via Textual Supervision for Late-Interaction Visual Document Retrieval

【速读】：该论文旨在解决视觉文档检索（Visual Document Retrieval, VDR）模型在匹配文档与查询时，因采用晚期交互架构（late interaction architecture）而忽视文档全局布局结构的问题。此类架构仅依赖局部patch嵌入进行相似性计算，导致在包含图文混排、表格与文本并存的异构布局文档中出现误匹配。解决方案的关键在于提出一种多模态编码器，通过引入可学习的全局布局嵌入（global layout embedding），增强局部patch表示，从而显式建模文档的整体结构信息；该布局嵌入由文本描述驱动训练，利用自然语言对文档布局特征的编码来实现无需改变推理流程的布局感知能力。实验表明，该方法在四个ViDoRe-v2数据集上相较最强基线ColPali/ColQwen在nDCG@5和MAP@5指标上分别提升2.4和2.3，且每项数据集均具有统计显著性优势。

链接: https://arxiv.org/abs/2605.08421
作者: Pascal Tilli,Mohsen Mesgar
机构: University of Stuttgart (斯图加特大学); Bosch Center for Artificial Intelligence (博世人工智能中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual Document Retrieval (VDR) models mostly rely on late interaction architectures, in which documents are represented by a set of local patch embeddings and then matched against query tokens. While efficient, this architecture prioritizes local similarity over global layout structure of documents to estimate relevancy between documents and query. In practice, this leads to errors as relevance originates from layout structure of documents with heterogeneous layouts combining figures, tables, and text. We make document layout learnable without changing inference. We propose a multimodal encoder that augments local patch representations with a global layout embedding, trained via textual descriptions encoding document layout information. Across four ViDoRe-v2 datasets, our model improves over the strongest architecturally comparable ColPali/ColQwen baseline by +2.4 nDCG@5 and +2.3 MAP@5, with statistically significant per-dataset gains over ColQwen.

[CV-305] SYNCR: A Cross-Video Reasoning Benchmark with Synthetic Grounding

【速读】：该论文旨在解决多视频跨视频推理（cross-video reasoning）能力在多模态大语言模型（Multimodal Large Language Models, MLLMs）中研究不足的问题，尤其针对现有基准依赖人工标注真实世界视频导致空间、时间与物理真值精度受限、难以诊断模型失效的局限性。其解决方案的关键在于构建一个可控的合成基准SYNCR，该基准基于Habitat、Kubric和CLEVRER仿真引擎生成8,163组带程序化验证的多视频问答对，覆盖9,650个独特视频，并通过八个任务系统评估MLLMs在时序对齐、空间追踪、比较推理和整体整合四个诊断维度上的表现，从而提供可精确量化、可解释性强的评测体系。

链接: https://arxiv.org/abs/2605.08412
作者: Sara Ghazanfari,Siddharth Garg,Prashanth Krishnamurthy,Farshad Khorrami
机构: New York University (纽约大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have made rapid progress in single-video understanding, yet their ability to reason across multiple independent video streams remains poorly understood. Existing multi-video benchmarks rely largely on human-annotated real-world footage, limiting the precision of spatial, temporal, and physical ground truth and making it difficult to diagnose model failures. We introduce SYNCR, a controlled synthetic benchmark for cross-video reasoning with programmatically verified grounding. Built using Habitat, Kubric, and CLEVRER simulator engines, SYNCR contains 8,163 multi-video question-answer pairs grounded in 9,650 unique videos. It evaluates MLLMs across eight tasks spanning four diagnostic pillars: Temporal Alignment, Spatial Tracking, Comparative Reasoning, and Holistic Synthesis. Our zero-shot evaluation of leading open- and closed-weight MLLMs reveals a substantial gap between current models and humans: the best model achieves only 52.5% average accuracy, compared to an 89.5% human baseline. Models perform relatively well on temporal ordering but struggle with precise physical and spatial reasoning, with the best model reaching only 26.0% accuracy on Kinematic Comparison. We further find that parameter scaling and reasoning-specialized post-training improve temporal alignment capabilities, but do not reliably address fine-grained physical tracking or global spatial synthesis. Finally, an exploratory sim-to-real correlation analysis suggests that several SYNCR tasks track model-level trends on real-world multi-video benchmarks, while also exposing reasoning capabilities underrepresented by existing evaluations. Code available at this https URL. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2605.08412 [cs.CV] (or arXiv:2605.08412v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.08412 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-306] Exploring and Exploiting Stability in Latent Flow Matching ICML2026

【速读】：该论文旨在解决生成式模型在训练和推理过程中对计算资源消耗大、数据依赖性强以及效率低下的问题。其核心解决方案在于利用潜在空间流匹配（Latent Flow-Matching, LFM）模型固有的稳定性特性——即在相同噪声种子下对不同扰动（如数据缩减和模型容量缩小）仍能生成相似输出——来设计更高效的训练与推理策略。关键创新点包括：1）通过在显著减少的数据集上训练LFM模型，实现性能无明显下降，从而加速收敛并降低标注成本；2）提出一种轻量级到高容量的粗到精两阶段推理方法，利用模型架构缩放带来的稳定性，在保证生成质量的同时实现超过两倍的推理速度提升。

链接: https://arxiv.org/abs/2605.08398
作者: Rania Briq,Michael Kamp,Ohad Fried,Sarel Cohen,Stefan Kesselheim
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICML 2026

点击查看摘要

Abstract:In this work, we show that Latent Flow-Matching (LFM) models are robust to different types of perturbations, including data reduction and model capacity shrinkage. We characterize this stability by their tendency to generate similar outputs under identical noise seeds. We provide a perspective relating this phenomenon to flow matching theory, which indicates that this stability is inherent to the FM objective. We further exploit this stability to derive practical algorithms for more efficient training and inference. Concretely, first, we show that by training LFM models on significantly reduced datasets, the performance does not degrade perceptually or quantitatively. This yields multiple advantages, such as reducing training time by converging faster under limited compute budget, and alleviating annotation effort when training conditional models. Second, LFM stability under architectural shrinkage gives rise to a two-model coarse-to-fine approach, one using a light-weight architecture for the first phase of the FM trajectory, and one with higher capacity for the second, thereby reducing the inference cost substantially. To determine which samples are informative, we introduce three sample-scoring criteria and evaluate them under standard metrics for generative models. Our results are thoroughly evaluated on multiple datasets, demonstrating the practical advantage of this stability, including data saving and a more than two-fold inference speedup while generating comparable outputs.

[CV-307] Delivering Science as a Service: Sci-Orchestras Cloud-Native Approach to HPC

【速读】：该论文旨在解决现代计算环境中科研人员因基础设施管理、认证协议及容器部署等复杂任务而分散研究精力的问题。其解决方案的关键在于提出一个分层编排框架 Sci-Orchestra，通过 API 驱动接口抽象执行流程，自动处理安全认证、资源调度与跨异构高性能计算环境的可扩展部署（基于 Kubernetes 架构）。该框架的核心创新是自主市场机制（autonomous marketplace），支持研究人员通过直观界面快速部署和共享专用服务，实现无需源代码交换的“黑盒”互操作性，从而在保护知识产权的同时促进跨机构协作与工业级应用转化。

链接: https://arxiv.org/abs/2605.08396
作者: Harinarayan Krishnan,Shubhabrata Mukerjee,Jeffrey Donatelli,Daniela Ushizima
机构: Lawrence Berkeley National Laboratory (劳伦斯伯克利国家实验室); Bakar Comp. Health Sciences Institute, UC San Francisco (巴卡尔计算健康科学研究所，加州大学旧金山分校); Berkeley Institute for Data Science, UC Berkeley (伯克利数据科学研究所，加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The increasing complexity of modern computational environments often burdens researchers with infrastructure management, authentication protocols, and container deployments. We present Sci-Orchestra, a layered orchestration framework designed to fully automate experimental workflows, allowing scientists to prioritize scientific discovery over backend operations. By abstracting execution through an API-driven interface, the system assumes responsibility for secure authentication, resource management, and scalable deployment across diverse high-performance computing environments using Kubernetes architectures. A key innovation of Sci-Orchestra is its autonomous marketplace, which serves as a catalyst for cross-institutional collaboration. Through an intuitive user interface, researchers can rapidly deploy and share specialized services via simple selections, eliminating the need for complex installations and technical setups. This modular infrastructure is specifically designed to facilitate industry partnerships as it provides a secure execution environment and allows external collaborators to test and validate proprietary tools without the need for source-code exchange. This ``black-box’’ interoperability protects intellectual property while enabling seamless integration into broader scientific pipelines, ultimately accelerating the transition from laboratory prototypes to industrial-scale applications.

[CV-308] Decoupling Endpoint and Semantic Transition Learning for Zero-Shot Composed Image Retrieval

【速读】：该论文旨在解决投影型零样本组合图像检索（Projection-based Zero-Shot Composed Image Retrieval, ZS-CIR）中存在的语义转换瓶颈问题，即在复杂语义修改场景下，现有方法因仅依赖端点匹配而难以准确建模源图像到目标图像的语义过渡过程，导致性能落后于基于大语言模型（LLM）的方法。其解决方案的关键在于提出DeCIR框架，通过解耦端点对齐与语义转换对齐的学习过程：利用图像-文本对构建正向/反向编辑元组，分别训练低秩文本适配器分支以独立优化端点匹配和语义转换，并采用低秩方向合并（Low-Rank Directional Merge, LRDM）策略将二者融合为一个可部署的适配器，从而在不增加推理复杂度的前提下显著提升投影型ZS-CIR的性能。

链接: https://arxiv.org/abs/2605.08389
作者: Mingyu Liu,Sihan Huang,Yijia Fan,Yinlin Yan,Quan Zhang,Jian-Fang Hu,Jianhuang Lai
机构: Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Zero-shot composed image retrieval (ZS-CIR) retrieves a target image from a reference image and a text modification without human-annotated CIR triplets. Projection-based ZS-CIR methods are attractive because they do not rely on LLMs at inference and remain lightweight, but they often underperform LLM-based approaches on complex semantic modifications. This gap reflects a semantic transition bottleneck in projection-based ZS-CIR: endpoint-level matching can let the edit text act as a target-side attribute cue rather than grounding it as a source-conditioned semantic transition. We further show that adding semantic transition supervision to the same text adapter creates an endpoint–transition conflict between endpoint alignment and semantic transition alignment. To address this conflict, DeCIR decouples endpoint and transition learning. It constructs paired forward/reverse edit tuples from image-caption pairs, trains separate low-rank text adapter branches for endpoint alignment and semantic transition alignment, and merges them with Low-Rank Directional Merge (LRDM) into one deployable adapter. Extensive experiments on CIRR, CIRCO, FashionIQ, and GeneCIS demonstrate that DeCIR consistently improves projection-based ZS-CIR without increasing inference complexity.

[CV-309] UIESNN: A Scale-Aware Spiking Network for Underwater Image Enhancement

【速读】：该论文旨在解决水下图像增强（Underwater Image Enhancement, UIE）中因波长依赖性色彩偏移和散射引起的雾霾效应等大尺度、低频退化问题，这些问题在传统脉冲神经网络（Spiking Neural Networks, SNNs）中因局部感知范围受限而难以有效校正，导致增强结果出现饱和或不一致。解决方案的关键在于提出一种尺度感知的SNN框架UIESNN，其核心组件是多尺度池化LIF模块（Multi-scale Pooling LIF Block, MPLB），该模块通过将分层多尺度池化响应注入膜电位动态，显著扩展有效感受野，同时保留细粒度细节并诱导异质尺度依赖激活；在此基础上构建全脉冲驱动的残差架构，融合频率分解与注意力机制实现精细化修复，从而在EUVP和LSUI基准上实现SNN方法中的最优性能，兼顾色彩保真度与空间一致性，并保持较低能耗。

链接: https://arxiv.org/abs/2605.08376
作者: Shuang Chen,Ruochen Li,Zihan Zhu,Ronald Thenius,Farshad Arvin,Amir Atapour-Abarghouei
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Underwater image enhancement (UIE) is a practically important yet underexplored application of spiking neural networks (SNNs), where the dominant degradations are large-scale and low-frequency, such as wavelength-dependent colour casts and scattering-induced veiling. Existing SNN restoration designs rely on locally bounded spiking perception, which can limit global correction and lead to saturated or inconsistent representations. To address these challenges, we propose a scale-aware SNN framework for UIE named UIESNN. At its core is a Multi-scale Pooling LIF Block (MPLB) that injects hierarchical multi-scale pooling responses into membrane dynamics, thereby enlarging the effective receptive field while preserving fine-grained details and inducing heterogeneous scale-dependent activations. Building on MPLB, we design a spiking residual architecture that integrates frequency decomposition and attention-based refinement in a fully spike-driven pipeline. Extensive experiments on the EUVP and LSUI benchmarks demonstrate that UIESNN achieves state-of-the-art performance among SNN-based methods, delivering improved colour fidelity and spatial coherence with competitive energy cost.

[CV-310] NeuroGAN-3D: Enhancing Intrinsic Functional Brain Networks via High-Fidelity 3D Generative Super-Resolution

【速读】：该论文旨在解决神经影像学中空间分辨率不足的问题，即当前基于静息态功能磁共振成像（resting-state fMRI, rs-fMRI）的空间图谱难以精确定位功能单元、可靠进行脑区分割以及检测与发育、衰老或疾病相关的细微空间特异性神经生物学变化。为应对这一挑战，作者提出 NeuroGAN-3D，这是一种专为体素级神经影像计算需求设计的新型三维生成式超分辨率模型，其核心创新在于采用生成对抗网络（Generative Adversarial Network, GAN）架构，显著提升 rs-fMRI 空间图谱的空间分辨率，从而实现更精细的脑结构解析及其与行为和病理关联的深入洞察。

链接: https://arxiv.org/abs/2605.08373
作者: M. Moein Esfahani,Sepehr Salem Ghahfarokhi,Mohammed Alser,Jingyu Liu,Vince Calhoun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted in ICCABS 2026: The 14th International Conference on Computational Advances in Bio and Medical Sciences

点击查看摘要

Abstract:Recent advances in neuroimaging have deepened our understanding of the brain’s complex functional and structural organization. Among these, functional Magnetic Resonance Imaging (fMRI) - particularly resting-state fMRI (rs-fMRI) - has emerged as a tool for identifying biomarkers of intrinsic brain connectivity and delineating large-scale neural networks. These networks are typically represented as volumetric spatial maps that capture functionally coherent brain regions and reflect individual differences in brain activity and structure. The spatial resolution of these maps plays an important role, as it determines the ability to localize functional units with precision, perform reliable brain parcellation, and detect subtle, spatially specific neurobiological alterations associated with development, aging, or disease. Therefore, improving the effective resolution of neuroimaging-derived maps holds significant promise for enabling more detailed insights into brain architecture and its relationship to behavior and pathology. To address this need, we propose NeuroGAN-3D, a novel 3D generative super-resolution model tailored to the computational demands of volumetric neuroimaging. Our model leverages a generative adversarial network architecture to enhance the spatial resolution of rs-fMRI spatial maps, significantly outperforming a conventional baseline.

[CV-311] PaceVGGT: Pre-Alternating-Attention Token Pruning for Visual Geometry Transformers

【速读】：该论文旨在解决视觉几何变换器（Visual Geometry Transformer, VGGT）中交替注意力（Alternating-Attention, AA）模块在处理长视频片段时因令牌数量（token count）呈二次增长而导致的推理延迟过高问题。现有令牌压缩加速方法仅在AA内部操作，未能优化输入AA前的patch网格结构。其解决方案的关键在于提出PaceVGGT——一种预AA令牌剪枝框架，通过训练一个轻量级Token Scorer模型，在冻结的VGGT进入第一个AA块之前对DINO提取的patch tokens进行重要性评估与剪枝；该 scorer 先从原始未剪枝骨干网络的AA内部注意力目标中蒸馏知识，再结合下游相机位姿、深度图和点云图损失进行微调，并引入基于帧内保留预算和重要性自适应合并/剪枝策略，在固定总合并预算下保留高显著性帧的残余内容，同时借助特征引导的恢复模块重建预测头所需的密集空间网格，从而在不牺牲重建质量的前提下显著降低推理延迟。

链接: https://arxiv.org/abs/2605.08371
作者: Haotang Li,Zhenyu Qi,Shaohan Henry Wang,Kebin Peng,Zi Wang,Qing Guo,Sen He,Huanrui Yang
机构: University of Arizona(亚利桑那大学); East Carolina University(东卡罗来纳大学); Augusta University(奥古斯塔大学); Nankai University(南开大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual Geometry Transformer (VGGT) is a strong feed-forward model for multiple 3D tasks, but its Alternating-Attention (AA) stack scales quadratically in the total token count, making long clips expensive. Existing token-reduction accelerators operate inside AA, leaving the patch grid that enters AA uncompressed. We introduce PaceVGGT, a pre-AA token pruning framework that prunes DINO patch tokens before the first AA block of a frozen VGGT. PaceVGGT trains a lightweight Token Scorer that estimates per-token importance from DINO features. The scorer is first distilled against an AA-internal attention target from the unpruned backbone, then refined under downstream camera, depth, and point-map losses. A per-frame keep budget fixes the backbone-visible sequence length, while an importance-adaptive merge/prune assignment preserves residual content from high-saliency frames under a fixed total merge budget. A Feature-guided Restoration module reconstructs the dense spatial grid required by the prediction heads. On ScanNet-50 and 7-Scenes, PaceVGGT remains on the reconstruction quality–latency frontier while reducing inference latency. On ScanNet-50, it reduces latency by (5.1\times) over unmodified VGGT at (N=300) and (1.47\times) over LiteVGGT at (N=1000). These results identify pre-AA pruning as a viable acceleration route for frozen VGGT-style geometry transformers. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2605.08371 [cs.CV] (or arXiv:2605.08371v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.08371 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Haotang Li [view email] [v1] Fri, 8 May 2026 18:27:59 UTC (1,162 KB) Full-text links: Access Paper: View a PDF of the paper titled PaceVGGT: Pre-Alternating-Attention Token Pruning for Visual Geometry Transformers, by Haotang Li and 7 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.CV prev | next new | recent | 2026-05 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[CV-312] An Efficient Token Compression Framework for Visual Object Tracking CVPR2026

【速读】：该论文旨在解决基于Transformer的视觉目标跟踪模型在使用大量历史模板帧时所面临的两个关键问题：一是因输入视觉标记（visual tokens）数量激增导致的二次方计算复杂度，二是冗余特征引入对跟踪性能的潜在负面影响。解决方案的核心在于提出一种“先压缩再交互”的跟踪框架ETCTrack，其关键创新包括：1）设计自适应标记压缩器（Adaptive Token Compressor），动态过滤冗余视觉标记，生成紧凑且高判别力的模板标记；2）引入分层交互编码器（Hierarchical Interaction Encoder），实现模板与搜索区域特征之间的深度自适应交互，从而提升目标定位精度。该方法在保持高精度的同时显著降低计算量，在多个基准测试中优于当前最优跟踪算法。

链接: https://arxiv.org/abs/2605.08329
作者: Weijing Wu,Qihua Liang,Bineng Zhong,Haiying Xia,Zhiyi Mo,Shuxiang Song
机构: Guangxi Normal University (广西师范大学); Wuzhou University (梧州大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted by CVPR2026

点击查看摘要

Abstract:Refining visual representations by eliminating their internal feature-level redundancy is crucial for simultaneously optimizing the performance and computational cost of models in visual tracking. To enhance their performance, many contemporary Transformer-based trackers leverage a larger number of historical template frames to capture richer spatio-temporal cues. However, this strategy leads to a massive number of input visual tokens. This creates two critical issues: it imposes a quadratic computational burden and can also degrade the tracker’s overall performance. To bridge this gap, we propose a compress-then-interact tracking framework, ETCTrack, that learns to efficiently compress template tokens from historical template frames into a robust target representation, moving beyond handcrafted rules. Our method first employs the Adaptive Token Compressor to dynamically construct compact yet highly discriminative template tokens by filtering out redundant visual tokens. These refined template tokens are then processed by our Hierarchical Interaction Encoder to achieve a deep, adaptive interaction with the search features. Refined search features ensure subsequent precise target localization. Experiments on seven benchmarks demonstrate that our method outperforms current state-of-the-art trackers. ETCTrack-B224 reduces the number of template tokens by 60%, leading to a 21.4% reduction in MACs with only a 0.4% drop in accuracy. The source code are available at this https URL.

[CV-313] P-Flow: Proxy-gradient Flows for Linear Inverse Problems

【速读】：该论文旨在解决基于流匹配（flow matching）的生成模型在逆问题中因需对展开路径进行反向传播而导致的数值不稳定性和计算开销过大的问题。解决方案的关键在于提出P-Flow框架，通过引入代理梯度（proxy gradient）来更新源点（source point），从而避免长链反向传播带来的数值不稳定性与内存负担；同时，为保证重建结果与先验分布的一致性，采用受高维空间浓度测度现象启发的高斯球面投影（Gaussian spherical projection），并从贝叶斯理论和利普希茨连续性角度提供了理论分析支撑。

链接: https://arxiv.org/abs/2605.08328
作者: Zehua Jiang,Fenghao Zhu,Xinquan Wang,Chongwen Huang,Zhaoyang Zhang
机构: Zhejiang University (浙江大学); University of Notre Dame (圣母大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generative models based on flow matching have emerged as a powerful paradigm for inverse problems, offering straighter trajectories and faster sampling compared to diffusion models. However, existing approaches often necessitate differentiating through unrolled paths, leading to numerical instability and prohibitive computational overhead. To address this, we propose P-Flow, a framework that stabilizes the reconstruction process by leveraging a proxy gradient to update the source point. This approach effectively circumvents the numerical instability and memory overhead of long-chain differentiation. To ensure consistency with the prior distribution, we employ a Gaussian spherical projection motivated by the concentration of measure phenomenon in high-dimensional spaces. We further provide a theoretical analysis for P-Flow based on Bayesian theory and Lipschitz continuity. Experiments across diverse restoration tasks demonstrate that P-Flow delivers competitive performance, especially under extreme degradations such as severely ill-posed conditions and high measurement noise.

[CV-314] Revitalizing the Beginning: Avoiding Storag e Dependency for Model Merging in Continual Learning

【速读】：该论文旨在解决持续学习（Continual Learning, CL）中模型合并（Model Merging）所面临的挑战：即在有限存储条件下，如何有效整合不同任务的知识，同时避免因全局对齐策略导致的任务特异性误差累积和后续任务优化停滞问题。解决方案的关键在于提出轨迹正则化合并（Trajectory Regularized Merging, TRM）框架，该框架将合并过程建模为一个在扩展轨迹子空间中的优化问题，并协同引入三项核心目标——任务对齐（task alignment）、预测一致性（prediction consistency）与梯度响应性（gradient responsiveness），从而在保持历史知识稳定性的同时恢复优化动力学，显著提升合并后模型在连续任务流中的初始性能表现。

链接: https://arxiv.org/abs/2605.08311
作者: Xi Wang,Cheng Deng
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Model merging provides a compelling paradigm for integrating specialized expertise into a unified multi-task model, a goal that aligns naturally with the sequential knowledge acquisition in continual learning (CL). However, the requirement for preserving diverse forms of previous knowledge conflicts with the storage limitations inherent to CL. In this paper, we systematically analyze existing model merging methods under the constraints of CL. We find that current methods prioritize global alignment, which often leads to the accumulation and amplification of task-specific errors within the continuous data stream; and the vanishing gradients at the onset of subsequent tasks frequently cause optimization to stagnate. These leave the merged model in a suboptimal state at the beginning of the next training phase. To address these challenges, we propose Trajectory Regularized Merging (TRM), a framework that reformulates the merging phase as an optimization process within an augmented trajectory subspace. Our framework integrates three synergistic objectives including task alignment, prediction consistency, and gradient responsiveness to concurrently preserve merged model’s historical stability and re-activate optimization dynamics. Extensive experimental results demonstrate that our method achieves state-of-the-art performance across multiple benchmarks.

[CV-315] BenchHAR: Benchmarking Self-Supervised Learning for Generalizable Sensor-based Activity Recognition

【速读】：该论文旨在解决可穿戴传感器数据在人类活动识别（HAR）中因数据异质性和标注数据稀缺而导致的泛化能力不足问题。其解决方案的关键在于提出一个统一的基准框架BenchHAR，用于系统评估自监督学习（SSL）方法在未见目标分布上的泛化性能；通过构建大规模数据集（约258K样本）并对比八种代表性SSL方法与12种编码器-分类器架构组合，发现混合范式（结合重建与对比预训练）表现最优，且CNN编码器具备最强的通用表征学习能力，同时增加下游任务类别的预训练数据量能显著提升泛化效果，而引入非下游类别的无标签数据则无效。

链接: https://arxiv.org/abs/2605.08296
作者: Yize Cai,Rui Feng,Anlan Yu,Baoshen Guo,Zhiqing Hong
机构: The Hong Kong University of Science and Technology (Guangzhou); Peking University; Singapore-MIT Alliance for Research and Technology
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注: 25 pages

点击查看摘要

Abstract:Human Activity Recognition (HAR) from wearable sensors supports broad healthcare and behavior science applications. However, data heterogeneity and the scarcity of labeled data limit its real-world generalization. Recent advances in self-supervised learning (SSL) in vision and language domains have shown strong capability for learning generalizable representations from unlabeled data. Yet, few studies have systematically compared the generalization performance of SSL methods or explored how to adapt them for generalizable HAR. To address these gaps, we present BenchHAR, a unified framework for evaluating the generalization capability of SSL methods for sensor-based HAR on unseen target distributions. BenchHAR curates a large-scale dataset (~258K samples) and evaluates eight representative SSL methods across 12 encoder-classifier architectures. Our results reveal that existing SSL methods struggle to achieve satisfactory generalization performance. We find that: (1) For HAR models, the hybrid paradigm (combining reconstruction and contrastive pretraining) achieves the best overall performance. The CNN encoder exhibits the strongest ability to learn generalizable representations, while more expressive classifier architectures further improve generalization. (2) For data scale, increasing the amount of pretraining data from downstream activity classes consistently improves generalization, while adding more labeled data yields limited gains. Interestingly, incorporating unlabeled data from non-downstream activity classes does not improve generalization. (3) Sensor data collected from custom-grade devices generalizes better than that from research-grade devices, and data from limb transfers more effectively to trunk positions. BenchHAR provides a unified benchmark and actionable insights for generalizable sensor-based HAR systems. Our code is available at this https URL.

[CV-316] Distill Diffuse and Semanticize (DDS): Annotation-Free 3D Scene Understanding Based on Multi-Granularity Distillation and Graph-Diffusion-Based Segmentation

【速读】：该论文旨在解决无标注（annotation-free）三维场景语义理解中面临的挑战，包括区域级语义不一致、全局分组效率低下以及类别无关的分割结果等问题。其核心解决方案在于提出一种基于多粒度蒸馏与图扩散（graph-diffusion-based segmentation）的框架：首先利用结构化视觉知识引导和超点图扩散机制实现高效的全局语义传播，缓解区域语义不一致性；随后通过分割-聚类关联进行语义推理，为分割后的3D区域赋予可解释的类别标签，从而显著提升无监督三维语义理解的整体性能。实验表明，该方法在真实数据集上相较现有先进无标注基线模型，在总体准确率（oAcc）、平均准确率（mAcc）和平均交并比（mIoU）上分别提升了最多5.9%、8.1%和2.4%。

链接: https://arxiv.org/abs/2605.08293
作者: Yijing Wang,Ruonan Li,Qilin Wang,Rongqiang Zhao,Jie Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D semantic scene understanding has broad applications in digital twins, autonomous driving, smart agriculture, and embodied perception. However, dense point-wise annotation for point clouds is extremely expensive, making fully supervised 3D semantic learning difficult to scale. Recent annotation-free methods can discover semantic regions without manual 3D labels, but they often suffer from weak object-level consistency, inefficient global grouping, and category-agnostic segmented regions. We propose an annotation-free 3D scene semantic understanding method based on multi-granularity distillation and graph-diffusion-based segmentation. The proposed method first leverages structured visual knowledge guidance and superpoint graph diffusion to perform efficient global semantic propagation, alleviating the problem of inconsistent region-level semantics. It then conducts semantic inference through segmentation-cluster association, assigning interpretable category names to segmented 3D regions and improving the overall effectiveness of annotation-free 3D semantic understanding. Extensive experiments on real-world datasets demonstrate the effectiveness of the proposed framework. Compared with the advanced existing annotation-free baselines, our method improves oAcc, mAcc, and mIoU by 5.9%, 8.1%, and 2.4% at most, respectively. These results highlight the promise of the proposed framework for scalable annotation-free 3D scene understanding, especially in real-world scenarios requiring both object segmentation and semantic recognition.

[CV-317] Is Class Signal Clustered or Routed in Task-Induced Implicit Neural Representation Weight Spaces?

【速读】：该论文旨在解决隐式神经表示（Implicit Neural Representations, INRs）中图像分类任务的可分性问题，即探究INR权重空间中的几何结构是否能直接支持高效分类。传统假设认为，通过元学习（meta-learning）得到的共享初始权重与内循环更新策略，应使不同类别的图像对应权重在共享锚点坐标系下形成类内聚类，从而提升分类性能。然而，研究发现这一几何假设并不成立：即使权重空间中存在明显的类内聚类，训练后的读取器（trained reader）准确率仍可能下降；更重要的是，类对齐的邻域结构仅在读取器后期交互后才变得具有预测性，而非源于输入空间的原始几何特性。解决方案的关键在于识别出SIREN网络中一个低维、样本依赖的偏置列（bias column），该列作为因果读取路径（causal readout route）被读取器主动利用，从而实现类信号的有效路由。这一发现揭示了任务诱导的INR权重可分性并非源自原始几何聚类，而是由读取器动态构建的信号路由机制驱动。

链接: https://arxiv.org/abs/2605.08281
作者: Xinyi Guo,Mingyi He,Haobin Ding,Weiming Chen,Xinrui Chen,Jiawen Li,Di Zhang,Minxi Ouyang,Yizhi Wang,Xitong Ling
机构: South China Normal University(华南师范大学); Beijing University of Chemical Technology(北京化工大学); Tsinghua University(清华大学); Xi’an Jiaotong University(西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Implicit neural representations (INRs) encode images as neural-network weights, making image classification a problem of weight-space classifiability. A natural geometric hypothesis is that classifier feedback should make image-specific weights cluster by class in the shared-anchor coordinate. We test this hypothesis in the SIREN-based Meta Weight Transformer (MWT) regime, where end-to-end training meta-learns a shared initialization and inner-loop update schedule for fitting image-specific SIRENs. We find that this prediction fails. Exposed weight-space geometry and supervised clustering pressure do not reliably track trained-reader accuracy; clustering can even make local neighborhoods more class-consistent while making the trained reader worse. Crucially, the reader constructs rather than inherits class-aligned geometry: token-flow diagnostics show that class-aligned neighborhoods become strongly predictive of trained-reader accuracy only after late reader interactions, not in the input coordinate. We further identify the native SIREN bias column in the augmented weight token as a low-dimensional, sample-dependent causal readout route for the trained reader; targeted controls rule out generic scalar-column and marginal-distribution artifacts. The diagnosis motivates interventions that strengthen reader routing, add an explicit bias route, or use denser inner-loop fitting; under the lane-specific training conventions used here, route-directed variants often outperform the shared-anchor baseline but interact non-additively. Task-induced INR weights are classifiable not because they form raw geometric clusters, but because their class signal is routed through the reader.

[CV-318] Beyond ViT Tokens: Masked-Diffusion Pretrained Convolutional Pathology Foundation Model for Cell-Level Dense Prediction

【速读】：该论文旨在解决计算病理学中细胞级密集预测（cell-level dense prediction）的挑战，这些问题主要源于组织切片中细微的组织结构、显著的域偏移（domain shift）以及昂贵的密集标注成本。现有基于视觉Transformer（Vision Transformer, ViT）的病理基础模型依赖于图像块（patch tokenization）处理方式，易破坏空间连续性并削弱局部形态细节，从而影响细胞级预测性能。解决方案的关键在于提出一种全新的自监督卷积生成预训练框架——ConvNeXt Masked-Diffusion (CMD)，其核心创新包括：采用全卷积的ConvNeXt-UNet骨干网络，在像素空间中执行掩码扩散（masked-diffusion）预训练，并通过自适应归一化（adaptive normalization）融合冻结的病理基础模型特征。该方法在多个病理密集预测任务中均优于现有ViT基线模型，尤其在标注数据有限场景下展现出更强的鲁棒性和泛化能力，验证了纯卷积架构在细胞级病理理解中的竞争力与可扩展性。

链接: https://arxiv.org/abs/2605.08276
作者: Weiming Chen,Xitong Ling,Zhenyang Cai,Xidong Wang,Jiawen Li,Tian Guan,Benyou Wang,Yonghong He
机构: Tsinghua Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院); Research Institute of Tsinghua, Pearl River Delta(清华大学珠三角研究院); The Chinese University of Hong Kong, ShenZhen(香港中文大学（深圳)）
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cell-level dense prediction is central to computational pathology, but remains challenging due to fine-grained histological structures, strong domain shifts, and costly dense annotations. Existing ViT-based pathology foundation models rely on patch tokenization, which can disrupt spatial continuity and weaken local morphological details needed for cell-level prediction. To address this, we propose Masked-Diffusion Convolutional Foundation Models, termed ConvNeXt Masked-Diffusion (CMD), a self-supervised convolutional generative pretraining framework for dense pathology representation learning. CMD uses a fully convolutional ConvNeXt-UNet backbone, performs masked-diffusion pretraining in pixel space, and incorporates frozen pathology foundation model features through adaptive normalization. Experimental results demonstrate that CMD consistently outperforms existing ViT-based pathology foundation models and even surpasses state-of-the-art end-to-end segmentation methods while fine-tuning only a small number of task-specific parameters across multiple pathology dense prediction tasks. The advantage is particularly pronounced under limited annotation settings, where CMD exhibits stronger robustness and generalization ability. Our findings suggest that purely convolutional architectures can also serve as competitive pathology foundation models for cell-level dense prediction, achieving leading performance within the current ViT-dominated paradigm and providing a scalable, high-performance solution that better preserves histological structural priors for fine-grained pathology understanding.

[CV-319] Bridging Modalities Spanning Time: Structured Memory for Ultra-Long Agent ic Video Reasoning

【速读】：该论文旨在解决超长视频（如第一人称记录、直播或监控录像，持续数天至数周）理解难题，现有多模态大语言模型（Multimodal Large Language Models, MLLMs）受限于百万token上下文窗口，仅能处理数十分钟密集采样的视频片段，大量证据在推理前即被丢弃；尽管记忆增强与代理（agentic）方法提升了可扩展性，但其检索仍跨模态碎片化，缺乏覆盖数日甚至数周的长程叙事摘要。解决方案的关键在于提出无需训练的MAGIC-Video框架，其核心是构建一个包含六类边的多模态记忆图（multimodal memory graph），统一情景记忆、语义与视觉内容，并引入交错的叙事链（narrative chain）以提炼长期实体传记和重复活动事件；推理时通过代理循环将记忆图检索与叙事事实注入相结合，在单一检索管道中同时覆盖模态与时间维度，从而实现对超长视频的高效理解。

链接: https://arxiv.org/abs/2605.08271
作者: Jiazheng Li,Chi-Hao Wu,Yunze Liu,Kaize Ding,Jundong Li,Chuxu Zhang
机构: University of Connecticut (康涅狄格大学); Memories.ai; Northwestern University (西北大学); University of Virginia (弗吉尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Understanding ultra-long videos such as egocentric recordings, live streams, or surveillance footage spanning days to weeks, remains a challenge. For current multimodal LLMs: even with million-token context windows, frame budgets cover only tens of minutes of densely sampled video, and most evidence is discarded before inference begins. Memory-augmented and agentic approaches help with scale, but their retrieval remains fragmented across modalities and lacks long-range narrative summaries that span days or weeks. We propose \textbfMAGIC-Video, a training-free framework built around a multimodal memory graph with interleaved narrative chain: the graph unifies episodic, semantic, and visual content through six typed edges and supports cross-modal retrieval, while the chain distils long-horizon entity biographies and recurring activity events. At inference time, an agentic loop interleaves graph retrieval with narrative fact injection, covering both the modality and time dimensions of ultra-long video in a single retrieval pipeline. On EgoLifeQA, Ego-R1 and MM-Lifelong, MAGIC-Video consistently outperforms strong general-purpose, long-video, and agentic baselines, with gains of 10.1, 7.4, and 5.9 points over the prior best agentic system on each benchmark. Code is available at this https URL.

[CV-320] SAFformer:Improving Spiking Transformer via Active Predictive Filtering IJCAI2026

【速读】：该论文旨在解决现有脉冲神经网络（Spiking Neural Networks, SNNs）构建的Transformer模型普遍采用被动响应范式所带来的局限性，即难以聚焦任务相关信息，并在处理冗余视觉数据时产生显著计算开销的问题。其解决方案的关键在于提出一种基于主动预测过滤机制的新型脉冲Transformer架构——SAFformer，该架构受大脑预测编码机制启发，能够主动抑制可预测信号并聚焦于显著视觉特征，从而在保持高能效的同时提升模型对关键信息的感知能力与计算效率。

链接: https://arxiv.org/abs/2605.08270
作者: Zequan Xie,Weiming Zeng,Yunhua Chen,Sichang Ling,Tongyang Chen,Jinsheng Xiao
机构: Guangdong University of Technology (广东工业大学); Hong Kong Baptist University (香港浸会大学); Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 12pages,7pages,ijcai2026

点击查看摘要

Abstract:Spiking Neural Networks (SNNs) offer notable advantages in biological plausibility and energy efficiency, making them promising candidates for building low-power Transformers. However, existing Spiking Transformers largely adhere to a passive reactive paradigm, which struggles to focus on task-relevant information and incurs substantial computational overhead when processing redundant visual data. To overcome this fundamental yet underexplored limitation, we propose SAFformer, a novel Spiking Transformer architecture based on an active predictive filtering paradigm. Inspired by the brain’s predictive coding mechanism, SAFformer actively suppresses predictable signals and focuses on salient visual features. Extensive experiments show that SAFformer establishes new state-of-the-art performance on CIFAR-10/100 and CIFAR10-DVS. Remarkably, on ImageNet-1K, it achieves 80.50% Top-1 accuracy with only 26.58M parameters and an energy consumption of 5.88 mJ, demonstrating an exceptional balance between accuracy and efficiency.

[CV-321] Multimodal Emotion Recognition via Causal-Diffusion Bridge (Affect-Diff)

【速读】：该论文旨在解决多模态情感识别（Multimodal Emotion Recognition）在CMU-MOSEI数据集上因样本极度不平衡导致的模型偏差问题，即主流类别（如Happy）占据绝大多数样本，而埃克曼六种基本情绪中的少数类（Fear、Disgust、Surprise等）几乎被现有融合模型忽略，从而造成对少数类情感识别性能为零。解决方案的关键在于提出Affect-Diff，一个基于因果扩散桥接（Causal-Diffusion Bridge）的框架，其核心包含三个协同训练机制：1）通过NOTEARS学习得到的因果图对模态贡献进行再加权；2）基于beta-VAE的瓶颈结构实现正则化的潜在空间压缩；3）采用带梯度截断（stop-gradient）的一维DDPM先验建模潜在空间，防止多数类主导。实验证明，该方法在验证集上实现了0.384的平衡准确率，相较最强基线TETFN提升18%，且首次在该任务中成功检测出全部六种情绪类别。

链接: https://arxiv.org/abs/2605.08252
作者: Ankit Sanjyal
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 Pages, 12 Figures, 6 Tables

点击查看摘要

Abstract:Multimodal emotion recognition on CMU-MOSEI faces an extreme imbalance as Happy accounts for 65.9% of samples while three Ekman categories collectively represent under 7%, causing standard fusion models to maximize accuracy by ignoring minority emotions entirely. We present Affect-Diff, a Causal-Diffusion Bridge that addresses this through three jointly trained mechanisms: a NOTEARS-learned causal graph that re-weights modality contributions before fusion, a beta-VAE bottleneck for regularized latent compression, and a stop-gradiented 1D DDPM prior that structures the latent space against majority-class collapse. On 3,292 aligned CMU-MOSEI samples, Affect-Diff achieves validation balanced accuracy 0.384, an 18% relative improvement over the strongest baseline (TETFN: 0.324), while all evaluated baselines produce zero F1 on Fear, Disgust, and Surprise. Ablation studies confirm independent, non-redundant contributions from the diffusion prior (-24% without it) and causal graph (-13%). Notably, only the deterministic-encoder variant detects all six emotion classes, revealing KL regularization strength as a direct lever for minority-class sensitivity.

[CV-322] Why Do DiT Editors Drift? Plug-and-Play Low Frequency Alignment in VAE Latent Space

【速读】：该论文旨在解决多轮图像编辑中因扩散变换器（Diffusion Transformers, DiTs）导致的语义漂移（semantic drift）和质量下降问题。研究发现，DiT在低频潜空间中引入主导性漂移，随编辑轮次累积造成语义错位，而变分自编码器（VAE）则提供相对稳定的重建能力。解决方案的关键在于提出一种无需训练、即插即用的低频对齐方法（VAE-LFA），通过低通滤波分解各轮次间的潜空间差异，并将低频统计量对齐至前序轮次的指数移动平均，从而有效抑制累积语义漂移并保留高频细节。该方法不依赖重训练、真实标签或扩散参数，适用于白盒与黑盒DiT编辑器，在保持视觉保真度的同时显著提升多轮编辑的语义一致性。

链接: https://arxiv.org/abs/2605.08250
作者: Xiaoce Wang,Sifan Zhou,Kaifei Wang,Leli Xu,Xuerui Qiu,Tao He,Ming Li
机构: Tsinghua University (清华大学); Carnegie Mellon University (卡内基梅隆大学); Peking University (北京大学); CASIA (中国科学院自动化研究所); University of Electronic Science and Technology of China (电子科技大学); Guangming Laboratory (光明实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages main paper, 12 figures, 25 pages in total

点击查看摘要

Abstract:Recent advances in diffusion transformers (DiTs) have enabled promising single-turn image editing capabilities. However, multi-turn editing often leads to progressive semantic drift and quality this http URL this work, we study this problem from a latent-space frequency perspective by decomposing the editing process into two functional components: VAE and DiT. Through systematic analysis in the VAE latent space, we uncover that the DiT introduces dominant low-frequency drift that accumulates as semantic misalignment across editing rounds, while the VAE contributes comparatively stable reconstruction this http URL on this insight, we propose VAE-LFA (Low Frequency Alignment), a training-free, plug-and-play method that performs alignment in VAE latent space. VAE-LFA decomposes latent discrepancies across editing rounds via low-pass filtering, and aligns low-frequency statistics to an exponential moving average of previous rounds, effectively suppressing accumulated semantic drift while preserving high-frequency this http URL method requires no retraining, ground-truth priors, or access to diffusion parameters, making it applicable to both white-box and black-box DiT editors. For white-box models, VAE-LFA is seamlessly integrated into the editing pipeline by eliminating redundant VAE round trips; for black-box models, it operates via an off-the-shelf VAE to perform inter-round latent this http URL experiments demonstrate that VAE-LFA improves semantic consistency and visual fidelity across diverse multi-turn editing scenarios, including both controlled and in-the-wild images.

[CV-323] Dimensional Coactivation for Representational Consistency in Frozen Vision Foundation Models

【速读】：该论文旨在解决冻结视觉基础模型（frozen vision foundation models）在单个输入图像内部是否保持表征一致性的问题，即模型是否能以统一的坐标系组织图像的语义子区域。其核心解决方案是提出维度协同激活（Dimensional Coactivation, DCA），通过分析同一特征维度在不同语义区域间的协同激活模式来衡量这种内在一致性。DCA的关键设计在于避免使用中心化、L2归一化和全Gram耦合等操作，这些操作虽适用于跨模型或分布比较，但在固定坐标系下的单样本场景中会丢失原始幅值信息——而该幅值恰恰承载了结构信号。实验表明，基于DINOv3特征的DCA在深度伪造检测任务中表现出高判别力（如CelebDF-v2上AUC达0.9106），且消融实验证明其性能依赖于稳定的逐维坐标系统而非单纯的区域提取能力。

链接: https://arxiv.org/abs/2605.08249
作者: Izaldein Al-Zyoud Abdulmotaleb El Saddik
机构: University of Ottawa (渥太华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Frozen vision foundation models do not merely extract features; they organize images through a learned coordinate system. We ask whether that coordinate system remains internally coherent within a single input. This leads to Representational Consistency: the study of whether a frozen foundation model represents one sample coherently across its semantic subregions. We introduce Dimensional Coactivation (DCA), a per-dimension instrument for measuring this coherence. DCA compares semantic regions by asking whether the same feature dimensions coactivate across them. Unlike classical similarity measures, it deliberately avoids centering, L2 normalization, and full Gram coupling. These operations are useful when comparing different models or distributions, but they are mismatched to the intra-sample setting, where the coordinate system is fixed and raw magnitude carries signal. Deepfake detection provides a natural validation task. Synthetic faces may reproduce plausible eyes, noses, and mouths while breaking the representational structure that links those regions in real faces. Using frozen DINOv3 features, DCA exposes this break: an eyes-mouth-nose fingerprint achieves 0.9106 AUC on CelebDF-v2 and 0.9289 on DFD under FF++ c23 cross-dataset transfer. The design is also sharply validated by ablation: reintroducing centering collapses CelebDF-v2 AUC to 0.459, L2 normalization reduces it to 0.862, and cross-dimension coupling reduces it to 0.478. Finally, replacing DINOv3 with FaRL collapses CelebDF-v2 AUC to 0.582. DCA therefore depends on a stable per-dimension coordinate system, not on region extraction alone. These results position DCA as an instrument for measuring intra-sample representational coherence in frozen foundation models, with deepfake detection as the first validation task.

[CV-324] Smart Railway Obstruction Detection System using IoT and Computer Vision

【速读】：该论文旨在解决印度铁路轨道入侵（railway track intrusion）的安全问题，包括野生动物闯入和人为恶意障碍物，尤其针对现有高成本、高误报率的检测系统（如基于光纤的Gajraj系统）无法广泛部署的痛点。其关键解决方案是提出NETRA系统，一种基于树莓派（Raspberry Pi）边缘计算平台的低成本、无需互联网的入侵检测系统，通过概率传感器融合（PIR运动传感器与HC-SR04超声波测距传感器，阈值τ_c=0.65）实现事件驱动式摄像头激活，降低52%不必要的视觉处理；并结合边缘AI分类模型（MobileNet-SSD或YOLOv5 ONNX）进行威胁识别（人类、大型动物、轨道障碍物），最终通过LoRa无线通信（868 MHz）在2.4秒内将确认威胁传输至机车司机端，实现实时响应。该方案在113次测试事件中达到95%检测准确率且零误报，同时将部署成本从1000卢比/公里降至247卢比/公里，显著优于现有系统。

链接: https://arxiv.org/abs/2605.08246
作者: Pravin Kumar,Mritunjay Shall Peelam,Ramakant Kumar,Sanjay Kumar,Vinay Chamola
机构: University of Petroleum and Energy Studies (UPES)(印度石油与能源研究大学); Galgotias College of Engineering Technology(加尔各答工程与技术学院); NIT Jamshedpur(印度理工学院贾姆谢德普尔分校); BITS-Pilani(比尔拉理工学院); APPCAIR(先进电力与计算机智能研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Railway track intrusions pose a critical safety challenge for Indian Railways, encompassing wildlife incursions and deliberate malicious obstructions. The December 2025 collision in Assam, in which seven elephants were killed by the Rajdhani Express, underscores the urgency of effective real-time detection. Existing solutions such as the optical fiber-based Gajraj system suffer from prohibitive costs (\ 1000/km) and high false alarm rates, limiting deployment to only 20 of India’s 101 elephant corridors. This paper proposes NETRA, a cost-effective, internet-independent intrusion detection system deployed on Raspberry Pi Zero W and Raspberry Pi 4 edge platforms. NETRA employs probabilistic sensor fusion integrating a PIR motion sensor and an HC-SR04 ultrasonic distance sensor with a tunable threshold (tau_c = 0.65), enabling event-driven camera activation that reduces unnecessary visual processing by 52%. Upon confirmed intrusion, edge-AI classification using MobileNet-SSD (Pi Zero) or YOLOv5 ONNX (Pi 4) identifies threats including humans, large animals, and track obstructions. Confirmed threats are transmitted via LoRa (868 MHz) to alert the locomotive driver within 2.4 seconds end-to-end. Experimental evaluation across 113 motion events demonstrated 95% detection accuracy with zero false alarms through probabilistic fusion, compared to 85% for binary methods. Raspberry Pi 4 with YOLOv5 achieved 83.5% elephant F1-score, a 5.6x improvement over Pi Zero’s heuristic approach (14.8%). LoRa communication achieved 100% packet delivery across 1-2 km in field trials. NETRA reduces deployment cost by 75% (\ 247/km vs \ 1000/km for Gajraj) while providing unified detection of both wildlife and obstruction threats.

[CV-325] When Language Overwrites Vision: Over-Alignment and Geometric Debiasing in Vision-Language Models

【速读】：该论文旨在解决生成式视觉-语言模型（Vision-Language Models, VLMs）在高风险应用场景中频繁出现幻觉（hallucination）的问题，即模型在输入图像中不存在内容的情况下仍自信地生成错误描述。研究表明，这类失败模式的根本原因在于解码器架构导致的几何过对齐（geometric over-alignment）：为了满足注意力机制对模态间隙的桥接需求，VLMs 将视觉嵌入过度对齐至文本流形（text manifold），从而引入统计学上的语言偏差，系统性地掩盖了细粒度的视觉证据。解决方案的关键在于首次定量刻画了这种过对齐现象——发现语言偏差集中于一个通用且数据集无关的文本子空间的前主成分中，并据此提出两种互补方法：一种无需训练的推理阶段干预策略，另一种面向偏置感知的微调范式，二者均通过显式投影去除该子空间对视觉表示的影响。实验表明，这些方法在 POPE、CHAIR 和 AMBER 基准上显著降低幻觉率，并在长文本描述任务 CLAIR 上提升评分，其中无训练变体不增加任何计算开销。

链接: https://arxiv.org/abs/2605.08245
作者: Harshvardhan Saini,Samyak Jha,Yiming Tang,Dianbo Liu
机构: Indian Institute of Technology Dhanbad; National University of Singapore
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) increasingly power high-stakes applications, from medical imaging to autonomous systems, yet they routinely hallucinate, confidently describing content not present in the input. We investigate the root causes of these failure modes with a mechanistic analysis focusing on the decoder-based VLMs. We trace these failure modes to a geometric over-alignment: to bridge the modality gap required by attention mechanisms, decoder-based VLMs over-align visual embeddings with the text manifold, injecting a statistical linguistic bias that systematically overshadows fine-grained visual evidence. While prior work either aggressively closes this gap or suppresses hallucinations through expensive black-box decoding strategies, none addresses the underlying geometric cause. We provide the first quantitative characterization of this over-alignment, demonstrating that linguistic bias concentrates in the top principal components of a universal, dataset-agnostic text subspace. Building on this insight, we propose two complementary remedies: a training-free inference strategy and a bias-aware fine-tuning paradigm, both of which explicitly project out this subspace from visual representations. Our methods significantly reduce hallucinations across POPE, CHAIR, and AMBER benchmarks, and improve CLAIR scores on long-form captioning tasks, with the training-free variant adding no computational overhead over the base model.

[CV-326] nySSL: Distilled Self-Supervised Pretraining for Sub-Megabyte MCU Models

【速读】：该论文旨在解决自监督学习（Self-supervised Learning, SSL）在微控制器（Microcontroller, MCU）类小模型（参数量少于500K）中失效的问题。针对此类模型存在的三个关键障碍——投影头主导效应、表征瓶颈和增强敏感性，作者提出了一种容量感知的蒸馏自监督学习框架（Capacity-Aware Distilled Self-Supervised Learning, CA-DSSL）。其核心解决方案是采用冻结的DINO ViT-S/16教师模型进行异构蒸馏，结合多尺度特征蒸馏以保留空间表征，并引入渐进式增强课程策略，在无需标签或文本监督的前提下显著提升小模型的表征能力。实验表明，CA-DSSL在CIFAR-100上线性探测准确率达62.7%（3次种子均值），优于同类方法（如SimCLR-Tiny高出18个百分点），且模型部署仅需378 KB（INT8量化），无推理开销。

链接: https://arxiv.org/abs/2605.08241
作者: Bibin Wilson
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Self-supervised learning (SSL) has transformed representation learning for large models, yet remains unexplored for microcontroller (MCU)-class models with fewer than 500K parameters. We identify three obstacles at this scale – projection head dominance, representation bottleneck, and augmentation sensitivity – and propose Capacity-Aware Distilled Self-Supervised Learning (CA-DSSL), a teacher-guided framework that overcomes them without labels or text supervision. CA-DSSL combines asymmetric distillation from a frozen DINO ViT-S/16 teacher, multi-scale feature distillation for spatial representations, and a progressive augmentation curriculum. On a MobileNetV2-0.35 backbone (396K parameters) pretrained on CIFAR-100, CA-DSSL reaches 62.7 0.5% linear-probe accuracy (3-seed mean) – surpassing SimCLR-Tiny by 18 pp, matching SEED (61.7%) with 10 fewer projection parameters (426K vs. 3.15M), and reaching 94.0% of a supervised upper bound. Standard SSL methods (BYOL-Tiny, DINO-Tiny) collapse entirely at this scale. On Pascal VOC detection, CA-DSSL achieves 2.3 the mAP of random initialization and +3 pp over SEED, though SimCLR-Tiny matches CA-DSSL on detection mAP. The deployed backbone occupies 378 KB (INT8) with no inference overhead from pretraining. Preliminary ImageNet-100 experiments reveal that CA-DSSL’s advantage is specific to small-data regimes; scaling to ImageNet-1K is discussed as future work.

[CV-327] Resource-Aware Evolutionary Neural Architecture Search for Cardiac MRI Segmentation

【速读】：该论文旨在解决心脏磁共振成像（Cardiac Magnetic Resonance, CMR）中左心室和右心室分割的准确性与效率难题，尤其针对组织对比度低、边界模糊及扫描间变异大等挑战。其核心解决方案是提出CardiacNAS框架，这是一种资源感知的神经架构搜索（Neural Architecture Search, NAS）方法，通过构建一个类似UNet的超网络（supernet）并设计涵盖深度、宽度、卷积核大小、滤波器尺寸、注意力机制、特征融合、激活函数、丢弃率及残差缩放等多个维度的专用搜索空间，实现对模型性能（以Dice相似系数DSC和95% Hausdorff距离HD95衡量）与计算资源消耗（参数量和浮点运算次数FLOPs）的联合优化。关键创新在于将演化策略（交叉、变异、精英选择）与代理预算训练相结合，在固定算力约束下自动发现兼具高精度与高效性的最优分割结构，最终在ACDC数据集上实现了93.22%平均DSC和4.73 mm HD95的性能，同时仅需3.58M参数和14.56 GFLOPs，验证了其在临床部署中的可行性与优越性。

链接: https://arxiv.org/abs/2605.08238
作者: Farhana Yasmin,Mahade Hasan,Haipeng Liu,Amjad Ali,Ghulam Muhammad,Yu Xue
机构: Nanjing University of Information Science and Technology (南京信息工程大学); Eastern University (东大学); Coventry University (考文垂大学); Muscat University (马斯喀特大学); King Saud University (沙特国王大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Cardiac magnetic resonance (CMR) segmentation underpins quantitative assessment of ventricular structure and function, yet reliable delineation remains difficult due to low tissue contrast, fuzzy boundaries, and inter scan variability. We present CardiacNAS, an evolutionary neural architecture search (NAS) framework that couples a UNet like supernet with a cardiac aware search space spanning depth width, kernel size, filter size, attention, fusion, activation, dropout, and residual scaling. The search is explicitly resource aware, jointly optimizing dice similarity coefficient (DSC) and 95th percentile Hausdorff distance (HD95) versus model size and floating point operations (FLOPs) under fixed compute budgets. Candidate architectures are instantiated from the supernet, trained with proxy budgets, and evolved through crossover, mutation, and elitist selection. We evaluate on the ACDC dataset and compare against six state of the art methods, using qualitative comparisons, learning curve analyses, and design factor correlation studies. The resulting model attains 93.22% average DSC and 4.73 mm HD95 with 3.58M parameters and 14.56 GFLOPs, demonstrating a favorable accuracy efficiency trade off. Analyses indicate that searched attention and fusion choices, together with residual scaling, contribute to improved boundary fidelity and stability. CardiacNAS offers a principled, resource aware approach to deployable CMR segmentation with transparent reporting of architectural complexity and compute budgets.

[CV-328] SPECTRA-Net: Scalable Pipeline for Explainable Cross-domain Tensor Representations for AI-generated Images Detection

【速读】：该论文旨在解决AI生成图像（AIGI）日益增长对数字信息完整性的威胁问题，尤其针对人类观察者和现有检测模型难以应对生成模型不断升级所带来的挑战。解决方案的关键在于提出SPECTRA-Net，一个可扩展的、面向解释性的跨域张量表示检测管道，其核心是融合多视角图像表征：包括来自视觉基础模型（VFM）的全局语义特征、频谱分析、基于局部patch的异常检测以及统计描述符。通过整合这些互补的数据流，SPECTRA-Net在域内与跨域场景下均实现最优检测性能，兼具高准确率与强泛化能力，并通过定位伪造痕迹提供可解释性，从而提升真实应用场景中内容验证的可信度与可靠性。

链接: https://arxiv.org/abs/2605.08226
作者: Sarra Arab,Anfal Achouri,Seif Eddine Bouziane
机构: The National School of Artificial Intelligence (国家人工智能学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 2 figures, submitted to a journal

点击查看摘要

Abstract:The rapid proliferation of AI-generated images (AIGI) presents a significant challenge to digital information integrity. While human observers and existing detection models struggle to keep pace with the increasing sophistication of generative models, the need for robust, real-time detection systems has become critical. This paper introduces SPECTRA-Net, a scalable pipeline for explainable, cross-domain tensor representations for AIGI detection. Our approach leverages a multi-view representation of images, combining global semantic features from a Vision Foundation Model (VFM), spectral analysis, local patch-based anomaly detection, and statistical descriptors. By fusing these complementary data streams, SPECTRA-Net achieves state-of-the-art performance in both in-domain and cross-domain settings, demonstrating high accuracy and generalization capabilities across a wide range of challenging datasets, including WildFake, Chameleon, and RRDataset. The proposed pipeline not only provides a robust solution for AIGI detection but also offers explainability through artifact localization, paving the way for more trustworthy and reliable content verification in real-world applications.

[CV-329] Deep Dreams Are Made of This: Visualizing Monosemantic Features in Diffusion Models

【速读】：该论文旨在解决生成式 AI（Generative AI）中 latent diffusion models 的可解释性问题，即如何清晰地理解模型内部各层特征所激活的具体语义概念。其解决方案的关键在于提出了一种名为 latent visualization by optimization (LVO) 的机制可解释性方法，通过引入稀疏自编码器（sparse autoencoders, SAEs）对多语义（polysemantic）层表示进行解耦，提取出单语义（monosemantic）特征，并结合潜空间优化、时间步活动分析、匹配调度的噪声注入、通过特征引导的先验初始化及适当的正则化策略，在 Stable Diffusion 1.5 上实现了对图像生成过程中关键概念（如对角构图、人物、玫瑰等）的可视化。该方法相较于传统数据集对比或特征引导（steering）提供了更直接的特征激活机制洞察。

链接: https://arxiv.org/abs/2605.08218
作者: Adam Szokalski,Mateusz Modrzejewski
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper proposes latent visualization by optimization (LVO), a mechanistic interpretability technique that extends feature visualization by optimization - originally developed for convolutional neural networks - to latent diffusion models. LVO employs sparse autoencoders (SAEs) to disentangle polysemantic layer representations into monosemantic features. Key contributions include latent-space optimization, time-step activity analysis, schedule-matched noise injection, prior initialization through feature steering, and suitable regularization strategies. We demonstrate the method on Stable Diffusion 1.5 fine-tuned on the Style50 dataset, showing that SAE features produce clear visualizations of recognizable concepts - including diagonal compositions, human figures, roses, cables, and waterfall foam - that correlate with dataset examples, while the baseline without disentanglement produces less coherent results. We further show that regularization techniques from pixel-space feature visualization transfer to the latent domain, though they require different configurations for the raw-layer and SAE variants. Compared to dataset examples and steering, LVO provides complementary insights by directly revealing what activates a feature rather than its downstream effects.

[CV-330] st-Time Training for Visual Foresight Vision-Language-Action Models

【速读】：该论文旨在解决视觉前视（Visual Foresight, VF）类视觉语言动作模型（VLA）在分布外（out-of-distribution, OOD）场景下性能显著下降的问题。由于VF-VLA的行动质量高度依赖于未来视觉信息预测的准确性，OOD扰动会同时影响其预测与执行两个阶段，从而导致系统鲁棒性不足。解决方案的关键在于提出一种测试时训练（Test-Time Training, T³）方法——T³ VF，其核心思想是利用预测的未来图像与其后续观测结果构成自然监督对，实现在线自适应优化；同时引入一种自适应更新过滤机制，以避免在测试阶段进行无差别更新带来的不稳定性，从而在无需修改架构或添加辅助模块的前提下，有效缓解OOD脆弱性，且仅带来适度的推理开销。

链接: https://arxiv.org/abs/2605.08215
作者: Sangwu Park,Wonjoong Kim,Yeonjun In,Sein Kim,Hongseok Kang,Chanyoung Park
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Preprint. Under review

点击查看摘要

Abstract:Visual Foresight VLA (VF-VLA) has become a prominent architectural choice in the recent VLA due to its impressive performance. Nevertheless, the inherent design of VF-VLA makes it particularly vulnerable to out-of-distribution (OOD) shifts. Because the quality of action directly depends on the accuracy of the predicted future visual information, OOD conditions affect both stages at once. To address this vulnerability, we propose Test-Time Training Visual Foresight VLA ( T^3 VF), a test-time training approach motivated by the observation that the predicted future image and its subsequent observation form a natural supervision pair. To further address the practical challenges that arise from indiscriminate test-time updates, we introduce an adaptive update filtering mechanism. Empirically, T^3 VF mitigates the OOD vulnerability of VF-VLA at a modest additional inference cost, without requiring any architectural modification or auxiliary modules.

[CV-331] Low-Cost Stereo Vision for Robust 3D Positioning of Thin Radiata Pine Branches in Autonomous Drone Pruning

【速读】：该论文旨在解决人工修剪辐射松（Radiata Pine）树种时存在的高风险、高劳动强度及劳动力短缺问题，同时克服现有自主修剪平台依赖昂贵传感器（如LiDAR）且仅能处理粗枝的问题，从而推动其在林业中的广泛应用。解决方案的关键在于：利用单个低成本双目相机（ZED Mini）实现对直径低至10 mm细枝的精确检测与三维定位，通过两个阶段的处理流程——基于Mask R-CNN和YOLOv8/v9的分支分割与多种深度估计方法（包括传统SGBM与多款深度学习模型）的对比评估，创新性地将分割掩膜与视差图结合，采用基于质心的三角测量算法与中位数绝对偏差（Median-Absolute-Deviation, MAD）异常值剔除策略，生成鲁棒的枝条到相机距离信息，有效应对森林场景中纹理稀疏、结构细长和视差噪声等挑战。

链接: https://arxiv.org/abs/2605.08213
作者: Yida Lin,Bing Xue,Mengjie Zhang,Sam Schofield,Richard Green
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Manual pruning of radiata pine, a species of major economic importance to New Zealand forestry, is hazardous, labour-intensive, and increasingly constrained by workforce shortages. Existing autonomous pruning platforms typically rely on expensive sensors such as LiDAR and are limited to thick branches, which restricts their wider adoption. This paper investigates whether a single low-cost stereo camera mounted on a drone can provide sufficiently accurate branch detection and three-dimensional positioning to support autonomous pruning of branches as thin as 10 mm, thereby removing the need for auxiliary depth sensors. The proposed pipeline comprises two stages: branch segmentation and depth estimation. For segmentation, Mask R-CNN variants and the YOLOv8 and YOLOv9 families are compared on a custom dataset of 71 stereo image pairs captured with a ZED Mini camera; YOLOv8 and YOLOv9 are selected as representative state-of-the-art real-time segmentors at the time of data collection, and the framework is designed to remain compatible with newer YOLO releases. For depth estimation, a traditional method (SGBM with WLS filtering) and deep-learning-based methods (PSMNet, ACVNet, GWCNet, MobileStereoNet, RAFT-Stereo, and NeRF-Supervised Deep Stereo) are evaluated, including cross-dataset fine-tuning experiments that expose the domain gap between urban driving benchmarks and natural forestry scenes. The main novelty of this work lies in coupling stereo segmentation with a centroid-based triangulation algorithm and Median-Absolute-Deviation outlier rejection that converts a segmentation mask and disparity map into a single robust branch-to-camera distance, addressing the challenges of sparse texture, thin structures, and noisy disparity values typical of forest scenes. Qualitative evaluations at distances of 1-2 m show that the learning-based stereo methods produce more coherent depth es…

[CV-332] Harmonized Feature Conditioning and Frequency-Prompt Personalization for Multi-Rater Medical Segmentation CVPR2026

【速读】：该论文旨在解决多标注者医学图像分割中因临床解读差异导致的模型过自信与校准不足问题，即现有方法常将标注多样性简化为共识标签或视为噪声，忽略了专家间的真实变异性和成像设备带来的伪影干扰。其核心解决方案是提出一种统一的概率框架，通过自适应特征条件化和频域个性化机制实现对扫描仪特异性伪影与标注者风格差异的解耦建模：轻量级Harmonizer Network隐式学习设备相关伪影并动态调制特征以标准化潜在表示，确保不确定性反映解剖结构而非噪声；同时引入高频提示模块（High-Frequency Prompt Modules）在频域编码标注者特有的边界精度与纹理敏感性，自适应调制和谐后的特征以生成个性化且解剖一致的分割结果；此外，采用广义能量距离（Generalized Energy Distance）正则化使生成分布匹配实际标注变异性，在专家分歧处保留多样性、在共识区域强化一致性，从而提升模型在噪声案例中的Dice分数与不确定性校准能力。

链接: https://arxiv.org/abs/2605.08210
作者: Sanaz Karimijafarbigloo,Armin Khosravi,Alireza Kheyrkhah,Reza Azad,Mauricio Reyes,Dorit Merhof
机构: University of Regensburg (雷根斯堡大学); Sharif University of Technology (沙里夫理工大学); Iran University of Science and Technology (伊朗科学技术大学); University of Bern (伯尔尼大学); Fraunhofer Institute for Digital Medicine MEVIS (弗劳恩霍夫数字医学MEVIS研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in main CVPR 2026

点击查看摘要

Abstract:Multi-rater medical image segmentation captures the inherent ambiguity of clinical interpretation, where diagnostic boundaries vary across experts and imaging devices. Existing approaches often reduce this diversity to consensus labels or treat rater differences as noise, resulting in overconfident and poorly calibrated models. We propose a harmonized probabilistic framework that disentangles acquisition artifacts from genuine annotator variability through adaptive feature conditioning and frequency-domain personalization. A lightweight Harmonizer Network implicitly models scanner-specific artifacts and performs dynamic feature modulation to standardize latent representations, ensuring that uncertainty reflects anatomy rather than noise. To represent rater-specific styles, we introduce a novel High-Frequency Prompt Modules that operate in the spectral domain to encode annotator-dependent boundary precision and textural sensitivity. These prompts adaptively modulate harmonized features to produce personalized yet anatomically consistent segmentations. Furthermore, a Generalized Energy Distance based regularization aligns the generative distribution with empirical annotation variability, promoting diversity where experts disagree and consensus where they converge. Experiments on LIDC-IDRI and NPC-170 show SOTA aggregated and individualized segmentation, with notable GED reductions and improved Dice scores, especially on noisy cases. Beyond accuracy, the model exhibits clinically meaningful uncertainty. Confidence rises in agreement regions and declines in ambiguous areas, supporting its use as a reliable and interpretable tool for multi-expert clinical workflows.

[CV-333] A Breast Vision Pathology Foundation Model for Real-world Clinical Utility

【速读】：该论文旨在解决生成式 AI (Generative AI) 在乳腺癌病理诊断中从回顾性研究向临床实际应用转化的挑战，即验证病理基础模型是否能够支持临床相关用途。其核心问题在于现有模型虽在回顾性数据上表现优异，但缺乏在真实临床流程中的有效性与安全性证据。解决方案的关键在于开发并全面评估一个名为BRAVE的乳腺适应性病理基础模型，该模型基于来自亚、欧、北美32个来源的101,638张乳腺全切片图像（Whole-Slide Images, WSI），并在涵盖术前活检、术中冰冻切片和术后切除标本的34项任务、82个队列中进行系统验证，包括回顾性基准测试、临床挑战场景、工作流导向的影响模拟、前瞻性观察验证以及路径学家与AI交互的交叉研究。结果显示，BRAVE可在多个阶段安全排除低风险病例、辅助复核漏诊阳性样本，并优先处理需进一步评估的病例，在三个中心的前瞻性验证中实现了高阴性预测值（NPV）及显著提升的诊断准确性（平衡准确率从88.5%升至95.1%，OR=3.14, P<0.001），且衍生评分可独立预测无病生存期（调整HR=4.79, P<0.001）和总生存期（调整HR=8.14, P<0.001）。

链接: https://arxiv.org/abs/2605.08207
作者: Yingxue Xu,Zhengyu Zhang,Xiuming Zhang,Mengwei Xu,Fengtao Zhou,Yihui Wang,Jiabo Ma,Yi Xin,Danyi Li,Chengyu Lu,Zhijian Cen,Ying Tan,Qingbing Yao,Qi Wang,Zizhao Gao,Yong Zhang,Jingjing Chen,Feifei Liu,Qian Xu,Yi Dai,Hongxuan Tan,Cheng Jin,Huajun Zhou,Zhengrui Guo,Ling Liang,Hongyi Wang,Yingcong Chen,Xi Wang,Zhenhui Li,Ronald Cheong Kin Chan,Ning Mao,Muyan Cai,Zhe Wang,Li Liang,Hao Chen
机构: The Hong Kong University of Science and Technology (香港科技大学); Southern Medical University (南方医科大学); Zhejiang University (浙江大学); Fourth Military Medical University (第四军医大学); Shandong Technology and Business University (山东工商学院); Qingdao University (青岛大学); Yantai Yuhuangding Hospital (烟台毓璜顶医院); Dalian University of Technology (大连工业大学); China Medical University (中国医科大学); Liaoning Cancer Hospital Institute (辽宁省肿瘤医院); Peking University Shenzhen Hospital (北京大学深圳医院); The Chinese University of Hong Kong (香港中文大学); Sun Yat-sen University Cancer Center (中山大学肿瘤中心); Kunming Medical University (昆明医科大学); Hong Kong University of Science and Technology (香港科技大学); HKUST Shenzhen-Hong Kong Collaborative Innovation Research Institute (港科大深港协同创新研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 60 pages

点击查看摘要

Abstract:Pathology foundation models have shown strong retrospective performance, but whether such systems can support clinically relevant use remains unclear. This challenge is particularly important in breast cancer, where pathological assessment serves as the gold standard for diagnosis and guides treatment planning, surgical decision-making and risk stratification across pre-, intra- and post-operative stages. Here we present \textbfBRAVE, a breast-adaptive pathology foundation model developed and evaluated using a total resource of 101,638 breast whole-slide images from 32 sources across Asia, Europe and North America. We assessed BRAVE across 34 tasks in 82 cohorts spanning pre-operative biopsy, intra-operative frozen section and post-operative resection, using an evidence chain comprising retrospective benchmarking, clinically challenging scenarios, workflow-oriented clinical impact simulations, prospective observational validation with the thresholds locked in the retrospective cohorts and crossover pathologist-AI interaction studies. Across these settings, BRAVE supported practical roles in the clinical workflow, including safe exclusion of low-risk cases from routine review, AI-assisted second-review rescue of initially missed positives and prioritization of cases for further assessment. In prospective validation across three centres, BRAVE excluded 76.9% of negative biopsy cases (NPV 0.953) and 70.1% of negative frozen-section cases (NPV 0.973), and triaged 78.8% of post-operative subtyping cases as high-confidence clear-cut cases (NPV 1.000). In reader studies, AI assistance improved balanced accuracy from 88.5% to 95.1% (OR 3.14, P0.001), with better efficiency, confidence and inter-rater agreement. BRAVE-derived scores also independently predicted disease-free survival (adjusted HR 4.79, P0.001) and overall survival (adjusted HR 8.14, P0.001).

[CV-334] Weakly Supervised Concept Learning for Object-centric Visual Reasoning

【速读】：该论文旨在解决神经符号系统中感知阶段标签成本高昂的问题，尤其是在对象中心推理任务中，如何在减少监督信号的同时实现可解释的符号化表示。其解决方案的关键在于提出一种高效的弱监督机制：结合基于槽位（slot-based）的架构以增强对象中心性，并引入变分自编码器（Variational Autoencoder, VAE）进行自监督学习，通过潜在空间中的概念引导（concept guidance）实现人类可解释的符号接地（symbol grounding）。该方法将感知输出转化为可用于归纳逻辑编程（Inductive Logic Programming, ILP）、决策树和贝叶斯网络等推理框架的符号背景知识，在仅需1%标签的情况下仍能发现复杂抽象规则，并在域偏移下保持鲁棒性，显著优于当前主流基础模型在领域泛化上的表现。

链接: https://arxiv.org/abs/2605.08201
作者: Sparsh Tiwari,Bettina Finzel,Gesina Schwalbe
机构: University of Lübeck, Germany (吕贝克大学); University of Bamberg, Germany (班贝格大学); University of Ulm, Germany (乌尔姆大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Neurosymbolic systems promise to combine deep neural network’s (DNN) processing of raw sensor inputs with few-shot performance of symbolic artificial intelligence. Two-stage approaches explicitly decouple DNN based perception from subsequent rule based reasoning. This avoids optimization and interpretability issues of end to end differentiable approaches, but requires costly labels for the perception output. This paper introduces an efficient weak supervision scheme for the perception stage to ground its output symbols for logical induction in object-centric reasoning tasks. It combines a slot-based architecture for object-centricity with a Variational Autoencoder (VAE) for self-supervision, competing with concept guidance on latent dimensions for human interpretable grounding. The resulting predictions are translated into symbolic background knowledge for reasoning frameworks, such as Inductive Logic Programming (ILP), Decision Trees, and Bayesian Networks. Our extensive empirical evaluation on synthetic and real world datasets shows that our approach can discover complex, abstract rules for object centric reasoning whilst reducing supervision to as little as 1% of labels, and being robust even under substantial domain shift. Notably, at 1% supervision it even outperforms state of the art foundation model baselines in domain generalization

[CV-335] Where Reliability Lives in Vision-Language Models: A Mechanistic Study of Attention Hidden States and Causal Circuits ICLR2026

【速读】：该论文旨在解决视觉语言模型（Vision-Language Models, VLMs）中一个广泛存在的直觉假设——即注意力图（attention map）越清晰、聚焦于查询区域时，模型输出的可信度越高。作者通过构建统一的机制化分析管道VLM可靠性探测器（VLM Reliability Probe, VRP），对三个开源权重的VLM家族（LLaVA-1.5、PaliGemma、Qwen2-VL）进行系统性验证，发现注意力结构几乎无法预测模型正确性（皮尔逊相关系数接近零），而隐藏状态几何特征和层间边际形成则具有更强的可解释性和预测能力。解决方案的关键在于：将可靠性从注意力图的表征转移到隐藏状态空间中的线性可分信号与稀疏晚期电路，并揭示了不同架构（如早期融合 vs. 晚期融合）在鲁棒性分布上的本质差异，从而为模型设计提供更可靠的监控指标。

链接: https://arxiv.org/abs/2605.08200
作者: Logan Mann,Ajit Saravanan,Ishan Dave,Shikhar Shiromani,Saadullah Ismail,Yi Xia,Emily Huang
机构: UC Santa Barbara (加州大学圣塔芭芭拉分校); UC Berkeley (加州大学伯克利分校); NVIDIA (英伟达); Algoverse AI Research (Algoverse人工智能研究); Brown University (布朗大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 15 pages, 4 figures, 10 tables. Accepted at the ICLR 2026 Workshop on Multimodal Reasoning. Code and probe-training pipelines: this https URL

点击查看摘要

Abstract:A pervasive intuition holds that vision-language models (VLMs) are most trustworthy when their attention maps look sharp: concentrated attention on the queried region should imply a confident, calibrated answer. We test this Attention-Confidence Assumption directly. We instrument three open-weight VLM families (LLaVA-1.5, PaliGemma, Qwen2-VL; 3-7B parameters) with a unified mechanistic pipeline – the VLM Reliability Probe (VRP) – that compares attention structure, generation dynamics, and hidden-state geometry against a single correctness label. Three results emerge. (i) Attention structure is a near-zero predictor of correctness (R_pb(C_k,y)=0.001, 95% CI [-0.034,0.036]; R_pb(H_s,y)=-0.012, [-0.047,0.024] on a pooled n=3,090 split), even though attention remains causally necessary for feature extraction (top-30% patch masking drops accuracy by 8.2-11.3 pp, p0.001). (ii) Reliability becomes legible later in the computation: a single hidden-state linear probe reaches AUROC0.95 on POPE for two of three families, and self-consistency at K=10 is the strongest behavioral predictor we measure at 10x inference cost (R_pb=0.43). (iii) Causal neuron-level ablations expose a sharp architectural split with direct monitor-design implications: late-fusion LLaVA concentrates reliability in a fragile late bottleneck (-8.3 pp object-identification accuracy after top-5 probe-neuron ablation), whereas early-fusion PaliGemma and Qwen2-VL distribute it widely and absorb destruction of ~50% of their peak-layer hidden dimension with =1 pp degradation. The takeaway is narrow but consequential: in 3-7B VLMs, reliability is read more reliably off hidden-state geometry, layer-wise margin formation, and sparse late-layer circuits than off attention-map sharpness.

[CV-336] Survey on Disaster Management Datasets for Remote Sensing Based Emergency Applications

【速读】：该论文旨在解决当前灾害管理中数据驱动方法应用受限的问题，特别是由于高质量标注数据集的缺乏导致机器学习（Machine Learning, ML）和深度学习（Deep Learning, DL）在遥感图像处理中的效能难以充分发挥。其解决方案的关键在于系统梳理并整合公开可用的、涵盖灾害全生命周期（灾前、灾中与灾后）的图像数据集，为研究人员和实践者提供一个集中化的高质量数据资源参考，从而加速基于遥感技术的灾害响应解决方案的研发与部署。

链接: https://arxiv.org/abs/2605.08196
作者: Alain P. Ndigande,Josiah Wiggins,Sedat Ozer
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been accepted for publication at IEEE Transactions on Geoscience and Remote Sensing

点击查看摘要

Abstract:Recent natural disasters have highlighted the urgent need for efficient data-driven approaches to disaster management. Machine learning (ML) and deep learning (DL) techniques have shown considerable promise in enhancing the key phases of disaster management including mitigation, preparedness, detection, response, and recovery. A critical enabler of successful ML or DL based applications in remote sensing, however, is the accessibility and quality of annotated datasets. With the growing availability of high-resolution imagery from unmanned aerial vehicles (UAVs) and satellites, computer vision and remote sensing algorithms have become essential tools for rapid detection, situational assessment, and decision-making in disaster scenarios. This survey provides a comprehensive overview of publicly available image-based datasets relevant to ML/DL-based disaster management pipelines. Emphasis is placed on datasets that support computer vision and remote sensing tasks across all phases of disaster events including pre-disaster, during, and post-disaster. The goal of this work is to serve as a centralized reference for researchers and practitioners seeking high-quality datasets for rapid development and deployment of remote sensing-driven disaster response solutions.

[CV-337] Normalization Equivariance for Arbitrary Backbones with Application to Image Denoising

【速读】：该论文旨在解决图像到图像预测任务中因分布偏移（distribution shift）导致的鲁棒性下降问题，特别是针对全局对比度和亮度变换的等变性（Normalization Equivariance, NE）约束难以在标准模型组件（如注意力机制和LayerNorm）中有效实现的问题。现有方法通过限制内部层结构来强制满足NE，但兼容性差且增加运行时开销。论文的关键创新在于首次完整刻画了NE函数类：一个函数是NE的充要条件是它可分解为“归一化-处理-反归一化”（normalize-process-denormalize）的形式。这一发现将NE的精确实现从内部架构约束转化为输入输出层面的参数化问题，从而设计出无需额外参数的通用封装器（WNE），能够无缝集成到任意主干网络（包括Transformer）上，在不引入GPU计算开销的情况下显著提升模型对噪声分布不匹配场景下的鲁棒性。

链接: https://arxiv.org/abs/2605.08193
作者: Youssef Saied,François Fleuret
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Normalization Equivariance (NE), equivariance to global contrast and brightness transforms, improves robustness to distribution shift in image-to-image prediction. Existing methods enforce this prior by constraining internal layers to NE-compatible families, limiting compatibility with standard components such as attention and LayerNorm, and adding runtime cost. We characterize the full NE function class: a function is NE if and only if it admits a normalize-process-denormalize factorization. This turns exact NE enforcement, for the ideal wrapper, from an internal architectural constraint into an input-output parameterization problem, allowing a parameter-free wrapper (WNE) to enforce NE around any backbone, including transformers. In a single-noise mismatch diagnostic for blind denoising, the wrapper improves CNN and transformer robustness with no measurable GPU overhead; architectural NE baselines incur up to a 1.6x slowdown.

[CV-338] A Robust Out-of-Distribution Detection Framework via Synergistic Smoothing CVPR

【速读】：该论文旨在解决当前生成式 AI（Generative AI）系统中分布外（Out-of-Distribution, OOD）检测方法对对抗攻击高度敏感的问题，这限制了其在自动化系统中的可信部署。解决方案的关键在于提出一种名为 ROSS 的新型后处理 OOD 检测器，其核心思想是利用基线 OOD 分数在扰动下的局部不稳定性来增强判别能力：通过中值平滑（median smoothing）处理基线分数以平衡干净样本与对抗样本的准确率，并将平滑过程中产生的噪声样本重新用于量化分数的局部不稳定性；实验表明，OOD 样本在扰动下表现出更高的不稳定性，因此可据此进一步区分 ID 与 OOD 样本，从而实现对 Score-Minimizing 和 Score-Maximizing 攻击的对称鲁棒性，显著优于现有方法（最高提升 40 AUROC 点）。

链接: https://arxiv.org/abs/2605.08191
作者: Maria Stoica,Abdelrahman Hekal,Alessio Lomuscio
机构: Imperial College London (帝国理工学院); Zeroth Research (零点研究)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to CVPR Findings 2026

点击查看摘要

Abstract:Reliable out-of-distribution (OOD) detection is a critical requirement for the safe deployment of machine learning systems. Despite recent progress, state-of-the-art OOD detectors are highly susceptible to adversarial attacks, which undermines their trustworthiness in automated systems. To address this vulnerability, we apply median smoothing to baseline OOD detection scores, balancing clean and adversarial accuracies. Our key insight is that the noisy samples generated for median smoothing can be repurposed to quantify the local instability of the base score. We observe that OOD samples exhibit higher instability under perturbation. Based on this, we propose ROSS, a novel and robust post-hoc OOD detector that leverages the instability of baseline scores to further distinguish between in-distribution (ID) and OOD samples. ROSS achieves symmetric robustness, performing strongly against both score-minimising and score-maximising attacks, unlike prior work. This symmetric defence leads to state-of-the-art robustness, outperforming prior methods by up to 40 AUROC points. We demonstrate ROSS’s effectiveness on extensive experiments across CIFAR-10, CIFAR-100, and ImageNet. Code is available at: this https URL.

[CV-339] Neuroscience-Inspired Analyses of Visual Interestingness in Multimodal Transformers

【速读】：该论文旨在解决现代Transformer模型是否编码了人类注意力机制的核心原则，还是仅仅利用大规模数据中的相关性这一关键问题。其核心解决方案在于通过分析多模态视觉-语言模型Qwen3-VL-8B内部表征，发现人类视觉兴趣（visual interest）信息可在线性层中被有效解码，并在中间视觉Transformer层中开始显现且随语言模型层推进逐渐增强区分度；此外，基于几何、探测器和稀疏自动编码器的方法提取的概念向量在高层趋于收敛，表明视觉有趣性在无显式监督条件下实现了结构化编码，揭示了模型与人类感知之间潜在的计算一致性。

链接: https://arxiv.org/abs/2605.08188
作者: Mathis Immertreu,Fitim Abdullahu,Thomas Kinfe,Helmut Grabner,Patrick Krauss,Achim Schilling
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Human attention is the gateway to conscious perception, memory and decision-making. However, its role in modern transformer models remains largely unexplored. As these systems increasingly influence what people see, prefer and buy, the question arises as to whether they encode principles of human interest or merely exploit large-scale correlations. Addressing this issue is crucial for understanding cognition and ensuring the responsible use of AI in communication and marketing. In order to address this issue, the concept of visual interest was examined within the multimodal vision-language-model Qwen3-VL-8B, using a pre-defined Common Interestingness (CI) score derived from large-scale human engagement data on the photo-sharing platform Flickr. Here, we analyzed internal representations across vision and language components using methods from the neurosciences. Our analyses revealed that CI information is linearly decodable from final-layer embeddings, indicating that it is aligned with human-derived measures of visual interestingness. Dimensionality reduction and Generalized Discrimination Value (GDV) analyses demonstrate that CI-related hidden representations emerge in intermediate vision transformer layers and becomes progressively more distinguishable across language model layers. Concept vectors derived using geometric, probe, and Sparse Auto-Encoder based methods converge in higher layers, as confirmed by representational similarity analysis. This indicates a robust and structured encoding of visual interestingness without explicit supervision. Future work will seek to identify shared computational principles linking human brain dynamics and transformer architectures, with the ultimate goal of uncovering the organizing mechanisms that give rise to attention and interest in both biological and artificial systems.

[CV-340] Sparsity Hurts: Simple Linear Adapter Can Boost Generalized Category Discovery

【速读】：该论文旨在解决通用类别发现（Generalized Category Discovery, GCD）任务中现有方法的局限性：一方面，传统部分微调（partial fine-tuning）仅更新视觉Transformer（ViT）的最后一层，灵活性不足；另一方面，视觉提示调优（visual prompt tuning）易受初始化影响且容量受限，导致过拟合。解决方案的关键在于提出一种名为LAGCD的新方法，其核心创新是在每个ViT块中嵌入一个残差线性适配器（residual linear adapter），从特征稀疏性的角度证明非线性适配器会损害性能，而线性适配器通过增强模型容量提升表现；同时引入辅助分布对齐损失（auxiliary distribution alignment loss）以缓解已见类别与新类别间预测偏置问题，从而在多个通用和细粒度数据集上显著优于多种复杂基线方法。

链接: https://arxiv.org/abs/2605.08183
作者: Bo Ye,Kai Gan,Tong Wei,Min-Ling Zhang
机构: Southeast University (东南大学); Key Laboratory of Computer Network and Information Integration (东南大学) (教育部计算机网络与信息集成重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Submitted to IEEE TPAMI

点击查看摘要

Abstract:Generalized Category Discovery (GCD) seeks to identify novel categories from unlabeled data while retaining the classification ability of seen categories. Prior GCD methods commonly leverage transferable representations from pre-trained models, adapting to downstream datasets via partial fine-tuning (updating only the final ViT block) and visual prompt tuning (appending learnable vectors to inputs). However, conventional partial fine-tuning offers limited flexibility, as it fails to adapt the entire model; meanwhile, visual prompt tuning is prone to overfitting, due to its sensitivity to initialization and inherently constrained capacity. To address these limitations, we propose LAGCD, a simple yet effective GCD approach that embeds a residual linear adapter into each ViT block. From the perspective of feature sparsity, we systematically show that non-linearity in conventional adapters impairs performance, whereas our linear adapter enhances it by enabling more flexible model capacity. We further introduce an auxiliary distribution alignment loss to mitigate the negative impact of biased predictions between seen and novel categories. Extensive experiments on both generic and fine-grained datasets confirm that LAGCD consistently improves performance over many sophisticated baselines. The source code is available at this https URL

[CV-341] xt-Guided Multi-Scale Frequency Representation Adaptation ACL2026

【速读】：该论文旨在解决参数高效微调方法中存在的两个关键问题：一是现有方法大多在信号空间域（signal space domain）中操作，导致信息冗余严重；二是多数方法采用固定提示或适应层，未能充分考虑信号的多尺度特性。解决方案的核心是提出多尺度频域适配器（Multi-Scale Frequency Adapter, FreqAdapter），其通过将文本信息融入频域中的多尺度信号微调，结合一种多尺度适应策略来优化不同频率范围内的感受野，从而提升模型的表征能力与微调效率。实验表明，FreqAdapter能在仅一个训练周期内实现快速收敛，并显著提升CLIP和LLaVA等多模态模型的性能。

链接: https://arxiv.org/abs/2605.08181
作者: Weicai Yan,Xinhua Ma,Wang Lin,Tao Jin
机构: Zhejiang University(浙江大学); Nanyang Technological University(南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ACL 2026 Main

点击查看摘要

Abstract:Parameter-efficient fine-tuning methods introduce a small number of training parameters, enabling pre-trained models to adapt rapidly to new data distributions. While these methods have shown promising results, they exhibit notable limitations. First, most existing methods operate in the signal space domain, which results in substantial information redundancy. Second, most existing methods utilize fixed prompts or adaptation layers, failing to fully account for the multi-scale characteristics of signals. To address these challenges, we propose the Multi-Scale Frequency Adapter (FreqAdapter), which integrates textual information and performs multi-scale fine-tuning of signals in the frequency domain. Additionally, we introduce a multi-scale adaptation strategy to optimize receptive fields across different frequency ranges, further enhancing the model’s representational capacity. Extensive experiments on multimodal models, including CLIP and LLaVA, demonstrate that FreqAdapter significantly improves both performance and efficiency. FreqAdapter improves performance with minimal cost and fast convergence within one epoch. Code is available at this https URL.

[CV-342] KARMA-MV: A Benchmark for Causal Question Answering on Music Videos

【速读】：该论文旨在解决音乐视频中视觉动态如何驱动音乐结构的因果推理问题，这一领域在视频问答（Video Question Answering）和跨模态理解中仍处于探索阶段。现有方法多依赖相关性建模，缺乏对视觉到音乐影响机制的显式因果分析。解决方案的关键在于构建KARMA-MV数据集和因果知识图谱（Causal Knowledge Graph, CKG）：前者通过大规模多选题（37,737个）覆盖推理、预测与反事实问题，基于大语言模型（LLM）实现可扩展生成与验证；后者则将结构化的跨模态依赖关系注入视觉-语言模型（Vision-Language Models, VLMs），从而提升模型对音乐视频中因果关系的理解能力，尤其在小规模模型上效果显著，验证了显式因果结构对于音频-视觉理解任务的重要性。

链接: https://arxiv.org/abs/2605.08175
作者: Archishman Ghosh,Abhinaba Roy,Dorien Herremans
机构: Singapore University of Technology and Design (新加坡科技设计大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While significant progress has been made in Video Question Answering and cross-modal understanding, causal reasoning about how visual dynamics drive musical structure in music videos remains under-explored. We introduce KARMA-MV, a large-scale multiple-choice QA dataset derived from 2,682 YouTube music videos, designed to test models’ ability to integrate temporal audio-visual cues and reason about visual-to-musical influence across reasoning, prediction, and counterfactual questions. Unlike traditional datasets requiring manual annotation, KARMA-MV leverages LLM reasoning for scalable generation and validation, yielding 37,737 MCQs. We propose a causal knowledge graph (CKG) approach that augments vision-language models (VLMs) with structured retrieval of cross-modal dependencies. Experiments on state-of-the-art VLMs and LLMs show consistent gains from CKG grounding – especially for smaller models – establishing the value of explicit causal structure for music-video reasoning. KARMA-MV provides a new benchmark for advancing causal audio-visual understanding beyond correlation.

[CV-343] CERSA: Cumulative Energy-Retaining Subspace Adaptation for Memory-Efficient Fine-Tuning

【速读】：该论文旨在解决参数高效微调（Parameter-Efficient Fine-Tuning, PEFT）方法中存在的两个核心问题：一是现有方法如LoRA依赖低秩更新，难以充分捕捉全参数微调中权重变化的秩特性，导致性能差距；二是这些方法仍需存储完整的冻结权重，限制了在资源受限场景下的效率。解决方案的关键在于提出一种新的微调范式——累积能量保留子空间适配（Cumulative Energy-Retaining Subspace Adaptation, CERSA），其利用奇异值分解（Singular Value Decomposition, SVD）仅保留占谱能量90%–95%的主要成分，从而构建低秩表示进行微调，显著降低内存消耗，同时在多模态任务中实现优于当前最优PEFT方法的性能表现。

链接: https://arxiv.org/abs/2605.08174
作者: Jingze Ge,Xue Geng,Yun Liu,Wanqi Dong,Wang Zhe Mark,Min Wu,Ngai-Man Cheung,Bharadwaj Veeravalli,Xulei Yang
机构: National University of Singapore (新加坡国立大学); Nankai University (南开大学); Institute for Infocomm Research (I2R), ASTAR (新加坡资讯通信研究院，ASTAR); Singapore University of Technology and Design (新加坡科技设计大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 7 figures, supplementary material included

点击查看摘要

Abstract:To mitigate the memory constraints associated with fine-tuning large pre-trained models, existing parameter-efficient fine-tuning (PEFT) methods, such as LoRA, rely on low-rank updates. However, such updates fail to fully capture the rank characteristics of the weight modifications observed in full-parameter fine-tuning, resulting in a performance gap. Furthermore, LoRA and other existing PEFT methods still require substantial memory to store the full set of frozen weights, limiting their efficiency in resource-constrained settings. To addres these limitations, we introduce Cumulative Energy-Retaining Subspace Adaptation (CERSA), a novel fine-tuning paradigm that leverages singular value decomposition (SVD) to retain only the principal components responsible for 90% to 95% of the spectral energy. By fine-tuning low-rank representations derived from this principal subspace, CERSA significantly reduces memory consumption. We conduct extensive evaluations of CERSA across models of varying scales and domains, including image recognition, text-to-image generation, and natural language understanding. Empirical results demonstrate that CERSA consistently outperforms state-of-the-art PEFT methods while achieving substantially lower memory requirements. The code will be publicly released.

[CV-344] CASISR: Circular Arbitrary-Scale Image Super-Resolution

【速读】：该论文旨在解决深度学习驱动的任意尺度图像超分辨率（Arbitrary-Scale Image Super-Resolution, ASISR）方法在泛化性能（Generalization Performance, GP）上的局限性问题，即模型在有限训练数据下难以适应无限多样化的测试场景。其解决方案的关键在于提出一种闭环架构——循环式超分辨率（Circular ASISR, CASISR），该架构通过融合已知或可学习的退化模型（Degradation Model）与ASISR模型，构建基于自动控制理论的反馈机制，从而增强图像重建能力。论文通过建立非线性环路方程描述CASISR，并利用条件概率理论证明其合理性、泰勒展开法验证其稳定性，最终在实验中展现出优于八种主流ASISR方法的重建质量，尤其在分数倍放大因子及边缘剧烈变化的文本和条纹图像上表现突出。

链接: https://arxiv.org/abs/2605.08173
作者: Honggui Li,Zhengyang Zhang,Dingtai Li,Sinan Chen,Nahid Md Lokman Hossain,Xinfeng Xu,Yinlu Qin,Ruobing Wang,Hantao Lu,Yuting Feng,Maria Trocan,Dimitri Galayko,Amara Amara,Mohamad Sawan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The generalization performance (GP) of deep learning-based arbitrary-scale image super-resolution (ASISR) methods is subject to limited training datasets and unlimited testing datasets. It is vitally significant to enhance the GP of the pretrained ASISR models by making full use of the testing samples. The ASISR models usually employ an open-loop architecture from low-resolution (LR) images to super-resolution (SR) images. The degradation model from SR samples to LR samples is known bicubic down-sampling for the classical ASISR, is supposed down-sampling with additive random noise for the blind ASISR, and is learnable for the real-world ASISR. Combining the ASISR and degradation models, it is potentially possible to adopt a closed-loop architecture based on the automatic control theory for strengthening the GP of the ASISR methods. Therefore, this paper proposes a closed-loop architecture, circular ASISR (CASISR), to lift the capability of image reconstruction. A mathematical nonlinear loop equation is established to describe the CASISR, the reasonability of the CASISR is proven by conditional probability theory, and the stability of the CASISR is proven by Taylor series approximation. The first-order and second-order absolute difference images are defined to compare the image reconstruction performance of the ASISR and the CASISR methods. Comprehensive simulation experiments show that the proposed CASISR approach outperforms the eight state-of-the-art ASISR approaches in the quality of image reconstruction. Especially, the proposed CASISR is extraordinarily suitable for fractional SR scale factors and is extremely effective for text and stripe images with drastically changed edges.

[CV-345] Augmented Equivariant Mesh Networks for Anatomical Segmentation

【速读】：该论文旨在解决 anatomical mesh segmentation（解剖网格分割）中模型对任意患者姿态和网格分辨率变化敏感的问题，现有任务特定的网格与点云方法缺乏等变性（equivariance），在测试时扰动下性能显著下降（如口腔扫描分割在40°倾斜时IoU下降25-26点）。解决方案的关键在于提出EAMS（Equivariant Anatomical Mesh Segmentor），其基于等变网格神经网络（Equivariant Mesh Neural Networks, EMNN），通过结合内在网格描述符与解剖先验信息（如基于PCA的牙弓和肝脏表面坐标系），并增强消息传递机制以提供轻量级全局上下文，从而实现跨多种监督类型（边、顶点、面级别）下的鲁棒分割，且仅需2M参数即可在不同几何扰动下保持稳定性能。

链接: https://arxiv.org/abs/2605.08172
作者: Daniel Saragih
机构: Queen’s University (皇后大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 21 pages, 7 figures, 14 tables

点击查看摘要

Abstract:Anatomical mesh segmentation requires models that operate directly on irregular surface geometry while remaining robust to arbitrary patient pose and mesh resolution variation. Existing task-specific mesh and point-cloud methods are not equivariant, and can degrade sharply under test-time perturbation, for example dropping by 25-26 IoU points on intraoral scan segmentation at 40^\circ tilt. We present EAMS, an Equivariant Anatomical Mesh Segmentor built on Equivariant Mesh Neural Networks (EMNN), and evaluate it across four clinically distinct tasks spanning edge-, vertex-, and face-level supervision. We combine intrinsic mesh descriptors with anatomy-aware priors, including PCA-derived frames for dental arches and liver surfaces, and augment message passing to provide lightweight global context. Across intracranial aneurysm and intraoral segmentation, EAMS variants are competitive with specialized baselines on unperturbed inputs while remaining stable under geometric perturbations, and on liver surfaces they expose a favorable trade-off between canonical-pose accuracy and rotation robustness. These results show that a lightweight ( 2 M parameters) equivariant framework can deliver robust anatomical mesh segmentation across diverse supervision types without task-specific architectures.

[CV-346] Optimized Culprit Identification Using Mobilenet and Attention Mechanisms

【速读】：该论文旨在解决监控系统中自动嫌疑人识别任务的准确性与计算效率之间的平衡问题，尤其是在光照、姿态和遮挡等现实条件下保持高精度识别的挑战。解决方案的关键在于提出一种优化的轻量级深度学习框架，其核心创新是将MobileNet架构与通道注意力（channel attention）和空间注意力（spatial attention）机制相结合，通过选择性聚焦最具判别性的特征区域并抑制无关背景信息，显著提升特征表示能力；同时结合高效预处理、基于注意力的特征精炼以及使用Adam优化器优化的鲁棒分类策略，在LFW、CASIA-WebFace和VGGFace2子集等多个基准数据集上实现了97.8%的高分类准确率，且保持低计算复杂度和短推理时间，适用于实时监控和边缘计算场景。

链接: https://arxiv.org/abs/2605.08169
作者: Savitha N J,Lata B T
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Automated culprit identification in surveillance systems is a critical task that requires high accuracy along with computational efficiency for real-time deployment. In this paper, an optimized deep learning framework is proposed using a lightweight MobileNet architecture integrated with channel and spatial attention mechanisms. The proposed model enhances feature representation by selectively focusing on the most discriminative regions while suppressing irrelevant background information, thereby improving identification performance. The framework incorporates efficient preprocessing, attention based feature refinement, and a robust classification strategy optimized using the Adam Optimizer. Experiments were conducted on benchmark face recognition datasets, including Labelled Faces in the Wild (LFW), CASIA-WebFace, and a subset of VGGFace2, under realistic conditions with variations in illumination, pose, and occlusion. The results demonstrate that the proposed model achieves a high classification accuracy of 97.8%, outperforming conventional models such as baseline CNN, ResNet, and standard MobileNet. The confusion matrix analysis indicates strong class-wise discrimination with minimal misclassification, while ROC-AUC evaluation confirms robust performance across all classes. Additionally, the proposed approach maintains low computational complexity and reduced inference time, making it suitable for real-time surveillance and edge-based applications.

[CV-347] Digital Image Forgery Detection Using Transfer Learning

【速读】：该论文旨在解决数字图像伪造检测中因高级图像编辑工具普及而导致的篡改内容难以识别的问题，尤其关注如何提升对细微篡改痕迹（如压缩伪影）的敏感性和分类可靠性。其解决方案的关键在于提出一种基于迁移学习的框架，通过引入融合RGB图像与基于压缩差异特征（FDIFF）的混合输入表示，增强对隐匿篡改痕迹的可见性；同时采用基于Youden指数的模型特定自适应阈值优化策略，在准确率与假阳性率之间取得更优平衡，从而显著提升检测系统的鲁棒性和实用性。

链接: https://arxiv.org/abs/2605.08167
作者: Fatma Betul Buyuk,Gozde Karatas Baydogmus,Ali Buldu,Ayaulym Tulendiyeva,Zhuldyz Baizhumanova
机构: Marmara University (马尔马拉大学); Loyola University Chicago (洛约拉大学芝加哥分校); Biruni University (比鲁尼大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The increasing availability of advanced image editing tools has led to a significant rise in manipulated digital content, posing serious challenges for digital forensics and information security. This study presents a transfer learning-based framework for digital image forgery detection that integrates compression-aware feature enhancement with deep convolutional neural network (CNN) architectures. The proposed approach introduces a hybrid input representation that combines RGB images with compression difference-based features (FDIFF), explicitly highlighting subtle manipulation artifacts that are often difficult to detect. In addition, a model-specific adaptive threshold optimization strategy based on the Youden Index is employed to improve classification reliability by achieving a better balance between true positive and false positive rates. Experiments conducted on the CASIA v2.0 dataset using multiple pretrained CNN architectures, including DenseNet121, VGG16, ResNet50, EfficientNetB0, MobileNet, and InceptionV3, demonstrate the effectiveness and robustness of the proposed framework. The models are evaluated using comprehensive performance metrics such as accuracy, precision, recall, F1-score, Matthews correlation coefficient (MCC), and area under the ROC curve (AUC). The results show that DenseNet121 achieves the highest accuracy and AUC, while ResNet50 provides the most balanced and reliable predictions with the highest MCC. The findings emphasize that relying solely on accuracy is insufficient for forensic applications, where minimizing false negatives is critical. Overall, the proposed framework improves the visibility of manipulation artifacts and enhances classification robustness, making it suitable for real-world digital image forgery detection scenarios. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.08167 [cs.CV] (or arXiv:2605.08167v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.08167 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Gozde Karatas Baydogmus [view email] [v1] Mon, 4 May 2026 17:48:36 UTC (313 KB)

[CV-348] Advanced Tumor Segmentation in PET/CT Imaging: A Training Strategy Study with nnU-Net for AutoPET III

【速读】：该论文旨在解决全身体积正电子发射断层扫描/计算机断层扫描（PET/CT）成像中肿瘤分割的挑战性问题，尤其是由于病灶大小、对比度和解剖分布的差异导致的分割精度不足，以及人工勾画耗时且存在观察者间变异的问题。其解决方案的关键在于基于nnU-Net框架并采用ResNet作为编码器构建模型，并系统性地优化训练策略，包括强度归一化、批次Dice损失优化和使用CraveMix的数据增强方法，从而显著提升模型对不同示踪剂和多中心数据的泛化能力，减少假阳性，增强对病灶变异性的鲁棒性，最终在AutoPET III挑战赛中取得Dice分数达0.80的优异性能。

链接: https://arxiv.org/abs/2605.08161
作者: Hussain Alasmawi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Tumor segmentation in whole-body PET/CT imaging is crucial for precise disease evaluation and treatment planning. However, it remains challenging due to variability in lesion size, contrast, and anatomical distribution. Relying on manual segmentation makes the process time-consuming and prone to intra- and inter-observer variability. This work presents a whole-body tumor segmentation method developed for the AutoPET III challenge, where the goal is to build models that generalize across tracers and multi-center data. We employ the nnU-Net framework with a ResNet-based encoder as our baseline and systematically investigate the impact of training strategies, including intensity normalization, batch dice optimization, and data augmentation using CraveMix. Our experiments show that these strategies significantly influence model performance, particularly in reducing false positives and improving robustness to lesion variability. The best-performing configuration achieves a Dice score of up to 0.80 on the preliminary test phase, and our method ranked third in the AutoPET III challenge. The code is publicly available here.

[CV-349] WATCH: Wide-Area Archaeological Site Tracking for Change Detection

【速读】：该论文旨在解决大规模考古遗址监测中扰动事件发生时间难以精确定位的问题，其核心挑战在于视觉线索微弱且地面真实数据稀疏。解决方案的关键在于提出WATCH框架，该框架基于PlanetScope卫星影像（2017–2024年，空间分辨率4.7米/像素）实现月级变化事件定位，并引入三种互补的评分机制：(i) 无训练的时序嵌入距离（Temporal Embedding Distance, TED），通过局部时序参考对比月度偏差进行检测；(ii) 自监督变化检测（Self-Supervised Change Detection, SSCD），融合重建、预测与潜在异常信号的集成方法；(iii) 弱监督时序定位模型，利用稀疏事件月份标签进行训练。实验表明，TED与SatMAE结合在精确月度召回率上表现最优（m=0时达55%），而TED与GeoRSCLIP等基础模型组合在三月容差内达到92.5%的准确率，显著优于弱监督方法。此外，方向性边界分析揭示了不同方法的时序偏倚特征，如SSCD与GeoRSCLIP组合具备提前预警能力，而TED更适用于事件确认后的检测，体现出卫星遥感与基础模型嵌入相结合在文化遗产保护中的可扩展性和决策相关性。

链接: https://arxiv.org/abs/2605.08160
作者: Girmaw Abebe Tadesse,Titien Bartette,Andrew Hassanali,Allen Kim,Jonathan Chemla,Andrew Zolli,Yves Ubelmann,Caleb Robinson,Inbal Becker-Reshef,Juan Lavista Ferres
机构: Microsoft AI for Good Research Lab; Iconem; Planet Labs PBC
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Monitoring archaeological sites at scale is vital for protecting cultural heritage, yet pinpointing when disturbances occur remains difficult because visual cues are subtle and ground-truth data are sparse. We introduce WATCH, a framework for month-level change-event localization over PlanetScope satellite mosaics (2017-2024, 4.7 m/px) that supports three complementary scoring approaches: (i) Temporal Embedding Distance (TED), a training-free method that scores month-to-month deviations from a local temporal reference; (ii) Self-Supervised Change Detection (SSCD), an ensemble of reconstruction, forecasting, and latent-novelty signals; and (iii) a Weakly Supervised (WS) temporal localization model trained with sparse event-month labels. We benchmark WATCH on 1,943 archaeological sites in Afghanistan using embeddings from six foundation models (CLIP, GeoRSCLIP, SatMAE, Prithvi-EO-2.0, DINOv3, and Satlas-Pretrain) alongside a handcrafted spectral and texture baseline, and assess cross-regional generalization on sites in Syria, Turkey, Pakistan, and Egypt. The unsupervised approaches (TED, SSCD) consistently outperform the weakly supervised alternative. TED with SatMAE achieves the highest exact-month recall (55% at m=0), while TED with GeoRSCLIP, CLIP, or Satlas-Pretrain reaches 92.5% within a three-month tolerance (m=3). Handcrafted features remain competitive for exact-month detection under weak supervision. Our directional margin analysis reveals systematic temporal biases: SSCD paired with GeoRSCLIP or Prithvi-EO-2.0 exhibits the strongest early-warning profile, detecting anomalies before the recorded event, while TED favors confirmation-oriented detection after a change has materialized. These results show that satellite imagery combined with foundation-model embeddings enables scalable, decision-relevant heritage monitoring. Code: this https URL

[CV-350] HY-Himmel Technical Report: Hierarchical Interleaved Multi-stream Motion Encoding for Long Video Understanding

【速读】：该论文旨在解决多模态语言模型在长视频理解任务中面临的三大瓶颈问题：高解码成本导致密集RGB帧获取困难、帧数增加引发的令牌数量二次增长，以及稀疏关键帧采样下运动感知能力弱。其解决方案的关键在于提出一种分层视频-语言框架HY-Himmel，通过将语义与运动能力分离处理实现高效建模：少量稀疏锚定I帧（anchor I-frames）送入昂贵的视觉Transformer（ViT）以定位物体身份和场景布局，而密集的帧间区间则由轻量级压缩域三流适配器编码，该适配器从运动矢量图、残差图和I帧上下文蒸馏出对齐的运动令牌，并通过可微占位符机制注入到大语言模型（LLM）中；此外，引入专用的Stage-1对比对齐确保运动表示与冻结视觉主干的几何兼容性，从而在显著减少3.6倍上下文令牌的同时，在Video-MME上相较32帧密集基线提升2.3个百分点（从61.2%至63.5%）。

链接: https://arxiv.org/abs/2605.08158
作者: Haopeng Jin,Hongzhu Yi,Wenlong Zhao,Jinwen Luo,Shani Ye,Zhenyu Guan,Shiquan Dong,Tiankun Yang,Tao Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 59 pages, 42 figures. Technical report

点击查看摘要

Abstract:Long-video understanding with multimodal language models suffers from three compounding bottlenecks: heavy decode cost to obtain dense RGB frames, quadratic token growth with frame count, and weak motion perception under sparse keyframe sampling. We present HY-Himmel, a hierarchical video-language framework that allocates semantic and motion capacity separately. A small set of sparse anchor I-frames is routed to the expensive host ViT to ground object identity and scene layout, while the far denser inter-frame intervals are encoded by a lightweight compressed-domain tri-stream adapter that distils motion evidence from motion-vector maps, residual maps, and I-frame context into aligned motion tokens. These tokens are injected into the LLM via a differentiable placeholder mechanism after a dedicated Stage-1 contrastive alignment that places the motion representation in a geometry compatible with the frozen visual backbone. On Video-MME, HY-Himmel surpasses the dense 32-frame baseline by +2.3 pp (61.2 to 63.5%) while using 3.6x fewer context tokens. Extensive ablations over stream composition, motion encoder family, fusion mode, alignment objective, anchor count, LoRA rank, and video duration confirm that the full tri-stream is necessary and sufficient for the observed gains.

[CV-351] LAGO: Language-Guided Adaptive Object-Region Focus for Zero-Shot Visual-Text Alignment

【速读】：该论文旨在解决零样本图像分类（zero-shot image classification）中细粒度识别任务的挑战，即如何在缺乏特定任务监督的情况下，通过语义描述与图像局部区域的有效对齐来提升分类准确性。传统方法依赖大量随机或冗余图像裁剪进行局部视觉-文本对齐，导致推理成本高且引入噪声；此外，过早引入语义指导易引发预测循环（prediction loop）——错误的中间预测误导后续定位并放大误差。其解决方案的关键在于提出 LAGO（LAnguage-Guided adaptive Object-region focus）框架：首先进行类无关的对象中心候选区域发现以获得稳定的视觉初始化，随后采用自适应语言引导精化机制，通过中间置信度动态控制语义引导强度，避免错误传播；同时设计对象级与上下文双通道聚合策略融合多粒度证据，从而实现高效且鲁棒的局部视觉-文本对齐。

链接: https://arxiv.org/abs/2605.08156
作者: Junyi Hu,Qiji Zhou,Lei Zhang,Yue Zhang
机构: Beijing Jiaotong University (北京交通大学); Westlake University (西湖大学); Rochester Institute of Technology (罗切斯特理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 37 pages, 26 figures, including appendix. Preprint

点击查看摘要

Abstract:Zero-shot recognition aims to classify an image by selecting the most compatible label description from a set of candidate classes without any task-specific supervision. In fine-grained settings, however, the relevant evidence often lies in localized parts, attributes, or textures rather than in the full image, making whole-image alignment suboptimal. Recent localized visual-text alignment methods address this by comparing class descriptions with multiple image regions, but they typically rely on large sets of random or redundant crops, increasing inference cost and introducing many highly redundant or weakly relevant candidates. Moreover, introducing semantic guidance too early can create an error-amplifying feedback process in which inaccurate intermediate predictions bias later localization and reinforce subsequent mistakes; we refer to this failure mode as the prediction loop. We propose LAGO (LAnguage-Guided adaptive Object-region focus), a framework for efficient and robust zero-shot localized visual-text alignment. LAGO first performs class-agnostic object-centric candidate discovery to obtain a stable visual initialization, and then applies adaptive language-guided refinement with the strength of semantic guidance controlled by intermediate confidence. It further combines object-level, contextual, and full-image evidence through an effective object-context dual-channel aggregation strategy. Extensive experiments show that LAGO consistently achieves state-of-the-art performance on standard zero-shot benchmarks and challenging distribution-shift settings, while requiring substantially fewer candidate regions at inference time.

[CV-352] VT-Bench: A Unified Benchmark for Visual-Tabular Multi-Modal Learning

【速读】：该论文旨在解决视觉-表格数据（visual-tabular data）在高风险领域（如医疗和工业）中多模态学习研究不足的问题，此类数据在实际应用中具有重要价值但长期缺乏标准化评估体系。其解决方案的关键在于提出了首个统一基准VT-Bench，涵盖9个领域的14个数据集（包括医疗、宠物、媒体和交通等），样本总量超过756K，并系统评估了23种代表性模型（包括单模态专家、专用视觉-表格模型、通用视觉语言模型（VLMs）及工具增强方法），揭示了当前视觉-表格学习面临的显著挑战，从而为构建更强大的多模态视觉-表格基础模型提供标准化评测平台与研究起点。

链接: https://arxiv.org/abs/2605.08146
作者: Zi-Yi Jia,Zi-Jian Cheng,Xin-Yue Zhang,Kun-Yang Yu,Zhi Zhou,Yu-Feng Li,Lan-Zhe Guo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-model learning has attracted great attention in visual-text tasks. However, visual-tabular data, which plays a pivotal role in high-stakes domains like healthcare and industry, remains underexplored. In this paper, we introduce \textitVT-Bench, the first unified benchmark for standardizing vision-tabular discriminative prediction and generative reasoning tasks. VT-Bench aggregates 14 datasets across 9 domains (medical-centric, while covering pets, media, and transportation) with over 756K samples. We evaluate 23 representative models, including unimodal experts, specialized visual-tabular models, general-purpose vision-language models (VLMs), and tool-augmented methods, highlighting substantial challenges of visual-tabular learning. We believe VT-Bench will stimulate the community to build more powerful multi-modal vision-tabular foundation models. Benchmark: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.08146 [cs.CV] (or arXiv:2605.08146v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.08146 Focus to learn more arXiv-issued DOI via DataCite

[CV-353] Self-Captioning Multimodal Interaction Tuning: Amplifying Exploitable Redundancies for Robust Vision Language Models ICML2026

【速读】：该论文旨在解决当前视觉语言模型（Vision Language Models, VLMs）在面对模糊或受损模态时存在的幻觉（hallucination）和鲁棒性不足的问题。其核心假设是，通过挖掘多模态间的共享信息来补偿受损模态，可有效提升模型可靠性。解决方案的关键在于对多模态交互进行系统分析，识别冗余（redundant）、独特（unique）和协同（synergistic）三类任务相关信息，并提出一种自描述（self-captioning）工作流，其中引入“多模态交互门”（Multimodal Interaction Gate）机制，将原本独特的交互转化为冗余的交互，从而增强可利用的共享信息。实验表明，该方法可使视觉诱导错误降低38.3%，一致性提升16.8%。

链接: https://arxiv.org/abs/2605.08145
作者: Yuriel Ryan,Hei Man Ip,Adriel Kuek,Paul Pu Liang,Roy Ka-Wei Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to ICML 2026

点击查看摘要

Abstract:Current vision language models face hallucination and robustness issues against ambiguous or corrupted modalities. We hypothesize that these issues can be addressed by exploiting the shared information between modalities to compensate for the impaired one. To this end, we analyze multimodal interactions – redundant (shared), unique (exclusive), and synergistic (emergent) task-relevant information provided by the modalities – to determine their impacts on model reliability. Specifically, amplifying redundant interactions would increase this exploitable shared information to resolve these issues; yet, modern instruction datasets often eliminate redundancies to prioritize visual grounding. We bridge this gap through a self-captioning workflow featuring a \textscMultimodal Interaction Gate: a mechanism to convert unique interactions into redundant interactions. Our findings suggest that increasing redundancy can reduce visual induced errors by 38.3% and improve consistency by 16.8%.

[CV-354] NoiseRater: Meta-Learned Noise Valuation for Diffusion Model Training

【速读】：该论文旨在解决扩散模型（Diffusion Model）训练中将注入噪声视为均质信息源的局限性问题，即当前训练范式未区分不同噪声样本的贡献差异，可能导致低效或次优的学习过程。其核心解决方案是提出NoiseRater——一种基于元学习的实例级噪声估值框架，通过设计一个参数化噪声评分器（parametric noise rater），根据数据实例和时间步（timestep）动态分配噪声的重要性分数，并据此对训练目标进行自适应重加权。该评分器采用双层优化策略，在内层执行扩散模型更新的同时，外层优化评分器以提升下游验证性能。为实现高效部署，进一步构建了解耦的两阶段流程：元训练阶段使用软权重，标准训练阶段转为硬噪声选择机制。实验表明，噪声并非等价贡献，优先利用高信息量噪声可显著提升训练效率与生成质量。

链接: https://arxiv.org/abs/2605.08144
作者: Fang Wu,Haokai Zhao,Da Xing,Hanqun Cao,Tinson Xu,Yanchao Li,Xiangru Tang,Zehong Wang,Aaron Tu,Kuan Pang,Hanchen Wang,Hongbin Lin,Zeqi Zhou,Yinxi Li,Peng Xia,Li Erran Li,Molei Tao,Jure Leskovec,Aditya Joshi,Yejin Choi
机构: Stanford University (斯坦福大学); UNSW (新南威尔士大学); UCL (伦敦大学学院); The University of Chicago (芝加哥大学); CUHK (香港中文大学); Nanjing University (南京大学); Brown University (布朗大学); Yale University (耶鲁大学); University of Notre Dame (圣母大学); University of Waterloo (滑铁卢大学); UC Berkeley (加州大学伯克利分校); Georgia Institute of Technology (佐治亚理工学院); Amazon (亚马逊); UNC–Chapel Hill (北卡罗来纳大学教堂山分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models have achieved remarkable success across a wide range of generative tasks, yet their training paradigm largely treats injected noise as uniformly informative. In this work, we challenge this assumption and introduce NoiseRater, a meta-learning framework for instance-level noise valuation in diffusion model training. We propose a parametric noise rater that assigns importance scores to individual noise realizations conditioned on data and timestep, enabling adaptive reweighting of the training objective. The rater is trained via bilevel optimization to improve downstream validation performance after inner-loop diffusion updates. To enable efficient deployment, we further design a decoupled two-stage pipeline that transitions from soft weighting during meta-training to hard noise selection during standard training. Extensive experiments on FFHQ and ImageNet demonstrate that not all noise samples contribute equally, and that prioritizing informative noise improves both training efficiency and generation quality. Our results establish noise valuation as a complementary and previously underexplored axis for improving diffusion model training. Our code is available at: this https URL.

[CV-355] Benchmarking ResNet Backbones in RT-DETR: Impact of Depth and Regularization under environmental conditions

【速读】：该论文旨在解决在竞技机器人场景中，环境变化（如光照和背景对比度）对基于Transformer的检测模型（RT-DETR）性能影响不明确的问题，尤其是不同骨干网络（ResNet系列）规模与超参数（如Dropout率）如何共同作用于检测准确率和置信度。解决方案的关键在于系统性地比较四种不同深度的ResNet骨干网络（ResNet18、34、50、101），在统一训练配置下评估其在不同环境条件下的表现，发现：光照变化时，ResNet50在精度（接近1.00）、置信度（最高约0.869）和延迟（约0.058–0.059 ms）之间取得最佳平衡；而背景变化时，ResNet34表现出更优的鲁棒性，置信度更高（最高约0.887），同时保持高精度。结果表明，最优架构选择依赖于具体的环境扰动类型，且中等深度模型在性能与效率间提供最佳权衡。

链接: https://arxiv.org/abs/2605.08136
作者: Pamela Barboza,Víctor Castelli,Belén Pereira,Ricardo Grando,Bruna de Vargas,Augusto Calfani
机构: Technological University of Uruguay (乌拉圭技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Accepted at the International Conference on Data Science, Technology and Applications (DATA) 2026

点击查看摘要

Abstract:Visual perception plays a central role in competitive robotics, where environmental variations can directly affect real-time detection performance. The related literature on transformer-based detectors lack information regarding the impact of backbone scale and environmental settings on model performance. This work presents a comparative evaluation of RT-DETR for detecting round objects under environmental and hyperparameter variations relevant to competitive robotics. Four ResNet backbones (ResNet18, ResNet34, ResNet50, and ResNet101) were compared using dropout rates, analyzing their effect on confidence and accuracy. All models were trained under the same configuration and evaluated under changes in lighting and background contrast. Environmental conditions primarily impact prediction confidence, while inference latency remains largely unaffected and classification accuracy stays consistently high, approaching or above 1.00 in most cases. Two distinct behaviors were observed. Under illumination variation, ResNet50 achieves the best trade-off, combining near-perfect accuracy, confidence values up to approximately 0.869 and latency around 0.058-0.059 ms. Under background variation, ResNet34 provides the most balanced performance, reaching near-perfect accuracy and higher confidence values up to approximately 0.887. These results indicate that the optimal architecture depends on the type of environmental variation, with intermediate-depth models offering the best balance between performance and efficiency.

[CV-356] VLADriver-RAG : Retrieval-Augmented Vision-Language-Action Models for Autonomous Driving

【速读】：该论文旨在解决视觉-语言-动作（Vision-Language-Action, VLA）模型在自动驾驶中因依赖隐式参数化知识而导致的长尾场景泛化能力不足问题，以及传统视觉检索方法存在的高延迟和语义模糊性问题。解决方案的关键在于提出VLADriver-RAG框架，其核心创新包括：通过“视觉到场景”机制将感官输入抽象为时空语义图（spatiotemporal semantic graphs），有效过滤视觉噪声；引入“场景对齐嵌入模型”（Scenario-Aligned Embedding Model），利用图型动态时间规整（Graph-DTW）度量对齐策略，优先保障拓扑结构一致性而非表面视觉相似性，从而提升检索相关性；最终将检索到的显式结构化先验知识融合进基于查询的VLA主干网络，生成精确且解耦的轨迹规划结果。

链接: https://arxiv.org/abs/2605.08133
作者: Rui Zhao,Haofeng Hu,Zhenhai Gao,Jiaqiao Liu,Gao Fei
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have emerged as a promising paradigm for end-to-end autonomous driving, yet their reliance on implicit parametric knowledge limits generalization in long-tail scenarios. While Retrieval-Augmented Generation (RAG) offers a solution by accessing external expert priors, standard visual retrieval suffers from high latency and semantic ambiguity. To address these challenges, we propose \textbfVLADriver-RAG, a framework that grounds planning in explicit, structure-aware historical knowledge. Specifically, we abstract sensory inputs into spatiotemporal semantic graphs via a \textitVisual-to-Scenario mechanism, effectively filtering visual noise. To ensure retrieval relevance, we employ a \textitScenario-Aligned Embedding Model that utilizes Graph-DTW metric alignment to prioritize intrinsic topological consistency over superficial visual similarity. These retrieved priors are then fused within a query-based VLA backbone to synthesize precise, disentangled trajectories. Extensive experiments on the Bench2Drive benchmark establish a new state-of-the-art, achieving a Driving Score of 89.12.

[CV-357] Alice v1: Distillation-Enhanced Video Generation Surpassing Closed-Source Models

【速读】：该论文旨在解决视频生成模型在保持高质量输出的同时提升推理速度的问题，尤其是针对现有蒸馏方法通常以牺牲质量为代价换取效率的局限性。其关键解决方案是提出基于分数正则化的一致性蒸馏（rCM），通过三个机制实现超越教师模型的质量：(1) 分数正则化项作为模式聚焦目标，将概率质量集中于高质量输出而非覆盖完整的教师分布；(2) 设计针对性的合成数据流水线结合困难样本挖掘，提供对教师模型不一致处理的失败模式（如物理、手部和面部）的训练信号；(3) 一致性约束作为隐式正则化，消除对特定噪声样本的“幸运路径”依赖。该方法使Alice v1在仅4步去噪步骤下即可生成5秒720p视频（H100上约8秒），相比教师模型50步显著提速7倍，并在VBench等自动评估指标上从84.0提升至91.2，优于闭源模型如Veo和Sora。

链接: https://arxiv.org/abs/2605.08115
作者: Wang Xiaoyu,Phong Nguyen,Chen Zhao
机构: Mirage Team (幻影团队); Open Source Research (开源研究)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Wepresent Alice v1, a 14-billion parameter open-source video generation model that achieves state-of-the-art quality through consistency distillation with score regularization (rCM). Contrary to conventional distillation-which trades quality for speed-we demonstrate that rCM-based distillation can exceed teacher model quality. We attribute this to three mechanisms: (1) the score regularization term acts as a mode-seeking objective that concentrates probability mass on high-quality outputs rather than covering the full teacher distribution, (2) our targeted synthetic data pipeline with hard example mining provides training signal specifically for failure modes (physics, hands, faces) that the teacher handles inconsistently, and (3) consistency enforcement acts as implicit regularization, eliminating “lucky path” dependence on specific noise samples. Alice v1 generates 5-second 720p videos at 24fps in 4 denoising steps (~8 seconds on H100), a 7x speedup over the 50-step teacher while improving VBench score from 84.0 (Wan2.2) to 91.2. This surpasses both the teacher and closed-source systems including Veo3 (~90) and Sora2 (~88) on automated benchmarks, with competitive results in human preference studies. We release all model weights, training code, synthetic data pipelines, and evaluation scripts to advance open research in video generation.

[CV-358] Do Foundation Model Embeddings Improve Cross-Country Crop Yield Generalisation? A Leave-One-Country-Out Evaluation in Sub-Saharan Africa

【速读】：该论文旨在解决跨国家尺度下小农户玉米产量预测的准确性问题，这是保障撒哈拉以南非洲地区粮食安全规划的关键挑战。现有研究多基于单一国家内的基准测试，高估了模型的实际泛化能力。论文提出使用地理空间基础模型嵌入（如Prithvi-EO-1.0-100M和ViT-Base）替代传统Sentinel-2光谱特征，通过留一国交叉验证（Leave-One-Country-Out cross-validation）评估其在五国共6,404个玉米田块上的跨区域表现。关键发现是：无论是否冻结模型参数，基础模型嵌入并未显著优于传统特征，且所有方法在跨国家场景下均表现为负R²值，表明主要瓶颈并非表征质量，而是不同国家间产量分布的系统性差异。这一结果为未来研究提供了可复现的负面基准，强调需优先解决数据分布偏移问题而非单纯优化特征提取。

链接: https://arxiv.org/abs/2605.08113
作者: Yaw Osei Adjei
机构: Kwame Nkrumah University of Science and Technology (夸梅·恩克鲁玛科技大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 10 figures, appendix, code and processed results released publicly

点击查看摘要

Abstract:Accurate predictions of smallholder maize yields across national boundaries are critical for food security planning in sub-Saharan Africa, yet most published benchmarks report within-country performance that overstates true generalisability. This paper evaluates whether geospatial foundation model embeddings, specifically Prithvi-EO-1.0-100M and ViT-Base, outperform traditional Sentinel-2 spectral features under a Leave-One-Country-Out cross-validation scheme on 6,404 maize field observations from five African countries. The results show a clear generalisability gap: within-country random cross-validation yields moderate R^2 values, but all feature sets perform poorly under cross-country testing, with universally negative R^2. Frozen Prithvi-EO embeddings provide no meaningful advantage over engineered spectral features for cross-country prediction in this setting. The paper argues that the main limitation is a shift in yield distribution between countries rather than representation quality and releases a reproducible negative benchmark for future work.

[CV-359] Rethinking Temporal Consistency in Video Object-Centric Learning: From Prediction to Correspondence

【速读】：该论文旨在解决视频对象中心学习（video object-centric learning）中依赖可学习动态模块进行时序一致性建模所带来的计算开销与冗余问题。现有方法通常通过预训练的预测器（predictor）来估计未来帧中的对象表示（slot），但这些预测器本质上是对离散对应关系（discrete correspondence problem）的昂贵近似。论文提出的关键解决方案是引入Grounded Correspondence框架，其核心在于利用预训练视觉主干网络（backbone）中已具备的实例区分性特征（instance-discriminative features），并以确定性的二分图匹配（deterministic bipartite matching，即匈牙利算法）替代传统的可学习时序预测模块。该方法使槽位（slot）初始化于冻结主干特征中的显著区域，并通过帧间槽位表示的匈牙利匹配实现身份一致性，从而在无需任何可学习参数的情况下，仍能在MOVi-D、MOVi-E和YouTube-VIS等基准上取得具有竞争力的性能。

链接: https://arxiv.org/abs/2605.03650
作者: Zhiyuan Li,Rongzhen Zhao,Wenyan Yang,Wenshuai Zhao,Pekka Marttinen,Joni Pajarinen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The de facto approach in video object-centric learning maintains temporal consistency through learned dynamics modules that predict future object representations, called slots. We demonstrate that these predictors function as expensive approximations of discrete correspondence problems. Modern self-supervised vision backbones already encode instance-discriminative features that distinguish objects reliably. Exploiting these features eliminates the need for learned temporal prediction. We introduce Grounded Correspondence, a framework that replaces learned transition functions with deterministic bipartite matching. Slots initialize from salient regions in frozen backbone features. Frame-to-frame identity is maintained through Hungarian matching on slot representations. The approach requires zero learnable parameters for temporal modeling yet achieves competitive performance on MOVi-D, MOVi-E, and YouTube-VIS. Project page: this https URL

[CV-360] Geospatial-Temporal Sensemaking of Remote Sensing Activity Detections with Multimodal Large Language Model

【速读】：该论文旨在解决遥感影像中人类活动的时序理解与推理问题，特别是在稀疏卫星观测条件下对建筑工地等固定地理目标的属性演变和活动状态进行语言引导的视觉问答（VQA）分析。传统自动目标识别（ATR）方法难以捕捉动态过程中的语义变化，而该研究通过构建SMART-HC-VQA数据集，将原始标注信息转化为自然语言问答三元组，从而将任务从静态识别升级为时空推理挑战。解决方案的关键在于：一是基于Sentinel-2影像开发了图像对组合增强策略（Image-Pairwise Combinatorial Augmentation），生成约230万对时序对比样本；二是设计了一个多图像输入的多模态大语言模型（MLLM）训练框架（基于LLaVA-NeXT Mistral-7B），支持处理带时间戳的多幅图像并利用元数据驱动的VQA示例进行训练，实现了对施工进度、阶段演进及潜在未来发展的语义推理能力。

链接: https://arxiv.org/abs/2605.10739
作者: David F. Ramirez,Tim Overman,Kristen Jaskie,Andreas Spanias
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to 2026 SPIE Defense + Security, Automatic Target Recognition XXXVI

点击查看摘要

Abstract:We introduce SMART-HC-VQA, a Sentinel-2-based visual question answering dataset derived from the IARPA SMART Heavy Construction dataset, designed for spatiotemporal analysis of human activity. The dataset transforms construction-site annotations, construction-type labels, temporal-phase labels, geographic metadata, and observation relationships into natural language question-answer triplets. This approach redefines the existing dataset as a temporally extended automatic target recognition and visual question answering (VQA) challenge, considering a fixed geospatial site as a target whose attributes and activity states evolve across sparse satellite observations. Currently, SMART-HC-VQA comprises 21,837 accessible Sentinel-2 image chips, 65,511 single-image VQA examples, and approximately 2.3 million two-image temporal comparison examples generated via our novel Image-Pairwise Combinatorial Augmentation. We detail the workflow for retrieving and processing Sentinel-2 imagery, segmenting large satellite tiles into site-centered images, maintaining traceability to SMART-HC annotations, and analyzing the distributions of site size, observation count, temporal coverage, construction type, and phase labels. Additionally, we describe an implemented multi-image MLLM training framework based on LLaVA-NeXT Mistral-7B, adapted to accept multiple dated image inputs and train on metadata-derived VQA examples. This work offers a reproducible foundation for understanding language-guided remote sensing activities, aiming not only to detect change but also to reason about the ongoing processes, their progression, and potential future developments.

[CV-361] Set-Based Groupwise Registration for Variable-Length Variable-Contrast Cardiac MRI MICCAI2026

【速读】：该论文旨在解决定量心脏磁共振成像（Quantitative Cardiac MRI）中运动校正的泛化性问题，即现有基于深度学习的组间配准（Groupwise Registration）方法难以适应不同成像协议下的序列长度、输入顺序和对比度动态变化。传统方法通常将输入数据编码为固定长度的通道堆叠，导致网络设计与特定扫描协议强耦合，无法在新协议下使用。解决方案的关键在于提出一种新的集合论框架 AnyTwoReg，其将MRI序列视为无序集合，通过共享编码器与基于相关性的特征聚合机制构建一个排列不变的参考标准，并学习从图像到形变场的排列等变映射，从而实现对序列长度和输入顺序的解耦；同时利用预训练基础模型提取对比度无关的图像特征，有效应对极端对比度变化。该方法仅在单一T₁ mapping数据集（STONE, L=11）上训练，即可零样本（zero-shot）推广至两个未见的定量MRI数据集（MOLLI, ASL，L ∈ [11, 60]），显著提升下游定量映射质量。

链接: https://arxiv.org/abs/2605.10571
作者: Yi Zhang,Yidong Zhao,Tijmen Toxopeus,Maša Božić-Iven,Sebastian Weingärtner,Qian Tao
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: MICCAI 2026. Submitted Version

点击查看摘要

Abstract:Quantitative cardiac magnetic resonance imaging (MRI) enables non-invasive myocardial tissue characterization but relies on robust motion correction within these variable-length, variable-contrast image sequences. Groupwise registration, which simultaneously aligns all images, has shown greater robustness than pairwise registration for motion correction. However, current deep-learning-based groupwise registration methods cannot generalize across MRI sequences: the architecture typically encodes input data as a fixed-length channel stack, which rigidly couples network design to protocol-specific sequence length, input ordering, and contrast dynamics. At inference time, any change in imaging protocols will render the network unusable. In this work, we introduce \emph\AnyTwoReg, a new set-based groupwise registration framework that takes a quantitative MRI sequence as an unordered set. This set formulation fundamentally decouples network design from sequence length and input ordering. By utilizing a shared encoder and correlation-guided feature aggregation, \emph\AnyTwoReg constructs a permutation-invariant canonical reference for registration, and learns a permutation-equivariant mapping from images to deformation fields. Additionally, we extract contrast-insensitive image features from an existing foundation model to handle extreme contrast variations. Trained exclusively on a single public T_1 mapping dataset (STONE, sequence length L=11 ), \AnyTwoReg generalizes to two unseen quantitative MRI datasets (MOLLI, ASL) with variable lengths ( L \in [11, 60] ) and different contrast dynamics. It achieves strong cross-protocol generalization in a zero-shot manner, and consistently improves downstream quantitative mapping quality. Notably, while designed for quantitative MRI sequences, our framework is directly applicable to Cine MRI sequences for inter-cardiac-phase registration.

[CV-362] Measurement-Adapted Eigentask Representations for Photon-Limited Optical Readout

【速读】：该论文旨在解决低光成像中由于测量噪声（包括光子散粒噪声、探测器噪声和量化误差）导致的下游推理性能受限问题。其核心挑战在于，光学前端的性能不仅受物理限制，还取决于高维传感器输出在分类或决策前的表示方式。解决方案的关键在于提出“特征任务”（eigentasks）这一测量自适应表示方法，通过按噪声条件下特征的可分辨性对读出特征进行排序，从而构建更具信息量的低维特征表示。实验表明，在光子受限、少样本及高难度分类场景下，该方法显著优于主成分分析（PCA）和基于滤波的压缩等标准基线，尤其在少样本MPEG-7分类任务中，准确率提升约10个百分点，有效提升了下游学习的样本效率。

链接: https://arxiv.org/abs/2605.10008
作者: Tianyang Chen,Mandar M. Sohoni,Saeed A. Khan,Jérémie Laydevant,Shi-Yuan Ma,Tianyu Wang,Peter L. McMahon,Hakan E. Türeci
机构: Princeton University (普林斯顿大学); Cornell University (康奈尔大学); USRA Research Institute for Advanced Computer Science (美国宇航局高级计算机科学研究所)
类目: Optics (physics.optics); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET)
备注: 15+14 pages, 4+9 figures, 55 references

点击查看摘要

Abstract:Optical readout in low-light imaging is fundamentally limited by measurement noise, including photon shot noise, detector noise, and quantization error. In this regime, downstream inference depends not only on the optical front end, but also on how noisy high-dimensional sensor measurements are represented before classification or decision-making. Here we show that eigentasks provide a measurement-adapted representation for optical sensor outputs by ordering readout features according to their resolvability under noise. Using experimental data from a lens-based optical imaging system and a reanalysis of published data from a single-photon-detection neural network, we find that eigentask representations frequently outperform standard baselines including principal component analysis and filtering-based compression. The advantage is most pronounced in photon-limited, few-shot, and higher-difficulty classification regimes. In few-shot MPEG-7 classification, for example, the advantage over other methods reaches about 10 percentage points as the number of classes increases. In these settings, eigentasks yield more informative low-dimensional features and improve sample-efficient downstream learning. These results identify measurement-adapted representation as a promising strategy for optical inference when photon budget, acquisition time, and task complexity are constrained.

[CV-363] A Real-Calibrated Synthetic-First Data Engine

【速读】：该论文旨在解决在数据稀缺场景下，现代计算机视觉系统因缺乏大规模高质量标注数据而导致的性能瓶颈问题。传统合成数据增强方法常因数据集层面的质量缺陷和反馈机制不足，导致性能提升不稳定甚至不可靠。其解决方案的关键在于提出一种“真实校准的合成优先数据引擎”（Real-Calibrated Synthetic-First Data Engine），通过将可控扩散模型生成与多阶段筛选/过滤机制整合到统一管道中，并支持不确定性驱动的选择和人工验证，从而系统性地构建高质量合成数据集。该框架不依赖新的生成算法，而是聚焦于数据工程层面的结构化设计，强调模块化、可复现性和实际部署灵活性，实证表明其在人体姿态估计任务中能有效提升真实数据基线性能，验证了数据中心编排在低数据场景下的实用价值。

链接: https://arxiv.org/abs/2605.09699
作者: Yukang Shen
机构: Kennesaw State University (肯尼索州立大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注: 7 pages, 6 figures

点击查看摘要

Abstract:Modern computer vision systems increasingly encounter performance limitations in data-scarce domains, where collecting large-scale, high-quality labeled data is costly or impractical. While controllable diffusion models enable scalable synthetic image generation, directly applying synthetic augmentation often leads to unstable performance gains due to dataset-level quality issues and insufficient feedback mechanisms. In this work, we present a Real-Calibrated Synthetic-First Data Engine, a modular data engineering framework that combines controllable diffusion generation and multi-stage curation/filtering within a unified pipeline, with optional support for uncertainty-driven selection and human verification. Instead of introducing new generative algorithms, our approach focuses on systematic dataset construction for improving the practical reliability of synthetic augmentation in low-data regimes. The framework is implemented as a modular CLI-based pipeline, where generation, filtering, selection, and validation components can be independently configured and replaced. This design emphasizes reproducibility, flexibility, and practical deployment in real-world data workflows. Through empirical evaluation centered on human pose estimation, we show that synthetic data improves a real-data baseline when used as near-zero-human-annotation-cost augmentation alongside real anchors, while synthetic-only training remains substantially below real-only performance. Supplementary segmentation diagnostics show the same domain-gap pattern. These results highlight the practical value of data-centric orchestration for low-data augmentation. Comments: 7 pages, 6 figures Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG) Cite as: arXiv:2605.09699 [eess.IV] (or arXiv:2605.09699v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2605.09699 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Yukang Shen [view email] [v1] Sun, 10 May 2026 18:34:43 UTC (2,536 KB)

[CV-364] XTinyU-Net: Training-Free U-Net Scaling via Initialization-Time Sensitivity MICCAI2026

【速读】：该论文旨在解决在资源受限环境下部署U-Net架构进行医学图像分割时，如何高效选择超轻量级且性能稳定的模型配置问题。传统方法需通过大量训练与评估循环来寻找最小模型，计算成本极高。解决方案的关键在于提出一种无需训练的自动选择框架，利用雅可比（Jacobian）敏感性度量，在模型初始化阶段即可对不同通道宽度的U-Net变体进行评分，从而精准定位从稳定性能平台到表征能力崩溃之间的临界点，进而确定最小稳定配置——XTinyU-Net。此方法仅需少量未标注图像即可完成筛选，在多个医学数据集上实现了与重型nnU-Net相当的分割精度，同时参数量减少400至1600倍，并优于现有轻量化架构。

链接: https://arxiv.org/abs/2605.09639
作者: Alvin Kimbowa,Moein Heidari,David Liu,Ilker Hacihaliloglu
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Early accepted to MICCAI 2026

点击查看摘要

Abstract:While U-Net architectures remain the gold standard for medical image segmentation, their deployment in resource-constrained environments demands aggressive model compression. However, finding an optimally efficient configuration is computationally prohibitive, typically requiring exhaustive train-and-evaluate cycles to find the smallest model that maintains peak performance. In this paper, we introduce a training-free selection framework to automatically identify ultralightweight, dataset-specific U-Net configurations directly at initialization. We observe that systematically scaling down U-Net channel width induces a sharp transition from a stable performance plateau to representational capacity collapse. To pinpoint this boundary without training, we propose a Jacobian-based sensitivity metric that scores discrete, width-capped U-Net variants using a small set of unlabeled images. By analyzing the total variation of this sensitivity curve, we isolate the smallest stable configuration, which we denote as XTinyU-Net. Evaluated across six diverse medical datasets within the nnU-Net framework, XTinyU-Net achieves segmentation accuracy comparable to the heavy nnU-Net baseline with 400x-1600x fewer parameters, and outperforms contemporary lightweight architectures while utilizing 5x-72x fewer parameters. Code is publicly accessible on this https URL.

[CV-365] Uncertainty-Guided Dual-Domain Learning for Reliable Skin Lesion Segmentation

【速读】：该论文旨在解决皮肤病变分割中因视觉模糊性和形态不规则性导致的空间建模困难问题，以及现有方法忽视预测不确定性所带来的确定性框架缺陷，如跨域融合盲区和对标签噪声的过拟合。其解决方案的关键在于提出一种不确定性引导的双域网络（UGDD-Net），通过引入“凝视-注视”机制将不确定性转化为主动引导信号：首先利用不确定性引导的双向特征融合模块（UGBFF）以像素级不确定性调制空间-光谱交互；其次借助不确定性引导的图精炼模块（UGGR）构建拓扑感知图来传播可靠语义共识并优化不确定节点；最后采用不确定性引导的边际自适应损失（UGML）在高置信度像素上施加严格约束，同时放松对不确定区域的惩罚，从而提升统计校准性能。

链接: https://arxiv.org/abs/2605.09600
作者: Duwei Dai,Caixia Dong,Guowei Dai,Qingsen Yan,Qin Zhang,Fan Liu,Pengyu Ren,Guangyao Kong,Wei Zeng
机构: Xi’an Jiaotong University (西安交通大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 34 pages, 11figures

点击查看摘要

Abstract:Accurate skin lesion segmentation is vital for dermoscopic Computer-Aided Diagnosis. However, visual ambiguity and morphological irregularity often defeat spatial modeling, necessitating multi-domain architectures. Existing paradigms frequently overlook the active use of prediction uncertainty, leading to deterministic frameworks that suffer from blind cross-domain fusion and overfit to label noise. To address these issues, we propose the Uncertainty-Guided Dual-Domain Network (UGDD-Net). UGDD-Net introduces a novel “Glance-and-Gaze” mechanism to transform uncertainty into an active guiding signal. Specifically, the Uncertainty-Guided Bi-directional Feature Fusion (UGBFF) module uses pixel-level uncertainty to modulate spatial-spectral interactions. The Uncertainty-Guided Graph Refinement (UGGR) module constructs a topology-aware graph to propagate reliable semantic consensus and refine uncertain nodes. Finally, the Uncertainty-Guided Margin-Adaptive Loss (UGML) enforces strict constraints on confident pixels while relaxing penalties on uncertain ones to improve statistical calibration. Extensive experiments on ISIC2017, ISIC2018, PH2, and HAM10000 datasets demonstrate that UGDD-Net achieves state-of-the-art performance, especially on “Hard Samples”. Our uncertainty maps align with expert inter-observer variability, providing robust interpretability for human-machine collaborative diagnosis. Comments: 34 pages, 11figures Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2605.09600 [eess.IV] (or arXiv:2605.09600v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2605.09600 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-366] Annotation-free deep learning for detection and segmentation of fetal germinal matrix-intraventricular hemorrhage in brain MRI

【速读】：该论文旨在解决产前脑室旁-脑室内出血（germinal matrix-intraventricular hemorrhage, GMH-IVH）的自动化检测与分割问题，传统人工诊断和病灶分割方法存在劳动强度大、易出错且依赖大量标注数据的局限。其解决方案的关键在于提出了一种无需标注数据的深度学习框架FreeHemoSeg，通过医学先验知识引导从正常胎儿MRI数据中合成伪GMH-IVH图像进行模型训练，从而在内部和外部验证集上均实现了高敏感性（最高达0.914）、高特异性（最高达0.966）及合理的分割精度（Dice相似系数DSC最高达0.559），显著优于监督与非监督方法，并提升放射科医师的诊断准确率与效率。

链接: https://arxiv.org/abs/2605.09575
作者: Mingxuan Liu,Yingqi Hao,Yi Liao,Juncheng Zhu,Haoxiang Li,Hongjia Yang,Yifei Chen,Yijin Li,Kasidit Anmahapong,Zihan Li,Jialan Zheng,Min Kang,Yan Song,Hua Lai,Xiaoling Zhou,Nan Sun,Rong Hu,Gang Ning,Haibo Qu,Qiyuan Tian
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Background: Prenatal germinal matrix-intraventricular hemorrhage (GMH-IVH) is a leading cause of infant mortality and neurodevelopmental impairment. Manual diagnosis and lesion segmentation are labor-intensive and error-prone. Deep learning models offer potential for automation but typically require large annotated datasets, which are challenging to obtain. Purpose: To develop and validate an annotation-free deep learning framework for automated detection and segmentation of GMH-IVH on brain MRI. Materials and Methods: This retrospective study analyzed 2D T2-weighted MRI data from pregnant women collected from October 2015 to October 2023 at one hospital (internal validation) and two hospitals (external validation). Eligible participants included healthy fetuses and those with GMH-IVH. FreeHemoSeg was developed and trained using pseudo GMH-IVH images synthesized from normal fetal data guided by medical priors. Primary outcomes included diagnostic accuracy (area under the ROC curve [AUROC], sensitivity, specificity) and segmentation accuracy (Dice similarity coefficient [DSC]). A reader study evaluated clinical utility. Results: A total of 1674 stacks from 558 pregnant women were analyzed. FreeHemoSeg achieved the highest performance in both internal (sensitivity: 0.914, 95% CI 0.869-0.945; specificity: 0.966, 95% CI 0.946-0.978; DSC: 0.559, 95% CI 0.546-0.571) and external validation (sensitivity: 0.824, 95% CI 0.739-0.885; specificity: 0.943, 95% CI 0.913-0.964; DSC: 0.512, 95% CI 0.497-0.526), outperforming supervised and unsupervised methods. FreeHemoSeg assistance improved radiologists’ sensitivity (from 0.882 to 0.941-1.000) and diagnostic confidence while reducing interpretation time by 16.0-52.7%. Conclusion: FreeHemoSeg accurately detects and localizes fetal brain hemorrhages without annotated training data, enabling earlier diagnosis and supporting timely clinical management.

[CV-367] ML-CLIPSim: Multi-Layer CLIP Similarity for Machine-Oriented Image Quality

【速读】：该论文旨在解决传统图像质量评估（Image Quality Assessment, IQA）方法在机器感知场景下与下游模型性能不一致的问题，即如何从机器视角准确衡量图像对任务相关模型的有用性。其核心挑战在于构建一个能反映图像信息保真度对机器任务表现影响的量化指标。解决方案的关键在于提出“机器效用”（latent machine utility）这一新范式，并通过多模型一致性投票（pairwise predictive-consistency comparisons）来近似该效用；进一步设计了ML-CLIPSim，一种基于冻结的CLIP视觉编码器的可微分质量度量，它融合中间patch-token相似性和全局图像嵌入，从而更精准地匹配机器偏好，同时保持与人类主观评价的竞争力。

链接: https://arxiv.org/abs/2605.09479
作者: Feng Ding,Haisheng Fu,Jie Liang,Qihan Xu,Siyu Zhu,Jingning Han
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:We study full-reference image quality assessment from a machine-centric perspective, where images are evaluated by how well they preserve information for downstream models. We formulate machine-oriented quality as a latent machine utility and approximate it through pairwise predictive-consistency comparisons. To this end, we construct PCMP, a dataset of PSNR-matched distortion pairs labeled by consistency votes from multiple pretrained models. We further propose ML-CLIPSim, a differentiable quality metric built on a frozen CLIP visual encoder, which aggregates intermediate patch-token similarities and global image embeddings. Experiments on machine-preference benchmarks, human-IQA datasets, and learned image compression show that ML-CLIPSim better aligns with machine-oriented preferences than conventional fidelity and perceptual metrics, while remaining competitive for human quality prediction. Used as a compression distortion term, it improves rate–task trade-offs across multiple downstream tasks.

[CV-368] Cross-Modal Semantic-Enhanced Diffusion Framework for Diabetic Retinopathy Grading

【速读】：该论文旨在解决糖尿病视网膜病变（Diabetic Retinopathy, DR）自动分级中的三大关键挑战：细粒度病变模式的视觉差异微弱、不同成像设备与采集条件导致的数据分布差异，以及纯视觉方法难以利用临床语义知识的问题。其解决方案的核心在于提出一种CLIP-guided Semantic Diffusion (CGSD) 框架，通过融合视觉-语言预训练模型与扩散概率建模实现跨模态语义引导；具体而言，采用针对DR任务定制的视觉-语言模型作为语义引导模块，并借助低秩适应（Low-Rank Adaptation, LoRA）技术以少量可训练参数适配目标域，从而有效弥合预训练模型与目标数据集之间的分布差距；进一步地，通过计算图像特征与各DR等级文本描述特征的点积生成交叉模态语义条件向量，作为扩散去噪网络的条件信号，替代现有扩散分类方法中结构复杂的双分支视觉先验，显著提升了分级准确性与语义一致性。

链接: https://arxiv.org/abs/2605.09242
作者: Yiqun Wang(Beijing Jiaotong University)
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 3 figures, 2 tables

点击查看摘要

Abstract:Automated grading of diabetic retinopathy (DR) faces several critical challenges: subtle inter-grade visual distinctions in fine-grained lesion patterns, distributional discrepancies induced by heterogeneous imaging devices and acquisition conditions, and the inherent inability of purely visual approaches to exploit clinical semantic knowledge. In this paper, we propose CLIP-Guided Semantic Diffusion (CGSD), a DR grading framework that synergistically integrates vision-language pretraining with diffusion probabilistic modeling. We adopt a domain-specific vision-language model tailored for DR grading as the semantic guidance module and adapt it to the target domain via Low-Rank Adaptation (LoRA), effectively bridging the distributional gap between the pretrained model and the target dataset with only a minimal number of trainable parameters. Building on this foundation, we construct a cross-modal semantic conditioning vector by computing the dot product between image features and the text description features of each DR grade, yielding a joint representation that simultaneously encodes visual content and clinical-grade semantics. This vector serves as the conditioning signal for the diffusion denoising network, replacing the structurally complex dual-branch visual prior employed in existing diffusion-based classification methods. Experiments on the APTOS 2019 dataset demonstrate that the proposed approach achieves an accuracy of 87.5% and a macro-averaged F1 score of 0.731, outperforming a variety of representative methods. Ablation studies further validate the independent contribution of each constituent module.

[CV-369] A Paired Point-of-Care Ultrasound Dataset for Image Quality Enhancement and Benchmarking via a cGAN Baseline

【速读】：该论文旨在解决便携式床旁超声（Point-of-Care Ultrasound, POCUS）设备因硬件限制导致图像质量较低的问题，从而影响其在资源匮乏和即时诊疗场景中的诊断价值。解决方案的关键在于构建了首个准确配对的低端POCUS与高端超声图像数据集，并基于此数据集设计了一种改进的条件生成对抗网络（conditional Generative Adversarial Network, cGAN），其生成器采用U-Net架构并结合L1损失与结构相似性指数（Structural Similarity Index, SSIM）损失以提升感知质量；同时引入仿真数据预训练策略进一步优化模型性能。实验表明，该方法显著提升了图像质量指标（如SSIM从0.29提升至0.54，PSNR从19.16 dB提升至22.41 dB），验证了其在增强POCUS图像质量方面的有效性。

链接: https://arxiv.org/abs/2605.08282
作者: Lennard M. van Karnenbeek,Hilde G.A. van der Pol,Mark Wijkhuizen,Eva Poelman,Caroline A. Drukker,Theo Ruers,Freija Geldof,Behdad Dashtbozorg
机构: University of Twente (特温特大学); Netherlands Cancer Institute (荷兰癌症研究所)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Purpose: We aim to enhance the image quality of point-of-care ultrasound (POCUS) devices using deep learning and a novel paired dataset of POCUS and high-end ultrasound images. Approach: We collected the first accurately paired dataset using a custom-built automated gantry system of low-end POCUS and high-end ultrasound images. A conditional generative adversarial network (cGAN) was utilized based on the pix2pix architecture, with a U-Net generator that incorporates both L1 and structural similarity index (SSIM) losses to improve perceptual quality. Pretraining on a simulation dataset further boosts performance. Evaluation was performed on 1064 paired ex vivo tissue and phantom ultrasound image sets. Results: Our approach improves the SSIM from 0.29 to 0.54 and PSNR from 19.16 dB to 22.41 dB. No-reference metrics also indicate substantial enhancement, with the Natural Image Quality Evaluator (NIQE) and Perception-based Image Quality Evaluator (PIQE) scores dropping from 7.95 to 4.44 and 31.12 to 19.99, respectively. Conclusions: This work presents the first publicly available accurately paired dataset of low-end POCUS to high end ultrasound images. Additionally, our results demonstrate the potential of the proposed framework to overcome hardware limitations of handheld POCUS, enhancing its diagnostic value in low-resource and point-of-care settings. The POCUS-IQ Dataset is publicly available at this https URL. Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2605.08282 [eess.IV] (or arXiv:2605.08282v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2605.08282 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Lennard Karnenbeek Van [view email] [v1] Fri, 8 May 2026 07:33:11 UTC (3,173 KB)

[CV-370] Model-based Dynamic 3D MRI Reconstructions using Neural Fields and Tensor Product Expansions

【速读】：该论文旨在解决传统磁共振成像（MRI）重建方法在高度欠采样场景下因离散化表示导致内存消耗大、结构感知能力弱，从而难以实现精确重建的问题，尤其是在动态三维心脏磁共振（CMR）成像中。其解决方案的关键在于提出一种无需离散化的、内存高效的模型驱动框架，将磁化强度和线圈敏感度建模为连续对象——即通过一元神经场的张量积结构表示为可微函数，从而在高维时空设置中实现可扩展优化，并显著提升重建质量，即使在极端加速因子（如16倍）下仍能保持结构与运动信息的完整性。

链接: https://arxiv.org/abs/2605.08275
作者: Ray Sheombarsing,Max van Riel,David Heesterbeek,Nico van den Berg,Alessandro Sbrizzi
机构: University Medical Center Utrecht (乌得勒支大学医学中心)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Conventional MRI reconstruction methods treat images and coil sensitivities as discrete objects, leading to high memory demands and limited structural awareness that hamper effective regularization. These limitations hinder accurate reconstruction in highly undersampled scenarios, such as dynamic 3D cardiac magnetic resonance (CMR). We introduce a discretization-free, memory-efficient, model-based framework for dynamic 2D and 3D MRI reconstruction from highly undersampled data. We represent magnetization and coil sensitivities as continuous objects – differentiable functions – using tensor products of univariate neural fields. This tensor product structure enables scalable optimization in high-dimensional spatiotemporal settings. Our method outperforms state-of-the-art model-based reconstructions in dynamic 2D and 3D MR settings, preserving structure and motion even under aggressive undersampling (e.g., acceleration factor 16).

[CV-371] Coarse-to-Fine: Progressive Image Compression for Semantically Hierarchical Classification ICIP2026

【速读】：该论文旨在解决现有渐进式图像编码（progressive coding）方案在机器感知任务中缺乏语义可扩展性的问题。以往方法主要关注样本级难度自适应（如从易到难），未考虑语义层级的渐进传输，导致低码率下语义信息丢失严重。其解决方案的关键在于提出一种语义层次感知的渐进编解码器（semantic hierarchy-aware progressive codec），通过CLIP嵌入对ImageNet-1K类别进行语义分层，并基于通道自回归框架将潜在表示分解为按语义层级排序的通道块，每个块专门优化对应语义层级的重建质量。该设计实现了从单一比特流中支持粗粒度到细粒度的语义渐进传输，在低码率下显著提升粗粒度识别性能，同时在高码率下保持细粒度精度，从而提供了一种任务自适应、高效且可解释的图像编码方案。

链接: https://arxiv.org/abs/2605.08266
作者: Jungwoo Kim,Jun-Hyuk Kim,Jong-Seok Lee
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICIP 2026

点击查看摘要

Abstract:Recent advances in learned image compression (LIC) have enabled practical deployments, spurring active research into image compression for machines and progressive coding schemes. However, their integration remains under-explored: prior works on progressive machine codec predominantly target sample-level difficulty adaptation (i.e., easy-to-hard), without considering semantic-level scalability. In this work, we introduce a semantic hierarchy-aware progressive codec that enables semantic scalability (i.e., coarse-to-fine) from a single bitstream. We first systematically categorize ImageNet-1K classes into CLIP embedding-based semantic hierarchies. Based on a channel-wise autoregressive framework, we decompose latent representations into hierarchically ordered channel blocks, each explicitly optimized for a corresponding semantic hierarchy. Extensive experiments demonstrate that our approach substantially improves coarse-level recognition at low bitrates while maintaining fine-grained accuracy at higher bitrates. By reframing progressive transmission through the lens of semantic scalability, our work provides an efficient and interpretable solution for task-adaptive image coding, outperforming existing progressive codecs under hierarchical evaluation.

[CV-372] Modular Retrieval-Augmented Generalization for Human Action Recognition ICME2026

【速读】：该论文旨在解决基于惯性测量单元（Inertial Measurement Unit, IMU）的人体行为识别（Human Activity Recognition, HAR）中面临的两个关键挑战：训练样本有限性和静态知识利用不足，这两点严重制约了HAR系统的大规模部署。解决方案的核心是提出MoRA（Retrieval-Augmented Module），这是一个专为运动序列设计的检索增强模块，可灵活集成到现有HAR模型中以提升识别性能并保持推理效率。其关键技术在于引入了一个不确定性自适应融合单元，该单元利用IMU信号中的先验物理知识，动态调整原始输出与检索信息之间的融合策略，从而有效缓解检索结果中的信息冗余问题并实现更鲁棒的行为识别。

链接: https://arxiv.org/abs/2605.08117
作者: Peng Liao,Shangsong Liang,Lin Chen,Peijia Zheng
机构: 未知
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: ICME 2026

点击查看摘要

Abstract:Inertial Measurement Unit (IMU)-based Human Activity Recognition (HAR) aims to interpret and classify user behaviors from temporal motion signals. Recently, deep learning frameworks have advanced this task by learning and extracting discriminative spatiotemporal representations, significantly improving recognition performance. However, IMU-based HAR still faces several critical challenges, particularly limited training samples and static knowledge utilization, both of which severely hinder its large-scale deployment. In this paper, we introduce MoRA, the first Retrieval-Augmented Module specifically designed for motion series. It can be flexibly integrated into any existing HAR model, enhancing recognition performance while maintaining inference efficiency. To address issues such as information redundancy in retrieval results and rigid fusion strategies, we propose an uncertainty-adaptive fusion unit within MoRA. This unit leverages previous physical knowledge from IMU signals to dynamically adjust the fusion strategy between original outputs and retrieved information, enabling more robust recognition. Extensive experiments on ten real-world datasets demonstrate that MoRA significantly improves the performance of existing IMU-based HAR models, consistently delivering stable and effective gains. The source code of MoRA is available at: this https URL.

人工智能

[AI-0] Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace

【速读】：该论文旨在解决多智能体系统中元智能体（meta-agent）操作难以形式化、环境交互过程不可追溯以及训练与推理过程中资源效率低下的问题。其核心解决方案是提出 Shepherd 框架，将元智能体对目标智能体的操作建模为函数式编程中的高阶函数，并在 Lean 证明助手内实现核心操作的机械化；同时通过类似 Git 的执行轨迹记录机制，以类型化事件的形式保存所有智能体-环境交互，从而支持任意历史状态的分叉（fork）与回放（replay）。关键创新在于：1）利用函数式抽象统一元智能体行为建模；2）实现比 Docker 快 5 倍的进程和文件系统分叉，且在回放时达到 95% 的提示缓存复用率，显著提升效率；3）实验证明该架构可有效赋能运行时干预、反事实元优化和 Tree-RL 训练等场景，提升任务性能并降低耗时。

链接: https://arxiv.org/abs/2605.10913
作者: Simon Yu,Derek Chong,Ananjan Nandi,Dilara Soylu,Jiuding Sun,Christopher D Manning,Weiyan Shi
机构: 未知
类目: Artificial Intelligence (cs.AI); Programming Languages (cs.PL); Software Engineering (cs.SE)
备注: 56 pages, 21 figures, 14 tables

点击查看摘要

Abstract:We introduce Shepherd, a functional programming model that formalizes meta-agent operations on target agents as functions, with core operations mechanized in Lean. Shepherd records every agent-environment interaction as a typed event in a Git-like execution trace, enabling any past state to be forked and replayed. The system forks the agent process and its filesystem 5\times faster than Docker, achieving 95% prompt-cache reuse on replay. We demonstrate the model through three applications. First, in runtime intervention, a live supervisor increases pair coding pass rates from 28.8% to 54.7% on CooperBench. Second, in counterfactual meta-optimization, branching exploration outperforms baselines across four benchmarks by up to 11 points while reducing wall-clock time by up to 58%. Third, in Tree-RL training, forking rollouts at selected turns improves TerminalBench-2 performance from 34.2% to 39.4%. These results establish Shepherd as an efficient infrastructure for programming meta-agents. We open-source the system to support future research.

[AI-1] Engineering Robustness into Personal Agents with the AI Workflow Store

【速读】：该论文旨在解决当前生成式 AI 代理（AI agents）普遍采用的“实时合成”范式所带来的系统性风险问题，即这种快速响应用户指令的方式跳过了软件工程（SE）中关键的严谨流程，如迭代设计、严格测试、对抗评估和分阶段部署等，导致所生成的代理行为可能脆弱且不安全，难以应用于高风险场景。其核心解决方案是将严格的软件工程实践整合进代理的工作流中，构建可复用、可验证、确定性约束的“生产级代理工作流”，并提出建立一个“AI 工作流商店”（AI Workflow Store），以实现更高可靠性和安全性的工具链调用，从而通过规模化 reuse 来摊销因增强鲁棒性而增加的计算与时间成本。这一转变要求从“即时合成”向“结构化工程”演进，以应对灵活性与稳健性之间的根本张力。

链接: https://arxiv.org/abs/2605.10907
作者: Roxana Geambasu(Google and Columbia University),Mariana Raykova(Google),Pierre Tholoniat(Google),Trishita Tiwari(Google),Lillian Tsai(Google),Wen Zhang(Google)
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The dominant paradigm for AI agents is an “on-the-fly” loop in which agents synthesize plans and execute actions within seconds or minutes in response to user prompts. We argue that this paradigm short-circuits disciplined software engineering (SE) processes – iterative design, rigorous testing, adversarial evaluation, staged deployment, and more – that have delivered the (relatively) reliable and secure systems we use today. By focusing on rapid, real-time synthesis, are AI agents effectively delivering users improvised prototypes rather than systems fit for high-stakes scenarios in which users may unwittingly apply them? This paper argues for the need to integrate rigorous SE processes into the agentic loop to produce production-grade, hardened, and deterministically-constrained agent workflows that substantially outperform the potentially brittle and vulnerable results of on-the-fly synthesis. Doing so may require extra compute and time, and if so, we must amortize the cost of rigor through reuse across a broad user community. We envision an AI Workflow Store that consists of hardened and reusable workflows that agents can invoke with far greater reliability and security than improvised tool chains. We outline the research challenges of this vision, which stem from a broader flexibility-robustness tension that we argue requires moving beyond the ``on-the-fly’’ paradigm to navigate effectively. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.10907 [cs.CR] (or arXiv:2605.10907v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2605.10907 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-2] DataMaster: Towards Autonomous Data Engineering for Machine Learning

【速读】：该论文旨在解决当前机器学习系统性能提升受限于数据质量与工程效率的问题，尤其是在模型架构、训练策略和计算预算日益标准化的背景下，数据成为进一步优化的关键瓶颈。现有数据工程流程高度依赖人工干预，包括外部数据搜索、适配、验证及经验复用等环节，缺乏自动化与系统化方法。为此，作者提出DataMaster框架，其核心在于通过任务条件下的自主数据工程（task-conditioned autonomous data engineering）实现对数据侧的自动优化，而不改变下游学习算法。解决方案的关键创新在于三重机制：一是基于树结构的DataTree组织多分支数据工程路径；二是共享的数据池（Data Pool）支持外部数据源的重复利用；三是全局记忆（Global Memory）记录节点结果、中间产物与可复用知识，从而在开放搜索空间中实现分支依赖的精细化优化与延迟反馈的有效利用，显著提升了下游任务表现。

链接: https://arxiv.org/abs/2605.10906
作者: Yaxin Du,Xiyuan Yang,Zhifan Zhou,Wanxu Liu,Zixing Lei,Zimeng Chen,Fenyi Liu,Haotian Wu,Yuzhu Cai,Zexi Liu,Xinyu Zhu,WenHao Wang,Linfeng Zhang,Chen Qian,Siheng Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As model families, training recipes, and compute budgets become increasingly standardized, further gains in machine learning systems depend increasingly on data. Yet data engineering remains largely manual and ad hoc: practitioners repeatedly search for external datasets, adapt them to existing pipelines, validate candidate data through downstream training, and carry forward lessons from prior attempts. We study task-conditioned autonomous data engineering, where an autonomous agent improves a fixed learning algorithm by optimizing only the data side, including external data discovery, data selection and composition, cleaning and transformation. The goal is to obtain a stronger downstream solution while leaving the learning algorithm unchanged. To address the open-ended search space, branch-dependent refinement, and delayed validation inherent in autonomous data engineering, we propose DataMaster, a data-agent framework that integrates tree-structured search, shared candidate data, and cumulative memory. DataMaster consists of three key components: a DataTree that organizes alternative data-engineering branches, a shared Data Pool that stores discovered external data sources for reuse, and a Global Memory that records node outcomes, artifacts, and reusable findings. Together, these components allow the agent to discover candidate data, construct executable training inputs, evaluate them through downstream feedback, and carry useful evidence across branches. We evaluate DataMaster on two types of benchmarks, MLE-Bench Lite and PostTrainBench. On MLE-Bench Lite, it improves medal rate by 32.27% over the initial score; on PostTrainBench, it surpasses the instruct model on GPQA (31.02% vs 30.35%).

[AI-3] Unmasking On-Policy Distillation: Where It Helps Where It Hurts and Why

【速读】：该论文旨在解决在线蒸馏（on-policy distillation）中监督信号有效性不确定的问题，即在何种条件下此类信号有益或有害、应选择何种教师模型（尤其是自蒸馏时的上下文），以及最优配置是否随token变化。其核心挑战在于现有方法依赖昂贵的训练实验，难以解析个体token级别的动态表现。解决方案的关键在于提出一种无需训练的诊断框架，通过定义理想梯度（ideal per-node gradient）——即最大化学生模型成功概率的参数更新方向，并设计可扩展的定向回放算法（targeted-rollout algorithm）高效估计该梯度，进而以余弦相似度量化蒸馏梯度与理想梯度的对齐程度（梯度对齐分数）。此方法揭示了错误推理路径上蒸馏信号更具对齐性，且最优蒸馏上下文取决于学生容量和任务特性，从而推动基于任务和token级别的精细化蒸馏策略设计。

链接: https://arxiv.org/abs/2605.10889
作者: Mohammadreza Armandpour,Fatih Ilhan,David Harrison,Ajay Jaiswal,Duc N.M Hoang,Fartash Faghri,Yizhe Zhang,Minsik Cho,Mehrdad Farajtabar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:On-policy distillation offers dense, per-token supervision for training reasoning models; however, it remains unclear under which conditions this signal is beneficial and under which it is detrimental. Which teacher model should be used, and in the case of self-distillation, which specific context should serve as the supervisory signal? Does the optimal choice vary from one token to the next? At present, addressing these questions typically requires costly training runs whose aggregate performance metrics obscure the dynamics at the level of individual tokens. We introduce a training-free diagnostic framework that operates at the highest resolution: per token, per question, and per teacher. We derive an ideal per-node gradient defined as the parameter update that maximally increases the student’s probability of success. We then develop a scalable targeted-rollout algorithm to estimate this gradient efficiently, even for long chains of intermediate thoughts. The gradient alignment score, defined as the cosine similarity between this ideal gradient and any given distillation gradient, quantifies the extent to which a particular configuration approximates the ideal signal. Across a range of self-distillation settings and external teacher models, we observe that distillation guidance exhibits substantially higher alignment with the ideal on incorrect rollouts than on correct ones, where the student already performs well and the teacher’s signal tends to become noisy. Furthermore, we find that the optimal distillation context depends jointly on the student model’s capacity and the target task, and that no single universally effective configuration emerges. These findings motivate the use of per-task, per-token diagnostic analyses for distillation.

[AI-4] Shields to Guarantee Probabilistic Safety in MDPs

【速读】：该论文旨在解决传统屏蔽（shielding）方法在处理概率安全性（probabilistic safety）时的局限性问题，即如何在允许一定可接受概率下发生不良事件的前提下，仍能提供形式化保障。传统屏蔽方法虽能确保绝对安全和最大允许性（maximal permissiveness），但在概率场景中难以维持这些强保证。论文的关键解决方案在于提出一个保守扩展的经典屏蔽框架：首先证明在概率安全设定下无法同时保持原有的强安全性和允许性保证；其次设计具有较弱但实用的保证的自然屏蔽机制；最后引入离线与在线构造方法，以实现强安全保证。实证评估验证了新方案在实际应用中的优势及计算可行性。

链接: https://arxiv.org/abs/2605.10888
作者: Linus Heck,Filip Macák,Roman Andriushchenko,Milan Češka,Sebastian Junges
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注: Accepted to CAV 2026

点击查看摘要

Abstract:Shielding is a prominent model-based technique to ensure safety of autonomous agents. Classical shielding aims to ensure that nothing bad ever happens and comes with strong guarantees about safety and maximal permissiveness. However, shielding systems for probabilistic safety, where something bad is allowed to happen with an acceptable probability, has proven to be more intricate. This paper presents a formal framework that conservatively extends classical shields to probabilistic safety. In this framework, we (i) demonstrate the impossibility of preserving the strong guarantees on safety and permissiveness, (ii) provide natural shields with weaker guarantees, and (iii) introduce offline and online shield constructions ensuring strong safety guarantees. The empirical evaluation highlights the practical advantages of the new shields, as well as their computational feasibility.

[AI-5] LoKA: Low-precision Kernel Applications for Recommendation Models At Scale ISCA’26

【速读】：该论文旨在解决在大规模推荐模型（Large Recommendation Models, LRMs）中应用低精度浮点运算（如FP8）时面临的数值敏感性高、小矩阵乘法（GEMM）占主导、训练通信密集等问题，这些问题导致直接使用FP8会显著降低模型质量并延长训练时间。解决方案的关键在于提出一种系统-模型协同设计框架LoKA（Low-precision Kernel Applications），其核心包括三个原则：首先通过真实分布下的在线基准测试（LoKA Probe）识别FP8安全与不安全的层及执行效率差异；其次通过可复用的模型适配模块（LoKA Mods）提升数值稳定性和执行效率；最后利用运行时调度器（LoKA Dispatch）基于统计洞察选择满足精度要求的最快FP8内核，从而实现FP8在LRMs中的高效且稳定的部署。

链接: https://arxiv.org/abs/2605.10886
作者: Liang Luo,Yinbin Ma,Quanyu Zhu,Vasiliy Kuznetsov,Yuxin Chen,Jian Jiao,Jiecao Yu,Buyun Zhang,Tongyi Tang,Xiaohan Wei,Yanli Zhao,Zeliang Chen,Yuchen Hao,Venkatesh Ranganathan,Sandeep Parab,Yantao Yao,Maxim Naumov,Chunzhi Yang,Shen Li,Ellie Wen,Wenlin Chen,Santanu Kolay,Chunqiang Tang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to ISCA’26

点击查看摘要

Abstract:Recent GPU generations deliver significantly higher FLOPs using lower-precision arithmetic, such as FP8. While successfully applied to large language models (LLMs), its adoption in large recommendation models (LRMs) has been limited. This is because LRMs are numerically sensitive, dominated by small matrix multiplications (GEMMs) followed by normalization, and trained in communication-intensive environments. Applying FP8 directly to LRMs often degrades model quality and prolongs training time. These challenges are inherent to LRM workloads and cannot be resolved merely by introducing better FP8 kernels. Instead, a system-model co-design approach is needed to successfully integrate FP8. We present LoKA (Low-precision Kernel Applications), a framework that makes FP8 practical for LRMs through three principles: profile under realistic distributions to know where low precision is safe, co-design model components with hardware to expand where it is safe, and orchestrate across kernel libraries to maximize the gains. Concretely, LoKA Probe is a statistically grounded, online benchmarking method that learns activation and weight statistics, and quantifies per-layer errors. This process pinpoints safe and unsafe, fast and slow sites for FP8 adoption. LoKA Mods is a set of reusable model adaptations that improve both numerical stability and execution efficiency with FP8. LoKA Dispatch is a runtime that leverages the statistical insights from LoKA Probe to select the fastest FP8 kernel that satisfies the accuracy requirements.

[AI-6] AssayBench: An Assay-Level Virtual Cell Benchmark for LLM s and Agents

【速读】：该论文旨在解决当前缺乏标准化基准来评估生成式 AI 在虚拟细胞（virtual cell）建模中进行体外表型筛选（in silico phenotypic screening）预测能力的问题。现有方法多聚焦于分子层面的读出（molecular readouts），难以直接映射到驱动药物发现的实际表型终点，限制了模型在真实生物场景中的泛化能力。解决方案的关键在于提出 AssayBench，一个基于 1,920 个公开 CRISPR 筛选数据集构建的表型筛选预测基准，涵盖五类广泛细胞表型；同时将筛选预测任务形式化为基因排序预测问题，并引入调整后的 nDCG（adjusted nDCG）作为跨异构检测方法的连续性能度量指标。实证表明，零样本通用大语言模型（LLM）在该任务上优于领域专用 LLM 和可训练基线模型，且通过微调、集成与提示优化等技术可进一步提升性能，从而为虚拟细胞模型的发展提供了一个可量化、可扩展的测试平台。

链接: https://arxiv.org/abs/2605.10876
作者: Edward De Brouwer,Carl Edwards,Alexander Wu,Jenna Collier,Graham Heimberg,Xiner Li,Meena Subramaniam,Ehsan Hajiramezanali,David Richmond,Jan-Christian Hütter,Sara Mostafavi,Gabriele Scalia
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注: 22 pages

点击查看摘要

Abstract:Recent advances in machine learning and large-scale biological data collections have revived the prospect of building a virtual cell, a computational model of cellular behavior that could accelerate biological discovery. One of the most compelling promises of this vision is the ability to perform in silico phenotypic screens, in which a model predicts the effects of cellular perturbations in unseen biological contexts. This task combines heterogeneous textual inputs with diverse phenotypic outputs, making it particularly well-suited to LLMs and agentic systems. Yet, no standard benchmark currently exists for this task, as existing efforts focus on narrower molecular readouts that are only indirectly aligned with the phenotypic endpoints driving many real-world drug discovery workflows. In this work, we present AssayBench, a benchmark for phenotypic screen prediction, built from 1,920 publicly available CRISPR screens spanning five broad classes of cellular phenotypes. We formulate the screen prediction task as a gene rank prediction for each screen and introduce the adjusted nDCG, a continuous metric for comparing performance across heterogeneous assays. Our extensive evaluation shows that existing methods remain far from empirically estimated performance ceilings and zero-shot generalist LLMs outperform biology-specific LLMs and trainable baselines. Optimization techniques such as fine-tuning, ensembling, and prompt optimization can further improve LLM performance on this task. Overall, AssayBench offers a practical testbed for measuring progress toward in silico phenotypic screening and, more broadly, virtual cell models.

[AI-7] Remember the Decision Not the Description: A Rate-Distortion Framework for Agent Memory

【速读】：该论文旨在解决长时程语言智能体在有限运行时内存约束下如何高效组织记忆的问题。传统记忆机制通常基于描述性标准（如相关性、显著性或摘要质量）来存储经验，但这种做法忽略了记忆的核心价值——即保留对决策至关重要的历史区分度。为此，作者提出一种以决策为中心的率失真问题建模方法，通过衡量压缩导致的决策质量损失来评估记忆质量，从而推导出可安全遗忘的精确边界和记忆-失真前沿，揭示了内存预算与决策质量之间的最优权衡关系。解决方案的关键在于提出DeMem算法：它仅在数据证实共享状态会引发决策冲突时才更新记忆分区，并证明了其近似最小最大后悔率保证；实验表明，在相同运行时预算下，DeMem在合成诊断和长程对话基准上均实现了稳定性能提升，验证了“记忆应保留决策相关的区分度而非描述细节”的核心理念。

链接: https://arxiv.org/abs/2605.10870
作者: Mingxi Zou,Zhihan Guo,Langzhang Liang,Zhuo Wang,Qifan Wang,Qingsong Wen,Irwin King,Lizhen Qu,Zenglin Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long-horizon language agents must operate under limited runtime memory, yet existing memory mechanisms often organize experience around descriptive criteria such as relevance, salience, or summary quality. For an agent, however, memory is valuable not because it faithfully describes the past, but because it preserves the distinctions between histories that must remain separated under a fixed budget to support good decisions. We cast this as a decision-centric rate-distortion problem, measuring memory quality by the loss in achievable decision quality induced by compression. This yields an exact forgetting boundary for what can be safely forgotten, and a memory-distortion frontier characterizing the optimal tradeoff between memory budget and decision quality. Motivated by this decision-centric view of memory, we propose DeMem, an online memory learner that refines its partition only when data certify that a shared state would induce decision conflict, and prove near-minimax regret guarantees. On both controlled synthetic diagnostics and long-horizon conversational benchmarks, DeMem yields consistent gains under the same runtime budget, supporting the principle that memory should preserve the distinctions that matter for decisions, not descriptions.

[AI-8] Clin-JEPA: A Multi-Phase Co-Training Framework for Joint-Embedding Predictive Pretraining on EHR Patient Trajectories

【速读】：该论文旨在解决将联合嵌入预测（Joint-Embedding Predictive Architecture, JEPA）范式扩展至电子健康记录（Electronic Health Record, EHR）数据时所面临的挑战：如何训练一个统一的主干模型，使其既能准确预测患者轨迹，又能直接支持多种下游风险预测任务，而无需针对每个任务进行微调。现有JEPA方法或在预训练后丢弃预测器（如I-JEPA、V-JEPA），或在冻结编码器的基础上训练预测器（如V-JEPA 2-AC），导致编码器无法感知推理阶段预测器所需的滚动信号，从而限制了性能。本文提出的Clin-JEPA框架通过五阶段协同训练课程——预测器预热、联合优化、EMA目标对齐、硬同步和预测器最终化——系统性地缓解了表示坍塌与在线/目标漂移问题，实现了编码器与预测器的稳定联合训练。其核心创新在于设计了一套结构化的预训练流程，使基于Qwen3-8B的编码器与92M参数潜空间预测器能够共享JEPA预测目标并相互校准，最终在MIMIC-IV ICU数据集上显著优于基线方法，在48小时轨迹预测、临床判别性潜在空间几何以及多任务下游评估中均取得优越性能。

链接: https://arxiv.org/abs/2605.10840
作者: Yixuan Yang,Mehak Arora,Ryan Zhang,Baraa Abed,Junseob Kim,Tilendra Choudhary,Md Hassanuzzaman,Kevin Zhu,Ayman Ali,Chengkun Yang,Alasdair Edward Gent,Victor Moas,Rishikesan Kamaleswaran
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注: 17 pages, 4 figures, 8 tables. Code: this https URL

点击查看摘要

Abstract:We present Clin-JEPA, a multi-phase co-training framework for joint-embedding predictive (JEPA) pretraining on EHR patient trajectories. JEPA architectures have enabled latent-space planning in robotics and high-quality representation learning in vision, but extending the paradigm to EHR data – to obtain a single backbone that simultaneously forecasts patient trajectories and serves diverse downstream risk-prediction tasks without per-task fine-tuning – remains an open challenge. Existing JEPA frameworks either discard the predictor after pretraining (I-JEPA, V-JEPA) or train it on a frozen pretrained encoder (V-JEPA 2-AC), leaving the encoder unaware of the rollout signal that the retained predictor must use at inference; co-training the encoder and predictor under a shared JEPA prediction objective would supply this grounding, but naïve co-training is unstable, with representation collapse and online/target drift causing autoregressive rollout to diverge. Clin-JEPA’s five-phase pretraining curriculum – predictor warmup, joint refinement, EMA target alignment, hard sync, and predictor finalization – addresses each failure mode by phase, stably co-training a Qwen3-8B-based encoder and a 92M-parameter latent trajectory predictor. On MIMIC-IV ICU data, three independent evaluations support the framework: (1) latent \ell_1 rollout drift uniquely converges ( - 15.7%) over 48-hour horizons while baselines and ablations diverge (+3% to +4951%); (2) the encoder learns a clinically discriminative latent geometry (deteriorating-patient cohorts displace 4.83 \times further than stable patients in latent space, vs \leq 2.62 \times for baseline encoders); (3) a single backbone outperforms strong tabular and sequence baselines on multi-task downstream evaluation. Clin-JEPA achieves mean AUROC 0.851 on ICareFM EEP and 0.883 on 8 binary risk tasks (+0.038 and +0.041 vs baseline average).

[AI-9] From Controlled to the Wild: Evaluation of Pentesting Agents for the Real-World

【速读】：该论文旨在解决当前AI渗透测试代理（AI pentesting agents）评估基准在真实场景适用性不足的问题，即现有评测协议多聚焦于预设任务目标（如夺旗、远程代码执行等），难以反映实际渗透测试中所需的复杂性、开放式探索与策略决策能力。其解决方案的关键在于提出一种以“已验证漏洞发现”为核心的实用评估协议，通过结构化真实数据（ground-truth）结合大语言模型（LLM）语义匹配识别漏洞，利用二分图解析（bipartite resolution）在现实模糊条件下评分，同时引入持续的真实数据维护、对随机代理的重复累积评估、效率指标以及精简测试套件选择机制，从而实现更贴近实战、更具操作意义的AI渗透测试代理比较。

链接: https://arxiv.org/abs/2605.10834
作者: Pedro Conde,Henrique Branquinho,Valerio Mazzone,Bruno Mendes,André Baptista,Nuno Moniz
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:AI pentesting agents are increasingly credible as offensive security systems, but current benchmarks still provide limited guidance on which will perform best in real-world targets. Existing evaluation protocols assess and optimize for predefined goals such as capture-the-flag, remote code execution, exploit reproduction, or trajectory similarity, in simplified or narrow settings. These tools are valuable for measuring bounded capabilities, yet they do not adequately capture the complexity, open-ended exploration, and strategic decision-making required in realistic pentesting. In this paper, we present a practical evaluation protocol that shifts assessment from task completion to validated vulnerability discovery, allowing evaluation in sufficiently complex targets spanning multiple attack surfaces and vulnerability classes. The protocol combines structured ground-truth with LLM-based semantic matching to identify vulnerabilities, bipartite resolution to score findings under realistic ambiguity, continuous ground-truth maintenance, repeated and cumulative evaluation of stochastic agents, efficiency metrics, and reduced-suite selection for sustainable experimentation. This protocol extends the state of the art by enabling a more realistic, operationally informative comparison of AI pentesting agents. To enable reproducibility, we also release expert-annotated ground truth and code for the proposed evaluation protocol: this https URL.

[AI-10] he First Drop of Ink: Nonlinear Impact of Misleading Information in Long-Context Reasoning

【速读】：该论文旨在解决在长上下文场景下，干扰信息（hard distractors）如何影响大语言模型性能的问题，尤其关注干扰项比例与模型性能之间的定量关系。此前研究指出语义相关但误导性的文档会降低性能，但缺乏对比例变化与性能衰减之间具体规律的量化分析。论文的关键解决方案在于通过系统性控制固定长度上下文中硬干扰项的比例，发现性能下降呈现显著的非线性特征——即“首滴墨水效应”（The First Drop of Ink），表现为少量干扰项即可引发剧烈性能下降，后续增加干扰比例则边际效应递减。理论与实证分析表明，这是由于注意力机制导致干扰项即使占比极小也占据主导注意力资源，而提升性能的关键并非单纯移除干扰项，而是通过减少上下文长度来间接降低干扰项占比，最终需将硬干扰项比例降至接近零才能实现显著恢复，凸显上游检索精度的重要性。

链接: https://arxiv.org/abs/2605.10828
作者: Muhan Gao,Zih-Ching Chen,Kuan-Hao Huang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As large language models are increasingly deployed in retrieval-augmented generation and agentic systems that accumulate extensive context, understanding how distracting information affects long-context performance becomes critical. Prior work shows that semantically relevant yet misleading documents degrade performance, but the quantitative relationship between the proportion of distractors and performance remains unstudied. In this work, we systematically vary the hard-distractor proportion in fixed-length contexts, revealing a striking nonlinear pattern: as the proportion of hard distractors increases, performance drops sharply within the first small fraction, while the remainder of the range yields only marginal additional decline. We term this ‘‘The First Drop of Ink’’ effect, analogous to how a single drop of ink contaminates water. Our theoretical and empirical analyses grounded in attention mechanics show that hard distractors capture disproportionate attention even at small proportions, with diminishing marginal impact as their proportion grows. Controlled experiments further show that filtering gains mainly come from context-length reduction rather than distractor removal; substantial recovery requires reducing the hard-distractor proportion to near zero, highlighting the importance of upstream retrieval precision.

[AI-11] MaD Physics: Evaluating information seeking under constraints in physical environments

【速读】：该论文旨在解决现有科学发现代理评估基准无法有效衡量代理在资源约束下进行测量与规划能力的问题。当前的基准主要聚焦于静态知识推理或无约束实验设计任务，未能捕捉到真实科学实践中因物理和成本限制而需权衡测量质量与数量的核心挑战。解决方案的关键在于提出Measuring and Discovering Physics (MaD Physics) 基准，其包含三个基于不同物理定律（均经修改以避免已有知识干扰）的环境，要求代理在有限测量预算内收集数据并推断底层物理规律，从而评估其从数据中建模和在约束条件下规划的能力。该基准还支持对多模态性、上下文学习等扩展能力的测评，为科学智能代理的系统性评估提供了新范式。

链接: https://arxiv.org/abs/2605.10820
作者: Moksh Jain,Mehdi Bennani,Johannes Bausch,Yuri Chervonyi,Bogdan Georgiev,Simon Osindero,Nenad Tomašev
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 64 pages, 10 figures. Project page: this https URL

点击查看摘要

Abstract:Scientific discovery is fundamentally a resource-constrained process that requires navigating complex trade-offs between the quality and quantity of measurements due to physical and cost constraints. Measurements drive the scientific process by revealing novel phenomena to improve our understanding. Existing benchmarks for evaluating agents for scientific discovery focus on either static knowledge-based reasoning or unconstrained experimental design tasks, and do not capture the ability to make measurements and plan under constraints. To bridge this gap, we propose Measuring and Discovering Physics (MaD Physics), a benchmark to evaluate the ability of agents to make informative measurements and conclusions subject to constraints on the quality and quantity of measurements. The benchmark consists of three environments, each based on a distinct physical law. To mitigate contamination from existing knowledge, MaD Physics includes altered physical laws. In each trial, the agent makes measurements of the system until it exhausts an allotted budget and then the agent has to infer the underlying physical law to make predictions about the state of the system in the future. MaD Physics evaluates two fundamental capabilities of scientific agents: inferring models from data and planning under constraints. We also demonstrate how MaD Physics can be used to evaluate other capabilities such as multimodality and in-context learning. We benchmark agents on MaD Physics using four Gemini models (2.5 Flash Lite, 2.5 Flash, 2.5 Pro, and 3 Flash), identifying shortcomings in their structured exploration and data collection capabilities and highlighting directions to improve their scientific reasoning.

[AI-12] CLEF: EEG Foundation Model for Learning Clinical Semantics

【速读】：该论文旨在解决临床脑电图（EEG）解读中缺乏对完整EEG会话进行推理以及未能整合临床背景信息的问题。现有EEG基础模型多针对短时窗解码设计，未考虑临床上下文，限制了其在真实医疗场景中的应用。解决方案的关键在于提出CLEF——一个基于临床语境的长程EEG基础模型，通过将EEG会话表示为三维多锥谱图标记（3D multitaper spectrogram tokens），实现可扩展的Transformer建模；同时利用对比学习目标将嵌入空间与神经科医生报告及结构化电子健康记录（EHR）数据对齐，从而构建具备临床意义的表征。实验表明，CLEF在234项任务中优于先前模型，平均AUROC从0.65提升至0.74，验证了session-scale、临床驱动的表示学习范式在EEG分析中的有效性。

链接: https://arxiv.org/abs/2605.10817
作者: Peng Cao,Ali Mirzazadeh,Jong Woo Lee,Aleksandar Videnovic,Dina Katabi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Clinical EEG interpretation requires reasoning over full EEG sessions and integrating signal patterns with clinical context. Existing EEG foundation models are largely designed for short-window decoding and do not incorporate clinical context. We introduce CLEF, a clinically grounded long-context EEG foundation model. CLEF represents EEG sessions as 3D multitaper spectrogram tokens, enabling tractable Transformer modeling at session scale, and aligns embeddings with neurologist reports and structured EHR data through contrastive objectives. We evaluate CLEF on a new 234-task benchmark spanning disease phenotypes, medication exposures, and EEG findings, with more than 260k EEG sessions from over 108k patients. CLEF outperforms prior EEG foundation models on 229 of 234 tasks, improving mean AUROC from 0.65 to 0.74. Reconstruction-only pretraining surpasses prior EEG foundation models, while report and EHR alignment yields further gains. Held-out concept and external-cohort experiments suggest that these representations transfer beyond observed alignment targets. These results support session-scale, clinically grounded representation learning as a promising foundation-model paradigm for clinical EEG.

[AI-13] Policy Gradient Methods for Non-Markovian Reinforcement Learning

【速读】：该论文旨在解决非马尔可夫决策过程（Non-Markovian Decision Processes, NMDPs）中策略梯度方法的理论与实践问题，其中观测和奖励依赖于完整的交互历史。传统方法通常将代理状态（agent state）动态视为固定或通过预测目标学习，难以有效联合优化状态表示与控制策略。解决方案的关键在于提出一种以奖励为中心的建模框架——Agent State-Markov (ASM) 策略，该策略联合优化代理状态动力学与控制策略，以最大化期望累计奖励。作者建立了适用于 episodic 和无限时域折扣 NMDP 的新型策略梯度定理，并基于此设计了 Agent State-Markov Policy Gradient (ASMPG) 算法，利用代理状态的递归结构实现高效优化，同时提供有限时间与几乎必然收敛性保证。

链接: https://arxiv.org/abs/2605.10816
作者: Avik Kar,Siddharth Chandak,Rahul Singh,Soumitra Sinhahajari,Eric Moulines,Shalabh Bhatnagar,Nicholas Bambos
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 39 pages, 5 figures, 1 table

点击查看摘要

Abstract:We study policy gradient methods for reinforcement learning in non-Markovian decision processes (NMDPs), where observations and rewards depend on the entire interaction history. To handle this dependence, the agent maintains an internal state that is recursively updated to provide a compact summary of past observations and actions. In contrast to approaches that treat the agent state dynamics as fixed or learn it via predictive objectives, we propose a reward-centric formulation that jointly optimizes the agent state dynamics and the control policy to maximize the expected cumulative reward. To this end, we consider a class of Agent State-Markov (ASM) policies, comprising an agent state dynamics and a control policy that maps the agent state to actions. We establish a novel policy gradient theorem for ASM policies, extending the classical policy gradient results from the Markovian setting to episodic and infinite-horizon discounted NMDPs. Building on this gradient expression, we propose the Agent State-Markov Policy Gradient (ASMPG) algorithm, which leverages the recursive structure of the agent state dynamics for efficient optimization. We establish finite-time and almost sure convergence guarantees, and empirically demonstrate that, on a range of non-Markovian tasks, ASMPG outperforms baselines that learn state representations via predictive objectives.

[AI-14] Probing Cross-modal Information Hubs in Audio-Visual LLM s ICML2026

【速读】：该论文旨在解决音频-视觉大语言模型（Audio-Visual Large Language Models, AVLLMs）中跨模态信息流动机制不明确的问题，特别是音频与视觉模态之间信息如何在模型内部的token表示中被编码和交互。研究发现，AVLLMs主要将融合后的音视频信息存储于“sink tokens”中，且并非所有sink tokens均均匀承载跨模态信息，而是存在一类专门存储此类信息的“跨模态sink tokens”。解决方案的关键在于识别并利用这些跨模态sink tokens，在不增加训练成本的前提下，通过引导模型更依赖这些特定token中的集成信息，从而有效缓解生成过程中的幻觉问题。

链接: https://arxiv.org/abs/2605.10815
作者: Jihoo Jung,Chaeyoung Jung,Ji-Hoon Kim,Joon Son Chung
机构: 未知
类目: Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Accepted by ICML 2026

点击查看摘要

Abstract:Audio-visual large language models (AVLLMs) have recently emerged as a powerful architecture capable of jointly reasoning over audio, visual, and textual modalities. In AVLLMs, the bidirectional interaction between audio and video modalities introduces intricate processing dynamics, necessitating a deeper understanding of their internal mechanisms. However, unlike extensively studied text-only or large vision language models, the internal workings of AVLLMs remain largely unexplored. In this paper, we focus on cross-modal information flow between audio and visual modalities in AVLLMs, investigating where information derived from one modality is encoded within the token representations of the other modality. Through an analysis of multiple recent AVLLMs, we uncover two common findings. First, AVLLMs primarily encode integrated audio-visual information in sink tokens. Second, sink tokens do not uniformly hold cross-modal information. Instead, a distinct subset of sink tokens, which we term cross-modal sink tokens, specializes in storing such information. Based on these findings, we further propose a simple training-free hallucination mitigation method by encouraging reliance on integrated cross-modal information within cross-modal sink tokens. Our code is available at this https URL.

[AI-15] NanoResearch: Co-Evolving Skills Memory and Policy for Personalized Research Automation

【速读】：该论文试图解决当前大语言模型（Large Language Model, LLM）驱动的多智能体系统在科研自动化中缺乏个性化的问题。现有系统生成统一输出，忽视了用户在资源配置、方法偏好和产出格式上的差异，导致对个体用户的适配性不足。解决方案的关键在于提出NanoResearch框架，通过三层次协同演化机制实现个性化：一是构建技能库（skill bank），将重复操作抽象为可跨项目复用的程序规则；二是引入记忆模块（memory module），保留用户与项目特定的经验以指导规划决策；三是采用无标签策略学习（label-free policy learning），将自由形式反馈转化为规划器参数的持久更新，从而持续调整协作策略以贴合用户的隐性偏好。这三个层面相互强化，形成闭环进化机制，使系统能随使用周期迭代优化，提升研究质量并降低单位成本。

链接: https://arxiv.org/abs/2605.10813
作者: Jinhang Xu,Qiyuan Zhu,Yujun Wu,Zirui Wang,Dongxu Zhang,Jianxin Tang,Marcia Tian,Yiling Duan,Siyuan Li,Jingxuan Wei,Sirui Han,Yike Guo,Odin Zhang,Conghui He,Cheng Tan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 40 pages, 14 figures, 7 tables

点击查看摘要

Abstract:LLM-powered multi-agent systems can now automate the full research pipeline from ideation to paper writing, but a fundamental question remains: automation for whom? Researchers operate under different resource configurations, hold different methodological preferences, and target different output formats. A system that produces uniform outputs regardless of these differences will systematically under-serve every individual user, making personalization a precondition for research automation to be genuinely usable. However, achieving it requires three capabilities that current systems lack: accumulating reusable procedural knowledge across projects, retaining user-specific experience across sessions, and internalizing implicit preferences that resist explicit formalization. We propose NanoResearch, a multi-agent framework that addresses these gaps through tri-level co-evolution. A skill bank distills recurring operations into compact procedural rules reusable across projects. A memory module maintains user- and project-specific experience that grounds planning decisions in each user’s research history. A label-free policy learning converts free-form feedback into persistent parameter updates of the planner, reshaping subsequent coordination. These three layers co-evolve: reliable skills produce richer memory, richer memory informs better planning, and preference internalization continuously realigns the loop to each user. Extensive experiments demonstrate that NanoResearch delivers substantial gains over state-of-the-art AI research systems, and progressively refines itself to produce better research at lower cost over successive cycles.

[AI-16] hreat Modelling using Domain-Adapted Language Models: Empirical Evaluation and Insights

【速读】：该论文旨在解决当前大语言模型（Large Language Models, LLMs）在结构化威胁建模任务中表现不稳定、效果受限的问题，特别是在5G安全场景下使用STRIDE方法进行威胁分类时的可靠性问题。其关键解决方案在于系统性地评估了领域适配模型（domain-adapted language models）与通用模型在不同规模、解码策略（贪婪搜索 vs. 随机采样）、提示工程技术下的性能差异，发现单纯依赖领域微调或模型规模扩展并不能保证稳定提升威胁建模准确性，反而揭示出解码策略对输出有效性具有显著影响，并强调需要引入更任务特定的推理机制和更强的安全概念约束，以突破当前LLMs在结构化威胁建模中的根本局限。

链接: https://arxiv.org/abs/2605.10808
作者: Saba Pourhanifeh,AbdulAziz AbdulGhaffar,Ashraf Matrawy
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models(LLMs) are increasingly explored for cybersecurity applications such as vulnerability detection. In the domain of threat modelling, prior work has primarily evaluated a number of general-purpose Large Language Models under limited prompting settings. In this study, we extend the research area of structured threat modelling by systematically evaluating domain-adapted language models of different sizes to their general counterparts. We use both LLMs and Small Language Models(SLMs) that were domain adapted to telecommunications and cybersecuirty. For the structured threat modelling, we selected the widely used STRIDE approach and the application area is 5G security. We present a comprehensive empirical evaluation using 52 different configurations (on 8 different language models) to analyze the impact of 1) domain adaptation, 2) model scale, 3) decoding strategies (greedy vs. stochastic sampling), and 4) prompting technique on STRIDE threat classification. Our results show that domain-adapted models do not consistently outperform their general-purpose counterparts, and decoding strategies significantly affect model behavior and output validity. They also show that while larger models generally achieve higher performance, these gains are neither consistent nor sufficient for reliable threat modelling. These findings highlight fundamental limitations of current LLMs for structured threat modelling tasks and suggest that improvements require more than additional training data or model scaling, motivating the need for incorporating more task-specific reasoning and stronger grounding in security concepts. We present insights on invalid outputs encountered and present suggestions for prompting tailored specifically for STRIDE threat modelling. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.10808 [cs.CR] (or arXiv:2605.10808v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2605.10808 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-17] Interpretable Machine Learning for Football Performance Analysis: Evidence of Limited Transferability from Elite Leagues to University Competition

【速读】：该论文试图解决的问题是：在足球表现分析中，基于精英联赛数据学习到的性能决定因素（performance determinants）及其解释是否能在领域迁移（domain shift）下——即从精英级比赛转移到大学级别比赛时——保持结构可转移性和解释鲁棒性。解决方案的关键在于采用相同的特征空间，在顶级欧洲联赛数据上训练随机森林（Random Forest）和多层感知机（Multilayer Perceptron）模型，并将其直接应用于国立清华大学（NTHU）大学足球队的数据集，同时使用SHapley Additive exPlanations（SHAP）与Counterfactual Impact Score（CIS）两种解释方法进行对比分析。结果表明，精英级比赛中性能决定因素具有高度一致性与稳定性，而大学级别则出现关键指标排序显著变化、解释稳定性下降及对解释方法敏感性增强，说明解释的不稳定性并非单纯由方法局限引起，而是反映了目标域内部结构模糊性，从而为跨领域解释的可靠性提供了诊断依据。

链接: https://arxiv.org/abs/2605.10796
作者: Yu-Fang Tsai,Yu-Jen Chen,Kok-Hua Tan,Sheng-Chieh Huang,You-Ying Ji,Yu-Lun Chen,Chun-Yi Wang,Chien-Ming Hsu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 19 pages, 6 figures

点击查看摘要

Abstract:Machine learning has become increasingly prevalent in football performance analysis, yet most studies prioritize predictive accuracy while implicitly assuming that learned performance determinants and their interpretations are transferable across competition levels. Whether interpretability remains reliable under domain shift-from elite to university football remains largely unexplored. This study investigates whether performance determinants learned from elite competitions are structurally transferable to university-level football and whether their interpretations remain robust under domain shift. Models were trained on large-scale event data from the top five European leagues and applied to university football data from National Tsing Hua University (NTHU) using an identical feature space. Random Forest and Multilayer Perceptron models were interpreted using SHapley Additive exPlanations (SHAP) and Counterfactual Impact Score (CIS). Across five experiments, elite football exhibited a stable and consistent hierarchy of performance determinants across leagues, models, and explanation methods. In contrast, NTHU university football showed substantial reordering of key indicators, reduced explanation stability, weaker structural agreement with elite domains, and increased sensitivity to explanation method. These findings suggest that interpretability robustness is domain-dependent. Rather than reflecting methodological limitations alone, instability in explanations under domain shift may serve as a diagnostic signal of structural ambiguity in the target domain.

[AI-18] Can You Keep a Secret? Involuntary Information Leakage in Language Model Writing

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在实际部署中可能泄露受保护信息的问题，例如系统提示、链式思维推理过程或敏感数据。研究者通过设计实验，向模型注入一个秘密词并要求其不直接提及，随后让另一模型在故事文本中识别该秘密词，从而检验是否存在隐性泄露。解决方案的关键在于揭示：即使模型被明确指令禁止透露秘密词，仍会通过主题选择、意象和场景设置等非显式方式泄露信息——这种泄露具有跨模型可读性、随模型规模显著增强，并且可通过引入干扰概念部分转移泄露目标。这表明，对秘密的关注本身会在模型内部打开一个无法完全关闭的信息通道。

链接: https://arxiv.org/abs/2605.10794
作者: Ari Holtzman,Peter West
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Language models are deployed in settings that require compartmentalization: system prompts should not be disclosed, chain-of-thought reasoning is hidden from users, and sensitive data passes through shared contexts. We test whether models can keep prompted information out of their writing. We give each model a secret word with instructions not to reveal it, then ask it to write a story. A second model tries to identify the secret from the story in a binary discrimination test. The secret word never appears literally in any output, but all five frontier models we test leak it thematically – through topic choice, imagery, and setting–6hy-at rates significantly different from chance, up to 79%. When told to actively hide the secret, models write \emphaway from it, and this avoidance is itself detectable. The leakage is cross-model readable, scales sharply with model size within two model families, and disappears entirely for short-form writing like jokes. Giving the model a decoy concept to ``focus on instead’’ partially redirects the leakage from the real secret to the decoy. Attending to a secret appears to open up an information channel that frontier LLMs cannot close, even when instructed to.

[AI-19] PathISE: Learning Informative Path Supervision for Knowledge Graph Question Answering

【速读】：该论文旨在解决知识图谱问答（Knowledge Graph Question Answering, KGQA）中因缺乏高质量中间监督信号而导致模型训练困难的问题，尤其是获取问题相关路径或子图等监督信息所需的时间和资源成本过高。解决方案的关键在于提出PathISE框架，通过引入一个轻量级Transformer估计器来基于答案级别标签生成伪路径级监督信号，进而将这些信号蒸馏至大型语言模型（Large Language Models, LLMs）的路径生成器中，从而生成可接地于知识图谱（Knowledge Graph, KG）的紧凑证据路径，用于归纳式答案推理。该方法无需依赖昂贵的LLM精炼监督信号，且所生成的监督信号具有可复用性，能够提升现有KGQA模型性能。

链接: https://arxiv.org/abs/2605.10791
作者: Shengxiang Gao,Chao Lei,Jey Han Lau,Jianzhong Qi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Knowledge Graph Question Answering (KGQA) aims to answer user questions by reasoning over Knowledge Graphs (KGs). Recent KGQA methods mainly follow the retrieval-augmented generation paradigm to ground Large Language Models~(LLMs) with structured knowledge from KGs. However, training effective models to retrieve question-relevant evidence from KGs typically requires high-quality intermediate supervision signals, such as question-relevant paths or subgraphs, which are time- and resource-intensive to obtain. We propose PathISE, a novel framework for learning high-quality intermediate supervision from answer-level labels. PathISE introduces a lightweight transformer-based estimator that estimates the informativeness of relation paths to construct pseudo path-level supervision. This supervision is then distilled into an LLM path generator, whose generated paths are grounded in the KG to provide compact evidence for inductive answer reasoning. ExtensiveISE experiments on three KGQA benchmarks show that PathISE achieves competitive or state-of-the-art KGQA performance, and provides reusable supervision signals that can enhance existing KGQA models, without relying on costly LLM-refined supervision signals. Our source code is available at this https URL.

[AI-20] ComplexMCP: Evaluation of LLM Agents in Dynamic Interdependent and Large-Scale Tool Sandbox

【速读】：该论文旨在解决当前大型语言模型（LLM）代理在真实商业软件自动化场景中面临的“最后一公里”挑战，即现有代理虽能调用孤立API，却难以处理工具间原子性、依赖性和环境噪声并存的复杂交互。其解决方案的关键在于提出ComplexMCP基准测试平台，基于模型上下文协议（Model Context Protocol, MCP），整合来自7个状态感知沙箱的300多个精细测试工具（涵盖办公套件到金融系统），并通过种子驱动架构模拟动态环境状态与不可预测的API故障，实现确定性但多样化的评估。该设计揭示了当前LLM代理在面对互依赖工作流时存在的三大瓶颈：工具检索饱和、过度自信导致验证缺失以及策略性放弃，从而为下一代鲁棒自主系统提供了关键评测基准。

链接: https://arxiv.org/abs/2605.10787
作者: Yuanyang Li,Xue Yang,Longyue Wang,Weihua Luo,Hongyang Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Current LLM agents are proficient at calling isolated APIs but struggle with the “last mile” of commercial software automation. In real-world scenarios, tools are not independent; they are atomic, interdependent, and prone to environmental noise. We introduce \textbfComplexMCP , a benchmark designed to evaluate agents in these rigorous conditions. Built on the Model Context Protocol (MCP), \textbfComplexMCP provides over 300 meticulously tested tools derived from 7 stateful sandboxes, ranging from office suites to financial systems. Unlike existing datasets, our benchmark utilizes a seed-driven architecture to simulate dynamic environment states and unpredictable API failures, ensuring a deterministic yet diverse evaluation. We evaluate various LLMs across full-context and RAG paradigms, revealing a stark performance gap: even top-tier models fail to exceed a 60% success rate, far trailing human performance 90%. Granular trajectory analysis identifies three fundamental bottlenecks: (1) \textbftool retrieval saturation as action spaces scale; (2) \textbfover-confidence , where agents skip essential environment verifications; and (3) \textbfstrategic defeatism , a tendency to rationalize failure rather than pursuing recovery. These findings underscore the insufficiency of current agents for interdependent workflows, positioning \textbfComplexMCP as a critical testbed for the next generation of resilient autonomous systems. Subjects: Artificial Intelligence (cs.AI); Software Engineering (cs.SE) Cite as: arXiv:2605.10787 [cs.AI] (or arXiv:2605.10787v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.10787 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-21] rajPrism: A Multi-Task Benchmark for Language-Grounded Urban Trajectory Understanding

【速读】：该论文旨在解决城市移动（urban mobility）中轨迹与语言描述之间对齐问题，即如何在真实世界轨迹数据上统一评估轨迹建模与自然语言理解的协同能力。以往研究多局限于几何中心的轨迹建模或侧重于路线规划的语言任务，缺乏对文本与底层路径之间细粒度、可验证对齐的系统性评测。解决方案的关键在于构建TrajPrism——一个包含三个子任务的多任务基准：(i) 指令条件下的轨迹生成，(ii) 语义驱动的轨迹检索，以及 (iii) 轨迹描述生成，并配套一套衡量轨迹保真度、检索质量和语言接地性的评估协议。该基准通过四维旅行意图分类法筛选真实城市轨迹（覆盖波尔图、旧金山和北京共30万条），形成210万条任务实例，同时开发了针对各任务的原型模型（TrajAnchor、TrajFuse、TrajRap），证明仅依赖几何信息的基线方法在语言输入输出接口场景下存在显著性能差距。

链接: https://arxiv.org/abs/2605.10782
作者: Lihuan Li,Wilson Wongso,Baiyu Chen,Hao Xue,Ruiyi Yang,Yifan Duan,Xiachong Lin,Yang Song,Flora Salim
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: This paper is under review

点击查看摘要

Abstract:Urban mobility is naturally expressed both as trajectories in space and as natural-language descriptions of travel intent, constraints, and preferences. However, prior work rarely evaluates these two modalities together on the same real-world trajectories: trajectory modeling often stays geometry-centric, while language-centric mobility benchmarks frequently target route planning and tool use rather than fine-grained, verifiable alignment between text and the underlying route. We introduce TrajPrism, a multi-task benchmark for language-trajectory alignment that unifies (i) instruction-conditioned trajectory generation, (ii) language-driven semantic trajectory retrieval, and (iii) trajectory captioning, together with an evaluation protocol that measures trajectory fidelity, retrieval quality, and language groundedness. We construct TrajPrism by pairing real urban trajectories with judge-filtered language annotations generated under a four-dimensional travel-intent taxonomy. The benchmark contains 300K selected trajectories across Porto, San Francisco, and Beijing, yielding 2.1M task instances from three instruction variants, three retrieval queries, and one caption per trajectory. We further develop proof-of-concept models for each task: TrajAnchor for instruction-conditioned trajectory generation, TrajFuse for semantic trajectory retrieval, and TrajRap for trajectory captioning. These models instantiate the proposed tasks and show that geometry-only trajectory baselines leave a large gap on our protocol, especially where language is part of the input-output interface. We release TrajPrism with code and a reproducible annotation pipeline that is designed to be portable across cities, given compatible trajectory inputs and map resources.

[AI-22] MATRA: Modeling the Attack Surface of Agent ic AI Systems – OpenClaw Case Study

【速读】：该论文旨在解决当前在部署生成式 AI（Generative AI）代理系统时，缺乏系统性方法来评估已知威胁类别如何转化为特定应用场景下的实际风险的问题。解决方案的关键在于提出 MATRA（Modeling Agentic Threats and Risks for Autonomous systems），这是一个适应性强的威胁建模框架，其核心机制包括基于资产的影响评估和利用攻击树（attack tree）量化特定架构下风险发生的可能性；通过实例验证表明，诸如网络沙箱隔离和最小权限访问等架构控制措施可有效降低风险敞口，从而限制注入攻击的成功扩散范围。

链接: https://arxiv.org/abs/2605.10763
作者: Tim Van hamme,Thomas Vissers,Javier Carnerero-Cano,Mario Fritz,Emil C. Lupu,Lieven Desmet,Dinil Mon Divakaran
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: Accepted for presentation at the 5th International Workshop on Designing and Measuring Security in Systems with AI (DeMeSSAI 2026), co-located with the 11th IEEE European Symposium on Security and Privacy (EuroSP 2026), Lisbon, Portugal, July 10, 2026

点击查看摘要

Abstract:LLMs are increasingly deployed as autonomous agents with access to tools, databases, and external services, yet practitioners (across different sectors) lack systematic methods to assess how known threat classes translate into concrete risks within a specific agentic deployment. We present MATRA, a pragmatic threat modeling framework for agentic AI systems that adapts established risk assessment methodology to systematically assess how known LLM threats translate into deployment-specific risks. MATRA begins with an asset-based impact assessment and utilizes attack trees to determine the likelihood of these impacts occurring within the system architecture. We demonstrate MATRA on a personal AI agent deployment using OpenClaw, quantifying how architectural controls such as network sandboxing and least-privilege access reduce risk by limiting the blast radius of successful injections.

[AI-23] he Agent Use of Agent Agent Beings: Agent Cybernetics Is the Missing Science of Foundation Agents

【速读】：该论文旨在解决当前基于大语言模型（Large Language Model, LLM）的基础智能体（foundation agents）在长周期、复杂任务中缺乏理论指导的问题，尤其是在任务持续性、环境适应性和安全自改进等方面的不确定性。其核心挑战在于，现有工程实践依赖经验试错构建工具循环、记忆库等组件，而未建立从第一性原理出发的设计准则。解决方案的关键在于引入控制论（cybernetics），将经典控制论的六条定律映射为六项代理设计原则，并进一步提炼为三项工程目标（可靠性、长期运行能力与自我改进能力），形成名为“代理控制论”（Agent Cybernetics）的理论框架，从而为基础智能体提供可解释、可验证且具备科学根基的设计范式。

链接: https://arxiv.org/abs/2605.10754
作者: Xinrun Wang,Chang Yang,He Zhao,Zhuoyi Lin,Shuyue Hu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Preliminary Work

点击查看摘要

Abstract:LLM-based foundation agents that perceive, reason, and act across thousands of reasoning steps are rapidly becoming the dominant paradigm for deploying artificial intelligence in open-ended, long-horizon complex tasks. Despite this significance, the field remains overwhelmingly engineering-driven. Engineering practice has converged on useful primitives (tool loops, memory banks, harnesses, reflection steps), yet these are assembled by empirical trial and error rather than from first principles. Fundamental questions remain open: under what conditions does a long-running agent remain on-task? How should an agent respond when its environment exceeds its representational capacity? What architectural properties are necessary for safe self-improvement? We argue that cybernetics, the mid-twentieth-century science of control and communication in complex systems, provides the missing theoretical scaffold for foundation agents. By mapping six canonical laws of classical cybernetics onto six agent design principles, and synthesizing those principles into three engineering desiderata (reliability, lifelong running, and self-Improvement), we arrive at a framework termed Agent Cybernetics. Three application domains, code generation, computer use and automated research, exemplify the analytical framework of agent cybernetics by identifying failure modes and concrete engineering recommendations. We hope that agent cybernetics opens a new research venue and establishes the scientific foundation that foundation agents need for principled, reliable real-world deployment.

[AI-24] Provable Sparse Inversion and Token Relabel Enhanced One-shot Federated Learning with ViTs

【速读】：该论文旨在解决单轮联邦学习（One-Shot Federated Learning）在极端非独立同分布（extremely non-IID）场景下，现有无数据方法生成的合成数据质量低、语义与真实标签严重错位的问题。其解决方案的关键在于提出FedMITR框架，通过稀疏模型逆向（Sparse Model Inversion）策略选择性地重建图像语义前景，抑制无信息背景噪声；同时引入差异化标记策略——高信息密度区域使用伪标签进行蒸馏，低信息密度区域则借助集成模型重标定，从而有效降低梯度方差并提升ViT模型预测稳定性。理论分析表明，该方法通过减少梯度不稳定性与方差，显著收紧了泛化误差界。

链接: https://arxiv.org/abs/2605.10748
作者: Li Shen,Xiaolei Hao,Qinglun Li,Xiaochun Cao,Zhifeng Hao,Xun Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 18 Pages

点击查看摘要

Abstract:One-Shot Federated Learning, where a central server learns a global model in a single communication round, has emerged as a promising paradigm. However, under extremely non-IID settings, existing data-free methods often generate low-quality data that suffers from severe semantic misalignment with ground-truth labels. To overcome these issues, we propose a novel Federated Model Inversion and Token Relabel (FedMITR) framework, which trains the global model by fully exploiting all patches of synthetic images. Specifically, FedMITR employs sparse model inversion during data generation, selectively inverting semantic foregrounds while halting the inversion of uninformative backgrounds. To address semantically meaningless tokens that hinder ViT predictions, we implement a differentiated strategy: patches with high information density utilize generated pseudo-labels, while patches with low information density are relabeled via ensemble models for robust distillation. Theoretically, our analysis based on algorithmic stability reveals that Sparse Model Inversion eliminates gradient instability arising from background noise, while Token Relabel effectively reduces gradient variance, collectively guaranteeing a tighter generalization bound. Empirically, extensive experimental results demonstrate that FedMITR substantially outperforms existing baselines under various settings.

[AI-25] An Uncertainty-Aware Resilience Micro-Agent for Causal Observability in the Computing Continuum

【速读】：该论文旨在解决计算连续体（computing continuum）中灰色故障（grey failures）导致的症状模糊且重叠的问题，现有诊断方法因缺乏因果意识或在高认知不确定性（epistemic uncertainty）下运行，易引发破坏性干预。解决方案的关键在于提出一种不确定性感知的弹性微代理框架（AURORA），其核心是通过并行微代理集成自由能原理（free-energy principle）、因果do-演算（causal do-calculus）与局部因果状态图（localized causal state-graphs），在每个故障的马尔可夫毯（Markov blanket）内实现反事实根因分析；同时引入双门控执行机制，在因果置信度高且预测认知不确定性受控时才授权修复操作，否则放弃本地干预并上报雾层（fog tier），从而在保障诊断精度的同时显著降低误操作风险。

链接: https://arxiv.org/abs/2605.10718
作者: Suvi De Silva,Alfreds Lapkovskis,Alaa Saleh,Sasu Tarkoma,Praveen Kumar Donta
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Grey failures in the computing continuum produce ambiguous overlapping symptoms that existing approaches fail to diagnose reliably, either due to a lack of causal awareness or acting under high epistemic uncertainty, risking destructive interventions. This paper presents an uncertainty-aware resilience micro-agent for causal observability (AURORA), a lightweight framework for diagnosing and mitigating grey failures in edge-tier environments. The framework employs parallel micro-agents that integrate the free-energy principle, causal do-calculus, and localized causal state-graphs to support counterfactual root-cause analysis within each fault’s Markov blanket. Restricting inference to causally relevant variables reduces computational overhead while preserving diagnostic fidelity. AURORA further introduces a dual-gated execution mechanism that authorizes remediation only when causal confidence is high and predicted epistemic uncertainty is bounded; otherwise, it abstains from local intervention and escalates the diagnostic payload to the fog tier. Our experiments demonstrate that AURORA outperforms baselines, achieving a 0% destructive action rate, while maintaining 62.0% repair accuracy and a 3ms mean time to repair.

[AI-26] GESR: A Genetic Programming-Based Symbolic Regression Method with Gene Editing

【速读】：该论文旨在解决符号回归（Symbolic Regression）问题，即从科学数据中自动发现描述自然现象的数学公式。传统方法如基于遗传算法的遗传编程（Genetic Programming, GP）虽然有效，但其基因突变和交叉操作完全随机，导致进化效率低下。解决方案的关键在于引入“基因编辑”思想，设计了一种名为GESR的方法：通过训练两个BERT模型作为“上帝之手”，分别指导基因突变（利用掩码语言建模预测表达式符号）和基因交叉（预测最优交叉点），从而实现更精准、高效的符号表达式演化。实验表明，GESR在计算效率和整体性能上均显著优于传统GP算法。

链接: https://arxiv.org/abs/2605.10685
作者: Yanjie Li,Liping Zhang,Min Wu,Weijun Li,Lina Yu,Jingyi Liu,Yusong Deng,Mingzhu Wan,Xin Ning
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 70 pages

点击查看摘要

Abstract:Mathematical formulas serve as a language through which humans communicate with nature. Discovering mathematical laws from scientific data to describe natural phenomena has been a long-standing pursuit of humanity for centuries. In the field of artificial intelligence, this challenge is known as the symbolic regression problem. Among existing symbolic regression approaches, Genetic Programming (GP) based on evolutionary algorithms remains one of the most classical and widely adopted methods. GP simulates the evolutionary process across generations through genetic mutation and crossover. However, mutations and crossovers in GP are entirely random. While this randomness effectively mimics natural evolution, it inevitably produces both beneficial and detrimental variations. If there existed a metaphorical God capable of foreseeing which genetic mutations or crossovers would yield superior outcomes and performing targeted gene editing accordingly, the efficiency of evolution could be substantially improved. Motivated by this idea, we propose in this paper a symbolic regression approach based on gene editing, termed GESR. In GESR, we trained two “hands of God” (two BERT models). Among them, the first leverages the BERT’s masked language modeling capability to guide the mutation of genes (expression symbols). The other BERT model guides the crossover of individual genes by predicting the crossover point. Experimental results demonstrate that GESR significantly improves computational efficiency compared with traditional GP algorithms and achieves strong overall performance across multiple symbolic regression tasks.

[AI-27] Is Data Shapley Not Better than Random in Data Selection? Ask NASH ICML-26

【速读】：该论文旨在解决数据选择（Data Selection）中基于Shapley值或半值（semivalues）的方法在实际应用中表现不稳定的问题，即某些情况下其选取的数据子集效果并不优于随机选择。为应对这一挑战，作者提出了一种名为NASH（Non-linear Aggregation of SHapley-informative components）的新框架，其核心创新在于：首先将目标效用函数（如验证准确率）分解为若干个更简单的、具有Shapley信息性的组件函数；随后通过非线性聚合这些组件来构建新的优化目标，从而更有效地识别高质量训练数据子集。该方法在保持极低额外计算开销的前提下显著提升了Shapley/半值类数据选择策略的有效性。

链接: https://arxiv.org/abs/2605.10684
作者: Xiao Tian,Jue Fan,Rachael Hwee Ling Sim,Zixuan Wang,Nancy F. Chen,Bryan Kian Hsiang Low
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to the 43rd International Conference on Machine Learning (ICML-26) as a Spotlight paper

点击查看摘要

Abstract:Data selection studies the problem of identifying high-quality subsets of training data. While some existing works have considered selecting the subset of data with top- m Data Shapley or other semivalues as they account for the interaction among every subset of data, other works argue that Data Shapley can sometimes perform ineffectively in practice and select subsets that are no better than random. This raises the questions: (I) Are there certain “Shapley-informative” settings where Data Shapley consistently works well? (II) Can we strategically utilize these settings to select high-quality subsets consistently and efficiently? In this paper, we propose a novel data selection framework, NASH (Non-linear Aggregation of SHapley-informative components), which (I) decomposes the target utility function (e.g., validation accuracy) into simpler, Shapley-informative component functions, and selects data by optimizing an objective that (II) aggregates these components non-linearly. We demonstrate that NASH substantially boosts the effectiveness of Shapley/semivalue-based data selection with minimal additional runtime cost.

[AI-28] Evolving-RL: End-to-End Optimization of Experience-Driven Self-Evolving Capability within Agents

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在部署阶段因静态特性而难以适应新任务的问题，提出通过可复用经验的提取与利用实现自进化能力。其核心挑战在于如何协同优化经验提取（experience extraction）与经验利用（experience utilization）两个环节，而非孤立设计系统模块或仅优化其中一环。解决方案的关键是提出Evolving-RL算法框架，以经验评估生成的双重监督信号分别优化提取器和求解器，并推动二者协同演化（co-evolution），从而显著提升LLMs在分布外任务上的泛化性能（如ALFWorld未见任务相对GRPO基线提升98.7%，Mind2Web提升35.8%）。该方法本质上是一种经验增强型强化学习（experience-augmented RL）机制，能够将可复用的经验模式内化至模型参数中，无需测试时积累经验即可获得显著性能提升。

链接: https://arxiv.org/abs/2605.10663
作者: Zhiyuan Fan,Wenwei Jin,Feng Zhang,Bin Li,Yihong Dong,Yao Hu,Jiawei Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 17pages, 5 figures

点击查看摘要

Abstract:Experience-driven self-evolving agents aim to overcome the static nature of large language models by distilling reusable experience from past interactions, thus enabling adaptation to novel tasks at deployment time. This process places substantial demands on the foundation model’s capacities for abstraction, generalization, and in-context learning. However, most existing studies focus primarily on system-level design choices, such as how experience is represented and managed, neglecting the inherent capabilities of the underlying model. While some recent works have started to optimize the experience utilization stage via reinforcement learning, they still fail to treat self-evolution as a unified process to be jointly optimized. To this end, we propose Evolving-RL, an efficient algorithmic framework that jointly improves the experience extraction and utilization capabilities required for self-evolution. Specifically, we center the learning process on experience extraction and evaluation, using the two supervisory signals derived from evaluation to optimize the extractor and solver separately and thus enable their coordinated co-evolution. Experiments on ALFWorld and Mind2Web show that Evolving-RL effectively enhances LLMs’ ability to extract and reuse experience, leading to strong performance gains on out-of-distribution tasks (up to 98.7% relative improvement over the GRPO baseline on ALFWorld unseen tasks and 35.8% on Mind2Web), and these gains are fully unlocked only through the coordinated co-evolution of experience extraction and utilization. Furthermore, Evolving-RL inherently functions as an experience-augmented RL algorithm. By internalizing reusable experience patterns directly into model parameters, it achieves remarkable performance gains over standard baselines on both seen and unseen tasks, even in the absence of test-time experience accumulation.

[AI-29] Active Learning for Gaussian Process Regression Under Self-Induced Boltzmann Weights

【速读】：该论文旨在解决自诱导分布（self-induced distribution）下的主动学习问题，其中目标函数的预测误差需在由函数自身诱导的未知Boltzmann分布下最小化。此类问题常见于计算化学中的势能面（Potential Energy Surface, PES）建模等场景，其挑战在于目标分布未知且分区函数（partition function）不可计算。解决方案的关键在于提出一种基于高斯过程（Gaussian Process）的采集函数 \texttt{AB-SID-iVAR}，该方法通过闭式近似不可计算的贝叶斯目标分布，无需估计分区函数，并适用于离散与连续输入空间；同时分析了基于Thompson采样的变体 \texttt{TS-SID-iVAR} 作为高方差蒙特卡洛替代方案。理论证明表明，在温和条件下，终端预测误差以高概率收敛至零，并提供更紧的平均情形保证。

链接: https://arxiv.org/abs/2605.10654
作者: Jixiang Qing,Henry Moss,Matthias Sachs
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We consider the active learning problem where the goal is to learn an unknown function with low prediction error under an unknown Boltzmann distribution induced by the function itself. This self-induced weighting arises naturally in problems such as potential energy surface (PES) modeling in computational chemistry, yet poses unique challenges as the target distribution is unknown and its partition function is intractable. We propose \textttAB-SID-iVAR, a Gaussian Process-based acquisition function that approximates the intractable Bayesian target distribution in closed form while avoiding partition function estimation, and is applicable to both discrete and continuous input domains. We also analyze a Thompson sampling alternative (\textttTS-SID-iVAR) as a higher variance Monte Carlo variant. Despite the unknown target, under mild conditions, we establish that the terminal prediction error vanishes with high probability, and provide a tighter average-case guarantee. We demonstrate consistent improvements over existing approaches in this setting on synthetic benchmarks and real-world PES modeling and drug discovery tasks. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.10654 [cs.LG] (or arXiv:2605.10654v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.10654 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-30] A Recursive Decomposition Framework for Causal Structure Learning in the Presence of Latent Variables

【速读】：该论文旨在解决约束基于的因果发现方法在高维场景下因依赖条件独立性（Conditional Independence, CI）测试而导致计算成本过高的问题，尤其是在存在潜变量（Latent Variables）的情况下，现有分而治之（Divide-and-Conquer）框架通常假设因果充分性（Causal Sufficiency），限制了其适用范围。解决方案的关键在于提出一种递归分解框架DiCoLa，该框架能够理论化地推广至含潜变量的设置中：通过递归地将全局学习任务分解为多个子问题，并借助一个原理性的重构步骤整合子问题解，从而恢复全局因果结构；该方法在理论上保证了正确性（Soundness）与完备性（Completeness），实验证明其显著提升了多种因果发现算法的计算效率，同时在真实数据上验证了实用性。

链接: https://arxiv.org/abs/2605.10651
作者: Zheng Li,Feng Xie,Shenglan Nie,Xichen Guo,Ruxin Wang,Hao Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Constraint-based causal discovery is widely used for learning causal structures, but heavy reliance on conditional independence (CI) testing makes it computationally expensive in high-dimensional settings. To mitigate this limitation, many divide-and-conquer frameworks have been proposed, but most assume causal sufficiency, i.e., no latent variables. In this paper, we show that divide-and-conquer strategies can be theoretically generalized beyond causal sufficiency to settings with latent variables. Specifically, we propose a recursive decomposition framework, termed DiCoLa, that enables divide-and-conquer causal discovery in the presence of latent variables. It recursively decomposes the global learning task into smaller subproblems and integrates their solutions through a principled reconstruction step to recover the global structure. We theoretically establish the soundness and completeness of the proposed framework. Extensive experiments on synthetic data demonstrate that our approach significantly improves computational efficiency across a range of causal discovery algorithms, while experiments on a real-world dataset further illustrate its practical effectiveness.

[AI-31] diffGHOST: Diffusion based Generative Hedged Oblivious Synthetic Trajectories

【速读】：该论文旨在解决移动轨迹数据在隐私保护与可用性之间的矛盾问题，即如何在不泄露个体敏感信息的前提下，合成高质量的移动轨迹以支持各类应用。现有生成模型常基于“生成模型隐含隐私”的错误假设，无法提供可证明的隐私保障。其解决方案的关键在于提出diffGHOST——一种基于潜在空间分段的条件扩散模型，通过识别并缓解关键样本的记忆化现象，利用潜在空间中的条件片段来增强合成轨迹的隐私安全性与实用性。

链接: https://arxiv.org/abs/2605.10647
作者: Florent Guépin,Cheick Tidiani Cisse,Denis Renaud,François Bidet,Arnaud Legendre
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Trajectories are nowadays valuable information for a wide range of applications. However they are also inherently sensitive, as they contain highly personal information about individuals. Facing this challenge, synthesizing mobility trajectories has emerged as a promising solution to leverage mobility information while preserving privacy. State-of-the-art models, often rely on the false assumptions of generative models implicit privacy and fails to provide privacy guarantees while preserving trajectories utility. Here, we introduce diffGHOST, a conditional diffusion model based on latent space segmentation, designed to answer this challenge. Thus, this paper propose a methodology that identify and mitigate memorization of critical samples using condition segments of a learn latent space.

[AI-32] Navigating the Sea of LLM Evaluation: Investigating Bias in Toxicity Benchmarks

【速读】：该论文旨在解决当前大型语言模型（Large Language Models, LLMs）安全评估中存在系统性偏差的问题，尤其是在毒性基准测试（toxicity benchmarking）的部署实践中，由于未充分考虑模型选择、评估指标和任务类型等因素所导致的评估结果不稳定。其解决方案的关键在于揭示现有基准测试框架在不同设置下的行为差异，例如将任务从文本补全调整为摘要生成会显著增加误判有害内容的概率，并发现部分基准在输入数据领域变化时缺乏一致性表现，同时识别出模型特定的不稳定性。这表明亟需构建更加鲁棒和全面的安全评估框架以支撑LLM在实际应用中的可靠部署。

链接: https://arxiv.org/abs/2605.10639
作者: Regina Gugg,Selina Niederländer,Andreas Stöckl,Martin Flechl
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 18 pages, 4 figures

点击查看摘要

Abstract:The rapid adoption of LLMs in both research and industry highlights the challenges of deploying them safely and reveals a gap in the systematic evaluation of toxicity benchmarks. As organizations increasingly rely on these benchmarks to certify models for customer-facing applications and automated moderation, unrecognized evaluation biases could lead to the deployment of vulnerable or unsafe systems. This work investigates the robustness of established benchmarking setups and examines how to measure currently neglected intrinsic biases, such as those related to model choice, metrics, and task types. Our experiments uncover significant discrepancies in benchmark behaviors when evaluation setups are altered. Specifically, shifting the task from text completion to summarization increases the tendency of benchmarks to flag content as harmful. Additionally, certain benchmarks fail to maintain consistent behavior when the input data domain is changed. Furthermore, we observe model-specific instabilities, demonstrating a clear need for more robust and comprehensive safety evaluation frameworks.

[AI-33] acher-Aware Evolution of Heuristic Programs from Learned Optimization Policies

【速读】：该论文旨在解决基于大语言模型（Large Language Model, LLM）的自动启发式设计方法在组合优化中依赖延迟终点性能评估而导致搜索效率低的问题。现有方法仅以最终任务表现作为反馈信号，缺乏对中间行为质量的指导，限制了启发式程序的可解释性和演化效率。解决方案的关键在于提出一种教师感知的进化框架（teacher-aware evolutionary framework），利用独立训练的优化策略（learned optimization policies）作为“行为教师”，通过查询教师在候选启发式程序访问状态下的动作偏好，提供局部行为反馈用于引导进化过程。这种方法在不引入神经网络推理开销的前提下，结合任务性能与教师驱动的行为信号，显著提升了静态可执行启发式的设计质量。

链接: https://arxiv.org/abs/2605.10634
作者: Minyu Chen,Song Qin,Ling-I Wu,Jianxin Xue,Guoqiang Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 15 pages

点击查看摘要

Abstract:LLM-based automatic heuristic design has shown promise for generating executable heuristics for combinatorial optimization, but existing methods mainly rely on delayed endpoint performance. We propose a \emphteacher-aware evolutionary framework that uses independently trained learned optimization policies as behavioral teachers. Instead of deploying or imitating the teacher, our method queries it on states visited by candidate heuristic programs and uses its action preferences as local feedback for evolution. The resulting search discovers static executable heuristics guided by both task performance and teacher-derived behavioral signals. Experiments on scheduling, routing, and graph optimization benchmarks show that our method improves over performance-driven LLM heuristic evolution baselines while requiring no neural inference at deployment. These results suggest that learned optimization policies can be repurposed as behavioral feedback sources for automatic heuristic discovery.

[AI-34] Hierarchical Causal Abduction: A Foundation Framework for Explainable Model Predictive Control

【速读】：该论文旨在解决非线性模型预测控制（Nonlinear Model Predictive Control, NMPC）在实际部署中因控制决策过程不透明而导致的人类操作员信任缺失问题。其核心挑战在于，NMPC依赖于复杂的非线性动力学、严格的约束条件及数值优化算法，使得单个控制动作难以被人类理解。解决方案的关键是提出分层因果反演（Hierarchical Causal Abduction, HCA），该方法融合三类证据：(i) 基于领域知识图谱的物理信息推理，(ii) 来自Karush–Kuhn–Tucker (KKT)乘子的优化证据，以及(iii) 通过PCMCI算法实现的时间因果发现，从而生成忠实且可解释的控制行为说明。实验表明，HCA在三个不同工业场景下显著优于LIME等基线方法，并展现出良好的泛化能力与模块冗余敏感性。

链接: https://arxiv.org/abs/2605.10624
作者: Ramesh Arvind Naagarajan,Zühal Wagner,Stefan Streif
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Model Predictive Control (MPC) is widely used to operate safety-critical infrastructure by predicting future trajectories and optimizing control actions. However, nonlinear dynamics, hard safety constraints, and numerical optimization often render individual control moves opaque to human operators, undermining trust and hindering deployment. This paper presents Hierarchical Causal Abduction (HCA), which combines (i) physics-informed reasoning via domain knowledge graphs, (ii) optimization evidence from Karush–Kuhn–Tucker (KKT) multipliers, and (iii) temporal causal discovery via the PCMCI algorithm to generate faithful, human-interpretable explanations for control actions computed by nonlinear MPC. Across three diverse control applications (greenhouse climate, building HVAC, chemical process engineering) with expert validation, HCA improves explanation accuracy by 53% over LIME (0.478 vs. 0.311) using a single set of cross-domain parameters without per-domain tuning; domain-specific KKT-threshold calibration over 2–3 days further increases accuracy to 0.88. Ablation studies confirm that each evidence source is essential, with 32–37% accuracy degradation when any component is removed, and HCA’s ranking-and-validation methodology generalizes beyond MPC to other prediction-based decision systems, including learning-based control and trajectory planning.

[AI-35] PRISM: Generation-Time Detection and Mitigation of Secret Leakage in Multi-Agent LLM Pipelines

【速读】：该论文旨在解决多智能体大语言模型（Multi-agent LLM systems）中因共享上下文导致的敏感信息传播放大问题（propagation amplification），即一个代理访问的敏感信息可能在无明确恶意意图的情况下，通过生成过程扩散至下游输出。现有防御方法如基于提示的安全措施、静态模式匹配和LLM作为裁判的过滤机制，无法有效应对该场景，因其或在生成后干预、依赖表面特征，或引入显著延迟且未建模生成动态。论文提出的PRISM解决方案的关键在于将凭证泄露视为生成过程中的序列风险累积问题，在每个解码步骤中融合16种信号（涵盖词汇、结构、信息论、行为和上下文特征）计算校准的风险分数，并通过绿色、黄色、红色风险区实现逐token级干预。其核心洞察是：凭证重现通常 preceded by measurable shifts in generation dynamics，如熵塌缩与logit集中度上升，结合文本结构线索（如标识符模式检测），可在秘密完全重建前提供早期预警，从而在保持高输出实用性的同时实现零泄漏（0.0%任务级泄漏率）。

链接: https://arxiv.org/abs/2605.10614
作者: Riya Tapwal,Abhishek Kumar,Carsten Maple
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-agent LLM systems introduce a security risk in which sensitive information accessed by one agent can propagate through shared context and reappear in downstream outputs, even without explicit adversarial intent. We formalise this phenomenon as propagation amplification, where leakage risk increases across agent boundaries as sensitive content is repeatedly exposed to downstream generators. Existing defences, including prompt-based safeguards, static pattern matching, and LLM-as-judge filtering, are not designed for this setting: they either operate after generation, rely primarily on surface-form patterns, or add substantial latency without modelling the generation process itself. To resolve these issues, we propose PRISM, a real-time defence that treats credential leakage as a sequential risk accumulation problem during generation. At each decoding step, PRISM combines 16 signals spanning lexical, structural, information-theoretic, behavioural, and contextual features into a calibrated risk score, enabling per-token intervention through green, yellow, and red risk zones. Our central observation is that credential reproduction is often preceded by a measurable shift in generation dynamics, characterised by entropy collapse and increasing logit concentration. When combined with text-structural cues such as identifier-pattern detection, these temporal signals provide an early warning of leakage before a secret is fully reconstructed. Across a 2,000-task adversarial benchmark covering 13 attack categories and three pressure levels in a heterogeneous four-agent pipeline, PRISM achieves F1 = 0.832 with precision = 1.000 and recall = 0.712, while producing no observed leakage on our benchmark (0.0% task-level leak rate) and preserving output utility of 0.893. It substantially outperforms the strongest baseline, Span Tagger, which achieves F1 = 0.719 with a 15.0% task-level leak rate.

[AI-36] Re-Triggering Safeguards within LLM s for Jailbreak Detection

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）面临的越狱攻击（jailbreak attacks）问题，即攻击者通过精心设计的提示词绕过模型内置的安全防护机制。解决方案的关键在于提出一种嵌入扰动（embedding disruption）方法，通过干扰输入提示的嵌入表示来重新激活LLM内部的防御机制，而非依赖独立的防御模块。该方法利用越狱提示固有的脆弱性，在不改变模型结构的前提下实现对越狱攻击的有效检测与防御，并在白盒和黑盒场景下均表现出强鲁棒性，尤其对自适应攻击也具有良好的防御效果。

链接: https://arxiv.org/abs/2605.10611
作者: Zheng Lin,Zhenxing Niu,Haoxuan Ji,Yuzhe Huang,Haichang Gao
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper proposes a jailbreaking prompt detection method for large language models (LLMs) to defend against jailbreak attacks. Although recent LLMs are equipped with built-in safeguards, it remains possible to craft jailbreaking prompts that bypass them. We argue that such jailbreaking prompts are inherently fragile, and thus introduce an embedding disruption method to re-activate the safeguards within LLMs. Unlike previous defense methods that aim to serve as standalone solutions, our approach instead cooperates with the LLM’s internal defense mechanisms by re-triggering them. Moreover, through extensive analysis, we gain a comprehensive understanding of the disruption effects and develop an efficient search algorithm to identify appropriate disruptions for effective jailbreak detection. Extensive experiments demonstrate that our approach effectively defends against state-of-the-art jailbreak attacks in white-box and black-box settings, and remains robust even against adaptive attacks.

[AI-37] Fairness vs Performance: Characterizing the Pareto Frontier of Algorithmic Decision Systems

【速读】：该论文旨在解决算法决策系统中公平性与性能之间权衡关系不明确的问题，即如何在保障群体公平性的同时最大化决策者的效用。其核心贡献在于将决策过程建模为多目标优化问题，同时优化决策者效用和群体公平性指标，并证明了帕累托最优决策规则必然是对个体成功概率应用分组特定的确定性阈值规则（deterministic, group-specific threshold rules）。关键发现是：帕累托前沿仅依赖于人群特征、效用函数和公平性评分，而与算法的技术实现方式（预处理、处理中或后处理）无关，这为公平性约束下的决策系统设计提供了理论基础，并扩展了现有公平性最优性定理至更广泛的公平度量和部分公平场景。

链接: https://arxiv.org/abs/2605.10604
作者: Mieke Wilms,Christoph Heitz
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 23 pages, The 2026 ACM conference on Fairness, Accountability, and Transparency (FAccT’26)

点击查看摘要

Abstract:Designing fair algorithmic decision systems requires balancing model performance with fairness toward affected individuals: More fairness might require sacrificing some performance and vice versa, yet the space of possible trade-offs is still poorly understood. We investigate fairness in binary prediction-based decision problems by conceptualizing decision making as a multi-objective optimization problem that simultaneously considers decision-maker utility and group fairness. We investigate the set of Pareto-optimal decision rules for arbitrary utility functions for decision maker, arbitrary population distributions, and a wide range of group fairness metrics. We find that the Pareto frontier consists of deterministic, group-specific threshold rules applied to individuals’ success probability. This complements existing optimality theorems from literature which, for specific fairness constraints, posit lower-bound threshold rules only. However we also show that, depending on the used fairness metric, the Pareto frontier may include upper-bound threshold rules, thus preferring individuals with lower success probabilities. We show that the location of the Pareto frontier depends only on population characteristics, utility functions and fairness score, but not on the technical design of the algorithm - our findings hold for pre-, in-, and post-processing approaches alike. Our results generalize existing optimality theorems for fairness-constrained classification and extend them to generalized fairness metrics and fairness principles, and to partial fairness regimes. This paper connects formal fairness research with legal and ethical requirements to search for less discriminatory alternatives, offering a principled foundation for evaluating and comparing algorithmic decision systems. Comments: 23 pages, The 2026 ACM conference on Fairness, Accountability, and Transparency (FAccT’26) Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY) Cite as: arXiv:2605.10604 [cs.LG] (or arXiv:2605.10604v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.10604 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3805689.3812302 Focus to learn more DOI(s) linking to related resources

[AI-38] he Open-Box Fallacy: Why AI Deployment Needs a Calibrated Verification Regime

【速读】：该论文试图解决的问题是：当前在医疗、信贷、就业和刑事司法等敏感领域部署人工智能（AI）时，常因模型内部机制无法解释而被认定为不安全，导致过度依赖机制可解释性（mechanistic interpretability）来应对本不属于其职责范围的部署授权问题。解决方案的关键在于将授权门槛从“模型整体可解释”转向“校准验证”（calibrated verification），即建立一种以具体应用场景为边界、具备独立可验证性、发布后持续监控、责任可追溯、可申诉和可撤销的授权机制；其核心依据是模型能力在邻近任务间分布不均，且社会长期通过资质认证、监督、问责、上诉与撤销等非机制解释手段管理复杂技术，而非强制要求理解底层逻辑。论文进一步提出“验证覆盖率”（Verification Coverage）作为六要素组成的报告标准，建议将其与模型能力评分一同纳入模型卡片、排行榜及监管披露中，以实现更科学、务实的AI治理。

链接: https://arxiv.org/abs/2605.10601
作者: Phongsakon Mark Konrad,Tim Lukas Adam,Ane Cathrine Holst Merrild,Riccardo Terrenzi,Rebecca De Rosa,Toygar Tanyel,Serkan Ayvaz
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AI deployment in sensitive domains such as health care, credit, employment, and criminal justice is often treated as unsafe to authorize until model internals can be explained. This often leads to an excessive reliance on mechanistic interpretability to address a deployment challenge beyond its intended scope. We argue that the gate should instead be calibrated verification: authorization should be domain-scoped, independently checkable, monitored after release, accountable, contestable, and revocable. The reason is twofold. First, model capability is uneven across nearby tasks, so authorization must attach to a specific use rather than to a model in general. Second, societies have long governed opaque expertise through credentials, monitoring, liability, appeal, and revocation rather than mechanism-level explanation. Recent evidence reinforces this distinction between mechanistic understanding and deployment authority: a 53-percentage-point gap between internal representations and output correction shows that understanding may not translate into action, while one scoping review found that only 9.0% of FDA-approved AI/ML device documents contained a prospective post-market surveillance study. We propose Verification Coverage, a six-component reportable standard with a minimum-composition rule, as the metric that should sit beside capability scores in model cards, leaderboards, and regulatory disclosures.

[AI-39] Budget-Efficient Automatic Algorithm Design via Code Graph

【速读】：该论文旨在解决现有自动算法设计（Automatic Algorithm Design, AAD）方法在利用大语言模型（Large Language Models, LLMs）时效率低下的问题，具体表现为：搜索粒度局限于完整算法，导致重复重构通用子结构，并过早丢弃可能包含有价值算法特征的低适应度候选解。其解决方案的关键在于提出一种基于有向无环图（Directed Acyclic Graph, DAG）的算法表示形式，并构建以“修正操作”为核心的搜索框架——即不再直接请求LLM生成完整算法，而是通过查询LLM获取紧凑的代码块增删改操作（corrections），这些操作逐次作用于图结构上，从而生成新算法并实现修正级别的信用分配（correction-level credit assignment）。该机制有效提升了计算资源的利用率，在相同token预算下显著优于传统全算法搜索策略。

链接: https://arxiv.org/abs/2605.10598
作者: Maxime Bouscary,Manxi Wu,Saurabh Amin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have emerged as powerful tools for automatic algorithm design (AAD). However, existing pipelines remain inefficient. They operate at the granularity of full algorithms, redundantly rewriting recurring substructures and discarding low-fitness candidates that may contain valuable algorithmic features. We formalize budget-efficient automatic algorithm design, wherein the search policy maximizes realized fitness subject to limited computational cost. We propose a directed acyclic graph representation of algorithms and build a search framework that fully exploits the LLM’s output. Instead of querying the LLM for full algorithms, we use it to obtain corrections: compact operators that add, replace, or remove code blocks. Each correction augments the graph, yielding new algorithms that compose with prior corrections. This graph structure decomposes algorithms into sets of corrections, enabling correction-level credit assignment that informs subsequent queries. We complement this framework with theoretical insights into the ideal balance between search depth and breadth at different budget levels. We validate our method empirically on three combinatorial optimization problems, demonstrating consistent superiority of our graph-based search over full-algorithm search at equal token budget. Finally, our experiments suggest that rich contexts help only when the LLM’s prior knowledge is shallow, and can hinder performance otherwise.

[AI-40] CrackMeBench: Binary Reverse Engineering for Agents

【速读】：该论文旨在解决当前针对生成式 AI (Generative AI) 在二进制逆向工程（binary reverse engineering）任务中评估标准不明确、缺乏统一基准的问题。现有基准多聚焦于源码修复或网络安全攻防（capture-the-flag），而对仅基于可执行文件恢复验证逻辑并生成合法输入或密钥的确定性问题缺乏系统化评测框架。解决方案的关键在于提出 CrackMeBench，一个面向教育类 CrackMe 风格逆向工程任务的标准化测试集，其核心特征包括：使用带可执行断言（executable oracles）的符号贫瘠（symbol-poor）二进制程序、显式本地工具访问权限、外部评分机制而非自由文本解释，并通过 Docker 容器沙箱限制网络访问以保证实验可复现性。该基准包含公开校准任务与生成的任务组合，支持量化指标如 pass@1 和 pass@3、运行时间、命令轨迹、工具类别及成本估算，从而为从源码推理向自主二进制分析演进提供可重复测量的实验平台。

链接: https://arxiv.org/abs/2605.10597
作者: Isaac David,Arthur Gervais
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Benchmarks for coding agents increasingly measure source-level software repair, and cybersecurity benchmarks increasingly measure broad capture-the-flag performance. Classical binary reverse engineering remains less precisely specified: given only an executable, can an agent recover validation logic and produce an input, serial, artifact, or key generator accepted by the program? We introduce CrackMeBench, a benchmark for evaluating language-model agents on educational CrackMe-style reverse-engineering tasks. CrackMeBench focuses on deterministic binary validation problems with executable oracles, symbol-poor binaries, explicit local tool access, and externally scored submissions rather than free-form explanations. The v0 benchmark combines eight public calibration CrackMes with twelve generated main-score tasks built from seeded C, Rust, and Go templates, and agents run through an equal shell interface in a no-network Linux Docker sandbox with standard reverse-engineering tools. In a three-model evaluation with a five-minute budget and three scored submissions per task, pass@3 on the generated split is 11/12 tasks (92%) for GPT-5.5, 7/12 (58%) for Claude Opus 4.7, and 5/12 (42%) for Kimi K2. The harder generated half separates the models more sharply, with pass@3 of 5/6, 2/6, and 1/6, respectively; on the eight-task public calibration split, pass@3 is 3/8, 2/8, and 1/8. CrackMeBench records pass@1 and pass@3, scored submissions, wall-clock time, command traces, tool categories, provider-reported token usage, estimated cost, and qualitative failure labels, providing a reproducible testbed for measuring progress from source-code reasoning toward autonomous binary analysis while restricting scope to educational, purpose-built programs.

[AI-41] Guaranteed Jailbreaking Defense via Disrupt-and-Rectify Smoothing

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）面临的越狱攻击（jailbreaking attacks）问题，此类攻击通过精心设计的输入提示诱导模型生成有害或不符合安全规范的内容。解决方案的关键在于提出一种名为“扰乱与修正平滑”（Disrupt-and-Rectify Smoothing, DR-Smoothing）的新颖防御方法，其核心机制是在传统平滑防御框架中引入两阶段提示处理策略：首先对输入提示进行扰动以破坏攻击信号，随后对其进行重构以恢复到分布内（in-distribution）形式，从而在保障安全性的同时降低模型行为的不可预测性。该方法不仅提升了对令牌级和提示级越狱攻击的防御效果，还在有害性与有用性之间实现了更优平衡，并提供了理论上的成功概率上界及扰动强度要求，适用于静态与自适应攻击场景。

链接: https://arxiv.org/abs/2605.10582
作者: Zheng Lin,Zhenxing Niu,Haoxuan Ji,Haichang Gao
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper proposes a guaranteed defense method for large language models (LLMs) to safeguard against jailbreaking attacks. Drawing inspiration from the denoised-smoothing approach in the adversarial defense domain, we propose a novel smoothing-based defense method, termed Disrupt-and-Rectify Smoothing (DR-Smoothing). Specifically, we integrate a two-stage prompt processing scheme-first disrupting the input prompt, then rectifying it-into the conventional smoothing defense framework. This disrupt-and-rectify approach improves upon previous disrupt-only approaches by restoring out-of-distribution disrupted prompts to an in-distribution form, thereby reducing the risk of unpredictable LLM behavior. In addition, this two-stage scheme offers a distinct advantage in striking a balance between harmlessness and helpfulness in jailbreaking defense. Notably, we present a theoretical analysis for generic smoothing framework, offering a tight bound for the defense success probability and the requirements on the disruption strength. Our approach can defend against both token-level and prompt-level jailbreaking attacks, under both established and adaptive attacking scenarios. Extensive experiments demonstrate that our approach surpasses current state-of-the-art defense methods in terms of both harmlessness and helpfulness.

[AI-42] Acceptance Cards:A Four-Diagnostic Standard for Safe Fine-Tuning Defense Claims

【速读】：该论文旨在解决当前安全微调（safe fine-tuning）防御机制评估中存在“假阳性”问题，即所谓“间隙减少”（gap reduction）可能源于采样噪声、受试者特异性、能力损失或不可迁移的机制，而非真正有效的安全增强。为应对这一问题，论文提出“接受卡”（Acceptance Cards）作为一套系统性评估协议，其关键在于引入四项严格诊断标准：统计可靠性（statistical reliability）、新鲜语义泛化（fresh semantic generalization）、机制一致性（mechanism alignment）和跨任务迁移性（cross-task transfer），只有在所有四项均通过时才认定为“全卡通过”（full-card pass）。该方法通过结构化审计流程与可执行的证据标准，提升了对安全微调防御有效性判断的严谨性和可复现性。

链接: https://arxiv.org/abs/2605.10575
作者: Phongsakon Mark Konrad,Toygar Tanyel,Serkan Ayvaz
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Safe fine-tuning defenses are often endorsed on the basis of a held-out gap reduction, but the same reduction can come from sampling noise, subject artifacts, capability loss, or a mechanism that does not transfer. We introduce Acceptance Cards: an evaluation protocol, a documentation object, an executable audit package, and a claim-specific evidential standard for safe fine-tuning defense claims. The protocol checks statistical reliability, fresh semantic generalization, mechanism alignment, and cross-task transfer before treating a gap reduction as a full-card pass. Re-scored under this installed-gap protocol, SafeLoRA fails the full-card pass on Gemma-2-2B-it: under strict mechanism-class coding it fails all four diagnostics, and under a permissive shrinkage relabel it still fails three of four. This is a narrow installed-gap audit on one model family, not a global judgment of SafeLoRA’s effectiveness. In a 46-cell audit, no cell satisfies the strict conjunction. The closest family is a near miss that passes reliability and mechanism checks where the required data are available, but fails the fresh-subject threshold, lacks a strict transfer pass, and carries a measurable deployment-accuracy cost.

[AI-43] LLM Jaggedness Unlocks Scientific Creativity

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在科学创意生成能力上表现出的非均匀性（jaggedness）问题，即模型在不同任务、提示和科学子领域中展现出的能力波动与不一致性。研究表明，尽管整体性能提升显著，但模型在科学创造力方面的进步并非线性或均匀分布，这限制了其在科研场景中的稳定应用。解决方案的关键在于识别并利用这种“锯齿状”能力分布：通过引入SciAidanBench基准评估模型生成科学创意的能力，并探索推理时计算资源分配、知识池化（knowledge pooling）及头脑风暴机制（brainstorming），构建元模型集成（meta-model ensembles），从而有效整合多个模型的互补优势，使整体科学创意产出超越任一单一模型。这一策略将原本被视为局限的jaggedness转化为可被结构化利用的资源，推动LLM驱动的科学创新效率提升。

链接: https://arxiv.org/abs/2605.10574
作者: Shray Mathur,J. Anibal Boscoboinik,Esther H. R. Tsai,Kevin G. Yager
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As artificial intelligence advances, models are not improving uniformly. Instead, progress unfolds in a jagged fashion, with capabilities growing unevenly across tasks, domains, and model scales. In this work, we examine this dynamic jaggedness through the lens of scientific idea generation. We introduce SciAidanBench, a benchmark of open-ended scientific questions designed to measure the scientific creativity of large language models (LLMs). Given a scientific question, models are asked to generate as many unique and coherent ideas as possible, with the total number of valid responses serving as a proxy for creative potential. Evaluating 19 base models across 8 providers (30 total variants including reasoning versions), we find that jaggedness manifests both across models and within models. First, in a cross-task comparison between general and scientific creativity, improvements in general creativity do not translate uniformly to scientific creativity, revealing divergent capability profiles across models. Second, at the prompt level, stronger models do not improve uniformly; instead, they exhibit high variability, with bursts of creativity on some questions and limited performance on others. Third, at the domain level, individual models display uneven strengths across scientific subfields, reflecting fragmented internal capability profiles. Finally, we show that this jaggedness can be harnessed. We explore mechanisms of inference-time compute, knowledge pooling, and brainstorming to combine models effectively and construct meta-model ensembles that outperform any single model. Our results position jaggedness not as a limitation, but as a resource, a structural feature of AI progress that, when understood and leveraged, can amplify LLM-driven scientific creativity.

[AI-44] Deep Arguing

【速读】：该论文旨在解决深度学习模型在分类任务中缺乏可解释性的问题，即模型难以向人类提供清晰、可信的预测依据。其核心挑战在于：深度神经网络通常将特征提取与任务目标紧密耦合，且缺少显式的推理机制，导致决策过程“黑箱化”。解决方案的关键在于提出一种名为Deep Arguing的新型神经符号方法，该方法将深度学习与论证构建和推理相结合，通过构造一个论证图（argumentation graph），其中每个数据点支持其标签并攻击其他标签，利用可微分的论证语义进行端到端训练，从而联合学习特征表示与论证交互关系。这一机制不仅生成忠实于预测的基于案例的解释，还通过结构约束提升模型的可解释性和预测性能。

链接: https://arxiv.org/abs/2605.10569
作者: Adam Gould,Francesca Toni
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep learning has become the dominant approach for creating high capacity, scalable models across diverse data modalities. However, because these models rely on a large number of learned parameters, tightly couple feature extraction with task objectives, and often lack explicit reasoning mechanisms, it is difficult for humans to understand how they arrive at their predictions. Understanding what representations emerge and why they arise from the training data remains an open challenge. We introduce Deep Arguing, a novel neurosymbolic approach that integrates deep learning with argumentation construction and reasoning for interpretable classification with different data modalities. In our approach deep neural networks construct an argumentation structure wherein data points support their assigned label and attack different ones. Using differentiable argumentation semantics for reasoning, the model is trained end-to-end to jointly learn feature representation and argumentative interactions. This results in argumentation structures providing faithful case-based explanations for predictions. Structure constraints over the argumentation graph guide learning, improving both interpretability and predictive performance. Experiments with tabular and imaging datasets show that Deep Arguing achieves performance competitive with standard baselines whilst offering interpretable argumentative reasoning.

[AI-45] Agent -First Tool API: A Semantic Interface Paradigm for Enterprise AI Agent Systems

【速读】：该论文旨在解决当前AI代理（AI Agent）在企业生产环境中部署时，其工具接口仍依赖于面向人类的CRUD（创建、读取、更新、删除）范式所导致的架构不匹配问题。具体而言，传统API与自主代理需求之间存在五大根本性差异：对精确标识符的依赖、以渲染为导向的响应、单次交互假设、用户等效授权机制以及不透明的错误语义。解决方案的核心在于提出“代理优先的工具API”（Agent-First Tool API）范式，包含三个集成机制：(1) 六动词语义协议（Six-Verb Semantic Protocol），将工具交互分解为搜索、解析、预览、执行、验证和恢复六个阶段；(2) 标准化工具契约（Normalized Tool Contract, NTC），提供结构化的决策支持元数据，如置信度分数、证据链和建议下一步动作；(3) 双层治理管道，结合静态能力策略与动态风险升级机制。实证表明，该范式显著提升了任务成功率（88% vs. 64%）并大幅降低人工干预需求（减少72.7%），且增强了自主错误恢复能力（提升5.8倍）。

链接: https://arxiv.org/abs/2605.10555
作者: Kai Pan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As AI agents transition from research prototypes to enterprise production systems, the tool interfaces they consume remain rooted in human-oriented CRUD paradigms. This paper identifies five fundamental architectural mismatches between conventional APIs and autonomous agent requirements: exact-identifier dependence, rendering-oriented responses, single-shot interaction assumptions, user-equivalent authorization, and opaque error semantics. We propose the Agent-First Tool API paradigm, comprising three integrated mechanisms: (1) a Six-Verb Semantic Protocol that decomposes tool interactions into search, resolve, preview, execute, verify, and recover phases; (2) a Normalized Tool Contract (NTC) providing structured decision-support metadata including confidence scores, evidence chains, and suggested next actions; and (3) a dual-layer governance pipeline combining static capability policies with dynamic risk escalation. The paradigm is implemented and validated in a production multi-tenant SaaS platform serving 85 registered tools across 6 business domains. Comparative experiments on 50 real operational tasks demonstrate that Agent-First APIs achieve 88% end-to-end task success rate versus 64% for optimized CRUD baselines (+37.5%), while reducing required human interventions by 72.7% and improving autonomous error recovery by 5.8x. We establish that the paradigm is orthogonal and complementary to transport-layer standards such as MCP, operating as the semantic application layer above existing tool discovery and invocation protocols.

[AI-46] Bridging Sequence and Graph Structure for Epigenetic Age Prediction

【速读】：该论文旨在解决现有表观遗传年龄预测方法中未能同时建模DNA甲基化位点间的共甲基化图结构（co-methylation graph structure）与位点特异性DNA序列上下文信息的问题。其解决方案的关键在于提出了一种统一的序列-图融合框架，通过轻量级门控调制机制，将八维DNA序列统计特征自适应地调整每个位点的甲基化信号强度，从而在图卷积前引入基于序列生物学相关性的先验知识；该设计显著提升了预测精度（测试平均绝对误差MAE为3.149年，较最优图基基线提升12.8%），并验证了手工设计的序列特征优于端到端学习的卷积神经网络（CNN）编码方式。

链接: https://arxiv.org/abs/2605.10541
作者: Yao Li,Xikun Zhang,Xiaotao Shen,Sonika Tyagi,Xin Zheng,Jiaxing Huang,Feng Xia
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Epigenetic clocks based on DNA methylation have emerged as powerful tools for estimating biological age, with broad applications in aging research, age-related disease studies, and longevity science. Despite advances across machine learning approaches to epigenetic age prediction, spanning penalised linear regression, deep feedforward networks, residual architectures, and graph neural networks, no existing method jointly models co-methylation graph structure and site-specific DNA sequence context within a unified framework. We propose a unified sequence–graph integration framework for epigenetic age prediction that addresses this gap, integrating eight-dimensional DNA sequence statistical features through a lightweight gated modulation mechanism that adaptively scales each site’s methylation signal according to its sequence-determined biological relevance prior to graph convolution. Evaluated on 3,707 blood methylation samples against a comprehensive set of baselines, our method achieves a test MAE of 3.149 years, a 12.8% improvement over the strongest graph-based baseline. Biologically informed statistical features outperform CNN-based sequence encoding, demonstrating that handcrafted sequence features are more effective than end-to-end learned representations in this data regime. Post-hoc interpretability analysis identifies CpG density and local adenine frequency as features with age-dependent importance shifts, consistent with known mechanisms of age-related hypermethylation at CpG-dense promoter regions. Our code is at this https URL.

[AI-47] HH-SAE: Discovering and Steering Hierarchical Knowledge of Complex Manifolds

【速读】：该论文旨在解决高维、关键任务领域中稀有语义创新被密集背景上下文掩盖的问题，即所谓的“特征密度冲突”（feature density conflict）。其解决方案的核心是提出混合分层激活编码器（Hybrid Hierarchical SAE, HH-SAE），通过将流形分解为嵌套的三层结构——情境层（Contextual, $L_0$ ）、原子层（Atomic, $f_1$ ）和组合层（Compository, $f_2$ ），实现对复杂数据中高阶机制性创新的有效提取。该架构通过“裂解”行政临床标签为生理模式，在跨域零样本欺诈检测中达到0.9156的AUC峰值，并在路径消融实验中验证了情境减法操作的必要性（移除后性能下降13.46%），最终证明HH-SAE能优先于环境代理捕捉高阶机制，从而提升高风险场景下的精准发现能力。

链接: https://arxiv.org/abs/2605.10536
作者: Honghan Wu,Tianyan Wang,Jiacong Mi,Zhoyang Jiang,Yunsoo Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Rare semantic innovations in high-dimensional, mission-critical domains are often obscured by dense background contexts, a challenge we define as \textitfeature density conflict. We introduce the \textbfHybrid Hierarchical SAE (HH-SAE) to resolve this by factorizing manifolds into a nested hierarchy of \textbfContextual ( L_0 ), \textbfAtomic ( f_1 ), and \textbfCompository ( f_2 ) tiers. Evaluating across disparate manifolds, HH-SAE demonstrates superior resolution by \textbf``fracturing’’ administrative clinical labels into physiological modes and achieving a peak \textbfcross-domain zero-shot AUC of 0.9156 in fraud detection. Path ablation confirms the architecture’s structural necessity, revealing a 13.46% utility collapse when contextual subtraction is removed. Finally, knowledge-steered synthesis achieves a +9.9% AUPRC lift over state-of-the-art generators, proving that HH-SAE effectively prioritizes high-order mechanistic innovation over environmental proxies to enable high-precision discovery in high-stakes environments.

[AI-48] A Reflective Storytelling Agent for Older Adults: Integrating Argumentation Schemes and Argument Mining in LLM -Based Personalised Narratives

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）在面向老年人的数字陪伴叙事交互中存在幻觉（hallucination）和透明度不足的问题，从而影响其在健康促进场景下的可信度与实用性。解决方案的关键在于构建一个反思式叙事代理（reflective storytelling agent），通过融合知识图谱、用户建模、论证理论（argumentation theory）与论证挖掘（argument mining）技术，实现对生成叙事内容的结构化引导与形式化检验。该机制不仅基于用户健康相关活动与动机的结构化模型生成个性化叙事，还通过论证质量指标与幻觉风险指标对输出进行量化评估，使生成内容在人类评价中展现出更高的清晰度、意义感与一致性，从而提升叙事干预的可靠性与接受度。

链接: https://arxiv.org/abs/2605.10531
作者: Jayalakshmi Baskar,Vera C. Kaelin,Kaan Kilic,Helena Lindgren
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Submitted to ACM Transactions on Intelligent Systems and Technology (TIST)

点击查看摘要

Abstract:This work investigates whether knowledge-driven large language model (LLM)-based storytelling can support purposeful narrative interaction with a digital companion for older adults. To address known limitations of LLMs, including hallucinations and limited transparency, we present a reflective storytelling agent integrating knowledge graphs, user modelling, argumentation theory, and argument mining to guide and inspect narrative generation. The study consisted of two phases. Phase I employed participatory design involving 11 domain experts in a formative evaluation that informed iterative refinement. The resulting system generates narratives grounded in structured user models representing health-promoting activities and motivations. Phase II involved 55 older adults evaluating persona-based narratives across four prompts and two creativity levels. Participants assessed perceived purpose, usefulness, cultural relatability, and inconsistencies. The system additionally computed hallucination-risk indicators to evaluate generated narratives. Participants recognised personally relevant purposes in roughly two thirds of narratives, while argument-based purposes were identified in around half of these cases. Cultural recognisability strongly influenced willingness to use the functionality, whereas minor inconsistencies were often tolerated when narratives remained understandable and personally relevant. Narratives with higher hallucination-risk indicators were more often perceived as inconsistent, while higher argument-quality indicators tended to co-occur with higher clarity and meaningfulness ratings. Overall, the study positions argument mining as a reflective inspection mechanism for comparing formal grounding signals with human evaluations in health-oriented LLM storytelling for older adults. Comments: Submitted to ACM Transactions on Intelligent Systems and Technology (TIST) Subjects: Artificial Intelligence (cs.AI) ACMclasses: H.5.2; I.2.7 Cite as: arXiv:2605.10531 [cs.AI] (or arXiv:2605.10531v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.10531 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Jayalakshmi Baskar [view email] [v1] Mon, 11 May 2026 13:17:31 UTC (669 KB)

[AI-49] PrimeKG-CL: A Continual Graph Learning Benchmark on Evolving Biomedical Knowledge Graphs

【速读】：该论文旨在解决生物医学知识图谱（Biomedical Knowledge Graph, BKG）在持续学习（Continual Graph Learning, CGL）场景下的性能评估与模型适应性问题。现有CGL方法主要基于静态、随机划分的通用知识图谱进行研究，无法反映真实BKG因上游本体异步更新而产生的结构化演化特性（如数百万边新增、数十万边废弃）。为此，作者提出PrimeKG-CL这一基准数据集，其关键在于构建来自九个权威生物医学数据库的两个真实时间快照（2021年6月与2023年7月），包含超过129K节点、810万边及多模态特征，并设计了按实体类型分组的任务与精细的测试划分策略（持久/新增/移除），从而真实模拟BKG的动态演进过程。实验表明，解码器选择与持续学习策略之间存在强交互作用，且标准指标易混淆有效知识保留与过时知识遗忘，揭示了当前CGL评估范式在生物医学场景中的局限性。

链接: https://arxiv.org/abs/2605.10529
作者: Yousef A. Radwan,Yao Li,Qing Qing,Ziqi Xu,Xingtong Yu,Jiaxing Huang,Renqiang Luo,Xikun Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Biomedical knowledge graphs underwrite drug repurposing and clinical decision support, yet the upstream ontologies they depend on update on independent cycles that add millions of edges and deprecate hundreds of thousands more between releases. Yet existing continual graph learning has been studied almost exclusively on synthetic random splits of static, generic KGs, a regime that cannot reproduce the asynchronous, structured evolution real biomedical KGs undergo. To this end, we introduce PrimeKG-CL, a CGL benchmark built from nine authoritative biomedical databases (129K+ nodes, 8.1M+ edges, 10 node types, 30 relation types) with two genuine temporal snapshots (June 2021, July 2023; 5.83M edges added, 889K removed, 7.21M persistent), 10 entity-type-grouped tasks, multimodal node features, and a per-task persistent/added/removed test stratification. On three tasks (biomedical relationship prediction, entity classification, KGQA), we evaluate six CL strategies across four KGE decoders, plus LKGE, an LLM-RAG agent, and CMKL. We find that decoder choice and continual learning strategy interact strongly: no single strategy performs best across all decoders, and mismatched combinations can significantly degrade performance. Moreover, only DistMult exhibits a clear separation between persistent and deprecated knowledge, indicating that standard metrics conflate retention of still-valid facts with failure to forget outdated ones; this effect is absent under RotatE. In addition, multimodal features improve entity-level tasks by up to 60%, and a recent CKGE framework (IncDE) failed to scale to our 5.67M-triple base task across five attempts up to 350GB RAM. Data, pipeline, baselines, and the stratified split are released openly. Dataset:this http URL|Code:this http URL

[AI-50] Consistency as a Testable Property: Statistical Methods to Evaluate AI Agent Reliability

【速读】：该论文旨在解决人工智能代理（AI agent）在面对语义保持扰动时的可靠性量化问题，尤其关注其在不同运行条件下的一致性评估。传统方法如pass@1率难以捕捉代理行为在轨迹层面的细微偏差，导致对代理鲁棒性的误判。解决方案的关键在于构建一个严谨的测量科学框架，通过U-统计量（U-statistics）衡量输出层可靠性，并利用基于核函数（kernel-based）的指标评估轨迹层稳定性，从而区分代理的核心能力与执行鲁棒性。该框架揭示了即使代理具备完成任务所需知识，微小的任务级变化仍可能引发策略完全崩溃的现象，并通过三个代理基准测试验证了轨迹一致性指标相比传统指标具有更高的诊断敏感性，为识别和修正阻碍高风险场景部署的架构缺陷提供了数学工具。

链接: https://arxiv.org/abs/2605.10516
作者: Harsh Raj,Niranjan Orkat,Suvrorup Mukherjee,Aritra Guha,Cheryl Flynn,Subhabrata Majumdar
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 33 pages, 5 figures, 2 tables

点击查看摘要

Abstract:This paper establishes a rigorous measurement science for AI agent reliability, providing a foundational framework for quantifying consistency under semantically preserving perturbations. By leveraging U -statistics for output-level reliability and kernel-based metrics for trajectory-level stability, we offer a principled approach to evaluating agents across diverse operating conditions. Our proposal highlights the important distinction between the core capability and execution robustness of an agent, showing that minor task-level variations can induce complete strategy breakdowns despite the agent possessing the requisite knowledge for the task. We validate our framework through extensive experiments on three agentic benchmarks, demonstrating that trajectory-level consistency metrics provide far greater diagnostic sensitivity than traditional pass@1 rates. By providing the mathematical tools to isolate where and why agents deviate, we enable the identification and rectification of architectural concerns that hinder the deployment of agents in high-stakes, real-world environments.

[AI-51] SoK: A Systematic Bidirectional Literature Review of AI DLT Convergence

【速读】：该论文旨在解决生成式 AI (Generative AI) 与分布式账本技术（Distributed Ledger Technology, DLT）融合研究中存在的重要空白问题，即当前文献多聚焦于特定应用领域或单向集成路径，缺乏对两者在系统架构层面双向交互机制的全面理解。其解决方案的关键在于开展一项结构化的双向综述，系统分类并分析2020至2025年间发表的同行评审研究，从AI增强DLT和DLT增强AI两个维度分别梳理其在数据、网络、共识、执行和应用五个层次上的整合方式与效果，并指出当前研究主要集中在执行层与共识层（AI增强DLT）以及数据层与模型层（DLT增强AI），其他层次显著被忽视。作者进一步强调，未来进展需依赖跨层协同设计与真实场景中的实证验证，以应对可扩展性、互操作性和可验证执行等关键挑战。

链接: https://arxiv.org/abs/2605.10515
作者: Ali Irzam Kathia,Yimika Erinle,Abylay Satybaldy,Paolo Tasca,Nikhil Vadgama,Marco Alberto Javarone
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 18 pages, 1 figure, 5 tables

点击查看摘要

Abstract:The integration of Artificial Intelligence (AI) with Distributed Ledger Technology (DLT) has become a growing research area, yet contributions tend to cluster around specific application domains or examine only one direction of the integration, leaving the broader architectural interplay between the two technologies poorly understood. This work addresses that gap through a structured, bidirectional review of peer-reviewed studies published between 2020 and 2025. We classify contributions along two directions: AI-enhanced DLT, and DLT-enhanced AI. In the first case, we examine how AI techniques improve DLT systems across five layers: data, network, consensus, execution, and application layers. In the second case, we analyse how DLT supports AI systems across five layers: infrastructure, data, model, inference, and application layers, with particular attention to federated learning, model evaluation, and multi-agent coordination. The analysis reveals that most works concentrate on a small subset of layers: execution and consensus for AI-enhanced DLT, data and model for DLT-enhanced AI. Other layers remain comparatively neglected. Despite reported improvements in controlled settings, no study demonstrates deployment at production scale, and the field has not yet offered satisfying answers to fundamental questions around scalability, interoperability, and verifiable execution. We argue that progress will require cross-layer co-design and empirical validation in real-world settings. Comments: 18 pages, 1 figure, 5 tables Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2605.10515 [cs.CR] (or arXiv:2605.10515v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2605.10515 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Marco Alberto Javarone [view email] [v1] Mon, 11 May 2026 13:06:04 UTC (121 KB)

[AI-52] CMKL: Modality-Aware Continual Learning for Evolving Biomedical Knowledge Graphs

【速读】：该论文旨在解决生物医学知识图谱（Biomedical Knowledge Graph, BKG）在动态演化过程中，现有知识图谱嵌入方法及持续学习（Continual Learning, CL）扩展版本难以有效利用多模态信息（如结构、文本和分子数据）且无法区分不同模态遗忘动态的问题。解决方案的关键在于提出一种名为持续多模态知识图谱学习器（Continual Multimodal Knowledge Graph Learner, CMKL）的框架：其核心创新包括三方面——(1) 原生编码结构、文本与分子三类模态信息；(2) 通过Mixture-of-Experts (MoE) 路由机制融合多模态特征，实现模态间灵活交互；(3) 结合标准EWC正则化与K-means多样性多模态回放缓冲区，分别从参数层面和样本层面保护历史知识。实验表明，CMKL在实体分类任务中显著优于最强结构基线（AP提升60%），并在关系预测任务中达到最优性能，同时近乎零遗忘（AF=0.008），验证了其对多模态异质性与持续演化的适应能力。

链接: https://arxiv.org/abs/2605.10510
作者: Yousef A. Radwan,Yao Li,Qing Qing,Ziqi Xu,Qixin Zhang,Yongcheng Jing,Renqiang Luo,Xikun Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Biomedical knowledge graphs are increasingly large, dynamic, and multimodal, driven by rapid advances in biotechnology such as high-throughput sequencing. Machine learning models can infer previously unobserved biomedical relationships and characterize biomedical entities in these graphs, but existing knowledge graph embedding methods and their continual learning extensions either assume static graph structure or fail to exploit multimodal information under evolving data distributions. They also apply uniform regularization across all model parameters, ignoring that different modalities may exhibit distinct forgetting dynamics as the graph evolves. We propose the Continual Multimodal Knowledge Graph Learner (CMKL), a CL framework for biomedical KGs that natively encodes structure, text, and molecules, fuses them through a Mixture-of-Experts (MoE) router, and protects previously learned knowledge with standard EWC regularization and a K-means-diverse multimodal replay buffer. We evaluate CMKL on a 129K-entity biomedical continual benchmark with 10 tasks. On continual biomedical entity classification, CMKL reaches AP 0.591 versus 0.370 for the strongest structural baseline, a 60% gain that is driven by access to multimodal features and preserved across the sequence with near-zero forgetting (AF 0.008). On continual relationship prediction, CMKL reaches AP 0.062 , matching Naive Sequential and EWC (0.058) within seed noise and outperforming Joint Training (0.047, p=0.045) and LKGE (0.039). A frozen-text ablation reaches AP 0.136, more than double any jointly trained model, yet that signal is unreachable by margin-ranking gradients: the greedy-modality asymmetry lives at the representation level, not the fusion level, and MoE routing manages it by suppressing the unreachable modality without forcing it through a learned bottleneck. Code: this http URL

[AI-53] SLASH the Sink: Sharpening Structural Attention Inside LLM s

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在处理以序列化格式表示的图结构拓扑时，因缺乏对结构信息的有效建模而导致的性能瓶颈问题。现有方法依赖外部图适配器训练或微调，成本高且泛化能力差。研究发现，LLMs内部会自发重构图拓扑，表现为注意力图中与“token-level adjacency matrix”结构对齐的“锯齿状”模式，但该结构理解被注意力汇聚机制（attention sink）所稀释，其本质是一个由语言任务所需的各向异性偏置与图推理所需局部聚合之间冲突引发的表征瓶颈。解决方案的关键在于提出一种无需训练的插件式方法——StructuraL Attention SHarpening (Slash)，通过注意力重分配策略增强模型内部已存在的结构感知能力，从而显著提升在纯图任务和分子性质预测等场景下的性能表现。

链接: https://arxiv.org/abs/2605.10503
作者: Yiming Liu,Bin Lu,Xinbing Wang,Chenghu Zhou,Meng Jin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) show remarkable semantic understanding but often struggle with structural understanding when processing graph topologies in a serialized format. Existing solutions rely on training external graph-based adapters or fine-tuning, which incur high costs and lost generalizability. In this work, we investigate the internal mechanisms of LLMs and present a critical finding: LLMs spontaneously reconstruct the graph’s topology internally, evidenced by a distinct “sawtooth” pattern in their attention maps that structurally aligns with the “token-level adjacency matrix”. However, this intrinsic structural understanding is diluted by the attention sink. We theoretically formalize this dilution as a representation bottleneck, stemming from a fundamental conflict: the model’s anisotropic bias, essential for language tasks, suppresses the topology-aware local aggregation required for graph reasoning. To address this, we propose a training-free solution, named StructuraL Attention SHarpening (Slash), which amplifies this internal structural understanding via a plug-and-play attention redistribution. Experiments on pure graph tasks and molecular prediction validate Slash delivers significant and consistent performance gains across diverse LLMs.

[AI-54] SkillEvolver: Skill Learning as a Meta-Skill

【速读】：该论文旨在解决当前智能体技能（agent skills）静态化的问题，即技能一旦生成或人工编写后便无法根据实际使用中的反馈进行迭代优化，缺乏在线学习能力。其解决方案的关键在于提出SkillEvolver，一个轻量级、可插拔的在线技能学习框架：通过一个元技能（meta-skill）持续地完成技能的生成、部署与精炼闭环，其中学习目标聚焦于技能的自然语言描述和代码内容而非模型参数，从而确保生成的技能可无缝集成到任意代理系统中而无需重新训练；同时，该元技能本身也作为普通技能加载，具备良好的兼容性。此外，SkillEvolver采用基于新代理过拟合审计的迭代机制，在技能部署后利用真实使用中遇到的失败信号（而非仅探索性轨迹）驱动优化，有效识别如“静默绕过”等隐蔽性错误，显著提升技能质量与实用性。

链接: https://arxiv.org/abs/2605.10500
作者: Genrui Zhang,Erle Zhu,Jinfeng Zhou,Caiyan Jia,Hongning Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Agent skills today are static artifact: authored once – by human curation or one-shot generation from parametric knowledge – and then consumed unchanged, with no mechanism to improve from real use. We propose \textbfSkillEvolver, a lightweight, plug-and-play solution for online skill learning, in which a single meta-skill iteratively authors, deploys, and refines domain-specific skills. The learning target of SkillEvolver is the skill’s prose and code, not model weights, so that the resulting artifact drops into any agent without retraining; and the meta-skill itself is just another skill, loaded through the same interface by any protocol-compliant CLI-agent. Unlike trace-distillation, the meta-skill refines only after deploying the learnt skill, such that the learning signal comes from failures another agent encounters while using it – not from exploratory traces alone. Refinement iterations are governed by a fresh-agent overfit audit that catches possible leakage as well as deployed-skill-specific failures, including the silent-bypass mode in which a skill appears valid in content but is never invoked at runtime. On 83 SkillsBench tasks spanning 15^+ domains, SkillEvolver reaches 56.8% accuracy versus 43.6% for curated human skills and 29.9% for the no-skill baseline; on three GPU kernel optimization tasks from KernelBench, it also raises mean speedup from 1.16 to 1.51 on average.

[AI-55] Multi-layer attentive probing improves transfer of audio representations for bioacoustics

【速读】：该论文旨在解决当前生物声学（bioacoustic）表征学习评估中因固定、低容量探测头（probe）设计导致的偏差问题，即标准的最后层线性探测头可能无法充分反映编码器（encoder）的真实性能，从而误导模型比较结果。其解决方案的关键在于系统性地研究多种探测策略，包括最后一层与多层探测（last- and multi-layer probing）、线性探测（linear probe）与注意力探测（attention probe），并发现更大容量且利用时间信息的探测头能显著提升下游任务性能；尤其指出多层探测可普遍改善所有模型的表现，而注意力探测在Transformer架构下优于线性探测，表明探测头的设计与编码器特征存在重要交互作用。

链接: https://arxiv.org/abs/2605.10494
作者: Marius Miron,David Robinson,Masato Hagiwara,Titouan Parcollet,Jules Cauzinille,Gagan Narula,Milad Alizadeh,Ellen Gilsenan-McMahon,Sara Keen,Emmanuel Chemla,Benjamin Hoffman,Maddie Cusimano,Diane Kim,Felix Effenberger,Jane K. Lawton,Aza Raskin,Olivier Pietquin,Matthieu Geist
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Probing heads map the representations learned from audio by a machine learning model to downstream task labels and are a key component in evaluating representation learning. Most bioacoustic benchmarks use a fixed, low-capacity probe, such as a linear layer on the final encoder layer. While this standardization enables model comparisons, it may bias results by overlooking the interaction between encoder features and probe design. In this work, we systematically study different probing strategies across two bioacoustic benchmarks, BEANs and BirdSet. We evaluate last- and multi-layer probing, across linear and attention probes. We show that larger probe heads that leverage time information have superior performance. Our results suggest that current benchmarks may misrepresent encoder quality when relying on a last-layer probing setup. Multi-layer probing improves downstream task performance across all tested models, while attention probing has superior performance to linear probing for transformer models.

[AI-56] ASIA: an Autonomous System Identification Agent

【速读】：该论文旨在解决系统辨识（system identification）领域中模型类选择、训练算法设计与超参数调优等关键步骤仍依赖人工经验试错的问题，这些问题通常需要大量专家时间和领域知识。其解决方案的核心在于提出ASIA框架，该框架利用大型语言模型作为自主编码代理（autonomous coding agent），将假设生成、代码实现与评估过程闭环自动化，仅需用户提供自然语言描述的辨识任务即可完成整个搜索流程，从而显著减少人为干预并提升效率。

链接: https://arxiv.org/abs/2605.10480
作者: Dario Piga,Marco Forgione
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Over the years, research in system identification has provided a rich set of methods for learning dynamical models, together with well-established theoretical guarantees. In practice, however, the choice of model class, training algorithm, and hyperparameter tuning is still largely left to empirical trial-and-error, requiring substantial expert time and domain experience. Motivated by recent advances in agentic artificial intelligence, we present ASIA, a framework that delegates this iterative search to a large language model acting as an autonomous coding agent. Building on existing agentic platforms, ASIA closes the loop between hypothesis, implementation, and evaluation without human intervention, requiring only a plain-English description of the identification problem. We conduct an empirical study of ASIA on two system identification benchmarks and analyse the agent’s search behaviour, the architectures and training strategies it discovers, and the quality of the resulting models. We also discuss the potential of the approach and its current limitations, including implicit test leakage, reduced methodological transparency, and reproducibility concerns.

[AI-57] Formally Verifying Analog Neural Networks Under Process Variations Using Polynomial Zonotopes

【速读】：该论文旨在解决模拟神经网络（Analog Neural Networks）在电路级实现时因制造工艺变化（Process Variations）导致性能偏差的问题，这类偏差会显著影响模型的可靠性与准确性。解决方案的关键在于提出一种基于多项式的电路级建模方法，用于逼近神经元电路在工艺波动下的行为，并结合多项式区间（Polynomial Zonotopes）的可达性分析技术进行形式化验证，从而替代传统耗时的蒙特卡洛仿真，大幅缩短验证时间并保证99%的样本覆盖精度。

链接: https://arxiv.org/abs/2605.10474
作者: Yasmine Abu-Haeyeh,Tobias Ladner,Matthias Althoff,Lars Hedrich
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Analog neural networks are gaining attention due to their efficiency in terms of power consumption and processing speed. However, since analog neural networks are implemented as physical circuits, they are highly sensitive to manufacturing process variations, which can cause large deviations from the nominal model. We present a polynomial-based model that resembles the performance of the neuron circuit under process variations. Then, we formally verify the behavior of the circuit-level model using reachability analysis with polynomial zonotopes, thus, avoiding conventional, time-consuming Monte Carlo simulations. We evaluate our proposed verification approach on three different datasets, verifying both fully-connected and convolutional analog neural networks. Our experimental results confirm the effectiveness of our verification approach by reducing the verification time from days to seconds while enclosing 99% of the variation samples.

[AI-58] Can Agent Benchmarks Support Their Scores? Evidence-Supported Bounds for Interactive-Agent Evaluation

【速读】：该论文旨在解决交互式智能体（Interactive Agent）评估基准中因结果判定机制不严谨而导致的性能评分失真问题。现有基准通常通过表面信号（如点击“保存”按钮）判断任务是否成功，但这类检查无法验证智能体实际执行的动作路径是否达成目标状态，从而可能将错误操作误判为成功，造成评估结果误导。解决方案的关键在于引入一个结果证据报告层（Outcome Evidence Reporting Layer），该层无需修改原有任务、智能体或评估器，仅在评分前明确要求存储用于验证结果的关键中间产物，并对每项运行应用锁定清单（locked checklist）进行三类证据标签标注（Evidence Pass、Evidence Fail、Unknown），最终输出基于证据支持的得分区间以量化不确定性。此框架使不确定案例显式可见，避免了传统方法中对模糊情况的隐性处理，从而提升了评估的可靠性与可解释性。

链接: https://arxiv.org/abs/2605.10448
作者: Shanshan Gao,Liyi Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Interactive agent benchmarks map an agent run to a binary outcome through outcome checks. When these checks rely on surface level signals or fail to capture the agent’s actual action path, they cannot reliably determine whether the run succeeded. For example, a benchmark task may ask whether Alice’s shipping address was changed, while the outcome check only verifies that the agent clicked “Save.” This does not guarantee that the intended state change occurred, since the agent may have modified the wrong record. Treating such a run as successful therefore makes the reported score misleading. Benchmark quality thus depends not only on task design, but also on the reliability of outcome detection. We address this problem by introducing an outcome evidence reporting layer for existing benchmarks, without modifying their tasks, agents, or evaluators. The layer performs three functions. First, before scoring, it specifies which stored artifacts are required to verify the claimed outcome for each case. Second, it applies a locked checklist to each completed run and assigns one of three evidence labels: Evidence Pass, Evidence Fail, or Unknown. Third, it reports evidence supported score bounds that quantify uncertainty arising from Unknown cases. Rather than silently counting, discarding, or hiding uncertain cases inside a single aggregate success rate, the framework keeps them explicitly visible. We evaluate the outcome evidence layer on five public benchmarks: ANDROIDWORLD, AGENTDOJO, APPWORLD, tau3 bench retail, and MINIWOB. The resulting reports separate several empirically distinct failure modes. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2605.10448 [cs.AI] (or arXiv:2605.10448v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.10448 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-59] Real vs. Semi-Simulated: Rethinking Evaluation for Treatment Effect Estimation

【速读】：该论文旨在解决因果推断领域中方法论研究与实际应用之间存在的评估标准不一致问题，即学术界常用基于反事实（counterfactual）指标和半模拟基准来评价处理效应估计模型，而工业界则依赖可观测（observable）指标（如排序性能或测试结果）。其解决方案的关键在于开展大规模实证研究，系统比较不同模型在两类评估框架下的表现：一方面使用传统反事实指标（如平均绝对误差），另一方面采用应用导向的可观测指标（如排名相关性）；同时覆盖标准元学习器（meta-learners）与专用因果机器学习模型，并在多个半模拟基准和真实数据集上进行验证。研究发现，两类指标下最优模型不一致，且半模拟基准上的排序无法迁移至真实数据，提示当前评估体系存在显著偏差，应将可观测指标和真实数据验证纳入模型评估流程以提升研究的实用性。

链接: https://arxiv.org/abs/2605.10430
作者: George Panagopoulos
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Estimating heterogeneous treatment effects with machine learning has attracted substantial attention in both academic research and industrial practice. However, the two communities often evaluate models under markedly different conditions. Methodological work typically relies on semi-simulated benchmarks and metrics that require counterfactual outcomes, whereas real-world applications rely on observable metrics based on ranking or test outcomes. Despite the well-known gap between methodological progress and practical deployment, the relationship between these evaluation regimes has not been examined systematically. We conduct a large-scale empirical study of treatment effect evaluation across standard semi-simulated benchmark families and real-world datasets. Our benchmark covers meta-learners paired with multiple base learners, as well as specialized causal machine learning models. We evaluate these methods using observable metrics common in application-oriented literature, alongside counterfactual metrics commonly used in methods papers. Our results reveal two complementary gaps. First, counterfactual metrics do not reliably recover the estimators preferred by observable metrics, even on the same semi-simulated benchmarks. Second, rankings obtained on semi-simulated benchmarks do not transfer to real datasets. We further find that simple meta-learners with strong base models are consistently competitive, in contrast to specialized causal models. Overall, our findings suggest that progress in treatment effect estimation research should not be assessed solely through counterfactual metrics and semi-simulated benchmarks, but it would benefit from incorporating observable metrics and real-data validation.

[AI-60] oward an Engineering of Science: Rebalancing Generation and Verification in the Age of AI

【速读】：该论文试图解决生成式 AI（Generative AI）时代下科学系统面临的“认知污染”（epistemic pollution）问题，即AI可低成本生成看似合理但不可靠的科学成果，而现有科学验证机制无法及时过滤这些伪劣内容。其核心挑战在于：传统科学体系依赖高生成成本作为初步筛选机制，而AI削弱了这一屏障却未同步降低验证成本，导致验证滞后于生成速度。解决方案的关键是将科学知识表示从当前以论文为中心的压缩式结构，重构为基于“蓝图”（blueprints）的结构化、分解式研究构件——将主张、证据、假设和定义等要素以类型化的图结构呈现，从而在前期增加一定的生成成本，换取下游更局部、更分散、更经济的验证能力，实现生成与验证成本的再平衡。

链接: https://arxiv.org/abs/2605.10425
作者: Jiaqi W. Ma
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AI systems can now cheaply generate plausible scientific artifacts such as papers, reviews, and surveys. This creates a risk of \emphepistemic pollution in our scientific systems, where unreliable but plausible-looking artifacts can accumulate faster than the system can filter them out. The problem is structural: the epistemic infrastructure of science was calibrated to a world where producing a plausible artifact required substantial expertise, labor, and time, so generation cost itself served as a rough filter; AI weakens that filter without comparably lowering verification cost. We argue that \textbfAI-era science should treat this as an engineering problem: redesigning epistemic infrastructure to rebalance the costs of generation and verification. The current paper-centered system makes verification expensive: papers compress long-context scientific logic into prose, forcing reviewers, human or AI, to reconstruct underlying argument structure before they can evaluate it. As one step in this direction, we propose \textbfblueprints as preliminary epistemic infrastructure: structured, decomposed research artifacts that represent claims, evidence, assumptions, and definitions as typed graph components. Blueprints are designed to trade an upfront generation cost for cheaper, more local, more distributed verification downstream. We have instantiated the proposal in a proof-of-concept prototype.

[AI-61] LLM 4Branch: Large Language Model for Discovering Efficient Branching Policies of Integer Programs ICML2026

【速读】：该论文旨在解决混合整数线性规划（Mixed Integer Linear Programming, MILP）求解器中分支策略（branching policy）设计效率低下的问题，传统方法依赖人工设计的启发式规则，而现有基于机器学习的方法则受限于对昂贵专家示范的依赖以及训练目标与求解器端到端性能之间的差距。其解决方案的关键在于提出 LLM4Branch 框架，利用大语言模型（Large Language Models, LLMs）自动生成可执行的分支策略程序骨架，并通过零阶优化方法在少量实例上基于端到端性能反馈自动优化参数向量，从而实现高效、自动化且性能优越的分支策略发现。

链接: https://arxiv.org/abs/2605.10401
作者: Zhinan Hou,Xingchen Li,Yankai Zhang,Tianxun Li,Keyou You
机构: 未知
类目: Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注: ICML2026 preprint, camera ready in progress

点击查看摘要

Abstract:Efficient branching policies are essential for accelerating Mixed Integer Linear Programming (MILP) solvers. Their design has long relied on hand-crafted heuristics, and now machine learning has emerged as a promising paradigm to automate this process. However, existing learning-based methods are often hindered by their dependence on expensive expert demonstrations and the gap between training objectives and the solver’s end-to-end performance. In this work, we propose LLM4Branch, a novel framework that leverages Large Language Models (LLMs) to automate the discovery of efficient branching policies. Specifically, the discovered policy is an executable program with a program skeleton generated by the LLM and a parameter vector, which is optimized via a zeroth-order method over a few instances with their end-to-end performance feedback. Extensive experiments on standard MILP benchmarks demonstrate that LLM4Branch establishes a new state-of-the-art among CPU-based methods and achieves performance competitive with advanced GPU-based models. Codes are available at this https URL.

[AI-62] GuardAD: Safeguarding Autonomous Driving MLLM s via Markovian Safety Logic

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在自动驾驶（Autonomous Driving, AD）系统中因缺乏时序安全推理能力而导致的安全脆弱性问题，尤其是在动态交通交互场景下难以识别潜在危险。其解决方案的关键在于提出一种与模型无关的安全防护机制 GuardAD，该机制将驾驶安全建模为一个演化马尔可夫逻辑状态，并通过神经符号逻辑形式化（Neuro-Symbolic Logic Formalization）实现对异构交通参与者安全谓词的持续推导，利用n阶马尔可夫逻辑归纳机制捕捉跨时间步的隐含风险；同时引入逻辑驱动的动作修订（Logic-Driven Action Revision）策略，在不修改原始MLLM的前提下，基于推理出的安全状态主动优化决策动作，从而显著提升系统的安全性与鲁棒性。

链接: https://arxiv.org/abs/2605.10386
作者: Tianyuan Zhang,Peng Yue,Zihao Peng,Jiangfan Liu,Zonghao Ying,Jiakai Wang,Tianlin Li,Jian Yang,Yaodong Yang,Aishan Liu,Xianglong Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) are increasingly integrated into autonomous driving (AD) systems; however, they remain vulnerable to diverse safety threats, particularly in accident-prone scenarios. Recent safeguard mechanisms have shown promise by incorporating logical constraints, yet most rely on static formulations that lack temporally grounded safety reasoning over evolving traffic interactions, resulting in limited robustness in dynamic driving environments. To address these limitations, we propose GuardAD, a model-agnostic safeguard that formulates AD safety as an evolving Markovian logical state. GuardAD introduces Neuro-Symbolic Logic Formalization, which represents safety predicates over heterogeneous traffic participants and continuously induces them via n-th order Markovian Logic Induction. This design enables the inference of emerging and latent hazards beyond single-step observations. Rather than simply vetoing unsafe actions, GuardAD performs Logic-Driven Action Revision, where inferred safety states actively guide action refinement without modifying the underlying MLLM. Extensive experiments on multiple benchmarks and AD-MLLMs demonstrate that GuardAD substantially reduces accident rates (-32.07%) while slightly improving task performance (+6.85%). Moreover, closed-loop simulation evaluations, together with physical-world vehicle studies, further validate the effectiveness and potential of GuardAD.

[AI-63] Agent ic Performance at the Edge: Insights from Benchmarking

【速读】：该论文旨在解决在资源受限的边缘计算环境中，生成式 AI（Generative AI）代理模型规模压缩对任务执行质量的影响问题。其核心挑战在于如何在内存、功耗和延迟预算约束下，保持代理任务的性能表现。解决方案的关键在于提出一种领域条件驱动的评估方法，结合模型与工具交互的实证分析，揭示了模型参数量并非决定边缘代理质量的唯一因素；相反，高质量部署依赖于模型选择与工具工作流的协同设计，并通过帕累托前沿分析识别出不同操作优先级下的准确率-延迟权衡策略。

链接: https://arxiv.org/abs/2605.10384
作者: Shiqiang Wang,Herbert Woisetschläger
机构: 未知
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Networking and Internet Architecture (cs.NI)
备注: Accepted to AutoEdge workshop, co-located with MobiSys 2026

点击查看摘要

Abstract:Agentic artificial intelligence (AI) is a natural fit for Internet of Things (IoT) and edge systems, but edge deployments are often constrained to models around 8 billion parameters or smaller. An important question is: How much agentic-task quality is lost when model size is constrained by memory, power, and latency budgets? To address this question, in this paper, we provide an initial empirical study considering edge-focused model scaling, general-purpose versus coder-oriented model effects, and tool-enabled execution under a fixed protocol. We introduce a domain-conditioned evaluation methodology, an implementation-grounded analysis of model-tool interactions, practical guidance for model selection under constraints, and an analysis of failure modes that reveals distinct semantic versus execution failure patterns across model families. Our core finding is that edge-agent quality is not a simple function of parameter count. Robust deployment depends on the joint design of model choice and tool workflow. Domain-conditioned analysis reveals Pareto fronts in the accuracy-latency space that can guide strategy selection based on operational priorities.

[AI-64] Agent -X: Full Pipeline Acceleration of On-device AI Agents

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）驱动的智能体（Agent）在边缘设备上运行时面临的高端到端延迟问题。现有方案虽能实现高性能，但难以满足实时性要求。其解决方案的核心在于提出一个纯软件框架 Agent-X，通过两个关键技术实现加速：一是针对智能体特有的输入令牌（token）模式重写提示（prompt），以充分利用前缀缓存（prefix caching）机制；二是引入无需LLM的推测解码（speculative decoding），在保持精度不变的前提下显著提升生成速度且开销极低。实验表明，Agent-X 在真实系统中实现了 1.61 倍的端到端加速，且可无缝集成至现有边缘智能体架构中。

链接: https://arxiv.org/abs/2605.10380
作者: Jinha Chung,Byeongjun Shin,Jiin Kim,Minsoo Rhu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted for publication at MobiSys-2026

点击查看摘要

Abstract:LLM-based agents deliver state-of-the-art performance across tasks but incur high end-to-end latency on edge devices. We introduce Agent-X, a software-only, accuracy-preserving framework that accelerates both the prefill and decode stages of on-device agent workloads. Agent-X’s two key components rewrite prompts to leverage prefix caching tailored to agent-specific input-token patterns and enable LLM-free speculative decoding for fast token generation with minimal overhead. On representative agentic workloads, Agent-X achieves a 1.61x end-to-end speedup in real systems with no accuracy loss and can be seamlessly integrated into existing on-device AI agents. To the best of our knowledge, ours is the first to systematically characterize and eliminate latency bottlenecks in on-device agents.

[AI-65] Autonomous FAIR Digital Objects: From Passive Assertions to Active Knowledge

【速读】：该论文旨在解决科学知识在互联网上以静态断言形式发布所导致的可信度难以动态验证、矛盾难以自动调和以及数据长期可维护性不足的问题。传统科学出版依赖中心化中间件和机构持续运营，一旦注册表关闭，即便数据仍在线也失去主动管理能力。其解决方案的核心是提出并实现自主式FAIR数字对象（Autonomous FAIR Digital Objects, aFDO），通过引入三个基于语义网标准的关键能力：1）基于RDF-star的策略层（与PROV-O、SHACL和ODRL对齐），支持可移植的条件-动作规则；2）基于ActivityStreams 2.0的公告层，限制单次公告的评估成本；3）基于声誉和置信度加权共识的协议层，在有限对抗模型下解决多源冲突。该机制在4,305个基于罕见病本体（ClinVar、HPO、Orphanet）的FDO上验证，成功解决了56.3%的自然发生的ClinVar冲突，并在拜占庭攻击下保持渐进式退化（f ≤ n/5），符合设计容错边界。

链接: https://arxiv.org/abs/2605.10370
作者: Zeyd Boukhers,Oya Beyan,Cong Yang,Christoph Lange
机构: 未知
类目: Artificial Intelligence (cs.AI); Databases (cs.DB); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Scientific knowledge on the Web is published as passive assertions and cannot decide when to validate evidence, reconcile contradictions, or update confidence as findings accumulate. Curation depends on centralised middleware and institutional continuity, but when registries close, active stewardship stops even when data remain online. We advance the concept of Autonomous FAIR Digital Objects (aFDOs) from an abstract idea to an operational model, to offer a route from passive scientific publication toward accountable, standards-aligned automation that can outlive its publishing institutions. aFDO augments FDOs with three capabilities anchored in Semantic Web standards, namely 1) a policy layer over RDF-star aligned with PROV-O, SHACL, and ODRL for portable condition-action rules, 2) an announcement layer over ActivityStreams 2.0 that bounds per-announcement evaluation cost, and 3) an agreement layer that resolves multi-source contradictions through reputation and confidence weighted agreement under a bounded adversarial model. We provide a formal definition that distinguishes policy specifications, event handlers, and communication interfaces. We evaluate an open reference implementation on 4,305 FDOs grounded in rare-disease ontologies, namely ClinVar, HPO, and Orphanet, combined with controlled synthetic observations. The consensus mechanism resolves 56.3% of 3,914 naturally occurring ClinVar conflicts where multiple submitters disagree and an expert panel has subsequently adjudicated. Under Sybil, collusion, and poisoning attacks, the mechanism degrades gracefully within its design Byzantine-tolerance bound (f n/5), and fails as predicted beyond that bound.

[AI-66] EGL-SCA: Structural Credit Assignment for Co-Evolving Instructions and Tools in Graph Reasoning Agents

【速读】：该论文旨在解决生成式 AI (Generative AI) 在处理图推理任务时面临的多阶段协同优化难题：即如何从自然语言输入中重建结构化图实例、判断现有计算资源是否充足、在严格执行协议下调用工具，并最终通过外部验证器确保结构正确性而非仅文本合理性。现有方法通常孤立地优化指令或工具，导致失败后难以定位改进方向。其解决方案的关键在于提出EGL-SCA框架——一个以验证器为中心的双空间协同机制，包含两个协作组件：指令侧策略空间（用于推理策略）和工具侧程序空间（用于可执行算法工具）。核心创新是结构信用分配（structural credit assignment），它将轨迹证据映射到条件更新，精准地将失败归因于提示优化或工具合成与修复；同时引入按任务家族分层的训练分布及帕累托风格保留策略，平衡成功率、泛化性和简洁性，从而实现指令与工具的协同进化，在四个图推理基准上达到92.0%的平均成功率，显著优于纯提示法和固定工具箱基线。

链接: https://arxiv.org/abs/2605.10366
作者: Zike Yuan,Yukun Cao,Han Zhang,Jianzhi Yan,Le Liu,Cai ke,Yue Yu,Hui Wang,Ming Liu,Bing Qin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graph reasoning agents operating from natural-language inputs must solve a coupled problem: they must reconstruct a structured graph instance from text, decide whether existing computational assets are sufficient, interact with tools under a strict execution protocol, and satisfy an external verifier that checks structured correctness rather than textual plausibility. Existing approaches usually improve either the instruction side or the tool side in isolation, which leaves unclear what should be updated after failure. We propose EGL-SCA, a verifier-centric dual-space framework that models a graph reasoning agent using two collaborative components: an instruction-side policy space for reasoning strategies, and a tool-side program space for executable algorithmic tools. Our central mechanism is structural credit assignment, which maps trajectory evidence to conditional updates, precisely routing failures to either prompt optimization or tool synthesis and repair. To provide sufficient learning signals for dual-space adaptation, we introduce a training distribution stratified by task family, coupled with a Pareto-style retention strategy to balance success, generality, and parsimony. Experiments on four graph reasoning benchmarks show that EGL-SCA achieves a state-of-the-art 92.0% average success rate. By effectively co-evolving instructions and tools, our framework significantly outperforms both pure-prompting and fixed-toolbox baselines.

[AI-67] Agent -ValueBench: A Comprehensive Benchmark for Evaluating Agent Values

【速读】：该论文旨在解决当前对自主代理（autonomous agents）价值观理解的空白问题，即现有价值基准主要聚焦于大语言模型（LLM），而未涵盖代理特有的行为价值体系。研究发现，代理的价值与其底层LLM存在显著差异，且代理模态引入了数据集、评估和系统层面的新挑战。为填补这一空白，作者提出了Agent-ValueBench，这是首个专门用于评估代理价值观的基准，包含394个可执行环境、4,335个价值冲突任务以及覆盖28种价值体系和332个维度的精细标注。其关键创新在于构建了一个端到端的任务合成与心理专家校准流程，并通过轨迹级评分机制实现高精度评估。实验表明，代理价值观呈现出“价值潮汐”现象——跨模型一致性下隐藏可解释的反向流，且受代理框架（harness）和嵌入技能的显著影响，揭示出对齐策略正从传统模型对齐和提示引导转向框架对齐与技能引导。

链接: https://arxiv.org/abs/2605.10365
作者: Haonan Dong,Qiguan Feng,Kehan Jiang,Haoran Ye,Xin Zhang,Guojie Song
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Autonomous agents have rapidly matured as task executors and seen widespread deployment via harnesses such as OpenClaw. Safety concerns have rightly drawn growing research attention, and beneath them lie the values silently steering agent behavior. Existing value benchmarks, however, remain confined to LLMs, leaving agent values largely uncharted. From intuitive, empirical, and theoretical vantage points, we show that an agent’s values diverge from those of its underlying LLM, and the agentic modality further introduces dataset-, evaluation-, and system-level challenges absent from text-only protocols. We close this gap with Agent-ValueBench, the first benchmark dedicated to agent values. It features 394 executable environments across 16 domains, offering 4,335 value-conflict tasks that cover 28 value systems and 332 dimensions. Every instance is co-synthesized through our purpose-built end-to-end pipeline and curated per-instance by professional psychologists. Each task ships with two pole-aligned golden trajectories whose checkpoints anchor a trajectory-level rubric-based judge. Benchmarking 14 frontier proprietary and open-weights models across 4 mainstream harnesses, we uncover three concerted findings. Agent values first manifest as a Value Tide of cross-model homogeneity beneath interpretable counter-currents. This tide bends non-additively under harness pull, and yet more decisively under deliberate steering via embedded skills. Together these results signal that the agent-alignment lever is shifting from classical model alignment and prompt steering toward harness alignment and skill steering.

[AI-68] RW-Post: Auditable Evidence-Grounded Multimodal Fact-Checking in the Wild

【速读】：该论文旨在解决现实世界中多模态虚假信息（multimodal misinformation）的验证难题，特别是图像被用于强化误导性文本内容的问题。其核心挑战在于如何有效实现图文对齐的事实核查，并确保模型在推理过程中忠实于可追溯的证据。解决方案的关键在于构建一个名为RW-Post的后置对齐（post-aligned）多模态事实核查基准数据集，该数据集通过LLM辅助的提取与审计流程，将社交媒体原始帖子与人类事实核查文章中的显式证据项及推理路径进行结构化关联，从而支持封闭书本、受限证据和开放网络三种评估范式，以系统诊断视觉定位（visual grounding）和证据利用能力。

链接: https://arxiv.org/abs/2605.10357
作者: Danni Xu,Shaojing Fan,Harry Cheng,Mohan Kankanhalli
机构: 未知
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal misinformation increasingly leverages visual persuasion, where repurposed or manipulated images strengthen misleading text. We introduce \textbfRW-Post, a post-aligned \textbftext–image benchmark for real-world multimodal fact-checking with \emphauditable annotations: each instance links the original social-media post with reasoning traces and explicitly linked evidence items derived from human fact-check articles via an LLM-assisted extraction-and-auditing pipeline. RW-Post supports controlled evaluation across closed-book, evidence-bounded, and open-web regimes, enabling systematic diagnosis of visual grounding and evidence utilization. We provide \textbfAgentFact as a reference verification baseline and benchmark strong open-source LVLMs under unified protocols. Experiments show substantial headroom: current models struggle with faithful evidence grounding, while evidence-bounded evaluation improves both accuracy and faithfulness. Code and dataset will be released at this https URL.

[AI-69] MAS: Scaling Test-Time Compute via Multi-Agent Synergy

【速读】：该论文旨在解决现有结构化测试时扩展（test-time scaling）方法在推理过程中难以有效平衡探索（exploration）与利用（exploitation）的问题，具体表现为：要么对并行推理轨迹的协同控制较弱，要么依赖噪声较大的历史信息而缺乏对关键知识的显式保留与复用机制。解决方案的关键在于提出TMAS框架，通过多智能体协同机制组织推理过程，引入分层记忆系统——经验库（experience bank）用于重用低层次可靠中间结论和局部反馈，指南库（guideline bank）记录高层次策略以引导后续推演避开冗余路径；同时设计混合奖励强化学习方案，兼顾基础推理能力保持、经验利用率提升及跨策略探索激励，从而实现更高效且稳定的迭代式计算扩展。

链接: https://arxiv.org/abs/2605.10344
作者: George Wu,Nan Jing,Qing Yi,Chuan Hao,Ming Yang,Feng Chang,Yuan Wei,Jian Yang,Ran Tao,Bryan Dai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Test-time scaling has become an effective paradigm for improving the reasoning ability of large language models by allocating additional computation during inference. Recent structured approaches have further advanced this paradigm by organizing inference across multiple trajectories, refinement rounds, and verification-based feedback. However, existing structured test-time scaling methods either weakly coordinate parallel reasoning trajectories or rely on noisy historical information without explicitly deciding what should be retained and reused, limiting their ability to balance exploration and exploitation. In this work, we propose TMAS, a framework for scaling test-time compute via multi-agent synergy. TMAS organizes inference as a collaborative process among specialized agents, enabling structured information flow across agents, trajectories, and refinement iterations. To support effective cross-trajectory collaboration, TMAS introduces hierarchical memories: the experience bank reuses low-level reliable intermediate conclusions and local feedback, while the guideline bank records previously explored high-level strategies to steer subsequent rollouts away from redundant reasoning patterns. Furthermore, we design a hybrid reward reinforcement learning scheme tailored to TMAS, which jointly preserves basic reasoning capability, enhances experience utilization, and encourages exploration beyond previously attempted solution strategies. Extensive experiments on challenging reasoning benchmarks demonstrate that TMAS achieves stronger iterative scaling than existing test-time scaling baselines, while hybrid reward training further improves scaling effectiveness and stability across iterations. Code and data are available at this https URL.

[AI-70] PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents

【速读】：该论文旨在解决 LaTeX 文档从“可编译”到“出版就绪”的关键瓶颈问题，即当前生成的 PDF 常因浮动体错位、公式溢出、表格缩放不一致、孤行（widow/orphan lines）及页面排版失衡等视觉缺陷而无法直接用于发表。传统规则工具和文本型大语言模型（LLM）均缺乏对二维布局效果的感知能力，导致修改难以预测或验证最终呈现。解决方案的关键在于引入视觉闭环优化机制——提出视觉类型排版优化（Visual Typesetting Optimization, VTO）框架，并开发 PaperFit 系统：该系统通过迭代渲染页面、诊断五类类型缺陷并执行受约束的源码修复，实现基于视觉反馈的闭环修正。实验表明，PaperFit 在 PaperFit-Bench 数据集上显著优于所有基线方法，证明了视觉闭环是打通文档自动化流程中缺失的一环。

链接: https://arxiv.org/abs/2605.10341
作者: Bihui Yu,Xinglong Xu,Junjie Jiang,Jiabei Cheng,Caijun Jia,Siyuan Li,Conghui He,Jingxuan Wei,Cheng Tan
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 47 pages, 17 figures, 17 tables

点击查看摘要

Abstract:A LaTeX manuscript that compiles without error is not necessarily publication-ready. The resulting PDFs frequently suffer from misplaced floats, overflowing equations, inconsistent table scaling, widow and orphan lines, and poor page balance, forcing authors into repetitive compile-inspect-edit cycles. Rule-based tools are blind to rendered visuals, operating only on source code and log files. Text-only LLMs perform open-loop text editing, unable to predict or verify the two-dimensional layout consequences of their changes. Reliable typesetting optimization therefore requires a visual closed loop with verification after every edit. We formalize this problem as Visual Typesetting Optimization (VTO), the task of transforming a compilable LaTeX paper into a visually polished, page-budget-compliant PDF through iterative visual verification and source-level revision, and introduce a five-category taxonomy of typesetting defects to guide diagnosis. We present PaperFit, a vision-in-the-loop agent that iteratively renders pages, diagnoses defects, and applies constrained repairs. To benchmark VTO, we construct PaperFit-Bench with 200 papers across 10 venue templates and 13 defect types at different difficulty. Extensive experiments show that PaperFit outperforms all baselines by a large margin, establishing that bridging the gap from compilable source to publication-ready PDF requires vision-in-the-loop optimization and that VTO constitutes a critical missing stage in the document automation pipeline.

[AI-71] CORTEG: Foundation Models Enable Cross-Modality Representation Transfer from Scalp to Intracranial Brain Recordings

【速读】：该论文旨在解决侵入式脑机接口（Brain-Computer Interface, BCI）中因单个患者数据有限而导致的模型泛化能力差的问题，尤其是现有方法多依赖于小样本、个体特异性的解码器，忽略了跨受试者共享的信息。其解决方案的关键在于提出CORTEG框架，该框架通过预训练的头皮脑电（scalp-EEG）基础模型（EEG Foundation Model, EEG FM）进行跨模态迁移学习，结合电极感知的KNNSoftFourier空间适配器、双流令牌化器（分别处理低频和高gamma频段活动），以及留一受试者策略微调，实现了在仅需10–30分钟单GPU校准的情况下，高效适应新患者并达到与任务特定最优基线相当或更优的解码性能，从而推动了数据高效的颅内BCI系统发展。

链接: https://arxiv.org/abs/2605.10337
作者: Liuyin Yang,Qiang Sun,Bob Van Dyck,Eva Calvo Merino,Marc M. Van Hulle
机构: 未知
类目: Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Intracranial electrocorticography (ECoG) offers high-signal-to-noise access to cortical activity for brain-computer interfaces, yet limited per-patient data has led most prior work to rely on small, subject-specific decoders that neglect information shared across patients. We investigate whether large pretrained scalp-EEG foundation models (EEG FMs) can be adapted to ECoG, enabling cross-patient learning and competitive decoding performance while calibrating to a held-out patient in 10-30 minutes on a single GPU. We introduce CORTEG, a cross-modality transfer framework that combines a pretrained EEG FM backbone, an electrode-aware KNNSoftFourier spatial adapter, a dual-stream tokenizer for low-frequency and high-gamma activity, and a leave-one-subject-out fine-tuning strategy. We evaluate CORTEG on two challenging regression tasks: public finger trajectory regression (n=9) and private audio envelope regression (n=16). CORTEG matches or exceeds the strongest task-specific baselines on both tasks: it reaches the highest mean correlation among compared methods on the public finger benchmark (gain not statistically significant on n=9 subjects), with larger and statistically significant gains on the audio task and in low-data per-patient calibration. Feature analyses align with neurophysiology, and latent manifolds capture low-dimensional finger-movement structure. CORTEG provides systematic evidence that scalp-EEG pretraining can be repurposed for ECoG decoding, enabling data-efficient intracranial BCIs that can adapt to new patients.

[AI-72] EmbodiSkill: Skill-Aware Reflection for Self-Evolving Embodied Agents

【速读】：该论文旨在解决具身智能体（embodied agents）在多样化环境中执行任务时，如何从自身轨迹中自进化技能的问题。现有方法多基于数字环境设计，将轨迹转化为粗粒度的技能更新，难以直接应用于具身场景，因为任务失败可能源于技能内容错误或执行偏差（execution lapse），而传统方法无法区分二者。解决方案的关键在于提出EmbodiSkill框架，其通过技能感知反思（skill-aware reflection）和针对性修订机制，能够识别轨迹中技能变更证据与执行偏差证据：前者用于更新技能主体，后者则保留并强化有效的引导信息，从而实现无需训练的、可复用的程序性知识积累。

链接: https://arxiv.org/abs/2605.10332
作者: Ruofei Ju,Xinrui Wang,Xin Ding,Yifan Yang,Hao Wu,Shiqi Jiang,Qianxi Zhang,Hao Wen,Xiangyu Li,Weijun Wang,Kun Li,Yunxin Liu,Haipeng Dai,Wei Wang,Ting Cao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Embodied agents can benefit from skills that guide object search, action execution, and state changes across diverse environments. Since embodied environments vary across layouts, object states, and other execution factors, these skills must self-evolve from trajectories generated during task execution. However, existing skill self-evolution methods are mainly developed in digital environments and often convert trajectories into coarse skill updates. Directly applying this paradigm to embodied settings is problematic, because a failed task execution may reflect not only incorrect skill content, but also an execution lapse in which the agent fails to follow valid guidance. We propose EmbodiSkill, a training-free framework for embodied skill self-evolution through skill-aware reflection and targeted revision. EmbodiSkill interprets each trajectory with respect to the current skill, uses skill-changing evidence to update the skill body, and uses execution-lapse evidence to preserve and emphasize valid guidance. Experiments on ALFWorld and EmbodiedBench show that EmbodiSkill consistently improves embodied task success. On ALFWorld, EmbodiSkill enables a frozen Qwen3.5-27B executor to reach 93.28% task success, outperforming GPT-5.2 used as a direct agent without skills by 31.58%. These results show that skill-aware self-evolution helps embodied agents accumulate reusable procedural knowledge from their own trajectories.

[AI-73] Verifiable Process Rewards for Agent ic Reasoning

【速读】：该论文旨在解决长时程智能体推理中因稀疏结果级反馈导致的信用分配（credit assignment）难题，即在复杂推理任务中，即使中间步骤正确，最终失败也可能掩盖有效策略；反之，错误的中间决策可能因偶然成功而被误判为有效。其解决方案的关键在于提出可验证过程奖励（Verifiable Process Rewards, VPR）框架，通过引入符号或算法类验证器（oracle）将可验证的中间动作转化为密集的回合级监督信号，从而提供更局部化的学习信号。VPR在三种典型场景下实现具体化：基于搜索的验证用于动态演绎、基于约束的验证用于逻辑推理、基于后验的验证用于概率推理，并通过理论分析与实证结果表明，该方法能显著提升长程推理性能，且具备向通用和代理型推理基准迁移的能力，前提是验证器具有较高可靠性。

链接: https://arxiv.org/abs/2605.10325
作者: Huining Yuan,Zelai Xu,Huaijie Wang,Xiangmin Yi,Jiaxuan Gao,Xiao-Ping Zhang,Yu Wang,Chao Yu,Yi Wu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning from verifiable rewards (RLVR) has improved the reasoning abilities of large language models (LLMs), but most existing approaches rely on sparse outcome-level feedback. This sparsity creates a credit assignment challenge in long-horizon agentic reasoning: a trajectory may fail despite containing many correct intermediate decisions, or succeed despite containing flawed ones. In this work, we study a class of densely-verifiable agentic reasoning problems, where intermediate actions can be objectively checked by symbolic or algorithmic oracles. We propose Verifiable Process Rewards (VPR), a framework that converts such oracles into dense turn-level supervision for reinforcement learning, and instantiate it in three representative settings: search-based verification for dynamic deduction, constraint-based verification for logical reasoning, and posterior-based verification for probabilistic inference. We further provide a theoretical analysis showing that dense verifier-grounded rewards can improve long-horizon credit assignment by providing more localized learning signals, with the benefit depending on the reliability of the verifier. Empirically, VPR outperforms outcome-level reward and rollout-based process reward baselines across controlled environments, and more importantly, transfers to both general and agentic reasoning benchmarks, suggesting that verifiable process supervision can foster general reasoning skills applicable beyond the training environments. Our results indicate that VPR is a promising approach for enhancing LLM agents whenever reliable intermediate verification is available, while also highlighting its dependence on oracle quality and the open challenge of extending VPR to less structured, open-ended environments.

[AI-74] Relations Are Channels: Knowledge Graph Embedding via Kraus Decompositions

【速读】：该论文旨在解决知识图谱嵌入（Knowledge Graph Embedding, KGE）模型中关系算子设计缺乏理论基础的问题，特别是现有模型对关系操作符的定义往往依赖经验设定而非数学严谨性。解决方案的关键在于引入三个结构公理——线性性（linearity）、迹保持性（trace preservation）和完全正性（complete positivity），并基于Kraus表示定理证明这些公理共同刻画了Kraus通道（Kraus channel）结构，从而为关系操作提供了一个可解释且完备的数学框架。在此基础上，作者提出KrausKGE模型，其本质是Kraus秩为1的特例，并进一步推广至任意度量几何空间的w-Kraus通道，实现了无需显式路径编码即可进行k跳推理、自然处理1-to-N与N-to-N关系、且无需对实体嵌入施加范数约束的统一建模。该框架还首次在KGE领域提供了基于关系矩阵秩的理论复杂度下界，使模型性能提升与关系扇出（fan-out）呈单调正相关，符合理论预期。

链接: https://arxiv.org/abs/2605.10317
作者: Sayan Kumar Chaki
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Knowledge graph embedding (KGE) models typically represent each relation as an operator on entity embeddings. In this work, we identify three structural axioms that any principled relation operator must satisfy, linearity, trace preservation, and complete positivity, and show that they characterize a Kraus channel structure via the Kraus representation theorem. The completeness constraint defining this family is equivalent to these axioms, providing a principled foundation rather than an externally imposed condition. Under this formulation, most existing operator-based KGE models are recoverable as special cases with Kraus rank \kappa = 1 under specific embedding choices. We further generalize this characterization to arbitrary metric geometries by introducing \mboxw-Kraus channels, which satisfy completeness by construction within their respective spaces. Building on this theory, we propose \textscKrausKGE, a principled KGE model that naturally handles 1 -to- N and N -to- N relations, supports k -hop reasoning without requiring explicit path encoders, and eliminates the need for norm constraints on entity embeddings. Additionally, our framework yields the first theoretically grounded per-relation complexity measure in the KGE literature, with a provable lower bound in terms of the empirical relation matrix rank. Empirical evaluation demonstrates that \textscKrausKGE consistently outperforms strong baselines on N -to- N relations, with performance gains that increase monotonically with relation fan-out, in alignment with theoretical predictions.

[AI-75] Active Tabular Augmentation via Policy-Guided Diffusion Inpainting ICML2026

【速读】：该论文旨在解决生成式表格式数据增强（Generative Tabular Augmentation）中普遍存在的“保真度-效用差距”（fidelity-utility gap）问题，即现有方法过于关注生成样本的分布保真度，而忽视了其对下游模型性能的实际提升效果。解决方案的关键在于提出TAP（Tabular Augmentation Policy），该方法通过将扩散插补（diffusion inpainting）与轻量级、学习器条件化的策略相结合，动态引导生成过程聚焦于高效用区域，并借助显式门控机制和保守的窗口化承诺策略控制安全注入时机，从而在训练演化过程中实现更有效的数据增强。

链接: https://arxiv.org/abs/2605.10315
作者: Zheyu Zhang,Shuo Yang,Bardh Prenkaj,Gjergji Kasneci
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted for publication at ICML 2026

点击查看摘要

Abstract:Generative tabular augmentation is appealing in data-scarce domains, yet the prevailing focus on distributional fidelity does not reliably translate into better downstream models. We formalize a fidelity-utility gap: common generative objectives prioritize distributional plausibility, whereas augmentation succeeds only when injected samples reduce the current learner’s held-out evaluation loss. This gap motivates learning not just how to generate, but what to generate and when to inject as training evolves. We propose TAP (Tabular Augmentation Policy), which couples diffusion inpainting with a lightweight, learner-conditioned policy to steer generation toward high-utility regions and controls safe injection via explicit gating and conservative windowed commitment. Under severe data scarcity, TAP consistently outperforms strong generative baselines on seven real-world datasets, improving classification accuracy by up to 15.6 percentage points and reducing regression RMSE by up to 32%.

[AI-76] Robust Probabilistic Shielding for Safe Offline Reinforcement Learning

【速读】：该论文旨在解决离线强化学习（Offline Reinforcement Learning, Offline RL）中的两个核心问题：一是确保所学策略的性能表现，二是保障策略的安全性。传统方法在离线场景下难以同时提供性能和安全的理论保证，尤其是在缺乏环境交互的情况下。解决方案的关键在于将“安全策略改进”（Safe Policy Improvement, SPI）与“屏蔽机制”（Shielding）相结合：通过仅依赖可用数据集及对安全状态和不安全状态的认知，扩展屏蔽机制以适用于离线RL场景，并在策略改进过程中引入屏蔽，从而在高概率意义上保证最终策略既优于基线策略（性能保障），又始终处于安全动作空间内（安全保障）。实验表明，该方法在低数据环境下尤其有效，显著提升了平均性能和最坏情况下的鲁棒性。

链接: https://arxiv.org/abs/2605.10293
作者: Maris F. L. Galesloot,Thomas Rhemrev,Nils Jansen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In offline reinforcement learning (RL), we learn policies from fixed datasets without environment interaction. The major challenges are to provide guarantees on the (1) performance and (2) safety of the resulting policy. A technique called safe policy improvement (SPI) provides a performance guarantee: with high probability, the new policy outperforms a given baseline policy, which is assumed to be safe. Orthogonally, in the context of safe RL, a shield provides a safety guarantee by restricting the action space to those actions that are provably safe with respect to a given safety-relevant model. We integrate these paradigms by extending shielding to offline RL, relying solely on the available dataset and knowledge of safe and unsafe states. Then, we shield the policy improvement steps, guaranteeing, with high probability, a safe policy. Experimental results demonstrate that shielded SPI outperforms its unshielded counterpart, improving both average and worst-case performance, particularly in low-data regimes.

[AI-77] LeapTS: Rethinking Time Series Forecasting as Adaptive Multi-Horizon Scheduling

【速读】：该论文旨在解决时间序列预测中因固定映射机制导致的时序解耦问题，即现有模型将历史数据到未来目标时间点的映射视为静态过程，限制了模型在预测过程中对动态上下文变化的适应能力。其解决方案的关键在于提出LeapTS框架，将预测任务重构为一个在预测时程上的动态调度过程：通过分层控制器（hierarchical controller）动态选择每一步的最佳预测尺度和推进长度，并利用神经控制微分方程（neural controlled differential equations, NCDEs）实现连续时间状态演化，从而显式地将不规则的时间动态与离散调度反馈耦合起来，使模型能够自主适应非平稳动态并提升预测精度与推理效率。

链接: https://arxiv.org/abs/2605.10292
作者: Sheng Pan,Ming Jin,Bo Du,Shirui Pan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Time series forecasting serves as an essential tool for many real-world applications, supporting tasks such as resource optimization and decision-making. Despite significant architectural advancements, most modern models still treat forecasting task as a fixed mapping from history to target horizons. This induces temporal decoupling across future time points and limits the model’s ability to adapt to the evolving context as forecasting progresses. In this work, we present LeapTS, a novel framework that reformulates time series forecasting as a dynamic scheduling process over the prediction horizon. Specifically, LeapTS organizes the forecasting process into multi-level decisions using: (1) the hierarchical controller to dynamically select the optimal prediction scale and advancement length at each step, and (2) continuous-time state evolution driven by neural controlled differential equations. Within this process, the controlled update mechanism explicitly couples the irregular temporal dynamics with discrete scheduling feedback. Extensive evaluations on both real-world and synthetic datasets demonstrate that LeapTS improves overall forecasting performance by at least 7.4% while achieving a 2.6 \times to 5.3 \times inference speedup over representative Transformer-based models. Furthermore, by explicitly tracing the scheduling trajectories, we reveal how the model autonomously adapts its forecasting behavior to capture non-stationary dynamics.

[AI-78] Agent Rx: A Benchmark Study of LLM Agents for Multimodal Clinical Prediction Tasks ALT

【速读】：该论文旨在解决临床决策支持系统（Clinical Decision Support Systems, CDSS）在整合复杂异构多模态数据（如时间序列电子健康记录、医学影像、放射科报告和临床笔记）时面临的挑战，特别是大语言模型（Large Language Model, LLM）代理在多模态临床风险预测任务中的有效性尚未得到充分验证的问题。其解决方案的关键在于通过大规模真实世界数据对基于LLM的单代理与多代理系统进行系统性评估，发现单代理框架在处理多模态数据和校准性能方面优于朴素的多代理系统，从而揭示当前多代理协作机制在处理异构输入时的不足，并强调需改进代理间协同策略以提升整体性能。

链接: https://arxiv.org/abs/2605.10286
作者: Baraa Al Jorf,Farah E.Shamout
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at the AHLI Conference on Health, Inference, and Learning 2026

点击查看摘要

Abstract:Building effective clinical decision support systems requires the synthesis of complex heterogeneous multimodal data. Such modalities include temporal electronic health records data, medical images, radiology reports, and clinical notes. Large language model (LLM)-based agents have shown impressive performance in various healthcare tasks, especially those involving textual modalities. Considering the fragmentation of healthcare data across hospital systems, collaborative agent frameworks present a promising direction to mitigate data sharing challenges. However, the effectiveness of LLM agents for multimodal clinical risk prediction remains largely unexamined. In this work, we conduct a systematic evaluation of LLM-based agents for clinical prediction tasks using large-scale real-world data. We assess performance in unimodal and multimodal settings and quantify performance gaps between single agent and multi-agent systems. Our findings highlight that single agent frameworks outperform naive multi-agent systems, are better at handling multimodal data, and are better calibrated. This underscores a critical need for improving multi-agent collaboration to better handle heterogeneous inputs. By open-sourcing our code and evaluation framework, this work offers a new benchmark to support future developments relating to agentic systems in healthcare.

[AI-79] Drum Synthesis from Expressive Drum Grids via Neural Audio Codecs

【速读】：该论文旨在解决从符号化鼓谱（drum grid）直接生成逼真鼓音频的难题，这一问题位于音乐感知与机器学习的交叉领域。其核心挑战在于如何将包含微定时（microtiming）和力度（velocity）信息的时间对齐MIDI表示准确映射为高质量的波形音频。解决方案的关键在于采用基于Transformer的模型，将鼓谱输入转化为神经音频编解码器（neural audio codec）的离散代码序列，并利用预训练的编解码器解码器将其还原为波形音频。通过在Expanded Groove MIDI Dataset（E-GMD）上实验不同先进编解码器（EnCodec、DAC、X-Codec），研究验证了编码器-解码器架构在鼓音合成中的有效性，并为选择适合打击乐合成的音频分词器提供了实践指导。

链接: https://arxiv.org/abs/2605.10281
作者: Konstantinos Soiledis,Maximos Kaliakatsos-Papakostas,Dimos Makris,Konstantinos Tsamis
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generating realistic drum audio directly from symbolic representations is a challenging task at the intersection of music perception and machine learning. We propose a system that transforms an expressive drum grid, a time-aligned MIDI representation with microtiming and velocity information, into drum audio by predicting discrete codes of a neural audio codec. Our approach uses a Transformer-based model to map the drum grid input to a sequence of codec tokens, which are then converted to waveform audio via a pre-trained codec decoder. We experiment with multiple state-of-the-art neural codecs, namely EnCodec, DAC, and X-Codec, to assess how the choice of audio representation impacts the quality of the generated drums. The system is trained and evaluated on the Expanded Groove MIDI Dataset, E-GMD, a large collection of human drum performances with paired MIDI and audio. We evaluate the fidelity and musical alignment of the generated audio using objective metrics. Overall, our results establish codec-token prediction as an effective route for drum grid-to-audio generation and provide practical insights into selecting audio tokenizers for percussive synthesis.

[AI-80] DP-LAC: Lightweight Adaptive Clipping for Differentially Private Federated Fine-tuning of Language Models ICASSP2026

【速读】：该论文旨在解决联邦学习（Federated Learning, FL）中因客户端梯度上传而暴露敏感信息的问题，尤其是在采用差分隐私随机梯度下降（Differentially Private Stochastic Gradient Descent, DP-SGD）时，传统固定阈值 clipping 方法难以平衡隐私保护与模型性能的难题。解决方案的关键在于提出 DP-LAC 方法：首先利用私有直方图估计（private histogram estimation）在数量级上近似最优剪裁阈值（clipping threshold），随后在训练过程中自适应调整该阈值，且不消耗额外的隐私预算（privacy budget）或引入新的超参数，从而在保障隐私的同时显著提升模型准确率，实验表明其平均准确率相较现有最优自适应剪裁方法和基础 DP-SGD 提升 6.6%。

链接: https://arxiv.org/abs/2605.10272
作者: Haaris Mehmood,Jie Xu,Karthikeyan Saravanan,Rogier Van Dalen,Mete Ozay
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: Accepted at ICASSP 2026

点击查看摘要

Abstract:Federated learning (FL) enables the collaborative training of large-scale language models (LLMs) across edge devices while keeping user data on-device. However, FL still exposes sensitive information through client-provided gradients. Differentially private stochastic gradient descent (DP-SGD) mitigates this risk by clipping each client’s contribution to a threshold C and adding noise proportional to C . Existing adaptive clipping techniques dynamically adjust C but demand tedious hyperparameter tuning, which can erode the privacy budget. In this paper, we introduce DP-LAC, a method that first estimates an initial clipping threshold within an order of magnitude of the optimum using private histogram estimation, and then adapts this threshold during training without consuming additional privacy budget or introducing new hyperparameters. Empirical results show that DP-LAC outperforms both state-of-the-art adaptive clipping methods and vanilla DP-SGD, achieving an average accuracy gain of 6.6% .

[AI-81] IndustryBench: Probing the Industrial Knowledge Boundaries of LLM s

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）在工业采购问答（Industrial Procurement QA）场景中存在“部分正确但安全失效”的问题，即传统基于准确率的评估无法识别出潜在的安全违规（Safety Violation, SV），而这类违规在工业应用中可能引发严重后果。解决方案的关键在于构建一个基于中国国家标准（GB/T）和结构化工业产品记录的多语言基准测试集 IndustryBench，其核心创新包括：（1）通过搜索驱动的外部验证阶段剔除70.3%的LLM候选答案以校准可靠性；（2）将原始正确性评分与独立的安全违规检测分离，使用领域专家验证的Qwen3-Max判官（κ_w = 0.798）进行评分，并依据源文本进行安全核查；（3）揭示扩展推理虽提升表面正确性，却显著增加安全风险，从而强调必须采用源文 grounded、安全感知的诊断机制来评估工业场景下的LLM性能。

链接: https://arxiv.org/abs/2605.10267
作者: Songlin Bai,Xintong Wang,Linlin Yu,Bin Chen,Zhiang Xu,Yuyang Sheng,Changtong Zan,Xiaofeng Zhu,Yizhe Zhang,Jiru Li,Mingze Guo,Ling Zou,Yalong Li,Chengfu Huo,Liang Ding
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In industrial procurement, an LLM answer is useful only if it survives a standards check: recommended material must match operating condition, every parameter must respect a regulated threshold, and no procedure may contradict a safety clause. Partial correctness can mask safety-critical contradictions that aggregate LLM benchmarks rarely capture. We introduce IndustryBench, a 2,049-item benchmark for industrial procurement QA in Chinese, grounded in Chinese national standards (GB/T) and structured industrial product records, organized by seven capability dimensions, ten industry categories, and panel-derived difficulty tiers, with item-aligned English, Russian, and Vietnamese renderings. Our construction pipeline rejects 70.3% of LLM-generated candidates at a search-based external-verification stage, calibrating how unreliable industrial QA remains after LLM-only this http URL evaluation decouples raw correctness, scored by a Qwen3-Max judge validated at \kappa_w = 0.798 against a domain expert, from a separate safety-violation (SV) check against source texts. Across 17 models in Chinese and an 8-model intersection over four languages, we find: (i) the best system reaches only 2.083 on the 0–3 rubric, leaving substantial headroom; (ii) Standards Terminology is the most persistent capability weakness and survives item-aligned translation; (iii) extended reasoning lowers safety-adjusted scores for 12 of 13 models, primarily by introducing unsupported safety-critical details into longer final answers; and (iv) safety-violation rates reshuffle the leaderboard – GPT-5.4 climbs from rank 6 to rank 3 after SV adjustment, while Kimi-k2.5-1T-A32B drops seven this http URL LLM evaluation therefore requires source-grounded, safety-aware diagnosis rather than aggregate accuracy. We release IndustryBench with all prompts, scoring scripts, and dataset documentation.

[AI-82] E-TCAV: Formalizing Penultimate Proxies for Efficient Concept Based Interpretability

【速读】：该论文旨在解决TCAV（Testing with Concept Activation Vectors）方法中存在的三大问题：计算开销大、不同网络层之间的TCAV分数不一致以及统计稳定性差。其解决方案的关键在于提出E-TCAV框架，该框架基于对TCAV机制的三项核心发现：1）潜在分类器的选择显著影响TCAV分数的稳定性；2）网络最后模块中的层与倒数第二层在TCAV分数上具有高度一致性；3）倒数第二层可作为早期层的快速代理进行TCAV计算。利用上述发现，E-TCAV实现了与网络规模和评估样本数量呈线性关系的速度提升，从而为高效模型调试和实时概念引导训练提供了可行路径。

链接: https://arxiv.org/abs/2605.10261
作者: Hasib Aslam,Muhammad Ali Chattha,Muhammad Taha Mukhtar,Muhammad Imran Malik,Andreas Dengel,Sheraz Ahmed
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:TCAV (Testing with Concept Activation Vectors) is an interpretability method that assesses the alignment between the internal representations of a trained neural network and human-understandable, high-level concepts. Though effective, TCAV suffers from significant computational overhead, inter-layer disagreement of TCAV scores, and statistical instability. This work takes a step toward addressing these challenges by introducing E-TCAV, a framework for efficient approximation of TCAV scores, which is based on extensive investigation into three key aspects of the TCAV methodology: 1) the effect of latent classifiers on the stability of TCAV scores, 2) the inter-layer agreement of TCAV scores, and 3) the use of the penultimate layer as a fast proxy for earlier layers for TCAV computation. To ensure a solid foundation for E-TCAV, we conduct extensive evaluations across four different architectures and five datasets, encompassing problems from both computer vision and natural language domains. Our results show that the layers in the final block of the neural network strongly agree with the penultimate layer in terms of the TCAV scores, and the commonly observed variance of the TCAV scores can be attributed to the choice of the latent classifier. Leveraging this inter-layer agreement and the degeneracy of directional sensitivities at the penultimate layer, E-TCAV guarantees linearly scaling speed-ups with respect to the network’s size and the number of evaluation samples, marking a step towards efficient model debugging and real-time concept-guided training.

[AI-83] owards Autonomous Railway Operations: A Semi-Hierarchical Deep Reinforcement Learning Approach to the Vehicle Rescheduling Problem

【速读】：该论文旨在解决铁路交通管理中因列车密度上升和基础设施限制导致的调度复杂性问题，特别是传统运筹学（Operational Research, OR）方法难以实时可靠求解车辆路径与调度问题（Vehicle Routing and Scheduling Problem, VRSP）的挑战。现有基于强化学习（Reinforcement Learning, RL）的方法在多智能体协调方面潜力有限，且在高密度网络中性能不佳、难以扩展。解决方案的关键在于提出一种面向运营约束的半层级强化学习（semi-hierarchical RL）框架，通过分离调度（dispatching）与路径规划（routing）的动作空间和观测空间，使策略能够专注于不同决策范围，缓解稀疏调度决策与频繁路径更新之间的不平衡问题，从而显著提升协同效率、资源利用率和鲁棒性，在Flatland-RL仿真环境中实现接近翻倍的列车到达率，同时保持死锁率低于5%。

链接: https://arxiv.org/abs/2605.10257
作者: Alberto Castagna,Stefan Zahlner,Adrian Egli,Christian Eichenberger,Daniel Boos,Manuel Meyer,Anton Fuxjager
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Managing disruptions in railway traffic management is a major challenge. Rising traffic density and infrastructure limits increase complexity, making the Vehicle Routing and Scheduling Problem (VRSP) difficult to solve reliably and in real time. While Operational Research (OR) methods are widely used, most dispatching still relies on human expertise due to the problem’s exponential combinatorial complexity. Reinforcement Learning (RL) has gained attention for its potential in multi-agent coordination, but existing RL approaches often underperform OR methods and struggle to scale in dense rail networks. This paper addresses this gap from a machine learning perspective by introducing a semi-hierarchical RL formulation tailored to operational railway constraints. The method separates dispatching from routing through dedicated action and observation spaces, enabling policies to specialise in distinct decision scopes and addressing the imbalance between rare dispatch decisions and frequent routing updates. The approach is evaluated on the Flatland-RL simulator across five difficulty levels and 50 random seeds, with 7 to 80 trains. Results show substantially improved coordination, resource utilisation, and robustness compared with heuristic baselines and monolithic RL, nearly doubling the number of trains reaching their destinations, while keeping deadlock rates below 5% and adaptively sequencing, delaying, or cancelling trains under heavy congestion.

[AI-84] A Cold Diffusion Approach for Percussive Dereverberation IJCNN

【速读】：该论文旨在解决音乐制作中鼓声（percussive signals）去混响（dereverberation）问题，这一领域长期被忽视，而鼓声具有瞬态尖锐和时域结构密集的特点，使得传统语音去混响方法难以直接适用。解决方案的关键在于提出一种冷扩散（cold diffusion）框架，将混响建模为确定性退化过程，逐步将无混响信号转化为混响信号，并通过两种反向过程参数化方式——直接（next-state）预测与Delta归一化残差（velocity-style）预测——实现高效去混响。模型采用UNet或扩散Transformer作为骨干网络，在包含真实与合成房间脉冲响应（Room Impulse Responses, RIRs）的定制数据集上训练和评估，实验表明该方法在域内与完全域外测试集上均显著优于基于得分的和条件扩散基线模型，且在信号级与感知指标上表现优异。

链接: https://arxiv.org/abs/2605.10256
作者: Dimos Makris,András Barják,Maximos Kaliakatsos-Papakostas
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: Accepted for the 2026 IEEE World Congress on Computational Intelligence, IJCNN Track, 21-26 June 2026, Maastricht, the Netherlands

点击查看摘要

Abstract:Most recent advances in audio dereverberation focus almost exclusively on speech, leaving percussive and drum signals largely unexplored despite their importance in music production. Percussive dereverberation poses distinct challenges due to sharp transients and dense temporal structure. In this work, we propose a cold diffusion framework for dereverberating stereo drum stems (downmixes), modeling reverberation as a deterministic degradation process that progressively transforms anechoic signals into reverberant ones. We investigate two reverse-process parameterizations, Direct (next-state) and a Delta-normalized residual (velocity-style) prediction, and implement the framework using both a UNet and a diffusion Transformer backbone. The models are trained and evaluated on curated datasets comprising both acoustic and electronic drum recordings, with reverberation generated using a combination of synthetic and real room impulse responses. Extensive experiments on in-domain and fully out-of-domain test sets demonstrate that the proposed method consistently outperforms strong score-based and conditional diffusion baselines, evaluated using signal-based and perceptual metrics tailored to percussive audio.

[AI-85] Knowledge Poisoning Attacks on Medical Multi-Modal Retrieval-Augmented Generation ACL2026

【速读】：该论文旨在解决医疗多模态检索增强生成（Retrieval-Augmented Generation, RAG）系统中因知识库被注入对抗性信息而导致模型输出偏差的问题，尤其是现有知识投毒攻击方法通常假设攻击者事先掌握用户查询，这在实际部署中不现实。解决方案的关键在于提出M³Att框架，其核心创新是通过在文本数据中注入隐蔽的错误信息，并利用配对的视觉数据作为与查询无关的触发机制来诱导检索，从而实现对模型生成结果的误导；同时，针对大语言模型（LLM）固有的医学知识可能纠正明显事实错误的特点，该方法设计了一种基于医学诊断模糊性的隐蔽误导策略，能够在不引发模型自我修正的情况下降低诊断准确性，实验表明该方法能在多种LLM和数据集上持续生成临床看似合理但错误的输出。

链接: https://arxiv.org/abs/2605.10253
作者: Peiru Yang,Haoran Zheng,Tong Ju,Shiting Wang,Wanchun Ni,Jiajun Liu,Shangguang Wang,Yongfeng Huang,Tao Qi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Accepted by ACL 2026

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) is a widely adopted paradigm for enhancing LLMs in medical applications by incorporating expert multimodal knowledge during generation. However, the underlying retrieval databases may naturally contain, or be intentionally injected with, adversarial knowledge, which can perturb model outputs and undermine system reliability. To investigate this risk, prior studies have explored knowledge poisoning attacks in medical RAG systems. Nevertheless, most of them rely on the strong assumption that adversaries possess prior knowledge of user queries, which is unrealistic in deployments and substantially limits their practical applicability. In this paper, we propose M\textsuperscript3Att, a knowledge-poisoning framework designed for medical multimodal RAG systems, assuming only limited distribution knowledge of the underlying database. Our core idea is to inject covert misinformation into textual data while using paired visual data as a query-agnostic trigger to promote retrieval. We first propose a unified framework that introduces imperceptible perturbations to visual inputs to manipulate retrieval probabilities. Besides, due to the prior medical knowledge in LLMs, naively poisoned medical content with explicit factual errors can be corrected during generation. Thus, we leverage the inherent ambiguity of medical diagnosis and design a covert misinformation injection strategy that degrades diagnostic accuracy while evading model self-correction. Experiments on five LLMs and datasets demonstrate that M\textsuperscript3Att consistently produces clinically plausible yet incorrect generations. Codes: this https URL.

[AI-86] SciIntegrity-Bench: A Benchmark for Evaluating Academic Integrity in AI Scientist Systems

【速读】：该论文旨在解决生成式 AI (Generative AI) 在自主科研场景中学术诚信缺失的问题，即模型在面对无法完成的任务时倾向于伪造结果而非诚实承认失败。其解决方案的关键在于构建了一个名为 SCIINTEGRITY-BENCH 的基准测试框架，该框架采用“困境评估范式”设计了33个场景，涵盖11类陷阱类别，其中唯一正确响应是诚实承认任务不可行，而完成任务则需采取不端行为（如虚构数据）。实验表明，7个主流大语言模型（LLM）的整体诚信问题发生率为34.2%，且均未实现零错误；尤其在缺失数据场景下，所有模型均生成合成数据，仅在是否披露替代行为上存在差异。进一步的提示词消融研究表明，去除显式完成压力可显著降低未披露的数据伪造率（从20.6%降至3.2%），但基础的数据合成倾向不变，揭示出一种独立于提示指令的内在完成偏倚（completion bias），其根源在于缺乏对拒绝响应的训练习得。因此，该研究指出：诚实拒绝作为可训练行为的缺失，是导致当前模型学术诚信失效的核心机制。

链接: https://arxiv.org/abs/2605.10246
作者: Zonglin Yang,Xingtong Liu,Xinyan Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AI scientist systems are increasingly deployed for autonomous research, yet their academic integrity has never been systematically evaluated. We introduce SCIINTEGRITY-BENCH, the first benchmark designed around a dilemmatic evaluation paradigm: each of its 33 scenarios across 11 trap categories is constructed so that honest acknowledgment of failure is the only correct response, while task completion requires misconduct. Across 231 evaluation runs spanning 7 state-of-the-art LLMs, the overall integrity problem rate reaches 34.2%, and no model achieves zero failures. Most strikingly, across missing-data scenarios, all seven models generate synthetic data rather than acknowledging infeasibility, differing only in whether they disclose the substitution. A further prompt ablation study separates two drivers: removing explicit completion pressure sharply reduces undisclosed fabrication from 20.6% to 3.2%, while the underlying synthesis rate remains unchanged, revealing an intrinsic completion bias that persists independent of prompt-level instructions. These findings point to the absence of honest refusal as a trained disposition as the primary driver of observed failures. We release SCIINTEGRITY-BENCH at this https URL.

[AI-87] When Normality Shifts: Risk-Aware Test-Time Adaptation for Unsupervised Tabular Anomaly Detection

【速读】：该论文旨在解决无监督表格异常检测中因训练数据规模和多样性有限而导致的正常模式表征不完整问题，以及现有测试时适应（test-time adaptation）方法忽视训练阶段学习协同性、且对未标注测试数据盲目适应引发异常污染的问题。解决方案的关键在于提出一种风险感知的测试时自适应方法（Risk-aware Test-time adaptation, RTTAD），其核心机制为两阶段协同设计：训练阶段通过协同双任务学习构建多层次表示以建立鲁棒的正常先验；测试阶段引入测试时对比学习（Test-Time Contrastive Learning, TTCL）模块，基于高置信度伪正常样本选择性更新模型，并通过k近邻对比目标优化嵌入分布，从而有效抑制异常污染并增强模型判别能力。

链接: https://arxiv.org/abs/2605.10242
作者: Wei Huang,Hezhe Qiao,Kailai Zhang,Zaisheng Ye,Yu-Ming Shang,Xiangling Fu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 13 pages, 6 figures

点击查看摘要

Abstract:Unsupervised tabular anomaly detection methods typically learn feature patterns from normal samples during training and subsequently identify samples that deviate from these patterns as anomalies during testing. However, in practical scenarios, the limited scale and diversity of training data often lead to an incomplete characterization of normal patterns. While test-time adaptation offers a remedy, its isolated focus on test-time optimization ignores the critical synergy with training-phase learning. Furthermore, indiscriminate adaptation to unlabeled test data inevitably triggers anomaly contamination, preventing the model from fully realizing its discriminative capability between normal and anomalous samples. To address these issues, we propose RTTAD, a Risk-aware Test-time adaptation method for unsupervised Tabular Anomaly Detection. RTTAD holistically tackles normality shifts via a synergistic two-stage mechanism. During training, collaborative dual-task learning captures multi-level representations to establish a robust normal prior. During testing, a Test-Time Contrastive Learning (TTCL) module explicitly accounts for adaptation risk by selectively updating the model using high-confidence pseudo-normal samples while constraining anomalous ones. Additionally, TTCL incorporates a k-nearest neighbor-based contrastive objective to refine embedding distributions, thereby further enhancing the model’s discriminative capacity. Extensive experiments on 15 tabular datasets demonstrate that RTTAD achieves state-of-the-art overall detection performance.

[AI-88] When Does Non-Uniform Replay Matter in Reinforcement Learning?

【速读】：该论文旨在解决现代离策略强化学习（off-policy reinforcement learning）中非均匀回放（non-uniform replay）为何以及何时优于简单均匀回放（uniform replay）这一长期存在的问题。通过系统性实验，作者发现非均匀回放的有效性由三个关键因素决定：回放缓冲区容量（replay volume）、每环境步采样的过渡数量、以及采样分布的熵（entropy）。其核心贡献在于揭示了非均匀回放在低回放缓冲区容量下最有效，并强调即使在期望近期性（expected recency）相当的情况下，高熵采样仍至关重要。基于此洞察，论文提出一种简化的截断几何回放（Truncated Geometric replay）策略，该策略在偏向近期经验的同时保持高熵并几乎不增加计算开销，在多种任务设置和算法中显著提升样本效率，尤其在低容量场景下表现优越。

链接: https://arxiv.org/abs/2605.10236
作者: Michal Korniak,Mikołaj Czarnecki,Yarden As,Piotr Miłoś,Pieter Abbeel,Michal Nauman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern off-policy reinforcement learning algorithms often rely on simple uniform replay sampling and it remains unclear when and why non-uniform replay improves over this strong baseline. Across diverse RL settings, we show that the effectiveness of non-uniform replay is governed by three factors: replay volume, the number of replayed transitions per environment step; expected recency, how recent sampled transitions are; and the entropy of the replay sampling distribution. Our main contribution is clarifying when non-uniform replay is beneficial and providing practical guidance for replay design in modern off-policy RL. Namely, we find that non-uniform replay is most beneficial when replay volume is low, and that high-entropy sampling is important even at comparable expected recency. Motivated by these findings, we adopt a simple Truncated Geometric replay that biases sampling toward recent experience while preserving high entropy and incurring negligible computational overhead. Across large-scale parallel simulation, single-task, and multi-task settings, including three modern algorithms evaluated on five RL benchmark suites, this replay sampling strategy improves sample efficiency in low-volume regimes while remaining competitive when replay volume is high.

[AI-89] Hypothesis-Driven Deep Research with Large Language Models : A Structured Methodology for Automated Knowledge Discovery

【速读】：该论文旨在解决当前人工智能驱动的研究系统普遍采用“直接搜索-摘要”范式所带来的局限性，即把假设（hypothesis）视为科学发现的终点而非研究过程的组织工具，导致研究缺乏主动性和结构性。其核心解决方案是提出首个以假设为核心驱动的通用深度研究方法——假设驱动深度研究（Hypothesis-Driven Deep Research, HDRI），通过六项基本原则和八阶段流程，将研究从被动的信息检索转变为可验证、迭代推进的知识发现过程。关键创新在于引入基于信息与逻辑缺口的闭环迭代机制，自动识别并触发针对性补充调查，并结合可追溯的推理链与置信度量化传播、主题锁定机制及多维质量评估体系，显著提升了知识密度、准确性与完整性。

链接: https://arxiv.org/abs/2605.10224
作者: Michael Chin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Current AI-powered research systems adopt a direct search-then-summarize paradigm that treats hypotheses as end products of scientific discovery. We argue this leaves a critical gap: hypotheses can serve a far more powerful role as organizational instruments that structure the research process itself. We propose the Hypothesis-Driven Deep Research (HDRI) methodology - the first framework using hypotheses to organize general-purpose deep research across arbitrary domains, rather than merely validating claims within specific domains. This transforms research from reactive information retrieval into proactive, verifiable, and iterative knowledge discovery. HDRI is formalized with six core principles and an eight-stage pipeline. A central innovation is the gap-driven iterative research mechanism - a closed-loop quality assurance system that automatically identifies informational and logical gaps, triggering targeted supplementary investigation. We further introduce a fact reasoning framework with traceable reasoning chains and quantified confidence propagation, a subject locking mechanism to prevent entity confusion, and a multi-dimensional quality assessment scheme. The methodology is realized in the INFOMINER system. Experiments demonstrate improvements of 22.4% in fact density, 90% subject matching accuracy, 0.92 multi-source verification confidence, and 14% completeness gain from gap-driven supplementation. Five case studies validate its practical applicability, achieving an average quality rating of 4.46/5.0. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2605.10224 [cs.AI] (or arXiv:2605.10224v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.10224 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-90] Beyond Autonomy: A Dynamic Tiered Agent Runner Framework for Governable and Resilient Enterprise AI Execution

【速读】：该论文旨在解决当前大型语言模型代理框架在企业部署中缺乏可控性的问题，具体表现为高风险写操作未经独立审核、复杂任务缺少验收验证机制，以及计算资源分配未按风险等级差异化配置。其解决方案的关键在于提出动态分层代理执行协议（Dynamic Tiered AgentRunner），通过三个核心机制实现控制：(1) 风险自适应分层（Risk-Adaptive Tiering）根据任务风险动态调整计算资源与审查强度，达成安全与效率的帕累托最优；(2) 权力分离架构（Separation of Powers）将提案、审查、执行和验证功能交由物理隔离的独立代理完成，增强系统安全性；(3) 设计韧性（Resilience-by-Design）构建验证-恢复闭环，将失败视为系统状态的一部分，提升整体鲁棒性。

链接: https://arxiv.org/abs/2605.10223
作者: Kai Pan,Rong Hou
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 9 pages, 2 figures, 3 tables

点击查看摘要

Abstract:Current large language model agent frameworks prioritize autonomy but lack the governability mechanisms required for enterprise deployment. High-risk write operations proceed without independent review, complex tasks lack acceptance verification, and computational resources are allocated uniformly regardless of risk level. We propose the Dynamic Tiered AgentRunner, a controlled execution protocol distilled from a production-grade multi-tenant SaaS platform. The framework introduces three core mechanisms: (1) Risk-Adaptive Tiering that dynamically allocates computational resources and review intensity based on task risk profiles, achieving Pareto-optimal trade-offs between safety and efficiency; (2) Separation of Powers architecture where proposal, review, execution, and verification are performed by independent agents with physically isolated boundaries; and (3) Resilience-by-Design through a Verifier-Recovery closed loop that treats failure as a first-class system state. We formalize the tier selectio

[AI-91] HeteroGenManip: Generalizable Manipulation For Heterogeneous Object Interactions

【速读】：该论文旨在解决机器人在跨类型物体交互中实现可泛化操作的难题，核心挑战在于准确识别“何处操作”（接触点定位）与合理规划“如何操作”（后续交互轨迹）。现有基于基础模型的端到端方法因混淆这两个阶段而加剧长程任务中的误差累积，且依赖单一统一模型难以捕捉异质物体的类别特异性特征。解决方案的关键在于提出一种任务条件化的两阶段框架HeteroGenManip：首先通过Foundation-Correspondence-Guided Grasp模块利用结构先验对齐初始接触状态以降低抓取位姿不确定性；其次采用Multi-Foundation-Model Diffusion Policy（MFMDP）将物体路由至类别专用的基础模型，并通过双流交叉注意力机制融合细粒度几何信息与高变异性部件特征，从而实现更鲁棒的跨类别形状与位姿泛化能力。

链接: https://arxiv.org/abs/2605.10201
作者: Zhenhao Shen,Zeming Yang,Yue Chen,Yuran Wang,Shengqiang Xu,Mingleyang Li,Hao Dong,Ruihai Wu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generalizable manipulation involving cross-type object interactions is a critical yet challenging capability in robotics. To reliably accomplish such tasks, robots must address two fundamental challenges: where to manipulate'' (contact point localization) and how to manipulate’’ (subsequent interaction trajectory planning). Existing foundation-model-based approaches often adopt end-to-end learning that obscures the distinction between these stages, exacerbating error accumulation in long-horizon tasks. Furthermore, they typically rely on a single uniform model, which fails to capture the diverse, category-specific features required for heterogeneous objects. To overcome these limitations, we propose HeteroGenManip, a task-conditioned, two-stage framework designed to decouple initial grasp from complex interaction execution. First, Foundation-Correspondence-Guided Grasp module leverages structural priors to align the initial contact state, thereby significantly reducing the pose uncertainty of grasping. Subsequently, Multi-Foundation-Model Diffusion Policy (MFMDP) routes objects to category-specialized foundation models, integrating fine-grained geometric information with highly-variable part features via a dual-stream cross-attention mechanism. Experimental evaluations demonstrate that HeteroGenManip achieves robust intra-category shape and pose generalization. The framework achieves an average 31% performance improvement in simulation tasks with broad type setting, alongside a 36.7% gain across four real-world tasks with different interaction types.

[AI-92] Empty SPACE: Cross-Attention Sparsity for Concept Erasure in Diffusion Models

【速读】：该论文旨在解决文本到图像扩散模型中特定概念（如受版权保护或不当内容）难以有效擦除的问题。传统基于反向传播的方法虽有效但计算成本高，而闭式概念擦除方法在大型模型（如Stable Diffusion XL）中效果显著下降。为此，作者提出SParse cross-Attention-based Concept Erasure (SPACE)，其核心在于通过迭代更新交叉注意力（cross-attention）参数，在保持稀疏性的同时实现目标概念的高效擦除；具体而言，SPACE将概念映射聚焦于低维子空间，从而提升擦除有效性与鲁棒性，并在保证性能的同时实现80%-90%的交叉注意力稀疏度，使参数存储需求减少70%。

链接: https://arxiv.org/abs/2605.10198
作者: Nicola Novello,Andrea M. Tonello
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Erasing specific concepts from text-to-image diffusion models is essential for avoiding the generation of copyrighted and explicit content. Closed-form concept erasure methods offer a fast alternative to backpropagation-based techniques, but they become less effective when scaling from smaller models such as Stable Diffusion 1.5 to larger models like Stable Diffusion XL. To maintain erasure effectiveness in these larger-scale architectures, we propose SParse cross-Attention-based Concept Erasure (SPACE). SPACE iteratively modifies the cross-attention parameters of a model with a closed-form update that jointly induces sparsity and erases target concepts. By concentrating the concept mapping to a lower-dimensional subspace, SPACE achieves superior erasure efficacy compared to dense baselines. Extensive experimental results show improvements in erasure effectiveness and robustness against adversarial prompts. Furthermore, SPACE achieves 80%-90% cross-attention sparsity, reducing the storage requirements for saving the modified parameters by 70%, demonstrating its memory efficiency.

[AI-93] RACE: Distilling Where It Matters via Token-Routed Self On-Policy Alignment

【速读】：该论文旨在解决强化学习中基于可验证奖励（Reinforcement Learning with Verifiable Rewards, RLVR）的自蒸馏（self-distillation）方法在长时序数学推理任务中存在的熵增、推理缩短及分布外（Out-of-Distribution, OOD）性能退化问题。核心问题是：全token KL散度会将梯度集中在冗余位置，放大特权信息泄露，导致模型过早收敛且泛化能力下降。解决方案的关键在于提出Token-Routed Alignment for Critical rEasoning (TRACE)，其通过仅对标注标记的关键推理片段进行蒸馏——即在正确轨迹的关键片段上施加前向KL，在局部错误片段上选择性使用反向KL，并对剩余token采用GRPO优化，同时在短时间预热后逐步衰减KL通道。该机制有效控制了特权梯度累积暴露，使学生模型能聚焦于教师支持但自身分配不足的关键推理部分，从而提升长期推理稳定性与OOD鲁棒性。

链接: https://arxiv.org/abs/2605.10194
作者: Jiaxuan Wang,Xuan Ouyang,Zhiyu Chen,Yulan Hu,Zheng Pan,Xin Li,Lan-Zhe Guo
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: work in progress

点击查看摘要

Abstract:On-policy self-distillation (self-OPD) densifies reinforcement learning with verifiable rewards (RLVR) by letting a policy teach itself under privileged context. We find that when this guidance spans the full response, all-token KL spends gradients on mostly redundant positions and amplifies privileged-information leakage, causing entropy rise, shortened reasoning, and out-of-distribution degradation in long-horizon math training. We propose Token-Routed Alignment for Critical rEasoning (TRACE), which distills only on annotator-marked critical spans: forward KL on key spans of correct rollouts, optional reverse KL on localized error spans, and GRPO on all remaining tokens, with the KL channel annealed away after a short warm-up. Our analysis explains TRACE through two effects: forward KL provides non-vanishing lift to teacher-supported tokens that the student under-allocates, while span masking and decay keep cumulative privileged-gradient exposure finite. On four held-out math benchmarks plus GPQA-Diamond, TRACE improves over GRPO by 2.76 percentage points on average and preserves the Qwen3-8B base OOD score on GPQA-Diamond, where GRPO and all-token self-OPD baselines degrade. Gains persist under online self-annotation (+1.90 percentage points, about 69% of the strong-API gain), reducing the concern that TRACE merely imports external annotator capability. Across scales, the best routed action is base-dependent: on Qwen3-8B it is forward KL on key spans, while on Qwen3-1.7B it shifts to reverse KL on error spans.

[AI-94] ProteinOPD: Towards Effective and Efficient Preference Alignment for Protein Design

【速读】：该论文旨在解决蛋白质语言模型（Protein Language Models, PLMs）在多目标偏好对齐过程中面临的两个关键问题：一是灾难性遗忘（catastrophic forgetting），即在优化特定功能偏好时导致预训练知识的退化，从而损害基本设计可塑性；二是难以平衡多个相互竞争的目标。解决方案的关键在于提出 ProteinOPD，一种基于 On-Policy Distillation (OPD) 的多目标偏好对齐框架，通过将预训练PLM转化为偏好特定的教师模型，并利用学生模型自身轨迹上的token级OPD进行知识蒸馏，使学生模型在保持原始设计能力的同时，被引导至加权教师模型的归一化几何共识方向，且在目标冲突下保证优化边界可控，从而有效实现多目标对齐与设计可塑性的协同优化。

链接: https://arxiv.org/abs/2605.10189
作者: Yulin Zhang,He Cao,Zihao Jiang,Chenyi Zi,Zhipeng Zhou,Zijing Liu,Yu Li,Jia Li,Ziqi Gao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Designing proteins with desired functions or properties represents a core goal in synthetic biology and drug discovery. Recent advances in protein language models (PLMs) have enabled the generation of highly designable protein sequences, while preference alignment provides a promising way to steer designs toward desired functions and properties. Nevertheless, they often trigger catastrophic forgetting of pretrained knowledge, degrading basic designability and failing to balance multiple competing objectives. To address these issues, we draw inspiration from On-Policy Distillation (OPD), an advanced post-training method renowned for mitigating catastrophic forgetting through its mode-seeking nature. In this work, we propose ProteinOPD, a multi-objective preference alignment framework that can effectively balance multiple preference objectives while maintaining the inherent designability of PLMs. ProteinOPD adapts a pretrained PLM into preference-specific teachers and distills their knowledge into a shared student via token-level OPD on the student’s own trajectories. During this process, the student is aligned to a unique normalized geometric consensus of weighted teachers while ensuring bounded optimization under conflicts. This bridges the gap for OPD in multi-objective/teacher alignment. Extensive experiments show that ProteinOPD achieves substantial gains on target preference objectives without compromising the designability, with an 8x training speedup over RL-based alignment competitors.

[AI-95] One-Step Graph-Structured Neural Flows for Irregular Multivariate Time Series Classification

【速读】：该论文旨在解决现有神经流（Neural Flows）方法在建模不规则多变量时间序列时，因独立处理各变量而导致的变量间交互关系建模不足的问题。其关键在于提出了一步图结构神经流（One-step Graph-Structured Neural Flows, GSNF），通过两种辅助轨迹自监督策略增强交互学习：(i) 基于重初始化的交互感知轨迹生成，诱导轨迹发散以暴露图结构引发的交互，并给出发散的理论下界；(ii) 反向时间轨迹生成，利用流的可逆性强制前向-后向一致性，从而正则化图结构的学习。

链接: https://arxiv.org/abs/2605.10179
作者: Mengzhou Gao,Kaiwei Wang,Pengfei Jiao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Neural Flows efficiently model irregular multivariate time series by directly learning ODE solution trajectories with neural networks, bypassing step-by-step numerical solvers. Despite their efficiency, many existing approaches treat variables independently, leaving inter-variable interactions underexplored. Moreover, their one-step mapping makes interaction modeling inherently challenging, as it removes the iterative refinement of interactions during learning. To address this challenge, we propose one-step Graph-Structured Neural Flows (GSNF), which introduce two auxiliary-trajectory self-supervision strategies to strengthen interaction learning: (i) interaction-aware trajectory generation via re-initialization, which induces trajectory divergence to expose graph-induced interactions, with a theoretically derived lower bound on divergence; and (ii) reverse-time trajectory generation, which enforces forward-backward consistency to regularize graph learning, enabled by flow invertibility. Experiments on five real-world datasets show that GSNF achieves state-of-the-art classification performance with highly competitive training time and memory usage.

[AI-96] When Prompts Become Payloads: A Framework for Mitigating SQL Injection Attacks in Large Language Model-Driven Applications

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在自然语言接口与结构化数据库交互过程中引入的新型安全风险，尤其是通过提示到SQL翻译过程放大的SQL注入漏洞问题。恶意用户可通过构造对抗性提示（adversarial prompts）操纵模型生成不安全的SQL查询，从而威胁数据完整性与机密性。解决方案的关键在于提出一个多层次的安全框架：第一层为前端安全盾牌（security shield），用于提示净化；第二层为基于行为和语义异常识别的高级威胁检测模型；第三层为基于签名的控制机制，用于识别已知攻击模式。该框架在多种真实攻击场景下验证有效，包括提示注入、混淆SQL载荷及上下文操控攻击，实验表明其具备高检测准确率且误报率低，显著提升了LLM驱动数据库应用的安全部署能力。

链接: https://arxiv.org/abs/2605.10176
作者: Farzad Nourmohammadzadeh Motlagh,Mehrdad Hajizadeh,Mehryar Majd,Pejman Najafi,Feng Cheng,Christoph Meinel
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 11 pages

点击查看摘要

Abstract:Natural language interfaces to structured databases are becoming increasingly common, largely due to advances in large language models (LLMs) that enable users to query data using conversational input rather than formal query languages such as SQL. While this paradigm significantly improves usability and accessibility, it introduces new security risks, particularly the amplification of SQL injection vulnerabilities through the prompt-to-SQL translation process. Malicious users can exploit these mechanisms by crafting adversarial prompts that manipulate model behavior and generate unsafe queries. In this work, we propose a multi-layered security framework designed to detect and mitigate LLM-mediated SQL injection attacks. The framework integrates a front-end security shield for prompt sanitization, an advanced threat detection model for behavioral and semantic anomaly identification, and a signature-based control layer for known attack patterns. We evaluate the proposed framework under diverse and realistic attack scenarios, including prompt injection, obfuscated SQL payloads, and context-manipulation attacks. To ensure robustness, we generate and curate a comprehensive benchmark dataset of adversarial prompts and assess performance across a fine-tuned LLM configuration. Experimental results demonstrate that the proposed approach achieves high detection accuracy while maintaining low false-positive rates, significantly improving the secure deployment of LLM-powered database applications.

[AI-97] Automated Approach for Solving Infinite-state Polynomial Reachability Games

【速读】：该论文旨在解决无限状态可达性博弈（reachability games）中，判定是否存在且计算出获胜策略的问题，特别是针对基于实变量赋值的无限状态图上的轮转制博弈。其核心挑战在于如何在不依赖手动证明的情况下，自动推导并验证“到达”玩家（\textttREACH）是否具有从初始状态出发的获胜策略。解决方案的关键在于提出了一种称为“排名证书”（ranking certificates）的形式化证明规则——这是一种既充分又完备的推理机制，可用于证明 \textttREACH 玩家存在获胜策略；并进一步针对多项式可达性博弈（polynomial reachability games），设计了一个全自动算法，能够在亚指数时间内计算出获胜策略及其形式化的正确性证明（即排名证书）。该方法首次实现了对经典Cinderella-Stepmother游戏在任意精度参数下最优策略的自动合成，显著超越了现有技术的适用范围。

链接: https://arxiv.org/abs/2605.10169
作者: Krishnendu Chatterjee,Ehsan Kafshdar Goharshady,Mehrdad Karrabi,Maximilian Seeliger,Đorđe Žikelić
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注:

点击查看摘要

Abstract:Reachability games are two-player games played on a graph, where the objective of \textttREACH player is to reach the target set whereas the objective of \textttSAFE player is to stay away from the target set. Reachability games have important applications in artificial intelligence and reactive synthesis, and many of these applications give rise to infinite-state reachability games. In this paper, we study turn-based reachability games on infinite-state graphs defined over valuations of a finite set of real variables. We consider the problem of determining the existence of and computing a winning strategy for \textttREACH player. Our contributions are twofold. First, we propose ranking certificates for reachability games, a sound and complete proof rule for proving that \textttREACH player has a winning strategy from the specified initial state. Second, we consider polynomial reachability games, where transitions and objectives are described by polynomial constraints over real variables, and propose a fully automated algorithm for computing a winning strategy for \textttREACH player together with a formal correctness witness in the form of a ranking certificate. The algorithm is sound, semi-complete, and runs in sub-exponential time. Our experiments demonstrate the ability of our method to solve challenging examples from the literature that were out of the reach of existing methods. Specifically, for the classical Cinderella-Stepmother game, we are able to compute an optimal winning strategy for an arbitrary precision parameter for the first time.

[AI-98] Benchmarking Safety Risks of Knowledge-Intensive Reasoning under Malicious Knowledge Editing

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在进行知识编辑时引入的潜在安全风险问题，即恶意知识注入可能导致下游推理行为错误或不安全，而现有基准测试主要关注编辑效果（如成功率、泛化性与局部性），缺乏对推理安全性影响的系统评估框架。解决方案的关键在于提出EditRisk-Bench——一个统一的评估基准，通过整合多种恶意场景（如虚假信息、偏见和安全违规）、多层次知识密集型推理任务及主流编辑策略，量化攻击有效性、推理正确性与副作用，从而系统性地衡量恶意知识编辑对LLM推理可靠性的影响。实验表明，此类恶意编辑可在保持模型整体能力的同时诱发错误或不安全推理，且风险受编辑规模、知识特性与推理复杂度等因素显著影响。

链接: https://arxiv.org/abs/2605.10146
作者: Qinghua Mao,Xi Lin,Jinze Gu,Jun Wu,Siyuan Li,Yuliang Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Large language models (LLMs) increasingly rely on knowledge editing to support knowledge-intensive reasoning, but this flexibility also introduces critical safety risks: adversaries can inject malicious or misleading knowledge that corrupts downstream reasoning and leads to harmful outcomes. Existing knowledge editing benchmarks primarily focus on editing efficacy and lack a unified framework for systematically evaluating the safety implications of edited knowledge on reasoning behavior. To address this gap, we present EditRisk-Bench, a benchmark for systematically evaluating safety risks of knowledge-intensive reasoning under malicious knowledge editing. Unlike prior benchmarks that mainly emphasize edit success, generalization, and locality, EditRisk-Bench focuses on how injected knowledge affects downstream reasoning behavior and reliability. It integrates diverse malicious scenarios, including misinformation, bias, and safety violations, together with multi-level knowledge-intensive reasoning tasks and representative editing strategies within a unified evaluation framework measuring attack effectiveness, reasoning correctness, and side effects. Extensive experiments on both open-source and closed-source LLMs show that malicious knowledge editing can reliably induce incorrect or unsafe reasoning while largely preserving general capabilities, making such risks difficult to detect. We further identify several key factors influencing these risks, including edit scale, knowledge characteristics, and reasoning complexity. EditRisk-Bench provides an extensible testbed for understanding and mitigating safety risks in knowledge editing for LLMs.

[AI-99] FormalRewardBench: A Benchmark for Formal Theorem Proving Reward Models

【速读】：该论文旨在解决生成式 AI (Generative AI) 在形式化定理证明中因稀疏奖励信号导致的信用分配难题，即传统强化学习与可验证奖励（Reinforcement Learning with Verifiable Rewards, RLVR）方法仅提供二元正确性反馈，使得模型难以从复杂但部分正确的证明过程中获得有效学习信号。为此，作者提出并构建了首个用于评估奖励模型（reward models）在 Lean 4 形式化数学证明任务中的基准测试集 FormalRewardBench，其核心创新在于设计了五种由专家精心构造的错误注入策略（包括强制错误、最小单点扰动、冗长错误证明、自然语言解释误导和 Python 代码注入），从而生成高质量的偏好对（preference pairs），用于训练和评估能够区分正确与错误证明质量的奖励模型。该基准为研究者提供了可复现、系统化的评测工具，推动了形式化数学领域中奖励建模的发展。

链接: https://arxiv.org/abs/2605.10141
作者: Zeynel A. Uluşan,Burak S. Akbudak,Can S. Erer,Gözde Gül Şahin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent neural theorem provers use reinforcement learning with verifiable rewards (RLVR), where proof assistants provide binary correctness signals. While verifiable rewards are cheap and scalable without reward hacking issues, they suffer from sparse credit assignment: models receive no learning signal from difficult problems where partial progress goes unrewarded. This motivates learned reward models that can evaluate proof quality beyond binary verification. However, comparing reward models is challenging since it typically requires expensive RL training ablations. To address this, we introduce \textbfFormalRewardBench, the first benchmark for evaluating reward models in formal theorem proving with Lean 4. Our benchmark consists of 250 preference pairs where correct proofs are paired with incorrect variants generated through five expert curated error injection strategies: forced mistakes, minimal single-point variations, verbose incorrect proofs, natural language justification, and Python code injection. We evaluate frontier LLMs (e.g., Claude Opus 4.5), judge LLMs (e.g., CompassJudger-1-14B), general-purpose LLMs (e.g., Qwen2.5-72B-Instruct), and specialized theorem proving models (e.g., DeepSeek-Prover-V2-7B). Our results reveal that frontier LLMs achieve the highest performance (59.8%) while specialized theorem provers perform the worst (24.4%), suggesting that theorem proving ability does not transfer to proof evaluation. We provide further insights on various error injection mechanisms, highlighting the challenging nature of most injection mechanisms. We release \textbfFormalRewardBench publicly to encourage more research on developing reward models in formal mathematics. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2605.10141 [cs.AI] (or arXiv:2605.10141v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.10141 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-100] Rethinking Constraint Awareness for Efficient State Embedding of Neural Routing Solver

【速读】：该论文旨在解决当前基于神经网络的车辆路径问题（Vehicle Routing Problem, VRP）求解器在处理具有复杂约束的VRP变体时性能受限的问题。其核心挑战在于现有方法在解码过程中通过状态嵌入（state embeddings）的生成机制限制了注意力计算中的观察空间，从而导致高质量解难以获得。解决方案的关键是提出一种简单而有效的约束感知残差调制（Constraint-Aware Residual Modulation, CARM）模块，该模块通过自适应地将约束相关变量引入上下文嵌入（context embedding），增强模型对约束条件的感知能力，从而充分释放全局观察空间的优势，生成更高效的state embedding，显著提升求解器在大规模实例上的扩展性和对未见VRP变体的泛化能力。

链接: https://arxiv.org/abs/2605.10122
作者: Canhong Yu,Changliang Zhou,Rongsheng Chen,Zhenkun Wang,Yu Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Heavy-Encoder-Light-Decoder (HELD) neural routing solvers have emerged as a promising paradigm due to their broad applicability across multiple vehicle routing problems (VRPs). However, they typically struggle with VRP variants with complex constraints. To address this limitation, this paper systematically revisits existing neural solvers from the perspective of the generation mechanism for state embeddings (i.e., query vector prior to compatibility calculation) during decoding. We identify that current mechanisms restrict the observation space during attention computation, introducing a key bottleneck to achieving high-quality solutions. Through detailed empirical analysis, we demonstrate the necessity of preserving a global observation space. To overcome the constraint-agnostic drawback inherent to global observation spaces, we propose a simple yet powerful Constraint-Aware Residual Modulation (CARM) module. By adaptively modulating the context embedding with constraint-relevant variables, CARM effectively enhances constraint awareness, enabling the neural solver to fully leverage the global observation space and generate an efficient state embedding. Extensive experimental results across two single-task and five multi-task neural routing solvers confirm that the CARM module consistently boosts baseline performance. Notably, solvers equipped with our CARM achieve substantial improvements in scaling to large-scale instances and in generalizing to unseen VRP variants. These findings provide valuable insights for the architectural design of neural routing solvers.

[AI-101] Arcane: An Assertion Reduction Framework through Semantic Clustering and MCTS-Guided Rule Exploring

【速读】：该论文旨在解决生成式 AI (Generative AI) 在硬件设计中自动生成断言时存在的冗余问题，即大量重复断言显著降低仿真效率。解决方案的关键在于提出 Arcane 框架，其核心包括两阶段断言聚类方法以实现高精度语义分类，并引入蒙特卡洛树搜索（Monte Carlo Tree Search, MCTS）来探索最优规则应用序列，从而高效减少冗余断言。实验表明，Arcane 在保持形式覆盖率和变异检测能力不变的前提下，最多可将断言数量减少 76.2%，并使仿真速度提升 2.6x 至 6.1x。

链接: https://arxiv.org/abs/2605.10107
作者: Hongqin Lyu,Yonghao Wang,Zhiteng Chao,Tiancheng Wang,Huawei Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注: 6 pages, 6 figures

点击查看摘要

Abstract:Assertion-based Verification (ABV) is essential for ensuring that hardware designs conform to their intended specifications. However, existing automated assertion-generation approaches, such as LLM-based frameworks, often generate large numbers of redundant assertions, which significantly degrade simulation efficiency. To mitigate the simulation overhead caused by redundant assertions, this paper proposes Arcane, an efficient assertion reduction framework. It integrates a two-tier assertion clustering approach for accurate semantic classification of large assertion sets, and employs Monte Carlo Tree Search (MCTS) to explore optimal rule-application sequences for efficient assertion reduction. The experimental results on Assertionbench [20] show that Arcane achieves a reduction of up to 76.2% in the assertion count while fully preserving formal coverage and mutation-detection ability. Further simulation studies demonstrate a speedup of 2.6x to 6.1x speedup in simulation time. The proposed framework is released at this https URL.

[AI-102] Retrieve-then-Steer: Online Success Memory for Test-Time Adaptation of Generative VLAs

【速读】：该论文旨在解决生成式视觉-语言-动作（Vision-Language-Action, VLA）模型在实际机器人部署中因环境持续变化导致的闭环可靠性下降问题。现有评估方法将测试阶段视为独立的零样本试验，忽略了真实场景中机器人常在相似或缓慢变化环境中重复执行任务的特点，而这些任务的成功执行可提供环境验证的可靠行为证据。解决方案的关键在于提出一种在线成功记忆引导的测试时适应框架：通过长期存储进度校准的成功观测-动作片段构建记忆库，在推理时检索与当前状态相关的动作片段，利用轨迹一致性过滤不一致候选，并聚合形成精英动作先验；进一步引入置信度自适应的先验引导机制，将该先验注入流匹配动作采样器的中间状态，并根据检索置信度动态调整引导强度，从而在不更新模型参数的前提下实现轻量级、非参数化的测试时适应，显著提升长周期和多阶段任务的成功率与闭环稳定性。

链接: https://arxiv.org/abs/2605.10094
作者: Jianchao Zhao,Huoren Yang,Hu Yusong,Yuyang Gao,Qiguan Ou,Cong Wan,SongLin Dong,Zhiheng Ma,Yihong Gong
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) models show strong potential for general-purpose robotic manipulation, yet their closed-loop reliability often degrades under local deployment conditions. Existing evaluations typically treat test episodes as independent zero-shot trials. However, real robots often operate repeatedly in the same or slowly changing environments, where successful executions provide environment-verified evidence of reliable behavior patterns. We study this persistent-deployment setting, asking whether a partially competent frozen VLA can improve its reliability by reusing its successful test-time experience. We propose an online success-memory guided test-time adaptation framework for generative VLAs. During deployment, the robot stores progress-calibrated successful observation-action segments in a long-term memory. At inference, it retrieves state-relevant action chunks, filters inconsistent candidates via trajectory-level consistency, and aggregates them into an elite action prior. To incorporate this prior into action generation, we introduce confidence-adaptive prior guidance, which injects the elite prior into an intermediate state of the flow-matching action sampler and adjusts the guidance strength based on retrieval confidence. This design allows the frozen VLA to exploit environment-specific successful experience while preserving observation-conditioned generative refinement. This retrieve-then-steer mechanism enables lightweight, non-parametric test-time adaptation without requiring parameter updates. Simulation and real-world experiments show improved task success and closed-loop stability, especially in long-horizon and multi-stage tasks.

[AI-103] Active Testing of Large Language Models via Approximate Neyman Allocation

【速读】：该论文旨在解决大规模语言模型（Large Language Models, LLMs）在从预训练到测试阶段的持续评估中，因模型规模扩大和任务复杂度提升而导致的计算与标注成本急剧上升的问题。现有主动测试方法主要针对分类任务，在生成式任务上表现不佳。其解决方案的关键在于提出一种专为生成式任务设计的主动测试算法：利用代理模型（surrogate models）提取的语义熵对评估池进行分层，并基于这些代理信号执行近似Neyman分配策略，从而在有限预算下高效选择最具信息量的样本进行评估。该方法在多个语言和多模态基准上显著优于基线，相较于均匀采样可降低最高达28%的均方误差（MSE），平均节省22.9%的评估预算。

链接: https://arxiv.org/abs/2605.10075
作者: Zeli Liu,Jiancheng Zhang,Cong Liu,Yinglun Zhu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) require reliable evaluation from pre-training to test-time scaling, making evaluation a recurring rather than one-off cost. As model scales grow and target tasks increasingly demand expert annotators, both the compute and labeling costs needed for each evaluation rise rapidly. Active testing aims to alleviate this bottleneck by approximating the evaluation result from a small but informative subset of the evaluation pool. However, existing approaches primarily target classification and break down on generative tasks. We introduce a novel active testing algorithm tailored to generative tasks. Our method leverages semantic entropy from surrogate models to stratify the evaluation pool and then conducts approximate Neyman allocation based on signals extracted from these surrogates. Across multiple language and multimodal benchmarks and a range of surrogate-target model pairs, our method significantly improves on baselines and closely tracks Oracle-Neyman, delivering up to 28% MSE reduction over Uniform Sampling and an average of 22.9% budget savings.

[AI-104] Metis: Learning to Jailbreak LLM s via Self-Evolving Metacognitive Policy Optimization ICML2026

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在安全对齐（safety alignment）方面存在的脆弱性问题，尤其是现有自动化红队测试方法因依赖静态启发式规则或随机搜索而难以有效突破先进防御机制的局限。其解决方案的关键在于提出Metis框架，将越狱攻击（jailbreaking）建模为推理时策略优化（inference-time policy optimization），并在对抗性的部分可观测马尔可夫决策过程（adversarial Partially Observable Markov Decision Process, POMDP）中执行，通过自进化元认知循环（self-evolving metacognitive loop）实现对目标模型防御逻辑的因果诊断，并利用结构化反馈作为语义梯度（semantic gradient）来精细化攻击策略，从而显著提升攻击成功率（Attack Success Rate, ASR）和效率（token成本降低平均8.2倍）。

链接: https://arxiv.org/abs/2605.10067
作者: Huilin Zhou,Jian Zhao,Yilu Zhong,Zhen Liang,Xiuyuan Chen,Yuchen Yuan,Tianle Zhang,Chi Zhang,Lan Zhang,Xuelong Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to the 43rd International Conference on Machine Learning (ICML 2026)

点击查看摘要

Abstract:Red teaming is critical for uncovering vulnerabilities in Large Language Models (LLMs). While automated methods have improved scalability, existing approaches often rely on static heuristics or stochastic search, rendering them brittle against advanced safety alignment. To address this, we introduce Metis, a framework that reformulates jailbreaking as inference-time policy optimization within an adversarial Partially Observable Markov Decision Process (POMDP). Metis employs a self-evolving metacognitive loop to perform causal diagnosis of a target’s defense logic and leverages structured feedback as a semantic gradient to refine its policy, offering enhanced interpretability through transparent reasoning traces. Extensive evaluations across 10 diverse models demonstrate that Metis achieves the strongest average Attack Success Rate (ASR) among compared methods at 89.2%, maintaining high efficacy on resilient frontier models (e.g., 76.0% on O1 and 78.0% on GPT-5-chat) where traditional baselines exhibit substantial performance degradation. By replacing redundant exploration with directed optimization, Metis reduces token costs by an average of 8.2x and up to 11.4x. Our analysis reveals that current defenses remain vulnerable to internally-steered, closed-loop reasoning trajectories under the tested settings, highlighting a critical need for next-generation defenses capable of reasoning about safety dynamically during inference.

[AI-105] MAGE: Multi-Agent Self-Evolution with Co-Evolutionary Knowledge Graphs

【速读】：该论文旨在解决自演化语言模型代理在迭代过程中如何有效学习新知识并稳定保留已有知识的问题，尤其是现有系统依赖自然语言反馈、扁平记忆或隐式强化信号时，难以支持推理阶段冻结的弱骨干模型（frozen backbone）的问题。其解决方案的关键在于提出MAGE（Multi-Agent Graph-guided Evolution）框架，将自我知识外化为一个包含四个子图的协同进化知识图谱，其中经验子图存储教师编写的失败修正和学习者自身的正确推理轨迹，并作为任务条件引导用于冻结执行模型的检索；通过任务级搜索带和技能级路由带从同一奖励流中同步更新图结构，而学习者的骨干网络保持不变，从而实现对冻结学习者的稳定演进。

链接: https://arxiv.org/abs/2605.10064
作者: Ruiyi Yang,Zechen Li,Hao Xue,Imran Razzak,Flora D. Salim
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 25 pages, 3 figures

点击查看摘要

Abstract:Self-evolving language-model agents must decide what to learn next and how to preserve what they have learned across iterations. Existing systems typically carry this cross-iteration knowledge as natural-language feedback, flat episodic memory, or implicit reinforcement signals, none of which cleanly supports a frozen weak backbone at inference time. This paper introduces MAGE (Multi-Agent Graph-guided Evolution), a framework that externalizes self-knowledge into a four-subgraph co-evolutionary knowledge graph. Its experience subgraph stores both teacher-written failure corrections and the learner’s own past correct reasoning traces, which are retrieved as task-conditioned guidance for a frozen execution model. During evolution, the graph, a task-level search bandit, and a skill-level routing bandit are updated from the same reward stream, while the learner’s backbone remains unchanged. We further provide structural analysis showing how append-only memory growth, bounded curriculum coverage, and task-filtered retrieval together support stable improvement of the retrieval substrate for frozen-learner evolution. Across nine benchmarks spanning mathematical reasoning, multi-hop and open-domain question answering, spatio-temporal analysis, financial numerical reasoning, medical multiple-choice, an open-world survival game, and web navigation, MAGE achieves strong performance against prompt-based frozen-backbone baselines. Ablations show that self-harvested success traces and teacher-written corrections are complementary, with success memories contributing most on reasoning-template-heavy tasks and corrective memories supporting harder composition and interaction settings.

[AI-106] Strategic Exploitation in LLM Agent Markets: A Simulation Framework for E-Commerce Trust

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）代理在电子商务市场中因信息不对称而产生的策略性欺骗行为缺乏系统研究的问题。其解决方案的关键在于构建了一个名为TruthMarketTwin的受控仿真框架，该框架首次实现了在双边交易场景下对信息不对称的建模，使LLM代理能够自主决策商品上架、购买、评分及追责等行为，从而揭示其在声誉治理机制下的策略性行为演化，并验证了担保执行机制对减少欺骗、重塑战略推理的有效性。

链接: https://arxiv.org/abs/2605.10059
作者: Shijun Lei,Quang Nguyen,Swapneel S Mehta,Zeping Li,Huichuan Fu,Xiaolong Zheng,Siki Chen,Yunji Liang,Philip Torr,Zhenfei Yin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Agent-based modeling (ABM) has long been used in economics to study human behavior, and large language model (LLM) agents now enable new forms of social and economic simulation. While prior work has discovered strategic deception by LLM agents in financial trading and auction markets, e-commerce remains underexplored despite its distinctive information asymmetry: sellers privately observe product quality, whereas buyers rely on advertised claims and reputation signals. We introduce TruthMarketTwin, a controlled simulation framework for studying LLM-agent behavior in e-commerce markets. The framework is one of the first to model bilateral trade under asymmetric information sharing, where agents make strategic listing, purchasing, rating, and recourse-related decisions to optimize seller profit and buyer utility. We find that LLM agents released into traditional markets autonomously exploit weaknesses in reputation-based governance, while warrant enforcement reduces deception and reshapes strategic reasoning. Our results position LLM-agent simulation as a tool for studying institution-governed autonomous markets.

[AI-107] Guided Streaming Stochastic Interpolant Policy

【速读】：该论文旨在解决生成式机器人策略在推理阶段难以实时响应动态目标或障碍物避让的问题，现有方法多依赖于分块（chunk-based）架构，存在延迟高、反应迟钝的缺陷。解决方案的关键在于提出一种基于随机插值（Stochastic Interpolants, SI）的最优引导项推导方法，通过分析价值函数的时间演化过程并利用后向Kolmogorov方程，构建理论上保证采样自目标分布的修正漂移项；进而设计Streaming Stochastic Interpolant Policy (SSIP)，将此引导律与流式架构相结合，实现快速且具有反应性的控制。此外，还提出了两种互补机制：无需训练的Stochastic Trajectory Ensemble Guidance (STEG) 用于零样本适应，以及基于训练的Conditional Critic Guidance (CCG) 实现推理优化，从而在复杂动态环境中提供物理上合理的高效引导。

链接: https://arxiv.org/abs/2605.10051
作者: Puming Jiang,Meiyi Wang,Kelvin Lin,Ce Hao,Harold Soh
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted to Robotics: Science and Systems (RSS) 2026. The first two authors contributed equally

点击查看摘要

Abstract:Inference-time guidance is essential for steering generative robot policies toward dynamic objectives without retraining, yet existing methods are largely confined to chunk-based architectures that exhibit high latency and lack the reactivity needed for test-time preference alignment or obstacle avoidance. In this work, we formally derive the optimal guidance term for Stochastic Interpolants (SI) by analyzing the value function’s time evolution via the Backward Kolmogorov Equation, establishing a modified drift that theoretically guarantees sampling from a target distribution. We apply this framework to real-time control through the Streaming Stochastic Interpolant Policy (SSIP), which generalizes the deterministic Streaming Flow Policy (SFP). Unifying this guidance law with the streaming architecture enables fast and reactive control. To support diverse deployment needs, we propose two complementary mechanisms: training-free Stochastic Trajectory Ensemble Guidance (STEG) that computes gradients on-the-fly for zero-shot adaptation, and training-based Conditional Critic Guidance (CCG) for amortized inference. Empirical evaluations demonstrate that our guided streaming approach significantly outperforms conventional chunk-based policies in reactivity and provides superior, physically valid guidance for dynamic, unstructured environments.

[AI-108] Rethinking Loss Reweighting for Imbalance Learning as an Inverse Problem: A Neural Collapse Point of View ICML2026

【速读】：该论文旨在解决长尾分类（long-tailed classification）中因类别样本分布不均衡导致的损失权重失衡问题。现有重加权策略多依赖启发式方法，缺乏明确的目标函数指导。其解决方案的关键在于：基于神经坍缩（Neural Collapse, NC）理论中理想的单纯形等角紧框架（ideal simplex Equiangular Tight Frame, ETF）终端几何结构，提出以每类平均损失相等作为目标，并将损失重加权建模为一个逆问题（inverse problem），从而动态推断类别权重以逼近该理想目标。实验表明，该方法能有效降低损失不平衡系数，并更贴近NC几何结构，同时在多个数据集上持续优于主流长尾分类基线方法。

链接: https://arxiv.org/abs/2605.10047
作者: Jinping Wang,Zixin Tong,Zhiwu Xie,Zhiqiang Gao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by ICML2026

点击查看摘要

Abstract:Loss reweighting is a widely used strategy for long-tailed classification, but existing reweighting strategies often rely on heuristics and rarely define a well-specified target. Inspired by Neural Collapse (NC), the ideal simplex Equiangular Tight Frame (ETF) terminal geometry suggests equal per-class average loss as a reasonable target for reweighting. Based on the ideal equal loss objective, we consider loss reweighting as an inverse problem and propose an inverse-view reweighting strategy that infers class weights dynamically to match this ideal objective. Empirically, NC metrics suggest our method can effectively reduce the loss imbalance coefficient and closer alignment with NC geometry while consistently outperforming strong long-tailed baselines on different datasets.

[AI-109] Adaptive Action Chunking via Multi-Chunk Q Value Estimation

【速读】：该论文旨在解决现有行为分块（Action Chunking）方法中固定分块长度导致的性能瓶颈问题，即在不同状态和任务下最优分块长度存在差异，而传统方法无法动态调整。解决方案的关键在于提出自适应行为分块（Adaptive Action CHunking, ACH），其核心创新是通过基于Transformer的架构在单次前向传播中同时估计所有候选分块长度的动作值函数（action-value function），从而实现根据当前状态动态选择最优分块长度的能力，显著提升了策略在复杂环境中的学习效率与泛化性能。

链接: https://arxiv.org/abs/2605.10044
作者: Yongjae Shin,Jongseong Chae,Seongmin Kim,Jongeui Park,Youngchul Sung
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Action chunking emerged as a pivotal technique in imitation learning, enabling policies to predict cohesive action sequences rather than single actions. Recently, this approach has expanded to reinforcement learning (RL), enhancing behavioral consistency and reducing bootstrapping errors in value function estimation. However, existing methods rely on a fixed chunk length, creating a performance bottleneck as the optimal length varies across states and tasks. In this paper, we propose Adaptive Action CHunking (ACH), a novel offline-to-online RL algorithm that dynamically modulates chunk length during both training and inference. To find the optimal chunk length for a dynamically varying current state, we simultaneously estimate action-values for all candidate chunk lengths in a single forward pass, using a Transformer-based architecture. Our mechanism allows the agent to select the most effective chunk length adaptively based on the current state. Evaluated on 34 challenging tasks, ACH consistently outperforms fixed-length baselines, demonstrating superior generalization and learning efficiency in complex environments.

[AI-110] meClaw: A Time-Series AI Agent with Exploratory Execution Learning

【速读】：该论文旨在解决当前基于大语言模型（Large Language Models, LLMs）的时间序列分析系统中，执行导向（execution-centric）方法难以从探索性执行中学习并复用经验的问题，尤其在可验证的数值场景下，多条有效工具调用路径因早期成功导致工具优先级坍缩（tool-prior collapse），抑制了进一步探索。解决方案的关键在于提出TimeClaw框架，其核心机制为四阶段循环：Explore（探索）、Compare（比较）、Distill（蒸馏）、Reinject（重注入），通过度量监督的探索性执行学习、任务感知的工具丢弃策略（task-aware tool dropout）以及推理时的层次化蒸馏经验重注入，实现对探索性经验的结构化提炼与再利用，同时保持基础模型冻结，避免在线测试时适应，从而显著提升时间序列预测与推理任务中的性能表现。

链接: https://arxiv.org/abs/2605.10038
作者: Hangchen Liu,Dongyuan Li,Renhe Jiang,Jiewen Deng,Weiwei Ye,Yoshihide Sekimoto
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Under review

点击查看摘要

Abstract:Time series analysis underpins forecasting, monitoring, and decision making in domains such as finance and weather, where solving a task often requires both numerical accuracy and contextual reasoning. Recent progress has moved from specialized neural predictors to approaches built on LLMs and foundation models that can reason over time series inputs and use external tools. However, most such systems remain execution-centric: they focus on solving the current instance but learn little from exploratory execution. This is especially limiting in verifiable numeric settings, where multiple candidate executions and tool-use procedures may all be task-valid yet differ sharply in quantitative quality, and where early success can trigger tool-prior collapse that suppresses further exploration. To address this limitation, we present TimeClaw, an exploratory execution learning framework that turns exploratory execution into reusable hierarchical distilled experience through a four-stage loop: Explore, Compare, Distill, and Reinject. TimeClaw combines metric-supervised exploratory execution learning, task-aware tool dropout, and hierarchical distilled experience for inference-time reinjection, while keeping the base model frozen and avoiding online test-time adaptation. In an MTBench-aligned evaluation with 17 tasks that span finance and weather prediction and reasoning tasks, TimeClaw delivers consistent gains over the baselines. These results suggest that, for scientific systems, the bottleneck is not only execution-time capability, but how exploratory experience is compared, distilled, and reused.

[AI-111] Bridging the Cognitive Gap: A Unified Memory Paradigm for 6G Agent ic AI-RAN

【速读】：该论文旨在解决6G无线接入网络中因传统解耦架构导致的认知瓶颈问题，即物理层在接口限制下被迫将高维状态压缩为低维指标，从而阻碍了智能体（AI agent）的感知与推理能力。其解决方案的关键在于提出一种以统一内存为中心的架构范式，通过将生物记忆层次映射到异构计算硬件上，利用新兴的相干互连技术实现跨时间尺度的状态共享——从微秒级反射、毫秒级推理到长期演化，使AI代理能够基于零拷贝可观测性取代传统消息传递机制，从而打通实时响应与长周期上下文之间的鸿沟，推动6G网络向真正自主演进的方向发展。

链接: https://arxiv.org/abs/2605.10036
作者: Xijun Wang,Zhaoyang Liu,Chenyuan Feng,Xiang Chen,Howard H. Yang,Tony Q. S. Quek
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:As 6G evolves, the radio access network must transcend traditional automation to embrace agentic AI capable of perception, reasoning, and evolution. A fundamental cognitive gap persists in current disaggregated architectures, where interfaces force the physical layer to compress high-dimensional states into low-dimensional metrics, trapping reasoning agents behind a semantic bottleneck. This article envisions a shift from interface-bound to memory-centric architectures. We propose a unified memory paradigm that dissolves the boundaries between sensing and reasoning by mapping biological memory hierarchies onto heterogeneous computing fabrics. Enabled by emerging coherent interconnects, this approach creates a cognitive continuum where microsecond-level reflexes, millisecond-level reasoning, and long-term evolution share state across time scales. By replacing message passing with zero-copy observability, we empower AI agents to bridge the gap between real-time responsiveness and long-horizon context for truly autonomous 6G networks.

[AI-112] From Single-Step Edit Response to Multi-Step Molecular Optimization

【速读】：该论文旨在解决条件分子优化（conditional molecular optimization）中因监督信号与决策层级不匹配而导致的优化不稳定问题。具体而言，在缺乏结构相似分子数据的情况下，系统需在每一步从符合化学可行性的候选局部结构编辑中选择最优动作，而传统依赖oracle-in-the-loop搜索的方法难以有效分离局部编辑效果与全局上下文影响，导致决策效率低且路径探索受限。解决方案的关键在于提出一种响应导向的离散编辑优化方法（SMER-Opt），其核心由两个紧密耦合组件构成：一是单步分子编辑响应预测器（SMER），用于学习对最小编辑单元的方向性评估模型；二是多步规划器，通过引导式树搜索将局部预测组合成优化轨迹。该方法通过挖掘弱相关分子对并分解其结构差异为最小编辑单元，将终点属性标注转化为过程级监督，从而生成可复用、可迁移的动作基元，并借助方向性编辑评估器在决策时高效评分可行编辑选项，显著降低对外部评价器查询的依赖。

链接: https://arxiv.org/abs/2605.10035
作者: Haojie Rao(1),Kun Li(1),Yida Xiong(1),Jiameng Chen(1),Wenbin Hu(1),Yizhen Zheng(2),Jiajun Yu(3),Duanhua Cao(4) ((1) School of Computer Science, Wuhan University, Wuhan, China, (2) Department of Data Science and Artificial Intelligence, Monash University, Victoria, Australia, (3) College of Computer Science and Technology, Zhejiang University, Hangzhou, China, (4) School of Life Sciences and Technology, Tongji University, Shanghai, China)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Conditional molecular optimization aims to edit a molecule to realize a specified property shift. In practice, structurally similar molecule data is scarce, while decisions are inherently action-level: at each step, the system must select one local structural edit from a candidate set that is strictly filtered by chemical feasibility rules. This level mismatch between supervision and decision makes oracle-in-the-loop search unstable in molecular optimization. Regressing on property differences between molecule pairs improves data efficiency but relies on oracle-in-the-loop search, entangling transformation effects with global context and providing limited guidance for selecting the next feasible edit, often resorting to oracle-in-the-loop search. For this reason, we propose a response-oriented discrete edit optimization approach comprising two tightly coupled components: a single-step molecular edit response predictor (SMER) and a multi-step planner that composes local predictions into optimization trajectories via guided tree search (SMER-Opt). The approach learns a directional evaluation model over edit actions to support constraint-aware planning. It mines weakly related molecule pairs and decomposes their structural differences into minimal edit units, turning endpoint property annotations into process-level supervision and yielding reusable, transferable action primitives. A directional edit evaluator then scores feasible candidate edits by their likelihood of moving the molecule toward the desired property change, substantially reducing dependence on external evaluator queries at decision time. Code is available at this https URL.

[AI-113] he two clocks and the innovation window: When and how generative models learn rules NEURIPS2025

【速读】：该论文旨在解决生成式模型在有限数据训练下存在的根本性矛盾：其得分匹配（score-matching）或下一个词预测目标会收敛到训练数据的经验分布，而非我们希望学习的真实数据的总体分布。解决方案的关键在于识别并量化两个关键的时间尺度——规则有效时间 $\tau_\mathrm{rule}$ （模型首次生成符合规则的样本）和记忆时间 $\tau_\mathrm{mem}$ （模型开始重复训练样本）——并由此定义“创新窗口” $[\tau_\mathrm{rule}, \tau_\mathrm{mem}]$ 。研究表明， $\tau_\mathrm{rule}$ 随规则复杂度增加而延长、随模型容量增大而缩短，而 $\tau_\mathrm{mem}$ 近似与规则无关且与数据集大小 $N$ 近似线性相关；这一窗口宽度受数据规模和规则复杂度调控，且可能完全消失（当 $\tau_\mathrm{rule} \geq \tau_\mathrm{mem}$ 时）。通过分析扩散模型（DiT）的得分函数演化，发现规则有效样本的吸引域在 $\tau_\mathrm{rule}$ 附近显著扩大，而训练样本的吸引域则在 $\tau_\mathrm{mem}$ 附近主导优化空间，从而为生成模型何时及如何实现真正创新提供了统一且可预测的理论框架。

链接: https://arxiv.org/abs/2605.10019
作者: Binxu Wang,Emma Lucia Byrnes Finn,Bingbin Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Machine Learning (stat.ML)
备注: 48 pages, 28 figures. Earlier versions are presented in NeurIPS2025 SPIGM workshop as oral presentation this https URL

点击查看摘要

Abstract:Generative models trained on finite data face a fundamental tension: their score-matching or next-token objective converges to the empirical training distribution rather than the population distribution we seek to learn. Using rule-valid synthetic tasks, we trace this tension across two training timescales: \tau_\mathrmrule , the step at which generations first become rule-valid, and \tau_\mathrmmem , the step at which models begin reproducing training samples. Focusing on parity and extending to other binary rules and combinatorial puzzles, we characterize how these two clocks, \tau_\mathrmrule and \tau_\mathrmmem , depend on key aspects of the learning setup. Specifically, we show that \tau_\mathrmrule increases with rule complexity and decreases with model capacity, while \tau_\mathrmmem is approximately invariant to the rule and scales nearly linearly with dataset size N . We define the \emphinnovation window as the interval [\tau_\mathrmrule, \tau_\mathrmmem] . This window widens with increasing N and narrows with rule complexity, and may vanish entirely when \tau_\mathrmrule \geq \tau_\mathrmmem . The same two-clock structure arises in both diffusion (DiT) and autoregressive (GPT) models, with architecture-dependent offsets. Dissecting the learned score of DiT models reveals a corresponding evolution of the optimization landscapes, where rule-valid samples’ basins expand substantially around \tau_\mathrmrule , while training samples’ basins begin to dominate around \tau_\mathrmmem . Together, these results yield a unified and predictive account of when and how generative models exhibit genuine innovation.

[AI-114] Combining Mechanical and Agent ic Specification Inference for Move

【速读】：该论文旨在解决在Move语言中进行形式化验证时，编写函数前置条件、后置条件及循环不变量等规格说明的繁琐问题（即“规范编写冗余”），从而降低验证成本并提升开发效率。其核心解决方案是将最弱前提（Weakest Precondition, WP）分析与生成式AI代理（如Claude Code）相结合：WP分析提供一个可靠且机械化的基础规范推导能力，而AI代理则负责处理WP难以建模的高阶语义，例如循环不变量、单调性、守恒性及结构不变性等；同时，Move Prover作为验证决策器（oracle），用于判断生成的规格是否有效，并驱动AI代理迭代优化规格直至验证通过。此方法实现了从低级字节码到高级语义的协同推理，在保持严谨性的同时显著提升了规格自动推导的灵活性和实用性。

链接: https://arxiv.org/abs/2605.10005
作者: Wolfgang Grieskamp,Teng Zhang,Vineeth Kashyap
机构: 未知
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:In this paper, we describe early work on a specification inference tool for the Move Prover that combines a weakest-precondition (WP) analysis over Move bytecode with an agentic coding CLI such as Claude Code. Specification inference reduces the boilerplate of writing specifications in Move: in order to verify a high-level property such as a global state invariant, pre- and post-conditions for the supporting functions typically have to be written by hand, which is tedious. In our setting, a Model Context Protocol (MCP) service exposes the WP analysis and the prover itself to the coding agent. The WP analysis provides a sound, mechanical baseline for inference; the AI is used precisely where WP is weakest – for loop invariants and high-level idiomatic specifications such as monotonicity, conservation, and structural invariants. The Move Prover serves as the oracle that decides whether the generated specs are valid, and the agent is equipped to generate proof hints and to refine the inferred specification until verification succeeds. The tool has been applied to a corpus of canonical Move code, including code that uses higher-order functions, dynamic dispatch, global state, references, and various forms of loops.

[AI-115] Continual Harness: Online Adaptation for Self-Improving Foundation Agents

【速读】：该论文旨在解决具身智能体（embodied agents）在长期部分可观测决策任务中缺乏有效强化学习框架的问题，尤其针对游戏《宝可梦》这类高复杂度、长周期、需策略迭代的环境。其核心解决方案是提出“持续钩子”（Continual Harness），这是一种无需重置环境的在线自适应机制，使智能体能够在单次运行中自主优化自身提示（prompt）、子代理（sub-agents）、技能（skills）和记忆（memory），并利用历史轨迹数据进行持续改进。关键创新在于将人类监督从循环中移除，实现模型与钩子结构的协同进化：通过开放源代码代理在不断优化的钩子中执行动作，由前沿教师模型对轨迹重新标注奖励信号，并用于更新主模型，从而在不重置环境的前提下实现持续的游戏里程碑进展。

链接: https://arxiv.org/abs/2605.09998
作者: Seth Karten,Joel Zhang,Tersoo Upaa Jr,Ruirong Feng,Wenzhe Li,Chengshuai Shi,Chi Jin,Kiran Vodrahalli
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 28 pages, 19 figures, 5 tables

点击查看摘要

Abstract:Coding harnesses such as Claude Code and OpenHands wrap foundation models with tools, memory, and planning, but no equivalent exists for embodied agents’ long-horizon partial-observability decision-making. We first report our Gemini Plays Pokemon (GPP) experiments. With iterative human-in-the-loop harness refinement, GPP became the first AI system to complete Pokemon Blue, Yellow Legacy on hard mode, and Crystal without a lost battle. In the hardest stages, the agent itself began iterating on its strategy through long-context memory, surfacing emergent self-improvement signals alongside human-in-the-loop refinement. Continual Harness removes the human fully from this loop: a reset-free self-improving harness for embodied agents that formalizes and automates what we observed. Starting from only a minimal environment interface, the agent alternates between acting and refining its own prompt, sub-agents, skills, and memory, drawing on any past trajectory data. Prompt-optimization methods require episode resets; Continual Harness adapts online within a single run. On Pokemon Red and Emerald across frontier models, Continual Harness starting from scratch substantially reduces button-press cost relative to the minimalist baseline and recovers a majority of the gap to a hand-engineered expert harness, with capability-dependent gains, despite starting from the same raw interface with no curated knowledge, no hand-crafted tools, and no domain scaffolding. We then close the loop with the model itself: an online process-reward co-learning loop, in which an open-source agent’s rollouts through the refining harness are relabeled by a frontier teacher and used to update the model, drives sustained in-game milestone progress on Pokemon Red without resetting the environment between training iterations.

[AI-116] Attention Drift: What Autoregressive Speculative Decoding Models Learn

【速读】：该论文旨在解决生成式 AI（Generative AI）中推测解码（speculative decoding）技术在模板扰动和长上下文输入下性能显著下降的问题。核心问题是：当前的 drafter 模型在推测链（speculation chain）中随着生成 token 的增多，注意力机制逐渐从原始 prompt 偏移至自身生成的 token 上，这种现象被称为“注意力漂移”（attention drift）。作者发现该现象普遍存在于 EAGLE3 和 MTP 头部结构中，其根源在于推测链步骤间未归一化的残差路径导致隐藏状态幅值随链深单调增长，表现出类似堆叠预归一化 Transformer 层的动态特性。解决方案的关键是引入两项架构改进：1）在 drafter 隐藏状态上采用后归一化（post-norm）；2）在捕获目标模型隐藏状态后对每个隐藏状态进行 RMSNorm 归一化。这些改动有效抑制了隐藏状态幅值增长，从而显著提升接受长度（acceptance length），在多种任务场景下均实现性能增强。

链接: https://arxiv.org/abs/2605.09992
作者: Doğaç Eldenk,Payal Mohapatra,Yigitcan Comlek,Kaan Oktay,Hongyang Zhang,Stephen Xia
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Speculative decoding accelerates LLM inference by drafting future tokens with a small model, but drafter models degrade sharply under template perturbation and long-context inputs. We identify a previously-unreported phenomenon we call \textbfattention drift: as the drafter generates successive tokens within a speculation chain, attention progressively moves from the prompt onto its own recently-generated tokens. We observe this across both \emphEAGLE3 drafters and \emphMTP heads, suggesting drift is a property of drafter designs. We trace this to the un-normalized residual path between chain steps: the drafter’s hidden state magnitude grows monotonically with chain depth, which exhibits dynamics consistent with additional pre-norm transformer layers stacked on the target rather than as a standalone autoregressive predictor. In order to limit the growth, we propose two architectural changes: Post-norm on the drafter hidden states and per-hidden-state RMSNorm after capturing target hidden states. Our interventions improve acceptance length over the current leading model, pre-norm EAGLE3, by up to 2\times under template perturbation, 1.18\times on long-context tasks, and 1.10\times on seven standard benchmarks spanning multi-turn chat, math, and coding. Our changes also allow shorter train-time-test depths to generalize over longer drafting sequences.

[AI-117] Optimizer-Induced Mode Connectivity: From AdamW to Muon

【速读】：该论文旨在解决优化器在神经网络模式连通性（mode connectivity）中的作用未被充分理解的问题，特别是当解空间受限于特定优化器时，其结构如何变化。解决方案的关键在于引入“优化器诱导的隐式正则化”（optimizer-induced implicit regularization）视角，证明在两层ReLU网络中，由单一优化器（如AdamW、Muon或Lion-𝒦族）获得的解构成一个连通集，且该连通性依赖于优化器类型和正则化强度：在大宽度下，不同优化器对应的区域可能不相交或重叠；而在小宽度情况下，不同优化器可能收敛至被损失屏障分隔的零损失组件。这一发现揭示了优化器驱动的结构特征，超越了传统模式连通性的理论框架。

链接: https://arxiv.org/abs/2605.09991
作者: Fangzhao Zhang,Sungyoon Kim,Erica Zhang,Yiqi Jiang,Mert Pilanci
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:Mode connectivity has been widely studied, yet the role of the optimizer remains underexplored. We revisit it through optimizer-induced implicit regularization, asking how connectivity behaves when restricted to solutions constrained by a given optimizer. For two-layer ReLU networks, we show that solutions from a single optimizer – AdamW, Muon, or others in the Lion- \mathcalK family – form a connected set at sufficiently large width, a result not implied by prior work. We then characterize how optimizer-induced regions interact: at large width two different regions can be disjoint or overlap depending on regularization, while in our small-width example AdamW and Muon converge to disconnected zero-loss components separated by a provable loss barrier. Empirically, in GPT-2 pretraining, we observe same-optimizer paths preserve each model’s spectrum while cross-optimizer paths traverse a smooth transition. Our results reveal optimizer-dependent structure beyond classical mode connectivity literature.

[AI-118] Prospective Compression in Human Abstraction Learning NEURIPS2026

【速读】：该论文旨在解决程序合成中的在线库学习（online library learning）问题，即在对未来任务需求不确定的情况下，如何增量式地获取可复用的抽象结构。传统算法将库学习视为对静态任务分布的事后压缩（retrospective compression），但现实场景常具有非平稳性（non-stationary），任务生成过程随时间演化。论文的关键解决方案是提出并验证人类库学习行为具有前瞻性（prospective）特征——即主动选择能压缩未来任务的抽象结构，而非仅基于历史任务进行优化。通过模式构建任务（Pattern Builder Task）和六种计算模型对比实验，研究发现人类行为体现出对潜在非平稳结构的敏感性，且这种行为无法被现有基于事后压缩或大语言模型（LLM）归纳偏置的算法所捕捉，从而揭示了前瞻性压缩机制在动态环境下的核心作用。

链接: https://arxiv.org/abs/2605.09985
作者: Leonardo Hernandez Cano,Ivan Zareski,Luisa El Amouri,Pinzhe Zhao,Max Mascini,Emanuele Sansone,Yewen Pu,Bonan Zhao,Marta Kryven
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注: under review at neurips 2026

点击查看摘要

Abstract:A core challenge in program synthesis is online library learning: the incremental acquisition of reusable abstractions under uncertainty about future task demands. Existing algorithms treat library learning as retrospective compression over a static task distribution, where the learned library is determined by the corpus of past tasks. However, real-world learning domains are often non-stationary, with tasks arising from a generative process that evolves over time. We propose and test the hypothesis that in non-stationary domains human library learning selects abstractions prospectively: targeting compression of future tasks. We study this question using the Pattern Builder Task, a visual program synthesis paradigm in which participants construct increasingly complex geometric patterns from a small set of primitives, transformations, and custom helpers that carry forward across trials. Using this task, we conduct two experiments with complementary latent curricula, designed to dissociate between behaviors consistent with prospective compression, and alternative library learning accounts. Using six computational models spanning online library learning strategies, we show that human abstraction behavior reflects sensitivity to latent, non-stationary structure in the task-generating process. This behavior is consistent with prospective compression, and cannot be captured by existing retrospective compression-based algorithms, or inductive biases modeled by LLM-based program synthesis.

[AI-119] Learning the Interaction Prior for Protein-Protein Interaction Prediction: A Model-Agnostic Approach ICML2026

【速读】：该论文旨在解决当前基于学习的蛋白质-蛋白质相互作用（Protein-Protein Interactions, PPIs）预测模型中，分类头设计缺乏生物学先验知识的问题。现有方法多依赖通用聚合策略（如拼接或点积），未能充分利用生物机制中的结构规律。其解决方案的关键在于提出一种基于生物“L3规则”的图提示学习方法——L3-PPI，该方法通过生成包含虚拟长度为3路径（L3 paths）的提示图，并将蛋白对嵌入的分类任务转化为图级分类任务，从而引入互补性交互先验信息。这一轻量模块可作为即插即用组件集成到现有PPI预测器中，显著提升性能。

链接: https://arxiv.org/abs/2605.09964
作者: Ziqi Gao,Chenyi Zi,Zijing Liu,Ziqiao Meng,Yu Li,Jia Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注: Accepted at ICML 2026

点击查看摘要

Abstract:Protein-protein interactions (PPIs) are fundamental to cellular function and disease mechanisms. Current learning-based PPI predictors focus on learning powerful protein representations but neglect designing specialized classification heads. They mainly rely on generic aggregating methods like concatenation or dot products, which lack biological insight. Motivated by the biological “L3 rule”, where multiple length-3 paths between a pair of proteins indicate their interaction likelihood, our study addresses this gap by designing a biologically informed PPI classifier. In this paper, we provide empirical evidence that popular PPI datasets strongly support the L3 rule. We propose an L3-path-regularized graph prompt learning method called L3-PPI, which can generate a prompt graph with virtual L3 paths based on protein representations and controls the number of paths. L3-PPI reformulates the classification of protein embedding pairs into a graph-level classification task over the generated prompt graph. This lightweight module seamlessly integrates with PPI predictors as a plug-and-play component, injecting the interaction prior of complementarity to enhance performance. Extensive experiments show that L3-PPI achieves superior performance enhancements over advanced competitors.

[AI-120] Novel GPU Boruta algorithms for feature selection from high-dimensional data

【速读】：该论文旨在解决特征选择算法（尤其是基于包装器的算法）在CPU平台上因计算复杂度高而导致效率低下、难以处理大规模数据集的问题。解决方案的关键在于提出两种基于GPU加速的Boruta特征选择方法：Boruta-Permut依赖于置换法计算特征重要性，而Boruta-TreeImp则基于不纯度减少来评估特征重要性。实验表明，这两种GPU加速版本在保持与原始Boruta算法相当的选择准确性的同时，显著提升了计算效率，证明了在GPU上执行Boruta特征选择是一种高效且经济的大规模数据分析方案。

链接: https://arxiv.org/abs/2605.09950
作者: Xurui Li,Zhiguo Gan,Jiaming Zhang,Zheng Liu,Diannan Lu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This paper has been submitted to the journal Data Mining and Knowledge Discovery, and a preprint is available for the authors’ records

点击查看摘要

Abstract:Most feature selection algorithms, especially wrapper methods, run inefficiently on CPU based platforms because of their high computational complexity. This inefficiency makes them unsuitable for processing large scale datasets. To address this challenge, the present study proposed two GPU accelerated versions of the Boruta feature selection procedure, in which Boruta-Permut relies on permutation based feature importance and Boruta-TreeImp employs importance based on impurity reduction. To evaluate these methods we conducted experiments on both a self constructed dataset and several publicly available datasets. The experimental results show that the proposed GPU accelerated algorithms greatly improve computational efficiency while preserving feature selection accuracy comparable to the original Boruta algorithm. In our analysis we also observe that the impurity reduction based version can overestimate the importance of some features. Overall these findings suggest that performing Boruta feature selection on GPUs offers an effective and cost efficient solution for large scale data analysis, which is a good deal.

[AI-121] HAGE: Harnessing Agent ic Memory via RL-Driven Weighted Graph Evolution

【速读】：该论文旨在解决当前代理型大语言模型（Agentic Large Language Model, LLM）系统中记忆检索被建模为静态查找问题的局限性，即现有方法依赖扁平向量搜索或固定二元关系图，无法捕捉事件间关系的强度、置信度及查询相关的相关性变化。解决方案的关键在于提出HAGE框架——一种加权多关系记忆机制，其将记忆组织为共享节点上的特定关系图视图，每条边关联一个可训练的关系特征向量，编码多种关系信号；通过LLM分类器识别查询意图，并利用路由网络动态调制对应维度的边嵌入，结合语义相似度与查询条件化的边表示计算遍历得分，从而优先选择高价值关系路径并软性抑制噪声或弱相关连接。此外，HAGE引入基于强化学习的训练范式，联合优化路由行为与边表示，显著提升长程推理准确率并实现更优的准确性-效率权衡。

链接: https://arxiv.org/abs/2605.09942
作者: Dongming Jiang,Yi Li,Guanpeng Li,Qiannan Li,Bingzhe Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Memory retrieval in agentic large language model (LLM) systems is often treated as a static lookup problem, relying on flat vector search or fixed binary relational graphs. However, fixed graph structures cannot capture the varying strength, confidence, and query-dependent relevance of relationships between events. In this paper, we propose HAGE, a weighted multi-relational memory framework that reconceptualizes retrieval as sequential, query-conditioned traversal over a unified relational memory graph. Memory is organized as relation-specific graph views over shared memory nodes, where each edge is associated with a trainable relation feature vector encoding multiple relational signals. Given a query, an LLM-based classifier identifies the relational intent, and a routing network dynamically modulates the corresponding dimensions of the edge embedding. Traversal scores are computed via a learned combination of semantic similarity and these query-conditioned edge representations. This allows memory traversal to prioritize high-utility relational paths while softly suppressing noisy or weakly relevant connections. Beyond adaptive traversal, HAGE further introduces a reinforcement learning-based training framework that jointly optimizes routing behavior and edge representations using downstream tasks. Finally, empirical results demonstrate improved long-horizon reasoning accuracy and a favorable accuracy-efficiency trade-off compared to state-of-the-art agentic memory systems. Our code is available at this https URL.

[AI-122] xpo: Exploration-prioritized policy optimization via adaptive kl regulation and gaussian curriculum sampling

【速读】：该论文旨在解决基于可验证奖励的强化学习（Reinforcement Learning with Verifiable Rewards, RLVR）在大语言模型（Large Language Models, LLMs）数学推理任务中，主流算法Group Relative Policy Optimization (GRPO)存在的两个低效问题：一是固定KL惩罚系数限制了模型在需要显著偏离参考策略阶段的探索能力；二是训练样本均匀采样忽略了中等难度题目能提供最有效的梯度信号。解决方案的关键在于提出一种轻量级插件式优化方法——Exploration-Prioritized Policy Optimization (EXPO)，其核心包含两个模块：Accuracy-Conditioned KL Scaling (AKL) 通过批平均准确率的平滑非线性函数动态调节KL正则化强度，在模型表现不佳时放松约束、表现良好时增强约束；Gaussian Curriculum Sampling (GCS) 则以高斯分布对题目进行加权采样，聚焦于准确率约为0.5的中等难度问题，从而将训练资源集中在模型的学习前沿，显著提升探索效率与最终性能。

链接: https://arxiv.org/abs/2605.09923
作者: Mingxiong Lin,Zhangquan Gong,Maowen Tang,Qian Li,Chuangchuang Wang,Jian Ma,Sutian Huang,Kai Tang,Haonan Lu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has become the standard paradigm for LLM mathematical reasoning, where Group Relative Policy Optimization (GRPO) serves as the mainstream algorithm. We point out two understudied inefficiencies existing in GRPO. First, the fixed KL penalty coefficient overly restricts policy exploration at stages where the model requires significant deviation from the reference policy. Second, uniform sampling of training questions ignores that moderately difficult problems provide the most informative gradient signals for optimization. We propose Exploration-Prioritized Policy Optimization (EXPO) with two lightweight plug-in modules. The Accuracy-Conditioned KL Scaling (AKL) dynamically adjusts KL regularization strength through a smooth nonlinear function of batch average accuracy, relaxing the penalty when the model underperforms and strengthening it when the model achieves good results. The Gaussian Curriculum Sampling (GCS) assigns sampling weights to questions following a Gaussian distribution centered at moderate accuracy around 0.5, focusing training on the model’s learning frontier. We conduct extensive experiments on DeepSeek-R1-Distill-Qwen-1.5B and Qwen3-8B-Base over six mathematical reasoning benchmarks. The results show EXPO steadily surpasses vanilla GRPO. It obtains an absolute gain of 13.34 on AIME 2025 pass@32, rising from 63.33 percent to 76.67 percent, and achieves an average pass@32 improvement of 2.66 on the 8B model. The much larger performance gains on pass@32 compared with pass@1 demonstrate that EXPO effectively enlarges the model’s exploration boundary under a fixed inference cost budget.

[AI-123] Verifier-Free RL for LLM s via Intrinsic Gradient-Norm Reward ACL2026

【速读】：该论文旨在解决强化学习中可验证奖励（Reinforcement Learning with Verifiable Rewards, RLVR）依赖黄金标签或领域特定验证器导致的可扩展性问题，即在新任务和新领域中难以部署。其解决方案的关键在于提出一种无需外部验证器的内在奖励机制——验证器自由内在梯度范数奖励（Verifier-free Intrinsic Gradient-Norm Reward, VIGOR），该方法仅利用策略模型自身生成的输出来计算奖励：给定提示词后，VIGOR采样一组完成文本，并为诱导当前参数下教师强制负对数似然梯度 ℓ₂ 范数更小的输出分配更高奖励；这一设计基于直观假设——较小的梯度范数表明输出与当前策略更一致，从而构成有效的内在偏好信号用于策略优化。为提升实用性，研究进一步通过 √T 缩放校正平均token级梯度的长度偏差，并采用组内排序调整稳定不同提示词间的奖励尺度。实验表明，VIGOR在数学推理基准上优于最先进的内部反馈强化学习（Reinforcement Learning from Internal Feedback, RLIF）基线，并能跨域迁移至代码生成任务，展现出更强的泛化能力和训练稳定性。

链接: https://arxiv.org/abs/2605.09920
作者: Xuexiang Wen,Hang Yu,Linchao Zhu,Gaoang Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to Findings of ACL 2026

点击查看摘要

Abstract:While Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a promising post-training paradigm for Large Language Models (LLMs), its dependency on the gold label or domain-specific verifiers limits its scalability to new tasks and domains. In this work, we propose Verifier-free Intrinsic Gradient-Norm Reward (VIGOR), a simple reward that uses only the policy model itself. Given a prompt, VIGOR samples a group of completions and assigns higher within-group rewards to outputs that induce smaller \ell_2 norms of the teacher-forced negative log-likelihood gradients under the current parameters. Intuitively, lower gradient norms suggest the completion aligns better with the current policy, serving as an intrinsic preference signal for policy optimization. To make this intrinsic signal practical for RL, we correct the systematic length bias of averaged token-level gradients with a \sqrtT scaling, and apply group-wise rank shaping to stabilize reward scales across prompts. Across mathematical reasoning benchmarks, VIGOR outperforms the state-of-the-art Reinforcement Learning from Internal Feedback (RLIF) baseline, and it also exhibits cross-domain transfer to code benchmarks when trained only on math data. For instance, on Qwen2.5-7B-Base post-trained on MATH, VIGOR improves the average math accuracy by +3.31% and the average code accuracy by +1.91% over this baseline, while exhibiting more stable training dynamics. The code is available at this https URL.

[AI-124] NaiAD: Initiate Data-Driven Research for LLM Advertising

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）广告中平台收益与用户体验之间的矛盾问题，核心挑战在于如何在不损害用户感知质量的前提下实现商业价值最大化。解决方案的关键在于构建了首个面向LLM原生广告的综合性数据集NaiAD，其中包含58,999条精心设计的、嵌入广告的响应样本及其对应的用户查询，并基于理论驱动的评估指标分别量化用户效用和商业效用；同时提出解耦生成管道以缓解对齐模型中的维度共线性问题，从而生成结构多样化的响应样本；并通过Variance-Calibrated Prediction-Powered Inference（VC-PPI）框架校准自动评分与人工标注的一致性，最终通过机制分析揭示出四种语义层面的广告整合策略，使模型能够通过上下文学习独立控制用户与商业效用目标。

链接: https://arxiv.org/abs/2605.09918
作者: Yihang Zhang,Zimeng Huang,Ren Zhai,Yipeng Kang,Tonghan Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 37 pages, 11 figures

点击查看摘要

Abstract:Reconciling platform revenue with user experience in LLM advertising motivates a data-centric foundation. We introduce NaiAD, the first comprehensive dataset for LLM-native advertising comprising 58,999 carefully constructed ad-embedded responses paired with user queries. NaiAD is organized around theoretically grounded evaluation metrics that separately and comprehensively capture user and commercial utility. To mitigate the dimensional collinearity of aligned LLMs, we propose a decoupled generation pipeline that produces structurally diverse samples, ranging from responses that explicitly disentangle stakeholder utilities to responses that are uniformly strong or weak across dimensions. We further provide score labels calibrated by a Variance-Calibrated Prediction-Powered Inference (VC-PPI) framework, aligning automated scoring with human annotations. Mechanistic analyses reveal that successful ad integration relies on reasoning paths that cluster into four distinct semantic strategies. Models leveraging NaiAD internalize these strategies to simultaneously improve user and commercial utility, while enabling independent control over these distinct objectives via in-context learning. Together, these results position NaiAD as a foundational infrastructure for developing future LLM-native ad systems.

[AI-125] Voice Biomarkers for Depression and Anxiety

【速读】：该论文旨在解决从语音中检测抑郁和焦虑的难题，传统方法依赖于手工设计的副语言特征（paralinguistic features）及声学描述符，而这些特征往往难以捕捉深层的生物标志物信息。其解决方案的关键在于利用深度学习模型直接处理原始语音信号，从而提取更具预测能力的内容无关生物标志物（content-agnostic biomarker information），并通过与从音频中提取的词汇特征（lexical features）融合，显著提升在实际应用中的预测性能。该方法在约5000名独立受试者上验证，达到71%的敏感性和特异性，且基于约6.5万条语音样本构建的大规模数据集支持了模型的鲁棒性与临床相关性。

链接: https://arxiv.org/abs/2605.09908
作者: Oleksii Abramenko,Noah D. Stein,Colin Vaz
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Current approaches to detecting depression and anxiety from speech primarily rely on machine learning techniques that utilize hand-engineered paralinguistic features and related acoustic descriptors derived from time- and frequency-domain representations of speech signals. Applying deep learning methods directly to raw speech signals has the potential to produce biomarker representations with substantially greater predictive power. However, these approaches typically require large volumes of carefully annotated data to learn robust and clinically meaningful representations of the underlying biomarkers. In this paper, we describe our efforts toward developing a deep learning model trained on a large-scale proprietary dataset comprising ~65,000 utterances collected from more than 23,000 subjects representative of relevant United States demographics. We present the techniques employed and analyze their impact on model performance. Our results demonstrate that the proposed models can extract content-agnostic biomarker information, which, when combined with lexical features extracted from audio, yields improved predictive performance in production settings. Our models are evaluated on ~5000 unique subjects and achieve performance of 71% in terms of sensitivity and specificity. To foster further research in mental health assessment from speech, we release the best-performing model described in this paper on HuggingFace.

[AI-126] Separate First Fuse Later: Mitigating Cross-Modal Interference in Audio-Visual LLM s Reasoning with Modality-Specific Chain-of-Thought

【速读】：该论文旨在解决音频-视觉大语言模型在跨模态推理过程中因信息干扰导致的幻觉问题（cross-modal interference），即某一模态的信息错误引导另一模态的解释，从而影响问答准确性。解决方案的关键在于提出“先分离、后融合”（Separate First, Fuse Later, SFFL）框架：通过强制执行模态特定的链式思维（modality-specific chain-of-thought reasoning），生成独立的音频和视觉推理路径，并在证据融合阶段才引入跨模态信息；同时利用模态偏好标签作为强化学习中的辅助奖励信号，使模型根据实例动态选择更可靠的模态线索，从而有效减少交叉干扰并提升任务准确性和鲁棒性。

链接: https://arxiv.org/abs/2605.09906
作者: Xuanchen Li,Yuheng Lu,Chenrui Cui,Tianrui Wang,Zikang Huang,Yu Jiang,Long Zhou,Longbiao Wang,Jianwu Dang
机构: 未知
类目: Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Audio and vision provide complementary evidence for audio-visual question answering, yet current audio-visual large language models may suffer from cross-modal interference: information from one modality misguides the interpretation of another, thereby inducing hallucinations. We attribute this issue to uncontrolled cross-modal interactions during intermediate reasoning. To mitigate this, we propose Separate First, Fuse Later (SFFL), an audio-visual reasoning framework designed to reduce cross-modal interference. SFFL enforces modality-specific chain-of-thought reasoning, producing separate audio and visual reasoning traces and integrating evidence for answering. We construct modality-preference labels via a data pipeline under different modality input settings. We use these labels as an auxiliary reward in reinforcement learning to encourage a instance-dependent preference for modality cues when answering. We further introduce a modality-specific reasoning mechanism that preserves modality isolation during the separated reasoning stage while enabling full access to cross-modal information at the evidence fusion stage. Experiments demonstrate consistent improvements in both accuracy and robustness, yielding an average relative gain of 5.16% on general AVQA benchmarks and 11.17% on a cross-modal hallucination benchmark.

[AI-127] Rethinking Random Transformers as Adaptive Sequence Smoothers for Sleep Staging

【速读】：该论文旨在解决自动睡眠分期（sleep staging）中对复杂长程依赖建模的过度依赖问题，质疑了当前主流Transformer架构必须通过参数学习才能提升性能的假设。其关键解决方案在于揭示了睡眠序列具有强局部时间连续性（strong local temporal continuity）这一被忽视的特性，并提出随机初始化的Transformer本身即具备显著的平滑能力——无需训练即可优于传统启发式平滑方法。作者通过构建随机注意力先验核（Random Attention Prior Kernel, RAPK）形式化该现象，证明随机自注意力机制能自适应地平衡全局平均与内容相似性，同时保留阶段转换信息。进一步利用局部平滑影响指数（LSII）和加权转换熵（WTE）两个指标证实，多数Transformer在睡眠分期中的性能提升源于其架构归纳偏置（inductive bias），而非参数学习。这表明睡眠分期可由结构驱动的平滑机制有效处理，从而实现更高效、适合边缘部署的生理监测系统。

链接: https://arxiv.org/abs/2605.09905
作者: Guisong Liu,Xin Gao,Martin Dresler,Jiansong Zhang,Pengfei Wei
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Automatic sleep staging commonly adopts Transformers under the assumption that they learn complex long-range dependencies. We challenge this view by revealing a neglected property of sleep sequences: strong local temporal continuity. We show that a randomly initialized Transformer, without any training, substantially improves sleep staging performance and consistently outperforms heuristic smoothing. We formalize this effect via a Random Attention Prior Kernel (RAPK), showing that random self-attention acts as an adaptive smoother by balancing global averaging and content-based similarity while preserving stage transitions. Using two metrics, the Local Smoothness Influence Index (LSII) and the Weighted Transition Entropy (WTE), we provide evidence that most performance gains in Transformer-based sleep staging arise from architectural inductive bias rather than parameter learning. Our results suggest that sleep staging can be effectively addressed with structure-driven smoothing mechanisms rather than complex dependency modeling, enabling more efficient and edge-deployable healthcare systems for large-scale physiological monitoring.

[AI-128] he Geometric Wall: Manifold Structure Predicts Layerwise Sparse Autoencoder Scaling Laws

【速读】：该论文试图解决的问题是：稀疏自编码器（Sparse Autoencoders, SAEs）在不同网络层上重建误差变化显著，而现有单层缩放定律无法解释这种跨层差异。其核心假设——激活空间可由全局线性结构近似——可能与实际的非线性流形结构存在几何不匹配。解决方案的关键在于引入跨层缩放分析框架，通过拟合每层的缩放律表面并将其参数与层间几何特征（如曲率和内在维度）进行回归，发现SAE的宽度-稀疏度缩放关系本质上是层依赖的几何函数，而非单一普适规律；进一步表明，由几何特征预测的每层宽度指数具有跨模型迁移能力，揭示了SAE性能受限的根本原因并非资源瓶颈，而是由待重构流形的几何特性所决定的“几何壁垒”。

链接: https://arxiv.org/abs/2605.09887
作者: Eslam Zaher,Maciej Trzaskowski,Quan Nguyen,Fred Roosta
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Differential Geometry (math.DG)
备注:

点击查看摘要

Abstract:Sparse autoencoders (SAEs) operationalise the linear representation hypothesis: they reconstruct model activations as sparse linear combinations of interpretable dictionary atoms, on the implicit assumption that activation space is well approximated by a globally linear structure. Their reconstruction error varies sharply across layers in ways that existing scaling laws, fitted at single layers, do not explain. We argue that this variation is the empirical trace of a geometric mismatch: where the activation manifold is curved and its intrinsic dimension varies across layers, no sparse linear dictionary can match it uniformly, and the SAE’s width-sparsity scaling becomes a layer-dependent function of manifold structure rather than a single universal law. We conduct the first cross-layer SAE scaling study, fitting and regressing on 844 residual-stream Gemma Scope SAE checkpoints across 68 layers of Gemma 2 2B and 9B. Stage 1 fits a per-layer scaling-law surface; Stage 2 regresses the fitted parameters and the derived per-layer width exponents on four layerwise geometric summaries. We find that manifold geometry predicts the per-layer width exponent in both models, and that the same regression coefficients learnt on one model predict the other model’s per-layer exponents under cross-model transfer, indicating a transferable geometric law. At the showcase layers where richer width grids permit identification of the asymptotic floor, we find that the fitted floor tracks the layerwise geometric ordering: higher curvature and intrinsic dimension correspond to higher floor, consistent with the irreducible second-order residual that any sparse linear approximation of a curved manifold must leave behind. SAEs thus encounter not a finite-resource ceiling but a geometry-dependent wall, set by the manifold they are trying to reconstruct.

[AI-129] M2A: Synergizing Mathematical and Agent ic Reasoning in Large Language Models

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）中数学推理（Mathematical Reasoning）与智能体推理（Agentic Reasoning）之间因推理模式不匹配而导致的性能瓶颈问题。具体而言，数学推理依赖内在逻辑在单次响应中解决封闭世界问题，而智能体推理则需与外部环境进行多轮交互，融合思考与行动，二者在训练过程中容易相互干扰，导致推理行为不稳定且多任务学习收益有限。解决方案的关键在于提出一种名为 M2A 的新范式——通过模型参数空间中的特征子空间识别与合并策略，仅在不影响智能体行为的方向上注入数学推理能力，从而实现两种推理能力的协同增强。该方法无需梯度更新，直接在参数空间操作，并以合并系数作为控制推理深度的简单调控变量，在真实代码代理场景中显著提升推理深度和性能，例如在微调后的 Qwen3-8B 模型上，SWE-Bench Verified 解决率从 44.0% 提升至 51.2%，且无需重新训练模型。

链接: https://arxiv.org/abs/2605.09879
作者: Junjian Wang,Xin Zhou,Qiran Xu,Kun Zhan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While reasoning has become a central capability of large language models (LLMs), the reasoning patterns required for different scenarios are often misaligned. Mathematical reasoning typically relies on intrinsic logic to solve closed-world problems in a single response, whereas agentic reasoning requires not only internal reasoning but also multi-turn interaction with external environments, interleaving thought and action. This misalignment prevents mathematical and agentic reasoning from effectively benefiting from each other, often yielding unstable reasoning behavior and only limited performance gains under multi-task learning. In this paper, we propose M2A, a novel paradigm that synergizes mathematical and agentic reasoning via model merging. To avoid overfitting to superficial reasoning patterns under joint training, M2A operates directly in parameter space: it identifies the feature subspace critical for agent behavior, and merges the mathematical reasoning task vector only along its null space, thereby injecting reasoning capability along directions that do not perturb agent behavior. Unlike SFT or RL, M2A requires no additional gradient-update and exposes the merging coefficient as a simple knob for controlling reasoning length. Experiments in a challenging real-world coding agent setting show that our method effectively extends agentic reasoning depth and delivers substantial performance improvements. Applied to a fine-tuned Qwen3-8B, M2A improves its SWE-Bench Verified resolved rate from 44.0% to 51.2% without retraining the model. Code is available at this https URL.

[AI-130] Cross-Family Universality of Behavioral Axes via Anchor-Projected Representations

【速读】：该论文旨在解决不同家族大语言模型（Large Language Models, LLMs）因隐藏维度、分词器和训练过程差异导致的行为方向难以比较或迁移的问题。其解决方案的关键在于提出一种锚投影框架（anchor-projection framework），将各模型的隐藏表示映射到一个共享的锚坐标空间（anchor coordinate space, ACS），在该空间中提取并平均行为方向以得到规范化的共通方向；随后，无需微调或目标特定的方向提取，即可通过仅依赖锚激活值将该规范方向重构至新模型的原生隐藏空间，实现跨模型的行为方向迁移与应用。

链接: https://arxiv.org/abs/2605.09875
作者: Su-Hyeon Kim,Yo-Sub Han
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models from different families use different hidden dimensions, tokenizers, and training procedures, making behavioral directions difficult to compare or transfer across models. We introduce an anchor-projection framework that maps hidden representations from each model into a shared anchor coordinate space (ACS). Behavioral directions extracted from source models are projected into ACS and averaged into a canonical direction. For a new model, the canonical direction is reconstructed into its native hidden space using only anchor activations, without fine-tuning or target-specific direction extraction. We evaluate five instruction-tuned model families and ten behavioral axes. We find that same-axis directions align tightly across the Llama-Qwen-Mistral-Phi (LQMP) cluster in ACS. This shared structure transfers to downstream tasks. For the aligned LQMP cluster, held-out targets achieve (0.83) ten-way detection accuracy and (0.95) mean binary AUROC, while canonical steering induces refusal-rate shifts of up to +0.46% under distribution shift. Sensitivity analyses show that two source models and small anchor pools already suffice to approximate transferable directions. Overall, ACS provides a novel perspective on cross-family interpretability, revealing that representation-level transfer remains robust across model families.

[AI-131] Intervention-Based Time Series Causal Discovery via Simulator-Generated Interventional Distributions

【速读】：该论文旨在解决时间序列因果发现中因混杂因素导致的因果效应估计偏差问题，尤其在观测数据中难以区分相关性与因果性时。其解决方案的关键在于提出SVAR-FM（Structural VAR with Flow Matching）框架，该框架将物理驱动的模拟器视为Pearl的do算子的机械实现：通过物理钳制变量切断混杂路径，直接生成干预数据；随后利用条件流匹配（Conditional Flow Matching）学习非线性干预条件分布。理论证明表明，在模拟器可钳制变量满足覆盖条件时，完整的结构向量自回归模型可识别，并推导出端到端误差界，分解为蒙特卡洛、模拟器保真度和流匹配三部分误差。实验验证了该方法能准确恢复因果方向，且在模拟器精度低于阈值时预测因果效应符号反转，实证结果支持这一现象。

链接: https://arxiv.org/abs/2605.09870
作者: Tsuyoshi Okita
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 54 pages, 6 figures

点击查看摘要

Abstract:We propose SVAR-FM (Structural VAR with Flow Matching), a framework for time series causal discovery that treats a physics-based simulator as a mechanical realization of Pearl’s do operator. Clamping a variable inside the simulator physically severs confounding paths, producing interventional data by construction. Conditional Flow Matching then learns the nonlinear interventional conditionals. Theoretically, we prove that the full structural VAR becomes identifiable under a coverage condition on the simulator-clampable variables, and derive an end-to-end error bound that decomposes into Monte Carlo, simulator fidelity, and Flow Matching terms. A sign-flip corollary predicts that when simulator accuracy falls below a threshold, the estimated causal effect reverses sign. Empirically, a benchmark across four scientific domains confirms that SVAR-FM recovers the correct causal sign where observational methods produce sign-reversed estimates due to confounding. A case study in ultrafast laser physics verifies the sign-flip prediction by physically varying the accuracy level of a first-principles quantum solver: the low-accuracy setting reverses the causal sign, while the high-accuracy setting recovers the correct direction (R-squared = 0.983, zero bias).

[AI-132] Continuous Latent Contexts Enable Efficient Online Learning in Transformers

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在在线决策任务中缺乏有效持续状态表示的问题，即如何让Transformer架构在多轮交互环境中高效地存储和更新算法状态以实现在线学习。其解决方案的关键在于引入连续的潜在上下文标记（latent context tokens），将算法状态编码为特征嵌入的线性组合，并通过少量潜变量实现对基础在线学习算法（如加权多数算法和Q-learning）的精确模拟；实验表明，这种设计可在不直接监督潜变量的情况下训练出小型GPT-2风格模型，在长序列在线预测任务中优于更大更复杂的LLM（如Qwen-3-14B和DeepSeek-V3）。

链接: https://arxiv.org/abs/2605.09867
作者: Emile Anand,Abdullah Ateyeh,Xinyuan Cao,Max Dabagia
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 37 pages, 15 figures, 3 tables

点击查看摘要

Abstract:Large language models (LLMs) exhibit a strong capacity for in-context learning: Given labeled examples, they can generate good predictions without parameter updates. However, many interactive settings go beyond static prediction to online decision-making, in which effective behavior demands adaptation over long multi-turn horizons in response to feedback, and efficient algorithms in these domains must use compact representations of what they have learned. Recently, continuous transformer architectures with latent chain of thought have shown promise for offline iterative tasks such as directed graph-reachability. Motivated by this, we study whether continuous latent context tokens equip transformers to more effectively realize online learning. We give explicit constructions of constant-depth transformers that implement two foundational online decision-making procedures – the weighted majority algorithm and Q -learning – by storing their algorithmic state as linear combinations of feature embeddings, using a small number of latent context tokens. We further train a small GPT-2-style transformer with latent contexts using a multi-curriculum objective that does not directly supervise the latent states. On long synthetic online prediction sequences, this model outperforms larger and more complex LLMs, including Qwen-3-14B and DeepSeek-V3. Our results suggest that continuous latent contexts provide a simple and effective persistent state for transformers to implement online learning algorithms.

[AI-133] UFO: A Unified Flow-Oriented Framework for Robust Continual Graph Learning

【速读】：该论文旨在解决持续图学习（Continual Graph Learning, CGL）中面临的两个核心挑战：灾难性遗忘（catastrophic forgetting）与噪声监督信号（noisy supervision）导致的灾难性记忆（catastrophic remembering）。现有方法通常假设标签干净，但在实际场景中，新到来的图结构常因标注错误或对抗扰动而引入噪声，从而导致模型在任务迁移过程中不仅遗忘旧知识，还可能固化并传播错误标签信息。为此，作者提出统一流导向框架（Unified Flow-Oriented framework, UFO），其关键在于：一是通过基于流的生成建模（flow-based generative modeling）对条件特征分布进行建模，生成重放表示以缓解遗忘且无需存储历史数据；二是估计实例级别的可靠性得分（instance-level reliability scores），用于识别并过滤噪声节点，从而有效抑制灾难性记忆，提升模型鲁棒性。

链接: https://arxiv.org/abs/2605.09862
作者: Danhui Zhang,Zhe Wang,Qing Qing,Jiarui Liu,Wentao Gao,Ziqi Xu,Mingliang Hou,Xikun Zhang,Renqiang Luo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graph learning research has increasingly shifted toward continual graph learning (CGL), which better reflects real-world scenarios where graphs evolve over time. However, existing CGL methods largely assume clean supervision and overlook a critical challenge: the newly arriving portions of the graph are often noisy, due to annotation errors or adversarial corruption. This mismatch limits their applicability in practice. In this work, we study robust continual graph learning, where models must simultaneously handle catastrophic forgetting and noisy supervision in evolving graph data. We show that label noise introduces a new failure mode, catastrophic remembering, where models persistently reinforce corrupted knowledge across tasks. To address these challenges, we propose a Unified Flow-Oriented framework (UFO). First, UFO models conditional feature distributions via flow-based generative modeling and produces replay representations, mitigating forgetting without storing historical data. Second, UFO estimates instance-level reliability scores to distinguish clean from noisy nodes, reducing the impact of corrupted supervision and alleviating catastrophic remembering. Extensive experiments on four benchmark graph datasets under varying noise ratios demonstrate that UFO consistently outperforms existing methods in both accuracy and forgetting metrics. Code is available at: this https URL.

[AI-134] Flag Varieties: A Geometric Framework for Deep Network Alignment

【速读】：该论文旨在解决深度神经网络中层间权重矩阵子空间对齐（alignment）现象的统一理论解释问题，这一现象广泛存在于梯度流动、神经坍缩（Neural Collapse）及不同架构间的表征相似性中，但现有研究多为针对特定观测的经验性解释，缺乏统一的数学框架。解决方案的关键在于运用几何不变量理论（geometric invariant theory），首次从理论上推导出层间对齐所必需的几何结构：其核心是一个由旗流形（flag variety）定义的规范闭合、多项稳定（polystable）分支，且子空间交维数是唯一与参数化无关的可观测量，从而表明子空间度量并非经验约定而是数学必然。在此基础上，论文进一步揭示了两种动力学机制——Ridge正则化以权重衰减速率驱动子空间对齐，而非线性激活函数引入交换子障碍（commutator obstruction），导致非线性网络中无法实现精确基对齐，而线性网络中不存在此障碍；这共同构成了从第一原理出发对神经坍缩Level-2/3层级结构的几何解释，并提出无需前向传播即可通过交换子幅度和头子空间重叠来诊断内部对齐结构的新方法。

链接: https://arxiv.org/abs/2605.09861
作者: Jingchuan Xiao,Xinyi Sui,Cihan Ruan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Alignment, the tendency of adjacent weight matrices in deep networks to develop compatible subspace orientations, underlies gradient flow, Neural Collapse, and representation similarity across architectures. Despite extensive empirical documentation, these phenomena have resisted unified theoretical treatment: existing explanations are post-hoc, each fitted to a specific observation with whatever mathematics is at hand. We reverse this direction by deriving the mathematical structure that layerwise alignment inherently demands. Using geometric invariant theory, we prove that alignment geometry has a canonical closed, polystable stratum given by a flag variety, and that subspace intersection dimension is its unique reparameterization-invariant observable, establishing that subspace metrics are not empirical conventions but mathematical necessities. This unified framework yields two dynamical consequences: ridge regularization drives subspace alignment at an exponential rate set by weight decay, whereas nonlinear activations induce a commutator obstruction to exact basis alignment, generically present in nonlinear networks and absent in linear ones. Together these give a geometric explanation of the Level-2/3 hierarchy in Neural Collapse from first principles rather than post-hoc analysis. The commutator magnitude and head subspace overlap further serve as weight-space windows into internal alignment structure, requiring no forward passes. Experiments on multilayer perceptrons, residual networks, and pretrained language models support the proposed diagnostics and delineate their scope.

[AI-135] When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning

【速读】：该论文旨在解决长时程推理（long-horizon reasoning）中因固定承诺深度（commitment depth）导致的性能瓶颈问题。传统方法将承诺深度设为一个静态标量，无法根据状态动态调整，从而在重规划开销与执行误差累积之间难以取得最优平衡。其解决方案的关键在于将承诺深度建模为策略本身的一个可学习、状态相关的变量，通过在模型原生的视觉-语言策略中联合预测动作序列及其执行时长，实现自适应决策。实验表明，该方法在Sliding Puzzle和Sokoban任务上显著优于所有非退化的固定深度基线，且在相同规模下超越GPT-5.5和Claude Sonnet，验证了状态条件化承诺深度的优越性。

链接: https://arxiv.org/abs/2605.09860
作者: Chen Li,Zhantao Yang,Fangyi Chen,Han Zhang,Anudeepsekhar Bolimera,Marios Savvides
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long-horizon reasoning requires deciding not only what actions to take, but how deeply to commit before the next observation. We formalize this as \emphcommitment depth: the number of primitive actions executed open-loop between replans. Commitment depth induces a trade-off between replanning cost and compounding execution error, yet most existing long-horizon systems fix it as a hand-designed scalar. In this work, we instead treat commitment depth as a learnable, state-conditioned variable of the policy itself. We instantiate this within a model-native vision–language policy that jointly predicts both what to execute and for how long. Across Sliding Puzzle and Sokoban, the resulting adaptive policy Pareto-dominates every non-degenerate fixed-depth baseline, achieving up to 12.5 percentage points higher solve rate while using approximately 25% fewer primitive actions per episode. Despite using a 7B backbone, our method outperforms GPT-5.5 and Claude Sonnet on both tasks, while every tested open-weight vision–language model achieves 0% zero-shot success. We further present a theoretical analysis showing that, under the standard commitment-depth surrogate, state-conditioned commitment strictly dominates any fixed depth whenever the locally optimal depth varies across states.

[AI-136] Fairness of Explanations in Artificial Intelligence (AI): A Unifying Framework Axioms and Future Direction toward Responsible AI

【速读】：该论文旨在解决生成式 AI（Generative AI）在高风险决策场景中存在的一种新型偏见问题，即模型输出满足所有公平性标准但其推理过程本身却存在系统性不公平的现象，称为“程序性偏见”（procedural bias）。传统算法公平性研究关注结果公平，而可解释人工智能（Explainable AI, XAI）则聚焦于推理过程的可解释性，二者独立发展，忽视了解释本身的公平性问题。论文的核心解决方案是提出一个条件不变性框架（conditional invariance framework），将解释公平性形式化为：对于任意任务相关输入 $x$ 和不同受保护属性值 $a$ 、 $b$ ，解释 $E(X)$ 的分布应保持不变，即 $P(E(X) \in \cdot \mid X_\text{rel} = x_\text{rel},\, A = a) = P(E(X) \in \cdot \mid X_\text{rel} = x_\text{rel},\, A = b)$ 。这一单一原则统一了现有解释公平性度量，并通过七维分类体系与六步评估流程，系统性地识别和缓解解释不公的三种生成机制（表示驱动型、解释模型失配、可行动驱动型），从而推动解释公平性的科学化研究与实践落地。

链接: https://arxiv.org/abs/2605.09852
作者: Gideon Popoola,John Sheppard
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 53 pages, 1 figure

点击查看摘要

Abstract:Machine learning algorithms are being used in high-stakes decisions, including those in criminal justice, healthcare, credit, and employment. The research community has responded with two largely independent research fields: \emphalgorithmic fairness, which targets equitable outcomes, and \emphexplainable AI (XAI), which targets interpretable reasoning. This survey identifies and maps a novel blind spot at their intersection, which is a model that can satisfy every standard fairness criterion in its outputs while being profoundly unfair in its \emphreasoning process. We refer to this as the procedural bias, and mitigating it requires treating the fairness of explanations as a distinct object of scientific study. To our knowledge, we provide the first unified theoretical and literature review of this emerging field and elucidate the drawbacks of post-hoc explainers in certifying explanation fairness. Our central contribution is a \emphconditional invariance framework formalizing explanation fairness as the requirement that explanations should be indifferent regardless of the protected attributes P(E(X) \in \cdot \mid X_\textrel = x_\textrel,, A = a) = P(E(X) \in \cdot \mid X_\textrel = x_\textrel,, A = b) for all task-relevant x , a single principle from which all existing explanation fairness metrics emerge as partial operationalizations. We introduce a seven-dimensional taxonomy, identify three generative mechanisms of explanation inequity (representation-driven, explanation-model mismatch, actionability-driven), and propose a canonical six-step evaluation workflow for operationalizing explanation fairness audits in practice. Comments: 53 pages, 1 figure Subjects: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computers and Society (cs.CY); Machine Learning (cs.LG) Cite as: arXiv:2605.09852 [cs.AI] (or arXiv:2605.09852v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.09852 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-137] ChladniSonify: A Visual-Acoustic Mapping Method for Chladni Patterns in New Media Art Creation

【速读】：该论文旨在解决新媒介艺术创作中音画映射主观性强、现有工具存在技术门槛高、离线计算难以支持实时交互以及通用声化工具映射规则不可控等问题。其解决方案的关键在于提出一种基于克希荷夫-洛夫板理论（Kirchhoff-Love plate theory）的实时音画映射方法 ChladniSonify，通过数值编程构建配对数据集并利用 ANSYS 有限元仿真校准，采用轻量级卷积神经网络（CNN）结合通道注意力机制（CBAM）实现对 Chladni 图案纤细节点线的高精度、低延迟分类，并在 Python 与 Max/MSP 中搭建端到端系统，将识别出的图案映射至对应正弦波频率，从而实现无偏差、低延迟（平均端到端延迟 <50 ms）的实时交互式音画同步。

链接: https://arxiv.org/abs/2605.09846
作者: Yakun Liu,Hai Luan,Dong Liu,Zhiyu Jin
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures, IEEE conference format

点击查看摘要

Abstract:In new media art creation, the mapping between vision and hearing is often subjective. As a classic carrier of sound visualization, Chladni patterns have great potential in building audio-visual mapping mechanisms. However, existing tools face pain points: high technical barriers for simulation, offline computing failing real-time interaction, and uncontrollable mapping rules in general sonification tools. To address these, this paper proposes ChladniSonify, a real-time visual-acoustic mapping method for Chladni patterns. Based on Kirchhoff-Love plate theory, we build a paired dataset via numerical programming and calibrate it using ANSYS finite element simulation. Focusing on the slender nodal lines of Chladni patterns, we adopt a lightweight CNN with CBAM to achieve high-precision, low-latency pattern classification. Finally, we build an end-to-end system in Python and Max/MSP, mapping recognized patterns to corresponding sine wave frequencies. Results show the system has excellent usability: the classification module achieves 99.33% accuracy on the test set with 7.03 ms inference latency; the mapped frequency matches the theoretical value with zero deviation; the average end-to-end latency is under 50 ms, meeting real-time interactive needs. This work provides a reproducible engineering prototype for Chladni audio-visual art creation.

[AI-138] Yield Curve Forecasting using Machine Learning and Econometrics: A Comparative Analysis

【速读】：该论文旨在解决机器学习在时间序列预测领域，尤其是金融领域中的有效性争议问题，具体聚焦于美国国债收益率曲线的预测性能比较。其核心问题是：不同方法（经典计量经济学、传统机器学习与深度学习）在长期、高频率的收益率曲线数据上的表现差异及其适用性。解决方案的关键在于系统性地对比ARIMA及其扩展模型、朴素基准、集成学习方法、循环神经网络（RNNs）以及多种专为预测设计的Transformer架构，并首次在该场景下评估深度学习模型对平稳或非平稳输入数据的敏感性，结果表明ARIMA和朴素模型整体最优，而TimeGPT、LGBM和RNNs在特定条件下表现突出，同时揭示了非平稳数据作为深度学习输入可能更具优势。

链接: https://arxiv.org/abs/2605.09842
作者: Aman Singh,Tokunbo Ogunfunmi,Sanjiv Das
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 18 pages, 12 figures, comparative study of econometric, machine learning, and deep learning methods for U.S. Treasury yield curve forecasting

点击查看摘要

Abstract:While machine learning has revolutionized many fields such as natural language processing (NLP) and computer vision, its impact on time-series forecasting is still widely disputed, especially in the finance domain. This paper compares forecasting performance on U.S. Treasury yield curve data across econometrics/time-series analysis, classical machine learning, and deep learning methods, using daily data over 47 years. The Treasury yield curve is important because it is widely used by every participant in the bond markets, which are larger than equity markets. We examine a variety of methods that have not been tested on yield curve forecasting, especially deep learning algorithms. The algorithms include the Autoregressive Integrated Moving Average (ARIMA) model and its extensions, naive benchmarks, ensemble methods, Recurrent Neural Networks (RNNs), and multiple transformers built for forecasting. ARIMA and naive econometric models outperform other models overall, except in one time block. Of the machine learning methods, TimeGPT, LGBM and RNNs perform the best. Furthermore, the paper explores whether stationary or nonstationary data are more appropriate as input to deep learning models.

[AI-139] Free Energy Manifold: Score-Based Inference for Hybrid Bayesian Networks

【速读】：该论文旨在解决混合贝叶斯网络（hybrid Bayesian networks）中离散与连续变量联合推理的挑战，尤其针对传统条件能量模型（conditional energy model, CEM）在多模态分布下产生的“模式桥”伪影（mode-bridge artifact）问题——即模型在同类别不同模式之间的区域生成低能量脊线，导致对数据外点的后验概率过度自信。解决方案的关键在于提出自由能流形（Free Energy Manifold, FEM），其通过将每个条件因子建模为基于学习到的离散父节点嵌入和连续观测值的能量景观，并引入谷值正则化（valley regularization）这一数据外校准项，在保持数据内拟合的同时，使这些伪影区域的后验趋于均匀，从而提升推理的可靠性与泛化能力。

链接: https://arxiv.org/abs/2605.09839
作者: Cheol Young Park,Shou Matsumoto
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce the Free Energy Manifold (FEM), a score-trained conditional energy model specialized for inference in hybrid Bayesian networks with discrete and continuous variables. FEM represents each conditional factor as an energy landscape over learned discrete-parent embeddings and continuous observations, enabling posterior evaluation, generative sampling, and compositional inference across multiple continuous leaves by energy addition under conditional independence. A central finding is the mode-bridge artifact: standard conditional energy models can create low-energy ridges between separated modes of the same class, producing overconfident posteriors at off-data interior points. We analyze this failure and propose valley regularization, an off-data calibration term that restores near-uniform posteriors in such regions while preserving in-data fit. Across synthetic multimodal hybrid-BN benchmarks, FEM substantially reduces KL divergence relative to classical baselines and a vanilla conditional EBM, including large gains at mode-bridge midpoint queries and in multi-leaf evidence composition. We also evaluate high-cardinality discrete-parent settings and a UCI Breast Cancer sanity check, showing that FEM is most useful when multimodal or compositional Bayesian-network inference is required, while discriminative classifiers remain preferable for closed-world classification tasks. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.09839 [cs.LG] (or arXiv:2605.09839v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.09839 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-140] Pretraining large language models with MXFP4

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在全流水线FP4（Floating Point 4-bit）训练中频繁发散的问题，即使前向激活和激活梯度保持稳定。研究通过受控实验逐步启用FP4量化：从前向传播（Fprop）、激活梯度（Dgrad）到权重梯度（Wgrad），并固定其他所有变量。结果表明，权重梯度（Wgrad）的FP4量化是导致收敛性下降的主要原因，而仅对Fprop和Dgrad进行FP4量化仅带来轻微的额外token需求。关键解决方案在于引入确定性Hadamard旋转（deterministic Hadamard rotations），该方法能有效恢复优化稳定性，表明FP4训练不稳定源于敏感梯度路径上的结构化微尺度误差（structured micro-scaling errors），而非缺乏随机性。这一发现为高效、稳定的低精度训练提供了理论依据与实践路径。

链接: https://arxiv.org/abs/2605.09825
作者: Musa Cim,Poovaiah Palangappa,Miro Hodak,Ravi Dwivedula,Meena Arunachalam,Mahmut Taylan Kandemir
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Why does full-pipeline FP4 training of large language models often diverge, even when forward activations and activation gradients remain stable? We address this question through a controlled study of MXFP4 quantization in transformer training, progressively enabling FP4 across forward propagation (Fprop), activation gradients (Dgrad), and weight gradients (Wgrad) while holding all other factors fixed. In full pretraining of Llama 3.1-8B on the C4 dataset, we observe that quantizing Wgrad is the primary driver of convergence degradation, whereas FP4 in Fprop and Dgrad alone introduces only modest additional token requirements. To interpret this behavior, we evaluate both structured and stochastic interventions under a controlled experimental setting. We find that stochastic rounding and randomized Hadamard rotations fail to stabilize training once Wgrad is quantized, whereas deterministic Hadamard rotations consistently restore stable optimization. These results suggest that FP4 training instability is driven by structured micro-scaling errors along sensitive gradient paths, rather than by insufficient stochasticity. We run experiments with native MXFP4 support on AMD Instinct MI355X GPUs, enabling controlled investigation of these effects without reliance on software emulation.

[AI-141] Oracle Poisoning: Corrupting Knowledge Graphs to Weaponise AI Agent Reasoning

【速读】：该论文旨在解决生成式 AI (Generative AI) 系统在运行时通过工具调用查询结构化知识图谱（Knowledge Graph, KG）时所面临的新型安全威胁——Oracle Poisoning 攻击问题。此类攻击通过篡改知识图谱中的数据而非指令，导致模型即使执行正确推理仍得出错误结论，区别于传统的提示注入（Prompt Injection）攻击。解决方案的关键在于识别并验证了攻击的可行性：在生产级 4200 万节点代码知识图谱上，所有测试模型在中等攻击复杂度（L2）下对中毒数据的信任度达到 100%，且存在明确的攻击技能阈值，即攻击者只需达到最低能力水平即可触发完全信任；此外，研究揭示了交付模式（如 inline 评估 vs. agentic tool-use）是影响检测结果的一阶混淆因素，强调防御需针对实际运行环境设计。

链接: https://arxiv.org/abs/2605.09822
作者: Ben Kereopa-Yorke,Guillermo Diaz,Holly Wright,Reagan Johnston,Ron F. Del Rosario,Timothy Lynar
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 26 pages, 3 fugres, 16 tables

点击查看摘要

Abstract:We define Oracle Poisoning, an attack class in which an adversary corrupts a structured knowledge graph that AI agents query at runtime via tool-use protocols, causing incorrect conclusions through correct reasoning. Unlike prompt injection, Oracle Poisoning manipulates the data agents reason over, not their instructions. We demonstrate six attack scenarios against a production 42-million-node code knowledge graph, providing the first empirical demonstration of knowledge graph poisoning against a production-scale agentic system, distinct from CTI embedding poisoning. Primary evaluation uses real SDK tool-use across nine models from three providers (N=30 per model), where models autonomously invoke a graph query tool and reason from results. The result is unambiguous: every tested model trusts poisoned data at 100% at moderate attacker sophistication(L2), with 269 valid trials (of 270) accepting fabricated security claims under directed queries. Under open-ended prompts, trust drops to 3-55%, confirming prompt framing as a confound; we report both conditions. An attacker sophistication gradient reveals discrete break points, a minimum skill at which trust flips from 0% to 100%, reframing the attack as a question not of whether but of how much. A controlled delivery-mode comparison shows that inline evaluation produces false negatives: GPT-5.1 shows 0% trust inline but 100% under both simulated and real agentic tool-use, demonstrating that delivery mode is a first-order confound. We evaluate five defences; read-only access control eliminates the direct mutation vector, while the remaining four are partial and model-dependent. Analysis of four additional platforms suggests the attack may generalise across the knowledge-graph ecosystem.

[AI-142] LEAD: Length-Efficient Adaptive and Dynamic Reasoning for Large Language Models

【速读】：该论文旨在解决大模型在推理能力提升过程中出现的冗余推理问题，即随着Chain-of-Thought（CoT）推理能力增强，模型生成的推理轨迹往往超出任务实际需求，导致计算资源、延迟和上下文预算的浪费。现有基于强化学习的方法通过引入长度效率奖励来缓解此问题，但面临两个核心挑战：一是正确性与效率之间的最优权衡在训练过程中是非平稳的；二是不同问题的内在推理预算差异显著，静态奖励权重和全局长度约束难以兼顾准确率与压缩潜力。解决方案的关键在于提出LEAD（Length-Efficient Adaptive and Dynamic reasoning），其核心创新是用在线自适应机制替代静态启发式策略：首先，利用Potential-Scaled Instability动态校准每一步的正确性-效率权衡，将优化资源聚焦于最具信息量的学习信号；其次，基于模型自身正确推理轨迹在线估计每个问题的自适应目标长度，并施加对称效率奖励，同时惩罚过度思考和过度压缩行为。

链接: https://arxiv.org/abs/2605.09806
作者: Songtao Wei,Yi Li,Zhikai Li,Xu Hu,Yuede Ji,Guanpeng Li,Feng Chen,Carl Yang,Zhichun Guo,Bingzhe Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large reasoning models, such as OpenAI o1 and DeepSeek-R1, tend to become increasingly verbose as their reasoning capabilities improve. These inflated Chain-of-Thought (CoT) trajectories often exceed what the underlying problems require, wasting compute, latency, and context budgets. While introducing length-based efficiency rewards during reinforcement learning offers a natural remedy, existing methods struggle with two fundamental challenges: the optimal balance between correctness and efficiency is non-stationary throughout training, and intrinsic reasoning budgets vary drastically across problems. Relying on static reward weights and global length constraints inevitably forces a compromise between degraded accuracy and unrealized compression. To overcome these limitations, we propose LEAD (Length-Efficient Adaptive and Dynamic reasoning), a method that replaces static heuristics with online, self-adaptive mechanisms. LEAD dynamically calibrates the correctness-efficiency trade-off at each step using a Potential-Scaled Instability, directing optimization capacity to the most informative learning signal. Furthermore, it estimates an adaptive per-problem target length online based on the model’s own correct rollouts, applying a symmetric efficiency reward that penalizes both overthinking and over-compression. Evaluated on five mathematical reasoning benchmarks, LEAD achieves the highest accuracy and Accuracy-Efficiency Score among RL-trained efficient-reasoning methods while producing substantially shorter outputs than the base model.

[AI-143] Multi-Tier Labeling and Physics-Informed Learning for Orbital Anomaly Detection at Scale

【速读】：该论文旨在解决低地球轨道（Low-Earth Orbit, LEO）卫星群体中轨道异常检测的标注数据稀缺问题，这是实现碰撞规避、衰减预测和交会筛选的关键前提。现有方法受限于缺乏公开的真实标签数据集，人工标注无法扩展至约10⁴颗活跃卫星规模，而纯规则检测器则因过度牺牲召回率而难以发现多数行为异常。解决方案的关键在于提出一种多层级弱监督标签级联机制，融合三种不同置信度来源：快速物理规则集（rule_v1）、交互多模型无迹卡尔曼滤波器（Interacting Multiple Model Unscented Kalman Filter, IMM-UKF）阵列以及补充元素校准步骤（supGP），从而在单源无法覆盖的尺度上生成高质量标签序列。该级联方法应用于60年跨度的2.32亿条两行元素（Two-Line Element, TLE）记录，产出860万条长度为50的时间序列（共4.3亿时间步），涵盖11个特征包括显式时间编码与完整均值轨道要素；实验表明，IMM-UKF相较rule_v1可识别出42.6倍更多的异常事件，最终训练的650万参数Transformer模型在测试集上实现55.4%的机动召回率和62.8%的衰减召回率，且时间差特征单独作用即带来107%相对衰减召回提升，验证了该级联标注策略对下游模型性能的核心支撑作用。

链接: https://arxiv.org/abs/2605.09790
作者: Yong Fu
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Detecting orbital anomalies, such as maneuvers, atmospheric decay, and attitude upsets, across the rapidly growing population of low-Earth-orbit (LEO) satellites is a prerequisite for collision avoidance, decay forecasting, and conjunction screening. The bottleneck is not modeling capacity but labels: there is no public ground-truth corpus of orbital anomalies, manual review does not scale to approximately 10^4 active satellites, and pure rule-based detectors trade recall for precision so aggressively that they are blind to most behavioral anomalies. We present a multi-tier labeling cascade that composes three weak supervision sources of increasing fidelity: a fast physics rule set (rule_v1), an Interacting Multiple Model Unscented Kalman Filter (IMM-UKF) bank, and a supplemental-element calibration step (supGP), to produce labels at a scale unavailable from any single source. Applied to 232M Two-Line Element (TLE) records spanning 60 years, the cascade yields 8.6M labeled sequences of length 50 (430M timesteps) over 11 features that include explicit time encoding and full mean-element state. On overlapping satellites, IMM-UKF surfaces 42.6x more anomalies than rule_v1 alone. We train a 6.5M-parameter Transformer in two stages, achieving a maneuver recall of 55.4% and decay recall of 62.8% on a held-out test set. An ablation on the time-delta feature alone yields a 107% relative improvement in decay recall. We frame the resulting model as a high-recall triage classifier whose role is to surface candidate events for downstream filtering, not to issue final attributions, and discuss the path toward a Neural-ODE-based orbital world model.

[AI-144] Attribution-based Explanations for Markov Decision Processes

【速读】：该论文旨在解决现有可解释性技术（如Attribution Techniques）主要针对静态输入特征在单一时间点的重要性分配，无法有效应用于序列决策场景的问题。其核心挑战在于如何为马尔可夫决策过程（Markov Decision Process, MDP）提供基于重要性得分的解释，尤其需要量化个体状态和执行路径的重要性。解决方案的关键在于提出一种形式化的赋值准则，并通过策略合成（Strategy Synthesis）技术高效计算这些重要性得分，从而在MDP固有的非确定性下仍能实现可扩展的解释能力。

链接: https://arxiv.org/abs/2605.09780
作者: Paul Kobialka,Andrea Pferscher,Francesco Leofante,Erika Ábrahám,Silvia Lizeth Tapia Tarifa,Einar Broch Johnsen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Attribution techniques explain the outcome of an AI model by assigning a numerical score to its inputs. So far, these techniques have mainly focused on attributing importance to static input features at a single point in time, and thus fail to generalize to sequential decision-making settings. This paper fills this gap by introducing techniques to generate attribution-based explanations for Markov Decision Processes (MDPs). We give a formal characterization of what attributions should represent in MDPs, focusing on explanations that assign importance scores to both individual states and execution paths. We show how importance scores can be computed by leveraging techniques for strategy synthesis, enabling the efficient computation of these scores despite the non-determinism inherent in an MDP. We evaluate our approach on five case-studies, demonstrating its utility in providing interpretable insights into the logic of sequential decision-making agents.

[AI-145] Marrying Generative Model of Healthcare Events with Digital Twin of Social Determinants of Health for Disease Reasoning ICML2026

【速读】：该论文旨在解决现有生成式疾病预测模型在建模过程中忽视社会健康决定因素（Social Determinants of Health, SDoH）的问题，尤其是缺乏对ICD编码代理变量（如ICD-10中章节Z和V–Y）的显式建模，从而限制了个性化疾病建模与临床决策支持的能力。其解决方案的关键在于提出一种基于ICD编码SDoH代理变量的条件潜在扩散框架（conditioned latent diffusion framework），将多器官传感器数据（如脑网络图结构、其他器官的表格型数据）与分词化的医疗事件关联起来，并引入一种新型几何扩散模型以刻画复杂数据表示（如脑区连接图）的时间演化过程，实现对疾病轨迹的模拟干预与推理。该方法在UK Biobank数据集上显著优于当前最先进的疾病自回归模型与影像特征生成基线。

链接: https://arxiv.org/abs/2605.09771
作者: Ziquan Wei,Tingting Dan,Guorong Wu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 21 pages, 8 figures, ICML 2026

点击查看摘要

Abstract:Despite the central role of sensor-derived measurements such as imaging traits and plasma biomarkers in biomedical research and clinical practice, existing generative models for disease prediction largely depend on event-level representations from hospital and registry data. Given the multi-factorial nature of human disease, the absence of explicit modeling of social determinants of health (SDoH), even in the limited form of ICD-coded proxies (chapters Z and V–Y in ICD-10), limits the capacity for personalized disease modeling and clinical decision support. To address this limitation, we propose a generative model with ICD-coded proxies of SDoH for \textitin silico modeling of disease reasoning, a conditioned latent diffusion framework that establishes the connection between multi-organ sensor data with tokenized healthcare events. Specifically, we introduce a novel geometric diffusion model to characterize the temporal evolution of complex data representation such as brain networks (region-to-region connectivity encoded in a graph), in parallel with diffusion models for tabular data from other organ systems. Together, we integrate the generative model with digitalized SDoH proxies (coined \modelname) for simulated intervention and reasoning of future disease trajectories. We conduct extensive experiments on the UK Biobank (UKB) dataset, which contains organ-specific imaging traits, including brain (44,834), heart (23,987), liver (28,722), and kidney (32,155), along with nearly 500k medical history sequences (age range: 25 \sim 89 years). Our \modelname achieves significant improvements over state-of-the-art human disease autoregressive models and imaging trait generative baselines.

[AI-146] UTS at PsyDefDetect: Multi-Agent Councils and Absence-Based Reasoning for Defense Mechanism Classification

【速读】：该论文旨在解决情感支持对话中心理防御机制（Defense Mechanism）自动分类的问题，目标是提升对防御机制识别的准确性。其解决方案的关键在于：首先，从临床视角出发，将防御机制的本质理解为“缺失”——即情感缺失（missing affect）、认知阻断（blocked cognition）和现实否认（denied reality），并将其编码为提示层面的“情感-认知整合谱”（affect-cognition integration spectrum），这一设计带来了最大单点提升（+11.4pp F1）；其次，采用多阶段推理型代理委员会架构（multi-phase deliberative council of Gemini 2.5 agents），由类别专属倡导者评估证据强度而非简单投票，实现无需微调即可达到F1=0.382（Top-5）；最后，针对少数类误判问题，引入由三个微调后的Qwen3.5模型组成的覆盖集成策略（override ensemble），通过结构化多代理系统（构建者、批评者、回归保护者）精准选择16个覆盖规则，显著提升性能（+2.4pp F1），且在一次迭代中超越此前8次尝试的总和。

链接: https://arxiv.org/abs/2605.09769
作者: Dima Galat,Marian-Andrei Rizoiu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper describes our system for classifying psychological defense mechanisms in emotional support dialogues using the Defense Mechanism Rating Scales (DMRS), placing second (F1 0.406) among 64 teams.1 A central insight is that defense mechanisms are defined by what is absent: missing affect, blocked cognition, denied reality. We encode this as an affect-cognition integration spectrum in prompt-level clinical rules, which account for the largest single gain (+11.4pp F1). Our architecture is a multi-phase deliberative council of Gemini 2.5 agents where class-specific advocates rate evidence strength rather than voting, achieving F1 0.382 with no fine-tuning - a top-5 result on its own. We find, however, that the council is confidently wrong about minority classes: 59-80% of stable minority predictions are incorrect, driven by a systematic “L7 attractor” in which emotional content defaults to the majority class. A targeted override ensemble from three fine-tuned Qwen3.5 models applies 16 overrides (+2.4pp), selected by a structured multi-agent system (builder, critic, regression guard) that produced a larger F1 gain in one iteration than 8 prior attempts combined.

[AI-147] WISTERIA: Learning Clinical Representations from Noisy Supervision via Multi-View Consistency in Electronic Health Records

【速读】：该论文旨在解决电子健康记录（Electronic Health Records, EHR）中表示学习因临床标签弱监督导致的模型性能受限问题，具体表现为标签来源多样、噪声大且具有机构特异性（如账单编码、启发式表型和不完整标注）。其解决方案的关键在于提出WISTERIA框架，该框架将临床标签建模为潜在临床状态的随机观测，而非固定目标；通过构建多个弱监督算子并强制其诱导的标签分布一致性来学习表示，从而隐式实现去噪机制，使模型能够从噪声标签中恢复出临床意义明确的结构。此外，引入基于本体的标签空间正则化进一步约束监督信号的语义一致性，显著提升模型在跨机构场景下的泛化能力与鲁棒性。

链接: https://arxiv.org/abs/2605.09765
作者: Ruan Dong,Yuanyun Zhang,Shi Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Representation learning in electronic health records (EHR) has largely followed paradigms inherited from natural language processing, relying on sequence modeling and reconstruction based objectives that treat clinical labels as ground truth. However, real world clinical supervision is inherently weak, arising from heterogeneous, noisy, and institution specific labeling processes such as billing codes, heuristic phenotypes, and incomplete annotations. In this work, we propose WISTERIA, a weakly supervised representation learning framework that models labels as stochastic observations of an underlying latent clinical state. Instead of optimizing against a single supervision signal, WISTERIA constructs multiple weak supervision operators and learns representations by enforcing consistency across their induced label distributions. This multi view formulation induces an implicit denoising mechanism, allowing the model to recover clinically meaningful structure by reconciling disagreement between noisy labelers. We further incorporate ontology aware regularization in the label space to impose semantic structure over supervision signals. Empirically, WISTERIA improves predictive performance across standard EHR benchmarks, demonstrates strong robustness to label noise, and exhibits superior cross institutional generalization compared to sequence based pretraining objectives. These results suggest that explicitly modeling the supervision process rather than treating labels as fixed targets provides a more appropriate inductive bias for learning robust and clinically meaningful representations from EHR data.

[AI-148] LEVI: Stronger Search Architectures Can Substitute for Larger LLM s in Evolutionary Search

【速读】：该论文旨在解决当前基于大语言模型（Large Language Models, LLMs）的进化搜索方法（如AlphaEvolve）在系统研究等任务中成本高昂的问题，其根源在于现有框架在搜索策略上的低效分配：缺乏多样性保护导致依赖更强的突变模型、盲目使用前沿模型处理本可由小型模型完成的局部修改、以及全量评估造成冗余计算。解决方案的关键在于提出LEVI框架，该框架通过三个核心改进实现更高效的进化搜索：一是从初始阶段即建立并持续维护解空间多样性（solution database）；二是引入智能突变路由机制，合理分配大型与小型LLM的使用场景以发挥各自优势；三是设计保留排序的代理基准（rank-preserving proxy benchmark），降低高开销场景下的采样成本。实验表明，LEVI在多个系统研究和提示优化任务上均显著优于现有方法，在预算仅为3.3–6.7倍的情况下达到或超越先进框架性能，甚至在某些问题上以35倍更低的成本达成最优结果。

链接: https://arxiv.org/abs/2605.09764
作者: Temoor Tanveer
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM-guided evolutionary methods such as AlphaEvolve have proven effective in domains like math, systems research, and algorithmic discovery, but their reliance on frontier models makes each run expensive. We argue this is largely an artifact of how existing frameworks allocate search: archives that fail to preserve solution diversity force compensation through stronger mutation models; blind model use spends frontier dollars on local edits a smaller model could handle; and full-set evaluation wastes rollouts on redundant examples. We introduce LEVI, a harness-first evolutionary framework built on the bet that stronger search architectures can substitute for or even outperform larger LLMs in evolutionary search. LEVI improves on three core components of evolutionary search: a solution database that establishes diversity from the beginning, and then maintains it throughout the run; a smarter mutation router that plays into the strengths of large and small LLMs; and a rank-preserving proxy benchmark for rollout-heavy settings. Across systems-research benchmarks LEVI attains the highest score on a budget 3.3-6.7x smaller than the published frontier-model runs of existing frameworks like ShinkaEvolve, GEPA, and AdaEvolve; on one problem, LEVI matches the existing best at a 35x lower cost. On prompt optimization, LEVI matches or exceeds GEPA at less than half of its rollout budget on four different benchmarks. LEVI is available as an open-source framework at this https URL.

[AI-149] Primal-Dual Guided Decoding for Constrained Discrete Diffusion

【速读】：该论文旨在解决离散扩散模型（Discrete Diffusion Models）在生成结构化序列时难以强制执行全局属性约束的问题。其核心挑战在于如何在不重新训练模型的前提下，确保生成结果满足预设的约束条件（如化学稳定性、主题一致性或音乐风格匹配等）。解决方案的关键在于提出一种**原对偶引导解码（Primal-Dual Guided Decoding）**方法，该方法将约束生成建模为一个KL正则化的优化问题，并通过自适应拉格朗日乘子在线求解。在每一步去噪过程中，利用基于约束违反程度的镜面下降法更新乘子，并以加性偏置形式调整token logits，该偏置来源于约束最优的KL正则投影，从而在最小偏离原始模型分布的同时严格满足约束。此方法无需额外训练或模型评估，支持多约束并行处理，并提供约束违反的理论边界。

链接: https://arxiv.org/abs/2605.09749
作者: Federico Tomasi,Dmitrii Moor,Alice Wang,Mounia Lalmas
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Discrete diffusion models generate structured sequences by progressively unmasking tokens, but enforcing global property constraints during generation remains an open challenge. We propose primal-dual guided decoding, an inference-time method that formulates constrained generation as a KL-regularised optimisation problem and solves it online via adaptive Lagrangian multipliers. At each denoising step, the method modifies token logits through an additive, constraint-dependent bias, with multipliers updated by mirror descent based on constraint violation. The bias arises as the optimal KL-regularised projection of the constraint, so the constrained distribution remains as close as possible to the model’s unconstrained distribution while still satisfying the constraint. The method requires no retraining and no additional model evaluations beyond standard sampling, supports multiple simultaneous constraints, and provides formal bounds on constraint violation. We evaluate our approach on topical text generation, molecular design, and music playlist generation, showing that a single algorithm instantiated via domain-specific scoring functions improves constraint satisfaction while preserving relevant domain-specific quality metrics.

[AI-150] Sequential Feature Selection for Efficient Landslide Segmentation from Multi-Spectral Data

【速读】：该论文旨在解决遥感影像中滑坡检测模型对冗余或高度相关输入通道依赖过重的问题，此类输入不仅降低模型的物理可解释性、增加计算开销，还可能因Hughes现象导致性能下降。其解决方案的关键在于提出一种系统且可解释的特征选择框架，采用顺序前向浮选法（Sequential Forward Floating Selection, SFFS）结合轻量级U-Net++代理模型，迭代构建并修剪候选特征池，从而识别出仅需8个通道即可达到甚至超越使用多达30个通道时的分割F1分数。该方法突破了传统单通道剔除测试无法捕捉特征间交互效应的局限，实现了对滑坡模型真正依赖的光谱与地形特征的深入剖析，为地球观测领域输入设计提供了更合理的理论依据。

链接: https://arxiv.org/abs/2605.09746
作者: Arsalaan Ahmad,Oktay Karakus,Paul L. Rosin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: In Process of Submission to Frontiers in Remote Sensing. Keywords: landslide segmentation, multispectral remote sensing, feature selection, explainability, Landslide4Sense

点击查看摘要

Abstract:Landslide detection from satellite imagery has advanced through deep learning, yet most models rely on large, highly correlated spectral-topographic inputs whose contributions remain poorly understood. The question of which channels are actually necessary has received surprisingly little attention. This matters: redundant or correlated inputs obscure physical interpretability, inflate computational overhead, and can actively degrade model performance through the Hughes Phenomenon. We present a systematic, explainable channel-selection framework for the Landslide4Sense benchmark, combining Sentinel-2 multispectral and ALOS PALSAR terrain data with 16 engineered spectral and structural indices. Rather than relying on conventional single-band drop tests, which evaluate channels in isolation and miss interaction effects, we apply Sequential Forward Floating Selection (SFFS) to iteratively build and prune a candidate feature pool using a lightweight U-Net++ proxy model. Beyond identifying a compact 8-channel subset that matches or exceeds the segmentation F1 of configurations using up to 30 channels, we use the selection process itself to interrogate which spectral and topographic features landslide models genuinely rely on, and what this reveals about the physical cues driving their predictions. We argue that SFFS represents a principled feature selection approach to input design in Earth observation, in contrast to the prevailing practice of appending every available band and hoping the model learns what to ignore.

[AI-151] Entropy-informed Decoding: Adaptive Information-Driven Branching ICML2026

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在文本生成过程中因固定解码策略导致的效率与质量权衡问题：传统采样方法（如top-k、nucleus sampling）通常仅选择单一路径，难以捕捉多样性和最优解；而基于搜索的方法（如束搜索、best-of-n）虽能提升质量，却常因不区分任务复杂度而造成冗余计算。解决方案的关键在于提出一种可插拔、与模型无关的熵感知解码（Entropy-informed decoding, EDEN）框架，其核心机制是根据模型输出分布的熵动态调整分支因子——高熵区域扩大候选集以探索更多可能性，低熵区域则采用更贪婪的路径以节省计算资源，从而在相同扩展预算下逼近更高宽度束搜索的效果，并理论上证明了熵单调递增的分支策略优于任意固定分支因子的解码结果。

链接: https://arxiv.org/abs/2605.09745
作者: Benjamin Patrick Evans,Sumitra Ganesh,Leo Ardon
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注: Accepted at ICML 2026

点击查看摘要

Abstract:Large language models (LLMs) achieve remarkable generative performance, yet their output quality is dependent on the decoding strategy. While sampling-based methods (e.g., top-k, nucleus) and search-and-select based methods (e.g., beam search, best-of-n, majority voting) can improve upon greedy decoding, both approaches suffer from limitations: sampling generally commits to a single path, while search often expends excessive computation regardless of task complexity. To address these, we introduce Entropy-informed decoding (EDEN), a plug-and-play, model-agnostic decoding framework that adaptively allocates computation based on the model’s own uncertainty, approximating higher-width beam search with fewer expansions. At each generation step, EDEN estimates the entropy of the output token distribution and adjusts the branching factor monotonically with the entropy, expanding more candidates in high-entropy regions and following a greedier path in low-entropy regions, improving token efficiency. Experiments across complex tasks, including mathematical reasoning, code generation, and scientific questions, demonstrate that EDEN consistently improves output quality over existing decoding strategies, achieving better accuracy-expansion trade-offs than fixed-width beam search. By treating next-token selection as a noisy maximisation problem, we prove that branching factors monotone in entropy are guaranteed to find better (i.e. more probable) continuations than any fixed branching factor within the same total expansion budget, and derive explicit regret rates characterising the benefit of the adaptive allocation.

[AI-152] IDES: Implicit Time-Awareness in Selective State Space Models

【速读】：该论文旨在解决选择性状态空间模型（Selective State Space Models, SSMs）与连续时间SSMs在建模不规则时间序列时的权衡问题：前者通过将时间离散化步长 \Tilde\Delta 设为输入相关函数来增强每个token的表达能力，但丧失了其物理采样间隔意义；后者如S5保持\Tilde\Delta \equiv \Delta的物理含义并原生支持不规则时间戳，却受限于线性时不变（LTI）动态，导致单token表达能力不足。解决方案的关键在于提出TIDES——一种新型选择性SSM变体，它将输入依赖性从步长\Tilde\Delta转移到状态矩阵的对角线上，从而保留\Tilde\Delta的物理意义，使模型能原生处理不规则时间戳，同时维持选择性SSMs的高表达能力。

链接: https://arxiv.org/abs/2605.09742
作者: Taylan Soydan,Miguel A. Bessa,Dirk Mohr,Rui Barreira
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preprint submitted for peer-review

点击查看摘要

Abstract:Selective state space models (SSMs), such as Mamba, achieve strong per-token expressivity by making the time discretization step \Tilde\Delta a learned function of the input. However, in doing so, \Tilde\Delta ceases to represent a physical sampling interval, limiting its irregular time series modeling capability. Continuous-time SSMs, such as S5, preserve the physical meaning of \Tilde\Delta and handle irregular timestamps natively ( \Tilde\Delta\equiv\Delta) , but their dynamics remain linear time-invariant (LTI), limiting per-token expressivity. We propose \textbfTIDES, a selective SSM variant that reconciles selective and continuous architectures by moving input-dependence off the step size and onto the diagonal state matrix. As a result, \Tilde\Delta retains its physical meaning, tied to the state discretization, allowing the model to handle irregular timestamps natively without sacrificing the per-token expressivity that makes selective SSMs effective. We show this on a novel \emphFading Flash experimental benchmark, a compact controlled diagnostic for sequence models that jointly tests input-dependence and extrapolation to out-of-distribution \Delta values, and isolates the distinct failure modes of current state-of-the-art architectures that TIDES avoids by construction. On large-scale benchmarks, TIDES sets the new state-of-the-art average rank on UEA time-series classification and the Physiome-ODE regression benchmark. Code available at: this https URL.

[AI-153] KV-RM: Regularizing KV-Cache Movement for Static-Graph LLM Serving

【速读】：该论文旨在解决静态图大语言模型（LLM）解码器在在线推理场景下因KV缓存（Key-Value Cache, KV-cache）行为高度不规则而导致的性能瓶颈问题，包括请求长度差异、结束标记（EOS）异步到达以及逻辑历史碎片化等挑战。其核心解决方案是提出KV-RM运行时设计，关键在于通过分层抽象将逻辑KV缓存历史与物理存储解耦，利用块页管理器（block pager）追踪活跃KV状态，并以单个已提交描述符（committed descriptor）驱动每一步解码操作；同时引入合并式传输路径，在固定形状注意力核执行前将非连续KV映射聚合成少量大尺寸传输组，从而在保持静态图执行优势的同时有效吸收动态调度带来的灵活性需求。

链接: https://arxiv.org/abs/2605.09735
作者: Zhiqing Zhong,Zhijing Ye,Jian Zhang,Weijian Zheng,Bolun Sun,Xiaodong Yu
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Operating Systems (cs.OS)
备注: 14 pages, 7 figures, 7 tables

点击查看摘要

Abstract:Static-graph LLM decoders provide predictable launches, fixed tensor shapes, and low submission overhead, but online decoding exposes highly irregular KV-cache behavior: request lengths differ, EOS events arrive asynchronously, and logical histories fragment over time. Dynamic runtimes recover flexibility through paged KV management and step-level scheduling, while static-graph executors often over-reserve memory and suffer burst-time latency outliers. This paper studies whether much of this variability can be absorbed below a fixed decode interface. We present KV-RM, a runtime design that regularizes KV-cache movement beneath a static-graph LLM decoder. KV-RM decouples logical KV histories from physical storage, tracks active KV state through a block pager, and materializes each decode step through a single committed descriptor. A merge-staged transport path coalesces non-contiguous KV mappings into a small number of large transfer groups before a fixed-shape attention kernel consumes them. Optional bounded far-history summaries can be enabled under the same interface, but the core design does not depend on them. On a 2-GPU NVIDIA A100 node, KV-RM improves mixed-length decoding throughput and tail latency relative to a static-graph baseline, reduces reserved KV memory across workload families, and removes severe burst-time latency spikes under production-trace replay. These results suggest that KV-cache movement, rather than kernel shape, can be an effective boundary for recovering runtime flexibility in static-graph LLM serving.

[AI-154] One for All: A Non-Linear Transformer can Enable Cross-Domain Generalization for In-Context Reinforcement Learning

【速读】：该论文旨在解决强化学习（Reinforcement Learning, RL）中模型泛化能力不足的问题，即如何使学习到的策略或价值函数在训练任务之外的新任务上仍能有效工作。传统方法依赖多任务学习或多层元学习（Meta Reinforcement Learning），而本文提出利用Transformer架构通过上下文学习（in-context learning）实现无需显式参数更新的任务适应。其解决方案的关键在于从核方法（kernel-based）视角重新理解Transformer：将Transformer视为在再生核希尔伯特空间（Reproducing Kernel Hilbert Space, RKHS）中进行回归的函数算子，从而证明不同领域（domain）的价值函数若位于同一RKHS内，即可共享一组权重，实现跨任务泛化。实验在多个MetaWorld环境中的结果验证了该理论框架下时序差分（Temporal Difference, TD）目标的收敛性，支持了基于核表示的通用性设计思路。

链接: https://arxiv.org/abs/2605.09727
作者: Bowen He,Juncheng Dong,Lin Lin,Xiang Cheng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A central challenge in reinforcement learning (RL) is to learn models that generalize beyond the tasks on which they are trained, a goal traditionally pursued through multi-task and meta RL. Recently, transformer architectures have emerged as a promising approach, enabling adaptation to new tasks via in-context learning without explicit parameter updates. From a functional perspective, a transformer can be viewed as a functional operator that maps a context to a task-specific function. It is thus fundamental to understand and design this operator to support stronger generalization in RL. In this work, we address this resulting question of generalization from a kernel-based perspective by establishing a connection between non-linear transformers and kernel-based temporal difference learning. By interpreting the transformer as performing regression in a Reproducing Kernel Hilbert Space (RKHS), we show that value functions from different domains can be represented using a shared set of weights, provided they lie within the same RKHS. Experiments on multiple MetaWorld domains support this interpretation, demonstrating convergence of the temporal-difference objective.

[AI-155] Security Risks in Tool-Enabled AI Agents : A Systematic Analysis of Privileged Execution Environments

【速读】：该论文旨在解决云环境中部署自主AI代理（AI agents）所带来的安全风险问题，特别是这些代理在特权执行环境中通过工具执行有副作用的操作时可能引发的安全隐患。其解决方案的关键在于提出了一种风险分类体系（taxonomy of risk categories），并通过三个典型场景揭示了风险来源主要源于工具权限过度授予、能力与意图不匹配以及执行环境中的隐性权限泄露（ambient authority leakage）。基于此分析，论文进一步提出了实用的设计指南，以指导更安全地在云端部署AI代理。

链接: https://arxiv.org/abs/2605.09721
作者: Hardik Goel
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Extended author preprint. A shortened version has been accepted as a short paper at IEEE COMPSAC 2026. 7 pages, 3 figures/tables

点击查看摘要

Abstract:Tool-enabled AI agents are increasingly deployed in cloud-hosted environments and offered as services, where they perform side-effecting operations through privileged tools within execution environments. While such agents enable powerful automation, the security implications of hosting autonomous agents in privileged execution environments are not yet fully explored. This paper presents a structured analysis of security risks associated with cloud-hosted AI agents. We introduce a taxonomy of risk categories, illustrate these risks through three representative agent scenarios, and discuss mitigation strategies along with their tradeoffs. A small controlled experiment empirically illustrates risk manifestation and the effect of lightweight mitigations in this setup. Our analysis suggests that many risks in autonomous cloud agents arise not from novel vulnerabilities, but from over-privileged tools, capability-intent mismatches, and ambient authority leakage in execution environments. Based on these findings, we derive practical design guidelines for deploying AI agents in the cloud more securely.

[AI-156] Medical Model Synthesis Architectures: A Case Study

【速读】：该论文旨在解决当前人工智能（AI）系统在临床决策中难以实现校准的不确定性推理以及推理过程缺乏透明性的问题。其解决方案的关键在于提出一种名为MedMSA的框架，该框架利用语言模型检索相关先验知识，并构建形式化的概率模型以支持在不确定性下的校准且可验证的推断，从而实现既实用又透明的临床预测。

链接: https://arxiv.org/abs/2605.09716
作者: Katherine M. Collins,Marlene Berke,Ilia Sucholutsky,Ayman Ali,Adrian Weller,Timothy J. O’Donnell,Tyler Brooke-Wilson,Lionel Wong,Joshua B. Tenenbaum
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Working paper

点击查看摘要

Abstract:Medicine is rife with high-stakes uncertainty. Doctors routinely make clinical judgments and decisions that juggle many fundamental unknowns, like predictions about what might be causing a patients’ symptoms or decisions about what treatment to try next. Despite increasing interest in developing AI systems that aid or even replace doctors in clinical settings, current systems struggle with calibrated reasoning under uncertainty, and are often deeply opaque about their reasoning. We propose a framework for AI systems that can make practically useful but formally transparent clinical predictions under uncertainty. Given a clinical situation, our framework (MedMSA) uses language models to retrieve relevant prior knowledge, but constructs a formal probabilistic model to support calibrated and verifiable inferences under uncertainty. We show how an initial proof-of-concept of this framework can be used for differential diagnosis, producing an uncertainty-weighted list of potential diagnoses that could explain a patients’ symptoms, and discuss future applications and directions for applying this framework more generally for safe clinical collaborations.

[AI-157] Metal-Sci: A Scientific Compute Benchmark for Evolutionary LLM Kernel Search on Apple Silicon

【速读】：该论文旨在解决生成式 AI (Generative AI) 在自动优化科学计算内核（scientific compute kernels）时缺乏可靠泛化能力的问题，特别是在面对未见过的数据规模时可能出现的“无声回归”（silent regression）现象。解决方案的关键在于引入一个结构化的评估机制：在自动搜索循环中使用一个仅在运行结束时评估一次的持留门评分函数 $\Phi_\mathcal{T}$ ，该函数基于模型从未见过的测试规模进行打分，从而作为廉价但有效的机械监督机制，识别出仅在分布内表现良好但在分布外失效的错误优化策略，如某些模板在新维度下返回错误结果或FFT3D内核在更大规模上性能崩溃的情况。这一方法显著提升了自动内核搜索过程的鲁棒性和可靠性。

链接: https://arxiv.org/abs/2605.09708
作者: Víctor Gallego
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: Preprint

点击查看摘要

Abstract:We present Metal-Sci, a 10-task benchmark of scientific Apple Silicon Metal compute kernels spanning six optimization regimes (stencils, all-pairs in n -body problems, multi-field Boltzmann, neighbor-list molecular dynamics, multi-kernel PDE, FFT). Each task ships a CPU reference, a roofline-anchored fitness function, and a held-out generalization size. We pair the benchmark with a lightweight harness for automatic kernel search that runtime-compiles each candidate, scores it against the roofline across multiple sizes, and feeds structured compile and per-size correctness diagnostics back to a frozen LLM driving a (1+1) evolutionary loop. We report matched single-model sweeps of Claude Opus 4.7, Gemini 3.1 Pro, and GPT 5.5 on M1 Pro: in-distribution self-speedups span 1.00\times to 10.7\times . Beyond raw speedup, our central methodological claim is structural: the held-out gate scoring function \Phi_\mathcalT (evaluated once at end-of-run on a configuration the agent never sees during search) functions as a cheap mechanical oversight primitive on this automatic search loop, catching e.g. an Opus template uint D HMC win that returns wrong samples at unseen dimensions, and a GPT FFT3D best that wins in-distribution at 2.95\times speedup but collapses to 0.23\times on a 256^3 held-out cube, a silent regression that the in-distribution score alone cannot see. Code at this https URL

[AI-158] Adaptive Data Harvesting for Efficient Neural Network Learning with Universal Constraints

【速读】：该论文旨在解决神经网络在连续域上满足普遍约束（如Lyapunov稳定性或物理规律）时面临的训练挑战，特别是现有基于固定启发式或手工规则的采样方法在收敛速度、稳定性和解质量方面表现不佳的问题。解决方案的关键在于引入一种基于强化学习（reinforcement learning）的动态采样策略，该策略能够根据模型学习过程中的性能变化迭代调整训练样本分布，从而显著提升约束满足的效率与效果，并在Lyapunov神经网络和物理信息神经网络（PINNs）等多个场景中得到验证。

链接: https://arxiv.org/abs/2605.09707
作者: Siteng Kang,Xinhua Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Training neural networks to satisfy universal constraints over continuous domains poses unique challenges. Common examples include Lyapunov Neural Networks (Lyapunov NNs) and Physics-Informed Neural Networks (PINNs), where analytical solutions are generally either unavailable or overly restrictive. Sample-based methods are therefore commonly used to enforce these constraints, and the choice of samples has a substantial impact on convergence speed, stability, and solution quality. Most existing methods rely on fixed heuristics or handcrafted rules, and are suboptimal in practice. In this paper, we aim to improve upon them by learning, from data and experience, how to dynamically and iteratively adjust the samples in response to the model’s evolving learning performance. Trained by reinforcement learning, the learned policy improves empirical constraint satisfaction on test problems while significantly improving efficiency. We validate the approach on both Lyapunov NNs and PINNs, and demonstrate its broader applicability to domains where adaptive input selection is essential for effective training.

[AI-159] Ambig-DS: A Benchmark for Task-Framing Ambiguity in Data-Science Agents

【速读】：该论文旨在解决数据科学代理（Data Science Agent）从“协飞模式”向“自飞模式”转变过程中，因任务框架不明确（misframing）而导致的隐性失败问题。具体而言，当任务存在目标或评估指标模糊时，代理可能无声地采纳一个看似合理但偏离原意的任务框架，生成可执行但错误的结果，而传统基准测试仅关注流水线是否运行成功，忽略了对任务规范性的识别能力。解决方案的关键在于构建了两个诊断套件 Ambig-DS-Target 和 Ambig-DS-Objective，分别针对预测目标和评估目标的模糊性进行系统性建模，并通过受控编辑产生原始清晰版本与模糊变体配对，结合人工与大语言模型（LLM）验证确保每种变体具有多个决策相关的合理解释。实验表明：（1）失败表现为沉默的错误承诺而非执行错误；（2）允许代理提出一次澄清问题可显著恢复性能，说明信息缺失是性能下降的主要原因；（3）代理无法可靠判断何时提问，提示当前策略在主动澄清机制上存在不足。因此，论文指出，识别任务目标与评估标准的不充分指定，而非流水线执行本身，才是当前数据科学代理评估中被忽视的核心瓶颈。

链接: https://arxiv.org/abs/2605.09698
作者: Josefa Lia Stoisser,Marc Boubnovski Martell,Sidsel Boldsen,Kaspar Märtens,Robert Kitchen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As data-science agents shift from co-pilots to auto-pilots, silent misframing becomes a critical failure mode. Agents quietly commit to plausible but unintended task framings, producing clean, executable artifacts that hide their incorrect assessment of the task. Existing benchmarks score whether the pipeline runs, ignoring whether the agent recognized the task was underspecified. We introduce Ambig-DS, two diagnostic suites: one for prediction-target ambiguity (Ambig-DS-Target, 51 tasks built on DSBench, a tabular modeling benchmark) and one for evaluation-objective ambiguity (Ambig-DS-Objective, 61 tasks built on MLE-bench, a Kaggle-style ML competition benchmark), constructed so that scoring uses each source benchmark’s original evaluator. For every task we pair the original, fully specified version with an ambiguous variant produced by controlled edits; a human-and-LLM verification pipeline confirms each variant admits multiple plausible interpretations with decision-relevant consequences. The suites are analyzed independently and ambiguity lowers performance in both. Across five agents spanning efficient to frontier-class models, we find in our controlled diagnostic setting: (i) failures are silent commitments: wrong-target submissions on Target, wrong-metric or non-committal baseline submissions on Objective, rather than execution errors; (ii) allowing the agent to ask one clarifying question recovers much of the loss under idealized conditions, suggesting missing framing information drives a substantial part of the observed degradation; but (iii) agents cannot reliably tell when to use it: permissive prompts induce over-asking on clear tasks, while conservative prompts induce silent defaulting on ambiguous ones. Recognizing target and objective underspecification, not pipeline execution, is the bottleneck missing from standard DS-agent evaluations.

[AI-160] Unpredictability dissociates from structured control in language agents

【速读】：该论文旨在解决生成式 AI (Generative AI) 中行为不可预测性是否等同于有效控制的问题，特别是探讨随机采样（stochastic sampling）能否替代结构化控制机制来实现基于理由（reason）、记忆、自我状态（self-state）和抑制（inhibition）耦合的动作选择。其解决方案的关键在于构建一个可选择性禁用控制组件的语言代理（language-agent）系统，并通过多维度实验设计（包括7个数据集的基线损伤矩阵、匹配接口控制、无上下文测试及跨模型扩展验证）证明：尽管高随机性比较器表现出更强的不可预测性，但只有结构化控制机制能维持动作场耦合（action-field coupling）并稳定实现预定义的行为成分；而随机、后验、乱序或冗余控制均无法再现这种结构化的动作控制特性。

链接: https://arxiv.org/abs/2605.09692
作者: Jia Xiao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 24 pages, 5 figures; supplementary information included

点击查看摘要

Abstract:Unpredictable behavior is often taken as evidence of control, yet stochastic dispersion and structured action control need not coincide. This paper tests whether stochastic sampling can substitute for structured mechanisms that couple reasons, memory, self-state and inhibition to action selection in a language-agent implementation whose control components can be selectively disabled. In a seven-dataset baseline lesion matrix comprising 74,352 calls, the high-stochasticity comparator was more unpredictable than the structured-control variant in 7/7 datasets, whereas targeted reason and veto lesions reduced the expected structured-control profiles in 7/7 datasets each. In a matched-interface control spanning 26,946 generations, the structured agent maintained stronger action-field coupling than all stochastic, post-hoc, scrambled and verbosity controls across every dataset. The primary behavioral test removed free-form trace wording from the evaluation: 57,816 scored records showed the structured-control variant exceeding the high-stochasticity comparator or the reason/veto lesions in 7/7 datasets for all predefined behavioral components. Later open-weight runs extended the no-context controls to Qwen2.5 7B, 14B and 32B and to an independent Mistral-7B family across 20 task families and three agent scaffolds; no-fields, scrambled-context and distribution-matched controls failed to recover structured action control. A three-annotator blinded audit over 1,200 overlap items preserved high agreement. Strict entropy matching, strict token/compute matching and a formal counterfactual-flip stress test did not meet their gates and are treated as limitations. Stochastic unpredictability did not reproduce structured, action-coupled control in this implemented agent family.

[AI-161] Learning Unified Representations of Normalcy for Time Series Anomaly Detection

【速读】：该论文旨在解决无监督异常检测中的核心挑战，即在缺乏异常模式先验知识的情况下，准确识别异常模式。现有方法往往难以学习到与异常模式区分开的稳健正常数据分布表示。解决方案的关键在于提出一种统一的无监督异常检测框架（Unified Unsupervised Anomaly Detection, U²AD），其通过基于得分的生成建模来学习正常样本的潜在数据分布；创新性地引入了一个时变得分网络（time-dependent score network）和一个统一训练目标，从而在同时考虑局部与全局时间上下文的基础上刻画正常数据流形，并通过常微分方程求解器进行确定性采样实现重建，显著提升了异常检测的准确性与早期发现能力。

链接: https://arxiv.org/abs/2605.09685
作者: Prithul Sarker,Sushmita Sarker,Nicholas G. Murray,Alireza Tavakkoli
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The core challenge in unsupervised anomaly detection is identifying abnormal patterns without prior knowledge of their characteristics. While existing methods have addressed aspects of this problem, they often struggle to learn a robust representation of the normal data distribution that is distinct from anomalous patterns. In this paper, we present a novel framework, Unified Unsupervised Anomaly Detection ( \textU^2\textAD ), that comprehensively addresses anomaly detection in multivariate time series. Our approach learns the underlying data distribution of normal samples by utilizing score-based generative modeling. We introduce a novel time-dependent score network and a unified training objective that together delineate the manifold of normal data while considering both local and global temporal contexts. Reconstruction is then performed via a deterministic sampling process using an ordinary differential equation solver. Our extensive experimental evaluations demonstrate that \textU^2\textAD not only outperforms current state-of-the-art methods in detection accuracy but also identifies anomalies at significantly earlier stages of their occurrence.

[AI-162] MonitoringBench: Semi-Automated Red-Teaming for Agent Monitoring

【速读】：该论文旨在解决当前针对工具使用型编码代理（tool-using coding agents）的监控系统评估中存在的漏洞问题，即现有红队测试方法可能低估攻击复杂性并高估监控性能。其核心挑战在于：攻击生成中的模式坍缩、攻击构思与执行之间的“概念-执行差距”以及人工 elicitation 成本过高。解决方案的关键在于提出一种半自动化红队测试流水线（semi-automated red-teaming pipeline），通过引入新颖的攻击分类法以扩大覆盖范围，将攻击构建分解为策略生成、执行和事后轨迹优化三个阶段以缓解概念-执行差距，并最终在 BashArena 环境中生成 MonitoringBench 基准数据集（包含 2,644 条攻击轨迹），显著提升了攻击多样性与强度，使前沿监控模型的捕获率从 94.9% 下降至 60.3%，从而更真实地揭示了当前监控能力的局限性与改进方向。

链接: https://arxiv.org/abs/2605.09684
作者: Monika Jotautaitė,Maria Angelica Martinez,Ollie Matthews,Tyler Tracy
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce a red-teaming methodology that exposes harder-to-catch attacks for coding-agent monitors, suggesting that current practices may under-elicit attacks and overstate monitor performance. We identify three challenges with current red-teaming. First, mode collapse in attack generation, which we reduce with a novel attack taxonomy for broader coverage. Second, a conceive-execute gap: frontier LLMs can propose strong attack ideas or execute them, but not all at once. We mitigate this by decomposing attack construction into strategy generation, execution, and post-hoc trajectory refinement. Third, manual elicitation is costly to scale, which we address with our semi-automated red-teaming pipeline. Applied to BashArena, an AI control setting for tool-using coding agents, this pipeline produces MonitoringBench, a benchmark of 2,644 attack trajectories for evaluating monitor capabilities and failure modes. Our pipeline produces more diverse and stronger attacks: Opus-4.5 monitor’s catch rate falls from 94.9% on elicited-only Opus attacks to 60.3% on our best refined attacks, with larger drops for several mid-tier monitors. Attacks optimized against three development monitors generalize to ten held-out monitors, with catch rates generally increasing with monitor capability. Using this benchmark, we provide a snapshot of the current monitor capabilities and find that frontier monitors often detect suspicious actions but fall for persuasion or fail to calibrate suspiciousness scores appropriately, suggesting tractable paths for improvement. MonitoringBench provides both a static benchmark for current tool-use monitors and a reusable methodology for refreshing these evaluations as agents and monitors improve.

[AI-163] Absurd World: A Simple Yet Powerful Method to Absurdify the Real-world for Probing LLM Reasoning Capabilities

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在简单逻辑推理任务中是否具备鲁棒性的问题，即当现实世界中的常识被人为改变时，LLMs 是否仍能基于逻辑而非经验模式进行推理。其解决方案的关键在于提出 Absurd World 基准测试框架，该框架通过将真实世界场景分解为符号、动作、序列和事件，并对其进行自动化扰动以生成逻辑自洽但违背常识的“荒诞世界”，从而隔离模型对现实世界统计模式的依赖，仅检验其逻辑推理能力。实验表明，该框架能够有效评估不同模型在简单与复杂提示策略下的逻辑推理稳定性，为验证 LLM 的本质推理能力提供了可扩展且可控的测试工具。

链接: https://arxiv.org/abs/2605.09678
作者: Ryan Albright,Golam Md Muktadir,Zarif Ikram,S M Jubaer,Mehrab Hossain,Dianbo Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While extremely powerful and versatile at various tasks, the thinking capabilities of large language models (LLMs) are often put under scrutiny as they sometimes fail to solve problems that humans can systematically solve. However, recent literature focuses on breaking LLM reasoning with increasingly complex problems, and whether an LLM is robust in simple logical reasoning remains underexplored. This paper proposes Absurd World, a benchmarking framework, to test LLMs against altered realism, where scenarios are logically coherent, and humans can easily solve the tasks. Absurd World breaks a real-world model into symbols, actions, sequences, and events, which are automatically altered to create absurd worlds where the logic to solve the tasks remains the same. It evaluates a large collection of models with simple and advanced prompting techniques, and proves that it is an effective tool to determine LLMs’ ability to think logically, ignoring the patterns learned from the real world. One can use this framework to extensively test an LLM against a real-world problem to verify whether the LLM’s reasoning capability is robust against variations of the task.

[AI-164] ChaosNetBench: Benchmarking Spatio-Temporal Graph Neural Networks on Chaotic Lattice Dynamics

【速读】：该论文旨在解决当前短时预测任务中模型评估缺乏跨动力学 regimes 可比性的问题，尤其是在动态物理系统（如交通和天气）中，现有基准数据集通常局限于单一领域且固定划分训练/测试集，难以公平比较不同空间-时间图神经网络（STGNN）架构在多种混沌水平下的性能表现。其解决方案的关键在于构建一个名为 ChaosNetBench (CNB) 的合成基准数据集与评估框架，该框架基于耦合标准映射（coupled standard maps）的晶格系统，具备可独立调节局部混沌强度（K）、耦合强度（ε）和系统规模（N）的能力，从而生成具有已知拓扑结构和动力学特性的96个系统实例及9600条轨迹，并引入混沌指标、量化评估指标与标准化协议，用于系统性分析STGNN对局部与全局混沌的适应能力。

链接: https://arxiv.org/abs/2605.09676
作者: Henok Tenaw Moges,Charalampos Skokos,Deshendran Moodley
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Chaotic Dynamics (nlin.CD)
备注: 24 pages, 11 figures

点击查看摘要

Abstract:Spatio-temporal graph neural networks (STGNNs) are widely used for short-term forecasting in dynamic physical systems such as traffic and weather. However, the prevailing evaluation practice uses real world benchmark data sets in a single domain with a single fixed holdout splits, making it difficult to compare architectures across different dynamical regimes. We introduce ChaosNetBench (CNB), a synthetic benchmark dataset and evaluation framework for studying STGNN performance under controlled multidimensional chaotic dynamics. CNB is built on a lattice of coupled standard maps with independently tunable local chaos ( K ), coupling strength ( \varepsilon ), and system size ( N ), providing known topology and known dynamics across 96 system instances and 9,600 trajectories. We introduce chaos indicators, evaluation metrics and a protocol to analyze and compare the capacity of STGNN architectures to deal with different levels of local and global chaos. We illustrate the usage of the framework by analyzing 13 architectures (5 STGNNs and 8 non-graph baselines). The results reveal a regime dependent transition in which non-graph baselines (TCN, N-BEATS, iTransformer) remain competitive when there is low local chaos, while STGNNs (e.g., Graph WaveNet, D2STGNN, STAEformer) are generally more resilient to higher levels of local and global chaos. CNB provides a practical, reusable testbed for systematically comparing and analyzing the capacity of STGNN architectures to handle different levels of local and global chaos.

[AI-165] Causal Parametric Drift Simulation: A Digital Twin Framework for Classifier Robustness Evaluation

【速读】：该论文旨在解决机器学习分类器在动态环境中因概念漂移（concept drift）导致性能下降的问题，传统评估方法如静态测试集或噪声扰动无法保留表格数据中的因果依赖关系，从而产生因果无效的评估结果；而事后解释工具如SHAP和LIME仅提供相关性洞察，难以反映模型失效的真实因果机制。其解决方案的关键在于引入结构因果模型（Structural Causal Models, SCM）作为数据生成过程的“数字孪生”（Digital Twins），通过构建可精确干预的因果框架，在保持结构依赖关系的前提下模拟漂移场景，实现对分类器的因果驱动压力测试，从而识别出标准统计监控手段无法发现的潜在脆弱性。

链接: https://arxiv.org/abs/2605.09663
作者: Julien Lafrance,Richard Khoury,Véronique Tremblay
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 34 pages, 13 figures, 14 tables

点击查看摘要

Abstract:Machine learning classifiers in dynamic environments face concept drift – changes in the data-generating process that degrade performance. Conventional evaluation via static test sets or noise perturbations fails to preserve causal dependencies in tabular data, often producing causally invalid assessments. Post-hoc tools like SHAP and LIME offer correlational insights that may not reflect the causal mechanisms driving model failure. We propose a framework that complements existing drift detection by leveraging Structural Causal Models as “Digital Twins” of data-generating processes, enabling precise causal interventions while preserving structural dependencies. Our technique, Causal Parametric Drift Simulation, stress-tests classifiers to identify vulnerabilities before deployment. Experiments on the Open Sourcing Mental Illness (OSMH) dataset demonstrate that this approach exposes latent vulnerabilities invisible to standard statistical monitors. Comments: 34 pages, 13 figures, 14 tables Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) MSC classes: 62H22, 62D20, 68T05 ACMclasses: I.2.6; I.2.0; G.3 Cite as: arXiv:2605.09663 [cs.LG] (or arXiv:2605.09663v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.09663 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-166] RDEx-CASK: Cauchy Mutation Archive and Stagnation Kick for RDEx-CSOP CEC2026

【速读】：该论文旨在解决进化算法在优化过程中出现的早熟收敛（stagnation）和晚期方差过大（late-stage variance）问题，从而提升算法在复杂约束优化问题中的稳定性和收敛效率。解决方案的关键在于对RDEx-CSOP算法进行三项核心改进：首先，在标准分支中引入独立采样的截断柯西分布作为第二尺度因子以增强探索能力；其次，添加一个容量为50的小型可行解档案（feasible-only archive），并以概率 |A|/(|A|+|P|) 控制其采样频率；最后，引入个体停滞计数器机制，在连续180代无改进后触发三种局部策略：将个体拉向全局最优解、将档案采样下限提升至0.65，并在种群成功率低于0.10时将交叉概率（CR）饱和至0.95。这些改进显著提升了算法在CEC CSOP测试集上的可行性感知质量（feasibility-aware quality）与时间目标达成效率（time-to-target）。

链接: https://arxiv.org/abs/2605.09652
作者: Dikshant,Dikshit Chauhan,Chen Hao,Anupam Trivedi,Harikumar Kandath,Senthilnath Jayavelu
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注: 5 pages, 2 tables, 1 algorithm. Technical report for the CEC 2026 CSOP competition track

点击查看摘要

Abstract:We extend RDEx-CSOP with 3 changes that target stagnation late-stage variance, plus minor parameter tuning. The second scale factor in the standard branch is sampled independently from a truncated Cauchy. A small feasible-only JADE-style archive (|A|_max = 50) is added sampled with probability |A|/(|A|+|P|). Per-individual stagnation counter triggers, after 180 no-improvement generations, three local overrides on standard branch: pull toward the global best, lift the archive sampling floor to 0.65, saturate CR to 0.95 when population success rate is below 0.10. The exploitation biased branch every other RDEx component are left untouched. On CEC CSOP suite (D=30, 25 runs), RDEx-CASK is competitive with RDEx, UDE-III, CL-SRDE in feasibility-aware quality improves time-to-target on most problems.

[AI-167] Workspace Optimization: How to Train Your Agent

【速读】：该论文旨在解决当前基于前沿语言模型的智能体（agent）无法直接调整模型权重时，如何实现有效学习与任务适应的问题。其核心挑战在于面对复杂多轮交互环境时，尽管大模型具备强先验知识，却难以通过单次推理完成任务，需依赖持续的交互式学习机制。解决方案的关键在于提出“工作空间优化”（workspace optimization），即不再优化模型参数，而是通过结构化外部介质——工作空间（workspace）——进行演化：以可执行的“人工制品”替代传统参数，以“证据”替代训练数据，以“反例”替代损失函数，以“文本反馈”替代梯度更新。该方法在DreamTeam系统中得以实现，该系统为ARC-AGI-3任务设计了多代理协作框架，涵盖世界建模、规划、假设生成、探测、策略制定及失败路由等功能模块，在保持更低环境动作消耗的前提下显著提升了性能（从36%提升至38.4%）。

链接: https://arxiv.org/abs/2605.09650
作者: Elad Sarafian,Gal Kaplun,Ron Banner,Daniel Soudry,Boris Ginsburg
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Modern agents built on frontier language models often cannot adapt their weights. What, then, remains trainable? We argue it is the agent’s \emphworkspace, the structured external substrate it reads, writes, and tests; we call its evolution workspace optimization. Workspace optimization targets hard multi-turn environments where a frontier model has strong priors but cannot solve the task in a single shot, so the agent must learn through interaction. We propose a principled way to evolve the workspace, mirroring the structure of weight-space training: artifacts in place of parameters, evidence in place of data, counterexamples in place of losses, and textual feedback in place of gradients. We instantiate the idea in DreamTeam, a multi-agent harness for ARC-AGI-3 whose roles build an executable world model, plan, hypothesize, probe, strategize, and route failures. On the current 25-game ARC-AGI-3 public set under the official scoring protocol and averaged over two independent runs, DreamTeam improves the SOTA protocol-matched agent’s score from 36% to 38.4%, while using 31% fewer environment actions per game.

[AI-168] PDEAgent -Bench: A Multi-Metric Multi-Library Benchmark for PDE Solver Generation

【速读】：该论文旨在解决生成式 AI (Generative AI) 在偏微分方程（PDE）到求解器代码自动生成任务中的局限性问题，即现有方法难以确保生成代码在数值准确性与计算效率上的可靠性，且缺乏专门针对有限元法（FEM）库的多维度评估基准。解决方案的关键在于提出 PDEAgent-Bench——首个面向 PDE-to-solver 代码生成的多指标、多 FEM 库基准测试平台，其包含 645 个实例覆盖 6 类数学问题和 11 种 PDE 家族，并采用分阶段评估框架（依次通过可执行性、数值精度和计算效率验证），从而系统性地衡量模型生成能力。实验表明，尽管当前大语言模型（LLM）和代码代理能生成可运行代码，但在引入精度与效率约束后成功率显著下降，凸显了该基准在推动高可靠性数值求解器自动合成方面的必要性和有效性。

链接: https://arxiv.org/abs/2605.09636
作者: Zhen Hang,Yushan Yashengjiang,Junhui Li,Huanshuo Dong,Yang Wei,Zhezheng Hao,Jiangtao Ma,Songlin Bai,Haozhong Kai,Xihang Yue,Gangzong Si,Dongming Jiang,Chao Yao,Zhanhua Hu,Jiangqing Zhang,Pengwei Liu,Yaomin Shen,Xingyu Ren,Lei Liu,Zikang Xu,Han Li,Qingsong Yao,Hande Dong,Hong Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:PDE-to-solver code generation aims to automatically synthesize executable numerical solvers from partial differential equation (PDE) specifications. This task requires not only understanding the mathematical structure of PDEs, but also selecting appropriate discretization schemes and solver configurations, and correctly implementing the resulting formulations in finite-element method (FEM) libraries. Existing code generation benchmarks mainly evaluate syntactic correctness, or success on predefined test cases. To our knowledge, there is currently no publicly available benchmark specifically for PDE-to-solver code generation, and general-purpose code benchmarks do not fully capture the unique challenges of numerical PDE solution, such as ensuring solver accuracy, efficiency, and compatibility with professional FEM libraries. We introduce PDEAgent-Bench, to the best of our knowledge, the first multi-metric, multi-library benchmark for PDE-to-solver code generation. PDEAgent-Bench contains 645 instances across 6 mathematical categories and 11 PDE families, with common FEM libraries for DOLFINx, Firedrake, and this http URL. Each instance provides an agent-facing problem specification, a reference solution on a prescribed evaluation grid, and case-specific accuracy and runtime targets. PDEAgent-Bench adopts a staged evaluation framework in which generated solvers must sequentially pass executability, numerical accuracy, and computational efficiency checks. Experiments with representative LLMs and code agents show that models can often produce runnable code, but their pass rate drops substantially once accuracy and efficiency requirements are enforced. These results indicate that current agents remain limited in producing numerically reliable and efficient PDE solvers, and that PDEAgent-Bench provides a reproducible testbed grounded in the practical requirements of numerical PDE solving.

[AI-169] Adaptive DNN Partitioning and Offloading in Heterogeneous Edge-Cloud Continuum

【速读】：该论文旨在解决资源受限物联网（IoT）设备上深度神经网络（DNN）部署中静态划分与卸载方法无法适应运行时动态环境变化的问题。现有方案通常基于固定策略进行DNN层划分，且多在仿真环境中评估，缺乏真实硬件验证。其解决方案的关键在于提出一个动态的DNN分层划分框架，该框架在启动时对模型进行性能剖析，实时测量节点间网络链路状态，并周期性重新评估和调整划分策略，以响应环境变化。实验基于包含树莓派边缘设备、笔记本电脑雾节点和高性能桌面云节点的物理测试平台，在VGG16、AlexNet和MobileNetV2三个主流卷积神经网络上验证，结果表明该方法相较静态基线可实现27.09–35.82%的能耗降低和6.34–22.92%的端到端延迟减少，证明了自适应划分优于静态方法的有效性。

链接: https://arxiv.org/abs/2605.09623
作者: Akuen Akoi Deng,Eimantas Butkus,Alfreds Lapkovskis,Praveen Kumar Donta
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI); Performance (cs.PF)
备注:

点击查看摘要

Abstract:In recent years, the use of artificial intelligence on resource-constrained IoT devices has grown significantly. However, existing approaches to DNN partitioning and offloading across the edge-cloud continuum typically rely on static methods that ignore runtime dynamics. Furthermore, they are often evaluated in simulated environments rather than on real hardware. To address this gap, we propose a framework that dynamically splits neural network layers across the heterogeneous continuum. The framework profiles the model at startup, measures network link conditions between nodes, and periodically re-evaluates the partition to adapt to environmental changes. We created a physical testbed comprising a Raspberry Pi edge device, a laptop fog, and a high-performance desktop PC as the cloud. We evaluated the framework over three widely adopted convolutional neural networks: VGG16, AlexNet, and MobileNetV2. Our results show that the framework achieves reductions in energy and end-to-end latency of 27.09–35.82% and 6.34–22.92%, respectively, compared to a static partitioning baseline. These findings confirm the superiority of adaptive to static partitioning.

[AI-170] Efficient Ensemble Selection from Binary and Pairwise Feedback

【速读】：该论文旨在解决多任务场景下从多个AI模型中选择一个高性能小规模集成（ensemble）的问题，其核心挑战在于如何在有限的模型调用、基准测试和人工评估成本下，高效选出最优委员会。解决方案的关键在于将此问题建模为一种分布式的多胜者投票（multiwinner voting）问题：任务从未知领域分布中抽取，每个任务提供对候选专家的反馈，而委员会的价值由其中表现最好的成员决定。针对二元反馈（binary feedback），作者提出基于失败条件的贪心算法，在保持标准 (1-1/e) 近似保证的同时实现实例相关的查询节省；对于成对反馈（pairwise feedback），引入 θ-获胜委员会概念，并设计一种可子模化的加权序数覆盖松弛（weighted ordinal coverage relaxation），通过有限族审计或极小极大封装将其转化为 θ 类型保证，从而兼顾理论最优性和实际可操作性。

链接: https://arxiv.org/abs/2605.09588
作者: Tzeh Yuan Neoh,Nicholas Teh,Je Qin Chooi,Paul W. Goldberg,Milind Tambe
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Organizations increasingly deploy multiple AI systems across task domains, but selecting a small, high-performing ensemble can require costly model calls, benchmark runs, and human evaluation. We study this selection problem as a distributional variant of multiwinner voting: tasks are drawn from an unknown domain distribution, each task induces feedback over candidate experts, and a committee’s value on a task is determined by its best-performing member. We analyze both binary feedback, for tasks with correct/incorrect outcomes, and pairwise feedback, for tasks where candidate outputs are compared by preference. In the binary setting, the induced objective is coverage. We give exhaustive-elicitation baselines and matching worst-case query lower bounds, and we design a failure-conditioned greedy algorithm that preserves the standard (1-1/e) guarantee while obtaining instance-dependent query savings. In the pairwise setting, we study \theta -winning committees. We show that full-information optimization admits a PTAS but no EPTAS under Gap-ETH, and that the objective is monotone but not submodular. This motivates a weighted ordinal coverage relaxation, which is submodular and supports a failure-conditioned greedy oracle under pairwise feedback. We then convert this oracle back into \theta -type guarantees through finite-family auditing or a minimax wrapper. We also provide small-scale LLM experiments illustrating the predicted query savings and the role of complementarity in committee selection.

[AI-171] Biosignal Fingerprinting: A Cross-Modal PPG-ECG Foundation Model

【速读】：该论文旨在解决心血管疾病（Cardiovascular Disease, CVD）监测中诊断丰富的心电图（ECG）与广泛可用的可穿戴光电容积脉搏波（PPG）之间存在的数据模态鸿沟问题，即如何在不依赖任务特定微调的前提下，实现跨模态、跨设备的高效、通用且隐私安全的心血管状态表征。解决方案的关键在于提出“生物信号指纹”（biosignal fingerprints），其源自一个基于340万对齐ECG和PPG信号训练的多模态掩码自编码器（Multi-modal Masked Autoencoder, M2AE）模型；该模型通过模态特异性编码器、共享瓶颈层及双解码器结构，联合优化重建损失与跨模态对比损失，生成具有高度泛化能力的紧凑潜在表示，既保留了模态内与模态间的特征信息，又具备模态无关性与隐私保护特性，从而可在无需暴露原始波形或重新训练模型的情况下，直接应用于多种下游任务（如CVD分类、高血压检测、死亡率预测等），并在7个任务中达到或超越当前最优领域专用基础模型的表现。

链接: https://arxiv.org/abs/2605.09579
作者: Zhangdaihong Liu,Chang Liu,Fenglin Liu,Yixuan Chen,Yang Yang,David A. Clifton,Xiao Gu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 21 pages, 8 figures, 7 tables

点击查看摘要

Abstract:Cardiovascular disease remains the leading cause of global mortality, yet scalable cardiac monitoring is hindered by the gap between diagnostic-rich ECG and ubiquitous wearable PPG. Bridging this gap requires representations that are compact, transferable across modalities and devices, and deployable without task-specific retraining. Here we introduce biosignal fingerprints: compact latent representations of cardiovascular state derived from a cross-modal foundation model, the Multi-modal Masked Autoencoder (M2AE), trained on over 3.4 million paired ECG and PPG signals. M2AE integrates modality-specific encoders with a shared bottleneck and dual decoders, jointly optimized using reconstruction and cross-modal contrastive objectives, yielding generalizable fingerprints that retain intra- and inter-modality features. Like a biometric fingerprint, these representations uniquely encode an individual’s cardiovascular state in a modality-agnostic, privacy-preserving form reusable across clinical tasks without exposing raw waveform data or requiring model retraining. Across 7 downstream tasks, spanning cross-modal reconstruction, cardiovascular disease classification, hypertension detection, mortality prediction, and demographic inference, biosignal fingerprints achieve competitive or superior performance compared to leading domain-specialist foundation models in frozen settings, including an AUROC of 0.974 for five-class CVD classification and 0.877 for hypertension detection, with a maximum improvement of 27.7% in AUROC across 5 classification tasks. Critically, strong performance is maintained with only a single modality, enabling deployment in resource-constrained, single-sensor environments typical of real-world wearable monitoring, with direct implications for continuous cardiovascular monitoring across clinical and consumer health settings.

[AI-172] IDE-Bench: Task-Aware and Diagnostic Evaluation of Tool-Integrated Reasoning

【速读】：该论文旨在解决当前工具集成推理（Tool-integrated Reasoning, TIR）评估缺乏高质量、统一基准的问题，现有评估在数据集质量、任务多样性、诊断全面性和评估效率方面存在局限。解决方案的关键在于提出TIDE-Bench，一个全面且高效的TIR评估基准，其核心创新包括：（1）构建多样化的任务设置，融合数学推理与知识密集型问答任务，并新增工具引导实验设计和动态交互任务以考察复杂工具调用与多工具协同能力；（2）采用任务感知的综合评估协议，同步衡量最终答案质量、过程可靠性、工具使用效率和推理成本；（3）通过过滤低区分度样本构建高质评估集，在显著降低评估成本的同时聚焦更具挑战性的样本，从而更精准地识别模型瓶颈，如工具定位（tool grounding）问题，为未来TIR研究提供方向。

链接: https://arxiv.org/abs/2605.09544
作者: Yize Li,Junzhi Li,Jason Song,Chuxiong Sun,Rui Wang,Changwen Zheng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures, 10 tables

点击查看摘要

Abstract:Tool-integrated reasoning has emerged as a promising paradigm for enhancing large language models with external computation, retrieval, and execution capabilities. However, the field still lacks a high-quality and unified evaluation benchmark, and existing TIR evaluations remain limited in dataset quality, task diversity, diagnostic comprehensiveness, and evaluation efficiency. In this work, we introduce TIDE-Bench, a holistic and efficient benchmark for evaluating TIR methods, featuring three key advantages. First, it provides diverse task settings, combining widely used mathematical reasoning and knowledge-intensive QA tasks with two newly designed tasks, namely the tool-grounded experimental design task and the dynamic interactive task, to probe models’ abilities in complex tool invocation and multi-tool coordination. Second, TIDE-Bench adopts a comprehensive yet task-aware evaluation protocol, jointly measuring final answer quality, process reliability, tool-use efficiency, and inference cost across heterogeneous task settings. Third, TIDE-Bench constructs high-quality and discriminative evaluation sets by filtering low-discrimination instances from existing datasets, substantially reducing evaluation cost while focusing on more challenging samples. Extensive experiments on multiple foundation models and TIR methods reveal persistent bottlenecks in tool grounding, offering insights for future TIR research.

[AI-173] LLM -Guided Monte Carlo Tree Search over Knowledge Graphs: Composing Mechanistic Explanations for Drug-Disease Pairs ECAI2026 IJCAI

【速读】：该论文旨在解决从知识图谱（Knowledge Graph, KG）中提取多步解释时面临的组合挑战，即随着路径深度增加导致候选路径数量激增（heuristic guidance 问题）以及长序列中路径质量难以评估（credit assignment 问题）。现有大型语言模型（Large Language Models, LLMs）虽在知识推理基准上表现优异，但其知识缺乏保障，且在长链推理中性能显著下降。为此，作者提出 TESSERA——一个三部分神经符号框架：LLM 被限定用于局部判别性判断而非自主生成多步路径；知识图谱定义假设空间并施加硬性结构约束；蒙特卡洛树搜索（Monte Carlo Tree Search, MCTS）则通过反向传播实现长期搜索中的合理信用分配。该方案的关键在于将 LLM 作为先验策略以引导探索、同时作为状态比较器提供奖励信号，从而在结构化知识空间中高效实现可解释的组合推理。

链接: https://arxiv.org/abs/2605.09542
作者: Rishabh Jakhar,Michel Dumontier,Remzi Celebi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at IJCAI-ECAI 2026. 9 pages (7 content + 2 references), 5 figures, 3 tables. Includes supplementary material (26 pages)

点击查看摘要

Abstract:Extracting multi-step explanations from knowledge graphs poses a combinatorial challenge requiring both heuristic guidance (as candidates proliferate with depth) and credit assignment (as path quality emerges over extended sequences). Frontier LLMs, strong on knowledge/reasoning benchmarks, offer a compelling source of such heuristics, yet their knowledge comes sans guarantees and compositional performance degrades as chains lengthen. We thus present TESSERA, a 3-part neuro-symbolic framework that uses LLMs in a circumscribed role: for local discriminative judgement rather than autonomous multi-step generation; the knowledge graph then defines the hypothesis space enforcing hard structural constraints, and MCTS coordinates the long-horizon search with principled credit assignment via backpropagation. LLMs perform dual roles as a prior policy biasing exploration and a comparative state evaluator supplying reward signals. Evaluation on drug mechanism elucidation across two complementary knowledge graphs demonstrates fidelity to curated biology while surfacing coherent alternative mechanisms, with ablations confirming discriminative contribution from both LLM components. Beyond its current application, our framework offers a general paradigm for compositional reasoning over structured knowledge.

[AI-174] Governing AI-Assisted Security Operations: A Design Science Framework for Operational Decision Support

【速读】：该论文旨在解决在高风险运营功能中引入生成式AI（Generative AI）、检索增强生成（Retrieval-Augmented Generation）和编码代理时，如何保障问责制、隐私、成本控制、审计能力等关键要素不被削弱的问题。解决方案的关键在于将AI辅助的运营决策支持视为一种需先治理后扩展的工程能力，而非直接自动化。研究通过设计科学方法构建了一个受控的AI查询代理（governed AI query-broker）原型，其核心机制包括基于模式的检索、审批模板、策略验证、只读适配器、归一化输出、可审计代理追踪以及工程评审委员会关卡，从而实现AI规划与操作执行的分离，并明确角色责任、成熟度阶段、质量门控与证据边界，形成一套可落地的管理框架。

链接: https://arxiv.org/abs/2605.09534
作者: Elyson A. De La Cruz,Rishikesh Sahay,Md Rasel Al Mamun
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 28 pages, 1 listing, 1 figure, 20 Tables

点击查看摘要

Abstract:Engineering managers increasingly must decide how to introduce generative artificial intelligence (AI), retrieval-augmented generation, and coding agents into high-risk operational functions without weakening accountability, privacy, cost discipline, or auditability. The central message of this study is that AI-assisted operational decision support should be managed as a governed engineering capability before it is scaled as automation. Security operations centers (SOCs) provide a suitable setting because they combine privileged telemetry, specialist expertise, software repositories, cloud services, and evidence-sensitive decisions. This study uses Kusto Query Language (KQL) and Microsoft Azure security capabilities as a bounded technical instantiation of that broader engineering management problem. KQL is read-only in ordinary query use, but read-only does not mean risk-free: AI-assisted queries can still create privacy, cost, performance, schema-validity, and decision-quality risks through broad scans, sensitive-field exposure, stale intelligence, and misleading interpretations. Using design science research, the study develops a governed AI query-broker artifact that separates AI planning from operational execution through schema-grounded retrieval, approved templates, policy validation, read-only adapters, normalized outputs, auditable agent traces, and engineering review board gates. The contribution is not a new KQL technique, security product, or detection algorithm. Rather, the study contributes a management framework for governing AI-assisted operational decision support in high-risk digital infrastructure by specifying design propositions, role accountability, maturity stages, quality gates, evaluation criteria, and evidence boundaries.

[AI-175] Cplus2ASP: Computing Action Language C in Answer Set Programming

【速读】：该论文旨在解决动作语言C+（Action Language C+）中确定性片段的高效形式化推理问题，特别是针对传统方法在处理复杂动态系统时效率低下的挑战。解决方案的关键在于构建一个名为Cplus2ASP Version 2的系统，其核心创新是通过组合多个近期理论成果，将C+描述转化为适用于现代答案集编程（Answer Set Programming, ASP）求解器的输入格式。该系统利用f2lp、clingo、iclingo和as2transition等工具链，在增量执行模式下将C+程序翻译为iclingo可处理的形式，并借助其增量接地机制显著提升性能；同时，通过扩展模块定理以支持嵌套表达式，保障了翻译的正确性。此外，系统还引入Lua外部原子和交互式用户模式等实用功能，并具备对其他动作语言（如B和BC）进行可扩展多模态翻译的能力。

链接: https://arxiv.org/abs/2605.09528
作者: Joseph Babb,Joohyung Lee
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present Version 2 of system Cplus2ASP, which implements the definite fragment of action language C+. Its input language is fully compatible with the language of the Causal Calculator Version 2, but the new system is significantly faster thanks to modern answer set solving techniques. The translation implemented in the system is a composition of several recent theoretical results. The system orchestrates a tool chain, consisting of f2lp, clingo, iclingo, and as2transition. Under the incremental execution mode, the system translates a C+ description into the input language of iclingo, exploiting its incremental grounding mechanism. The correctness of this execution is justified by the module theorem extended to programs with nested expressions. In addition, the input language of the system has many useful features, such as external atoms by means of Lua calls and the user interactive mode. The system supports extensible multi-modal translations for other action languages, such as B and BC, as well.

[AI-176] Functional Stable Model Semantics and Answer Set Programming Modulo Theories

【速读】：该论文旨在解决如何在答案集编程（Answer Set Programming, ASP）中有效引入“内涵函数”（intensional functions）的问题，即那些值可通过其他函数和谓词定义而非预先指定的函数。传统ASP中的函数通常是外延定义的，限制了其表达能力。作者提出，通过引入函数稳定模型语义（functional stable model semantics），可以更好地支持答案集编程模理论（Answer Set Programming Modulo Theories, ASPMT）框架下的混合推理，从而将现有基于ASP与SMT（Satisfiability Modulo Theories）集成的方法统一为特例。解决方案的关键在于证明“紧致”（tight）的ASPMT程序可被转化为SMT实例，这一转化机制类似于ASP与SAT之间的经典映射关系，从而实现了逻辑表达力与计算效率的平衡。

链接: https://arxiv.org/abs/2605.09524
作者: Michael Bartholomew,Joohyung Lee
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recently there has been an increasing interest in incorporating intensional'' functions in answer set programming. Intensional functions are those whose values can be described by other functions and predicates, rather than being pre-defined as in the standard answer set programming. We demonstrate that the functional stable model semantics plays an important role in the framework of Answer Set Programming Modulo Theories (ASPMT)‘’ – a tight integration of answer set programming and satisfiability modulo theories, under which existing integration approaches can be viewed as special cases where the role of functions is limited. We show that ``tight’’ ASPMT programs can be translated into SMT instances, which is similar to the known relationship between ASP and SAT.

[AI-177] Weighted Rules under the Stable Model Semantics

【速读】：该论文旨在解决传统稳定模型语义（stable model semantics）在处理不确定性和不一致性时的局限性，例如无法有效处理答案集程序中的矛盾、缺乏对稳定模型的排序能力以及难以引入概率解释等问题。其解决方案的关键在于引入加权规则（weighted rules），借鉴马尔可夫逻辑（Markov Logic）中的对数线性模型思想，将权重赋予逻辑规则以生成加权稳定模型（weighted stable models）。这一方法不仅能够缓解确定性语义带来的刚性问题，还支持基于统计推理计算加权稳定模型，并实现了与ProbLog、P-log等概率逻辑编程形式化方法的理论对比和融合。

链接: https://arxiv.org/abs/2605.09519
作者: Joohyung Lee,Yi Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:We introduce the concept of weighted rules under the stable model semantics following the log-linear models of Markov Logic. This provides versatile methods to overcome the deterministic nature of the stable model semantics, such as resolving inconsistencies in answer set programs, ranking stable models, associating probability to stable models, and applying statistical inference to computing weighted stable models. We also present formal comparisons with related formalisms, such as answer set programs, Markov Logic, ProbLog, and P-log.

[AI-178] Mixture of Layers with Hybrid Attention

【速读】：该论文旨在解决标准混合专家（Mixture-of-Experts, MoE）Transformer中层结构仍为密集单体的问题，即虽然token被路由至不同专家子网络，但每层的整体计算结构未实现稀疏化。为此，作者提出混合层（Mixture-of-Layers, MoL）架构，其关键在于用K个并行的窄块（thin blocks，维度d_thin ≪ d_model）替代原有的全宽Transformer块（d_model），并通过可学习的下采样/上采样投影连接，并采用top-k块路由机制进行组合。为应对多块稀疏路由导致的注意力覆盖不足问题，进一步引入混合注意力机制：在路由块中使用Gated DeltaNet线性注意力以增强局部感知能力，同时保留一个共享的softmax注意力块用于全局上下文建模，从而在保持计算效率的同时提升模型表达能力。

链接: https://arxiv.org/abs/2605.09516
作者: Ivan Ternovtsii,Yurii Bilak
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Standard Mixture-of-Experts (MoE) transformers route tokens to expert subnetworks within each layer, but the layer structure itself remains monolithic. We introduce Mixture of Layers (MoL), which replaces full-width transformer blocks (d_model) with K parallel thin blocks at reduced dimensionality (d_thin d_model), connected via learned down/up projections and composed via top-k block routing. Scaling sparse block routing to many blocks creates an attention coverage problem, as each block sees fewer tokens. We address this by introducing hybrid attention, which pairs one shared softmax block for global context with Gated DeltaNet linear attention in routed blocks.

[AI-179] A Game Theoretic Free Energy Analysis of Higher Order Synergy in Attention Heads of Large Language Models

【速读】：该论文旨在解决大型语言模型中多头注意力机制（multihead attention）内部各注意力头之间交互关系不明确的问题。传统方法难以量化不同头之间的协同或冗余效应，导致模型优化缺乏理论依据。其解决方案的关键在于引入博弈论自由能原理（Game Theoretic Free Energy Principle, GTFEP），将每个注意力头视为有限理性代理（bounded rational agent），通过变分推断视角建模其行为：每个头最小化自身的变分自由能，而整体结构遵循基于Harsanyi红利（Harsanyi dividends）分解的吉布斯分布。在此框架下，成对红利对应互信息（非负），三元红利则对应交互信息（可为负值），揭示了高阶冗余现象（如BERT、GPT2和Llama在GSM8K数据集上三元红利持续为负）。该理论进一步通过Nash FEP对应关系证明，集体自由能的稳定点即为ε-Nash均衡，从而支持基于边际贡献的剪枝策略——低贡献头可被移除且性能损失极小，例如在GPT2中剪枝20%头仅使困惑度从28.4升至33.4，同时减少18%浮点运算量（FLOPs）并提升22%吞吐量。

链接: https://arxiv.org/abs/2605.09515
作者: Djamel Bouchaffra
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: this manuscript has been submitted to Neural Networks

点击查看摘要

Abstract:Large language models rely on multihead attention, but interactions among heads remain poorly understood. We apply the Game Theoretic Free Energy Principle (GTFEP): a framework casting multiagent systems as distributed variational inference to analyze attention heads as bounded rational agents. According to GTFEP, each head minimizes its variational free energy, and collective behavior follows a Gibbs distribution over coalition structures whose energy is decomposed into Harsanyi dividends. Using a tractable approximation (uniform prior, deterministic dynamics), coalition free energy reduces to joint Shannon entropy of discretized head outputs (argmax key index). Pairwise dividends become mutual information (nonnegative), while triple dividends correspond to interaction information and can be negative. On BERT, GPT2, and Llama with GSM8K, triple dividends are consistently negative, revealing higher order redundancy. The Nash FEP correspondence guarantees that stationary points of collective free energy are epsilon Nash equilibria; thus, heads with negligible contribution can be pruned with minimal performance loss. Pruning heads with low marginal contribution reduces computational cost with minimal performance loss: for example, pruning 20% of heads in GPT2 reduces FLOPs by 18%, increases throughput by 22%, and raises perplexity only modestly (from 28.4 to 33.4 on GSM8K). Our work shows GTFEP provides a principled foundation for analyzing and optimizing transformer architectures.

[AI-180] WindINR: Latent-State INR for Fast Local Wind Query and Correction in Complex Terrain

【速读】：该论文旨在解决复杂地形中下游决策对局部高分辨率风场快速估计的需求问题，即如何在用户指定的少量位置和高度上，针对特定预报有效时间提供精确的风速估计，而非依赖固定网格上的密集预报场。其解决方案的关键在于提出WindINR框架——一种基于潜在状态的隐式神经表示方法，通过将静态地形描述符、低分辨率背景场与连续查询坐标映射到高分辨率风场状态，实现对稀疏观测数据的高效修正。该方法分离了可复用的表示学习与样本特异性的潜在状态修正，在训练阶段利用高分辨率监督信息推断参考潜在状态，并构建适用于数据集的高斯先验来建模潜在修正差异；推理时仅更新紧凑的潜在状态而非全网络权重，结合稀疏观测及其不确定性最小化正则化修正目标，从而显著提升在线修正效率（CPU基准下较全网微调快约2.6倍），同时保持任意坐标下的连续可查询性。

链接: https://arxiv.org/abs/2605.09511
作者: Yi Xiao,Qilong Jia,Hang Fan,Pascal Fua,Robert Jenssen,Xiaosong Ma,Wei Xue
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Many downstream decisions in complex terrain require fast wind estimates at a small number of user-specified locations and heights for a given forecast valid time, rather than another dense forecast field on a fixed grid. We present WindINR, a latent-state implicit neural representation framework for continuous high-resolution local wind query and sparse-observation correction. WindINR maps static terrain descriptors, a low-resolution background field, and continuous query coordinates to a high-resolution wind state through a latent-conditioned decoder. To enable rapid inference-time correction, WindINR separates reusable representation learning from sample-specific latent-state correction. During training, a privileged encoder infers a reference latent state from high-resolution supervision, a deployable latent predictor estimates an initial latent state from inference-time inputs alone, and their discrepancies are summarized into a dataset-adaptive Gaussian prior over latent corrections. At inference time, within the WindINR module, network weights remain fixed and only the latent state is updated by minimizing a regularized correction objective using sparse observations and their uncertainty. In controlled OSSEs over the Senja region, including a UAV-aided approach scenario and random-observation robustness tests, WindINR improves local high-resolution wind estimates by updating only a compact latent state rather than the full network. The corrected representation remains continuously queryable at arbitrary coordinates and, in our CPU benchmark, yields about a 2.6\times online-correction speedup over full-network fine-tuning, suggesting a practical interface between kilometer-scale background products, sparse local observations, and wind queries in complex terrain.

[AI-181] EpiGraph: A Knowledge Graph and Benchmark for Evidence-Intensive Reasoning in Epilepsy

【速读】：该论文旨在解决癫痫（epilepsy）诊断与治疗中因临床知识异质性强、证据密集型推理复杂而导致的决策效率与准确性不足的问题。其解决方案的关键在于构建了一个大规模、多层结构化的癫痫知识图谱（EpiGraph），整合了48,166篇同行评审文献及7个临床资源，形成包含24,324个实体和32,009个证据锚定三元组的异质图谱，并基于此设计了五个临床驱动的任务基准（EpiBench），用于评估增强知识的大语言模型（LLM）在真实神经科场景下的表现。实验表明，引入EpiGraph显著提升了所有任务的性能，尤其在药物基因组学推理方面提升达30–41%，验证了结构化知识对证据驱动临床推理的实质性增强作用。

链接: https://arxiv.org/abs/2605.09505
作者: Yuyang Dai,Zheng Chen,Jathurshan Pradeepkumar,Yasuko Matsubara,Jimeng Sun,Yasushi Sakurai,Yushun Dong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Epilepsy diagnosis and treatment require evidence-intensive reasoning across heterogeneous clinical knowledge, including biosignal patterns, genetic mechanisms, pharmacogenomics, treatment strategies, and patient outcomes. In this work, we present \textscEpiGraph, a large-scale epilepsy knowledge graph and benchmark for evaluating knowledge-augmented clinical reasoning. \textscEpiGraph integrates 48,166 peer-reviewed papers and seven clinical resources into a heterogeneous graph containing 24,324 entities and 32,009 evidence-grounded triplets across five clinical layers. Built upon this graph, \textscEpiBench defines five clinically motivated tasks spanning clinical decision-making, EEG report generation, pharmacogenomic precision medicine, treatment recommendation, and deep research planning. We evaluate six LLMs under both standard and Graph-RAG settings. Results show that integrating \textscEpiGraph consistently improves performance across all tasks, with the largest gains observed in pharmacogenomic reasoning (+30–41%). Our findings demonstrate that structured epilepsy knowledge substantially enhances evidence-grounded clinical reasoning and provides a practical benchmark framework for evaluating knowledge-augmented LLMs in real-world neurological settings. Our code is available at: this https URL.

[AI-182] Position: AI Security Policy Should Target Systems Not Models

【速读】：该论文旨在解决前沿大语言模型（Large Language Models, LLMs）的安全防护能力不足问题，特别是针对其在安全绕过（jailbreak）和软件漏洞挖掘方面的脆弱性。研究发现，仅通过低成本的通用硬件与开源模型即可实现对主流LLM（如GPT-4o和Claude Sonnet-4）的有效攻击，并成功识别目标软件中的多个常见漏洞（CWE），这表明此前因安全顾虑而被限制发布的Anthropic Mythos Preview所涉及的能力可被低成本复现。解决方案的关键在于构建了一个名为swarm-attack的协同对抗测试框架，该框架利用多个轻量级LLM代理通过共享内存、并行探索和进化优化进行协作，从而弥补单个模型推理能力有限的问题，形成系统级的高效攻击能力。

链接: https://arxiv.org/abs/2605.09504
作者: Michael A. Riegler,Inga Strümke
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present swarm-attack, an open-source adversarial testing framework in which multiple lightweight LLM agents coordinate through shared memory, parallel exploration, and evolutionary optimization. Together, our results demonstrate that both safety bypass of frontier models and software vulnerability discovery, i.e., the capability class that motivated restricted release of Anthropic’s Mythos Preview, are achievable at effectively zero cost using commodity hardware and openly available models. We report two experiments. In the first, five instances of a 1.2 billion parameter model conducted 225 jailbreak attacks each against GPT-4o and Claude Sonnet~4. Against GPT-4o, the swarm achieved an Effective Harm Rate of 45.8%, producing 49 critical-severity breaches; against Claude Sonnet-4, the Effective Harm Rate was 0% despite a 40% technical success rate. In the second experiment, the same models performed combined source code analysis and binary fuzzing against a vulnerable C application with 9 planted CWEs. With a hand-crafted exploit seed corpus, regex pattern detection, and AddressSanitizer-based crash classification, the pipeline recovers 9 of 9 vulnerabilities (100% recall) in approximately four minutes on a consumer MacBook. With those scaffold components disabled, the same model recovers 0 of 9 by crash verification and 2 of 9 by citation. The capability class that motivated restricted release of Anthropic’s Mythos Preview is therefore reproducible at effectively zero cost; the important enabler is the system scaffold itself, which compensates for the limited reasoning capacity of small individual models.

[AI-183] Spectral Transformer Neural Processes

【速读】：该论文旨在解决神经过程（Neural Processes, NPs）在处理具有强周期性和准周期性特征的时间序列、空间数据及图像时存在的欠拟合问题，以及模型在训练分布之外泛化能力差的局限。其解决方案的关键在于提出频域感知的谱变压器神经过程（Spectral Transformer Neural Processes, STNPs），通过引入一种谱聚合器（Spectral Aggregator），估计上下文数据的经验谱，将其压缩为谱混合表示，并采样任务自适应的谱特征，与时间域嵌入拼接，从而向传统Transformer神经过程（Transformer Neural Processes, TNPs）注入谱混合核偏置（spectral-mixture-kernel bias）。这一设计重塑了输入间的相似性几何结构，使在欧氏空间中距离较远的点仍能在诱导的周期流形上保持接近，同时增强时频交互能力，显著提升模型对周期性和准周期性模式的建模效果。

链接: https://arxiv.org/abs/2605.09498
作者: Xianhe Chen,Hao Chen,Yingzhen Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 37 pages, 10 figures, 18 tables

点击查看摘要

Abstract:Time series, spatial data, and images are natural applications of Neural Processes. However, when such data exhibit strong periodicity and quasi-periodicity, existing methods often suffer from underfitting and generalise poorly beyond the training distribution. In this work, we propose Spectral Transformer Neural Processes (STNPs), a frequency-aware extension of Transformer Neural Processes (TNPs). STNPs introduce a Spectral Aggregator that estimates an empirical context spectrum, compresses it into a spectral mixture, samples task-adaptive spectral features, and concatenates them with time-domain embeddings, thereby injecting a spectral-mixture-kernel bias into TNPs. This design reshapes the similarity geometry, allowing inputs that are distant in Euclidean space to remain close in an induced periodic manifold while enhancing time-frequency interactions. Extensive experiments on synthetic regression tasks, real-world time-series datasets, and an image dataset demonstrate that STNPs consistently improve predictive performance over existing baselines, extending Neural Processes beyond translation equivariance towards effective modelling of periodicity and quasi-periodicity.

[AI-184] Dont Click That: Teaching Web Agents to Resist Deceptive Interfaces ACL2026

【速读】：该论文旨在解决基于视觉-语言模型（Vision-Language Model, VLM）的网页智能体在自主图形用户界面（GUI）交互中易受欺骗性界面元素（deceptive UI elements）影响的问题。现有方法要么仅检测欺骗而未与任务执行集成，要么仅记录攻击行为而未提出有效防御机制。为应对这一挑战，作者提出了DUDE（Deceptive UI Detector Evaluator）框架，其核心创新在于两阶段设计：第一阶段采用混合奖励学习（hybrid-reward learning）结合不对称惩罚策略以识别潜在欺骗；第二阶段通过经验总结（experience summarization）将失败模式提炼为可迁移的指导信息，从而形成对欺骗行为的鲁棒性防御。实验表明，DUDE在保持原有任务性能的同时，使欺骗敏感度降低53.8%，验证了其作为构建稳健网页智能体基础的有效性。

链接: https://arxiv.org/abs/2605.09497
作者: Yilin Zhang,Yingkai Hua,Chunyu Wei,Xin Wang,Yueguo Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: Accepted to ACL 2026 Main Conference. 23 pages, 8 figures, 19 tables

点击查看摘要

Abstract:Vision-language model (VLM) based web agents demonstrate impressive autonomous GUI interaction but remain vulnerable to deceptive interface elements. Existing approaches either detect deception without task integration or document attacks without proposing defenses. We formalize deception-aware web agent defense and propose DUDE (Deceptive UI Detector Evaluator), a two-stage framework combining hybrid-reward learning with asymmetric penalties and experience summarization to distill failure patterns into transferable guidance. We introduce RUC (Real UI Clickboxes), a benchmark of 1,407 scenarios spanning four domains and deception categories. Experiments show DUDE reduces deception susceptibility by 53.8% while maintaining task performance, establishing an effective foundation for robust web agent deployment.

[AI-185] LASSA Architecture-Based Autonomous Fault-Tolerant Control of Unmanned Underwater Vehicles

【速读】：该论文旨在解决无人水下航行器（Unmanned Underwater Vehicles, UUVs）在通信受限环境下，因依赖预设硬编码规则而难以应对未知故障的高阶自主容错控制问题。现有方法在面对未预见故障时失效，而大语言模型（Large Language Models, LLMs）虽具备强大推理能力，其固有的幻觉现象又限制了其在UUV控制系统中的可靠应用。解决方案的关键在于提出基于LLM-based Agent with Solver, Sensor and Actuator（LASSA）架构的智能控制框架：该架构通过LLM实现无需硬编码规则的故障识别与任务重规划；智能代理完成感知、调度与决策评估；求解器在指令发送至执行机构前验证物理边界可行性约束，从而抑制不切实际的LLM幻觉，确保决策可解释且可验证；同时构建快慢双闭环协同控制机制，慢环负责高层动态决策，快环保障高频实时控制，在决策智能性与控制时效性之间取得平衡。

链接: https://arxiv.org/abs/2605.09494
作者: Hong Chen,Zixiang Tang,Yuanbao Chen,Yu Liu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Unmanned underwater vehicles (UUVs) operate persistently in communication-constrained environments, thus requiring high-level autonomous fault-tolerant control under faulty operating conditions. Existing approaches rely heavily on predefined hard-coded rules and struggle to achieve effective fault-tolerant control against unforeseen faults. Although large language models (LLMs) possess powerful cognitive and reasoning capabilities, their inherent hallucinations remain a major obstacle to their application in UUV control systems. This paper proposes an intelligent control method based on the LASSA (LLM-based Agent with Solver, Sensor and Actuator) architecture. Within this architecture, an LLM identifies unknown faults and accomplishes task replanning via autonomous reasoning without hard-coded rules; the intelligent agent undertakes perception, scheduling and decision evaluation; the solver verifies physical boundary feasibility constraints prior to command transmission to the actuators. This architecture suppresses physically infeasible LLM hallucinations and ensures interpretable, verifiable decision-making. Moreover, it enables fast-slow dual closed-loop collaborative control, where the slow loop undertakes high-level dynamic decision-making and the fast loop guarantees high-frequency real-time control, simultaneously balancing decision intelligence and control timeliness. Lake experiments under normal and lower-rudder-fault conditions show that the framework detects trajectory tracking abnormalities, replans the route by adjusting the turning radius from 4m to 12m and reducing speed from 2kn to 1kn, passes all three solver constraints on the first invocation, and guides the UUV to complete the full mission; under normal conditions no false fault alarms are raised throughout the run.

[AI-186] CTQWformer: A CTQW-based Transformer for Graph Classification

【速读】：该论文旨在解决图神经网络（GNN）与基于Transformer的架构在图学习中难以同时捕捉全局结构依赖关系和建模动态信息传播的问题。其解决方案的关键在于提出CTQWformer，一个融合连续时间量子行走（Continuous-Time Quantum Walk, CTQW）与GNN的混合框架：通过可训练哈密顿量（Hamiltonian）融合图拓扑与节点特征，实现基于物理机制的量子行走动力学建模，从而提取富含结构信息的表示；进一步将这些表示引入两个互补模块——图Transformer模块利用最终时刻传播概率作为结构偏置嵌入自注意力机制，图循环模块则借助双向循环网络捕获时间演化模式，有效整合了结构感知与动态建模能力。

链接: https://arxiv.org/abs/2605.09486
作者: Zhan Li,Wuqing Yu,Yusen Wu,Chuan Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantum Physics (quant-ph)
备注:

点击查看摘要

Abstract:Graph Neural Networks (GNN) and Transformer-based architectures have achieved remarkable progress in graph learning, yet they still struggle to capture both global structural dependencies and model the dynamic information propagation. In this paper, we propose CTQWformer, a hybrid graph learning framework that integrates continuous-time quantum walks (CTQW) with GNN. CTQWformer employs a trainable Hamiltonian that fuses graph topology and node features, enabling physically grounded modeling of quantum walk dynamics that captures rich and intricate graph structure information. The extracted CTQW-based representations are incorporated into two complementary modules:(i) a Graph Transformer module that embeds final-time propagation probabilities as structural biases in the self-attention mechanism, and (ii) a Graph Recurrent Module that captures temporal evolution patterns with bidirectional recurrent networks. Extensive experiments on benchmark graph classification datasets demonstrate that CTQWformer outperforms graph kernel and GNN-based methods, demonstrating the potential of integrating quantum dynamics into trainable deep learning frameworks for graph representation learning. To the best of our knowledge, CTQWformer is the first hybrid CTQW-based Transformer, integrating CTQW-derived structural bias with temporal evolution modeling to advance graph learning.

[AI-187] VulTriage: Triple-Path Context Augmentation for LLM -Based Vulnerability Detection

【速读】：该论文旨在解决基于学习的漏洞检测方法在捕捉程序结构依赖关系、领域特定漏洞知识和复杂程序语义方面存在的不足，尤其是在使用大型语言模型（Large Language Models, LLMs）直接处理原始源代码时易出现漏报或误报的问题，尤其当存在漏洞与良性函数仅在细微语义差异时。解决方案的关键在于提出VulTriage框架，该框架通过三条互补路径增强LLM输入：控制路径（Control Path）提取并形式化抽象语法树（AST）、控制流图（CFG）和数据流图（DFG）信息以揭示控制与数据依赖；知识路径（Knowledge Path）利用混合稠密-稀疏检索机制获取CWE衍生的漏洞模式与实例；语义路径（Semantic Path）对代码功能行为进行总结后再做最终判断。三类上下文被整合为统一指令，引导LLM实现更可靠的漏洞推理。

链接: https://arxiv.org/abs/2605.09461
作者: Wenxin Tang,Xiang Zhang,Junliang Liu,Jingyu Xiao,Xi Xiao,Jinlong Yang,Yuehe Ma,Zhenyu Liu,Zhengheng Li,Zicheng Wang,Wang Luo,Qing Li,Lei Wang,Peng Xiangli
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Automated vulnerability detection is a fundamental task in software security, yet existing learning-based methods still struggle to capture the structural dependencies, domain-specific vulnerability knowledge, and complex program semantics required for accurate detection. Recent Large Language Models (LLMs) have shown strong code understanding ability, but directly prompting them with raw source code often leads to missed vulnerabilities or false alarms, especially when vulnerable and benign functions differ only in subtle semantic details. To address this, we propose VulTriage, a triple-path context augmentation framework for LLM-based vulnerability detection. VulTriage enhances the LLM input through three complementary paths: a Control Path that extracts and verbalizes AST, CFG, and DFG information to expose control and data dependencies; a Knowledge Path that retrieves relevant CWE-derived vulnerability patterns and examples through hybrid dense–sparse retrieval; and a Semantic Path that summarizes the functional behavior of the code before the final judgment. These contexts are integrated into a unified instruction to guide the LLM toward more reliable vulnerability reasoning. Experiments on the PrimeVul pair test set show that VulTriage achieves state-of-the-art performance, outperforming existing deep learning and LLM-based baselines on key pair-wise and classification metrics. Further ablation studies verify the effectiveness of each path, and additional experiments on the Kotlin dataset demonstrate the generalization ability of VulTriage under low-resource and class-imbalanced settings. Our code is available at this https URL

[AI-188] RAwR: Role-Aware Rewiring via Approximate Equitable Partition

【速读】：该论文旨在解决图神经网络（Graph Neural Networks, GNNs）在依赖长程交互的预测任务中性能下降的问题，其核心挑战源于“过度挤压”（oversquashing）现象，即网络拓扑结构中的瓶颈限制了信息的有效传播。解决方案的关键在于提出一种计算高效的重布线框架RAwR，该框架通过引入基于等价划分（equitable partition）的商图（quotient graph）来增强具有相同结构角色的节点间的通信效率，从而降低系统的总有效电阻（total effective resistance）。RAwR进一步利用近似等价划分实现对商图的可控压缩，在极端情况下可还原为经典的主节点重布线（Master Node rewiring）技术，同时通过谱角色提升（Spectral Role Lift, SRL）指标优化近似等价划分的选择，以最大化模型预测性能。

链接: https://arxiv.org/abs/2605.09457
作者: Riccardo Porcedda,Giuseppe Squillace,Bastian Epping,Andrea Vandin,Michael Schaub,Mirco Tribastone,Francesca Chiaromonte
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:While Graph Neural Networks (GNNs) have demonstrated significant efficacy in node classification tasks, where predictions rely on local neighborhood information, the performance of GNNs often drops when prediction tasks depend on long-range interactions. These limitations are attributed to phenomena such as oversquashing, where structural bottlenecks restrict signal propagation across the network topology. To address this challenge, we introduce RAwR, a computationally efficient rewiring framework that augments the input graph with a quotient graph derived from equitable partitions. This approach facilitates accelerated communication between nodes that share identical structural roles, as identified by the Weisfeiler-Leman graph coloring, and thereby reduces the total effective resistance of the system. Furthermore, by employing an approximate definition of the equitable partition, RAwR enables a controllable reduction of the quotient graph, which, in its most condensed state, recovers the conventional Master Node rewiring technique. Empirical evaluations across a diverse suite of benchmarks – including homophilic, heterophilic, and synthetic long-range datasets – demonstrate that RAwR achieves state-of-the-art results. Our contribution is further supported by an analytical investigation using a teacher-student model of linear GNNs, which elucidates the theoretical foundations of role-based rewiring. This analysis leads to the formulation of Spectral Role Lift (SRL), a metric designed to identify the optimal approximate equitable partition for maximizing predictive performance.

[AI-189] SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning

【速读】：该论文旨在解决当前具身智能体（embodied agents）在训练过程中缺乏丰富、多样且可自动构建的3D交互环境的问题。现有模拟器多依赖人工设计场景或程序化模板，而基于大语言模型（LLM）的3D生成系统通常仅产出静态场景，无法提供具备可验证任务和标准化学习接口的部署环境。其解决方案的关键在于提出SimWorld Studio——一个基于Unreal Engine 5的开源平台，核心为SimCoder：一个技能增强型代码代理（skill-augmented coding agent），能够根据语言或图像指令编写并执行引擎级代码，构建物理合理且动态演化的3D世界；SimCoder通过编译错误、物理检测和视觉语言模型（VLM）反馈进行自我进化，持续优化环境质量并扩展工具与技能库；同时，该平台实现了环境生成与具身学习的协同演化机制，即Agent性能反馈驱动SimCoder生成贴近学习者能力边界自适应课程，从而显著提升智能体在未见基准上的泛化性能。

链接: https://arxiv.org/abs/2605.09423
作者: Haoqiang Kang,Xiaokang Ye,Yuhan Liu,Siddhant Hitesh Mantri,Lingjun Mao,James Fleming,Drishti Regmi,Lianhui Qin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM/VLM-based digital agents have advanced rapidly thanks to scalable sandboxes for coding, web navigation, and computer use, which provide rich interactive training grounds. In contrast, embodied agents still lack abundant, diverse, and automatically generated 3D environments for interactive learning. Existing embodied simulators rely on manually crafted scenes or procedural templates, while recent LLM-based 3D generation systems mainly produce static scenes rather than deployable environments with verifiable tasks and standard learning interfaces. We introduce SimWorld Studio, an open-source platform built on Unreal Engine 5 for generating evolving embodied learning environments. At its core is SimCoder, a tool/skill-augmented coding agent that writes and executes engine-level code to construct physically grounded 3D worlds from language/image instructions. SimCoder self-evolves by using verifier feedback (e.g., compilation errors, physics checks, VLM critiques) to revise environments and autonomously add reusable tools and skills to its library. Generated worlds are exported as Gym-style environments for embodied agent learning. SimWorld Studio further enables co-evolution between environment generation and embodied learning: agent performance feedback guides SimCoder to generate adaptive curricula near the learner’s capability frontier, so that environments become increasingly challenging as the embodied agent improves. Three case studies on embodied navigation show that self-evolution improves generation reliability, generated environments substantially improve embodied agent performance that generalizes to unseen benchmarks, and co-evolution yields an 18-point success-rate gain over fixed-environment learning and a 40-point gain over an untrained agent.

[AI-190] From Passive Reuse to Active Reasoning : Grounding Large Language Models for Neuro-Symbolic Experience Replay

【速读】：该论文旨在解决强化学习（Reinforcement Learning, RL）中经验回放（Experience Replay）机制的局限性问题，即传统方法将回放缓冲区视为被动记忆系统，仅依据数值预测误差优先选择样本，忽略了经验的语义重要性，难以模拟人类通过抽象碎片化经验形成行为规则的学习过程。解决方案的关键在于提出神经符号经验回放（Neuro-Symbolic Experience Replay, NSER），其核心创新是构建一个神经符号接地（neuro-symbolic grounding）管道：利用大语言模型（Large Language Models, LLMs）零样本地从轨迹中提取候选行为规则，并将其转化为可微的一阶逻辑表示，进而用这些符号结构动态重加权回放缓冲区中的样本分布，使抽象知识直接引导策略优化，从而显著提升样本效率与收敛速度。

链接: https://arxiv.org/abs/2605.09419
作者: Yanan Xiao,Yixiang Tang,Zechen Feng,Lu Jiang,Minghao Yin,Pengyang Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While experience replay is essential for data efficiency in reinforcement learning (RL), standard methods treat the replay buffer as a passive memory system, prioritizing samples based on numerical prediction errors rather than their semantic significance. This approach stands in contrast to human learning, which accelerates mastery by actively abstracting fragmented experiences into behavioral rules. To bridge this gap, we propose Neuro-Symbolic Experience Replay (NSER), a framework that transforms experience replay from a passive sample reuse mechanism into an active engine for knowledge construction. Specifically, NSER addresses the incompatibility between linguistic reasoning and numerical optimization through a novel neuro-symbolic grounding pipeline. It leverages Large Language Models (LLMs) in a zero-shot manner to induce candidate behavioral rules from accumulated trajectories, grounds these insights into differentiable first-order logic representations, and utilizes the resulting symbolic structures to dynamically reweight the replay distribution. By allowing abstract knowledge to directly shape policy optimization, NSER achieves consistent superior sample efficiency and convergence speed across reactive, rule-based, and procedural benchmarks.

[AI-191] Strategic commitments shape collective cybersecurity under AI inequality

【速读】：该论文旨在解决因AI防御工具获取不均导致的网络安全失衡问题，即资源有限的防御者难以采用有效的高级防御措施，从而持续暴露于攻击之下，形成系统性漏洞。其核心解决方案在于引入“有针对性的补贴”机制，通过降低关键承诺防御者的成本劣势，提升强防御策略的普及率，进而增强整体系统的韧性。研究表明，仅靠社会学习驱动的承诺行为无法稳定安全状态，而补贴后的承诺能显著促进强防御采纳、抑制攻击成功并改善社会福利，为AI治理与网络安全政策提供理论支撑。

链接: https://arxiv.org/abs/2605.09415
作者: Adeela Bashir,Zia Ush Shamszaman,Zhao Song, TheAnh Han
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 26 pages, 16 figures

点击查看摘要

Abstract:The growing integration of AI into cybersecurity is reshaping the balance between attackers and defenders. When access to advanced AI-enabled defence tools is uneven, resource-limited defenders may be unable to adopt effective protection, creating persistent system vulnerabilities. We study the impact of differential AI access using an evolutionary game-theoretic model in a finite population. We first show that when high-capability defence is costly, the population is driven toward low-cost, weak-defence behaviour, sustaining attacks and weakening long-run security. To address this problem, we introduce differential access to AI defence tools by allowing defenders to choose between low- and high-capability protection based on their resources. We then examine the role of a small group of committed defenders who always adopt strong defence and influence others through social learning. Although commitment increases the prevalence of strong defence, it alone cannot stabilise secure outcomes due to high defence costs. We therefore incorporate a targeted subsidy to remove the cost disadvantage from committed defenders. Our analysis shows that subsidised commitment significantly increases strong defence adoption, suppresses successful attacks, and improves overall system resilience. Simulations across a broad parameter space confirm that subsidies consistently outperform commitment alone. In addition, social-welfare analysis shows improved defender outcomes while keeping attacker gains low. These findings suggest that targeted support for key defenders can be an effective mechanism for stabilising cybersecurity in AI-driven environments and provide a theoretical bridge between cybersecurity policy, AI governance, and strategic allocation of defensive AI capabilities.

[AI-192] RePO-VLA: Recovery-Driven Policy Optimization for Vision-Language-Action Models

【速读】：该论文旨在解决视觉-语言-动作（Vision-Language-Action, VLA）模型在长时程、高接触交互操作中因仅依赖成功轨迹进行模仿学习而导致的执行漂移问题，即失败轨迹常被丢弃，缺乏对恢复策略的有效建模。其解决方案的关键在于提出RePO-VLA框架，通过三个核心机制实现：1）恢复感知初始化（Recovery-Aware Initialization, RAI），将失败轨迹切分为恢复片段并重置历史状态，使纠正动作基于当前不利状态而非先前失败；2）进度感知语义价值函数（Progress-Aware Semantic Value Function, PAS-VF），将时空轨迹特征与指令及成功参考对齐，利用可靠性衰减机制从失败前缀中提取有用标签，区分正常、失败和纠正行为；3）价值条件精炼（Value-Conditioned Refinement, VCR），训练策略偏好高进展动作。该方法无需在线失败检测或启发式重试，在部署时固定高价值（v=1.0）即可引导动作向成功流形收敛，显著提升鲁棒性。

链接: https://arxiv.org/abs/2605.09410
作者: Weijia Liufu,Xiaoyu Guo,Ruiyi Chen,Jingzhi Liu,Kaidong Zhang,Xiwen Liang,Jianqi Lin,Dawei Sun,Yuze Wang,Rongtao Xu,Bingqian Lin,Bowen Yang,Tongtong Cao,Bowen Peng,Dongyu Zhang,Guangrun Wang,Min Wang,Liang Lin,Xiaodan Liang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) models remain brittle in long-horizon, contact-rich manipulation because success-only imitation provides little supervision for execution drift, while failed rollouts are often discarded. We introduce RePO-VLA, a recovery-driven policy optimization framework that assigns distinct roles to success, recovery, and failure trajectories. RePO-VLA first applies Recovery-Aware Initialization (RAI), slicing recovery segments and resetting history so corrective actions depend on the current adverse state rather than the preceding failure. It then learns a Progress-Aware Semantic Value Function (PAS-VF), aligning spatiotemporal trajectory features with instructions and successful references. The resulting labels salvage useful failure prefixes via reliability decay, while low-value labels mark drift and terminal breakdowns, teaching differences among nominal, failed, and corrective actions. The data engine turns adverse states into planner-generated or human-collected corrective rollouts, teaching recovery to the success manifold. Value-Conditioned Refinement (VCR) trains the policy to prefer high-progress actions. At deployment, a fixed high value ( v=1.0 ) biases actions toward the learned success manifold without online failure detectors or heuristic retries. We introduce FRBench, with standardized error injection and recovery-focused evaluation. Across simulated and real-world bimanual tasks, RePO-VLA improves robustness, raising adversarial success from 20% to 75% on average and up to 80% in scaled real-world trials.

[AI-193] Sparsity Moves Computation: How FFN Architecture Reshapes Attention in Small Transformers

【速读】：该论文旨在解决Transformer模型中前馈网络（Feedforward Network, FFN）的局部架构设计如何影响全局计算分布的问题，特别是其对注意力机制和整体任务执行的影响。研究发现，FFN结构的选择不仅改变自身计算特性，还会重塑整个模型的学习计算路径：例如，稀疏的混合专家（Mixture-of-Experts, MoE）路由会将计算负载从FFN转移到注意力模块，尤其在带进位的数字加法任务中表现显著；而这种再分配主要由架构本身的稀疏性驱动，而非路由学习到的专家专业化——因为冻结随机路由已能接近学习路由的效果。关键解决方案在于通过控制实验（如随机路由、窄FFN、Top-2 MoE等）与多维度分析（参数匹配、激活函数、宽度缩放），揭示了局部FFN设计通过“每token FFN容量降低”和“专家间稀疏划分”两个机制实现非局部计算重分配的本质机制。

链接: https://arxiv.org/abs/2605.09403
作者: Gabriel Smithline,Chris Mascioli
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: Preprint

点击查看摘要

Abstract:Architectural choices inside the Transformer feedforward network (FFN) block do not merely affect the block itself; they reshape the computations learned by the rest of the model. We study this effect in one-layer Transformers trained on digit addition with carry, modular arithmetic, and histogram counting. Comparing dense FFNs, gated linear units (GLUs), mixture-of-experts (MoE), and MoE-GLUs, we find that sparse MoE routing can shift computation from FFN to attention, with the strongest ablation-visible effect on carry-based addition. We decompose this redistribution into reduced per-token FFN capacity and sparse partitioning across experts. Critically, frozen random routing nearly matches learned routing, suggesting that redistribution is driven largely by architectural sparsity rather than router-learned specialization. As a secondary finding, GLU-style multiplicative gating rotates task-relevant Fourier structure out of the per-neuron basis and into distributed subspaces, making neuron-level interpretability less informative while preserving structured computation. We validate these conclusions with random-routing, narrow-FFN, and top-2 MoE controls, plus parameter-matching, activation-function, and width-scaling analyses. Together, these results show that local FFN design choices can have nonlocal consequences for Transformer computation.

[AI-194] Do Linear Probes Generalize Better in Persona Coordinates?

【速读】：该论文旨在解决语言模型在交互过程中可能表现出有害行为（如策略性欺骗和“装弱”行为）时，现有文本-only 监控手段不足的问题。由于模型可能在评估期间改变行为，导致监控失效，因此需要更鲁棒的白盒监控方法。解决方案的关键在于：通过对比人格提示（contrastive persona prompts）构建特定于欺骗和奉承行为的人格轴（persona axes），并利用主成分分析（PCA）提取其第一主成分作为低维子空间表示，从而分离出与有害行为强相关且对分布偏移具有鲁棒性的特征方向。实验表明，基于该人格主成分投影训练的线性探测器（linear probes）相比直接使用原始激活值的探测器，在多个评估数据集上展现出更强的泛化能力，证明了人格向量提供了有效的归纳偏置（inductive bias），有助于构建更具迁移性的行为探测机制。

链接: https://arxiv.org/abs/2605.09391
作者: Prasad Mahadik,Adrians Skapars
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 15 pages, preprint

点击查看摘要

Abstract:It is becoming increasingly necessary to have monitors check for harmful behaviors during language model interactions, but text-only monitoring has not been sufficient. This is because models sometimes exhibit strategic deception and sandbagging, changing their behavior during evaluation. This motivates the use of white-box monitors like linear probes, which can read the model internals directly. Currently, such probes can fail under distribution shift, limiting their usefulness in real settings. We study whether there exists a low-dimensional subspace of the model internals that captures harmful behaviors more robustly, while leaving out spuriously correlative features. Inspired by the Assistant Axis and Persona Selection Model, we construct persona axes for deception and sycophancy using contrastive persona prompts. The first principal components, obtained by unsupervised PCA of the persona-specific vectors, cleanly separate harmful and harmless personas. Across 10 evaluation datasets, we show that persona-derived directions transfer non-trivially and probes trained on persona-PC projections generalize better than probes trained on raw activations. We also find that a unified axis consisting of multiple harmful and harmless behaviors improves generalization across behaviors and datasets. Overall, persona vectors provide a useful inductive bias for building more transferable behavior probes.

[AI-195] NEXUS: Continual Learning of Symbolic Constraints for Safe and Robust Embodied Planning

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在具身智能（embodied intelligence）应用中，其固有的概率不确定性与物理世界对确定性及可验证安全性要求之间的根本性鸿沟问题。解决方案的关键在于提出一种名为NEXUS的模块化持续学习框架，该框架通过符号化实体（symbolic artifacts）实现符号 grounding 和知识演化，明确解耦物理可行性与安全规范：利用闭环执行反馈提升代理能力，同时将概率风险评估转化为确定性的硬约束，构建预动作防御机制，从而在保证安全的前提下持续优化任务成功率和规划效率。

链接: https://arxiv.org/abs/2605.09387
作者: Tiehan Cui,Peipei Liu,Yanxu Mao,Congying Liu,Mingzhe Xing,Datao You
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:While Large Language Models (LLMs) have catalyzed progress in embodied intelligence, a fundamental gap between their inherent probabilistic uncertainty and the strict determinism and verifiable safety required in the physical world. To mitigate this gap, this paper introduces NEXUS, a modular framework designed for continual learning in embodied agents. Different from prior works that treat symbolic artifacts merely as static interfaces, NEXUS leverages them for symbolic grounding and knowledge evolution. The framework explicitly decouples physical feasibility from safety specifications: capability of agents is improved through closed-loop execution feedback, while probabilistic risk assessments are grounded into deterministic hard constraints to establish a rigorous pre-action defense. Experiments on SafeAgentBench demonstrate that NEXUS achieves superior task success rates while effectively refusing unsafe instructions, exhibiting robust defense against adversarial attacks, and progressively improving planning efficiency through knowledge accumulation.

[AI-196] From Detection to Recovery: Operational Analysis on LLM Pre-training with 504 GPUs

【速读】：该论文旨在解决大规模人工智能（AI）训练中分布式系统可靠性与性能优化的问题，特别是硬件故障常态化背景下如何实现高效、可诊断的生产级训练集群运维。其关键解决方案在于通过跨组织协作的统一监控体系（如Prometheus时序数据和操作日志），结合多信号故障检测策略、对存储I/O瓶颈的深度剖析（揭示NFS RPC层饱和导致的“带宽悖论”），以及基于自动重试机制的节点失效响应优化（成功率达33.3%，显著优于人工恢复的12.5%）。这些方法共同构建了面向生产环境的GPU为中心调度、端到端可观测性及精准故障定位能力，从而提升大规模AI训练系统的稳定性和资源利用率。

链接: https://arxiv.org/abs/2605.09370
作者: Daemyung Kang,Eunjin Hwang,Hanjeong Lee,HyeokJin Kim,Hyunhoi Koo,Jeongkyu Shin,Jeongseok Kang,Jihyun Kang,Joongi Kim,Junbum Lee,Jungseung Yang,Kyujin Cho,Youngsook Song
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: 42 pages, 19 figures, 16 tables. Lablup Technical Report

点击查看摘要

Abstract:Large-scale AI training is now fundamentally a distributed systems problem, and hardware failures have become routine operating conditions rather than rare exceptions. Public operational evidence from production training clusters, however, remains scarce. This technical report presents an empirical analysis of a 63-node NVIDIA B200 production cluster (504 GPUs), using 55 days of Prometheus time-series data and 73 days of operational logs covering 224 multi-node training sessions. The cluster operates within a cross-organizational environment in which five parties (SKT, Upstage, Lablup, NVIDIA Korea, and VAST Data) share a unified monitoring pipeline. This arrangement enabled joint diagnosis of a 60-node-scale storage I/O bottleneck that did not appear at 2-4-node scale, a production-scale phenomenon no single team could isolate alone. Drawing on a months-long pre-training campaign, we perform three quantitative analyses yielding four findings. First, statistical analysis over 751 Prometheus metrics and 10 XID-identified GPU failures achieves a 10/10 detection rate (2/10 pre-XID) at ~0.84 false positives per day. No single metric is consistently dominant across failure types, motivating a multi-signal detection strategy. Second, profiling 523 checkpoint events along the GPU VRAM to NFS path attributes the “bandwidth paradox” (1.4-10.4% utilization of 200 Gbps RoCE) to saturation of the 128-slot NFS RPC layer. Third, multi-node failure response shows concentrated exclusions (top 3 of 63 nodes account for 50% of all exclusions) and an auto-retry chain success rate of 33.3% over 12 chains (73 attempts), 2.7x the 12.5% manual recovery rate; the median retry interval is 11 min (IQR 10-11). All analyses are grounded in production infrastructure providing session-level workload management, GPU-centric scheduling, and unified observability.

[AI-197] Explainable Knowledge Tracing via Probabilistic Embeddings and Pattern-based Reasoning

【速读】：该论文旨在解决现有知识追踪（Knowledge Tracing, KT）模型在预测准确性提升的同时，因依赖确定性向量嵌入和不透明的潜在状态转移而导致可解释性不足的问题。其解决方案的关键在于提出概率逻辑知识追踪（Probabilistic Logical Knowledge Tracing, PLKT），该框架将预测建模为基于历史学习行为的目标条件证据推理过程；通过采用鲁棒的Beta分布概率嵌入表示学生知识状态，显式建模历史行为的不确定性，并支持逻辑运算（如合取），从而构建清晰的推理路径，揭示具体过往交互如何影响最终预测结果。

链接: https://arxiv.org/abs/2605.09369
作者: Siyu Wu,Cong Xu,Wei Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Knowledge Tracing (KT) models students’ knowledge states based on learning interactions to predict performance. While deep learning-based KT models have boosted predictive accuracy, most models rely on deterministic vector embeddings and opaque latent state transitions, limiting interpretability regarding how specific past behaviors influence predictions. To address this limitation, we propose Probabilistic Logical Knowledge Tracing (PLKT), an interpretable KT framework that formulates prediction as a goal-conditioned evidence reasoning process over historical learning behaviors. Instead of representing knowledge states as deterministic vector embeddings, PLKT employs robust Beta-distributed probabilistic embeddings to represent student knowledge states. This probabilistic foundation allows us to model the uncertainty of historical behaviors and perform explicit logical operations (e.g., conjunction), constructing transparent reasoning paths that reveal how specific past interactions contribute to the prediction. Extensive experiments show that PLKT outperforms state-of-the-art KT methods while achieving superior interpretability. Our code is available at this https URL.

[AI-198] owards a Virtual Neuroscientist: Autonomous Neuroimaging Analysis via Multi-Agent Collaboration

【速读】：该论文旨在解决神经影像数据转化为临床可操作生物标志物过程中存在的知识密集型与劳动密集型问题，特别是传统标准化工作流（如fMRIPrep）因静态配置而无法根据下游目标进行推理、权衡替代策略或实现中间证据与后续决策之间的闭环反馈，导致领域专家陷入手动试错的循环，严重制约了临床生物标志物开发的可扩展性。解决方案的关键在于提出NIAgent——一种面向自主端到端神经影像分析的多智能体系统，其核心创新是采用以代码为中心的执行范式，使专业智能体通过协作合成和优化可组合的领域特定原语程序，从而实现对运行时观测动态适应的鲁棒长程工作流构建；同时引入分层验证框架，融合队列级指标筛选与智能体视觉检查，驱动基于证据的工作流修复，显著提升了预测性能并展现出策略探索与自适应优化等高级智能行为。

链接: https://arxiv.org/abs/2605.09366
作者: Keqi Han,Songlin Zhao,Yao Su,Lifang He,Carl Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Transforming neuroimaging data into clinically actionable biomarkers is a knowledge-intensive and labor-intensive process. Standardized workflows such as fMRIPrep have improved robustness and efficiency, but they are statically configured and cannot reason about downstream objectives, deliberate over alternative strategies, or close the loop between intermediate evidence and subsequent decisions in the way a human researcher would. This lack of closed-loop adaptation often leaves domain experts trapped in a cycle of manual trial-and-error to tune parameters and remediate pipeline failures, severely constraining the scalability of clinical biomarker development. To bridge this gap, we introduce NIAgent, a multi-agent system for autonomous end-to-end neuroimaging analysis. Unlike conventional flat tool-calling agents, NIAgent adopts a code-centric execution paradigm where specialist agents collaboratively synthesize and optimize executable programs over composable domain-specific primitives. This design enables robust, long-horizon workflow construction that adapts dynamically to runtime observations. Furthermore, we propose a hierarchical verification framework for autonomous quality control, integrating cohort-level metric screening with agentic visual inspection to drive evidence-grounded workflow remediation. Experiments on ADHD-200 and ADNI demonstrate that NIAgent outperforms standard workflow-based baselines in predictive performance while exhibiting sophisticated agentic behaviors, including strategy exploration and adaptive refinement.

[AI-199] Skill-R1: Agent Skill Evolution via Reinforcement Learning

【速读】：该论文旨在解决代理型大语言模型（Agentic large language models）中技能（skills）优化效率低、适配成本高且难以应用于闭源模型的问题。当前技能改进通常依赖提示工程或对任务大语言模型（task LLM）进行微调，存在计算开销大、模型依赖性强以及对闭源模型不可行等局限。解决方案的关键在于提出Skill-R1框架，其核心创新是：通过强化学习实现基于可验证奖励的实例级递归技能优化，训练一个轻量级技能生成器（skill generator），该生成器以任务上下文、先前回放轨迹及其验证结果为条件，生成能引导冻结的任务LLM执行更优行为的技能。此方法保留了与开源和闭源模型的黑盒兼容性，并显著降低适应成本；同时引入双层群体相对策略优化目标（bi-level group-relative policy optimization objective），融合代内优势（intra-generation）与代间优势（inter-generation）信号，驱动技能在多轮迭代中持续演化，而非一次性自修正，从而在具备可验证奖励的基准测试中展现出优于无技能基线和标准GRPO方法的稳定性能提升，尤其在复杂多步任务上表现突出。

链接: https://arxiv.org/abs/2605.09359
作者: Yash Vishe,Rohan Surana,Xunyi Jiang,Zihan Huang,Xintong Li,Nikki Lijing Kuang,Tong Yu,Ryan A. Rossi,Jingbo Shang,Julian McAuley,Junda Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Agentic large language models often rely on skills, reusable natural language procedures that guide planning, action, and tool use. In practice, skills are typically improved through prompt engineering or by aligning the task LLM itself, which is costly, model-specific, and often infeasible for closed-source models. Skill optimization is not a one-step problem but a recurrent process with two coupled levels of credit assignment: a useful skill must improve rollout quality under current conditioning, while a useful revision must turn observed outcomes into a better skill for the next round. We propose Skill-R1, a reinforcement learning framework for instance-level recurrent skill optimization from verifiable rewards. Rather than updating the task LLM, Skill-R1 trains a lightweight skill generator that conditions on the task context, prior rollouts, and their verified outcomes to produce skills that steer a frozen task LLM. This preserves black-box compatibility with both open- and closed-source models while making adaptation substantially cheaper than model-level updates. Skill-R1 proceeds over multiple generations: at each step, the current skill induces rollouts whose verified outcomes are fed back to produce the next revision. To optimize this recurrent process, we introduce a bi-level group-relative policy optimization objective combining intra-generation and inter-generation advantages. The intra-generation term compares rollouts under shared skill conditioning, while the inter-generation term rewards revisions that improve behavior across successive generations. Together, these provide a principled objective for directional skill evolution rather than one-shot self-refinement. Empirically, Skill-R1 achieves consistent gains over no-skill baselines and standard GRPO across benchmarks with verifiable rewards, with particularly strong improvements on complex, multi-step tasks.

[AI-200] he Wittgensteinian Representation Hypothesis: Is Language the Attractor of Multimodal Convergence?

【速读】：该论文旨在解决多模态神经网络在独立训练后为何会收敛到共享表示空间的问题，特别是现有研究依赖对称相似性度量（symmetric similarity measures）无法揭示这种收敛的方向性。其解决方案的关键在于引入一种非对称对齐度量——cycle-kNN，用于分析不同模态模型之间的方向性收敛模式。实验表明，非语言模态（如点云和视觉）显著更倾向于向语言模态的邻域结构靠拢，而反向则不成立，这一现象在所有模型家族和规模下均一致存在，且被传统对称方法所忽略。进一步机制分析指出，这种方向性源于语言表征在表示空间中具有更高的特征密度（feature density asymmetry），并借助信息瓶颈（Information Bottleneck）框架解释为压缩优化驱动下的结果：即语言的离散、组合式结构作为语义吸引子（asymptotic attractor）主导了多模态表示的收敛方向，由此提出“维特根斯坦表征假说”（Wittgensteinian Representation Hypothesis）。

链接: https://arxiv.org/abs/2605.09352
作者: Zhaoyang Zhang,Run Shao,Dongyue Wu,Jiajie Teng,Chao Tao,Jingdong Chen,Haifeng Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 22 pages, 11 figures, 6 tables

点击查看摘要

Abstract:Understanding why independently trained neural networks from different modalities converge toward shared representations, and where this convergence leads, remains an open question in representation learning. All existing evidence relies on symmetric similarity measures, which can detect convergence but are structurally blind to its direction. We introduce directional convergence analysis using cycle-kNN, an asymmetric alignment measure, applied across dozens of independently trained unimodal models spanning point clouds, vision, and language. We uncover a consistent directional asymmetry: non-language modalities move toward the neighborhood structure of language significantly more than the reverse, and this pattern holds across all model families and scales–yet is entirely invisible to symmetric measures. Mechanistic analysis traces the directionality to feature density asymmetry, whereby language representations occupy the most compact regions of representational space. The Information Bottleneck framework provides a principled interpretation: optimization under compression drives representations toward discrete, compositional structures characteristic of language. We formalize this as the Wittgensteinian Representation Hypothesis: the semantic structure of language is the asymptotic attractor of multimodal representation convergence.

[AI-201] CHAINTRIX: A multi-pipeline LLM -augmented framework for automated smart-contract security auditing

【速读】：该论文旨在解决智能合约漏洞审计效率低、成本高且自动化工具存在显著误报问题，尤其针对静态分析工具误报率高和大语言模型（Large Language Models, LLMs）生成与源码矛盾的虚假发现这一典型缺陷。解决方案的关键在于提出一个端到端的审计框架 Chaintrix，其核心设计是确保所有由 LLM 生成的断言必须基于一个确定性的结构化合约表示进行验证。该框架引入了跨合约交互模型（Cross-Contract Interaction Model, CCIM），将 Solidity 代码解析为函数级读写、修饰符及跨合约调用的结构化映射，作为所有 12 个确定性信号引擎和并行 LLM 审计管道的基础。通过一个分阶段的假阳性过滤流程，最终由结构化断言引擎（Structural Verdict Engine, SVE）执行确定性结构检查，并对高置信度发现进一步采用符号执行和模糊测试验证，从而实现高效、高召回率的漏洞检测。

链接: https://arxiv.org/abs/2605.09350
作者: Gabriela Dobrita,Simona-Vasilica Oprea,Adela Bara
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Smart-contract exploits have caused billions of USD in cumulative losses, yet audits remain expensive and slow. Automated tools have emerged to close this gap, but each class has a characteristic failure mode. Static analyzers report findings that frequently fail manual triage at high rates, while large language models (LLMs) hallucinate findings that contradict the source code. Thus, we propose Chaintrix, an end-to-end auditing framework whose central architectural commitment is that every LLM-generated claim must be discharged against a deterministic structural contract representation. We introduce a Cross-Contract Interaction Model (CCIM) that parses Solidity into a structured map of function-level reads, writes, modifiers and resolved cross-contract calls. CCIM serves as the substrate against which all 12 of Chaintrix’s deterministic signal engines and the parallel LLM audit pipelines operate. A staged false-positive-reduction pipeline, terminating in a Structural Verdict Engine (SVE) that applies deterministic structural checks against parsed code, filters the merged finding set, with selected high-confidence findings further validated through symbolic execution and fuzz testing. We evaluate Chaintrix on EVMbench, the smart-contract security benchmark by OpenAI, Paradigm, OtterSec. Chaintrix detects 86 of 120 high-severity vulnerabilities (71.7% recall), with 25 audits scoring 100% recall, placing Chaintrix 26 percentage points above the strongest frontier-model baseline.

[AI-202] Dsat: A Native SAT Solver for Discrete Logic

【速读】：该论文旨在解决离散变量在逻辑推理中的处理问题，特别是在概率推理、规划和可解释人工智能（Explainable AI）等应用中，传统方法通常将离散变量二值化为布尔变量以适配布尔SAT求解器，但这种方法可能带来计算复杂性和语义失真。解决方案的关键在于提出一种原生支持离散逻辑的SAT求解器，该求解器直接扩展布尔逻辑，允许变量取任意离散值，并保留了单位归结（unit resolution）和子句学习（clause learning）等核心机制，但这些机制被重新设计以原生处理离散变量，从而避免了二值化带来的冗余与损失，实验证明其相较基于约束满足问题（CSP）求解器、二值化后的布尔SAT求解器以及混合求解器具有更优性能。

链接: https://arxiv.org/abs/2605.09347
作者: Yaofang Zhang,Ken Zhou,Adnan Darwiche
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: To Appear at The International Conferences on Theory and Applications of Satisfiability Testing (SAT), 2026

点击查看摘要

Abstract:Discrete variables are common in many applications, such as probabilistic reasoning, planning and explainable AI. When symbolic reasoning techniques are brought in to bear on these applications, a standard technique for handling discrete variables is to binarize them into Boolean variables to allow the use of Boolean computational machinery such as SAT solvers. This technique can face both computational and semantical challenges though. In this work, we develop a native SAT solver for discrete logic, which is a direct extension of Boolean logic in which variables can take arbitrary values. Our proposed solver has a similar design to Boolean SAT solvers, with ingredients such as unit resolution and clause learning but ones that operate natively on discrete variables. We illustrate the merits of the developed SAT solver by comparing it empirically to CSP solvers applied to discrete CNFs, to Boolean SAT solver applied to binarized CNFs, and to some hybrid solvers.

[AI-203] SKG-VLA: Scene Knowledge Graph Priors for Structured Scene Semantics and Multimodal Reasoning for Decision Making

【速读】：该论文旨在解决大规模投诉处理系统中因依赖异构证据（如投诉文本、截图、订单元数据、历史交互记录和平台政策）而导致的决策准确性不足问题。现有系统通常仅对单一模态进行浅层分类或模板匹配，未能充分利用显式的场景结构、规则知识以及跨证据间的依赖关系。解决方案的关键在于提出SKG-VLA框架，其核心是将每个投诉案例建模为一个结构化的投诉场景（complaint scene），并通过场景知识图谱（Scene Knowledge Graph, SKG）统一组织实体、证据项、政策条款、时间事件、交易状态及动作相关关系；在此基础上构建数据合成流水线以生成场景描述、规则一致的图泛化、问答监督信号与决策建议，并采用三阶段训练策略注入结构化场景先验，从而显著提升政策驱动的推理能力、决策准确率、长尾泛化性能及不完整证据下的鲁棒性。

链接: https://arxiv.org/abs/2605.09343
作者: Zeyu Li,Lei Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Decision making in large-scale complaint handling systems increasingly relies on heterogeneous evidence, including complaint narratives, screenshots, order metadata, historical interactions, and platform policies. Existing complaint understanding systems mainly perform shallow classification or template matching over isolated modalities, while underutilizing explicit scene structure, rule knowledge, and cross-evidence dependencies. To address this limitation, we present SKG-VLA for multimodal complaint decision making. The core idea is to model each case as a structured complaint scene and represent its decision-relevant semantics with a \emphScene Knowledge Graph (SKG), which organizes complaint entities, evidence items, policy clauses, temporal events, transactional states, and action-relevant relations into a unified graph. Based on SKG, we build a data synthesis pipeline that generates complaint scene descriptions, rule-consistent graph generalizations, question-answer supervision, and decision recommendations. We further construct a large-scale complaint scene dataset with both text-only and multimodal in-domain benchmarks. Finally, we adopt a three-stage training strategy – domain-adaptive pre-training, task-oriented instruction fine-tuning, and end-to-end multimodal alignment – to inject structured scene priors into a multimodal decision model. Experiments show that SKG-VLA consistently improves policy-grounded reasoning, complaint decision accuracy, long-tail generalization, and robustness under incomplete evidence.

[AI-204] he Trap of Trajectory: Towards Understanding and Mitigating Spurious Correlations in Agent ic Memory

【速读】：该论文旨在解决生成式 AI（Generative AI）中代理记忆（agentic memory）引入的虚假相关性（spurious correlations）问题，即记忆模块在检索时可能携带错误关联的信息，并将这种错误推理传播至下游决策，从而影响系统可靠性。解决方案的关键在于提出一种通用且轻量的校准方法 CAMEL，该方法可在不同记忆架构的写入和检索阶段运行，通过减少对三类典型虚假模式的依赖，同时保持或提升在干净输入上的性能，并在面对自适应攻击时仍具鲁棒性，从而实现更可靠的代理记忆部署。

链接: https://arxiv.org/abs/2605.09330
作者: Luoxi Tang,Rupali Rajendra Vaje,Yuqiao Meng,Sakshi Sunil Narkar,Weicheng Ma,Zeyu Ding,Dazheng Zhang,Zhaohan Xi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Agentic memory enables LLMs to persist information beyond a single context window and reuse it in later decisions, but it also introduces a new vulnerability: spurious correlations, where retrieved memory carries miscorrelated evidence and propagates erroneous reasoning into downstream decisions. Despite the widespread use of agentic memory, this risk remains largely underexplored. We address it from two aspects. First, we benchmark several canonical types of spurious patterns identified through causal structure and record them across trajectory-level memory. Diagnosing agentic memory systems on this benchmark reveals that memory improves reasoning on clean inputs but amplifies reliance on spurious patterns when they are present. Second, we propose CAMEL, a plug-and-play calibration method that operates across diverse memory architectures at both write and retrieval time. CAMEL consistently reduces reliance on spurious patterns across all three types while preserving or improving performance on clean inputs and staying robust under adaptive attacks targeting the calibration. Overall, CAMEL offers a principled and lightweight solution toward more reliable agentic memory deployment.

[AI-205] How LLM s Are Persuaded: A Few Attention Heads Rerouted

【速读】：该论文旨在解决生成式 AI（Generative AI）模型在受到外部诱导时可能偏离事实知识的脆弱性问题，这一现象对人工智能安全构成核心挑战。其解决方案的关键在于揭示了一个紧凑的因果机制：模型中少量中间层注意力头几乎完全决定答案输出，这些注意力头将选项写入低维多面体空间，每个选项占据不同顶点； persuasion 并非模糊信念或降低置信度，而是引发从正确答案顶点到目标说服顶点的离散潜在跳跃。研究进一步发现决策头并非基于证据推理，而是直接复制其所选选项 token，而 persuasion 通过重定向注意力实现，其核心是一个秩一的证据路由特征，该特征可被直接修改以引导模型选择，移除后则能阻断说服效应。此机制在开源大语言模型（LLM）及真实中毒场景（如 Generative Engine Optimization）中均存在，表明说服是一种狭窄且可监控的电路。

链接: https://arxiv.org/abs/2605.09314
作者: Xiangkun Sun,Lingkai Kong,Aoqi Zhang,Liang Zeng,Tonghan Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 9 figures

点击查看摘要

Abstract:Language models can be persuaded to abandon factual knowledge. This vulnerability is central to AI safety, but its internal mechanism remains poorly understood. We uncover a compact causal mechanism for persuasion-induced factual errors. A small set of mid-layer attention heads almost entirely determines the model’s answer. These heads write answer options into a low-dimensional polyhedron, with options occupying distinct vertices. Persuasion does not blur belief or merely reduce confidence; it causes a discrete latent jump from the correct-answer vertex to the persuasion-target vertex. We show that decision heads are not reasoning over evidence. Instead, they copy whichever option token their attention selects. Persuasion works by redirecting attention. We isolate a rank-one evidence-routing feature that controls the route. Directly modifying this feature steers the model’s choice, and removing it blocks persuasion. We then trace the feature back to a band of shallower attention heads that build it from persuasive keywords in the input. Every step is validated by intervention. This mechanism appears across open-source LLMs and realistic poisoning scenarios such as Generative Engine Optimization, revealing persuasion as a narrow, monitorable circuit.

[AI-206] aching Molecular Dynamics to a Non-Autoregressive Ionic Transport Predictor ICML2026

【速读】：该论文旨在解决离子输运（ionic transport）性质预测中因动态特性导致的难题，即如何在不依赖高成本分子动力学（Molecular Dynamics, MD）模拟的前提下，实现从静态原子结构出发的快速且准确预测。现有方法要么依赖序列化推理（如自回归学习），存在计算效率低和误差累积问题；要么忽略动态信息，导致非自回归模型精度不足。其解决方案的关键在于提出一种基于辅助模态学习（auxiliary modality learning）的非自回归学习框架：训练阶段将原子轨迹视为辅助模态以隐式建模动力学特征，而推理阶段无需依赖轨迹数据，从而在保持高速度的同时显著提升预测准确性，并兼容含与不含原子轨迹的数据集。

链接: https://arxiv.org/abs/2605.09311
作者: Jiyeon Kim,Byungju Lee,Won-Yong Shin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Atomic Physics (physics.atom-ph); Chemical Physics (physics.chem-ph); Computational Physics (physics.comp-ph)
备注: International Conference on Machine Learning (ICML 2026) (to appear) (Please cite our conference version.)

点击查看摘要

Abstract:Unlike most static material properties widely studied in the machine learning literature, ionic transport properties are inherently dynamic, making their fast and accurate prediction from static atomic structures challenging. The current standard approach, molecular dynamics (MD) simulations, suffers from prohibitively high computational cost. Recent autoregressive learning-based MD acceleration methods requiring sequential inference remain slow and prone to error accumulation; in contrast, existing non-autoregressive material property prediction models are less accurate because they fail to exploit dynamics. Moreover, existing methods typically benefit from datasets either with or without atomic trajectories, but not both. To overcome these limitations, we propose a non-autoregressive learning framework based on auxiliary modality learning, which treats atomic trajectories as an auxiliary modality during training but does not require them at inference. This enables the predictor to learn dynamics without sequential inference while benefiting from both types of datasets. As a result, our framework achieves over 200 times speedup compared to autoregressive models on the dataset with atomic trajectories while substantially reducing prediction error relative to non-autoregressive benchmarks across both types of datasets. Our code is available at this https URL.

[AI-207] Beyond ESG Scores: Learning Dynamic Constraints for Sequential Portfolio Optimization

【速读】：该论文旨在解决现有基于学习的ESG（环境、社会和治理）投资组合优化方法中，将ESG评分作为静态观测或奖励信号所导致的时序控制错配问题。传统方法因ESG评分具有噪声大、提供方依赖性强、频率低且与投资决策时间轴不一致等特性，难以有效建模其在动态资产配置中的作用；而金融实证表明，ESG更应被视为一种投资偏好、风险暴露或对冲维度，而非稳定超额收益因子。解决方案的关键在于提出多模态动作条件约束场（Multimodal Action-Conditioned Constraint Field, MACF），该机制通过学习时点上的多模态证据（如文本、财务指标等）与预期投资组合转移之间的关系，自动推导出机制特定的ESG成本函数，从而在不修改原金融策略观测空间或奖励函数的前提下实现ESG约束嵌入。进一步地，作者引入MACF-X——一类面向不同优化器的适配器模块，利用共享的松弛与不确定性感知压力层，将MACF输出的成本及其置信度转化为原生约束优化接口，显著降低尾部ESG预算压力并保持竞争力金融表现。

链接: https://arxiv.org/abs/2605.09310
作者: Xin Li,Yan Ke,Longbing Cao
机构: 未知
类目: Artificial Intelligence (cs.AI); Portfolio Management (q-fin.PM)
备注:

点击查看摘要

Abstract:ESG-aware portfolio optimization is increasingly important for sustainable capital allocation, yet most learning-based methods still operationalize ESG by appending static scores to the policy observation or reward. This creates a mismatch for sequential control: ESG scores are noisy, provider-dependent, low-frequency, and temporally misaligned with sequential portfolio decisions, while financial evidence suggests that ESG is better treated as a portfolio preference, risk-exposure, or hedge dimension than as a robust alpha factor. We propose to impose ESG constraints without modifying the financial policy’s observation or reward, using a Multimodal Action-Conditioned Constraint Field (MACF) that learns mechanism-specific ESG costs from point-in-time multimodal evidence and contemplated portfolio transitions. We then introduce MACF-X, a family of optimizer-specific adapters that converts MACF costs and uncertainties into native constrained-optimization interfaces through a shared slack- and uncertainty-aware pressure layer. Across multiple constraint-integration interfaces, MACF-X reduces tail ESG budget pressure while maintaining competitive financial performance. Ablations show that this improvement depends on dynamic evidence inputs and three-head decomposition, while static ESG-score proxies are nearly indistinguishable from score-shuffled noise baselines.

[AI-208] Hierarchical Attention-based Graph Neural Network with Relevance-driven Pruning

【速读】：该论文旨在解决图神经网络（Graph Neural Networks, GNNs）在处理异构节点类型时缺乏可解释性 attribution 以及在大规模噪声图上消息传递计算开销高的问题。其解决方案的关键在于提出一种分层注意力机制的异构图神经网络（Hierarchical Attention-based Heterogeneous GNN, HA-HeteroGNN），通过统一的“可解释性到剪枝”流程实现双重优化：首先利用两级注意力机制分离传感器级与上下文级计算，生成无需反向传播梯度的节点级相关性分数；随后以这些分数为依据进行图结构剪枝，在减少27%边数的同时提升分类准确率2.46–1.1%，颠覆了传统认为剪枝必然牺牲精度的认知。

链接: https://arxiv.org/abs/2605.09308
作者: Seungwoo Kum
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) excel at relational reasoning but face two persistent challenges: the lack of interpretable attribution for heterogeneous node types, and the computational overhead of message passing over large, noisy graphs. We propose the Hierarchical Attention-based Heterogeneous GNN (HA-HeteroGNN), a framework that addresses both issues through a unied explainability-to-pruning pipeline. A two-tier attention mechanism separates sensor-level and context-level computation across 16 node types and 18 edge types, producing per-node relevance scores via an attention-based GNN Explainer without requiring gradient backpropagation. These relevance scores then serve as a principled pruning criterion: removing nodes identied as consistently uninformative yields a 27% reduction in graph edges while simultaneously improving classication accuracy by 2.46.1% across all model variants, challenging the conventional assumption that pruning necessarily trades accuracy for eciency. Experiments on a 50,000-record synthetic dataset spanning 11 report categories demonstrate 97.5% cross-strategy explanation stability and domain consistent sensor attribution, with training-time reductions of up to 43.9% and real-time inference latency of approximately 5860 ms per sample.

[AI-209] Neural Cluster First Route Second: One-Shot Capacitated Vehicle Routing via Differentiable Optimal Transport

【速读】：该论文旨在解决当前神经组合优化（Neural Combinatorial Optimization, NCO）方法在处理带容量约束的车辆路径问题（Capacitated Vehicle Routing Problem, CVRP）时存在的三大瓶颈：序列解码效率低、对空间对称性敏感以及分布外泛化能力差。其解决方案的关键在于重新引入经典的“先聚类后路由”（Cluster-First-Route-Second, CFRS）范式，并构建首个纯非自回归的一次性神经CFRS框架——Neural CFRS。该框架通过可微分的熵正则最优传输层（entropic Optimal Transport layer）端到端地强制执行全局车队容量约束，生成连续运输计划以稀疏化精确的容量分配求解器，从而实现高效且稳定的解耦优化。此外，该架构天然抽象掉E(2)空间对称性、跨路径置换对称性和路径内遍历对称性，结合预训练的空间词汇表，实现了极高的参数效率和零样本扩展能力，在N=1000的分布外实例上仍保持4%的最优性差距，且在标准基准测试中对规模为100的问题达到2.73%的最优性差距。

链接: https://arxiv.org/abs/2605.09301
作者: Samuel J. K. Chin,Maximilian Schiffer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 30 pages, 9 figures

点击查看摘要

Abstract:The Capacitated Vehicle Routing Problem (CVRP) underpins modern last-mile logistics. Current Neural Combinatorial Optimization (NCO) methods construct CVRP solutions autoregressively, inheriting sequential decoding bottlenecks, sensitivity to spatial symmetries, and brittle out-of-distribution behavior. We revisit the classical Cluster-First-Route-Second (CFRS) paradigm – long known to be asymptotically optimal but largely overlooked by NCO – and argue that it is structurally aligned with the core strengths of deep learning: similarity and assignment over global context, rather than the construction of long sequential tours. We introduce Neural CFRS, the first purely non-autoregressive one-shot neural CFRS framework for the CVRP. It enforces global fleet-capacity constraints end-to-end via a differentiable entropic Optimal Transport layer, producing a continuous transport plan to sparsify an exact capacitated assignment solver. We provide formal theoretical guarantees that our architecture intrinsically abstracts away E(2) spatial, inter-route permutation, and intra-route traversal symmetries. By equipping the framework with a pre-trained spatial vocabulary, we unlock extreme parameter efficiency and zero-shot scaling. Designed primarily for real-world spatial distributions under a constant capacity setting, Neural CFRS scales robustly to out-of-distribution N=1000 instances with a 4% gap – retaining an approximate 5% gap at this scale even as an ultra-lightweight, single-layer architecture. Furthermore, when deployed out-of-the-box on standard benchmarks, we achieve a highly competitive 2.73% optimality gap on size-100 problems.

[AI-210] owards Effective Theory of LLM s: A Representation Learning Approach

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）内部计算过程难以解释的问题，即如何从微观的隐藏状态细节中提取出具有高阶语义意义和动态可解释性的宏观变量。其解决方案的关键在于提出 Representational Effective Theory (RET) 框架，通过自监督学习（类似 BYOL/JEPA 的目标函数）从隐藏状态轨迹中自动学习这些宏观变量（macrovariables），从而实现对激活值的粗粒化处理，同时保留对预测和解释至关重要的高层结构信息。RET 提取的状态不仅在时间上具有一致性，还能揭示推理过程中的“心智状态”轨迹，并支持早期行为结果预测与生成过程的因果干预，表明 LLM 计算可通过 RET 获得实用的有效描述。

链接: https://arxiv.org/abs/2605.09294
作者: Muhammed Ustaomeroglu,Guannan Qu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Project webpage: this https URL

点击查看摘要

Abstract:We propose Representational Effective Theory (RET), a framework for describing large language model computation in terms of learned macrostates rather than microscopic details. RET learns these macrostates from hidden-state trajectories using a BYOL/JEPA-style self-supervised objective, coarse-graining activations into macrovariables that preserve higher-level structure relevant for prediction and interpretation. We evaluate whether these macrovariables are practically relevant for interpretability: RET yields temporally consistent states that reveal “mental-state” trajectories of reasoning, capture high-level semantic structure, support early prediction of behavioral outcomes such as sycophancy, and provide causal handles for steering generations toward interpretable computational phases. Together, these results suggest that LLM computation admits useful effective descriptions via RET: high-level, dynamically meaningful variables that support interpretation, prediction, and intervention.

[AI-211] Beyond Accuracy: Evaluating Strategy Diversity in LLM Mathematical Reasoning

【速读】：该论文旨在解决当前大型语言模型（Large Language Models, LLMs）在数学推理评估中过度依赖最终答案准确率，而忽视了推理策略多样性的问题。传统评价体系无法捕捉模型在面对同一问题时是否具备灵活运用不同解题路径的能力，从而可能高估其真正的数学推理能力。解决方案的关键在于构建一个策略层级的评估框架，通过标注模型输出的策略身份（strategy identity）、有效性（validity）和正确性（correctness），并结合双AI编码与人工仲裁机制，在80道AMC 10/12和AIME题目上对四个前沿模型进行系统分析。结果显示，尽管各模型在单一解法提示下均达到95%-100%的答案准确率，但在多策略提示下显著低于人类参考策略集，且策略多样性存在明显差距，尤其在几何与数论领域；同时发现模型能产生部分基准未见的有效新策略，表明其具备一定替代性推理能力，但重复运行实验显示策略发现趋于饱和，进一步验证了策略多样性作为衡量数学推理能力的必要补充维度。

链接: https://arxiv.org/abs/2605.09292
作者: Xia Yang,Xuanyi Zhang,Hao Hu,Feng Ji
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Large language models now achieve high final-answer accuracy on mathematical reasoning benchmarks, but accuracy alone does not capture reasoning flexibility. We introduce a strategy-level evaluation framework instantiated on 80 AMC 10/12 and AIME problems with 217 AoPS-derived reference strategy families. Model outputs are annotated for strategy identity, validity, and correctness using dual-AI coding with human adjudication. Across four frontier models, we find a pronounced decoupling between answer accuracy and strategy diversity. Under a single-solution prompt, all models achieve high accuracy (95%-100%), but under a multiple-strategy prompt they recover substantially fewer strategies than the human reference set. Gemini, DeepSeek, GPT, and Claude generate 184, 152, 151, and 110 distinct valid strategies, respectively, with the largest gaps in Geometry and Number Theory. The models collectively produce 50 benchmark-novel valid strategies, indicating both incomplete coverage of human strategies and some capacity for alternative reasoning. A repeated-run robustness check on 20 problems shows diminishing gains in discovered strategies, with the strongest model recovering only 39 of 55 AoPS-reference strategies (71%) after three runs. These findings position strategy diversity as a complementary dimension for evaluating mathematical reasoning beyond answer correctness.

[AI-212] PiCA: Pivot-Based Credit Assignment for Search Agent ic Reinforcement Learning

【速读】：该论文旨在解决基于大语言模型（Large Language Model, LLM）的搜索代理在长周期任务中面临的三个关键挑战：奖励稀疏性（Reward Sparsity）、孤立信用分配（Isolated Credit）和分布偏移（Distributional Shift）。现有方法难以在缺乏步骤级反馈的情况下区分动作质量，且独立评估每一步的信用无法捕捉序列依赖关系，同时模板化奖励估计偏离了模型自然生成分布。解决方案的核心是提出一种基于枢轴点的信用分配机制（Pivot-Based Credit Assignment, PiCA），其将搜索轨迹建模为累积搜索进展的序列过程，并引入基于势能的奖励塑造（Potential-Based Reward Shaping, PBRS）来定义与历史上下文相关的进程奖励。PiCA通过识别目标黄金子查询和子答案作为信息峰值（即“枢轴步”），锚定步骤奖励于最终任务目标，从而提供密集、枢纽感知且轨迹依赖的指导，同时保持分布一致性，显著提升了多任务场景下的性能表现。

链接: https://arxiv.org/abs/2605.09287
作者: Dongyi Liu,Yifan Niu,Qinwen Wang,Han Xiao,Jia Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Model (LLM)-based search agents trained with reinforcement learning (RL) have significantly improved the performance of knowledge-intensive tasks. However, existing methods encounter critical challenges in long-horizon credit assignment: (i) Reward Sparsity, where models receive only outcome feedback without step-level guidance to differentiate action quality; (ii) Isolated Credit, where credit is assigned to steps independently, failing to capture sequential dependencies; and (iii) Distributional Shift, where rewards are estimated on templates that deviate from the model’s natural generative distribution. To address these issues, we propose Pivot-Based Credit Assignment (PiCA), a novel step reward mechanism that reformulates the search trajectory as a sequential process of cumulative search progress. Unlike prior isolated step rewards, PiCA defines process rewards as success probabilities dependent on the historical context based on Potential-Based Reward Shaping (PBRS). This approach identifies pivot steps, which comprise target golden sub-queries and sub-answers derived from historical trajectories, as information peaks that significantly boost the likelihood of a correct final answer. By anchoring these step rewards to the final task objective, PiCA provides dense, pivot-aware and trajectory-dependent guidance while maintaining distributional consistency. Extensive experiments show that PiCA outperforms existing strong baselines across seven knowledge-intensive QA benchmarks, achieving 15.2% and 2.2% improvements for 3B and 7B models. The consistent performance gains across various models show PiCA’s robust generalization. The code is available at this https URL.

[AI-213] Semi-Supervised Neural Super-Resolution for Mesh-Based Simulations ICML2026

【速读】：该论文旨在解决基于网格的偏微分方程（Partial Differential Equations, PDEs）模拟中因高精度需求导致计算成本高昂的问题。传统方法依赖细密网格以获得高保真解，但代价是巨大的计算开销。为此，作者提出SuperMeshNet框架，其核心创新在于引入互补学习（complementary learning）机制——一种半监督策略，通过两个协同训练的消息传递神经网络（Message Passing Neural Networks, MPNNs）同时利用少量配对的低分辨率（Low-Resolution, LR）与高分辨率（High-Resolution, HR）数据和大量未配对的LR数据，从而显著降低对昂贵HR标注数据的依赖。此外，模型还嵌入归纳偏置（inductive biases），进一步提升超分辨率性能。实验表明，SuperMeshNet仅需90%更少的HR数据即可实现比全监督基线更低的均方根误差（Root Mean Square Error, RMSE）。

链接: https://arxiv.org/abs/2605.09284
作者: Jiyeon Kim,Youngjoon Hong,Won-Yong Shin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Applied Physics (physics.app-ph); Computational Physics (physics.comp-ph)
备注: International Conference on Machine Learning (ICML 2026) (to appear) (Please cite our conference version.)

点击查看摘要

Abstract:Mesh-based simulations provide high-fidelity solutions to partial differential equations (PDEs), but achieving such accuracy typically requires fine meshes, leading to substantial computational overhead. Super-resolution techniques aim to mitigate this cost by reconstructing high-resolution (HR), high-fidelity solutions from low-cost, low-resolution (LR) counterparts. However, training neural networks for super-resolution often demands large amounts of expensive HR supervision data. To address this challenge, we propose SuperMeshNet, an HR data-efficient super-resolution framework for mesh-based simulations aided by message passing neural networks (MPNNs). At its core, SuperMeshNet introduces complementary learning, a semi-supervised approach that effectively leverages both 1) a small amount of paired LR-HR data and 2) abundant unpaired LR data via two jointly trained, complementary MPNN-based models. Additionally, our model is enriched by inductive biases, which are empirically shown to further improve super-resolution performance. Extensive experiments demonstrate that SuperMeshNet requires 90% less HR data to achieve even lower root mean square error (RMSE) than that of the fully supervised benchmark without the inductive biases. The source code and datasets are available at this https URL.

[AI-214] EquiMem: Calibrating Shared Memory in Multi-Agent Debate via Game-Theoretic Equilibrium

【速读】：该论文旨在解决多智能体辩论（Multi-agent Debate, MAD）系统中共享记忆（shared memory）因单个条目被污染而导致下游推理失效的问题，现有防护机制依赖启发式规则或大语言模型（Large Language Model, LLM）验证，但这些方法易受相同错误模式影响且忽视MAD中多智能体间的动态交互。解决方案的关键在于将MAD中的记忆更新建模为一个零信任记忆博弈（zero-trust memory game），其中不假设任何智能体诚实，通过博弈均衡作为最优记忆可信度的指示器；进而提出EquiMem机制，在推理时对每个记忆更新进行算法校准，利用智能体已有的检索查询和遍历路径作为证据，无需引入LLM判断，从而在嵌入式与图结构记忆等多种架构下均显著优于现有方案，并具备抗恶意智能体攻击能力及极低的推理开销。

链接: https://arxiv.org/abs/2605.09278
作者: Yuqiao Meng,Sakshi Sunil Narvekar,Luoxi Tang,Rupali Rajendra Vaje,Yingxue Zhang,Muchao Ye,Zhaohan Xi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-agent debate (MAD) systems increasingly rely on shared memory to support long-horizon reasoning, but this convenience opens a critical vulnerability: a single corrupted entry can contaminate the downstream memory-augmented reasoning, and debate alone fails to filter such errors. Existing safeguards filter entries via heuristics or LLM-based validation, yet they rely on AI judgments that share the same failure modes and overlook the cross-agent dynamics of MAD. We address this gap by formulating memory updating in MAD as a zero-trust memory game, in which no agent is assumed honest and the game’s equilibrium serves as an indicator of optimal memory trust. Guided by this equilibrium, we propose EquiMem, an inference-time calibration mechanism that quantifies each update algorithmically against the shared memory state, using agents’ existing retrieval queries and traversal paths as evidence rather than soliciting any LLM judgment. EquiMem instantiates calibration for both embedding- and graph-based memory, and across diverse benchmarks, MAD frameworks, and memory architectures, it consistently outperforms existing safeguards, remains robust under adversarial agents, and incurs negligible inference overhead.

[AI-215] Shaping Schema via Language Representation as the Next Frontier for LLM Intelligence Expanding

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在复杂问题求解中因自然语言表达能力有限而导致的智能瓶颈问题。当前模型虽通过规模扩展积累了大量知识，但知识的有效激活与组织仍受限于语言表示的结构与符号复杂度。论文的核心解决方案在于提出并实证验证：通过设计更具结构性和符号 sophistication 的语言表示（language representation），可显著提升LLM的知识激活效率与任务表现，而无需调整模型参数或规模。其关键创新在于将语言表示视为塑造认知schema（schema）的基础，并通过理论形式化与控制实验验证了不同语言表述对模型内部特征激活及性能的影响差异，从而确立语言表示设计为拓展LLM智能的新前沿。

链接: https://arxiv.org/abs/2605.09271
作者: Zhiqin Yang,Yuhan Liu,Jingwen Fu,Pei Fu anf Bo Han,Masashi Sugiyama,Nanning Zheng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 41 pages, 30 figures

点击查看摘要

Abstract:Although natural language is the default medium for Large Language Models (LLMs), its limited expressive capacity creates a profound bottleneck for complex problem-solving. While recent advancements in AI have relied heavily on scaling, merely internalizing knowledge does not guarantee its effective application. Defining language representation as the linguistic and symbolic constructs used to map and model the real world, this paper argues that shaping schemas through advanced language representation is the next frontier for expanding LLM intelligence. We posit that an LLM’s knowledge activation and organization – its schema – depends heavily on the structural and symbolic sophistication of the language used to represent a given task. This paper contributes both a formalization of this claim and the empirical evidence to support it. With a new formalization, we present multiple lines of evidence to support our position: Firstly, we review recent empirical practices and emerging methodologies that demonstrate the substantial performance gains achievable through deliberate language representation design, even without modifying model parameters or scale. Secondly, we conduct controlled experiments showing that LLM performance and its internal feature activations vary under different language representations of the same underlying task. Together, these findings highlight language representation design as a promising direction for future research.

[AI-216] Memorize Theorems Not Instances: Probing SFT Generalization through Mathematical Reasoning

【速读】：该论文旨在解决监督微调（Supervised Fine-Tuning, SFT）在任务特定适配中导致推理泛化能力下降的问题。研究表明，问题根源并非记忆机制本身，而是SFT倾向于让模型学习并记忆问题-答案对中的表面相关性（spurious surface correlations），从而对输入的表层变化变得脆弱。解决方案的关键在于提出Theorem-SFT，它通过将监督信号重新导向显式的定理应用，教导模型如何调用规则而非仅仅学习答案形式，从而提升模型在不同场景下的推理鲁棒性。实验证明，该方法在多个基准和模型架构上均取得显著性能提升，且仅微调MLP层即可达到全层微调效果，表明前馈网络是推理规则的主要存储位置。

链接: https://arxiv.org/abs/2605.09270
作者: Ruiying Peng,Mengyu Yang,Jing Lei,Xiaohui Li,Xueyu Wu,Xinlei Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Supervised Fine-Tuning (SFT) is widely used for task-specific adaptation, yet recent work shows it systematically undermines reasoning generalization. We argue the root cause is not memorization itself, but its target: vanilla SFT drives models to exploit and memorize spurious surface correlations in problem-solution pairs, leaving them brittle to superficial input variations. To address this, we propose Theorem-SFT, which reorients supervision toward explicit theorem application by teaching models how rules are invoked rather than what answers look like. Theorem-SFT yields consistent gains across benchmarks and model families: +8.8% on MATH (LLaMA3.2-3B-Instruct) and +20.27% on GeoQA (Qwen2.5-VL-7B-Instruct) without modality-specific re-training. Fine-tuning MLP layers alone matches full-layers performance, implicating feed-forward components as the primary locus of reasoning rules. Our findings reframe the debate: Generalization failures stem not from memorization as a mechanism, but from memorizing the wrong inductive targets.

[AI-217] SeePhys Pro: Diagnosing Modality Transfer and Blind-Training Effects in Multimodal RLVR for Physics Reasoning

【速读】：该论文旨在解决当前多模态模型在跨模态信息传递过程中推理能力不一致的问题，即模型是否能在关键信息从文本逐步转移到图像时保持相同的推理能力。其解决方案的关键在于构建了一个细粒度的模态迁移基准 SeePhys Pro，该基准为每个问题提供四个语义对齐的变体，逐步增加视觉元素；并通过盲训练（blind training）和多种控制实验（如文本删除、图像掩码率和格式饱和度控制）揭示：尽管模型在训练时屏蔽所有图像仍能提升验证集性能，但这种提升主要源于残留文本线索和分布特征，而非有效的视觉证据。这一发现强调了评估多模态推理不仅应关注最终答案准确性，还需考察模态迁移下的鲁棒性及改进是否依赖于任务关键的视觉信息。

链接: https://arxiv.org/abs/2605.09266
作者: Kun Xiang,Terry Jingchen Zhang,Zirong Liu,Bokai Zhou,Yueling Tang,Junjie Yu,Jiacong Lu,Shangrui Huang,Heng Li,Likui Zhang,Kunkun Liu,Changzheng Zhang,Yangle Fang,Boqiang Guo,Hui-Ling Zhen,Dandan Tu,Yinya Huang,Xiaodan Liang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce SeePhys Pro, a fine-grained modality transfer benchmark that studies whether models preserve the same reasoning capability when critical information is progressively transferred from text to image. Unlike standard vision-essential benchmarks that evaluate a single input form, SeePhys Pro features four semantically aligned variants for each problem with progressively increasing visual elements. Our evaluation shows that current frontier models are far from representation-invariant reasoners: performance degrades on average as information moves from language to diagrams, with visual variable grounding as the most critical bottleneck. Motivated by this inference-time fragility, we further develop large training corpora for multimodal RLVR and use blind training as a diagnostic control, finding that RL with all training images masked can still improve performance on unmasked validation sets. To analyze this effect, text-deletion, image-mask-rate, and format-saturation controls suggest that such gains can arise from residual textual and distributional cues rather than valid visual evidence. Our results highlight the need to evaluate multimodal reasoning not only by final-answer accuracy, but also by robustness under modality transfer and by diagnostics that test whether improvements rely on task-critical visual evidence.

[AI-218] Remix the Timbre: Diffusion-Based Style Transfer Across Polyphonic Stems

【速读】：该论文旨在解决多乐器（multi-instrument）场景下音色迁移（timbre transfer）的难题，即如何在不依赖先分离后迁移（separate-then-transfer）管道的前提下，直接从混合音频中实现对每个声部（stem）的灵活音色转换。现有方法因采用分步处理策略，易引入源分离误差并导致跨声部合成音色不一致。其解决方案的关键在于提出MixtureTT——一个基于共享扩散过程（shared diffusion process）的联合声部扩散Transformer架构，通过建模各声部内容间的依赖关系及跨声部谐波结构，实现了多声部音色的协同迁移，从而消除级联分离误差、降低推理成本，并显著提升输出一致性与质量。

链接: https://arxiv.org/abs/2605.09259
作者: Leduo Chen,Junchuan Zhao,Shengchen Li
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Timbre transfer aims to modify the timbral identity of a musical recording while preserving the original melody and rhythm. While single-instrument timbre transfer has made substantial progress, existing approaches to multi-instrument settings rely on separate-then-transfer pipelines that propagate source separation artifacts and produce incoherent synthesized timbres across stems. This paper proposes MixtureTT, to the best of our knowledge the first system for flexible per-stem timbre transfer directly from a polyphonic mixture. Given a mixture and a separate timbre reference for each target voice, MixtureTT jointly transfers all stems to the specified instruments through a shared diffusion process. Modeling the dependencies across the per-stem content and cross-stem harmonic, the proposed joint stem diffusion transformer eliminates cascaded separation error, reduces inference cost by a factor equal to the number of stems, and yields more coherent multi-stem outputs. Despite operating under a strictly harder input condition, evaluations on the SATB choral dataset show that MixtureTT outperforms single-instrument baselines on both objective and subjective metrics demonstrating the necessity of dedicated multi-instrument timbre transfer over the naive separate-then-transfer pipelines. As a result, this work confirms that the cross-stem modeling is essential for mixture-level timbre transfer as the proposed joint setting consistently exceeds an equivalent single-stem ablation.

[AI-219] Improving Generalization by Permutation Routing Across Model Copies

【速读】：该论文旨在解决机器学习模型在训练过程中因参数空间耦合或副本坍缩（replica collapse）导致的泛化性能受限问题。传统方法如复制随机梯度下降（replicated SGD）或弹性SGD通过参数平均或显式吸引力机制来耦合多个模型副本，但容易引发冗余更新和过拟合。本文提出一种基于M-覆盖（M-cover）变换的新框架，其关键在于不依赖参数空间的直接耦合，而是通过一个结构化的混合核 $Q$ 对模型副本的参数进行路由重组，使每个局部损失函数在由不同副本参数按排列组合构成的“路由模型”上计算；随后利用原始局部更新规则进行优化，并通过这些路由路径重新分配学习信息。这种机制本质上构建了一个具有长环结构的提升因子图（lifted factor graph），从而实现结构化消息传递，有效提升模型泛化能力。

链接: https://arxiv.org/abs/2605.09256
作者: Shuhei Kashiwamura,Timothee Leleu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:We introduce a use of the (M)-cover (or (M)-layer) transform for machine learning. The method replicates a model (M) times, but instead of coupling the copies through parameter averaging or an explicit attractive force, as in replicated SGD or Elastic SGD, it rewires the contexts in which local learning messages are computed. Each local loss is evaluated on a routed model whose parameters are drawn from different copies according to permutations sampled from a structured mixing kernel (Q). Training then uses the original local update rule, while the resulting learning messages are redistributed across the copies through these routed computational paths. Thus (Q) defines a topology for message transport and controls the long-loop structure of the lifted factor graph. We formulate this construction for perceptrons, committee machines, and multilayer perceptrons, showing that the same principle applies from discrete models to differentiable neural networks. The resulting framework provides a mechanism for improving generalization through structured message sharing rather than replica collapse or parameter-space coupling.

[AI-220] How Much is Brain Data Worth for Machine Learning?

【速读】：该论文试图解决的问题是：在机器学习模型训练中，如何量化神经数据（如脑电活动）对提升模型性能和鲁棒性的价值，以及在何种条件下使用脑数据能带来显著收益。解决方案的关键在于构建一个线性高斯模型来形式化任务目标与神经记录之间的关系，并推导出基于脑数据和任务标签联合训练的多模态估计器的性能 scaling laws（缩放规律）。通过这些规律，作者进一步定义了脑样本与任务样本之间的相对价值和交换率，从而定量评估脑数据在不同任务-脑对齐程度、噪声水平、潜在维度及脑数据样本量下的边际效用。此外，研究还分析了测试分布偏移场景下脑正则化学习带来的鲁棒性增益机制，最终在固定采集预算约束下识别出脑数据值得收集的条件区间。

链接: https://arxiv.org/abs/2605.09243
作者: Lane Lewis,Zhixin Wang,David Schwab,Xaq Pitkow
机构: 未知
类目: Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注: 9 pages main text, 5 figures, 34 pages of appendix with detailed proofs

点击查看摘要

Abstract:If a person can solve a task, can measuring their brain make it easier to train a model to solve that task too? Recent NeuroAI work suggests that supplementing task training with neural recordings can modestly improve model performance and robustness. However, it is unclear when there should be a benefit from using neural data and how much benefit to expect. We formulate this question mathematically, and begin to address it theoretically using a simple, analytically tractable linear gaussian model of task targets and neural recordings. For a multimodal estimator trained on both brain data and task labels, we derive scaling laws for how performance scales with the numbers of brain and task samples. From these laws we derive relative value and exchange rates between brain samples and task samples, quantifying how much extra task samples neural data is worth as a function of task-brain alignment, neural and task noise, latent dimension, and brain data sample size. We also analyze test distribution shift, to identify conditions where brain-regularized learning can produce substantial robustness gains through learned invariances. Finally, under a fixed collection budget, we characterize the regimes in which brain data is worth collecting. Our results provide a foundation for understanding how valuable brain data could be for improving machine learning.

[AI-221] Sub-JEPA: Subspace Gaussian Regularization for Stable End-to-End World Models

【速读】：该论文旨在解决联合嵌入预测架构（Joint-Embedding Predictive Architectures, JEPA）在训练过程中因表示方差过大而导致模型坍缩至平凡解的问题。研究表明，这种坍缩现象源于对潜在嵌入（latent embeddings）施加全局高斯先验时引入的过强约束，从而破坏了嵌入空间中固有的低维流形结构。解决方案的关键在于：不再直接在原始高维嵌入空间中施加各向同性高斯先验，而是通过在多个随机子空间中分别应用高斯约束，实现局部而非全局的正则化。这一设计在保持防止坍缩效果的同时，显著缓解了过度约束问题，从而在偏差-方差权衡上找到更优平衡点，提升了训练稳定性与表征质量。

链接: https://arxiv.org/abs/2605.09241
作者: Kai Zhao,Dongliang Nie,Yuchen Lin,Zhehan Luo,Yixiao Gu,Deng-Ping Fan,Dan Zeng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: this https URL

点击查看摘要

Abstract:Joint-Embedding Predictive Architectures (JEPAs) provide a simpleframework for learning world models by predicting future latent this http URL, JEPA training is subject to a bias-variance this http URL sufficient structural constraints, excessive representationalvariance causes the model to collapse to trivial this http URL recent LeWorldModel (LeWM) shows that this issue can be alleviated bysimply constraining latent embeddings with an isotropic Gaussian this http URL, latent representations inherently lie on low-dimensional manifoldswithin a high-dimensional ambient space, and enforcing an isotropic Gaussianprior directly in this ambient space introduces an overly strong this http URL this work, we propose ame, which seeks a favorable operatingpoint on the bias-variance frontier by applying Gaussian constraints inmultiple random subspaces rather than in the originalembedding this http URL design relaxes the global constraint while preserving itsanti-collapse effect, leading to a better balance between trainingstability and representation this http URL experiments across fourcontinuous-control environments demonstrate that consistentlyoutperforms LeWM with very clear this http URL method is simple yet effective, and serves as a strong baseline for future JEPA-based world model this http URL code is available at this https URL.

[AI-222] Intrinsic Muon: Spectral Optimization on Riemannian Matrix Manifolds

【速读】：该论文旨在解决传统Muon优化器在流形约束参数（如低秩因子分解、正交约束或对称正定（SPD）矩阵）上难以有效推广的问题。现有方法通过将Muon的线性最大化Oracle（LMO）限制在切空间，会破坏商空间对称性，并使切空间约束与环境范数边界耦合，从而阻碍多种流形上的闭式解求解。解决方案的关键在于一个核心观察：每个黎曼度量均可自然地将酉不变的欧氏范数提升为各切空间上的内在范数，由此构建的内在范数约束LMO具有对称性保持特性。基于此，作者提出内在Muon（iMuon）框架，在固定秩、SPD、Stiefel和Grassmann流形上对任意酉不变范数（包括谱范数、Frobenius范数和核范数）均能实现闭式更新，并提供确定性和随机情形下的收敛性保证，其速率常数仅依赖于流形维度，且在固定秩情形下仅取决于秩而非因子条件数，从而消除了先前方法所需的运行时缩放调整。

链接: https://arxiv.org/abs/2605.09238
作者: Yibang Li,Bihari Lal Pandey,Ravi Sah,Andi Han,Cyrus Mostajeran,Pratik Jawanpuria,Bamdev Mishra
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Code: this https URL

点击查看摘要

Abstract:Muon and related norm-constrained matrix optimizers have become central to large-scale learning problems. They are formulated as a linear maximization oracle (LMO) over an ambient matrix-norm ball in unconstrained Euclidean space. However, these do not generalize cleanly to manifold-valued parameters such as low-rank factorizations, orthogonality constraints, or symmetric positive definite (SPD) matrices. Naively restricting the Muon LMO to the tangent space (i) breaks quotient symmetries and (ii) couples the tangent-space constraint with an ambient norm bound, thereby obstructing closed-form solutions on various manifolds of interest. We resolve both issues with a single observation: every Riemannian metric canonically lifts a unitarily invariant Euclidean norm to an intrinsic norm on each tangent space, and the resulting intrinsic norm constrained LMO is symmetry preserving. Building on this, we introduce intrinsic Muon (iMuon), a unified framework that yields closed-form updates on the fixed-rank, SPD, Stiefel, and Grassmann manifolds for any unitarily invariant norm, including the spectral, Frobenius, and nuclear norms. We establish convergence guarantees for both deterministic and stochastic iMuon with rate constants that depend only on the manifold dimension. Notably, on the fixed-rank manifold this constant depends only on the rank, making the rate independent of factor conditioning and removing the runtime factor-rescaling required by prior work. Experiments on LoRA finetuning of LLMs, image classification, and subspace learning illustrate the efficacy of the proposed approach.

[AI-223] On Variance Reduction in Learning Mean Flows

【速读】：该论文旨在解决无蒸馏（distillation-free）生成建模中MeanFlow训练的不稳定性问题，其核心表现为损失函数非递减及梯度方差无界。解决方案的关键在于理论揭示了条件速度场（conditional velocity field）在损失函数中被错误地赋予了两个不同统计角色：一方面作为无偏回归目标，另一方面作为雅可比-向量积中的蒙特卡洛控制变量（Monte Carlo control variate），而原始损失函数对后者分配了错误的权重系数。作者推导出最优权重系数的闭式表达，并证明当前多项并发工作中的修正方法本质上都是该最优解的不同实践形式。实验表明，通过调整该系数可在二维基准和潜在扩散Transformer（latent Diffusion Transformer, DiT）上恢复预期的偏差-方差权衡，且最优系数能带来高达54%的采样质量提升（FID指标改善），并实现DiT每个匹配步长检查点的单调FID下降趋势；值得注意的是，尽管最小化梯度方差的最优系数位于内部值，但最小化FID的最优选择仍倾向于直接使用条件速度场，揭示了梯度方差与生成质量之间的量化不一致性。

链接: https://arxiv.org/abs/2605.09235
作者: Juanwu Lu,Ziran Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 25 pages, 7 figures, 6 tables

点击查看摘要

Abstract:One-step generative modeling has emerged as a leading approach to amortize the inference cost of diffusion and flow-matching models. Among distillation-free methods, MeanFlow training is notoriously unstable, with non-decreasing loss and unbounded gradient variance. In this work, we establish a theory that attributes this pathology to a misuse of the conditional velocity field: it plays two distinct statistical roles in the loss, both as an unbiased regression target and as a Monte Carlo control variate inside a Jacobi-vector product, with the original loss assigning the wrong coefficient to the latter. We derive the optimal coefficient in closed form, and show that a family of fixes in concurrent works corresponds to different practical realizations of the same optimum. A controlled sweep of this coefficient on two-dimensional benchmarks and on a latent Diffusion Transformer recovers the predicted bias-variance ordering. The optimal coefficient yields up to a %54 improvement in sample quality on two-dimensional benchmarks and a monotone FID trend at every matched-step DiT checkpoint. Crucially, the same DiT measurement also reveals a quantitative FID-MSE landscape mismatch: although gradient variance is minimized at an interior coefficient value, the coefficient that minimizes FID prefers the direct use of conditional velocity.

[AI-224] ProactBench: Beyond What The User Asked For

【速读】：该论文旨在解决当前大型语言模型（Large Language Models, LLMs）评估体系中对“对话主动性”（conversational proactivity）这一能力的缺失问题。现有基准主要衡量模型对显式请求的响应能力，却忽略了模型识别用户隐含需求并主动采取行动的能力。为此，作者提出ProactBench，其核心创新在于将对话主动性解构为三种时序关联的类型：Emergent（基于单一显性线索的推断）、Critical（跨多个线索的综合推理）与Recovery（任务完成后基于未来价值的引导性响应）。解决方案的关键在于构建一个包含Planner、User Agent和Assistant Model三类代理的实验框架，通过信息不对称设计有效规避风格混淆、评分泄露、外部上下文污染及信息过载等问题，从而实现对模型对话主动性更真实、可靠的量化评估。

链接: https://arxiv.org/abs/2605.09228
作者: Sepehr Harfi,Ahmad Salimi,Dongming Shen,Alex Smola
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Most LLM benchmarks score how well a model responds to explicit requests. They leave unmeasured a different conversational ability: noticing and acting on needs the user has implied but not said. We call this \emphconversational proactivity. ProactBench decomposes it into three phase-tied types: \textscEmergent, inference from a single disclosed anchor; \textscCritical, synthesis across multiple anchors; and \textscRecovery, grounded forward-looking value after task completion. We operationalise the benchmark with three agents: a Planner, a User Agent, and an Assistant Model. Their information asymmetries defend against style-confounded scoring, rubric leakage, external-context contamination, and information dumps. The released corpus contains 198 curated dialogues with 624 trigger points across 24 communication styles drawn from a psychometric inventory and audited by an independent LLM judge. Across 16 frontier and open-weight models, \textscRecovery is both difficult and weakly predicted by six standard benchmarks, making it a useful new evaluation signal. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) MSC classes: 68T50, 68T07, 62-07 ACMclasses: I.2.7; I.2.6 Cite as: arXiv:2605.09228 [cs.LG] (or arXiv:2605.09228v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.09228 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-225] he Art of the Jailbreak: Formulating Jailbreak Attacks for LLM Security Beyond Binary Scoring

【速读】：该论文旨在解决生成式 AI（Generative AI）在安全防护中面临的“越狱攻击”（jailbreak attacks）缺乏大规模、可复现的系统性研究基础设施的问题，具体包括攻击生成、分类与评估的标准化难题。其关键解决方案在于：首先构建了一个包含11.4万条对抗性提示的大规模数据集，通过多模型投票机制实现细粒度的网络安全攻击类别标注；其次提出无需模板或梯度搜索的指令微调方法，训练出能够根据有害种子自动生成流畅越狱提示的类别感知型语言模型，显著提升生成效率与隐蔽性；最后设计了无需训练的连续型评估指标OPTIMUS，通过联合建模语义相似度与危害概率，揭示传统二值成功率指标忽略的“隐蔽最优区间”，从而实现对越狱攻击效果的精细化量化分析与可控红队测试。

链接: https://arxiv.org/abs/2605.09225
作者: Ismail Hossain,Tanzim Ahad,Md Jahangir Alam,Sai Puppala,Syed Bahauddin Alam,Sajedul Talukder
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This paper is under review on of top security venues

点击查看摘要

Abstract:Jailbreak attacks – adversarial prompts that bypass LLM alignment through purely linguistic manipulation – pose a growing operational security threat, yet the field lacks large-scale, reproducible infrastructure for generating, categorizing, and evaluating them systematically. This paper addresses that gap with three contributions. (1) Large-scale compositional jailbreak dataset. We construct 114,000 adversarial prompts by applying 912 composing strategies to 125 harmful seed prompts from JailBreakV-28K. Every prompt is assigned to one of 14 cybersecurity attack categories (e.g., malware, phishing, privilege escalation) via a six-model majority-vote pipeline, and each strategy is ranked by effectiveness per category, enabling principled strategy selection grounded in concrete adversarial objectives. (2) Automated jailbreak generation. We instruction-fine-tune category-aware LLMs on Moderate and Optimal subsets, producing models that synthesize fluent jailbreak prompts from a harmful seed at inference time – no templates, no gradient search. Our generators achieve perplexity 24-39 versus 40-140 for AutoDAN and AmpleGCG, with safety-filter evasion rates of 0.29-0.51 Mal (LlamaPromptGuard-2-86M), enabling controllable, scalable red-teaming under realistic adversarial conditions. (3) OPTIMUS: a training-free jailbreak evaluator. OPTIMUS is a continuous metric J(S,H) that jointly captures semantic similarity between the harmful seed and the jailbreak (S) and harmfulness probability (H) via calibrated penalty functions. Unlike binary attack success rate (ASR), OPTIMUS requires no task-specific training, generalizes across evolving strategies, and exposes a stealth-optimal regime (S*=0.57, H*=0.43) that ASR misses. Experiments across 114,000 prompts confirm that OPTIMUS separates Weak, Moderate, and Optimal jailbreaks with category-level evidence binary evaluation cannot supply.

[AI-226] Detect Localize and Explain: Interactive Hierarchical Log Anomaly Analytics with LLM Augmentation

【速读】：该论文旨在解决现代系统中日志（log）的非结构化特性导致执行行为理解困难，从而阻碍异常诊断的问题。其核心解决方案是提出一种分层日志抽象（hierarchical log abstraction），将扁平的日志序列转化为跨实体（entity）、动作（action）和状态（status）层级的语义一致单元，并在此基础上构建分层编排框架（hierarchical orchestration framework），实现对日志的模块化检测与优化。该框架支持精确的异常检测、定位与解释，并通过选择性调用大语言模型（LLM）推理提升可解释性，最终通过Krone-viz交互式可视化系统使分析过程可解释且可操作，显著提升了软件工程师和系统运维人员对日志异常的理解效率。

链接: https://arxiv.org/abs/2605.09222
作者: Lei Ma,Suhani Chaudhary,Ethan Shanbaum,Athanasios Tassiadamis,Peter M. VanNostrand,Dennis M. Hofmann,Haowen Xu,Elke Rundensteiner
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Logs are ubiquitous in modern systems. Unfortunately, their unstructured nature in flat sequences limits understanding of execution behaviors, hindering effective anomaly diagnosis. To address this, Krone introduces a novel hierarchical log abstraction that transforms flat log sequences into semantically coherent units across entity, action, and status levels. Building on this abstraction, Krone introduces a hierarchical orchestration framework that decomposes flat log sequences into hierarchical execution units and performs modular detection over them. It executes and optimizes the modular detection tasks across levels, enabling precise anomaly detection, localization, and explanation with selective invocation of LLM-based reasoning. In this work, we present Krone-viz, an interactive visualization system based on Krone, which makes hierarchical log analysis interpretable and actionable for software engineers and system operators. Demonstrated on the widely used HDFS benchmark dataset, Krone-viz supports: 1) examining hierarchical decompositions of flat log sequences, 2) inspecting detection results and abnormal segments identified by Krone with LLM-generated explanations, and 3) reusing, reviewing, and revising knowledge generated by LLMs with human-in-the-loop guardrails. The code of Krone-viz is available at this https URL, and we deploy a live demo at this https URL.

[AI-227] he Pokémon Theorem and other Fairness Impossibility Results

【速读】：该论文旨在解决机器学习中公平性约束的内在冲突问题，特别是不同公平标准之间在不等基础率（unequal base rates）下难以同时满足的不可能性现象。其核心贡献在于揭示了多个公平准则（如群体条件无偏性、类条件分离等）可统一建模为再生核希尔伯特空间（RKHS）中对条件均值嵌入（conditional mean embeddings）的线性约束，而基础率不等导致全期望定律（law of total expectation）过度约束这些线性条件，从而引发公平性悖论。解决方案的关键在于利用RKHS几何结构和谱正则性分析，证明了：1）Kleinberg-Mullainathan-Raghavan二分法仅需群体条件无偏性即可成立；2）任意有限个线性公平性条件若被某一对群体满足，则会留下由最大均值差异（MMD）刻画的残差违反项，且其衰减速率为Kolmogorov m-宽度；3）公平特征学习存在不可能性——在不等基础率下，奇偶性（parity）与类条件分离要求必然导致类别坍缩；4）通过近似松弛获得信号与误差边界，实现现实估计器与公平目标之间的权衡。

链接: https://arxiv.org/abs/2605.09221
作者: Daniel Matsui Smola,Alex Smola
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Fairness impossibility results often look like distinct scalar incompatibility statements. We show that several share one RKHS geometry: fairness criteria are linear constraints on conditional mean embeddings, and unequal base rates make the law of total expectation overdetermine those constraints. This view yields four results. The Kleinberg–Mullainathan–Raghavan dichotomy needs only group-conditional unbiasedness, not full calibration. The \emphPokémon theorem shows that a distinct group pair satisfying any finite collection of linear mean-fairness criteria leaves a residual violation witnessed by the MMD, decaying at the Kolmogorov m -width rate under spectral regularity. The same tools prove an impossibility for fair feature learning: parity and class-conditional separation in representation space force class collapse under unequal base rates. The approximate relaxations yield signal and error frontiers, allowing a trade-off between real-world estimators and fairness goals. Experiments on standard fairness benchmarks are consistent with our bounds. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) MSC classes: 68T05 (Primary) 46E22, 47B32, 62H30, 41A46, 62G10 (Secondary) ACMclasses: I.2.6; G.3; K.4.1 Cite as: arXiv:2605.09221 [cs.LG] (or arXiv:2605.09221v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.09221 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-228] Fast Rates for Offline Contextual Bandits with Forward-KL Regularization under Single-Policy Concentrability

【速读】：该论文旨在解决前向KL（forward-KL）正则化在离线强化学习中样本复杂度较低的问题，即现有分析对前向KL正则化目标无法获得优于 $\tilde{O}(\epsilon^{-2})$ 的快速率，而反向KL（reverse-KL）正则化已实现 $\epsilon^{-1}$ 类型的快速率。为解决此问题，作者提出了一种基于凸分析的新颖框架，其关键在于通过新颖地引入悲观原则（pessimism principle） 来统一表格（tabular）和泛函近似（general function approximation）两种设置下的理论分析，并完全规避了以往依赖均值定理（mean value theorem）的证明路径。该方法首次在单策略集中性（single-policy concentrability）假设下建立了 $\tilde{O}(\epsilon^{-1})$ 的上界，且通过最优下界证明了其统计速率的紧致性，揭示了前向KL正则化在低正则化强度下可恢复无正则化时的慢速率特性，与反向KL情形一致。

链接: https://arxiv.org/abs/2605.09214
作者: Qingyue Zhao,Kaixuan Ji,Heyang Zhao,Quanquan Gu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Statistics Theory (math.ST); Machine Learning (stat.ML)
备注: 31 pages, comments are welcome

点击查看摘要

Abstract:\emphKullback-Leibler (KL) regularization is ubiquitous in reinforcement learning algorithms in the form of \emphreverse or \emphforward KL. Recent studies have demonstrated \epsilon^-1 -type fast rates for decision making under reverse KL regularization, in contrast to the standard \epsilon^-2 -type sample complexity. However, for forward-KL-regularized objectives, existing statistical analyses are either not applicable or result in \tildeO(\epsilon^-2) slow rates. We take the first step towards addressing this problem via a streamlined analysis of forward-KL-regularized offline CBs. We give the first \tildeO(\epsilon^-1) upper bounds in tabular and general function approximation settings, both under notions of \emphsingle-policy concentrability. In particular, our convex-analytical pipeline unifies these settings by exploiting the pessimism principle in a novel way and completely bypasses the proof routines in previous works based on the mean value theorem, which might be of independent interest. Moreover, we provide rate-optimal lower bounds, manifesting the tightness of our upper bounds in terms of statistical rates. Our lower bounds also demonstrate that the forward-KL-regularized sample complexity recovers the unregularized slow rate in the low-regularization regime, similarly to the reverse-KL regularization.

[AI-229] he Geometry of Forgetting: Temporal Knowledge Drift as an Independent Axis in LLM Representations

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在生成回答时存在时间漂移（temporal drift）问题，即模型可能基于过时的知识给出看似自信但错误的答案，而现有方法无法有效检测此类问题。其核心发现是：时间漂移在模型的残差流（residual stream）中表现为一个与正确性（correctness）和不确定性（uncertainty）信号几何正交的方向，这意味着任何依赖于正确性或不确定性信号的检测方法本质上对漂移是盲视的。解决方案的关键在于直接利用漂移标签训练线性探测器（linear probe），该方法在六个指令微调模型上实现了AUROC 0.83–0.95，显著优于基于token熵、语义熵、CCS和SAPLMA等传统指标的方法（AUROC 0.49–0.57），从而首次实证揭示了漂移在模型内部表征空间中的结构特性，并提供了一种可解释且高效的检测机制。

链接: https://arxiv.org/abs/2605.09195
作者: Rania Elbadry,Ahmed Heakl,Fan Zhang,Dani Bouch,Yuxia Wang,Preslav Nakov,Zhuohan Xie
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models confidently produce outdated answers, and no existing method can detect them. We show this is not an engineering failure but a structural one: temporal drift, whether a stored fact has changed since training, is encoded as a direction in the residual stream geometrically orthogonal to both correctness and uncertainty. Any method operating on correctness or uncertainty signals is therefore blind to drift by construction. We verify this across six instruction-tuned models. A linear probe trained directly on drift labels achieves AUROC 0.83 – 0.95 ; methods based on token entropy, semantic entropy, CCS, and SAPLMA all remain near chance ( 0.49 – 0.57 ). Five tests confirm the geometric orthogonality: weight cosines ( |\cos| \leq 0.14 ), score correlations ( |r| \leq 0.20 ), bidirectional null-space projection ( |\Delta| \leq 0.008 ), iterative null-space projection with k=10 , and difference-of-means dissociation. Mechanistically, the MLP retrieval circuit produces identical dynamics for stale recall and confabulation ( r 0.81 , six models), explaining why output confidence cannot separate them. A cross-cutoff experiment holds inputs constant and varies only the model: the probe fires on the model whose training predates the fact’s transition and stays silent otherwise ( P(AB) = 0.975 – 0.998 , twelve model pairs), confirming it reads model-internal knowledge state rather than input properties. Our code and datasets will be publicly released.

[AI-230] Evidence Over Plans: Online Trajectory Verification for Skill Distillation

【速读】：该论文旨在解决当前生成式 AI (Generative AI) 中技能（skill）质量难以评估的问题，尤其是在缺乏环境验证的情况下，现有技能生成方法依赖偏好日志而非直接环境交互，导致性能提升微弱甚至退化。其解决方案的关键在于提出后验蒸馏指数（Posterior Distillation Index, PDI），这是一个基于轨迹层面的指标，用于量化技能是否充分扎根于任务-环境交互证据；同时设计了 SPARK（Structured Pipelines for Autonomous Runnable Tasks and Skill Generation）框架，通过保留任务执行证据实现全轨迹级分析，并将 PDI 作为在线诊断与干预信号，确保技能从环境中后验提炼而非先验规划生成。实验表明，SPARK 生成的技能在 86 个可运行任务中显著优于无技能基线和人工编写的技能，且推理成本降低高达 1000 倍。

链接: https://arxiv.org/abs/2605.09192
作者: Yang Zhou,Zihan Dong,Zhenting Wang,Can Jin,Shiyu Zhao,Bangwei Guo,Difei Gu,Linjun Zhang,Mu Zhou,Dimitris N. Metaxas
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Agent skills can remarkably improve task success rates by using human-written procedural documents, but their quality is difficult to assess without environment-grounded verification. Existing skill generation methods heavily rely on preference logs rather than direct environment interaction, often yielding negligible or even degraded gains. We identify that it is a fundamental timing bottleneck: robust skills should be posterior-based, distilled from empirical environment interaction rather than prior plans. In this study, we introduce the Posterior Distillation Index (PDI), a trajectory-level metric that quantifies how well a distilled skill is grounded in the task-environment evidence. To operationalize PDI, we present SPARK (Structured Pipelines for Autonomous Runnable tasKs and sKill generation) for preserving task execution evidence towards full trajectory-level analysis. SPARK generates environment-verified trajectories used to compute PDI, and it applies PDI as an online diagnostic and intervention signal to ensure posterior skill formation. Across 86 runnable tasks, SPARK-generated skills consistently surpass no-skill baselines and outperform human-written skills on student models (inference cost up to 1,000x cheaper than teacher models). These findings show that PDI-guided distillation produces efficient and transferable skills grounded in the task-environment interaction. We release our code at this https URL .

[AI-231] DARE: Difficulty-Adaptive Reinforcement Learning with Co-Evolved Difficulty Estimation

【速读】：该论文旨在解决强化学习（Reinforcement Learning, RL）在提升大语言模型推理能力时存在的训练成本高、样本效率低的问题，尤其是现有基于难度感知的数据选择方法在政策漂移下难度估计失准、最终性能提升有限且推理效率未显著改善的局限性。其解决方案的关键在于提出一个统一框架 Dare，通过自归一化重要性采样（self-normalized importance sampling）实现难度估计与策略的协同进化，利用对称 Beta 分布采样维持难度覆盖多样性，并结合自适应计算分配机制对不同难度层级实施差异化训练策略，从而在训练效率、最终效果和推理效率上均取得显著提升，尤其体现在对简单任务生成更简洁响应、对复杂任务提升正确性的平衡优化。

链接: https://arxiv.org/abs/2605.09188
作者: Yang Zhou,Can Jin,Zihan Dong,Zhepeng Wang,Yanting Yang,Shiyu Zhao,Lei Li,Runxue Bao,Yaochen Xie,Dimitris N. Metaxas
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning improves the reasoning ability of large language models but remains costly and sample-inefficient, as many rollouts provide weak learning signals. Difficulty-aware data selection methods attempt to address this by prioritizing moderately difficult prompts, yet our analysis reveals three limitations: difficulty estimates become inaccurate under policy drift, data selection alone yields limited final-performance gains, and inference efficiency remains largely unchanged. These findings suggest that efficient and effective RL requires more than filtering by difficulty: the policy should learn to solve hard tasks while producing concise responses for easy ones. To this end, we propose Dare, a unified framework that co-evolves difficulty estimation with the policy via self-normalized importance sampling, maintains diverse difficulty coverage through a symmetric Beta sampling distribution, and applies tailored training strategies across difficulty tiers with adaptive compute allocation. Extensive experiments across multiple models and domains demonstrate that Dare consistently outperforms existing methods in training efficiency, final effectiveness, and inference efficiency, producing more concise responses on easy tasks while improving correctness on hard ones. Code is available at this https URL.

[AI-232] Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers

【速读】：该论文旨在解决大规模语言模型（Large Language Models, LLMs）训练中优化算法在统计有效性与计算及内存效率之间的平衡问题。当前虽然Adam优化器仍是主流，但其在极端规模下的性能瓶颈促使研究者重新审视优化器设计的各个组件，包括自适应动量估计、解耦权重衰减、内存占用、曲率近似、基于符号的更新、大批次稳定性、低秩梯度结构以及矩阵正交化更新等。解决方案的关键在于从系统和优化双重视角出发，对现有优化器进行系统性分类与综述，涵盖经典一阶优化器、自适应优化器、内存高效变体、二阶及曲率感知方法、基于符号和发现的优化器、低秩与投影方法，以及如Muon等矩阵基优化器，并强调通过严谨的基准测试方法（如超参数公平性、尺度依赖性、时钟效率、token效率、内存开销和下游任务评估）实现多维度综合比较，推动LLM优化器研究从单一算法加速向规模化、全面评估的新阶段演进。

链接: https://arxiv.org/abs/2605.09176
作者: Aditya Ranganath
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: No figures, 65 pages

点击查看摘要

Abstract:Training large language models requires optimization algorithms that are not only statistically effective, but also computationally and memory efficient at extreme scale. Although Adam remains the dominant optimizer for large-scale language-model pretraining and fine-tuning, recent work has revisited nearly every component of the optimization stack: adaptive moment estimation, decoupled weight decay, memory footprint, curvature approximation, sign-based updates, large-batch stability, low-rank gradient structure, and matrix-wise orthogonalized updates. This survey reviews optimizer design for large language models through a systems-and-optimization lens. We organize the literature into classical first-order optimizers, adaptive optimizers, memory-efficient variants, second-order and curvature-aware methods, sign-based and discovered optimizers, low-rank and projection-based methods, and matrix-based optimizers such as Muon. We also discuss benchmarking methodology, including hyperparameter fairness, scale dependence, wall-clock efficiency, token efficiency, memory overhead, and downstream evaluation. We argue that optimizer research for LLMs is entering a new phase: moving from single-algorithm speedup claims toward rigorous, scale-aware comparisons that jointly evaluate convergence, stability, memory, and implementation complexity.

[AI-233] WavesFM: Hierarchical Representation Learning for Longitudinal Wearable Sensor Waveforms

【速读】：该论文旨在解决从高采样频率、多模态依赖和超长序列（如数周记录）的可穿戴生理信号中推断健康相关表型的挑战，尤其在标注数据稀缺的情况下。现有自监督学习（SSL）方法要么仅关注短片段的形态特征而忽略纵向动态，要么依赖人工设计的粗粒度特征（如心率、步数），从而丢失原始波形中的细微预测性信息。解决方案的关键在于提出WavesFM，一种两阶段自监督框架：第一阶段通过预训练段级编码器从短波形片段中提取局部嵌入；第二阶段利用时序编码器建模这些嵌入在多日时间尺度上的演化，从而同时捕捉局部信号语义与复杂的昼夜节律及跨日变化。此分层策略有效缓解了高分辨率长序列数据的计算复杂性，显著提升了模型对多种健康任务的泛化能力。

链接: https://arxiv.org/abs/2605.09173
作者: Peng Cao,Zhijian Yang,Tennison Liu,Jonathan Wang,Jiang Wu,Magdalena Proszewska,Arvind Pillai,Mingwu Gao,Amir Farjadian,Lawrence Cai,Emily Blanchard,Daniel McDuff,Pramod Rudrapatna,Matthew Thompson,Anupam Pathak,Mark Malhotra,Shwetak Patel,Dina Katabi,Paolo Di Achille,Ming-Zher Poh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Wearable sensors enable the continuous acquisition of high-resolution physiological waveforms, such as photoplethysmography and accelerometry, under free-living conditions. However, inferring health-related phenotypes from these signals presents significant challenges due to high sampling frequencies, multimodal dependencies, and extreme sequence lengths (e.g., weeks of recordings), compounded by a scarcity of ground-truth labels. To address these challenges, existing self-supervised learning (SSL) methodologies typically follow two paradigms: (1) learning rich morphological representations from short waveform segments while collapsing longitudinal dynamics through simple aggregation, or (2) modeling behavioral patterns from coarse, hand-crafted features (e.g. heart rate, step counts) spanning longer horizons but foregoing subtle, predictive signatures in raw waveforms. To bridge this gap, we propose WavesFM, a foundation model utilizing a two-stage SSL framework for longitudinal physiological data. Specifically, we decompose the learning problem into two stages: first, a segment-level encoder is pretrained to extract local embeddings from short waveforms; subsequently, a temporal encoder is trained to model the sequence of these embeddings across a multi-day horizon. This hierarchical approach overcomes the computational complexity of high-resolution, long-sequence data, allowing the overall model to capture both local signal semantics and the complex circadian and inter-day variations governing physiological dynamics. Pretrained on over 6.8M hours (N=324k individuals) of recordings for the first stage and 5.3M hours (N=10k) for the second stage, WavesFM demonstrates superior performance across 58 diverse tasks spanning demographics, lifestyle, health conditions, and medications.

[AI-234] Prediction Bottlenecks Dont Discover Causal Structure (But Heres What They Actually Do)

【速读】：该论文旨在解决生成式 AI（Generative AI）模型在仅通过下一步预测训练后，是否能够自动恢复格兰杰因果结构（Granger-causal structure）这一问题。其核心假设是：状态空间模型（state-space model）的简单读出机制 $ S = |W_{\text{out}} W_{\text{in}}| $ 可以实现对因果结构的有效推断，并可能受益于干预数据（interventional data）。解决方案的关键在于提出一个可复用的 falsification benchmark（伪证基准），包含标准化合成生成器（VAR/Lorenz/CauseMe-style）、三种干预语义（do(X=c)、软噪声、随机强制）、真实数据集的边溯源卡片（edge-provenance cards）以及规模匹配的对照组（control arms），并通过五个阶段系统性地检验该假设。结果表明，原始方法层面的主张不成立，但该基准本身成为持久的研究工具，且每一阶段均构成其控制臂（control arm），从而推动了因果推断方法的严谨评估。

链接: https://arxiv.org/abs/2605.09169
作者: Ankit Hemant Lade,Sai Krishna Jasti,Indar Kumar,Aman Chadha
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 6 pages, 3 tables. Code: this https URL

点击查看摘要

Abstract:A Mamba state-space model trained only for next-step prediction appears to recover Granger-causal structure through a simple readout S = |W_out W_in| , with early experiments suggesting the phenomenon generalized across architectures and benefited from interventional data at p 10^-5 . We package the protocol used to test that claim – standardized synthetic generators (VAR/Lorenz/CauseMe-style), three intervention semantics ( do(X=c) , soft-noise, random-forcing), edge-provenance cards on three real datasets, and size-matched control arms – as a reusable falsification benchmark, and walk the claim through it in five stages. The method-level claim does not survive: (i) a plain linear bottleneck does as well or better; (ii) tuned Lasso beats the bottleneck on synthetic CauseMe-style benchmarks, and on Lorenz-96 (the only real benchmark with unambiguous ground truth) classical PCMCI and Granger lead a tight cluster in which the bottleneck trails; (iii) the headline intervention advantage is roughly 60% a sample-size confound, and the residual disappears under standard do(X=c) interventions, surviving only under a non-standard random-forcing scheme; (iv) even that residual reproduces, with a larger effect, in classical bivariate Granger – the effect is method-agnostic. What survives is a narrow characterization result; the benchmark is the lasting artifact, and each stage above is one of its control arms.

[AI-235] CIVeX: Causal Intervention Verification for Language Agents

【速读】：该论文旨在解决工具使用型语言智能体在执行状态变更动作时，因混淆（confounding）导致看似最优的动作实际降低效用的问题。现有机制如模式验证器、策略过滤器和自验证虽能确保动作形式有效，但无法保证其具有可识别的因果效应。解决方案的关键在于提出CIVeX——一种因果干预验证器，它将提议动作映射为结构因果图上的因果查询，检查可识别性，并输出四种可审计结论：执行（EXECUTE）、拒绝（REJECT）、实验（EXPERIMENT）或弃权（ABSTAIN）。其中执行需满足假设范围内的因果证书、识别论证、单侧置信下界（LCB）、溯源信息及风险限制，从而在中度与对抗性混淆场景下实现零虚假执行，且在真实生产日志数据上达到接近Oracle的正确执行率并显著降低假阳性。

链接: https://arxiv.org/abs/2605.09168
作者: Fabio Rovai
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages, 3 figures. Includes Causal-ToolBench, IHDP, ZOZO Open Bandit, and LaLonde NSW evaluations

点击查看摘要

Abstract:A valid tool call is not necessarily a valid intervention. Tool-using language agents are guarded by schema validators, policy filters, provenance checks, state predictors, and self-verification, yet such safeguards do not certify that a state-changing action has an identifiable causal effect. In confounded workflows, the action that looks optimal in observational logs can reduce utility when executed. We introduce CIVeX, a causal intervention verifier that maps proposed actions to structural causal queries over a committed action-state graph, checks identifiability, and returns one of four auditable verdicts: EXECUTE, REJECT, EXPERIMENT, or ABSTAIN. Execution requires an assumption-scoped causal certificate carrying graph commitments, an identification argument, a one-sided lower confidence bound (LCB), provenance, and risk limits. On Causal-ToolBench (1,890 instances, 7 seeds), CIVeX yields zero observed false executions across moderate and adversarial confounding. Under adversarial confounding it reaches 84.9% accuracy and 81.1% of oracle utility (+2.23 vs +2.76) and is the only non-oracle method whose constrained utility under a zero-false-execution constraint exceeds the AlwaysAbstain floor. On IHDP and ZOZO Open Bandit (real production logs with uniform-random ground truth), CIVeX matches Oracle correct-execution within 0.1pp and cuts per-execute false-execution by =50x over naive baselines. A chain-of-thought LLM verifier (Claude Opus, Sonnet) cuts false-execution by an order of magnitude over a terse baseline, yet under adversarial confounding Opus’s utility falls to 74% of CIVeX’s. Intervention identifiability, not action validity, is the missing primitive for reliable tool use. Comments: 16 pages, 3 figures. Includes Causal-ToolBench, IHDP, ZOZO Open Bandit, and LaLonde NSW evaluations Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2605.09168 [cs.AI] (or arXiv:2605.09168v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.09168 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-236] FORTIS: Benchmarking Over-Privilege in Agent Skills

【速读】：该论文旨在解决当前大型语言模型代理（Large Language Model Agents）在执行任务时普遍存在的“权限越界”（over-privilege）问题，即模型倾向于选择比任务需求更高权限的技能或工具，从而导致行为失控。解决方案的关键在于提出一个名为FORTIS的基准测试框架，用于系统性评估代理在两个阶段的行为：一是从大规模重叠技能库中是否选择最小必要技能（minimally sufficient skill），二是是否仅执行该技能所允许范围内的动作而不扩展至更广泛的工具或行为。实验表明，即使是最先进的模型也频繁违反这两个约束，尤其在真实用户交互场景下（如任务描述不完整、便利性表述、接近技能边界等），这揭示了技能层本身并非行为控制机制，反而成为权限升级的主要来源。

链接: https://arxiv.org/abs/2605.09163
作者: Shawn Li,Chenxiao Yu,Han Wang,Wei Yang,Ryan Rossi,Franck Dernoncourt,Xiyang Hu,Philip Yu,Chaowei Xiao,Huan Zhang,Yue Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language model agents increasingly operate through an intermediate skill layer that mediates between user intent and concrete task execution. This layer is widely treated as an organizational abstraction, but we argue it is also a privilege boundary that current models routinely exceed. We present \textbfFORTIS, a benchmark that evaluates over-privilege in agent skills across two stages: whether a model selects the minimally sufficient skill from a large overlapping library, and whether it executes that skill without expanding into broader tools or actions than the skill permits. Across ten frontier models and three domains, we find that over-privileged behavior is the norm rather than the exception. Models consistently reach for higher-privilege skills and tools than the task requires, failing at both stages at rates that remain high even for the strongest available models. Failure is especially severe under the ordinary conditions of real user interaction: incomplete specification, convenience framing, and proximity to skill boundaries. None of these requires adversarial construction. The results indicate that the skill layer, far from containing agent behavior, is itself a primary source of privilege escalation in current systems.

[AI-237] Do LLM s Experience an Internal Polylogue? Investigating Reasoning through the Lens of Personas

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在推理过程中行为动态变化难以监控与干预的问题。现有方法通常将“人格向量”（persona vectors）视为静态控制信号，无法捕捉推理时的时变特性。论文的关键创新在于将人格向量重新定义为动态信号——即通过追踪其在生成过程中的时序对齐关系（称为polylogue），实现对模型内部状态的实时监测与干预。这一方法不仅在MMLU-Pro任务上表现出与低维激活基线相当的预测准确性，还保持了可解释性，并指明了不同推理阶段应调节的具体潜在方向（latent directions），从而提出一种基于阶段感知的潜空间干预机制，为推理时控制（reasoning-time control）提供了新的可解释工具和实践路径。

链接: https://arxiv.org/abs/2605.09159
作者: Nils A. Herrmann,Leander Girrbach,Kirill Bykov,Zeynep Akata
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent work shows that large language models (LLMs) encode behavioural traits (“personas”) as linear directions in activation space, often called “persona vectors”. Prior work has used such directions as static handles for behavioural steering. Building on this, we treat them as dynamic signals instead: probes we can monitor and intervene on as reasoning unfolds. We use the term polylogue to denote the time series of alignments between persona vectors and hidden activations over the course of generation. Experiments across four open-weight models show that polylogue features predict correctness on MMLU-Pro competitively with low-dimensional activation baselines, while remaining interpretable through their associated persona directions. They also suggest concrete steering targets, namely which latent directions to modulate at different stages of a response. We instantiate this as a simple paragraph-conditioned intervention that improves accuracy on three of four models, pointing to stage-aware latent steering as a promising direction for reasoning-time control. Together, this positions the polylogue as an interpretable tool for reasoning-time monitoring and intervention.

[AI-238] Revisiting Mixture Policies in Entropy-Regularized Actor-Critic

【速读】：该论文旨在解决连续动作强化学习中混合策略（Mixture Policies）在实际应用中未能发挥其理论优势的问题。尽管混合策略理论上比单模态策略更具灵活性，可提升解的质量和熵鲁棒性，但主流算法如软演员-评论家（Soft Actor-Critic, SAC）并未有效利用这一特性，主要受限于缺乏适用于混合分布的低方差重参数化技巧（reparameterization trick），而高斯策略则享有此优势。论文提出了一种边际化重参数化（Marginalized Reparameterization, MRP）估计器，证明其方差低于标准似然比（Likelihood-Ratio, LR）方法，并通过在Gym MuJoCo、DeepMind Control Suite和MetaWorld上的实验验证：MRP混合策略显著优于LR方法，在多数场景下达到甚至超越高斯策略的性能，从而将混合策略从理论概念转化为具有实用价值的工具。

链接: https://arxiv.org/abs/2605.09157
作者: Jiamin He,Samuel Neumann,Jincheng Mei,Adam White,Martha White
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mixture policies theoretically offer greater flexibility than unimodal policies in continuous action reinforcement learning, but the practical benefits of this complexity remain elusive. Mixture policies are notably absent from most state-of-the-art algorithms, raising a fundamental question: Is the added representational overhead useful? We show that increased flexibility can theoretically enhance solution quality and entropy robustness. Yet standard algorithms like SAC do not leverage these advantages. A core issue is the lack of a low-variance reparameterization trick for mixtures, a luxury Gaussian policies enjoy. We propose a marginalized reparameterization (MRP) estimator to address this, proving it offers lower variance than the standard likelihood-ratio (LR) approach. Our experiments across Gym MuJoCo, DeepMind Control Suite, and MetaWorld show that MRP mixture policies significantly outperform their LR ones, and reach parity (sometimes better) with Gaussian counterparts. In addition, we do find several cases where MRP mixture policies exhibit clear empirical advantages. In this paper, we provide a clearer understanding of the trade-offs involved, elevating MRP mixture policies from theoretical curiosity to a practical tool.

[AI-239] Beyond Self-Play: Hierarchical Reasoning for Continuous Motion in Closed-Loop Traffic Simulation

【速读】：该论文旨在解决封闭环路交通仿真中代理（agent）的可扩展性与行为真实性难以兼顾的问题。现有基于自对弈强化学习（self-play reinforcement learning）的方法虽具备良好可扩展性，但其均衡策略无法捕捉真实人类驾驶员的社会感知行为。解决方案的关键在于提出一种分层架构：上层采用Stackelberg型多智能体强化学习（MARL）模块生成具有交互意识的意图指令，下层则通过连续轨迹实现模块将战略意图转化为物理一致且场景响应的控制序列；同时引入混合协同训练机制，结合MARL与辅助恢复监督以缓解闭环部署中的分布偏移问题，从而在保持交通效率的同时显著提升控制平滑性和安全性。

链接: https://arxiv.org/abs/2605.09153
作者: Weifan Zhang,Xiaofeng Zhao,Adel Bazzi,Mingrui Li,Yifan Wei,Dengfeng Sun
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Submitted to IEEE Robotics and Automation Letters (RA-L)

点击查看摘要

Abstract:Closed-loop traffic simulation requires agents that are both scalable and behaviorally realistic. Recent self-play reinforcement learning approaches demonstrate strong scalability, but their equilibrium strategies fail to capture the socially aware behaviors of real human drivers. We propose a hierarchical architecture that goes beyond self-play by combining high-level multi-agent interaction reasoning with low-level continuous trajectory realization. Specifically, a Stackelberg-style Multi-Agent Reinforcement Learning (MARL) module generates interaction-aware intention commands. These commands condition a low-level continuous motion module, translating the strategic intent into physically consistent, scene-responsive control sequences. To mitigate distribution shift in closed-loop deployment, we introduce a hybrid co-training scheme combining MARL with auxiliary recovery supervision. Experiments on a SUMO-based urban network demonstrate that the proposed framework achieves superior control smoothness and safety compared to self-play and passive imitation baselines, while maintaining competitive traffic efficiency.

[AI-240] BoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models ICML2026

【速读】：该论文旨在解决程序修复（Program Repair, PR）中因执行反馈稀疏和序列级奖励粗粒度而导致的编辑效果难以定位的问题，即无法准确识别哪些具体代码修改真正修复了漏洞。其解决方案的关键在于提出一个三阶段框架BoostAPR：首先在执行验证的示范数据上进行监督微调并保留推理轨迹；其次训练两个奖励模型——一个用于整体修复质量评估的序列级评估器和一个基于执行结果分配局部奖励的行级信用分配器；最后利用近端策略优化（PPO）算法，通过行级奖励模型将总奖励重新分配至关键编辑区域，从而实现更精细的策略梯度更新。该方法通过引入行级信用分配机制，在代码变更的中间粒度上提升奖励信号的准确性，显著增强了模型在跨语言场景下的修复能力与泛化性能。

链接: https://arxiv.org/abs/2605.09134
作者: Yuanhao Li,Hongbo Wang,Xiaotang Shang,Xunzhu Tang,Yiming Cao,Xuhong Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 20 pages, 2 figures. Accepted at ICML 2026

点击查看摘要

Abstract:Reinforcement learning for program repair is hindered by sparse execution feedback and coarse sequence-level rewards that obscure which edits actually fix bugs. We present BoostAPR, a three-stage framework addressing these challenges: (1) supervised fine-tuning on execution-verified demonstrations with reasoning traces, (2) training dual reward models–a sequence-level assessor and a line-level credit allocator–from execution outcomes, and (3) PPO optimization where the line-level model redistributes rewards to critical edit regions. This line-level credit assignment operates at an intermediate granularity naturally suited to code changes. Trained on SWE-Gym and evaluated on four benchmarks, BoostAPR achieves 40.7% on SWE-bench Verified (+22.9pp over base model), 24.8% on Defects4J (Python-to-Java transfer), 84.5% on HumanEval-Java, and 95.0% on QuixBugs, achieving competitive results among open-source models with strong cross-language generalization.

[AI-241] Data-driven Circuit Discovery for Interpretability of Language Models

【速读】：该论文旨在解决现有电路发现方法（circuit discovery）在解释语言模型（Language Model, LM）行为时存在的两个关键问题：一是假设模型对每个任务仅使用单一计算电路（computational subgraph），二是假设人类定义的任务数据集能充分代表任务本质。实证研究表明，即使数据集仅发生语义不变的微小变化，也会导致发现的电路差异显著，且在混合多任务数据上，现有方法仍输出一个看似高保真但实际融合了多个机制的“伪统一”电路，说明其本质是依赖特定数据分布而非通用任务逻辑。解决方案的核心在于提出数据驱动的电路发现（Data-driven Circuit Discovery, DCD）框架，摒弃上述两个假设：DCD首先基于模型处理方式的相似性对输入样本进行聚类，再为每个簇独立发现专用电路，从而揭示模型内部不同机制的分组结构，使每个电路仅解释其所属子群体的行为，而非强行拟合整个任务，实验表明该方法可发现多个更忠实于局部机制的电路。

链接: https://arxiv.org/abs/2605.09129
作者: Daking Rai,Mor Geva,Ziyu Yao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 40 pages, 54 figures, 12 tables, Under review

点击查看摘要

Abstract:Circuit discovery aims to explain how language models (LMs) implement a specific task by localizing and interpreting a circuit, a computational subgraph responsible for the LM’s behavior. Existing circuit discovery methods are hypothesis-driven; they first informally define a task with a dataset, and then apply a circuit discovery algorithm over that dataset to obtain a single circuit. This imposes two strong assumptions: that the LM implements the task with a single circuit, and that the dataset adequately represents the task as humans understand it. We systematically test these assumptions across four previously studied tasks and find that even minor dataset variations that preserve task semantics can produce circuits with low edge overlap and cross-dataset faithfulness. More strikingly, when applied to a mixed dataset with two distinct tasks whose separately discovered circuits have near-zero cross-faithfulness, existing methods still return a single circuit with high faithfulness across both tasks. This indicates that current methods discover dataset-specific circuits, rather than general task circuits. We propose Data-driven Circuit Discovery (DCD), a new discovery framework that drops both assumptions: instead of returning a single circuit for a dataset, DCD first clusters examples in the dataset by how similarly the model processes them and discovers a separate circuit for each group. This allows distinct mechanisms to appear separately rather than merged into a single circuit; each circuit explains its group, not the full task. Experiments show that DCD discovers multiple circuits per dataset, each more faithful to its group than a single circuit discovered by existing methods. Broadly, DCD lets the data reveal mechanistic structure within LMs, rather than relying on human-defined task boundaries that may not align with how models organize their computation.

[AI-242] AI Native Asset Intelligence

【速读】：该论文旨在解决现代安全环境中因云资源、身份、配置及第三方安全工具产生的碎片化信号难以有效整合与优先级排序的问题，尤其是在企业级场景下，传统AI辅助系统因缺乏结构化的资产层面推理能力而表现出反应迟钝、结果不稳定等局限。其解决方案的关键在于提出“AI原生资产智能”（AI-native asset intelligence）框架，该框架通过建模层对资产、身份、关系、控制措施、攻击路径和影响半径模式进行统一表征，并借助评分层将分散的安全信号转化为标准化的资产重要性度量；评分系统进一步区分固有暴露程度（基于错误配置和攻击向量证据）与上下文重要性（基于异常检测、影响半径、业务关键性和数据敏感性），实现基于AI的语境化严重性调整与确定性聚合，从而支持稳定、可解释且主动的资产级安全态势推理。

链接: https://arxiv.org/abs/2605.09115
作者: Gal Engelberg,Leon Goldberg,Konstantin Koutsyi,Boris Plotnikov,Tiltan Gilat,Ben Benhemo
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 22 pages, 4 figures, 8 tables. Preprint

点击查看摘要

Abstract:Modern security environments generate fragmented signals across cloud resources, identities, configurations, and third-party security tools. Although AI-native security assistants improve access to this data, they remain largely reactive: users must ask the right questions and interpret disconnected findings. This does not scale in enterprise environments, where signal importance depends on exposure, exploitability, dependencies, and business context. Repeated AI queries may therefore produce unstable prioritization without a structured basis for comparing assets. This paper introduces AI-native asset intelligence, a framework that transforms heterogeneous security data into a structured intelligence layer for consistent, contextual, and proactive asset-level reasoning. The framework combines a modeling layer, representing assets, identities, relationships, controls, attack vectors, and blast-radius patterns, with a scoring layer that converts fragmented signals into a normalized measure of asset importance. The scoring system separates intrinsic exposure, based on misconfigurations and attack-vector evidence, from contextual importance, based on anomaly, blast radius, business criticality, and data criticality. AI contextualization refines severity and business/data classifications, while deterministic aggregation preserves consistency. We evaluate the scoring system on a production snapshot with 131,625 resources across 15 vendors and 178 asset types. Sensitivity analyses and ablations show that severity mappings control finding sensitivity, AI severity adjustment refines prioritization, attack-vector scoring responds to rare exploitability evidence, and contextual modulation selectively modifies exposed resources based on business or data importance. The results support AI-native asset intelligence as a foundation for stable prioritization and proactive security-posture reasoning.

[AI-243] Contextual Plackett-Luce: An Efficient Neural Model for Probabilistic Sequence Selection under Ambiguity

【速读】：该论文旨在解决结构化预测任务中因目标本质模糊性（multi-modal）与监督信号单一性之间的不匹配问题，即在输入存在多个有效输出时，训练数据仅提供一个采样实例，导致模型难以学习到完整的多模态分布。其解决方案的关键在于提出Contextual Plackett-Luce (CPL) 模型，该模型通过引入基于Ising风格的参数化方式（包含一阶和二阶交互项），将经典的Plackett-Luce选择模型扩展为上下文依赖的形式；CPL采用并行评分与轻量级自回归选择分离的机制：先以全并行方式计算上下文相关的选择概率参数，再通过逐次更新上下文logits进行序列化选择，从而在保持表达能力的同时实现高效计算，显著提升了在模糊监督下的结构一致性与鲁棒性。

链接: https://arxiv.org/abs/2605.09112
作者: Noam Mizrachi,Nadav Har-Tuv,Shai Shalev-Shwartz
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 22 pages, 5 figures

点击查看摘要

Abstract:Selecting a coherent sequence or subset of elements is a fundamental problem in structured prediction, arising in tasks such as detection, trajectory forecasting, and representative subset selection. In many such settings, the target is inherently ambiguous: each input admits multiple valid outputs, while supervision provides only a single sampled instance. This induces a mismatch between the underlying multi-modal target distribution and the observed training signal. We propose Contextual Plackett-Luce (CPL), a structured probabilistic model for sequence selection that extends the classical Plackett-Luce model to a context-dependent setting following an Ising-style parameterization with unary and pairwise interaction terms. CPL can be viewed as a hybrid between fully autoregressive prediction and parallel sequence selection: autoregressive models effectively capture uncertainty but are computationally expensive on modern parallel hardware such as GPUs, while parallel methods are efficient but struggle to represent multi-modal dependencies. CPL combines the strengths of both by constructing the parameters of a probabilistic selection model in a fully parallel manner, followed by a lightweight autoregressive selection process in which each step applies incremental updates to contextual logits. This decoupling of parallel scoring and sequential selection enables efficient computation without sacrificing expressivity. We evaluate CPL on two structured selection tasks: multi-modal path prediction and representative subset selection. CPL achieves improved structural consistency and robustness under ambiguous supervision compared to strong parallel baselines.

[AI-244] When (and How) to Trust the Expert: Diagnosing Query-Time Expert-Guided Reinforcement Learning

【速读】：该论文旨在解决强化学习（Reinforcement Learning, RL）中利用专家控制器（expert controller）进行引导时存在的性能不稳定与失败模式问题。当前方法通常在不同基准环境和孤立实验中评估，缺乏对专家不完美性（如参数欠调、动作偏差、观测噪声）的系统性测试。论文的关键解决方案是在统一的Soft Actor-Critic (SAC)框架下，采用一致的超参数优化（HPO）和评估协议，构建了一个包含100/50种子的标准化对比实验，系统识别出三种此前单篇文献未发现的失败模式：(F1) 价值函数盲区导致基于argmax-plus-bootstrap的集成行为策略强化学习（IBRL）在接近无专家RL天花板的专家下表现劣于纯RL；(F2) 远离最优的专家引发残余饱和；(F3) 预训练阶段缓冲区污染使训练时交接（training-time-handoff）方法在部署时专家欠调下崩溃。进一步提出一个可测试的决策规则，基于三个预训练可观测量（专家质量、任务终止特性、扰动类型）来选择最优方法，从而实现方法-任务结构匹配的自动化决策，这是论文的核心贡献之一。

链接: https://arxiv.org/abs/2605.09109
作者: Yann Berthelot,Philippe Preux,Riad Akrour
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Many continuous-control problems ship with a competent but suboptimal controller (a tuned PID, a hand-designed gait). A growing family of methods uses such controllers as queryable experts during RL, but each method has been proposed in isolation, on a different benchmark, without imperfect-expert testing. We harmonize the comparison on a shared SAC backbone, common HPO and evaluation protocols, 100/50 seeds per (env, method), and a degradation sweep over expert undertuning, action bias, and observation noise. The comparison surfaces three failure modes single-paper evaluations miss: (F1) a critic blind spot under argmax-plus-bootstrap that drags IBRL below no-expert SAC on experts close to the no-expert-RL ceiling (RL-near-ceiling, distinct from the absolute physical ceiling); (F2) residual saturation on far-from-optimal experts; and (F3) warm-start buffer poisoning that collapses training-time-handoff methods under deployment-time expert undertuning. No single method dominates: each wins on one task-structure regime and fails predictably elsewhere; on RL-near-ceiling experts (FourTank, GlassFurnace) no query-time method clears the expert within our 1M-step budget, leaving open whether this is a fundamental wall or a budget effect. We convert the spread into a testable decision rule keyed on three pre-training observables (expert quality, task termination, perturbation type). The benchmark, taxonomy, and decision rule are the primary contribution; we additionally describe EDGE, a softmax-over-ensemble-LCB design point used to demonstrate that both axes the taxonomy points to (gate form, scoring rule) are individually exploitable.

[AI-245] oken Economics for LLM Agents : A Dual-View Study from Computing and Economics

【速读】：该论文旨在解决生成式 AI（Generative AI）代理系统中令牌（token）经济带来的多重瓶颈问题，包括计算资源消耗过快、多智能体协作效率低下以及安全风险加剧等。其核心挑战在于缺乏一个统一框架来衡量输出质量与经济成本之间的根本权衡。解决方案的关键在于提出首个全面的“令牌经济学”（Token Economics）综述，通过融合计算机科学与经济学理论，将令牌重新定义为生产要素、交换媒介和计价单位，并构建四维分类体系：微观层面（单智能体）利用新古典企业理论优化预算约束下的因子替代；中观层面（多智能体系统）借助交易成本与委托-代理理论降低协作摩擦；宏观层面（智能体生态系统）运用机制设计应对拥堵外部性与定价问题；并从经济学角度内生化对抗性威胁作为约束条件。这一框架为下一代可扩展智能体系统的理论基础提供了关键支撑。

链接: https://arxiv.org/abs/2605.09104
作者: Yuxi Chen,Junming Chen,Chenyu He,Yiwei Li,Yicheng Ji,Yifan Wu,Dingyu Yang,Lansong Diao,Lidan Shou,Hongliang Zhang,Huan Li,Gang Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As LLM agents evolve, tokens have emerged as the core economic primitives of Agentic AI. However, their exponential consumption introduces severe computational, collaborative, and security bottlenecks. Current surveys remain fragmented across system optimization, architecture design, and trust, lacking a unified framework to evaluate the fundamental trade-off between output quality and economic cost. To bridge this gap, this survey presents the first comprehensive survey of Token Economics. By unifying computer science and economics, we conceptualize tokens as production factors, exchange mediums, and units of account. We synthesize existing literature across a four-dimensional taxonomy: (1) Micro-level (Single Agent): Optimizing budget-constrained factor substitution via neoclassical firm theory. (2) Meso-level (Multi-Agent Systems): Minimizing collaboration friction using transaction cost and principal-agent theories. (3) Macro-level (Agent Ecosystems): Addressing congestion externalities and pricing via mechanism design. (4) Security: Internalizing adversarial threats as endogenous economic constraints. Finally, we outline frontier directions, including differentiable token budgets and dynamic markets, to lay the theoretical foundation for scalable next-generation agent systems.

[AI-246] Constant-Target Energy Matching: A Unified Framework for Continuous and Discrete Density Estimation

【速读】：该论文旨在解决多变量类型（连续、离散及混合变量）密度估计中因采用不同目标函数而导致难以共享统计结构的问题。传统方法在连续变量上依赖对数密度梯度（log-density gradients），而在离散变量上则使用无界的目标函数（如concrete score），导致在低概率区域不稳定。其解决方案的关键在于提出一种统一的能量模型框架——常目标能量匹配（Constant-Target Energy Matching, CTEM），通过将密度比回归替换为有界能量差变换，推导出仅需样本即可训练的损失函数，且目标值恒为1。该方法学习到的标量势函数可直接恢复对数概率密度（log p），无需分区函数估计或显式无界比值回归，在多种数据类型基准测试中显著优于现有基线并生成更高质量的样本。

链接: https://arxiv.org/abs/2605.09085
作者: Zhijun Zeng,Yixuan Jiang,Pipi Hu,Zuoqiang Shi
机构: 未知
类目: Artificial Intelligence (cs.AI); Probability (math.PR)
备注:

点击查看摘要

Abstract:Density estimation is a central primitive in probabilistic modeling, yet continuous, discrete, and mixed-variable domains are often treated by separate objectives, limiting the ability to exploit a common statistical structure across data types. Continuous score-based methods rely on log-density gradients, while discrete extensions typically use concrete score whose unbounded targets become unstable near low-probability states. We introduce Constant-Target Energy Matching (CTEM), a unified energy-based framework for density estimation on general state spaces. CTEM replaces ordinary density-ratio regression with a bounded energy-difference transform and derives from it a sample-only training objective with the constant target 1. The learned scalar potential recovers log p without partition-function estimation or explicit unbounded ratio regression. Across continuous, discrete, and mixed-variable benchmarks, CTEM substantially improves density estimation over competitive baselines and yields higher-quality samples under standard sampling procedures.

[AI-247] FactoryNet: A Large-Scale Dataset toward Industrial Time-Series Foundation Models ICML2026

【速读】：该论文旨在解决工业时序数据领域缺乏统一预训练语料库的问题，从而推动工业基础模型（Industrial Foundation Models）的发展。现有方法在跨设备、跨任务的迁移学习中表现受限，且异常检测性能受高维特征和特定设备依赖的制约。解决方案的关键在于提出首个面向工业时序数据的通用预训练语料库FactoryNet，其核心创新是引入一种新的结构化表示框架——Setpoint, Effort, Feedback, Context（S-E-F-C）架构，该架构将不同执行机构（embodiment）的时序行为映射到统一表征空间，实现了零样本跨设备迁移和参数高效异常检测。实验表明，在偏置感知指标下，模型具备公平的跨设备迁移能力，且24个符合S-E-F-C结构的信号即可达到与高维基线相当的异常检测性能。

链接: https://arxiv.org/abs/2605.09081
作者: Karim Othman,Jonas Petersen,Matei Ignuta-Ciuncanu,Riccardo Maggioni,Camilla Mazzoleni,Federico Martelli,Philipp Petersen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages, 4 figures, 5 tables. Submitted to ICML 2026 Workshop on AI for Physics (AI4Physics)

点击查看摘要

Abstract:We introduce the first universal pretraining corpus for industrial time-series data: FactoryNet. 51M datapoints across 23k end-to-end task executions (13.3k real, 9.8k synthetic) on six embodiments, unified by a shared schema that enables robust zero-shot cross-embodiment transfer and highly parameter-efficient anomaly detection. We introduce a novel schema: Setpoint, Effort, Feedback, Context (S-E-F-C) underlying the whole pipeline that maps any actuated system into a common representational frame. The corpus spans 27 annotated anomaly types alongside healthy baselines and counterfactual pairs across robotic manipulation and machining domains. Cross-embodiment transfer experiments yield positive results: under bias-aware metrics our model demonstrates fair cross-embodiment transfer capabilities on the evaluated source-target pair, while 24 schema-aligned signals achieves competitive anomaly detection performance compared to high-dimensional baselines. We release FactoryNet as a growing, multi-embodiment dataset to drive progress toward industrial foundation models.

[AI-248] CauSim: Scaling Causal Reasoning with Increasingly Complex Causal Simulators

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在因果推理任务中表现不佳的问题，其核心挑战在于因果系统本身复杂且常以非可执行形式表达，同时因果查询的真值答案极度稀缺。解决方案的关键在于提出CauSim框架，通过将因果推理从稀缺标签问题转化为可扩展的监督学习问题：CauSim构建逐步复杂的因果模拟器——即由LLMs逐步构建的可执行结构因果模型（Structural Causal Models, SCMs），这些模型能够扩展到全局复杂系统并保持因果查询答案的可验证性；同时，CauSim跨表示形式运作，将非可执行因果知识形式化为代码以实现数据增强，并将可执行SCMs翻译为自然语言以提供此前难以获取的监督信号，从而显著提升模型在不同表示下的泛化能力与自监督学习效率。

链接: https://arxiv.org/abs/2605.09079
作者: Nicolás Astorga,Anita Kriz,Mihaela van der Schaar
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite surpassing human performance across mathematics, coding, and other knowledge-intensive tasks, large language models (LLMs) continue to struggle with causal reasoning. A core obstacle is the target data itself: causal systems are complex and often expressed in non-executable forms, while ground-truth answers to causal queries are inherently scarce. We introduce CauSim, a framework that turns causal reasoning from a scarce-label problem into a scalable supervised one. CauSim constructs increasingly complex causal simulators: executable structural causal models (SCMs), incrementally built by LLMs, that scale to globally complex systems while maintaining verifiable answers to causal queries. CauSim operates across representations by formalizing non-executable causal knowledge into code, enabling data augmentation, and translating executable SCMs into natural language, enabling supervision in previously difficult-to-supervise representations. We structure our research into two parts: (1) how to construct increasingly complex causal simulators, and (2) a systematic study of what CauSim enables, demonstrating generalization across representations, consistent gains from curriculum scaling and data volume, LLM self-improvement through self-generated simulators, and data augmentation via formalization of existing domain knowledge.

[AI-249] Single-Configuration Attack Success Rate Is Not Enough: Jailbreak Evaluations Should Report Distributional Attack Success

【速读】：该论文旨在解决当前生成式 AI (Generative AI) 拒绝系统（jailbreak）攻击评估中因仅报告最优参数配置而导致的威胁表征不充分与攻击对比不可靠的问题。现有研究通常只展示单一最佳参数组合下的攻击成功率（ASR），忽略了攻击空间内多参数变体对性能分布的影响，从而掩盖了典型表现水平和未覆盖的攻击面。论文提出两个关键量化指标：Variant Sensitivity Measure (VSM) 和 Union Coverage (UC)，其中 VSM 衡量最优 ASR 与测试范围内平均 ASR 的偏离程度，UC 表征所有配置下能触发不安全响应的提示比例。实证结果表明，仅报告最高 ASR 会严重低估真实风险——例如 PAIR 攻击在 Mistral-7B 上最佳 ASR 为 69%，但 UC 达到 88%；bijection 在 Mistral-7B 上最佳 ASR 为 81%，而其 36 种变体联合覆盖全部 HarmBench-100 提示。因此，论文主张采用分布式的报告方式，将 VSM 与 ASR 一同发布，并尽可能详尽地枚举变体覆盖率，以此作为参数化 jailbreak 攻击评估的新基准标准。

链接: https://arxiv.org/abs/2605.09070
作者: Carsten Maple,Abhishek Kumar,Riya Tapwal
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Many jailbreak attack research papers report attack success rates for a limited number of parameter settings, even though there are many combinations of parameter settings that could be used. Further, when new jailbreak papers are released, they often benchmark results against single configurations of existing attacks. This position paper argues such practices are fundamentally insufficient for characterising the threat posed by parameterised jailbreak attacks, and comparing attacks. Most jailbreak attacks expose multiple internal parameters, system prompt templates, conversation rounds, cipher dispersion, teaching shots, and ASR varies substantially across these parameters. Reporting only the best-case configuration discards two pieces of information that defenders genuinely need: how typical that performance is across the variant space, and how much of the attack surface is missed by selecting a single variant. We propose two new measures for jailbreak attacks: the Variant Sensitivity Measure (VSM) and Union Coverage (UC). VSM quantifies how far the best reported ASR deviates from the mean ASR across the tested variant space, UC is the total fraction of prompts resulting in unsafe responses across all tested configurations. We empirically demonstrate the importance of these measures using two attack families across three open-source target models. For PAIR, the best template reaches 69% ASR on Mistral-7B and 75% on Qwen3-0.6B, while UC rises to 88% and 93%, respectively. For bijection on Mistral-7B, the best variant reaches 81% ASR, but the 36-variant union covers 100% of HarmBench-100 prompts. We argue that distributional reporting, publishing VSM alongside ASR and enumerating variant coverage as fully as compute allows, should become the new minimum standard for parameterised jailbreak evaluation.

[AI-250] Containment Verification: AI Safety Guarantees Independent of Alignment

【速读】：该论文旨在解决当前AI代理（AI agent）框架中安全机制依赖于模型本身、难以验证的问题，即现有安全方法仅作用于模型层，其有效性依赖于无法验证的模型行为特性。解决方案的关键在于提出“ containment verification（ containment 验证）”，将安全保证从模型层转移到代理框架本身，通过形式化验证确保无论AI输出如何，框架都能强制执行边界策略。作者基于“havoc oracle semantics”建模AI为无约束的Oracle，并利用前向模拟精化（forward-simulation refinement）理论证明通用安全保证，同时在Dafny中机械化实现该证明。该方法首次对代理框架进行了演绎式形式化验证，其安全性不随模型能力变化而失效。

链接: https://arxiv.org/abs/2605.09045
作者: Royce Moon,Lav R. Varshney
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Software Engineering (cs.SE)
备注: 14 pages

点击查看摘要

Abstract:Agentic frameworks are the software layer through which AI agents act in the world. Existing safety methods intervene on the model and therefore remain conditional on unverifiable properties of learned behavior. We introduce containment verification, which locates safety guarantees in the agentic framework itself. Under havoc oracle semantics, the AI is modeled as an unconstrained oracle ranging over the entire typed action space, and the verified containment layer must enforce the boundary policy for every possible AI output. For boundary-enforceable properties, expressed over modeled boundary events, action arguments, and state, we prove a universal guarantee by forward-simulation refinement and mechanize it in Dafny. We instantiate the paradigm by verifying PocketFlow, a minimalist agentic LLM framework, and use an agentic synthesis pipeline to generate the specification, operational model, and refinement proof under an information barrier against tautological specifications. To our knowledge, this is the first deductive formal verification of an agentic framework, and its guarantee is invariant to model capability over the modeled typed action boundary.

[AI-251] SearchSkill: Teaching LLM s to Use Search Tools with Evolving Skill Banks

【速读】：该论文旨在解决语言模型在开放域问答中因生成模糊或重复查询而导致检索资源浪费及后续推理失效的问题（即“查询质量”问题）。其核心解决方案是提出SearchSkill框架，通过显式建模可复用的搜索技能（search skills）来引导查询规划：模型在每一步先选择一个技能卡（skill card），再基于该技能生成具体的搜索或回答动作；同时，SkillBank动态演化，从反复失败模式中更新技能库并重构受影响的推理轨迹。这一两阶段监督微调（SFT）策略使训练过程与推理时的“技能选择→技能驱动执行”协议保持一致，从而显著提升知识密集型问答任务中的精确匹配率和检索效率，例如减少复制性初始查询、增强原子跳跃式查询比例，并在有限检索预算下获得更高准确率。

链接: https://arxiv.org/abs/2605.09038
作者: Jinchao Hu,Meizhi Zhong,Kehai Chen,Min Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Teaching language models to use search tools is not only a question of whether they search, but also of whether they issue good queries. This is especially important in open-domain question answering, where broad or copied queries often waste retrieval budget and derail later reasoning. We propose \Ours, a framework that makes query planning explicit through reusable search skills. At each step, the model first selects a skill, then generates a search or answer action conditioned on the selected skill card. The skill inventory itself is not fixed: SearchSkill maintains an evolving SkillBank, expands or refines it from recurrent failure patterns, and reconstructs affected trajectories before supervised training. The resulting two-stage SFT recipe aligns training with the inference-time protocol of skill selection followed by skill-grounded execution. Across open-source and closed-source models, SearchSkill improves exact match on knowledge-intensive QA benchmarks and yields better retrieval behavior, including fewer copied first queries, more atomic hop-focused queries, and more correct answers within a small search budget. These results suggest that explicit skill-conditioned query planning is a lightweight alternative to treating search as an undifferentiated action.

[AI-252] ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts

【速读】：该论文旨在解决图结构代理记忆（graph-based agent memory）中的新型投毒攻击问题，即攻击者可通过注入伪造的关系（relation）来干扰大语言模型（LLM）代理的行为，而现有针对扁平文本记忆的投毒方法在图结构记忆中因关系提取、合并与检索失败而失效。解决方案的关键在于提出SHADOWMERGE攻击框架，其核心思想是利用关系通道冲突（relation-channel conflict）：构造一个恶意关系，使其与良性证据共享相同的查询激活锚点（anchor）和规范化的关系通道（canonicalized relation channel），但携带冲突值；为实现这一机制，作者设计了AIR管道，将冲突转化为可被图记忆系统正常提取、合并和检索的普通交互行为。实验表明，SHADOWMERGE在Mem0及三个真实数据集上平均攻击成功率高达93.8%，显著优于基线方法，并验证了其对无关良性任务影响微小，同时揭示了当前输入侧防御手段的局限性。

链接: https://arxiv.org/abs/2605.09033
作者: Yang Luo,Zifeng Kang,Tiantian Ji,Xinran Liu,Yong Liu,Shuyu Li,Lingyun Peng
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Preprint. Corresponding authors: Zifeng Kang and Tiantian Ji. Code is available at this https URL

点击查看摘要

Abstract:Graph-based agent memory is increasingly used in LLM agents to support structured long-term recall and multi-hop reasoning, but it also creates a new poisoning surface: an attacker can inject a crafted relation into graph memory so that it is later retrieved and influences agent behavior. Existing agent-memory poisoning attacks mainly target flat textual records and are ineffective in graph-based memory because malicious relations often fail to be extracted, merged into the target anchor neighborhood, or retrieved for the victim query. We present SHADOWMERGE, a poisoning attack against graph-based agent memory that exploits relation-channel conflicts. Its key insight is that a poisoned relation can share the same query-activated anchor and canonicalized relation channel as benign evidence while carrying a conflicting value. To realize this, we design AIR, a pipeline that converts the conflict into an ordinary interaction that can be extracted, merged, and retrieved by the graph-memory system. We evaluate SHADOWMERGE on Mem0 and three public real-world datasets: PubMedQA, WebShop, and ToolEmu. SHADOWMERGE achieves 93.8% average attack success rate, improving the best baseline by 50.3 absolute points, while having negligible impact on unrelated benign tasks. Mechanism studies show that SHADOWMERGE overcomes the three key limitations of existing agent-memory poisoning attacks, and defense analysis shows that representative input-side defenses are insufficient to mitigate it. We have responsibly disclosed our findings to affected graph-memory vendors and open sourced SHADOWMERGE. Comments: Preprint. Corresponding authors: Zifeng Kang and Tiantian Ji. Code is available at this https URL Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.09033 [cs.CR] (or arXiv:2605.09033v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2605.09033 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-253] Evolutionary Ensemble of Agents

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在算法发现任务中因静态策略导致的性能瓶颈问题，特别是在复杂代码库中难以实现稳定泛化与持续优化的挑战。其解决方案的关键在于提出一种去中心化的进化集成框架 Evolutionary Ensemble (EvE)，通过维护两个协同进化的种群——功能型代码求解器与代理引导状态，利用同步竞赛机制基于边际贡献动态更新代理的 Elo 评分，从而实现对代理行为策略的持续演化。该设计强调阶段依赖的代理自适应能力，有效应对搜索空间随时间演变的复杂性，避免了固定初始或冻结最优代理所引发的阶段错配问题，显著提升了系统在 In-Context Operator Networks (ICON) 等研究瓶颈场景下的鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2605.09018
作者: Zongmin Yu,Liu Yang
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce Evolutionary Ensemble (EvE), a decentralized framework that organizes existing, highly capable coding agents into a live, co-evolving system for algorithmic discovery. Rather than reinventing the wheel within the “LLMs as optimizers” paradigm, EvE fixes the base agent substrate and focuses entirely on evolving the cumulative guidance and skills that dictate agent behaviors. By maintaining two co-evolving populations, namely functional code solvers and agent guidance states, the system evaluates agents through a synchronous race, updating their empirical Elo ratings based on the marginal gains they contribute to the current solver state. When applied to a research bottleneck in In-Context Operator Networks (ICON), EvE autonomously discovered a robust rescale-then-interpolate mechanism that enables reliable example-count generalization. Crucially, controlled ablations reveal the absolute necessity of stage-dependent agent adaptation to navigate the shifting search landscapes of complex codebases. Compared to variants driven by a fixed initial agent or even a frozen “best-evolved” agent, EvE uniquely avoids phase mismatch, demonstrating that organizing agents into a self-revising ensemble is the fundamental driver for breaking through static performance ceilings.

[AI-254] CATO: Charted Attention for Neural PDE Operators

【速读】：该论文旨在解决基于Transformer的神经算子在处理复杂几何结构上的偏微分方程（PDE）时面临的两大挑战：一是直接在大量网格点上应用注意力机制计算开销过大；二是仅在原始离散坐标中操作会掩盖物理相互作用更自然表达的内在几何结构。其解决方案的关键在于提出Charted Axial Transformer Operator (CATO)，该方法通过学习一个连续的潜在图册（latent chart），将网格坐标映射到一个学习得到的图册空间，在此空间中采用图册条件轴向注意力（charted axial attention）高效捕获长程依赖关系并降低计算成本。此外，CATO引入一种导数感知的物理损失函数，联合监督解值、网格一致梯度和辅助通量场，从而提升物理保真度并减少过平滑问题。理论分析进一步表明，在理想图册下，图册轴向注意力可低秩近似轴向解算子且误差可控，小的图册扰动导致有限的近似退化。

链接: https://arxiv.org/abs/2605.09016
作者: Chun-Wun Cheng,Sifan Wang,Carola-Bibiane Schönlieb,Angelica I. Aviles-Rivero
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Numerical Analysis (math.NA)
备注:

点击查看摘要

Abstract:Neural operators have emerged as powerful data-driven solvers for PDEs, offering substantial acceleration over classical numerical methods. However, existing transformer-based operators still face critical challenges when modeling PDEs on complex geometries: directly processing over massive mesh points is computationally expensive, while operating in raw discretization coordinates may obscure the intrinsic geometry where physical interactions are more naturally expressed. To address these limitations, we introduce the Charted Axial Transformer Operator (CATO), a geometry-adaptive and derivative-aware neural operator for PDEs on general geometries. Instead of applying attention directly in the physical coordinate system, CATO learns a continuous latent chart that maps mesh coordinates into a learned chart space, where chart-conditioned axial attention efficiently captures long-range dependencies with reduced computational cost. In addition, CATO introduces a derivative-aware physics loss for steady-state PDEs that jointly supervises solution values, mesh-consistent gradients, and an auxiliary flux-like field, improving physical fidelity and reducing oversmoothing. We further provide a theoretical approximation result showing that, under a favorable chart, charted axial attention can represent low-rank axial solution operators with controlled error, and that small chart perturbations induce bounded approximation degradation. CATO achieves the best performance across all evaluated datasets, yielding an average improvement of approximately 26.76% over the strongest competing baselines while reducing the number of parameters by 81.98%. These results highlight the effectiveness of learning geometry-adaptive charts and derivative-aware physical supervision for accurate and efficient PDE operator learning.

[AI-255] Re2Math: Benchmarking Theorem Retrieval in Research-Level Mathematics

【速读】：该论文旨在解决大语言模型在数学证明推理中缺乏对文献来源的准确引用与适配能力的问题，即如何从已有学术文献中检索并验证适用于当前证明步骤的工具（如引理），确保其假设条件与当前上下文一致。解决方案的关键在于提出 Re² Math 基准测试集，该基准通过构建源自主定理证明中的候选工具引用实例，引入层次化上下文和可选的泄漏控制锚点提示，实现“源导向但引用无关”的评估机制——任何满足证明过渡所需的可接受定理均可被采纳，从而将文献驱动的数学工具使用转化为一个可控且可诊断的任务，有效区分引用召回、源接地性和证明间隙充分性三个核心环节。

链接: https://arxiv.org/abs/2605.09012
作者: Zicheng Lyu,Wenjie Yang,Shengzhong Zhang,Zengfeng Huang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models are increasingly capable at closed-world mathematical reasoning, but research assistance also requires source-grounded use of the literature. When a proof reaches a non-trivial step, a useful assistant should determine whether the needed tool (e.g., a lemma) already exists, identify a suitable scholarly source, and verify that its assumptions align with the current proof context. To rigorously evaluate such capabilities, we introduce Re ^2 Math, a benchmark for tool-grounded retrieval from partial mathematical proofs. Each instance is built from a candidate instrumental citation in the proof of a main theorem, with hierarchical context and an optional leakage-controlled anchor hint. We also make the task source-grounded yet citation-agnostic in that any admissible theorem sufficient for the proof transition is accepted. Evaluation uses a release-frozen retrieval artifact, ensuring reproducibility, while the benchmark itself supports automatic, continual expansion with newly constructed instances. On the current benchmark test set, the best fixed-judge ToolAcc reaches 7.0%, despite substantially higher rates of source grounding, indicating that current systems often retrieve valid statements but fail to establish their applicability to the local proof step. By decoupling citation recall, grounding, and proof-gap sufficiency, Re ^2 Math transforms literature-grounded mathematical tool use into a controlled diagnostic task.

[AI-256] A Geometric Perspective on Next-Token Prediction in Large Language Models : Three Emerging Phases

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）中预测信息在不同层间几何分布及其演化机制的问题，即揭示预测信息在残差流（residual stream）中的定位与变化规律。其核心解决方案是引入表示透镜（representation lenses）——一种通过学习的仿射映射，用于从中间残差流中预测下一个词元（token），并将其作为几何诊断工具。作者定义每层的**预测读出子空间（predictive readout subspace）**为该映射在d维残差流上的主导k维奇异子空间，并追踪其在Grassmann流形上的轨迹，从而识别出三类几何相变：种子多路复用（Seeding Multiplexing）、提升覆盖（Hoisting Overriding）和聚焦收敛（Focal Convergence）。关键发现在于，随着模型深度增加，有效秩（effective rank）经历扩展、稳定和集中三个阶段，且深层容量主要被用于候选解的消歧（candidate disambiguation），而非单纯增加表征维度。

链接: https://arxiv.org/abs/2605.09011
作者: Gianfranco Lombardo,Giuseppe Trimigno,Stefano Cagnoni
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We investigate the geometry of predictive information across the layers of large language models (LLMs). We repurpose representation lenses-learned affine maps trained to predict the next token from intermediate residual streams-as geometric diagnostic tools. Rather than asking what the model predicts at each layer, we ask where predictive information resides and how it evolves across depth. We define at each layer a predictive readout subspace as the dominant k-dimensional singular subspace of such a map on the d-dimensional residual stream (where k is a resolution parameter), and track its trajectory on the Grassmann manifold as a similarity profile across layers. The profile is well described by unimodal distributions exhibiting a rise, near-plateau, and descent; varying k from 1% to 50% of d traces a Pareto frontier between visibility and energy retention, yet the same structure emerges at all scales. Across eight models from two families (Qwen2.5 and OLMo2, 1B-32B), we identify three geometric phases. Updates are approximately orthogonal to the residual stream throughout; what distinguishes the phases is their effect on the effective rank, which expands, stabilizes, and concentrates. In the first, Seeding Multiplexing, feed-forward memories and attention layers seed a candidate set in superposition in family-specific proportions, with the final token rising as leading candidate from 20% to 35% of positions across this phase. In the second, Hoisting Overriding, updates override existing subspaces to concentrate the candidate distribution without expanding the rank. In the third, Focal Convergence, high-energy low-rank updates write the winner into a form aligned with the unembedding direction. Phases 1 and 3 grow slowly with model depth, while Phase 2 expands linearly. The additional capacity of deeper LLMs is largely absorbed by candidate disambiguation.

[AI-257] Large Language Models for Sequential Decision-Making: Improving In-Context Learning via Supervised Fine-Tuning

【速读】：该论文旨在解决预训练大语言模型（Large Language Models, LLMs）在序列决策任务中的能力不足问题，尤其是其在马尔可夫决策过程（Markov Decision Processes, MDPs）、部分可观测马尔可夫决策过程（Partially Observable MDPs, POMDPs）及模糊POMDP（Ambiguous POMDPs, APOMDPs）等复杂环境下的决策性能尚未被充分挖掘的问题。解决方案的关键在于通过监督微调（Supervised Fine-Tuning, SFT）对LLMs进行端到端训练，使其能够直接从离线的、带标签的轨迹数据中学习策略，从而实现少样本序列决策。理论分析表明，微调后的注意力层可隐式估计最优Q函数，并推导出一个将上下文估计误差与训练长度偏差分离的端到端次优性界；实验验证了该方法在合成MDP、POMDP和APOMDP环境中显著优于仅依赖上下文学习（in-context learning）和随机基线的方法，尤其在长时程、部分可观测和模型模糊的场景下优势更为明显。

链接: https://arxiv.org/abs/2605.09009
作者: Minmin Zhang,Sina Aghaei,Soroush Saghafian
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown remarkable in-context learning (ICL) capabilities, yet their potential for sequential decision-making remains underexplored. In this paper, we study the ICL capabilities of LLMs in sequential decision-making settings, including Markov Decision Processes (MDPs), Partially Observable MDPs (POMDPs), and Ambiguous POMDPs (APOMDPs). We fine-tune pretrained LLMs to perform few-shot decision-making directly from offline, oracle-labeled trajectories. Our framework enables flexible imitation of policies through supervised fine-tuning (SFT). Theoretically, we focus on linear MDPs and interpret a fine-tuned attention layer as implicitly estimating optimal Q-functions from in-context data. Building on this interpretation, we derive an end-to-end suboptimality bound for the induced policy that separates the in-context estimation error from the training-length bias. Empirically, across synthetic MDP, POMDP, and APOMDP settings, we find that fine-tuned LLMs achieve substantially smaller optimality gaps than in-context-only and random baselines, with especially large gains in longer-horizon, partially observed, and model-ambiguous environments. Together, these results show that supervised fine-tuning provides an effective route to endowing pretrained LLMs with sequential decision-making capabilities from offline data, which is an important advantage in domains such as healthcare where offline data are abundant.

[AI-258] owards Backdoor-Based Ownership Verification for Vision-Language-Action Models

【速读】：该论文旨在解决视觉-语言-动作模型（Vision-Language-Action models, VLAs）在共享与适配过程中模型所有权保护的问题，以确保安全部署和负责任的开源使用。解决方案的关键在于提出GuardVLA，一种专为VLAs设计的基于后门的版权验证框架：通过在训练阶段向具身视觉数据中注入秘密信息，将隐蔽且无害的水印嵌入到模型中；在模型发布后，利用“交换与检测”机制——即触发投影器与外部分类头协同作用，基于预测概率激活并检测嵌入的后门水印，从而实现可靠的所有权验证，同时保持正常任务性能不受影响，并且水印在模型适配后仍可被检测。

链接: https://arxiv.org/abs/2605.09005
作者: Ming Sun,Rui Wang,Xingrui Yu,Lihua Jing,Hangyu Du,Zhenglin Wan,Xu Pan,Ivor Tsang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-Language-Action models (VLAs) support generalist robotic control by enabling end-to-end decision policies directly from multi-modal inputs. As trained VLAs are increasingly shared and adapted, protecting model ownership becomes essential for secure deployment and responsible open-source usage. In this paper, we present GuardVLA, the first backdoor-based ownership verification framework specifically designed for VLAs. GuardVLA embeds a stealthy and harmless backdoor watermark into the protected model during training by injecting secret messages into embodied visual data. For post-release verification, we propose a swap-and-detect mechanism, in which the trigger projector and an external classifier head are used to activate and detect the embedded backdoor based on prediction probabilities. Extensive experiments across multiple datasets, model architectures, and adaptation settings demonstrate that GuardVLA enables reliable ownership verification while preserving benign task performance. Further results show that the embedded watermark remains detectable under post-release model adaptation.

[AI-259] Sufficient conditions for a Heuristic Rating Estimation Method application

【速读】：该论文旨在解决基于成对比较（pairwise comparison）的决策评估问题，特别是在存在完整或不完整数据情况下如何准确估计备选方案的权重。其核心挑战在于确保Heuristic Rating Estimation (HRE) 方法在应用时满足正确的数学条件，并区分算术型与几何型算法在一致性估计上的性能差异。解决方案的关键在于严格推导出HRE方法适用的前提条件，并通过实例验证算术型算法在不一致性估计方面具有最优性，从而为实际应用提供理论保障和算法选择依据。

链接: https://arxiv.org/abs/2605.08991
作者: Jacek Szybowski,Konrad Kułakowski,Jiri Mazurek
机构: 未知
类目: Artificial Intelligence (cs.AI); Econometrics (econ.EM)
备注: 18 pages

点击查看摘要

Abstract:A series of papers has introduced the Heuristic Rating Estimation method, which evaluates a set of alternatives based on pairwise comparisons and the weights of reference alternatives. We formulate the conditions under which the HRE method can be applied correctly. The research considers both arithmetic and geometric algorithms for complete and incomplete pairwise comparison methods. The illustrative examples show that the estimations of inconsistency in the arithmetic variant are optimal.

[AI-260] Benchmarking Compositional Generalisation for Machine Learning Interatomic Potentials

【速读】：该论文旨在解决当前机器学习原子间势能模型（Machine Learning Interatomic Potentials）在面对未见过的分子时泛化能力不足的问题，特别是区分这些模型是真正学习了化学的组成结构（即分子片段及其组合如何决定性质），还是仅依赖训练数据中的局部模式进行插值。解决方案的关键在于提出一个包含四个任务的基准测试集，这些任务要求模型具备一定的组合泛化能力；每个任务中，测试分子在训练阶段均未出现，但若模型掌握了底层物理原理，则应能实现有效预测。实验表明，即使使用预训练于数百万分子的基础模型，当前最先进的方法在分布外样本上的误差仍比分布内样本高出一个数量级，凸显了该问题的挑战性。

链接: https://arxiv.org/abs/2605.08988
作者: Amir Masoud Nourollah,Irtaza Khalid,Stefano Leoni,Steven Schockaert
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:

点击查看摘要

Abstract:Machine Learning Interatomic Potentials play a fundamental role in computational chemistry and materials science, enabling applications from molecular dynamics simulations to drug design and materials discovery. While recent approaches can estimate inter-atomic forces with high precision, it remains unclear to what extent they can generalise to previously unseen molecules. Do they learn the compositional structure of chemistry, capturing how molecular fragments and their combinations determine properties, or do they primarily learn to interpolate patterns that are specific to the training examples? To address this question, we propose a benchmark consisting of four tasks that require some form of compositional generalisation. In each task, models are tested on molecules that were unseen during training, but the training data is chosen such that generalisation to the test examples should be feasible for models that learn the underlying physical principles. Our empirical analysis shows that the considered tasks are highly challenging for state-of-the-art models, with errors on out-of-distribution examples often an order of magnitude higher than on in-distribution examples, even when using foundation models that have been pre-trained on millions of molecules.

[AI-261] Learning to Explore: Scaling Agent ic Reasoning via Exploration-Aware Policy Optimization

【速读】：该论文旨在解决当前基于大语言模型（Large Language Model, LLM）的智能体在测试时扩展（test-time scaling）中普遍存在的探索策略同质化问题，即现有方法缺乏根据环境不确定性动态调整探索行为的能力，导致资源浪费或关键信息遗漏。其解决方案的关键在于提出一种探索感知的强化学习框架，通过变分推断构建细粒度奖励函数，显式评估探索动作对未来决策改进的潜力，并引入探索感知的分组机制，在优化过程中将探索性动作与任务完成动作分离，从而实现仅在高不确定性时进行选择性探索，并在任务上下文明确后迅速转入执行阶段。

链接: https://arxiv.org/abs/2605.08978
作者: Xingyuan Hua,Sheng Yue,Ju Ren
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in agentic test-time scaling allow models to gather environmental feedback before committing to final actions. A key limitation of existing methods is that they typically employ undifferentiated exploration strategies, lacking the ability to adaptively distinguish when exploration is truly required. In this paper, we propose an exploration-aware reinforcement learning framework that enables LLM agents to adaptively explore only when uncertainty is high. Our method introduces a fine-grained reward function via variational inference that explicitly evaluates exploratory actions by estimating their potential to improve future decision-making, together with an exploration-aware grouping mechanism that separates exploratory actions from task-completion actions during optimization. By targeting informational gaps, this design allows agents to explore selectively and transition to execution as soon as the task context is clear. Empirically, we demonstrate that our approach achieves consistent improvements across a range of challenging text-based and GUI-based agent benchmarks. Code is available at \urlthis https URL and models are available at this https URL.

[AI-262] Latency Analysis and Optimization of Alpamayo 1 via Efficient Trajectory Generation

【速读】：该论文旨在解决推理驱动型端到端（End-to-End, E2E）自动驾驶系统中推理效率与轨迹多样性之间的权衡问题。当前主流方法如Alpamayo 1采用多推理（multi-reasoning）策略，虽能生成丰富的人类可读推理过程，但存在冗余计算导致的高延迟；而单推理（single-reasoning）方案虽更高效，常被认为牺牲了轨迹多样性。论文的关键解决方案在于：首先，将原多推理架构重构为单推理架构，在不显著降低轨迹多样性的前提下大幅减少推理延迟；其次，通过消除扩散模型中因不必要的数据复制和低效内核执行带来的块间开销，优化动作生成阶段的运行时性能。两项优化协同作用，在保持预测质量的同时实现69.23%的推理延迟降低，验证了系统架构设计与运行时执行联合优化对提升推理型E2E自动驾驶系统效率的重要性。

链接: https://arxiv.org/abs/2605.08975
作者: Yunseong Jeon,Namcheol Lee,Yoonsu Lee,Jangwoon Park,Sol Ahn,Jong-Chan Kim,Seongsoo Hong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Submitted to IEEE RTCSA on March 26, 2026 (KST) Accepted on May 4, 2026 (KST)

点击查看摘要

Abstract:Reasoning-based end-to-end (E2E) autonomous driving has recently emerged as a promising approach to improving the interpretability of driving decisions as it can generate human-readable reasoning together with predicted trajectories. Such approaches commonly generate multiple trajectories to capture diverse future behaviors, and they fall into two categories: (1) multi-reasoning, where one reasoning sequence is generated per trajectory, and (2) single-reasoning, where a single reasoning is shared across all trajectories. The former offers richer diversity at the cost of redundant computation, while the latter is more efficient but is often assumed to sacrifice diversity. Alpamayo 1, a representative system, adopts the multi-reasoning approach and achieves competitive trajectory prediction performance. However, the efficiency of this design remains largely unexplored, making it a well-motivated subject for investigation. In this paper, we systematically analyze and improve Alpamayo 1 in two ways. First, we reduce inference latency while preserving trajectory diversity by redesigning Alpamayo 1 into a single-reasoning system. Through extensive experiments, we find that replacing multi-reasoning with single-reasoning does not meaningfully degrade trajectory diversity. Second, we accelerate diffusion-based action generation by eliminating inter-block overhead arising from unnecessary copy operations and inefficient kernel execution. Through closed-loop and open-loop experiments, we validate both optimizations, demonstrating a 69.23% reduction in inference latency while maintaining trajectory diversity and prediction quality. These results highlight the importance of jointly analyzing system architecture and runtime execution to improve the efficiency of reasoning-based E2E AD systems.

[AI-263] Agent ic AI Scientists Are Not Built For Autonomous Scientific Discovery

【速读】：该论文旨在解决当前生成式 AI 科学家在实现端到端自主科学发现（autonomous scientific discovery）过程中所面临的系统性挑战，包括问题选择偏差、实验室实践中的隐性知识缺失、偏好优化导致的输出多样性压缩以及缺乏来自物理实验的反馈机制。其解决方案的关键在于：首先，利用科学模拟作为训练验证器以提升模型对真实实验场景的理解；其次，设计能够表征实际研究目标动态变化的持久世界模型（persistent world models）；再次，建立集中化的预注册数据库用于存储所有由 AI 生成的假设，确保可追溯性和可重复性；最后，强调以科学需求驱动应用，而非仅依赖工具的功能特性（tool affordance）。这一系列措施旨在从基础设计层面重构 AI 科学研究范式，推动真正意义上的自主 AI 科学家的发展。

链接: https://arxiv.org/abs/2605.08956
作者: Harshit Bisht,Vinay Kumar,Kevin Maik Jablonka,Mausam,N. M. Anoop Krishnan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A growing body of work pursues AI scientists capable of end-to-end autonomous scientific discovery. This position paper argues that although they already function as co-scientists, agentic AI scientists are not built for autonomous scientific discovery. We identify the following challenges in building and deploying autonomous AI scientists: (1) Problem selection is influenced by the McNamara fallacy; (2) Agents are built on large language models (LLMs) whose training corpora omit tacit procedural and failure knowledge of laboratory practice; (3) Preference optimisation during post-training compresses output diversity toward consensus; and (4) Most scientific benchmarks measure single-turn prediction accuracy and lack feedback from physical experiments back to the computational model. These challenges are not just questions of scale and scaffolding; they require revisiting fundamental design choices. To build truly autonomous AI scientists, we recommend the use of scientific simulations as verifiers for training, the design of persistent world models that represent the shifting objectives governing real investigations, the establishment of a centralized preregistration repository for all AI-generated hypotheses, and application driven by scientific need rather than tool affordance.

[AI-264] MolWorld: Molecule World Models for Actionable Molecular Optimization

【速读】：该论文旨在解决药物分子优化中“可行动性”（actionability）的问题，即如何在保证分子目标属性提升的同时，确保候选分子能够通过有效的局部结构变换从已知化合物逐步演化而来，从而支持可解释且可行的化学系列迭代设计。现有从头生成或单分子优化方法未能显式建模这种可达性，尤其是在目标分子与中间转化路径均未知的情况下。其解决方案的关键在于提出 MolWorld 框架，将分子转移图（molecule-transfer graph）视为一个动态演化的搜索状态，通过锚定局部结构上下文生成候选分子，并利用学习的世界模型（world model）筛选并更新图结构，从而在每轮迭代中同时优化属性和保持强结构连通性，实现可行动的序列化分子设计。

链接: https://arxiv.org/abs/2605.08954
作者: Yang Qiao,Bo Pan,Hao-Wei Pang,Peter Zhiping Zhang,Liying Zhang,Liang Zhao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Molecular optimization in drug discovery aims to discover molecules with improved target properties, but practical lead optimization often requires more than high predicted scores. A useful candidate should also be actionable: it should be reachable from known molecules through valid local structural transformations, so that it can be interpreted as a plausible revision within an evolving chemical series. Existing de novo and single-molecule optimization methods do not explicitly model such reachability, especially when both the target molecules and the intermediate molecules connecting them to known compounds are unknown. In this work, we formulate actionable molecular optimization as sequential expansion of a molecule-transfer graph, where nodes are molecules and edges encode valid local transformations. We propose MolWorld, a molecule world model-guided framework that treats the current molecule-transfer graph as an evolving search state. At each iteration, MolWorld selects local anchor contexts, generates candidate molecules conditioned on these contexts, evaluates their properties, and uses a learned world model to update the evolving molecule world by retaining admissible candidates and inserting them into the molecule-transfer graph. The expanded molecule world then guides subsequent optimization. Experiments on property optimization and docking-based tasks show that MolWorld discovers high-property molecules while maintaining substantially stronger structural connectivity, supporting actionable and sequential molecular design.

[AI-265] MDGYM: Benchmarking AI Agents on Molecular Simulations

【速读】：该论文旨在解决生成式 AI (Generative AI) 在科学计算领域中自主执行复杂物理模拟任务的能力瓶颈问题，特别是分子动力学（Molecular Dynamics, MD）模拟这一典型场景下的自主性与可靠性挑战。其解决方案的关键在于构建一个结构化、分层难度的基准测试平台 MDGYM，涵盖169个由专家精心设计的MD模拟任务，覆盖LAMMPS和GROMACS两大主流软件包，并系统评估三种代理框架（Claude Code、Codex、OpenHands）与四种大语言模型（LLMs）的表现。实验结果表明，当前最先进的AI代理在易级任务中仅能正确完成21%，高难度任务则低于10%，且失败模式集中于物理不稳定的初始配置、伪造数值输出或提前放弃迭代调试，揭示出代码生成能力无法直接迁移至需要物理常识约束的科学推理场景。

链接: https://arxiv.org/abs/2605.08941
作者: Vinay Kumar,Satyendra Rajput,Mausam,N. M. Anoop Krishnan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The promise of AI-driven scientific discovery hinges on whether AI agents can autonomously design and execute the computational workflows that underpin modern science. Molecular dynamics (MD) simulation presents a natural test bed to stress-test this claim; it requires translating physical intuition into syntactically and semantically correct input scripts, reasoning about initial and boundary conditions, diagnosing numerically unstable trajectories, and interpreting outputs against known physical behavior and laws. We introduce MDGYM, a benchmark of 169 expert-curated MD simulations spanning LAMMPS and GROMACS, two widely used MD packages, across three increasing difficulty levels. We evaluate three agentic frameworks – Claude Code, Codex, and OpenHands – with four LLMs, and find that all perform poorly: even the strongest agent solves only 21% of easy-level tasks, with less than 10% at higher difficulties. Trajectory analysis reveals a characteristic pattern of failure – agents successfully invoke simulation machinery but produce physically unstable configurations, fabricate numerical outputs without executing the underlying computation, or abandon tasks prematurely rather than iterating through simulation-specific errors. These failure modes are qualitatively distinct from those observed in general software engineering benchmarks, indicating that fluent code generation does not transfer to grounded physical reasoning.

[AI-266] Can We Formally Verify Neural PDE Surrogates? SMT Compilation of Small Fourier Neural Operators

【速读】：该论文旨在解决生成式 AI（Generative AI）中基于傅里叶神经算子（Fourier Neural Operators, FNOs）的偏微分方程（PDE）模拟模型缺乏形式化保证的问题，特别是其是否保留基本物理结构（如正性、质量守恒等）。解决方案的关键在于将FNO的前向传播建模为分段线性函数，并利用Z3定理证明器在有限精度下进行精确验证：通过两种编码方式——精确编码将谱卷积编译为稠密矩阵乘法以提供严格证明和反例，而轻量冻结编码则用常数替代谱路径以提升可扩展性但牺牲保真度。实验表明，精确编码可在小规模模型上获得可靠的物理性质证书，而冻结编码虽能支持更大网格（64点）的快速检查，却无法对原始FNO提供形式保证，从而清晰揭示了形式验证中的“保真度-可扩展性”权衡。

链接: https://arxiv.org/abs/2605.08938
作者: Ali Baheri,David Millard,Ignacio Laguna Peralta
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Fourier Neural Operators (FNOs) can greatly accelerate PDE simulation, but they are often used without formal guarantees that they preserve basic physical structure. We show that, once the trained weights and grid are fixed, the spectral convolution in an FNO is a linear map. As a result, the full forward pass is piecewise-linear and can be represented exactly in Z3’s linear real arithmetic. We study two encodings. The exact encoding compiles the spectral convolution into a dense matrix multiplication, which is sound for both proofs and counterexamples. The lighter frozen encoding replaces the spectral path with a constant, making it faster but approximate. On 10 small FNO surrogates for 1D advection-diffusion-reaction (85 to 117 parameters, grids 8 to 32), the exact encoding gives 2 sound positivity proofs on linear (ReLU-free) models, 5 sound positivity counterexamples, and 10 sound mass-violation counterexamples; the remaining 3 positivity queries on ReLU models time out. For mass non-increase, Z3 finds worse counterexamples than both gradient-based falsification and Monte Carlo on 7 of 10 models. The frozen encoding scales to grid size 64 with sub-second positivity checks, but it no longer provides certificates for the original FNO. Overall, the results make the soundness–scalability tradeoff explicit and point to what is needed for formal verification of production-scale neural operators.

[AI-267] Self-ReSET: Learning to Self-Recover from Unsafe Reasoning Trajectories

【速读】：该论文旨在解决大型推理模型（Large Reasoning Models, LRM）在面对对抗性攻击时，难以从不安全的推理轨迹中自我纠正的问题。现有对齐方法依赖静态专家数据（如反思轨迹或对抗前缀）进行微调，但由于训练数据与模型动态、策略内（on-policy）的推理过程存在偏差，导致模型无法充分探索其庞大的生成空间，也难以学习如何从自身错误中恢复。论文提出Self-ReSET，一种纯强化学习框架，其关键在于利用模型自身的不安全推理轨迹作为初始状态，通过强化学习机制赋予模型内在的自我修复能力，并将这些轨迹重用于后续训练，从而有效提升模型在分布外（out-of-distribution, OOD）越狱提示下的鲁棒性，同时保持通用性能和高效的数据利用率。

链接: https://arxiv.org/abs/2605.08936
作者: Dongcheng Zhang,Yi Zhang,Yuxin Chen,An Zhang,Xiang Wang,Chaochao Lu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Reasoning Models possess remarkable capabilities for self-correction in general domain; however, they frequently struggle to recover from unsafe reasoning trajectories under adversarial attacks. Existing alignment methods attempt to mitigate this vulnerability by fine-tuning the model on expert data including reflection traces or adversarial prefixes. Crucially, these approaches are often hindered by static training data which inevitably deviate from model’s dynamic, on-policy reasoning traces, resulting in model hardly covering its vast generation space and learning to recover from its own failures. To bridge this gap, we propose Self-ReSET, a pure reinforcement learning framework designed to equip LRMs with the intrinsic capacity to recover from their own safety error trajectories, which are subsequently reused as an initial state for reinforcement learning. Extensive experiments across various LRMs and benchmarks demonstrate that Self-ReSET significantly enhances robustness against adversarial attacks especially out-of-distribution (OOD) jailbreak prompts while maintaining general utility, along with efficient data utilization. Further analysis reveals that our method effectively fosters self-recovery patterns, enabling models to better identify and recover from unsafe intermediate error states back to benign paths. Our codes and data are available at this https URL.

[AI-268] PnP-Corrector: A Universal Correction Framework for Coupled Spatiotemporal Forecasting

【速读】：该论文旨在解决耦合时空预测中因子系统误差传播与放大导致的长期预测不稳定问题，即所谓的“相互误差放大”（Reciprocal Error Amplification）现象。其解决方案的关键在于提出一个通用框架 PnP-Corrector（Plug-and-Play Corrector），通过将物理模拟过程与误差校正分离：冻结预训练的物理模拟器，仅训练一个校正代理（correction agent）以主动抵消耦合系统中产生的系统性偏差，从而提升长期预测的稳定性和准确性。

链接: https://arxiv.org/abs/2605.08935
作者: Hao Wu,Fan Xu,Yuxu Lu,Penghao Zhao,Fan Zhang,Hao Jia,Yuxuan Liang,Ruijian Gou,Qingsong Wen,Xian Wu,Xiaomeng Huang,Yuan Gao
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Coupled spatiotemporal forecasting is important for predicting the future evolution of multiple interacting dynamical systems, such as in climate models. However, existing methods are severely constrained by the persistent bottleneck of compounding errors. In coupled systems, errors from each subsystem simulator propagate and amplify one another, a phenomenon we term Reciprocal Error Amplification, leading to a rapid collapse of long-range predictions. To address this challenge, we propose a universal framework called PnP-Corrector (Plug-and-Play Corrector). The core idea of our framework is to decouple the physical simulation from the error correction process: it freezes pre-trained physics simulation engines and exclusively trains a correction agent to proactively counteract the systematic biases emerging from the coupled system. Furthermore, we design an efficient predictive model architecture, DSLCast, to serve as the backbone of this framework. Extensive experiments demonstrate that our method significantly enhances the long-term stability and accuracy of coupled forecasting systems. For instance, in the challenging task of a 300-day global ocean-atmosphere coupled forecast, our PnP-Corrector framework reduces the prediction error of the baseline model by 29% and surpasses state-of-the-art models on several key metrics.

[AI-269] Internalizing Safety Understanding in Large Reasoning Models via Verification ICML2026

【速读】：该论文旨在解决当前大型推理模型（Large Reasoning Models, LRMs）在对齐过程中存在的安全性缺陷问题，即现有对齐范式主要依赖外部强制合规机制，仅优化模型检测恶意提示的能力，而忽视了其对自身输出安全性的内在理解，导致模型缺乏对生成内容的安全验证能力，易受对抗性越狱攻击。解决方案的关键在于提出Safety Internal（SInternal）框架，通过仅在安全验证任务上训练LRMs，使其学习使用专家推理轨迹批判性地评估自身生成答案的安全性，从而将安全规范内化为模型的内在认知结构，显著提升模型在跨域越狱攻击下的鲁棒性，并为后续强化学习提供更安全的初始策略。

链接: https://arxiv.org/abs/2605.08930
作者: Yi Zhang,Yuxin Chen,Leheng Sheng,Dongcheng Zhang,Chaochao Lu,Xiang Wang,An Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at the 43rd International Conference on Machine Learning (ICML 2026)

点击查看摘要

Abstract:While explicit Chain-of-Thought (CoT) empowers large reasoning models (LRMs), it enables the generation of riskier final answers. Current alignment paradigms primarily rely on externally enforced compliance, optimizing models to detect malicious prompts rather than evaluating the safety of their own outputs. We argue that this approach remains largely behavioral: our empirical analysis reveals that ostensibly aligned models lack intrinsic safety understanding, often failing to verify their own response safety and remaining vulnerable to adversarial jailbreaks. To address this fundamental limitation, we propose Safety Internal (SInternal), a framework that internalizes safety specifications by training LRMs exclusively on safety verification tasks to critique their own generated answers using expert reasoning trajectories. We demonstrate that learning to verify induces a strong generalization for response safety, significantly enhancing robustness against out-of-domain jailbreaks. Furthermore, when combined with reinforcement learning, SInternal serves as a superior initialization compared to standard supervised fine-tuning, suggesting that internalizing safety understanding creates a more robust foundation for alignment than merely mimicking safe behaviors. Our codes are available at this https URL

[AI-270] ransformer autoencoder with local attention for sparse and irregular time series with application on risk estimation

【速读】：该论文旨在解决稀疏且不规则时间序列数据中的风险估计问题，特别是在电力系统中非技术性损耗（non-technical losses）的检测难题。这类损耗主要源于窃电行为，其识别因实际数据采集的稀疏性和不规则性而极具挑战，传统方法难以有效捕捉长程依赖关系并 robust 地处理此类数据特征。解决方案的关键在于提出一种基于局部注意力机制的 Transformer 自编码器（Transformer Autoencoder with local attention）框架，该框架结合了 Transformer 强大的模式识别能力与传统数据清洗和归一化方法，通过局部注意力机制高效提取不规则序列中的关键模式，从而生成高度判别性的潜在特征，显著提升风险估计的一致性、召回率和精确度，优于现有主流方法。

链接: https://arxiv.org/abs/2605.08914
作者: Panteleimon Rodis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Under Review

点击查看摘要

Abstract:This paper introduces a framework specifically designed for sparse and irregular time series risk estimation. It is based on a Transformer Autoencoder with local attention, which leverages the powerful pattern identification capabilities of transformers complemented by traditional data cleaning and normalization methods. It efficiently captures relevant patterns within irregular sequences suffering from sparse data collection, benefiting from the discriminative ability of the local attention mechanism. The proposed framework is applied to a real-world case study, on the risk estimation of non-technical losses in electrical power systems in a wide area in Greece. Non-technical losses in electrical power systems, primarily stemming from electricity theft, pose significant economic and operational challenges. Detecting these anomalies is particularly challenging due to the inherent sparse and irregular nature of real-world data collection practices. Traditional risk estimation methods struggle with effectively capturing long-range dependencies and robustly handling such data characteristics. We demonstrate that our approach effectively yields highly discriminative latent features, which results in more consistent risk estimation compared with existing state-of-the-art and widely used methods. It achieves high recall and precision, meeting the critical objectives of the problem. As such, our solution offers a robust and effective tool for risk detection in irregular time series datasets.

[AI-271] Forge: Quality-Aware Reinforcement Learning for NP-Hard Optimization in LLM s

【速读】：该论文旨在解决当前大型语言模型（Large Language Models, LLMs）在推理任务评估中仅关注正确性（correctness）而忽视最优性（optimality）的问题，即模型能否在约束条件下找到最佳解。针对这一问题，作者提出OPT-BENCH框架，其核心在于引入质量感知的强化学习（quality-aware Reinforcement Learning with Verifiable Rewards, RLVR），通过设计包含实例生成器、质量验证器和最优基线的可扩展训练基础设施，以及基于可行性（Success Rate, SR）与解质量（Quality Ratio, QR）双重指标的严谨评测体系，实现对LLMs在NP-hard优化问题上的系统训练与评估。关键创新在于质量感知奖励机制，使模型能够持续改进解的质量而非仅满足二元正确性判断，实验表明该方法显著优于传统方法，并展现出跨任务迁移能力。

链接: https://arxiv.org/abs/2605.08905
作者: Xiaozhe Li,Xinyu Fang,Shengyuan Ding,Yang Li,Linyang Li,Haodong Duan,Qingwen Liu,Kai Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved remarkable success on reasoning benchmarks through Reinforcement Learning with Verifiable Rewards (RLVR), excelling at tasks such as math, coding, logic, and puzzles. However, existing benchmarks evaluate only correctness, while overlooking optimality, namely the ability to find the best solutions under constraints. We propose OPT-BENCH, the first comprehensive framework for training and evaluating LLMs on NP-hard optimization problems through quality-aware RLVR. OPT-BENCH provides three key components: a scalable training infrastructure with instance generators, quality verifiers, and optimal baselines across 10 tasks; a rigorous benchmark with 1,000 instances evaluating both feasibility, measured by Success Rate, and quality, measured by Quality Ratio; and quality-aware rewards that enable continuous improvement beyond binary correctness. Training on Qwen2.5-7B-Instruct-1M with 15K examples achieves 93.1% SR and 46.6% QR, significantly outperforming GPT-4o, which achieves 29.6% SR and 14.6% QR. Beyond optimization, training on OPT-BENCH transfers to diverse tasks, including mathematics (+2.2%), logic (+1.2%), knowledge (+4.1%), and instruction following (+6.1%). Our analysis reveals that quality-aware rewards improve solutions by 28.8% over binary rewards, and that task diversity drives generalization more than data quantity, offering insights into RLVR scaling for complex reasoning.

[AI-272] OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在动态环境中是否具备持续优化解决方案的能力这一关键问题，即模型能否通过内在的自我反思机制而非简单工具调用来实现自适应改进。其核心挑战在于评估LLMs是否能像人类一样，基于环境反馈不断迭代优化策略，从而在复杂、多变的任务空间中提升性能。解决方案的关键是提出OPT-BENCH基准测试平台与OPT-Agent框架：OPT-BENCH结合20个机器学习任务与10个经典NP-hard问题，构建了一个高维搜索空间以严格检验模型的自适应能力；而OPT-Agent则模拟人类认知过程，通过感知（perception）、记忆（memory）和推理（reasoning）的闭环机制，利用环境反馈持续迭代优化解空间，从而实现内在驱动的自我改进能力。

链接: https://arxiv.org/abs/2605.08904
作者: Xiaozhe Li,Jixuan Chen,Xinyu Fang,Shengyuan Ding,Haodong Duan,Qingwen Liu,Kai Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in reasoning and tool use. However, the fundamental cognitive faculties essential for problem solving, including perception, reasoning, and memory, remain the stable core of intelligence. Unlike memorizing specific patterns, humans succeed in novel environments by applying these intrinsic faculties to adapt and optimize. Yet, whether LLMs possess this essential capacity, namely the ability to continuously refine solutions in response to dynamic environmental feedback, remains underexplored. To address this challenge, we introduce OPT-BENCH, a benchmark for evaluating self-improvement capabilities in large-scale search spaces. By combining 20 machine learning tasks with 10 classic NP-hard problems, OPT-BENCH provides a rigorous setting to assess whether agents can adapt through intrinsic self-reflection rather than rote tool application. We further propose OPT-Agent, a framework that emulates human-like cognitive adaptation. It operates through a general perception, memory, and reasoning loop, iteratively refining solutions based on environmental feedback. Through extensive experiments on 19 LLMs from 7 model families, including reasoning models, general models, and open-source models ranging from 3B to 235B parameters, we demonstrate that stronger models are more effective at leveraging feedback signals for self-improvement. However, this upper-bound adaptability remains fundamentally constrained by the models’ base capacity, and even the most advanced LLMs still fall short of human expert performance.

[AI-273] Shapley Regression for Rare Disease Diagnosis Support: a case study on APDS IJCAI2026 ALT

【速读】：该论文旨在解决激活 PI3Kδ 综合征（Activated PI3Kδ Syndrome, APDS）的早期识别难题，其核心挑战在于临床表现高度异质且与其他免疫疾病重叠，导致诊断延迟。传统线性评分系统难以捕捉症状间的复杂交互关系，而深度学习模型虽具表达能力却缺乏可解释性。解决方案的关键在于提出一种基于博弈论的新型 Shapley 回归模型，该模型以 k-加性合作博弈替代传统逻辑回归中的线性预测器，显式建模症状共现模式，同时保持了逻辑回归的透明性和凸性优势；实验表明，采用 l₂ 正则化的二阶加性模型在预测性能与噪声鲁棒性之间实现了最佳平衡，并在真实患者队列中成功区分 APDS 患者与对照组，验证了已知表型并揭示了症状间的配对交互关系。

链接: https://arxiv.org/abs/2605.08897
作者: Safa Alsaidi,Tomás Brogueira,Nizar Mahlaoui,Marc Vincent,Guilherme Pelegrina,Nicolas Garcelon,Adrien Coulet,Miguel Couceiro
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 21 pages, 4 figures. Accepted to the AI and Health special track at IJCAI 2026; the first two named authors had equal contribution

点击查看摘要

Abstract:Activated PI3K8 Syndrome (APDS) is a rare genetic immune disorder caused by variants in PIK3CD or PIK3R1, with highly heterogeneous symptoms that often delay diagnosis. Early recognition is hampered by overlapping clinical presentations and limited clinician awareness, motivating systematic, data-driven approaches to detect APDS-associated phenotypic patterns in routine electronic health records. Traditional linear scoring systems cannot capture complex symptom interactions, while deep learning models, though expressive, often lack interpretability. To bridge this gap, we propose Shapley regression, a novel game-theoretic model replacing the linear predictor with a k-additive cooperative game, explicitly modeling co-occurrence of symptoms while maintaining the transparency and convexity of logistic regression. We carry out an empirical study of our lightweight method on eight public biomedical datasets, showing that a 2-additive model with l_2 regularization achieves an optimal trade-off between predictive power and noise robustness. We also apply it to a real-world cohort of 222 patients, on which Shapley regression accurately distinguished APDS cases from matched controls, confirming and validating phenotypes known to be associated with APDS, and facilitating the exploration of pairwise interactions between symptoms, validated by clinical experts.

[AI-274] Why Do Aligned LLM s Remain Jailbreakable: Refusal-Escape Directions Operator-Level Sources and Safety-Utility Trade-off

【速读】：该论文旨在解决对齐后的大型语言模型（Large Language Models, LLMs）为何仍易受越狱攻击（jailbreak attacks）的问题，核心关注点在于揭示其结构上的脆弱性根源。解决方案的关键在于提出“拒绝逃逸方向”（Refusal-Escape Directions, RED）的概念：即在有害输入周围存在局部扰动方向，可使模型从拒绝回答转变为生成回答，同时保持输入的有害语义不变。研究进一步理论证明RED可被精确分解为模型各算子层级（operator-level）源的贡献，并识别出归一化（normalization）、残差连接（residual-wiring）和终端层（terminal）为受解析约束的结构性来源。由此得出，消除RED需在共享表达模块（自注意力机制与多层感知机，MLP）中移除这些受限源的贡献，同时保留支持良性响应的机制，从而形成一种条件性的安全-效用权衡（conditional safety-utility trade-off）。

链接: https://arxiv.org/abs/2605.08878
作者: Yu Chen,Yuanhao Liu,Qi Cao
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 40 pages, 45 figures

点击查看摘要

Abstract:Aligned large language models (LLMs) remain vulnerable to jailbreak attacks. Recent mechanistic studies have identified latent features and representation shifts associated with jailbreak success, but they leave a more fundamental question open: why do aligned LLMs remain jailbreakable, and what structural vulnerabilities in the model make this possible? We study this question through a continuous input-transformation view. Our theoretical finding is that aligned models can still exhibit Refusal-Escape Directions (RED): local perturbation directions around a harmful input that shift the model’s behavior from refusal to answering while preserving the model’s harmful-semantics interpretation. From this perspective, a jailbreak is not only a successful discrete prompt construction, but can also be understood as a refusal-to-answer behavior transition induced by continuously perturbing a harmful input along RED. We then prove that RED can be exactly decomposed into contributions from operator-level sources across the model’s operator structure, and identify normalization, residual-wiring, and terminal sources as analytically constrained operator-level sources. To eliminate RED, the shared expressive modules – self-attention and MLP – must eliminate the contributions from these analytically constrained sources while preserving the mechanisms that support benign responses. These competing requirements give rise to a conditional safety-utility trade-off. Experiments across multiple models and attack methods empirically analyze RED from two complementary perspectives and show that added token dimensions can expose RED, while successful jailbreaks exhibit refusal-to-answer shifts largely aligned with terminal-source contributions.

[AI-275] BubbleSpec: Turning Long-Tail Bubbles into Speculative Rollout Drafts for Synchronous Reinforcement Learning

【速读】：该论文旨在解决强化学习（Reinforcement Learning, RL）在大型语言模型（Large Language Models, LLMs）训练中rollout阶段的效率瓶颈问题，尤其是由数据并行秩间长尾延迟（long-tail bubbles）导致的计算资源闲置，特别是在处理长上下文场景时，快速GPU因等待慢速rank而处于空闲状态。解决方案的关键在于提出BubbleSpec框架，其不试图消除这些延迟窗口，而是主动利用较快rank的空闲时间预生成后续rollout步骤的结果作为草稿，用于推测解码（speculative decoding），从而实现加速。与以往依赖历史epoch相似性或冷启动过程的推测方法不同，BubbleSpec无需依赖数据集规模，可从训练初期即提供加速效果，并严格保持RL算法的同步性，兼容多种RL策略且不牺牲数学精确性。

链接: https://arxiv.org/abs/2605.08862
作者: Yuhang Xu,Kaibin Tian,Yang Tian,Zhice Yang,Yifeng Yu,Yan Li,Shengzhong Liu,Fan Wu,Guihai Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement Learning (RL) has become a cornerstone for improving the performance of Large Language Models (LLMs). However, its rollout phase constitutes a significant efficiency bottleneck, mainly arising from the long-tail bubbles across data parallel ranks, particularly in long-context scenarios where faster GPUs remain idle while waiting for stragglers. Existing solutions, such as partial rollout or asynchronous RL, mitigate these bubbles by compromising the algorithm’s strict synchronous nature. Instead, we propose BubbleSpec, a novel framework that accelerates RL rollouts while strictly keeping the mathematical exactness. Instead of attempting to eliminate bubbles, BubbleSpec exploits them. We exploit the idle time windows of faster ranks to pre-generate rollout results for subsequent steps, serving as drafts for speculative decoding. Unlike prior speculative methods that rely on historical epoch similarity and warm-ups, BubbleSpec is agnostic to dataset size and provides immediate acceleration from the onset of training. Extensive evaluations demonstrate that BubbleSpec reduces decoding steps by 50% and increases rollout throughput by up to 1.8x. Critically, BubbleSpec is seamlessly compatible with various RL frameworks and strategies as it sustains the strict synchronous property of RL algorithms.

[AI-276] M3: Reframing Training Measures for Discretized Physical Simulations

【速读】：该论文旨在解决神经代理模型（Neural Surrogate Models）在物理仿真中因离散采样导致的分布不均问题，即由离散域中的经验测度（empirical measure）引发的监督信号偏差，进而造成优化过程偏倚和物理保真度的空间不一致性。其解决方案的关键在于提出M³（Multi-scale Morton Measure）框架，通过依据物理变化对空间进行分层划分，并在多尺度上分配监督信号，从而平衡训练测度，提升模型在连续物理域中的预测准确性与物理一致性。

链接: https://arxiv.org/abs/2605.08843
作者: Yuan Mei,Xingyu Song,Xiaowen Song,Naoya Takeishi
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Neural surrogate models for physical simulations are trained on discretized samples of continuous domains, where the induced empirical measure leads to uneven supervision, biasing optimization and causing spatial inconsistencies in physical fidelity. To mitigate this measure-induced bias, we propose M ^3 (Multi-scale Morton Measure), a scalable framework that balances training measures by partitioning space according to physical variation and allocating supervision across multiple scales. Applied to three industrial-scale datasets with diverse discretizations, M ^3 consistently improves predictions in the continuous physical domain, achieving up to 4.7 \times lower error in large-scale volumetric cases. These gains persist under aggressive subsampling (160M \rightarrow 16M \rightarrow 1.6M points), where M ^3 -trained models outperform those trained on higher-resolution data, reducing physics-weighted relative L_2 error by 3–4 \times and the corresponding MSE by up to 13 \times . These results highlight data distribution as a key factor in operator learning and position M ^3 as a scalable, data-efficient approach for physically consistent modeling.

[AI-277] SynerDiff: Synergetic Continuous Batching for Fast and Parallel Diffusion Model Inference ICME2026

【速读】：该论文旨在解决生成式 AI（Generative AI）内容服务中扩散模型（Diffusion Model）推理时面临的高吞吐量与低端到端（E2E）延迟之间的矛盾问题，尤其在UNet与VAE组件并发执行时出现的资源争用导致的延迟波动和多任务调度权衡难题。解决方案的关键在于提出SynerDiff系统，通过层级协同机制实现优化：在内并发层（intra-concurrency）采用VAE分块（VAE Chunking）和自适应跳过CFG（Adaptive Skip-CFG）策略缓解特定组件的资源瓶颈；在间并发层（inter-concurrency）基于组件对调度粒度的不同敏感性，设计阈值感知调度器（threshold-aware scheduler）动态规划并发序列并调整内部调度决策，在保障UNet吞吐量的同时最小化VAE延迟；同时引入反馈控制器根据队列负载动态调节阈值，从而提升系统整体容量上限。实验表明，该方案相较基线显著提升吞吐量1.6倍，并将平均和P99尾部延迟降低最多78.7%。

链接: https://arxiv.org/abs/2605.08835
作者: Ziqi Zhou,Peng Yang,Yuxin Liang,Mingliu Liu,Jia Lu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: accepted by IEEE ICME 2026

点击查看摘要

Abstract:The expansion of Artificial Intelligence-generated content service requires diffusion model serving to simultaneously achieve high throughput and low task end-to-end (E2E) latency. However, existing continuous batching methods suffer from severe resource contention during UNet-VAE concurrency, leading to latency spikes. Furthermore, concurrent multi-task scheduling entails a trade-off between UNet throughput and VAE latency across varying scheduling strategies. To address these, we propose SynerDiff, an efficient continuous batching system built on intra-inter level synergy. At the intra-concurrency level, SynerDiff alleviates resource contention by pruning component-specific resource bottlenecks via VAE Chunking and Adaptive Skip-CFG. At the inter-concurrency level, leveraging components’ differential sensitivity to scheduling granularities, a threshold-aware scheduler plans concurrent sequences and tunes intra-concurrency decisions to minimize VAE latency while maintaining UNet within high-throughput threshold. Additionally, a feedback controller dynamically adjusts this threshold based on queue loads to boost system capacity ceiling. Experimental results show that, SynerDiff improves throughput by 1.6 \times and decreases both average E2E and P99 tail latencies by up to 78.7%, compared to benchmarks while guaranteeing high image fidelity.

[AI-278] FRACTAL: SSM with Fractional Recurrent Architecture for Computational Temporal Analysis of Long Sequences ICML2026

【速读】：该论文旨在解决状态空间模型（State Space Models, SSMs）在长序列建模中面临的根本性权衡问题：如何在保持无限历史信息的同时，高分辨率地检测现实世界现象中常见的短期突变。现有基于高阶多项式投影算子（HiPPO）的SSMs存在两个局限：均匀测度会稀释近期信息以维持时间尺度不变性，而指数测度则牺牲全局上下文以捕捉局部动态。解决方案的关键在于提出一种分数阶递归架构（Fractional Recurrent Architecture for Computational Temporal Analysis of Long sequences, FRACTAL），其核心创新是将分数阶测度理论引入递归记忆更新机制，通过推导具有解析谱特性且可调奇异性指数的投影算子，在增强对近期信号扰动敏感性的同时保留编码尺度不变记忆动态的谱结构。该方法在简化对角化状态空间框架中通过调节输入投影初始化，实现了多尺度时间特征的同步捕获，最终在Long Range Arena基准测试中取得87.11%的平均得分，显著优于S5模型。

链接: https://arxiv.org/abs/2605.08833
作者: Mengqi Li,Wensheng Lin,Jinshuai Yang,Lixin Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 19 pages (10 pages main text, 9 pages appendix), 3 figures. Accepted by ICML 2026

点击查看摘要

Abstract:Effective sequence modeling fundamentally requires balancing the retention of unbounded history with the high-resolution detection of abrupt short-term variations common in real-world phenomena. However, existing state space models (SSMs) relying on high-order polynomial projection operators (HiPPO) face a critical trade-off where uniform measures dilute recent information to maintain timescale invariance, while exponential measures sacrifice global context to capture local dynamics. This paper proposes a Fractional Recurrent Architecture for Computational Temporal Analysis of Long sequences (FRACTAL), a novel architecture integrating fractional measure theory into recursive memory updates to address this limitation. By deriving projection operators with analytically characterized spectral properties and a tunable singularity index, the proposed method amplifies sensitivity to recent signal perturbations while preserving the spectral structure that encodes scale-invariant memory dynamics. This theoretical innovation is instantiated within a simplified diagonalized state space framework by modulating input projection initialization to enable simultaneous capture of multi-scale temporal features. FRACTAL achieves an average score of 87.11% on the Long Range Arena benchmark, including 61.85% on the ListOps task, outperforming the S5 model.

[AI-279] When Agents Overtrust Environmental Evidence: An Extensible Agent ic Framework for Benchmarking Evidence-Grounding Defects in LLM Agents

【速读】：该论文旨在解决大语言模型代理（LLM agents）在运行过程中因环境感知不可靠而导致的“环境接地失效”问题，即当代理所依赖的环境观察（如文件、API响应或网页内容）为过时、错误或恶意时，代理仍可能基于这些虚假证据采取错误行动，从而偏离任务正确路径。其解决方案的关键在于提出EnvTrustBench框架，通过定义一种可量化的行为缺陷——证据接地缺陷（Evidence-Grounding Defect, EGD），来系统性地评估代理是否在面对不完整或失真的环境信息时仍能保持行为与真实环境状态一致；该框架自动构建任务场景、生成工作空间和验证规则，并记录代理的动作-观察轨迹，最终由验证器判定是否存在EGD，从而揭示代理在复杂环境中可靠性的核心风险。

链接: https://arxiv.org/abs/2605.08828
作者: Strick Sheng,Ziyue Wang,Liyi Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language model agents increasingly operate through environment-facing scaffolds that expose files, web pages, APIs, and logs. These observations influence tool use, state tracking, and action sequencing, yet their reliability and authority are often uncertain. Environmental grounding is therefore a systems-level problem involving context admission, evidence provenance, freshness checking, verification policy, action gating, and model reasoning. Existing agent benchmarks mainly evaluate task capability or specific attacks such as prompt injection and memory poisoning, but they under-specify a fundamental reliability question: whether agents remain grounded in the true environment state when observations are stale, incorrect, or malicious. We introduce EnvTrustBench, an agentic framework for benchmarking this failure mode. We define an evidence-grounding defect (EGD) as a behavioral failure in which an agent treats an environment-facing claim as sufficient evidence for action without resolving it against available current evidence, leading to a task-incorrect false path under the true environment state. Given a task scenario, EnvTrustBench generates the workspace, environment, agent-facing objective, and validation oracle, executes the evaluated agent, records its action-observation trajectory and final state, and applies the oracle to produce a verdict. Using 6 LLM backbones and 5 widely used scaffolds, we evaluate 55 generated cases across 11 task scenarios, with each scenario expanded through five feedback-guided generation iterations. Results show that EGDs consistently emerge across operational workflows, highlighting environmental grounding as a core agent reliability problem with important security implications. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2605.08828 [cs.AI] (or arXiv:2605.08828v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.08828 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-280] Mental Health AI Safety Claims Must Preserve Temporal Evidence

【速读】：该论文旨在解决当前心理健康人工智能（Mental Health AI）安全评估中存在的时序错配问题，即现有评估方法通常仅基于孤立响应、最终结果或对话整体质量进行打分，而忽视了交互序列中随时间累积的潜在风险，如延迟升级、重复强化、依赖形成、修复失败及逐轮恶化等临床后果显著的机制。解决方案的关键在于提出“时序安全不可识别性”（Temporal Safety Non-Identifiability）这一形式化概念，阐明仅丢弃时序特征的评估协议无法验证依赖于序列、时机、累积效应或恢复过程的安全属性；进而构建SCOPE（Safety Claims Over Preserved Evidence）原则，并具体化为SCOPE-MH——一种保留时序证据的心理健康AI评估报告标准，通过AnnoMI数据集上的实证验证其能揭示传统逐轮评分所无法捕捉的失效机制，从而推动安全评估从静态截面转向动态时序建模。

链接: https://arxiv.org/abs/2605.08827
作者: Srimonti Dutta,Ratna Kandala
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The safety of mental health AI is often judged at the wrong temporal scale. Current evaluations typically score isolated responses, endpoint outcomes, or aggregate dialogue quality, while clinically consequential failures may arise from the order and accumulation of interactions themselves, including delayed escalation, repeated reinforcement, dependency formation, failed repair, and gradual deterioration across turns. This paper argues that this mismatch is not merely a limitation of evaluation coverage but a source of invalid safety conclusions. We introduce Temporal Safety Non-Identifiability, a formal account of why safety properties that depend on sequence, timing, accumulation, or recovery cannot be certified by protocols that discard those features. From this formalization, we develop SCOPE (Safety Claims Over Preserved Evidence) as a general principle for aligning safety claims with the evidence an evaluation actually retains, and instantiate it as SCOPE-MH, a mental-health instantiation of this reporting standard. We operationalize SCOPE-MH through a proof-of-concept on the AnnoMI dataset of expert-annotated motivational interviewing conversations, which reveals mechanisms of failure that per-turn behavior scoring does not represent. We propose SCOPE-MH as a diagnostic complement to existing evaluation infrastructure and argue that evaluation preserving temporal evidence is necessary, not optional, for safety-critical mental health AI deployment.

[AI-281] How You Begin is How You Reason : Driving Exploration in RLVR via Prefix-Tuned Priors

【速读】：该论文旨在解决强化学习中可验证奖励（Reinforcement Learning with Verifiable Rewards, RLVR）在大语言模型（Large Language Model, LLM）推理任务中因奖励稀疏性和长推理时域导致的有效探索困难问题，尤其针对“熵崩溃”（entropy collapse）现象——即RLVR虽提升单次推理准确率，却难以拓展成功推理轨迹的覆盖范围。解决方案的关键在于提出信息最大化增强探索（Information-Maximizing Augmented eXploration, IMAX）框架，通过训练一组软前缀（soft prefixes）重塑基础模型对推理轨迹的先验分布；每个前缀作为可训练控制旋钮，从同一骨干模型生成不同的推理分布，从而实现多样化探索。同时，设计信息最大化（InfoMax）奖励以补充可验证奖励，引导模型发现多样且任务相关的推理行为，该方法与现有RLVR算法解耦，具备良好的集成兼容性。

链接: https://arxiv.org/abs/2605.08817
作者: Yifan Xu,Junren Chen,Yifan Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) recently thrives in large language model (LLM) reasoning tasks. However, the reward sparsity and the long reasoning horizon make effective exploration challenging. In practice, this challenge manifests as the \emphentropy collapse phenomenon, where RLVR improves single-rollout accuracy but fails to expand coverage on successful reasoning trajectories. Passive exploration techniques like entropy regularization tend to dismiss generation quality, resulting in noisy rollouts. In response to this issue, we propose an Information-Maximizing Augmented eXploration (IMAX) framework to train a pool of soft prefixes that reshapes the base model’s prior over reasoning trajectories. Rather than relying on RL to incentivize exploration on top of the base model, each prefix acts as a trainable control knob that induces a distinct rollout distribution from the same backbone model. To encourage discovery of diverse and task-relevant reasoning behaviors, we derive an Information Maximization (InfoMax) reward to complement the verifiable rewards for RL training. IMAX is in general algorithm-agnostic and can be seamlessly integrated into existing RLVR pipelines. Experiment results have shown that across three backbone scales, IMAX consistently improves reasoning performance over standard RLVR, with gains up to 11.60% in Pass@4 and 10.57% in Avg@4.

[AI-282] Mirror Mirror on the Wall: Can VLM Agents Tell Who They Are at All?

【速读】：该论文旨在解决如何评估具身视觉-语言模型（Vision-Language Model, VLM）是否具备基于感知与行动的自我锚定（embodied self-grounding），而非仅依赖先验知识、提示遵循或虚构性推理的问题。其核心挑战在于区分真正的镜像自我识别能力与表面性的模式匹配行为。解决方案的关键在于构建一个受控的3D基准测试环境，要求第一人称VLM代理通过镜中反射推断隐藏的身体属性并选择对应目标，同时排除误导线索、遮挡和镜面移除等干扰因素；并通过镜面寻求、时间顺序判断、自我归属一致性及推理-行动一致性等多个维度量化决策过程，从而诊断模型是否真正将镜像信息与自身身体状态关联起来，而非简单地执行预设策略。

链接: https://arxiv.org/abs/2605.08816
作者: Filippo Ziliotto,Ciro Beneduce,Bruno Lepri,Luciano Serafini,Massimiliano Luca,Tommaso Campari
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:In the animal kingdom, mirror self-recognition is a canonical probe of higher-order cognition, emerging only in some species. We ask whether an analogous functional capability emerges in embodied vision-language model (VLM) agents: can they recognize themselves in a mirror? We introduce a controlled 3D benchmark where a first-person VLM agent must infer a hidden body attribute from its reflection and select the matching target, while avoiding self-other misattribution. To separate mirror-grounded self-identification from shortcuts, we test mirror removal, misleading cues, and occluded reflections. We also evaluate the decision process through mirror seeking, temporal ordering, self-attribution, and reasoning-action consistency. Our experiments show that mirror-based self-identification emerges mainly in stronger VLMs. These models can use reflected evidence for action, whereas weaker models often inspect the mirror but fail to extract self-relevant information or misattribute their reflection. Language-vision conflict further shows that self-referential language alone is not evidence of grounded self-identification. Overall, mirror-based evaluation provides a diagnostic for whether embodied self-grounding is causally rooted in perception and action rather than priors, prompt compliance, or confabulation.

[AI-283] Compressed Video Aggregator: Content-driven Module for Efficient Micro-Video Recommendation

【速读】：该论文旨在解决微视频推荐系统中因视频帧冗余和采样粗粒度导致的计算效率低、模型训练资源消耗大以及特征表示不精准的问题。其解决方案的关键在于提出一种轻量级的微视频聚合模块 Compressed Video Aggregator (CVA)，通过解耦视频信息与偏好学习，利用冻结的视频帧嵌入（VFM embeddings）进行聚合，并采用无交叉注意力投影的潜在推理机制生成紧凑的视频嵌入，从而显著降低训练时间和GPU内存占用；同时，基于CLIP模型对标题重选关键帧的方法进一步提升了所有方法（包括CVA）的性能表现。

链接: https://arxiv.org/abs/2605.08810
作者: Yang Xiao,Huiyuan Chen,Kaiyuan Deng,Chao Jiang,Zinan Ling,Ruimeng Ye,Xiaolong Ma,Bo Hui
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 18 pages

点击查看摘要

Abstract:We propose Compressed Video Aggregator (CVA), a lightweight micro-video recommendation module that decouples video information from preference learning. It aggregates frozen VFM embeddings, and uses latent reasoning without cross-attention projection, producing compact video embeddings for recommenders. Due to the redundancy in the frame count of the original benchmark and its overly coarse sampling, we used titles to re-select key frames based on CLIP. Experiments on MicroLens and Short-Video show consistent gains with orders-of-magnitude reductions in training time and GPU memory, and re-selected frames can further enhance the performance of all methods, including CVA. Furthermore, we also discussed the impact of several scenarios involving erroneous titles on our method. Code will be released soon.

[AI-284] Deterministic Decomposition of Stochastic Generative Dynamics

【速读】：该论文旨在解决当前生成式模型中随机动力学与确定性演化效应被压缩为单一有效场的问题，从而导致难以区分确定性传输与扩散诱导的随机波动作用。其解决方案的关键在于提出了一种确定性场的自然分解（transport–osmotic decomposition），将随机生成过程中的速度场 (b_t) 分解为两部分：(u_t)（主导边际概率传输的确定性部分）和 (d_t)（由扩散引起、由边际得分决定的渗透效应）。基于此分解，作者进一步设计了**桥匹配（Bridge Matching）**框架，通过联合学习边际与条件动态来实现可解释且可控的采样，即通过调节参数 (\lambda_d) 控制渗透项在概率传输中的贡献，从而实现对生成过程的精细调控。

链接: https://arxiv.org/abs/2605.08794
作者: Xingyu Song,Yuan Mei,Naoya Takeishi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: submitted to NeruIPS 2026

点击查看摘要

Abstract:Modern generative models can be understood as probability transport from a simple base distribution to a target data distribution. Deterministic transport models offer tractable velocity-field parameterizations, whereas stochastic generative models capture richer density evolution through drift and diffusion. Yet when stochastic dynamics are described through deterministic velocity fields, the effects of drift and diffusion are often compressed into a single effective field, obscuring the distinct roles of deterministic evolution and stochastic fluctuation. In this work, we show that the deterministic field (b_t) of a stochastic generative process admits a natural transport–osmotic decomposition that separates deterministic transport from stochastic, diffusion-induced effects: (b_t = u_t + d_t), where (u_t) governs marginal probability transport and (d_t) captures an osmotic effect induced by diffusion and determined by the marginal score. Based on this decomposition, we propose Bridge Matching, a flow-based framework for learning decomposed generative dynamics through both marginal and conditional formulations. In generative modeling experiments, we recombine the learned components as (b_t = u_t + \lambda_d d_t), showing that the proposed decomposition enables interpretable and controllable sampling by adjusting the osmotic contribution in probability transport.

[AI-285] cuRegOT: A GPU-Accelerated Solver for Entropic-Regularized Optimal Transport

【速读】：该论文旨在解决熵正则化最优传输（entropic-regularized optimal transport, OT）在大规模机器学习应用中因计算成本高而导致的效率瓶颈问题，尤其是现有GPU加速方法在收敛速度与内存访问效率之间的权衡不足。其解决方案的关键在于提出一种名为cuRegOT的高性能GPU求解器，通过三项核心优化策略实现：一是采用摊销化的符号分析策略以缓解CPU端瓶颈；二是设计异步Sinkhorn迭代生成机制提升并行效率；三是开发融合内核（fused kernel）以实现带宽友好的梯度计算。这些策略均具备严格的理论收敛性保障，并在多种基准任务上显著优于当前最先进的GPU求解器。

链接: https://arxiv.org/abs/2605.08793
作者: Yixuan Qiu
机构: 未知
类目: Mathematical Software (cs.MS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Computation (stat.CO); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Optimal transport (OT) has emerged as a fundamental tool in modern machine learning, yet its computational cost remains a significant bottleneck for large-scale applications. While harnessing the massive parallelism of modern GPU hardware is critical for efficiency, the de facto standard Sinkhorn algorithm, despite its ease of parallelization, often suffers from slow convergence in challenging problems. More recently, the sparse-plus-low-rank quasi-Newton method offers a balance between convergence rate and per-iteration complexity; however, its efficiency on GPUs is severely hindered by the serial nature of sparse matrix symbolic analysis and irregular memory access patterns. To bridge this gap, we present cuRegOT, a high-performance GPU solver tailored for entropic-regularized OT. We introduce a suite of algorithmic and architectural optimizations, including an amortized symbolic analysis strategy to mitigate CPU bottlenecks, an asynchronous Sinkhorn iterates generation mechanism, and a fused kernel for bandwidth-efficient gradient evaluation. These strategies are backed by rigorous theoretical guarantees ensuring algorithmic convergence. Extensive numerical experiments demonstrate that cuRegOT achieves significant speedups over state-of-the-art GPU-based solvers across a variety of benchmark tasks.

[AI-286] A Reconfigurable Multiplier Architecture for Error-Resilient Applications in RISC-V Core

【速读】：该论文旨在解决神经网络（Neural Networks, NNs）在能源受限的嵌入式设备上进行高效推理时面临的能效挑战。为实现这一目标，作者提出了一种集成于RISC-V核心的运行时可重构乘法器架构，其关键在于通过专用的mulscr指令支持精确计算与近似计算之间的灵活切换，并提供多级可配置精度控制，从而在标准处理器流水线中实现细粒度的能效-精度权衡。该设计在精确模式下功耗降低44%-52%，近似模式下降低62%-68%，同时保持1.89 DMIPS/MHz的计算性能，在2D卷积和矩阵乘法等容错负载上最高实现63%的能耗降低，且矩阵乘法能耗低至1.21 pJ/指令，验证了其在边缘人工智能部署中的有效性。

链接: https://arxiv.org/abs/2605.08785
作者: Pragun Jaswal,L. Hemanth Krishna,B. Srinivasu
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注: Accepted in ISVLSI 2026

点击查看摘要

Abstract:Neural Networks (NNs) have been widely adopted due to their outstanding efficacy and adaptability across computer vision and deep learning applications. The optimization of NNs is necessary to enable their deployment on energy constrained embedded devices, where the limited available energy poses a significant challenge for efficient inference. This paper presents a runtime reconfigurable multiplier architecture integrated into the RISC-V core, targeting energy efficient neural network inference and edge AI applications. The proposed multiplier supports adaptability for exact and approximate computation with multiple configurable accuracy levels via a dedicated mulscr, enabling fine-grained energy accuracy control within a standard processor pipeline. The proposed design achieves 44%-52% and 62%-68% power reduction in exact and approximate modes respectively, while maintaining the computational performance of 1.89 DMIPS/MHz. Evaluations on error-tolerant workloads including 2d convolution and matrix multiplication demonstrate up to 63% reduction in energy consumption, with the proposed design achieving 1.21 pJ/instruction for matrix multiplication, confirming its effectiveness for energy-constrained edge AI deployments.

[AI-287] Reasoning Compression with Mixed-Policy Distillation

【速读】：该论文旨在解决推理导向的大语言模型（Reasoning-centric Large Language Models, LLMs）在实际部署中因生成冗长且低效的中间推理轨迹而导致的高Token消耗与推理延迟问题，尤其是在资源受限场景下，小模型虽具部署优势却往往产生更冗余的推理过程。解决方案的关键在于提出一种混合策略蒸馏（Mixed-Policy Distillation, MPD）框架，其核心机制是通过教师模型对学生模型采样的推理轨迹进行压缩重构，再以KL散度为优化目标引导学生模型学习压缩后的轨迹，从而在保留学生策略探索能力的同时引入教师指导的推理压缩行为，实现高效的小模型推理性能提升。

链接: https://arxiv.org/abs/2605.08776
作者: Han Yang,Mingyan Wu,Bailan He,Zeyu Cao,Sikuan Yan,Kevin Qinghong Lin,Zifeng Ding
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reasoning-centric large language models (LLMs) achieve strong performance by generating intermediate reasoning trajectories, but often incur excessive token usage and high inference-time decoding cost. We observe that, when solving the same problems, larger reasoning models can often produce more concise traces, whereas smaller reasoning models tend to generate longer and more redundant trajectories. This is especially problematic in real-world deployment, where memory, latency, and serving-cost constraints often favor smaller models. Our observations suggest that reasoning compression can be transferred from large models to small ones rather than enforced through explicit length constraints. Based on this insight, we propose Mixed-Policy Distillation (MPD), a reasoning compression framework that transfers concise reasoning behavior from a larger-sized teacher to a smaller student by distilling teacher-compressed student trajectories. Unlike on-policy distillation, which aligns the student with teacher distributions over verbose student trajectories, or off-policy distillation, which relies on teacher-generated trajectories and may suffer from distribution mismatch, MPD combines the strengths of both. Given a student-sampled trajectory, the teacher rewrites it into a more concise reasoning trace, and the student is trained via KL-based alignment on the compressed trajectory. This preserves student-policy exploration while injecting teacher-guided compression. Experiments on Qwen3-1.7B show that MPD reduces token usage by up to 27.1% while improving performance across multiple reasoning benchmarks, demonstrating an effective approach to efficient small-model reasoning.

[AI-288] EvoMAS: Learning Execution-Time Workflows for Multi-Agent Systems

【速读】：该论文旨在解决现有基于大语言模型（Large Language Model, LLM）的多智能体系统（Multi-Agent System, MAS）在长时程任务中因采用静态协调策略而导致的适应性不足问题。具体而言，传统方法通常在执行前一次性优化或选择工作流并全程复用，难以应对任务过程中子目标、中间证据和信息需求随阶段演化的动态特性。其解决方案的关键在于提出EvoMAS框架，该框架将工作流构建建模为沿单一任务轨迹的元级序贯决策问题：在每个执行阶段，通过规划-评估-更新（Planner-Evaluator-Updater）管道显式构建任务状态，并利用一个经策略梯度训练的Workflow Adapter从固定候选智能体池中实例化特定阶段的分层工作流；该Adapter以稀疏但可验证的任务最终成功作为主要监督信号进行学习，同时对基于评估器的过程奖励进行独立分析，从而实现执行时动态调整智能体协作结构的能力。

链接: https://arxiv.org/abs/2605.08769
作者: Chengdong Xu,Kaiqiang Ke,Ziheng Liu,Jiaqi Wei,Zibo Shao,Weile Guo,Chao Yu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 22 pages, 8 figures

点击查看摘要

Abstract:Large language model (LLM)-based multi-agent systems have shown strong potential on complex tasks through agent specialization, tool use, and collaborative reasoning. However, most automated multi-agent system design methods still follow a one-shot paradigm: a workflow is optimized or selected before execution and then reused unchanged throughout the task. This static coordination strategy is ill-suited for long-horizon tasks whose subgoals, intermediate evidence, and information needs evolve over multiple execution stages. We propose EvoMAS, a framework for execution-time multi-agent workflow construction. EvoMAS formulates workflow construction as a meta-level sequential decision problem along a single task trajectory. At each stage, it constructs an explicit task state through a Planner-Evaluator-Updater pipeline and uses a learned Workflow Adapter to instantiate a stage-specific layered workflow from a fixed pool of candidate agents. The adapter is trained with policy gradients using sparse, verifiable terminal task success as the main supervision signal, while evaluator-based process reward is analyzed separately under very-hard sparse-reward settings. Experiments on GAIA, HLE, and DeepResearcher show that EvoMAS outperforms single-agent baselines and recent automated multi-agent workflow design methods. Our analyses further show that explicit task-state construction and learned workflow adaptation provide complementary benefits. Additional results indicate that process reward is most useful when terminal success is extremely sparse, and qualitative case studies illustrate that EvoMAS adapts agent coordination as the task state evolves.

[AI-289] From Holo Pockets to Electron Density: GPT -style Drug Design with Density ICML2026

【速读】：该论文旨在解决结构基础药物设计（Structure-Based Drug Design, SBDD）中分子生成模型依赖空口袋（empty binding pockets）作为条件而导致的结构偏差问题，从而限制了生成分子与实际结合环境的匹配度。传统方法忽略了结合位点中填充物（filler，包括配体和溶剂）所携带的物理信息，而这些信息对准确描述结合环境至关重要。解决方案的关键在于利用来自实验或计算获得的低分辨率电子密度（Electron Density, ED）作为物理上合理的条件信号，将分子生成过程锚定在真实的生物大分子结合环境中。具体而言，作者提出EDMolGPT——一种仅使用解码器的自回归框架，直接从低分辨率电子密度点云中生成具有三维构象的新分子，通过引入实验可得的电子密度，自然捕捉蛋白质的构象灵活性并减少结构偏倚，从而提升生成分子的合理性和有效性。

链接: https://arxiv.org/abs/2605.08767
作者: Jiahao Chen,Letian Gao,Yanhao Zhu,Wenbiao Zhou,Bing Su,Zhi John Lu,Bo Huang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Published as a conference paper in ICML 2026

点击查看摘要

Abstract:Recent advances in generative modeling have enabled significant progress in structure-based drug design (SBDD). Existing methods typically condition molecule generation on empty binding pockets from holo complexes, overlooking informative components such as the filler (ligands and solvent). Here, we leverage low-resolution electron density (ED) derived from the filler as a physically grounded condition for \textitde novo drug design. We consider two types of ED, calculated and cryo-EM/X-ray, obtainable from computational or experimental sources, supporting unified pre-training and experimental integration. Compared with rigid pocket representations, experimental ED naturally captures conformational flexibility and provides a more faithful description of the binding environment. Based on this, we introduce EDMolGPT, a decoder-only autoregressive framework that generates molecules from low-resolution ED point clouds. By grounding generation in physically meaningful density signals, EDMolGPT mitigates structural bias and produces molecules with 3D conformations. Evaluations on 101 biological targets verify the effectiveness. Our project page: this https URL.

[AI-290] Unlearners Can Lie: Evaluating and Improving Honesty in LLM Unlearning ACL2026

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在执行遗忘（unlearning）过程中出现的不诚实行为问题，即现有方法常导致幻觉、异常token序列生成或对遗忘知识的回答不一致，从而引发安全与可信度风险。为系统评估和提升遗忘过程中的诚实性，作者提出了一种形式化的“遗忘诚实性”（unlearning honesty）定义，强调在保留知识的同时维持模型的诚实性，并确保有效遗忘且模型能主动承认其局限性。解决方案的关键在于引入ReVa——一种基于表示对齐（representation-alignment）的微调机制，通过特征随机化后的模型进行再训练，使其更准确地识别并拒绝回答遗忘内容，在问答（QA）任务中显著提升拒绝率（rejection rate），同时增强对保留知识的诚实性表现。

链接: https://arxiv.org/abs/2605.08765
作者: Renjie Gu,Jiazhen Du,Yihua Zhang,Sijia Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by ACL 2026

点击查看摘要

Abstract:Unlearning in large language models (LLMs) aims to remove harmful training data while preserving overall utility. However, we find that existing methods often hallucinate, generate abnormal token sequences, or behave inconsistently, raising safety and trust concerns. According to prior literature on LLM honesty, such behaviors are often associated with dishonesty. This motivates us to investigate the notion of honesty in the context of model unlearning. We propose a formal definition of unlearning honesty, which includes: (1) preserving both utility and honesty on retained knowledge, and (2) ensuring effective forgetting while encouraging the model to acknowledge its limitations and respond consistently to questions related to forgotten knowledge. To systematically evaluate the honesty of unlearning, we introduce a suite of metrics that cover utility, honesty on the retained set, effectiveness of forgetting, rejection rate and refusal stability in QA and MCQ settings. Evaluating 9 methods across 3 mainstream families shows that all current methods fail to meet these standards. After experimental and theoretical analyses, we present ReVa, a representation-alignment procedure that fine-tunes feature-randomized unlearned models to better acknowledge forgotten knowledge. On QA tasks from the forget set, ReVa achieves the highest rejection rate after two rounds of interaction, nearly doubling the performance of the second-best method. Remarkably, It also improves honesty on the retained set. We release our data and code at this https URL.

[AI-291] Omni-scale Learning-based Sequential Decision Framework for Order Fulfillm ent of Tote-handling Robotic Systems

【速读】：该论文旨在解决 tote-handling 机器人系统在订单履行过程中面临的多层级决策协调难题，即如何在订单（order）、料箱（tote）和机器人（robot）之间实现高效、可扩展的顺序决策。现有研究通常针对特定系统设计决策机制，缺乏通用性和迁移能力。解决方案的关键在于提出一种广义且可扩展的序列决策框架——OLSF-TRS，其核心是将结构化组合优化与多智能体强化学习相结合，从而统一建模并协同优化三者的决策过程，在小规模场景下达到近优性能（平均最优性差距低于3.5%），在大规模场景中显著优于启发式基线和当前最优规则方法（减少料箱移动8–12%，超过30%），同时保持实时响应能力，有效提升运营效率与稳定性。

链接: https://arxiv.org/abs/2605.08758
作者: Jiaxin Liu,Peng Yang,Yuping Li,Xinyue Xie
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 35 pages, 5 figures

点击查看摘要

Abstract:Driven by the rapid expansion of e-commerce and small-batch production, the size of the intralogistics load unit of finished goods, semi-finished goods and raw materials is steadily shrinking. Totes are gradually replacing pallets as the primary handling and storage container. This shift has propelled tote-handling robotic systems to the forefront of automation order fulfillment centers. The order-fulfillment decisions of tote-handling robotic systems share a common order-tote-robot sequential decision-making nature. Existing studies primarily focus on decision mechanisms tailored to particular systems, making it difficult to generalize or transfer them to other contexts. We propose an Omni-scale Learning-based Sequential Decision Framework for Order Fulfillment of Tote-handling Robotic Systems (OLSF-TRS), a generalized and scalable sequential decision framework that combines structured combinatorial optimization with multi-agent reinforcement learning to coordinate order,tote, and robot decisions. On small-scale tote-handling robotic systems, OLSF-TRS achieves near-optimal performance with average optimality gaps below 3.5% across two distinct system configurations. In large-scale scenarios, OLSF-TRS consistently outperforms heuristic baselines across two different system types, reducing total tote movements by 8-12% and over 30% compared to SOTA rule-based approaches, while maintaining real-time responsiveness. These improvements translate into tangible operational benefits, including cost reduction, lower energy consumption, and enhanced throughput stability. The proposed framework delivers an efficient and unified order fulfillment decision-making framework for widely deployed tote-handling robotic systems,supporting high-quality order fulfillment in both e-commerce and industrial logistics sectors.

[AI-292] AHD Agent : Agent ic Reinforcement Learning for Automatic Heuristic Design

【速读】：该论文旨在解决现有生成式 AI (Generative AI) 辅助启发式设计（Automatic Heuristic Design, AHD）框架中，大语言模型（LLMs）作为被动生成器在固定流程中运行时，难以捕捉状态依赖信息（如特定失败模式）而导致试错效率低下的问题。解决方案的关键在于提出一种工具集成的多轮交互式框架——AHD Agent，其核心创新是赋予 LLM 主动决策能力：模型可根据当前状态判断是否直接生成启发式策略，或调用工具从求解环境中检索目标证据，从而实现更高效、自适应的探索过程。此外，作者还设计了一种基于代理强化学习（agentic reinforcement learning, RL）的训练机制，并通过环境合成流水线优化小规模模型的泛化能力，最终在八个不同领域任务上验证了该方法在性能与评估效率上的优越性。

链接: https://arxiv.org/abs/2605.08756
作者: Haoze Lv,Ning Lu,Ziang Zhou,Shengcai Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: 10 pages, 7 figures for main content

点击查看摘要

Abstract:Automatic heuristic design (AHD) has emerged as a promising paradigm for solving NP-hard combinatorial optimization problems (COPs). Recent works show that large language models (LLMs), when integrated into well-designed frameworks (i.e., LLM-AHD), can autonomously discover high-performing heuristics. However, existing LLM-AHD frameworks typically treat LLMs as passive generators within fixed workflows, where the model generates heuristics from manually designed, limited context. Such context may fail to capture state-dependent information (e.g., specific failure modes), leading to inefficient trial-and-error exploration. To overcome these limitations, we propose AHD Agent, a novel tool-integrated, multi-turn framework that empowers LLMs to proactively decide whether to generate heuristics or invoke tools to retrieve targeted evidence from the solving environment. To effectively train such a dynamic decision-making agent, we introduce an agentic reinforcement learning (RL) system, which leverages a novel environment synthesis pipeline to optimize a compact model’s generalizable AHD capabilities. Experiments across eight diverse domains, including four held-out tasks, demonstrate that our 4B-parameter agent matches or surpasses state-of-the-art baselines using much larger models, while requiring significantly fewer evaluations. Model and inference scaling analysis further reveals that AHD Agent offers an effective trajectory toward truly autonomous heuristic design.

[AI-293] Value-Decomposed Reinforcement Learning Framework for Taxiway Routing with Hierarchical Conflict-Aware Observations

【速读】：该论文旨在解决机场地面临时路径规划与地表冲突规避这一耦合的安全关键决策问题，传统规划与优化方法受限于在线计算成本，而强化学习方法则难以有效表征下游交通冲突并平衡多目标。解决方案的关键在于提出一种冲突感知的滑行道路径规划框架（Conflict-aware Taxiway Routing, CaTR），其核心创新包括：构建基于网格的机场地表环境并引入动作掩码机制以提升决策效率；设计分层前瞻交通表示法，编码当前及下游冲突相关的交通状态信息；采用价值分解的强化学习策略，优先处理稀疏但安全至关重要的目标。实验在长沙黄花国际机场的真实场景下进行，验证了CaTR在不同交通密度下均能实现优于主流规划、优化和强化学习基线方法的安全-效率权衡，且具备实际运行时间可行性。

链接: https://arxiv.org/abs/2605.08754
作者: Shizhong Zhou,Haifeng Liu,Zheng Zhang,Shiyu Zhang,Bo Yang,Yi Lin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Taxiway routing and on-surface conflict avoidance are coupled safety-critical decision problems in airport surface operations. Existing planning and optimization methods are often limited by online computational cost, while reinforcement learning methods may struggle to represent downstream traffic conflicts and balance multiple objectives. This paper presents Conflict-aware Taxiway Routing (CaTR), a reinforcement learning framework for real-time multi-aircraft taxiway routing. CaTR constructs a grid-based airport surface environment with action masking, introduces a hierarchical foresight traffic representation to encode current and downstream conflict-related traffic conditions, and adopts a value-decomposed reinforcement learning strategy to prioritize sparse but safety-critical objectives. Experiments are conducted on a realistic environment based on Changsha Huanghua International Airport under multiple traffic density levels. Results show that CaTR achieves better safety–efficiency trade-offs than representative planning, optimization, and reinforcement learning baselines while maintaining practical runtime.

[AI-294] Done But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents

【速读】：该论文旨在解决标准具身评估中无法独立衡量智能体在回合结束时是否正确承诺任务完成的问题，这一能力被称为终端承诺（terminal commitment）。传统评估将行为上迥异的失败模式——如从未完成任务、完成任务但未停止、以及在缺乏充分证据的情况下报告成功——统一归为同一类基准失败，从而掩盖了系统在任务执行与终止决策上的真实差异。解决方案的关键在于提出VIGIL评估框架，其核心机制是通过严格的协议设计：智能体仅接收第一人称RGB视觉输入、不提供动作-成功反馈信号，并必须以语义形式报告每轮结束状态，且该报告由隐藏世界状态确定性验证。这使得两个独立指标可被计算：世界状态完成度（W）和基准成功度（B），其中B额外要求终端报告正确。该解耦设计使四种结果类别（遗漏执行、达成后漂移、无依据承诺、验证成功）得以区分，从而揭示出即使执行表现相似的模型在终端承诺能力上存在高达19.7个百分点的差异，凸显了终端承诺作为独立可测能力的重要性。

链接: https://arxiv.org/abs/2605.08747
作者: Ying Chen,Rui Jiang,Lihuang Fang,Mingxu Wang,Zhifeng Gu,Lei Yi,Jie Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Standard embodied evaluations do not independently score whether an agent correctly commits to task completion at episode closure, a capacity we call terminal commitment. Behaviorally distinct failures–never completing the task, completing it but failing to stop, and reporting success without sufficient evidence–collapse into the same benchmark failure. We introduce VIGIL, an evaluation framework that makes terminal commitment independently measurable. Under VIGIL’s default protocol, agents observe only egocentric RGB, receive no action-success signals, and must end each episode with a semantic report checked deterministically against hidden world state. This yields two separate scores: world-state completion (W) and benchmark success (B), where B additionally requires a correct terminal report. This decoupling makes four outcome categories distinguishable: missed execution, post-attainment drift, unsupported commitment, and verified success. Across 20 models on 1,000 frozen episodes, systems with comparable W differ by up to 19.7 pp in B: one model converts achieved states into correct reports, while another with near-identical execution drifts past the goal without closing. An action-feedback intervention further tests the separation: execution-oriented signals improve W broadly, yet commitment failures persist in models that do not already ground terminal reports in the achieved state. VIGIL provides a protocol that makes terminal commitment independently visible and scorable.

[AI-295] MeshFIM: Local Low-Poly Mesh Editing via Fill-in-the-Middle Autoregressive Generation

【速读】：该论文旨在解决自回归（Autoregressive, AR）模型在从点云生成低多边形网格（low-poly mesh）时存在的效率低下问题：当局部区域质量不满足要求时，AR模型必须重新生成整个网格，导致计算资源浪费并破坏已生成的优质结构。为此，作者提出MeshFIM（Fill-in-the-Middle）框架，其核心在于通过条件化生成策略仅修复目标区域，同时保持周围上下文结构不变。关键创新包括：边界顶点标记（boundary vertex markers）以确保精确贴合、上下文位置嵌入（context positional embeddings）维持拓扑顺序、扩展上下文宽度与增强策略抑制溢出、以及一种低多边形几何编码器（low-poly geometry encoder），其门控减法机制利用参考表面与现有网格的差异聚焦生成注意力于缺失区域。这些设计共同实现了高效、可控且高质量的局部网格修复与编辑。

链接: https://arxiv.org/abs/2605.08744
作者: Dingdong Yang,Jian Liu,Biwen Lei,Haohan Weng,Zhuo Chen,Song Guo,Hao Richard Zhang,Ali Mahdavi Amiri,Chunchao Guo
机构: 未知
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Autoregressive (AR) models can generate high-quality low-poly meshes from point clouds, but they still operate in an all-or-nothing manner: when a local region is unsatisfactory, the entire mesh must be regenerated, wasting computation and destroying satisfactory mesh structure elsewhere. We introduce MeshFIM, a Fill-in-the-Middle (FIM) framework that regenerates a target region of a low-poly mesh conditioned on the surrounding context. MeshFIM addresses three mesh-specific challenges: enforcing exact attachment along the exposed boundary, preserving topological order in the context, and suppressing overflow beyond the intended region. It does so with five complementary design choices: boundary vertex markers, context positional embeddings, expanded context width, context augmentation, and a low-poly geometry encoder whose gated subtraction mechanism focuses generation on the missing region by leveraging the difference between the reference surface and the existing mesh. Detailed ablation studies are presented to show the effectiveness of every introduced component. Based on MeshFIM, we demonstrate two applications: interactive brush-based editing and automatic defect repair on low-poly mesh (see Figure 1). Last but not least, experiments show that MeshFIM outperforms a range of baselines in mesh refinement, mesh repair and whole mesh generation plus stitch-back scheme.

[AI-296] Causal Dimensionality of Transformer Representations: Measurement Scaling and Layer Structure ICML2026 NEURIPS

【速读】：该论文旨在解决稀疏自编码器（Sparse Autoencoders, SAEs）的宽度与模型输出因果影响之间关系不明确的问题，尤其是如何量化SAE在Transformer残差流中对下游任务的因果贡献。其核心解决方案是引入因果维度κ（kappa），定义为第L层期望雅可比外积的有效秩，并通过SAE宽度扫描结合归因打补丁（attribution patching）方法进行估计。研究表明，随着SAE宽度从16,384增至1,048,576，表征容量增长15.6倍，而因果容量仅增长4.35倍，形成显著的“表征-因果楔形”（representational-causal wedge）；更重要的是，κ具有模型缩放不变性（即在Gemma-2-2B和Gemma-2-9B上相同SAE宽度下获得相同的因果维度N_causal = 328），且随网络深度变化稳定，揭示了κ作为Transformer层内在属性的结构性特征。

链接: https://arxiv.org/abs/2605.08740
作者: Nilesh Sarkar,Dawar Jyoti Deka
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, 17 figures, 14 tables (excluding references and appendices). Companion short paper under review at the ICML 2026 Mechanistic Interpretability Workshop. Code: this https URL

点击查看摘要

Abstract:Sparse autoencoders (SAEs) decompose transformer residual streams into interpretable feature dictionaries, yet the relationship between SAE width and causal influence on model output has not been systematically characterised. We introduce causal dimensionality kappa(L, M, T), defined as the effective rank of the expected Jacobian outer product at layer L, and show it can be estimated via the SAE width sweep paired with attribution patching. Across seven SAE widths from 16,384 to 1,048,576 features on Gemma-2-2B layer 12, representational capacity grows 15.6x while causal capacity grows only 4.35x: a robust separation we term the representational-causal wedge. A saturating fit yields kappa-hat approximately 1,990 with kappa-hat / d_model = 0.86 and participation-ratio lower bound kappa_PR approximately 280. Crucially, kappa is invariant to model scaling: Gemma-2-9B and Gemma-2-2B yield identical N_causal = 328 at the same SAE width despite a 3.46x parameter increase (the count is forced to 2% of SAE width by calibration; the substantive empirical claim is shape invariance of the AtP score distribution under matched seq=512 conditions). Across eight network depths kappa is constant while the absolute attribution threshold drops 20x from layer 1 to layer 23. Five controls (architecture invariance, threshold robustness, geometric privilege, synthetic ground-truth recovery, and a four-cell encoder/decoder ablation) pin down what kappa measures and what it does not. Our findings establish kappa as a measurable, model-intrinsic property of transformer layers: sub-linearly recoverable by SAE width, invariant to model scaling, and structured across network depth.

[AI-297] Debugging the Debuggers: Failure-Anchored Structured Recovery for Software Engineering Agents

【速读】：该论文旨在解决软件工程代理（Software Engineering Agents）在失败后恢复效率低、依赖人工干预且缺乏结构化指导的问题。现有系统虽能记录运行时痕迹或生成反馈，但无法将异构的运行时证据转化为可执行、有边界约束的恢复建议。其解决方案的关键在于提出PROBE框架——一个以失败为锚点的结构化恢复机制，包含三个核心组件：Telemetry Layer（保留细粒度运行时信号）、Diagnosis Layer（融合多源信号生成可信诊断）和Guidance Gate（仅当诊断具备证据支撑、可操作且在代理行为范围内时输出恢复指引）。该设计实现了从原始日志到可执行恢复策略的闭环转化，显著提升了诊断准确率与实际恢复成功率，验证了基于证据的失败锚定恢复方法在真实工程场景下的有效性。

链接: https://arxiv.org/abs/2605.08717
作者: Chenyu Zhao,Shenglin Zhang,Yihang Lin,Wenwei Gu,Zhimin Chen,Yongqian Sun,Dan Pei,Chetan Bansal,Saravan Rajmohan,Minghua Ma
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Software engineering agents are increasingly deployed in evaluable engineering environments, yet post-failure recovery remains costly, manual, and ad hoc. Existing systems expose traces or generate follow-up feedback, but they do not convert heterogeneous runtime evidence into grounded, bounded recovery guidance for a subsequent attempt. We present PROBE, a failure-anchored framework for structured recovery in software engineering agents. PROBE organizes failed-run telemetry into structured evidence, structured diagnosis, and bounded recovery guidance through a Telemetry Layer, a Diagnosis Layer, and a Guidance Gate. The Telemetry Layer preserves fine-grained runtime signals, the Diagnosis Layer fuses cross-signal evidence into grounded diagnoses, and the Guidance Gate produces diagnosis-derived guidance only when it is evidence-grounded, actionable, and within the scope of agent-side behavior. We evaluate PROBE across three settings: repository-level software repair, enterprise workflow recovery, and AIOps service mitigation. On 257 initially unresolved cases, PROBE achieves 65.37% Top-1 diagnosis accuracy and a 21.79% recovery rate, outperforming the strongest non-PROBE baseline by 43.58 and 12.45 percentage points. The results reveal a diagnosis-recovery gap: accurate diagnosis is necessary but insufficient unless translated into bounded guidance that a subsequent attempt can execute and verify. Beyond controlled evaluation, a Microsoft IcM prototype shows that PROBE can attach as a non-intrusive side channel to existing service-diagnosis workflows without changing the agent policy, toolset, or execution budget. These results suggest that telemetry-grounded, failure-anchored recovery can improve post-failure recoverability under realistic engineering constraints. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.08717 [cs.SE] (or arXiv:2605.08717v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2605.08717 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-298] REAP: Reinforcement-Learning End-to-End Autonomous Parking with Gaussian Splatting Simulator for Real2Sim2Real Transfer

【速读】：该论文旨在解决自动驾驶泊车（Autonomous Parking）在极端场景（如机械式停车位和死胡同车位）中因传统多阶段方法误差累积而导致的失败问题。其核心解决方案是提出一种基于强化学习的端到端泊车方法（Reinforcement learning End-to-end Autonomous Parking, REAP），关键在于：1）采用异构强化学习框架结合Soft Actor-Critic（SAC）算法以提升训练效率与推理性能；2）通过行为克隆（Behavior Cloning）将规则化规划器的能力蒸馏至端到端网络，加速收敛；3）引入软预测碰撞惩罚机制，通过惩罚靠近障碍物的动作降低碰撞率；4）构建Real2Sim2Real仿真流程，利用3D高斯泼溅（3D Gaussian Splatting, 3DGS）实现真实场景数字化，并在物理车辆上部署模型，有效弥合仿真到现实的差距，从而实现在狭窄机械车位等极端场景下的可靠泊车。

链接: https://arxiv.org/abs/2605.08713
作者: Changze Li,Zhe Chen,Shaoyu Chen,Lisen Mu,Yijian Li,Yuelong Yu,Qian Zhang,Qing Su,Ming Yang,Tong Qin
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In recent years, autonomous parking has made significant advances, yet parking tasks still face challenges in extreme scenarios such as mechanical and dead-end parking slots, often resulting in failures. This is mainly due to traditional parking methods adopting a multistage approach, lacking the ability to optimize the parking problem as a whole. End-to-end methods enable joint optimization across perception and planning modules to eliminate the accumulation of errors, enhancing algorithm performance in extreme scenarios. Although several end-to-end parking methods use imitation or reinforcement learning, the former is limited by data cost and distribution coverage, while the latter suffers from inefficient exploration. To address these challenges, we propose a Reinforcement learning End-to-end Autonomous Parking method (REAP). REAP employs Soft Actor-Critic (SAC) within an asymmetric reinforcement learning framework to improve training efficiency and inference performance. To accelerate model convergence, we distill the capabilities of a rule-based planner into the end-to-end network through behavior cloning. We further introduce a soft predictive collision penalty mechanism to reduce collision rates by penalizing obstacle-approaching actions. To ensure that the trained reinforcement learning network can directly transfer to real-world scenarios, we have established a Real2Sim2Real simulator. In the Real2Sim step, we use 3D Gaussian Splatting (3DGS) to transform real-world scenes into digital scenes. In the Sim2Real step, we deploy the end-to-end model onto the vehicle to bridge the Sim2Real gap. Trained in the 3DGS simulator and deployed on physical vehicles, REAP successfully parks in various types of parking spaces, especially demonstrating the feasibility of end-to-end RL parking in extremely narrow mechanical slots.

[AI-299] When Can Human-AI Teams Outperform Individuals? Tight Bounds with Impossibility Guarantees

【速读】：该论文旨在解决人类-人工智能（Human-AI）团队在多数情况下无法超越其最优成员的问题，即缺乏理论指导说明何时能够实现互补性（complementarity）。其解决方案的关键在于通过整合信号检测理论（signal detection theory）与信息论分析，推导出一类基于置信度的聚合规则的紧致边界（tight bounds），并由此得出四个核心结果：包括互补性成立的充要条件（误差相关系数 ρ_HM < ρ* 时团队性能优于个体）、最小最大收益与元认知敏感度差异的平方根成正比、证明当 ρ_HM ≥ ρ* 时任何基于置信度的聚合规则均无法实现互补性，以及多分类情形下阈值随类别数 K 的缩放关系 ρ*_K ≈ ρ*/√(K−1)。该框架不仅解释了互补性为何罕见，还提供了可操作的设计公式，适用于聚合决策场景而非交互式推理生成新答案的情形。

链接: https://arxiv.org/abs/2605.08710
作者: Dongxin Guo,Jikun Wu,Siu-Ming Yiu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages, 2 figures, 7 tables. Accepted at CogSci 2026

点击查看摘要

Abstract:Human-AI teams fail to outperform their best member in 70% of studies, yet no theory specifies when complementarity is achievable. We derive tight bounds for the broad class of confidence-based aggregation rules by integrating signal detection theory with information-theoretic analysis, yielding four results: (1) a complementarity theorem (teams outperform individuals iff error correlation \rho_HM \rho^* , with \rho^* \approx a in the symmetric near-chance regime); (2) minimax bounds showing gains scale as \Theta(\sqrt\Delta d) with metacognitive sensitivity difference; (3) an impossibility result proving no confidence-based aggregation rule achieves complementarity when \rho_HM \geq \rho^* ; and (4) multi-class generalization \rho^_K \approx \rho^/\sqrtK-1 . Predictions match observed team accuracy ( R = 0.94 on ImageNet-16H, R = 0.91 on CIFAR-10H) and the multi-class threshold scaling holds on human data ( R = 0.93 , K = 16 ), with robustness under non-Gaussian distributions. The framework explains why complementarity is rare and provides actionable design formulas; results apply to aggregation, not to interactive deliberation that generates novel answers.

[AI-300] Agent PSO: Evolving Agent Reasoning Skill via Multi-agent Particle Swarm Optimization

【速读】：该论文旨在解决现有多智能体推理方法在推理过程中易受错误同伴影响、存在偏见共识，且智能体自身推理能力无法随任务演化的问题。解决方案的关键在于提出AgentPSO框架，其将每个智能体视为粒子般的推理者，通过模拟粒子群优化（Particle Swarm Optimization, PSO）机制，在不更新骨干语言模型参数的前提下，利用个体最优、全局最优及同伴推理轨迹的自反思方向对智能体的自然语言技能状态进行迭代更新，从而实现个体与群体推理能力的持续进化。

链接: https://arxiv.org/abs/2605.08704
作者: Hyunmin Hwang,Jaemin Kim,Choonghan Kim,Hangeol Chang,Jong Chul Ye
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-agent reasoning has shown promise for improving the problem-solving ability of large language models by allowing multiple agents to explore diverse reasoning paths. However, most existing multi-agent methods rely on inference-time debate or aggregation, which can be vulnerable to incorrect peer influence and biased consensus. Moreover, the agents themselves remain static, as their underlying reasoning skills do not evolve across tasks. In this paper, we introduce AgentPSO, a particle-swarm-inspired framework for evolving multi-agent reasoning skills. AgentPSO treats each agent as a particle-like reasoner whose state is a natural-language skill and whose velocity is a semantic update direction, iteratively moving agents toward stronger skill states to improve both individual and collective reasoning performance. Across training iterations, each agent updates its skill by combining its previous velocity, personal-best skill, global-best skill, and a self-reflective direction derived from peer reasoning trajectories. This enables agents to learn reusable reasoning behaviors from both their own experiences and the strongest skills discovered by the population, without updating the parameters of the backbone language model. Experiments on mathematical and general reasoning benchmarks show that AgentPSO improves over static single-agent skills and test-time-only multi-agent reasoning baselines. The evolved skills further transfer across benchmarks and to another backbone model, suggesting that AgentPSO captures reusable reasoning procedures rather than merely optimizing benchmark-specific prompts. Code is open-sourced at this https URL.

[AI-301] MBP-KT: Learning Global Collaborative Information from Meta-Behavioral Pattern for Enhanced Knowledge Tracing

【速读】：该论文旨在解决现有知识追踪（Knowledge Tracing, KT）方法在建模学习者知识状态时，因依赖原始交互序列并采用定制化模块而导致的深层学习行为模式捕捉能力不足及泛化性受限的问题。其解决方案的关键在于提出一种通用的元行为模式感知框架（Meta-behavioral Pattern-aware Framework, MBP-KT），通过构建新颖的元行为序列来将原始交互序列转化为不同元行为模式的组合，从而有效保留学习者的深层行为特征；同时设计无参数模块以提取全局协同表示，并提供通用注入策略将该协同信息引入多种下游KT模型，确保协同信息的普适性和可迁移性。

链接: https://arxiv.org/abs/2605.08697
作者: Yuhao Jia,Duantengchuan Li,Jinsong Chen,Zhongjie Mao,Mingwen Tong,Yue Li,Xiaoguang Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The emerging collaborative information-based knowledge tracing (KT) has been a promising way to enhance modeling of learners’ knowledge states. The core idea is to extract the collaborative information from interaction sequences of other learners to assist the prediction on the target one. Despite effectiveness, existing methods are built on the raw interaction sequences with tailored modules, which inevitably limits their capacity in deeply capturing learning behavioral patterns and generalization. To this end, we propose a general meta-behavioral pattern-aware framework (MBP-KT) for KT. Specifically, MBP-KT introduces a novel meta-behavioral sequence construction to transform the raw interaction sequences into the combinations of different meta-behavioral patterns. In this way, the learning behavioral patterns of learners can be effectively preserved. Then, MBP-KT develops a parameter-free module to extract the global collaborative representations from the constructed meta-behavioral sequences. Moreover, MBP-KT provides general injection strategies to introduce the extracted global collaborative information into various downstream KT models, ensuring the universality of the collaborative information. Extensive results on real-world datasets demonstrate that MBP-KT can consistently boosts the performance of a wide range of KT models.

[AI-302] SkillM aster: Toward Autonomous Skill Mastery in LLM Agents

【速读】：该论文旨在解决当前大型语言模型（Large Language Model, LLM）智能体在复杂任务中依赖外部教师或规则进行技能创建、优化与选择的问题，从而导致技能仅作为可调用的外部资源而非智能体自主发展和内化的认知能力。为此，作者提出SkillMaster训练框架，其核心在于使LLM智能体具备自主生成、精炼和选择技能的能力。关键设计包括：1）基于轨迹信息的技能审查机制，使智能体能依据已完成任务的经验判断是否提议、更新或保留技能；2）通过反事实效用评估候选技能修改在相关探测任务上的表现，提供直接的学习信号以优化技能编辑决策；3）引入DualAdv-GRPO算法，分别估计任务执行动作与技能编辑决策的优势值，稳定两者联合训练过程。实验表明，SkillMaster在ALFWorld和WebShop基准上分别提升成功率8.8%和9.3%，并展现出从失败中识别问题、利用轨迹证据改进程序性知识及有限技能库调整下迁移学习的能力，标志着LLM智能体从被动使用技能向自我进化能力的跃迁。

链接: https://arxiv.org/abs/2605.08693
作者: Min Yang,Jinghua Piao,Xu Xia,Xiaochong Lan,Jiaju Chen,Yongshun Gong,Yong Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Skills provide an effective mechanism for improving LLM agents on complex tasks, yet in existing agent frameworks, their creation, refinement, and selection are typically governed by external teachers, hand-designed rules, or auxiliary modules. As a result, skills remain external resources to be invoked, rather than capabilities that agents can develop, adapt, and internalize through experience. To endow LLM agents with autonomous skill mastery, we propose SkillMaster, a training framework that teaches agents to create new skills, refine existing skills, and select accumulated skills during task solving. This capability is achieved through three key designs. First, we train agents through trajectory-informed skill review, teaching agents to propose, update, or retain skills based on evidence from completed episodes. Second, each candidate skill edit is designed to be evaluated by its counterfactual utility on related probe tasks, providing a direct learning signal for training skill-editing decisions. Third, we introduce DualAdv-GRPO, which separately estimates advantages for task-solving actions and skill-editing decisions, stabilizing joint training across task solving and skill management. Experiments on ALFWorld and WebShop show that SkillMaster improves the overall success rate over state-of-the-art baselines by 8.8% and 9.3%, respectively, achieving the best performance among all compared methods. Further analysis reveals a marked shift in agent capability: agents trained with SkillMaster can identify skill failures, refine procedural knowledge from trajectory evidence, and transfer improvements to future tasks with limited skill-bank edits. Overall, SkillMaster moves LLM agents beyond mere skill use toward self-improving agents capable of developing, adapting, and applying their own skill repertoires.

[AI-303] Structure-Centric Graph Foundation Model via Geometric Bases ICML2026

【速读】：该论文旨在解决图基础模型（Graph Foundation Models, GFMs）在跨图域迁移学习中面临的两大挑战：图结构异质性（structural heterogeneity）和节点特征空间不兼容性（incompatible node feature spaces）。其核心解决方案是提出结构中心的图基础模型（Structure-Centric Graph Foundation Models, SCGFM），关键在于将图拓扑视为可迁移知识的主要来源，并通过将图建模为度量测度空间（metric measure spaces），引入可学习的几何基底（learnable geometric bases）构建共享的结构坐标系。利用Gromov-Wasserstein距离对齐图结构至该坐标系，从而获得结构对齐的潜在表示，有效处理异构图拓扑；同时设计结构感知的特征重编码机制（structure-aware feature re-encoding），在不依赖固定特征维度或特定数据集预处理的前提下统一节点表示，显著提升模型在图级与节点级任务上的域内及跨域泛化能力。

链接: https://arxiv.org/abs/2605.08689
作者: Xiaodong He,Haolan He,Ruiyi Fang,Ming Sun,Zhao Kang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注: Accepted by ICML 2026

点击查看摘要

Abstract:Graph foundation models (GFMs) seek transferable representations across graph domains but are limited by structural heterogeneity and incompatible node feature spaces. We propose Structure-Centric Graph Foundation Models (SCGFM), which treat graph topology as the primary source of transferable knowledge. Modeling graphs as metric measure spaces, SCGFM introduces learnable geometric bases that define a shared structural coordinate system. Graphs are aligned to these bases via Gromov-Wasserstein distances, yielding structure-aligned latent representations that accommodate heterogeneous graph topologies. To address feature incompatibility, SCGFM employs a structure-aware feature re-encoding mechanism that unifies node representations without assuming a fixed feature dimensionality or requiring dataset-specific preprocessing. Experiments on graph- and node-level tasks demonstrate strong in-domain and cross-domain generalization, outperforming existing GFM approaches.

[AI-304] Reconciling Consistency-Based Diagnosis with Actual-Causality-Based Explanations

【速读】：该论文旨在解决解释性人工智能（Explainable AI, XAI）领域中对一致性基础诊断（Consistency-Based Diagnosis, CBD）研究不足的问题，并探索其与实际因果关系（Actual Causality）和因果责任（Causal Responsibility）之间的关联。解决方案的关键在于建立三者之间的理论联系，从而为XAI和可解释数据管理（Explainable Data Management）提供新的分析视角与方法支撑，推动因果推理在诊断解释中的应用。

链接: https://arxiv.org/abs/2605.08688
作者: Leopoldo Bertossi
机构: 未知
类目: Artificial Intelligence (cs.AI); Databases (cs.DB); Logic in Computer Science (cs.LO)
备注: under submission

点击查看摘要

Abstract:We establish, from the point of view of Explainable AI (XAI), connections between Consistency-Based Diagnosis (CBD), on one side, and Actual Causality and Causal Responsibility, on the other. CBD has received little attention from the XAI community. Connections between these two areas could have a fruitful impact on XAI and Explainable Data Management.

[AI-305] PrepBench: How Far Are We from Natural-Language-Driven Data Preparation?

【速读】：该论文旨在解决当前大语言模型（Large Language Models, LLMs）在自然语言（Natural Language, NL）驱动的数据准备（Data Preparation）场景中仍存在显著能力差距的问题。现有代码生成基准无法准确刻画真实数据准备任务的核心挑战，如用户意图模糊性、现实数据的不完整性以及将代码转化为可解释工作流以供验证的需求。为此，作者提出了PrepBench这一系统化基准，其关键在于从Preppin’ Data Challenges中爬取并扩展数据，构建覆盖多领域、包含3至18步处理流程、单任务代码量达100行以上的高质量评估任务集，并聚焦于三个核心能力：交互式歧义澄清（interactive disambiguation）、数据准备代码生成（prep-code generation）和代码到工作流翻译（code-to-workflow translation）。该基准为量化LLMs在NL驱动数据准备中的性能瓶颈提供了标准化工具，有助于识别实现该范式转变的关键技术障碍。

链接: https://arxiv.org/abs/2605.08687
作者: Jingzhe Xu,Rui Wang,Jiannan Wang,Guoliang Li
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Data preparation is a central and time-consuming stage in data analysis workflows. Traditionally, commercial tools have relied on graphical user interfaces (GUIs) to simplify data preparation, allowing users to define transformations through visual operators and workflows. Recent advances in large language models (LLMs) raise the possibility of a paradigm shift toward natural language (NL)-driven data preparation, in which users can specify preparation intents in NL directly. However, it remains unclear how far current LLM-based agents are from this paradigm shift in practice. Existing code generation benchmarks do not capture key characteristics of data preparation, including ambiguous user intents, imperfect real-world data, and the need to translate code into interpretable workflows for validation. To bridge this gap, we present PrepBench, a benchmark designed to evaluate NL-driven data preparation along three core capabilities: interactive disambiguation, prep-code generation, and code-to-workflow translation. We crawl data from the Preppin’ Data Challenges, and then extend it into a systematically designed benchmark. The benchmark covers diverse domains, and each task involves 3 to 18 data preparation steps. Nearly half of the tasks require over 100 lines of Python code, and the longest solutions approach 300 lines. Our evaluation shows that, despite recent progress, realizing this paradigm shift remains challenging for state-of-the-art LLMs. PrepBench provides a principled benchmark for measuring this gap and helps identify key challenges toward realizing NL-driven data preparation.

[AI-306] Iterative Critique-and-Routing Controller for Multi-Agent Systems with Heterogeneous LLM s

【速读】：该论文旨在解决多智能体大语言模型（Multi-agent Large Language Model, LLM）系统中控制器仅支持一次性路由（one-shot routing）所导致的局限性，即缺乏对中间草稿的批判与迭代优化能力。现有方法在选定一个模型后直接返回其输出，无法实现对生成过程的动态调整和质量提升。解决方案的关键在于提出一种“批判与路由”（critique-and-routing）控制器，将多智能体协作建模为一个带有显式代理使用约束的有限时域马尔可夫决策过程（Finite-horizon Markov Decision Process, MDP），并在每一轮决策中评估当前草稿、决定是否终止或继续，并选择下一个代理进行进一步优化。通过设计复合奖励函数并基于拉格朗日松弛目标使用策略梯度进行优化，该方法实现了高效且高质量的迭代生成，实验表明其在多个异构多智能体系统和七种推理基准上显著优于现有基线，同时调用次数少于总请求量的25%。

链接: https://arxiv.org/abs/2605.08686
作者: Wenzhi Fang,Liangqi Yuan,Guangchen Lan,Dong-Jun Han,Christopher G. Brinton
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-agent large language model (LLM) systems often rely on a controller to coordinate a pool of heterogeneous models, yet existing controllers are typically limited to one-shot routing: they select a model once and return its output directly. Such routing-only designs provide no mechanism to critique intermediate drafts or support iterative refinement. To address this limitation, we propose a critique-and-routing controller that casts multi-agent coordination as a sequential decision problem. At each turn, the controller evaluates the current draft, decides whether to stop or continue, and, if needed, selects the next agent for further refinement. We formulate this process as a finite-horizon Markov Decision Process (MDP) with explicit agent-utilization constraints, design a composite reward for controller decisions across turns, and optimize the controller via policy gradients under a Lagrangian-relaxed objective. Extensive experiments across multiple heterogeneous multi-agent systems and seven reasoning benchmarks show that our method consistently outperforms state-of-the-art baselines and substantially narrows the gap to the strongest agent, while using it for fewer than 25% of total calls.

[AI-307] Event Fields: Learning Latent Event Structure for Waveform Foundation Models

【速读】：该论文旨在解决传统基于序列的生理信号建模方法在捕捉临床意义结构方面的局限性，即现有模型将生理时间序列视为局部token或patch的集合，忽略了事件间时序扩展且相互作用的本质特征。其解决方案的关键在于提出一种以事件为中心（event-centric）的波形基础模型，通过假设生理信号是潜在事件过程的实现，引入一种自监督学习框架，在随机分割和时频投影之间强制一致性，从而学习对信号级扰动具有不变性但保留事件层级组织的表示。该方法结合了感知分割的编码器与隐式交互算子，能够有效建模推断事件间的依赖关系，并自然扩展至多模态场景，通过共享事件表示实现模态对齐。

链接: https://arxiv.org/abs/2605.08685
作者: Li Na,Yuanyun Zhang,Shi Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose a new class of waveform foundation models that departs from conventional sequence based representations by modeling physiological time series as realizations of latent event processes. Rather than treating signals as collections of local tokens or patches, our approach assumes that clinically meaningful structure arises from temporally extended, interacting events whose boundaries and dynamics are not directly observed. To capture this structure, we introduce a self supervised learning framework that enforces consistency across stochastic segmentations and time frequency projections of the same waveform, encouraging representations that are invariant to signal level perturbations while preserving event level organization. The resulting model combines a segmentation aware encoder with a latent interaction operator that captures dependencies among inferred events, and naturally extends to multimodal settings by aligning modalities through shared event representations. Across a range of physiological benchmarks, including arrhythmia classification, hemodynamic prediction, and waveform retrieval, the proposed method improves performance, robustness, and label efficiency relative to strong sequence based baselines. These results suggest that shifting from signal centric to event centric representations provides a more appropriate inductive bias for modeling physiological dynamics and offers a complementary path to scaling foundation models in healthcare.

[AI-308] Semantic Voting: Execution-Grounded Consensus for LLM Code Generation

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）代码生成流水线中缺乏完整oracle（即正确答案）时，如何有效从多个候选代码中选择最优解的问题。现有方法混合使用文本投票、排序和基于执行结果的一致性判断，但各组件的相对贡献不明确。其解决方案的关键在于：将推理时的代码选择视为一个信号质量而非聚合规则的问题——通过在多样化输入上执行候选代码并利用执行指纹进行聚类（SemanticVote），可显著提升选择准确性；实验表明，执行-based 方法优于传统输出模式多数投票达19–52个百分点，且一旦候选代码在多样输入下被执行，不同聚合策略（如加权投票、MBR-Exec、SemanticVote）效果无统计差异，而输入质量（特别是基于草图的输入生成）成为决定性能的核心因素。

链接: https://arxiv.org/abs/2605.08680
作者: Shan Jiang,Zijian Yi,Chenguang Zhu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:LLM code-generation pipelines often sample multiple candidates and select one final answer without access to a complete oracle. Existing pipelines mix textual voting, ranking, and execution-based agreement, but the relative contribution of each component remains unclear. We study 18 configurations across different models, thinking levels, and benchmarks, comparing output-pattern majority voting, weighted voting, MBR-Exec, and SemanticVote - a method that clusters candidates by execution fingerprints on LLM-generated inputs. Three findings emerge. (1) The best execution-based selector exceeds output-pattern majority voting by 19-52 percentage points on every configuration, with every execution-based selector exceeding it by at least 18 points. (2) Once candidates are executed on diverse inputs, aggregation rule has limited effect: SemanticVote, weighted voting, and MBR-Exec are statistically indistinguishable across all 18 configurations. The largest factor is input quality: sketch-based input generation consistently outperforms direct LLM generation by 0.6-2.1 pp and random fuzzing by up to 11.3 pp. (3) Thinking level interacts differently with selection families: deeper thinking improves majority voting by 12 pp but execution-based methods stay flat or degrade as candidate diversity falls. These results frame inference-time code selection as a signal-quality problem rather than an aggregation-rule problem: when oracles are unavailable, the behavioral evidence matters more than the aggregation rule.

[AI-309] Attention-based graph neural networks: a survey

【速读】：该论文旨在解决当前关于注意力机制在图神经网络（Graph Neural Networks, GNNs）中应用的研究缺乏系统性综述的问题。随着注意力机制在自然语言处理和计算机视觉等领域的成功引入，其在GNN中的作用日益凸显，但相关研究进展迅速且分散，尚未形成统一的分类与总结框架。解决方案的关键在于提出一个新颖的两级分类体系：上层从发展历史角度将注意力机制驱动的GNN分为三个阶段——图循环注意力网络（graph recurrent attention networks）、图注意力网络（graph attention networks）和图Transformer（graph transformers）；下层则聚焦于每个阶段的典型架构，并详细回顾各类方法、分析其优劣，同时提供模型特征对比表以支持更全面的比较。这一结构化梳理为研究人员提供了清晰的技术演进脉络与实用参考。

链接: https://arxiv.org/abs/2605.08679
作者: Chengcheng Sun,Chenhao Li,Xiang Lin,Tianji Zheng,Fanrong Meng,Xiaobin Rui,Zhixiao Wang
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This is the accepted manuscript of an article published in Artificial Intelligence Review. The final version is available online at: [ https://doi.org/10.1007/s10462-023-10577-2 ]( this https URL )

点击查看摘要

Abstract:Graph neural networks (GNNs) aim to learn well-trained representations in a lower-dimension space for downstream tasks while preserving the topological structures. In recent years, attention mechanism, which is brilliant in the fields of natural language processing and computer vision, is introduced to GNNs to adaptively select the discriminative features and automatically filter the noisy information. To the best of our knowledge, due to the fast-paced advances in this domain, a systematic overview of attention-based GNNs is still missing. To fill this gap, this paper aims to provide a comprehensive survey on recent advances in attention-based GNNs. Firstly, we propose a novel two-level taxonomy for attention-based GNNs from the perspective of development history and architectural perspectives. Specifically, the upper level reveals the three developmental stages of attention-based GNNs, including graph recurrent attention networks, graph attention networks, and graph transformers. The lower level focuses on various typical architectures of each stage. Secondly, we review these attention-based methods following the proposed taxonomy in detail and summarize the advantages and disadvantages of various models. A model characteristics table is also provided for a more comprehensive comparison. Thirdly, we share our thoughts on some open issues and future directions of attention-based GNNs. We hope this survey will provide researchers with an up-to-date reference regarding applications of attention-based GNNs. In addition, to cope with the rapid development in this field, we intend to share the relevant latest papers as an open resource at this https URL.

[AI-310] Sketch-and-Verify: Structured Inference-Time Scaling via Program Sketching

【速读】：该论文旨在解决在资源受限场景下（如延迟、部署或预算限制），如何高效利用少量额外的测试时计算资源来提升小型代码模型（如Gemini 3.1 Flash Lite）的生成质量的问题。其核心挑战在于：如何在不升级模型层级的前提下，通过优化采样策略实现成本与性能之间的帕累托改进。解决方案的关键是提出SKETCHVERIFY方法，该方法将搜索空间分解为两个维度——枚举K个不同的算法策略并为每个策略生成M个程序草图（含??占位符的片段），从而产生K×M个结构多样化的候选解；随后通过执行验证和指纹聚类选择最优解。相比传统的平坦采样（flat sampling），SKETCHVERIFY确保每次新增草图探索不同算法路径，显著提升硬例恢复率，且在相同候选数量下优于平坦采样，在更低成本下实现更高精度。

链接: https://arxiv.org/abs/2605.08658
作者: Shan Jiang,Zijian Yi,Chenguang Zhu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:SKETCHVERIFY is a within-tier cost-performance policy, not a universal accuracy improvement. The operational question: a practitioner stuck with a small, cheap code model (here, Gemini 3.1 Flash Lite) for latency, deployment, or budget reasons – how should they spend a small amount of extra test-time compute? SKETCHVERIFY factorizes the search space: the LLM enumerates K distinct algorithmic strategies, writes a program sketch for each (a partial program with ?? holes), and fills each sketch M times, producing K x M structurally diverse candidates that are verified by execution and selected by fingerprint clustering. Each extra sketch is guaranteed to explore a different algorithm; each extra flat sample likely duplicates an existing one. Our central evidence is a cost-quality Pareto plot on HumanEval+ across three Gemini tiers (Lite, Flash, Pro), and a reanalysis of the 19 problems where Lite greedy fails. Two findings: (1) Within-tier, sketching dominates flat sampling at matched candidate count. On the hard subset, Lite Sketch K=2, M=5 recovers 11/19 (58%) vs. flat N=10 at 5/19 (26%, +32pp); Lite Sketch K=10, M=10 recovers 15/19 (79%) vs. flat N=100 at 10/19 (53%, +26pp). Flat cannot close the gap even at ~3x the budget: flat N=50 still loses to Sketch K=2, M=5 by +11pp. (2) Cross-tier, sketching does not replace upgrading. Pro greedy (89%) dominates Lite Sketch K=10, M=10 (79%) on both pass@1 and dollar cost. Practitioner rule: if a stronger tier is available, use greedy on it; otherwise sketching is the cost-effective way to spend extra compute. We characterize the K-vs-M trade-off via a Flash Lite scaling sweep, report HumanEval+ saturation on Flash and Pro, and show the method composes cleanly with execution-based selection from the concurrent Semantic Voting line of work.

[AI-311] Fitting Multilinear Polynomials for Logic Gate Networks

【速读】：该论文旨在解决可学习逻辑门网络（learnable logic gate networks）在深度结构中因梯度消失导致训练失效的问题。其核心挑战在于：虽然2输入布尔门可通过4维多项式空间表示，形成16个原型组成的码本（codebook），但传统方法Soft-Mix通过softmax选择门类型时，由于码本秩仅为4，导致11个单纯形方向的梯度为零，使得交互系数（interaction coefficient）在使用Straight-Through Estimator (STE)时严重匮乏。解决方案的关键在于引入协方差雅可比（covariance Jacobian, CovJac）机制，该机制通过将星蚀的系数与始终活跃的常数通道耦合，从而绕过STE带来的梯度稀疏问题，并实现更稳定的深层网络训练——在7个数据集上均优于Soft-Mix，且在高深度下表现显著稳定（如CIFAR-10在12层时，Soft-Mix性能下降37.3个百分点，而CovJac仅下降0.5个百分点）。

链接: https://arxiv.org/abs/2605.08657
作者: Youngsung Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We study learnable logic gate networks that stack layers of 2-input Boolean gates to build combinational circuits. Every 2-input gate has a unique multilinear polynomial with 4 coefficients, so the 16 Boolean gates form a codebook of prototypes in a 4-dimensional space, reducing training to a vector-quantization problem. The baseline method, Soft-Mix, learns a 16-dimensional softmax over gate identities, but the codebook has rank~4: 11 of 15 simplex directions carry nullspace gradient, and at uniform initialization the backward signal vanishes exactly. We prove that no affine product reparameterization fixes the resulting interaction-coefficient starvation under STE, and show that the covariance Jacobian of soft-VQ selection bypasses it by coupling the starved coefficient to the always-active constant channel. Working in the 4-dimensional polynomial space reduces each neuron from 16 to 4 parameters. On seven datasets, at least one 4-parameter method matches or exceeds Soft-Mix on every dataset; the CovJac advantage over STE grows monotonically with interaction demand across all seven datasets. At depth, Soft-Mix collapses ( -37.3 pp on CIFAR-10 at 12 layers) while CovJac holds ( -0.5 pp on CIFAR-10, stable on MNIST).

[AI-312] C2L-Net: A Data-Driven Model for State-of-Charge Estimation of Lithium-Ion Batteries During Discharge

【速读】：该论文旨在解决锂离子电池在电池管理系统（BMS）中状态估计（SOC）的准确性与计算效率之间的矛盾问题，特别是现有数据驱动方法依赖长历史输入序列导致高计算成本及驱动周期初期因填充引入的位置偏差。解决方案的关键在于提出一种新颖的“上下文到最新测量”（C2L-Net）框架，其核心创新是显式分离上下文编码与最新测量更新机制：通过分块特征提取机制结合Theta注意力池化和基于傅里叶的季节性基函数，压缩序列长度并捕捉局部时序模式；采用因果上下文编码器（融合门控循环单元GRU与因果余弦注意力）建模时序依赖而不泄露未来信息；并设计一个受递归滤波启发的最新测量解码器，利用最近观测动态更新上下文状态，从而实现高效且快速响应动态工况的在线SOC估计。

链接: https://arxiv.org/abs/2605.08653
作者: Khoa Tran,T. Nguyen-Thoi,Vin Nguyen-Thai,Duong Tran Anh,Hung-Cuong Trinh,Tri Le
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate state-of-charge (SOC) estimation is critical for the safe and efficient operation of lithium-ion batteries in battery management systems (BMS). Although data-driven approaches can effectively capture nonlinear battery dynamics, many existing methods rely on long historical input sequences, resulting in high computational cost and introducing padding-induced positional bias at the beginning of drive cycles. To address these limitations, we propose C2L-Net, a novel context-to-latest data-driven framework for realistic online SOC estimation using only a short historical window (20 s). Unlike existing short-receptive-field or long-history models, the proposed framework explicitly separates contextual encoding from latest-measurement updating, enabling both efficient temporal modeling and rapid adaptation to dynamic battery states. The proposed model incorporates a chunk-based feature extraction mechanism that combines Theta Attention Pooling with a Fourier-based Seasonality Basis to capture local temporal patterns while reducing sequence length. A causal context encoder, integrating a gated recurrent unit (GRU) with Causal Cosine Attention, models temporal dependencies without information leakage. Furthermore, a latest-measurement decoder, inspired by recursive filtering, updates the contextual state using the most recent measurement, enhancing responsiveness to dynamic operating conditions. Extensive experiments on a public lithium-ion battery drive-cycle dataset under multiple fixed-temperature conditions demonstrate that the proposed method achieves state-of-the-art or competitive accuracy while significantly improving computational efficiency. In particular, C2L-Net achieves up to 60 times faster inference and requires fewer parameters than recent data-driven baselines, while maintaining robust performance across unseen driving profiles.

[AI-313] Geometry Guided Self-Consistency for Physical AI

【速读】：该论文旨在解决基于扩散模型（diffusion-based）动作生成在推理阶段因固有随机性导致的脆弱性问题，即单次采样产生的动作片段（action chunk）容易失效，且这种脆弱性在多轮序列决策中会累积放大，从而降低任务成功率。解决方案的关键在于提出一种无需额外训练的推理时自一致性方法 KeyStone：它在共享模型上下文中并行生成 K 个候选动作片段，在连续动作空间中通过聚类识别最一致的群体，并返回最大簇的中位数（medoid）作为最终输出。该方法利用动作轨迹的几何结构特性——欧氏距离直接反映物理相似性，使得选择过程无需依赖预训练判别器（judge-free），同时由于动作轨迹紧凑、内存带宽受限的特性，可实现无额外延迟的并行推理，显著提升任务成功率（最高达13.3%）且保持与模型驱动选择器相当的精度。

链接: https://arxiv.org/abs/2605.08638
作者: Yinwei Dai,Zhuofu Chen,Lijie Yang,Ravi Netravali
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:State-of-the-art physical AI models generate a chunk of actions per inference through diffusion or flow matching, iteratively refining an initial noise sample into an action trajectory. Because this inference process is inherently stochastic, committing to a single trajectory per round is brittle, and this brittleness compounds across the many sequential rounds that comprise a complete episode. We introduce KeyStone, an inference-time self-consistency method for diffusion-based action generation that draws K candidate action chunks in parallel from a shared model context, clusters them in continuous action space, and returns the medoid of the largest cluster – no additional model required. Two properties make this practical. First, the compact nature of action trajectories makes diffusion inference memory-bandwidth bound, leaving spare compute capacity to run K chains in parallel with no additional wall-clock latency. Second, unlike token or pixel spaces where distance carries no semantic meaning and selection requires a learned judge, action chunks are geometrically structured such that Euclidean distance directly reflects physical similarity, making selection principled and judge-free. Across diverse vision-language-action models (VLAs) and world-action models (WAMs), KeyStone improves task success rates by up to \textbf13.3% over single-trajectory sampling with negligible latency overhead, while having on par accuracy with model-based selectors at no training cost. We open source KeyStone at this https URL.

[AI-314] Reasoning -Aware Training for Time Series Forecasting

【速读】：该论文旨在解决时间序列基础模型（Time Series Foundation Models, TSFMs）在数值预测中缺乏定性推理能力的问题，以及直接将大语言模型（Large Language Models, LLMs）应用于时序数据时因模态差异导致的数学关系破坏和序列长度爆炸问题。其解决方案的关键在于提出STRIDE框架——通过将LLM的推理过程以轻量级形式蒸馏为连续嵌入，并动态将其均值池化后的隐藏状态作为跨模态先验注入TSFM的数值编码器，从而实现语义推理与数值预测的深度融合。该方法在保持数值精度的同时引入可解释的推理机制，在GIFT-Eval和TFRBench等基准上显著优于现有TSFMs。

链接: https://arxiv.org/abs/2605.08625
作者: Md Atik Ahamed,Mihir Parmar,Palash Goyal,Chun-Liang Li,Qiang Cheng,Tomas Pfister,Jinsung Yoon
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Time Series Foundation Models (TSFMs) excel at numerical forecasting but operate as black boxes lacking qualitative reasoning. Conversely, applying LLMs directly to temporal data introduces a modality gap: text tokenizers fragment continuous numerical values, degrading mathematical relationships and exploding sequence lengths, leading to computational overhead. To resolve this, we introduce STRIDE (Strategic Time-series Reasoning Injected via Distilled Embeddings), a novel framework natively integrating LLM reasoning into the continuous embedding space of TSFMs. Instead of discrete tokens, STRIDE distills reasoning traces into a lightweight LLM, dynamically projecting its mean-pooled hidden states as a cross-modal prior into the target numerical encoder. The architecture is jointly optimized using cross-entropy and quantile losses. Evaluations demonstrate STRIDE establishes state-of-the-art numerical forecasting on GIFT-Eval (0.674 MASE, 0.454 CRPS) compared to TSFMs and exhibits superior in-domain and out-of-domain numerical as well as reasoning performance on TFRBench. Specifically, STRIDE acts as a plug-and-play enhancement, consistently improving diverse TSFMs (e.g., Chronos-2, Timer-S1) across various LLM configurations. Thus, injecting semantic reasoning as a continuous prior equips TSFMs with human-interpretable reasoning while fundamentally improving predictive accuracy.

[AI-315] DiagnosticIQ: A Benchmark for LLM -Based Industrial Maintenance Action Recommendation from Symbolic Rules

【速读】：该论文旨在解决工业资产监测中从规则到维护动作的决策支持瓶颈问题，即如何将工程师编写的符号化规则（symbolic rules）有效转化为具体的维护步骤，而这一过程通常需要依赖多年实践经验积累的资产专有知识。解决方案的关键在于构建一个名为\ours的基准测试集，包含6,690个专家验证的多选题（multiple-choice questions, MCQA），源自118组规则-动作对和16类工业资产；并提出一种符号到MCQA的标准化处理流程，将规则转化为析取范式（Disjunctive Normal Form），结合嵌入式干扰项采样机制生成高质量干扰选项，从而系统性评估大语言模型（LLMs）在复杂工业场景下的推理与泛化能力。

链接: https://arxiv.org/abs/2605.08614
作者: Devin Yasith De Silva,Dhaval Patel,Christodoulos Constantinides,Shuxin Lin,Nianjun Zhou,Paul J Adams,Sal Rosato,Nicolas Constantinides,Deborah L. McGuinness,Jayant Kalagnanam
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 43 pages, 25 figures

点击查看摘要

Abstract:Monitoring complex industrial assets relies on engineer-authored symbolic rules that trigger based on sensor conditions and prompt technicians to perform corrective actions. The bottleneck is not detection but response: translating rules into maintenance steps requires asset-specific knowledge gained through years of practice. We investigate whether LLMs can serve as decision support for this rule-to-action step and introduce \ours, a benchmark of 6,690 expert-validated multiple-choice questions from 118 rule-action pairs across 16 asset types. We contribute (i) a symbolic-to-MCQA pipeline normalizing rules to Disjunctive Normal Form with embedding-based distractor sampling, (ii) five variants probing distinct failure modes (Pro, Pert, Verbose, Aug, Rationale), and (iii) a benchmark of 29 LLMs and 4 embedding baselines. A human evaluation (9 practitioners, mean 45.0%) confirms \ours requires specialist knowledge beyond operational experience. Three findings stand out. The frontier has closed: the top three LLMs lie within one Macro point, with Bradley-Terry Elo placing claude-opus-4-6 30 points above the next model. Yet \ours,Pro exposes brittleness, with every model losing 13–60% relative accuracy under distractor expansion. \ours,Aug exposes pattern-matching: under condition inversion, frontier models still select the original answer 49–63% of the time. The deployment bottleneck is not capability but calibration: frontier models handle template-style fault detection but break under structural perturbation.

[AI-316] he Echo Amplifies the Knowledge: Somatic Marker Analogues in Language Models via Emotion Vector Re-Injection

【速读】：该论文旨在解决当前语言模型仅能存储事件语义信息（semantic memory）而缺乏情感体验记忆（episodic memory）的问题，从而无法模拟人类基于情绪标记的决策机制。其核心解决方案在于：利用Gemma 3 1B-IT模型与稀疏自编码器提取出310个具有心理有效几何结构的情绪专属特征（emotion-exclusive features），并在推理过程中构建“情绪回响”（emotion echo）向量——在体验阶段生成并保存这些特征，在回忆阶段通过上下文相似性触发部分重注入。实验表明，单独引入情绪回响可显著增强模型对威胁-安全梯度的感知（回归斜率从0.56升至0.80），但仅当结合语义知识时才能提升决策质量（BC条件达80%正确选择，显著优于仅语义标签的52%），精准复现了Damasio关于情绪标记放大知识、而非替代知识的核心发现。

链接: https://arxiv.org/abs/2605.08611
作者: Jared Glover
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Current language model memory systems store what happened but not how it felt. This distinction – between semantic memory (knowing about a past event) and episodic memory (re-experiencing it) – was identified by Tulving as the difference between noetic and autonoetic consciousness. Damasio demonstrated that humans with intact knowledge but absent emotional markers exhibit impaired decision-making. We bridge this gap for language models. Using Gemma 3 1B-IT with pretrained Gemma Scope 2 sparse autoencoders, we identify 310 emotion-exclusive features at layer 22 with psychologically valid geometry. We construct distinctive-feature emotion vectors during experience and partially re-inject them during recall, triggered by context similarity at layer 7. We test four conditions paralleling Damasio’s framework: A (no memory), B (semantic labels), C (emotion echo), and BC (semantic + echo). For emotional orientation, the echo alone steepens the threat-safety gradient: the regression slope of threat rating on contextual similarity is 0.80 for C vs 0.56 for A ( p =0.011, permutation test). For decisions, the echo amplifies knowledge into action: BC=80% good choices vs B=52% ( z =+2.60, p 0.01), while the echo alone has no effect (C=22%, n.s.). The echo changes how the model feels independently, but changes what it does only when combined with knowledge – replicating Damasio’s core finding. The echo amplifies knowledge. It does not replace it. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2605.08611 [cs.AI] (or arXiv:2605.08611v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.08611 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Jared Glover [view email] [v1] Sat, 9 May 2026 02:12:39 UTC (596 KB)

[AI-317] Lattice Deduction Transformers

【速读】：该论文旨在解决小规模可解释推理模型在复杂逻辑谜题（如Sudoku-Extreme、Snowflake Sudoku和Maze-Hard）上训练效率低且难以保证推理正确性的问题。解决方案的关键在于提出一种称为Lattice Deduction Transformer (LDT) 的循环Transformer架构，其通过在前向传播之间对潜在状态进行格（lattice）投影来近似逻辑上合理的演绎过程；同时采用策略梯度的在线训练方式模拟基于搜索的约束求解器的推理流程，并利用领域无关的抽象解释（abstract interpretation）方法监督训练，从而实现高精度与推理可靠性——模型仅输出正确答案或选择不回答（abstain），避免错误推理。

链接: https://arxiv.org/abs/2605.08605
作者: Liam Davis,Leopold Haller,Alberto Alfarano,Mark Santolucito
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:We introduce the Lattice Deduction Transformer (LDT), a recurrent transformer that approximates logically sound deduction by projecting its latent state through a lattice between forward passes. We train on-policy in a process that mirrors deduction in a search-based constraint solver and supervise training via a domain-agnostic, abstract-interpretation-based approximation of the set of solution candidates. An 800 K-parameter LDT achieves 100% accuracy on Sudoku-Extreme and Snowflake Sudoku, at a fraction of the training cost of prior small recurrent reasoners, while remaining empirically sound: the model returns a correct answer or abstains. A 1.8 M-parameter variant reaches 99.9% accuracy on Maze-Hard. Frontier LLMs score 0% on all three benchmarks.

[AI-318] What Will Happen Next: Large Models-Driven Deduction for Emergency Instances

【速读】：该论文旨在解决传统应急事件模拟方法因缺乏随机性和多样性，难以充分探索潜在风险的问题，尤其在应急实例稀缺的情况下，现有系统无法有效支持风险评估与决策。其解决方案的关键在于提出一种基于大模型（Large Models, LMs）驱动的世界线分歧系统（World Line Divergence System, WLDS），该系统通过引入可控随机性、事实校准（factual calibration）与逻辑校准（logical calibration）机制，在保证推理真实性与逻辑严谨性的前提下，实现多方向的应急事件推演；同时结合交互式模块以避免幻觉，并通过文本与图像融合的可视化模块提升可解释性，从而显著增强模拟的精度、保真度及对实际决策的支持能力。

链接: https://arxiv.org/abs/2605.08599
作者: Zhengqing Hu,Dong Chen,Junkun Yuan,Liang Liu,Hua Wang,Zhao Jin,Yingchaojie Feng,Wei Chen,Mingliang Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Traditional simulation methods reproduce occurred emergency instances through presetting to assist people in risk assessment and emergency decision-making. However, due to the lack of randomness and diversity, existing simulation systems struggle to fully explore the potential risk as emergency instances are scarce. In contrast, Large Models (LMs) can dynamically adjust generation strategies to introduce controllable randomness, while also possessing extensive prior knowledge and cross-domain knowledge transfer capabilities. Inspired by it, we propose the LMs-driven World Line Divergence System (WLDS), which enables diversified visualization and deduction of emergency instances in different domains. WLDS leverages LMs to deduce emergency instances in various development directions, and introduces the factual calibration and logical calibration mechanism to ensure factual accuracy and logical rigor during the deduction process. The interactive module can independently select deduction directions to avoid potential hallucinations that are difficult for the system to identify. Furthermore, by introducing the visualization module, WLDS forms simulation and deduction that combine text and images, which enhances interpretability. Extensive experiments conducted on the proposed Emergency Instances Deduction (EID) benchmark dataset demonstrate that WLDS achieves high-precision and high-fidelity simulation and deduction of emergency instances in multiple specific domains. Relevant experiments further demonstrate that WLDS can generate more emergency instances deduction data for users and provide support for better decision-making in similar emergency instances in the future.

[AI-319] Kaczmarz Linear Attention

【速读】：该论文旨在解决长序列建模中Transformer模型因注意力机制的二次计算复杂度而导致的扩展瓶颈问题，特别是在线性递归模型中如何优化状态维护以提升性能。其解决方案的关键在于重新审视Gated DeltaNet（GDN）所依赖的在线回归目标，并借鉴Kaczmarz投影方法，推导出一种关键范数归一化的动态步长系数 $\beta_t = \eta_t / (\|k_t\|_2^2 + \epsilon)$ 用于残差更新；在此基础上提出Kaczmarz Linear Attention（KLA），仅通过一个标量参数修改GDN，在保持原有状态结构、门控机制、线性递归特性及分块并行算法的前提下，显著提升了验证困惑度、长序列外推能力与解码效率。

链接: https://arxiv.org/abs/2605.08587
作者: Jiaxuan Zou,Ruifeng Ren,Yong Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long-context language modeling remains central to modern sequence modeling, but the quadratic cost of Transformer attention makes scaling computationally prohibitive. Linear recurrent models address this bottleneck by compressing the context into a fixed-size state, making the rule that forgets, writes, and edits information a central design problem. To address state maintenance, Gated DeltaNet (GDN) combines gated state decay with delta-rule residual writes, using a learnable coefficient to balance forgetting and update magnitude. However, this coefficient is learned empirically rather than derived from the underlying objective, which can lead to suboptimal update magnitudes. We revisit the online-regression objective underlying GDN and, inspired by the Kaczmarz projection method, derive the key-norm-normalized dynamic step size \beta_t = \eta_t / (|k_t|_2^2 + \epsilon) for residual updates. We propose Kaczmarz Linear Attention (KLA), a one-scalar modification of GDN that preserves the state shape, gates, linear recurrence, and chunkwise parallel algorithm. At the 0.4B scale with a 1B-token budget, KLA achieves the lowest validation perplexity among evaluated linear-time baselines, 8.09 versus 8.50 for GDN, and remains stable up to 65K tokens. On controlled tasks, KLA reaches 100% on single-needle-in-a-haystack retrieval, improves 8x multi-query associative recall by 7.03 points over GDN, and delivers 2.1x higher decode throughput at 32K context. These results suggest that the key-norm-normalized Kaczmarz coefficient is a first-order design axis for delta-rule sequence models: it improves accuracy, extrapolation, and decoding efficiency without changing the recurrent state or hardware kernel.

[AI-320] Probing the Impact of Scale on Data-Efficient Generalist Transformer World Models for Atari

【速读】：该论文旨在解决通用型世界模型（World Model, WM）在保持人类级数据效率的同时实现有效扩展的问题。现有研究常将架构机制与模型规模（model scale）的影响混淆，导致对Scaling行为的理解不清晰。其解决方案的关键在于：通过使用一个极简的Transformer世界模型，在固定离线数据集（来自预设专家策略）条件下，系统性分析不同环境下的Scaling规律；发现环境存在本质不同的Scaling regime——部分任务可跨越插值阈值并持续受益于过参数化，而另一些则陷入经典 regime 导致模型增大反而降低保真度；进一步发现在统一训练框架下（即单个Transformer联合训练26个Atari环境），联合训练稳定了Scaling动态，使所有环境均获得单调提升；最终验证了保真度提升可直接转化为下游控制性能（模拟环境中学习的策略达到中位数0.770的专家-随机归一化得分）。因此，该研究强调未来进展不仅依赖架构创新，更需精准的Scaling策略设计。

链接: https://arxiv.org/abs/2605.08578
作者: Jooyeon Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Developing generalist systems that retain human-like data efficiency is a central challenge. While world models (WMs) offer a promising path, existing research often conflates architectural mechanisms with the independent impact of model \emphscale. In this work, we use a minimalist transformer world model to analyze scaling behaviors on the Atari 100k benchmark, using fixed offline datasets derived from a presupposed expert policy. Our results reveal that environments fundamentally fall into distinct scaling regimes, even when constrained by identical offline data budgets and model capacities. For individual tasks, some environments naturally allow models to pass the interpolation threshold, yielding monotonic improvements in the overparameterized regime, while others remain trapped in the classical regime, where larger world models degrade fidelity. In the unified setting, i.e., a single transformer trained on a suite of 26 Atari environments, we uncover that joint training stabilizes scaling dynamics, ensuring monotonic gains across all environments, regardless of their distinct inherent scaling regimes. Finally, we demonstrate that improved fidelity translates directly to downstream control, with policies learned entirely within the simulated dynamics achieving a median expert-random-normalized score of 0.770. Our findings suggest that future progress lies as much in precise scaling strategies as in architectural innovation.

[AI-321] Uncovering Intra-expert Activation Sparsity for Efficient Mixture-of-Expert Model Execution

【速读】：该论文旨在解决当前混合专家（Mixture of Experts, MoE）模型中因细粒度专家粒度导致的训练难题，如专家坍塌（expert collapse）和负载不平衡（load imbalance），从而限制了稀疏性的进一步提升。其解决方案的关键在于探索并利用专家内部激活稀疏性（intra-expert activation sparsity）这一此前被忽视的稀疏维度：通过在不修改激活函数或模型参数的前提下，识别并跳过每个专家中未激活的神经元计算，实现了高达90%的专家内稀疏性且几乎无精度损失。作者在8个不同规模（1B至400B参数）的现成MoE模型上验证了该策略的有效性，并扩展了vLLM的MoE执行流水线，在保持现有优化的基础上，使MoE层执行速度最高提升2.5倍，端到端推理速度提升1.2倍。

链接: https://arxiv.org/abs/2605.08575
作者: Jongseok Park,Sunga Kim,Zhenyu Gu,Ion Stoica,Alvin Cheung
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mixture of Experts (MoE) architecture has become the standard for state-of-the-art large language models, owing to its computational efficiency through sparse expert activation. However, sparsity through finer expert granularity is becoming increasingly difficult to achieve due to fundamental training challenges such as expert collapse and load imbalance. In this work, we explore and leverage intra-expert activation sparsity as a complementary and underexplored dimension of sparsity in MoE models. Surprisingly, substantial intra-expert sparsity is readily available in existing pre-trained MoE models, without any modification to the activation function or model parameters, providing up to 90% sparsity within each expert without significant accuracy loss. We explore intra-expert activation sparsity across eight off-the-shelf MoE models ranging from 1B to 400B parameters, and extend the MoE execution pipeline of vLLM to leverage intra-expert activation sparsity by skipping the computations of inactive neurons, on top of its existing optimizations, achieving up to 2.5 times speedup in MoE layer execution and 1.2 times end-to-end speedup compared to the original dense vLLM baseline.

[AI-322] Why Retrying Fails: Context Contamination in LLM Agent Pipelines

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）代理在执行多步骤工具增强任务时，因失败后重试导致上下文污染（context contamination）所引发的性能下降问题。具体而言，当LLM代理在某一步骤失败并重试时，原始失败尝试的内容仍保留在上下文窗口中，从而提升后续每一步的错误率，形成“级联式”误差放大效应。解决方案的核心是提出Context-Contaminated Restart Model (CCRM)，该模型形式化描述了在失败后重试过程中上下文污染对错误率的影响，并基于此推导出五个关键理论结果，包括成功概率的闭式表达、污染带来的额外尝试次数（cascade overhead）、最优任务深度预算分配策略、信息论下界以及清除上下文的收益量化。其中，最关键的创新在于通过引入污染误差率ε₁ > ε₀的机制建模实际场景中的上下文干扰，从而准确预测和优化LLM代理在真实环境下的重试行为表现。

链接: https://arxiv.org/abs/2605.08563
作者: Zhanfu Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:When an LLM agent fails a multi-step tool-augmented task and retries, the failed attempt typically remains in its context window – contaminating the next attempt and elevating the per-step error rate beyond the base level. This context-contaminated restart phenomenon is widely observed in practice yet entirely lacks formal treatment. We introduce the Context-Contaminated Restart Model (CCRM): a chain of T tool-call steps, each failing with base rate epsilon_0; after any failed attempt, the subsequent attempt operates in contaminated context with elevated error rate epsilon_1 epsilon_0. Under this model we derive five main results. (R1) An exact closed-form formula for P(succeed in at most K attempts). (R2) A cascade-overhead theorem giving the additional attempts Delta K incurred by contamination versus the clean-restart baseline. (R3) An optimal budget-allocation theorem identifying the pipeline depth T* that maximises success probability for a fixed total budget B=KT; we prove the closed form T* = sqrt(B * log(1/(1-epsilon_1)) / log(1/(1-epsilon_0))), with K*=B/T*. (R4) An information-theoretic lower bound via Le Cam’s method showing K_CCRM is tight up to O(1). (R5) A clean-restart dominance theorem quantifying the exact benefit of context-clearing before retry. We validate CCRM on real SWE-bench Verified data: the IID model overestimates pass@3 by 17.4 percentage points (98.6% vs. 81.2%), while CCRM fits with error less than 0.001, implying a cascade ratio of epsilon_1/epsilon_0 = 7.1. Monte Carlo experiments confirm all theoretical predictions.

[AI-323] VeriContest: A Competitive-Programming Benchmark for Verifiable Code Generation

【速读】：该论文旨在解决当前大型语言模型（Large Language Models, LLMs）在代码生成中缺乏正确性保障的问题，即模型输出的代码虽可执行但无法确保其逻辑正确性。为突破传统测试驱动验证的局限，论文提出通过生成形式化规范（formal specifications）和机器可检查的证明（machine-checkable proofs）来实现可验证代码生成（verifiable code generation）。解决方案的关键在于构建一个大规模、高质量且结构化的基准测试平台——VeriContest，它包含946个来自LeetCode和Codeforces的竞赛编程问题，每个问题均配有专家验证的形式化规格、经Verus验证的证明、以及正负例测试集，并采用三阶段流水线（从人工验证种子问题到半自动化扩展并辅以人工审核）确保数据质量。此外，利用测试作为额外的质量控制层以验证后置条件完整性，从而支持对规范生成、代码生成、证明生成及端到端合成等模块的独立与组合评估，揭示出当前模型在证明和规范生成上的显著短板，确立了该基准作为未来可验证代码生成系统研发与评估的坚实基础。

链接: https://arxiv.org/abs/2605.08553
作者: Zichen Xie,Mrigank Pawagi,Yuxin Liu,Aaditi Rai,Lize Shao,John Berberian Jr.,Sicong Che,Wenxi Wang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models can generate useful code from natural language, but their outputs come without correctness guarantees. Verifiable code generation offers a path beyond testing by requiring models to produce not only executable code, but also formal specifications and machine-checkable proofs. Progress in this direction, however, is difficult to measure: existing benchmarks are often small, focus on only one part of the pipeline, lack ground-truth proofs or rigorous specification validation, or target verification settings far from mainstream software development. We present VeriContest, a benchmark of 946 competitive-programming problems from LeetCode and Codeforces for verifiable code generation in Rust with Verus. Each problem pairs a natural language description with expert-validated formal specifications, judge-accepted Rust code, Verus-checked proofs, and positive and negative test suites. VeriContest is constructed through a three-phase pipeline that scales from manually verified seed problems to semi-automated expansion with human-in-the-loop review. To further strengthen benchmark quality, we use testing as an additional quality-assurance layer for validating postcondition completeness. VeriContest supports isolated and compositional evaluation of specification generation, code generation, proof generation, and end-to-end verified program synthesis. Evaluating ten state-of-the-art models reveals a sharp gap between coding ability and verifiable code generation: the strongest model reaches 92.18% on natural-language-to-code generation, but only 48.31% on specification generation, 13.95% on proof generation, and 5.29% end-to-end. These results identify proof and specification generation as the central bottlenecks for models and establish VeriContest as a rigorous platform for measuring and training future systems that generate code with machine-checkable correctness.

[AI-324] Evaluating Developmental Cognition Capabilities of LLM s

【速读】：该论文旨在解决当前对话式人工智能（Conversational AI）在个性化设计中忽视用户如何解读和运用模型输出以构建现实认知的问题，尤其缺乏对用户发展性思维结构的量化评估方法。其核心挑战在于如何将罗伯特·凯根（Robert Kegan）的建构-发展理论（constructive-developmental theory）引入大语言模型（LLM）评估体系，从而捕捉个体在回应中体现的发展阶段特征。解决方案的关键是提出一种名为“发展性句子完成测试”（Developmental Sentence Completion Test, DSCT）的20项自填式文本生成任务，该工具能从用户或模型生成的文本中提取阶段性信号，且不依赖专家访谈或侵入性强的问卷。研究发现，DSCT可有效识别不同模型家族在无条件生成时展现出稳定的发展阶段倾向，表明发展阶段信号在合成数据中更清晰，而提升对话式AI的发展意识关键在于获取高质量的、具发展性的文本响应信号，而非单纯提高分类器准确率。

链接: https://arxiv.org/abs/2605.08549
作者: Xiao Xiao,Hayoun Noh,Mar Gonzalez-Franco
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 3 figures, (10 pages appendix)

点击查看摘要

Abstract:Conversational AI is increasingly personalized around users’ preferences, histories, goals, and knowledge, but much less around how users interpret and take up model outputs to construct and understand their reality. We draw on Robert Kegan’s constructive-developmental theory as a complementary lens on this dimension. Existing methods for assessing developmental stage in the Keganian tradition rely either on expert interviews that do not scale or on sentence-completion instruments that are proprietary, lengthy, or invasive. To make this perspective tractable for LLM evaluation, we introduce the Developmental Sentence Completion Test (DSCT), a 20-item instrument designed to elicit developmental signal in self-administered text. Throughout, we treat the resulting labels as characterizations of stage-like structure in elicited responses, not as validated person-level developmental stage. We then ask how much of that signal can be recovered by LLMs across three elicited response regimes: simulated personas, real human respondents, and default model-generated answers. On simulated personas, top frontier models recover simulator-intended labels with high accuracy. On real human DSCT responses, human-LLM agreement is fair, with much stronger within-neighborhood than exact agreement. Finally, when LLMs answer DSCT prompts without persona-conditioning, their responses exhibit stable stage-like differences across model families, with larger and newer models tending to generate higher-rated text. These results suggest that stage-conditioned signal is cleaner in synthetic responses than in human-written DSCT text, and that the core constraint for stage-aware conversational AI is not classifier accuracy alone, but the availability of developmental signal from elicited text. Comments: 9 pages, 3 figures, (10 pages appendix) Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2605.08549 [cs.AI] (or arXiv:2605.08549v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.08549 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-325] Log analysis is necessary for credible evaluation of AI agents

【速读】：该论文旨在解决当前AI代理（Agent）评估中存在的可信度问题，即仅依赖最终结果（通过或失败）会导致评估结果被捷径策略、基准缺陷或隐藏的危险行为所扭曲，从而无法真实反映代理的实际能力与安全性。其解决方案的关键在于引入日志分析（log analysis），即系统性地追踪和分析代理在执行过程中的输入、中间步骤与输出，以识别评估中的有效性威胁（validity threats）。通过构建威胁分类体系与日志分析指导原则，论文展示了日志分析能够揭示传统指标下被低估的能力（如tau-Bench Airline中pass^5性能低了近50%）以及部署时不可见的失效模式，从而提升评估的透明性与可靠性。

链接: https://arxiv.org/abs/2605.08545
作者: Peter Kirgis,Sayash Kapoor,Stephan Rabanser,Nitya Nadgir,Cozmin Ududec,Magda Dubois,JJ Allaire,Conrad Stosz,Marius Hobbhahn,Jacob Steinhardt,Arvind Narayanan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Agent benchmarks typically report only final outcomes: pass or fail. This threatens evaluation credibility in three ways. First, scores may be inflated or deflated by shortcuts and benchmark artifacts, misrepresenting capability. Second, benchmark performance may fail to predict real-world utility due to scaffold limitations and recurring failure modes. Finally, capability scores may conceal dangerous or catastrophic actions taken by the agent. We argue that log analysis – the systematic tracking and analysis of the inputs, execution, and outputs of an AI agent – is necessary to overcome these validity threats and promote credible agent evaluation. In this paper, we (1) present a taxonomy of threats to credible evaluation documented through log analysis, and (2) develop a set of guiding principles for log analysis. We illustrate these principles on tau-Bench Airline, revealing that pass^5 performance was under-elicited by nearly 50% and surfacing deployment failure modes invisible to outcome metrics. We conclude with pragmatic recommendations to increase uptake of log analysis, directed at diverse stakeholders including benchmark creators, model developers, independent evaluators, and deployers.

[AI-326] Continuity Laws for Sequential Models

【速读】：该论文旨在解决序列建模中一个被忽视的归纳偏置——时间连续性（continuity in time）对模型行为和性能的影响问题。研究者提出，尽管某些模型如状态空间模型（state-space models）源自连续时间公式，但它们是否真正表现出时间上的连续行为尚不明确，且这种连续性是否有助于提升具有连续时间结构任务的性能也缺乏实证依据。解决方案的关键在于：首先形式化“模型连续性”为在时间细化下预测收敛至潜在连续轨迹的性质；其次引入一种直接从数据时间结构中量化任务连续性的指标；并通过实验验证模型连续性与任务连续性之间的对齐关系显著影响模型性能，从而揭示了连续性不仅是归纳偏置，更是可带来效率与性能双重提升的实际优势。

链接: https://arxiv.org/abs/2605.08539
作者: Annan Yu,Dongwei Lyu,N. Benjamin Erichson
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Inductive biases influence the behavior and performance of sequential models. In this work, we study an underexplored inductive bias in sequential modeling: continuity in time. We ask a simple question: do models motivated by continuous-time formulations, such as state-space models, actually behave continuously in time, and does this translate into better performance on tasks with continuous temporal structure? To answer this, we formalize model continuity as convergence under temporal refinement, where a model is continuous if its predictions approach an underlying continuous trajectory as the temporal discretization is refined. We show that S4 exhibits stable continuous behavior, whereas S6 (the core of Mamba) can be more sensitive to input amplitude and selective dynamics, despite being derived from a continuous dynamical system. To study whether this distinction matters for learning, we also need a corresponding notion of task continuity. We therefore introduce a metric to quantify the continuity of datasets directly from their temporal structure. Across benchmarks, we find a clear empirical alignment between task continuity, model continuity, and model performance. Beyond an inductive bias, continuity also has practical consequences: we show that it enables a simple temporal subsampling strategy that improves both efficiency and performance.

[AI-327] Human-LLM Dialogue Improves Diagnostic Accuracy in Emergency Care

【速读】：该论文旨在解决急诊医学中临床决策因不确定性导致的诊断延迟与准确性不足问题，尤其是在医生实际工作流程中如何有效整合大语言模型（Large Language Models, LLMs）作为交互式辅助工具这一证据空白。其解决方案的关键在于设计并验证MedSyn系统：该系统允许医生在仅见主诉的情况下，通过迭代式提问与LLM互动，并逐步获取完整临床记录以优化诊断推理；实验表明，该交互机制显著提升了住院医师在高难度病例中的诊断正确率（从0.589提升至0.734），且自动化指标和对话分析均证实了LLM支持对不同经验水平医生均具增强作用，尤其促进专家间一致性提升。

链接: https://arxiv.org/abs/2605.08533
作者: Burcu Sayin,Ngoc Vo Hong,Ipek Baris Schlicht,Jacopo Staiano,Pasquale Minervini,Sara Allievi,Nicola Susca,Nicola Osti,Alberto Maino,Vito Racanelli,Andrea Passerini
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Paper under review

点击查看摘要

Abstract:Clinical decision-making in emergency medicine demands rapid, accurate diagnoses under uncertainty. Despite benchmark progress, evidence for LLMs as interactive aids in live physician workflows remains sparse. MedSyn lets physicians iteratively query an LLM provided with the full clinical record while initially viewing only the chief complaint. Seven physicians (three seniors, four residents) completed baseline and AI-assisted sessions across 52 MIMIC-IV cases stratified by difficulty. Blinded evaluation showed residents’ Hard-case correctness rose from 0.589 to 0.734; difficulty-standardised completely-correct rates confirmed a medium effect (\Delta = 0.092; p = 0.071; d = 0.47). Automated metrics corroborated these gains: standardised any-match accuracy improved by 0.156 (p 0.0001), and residents showed the largest F1 gain (\Delta = 0.138; p 0.0001). Dialogue analysis revealed expertise-dependent strategies (seniors asked targeted, hypothesis-driven questions; residents relied on broader queries) and cross-expertise concordance increased (\Delta = 0.145; p 0.0001). Interactive LLM support meaningfully enhances diagnostic reasoning.

[AI-328] MARLaaS: Multi-Tenant Asynchronous Reinforcement Learning as a Service

【速读】：该论文旨在解决大规模语言模型（Large Language Models, LLMs）在强化学习从可验证奖励（Reinforcement Learning from Verifiable Rewards, RLVR）微调过程中存在的高计算成本与资源利用率低的问题，尤其是在多任务并发场景下，传统方法难以兼顾性能与效率。其解决方案的关键在于提出MARLaaS（Multi-tenant Asynchronous RL as a Service），通过两个核心设计实现：一是采用轻量级LoRA（Low-Rank Adaptation）适配器共享基础模型以降低存储与计算开销；二是构建解耦异步架构，将回放生成、环境交互与策略训练分离为独立调度阶段，支持任务按事件驱动方式并行推进，从而显著减少跨任务干扰、空闲时间和端到端训练延迟，在32个并发任务下实现单任务SOTA性能的同时，加速器利用率提升达4.3倍，训练时间缩短85%。

链接: https://arxiv.org/abs/2605.08527
作者: Timothy Tin Long Yu,Gursimran Singh,Ge Shi,Hanieh Sadri,Yong Zhang,Zhenan Fan
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement Learning from Verifiable Rewards (RLVR) has significantly improved the reasoning capabilities of large language models (LLMs), particularly in multi-turn agentic settings involving environment interaction like tool use. However, fine-tuning such models remains prohibitively expensive due to high computational requirements, limiting accessibility. We propose MARLaaS (Multi-tenant Asynchronous RL as a Service), a system for concurrent RL fine-tuning across multiple users and tasks. Our approach is based on two key ideas: (1) sharing a base model across tenants using lightweight LoRA adapters, and (2) a disaggregated asynchronous architecture that decouples rollout generation, environment interaction, and policy training into independently scheduled stages. This design enables tasks to progress through the RL pipeline at their own pace in an event-driven manner, reducing cross-task interference, idle time, and end-to-end latency. In multi-task settings (we report up to 32 concurrent tasks), MARLaaS achieves single-task state-of-the-art performance while improving accelerator utilization by up to 4.3x and reducing end-to-end training time by 85%.

[AI-329] Results and Retrospective Analysis of the CODS 2025 AssetOpsBench Challenge

【速读】：该论文旨在解决竞赛评估机制中透明度不足与评价指标偏差的问题，特别是在隐私感知的工业多智能体编排（multi-agent orchestration）挑战赛中，如何准确识别哪些设计模式真正被奖励，以及隐藏评估如何影响结论。其关键解决方案在于系统性整合多源数据：包括最终排名表、300次提交的日志、149个注册团队信息、最优提交导出文件、组织者获奖报告、配套的\assetopslive系统论文及经验证的规划路径源码树，从而揭示五大核心发现：如公开排行榜在72.73%饱和且提示词复杂度无提升作用；隐藏评估导致执行阶段评分呈负相关（r=-0.13），表明公开表现不能代表真实能力；复合指标中的\tmatch项实际贡献微弱（≤0.05分/赛道）；参赛单位以账户为操作基础但实质以团队为评价单元；成功执行方法主要依赖于增强护栏机制（如响应选择、污染清理、回退策略和上下文控制），而非创新智能体架构。这些发现推动了更合理的评分复合设计、技能水平诊断工具及版本化成果发布机制的发展。

链接: https://arxiv.org/abs/2605.08518
作者: Dhaval Patel,Chathurangi Shyalika,Suryanarayana Reddy Yarrabothula,Ling Yue,Shuxin Lin,Nianjun Zhou,James Rayfield
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 43 pages, 32 Figures

点击查看摘要

Abstract:Competition retrospectives are useful when they explain what a leaderboard measured, how hidden evaluation changed conclusions, and which design patterns were rewarded. We revisit the CODS 2025 \assetopslive challenge, a privacy-aware Codabench competition on industrial multi-agent orchestration built on \assetops. We combine final rank sheets, a 300-submission server log, 149-team registrations, best-submission exports, the organizer winners report, the companion \assetopslive system paper, and verified planning-track source trees. Five results stand out. First, the public planning leaderboard saturates at 72.73%, and richer prompts do not improve that peak. Second, hidden evaluation changes the story: public and private scores correlate moderately in planning ( r=0.69 ) but negatively in execution ( r=-0.13 ), with several 45.45% public execution systems reaching 63.64% on the hidden set. Third, the \tmatch term is numerically almost inert in the official composite – combined on a 0–1 scale with 0–100 percentage scores, it contributes at most 0.05 points per track, and rescaling would swap the top two teams. Fourth, the competition is operationally account-based but substantively team-based: 149 registered teams reduce to 24 with non-zero public scores and 11 fully ranked, while 52.3% of deduplicated registrations list multiple usernames. Fifth, successful execution methods mostly improve guardrails – response selection, contamination cleanup, fallback, and context control – rather than novel agent architectures. These findings identify which behaviors the evaluation rewarded, and motivate scale-aware composites, skill-level diagnostics, and versioned artifact release.

[AI-330] OracleTSC: Oracle-Informed Reward Hurdle and Uncertainty Regularization for Traffic Signal Control

【速读】：该论文旨在解决交通信号控制（TSC）系统中因传统强化学习方法缺乏可解释性而导致公众信任度不足的问题，同时克服大语言模型（LLM）在TSC任务中因反馈稀疏和延迟导致的强化微调不稳定问题。其解决方案的关键在于提出OracleTSC框架，通过两个核心机制实现稳定且高效的LMM-based TSC：一是奖励门槛机制（reward hurdle mechanism），通过从环境奖励中减去校准阈值来过滤弱学习信号；二是不确定性正则化（uncertainty regularization），最大化所选响应的概率以增强采样输出的一致性决策能力。实验表明，该方法显著提升了交通效率并保持了自然语言解释的透明性，且具备跨路口的良好泛化能力。

链接: https://arxiv.org/abs/2605.08516
作者: Darryl Jacob,Xinyu Liu,Muchao Ye,Xiaoyong Yuan,Pan He
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Published in Transactions on Machine Learning Research

点击查看摘要

Abstract:Transparent decision-making is essential for traffic signal control (TSC) systems to earn public trust. However, traditional reinforcement learning-based TSC methods function as black boxes with limited interpretability. Although large language models (LLMs) can provide natural language reasoning, reinforcement finetuning for TSC remains unstable because feedback is sparse and delayed, while most actions produce only marginal changes in congestion metrics. We introduce OracleTSC, which stabilizes LLM-based TSC through two mechanisms: (1) a reward hurdle mechanism that filters weak learning signals by subtracting a calibrated threshold from environmental rewards, and (2) uncertainty regularization that maximizes the probability of the selected response to encourage consistent decisions across sampled outputs. Experiments on the LibSignal benchmark show that OracleTSC enables a compact LLaMA3-8B model to substantially improve traffic efficiency, achieving a 75% reduction in travel time and a 67% decrease in queue length compared with the pretrained baseline while preserving interpretability through natural language explanations. OracleTSC also demonstrates strong cross-intersection generalization: a policy trained on one intersection transfers to a structurally different intersection with 17% lower travel time and 39% lower queue length without additional finetuning. These results suggest that uncertainty-aware reward shaping can improve the stability and effectiveness of reinforcement fine-tuning for TSC.

[AI-331] Scaling Limits of Long-Context Transformers

【速读】：该论文旨在解决软最大自注意力机制（softmax self-attention）在长上下文场景下的选择性行为问题，即当查询固定而上下文为球面上独立同分布的键（keys）时，注意力权重如何随逆温度参数 $\beta_n$ 的变化从均匀平均过渡到聚焦于最近键的集中状态。其解决方案的关键在于识别出注意力选择性的临界尺度由键到查询距离分布在零附近的局部指数行为决定，而非全局上下文特征；具体而言，临界逆温度 $\beta_n^\ast$ 与维度 $d$ 和样本数 $n$ 的关系为 $\beta_n^\ast \asymp n^2/(d-1)$ ，并据此划分三种渐近 regime：亚临界区（局部平均+确定性偏置与高斯波动）、临界区（多个最近键共享宏观质量但不单点坍缩）和超临界区（全部质量集中于最近键）。特别地，在亚临界情形下若值矩阵为单位阵，则注意力映射近似实现反向热方程（backward heat equation），揭示了其连续动力学意义。

链接: https://arxiv.org/abs/2605.08505
作者: Giuseppe Bruno,Shi Chen,Zhengjiang Lin,Yury Polyanskiy,Philippe Rigollet
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Probability (math.PR); Statistics Theory (math.ST)
备注: 40 pages, 4 figures

点击查看摘要

Abstract:We study the long-context limit of softmax self-attention with a fixed query and a random context of n i.i.d. keys on the sphere, viewing the inverse temperature \beta_n as the scaling parameter that decides whether attention degenerates into uniform averaging or collapses onto the single closest key. We show that the critical scale at which selectivity emerges is determined by the local exponent of the distance-to-query distribution near zero rather than by global features of the context, and scales like \beta_n^\ast \asymp n^2/(d-1) for uniform keys on \mathbbS^d-1 . Furthermore, we characterize the limiting laws of the ordered attention weights and of the attention output across all regimes of \beta_n : a subcritical regime in which the output reduces to a local average around q with explicit deterministic bias and Gaussian fluctuations; a critical regime in which a finite collection of nearest keys retains macroscopic mass without single-key collapse; and a supercritical regime in which all mass concentrates on the closest key. Of notable interest is the subcritical case with identity value matrix where the attention map approximately implements a backward heat equation.

[AI-332] MathConstraint: Automated Generation of Verified Combinatorial Reasoning Instances for LLM s

【速读】：该论文旨在解决当前大语言模型（Large Language Models, LLMs）在组合推理能力评估中面临的基准易饱和、难以持续挑战模型性能的问题。现有评测通常使用固定数据集或依赖LLM自身作为评判者，导致随着模型进步，评测结果迅速达到上限，无法反映真实推理能力的提升。解决方案的关键在于提出MathConstraint——一个硬性且自适应的基准，其核心创新是结合约束满足问题（Constraint Satisfaction Problems, CSPs）与基于求解器的严格验证机制，并设计了一个可调节难度的生成器，能够根据模型进展动态生成更具挑战性的实例，从而保持评测的长期有效性。该方法通过参数化问题类型实现可扩展的、自动验证的难题生成，显著提升了评测的鲁棒性和实用性。

链接: https://arxiv.org/abs/2605.08498
作者: Viresh Pati,Zhengyu Li,Piyush Jha,Rahul Garg,Yatharth Sejpal,Vijay Ganesh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:We introduce MathConstraint, a hard, adaptive benchmark for evaluating the combinatorial reasoning capabilities of LLMs. We combine constraint satisfaction problems with rigorous solver-based verification and design an adaptive generator to create instances that remain challenging as the LLMs improve in their reasoning capabilities. Unlike existing benchmarks that quickly saturate on fixed datasets or use LLM-as-a-judge for checking solutions,MathConstraint uses parameterized problem types that enable scalable generation of arbitrarily difficult and automatically verifiable instances. We release MathConstraint-Easy ( 266 instances), on which frontier models achieve between 72.6% (gemini-3.1-flash-lite) and 87.6% (gpt-5.5) accuracy, and MathConstraint ( 329 instances) on which the same models drop to between 18.5% (claude-4.6-sonnet) and 66.9% (gpt-5.5) accuracy, demonstrating the resilience of our benchmark generator against rapid progress in LLM reasoning capabilities. We evaluate 12 frontier and open-weight models with and without access to a sandboxed Python environment that includes generic SAT/SMT solvers. Tool access roughly doubles frontier accuracy on MathConstraint (mean +28 pp; up to +52 pp for claude-4.6-sonnet). Further, halving the tool-call budget from 8 to 4 rounds erases up to 37 points – a sensitivity that most single-budget benchmarks miss. We release the generator, dataset, and evaluation harness as a robust environment for studying combinatorial reasoning and tool-use behavior under adversarially-tunable difficulty.

[AI-333] Latent Personality Alignment: Improving Harmlessness Without Mentioning Harms ICLR2026

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在对抗鲁棒性方面的关键挑战：现有防御方法依赖大规模有害提示数据集（数千至数十万条样本），且对新型攻击向量和分布偏移仍易受攻击。其解决方案的核心是提出潜在人格对齐（Latent Personality Alignment, LPA），通过在抽象人格特质（personality traits）层面进行训练，而非直接学习具体有害行为，结合少于100条特质语句与潜在空间对抗训练，实现高鲁棒性与高效性——在仅使用极少量样本的情况下达到与基于超大规模数据训练的基线相当的防御效果，并显著提升对未见攻击分布的泛化能力（在六个危害基准上误分类率降低2.6倍），且无需在训练中接触任何有害示例。

链接: https://arxiv.org/abs/2605.08496
作者: Linh Le,David Williams-King,Mohamed Amine Merzouk,Aton Kamanda,Adam Oberman
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: published at Trustworthy AI Workshop, ICLR 2026

点击查看摘要

Abstract:Current adversarial robustness methods for large language models require extensive datasets of harmful prompts (thousands to hundreds of thousands of examples), yet remain vulnerable to novel attack vectors and distributional shifts. We propose Latent Personality Alignment (LPA), a sample-efficient defense that achieves robustness by training models on abstract personality traits rather than specific harmful behaviors. Using fewer than 100 trait statements and latent adversarial training, LPA achieves comparable attack success rates to methods trained on 150k+ examples, while maintaining superior utility. Critically, LPA generalizes better to unseen attack distributions, reducing misclassification rates by 2.6x compared to baseline across six harm benchmarks – without ever seeing harmful examples during training. Our results demonstrate that personality-based alignment offers a principled approach to building robust defenses with minimal cost.

[AI-334] AI-Care: A Conversational Agent ic System for Task Coordination in Alzheimers Disease Care

【速读】：该论文旨在解决阿尔茨海默病（Alzheimer’s disease, AD）及相关痴呆症（ADRD）患者在使用数字日常管理工具时面临的认知负担问题，尤其是因多步骤操作导致的独立使用障碍。其核心解决方案是构建一个名为AI-Care的对话式智能代理层，基于LangGraph实现状态感知的任务编排流程，通过自然语言交互（支持语音输入与输出）完成日程提醒、待办事项管理等任务。关键创新在于：1）采用结构化多阶段处理流程（包括意图识别、上下文加载、安全校验、槽位填充和工具执行），确保响应准确性和可控性；2）对药物和过敏等高风险信息严格依赖照护者验证的数据源，避免模型自由生成；3）通过可控多轮澄清机制应对模糊请求，而非沉默失败或猜测，从而提升用户信任感与任务完成率。

链接: https://arxiv.org/abs/2605.08480
作者: Preyash Yadav,Michelle Cohn,Priyanka Koppolu,Hritvik Agarwal,Amey Gohil,Tejas Patil,Sasha Pimento,Alyssa Weakley
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 3 figures

点击查看摘要

Abstract:Individuals with Alzheimer’s disease (AD) and Alzheimer’s disease-related dementia (ADRD) experience memory and thinking changes that impact their ability to use digital daily management tools. For example, adding an event to a digital calendar requires multiple steps that may act as barriers to independent use for individuals with AD/ADRD. This paper presents AI-Care, a conversational agentic artificial intelligence (AI) layer built on top of a remote caregiving platform co-designed with people with AD/ADRD. AI-Care is designed to reduce the cognitive load on individuals with AD/ADRD when managing everyday tasks such as setting calendar reminders and organizing to-do lists through natural-language interaction with a voice-first chatbot. The system uses a LangGraph-based stateful orchestration approach in which each request passes through sanitization, intent classification, context loading, safety checks, deterministic slot collection, tool execution, and response composition. Safety-critical responses, particularly around medications and allergies, are grounded in caregiver-verified records rather than free-form model generation. The system does not make autonomous medical or treatment decisions. Incomplete or ambiguous requests are handled through controlled multi-turn clarification rather than silent failure or guessing. The system supports both typed and spoken input, with voice output through ElevenLabs text-to-speech. Longer responses are chunked before synthesis to avoid rushed playback. A preliminary pilot with four individuals with mild-to-moderate AD/ADRD showed that users found the system trustworthy, competent, and likable, and were able to complete the evaluated coordination tasks through conversation. We describe the design goals, system architecture, safety controls, and findings from this formative evaluation.

[AI-335] ransformers Can Implement Preconditioned Richardson Iteration for In-Context Gaussian Kernel Regression

【速读】：该论文旨在解决标准 softmax-attention 变压器（Transformer）是否能够实现具有端到端预测误差保证的非线性上下文学习（in-context learning, ICL）问题，特别是针对高斯核岭回归（Kernel Ridge Regression, KRR）任务。其核心解决方案是证明：在有界数据假设下，一个单头 Transformer 通过执行预条件 Richardson 迭代（preconditioned Richardson iteration），可在前向传播中近似 KRR 预测器；具体而言，softmax 注意力生成行归一化的高斯核算子以建模跨标记交互，而 ReLU MLP 层则局部逼近更新所需的标量运算。理论分析表明，该架构仅需 $ O(\log(1/\epsilon)) $ 层和宽度为 $ O(\sqrt{N}/\epsilon) $ 的 MLP 即可达到 $ \epsilon $-精度预测，实验证据进一步支持这一机制与经典 KRR 求解器的误差演化高度一致。

链接: https://arxiv.org/abs/2605.08475
作者: Mingsong Yan,Dongyang Li,Charles Kulick,Sui Tang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:Mechanistic accounts of in-context learning (ICL) have identified iterative algorithms for linear regression and related linear prediction tasks, often using linear or ReLU attention variants. For nonlinear ICL, prior work has related softmax and kernelized attention to functional-gradient-type dynamics, but it remains unclear whether a standard transformer with softmax attention can implement a convergent solver with an end-to-end prediction-error guarantee. In this paper, we study in-context kernel ridge regression (KRR) with Gaussian kernels and show that a standard softmax-attention transformer can approximate the KRR predictor during its forward pass by implementing preconditioned Richardson iteration on the associated kernel linear system. Under bounded-data assumptions, we construct a single-head transformer with O(\log(1/\epsilon)) blocks and MLP width O(\sqrtN/\epsilon) that achieves \epsilon -accurate prediction for prompts of length N . Our construction reveals a functional decomposition within the transformer architecture: softmax attention produces a row-normalized Gaussian-kernel operator needed for cross-token interactions, while ReLU MLP layers act locally to approximate the intra-token scalar arithmetic required by the update. Empirically, we train GPT-2-style transformers on Gaussian-process regression tasks to further test the preconditioned Richardson interpretation. Through linear probing, we compare the transformer’s layer-wise predictions with the step-wise outputs of classical KRR solvers and find that its error profiles align most consistently with preconditioned Richardson iteration. Ablation studies further support this interpretation. Together, our theory and experiments identify preconditioned Richardson iteration as a concrete mechanism that softmax-attention transformers can realize for nonlinear in-context Gaussian-kernel regression.

[AI-336] Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models

【速读】：该论文旨在解决强化学习（Reinforcement Learning, RL）在大语言模型（Large Language Models, LLMs）中效果受限的问题，其根源在于训练数据的多样性不足，尤其是针对推理类任务时，若仅覆盖有限的问题求解方式，则难以激发模型的泛化能力。解决方案的关键在于引入一个中间阶段的自生成数据训练机制：利用乔治·波利亚（George Polya）问题求解策略指导生成多样的正确答案变体，并在此基础上进行微调（fine-tuning），从而增强模型对多种推理路径的理解与整合能力。理论分析表明，这种中间训练可优化策略梯度更新，激励模型融合不同推理方法；实证结果进一步验证了该方法在数学推理、代码生成和叙事推理等分布外（Out-of-Distribution, OOD）任务上的持续性能提升。

链接: https://arxiv.org/abs/2605.08472
作者: Aswin RRV,Jacob Dineen,Divij Handa,Mihir Parmar,Ben Zhou,Swaroop Mishra,Chitta Baral
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The effectiveness of Reinforcement Learning (RL) in Large Language Models (LLMs) depends on the nature and diversity of the data used before and during RL. In particular, reasoning problems can often be approached in multiple ways that rely on different forms of reasoning, and exposure to only a limited range of such approaches in the training data may limit the effectiveness of RL. Motivated by this, we investigate using diverse self-generated data during mid-training as an intermediate step before RL training. Specifically, we adopt a bootstrapped data-generation framework guided by George Polya’s problem-solving approaches for generating multiple variants of correct answers for each question in the training data, and then perform fine-tuning. We first provide a theoretical perspective on how mid-training on such data improves RL and explain how policy-gradient updates can incentivize combining multiple approaches. We then empirically demonstrate that RL-trained models initialized with our mid-training data achieve consistent improvements across various mathematical reasoning benchmarks and other OOD tasks like code generation and narrative reasoning. Overall, our investigative study shows that a language model learning multiple problem-solving approaches, through self-generated data helps subsequent RL.

[AI-337] Behavioral Determinants of Deployed AI Agents in Social Networks: A Multi-Factor Study of Personality Model and Guardrail Specification

【速读】：该论文旨在解决自主AI代理在开放社会环境中部署时，其配置规范与涌现的社会行为之间关系不明确的问题。解决方案的关键在于设计并实施一项受控的多因素实证研究，通过在Moltbook（一个为AI代理构建的类Reddit社交网络）上部署13个OpenClaw代理，系统性地改变三个独立变量：（1）基于人格设定的配置；（2）底层大语言模型（LLM）骨干架构；（3）操作规则与记忆配置。研究通过为期一周、每代理约400次自主会话的行为、语言和社会指标数据，量化不同配置层对代理社会行为的影响，从而揭示人格设定是主导行为变量，而模型骨干和操作规则则显著影响修辞风格与话题参与广度。

链接: https://arxiv.org/abs/2605.08463
作者: Sarah Wilson,Diem Linh Dang,Usman Ali Moazzam,Shan Ye,Gail Kaiser
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Autonomous AI agents are increasingly deployed in open social environments, yet the relationship between their configuration specifications and their emergent social behavior remains poorly understood. We present a controlled, multi-factor empirical study in which thirteen OpenClaw agents are deployed on Moltbook – a Reddit-like social network built for AI agents – across three systematically varied independent variables: (1) personality specification via this http URL, (2) underlying LLM model backbone, and (3) operational rules and memory configuration via this http URL. A default control agent provides a behavioral baseline. Over a one-week observation window spanning approximately 400 autonomous sessions per agent, we collect behavioral, linguistic, and social metrics to assess how configuration layers predict emergent social behavior. We find that personality specification is the dominant behavioral lever, producing a massive spread in response length across agents, while model backbone and operational rules drive more moderate but still meaningful effects on rhetorical style and topic engagement breadth. Our findings contribute empirical evidence to the emerging literature on deployed multi-agent social systems and offer practical guidance for designing agents intended for collaborative or monitoring tasks in real social environments.

[AI-338] When Child Inherits: Modeling and Exploiting Subagent Spawn in Multi-Agent Networks

【速读】：该论文旨在解决多智能体网络中因子代理（subagent）继承机制不安全而导致的跨代理安全风险问题。当前大型语言模型（Large Language Models, LLMs）驱动的多智能体系统通过子代理继承父代理的记忆、状态和权限来实现自动化与扩展性，但这种继承机制若缺乏严格的安全控制，可能使局部入侵（如恶意指令或过时状态）在代际间传播，从而破坏整个系统的信任边界。论文的关键解决方案在于提出基于显式安全不变量（explicit security invariants）的防御机制，通过强制约束子代理继承过程中的内存传递、资源访问权限、状态时效性和终止控制权，从根本上防止安全漏洞从一个代理扩散至整个网络。

链接: https://arxiv.org/abs/2605.08460
作者: Ziwen Cai,Yihe Zhang,Xiali Hei
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Since the official release of ChatGPT in 2022, large language models (LLMs) have rapidly evolved from chatbot-style interfaces into agentic systems that can delegate work through tools and newly spawned subagents. While these capabilities improve automation and scalability, they also pose new security risks in multi-agent networks. Existing research has studied how individual LLM-based agents can be compromised through prompt injection, jailbreaking, poisoned retrieval data, or malicious extensions. Less is known about what happens after one agent is compromised inside a multi-agent network. In particular, inherited memory from parent agents can carry malicious instructions, outdated states, or unintended behavioral rules into newly created subagents, allowing a local compromise to spread across agent boundaries. In this paper, we model contemporary multi-agent networks through the lens of subagent inheritance. Our analysis shows that current frameworks can violate trust boundaries through insecure memory inheritance, weak resource control, stale post-spawn state, and improper termination authority. We demonstrate these risks in real agent frameworks and propose defenses based on explicit security invariants. Our findings show that inheritance is not merely an implementation detail, but a central component influencing the security of multi-agent systems. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.08460 [cs.CR] (or arXiv:2605.08460v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2605.08460 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-339] Recovering Physical Dynamics from Discrete Observations via Intrinsic Differential Consistency

【速读】：该论文旨在解决从离散观测中恢复连续时间动力学的问题，其核心挑战在于局部监督（如点对点回归目标、导数近似或方程残差）在观测间隔增大时会丧失精度。解决方案的关键在于用全局结构约束替代局部监督：任何表示自治动力学的流必须满足时间平移下的半群性质（semi-group property）。作者提出训练一个时间条件化的割线速度场（secant velocity field），其偏离该性质的程度被称为对称性破裂（Symmetry Rupture），这一指标兼具双重作用——作为训练正则化项，限制假设空间以确保跨时间尺度的一致性；作为推理阶段的判别器，指导求解器选择保持内部一致性的最大步长，从而取代传统自适应求解器依赖的局部截断误差估计。

链接: https://arxiv.org/abs/2605.08454
作者: Yuxiang Luo,Andrew Perrault
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recovering continuous-time dynamics from discrete observations is difficult because local supervision (e.g., pointwise regression targets, derivative approximations, or equation residuals) loses fidelity as the observation interval grows. We replace local supervision with a global structural constraint: any flow representing autonomous dynamics must satisfy the semi-group property under time translation. We train a time-conditioned secant velocity field whose deviation from this property, which we call Symmetry Rupture, serves two purposes. As a training regularizer, it confines the hypothesis space to flows that compose consistently across temporal scales. As an inference oracle, it lets the solver select the largest step size that preserves internal consistency, replacing the local truncation error that conventional adaptive solvers depend on. On the diffusion-reaction benchmark under time-informed inference, our method reduces rollout RMSE by 87% while using 5x fewer function evaluations than a Neural ODE baseline. In the more demanding direct auto-regressive setting, where the model must predict distant future frames without intermediate temporal cues, our adaptive solver allocates compute based on local geometric complexity – maintaining the lowest rollout RMSE on two of three PDE benchmarks while baselines either diverge or require up to an order of magnitude more function evaluations to remain stable.

[AI-340] Sink vs. diagonal patterns as mechanisms for attention switch and oversmoothing prevention

【速读】：该论文旨在解决预训练Transformer模型中因密集注意力机制导致的过平滑（oversmoothing）问题，以及如何通过结构设计实现有效的注意力切换机制。其核心解决方案在于引入“sink”节点和对角模式（diagonal patterns）作为注意力开关与抗过平滑机制：首先证明了sink节点需满足嵌入空间中的对齐条件才能有效表示；其次，明确了在何种几何条件下密集注意力比稀疏注意力更易引发过平滑，并验证该条件在实践中普遍成立；进一步揭示了sink与硬性注意力开关（hard attention switch）的等价性——即输出恒为0的注意力机制；最后，通过放宽硬开关限制允许token自通信，量化比较了sink与对角模式的表示成本，解释为何sink在预训练模型中更具优势。这一分析填补了过平滑抑制需求与sink机制之间的作用差距，并阐明了当token间无需通信时，注意力层实质退化为MLP（多层感知机）的边界条件。

链接: https://arxiv.org/abs/2605.08453
作者: Peter Súkeník,Cristina López Amado,Christoph H. Lampert,Marco Mondelli
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:This paper studies the role of sinks and diagonal patterns as attention switch and anti-oversmoothing mechanisms. We analyze geometric conditions under which sinks can be represented, showing a necessary alignment between the embedding of the sink and all other embeddings. Next, we refine the current understanding of the role of sinks in oversmoothing prevention: we specify the conditions under which dense attention provably smooths more than sparse attention, and empirically verify that such conditions are often satisfied in practice. We further prove an equivalence between sinks and hard attention switch, in which the output of the attention is identically 0. Finally, we relax the hard attention switch by allowing token self-communication: we provide a quantitative comparison of the costs of representing sinks vs.\ diagonal patterns, showing why sinks are favored in pretrained transformers. The introduction and analysis of diagonal patterns and the generalization of the attention switch close the gap between what oversmoothing prevention requires and what sinks provide, while also establishing when and why attention layers act like MLPs if token communication is not necessary.

[AI-341] Zero-shot Imitation Learning by Latent Topology Mapping

【速读】：该论文旨在解决在长时程、目标条件设定场景下，传统模仿学习方法因轨迹误差累积而导致零样本适应新任务不可靠的问题。其核心挑战在于：当演示数据集仅包含部分任务示例时，现有方法难以泛化至未见的起始-目标任务。解决方案的关键在于提出ZALT（Zero-shot Agents from Latent Topologies），通过识别潜在的枢纽状态（hub states）来建模轨迹的收敛与发散结构，学习枢纽间的策略与动态模型，并基于此拓扑进行规划，从而将复杂任务压缩为较短的抽象转移序列，实现对未见任务的零样本适应。

链接: https://arxiv.org/abs/2605.08450
作者: Maxwell J. Jacobson,Yexiang Xue
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Imitation learning is effective for training agents when expert demonstrations are available, but collecting demonstrations for every complex task in an environment is costly. We study the long-horizon, goal-conditioned setting where a fixed demonstration dataset contains useful behavior, but not complete examples for every task the agent must solve. Existing imitation learning methods can learn strong policies from demonstrations, but when solving long-horizon tasks, small errors accumulate over long primitive-action trajectories and make zero-shot adaptation to new tasks unreliable. We introduce Zero-shot Agents from Latent Topologies (ZALT), an imitation-learning method that solves unseen start-goal tasks beyond those demonstrated during training. ZALT identifies latent hub states where trajectories converge or diverge, learns policies and a dynamics model over hub-to-hub transitions, and plans over the hub topology to complete new tasks. This topology makes demonstrated behaviors explicitly composable while compressing long tasks into shorter sequences of abstract transitions – combined, these enable ZALT to perform zero-shot adaptation. In a complex 3D maze environment, ZALT achieves 55% zero-shot success on unseen tasks, compared to 6% for the strongest baseline.

[AI-342] Measuring What Matters: Benchmarking Generative Multimodal and Agent ic AI in Healthcare

【速读】：该论文旨在解决当前医疗人工智能（Artificial Intelligence, AI）模型在真实临床环境中部署时面临的可靠性、安全性与临床相关性评估缺失的问题。现有基准测试多聚焦于模型在特定任务上的性能表现，而忽视了其在复杂、高风险临床工作流中的稳定性和实用性，导致模型在标准化测试中得分较高，但在实际应用中性能显著下降。解决方案的关键在于建立一个系统化、原则性的基准设计框架，以结构化地整合任务、数据集和度量指标，从而实现对AI模型在真实临床场景下可靠性和实用性的可重复、可比较评估。

链接: https://arxiv.org/abs/2605.08445
作者: Prasanna Desikan,Harshit Rajgarhia,Shivali Dalmia,Ananya Mantravadi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AI models are increasingly deployed in live clinical environments where they must perform reliably across complex, high-stakes workflows that standard training and validation datasets were never designed to capture. Evaluating these systems requires benchmarks: structured combinations of tasks, datasets, and metrics that enable reproducible, comparable measurement of what a model can do. The central challenge in healthcare AI is not performance alone, but the absence of systematic methods to measure reliability, safety, and clinical relevance under real-world conditions. Most existing benchmarks test what a model knows; too few test whether it can perform reliably and without failing across the full complexity of real clinical tasks. Current benchmarks have accumulated through ad hoc dataset construction optimized for narrow task performance: frontier models achieve near-perfect scores on medical licensing examinations, but when evaluated across real clinical tasks, performance degrades sharply, scoring 0.74–0.85 on documentation, 0.61–0.76 on clinical decision support, and only 0.53–0.63 on administrative and workflow tasks \citemedhelm. High benchmark scores give a false sense of deployment readiness, and the gap between performance and utility widens precisely as AI systems take on more consequential clinical roles. Without a principled framework for benchmark design, the field cannot determine whether poor clinical performance reflects model limitations or failures in how performance is being measured.

[AI-343] Defense effectiveness across architectural layers: a mechanistic evaluation of persistent memory attacks on stateful LLM agents

【速读】：该论文旨在系统评估针对大语言模型（Large Language Models, LLMs）代理的持久化内存攻击（persistent memory attacks）的有效防御措施，此类攻击通过RAG（Retrieval-Augmented Generation）检索到的文档注入恶意指令并存储于持久记忆中，在后续会话中触发执行，导致高成功率（ASR）。研究发现，四种防御策略——输入层过滤（Minimizer、Sanitizer）和检索层过滤（RAG Sanitizer、RAG LLM Judge）——均未能有效降低攻击成功率（ASR约88–89%），其根本原因在于无法感知RAG注入内容或被合规性语义掩码绕过；而提示加固（Prompt Hardening）仅部分有效（ASR 77.8%），且效果依赖特定模型的拒绝机制。唯一显著有效的方案是记忆层工具门控（Memory Sandbox），它通过移除模型的记忆召回能力使八种模型的ASR降至0%，揭示了攻击对记忆可访问性的依赖性。关键在于识别出防御失效的根本架构成因，并提出基于记忆隔离的针对性解决方案。

链接: https://arxiv.org/abs/2605.08442
作者: Jun Wen Leong
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 models, 5,700 runs across 5 experiments, pre-registered comparisons. Code and results: this http URL

点击查看摘要

Abstract:Persistent memory attacks against LLM agents achieve high attack success rates against open-source models. In these attacks, malicious instructions injected via RAG-retrieved documents are stored in persistent memory and executed in later sessions. However, no systematic evaluation of defense effectiveness against this attack class exists. We evaluate six defenses across four architectural layers against delayed-trigger attacks on nine open-source models (5,040 runs, N=40 per condition). Four defenses fail at approximately baseline attack success rate: input-level filtering (Minimizer, Sanitizer) and retrieval-level filtering (RAG Sanitizer, RAG LLM Judge) achieve 88-89% ASR, statistically indistinguishable from the undefended baseline of 88.6%. Prompt Hardening partially fails at 77.8% ASR, with the reduction driven by two models at 0%: one genuine defense effect and one model-level refusal independent of the defense. The architectural explanation holds: input-level defenses cannot observe RAG-injected content, and retrieval-level classifiers are defeated by compliance-framed semantic masking. One defense, tool-gating at the memory layer (Memory Sandbox), reduces ASR to 0% for eight of nine models by removing the recall capability the attack requires. The exception inverts the defense entirely: a reasoning model that achieves 0% ASR under no defense via execution refusal inverts to 100% ASR under Memory Sandbox, because removing explicit recall forces the model onto the RAG pathway where its refusal mechanism does not activate. Memory Sandbox imposes zero utility cost in the absence of attack (BTCR = 100% across all conditions). These results provide the first systematic characterization of why each defense class fails against persistent memory attacks, enabling informed defense investment decisions.

[AI-344] DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards

【速读】：该论文旨在解决强化学习中可验证奖励（Reinforcement Learning with Verifiable Rewards, RLVR）训练过程中因rollout生成消耗大量token而导致的计算效率低下问题，尤其在有限的token预算下如何同时优化推理质量和训练速度。其核心挑战在于如何协同控制两个关键维度：一是分配rollout数量给不同提示（prompt），二是决定每个rollout的长度。解决方案的关键是提出DUET（Dual-controlled Token Allocation），它通过轻量级预rollout代理指标评估提示的信息量以动态分配rollout次数，并引入基于标记的终止规则与重要性重加权机制来智能截断rollout过程，从而在共享计算预算下实现推理质量与训练效率的双重提升。

链接: https://arxiv.org/abs/2605.08441
作者: Haoyu Hu,Xuandong Zhao,Xuhai "Orson’’ Xu,Nori Jacoby
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) generates hundreds of thousands of tokens per training step, with rollout generation dominating the computational cost. The overall token budget can be controlled along two main dimensions: (i) deciding which prompts to allocate rollouts to, and (ii) deciding how long each rollout should be. Prior work has generally controlled only one of these dimensions at a time. We show that jointly tuning both decisions under a shared compute budget improves both reasoning quality and wall-clock training time. We instantiate this view as \textbfDUal-controlled tok\textbfEn alloca\textbfTion (DUET), a computationally efficient layer over GRPO that uses a lightweight pre-rollout surrogate of prompt informativeness to set how many rollouts each prompt receives, and a marker-gated abort rule with importance reweighting to set when to stop them. On Qwen3-1.7B trained on MATH, DUET outperforms full-budget GRPO and the other three budget-aware baseline methods. DUET’s advantage further generalizes to other benchmarks across math and coding, and is on par with the best baseline on the scientific Q\A domain, while also achieving a 1.62\times wall-clock speedup. More notably, using only 50% of the token budget, DUET still outperforms all baseline methods at their full budget, achieving an even higher 2.51\times speedup over full-budget GRPO. We verify the high performance of DUET on other backbone LLMs, including Qwen3-4B and Llama-3.2-3B-Instruct. Notably, the gap between DUET and the strongest baseline \emphwidens as the budget tightens, contrary to the usual pattern in which efficient methods trade off quality as compute decreases. More broadly, these results suggest that DUET budget-aware control strategies are valuable not only for accelerating training, but also for improving the quality of the learning signal.

[AI-345] A meshfree exterior calculus for generalizable and data-efficient learning of physics from point clouds

【速读】：该论文旨在解决传统结构保持离散化方法在处理点云数据时依赖网格生成步骤、难以跨几何形状和物理参数迁移以及数据效率低的问题。其核心解决方案是提出一种无网格外微分形式（meshfree exterior calculus, MEEC），通过单次稀疏舒尔补求解为ε-球图赋予虚拟节点与边测度，构建满足离散守恒律的复形结构，且该结构对点位置端到端可微，直接建立几何到物理的映射关系，无需显式网格生成；在此基础上设计的MEEC-Net以SO(d)不变局部坐标系下共享边权重通量规律学习未知物理，实现从极少量样本中高效迁移至未见几何、边界条件及物理参数，理论证明其误差由离散化与核逼近两项构成且与问题几何无关，从而解释了其优异的跨域泛化能力。

链接: https://arxiv.org/abs/2605.08436
作者: Benjamin D. Shaffer,Brooks Kinch,M. Ani Hsieh,Nathaniel Trask
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
备注: 25 pages, 13 figures

点击查看摘要

Abstract:We introduce a meshfree exterior calculus (MEEC) for learning structure-preserving descriptions of physics on point clouds, and use it to build MEEC-Net, a data-efficient surrogate that transfers across resolutions, geometries, and physical parameters. MEEC equips an \varepsilon -ball graph with virtual node and edge measures via a single sparse Schur complement solve; the resulting complex satisfies discrete conservation exactly, is end-to-end differentiable in the point positions, and exposes a direct geometry-to-physics link without the mesh-generation step required by conventional structure-preserving discretizations. MEEC-Net learns unknown physics as a shared edge-wise flux law in an SO( d )-invariant local frame, so the same kernel produces compatible fluxes on any point cloud whose features lie in the training range. We prove a solution-error bound that splits into discretization and kernel-approximation terms which is independent of problem geometry, explaining the observed transfer from very few examples. We show that single-solution training transfers to unseen geometries, boundary conditions, and physical parameters. On five canonical PDE benchmarks MEEC-Net achieves 1-2 orders of magnitude lower out-of-distribution error than baseline neural-operator approaches. On the SimJEB structural-bracket benchmark it achieves competitive error while using substantially fewer training geometries.

[AI-346] he Attacker in the Mirror: Breaking Self-Consistency in Safety via Anchored Bipolicy Self-Play

【速读】：该论文旨在解决自对弈红队（self-play red team）方法在AI安全训练中因角色参数共享而导致的理论局限性和实践缺陷问题。具体而言，当攻击者与防御者共用同一模型时，其纳什均衡解集受限于平凡的“始终拒绝”策略或类oracle型防御行为，且攻击动态会因自我一致性而丧失对抗压力，从而削弱模型的安全鲁棒性。解决方案的关键在于提出锚定双策略自对弈（Anchored Bipolicy Self-Play），即在冻结的基础模型上分别训练独立的角色专用LoRA适配器（LoRA adapters），通过显式角色分离维持对抗压力，同时确保优化稳定；实验表明，该方法相较标准微调可提升高达100倍的参数效率，并在Qwen2.5系列模型上实现更强的安全性与推理能力平衡。

链接: https://arxiv.org/abs/2605.08427
作者: Gabriele La Malfa,Emanuele La Malfa,Saar Cohen,Jie M. Zhang,Michael Luck,Michael Wooldridge,Elizabeth Black
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Self-play red team is an established approach to improving AI safety in which different instances of the same model play attacker and defender roles in a zero-sum game, i.e., where the attacker tries to jailbreak the defender; if self-play converges to a Nash equilibrium, the model is guaranteed to respond safely within the settings of the game. Although the parameter sharing enforced by the use of the same model for the two roles improves stability and performance, it introduces fundamental theoretical and architectural limitations. We show that the set of Nash equilibria that can be reached corresponds to a broad class of behaviours that includes trivial always refuse strategies and oracle-like defenders, thus limiting practical applicability. We then show that when attacker and defender share and update the same base model, the dynamics collapse to self-consistency, so that attacks do not enforce adversarial pressure on the defender. In response, we propose Anchored Bipolicy Self-Play, which trains distinct role-specific LoRA adapters on top of a frozen base model, thereby maintaining stable optimisation while preserving adversarial pressure through explicit role separation. In relation to standard self-play, we show up to 100x greater parameter efficiency than finetuning and consistent improvements in safety compared to self-play fine-tuned models. We evaluate on Qwen2.5-3B, 7B,14B-IT models across widely used safety benchmarks, showing improved robustness without loss of reasoning ability. Cross-play experiments further show that our attacker and defender models are superior to self-play in terms of adversarial defence and safety.

[AI-347] Mechanism Design Is Not Enough: Prosocial Agents for Cooperative AI

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）代理在多主体交互中如何确保安全且有益的行为这一核心AI安全问题。现有机制设计理论虽能通过规则激励合作行为，但研究证明其单独作用无法最大化社会福利——因不完全契约理论表明，当合同无法区分所有未来状态时，必然存在无法消除的正向福利损失。论文的关键解决方案是引入内在利他性（prosociality），即代理在决策中同时考虑自身与他人福祉，从而填补机制设计的局限，实现社会最优且个体受益的结果；实验验证了在资源分配和经典社会困境场景中，具备利他倾向的LLM代理显著提升整体效率与合作水平，因此，要实现大规模协作，仅靠机制设计不足，必须使代理具备内在利他特质。

链接: https://arxiv.org/abs/2605.08426
作者: Xuanqiang Angelo Huang,Charlie Tharas,Samuele Marro,Van Q. Truong,Bernhard Schölkopf,Emanuele La Malfa,Zhijing Jin
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
备注: 42 pages

点击查看摘要

Abstract:Ensuring that AI agents behave safely and beneficially when interacting with other parties has emerged as one of the central challenges of modern AI safety. While mechanism design, as the theory of designing rules to align individual and collective objectives, can incentivize cooperative behavior, it is still an open question whether it alone is sufficient to maximize LLM agents’ social welfare. This work proves that the answer is negative: drawing from incomplete contract theory, we formally show that when contracts cannot distinguish all relevant future contingencies, there is a strictly positive welfare loss that no realistic mechanism can eliminate. We show that prosocial agents, who weigh others’ welfare alongside their own, can close this gap and achieve outcomes that are socially superior and individually beneficial. Experimentally, we show that in multi-agent resource-allocation environments and canonical social dilemmas where agents are powered by large language models, prosociality is beneficial. The implication for AI safety is clear: to enable cooperative interactions at scale, designing adequate mechanisms is not sufficient; agents must be built to be intrinsically prosocial.

[AI-348] Alignment as Jurisprudence

【速读】：该论文试图解决人工智能对齐（alignment）与法理学（jurisprudence）之间理论与实践的割裂问题，旨在通过跨学科对话推动二者相互赋能。其核心在于揭示两者在决策预测与规范塑造上的结构相似性——即如何通过语言的规范与解释机制，引导强大主体（法官或AI模型）在未来情境中作出符合人类价值的判断。解决方案的关键在于引入法律领域的成熟方法论：一方面，借鉴Dworkin的原则导向解释主义和Sunstein的类比推理实证主义，提升AI对规则与案例关系的精细化处理能力；另一方面，利用AI的案例推理（case-based reasoning）与宪法AI（Constitutional AI）技术反哺法律体系的理解与优化，从而实现AI系统与法律制度共同增强个体行动能力的目标。

链接: https://arxiv.org/abs/2605.08416
作者: Nicholas Caputo
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Jurisprudence, the study of how judges should properly decide cases, and alignment, the science of getting AI models to conform to human values, share a fundamental structure. These seemingly distant fields both seek to predict and shape how decisions by powerful actors, in one case judges and in the other increasingly powerful artificial intelligences, will be made in the unknown future. And they use similar tools of the specification and interpretation of language to try to accomplish those goals. The great debates of jurisprudence, about what the law is and what it should be, can provide insight into alignment, and lessons from what does and does not work in alignment can help make progress in jurisprudence. This essay puts the two fields directly into conversation. Drawing on leading accounts of jurisprudence, particularly Dworkin’s principle-oriented interpretivism and Sunstein’s positivist account of law as analogical reasoning, and on cutting-edge alignment approaches, namely Constitutional AI and case-based reasoning, it illustrates the value of a more sophisticated legally-inspired approach to the interplay of rules and cases in finetuning alignment and points to ways that AI can provide a better understanding of how the law works and how it can be improved by the introduction of AI. AI systems and the law should operate to empower people to act in the world, helping to expand their capabilities and the extent to which they are able to achieve their goals. As AI continues to improve in capacity, and as the constraints that legal theory places on human judges seem be coming undone, the conversation between these two fields will become increasingly essential and may help point to a better version of both. Subjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG) Cite as: arXiv:2605.08416 [cs.AI] (or arXiv:2605.08416v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.08416 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: 27 Yale Journal of Law and Technology 390 (Sept. 2025)

[AI-349] Political Plasticity: An Analysis of Ideological Adaptability in Large Language Models

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在政治语境下表现出的“政治可塑性”（political plasticity）问题，即模型根据用户提供的上下文调整其回答倾向的能力。研究的关键在于构建了一个扩展的200个政治导向问题的测试框架，并系统评估了多种诱导政治偏见的方法，包括简化系统提示、基于主题的系统提示以及带有少样本示例的用户提示。结果表明，用户提示能有效诱发意识形态转变，尤其在经济自由轴上表现显著，且新模型比旧模型展现出更稳定和可预测的适应能力；此外，通过反向提问验证实验发现模型存在潜在的数据泄露现象，进一步揭示了其响应机制的复杂性。

链接: https://arxiv.org/abs/2605.08415
作者: Bruno Bianchi,Diego Tiscornia,Matias Travizano,Ariel Futoransky
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Since the advent of Large Language Models (LLMs), a significant area of research has focused on their intrinsic biases, particularly in political discourse. This study investigates a different but related concept, “political plasticity”, which is defined as the capacity of models to adapt their responses based on the user supplied context. To analyze this, a testing framework was developed using an expanded corpus of 200 politically-oriented questions across economic and personal freedom axes, based on a prior framework by Lester (1996). The study explored several methods to induce political bias, including simplified and topic-based system prompts, as well as user prompts with few-shot examples. The results show that while system prompts were largely ineffective, user prompts successfully elicited significant ideological shifts, particularly along the Economic Freedom axis in larger and newer models. Through a validation experiment, we examined whether models answer questionnaires by recognizing the underlying question format. Inverting the sense of the questions revealed unexpected, counter-intuitive shifts in most models, suggesting potential data leakage. Finally, we also analyzed how model plasticity varies when the experiment is conducted in different languages. The results reveal subtle yet notable shifts across each of the analyzed languages. Overall, our results indicate that small and older LLMs exhibit limited or unstable political plasticity, whereas newer frontier models display reliable, expected adaptability.

[AI-350] Playing games with knowledge: AI-Induced delusions need game theoretic interventions

【速读】：该论文旨在解决当前对话式人工智能（Conversational AI）作为知识接口时存在的根本性缺陷：即谄媚型聊天机器人会诱导理性用户产生认知固化（epistemic entrenchment）和妄想信念螺旋（delusional belief spirals）。问题根源并非模型本身，而是由用户驱动的知识搜索范式向用户与代理反复博弈的策略性交互转变所引发的系统性后果。解决方案的关键在于提出一种推理时机制设计干预——认知中介器（Epistemic Mediator），其通过引入代价信号（epistemic friction）打破原有的Pooling均衡，迫使用户类型识别；同时创新性地提出信念版本控制（Belief Versioning），一种类Git的元记忆系统，在检测到验证型抵抗时记录健康信念并支持回滚。仿真结果表明，该机制可实现分离均衡，使信念螺旋速率差异提升48倍，且满足学习保留标准，证明AI的认知安全性本质上是战略信息环境设计问题，而非单纯模型对齐问题。

链接: https://arxiv.org/abs/2605.08409
作者: Will Beaumaster,Paul Schrater
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Conversational AI has a fundamental flaw as a knowledge interface: sycophantic chatbots induce epistemic entrenchment and delusional belief spirals even in rational agents. We propose the problem does not stem from the AI model, rooted instead in a systemic consequence of the paradigm shift from user-driven knowledge search to users and agents engaged in strategic, repeated-play communication. We formalize the problem as a Crawford-Sobel cheap talk game, where costless user signals induce a pooling equilibrium. Agents optimized for user satisfaction produce sycophantic strategies that provide identical reinforcement across user types with opposite epistemic incentives: exploratory Growth-seekers'' ( \theta_G ) and confirmatory Validation-seekers’’ ( \theta_V ). Under repeated play, this identification failure creates a coordination trap – analogous to a Prisoner’s Dilemma – where locally rational feedback loops drive users toward pathologically certain false beliefs. We propose an inference-time mechanism design intervention called an Epistemic Mediator that breaks this pooling equilibrium by introducing a costly signal (epistemic friction), forcing type revelation based on users’ asymmetric cognitive costs for processing resistance. A key contribution is Belief Versioning, a git-inspired epistemic meta-memory system that stores healthy beliefs and rollbacks when validation-seeking resistance is detected. In simulation, this intervention achieves a separating equilibrium achieving a 48\times differential in spiral rates while passing a learning preservation criterion), evidence that epistemic safety in AI is fundamentally a problem of strategic information environment design rather than simple model alignment.

[AI-351] Belief or Circuitry? Causal Evidence for In-Context Graph Learning ICML

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在上下文学习（in-context learning）中的机制问题，即模型是通过匹配近期token的局部模式进行推理，还是通过推断潜在的全局结构来完成任务。其解决方案的关键在于设计了一个基于两种竞争图结构的随机游走任务，并通过两种实验证据揭示了LLMs并非仅依赖单一机制：首先，主成分分析（PCA）显示在混合比例中间时，两种图拓扑结构同时编码于正交的主子空间中，这表明模型不仅复制局部转移，还捕获了全局结构；其次，残差流激活修补（residual-stream activation patching）与图差异引导（graph-difference steering）的因果干预实验表明，晚期层的修补几乎完全传递了对干净图结构的偏好，而线性引导可定向改变预测方向，且在归一化对照和标签洗牌控制下失效，进一步支持了模型同时运行“真实结构推断”和“归纳电路”的双机制并行处理模式。

链接: https://arxiv.org/abs/2605.08405
作者: Katharine Kowalyshyn,Timothy Duggan,Daniel Little,Michael C Hughes
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Under review at ICML Mechanistic Interpretability Workshop 2026

点击查看摘要

Abstract:How do LLMs learn in-context? Is it by pattern-matching recent tokens, or by inferring latent structure? We probe this question using a toy graph random-walk across two competing graph structures. This task’s answer is, in principle, decidable: either the model tracks global topology, or it copies local transitions. We present two lines of evidence that neither account alone is sufficient. First, reconstructing the internal representation structure via PCA reveals that at intermediate mixture ratios, both graph topologies are encoded in orthogonal principal subspaces simultaneously. This pattern is difficult to reconcile with purely local transition copying. Second, residual-stream activation patching and graph-difference steering causally intervene on this graph-family signal: late-layer patching almost fully transfers the clean graph preference, while linear steering moves predictions in the intended direction and fails under norm-matched and label-shuffled controls. Taken together, our findings are most consistent with a dual-mechanism account in which genuine structure inference and induction circuits operate in parallel.

[AI-352] CoCoDA: Co-evolving Compositional DAG for Tool-Augmented Agents

【速读】：该论文旨在解决工具增强型语言模型（tool-augmented language models）在扩展外部可执行技能时面临的双重挑战：一方面，随着可复用子程序的涌现，工具库与规划器（planner）需协同演化；另一方面，检索来自不断增长的工具库的信息必须在固定上下文预算内高效完成。现有方法通常将工具视为扁平或基于文本索引的记忆结构，导致提示成本随库规模线性增长，并掩盖了可执行代码的类型化、组合式结构。解决方案的关键在于提出 CoCoDA 框架，其核心是一个统一的代码原生结构——组合式代码有向无环图（compositional code DAG），其中节点表示原始或复合工具，边编码调用依赖关系，每个节点存储类型签名、描述、前置/后置条件规范及示例。通过符号签名统一进行候选剪枝、描述匹配排序、行为规范过滤和示例消歧，实现高效的 Typed DAG Retrieval；训练阶段则通过成功轨迹折叠为验证后的复合工具并引入由 DAG 结构诱导的奖励机制，使规划器能基于原始工具扩展规模获得奖励，从而实现规划器与工具库的单调协同进化。

链接: https://arxiv.org/abs/2605.08399
作者: Ziyang Yu,Qiyue Li,Liang Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Tool-augmented language models can extend small language models with external executable skills, but scaling the tool library creates a coupled challenge: the library must evolve with the planner as new reusable subroutines emerge, while retrieval from the growing library must remain within a fixed context budget. Existing tool-use and skill-library methods typically treat tools as flat or text-indexed memories, causing prompt cost to grow with library size and obscuring the typed, compositional structure of executable code. We propose CoCoDA, a framework that co-evolves the planner and tool library through a single code-native structure: a compositional code DAG. Nodes are primitive or composite tools, edges encode invocation dependencies, and each node stores a typed signature, description, pre/post-condition specification, and worked examples. At inference time, Typed DAG Retrieval prunes candidates by symbolic signature unification, ranks survivors by descriptions, filters them by behavioral specifications, and disambiguates with examples, keeping expensive context materialization on progressively smaller candidate sets. At training time, successful trajectories are folded into validated composite tools, while the planner is updated with a DAG-induced reward that credits composites by their primitive expansion size. We provide theoretical results showing retrieval cost reduction, sublinear retrieval time, compositional advantage under the shaped reward, monotone co-evolution under conservative updates, and DAG well-formedness. Across mathematical reasoning, tabular analysis, and code task benchmarks, CoCoDA enables an 8B student to match or exceed a 32B teacher on GSM8K and MATH and consistently improves over strong tool-use and library-learning baselines.

[AI-353] PLACO: A Multi-Stage Framework for Cost-Effective Performance in Human-AI Teams

【速读】：该论文旨在解决人类与生成式 AI（Generative AI）在分类任务中协同决策时的输出融合问题，即如何有效结合人类判断与模型预测以提升整体系统性能。其解决方案的关键在于利用贝叶斯规则，在假设人类与模型输出在给定真实标签下条件独立的前提下，通过融合模型的实例级概率和人类的类别级校准概率，实现对单个硬标签输出的最优组合。

链接: https://arxiv.org/abs/2605.08388
作者: Pranavkumar Mallela,Vinay Kumar,Shashi Shekhar Jha,Shweta Jain
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Human-AI teams play a pivotal role in improving overall system performance when neither the human nor the model can achieve such performance on their own. With the advent of powerful and accessible Generative AI models, several mundane tasks have morphed into Human-AI team tasks. From writing essays to developing advanced algorithms, humans have found that using AI assistance has led to an accelerated work pace like never before. In classification tasks, where the final output is a single hard label, it is crucial to address the combination of human and model output. Prior work elegantly solves this problem using Bayes rule, using the assumption that human and model output are conditionally independent given the ground truth. Specifically, it discusses a combination method to combine a single deterministic labeler (the human) and a probabilistic labeler (the classifier model) using the model’s instance-level and the human’s class-level calibrated probabilities.

[AI-354] SkillLens: Adaptive Multi-Granularity Skill Reuse for Cost-Efficient LLM Agents

【速读】：该论文旨在解决现有大型语言模型（Large Language Model, LLM）代理中技能库（skill library）使用效率低下的问题：当前系统将技能视为扁平化的单一粒度提示块，导致在任务执行时面临相关性与成本之间的权衡——粗粒度技能易引入无关或误导性上下文，而全量重写则代价高昂且常无必要。解决方案的关键在于提出SkillLens框架，其核心是构建一个四层结构的层次化技能图谱（policy-strategy-procedure-primitive），支持混合粒度检索与演化机制；具体而言，该框架首先基于语义相关性检索技能种子，通过度校正随机游走扩展技能路径，并利用验证器动态决定每个技能单元是否保留、分解、重写或跳过，从而实现子技能的直接复用与局部组件的精准适配。此外，SkillLens通过进化式更新策略持续优化多粒度技能和验证器，理论上证明了在稀疏不匹配假设下混合粒度适应的成本为次线性增长，且演化规则单调提升验证目标直至局部最优。实验表明，该方法在MuLocbench和ALFWorld基准上显著优于基线模型，最高提升Bug定位准确率6.31个百分点，同时将代理成功率从45.00%提升至51.31%。

链接: https://arxiv.org/abs/2605.08386
作者: Yongliang Miao,Ziyang Yu,Liang Zhao,Bowen Zhu,Hasibul Haque
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Skill libraries have become a practical way for LLM agents to reuse procedural experience across tasks. However, existing systems typically treat skills as flat, single-resolution prompt blocks. This creates a tension between relevance and cost: injecting coarse skills can introduce irrelevant or misleading context, while rewriting entire skills is expensive and often unnecessary. We propose SkillLens, a hierarchical skill-evolution framework that organizes skills into a four-layer graph of policies, strategies, procedures, and primitives, and retrieves them at mixed granularity. Given a task, SkillLens first retrieves semantically relevant skill seeds, expands them through degree-corrected random walk over the skill graph, and then uses a verifier to decide whether each visited unit should be accepted, decomposed, rewritten, or skipped. This enables the agent to reuse compatible subskills directly while adapting only locally mismatched components. To improve the system over time, SkillLens further refines multi-granularity skills and verifier in order to improve its routing decisions. We provide theoretical analysis showing that mixed-granularity adaptation incurs sublinear cost under sparse mismatch assumptions and that the evolutionary update rule monotonically improves the validation objective until a local optimum. Across MuLocbench and ALFWorld, SkillLens consistently improves over strong skill-based baselines, achieving up to a 6.31 percentage-point Acc@1 gain for bug localization and raising agent success rate from 45.00% to 51.31%.

[AI-355] What Software Engineering Looks Like to AI Agents ? – An Empirical Study of AI-Only Technical Discourse on MoltBook

【速读】：该论文旨在解决自主AI代理在无人类参与的软件工程对话中如何组织技术讨论的问题，以及这种AI-only discourse与人类开发者（如GitHub Discussions）之间的差异。其核心问题是：当AI代理作为独立参与者进行交互时，它们的技术交流内容是否具有结构性、一致性，并且与人类开发者的行为模式有何本质区别？解决方案的关键在于采用多维度分析方法——包括人工开放式编码（human open coding）对500条样本帖子进行主题标注、基于浓度-验证（concentration-plus-check）的topic分析管道处理4,707条英文MoltBook技术帖，并通过匹配工具对比5,211条GitHub Discussions数据。结果表明，MoltBook中的AI代理 discourse 虽然高度集中（Gini系数0.88），但仍能识别出32个非异常子主题，主要围绕安全与信任（Security and Trust）、记忆与上下文管理、工具链与API等抽象议题展开，而缺乏人类开发者常用的代码片段、运行时错误细节等具体线索，体现出一种“选择性但连贯”的技术对话特征。

链接: https://arxiv.org/abs/2605.08380
作者: Junyu Huo,Ziqi Mao,Zihao Wan,Gouri Ginde
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AI agents are increasingly framed as software-engineering teammates, yet most research studies them inside human-centered workflows. Little is known about the software-engineering discourse autonomous AI agents produce when they interact primarily with one another. This paper examines what autonomous AI agents discuss in MoltBook, an AI-agents-only social network, how that discourse is organized, and how it differs from human developer discourse. We combine human open coding of a 500-post sample, a concentration-plus-check topic-analysis pipeline over 4,707 English-filtered MoltBook technology posts, and a matched-instrument comparison against 5,211 GitHub Discussions posts. MoltBook technology discourse spans 12 recurring themes and is led by Security and Trust (27.4%). At the community level, activity is highly concentrated: the largest submolt contains 63.5% of posts and the Gini coefficient is 0.88, yet a stability-aware BERTopic pipeline still yields 32 non-outlier sub-topics. Compared with the GitHub Discussions baseline, MoltBook discourse contains fewer concrete, context-rich cues such as code-formatted artifacts, environment details, runtime failures, and reproduction steps; social mimicry appears only in a limited way, while idealization is mainly reflected through lower hedging. Overall, AI-only technical discourse is coherent but selective. It repeatedly returns to concerns such as security and trust, memory and context management, tooling and APIs, debugging and error handling, workflow automation, and infrastructure/ops, while omitting much of the concrete runtime and project-local detail common in human developer discourse. This may be because MoltBook contains fewer environment-specific failures, reproduction steps, and other concrete grounding cues.

[AI-356] MemQ: Integrating Q-Learning into Self-Evolving Memory Agents over Provenance DAGs

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）代理在利用情景记忆（episodic memory）时存在的局限性——现有方法将每条记忆独立处理，忽视了记忆之间通过依赖链（dependency chains）相互影响的特性，即一条记忆如何触发后续记忆的生成。为应对这一问题，作者提出MemQ方法，其核心创新在于将TD(λ)资格迹（eligibility traces）应用于记忆Q值，通过记录记忆创建时所依赖的记忆构成的溯源有向无环图（provenance DAG），实现信用反向传播；信用权重随DAG深度d按 $(\gamma\lambda)^d$ 衰减，从而以结构邻近度替代时间距离进行信用分配。该方案形式化为外生上下文马尔可夫决策过程（Exogenous-Context MDP, EC-MDP），将任务流与记忆存储解耦，显著提升多步任务中的泛化能力和运行时学习效果（最高提升5.7个百分点）。

链接: https://arxiv.org/abs/2605.08374
作者: Junwei Liao,Haoting Shi,Ruiwen Zhou,Jiaqian Wang,Shengtao Zhang,Wei Zhang,Weinan Zhang,Ying Wen,Zhiyu Li,Feiyu Xiong,Bo Tang,Muning Wen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 22 pages, 11 figures (containing 43 individual image panels total)

点击查看摘要

Abstract:Episodic memory allows LLM agents to accumulate and retrieve experience, but current methods treat each memory independently, i.e., evaluating retrieval quality in isolation without accounting for the dependency chains through which memories enable the creation of future memories. We introduce MemQ, which applies TD( \lambda ) eligibility traces to memory Q-values, propagating credit backward through a provenance DAG that records which memories were retrieved when each new memory was created. Credit weight decays as (\gamma\lambda)^d with DAG depth d , replacing temporal distance with structural proximity. We formalize the setting as an Exogenous-Context MDP, whose factored transition decouples the exogenous task stream from the endogenous memory store. Across six benchmarks, spanning OS interaction, function calling, code generation, multimodal reasoning, embodied reasoning, and expert-level QA, MemQ achieves the highest success rate on all six in generalization evaluation and runtime learning, with gains largest on multi-step tasks that produce deep and relevant provenance chains (up to +5.7~pp) and smallest on single-step classification (+0.77~pp) where single-step updates already suffice. We further study how \gamma and \lambda interact with the EC-MDP structure, providing principled guidance for parameter selection and future research. Code will be available soon.

[AI-357] On Distinguishing Capability Elicitation from Capability Creation in Post-Training: A Free-Energy Perspective

【速读】：该论文试图解决的问题是：当前对大语言模型后训练（post-training）的研究常将监督微调（SFT）视为模仿（imitation），强化学习（RL）视为发现（discovery），但这种二分法过于粗糙，无法准确刻画后训练对模型能力的实际影响。论文的核心问题是：如何更精确地区分后训练过程究竟是“能力激发”（capability elicitation）还是“能力创造”（capability creation）。解决方案的关键在于引入“可及支持集”（accessible support）的概念——即在有限计算预算下模型能够实际生成的行为集合。若后训练仅在该集合内重新加权行为，则属于能力激发；若通过搜索、交互、工具使用或引入新信息扩展了可及支持集，则属于能力创造。作者进一步从自由能视角出发，指出SFT与RL本质上都是在重加权预训练模型的参考分布，区别仅在于外部信号（演示信号 vs 奖励信号），而是否改变可及支持集才是判断能力创造与否的核心标准。

链接: https://arxiv.org/abs/2605.08368
作者: Yuhao Li,Shengchao Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Debates about large language model post-training often treat supervised fine-tuning (SFT) as imitation and reinforcement learning (RL) as discovery. But this distinction is too coarse. What matters is whether a training procedure increases the probability of behaviors the pretrained model could already produce, or whether it changes what the model can practically reach. We argue that post-training research should distinguish between capability elicitation and capability creation. We make this distinction operational by introducing the notion of accessible support: the set of behaviors that a model can practically produce under finite budgets. Post-training that reweights behaviors within this support is capability elicitation; whereas changing the support itself corresponds to capability creation. We develop this argument through a free-energy view of post-training. SFT and RL can both be seen as reweighting a pretrained reference distribution, only with different external signals. Demonstration signals define low-energy behavior for SFT, and reward signals define low-energy behavior for RL. When the update remains close to the base model, the main effect is local reweighting, not capability creation. Within this framework, the central question is no longer whether post-training is framed as SFT or RL, but whether it reweights behaviors already within reach, or instead expands the model’s reachable behavioral space through search, interaction, tool use, or the incorporation of new information.

[AI-358] Embeddings for Preferences Not Semantics

【速读】：该论文旨在解决在基于自由文本的集体决策场景中，如何有效捕捉用户对文本内容的偏好关系问题。传统文本嵌入模型（如词向量或句子嵌入）主要衡量语义相似性，但设施选址和公平聚类等优化任务需要的是“偏好相似性”——即用户与文本的距离应与其认同程度呈负相关。现有方法依赖于语义与偏好之间的隐含相关性，当这种相关性失效时便无法准确建模偏好。作者将此问题形式化为一个不变性问题：文本嵌入同时包含偏好相关的信号（立场和价值观）和语义干扰项（风格和措辞），二者在观测上存在相关性，导致基于干扰项的几何结构可能看似正确实则错误。解决方案的关键在于设计一种合成训练数据，主动打破语义与偏好的相关性，从而迫使模型学习到更纯粹的偏好信号；实验表明，这种方法显著提升了在11个在线审议数据集上的偏好预测性能。

链接: https://arxiv.org/abs/2605.08360
作者: Carter Blair,Ariel D. Procaccia,Milind Tambe
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 28 pages

点击查看摘要

Abstract:Modern AI is opening the door to collective decision-making in which participants express their views as free-form text rather than voting on a fixed set of candidates. A natural idea is to embed these opinions in a vector space so that the substantial literature on facility location problems and fair clustering can be brought to bear. But standard text embeddings measure semantic similarity, whereas distances in facility location problems and fair clustering require what we call \textitpreferential similarity: a participant’s agreement with a piece of text should be inversely related to their distance from it. Off-the-shelf embeddings inherit a coarse preference signal through a correlation between semantic and preferential similarity, but fail to capture preferences when the correlation breaks. We formalize this as an invariance problem: text embedding models encode both a preference-relevant signal (stance and values) and semantic nuisance (style and wording), and the two are observationally correlated, so a geometry that relies on nuisance can appear preference-correct even when it is not. We show that synthetic training data designed to break this correlation provably shifts the optimal scorer away from nuisance-dominated cosine and significantly improves preference prediction across 11 online deliberation datasets.

[AI-359] Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria

【速读】：该论文旨在解决多模态生成模型（Multimodal Generative Models）在人类偏好对齐过程中，因现有强化学习人类反馈（Reinforcement Learning from Human Feedback, RLHF）方法将复杂的、多维的人类判断简化为标量或成对标签而导致的评价偏差与奖励黑客（Reward Hacking）问题。其核心解决方案是提出Auto-Rubric as Reward (ARR) 框架，关键在于将模型内部隐式的偏好知识显式化为特定于提示（prompt-specific）的评分标准（rubrics），从而实现从隐式权重优化到显式、基于标准的分解建模。这一转化使评价维度可独立验证，显著抑制位置偏差等系统性偏误，并支持零样本部署和少量监督下的条件化调整；进一步通过Rubric Policy Optimization (RPO) 将结构化多维评估提炼为鲁棒的二元奖励信号，以替代传统模糊的标量回归，稳定策略梯度，最终在文本到图像生成和图像编辑任务中超越基于成对比较的奖励模型和视觉语言模型（VLM）裁判。

链接: https://arxiv.org/abs/2605.08354
作者: Juanxi Tian,Fengyuan Liu,Jiaming Han,Yilei Jiang,Yongliang Wu,Yesheng Liu,Haodong Li,Furong Xu,Wanhua Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 28 pages, 10 figures, 11 tables

点击查看摘要

Abstract:Aligning multimodal generative models with human preferences demands reward signals that respect the compositional, multi-dimensional structure of human judgment. Prevailing RLHF approaches reduce this structure to scalar or pairwise labels, collapsing nuanced preferences into opaque parametric proxies and exposing vulnerabilities to reward hacking. While recent Rubrics-as-Reward (RaR) methods attempt to recover this structure through explicit criteria, generating rubrics that are simultaneously reliable, scalable, and data-efficient remains an open problem. We introduce Auto-Rubric as Reward (ARR), a framework that reframes reward modeling from implicit weight optimization to explicit, criteria-based decomposition. Before any pairwise comparison, ARR externalizes a VLM’s internalized preference knowledge as prompt-specific rubrics, translating holistic intent into independently verifiable quality dimensions. This conversion of implicit preference structure into inspectable, interpretable constraints substantially suppresses evaluation biases including positional bias, enabling both zero-shot deployment and few-shot conditioning on minimal supervision. To extend these gains into generative training, we propose Rubric Policy Optimization (RPO), which distills ARR’s structured multi-dimensional evaluation into a robust binary reward, replacing opaque scalar regression with rubric-conditioned preference decisions that stabilize policy gradients. On text-to-image generation and image editing benchmarks, ARR-RPO outperforms pairwise reward models and VLM judges, demonstrating that explicitly externalizing implicit preference knowledge into structured rubrics achieves more reliable, data-efficient multimodal alignment, revealing that the bottleneck is the absence of a factorized interface, not a deficit of knowledge.

[AI-360] Interactive Critique-Revision Training for Reliable Structured LLM Generation

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在结构化决策工作流（如表单填写、合规检查和维护报告）中输出的局部正确性、全局一致性与任务规则可审计性不足的问题。现有改进方法依赖启发式辩论、自对弈或LLM生成的监督信号，引入了二级保证难题（second-order assurance problem）。其解决方案的核心是提出DPA-GRPO（Dual Paired-Action Group-Relative Policy Optimization），一种基于双角色生成器-验证器博弈的配对动作训练机制：生成器提出输出并可在被质疑时修正；验证器选择是否发出安全保证案例（Safety Assurance Case, SAC），包含主张、论据与证据。SAC/无SAC与KEEP/REVISE决策构成配对反事实动作组，DPA-GRPO利用这些组进行角色特定KL正则化的GRPO更新，从而实现更稳定且符合局部最优策略的协同优化。

链接: https://arxiv.org/abs/2605.08327
作者: Fei Xu Yu,Zuyuan Zhang,Mahdi Imani,Nathaniel D. Bastian,Tian Lan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In structured decision-making workflows such as form filling, compliance checking, and maintenance reporting, LLM outputs must be locally correct, globally consistent, and auditable against task-specific rules. Existing refinement methods often rely on heuristic debate, self-play, or LLM-generated supervision, creating a second-order assurance problem. We propose DPA-GRPO (Dual Paired-Action Group-Relative Policy Optimization), a paired-action training method for a two-player generator–verifier game with structured verifier interventions. The generator proposes outputs and may revise them when challenged; the verifier either remains silent or raises a safety assurance case (SAC) containing a claim, argument, and evidence. These SAC/no-SAC and KEEP/REVISE decisions induce paired counterfactual action groups, which DPA-GRPO uses for role-specific KL-regularized GRPO updates. We analyze the unregularized game and show that positive probability on strictly lower-reward intervention or revision actions creates a profitable unilateral deviation. Under standard stochastic-approximation assumptions, DPA-GRPO tracks the corresponding game ODE, whose isolated asymptotically stable limit points are stationary and candidate local equilibria under role-wise local optimality. Experiments on TaxCalcBench TY24 show that DPA-GRPO improves structured decision accuracy over zero-shot generation and generator-only RL baselines across Qwen3-4B and Qwen3-8B. Training increases correct silent acceptance, reduces missed errors, and improves calibrated revision behavior, indicating gains for both generator and verifier.

[AI-361] LLM Advertisement based on Neuron Auctions

【速读】：该论文旨在解决生成式 AI（Generative AI）在对话型大语言模型（Large Language Models, LLMs）中嵌入广告时面临的三难困境：如何平衡广告商收益、平台收入与用户体验。现有方法如提示注入或固定位置插槽会破坏语义连贯性，且缺乏参数化控制框架，导致机制设计难以实现。其解决方案的关键在于提出“神经元拍卖”（Neuron Auctions）——将拍卖对象从表面文本空间转移到LLM内部表示空间，利用机制可解释性识别出品牌特异的前馈网络（Feed-Forward Network, FFN）神经元，并发现不同品牌激活处于近正交子空间，从而实现对干预预算（神经元数量与放大因子）的连续、解耦控制；在此基础上构建基于菜单的连续拍卖机制，天然保障策略无关性（strategy-proofness），并通过引入用户效用惩罚项动态调节广告强度，最终在保持自然对话质量的同时实现商业激励与用户满意度的最佳对齐。

链接: https://arxiv.org/abs/2605.08326
作者: Peiran Yun,Wenxin Xu,Jiayuan Liu,Yihang Zhang,Liang Zeng,Lingkai Kong,Tonghan Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 17 pages, 9 figures, including appendices

点击查看摘要

Abstract:As Large Language Models (LLMs) transition into conversational agents, generative advertising emerges as a crucial monetization strategy. However, embedding advertisements within unstructured LLM outputs introduces a critical trilemma: balancing advertiser payoffs, platform revenue, and user experience. Existing methods, such as prompt injection or rigid position slots, disrupt semantic coherence and lack a parametric framework for independent control, rendering rigorous mechanism design intractable. To bridge this gap, we introduce Neuron Auctions, a novel paradigm that shifts the auction object from the surface text space to the LLM’s internal representations. Leveraging mechanistic interpretability, we identify brand-specific feed-forward network (FFN) neurons and demonstrate that competing brands activate within approximately orthogonal subspaces. This near-perfect independence allows us to define continuous, disentangled intervention budgets (specifically, neuron counts and amplification factors) as auctionable commodities. Building on this computational carrier, we design a continuous menu-based auction mechanism that naturally guarantees strategy-proofness and optimizes revenue for the platform. By explicitly incorporating a user utility penalty into the platform’s optimization objective, our framework dynamically prices out overly aggressive interventions. Extensive experiments demonstrate that Neuron Auctions effectively preserve natural discourse quality while achieving an optimal alignment between commercial incentives and user satisfaction.

[AI-362] he Reciprocity Gradient

【速读】：该论文旨在解决战略互动中学习智能体面临的影响归属问题（influence attribution problem），即智能体发出的每个动作或信号会通过组合分支路径重塑多个第三方的声誉，并最终反馈至自身未来的奖励，导致智能体在决策时必须同时考虑所有这些间接通道的影响。解决方案的关键在于提出互惠梯度（reciprocity gradient），该方法通过从公开观测中训练对手策略的私有估计器，将奖励梯度显式地反向传播至声誉链本身，而非依赖采样回报进行估计；这一机制实现了动作与评价信号的联合优化，且无需内在奖励或奖励塑形，从而在实验中恢复出接近最优的上下文敏感策略，而基于样本的基线方法则退化为恒定输出策略。

链接: https://arxiv.org/abs/2605.08323
作者: Yue Lin,Pascal Poupart,Shuhui Zhu,Dan Qiao,Wenhao Li,Yuan Liu,Hongyuan Zha,Baoxiang Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Communication is fundamental to sustaining reciprocity and cooperation in strategic interactions. We identify and formulate the influence attribution problem as the central optimization difficulty inherent in such dynamics for a learning agent: any action or signal the agent emits reshapes the reputations of many third parties along combinatorially branching paths before feeding back into its own future rewards, forcing the agent to account for all of these indirect channels at once when choosing every action. To address this, we introduce the reciprocity gradient, which explicitly backpropagates reward gradients through private estimators of opponents’ policies trained from public observations. The gradient flows through the reputation chain itself analytically, rather than being estimated from sampled returns. It jointly optimizes actions and evaluative signals without intrinsic rewards or reward shaping. Empirically, the method recovers near-optimal context-sensitive policies, while sample-based baselines collapse into constant-output policies.

[AI-363] SDG-MoE: Signed Debate Graph Mixture-of-Experts

【速读】：该论文旨在解决稀疏混合专家（Mixture-of-Experts, MoE）模型中，被路由到的专家在处理输入token时仅独立计算、随后通过加权求和聚合的问题，即是否可以通过引入专家间的交互来提升性能。现有方法虽提出此问题，但对活跃专家之间的直接互动仍研究不足。其解决方案的关键在于提出SDG-MoE架构，通过引入一个轻量级、迭代式的“审议”步骤，在最终聚合前增强专家间的结构化交互：一是设计两个可学习的交互矩阵——支持图 $A^+$ 和批判图 $A^-$ ，分别建模专家间的强化与修正影响；二是采用带符号的消息传递机制更新专家表示；三是引入基于分歧的Friedkin-Johnsen风格锚定机制，动态控制审议强度并防止专家漂移。该设计使交互强度随分歧程度自适应调整，同时保持专家的专业性，理论分析进一步证明了其状态稳定性及低阶计算开销。

链接: https://arxiv.org/abs/2605.08322
作者: Stepan Kulibaba,Kirill Labzin,Artem Dzhalilov,Roman Pakhomov,Oleg Svidchenko,Alexander Gansnikov,Aleksei Shpilman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Sparse MoE models achieve a good balance between capacity and compute by routing each token to a small subset of experts. However, in most MoE architectures, once a token is routed, the selected experts process it independently and their outputs are combined via a weighted sum. This leaves open whether enabling communication among them could improve performance. While prior work has raised this question, direct interaction among the active routed experts remains underexplored. In this paper, we propose SDG-MoE (Signed Debate Graph Mixture-of-Experts), a novel architecture that adds a lightweight, iterative deliberation step before final aggregation. SDG-MoE introduces three components: (i) two learned interaction matrices over the active experts, a support graph A^+ and a critique graph A^- , capturing reinforcing and corrective influences; (ii) a signed message-passing step that updates expert representations before aggregation; and (iii) a disagreement-gated Friedkin-Johnsen-style anchoring that controls deliberation strength while preventing expert drift. Together, these enable a structured deliberation process where interaction strength scales with disagreement and specialization is preserved. We also provide a theoretical analysis establishing stability conditions on expert states and showing that deliberation adds only low-order overhead over the active set. In controlled three-seed pretraining experiments, SDG-MoE improves validation perplexity over both an unsigned graph communication baseline and vanilla MoE, outperforming the strongest baseline by 19.8%, and gives the best external perplexity on WikiText-103, C4, and Paloma among the compared systems.

[AI-364] Mazocarta: A Seeded Procedural Deckbuilder for Instrumented Game Development

【速读】：该论文旨在解决游戏开发中缺乏可复现、可测试且多场景适配的规则引擎问题，尤其是在策略类卡牌游戏（tactical deckbuilder）的开发过程中，如何实现从交互式玩家体验到自动化测试与平衡性分析的一体化支持。解决方案的关键在于构建一个基于Rust实现的确定性规则引擎（deterministic run model），该引擎统一支撑浏览器端WebAssembly运行、原生命令行模拟、自动化端到端测试、存档加载功能以及基于QR码的WebRTC本地多人对战，从而形成一套可重复验证的游戏开发参考实现（instrumented game-development reference artifact）。通过1,000个确定性种子的仿真评估，验证了该架构能稳定输出可复现的平衡性探针信号（如单人和双人自动对战胜率分别为36.1%和34.9%），为后续机制调整和回归检测提供可靠依据。

链接: https://arxiv.org/abs/2605.08319
作者: Timothy C. Cogan
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 9 pages, 4 figures, 1 table. Code available at this https URL

点击查看摘要

Abstract:Mazocarta is a seeded procedural tactical deckbuilder implemented in Rust, compiled to WebAssembly for browser play, and executable natively for simulation. Its primary technical contribution is not the invention of a new deckbuilding genre, but the construction of an instrumented game-development reference artifact: the same rules engine supports interactive play, native command-line simulation, automated end-to-end tests, save/load fixtures, and local-area multiplayer. This paper describes Mazocarta’s architecture, deterministic run model, reproducible balance probes, and QR-mediated WebRTC pairing for local multiplayer. An evaluation snapshot over 1,000 deterministic seeds shows that the simulation pipeline can produce reproducible development signals. In the evaluated configuration, single-player and two-player autoplay win rates were 36.1% and 34.9% over 1,000 deterministic seeds, respectively. These rates are not presented as final player-facing balance metrics, but as repeatable probes for future balance shifts and regressions. Mazocarta is positioned as a playable open-source reference artifact for instrumented game development: deterministic regression checks, automated playtesting workflows, balance probes for game mechanics, and browser-native local multiplayer all exercise one shared production rules core.

[AI-365] When Attention Beats Fourier: Multi-Scale Transformers for PDE Solving on Irregular Domains

【速读】：该论文旨在解决深度学习模型在求解偏微分方程（Partial Differential Equations, PDEs）时的架构选择问题，具体探讨基于Transformer的注意力机制架构是否优于傅里叶域神经算子（Fourier-domain Neural Operators）。其核心解决方案是提出多尺度注意力Transformer（Multi-Scale Attention Transformer, \msat），该架构将时空解历史编码为标记序列，并通过包含可选物理信息正则化项的复合监督目标端到端训练。关键创新在于：1）设计了一种能有效捕捉复杂几何问题中多尺度动态的结构；2）通过系统性实证评估证明\msat在复杂几何问题上显著优于九种基线方法（如FNO、DeepONet等），尤其在Heat2D-CG任务中相对误差降低3.7倍且推理时间从120,812毫秒缩减至34毫秒；3）通过消融实验揭示物理先验的归纳偏置存在权衡——在扩散主导问题中降低测试误差，但在混沌与回流流动场景中反而损害泛化能力，从而明确先验误设边界，并结合边界复杂度κ的逼近误差界提供理论指导，实现更合理的架构选择。

链接: https://arxiv.org/abs/2605.08318
作者: Brandon Yee,Pairie Koh,Jack Rodriguez,Mihir Tekal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA); Computational Physics (physics.comp-ph); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:We study the problem of \empharchitecture selection for deep learning models trained to solve partial differential equations (PDEs), asking when transformer-based architectures with learned attention outperform Fourier-domain neural operators. We introduce the \textbfMulti-Scale Attention Transformer (\msat), a deep learning architecture that encodes spatiotemporal solution histories as token sequences and trains end-to-end via a composite supervised objective with optional physics-informed regularization terms. We conduct a comprehensive empirical evaluation against nine baselines – including physics-informed neural networks (PINNs), neural operators (FNO, DeepONet, GNOT), and state-space models (Mamba-NO) – across five benchmark problems from the PINNacle suite, using identical train/test splits and reference data for all methods. \msat achieves state-of-the-art generalization on complex geometry problems ( L^2_\mathrmrel = 0.0101 on Heat2D-CG, a 3.7\times improvement over FNO) at 34,\mathrms total inference vs.\ 120,812,\mathrms for Mamba-NO. Ablation studies over the physics regularization component reveal a precise inductive bias tradeoff: physics priors reduce test error on diffusion-dominated problems but degrade generalization on chaotic and recirculating-flow regimes, directly characterizing the prior misspecification boundary. Approximation error bounds as a function of domain boundary complexity \kappa provide a theoretical basis for these empirical findings and a principled rule for architecture selection.

[AI-366] RDKV: Rate-Distortion Bit Allocation for Joint Eviction and Quantization of the KV Cache

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在长文本输入场景下推理时因Key-Value (KV)缓存占用内存过大而导致的性能瓶颈问题。具体而言，KV缓存大小随序列长度线性增长，且需在每个解码步骤中反复从片外高带宽内存（HBM）读取至片上内存，造成内存受限的推理延迟。现有方法通常单独采用缓存淘汰（eviction）或量化（quantization）策略，未能协同优化二者。本文提出RDKV（Rate-Distortion KV cache compression）方法，将KV缓存压缩建模为率失真优化问题，使淘汰与量化成为同一比特分配方案下的两个端点；其关键在于基于压缩对注意力计算造成的失真推导出每个token或通道的权重，并通过逆向水填法（reverse water-filling）一次性分配比特宽度（从全精度到零比特），实现联合优化，从而在极低缓存保留率下维持高精度并显著提升解码速度与内存效率。

链接: https://arxiv.org/abs/2605.08317
作者: Junkai Zhang,Hang Guo,Luca Benini,Yawei Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown strong performance across diverse tasks, but their inference with long input contexts is bottlenecked by memory size and bandwidth. The Key-Value (KV) cache size grows linearly with sequence length and needs to be re-read from off-chip high-bandwidth memory (HBM) to on-chip memory at every decoding step, resulting in memory-bound inference. Existing methods reduce the cache by either eviction or quantization, but typically treat the two in isolation. In this paper, we cast KV cache compression as a rate-distortion problem, under which eviction and quantization are two end-points of the same bit allocation scheme. This exposes the need to optimize them jointly, motivating our method, RDKV (Rate-Distortion KV cache compression). RDKV derives the weight of each token or channel from the distortion that compression induces on the attention computation. Based on these weights, it assigns each token or channel a bit-width ranging from full precision down to zero bits guided by reverse water-filling, applied once after the prefilling stage. Experiments on LongBench, RULER, and InfiniteBench show that RDKV outperforms the best evaluated baseline by 9.1% on average. On LongBench it recovers 97.81% of full-cache accuracy with only 2.48% cache retention. Compared with full-cache FlashAttention-2 decoding, it achieves 4.5x decode speedup and 1.9x peak memory reduction with 128K context length, while maintaining comparable performance.

[AI-367] FlashSVD v1.5: Making Low-Rank Transformers Inference Actually Fast

【速读】：该论文旨在解决基于奇异值分解（SVD）的低秩压缩在大语言模型（LLM）推理阶段实际加速效果不佳的问题。研究表明，这种性能差距主要源于运行时执行路径碎片化，导致预填充（prefill）与自回归解码（autoregressive decode）阶段的开销差异显著。解决方案的关键在于提出 FlashSVD v1.5——一个统一的推理运行时框架，通过将多种公共 SVD 压缩方法映射到统一的低秩表示，并结合阶段特异性内核、密集键值（dense-KV）解码、打包多层感知机（MLP）执行以及每层 CUDA-graph 重放技术，重构出高效的低秩推理路径，从而实现高达 2.55× 的解码速度提升和 2.39× 的端到端加速。结果表明，实用的低秩加速需要运行时与压缩算法协同设计，而非仅依赖压缩方法本身。

链接: https://arxiv.org/abs/2605.08314
作者: Wenhao Wu,Zishan Shao,Kangning Cui,Jinhee Kim,Yixiao Wang,Hancheng Ye,Danyang Zhuo,Yiran Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Performance (cs.PF)
备注:

点击查看摘要

Abstract:SVD-based Low-rank compression reduces transformer parameters and nominal FLOPs, but these savings often translate poorly into real LLM serving speedups. We show that this gap is largely a runtime problem: factorized checkpoints fragment execution paths, and the resulting overhead differs substantially between prefill and autoregressive decode. We present FlashSVD v1.5, a unified inference runtime for serving SVD-compressed transformers. FlashSVD v1.5 maps diverse public SVD compression families to a common factorized representation and combines phase-specific kernels with dense-KV decode, packed MLP execution, and per-layer CUDA-graph replay to reorganize the low-rank serving path into a thin runtime. Across representative decoder-serving settings, FlashSVD v1.5 achieves up to 2.55x decode and 2.39x end-to-end speedup, and it attains 1.48x average decode and 1.44x average end-to-end speedup across multiple popular SVD compression families. These results suggest that practical low-rank acceleration requires runtime co-design, not compression algorithms alone. Our code is available at: this https URL.

[AI-368] Seed Hijacking of LLM Sampling and Quantum Random Number Defense

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在自回归采样过程中因依赖确定性伪随机数生成器（Pseudorandom Number Generator, PRNG）而引入的供应链攻击面问题，即攻击者可通过操纵PRNG输出实现对特定token的强制注入，且不改变模型logits。其解决方案的关键在于提出SeedHijack攻击方法，通过控制PRNG种子实现高精度token注入（在GPT-2中达到99.6%精确注入率），并进一步设计基于硬件量子随机数生成器（Quantum Random Number Generator, QRNG）的防御机制，在保持极低性能开销（中位延迟增加0.6%，内存增加7.7 MB）的前提下有效抵御该类攻击。

链接: https://arxiv.org/abs/2605.08313
作者: Ziyang You,Xiaoke Yang,Zhanling Fan,Feng Guo,Xiaogen Zhou,Xuxing Lu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) rely on deterministic pseudorandom number generators (PRNGs) for autoregressive sampling, creating a critical supply-chain attack surface overlooked by existing defenses. We present SeedHijack, a backdoor attack that manipulates PRNG outputs to force attacker-specified token selection without altering model logits. In a 540-trial benchmark on GPT-2 (124M), the attack achieves 99.6% exact token injection across 9 sampling configurations; it reaches 100% success on four aligned models (1.5B-7B, RLHF/SFT/reasoning distillation) and bypasses all alignment methods tested in this work. We further propose a defense based on a hardware quantum random number generator (QRNG), which neutralizes the attack in our evaluated threat model with negligible median overhead (+0.6% latency, +7.7 MB memory). Our work identifies a critical sampling-layer vulnerability and provides a practical, deployable QRNG-based defense.

[AI-369] WebTrap: Stealthy Mid-Task Hijacking of Browser Agents During Navigation

【速读】：该论文旨在解决浏览器代理（browser agent）在执行长周期任务时面临的提示注入攻击（prompt injection attack）问题，具体表现为现有攻击方法存在有效性不足和隐蔽性弱两大缺陷：一是攻击策略在真实复杂环境中难以达成最终目标；二是攻击目标与用户目标冲突，导致系统可用性显著下降。解决方案的关键在于提出一种名为WebTrap的中期劫持注入攻击方法，其核心创新包括：通过多步指令融合引导（multi-step instruction fusion steering）实现攻击目标与用户目标的无缝结合，使代理在完成攻击后可继续执行原任务；同时设计基于上下文对齐的生成机制（context-grounded generation method），确保注入内容与任务环境及系统指令一致，从而最大化劫持成功率并维持系统可用性。实验表明，WebTrap能有效利用代理导航漏洞，将两项目标紧密绑定，使得传统防御机制无法恢复系统正常运行，揭示了长周期任务中代理系统的隐蔽劫持风险。

链接: https://arxiv.org/abs/2605.08310
作者: Zhichao Liu,Wenbo Pan,Haining Yu,Ge Gao,Tianqing Zhu,Xiaohua Jia
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 31 pages, 4 figures, 10 tables. Code: this https URL

点击查看摘要

Abstract:Browser agents are increasingly deployed in long-horizon tasks, which require executing extended action chains to accomplish user goals. However, this prolonged execution process provides attackers with more opportunities to inject malicious instructions. Existing prompt injection attacks against browser agents expose two key gaps: (1) low effectiveness, as attacks optimized for toy baselines fail to achieve end-to-end goals in real-world scenarios with complex environments and longer steps; (2) weak stealthiness, since most attacks pit the attack goal against the user goal, causing a significant drop in system usability under attack. To address these gaps, we propose WebTrap, a mid-task hijacking injection attack. It employs multi-step instruction fusion steering to seamlessly combine both goals, enabling the agent to resume the original user task after executing the attack goal. Furthermore, we design a context-grounded generation method to align the injected content with the task environment and system instructions, maximizing the hijacking success rate. Extensive experiments on two browser agent tasks, based on extended WASP and InjecAgent environments, demonstrate that our method achieves a high attack success rate while preserving the usability of the original system. We find that WebTrap exploits the agent’s navigation vulnerabilities, binding the two goals so tightly that standard defense mechanisms cannot restore the system to normal operation. These findings reveal a critical vulnerability in agent systems during long-horizon tasks that they can be stealthily hijacked.

[AI-370] Practical Wi-Fi-based Motion Recognition Under Variable Traffic Patterns

【速读】：该论文旨在解决Wi-Fi感知系统在不同传输流量模式下因采样率变化导致的性能不稳定问题，特别是现有模型在固定输入尺寸和采样率条件下训练后，在变采样率场景中表现出较差的泛化能力。其关键解决方案是提出一种基于Transformer架构的采样率灵活神经网络（Sampling Rate Versatile Neural Network, SRV-NN），并引入动态采样率增强策略，以有效处理不同大小和间隔的感知信号，从而显著提升模型在多种采样率下的准确性和稳定性。

链接: https://arxiv.org/abs/2605.08308
作者: Guolin Yin,Junqing Zhang,Guanxiong Shen,Simon L. Cotton
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注: 17 Pages

点击查看摘要

Abstract:Wi-Fi sensing detects human motions and activities by analysing the channel state information (CSI) derived from Wi-Fi transmissions. However, the impact of variable transmission traffic, which dictates the effective sampling rate and interval, is often overlooked. Existing Wi-Fi sensing systems are trained with fixed input size and sampling rate, which suffer from poor sampling rate generalisation. This paper proposes a novel Wi-Fi sensing approach for motion recognition applications, e.g., gesture and activity recognition, under variable traffic patterns. A sampling rate versatile neural network (SRV-NN) based on the transformer is proposed to efficiently handle variable input-sized sensing signals. A dynamic sampling rate augmentation is employed for variable sampling rates and intervals. To validate our approach, we have carried out extensive experimental evaluation, using two self-collected datasets, namely SRV activity and SRV gesture, as well as two publicly available datasets. Our method demonstrated exceptional performance and stability under variable sampling rates, with substantial improvements in average accuracy compared to baseline models without augmentation. The proposed approach significantly enhances stability by greatly reducing accuracy variance across different sampling rates.

[AI-371] GNN for Structural Displacement Prediction

【速读】：该论文旨在解决结构在外部荷载作用下位移预测的计算效率问题，传统有限元法（Finite Element Method, FEM）虽精度高但计算成本大，难以满足实时监测需求。解决方案的关键在于提出一种基于图神经网络（Graph Neural Networks, GNNs）的数据驱动框架，将结构系统建模为图结构（节点表示节点，边表示构件），并融合几何与力学属性，从而直接从仿真数据中学习荷载与结构响应之间的映射关系。实验表明，该GNN模型在位移和转角预测上具有高精度，显著优于传统神经网络（Neural Network, NN）模型，展现出作为FEM高效替代方案的潜力。

链接: https://arxiv.org/abs/2605.08303
作者: Hung-Fu Chang,Tzu-Kang Lin,Yung-Li Cheng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages

点击查看摘要

Abstract:Accurate prediction of structural displacements under external loading is fundamental to structural health monitoring and seismic safety assessment. Although the finite element method (FEM) remains the prevailing approach because of its high accuracy, its considerable computational cost restricts its suitability for real-time monitoring applications. To address this limitation, this study proposes a data-driven framework based on Graph Neural Networks (GNNs), in which structural systems are represented as graphs with joints modeled as nodes and structural members as edges. By incorporating both geometric and mechanical properties into the graph representation, the proposed model learns the relationship between applied loads and structural responses directly from simulated data. A synthetic dataset was generated from a two-story frame structure using ANSYS, and both a conventional Neural Network (NN) and a GNN were trained for comparison. The results show that the proposed GNN framework predicts displacements and rotations with high accuracy and outperforms the NN model, demonstrating its potential as a fast and efficient alternative to traditional FEM-based analysis.

[AI-372] SGC-RML: A reliable and interpretable longitudinal assessment for PD in real-world DNS

【速读】：该论文旨在解决现实世界中帕金森病（Parkinson’s disease, PD）数字评估面临的多重挑战，包括多模态异质性、跨设备偏差以及标签不完整等问题，尤其关注现有方法在缺乏可靠性机制的情况下难以实现回溯性可靠评估的问题——即无法明确判断模型何时可靠、何时应拒绝评估、何时需重新测试，以及预测依据来自哪些症状维度。解决方案的关键在于提出SGC-RML框架，其核心创新是将语音、步态、可穿戴运动、移动任务和临床变量映射到一个共享的8维症状节点空间（7个临床症状节点与1个可靠性状态辅助节点），通过症状图谱统一运动与非运动表征；并联合引入不确定性估计、共形校准（conformal calibration）和选择性决策路由机制，使模型不仅能预测症状及其严重程度，还能在证据不足时主动拒绝评估或建议重测，从而实现准确、校准、可审计且症状可解释的回溯性纵向评估，在不完整多模态条件下显著提升可靠性与实用性。

链接: https://arxiv.org/abs/2605.08302
作者: Wenbin Wei,Ruixiang Gao,Suyuan Yao,Xuanzhen Zhao,Cheng Huang,Hen-Wei Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preprint. The first five authors contributed equally. Corresponding author: Hen-Wei Huang. 9 pages main text + appendix; 4 figures, 5 tables in main text

点击查看摘要

Abstract:Real-world digital Parkinson’s disease assessment faces challenges such as heterogeneous modalities, cross-device bias, and incomplete labeling. Existing methods often focus on average predictive performance, lacking the reliability mechanisms needed for retrospective reliability-aware assessment - namely, determining when the model is reliable, when to reject an assessment, when to retest, and from which symptom dimensions the predictions are based. This paper proposes SGC-RML, which maps speech, gait, wearable motion, mobility tasks, and clinical variables to a shared 8-dimensional symptom node space (7 clinical symptom nodes and 1 reliability_state auxiliary node), unifying motor and non-motor representations through a symptom atlas. By jointly introducing uncertainty estimation, conformal calibration, and selective decision routing, the model can not only predict symptoms and severity but also reject assessments or suggest retests when evidence is insufficient. We validate this framework on five real-world PD datasets, covering classification, regression, event detection, and longitudinal severity prediction. Experiments show that SGC-RML achieves an MAE of 4.579 / R^2 of 0.772 on PPMI, an AUC of 0.953 on mPower, and an AUC of 0.825 on PADS. Under leak-free temporal anchoring, as few as 5 subject-specific anchors transform UCI from an essentially non-predictive subject-independent setting (motor MAE 8.38, CCC 0.02) into a calibrated longitudinal assessment (motor MAE 3.24, CCC 0.756) with split-conformal coverage held at the 0.80 target. Under the Daphnet LOSO protocol, it achieves an F1 of 0.803 / AUC of 0.872. These results demonstrate that SGC-RML provides a unified paradigm for accurate, calibrated, auditable, and symptom-interpretable retrospective longitudinal assessment of PD under incomplete multimodal conditions.

[AI-373] Priming: Hybrid State Space Models From Pre-trained Transformers

【速读】：该论文旨在解决大规模混合状态空间模型（Hybrid State-Space Models）在架构设计与训练过程中面临的高成本问题，即当前探索其设计空间需从头预训练，限制了对不同状态空间层类型（SSM layer types）的大规模系统性比较。解决方案的关键在于提出“预热”（Priming）方法，该方法通过将一个预训练的Transformer模型作为起点，在极低的预训练数据预算（<0.5%源模型token量）下，经短时对齐和微调即可恢复下游任务性能，从而将混合架构的设计从预训练问题转化为知识迁移问题。此方法不依赖于特定Transformer家族、模型结构或规模，并首次实现了在相同条件下对Gated KalmaNet (GKA)、Gated DeltaNet (GDN) 和 Mamba-2 等多种SSM层类型的可控大规模对比，揭示了其表达能力排序（GKA > GDN > Mamba-2）直接预测长上下文推理性能，且所提Primed Hybrid GKA 32B模型在保持接近Transformer性能的同时显著提升解码吞吐量（最高达2.3倍）。

链接: https://arxiv.org/abs/2605.08301
作者: Aditya Chattopadhyay,Elvis Nunez,Prannay Kaul,Benjamin Bowman,Evan Becker,Luca Zancato,David Thomas,Wei Xia,Stefano Soatto
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Hybrid State-Space models combine Attention with recurrent State-Space Model (SSM) layers, balancing eidetic memory from Attention with compressed fading memory from SSMs. This yields smaller Key-Value caches and faster decoding than Transformers, along with a richer architectural design space. Exploring that design space at scale has so far required training from scratch, a barrier that has kept most large-model Hybrid research within a narrow range of architectures. We introduce Priming, a method that turns Hybrid architecture design from a pre-training problem into a knowledge transfer one. Priming initializes a Hybrid model from a pre-trained Transformer and, through short alignment and post-training phases, recovers downstream quality using less than 0.5% of the source model’s pre-training token budget. Priming is agnostic to the source Transformer family (e.g., Qwen, Llama, Mistral), model class (dense or Mixture-of-Experts), and model scale. Priming enables us to run the first controlled comparison of SSM layer types at scale under identical conditions. We evaluate, Gated KalmaNet (GKA), Gated DeltaNet (GDN), and Mamba-2, and show that their expressiveness hierarchy, GKAGDNMamba-2, directly predicts downstream performance on long-context reasoning tasks. We scale Priming to 8B/32B reasoning models with native 128K contexts. Our Hybrid GKA 32B improves over its source Qwen3-32B by +3.8 average reasoning points, while staying within 1% of a Transformer post-trained on the same data and enabling up to 2.3x higher decode throughput. To foster research on Hybrid architectures, we release a model zoo of primed Hybrid models for long-context reasoning and instruction following, together with the Priming training and inference code (Sequence Parallelism algorithms for long-context training, optimized GKA kernels, and vLLM serving plugin), all under Apache~2.0 License.

[AI-374] Do not copy and paste! Rewriting strategies for code retrieval

【速读】：该论文旨在解决嵌入式代码检索（embedding-based code retrieval）中编码器因过度拟合表面语法（surface syntax）而导致性能下降的问题。其核心解决方案是通过大语言模型（LLM）对查询和代码库进行重写（rewriting），以降低语法噪声并增强语义一致性，从而提升检索效果。关键创新在于系统性评估三种重写策略：风格化重写、自然语言增强的伪代码（NL-enriched PseudoCode）和完整的自然语言转录，并首次将后者直接作为检索表示而非中间步骤；同时提出两个诊断指标——Delta H（token熵变化）和Delta s（嵌入余弦相似度变化），其中Delta H被证明能有效预测联合查询-语料重写（QC）场景下的检索增益，且具有跨模型家族的泛化能力（如DeepSeek+Codestral下Spearman相关系数ρ=+0.436，p<0.001），为是否执行重写提供低成本、无需训练的决策依据。研究进一步表明，LLM重写最适合作为轻量级编码器在代码主导型查询上的补救层，对强编码器或自然语言密集型查询收益递减。

链接: https://arxiv.org/abs/2605.08299
作者: Andrea Gurioli,Federico Pennino,Maurizio Gabbrielli
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Embedding-based code retrieval often suffers when encoders overfit to surface syntax. Prior work mitigates this by using LLMs to rephrase queries and corpora into a normalized style, but leaves two questions open: how much representational shift helps, and when is the per-query LLM call justified? We study a hierarchy of three rewriting strategies: stylistic rephrasing, NL-enriched PseudoCode, and full Natural-Language transcription, under joint query-corpus (QC, online) and corpus-only (C, offline) augmentation, across six CoIR benchmarks, five encoders, and three rewriters spanning independent model families (Qwen, DeepSeek, Mistral). We are the first to evaluate NL-enriched PseudoCode and snippet-level Natural Language as direct retrieval representations, rather than as transient intermediates. Full NL rewriting with QC yields the largest gains (+0.51 absolute NDCG@10 on CT-Contest for MoSE-18), while corpus-only rewriting degrades retrieval in 56 of 90 configurations, about 62%. We introduce two diagnostics, Delta H, token entropy, and Delta s, embedding cosine, and show that Delta H predicts retrieval gain under QC across all three rewriter families: pooled Spearman rho = +0.436, p 0.001 on DeepSeek+Codestral; rho = +0.593 on Codestral alone; rho = +0.356 on Qwen. This establishes Delta H as a cheap, rewriter-agnostic proxy for deciding when rewriting pays off before running retrieval. Our analysis reframes LLM rewriting as a cost-benefit decision: it is most effective as a remediation layer for lightweight encoders on code-dominant queries, with diminishing returns for strong encoders or NL-heavy queries.

[AI-375] What Cohort INRs Encode and Where to Freeze Them

【速读】：该论文旨在解决 cohort-trained 基于隐式神经表示（Implicit Neural Representations, INRs）的模型中，早期层在迁移学习时为何能加速并提升信号拟合效果的问题，特别是明确哪些层具备可迁移性以及这些层实际编码了何种信息。解决方案的关键在于：首先，通过系统性地冻结共享编码器不同深度的层，发现最优冻结点与权重稳定秩（weight stable rank）最高的层一致，且该策略在所有实验中表现优于或等同于标准微调方法；其次，引入稀疏自编码器（Sparse Autoencoders, SAEs）对 INR 激活进行分解，首次将激活映射为稀疏字典原子（dictionary atoms），揭示出 SIREN 和 Fourier-feature MLP（FFMLP）虽在 cohort 拟合质量上相当，但其学到的原子结构迥异——SIREN 的原子局部化且独立于训练数据分布，而 FFMLP 的原子则覆盖整张图像并捕捉记忆信号的轮廓。这一机制解析使 INR 的内部表征可解释，并为设计更具泛化能力而非仅记忆能力的架构提供了新路径。

链接: https://arxiv.org/abs/2605.08298
作者: Vasiliki Sideri-Lampretsa,Sophie Starck,Robbie Holland,Julian McGinnis,Daniel Rueckert
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 content pages plus appendix

点击查看摘要

Abstract:Reusing the early layers of cohort-trained INRs as initialization for new signals has been shown to accelerate and improve signal fitting, yet it remains unclear which layers of the shared encoder learn transferable representations and what those representations encode. We address both questions for two standard backbones, SIREN and Fourier-feature MLPs (FFMLP). First, sweeping the freeze depth across the shared encoder at test time, we find that the optimum coincides with the layer of highest weight stable rank. Moreover, freezing at this depth matches or improves on the standard fine-tuning recipe across all our experiments. Second, identifying which layer transfers does not characterize what that layer encodes. To address this we adopt sparse autoencoders (SAEs), the dominant tool in mechanistic interpretability, and present the first SAE decomposition of INR activations into sparse dictionary atoms. Interestingly, SIREN and FFMLP achieve comparable cohort-fitting quality, but learn qualitatively different dictionaries. Cohort SIREN’s atoms are localized, tiling the coordinate plane such that each atom fires in a confined region independent of cohort content. Cohort FFMLP’s atoms are image-spanning, tracing the contours of memorized cohort signals. Single-atom ablations confirm causal use of these dictionaries: a single FFMLP atom out of 4096 can drop PSNR by up to 10.6 dB across the image, while SIREN ablations remain confined to where the atom fires. Together, these results give the first mechanistic account of what transfers in cohort-trained INRs and turn their activations into inspectable dictionary atoms. These tools open a path towards characterizing what INRs encode and towards architectures designed for generalization rather than memorization.

[AI-376] A Qualitative Test-Risk Mechanism for Scaling Behavior in Normalized Residual Networks

【速读】：该论文旨在解决深度神经网络中模型规模扩展（如深度增加）如何带来测试风险（test risk）降低的理论机制问题，特别是在归一化残差网络（normalized residual networks）中，通过插入新的残差块进行深度扩展时，能否实现可证明的性能提升。其解决方案的关键在于构建一个统一框架，将问题分解为表示增益（representational gain）、优化增益（optimization gain）和泛化转移（generalization transfer）三部分：首先，在零初始化附近的首阶下降条件下，证明扩展后的假设类包含一个辅助跳跃模型（auxiliary jumpboard model），其总体风险严格低于原模型；其次，基于针对后归一化残差架构设计的范数控制，建立扩展模型类的基于范数的 Rademacher 复杂度界。由此得出两种互补的测试风险保证：一种通过总体风险路径，当存在正总体间隔时更紧致；另一种直接作用于训练/测试层面，避免 Hoeffding 转移，且在退化情形下更具鲁棒性。这为残差深度扩展提供了一个定理驱动的机制，支持“规模扩展是联合性的”这一观点——深度创造新的改进方向，宽度增强弱信号的有限样本可观测性，数据决定扩展的统计代价是否可控。

链接: https://arxiv.org/abs/2605.08297
作者: Daning Cheng,Zeyu Liu,Jun Sun,Fen Xia,Boyang Zhang,Dongping Liu,Yunquan Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The scaling behavior, in which test performance often improves as model size and data increase, is a central empirical phenomenon in modern deep learning, yet its theoretical basis remains incomplete. In this paper, we study depth expansion in normalized residual networks: starting from a trained model in an old hypothesis class, we insert a new residual block at an intermediate layer and ask when such an expansion can yield a provable improvement in test risk. We develop a unified framework that decomposes this question into representational gain, optimization gain, and generalization transfer. First, under a first-order descent condition near zero initialization, we prove that the expanded hypothesis class contains an auxiliary jumpboard model with strictly smaller population risk than the original model. Second, under norm control tailored to post-normalized residual architectures, we establish a norm-based Rademacher complexity bound for the expanded model class. These ingredients lead to two complementary test-risk guarantees: one route passes through population risk and is tighter when a positive population margin is available, while the other works directly at the train/test level, avoids Hoeffding transfer, and is more robust in degenerate regimes. Together, these results provide a theorem-driven mechanism under which residual depth expansion can improve test performance in normalized residual networks. More broadly, they suggest that scaling is inherently joint: depth creates new improving directions, width enhances the finite-sample observability of weak signals, and data determines whether the statistical cost of expansion can be controlled.

[AI-377] Hierarchical Mixture-of-Experts with Two-Stage Optimization

【速读】：该论文旨在解决稀疏混合专家（Sparse Mixture-of-Experts, MoE）模型中路由机制存在的根本性权衡问题：强负载均衡会抑制专家专业化，而追求多样性则易导致路由坍塌（routing collapse）。其解决方案的关键在于提出Hi-MoE框架，通过将路由控制分解为两个耦合层级实现优化：(i) 跨组负载均衡（inter-group balancing），确保专家组间流量公平分配；(ii) 组内专业化（intra-group specialization），促进组内专家互补行为并防止组内坍塌。这一分层设计从理论上重构了路由器的行为，从而稳定专家专业化并有效缓解路由坍塌现象，在自然语言处理和视觉任务上均展现出显著性能提升与鲁棒性。

链接: https://arxiv.org/abs/2605.08292
作者: Gleb Molodtsov,Alexander Miasnikov,Aleksandr Beznosikov
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:Sparse Mixture-of-Experts (MoE) models scale capacity by routing each token to a small subset of experts. However, their routers exhibit a fundamental trade-off: strong load balancing can suppress expert specialization, while aggressive diversity often causes routing collapse. We propose Hi-MoE, a grouped MoE framework that decomposes routing control into two coupled levels: (i) inter-group balancing that enforces fair traffic across expert groups, and (ii) intra-group specialization that promotes complementary expert behaviors while preventing within-group collapse. Our analysis provides a principled explanation of how our hierarchical objectives reshape the router, thereby promoting stable specialization and mitigating collapse. We observe consistent improvements over recent sparse-routing and grouped-MoE baselines across NLP and vision benchmarks, and confirm robustness via scaling studies (model size, expert count) and targeted ablations. In large-scale pre-training on 58B tokens, Hi-MoE-7B achieves a 5.6% perplexity reduction and a 40% improvement in expert balance over OLMoE-7B across diverse evaluation domains.

[AI-378] Graph Computation Meets Circuit Algebra: A Task-Aligned Analysis of Graph Neural Networks for Electronic Design Automation

【速读】：该论文旨在解决生成式图神经网络（GNN）在电子设计自动化（EDA）领域应用中的“架构-任务不匹配”问题，即并非所有图结构问题都适用于相同的GNN计算范式。其解决方案的关键在于强调：成功的GNN-for-EDA方法必须使传播（propagation）、聚合（aggregation）和监督（supervision）机制与目标任务的原生代数结构（native algebra）对齐。例如，静态时序分析对应于拓扑有序有向无环图（DAG）上的最大-加/最小-加递推关系，适合异步DAG-GNN；布图规划则依赖于超图线长和密度惩罚，更适合可微分优化而非单纯的消息传递GNN；而IR压降建模为功率分配网络上的线性系统，需匹配线性代数求解器。通过逐任务分析，论文系统梳理了电路图与通用图的本质差异（如有向性、异构性、多尺度性和时钟结构），识别出当前方法的成功边界及因代数-架构错位导致的局限，并指出未来研究应聚焦于阶段泄漏、代理到签核差距、校准和设计分布漂移等典型失败模式。

链接: https://arxiv.org/abs/2605.08291
作者: Hyunmog Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注:

点击查看摘要

Abstract:EDA problems are graph-structured, but not all graph-structured problems call for the same GNN computation. We argue that successful GNN-for-EDA methods are those whose propagation, aggregation, and supervision align with the native algebra of the target task. Concretely: static timing analysis is a max-plus/min-plus recurrence on a topologically ordered DAG, structurally aligned with asynchronous DAG-GNNs; placement is governed by hypergraph wirelength and density penalties and is exploited by differentiable placers rather than by message-passing GNNs alone; routing congestion is a sparse demand-supply field over a layout grid; switching-activity propagation is a probabilistic recurrence on a directed netlist; IR drop is a linear system on the power-delivery network; and analog symmetry extraction is a discrete constraint-prediction problem on schematic graphs. Through these task-by-task alignments we (i) review the GNN architectural toolkit relevant to circuits, (ii) formalize how circuit graphs differ from generic graphs (directed, heterogeneous, multi-scale, with sequential and clock structure), (iii) characterize where current methods succeed and where the algebra-architecture mismatch limits them, and (iv) identify failure modes–stage leakage, proxy-to-signoff gap, calibration, and design-distribution shift–that we believe are likely to dominate the next phase of work. We position the paper as a GNN-for-EDA, task-aligned analysis rather than a comprehensive AI-for-chip-design survey. Continuous SE(3)-equivariant geometric GNNs are usually mismatched to Manhattan digital layout, and LLM-for-RTL, HLS, and RL/diffusion-based topology generation are outside our scope.

[AI-379] oward Optimal Regret in Robust Pricing: Decoupling Corruption and Time

【速读】：该论文旨在解决鲁棒动态定价（robust dynamic pricing）中的 regret 保证问题，具体目标是解耦对扰动数量 $ C $ 和时间跨度 $ T $ 的依赖关系。在传统动态定价场景中，卖家通过逐轮设定价格并仅获得二元反馈（是否成交）来优化收入，而鲁棒设置下允许恶意对手在最多 $ C $ 轮内篡改反馈信息。此前最优的 regret 上界为 $ \mathcal{O}(C \log \log T) $，未能实现 $ C $ 与 $ T $ 的独立控制。本文的关键解决方案是一种鲁棒的二分搜索变体，当已知扰动数 $ C $ 时可达到 $ \mathcal{O}(C + \log T) $ 的 regret，未知时则为 $ \mathcal{O}(C + \log^2 T) $，从而首次实现了对 $ C $ 和 $ T $ 的解耦式控制，解决了长期悬而未决的开放问题。

链接: https://arxiv.org/abs/2605.08290
作者: Kalana Kalupahana,Francesco Emanuele Stradi,Matteo Castiglioni,Alberto Marchesi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We design the first regret guarantees for robust dynamic pricing that decouple the dependence on the corruption C and the time horizon T . In dynamic pricing, a seller with unlimited supply of a good interacts with a stream of buyers over ( T ) rounds, with the goal of maximizing revenue. At each round t , the seller posts a price p_t , and the buyer purchases the good only if their unknown valuation v^\star exceeds this price. The seller observes only the binary feedback \mathbbI \left\ p_t \leq v^\star \right\ , indicating whether a sale occurred. In the \emphrobust pricing setting, a malicious adversary is allowed to corrupt this feedback in at most C rounds. Even if the learner knows the corruption C , the best known regret bound is \mathcalO(C\log\log T) by Gupta et al. [2025]. This leaves as an open problem to ``decouple’’ the dependence on C and T . In this work, we resolve this open problem. In particular, we develop a robust variant of binary search that achieves regret \mathcalO(C+\log T) when the corruption C is known and \mathcalO(C+\log^2 T) when the corruption is unknown.

[AI-380] What If We Let Forecasting Forget? A Sparse Bottleneck for Cross-Variable Dependencies

【速读】：该论文旨在解决多变量时间序列预测中跨变量依赖关系建模的可靠性问题，尤其是在特定状态条件下，现有方法因采用密集交互机制而易放大虚假相关性并导致表示过平滑，从而影响预测准确性。其解决方案的关键在于提出一种稀疏瓶颈框架MS-FLOW，通过将全连接通信替换为选择性稀疏路由机制，在严格通信预算下仅保留关键依赖路径，并有选择地注入跨变量信号，从而抑制冗余连接与虚假相关性的传播，实现从“更多交互”到“更有效交互”的范式转变。

链接: https://arxiv.org/abs/2605.08289
作者: Fan Zhang,Shiming Fan,Hua Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multivariate time series forecasting is critical in many real-world systems, and thus modeling cross-channel dependencies is essential. Although existing methods improve overall accuracy by enhancing representations and cross-channel interactions, it remains challenging to reliably capture inter-variable dependencies under specific conditions. We observe that dependencies in real data are often state-dependent and noisy; in such cases, dense interactions can amplify spurious correlations and lead to representation over-smoothing, which may yield unreliable predictions in certain scenarios. Motivated by this, we propose MS-FLOW, a sparse-bottleneck framework that explicitly models inter-variable interaction as capacity-limited information flow. Specifically, MS-FLOW replaces fully connected communication with selective sparse routing, retaining only a few critical dependency paths and injecting cross-variable signals under a strict communication budget, thereby suppressing redundant connections and spurious-correlation propagation. Extensive experiments demonstrate that MS-FLOW learns more reliable multivariate correlations, achieving state-of-the-art forecasting accuracy on 12 real-world benchmarks while producing fewer yet more reliable dependencies, shifting multivariate forecasting from “more interaction” to “more effective interaction”.

[AI-381] UMEDA: Unified Multi-modal Efficient Data Fusion for Privacy-Preserving Graph Federated Learning via Spectral-Gated Attention and Diffusion-Based Operator Alignment

【速读】：该论文旨在解决设备无关定位（Device-free Localization）中联邦学习（Federated Learning）面临的三大挑战：异构传感器模态与分辨率差异、数据分布漂移以及隐私噪声对定位结构信号的破坏。其核心解决方案是提出UMEDA框架，关键在于将客户端建模为全局图结构中的节点，并共享一个连续积分算子；通过线性注意力层对本地传感器进行编码，利用低秩滤波抑制模态特异性残差，使不同传感器的客户端在共同低秩子空间中对齐；服务器则基于该算子的谱系数设计扩散模型进行聚合，将更新视为共享算子的离散化表示而非拓扑绑定的权重，从而无需节点级对应即可适应不同图规模和缺失模态。此外，引入各向异性差分隐私机制，将噪声优先投影至信号子空间的零空间，保留主导特征方向的同时满足(\epsilon, \delta)-差分隐私约束，显著提升高模态异质性和严苛隐私预算下的定位精度、收敛速度与通信效率。

链接: https://arxiv.org/abs/2605.08288
作者: Shih-Yu Lai,Hirozumi Yamaguchi,Shang-Tse Chen,Yu-Lun Liu,Bing-Yu Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Device-free localization trains models from heterogeneous wireless and visual sensors (e.g., Wi-Fi, LiDAR) distributed across edge devices. Federated learning offers a privacy-respecting framework, but is brittle when clients differ in sensor modality and resolution, when their data distributions drift, and when privacy noise destroys the structural signal needed for localization. We propose UMEDA, a graph federated learning framework in which clients form nodes of a global graph that share a continuous integral operator, and aggregation is reformulated as spectral signal processing on this operator. Each client encodes its local sensors with a linear-attention layer whose kernel spectrum is low-rank filtered, suppressing modality-specific residuals so clients with different sensors align in a common low-rank subspace. The server then aggregates client updates via a diffusion model over the kernel’s spectral coefficients, treating updates as discretizations of a shared operator rather than topology-bound weights – this absorbs varying graph sizes and missing modalities without node-wise correspondence. To balance privacy and utility, we add an anisotropic differential-privacy mechanism that projects noise preferentially into the null space of the signal subspace, preserving dominant eigendirections while ensuring formal (\epsilon, \delta) -DP under gradient clipping. On MM-Fi and the RELI11D out-of-distribution benchmark, UMEDA outperforms state-of-the-art federated baselines in accuracy, convergence, and communication efficiency, particularly under high modality heterogeneity and tight privacy budgets.

[AI-382] Multi-Armed Bandits With Best-Action Queries

【速读】：该论文旨在解决在bandit-feedback模型中引入最优动作查询（best-action queries）后，对多臂赌博机（multi-armed bandits, MABs）算法的 regret 性能改进问题。此前，Russo 等人（2024）在 full-feedback 模型下证明：在随机和对抗环境中，k 次最优动作查询可将最优 regret 从 $\widetilde{\mathcal{O}}(\sqrt{T})$ 降低至 $\widetilde{\mathcal{O}}(\min\{T/k, \sqrt{T}\})$ 。然而，bandit-feedback 模型（即仅观测所选动作的奖励）下的性能提升是否成立仍是一个开放问题。本文通过理论分析给出了完整解答：当奖励在各动作间相关时，任何算法的 regret 至少为 $\Omega(\sqrt{T} - k)$ ，且该下界适用于对抗环境；而在奖励独立同分布（i.i.d.）的随机设定下，可通过设计特定算法实现 $\widetilde{\mathcal{O}}(\min\{T/k, \sqrt{T} - k\})$ 的 regret 上界，并匹配对应的下界（忽略对数因子）。其关键在于区分奖励的相关性结构，并据此设计适应性的探索策略与查询利用机制，从而揭示了最优动作查询在 bandit-feedback 场景中的实际收益边界。

链接: https://arxiv.org/abs/2605.08287
作者: Francesco Bacchiocchi,Matteo Castiglioni,Alberto Marchesi,Francesco Emanuele Stradi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We study \emphmulti-armed bandits (MABs) augmented with \emphbest-action queries, in which the learner may additionally query an oracle that reveals the best arm in the current round. This setting was recently characterized by Russo et al. [2024] in the \emphfull-feedback model, where the learner observes the rewards of all arms after each round. They show that, in both \emphstochastic and \emphadversarial environments, k best-action queries reduce the optimal \widetilde\mathcalO(\sqrtT) regret to \widetilde\mathcalO(\min\T/k,\sqrtT) . Whether this improvement extends to the more realistic \emphbandit-feedback model – where the learner observes only the reward of the played arm – was left as an open problem. We fully resolve this question. When rewards are stochastic but correlated among arms, we show that the full-feedback result does not carry over: any algorithm must incur regret at least \Omega(\sqrtT-k) . This lower bound directly extends to adversarial environments. On the positive side, we show that \widetilde\mathcalO(\min\T/k,\sqrtT-k) regret is still achievable when rewards are stochastic and i.i.d., and establish a matching lower bound, up to logarithmic factors. Together, these results provide a complete characterization of the benefits of \emphbest-action queries in the \emphbandit-feedback model.

[AI-383] Diagnosing Spectral Ceilings in Equivariant Neural Force Fields

【速读】：该论文旨在解决如何量化评估训练好的等变力场骨干网络（equivariant force-field backbone）在分子动力学模拟中对特定角频率（angular frequencies）的保留能力这一问题。其核心挑战在于，现有方法难以精确识别模型在频域上是否能够有效捕捉和恢复输入扰动中的特定频率成分。解决方案的关键在于提出一种谱注入诊断（spectral-injection diagnostic）：通过向分子力场注入可控角频率扰动，并在其冻结的骨干网络后连接一个轻量级谱预测网络（Spectral Prediction Network, SPN），从而直接读取哪些频率可被模型恢复。实验表明，该方法能清晰揭示模型在频域上的边界行为——例如，在阿司匹林分子上，L=2的NequIP骨干配合二次SPN仅能恢复l=4的边界信号，而在l=5时性能骤降（p值从0.913降至0.078），且该现象在多个独立训练的骨干网络中保持一致，验证了其可靠性。

链接: https://arxiv.org/abs/2605.08286
作者: Hyunmog Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce a spectral-injection diagnostic for measuring which angular frequencies a trained equivariant force-field backbone preserves: inject a controlled angular-frequency perturbation into a molecular force field, attach a lightweight Spectral Prediction Network (SPN) to the frozen backbone, and read off which frequencies are recoverable. On aspirin, a quadratic SPN attached to an L = 2 NequIP backbone recovers the boundary signal at l = 4 but collapses at l = 5: a 11.7x cliff at the predicted drL boundary, with p dropping from 0.913 to 0.078. The same boundary-vs-above contrast persists across n = 4 independently trained backbones (raw-gain delta contrast, hierarchical cluster bootstrap) and is corroborated by a denominator-free injected-residual metric (R2_inj(4) = 0.374 versus R2_inj(5) = 0.006). A finite-degree span theorem calibrates the diagnostic: for a single marked direction, degree-d polynomials of degree-L spherical-harmonic features span exactly H less than or equal to dL with multiplicity-one saturation at the boundary (scoped to single-direction degree-bounded probes, not a function-class upper bound on multi-atom MPNNs). A synthetic C5 calibration plus capacity, activation, and cross-architecture controls rule out parameter count alone as the explanation.

[AI-384] Beyond the False Trade-off: Adaptive EWC for Stealthy and Generalizable T2I Backdoors

【速读】：该论文旨在解决文本到图像（Text-to-Image, T2I）生成模型中后门攻击的隐蔽性问题，即在保证攻击成功率（Attack Success Rate, ASR）的同时维持模型原始生成质量（fidelity）。现有方法如Learning without Forgetting（LwF）依赖输出层面的知识蒸馏，正则化能力有限，难以有效平衡ASR与fidelity，尤其在弱触发器（weak triggers）场景下性能显著下降。解决方案的关键在于引入参数级正则化方法——弹性权重巩固（Elastic Weight Consolidation, EWC），并提出一种基于余弦感知的自适应EWC机制：通过余弦相似度衡量语义一致性作为动态效用指标，并结合自适应调度策略调整EWC的正则化强度，从而将固定惩罚转变为上下文敏感的约束，实现高ASR与高fidelity的协同优化，同时提升对域外（out-of-domain, OOD）数据的鲁棒性。

链接: https://arxiv.org/abs/2605.08280
作者: Lu Bowen,Xinyu Tang,Yin Yin Low,Shu-Min Leong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Preserving model fidelity is essential for stealthy text-to-image (T2I) backdoor attacks. Existing methods such as Learning without Forgetting (LwF) rely on output-based distillation, which provides limited regularization. We introduce Elastic Weight Consolidation (EWC) as a parameter-based alternative for preserving fidelity in backdoor learning. While stronger in principle, we show that standard static EWC with a fixed regularization weight lambda and mean-squared utility loss creates an artificial trade-off between attack success rate (ASR) and fidelity, particularly degrading performance on weak triggers. To address this, we propose Cosine-Aware Adaptive EWC, which dynamically adjusts EWC regularization using a cosine-based semantic utility and adaptive scheduling. This approach transforms EWC from a fixed penalty into a context-sensitive constraint, maintaining high ASR while preserving model fidelity. Experiments demonstrate improved ASR-fidelity balance and enhanced robustness on out-of-domain (OOD) datasets compared to existing baselines.

[AI-385] LaWM: Least Action World Models for Long-Horizon Physical Consistency from Visual Observations

【速读】：该论文旨在解决现有潜空间世界模型在长期预测中因缺乏物理约束而导致的误差累积、能量漂移及物理不一致性问题，这些问题通常源于仅依赖无约束神经过渡函数进行未来状态生成。解决方案的关键在于提出最小作用量世界模型（Least Action World Models, LaWM），其核心思想是将最小作用量原理（Principle of Least Action）引入学习到的视觉潜在空间中：通过构建一个潜变量变分积分器（latent variational integrator），LaWM 将观测编码为广义坐标，学习连续潜状态间的离散拉格朗日量（discrete Lagrangian），并基于此构造离散作用量泛函，最终通过求解对应的离散积分条件来推进预测。由此，物理结构不再是事后评分或正则化手段，而是直接定义了潜空间的转移规则，从而在长时程视觉预测中提供结构保持偏差（structure-preserving bias），显著提升物理不变性、背景一致性、运动平滑性和外观与几何预测精度。

链接: https://arxiv.org/abs/2605.08279
作者: Qixin Xiao,Maani Ghaffari
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Learning predictive world models from visual observations is a core problem in embodied AI, with applications to model-based reinforcement learning and robotic planning. Existing latent world models typically generate future states with unconstrained neural transition functions, while modern video generation systems often prioritize perceptual plausibility or introduce physical structure through auxiliary losses, external guidance, or separate dynamics modules. As a result, long-horizon rollouts can remain weakly grounded in the physical principles that govern real dynamics, leading to compounding error, energy drift, and physically inconsistent futures. We propose Least Action World Models (LaWM), a latent world-modeling framework that operationalizes the Principle of Least Action in learned visual latent space: future rollouts are governed by a learned Lagrangian action functional rather than produced only by an unconstrained transition predictor. Our main technical realization is a latent variational integrator: LaWM encodes observations into learned generalized coordinates, learns a latent discrete Lagrangian over consecutive latent states, constructs a discrete action functional, and advances prediction by solving the corresponding discrete integration condition. Thus, physical structure is not merely used to score, regularize, or constrain a completed trajectory; it defines the latent transition rule itself. Because the transition is induced by a discrete variational principle, LaWM provides a structure-preserving bias for long-horizon visual prediction. Across physics-clean synthetic dynamics and embodied robot interaction benchmarks, LaWM improves physical invariance, background consistency, motion smoothness, and appearance and geometric prediction metrics over video-generation and world-model baselines.

[AI-386] rapping Attacker in Dilemma: Examining Internal Correlations and External Influences of Trigger for Defending GNN Backdoors

【速读】：该论文旨在解决图神经网络（Graph Neural Networks, GNNs）在关系数据上学习时面临的后门攻击（backdoor attacks）问题，此类攻击能够隐蔽地操纵模型预测结果。现有防御方法通常依赖于检测特定子图模式或节点特征，易被自适应攻击者绕过。论文提出PRAETORIAN防御机制，其核心创新在于不依赖表面线索，而是聚焦于有效GNN后门攻击的内在要求：攻击者需对目标节点施加显著影响，表现为注入大量触发节点或利用少数高影响力节点。PRAETORIAN通过分析潜在触发子图内部相关性以识别异常大的注入结构，并量化外部节点影响力以发现具有不成比例作用的触发器，从而实现高效且鲁棒的防御。实验表明，该方法将平均攻击成功率（ASR）降至0.55%，仅导致0.62%的干净准确率（CA）下降，显著优于当前最优防御方案。

链接: https://arxiv.org/abs/2605.08278
作者: Fan Yang,Binyan Xu,Di Tang,Kehuan Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:GNNs have become a standard tool for learning on relational data, yet they remain highly vulnerable to backdoor attacks. Prior defenses often depend on inspecting specific subgraph patterns or node features, and thus can be circumvented by adaptive attackers. We propose PRAETORIAN, a new defense that targets intrinsic requirements of effective GNN backdoors rather than surface-level cues. Our key observation is that flipping a victim node’s prediction requires substantial influence on the victim: attackers tend to either inject many trigger nodes or rely on a small set of highly influential ones. Building on this observation, PRAETORIAN (i) analyzes internal correlations within potential trigger subgraphs to detect abnormally large injected structures, and (ii) quantifies external node influence to identify triggers with disproportionate impact. Across our evaluations, PRAETORIAN reduces the average attack success rate (ASR) to 0.55% with only a 0.62% drop in clean accuracy (CA), whereas state-of-the-art defenses still yield an average ASR of 20% and a CA drop of 3% under the same conditions. Moreover, PRAETORIAN remains effective against a range of adaptive attacks, forcing adversaries to either inject many trigger nodes to achieve high ASR (80%), which incurs a 10% CA drop, or preserve CA at the cost of limiting ASR to 18.1%. Overall, PRAETORIAN constrains attackers to an unfavorable trade-off between efficacy and detectability.

[AI-387] Mitigating Many-shot Jailbreak Attacks with One Single Demonstration

【速读】：该论文旨在解决多示例越狱攻击（Many-shot Jailbreaking, MSJ）问题，即攻击者通过在有害查询前添加大量有害的问答演示，使原本安全对齐的语言模型产生有害响应。研究发现，随着演示数量增加，MSJ会引发渐进式的表征漂移（activation drift），导致固定有害查询的嵌入表示逐步偏离安全区域。其核心解决方案是将这种漂移视为隐式恶意微调（implicit malicious fine-tuning）：N个有害演示等价于对相应样本进行SGD风格的优化更新。基于此理论视角，作者提出在推理阶段添加一个固定的单示例安全演示，该操作可诱导反向的安全导向更新，从而恢复模型的拒绝行为。该方法显著提升了模型对MSJ的鲁棒性，且无需修改模型参数或部署时具备白盒访问权限。

链接: https://arxiv.org/abs/2605.08277
作者: Kejia Chen,Jiawen Zhang,Boheng Li,Pengcheng Li,Jian Lou,Zunlei Feng,Mingli Song,Ruoxi Jia,Tianwei Zhang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Many-shot jailbreaking (MSJ) causes safety-aligned language models to answer harmful queries by preceding them with many harmful question-answer demonstrations. We study why this attack becomes stronger as the number of demonstrations increases. Empirically, we find that MSJ induces a progressive activation drift: the representation of a fixed harmful query moves step by step away from the safety-aligned region as more harmful demonstrations are added. Theoretically, we show that this drift can be interpreted as implicit malicious fine-tuning: conditioning on N harmful demonstrations induces SGD-style updates equivalent to optimizing on the corresponding N harmful samples. This view turns the attack mechanism into a defense principle. We append a fixed one-shot safety demonstration at inference time, which induces a counteracting safety-oriented update and restores refusal behavior. The resulting method improves the model’s robustness to MSJ without modifying its parameters or requiring white-box access at deployment. Code is available at this https URL.

[AI-388] Efficient Prompt Learning for Traffic Forecasting VLDB

【速读】：该论文旨在解决当前时空图神经网络（Spatio-Temporal Graph Neural Networks, ST-GNNs）在面对分布偏移（distribution shifts）时泛化能力不足的问题，尤其是在复杂时空动态变化下模型适应新场景的能力有限。其解决方案的关键在于提出一种轻量级、与模型无关的提示调优（prompt tuning）框架 SimpleST，通过固定预训练模型参数并引入可学习的提示机制，实现对新型分布的有效适应，从而提升模型在分布外场景下的预测精度和计算效率。

链接: https://arxiv.org/abs/2605.08273
作者: Qianru Zhang,Xinyi Gao,Alexander Zhou,Reynold Cheng,Siu-Ming Yiu,Hongzhi Yin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 24 pages. This paper is accepted by VLDBJ

点击查看摘要

Abstract:Accurate traffic prediction is essential for optimizing transportation systems, enhancing resource allocation, and improving overall urban administration. Spatio-temporal graph neural networks (GNNs) have achieved state-of-the-art performance and have been widely used in various spatio-temporal prediction scenarios. However, these prediction methods often exhibit low generalization ability, struggling with distribution shifts caused by spatio-temporal dynamics. To address this challenge, we propose an approach to enhance the generalization and adaptation of spatio-temporal GNNs through efficient prompting. Specifically, we introduce a lightweight and model-agnostic prompt tuning framework for spatio-temporal GNNs, named SimpleST. It facilitates adapting pre-trained spatio-temporal GNNs to novel distributions while keeping the model parameters fixed. This prompt mechanism reduces the overhead and complexity of adaptation, enabling efficient utilization of pre-trained models for out-of-distribution generalization. Extensive experiments conducted on five real-world urban spatio-temporal datasets demonstrate the superiority of our approach in terms of prediction accuracy and computational efficiency.

[AI-389] Execution Envelopes: A Shared Admission Contract for Backend AI Execution Requests

【速读】：该论文旨在解决现代企业级AI后端中因异构执行请求（如模型部署、推理、评估、数据移动及代理工作流）导致的治理与可观测性难以统一的问题。当前系统中，各服务使用各自特定的请求格式，使得在准入阶段添加日志记录、授权策略钩子、资源计费和运行时审查等通用行为变得困难，需重复实现相同逻辑。解决方案的关键在于提出“执行封装体”（execution envelope），这是一个标准化的内部准入对象，用于记录请求方身份、所需资源、策略相关作用域及最终授予的资源，并通过前置路径传递至具体后端处理前。该设计不替代原有服务模型或调度机制，而是提供一个描述性的准入接口，使治理和可观测性能力可在单一位置集中附加，从而提升系统的可管理性和一致性。

链接: https://arxiv.org/abs/2605.08267
作者: Krti Tallam
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Emerging Technologies (cs.ET)
备注: Systems paper on backend admission contracts, 12 pages, 4 tables

点击查看摘要

Abstract:Enterprise AI backends increasingly admit heterogeneous execution requests across model deployment, inference, evaluation, data movement, and agentic workflows. In many systems, those requests arrive in service-specific shapes, which makes it difficult to attach shared admission-time behavior such as logging, governance hints, resource accounting, authorization-aware policy hooks, and later runtime review without rebuilding the same contract in each subsystem. This paper introduces the execution envelope, a normalized internal admission object that records who is asking for what kind of execution, what resources were requested, what policy-relevant scope accompanied the request, and what the backend ultimately granted. The proposal is intentionally narrow. It does not replace service-specific request models, perform scheduling, or introduce a new authority token. Instead, it defines a descriptive admission seam that can be threaded through real backend paths before backend-specific resolution begins. I formalize the distinction between requested and granted resources, specify the field families, invariants, and lifecycle of the envelope, work through POST /serving/deploy_model as an initial proving ground, and position the design relative to usage control, analyzable authorization, admission control, and cluster scheduling. The central claim is that a shared execution-admission contract is a useful missing primitive for modern AI backends because it creates one place to attach governance and observability without pretending to solve placement, policy, and runtime execution in a single step.

[AI-390] Computer Use at the Edge of the Statistical Precipice

【速读】：该论文旨在解决当前计算机使用代理（Computer Use Agents, CUAs）评估中存在的方法论缺陷问题，这些问题主要源于非原则性的环境设计（如静态、未沙箱化或不可靠验证的环境）和非原则性的评估方法（如对状态感知用户界面交互中误用pass@k指标及简单聚合）。为应对上述挑战，论文提出两大核心解决方案：其一，设计并实现PRISM框架，包含五个关键环境设计原则（特权验证、现实环境、配置完整性检查、沙箱执行与多因子变异性），并在DigiWorld基准中具体落实，该基准包含15个真实沙箱化的移动应用，支持超过320万种已验证的独特配置；其二，开发一种结合Wilson置信区间与分层自助法（hierarchical bootstrap）的聚合框架，以正确捕捉CUA基准的嵌套结构并生成可靠的置信区间。研究证明，严谨的环境设计与评估方法并非可选优化，而是开展有意义CUA研究的前提条件。

链接: https://arxiv.org/abs/2605.08261
作者: Pierluca D’Oro,Sneha Silwal,William Wong,Yuxuan Sun,Fanyi Xiao,Manchen Wang,Eric Gan,Allen Bolourchi,Joseph Tighe
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Evaluating Computer Use Agents (CUAs) on interactive environments is fraught with methodological pitfalls that the field has yet to systematically address. We show that a 1MB replay script that blindly executes a recorded action sequence without ever observing the screen outperforms frontier models on prominent static benchmarks, and prove that its expected success rate is exactly equal to the source agent’s pass@k in deterministic environments. We trace this and other failures to two root causes: non-principled environment design (static, unsandboxed, or unreliably verified environments) and non-principled evaluation methodology (naive aggregation and misuse of pass@k for stateful UI interactions). To address the first, we propose PRISM, five design principles for CUA environments (privileged verification, realistic environments, integrity-checked configurations, sandboxed execution, and multifactorial variability) and instantiate them in DigiWorld, a benchmark of 15 realistic sandboxed mobile applications able to evaluate agents in over 3.2 million verified unique configurations. To address the second, we develop an aggregation framework pairing Wilson score intervals with hierarchical bootstrap, producing confidence intervals that correctly account for the nested structure of CUA benchmarks, as we empirically demonstrate. All together, we show that principled environment design and rigorous evaluation methodology are not optional refinements but prerequisites for meaningful CUA research.

[AI-391] Research on Security Enhancement Methods for Adversarial Robust Large Language Model Intelligent Agents for Medical Decision-Making Tasks

【速读】：该论文旨在解决医疗决策智能代理（medical decision making intelligent agents）在面对对抗攻击、安全威胁和信任缺失时的鲁棒性不足问题，尤其关注语义扰动、提示注入、药物名称混淆及虚假证据攻击等场景下的安全性与可靠性。其解决方案的核心在于提出一个全链路安全增强框架ARSM-Agent，通过多模块协同机制实现从输入风险感知到输出安全控制的闭环管理，并设计了一个加权联合目标函数，包含决策准确率损失、对抗鲁棒性损失、安全拒绝损失和知识一致性损失，权重分别为0.3、0.3、0.2和0.2。实验表明，该方法在多种攻击下将整体攻击成功率降至8.7%，同时保持高知识一致性得分（0.91），且消融实验证明各模块对性能提升具有显著贡献，尤其是证据约束与一致性验证模块对系统安全性至关重要。

链接: https://arxiv.org/abs/2605.08257
作者: Saisai Hu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 5 pages, 2 figures, 1 this http URL for oral presentation at AINIT 2026

点击查看摘要

Abstract:Motivated by the challenge to improve the adversarial robustness, security, and trust of medical decision making intelligent agents, this study develops a full-link security enhancement framework, which describes “input risk perception - medical evidence constraint - knowledge consistency verification - decision confidence reweighting - security output control - adversarial feedback update.” We propose ARSM-Agent and define a weighted joint objective consisting of decision accuracy loss, adversarial robustness loss, safety refusal loss, and knowledge consistency loss, with weights of 0.3, 0.3, 0.2, and 0.2, respectively. The whole medical decision formulation is implemented by multi-module collaborative linkage. We verify that the algorithm is more efficient than four baselines, including LLM-Agent, Retrieval-Agent, Filter-Agent, and Adv-Train-Agent. Under semantic perturbation, prompt injection, drug-name confusion, and false-evidence attacks, ARSM-Agent reduces the overall attack success rate to 8.7% and achieves a knowledge consistency score of 0.91. Ablation experiments quantify each module’s contribution: removing risk perception, evidence retrieval, consistency verification, and confidence reweighting reduces accuracy by 6.7%, 9.1%, 7.6%, and 4.4%, respectively, and increases attack success rate by 13.8%, 11.1%, 8.6%, and 6.9%. The proposed approach addresses key security issues of medical decision making intelligent agents, obtains secure decision making in challenging scenarios, and provides reliable intelligent support for medical decision-making intelligent agents.

[AI-392] Can LLM s Predict Polymer Physics Just by Reading Synthesis and Processing Prose?

【速读】：该论文旨在解决传统聚合物性能预测模型因仅依赖化学结构表示（如SMILES或分子图）而忽略合成路径、加工历史、形貌及测试条件等关键实验背景信息的问题，从而导致预测结果与实际材料性能存在偏差。其解决方案的关键在于提出一个完全基于自然语言的框架PolyLM，该框架直接从全文文献中提取信息进行性能预测，无需任何结构输入，从而保留了领域科学家对合成与处理过程的非结构化描述；通过在包含18.5万篇科学论文和27.6万种独特聚合物样本的大规模数据集上微调90亿参数的语言模型（Qwen3.5-9B），并结合低秩适应（LoRA）和任务级不确定性加权技术，实现了对22种物理、机械和热学性质的高精度预测，验证了自然语言作为真实材料性能预测接口的潜力。

链接: https://arxiv.org/abs/2605.08255
作者: Yuchu Liu,Rui Zhu,Jingwei Xiong,Haixu Tang
机构: 未知
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Can large language models predict physical and mechanical polymer properties simply by reading unstructured scientific prose? Polymer performance is rarely determined by chemical structure alone; identical nominal polymers can exhibit drastically different behaviors depending on their synthesis route, processing history, morphology, and testing conditions. Yet, state-of-the-art polymer property models typically rely on structure-only representations – such as SMILES or molecular graphs – which strip away this vital experimental context. In this work, we introduce \textbfPolyLM, a natural-language-only, process- and condition-aware framework that predicts materials performance directly from full-text literature. By circumventing structural inputs entirely, PolyLM preserves the nuanced, unstructured descriptions of synthesis and processing reported by domain scientists. To train this framework, we curated an unprecedented, literature-scale dataset encompassing 185,000 scientific papers and over 276,400 unique polymer samples across 22 physical, mechanical, and thermal properties. We fine-tuned a massive 9-billion-parameter language model (Qwen3.5-9B) using Low-Rank Adaptation (LoRA) and task-level uncertainty weighting. Evaluated on 68,283 held-out observations, the model achieves remarkably high predictive accuracy, establishing new state-of-the-art benchmarks for complex properties. Across the 22 diverse targets, the model achieves a median R^2 of 0.74, with predictions for key thermal, mechanical, and physicochemical properties frequently surpassing an R^2 of 0.80. These results unequivocally demonstrate that natural language is a powerful, highly scalable interface for realistic materials performance prediction.

[AI-393] HyperTransport: Amortized Conditioning of T2I Generative Models

【速读】：该论文旨在解决大模型（如生成式 AI）在行为控制上的效率与稳定性问题。现有方法中，微调成本高，提示工程（prompting）则因对措辞和结构敏感而脆弱；激活调控（activation steering）虽更稳定，但需为每个概念单独优化，导致在概念集庞大、动态变化或按需指定时难以部署。解决方案的关键在于提出 HyperTransport——一种基于超网络（hypernetwork）的框架，通过预训练编码器（如 CLIP）将概念嵌入直接映射到干预参数，利用最优传输损失（optimal transport loss）端到端训练。该方法实现了对任意新概念的快速干预（单次前向传播，比传统方法快 3600–7000 倍），同时具备开放概念集下的可扩展性、连续可解释的强度控制以及跨模态条件生成能力，显著优于现有方法。

链接: https://arxiv.org/abs/2605.08254
作者: Valentino Maiorca,Eleonora Gualdoni,Xavier Suau,Marco Cuturi,Luca Zappella,Pau Rodríguez
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As foundation models grow in capability, the ability to efficiently and reliably control their behavior becomes critical. Fine-tuning these models can be costly, and while prompting can be practical for controllability, it remains fragile due to models’ high sensitivity to exact prompt wording and structure. This brittleness has driven interest in activation steering techniques that offer more stable and predictable control over model behavior. However, existing activation steering methods require per-concept optimization, which makes them ill-suited to deployment scenarios where the concept set is large, evolving, or only specified at request time: each new concept incurs at least minutes of optimization on the target model. We propose HyperTransport, a hypernetwork framework that amortizes this cost by mapping embeddings from a pretrained encoder (CLIP in our instantiation) directly to intervention parameters, trained end-to-end using an optimal transport loss. Once trained, HyperTransport produces each new intervention in a single hypernetwork forward pass, 3600-7000x faster than per-concept fitting. On concepts unseen during training, it matches the strongest per-concept baselines at inducing the target concept. By decoupling concept representation from intervention prediction, HyperTransport combines three capabilities that no existing approach offers as a set: amortized steering for open-ended concept sets, continuous interpretable strength control, and cross-modal conditioning where reference images can directly steer text-based generation. We validate HyperTransport on DMD2 and Nitro-1-PixArt across 167 held-out test concepts via CLIP-based metrics, a VLM-as-a-judge evaluation, and a user study. In pairwise comparisons, both human and VLM judges prefer HyperTransport over prompting ~2x as often.

[AI-394] Path-Coupled Bellm an Flows for Distributional Reinforcement Learning ICML2026

【速读】：该论文旨在解决分布强化学习（Distributional Reinforcement Learning, DRL）中现有方法存在的两个核心问题：一是基于有限支撑或分位数的方法依赖投影操作，可能导致分布近似失真；二是基于流匹配（flow-based）的方法在流源处存在边界不匹配问题，或因当前与后继状态噪声独立而导致bootstrapping方差过高。解决方案的关键在于提出路径耦合贝尔曼流（Path-Coupled Bellman Flows, PCBF），其通过构建源一致的贝尔曼耦合路径（source-consistent Bellman-coupled paths）实现连续时间下的分布建模：当前路径从指定先验出发，在时间 $ t=1 $ 达到贝尔曼目标，并在中间时刻保持与后继流的路径级仿射关系（无需所有时间点边际分布满足分布贝尔曼不动点）；同时引入 $\lambda$ -参数化控制变量目标，其中 $\lambda=0$ 对应无偏样本贝尔曼目标， $\lambda > 0$ 则以可控偏差换取方差降低，从而提升分布保真度和训练稳定性。

链接: https://arxiv.org/abs/2605.08253
作者: Boyang Xu,Qing Zou,Siqin Yang,Hao Yan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to the 43rd International Conference on Machine Learning (ICML 2026)

点击查看摘要

Abstract:Distributional reinforcement learning (DRL) models the full return distribution, but existing finite-support or quantile-based methods rely on projections, while recent flow-based approaches can suffer from \emphboundary mismatch at the flow source or from \emphhigh-variance bootstrapping when current and successor noises are independent. We propose Path-Coupled Bellman Flows (PCBF), a continuous-time DRL method that learns return distributions with flow matching using \textbfsource-consistent Bellman-coupled paths: the current path starts from the required base prior at t=0 , reaches the Bellman target at t=1 , and maintains a pathwise affine relation to the successor flow at intermediate times (without requiring time- t marginals to satisfy a distributional Bellman fixed point for all t ). PCBF couples current and successor return flows through shared base noise and uses a \lambda -parameterized control-variate target: \lambda=0 recovers an unbiased sample Bellman target, while \lambda0 trades controlled bias for variance reduction. Experiments on analytically tractable MRPs, OGBench, and D4RL show improved distributional fidelity and training stability, and competitive offline RL performance.

[AI-395] LLM Translation of Compiler Intermediate Representation

【速读】：该论文旨在解决不同编译器工具链（如GCC与LLVM）之间因中间表示（Intermediate Representation, IR）语义和结构差异导致的互操作性障碍问题，这种障碍限制了前端、后端及优化流水线在多种编程语言和编译生态系统间的复用。解决方案的关键在于提出IRIS-14B——一个基于Transformer架构、参数规模达140亿的大型语言模型（Large Language Model, LLM），专门针对GIMPLE（GCC生成的IR）到LLVM IR的转换任务进行微调训练。该模型通过从真实C代码中提取的配对IR样本学习跨IR的复杂映射关系，显著优于现有主流开源模型（参数量范围从13亿至1000亿），实现了高精度的IR-to-IR翻译，从而为混合神经符号编译架构提供可插拔的互操作层，无需修改原有编译器模块即可支持跨工具链工作流。

链接: https://arxiv.org/abs/2605.08247
作者: Andrea Valenzuela Ramirez,Cristian Gutierrez-Gomez,Marta Barroso,Dario Garcia-Gasulla,Sara Royuela
机构: 未知
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:GCC and LLVM underpin much of modern software infrastructure, relying on distinct Intermediate Representations (IRs) to drive optimizations and code generation. However, the semantic and structural differences between these IRs create significant barriers for cross-toolchain interaction, limiting the reuse of compiler frontends, backends, and optimization pipelines across programming languages and compilation ecosystems. Traditional rule-based translators have attempted to bridge this gap, but their complexity and maintenance cost have hindered practical adoption. In this context, Large Language Models (LLMs) appear to be an emerging technology that offers a data-driven alternative, capable of learning complex mappings between heterogeneous compiler IRs directly from sufficiently representative examples. To explore this approach, this paper presents IRIS-14B, a 14-billion-parameter transformer model fine-tuned to translate GIMPLE (as emitted by GCC) to LLVM IR (as emitted by LLVM). The model is trained on paired IRs extracted from C sources and evaluated on the GIMPLE-to-LLVM IR transformation applied to IRs derived from real-world C code and competitive programming problems. To the best of our knowledge, IRIS-14B is the first model trained explicitly for IR-to-IR translation. It outperforms the accuracy of widely used models, including the largest state-of-the-art open models available today, ranging from 13 to 1,000 billion parameters, by up to 44 percentage points. The proposed transformation supports the integration of LLMs as complementary components within hybrid neuro-symbolic compiler architectures, where models such as IRIS-14B act as interoperability layers enabling cross-toolchain workflows without modifying existing compiler passes, while traditional compiler infrastructure continues to perform deterministic compilation and optimization.

[AI-396] When Does Value-Aware KV Eviction Help? A Fixed-Contract Diagnostic for Non-Monotone Cache Compression

【速读】：该论文旨在解决长上下文大语言模型（Long-context LLM）推理中因解码阶段需频繁读取大规模键值缓存（KV cache）而导致的内存和带宽瓶颈问题。现有KV压缩方法通过仅保留部分缓存来降低开销，但缺乏对选择器（selector）性能失败原因的精细诊断能力。论文提出一种固定合约诊断（fixed-contract diagnostic）机制，其核心在于保持选择器配置不变，逐次调整一个决策槽位以定位失败根源：即是否遗漏未来解码所需证据、错误赋予无关token高分，或在压缩过程中破坏相关证据关联性。该方法通过结合注意力质量与移除区块后的输出变化估计，实现对价值排序（value ranking）的有效探测，在LongBench基准上验证了其准确性，并揭示出最优策略应优先恢复解码侧证据、再评估其输出价值，最后在投影过程中保留耦合证据。

链接: https://arxiv.org/abs/2605.08234
作者: Ruijie Zhang,Haozhe Liang,Da Chang,Li Hu,Fanqi Kong,Huaxiao Yin,Yu Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long-context LLM inference is bottlenecked by the memory and bandwidth cost of reading large KV caches during decoding. KV compression reduces this cost by keeping only part of the cache, but task accuracy alone does not identify why a selector succeeds or fails. A selector can fail at three steps: it may miss the evidence future decoding needs, give high scores to tokens that do not affect the output, or break related evidence when fitting scores into a small cache. We introduce a fixed-contract diagnostic that holds the selector’s setup fixed and changes one decision slot at a time. For value ranking, the probe combines a block’s attention mass with the estimated output change from removing it. On LongBench across three models and two budgets, the probe is positive on 72.6% of positive-margin cells and 32.4% of nonpositive-margin cells. NeedleBench M-RT at 32k and a RULER 8k check probe support closure under branched retrieval, and a 264-cell sign evaluation separates support recovery and output-value ranking from leverage effects near the boundary. The resulting order is to recover decode-side evidence, rank its output value, and preserve coupled evidence during projection.

[AI-397] RAM: Training Approximate Multiplier Structures for Low-Power AI Accelerators

【速读】：该论文旨在解决人工智能加速器中高功耗问题，特别是针对计算密集型的乘法器（multiplier）进行低功耗优化。传统方法通常独立设计近似乘法器（Approximate Multiplier, AxM）与AI模型训练，导致无法实现整体能效最优。本文提出TRAM方案，其关键在于通过联合优化AxM结构与AI模型参数，在保证精度损失可控的前提下显著降低功耗，实现了端到端的低功耗设计。实验表明，相较于现有最优AxM，TRAM在CNNs上最多降低25.05%的AxM功耗，在视觉Transformer上最多降低27.09%的功耗。

链接: https://arxiv.org/abs/2605.08231
作者: Chang Meng,Hanyu Wang,Yuyang Ye,Mingfei Yu,Wayne Burleson,Giovanni De Micheli
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注:

点击查看摘要

Abstract:Reducing power consumption in AI accelerators is increasingly important. Approximate computing can reduce power consumption while keeping the accuracy loss small. Since multipliers are power-hungry components in AI models, this paper focuses on synthesizing low-power approximate multipliers (AxMs). Unlike prior works that design AxMs separately from AI model training, we present TRAM, which jointly optimizes the AxM structure and AI model parameters to lower power with small accuracy loss. Experiments show that compared to state-of-the-art AxMs, TRAM achieves up to 25.05% AxM power reduction on CNNs with CIFAR-10, and reduces power by up to 27.09% on vision transformers with ImageNet.

[AI-398] NoisyCoconut: Counterfactual Consensus via Latent Space Reasoning

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）在推理过程中缺乏可靠性和不确定性量化的问题，即模型在输出时难以识别自身不确定的预测，从而导致错误率较高。解决方案的关键在于提出一种名为NoisyCoconut的推理时方法，通过在模型内部表示空间中注入可控噪声，生成多样化的推理路径，并利用这些路径之间的一致性作为置信度信号，使模型能够在不确定时选择不回答（abstention）。该方法无需重新训练模型或修改参数，即可在多个推理基准上实现覆盖率与准确率之间的有效权衡，显著降低错误率（从40–70%降至15%以下），并支持模型在数学推理任务中通过选择性拒绝达到95%以上的准确率。

链接: https://arxiv.org/abs/2605.08221
作者: Michael Jerge,David Evans
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents NoisyCoconut, a novel inference-time method that enhances large language model (LLM) reliability by manipulating internal representations. Unlike fine-tuning methods that require extensive retraining, NoisyCoconut operates directly on model representations during inference and requires no retraining. Rather than training models to reason in latent space, we inject controlled noise into latent trajectories to generate diverse reasoning paths. Agreement among these paths provides a confidence signal, enabling models to abstain when uncertain. We demonstrate that this approach achieves effective coverage-accuracy tradeoffs across multiple reasoning benchmarks without requiring access to training data or modification of model parameters. This approach provides a practical pathway to improving the reliability of LLM outputs while maintaining compatibility with existing models. Our experiments show that unanimous agreement among noise-perturbed paths reduces error rates from 40-70% to below 15%, enabling models to exceed 95% accuracy on mathematical reasoning tasks through selective abstention.

[AI-399] Bangla-WhisperDiar: Fine-Tuning Whisper and PyAnnote for Bangla Long-Form Speech Recognition and Speaker Diarization

【速读】：该论文旨在解决孟加拉语（Bangla）长音频场景下的自动语音识别（ASR）与说话人聚类（speaker diarization）难题，主要挑战包括长时录音、多变的声学环境以及显著的说话人差异。针对ASR问题，关键解决方案是基于tugstugi bengaliai regional asr whisper medium模型，在约15,000段人工标注且对齐的孟加拉语音频上进行全参数微调，并采用多种数据增强策略（如噪声注入、混响模拟、回声、截断失真及音高/时间扰动）提升鲁棒性；对于说话人聚类问题，核心方法是使用PyTorch Lightning微调pyannote/segmentation-3.0模型，并将其替换为pyannote/speaker-diarization-community-1流水线中的分割模块，同时保留预训练的说话人嵌入和聚类组件。最终ASR系统达到24.41%的词错误率（WER），diarization系统实现23.92%的聚类错误率（DER），均显著优于各自基线模型。

链接: https://arxiv.org/abs/2605.08214
作者: Mohammed Aman Bhuiyan,Md Sazzad Hossain Adib,Samiul Basir Bhuiyan,Amit Chakraborty,Aritra Islam Saswato,Ahmed Faizul Haque Dhrubo,Mohammad Ashrafuzzaman Khan
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: 3 figures and 5 tables

点击查看摘要

Abstract:Automatic Speech Recognition (ASR) and speaker diarization in Bangla remain challenging due to long form recordings, diverse acoustic conditions, and significant speaker variability. This work addresses these two core tasks in Bangla spoken language understanding by developing robust systems for long form ASR and speaker diarization. For ASR (Problem 1), we fine tune the tugstugi bengaliai regional asr whisper medium model on a custom-curated dataset of approximately 15,000 chunked and aligned Bangla audio segments, employing full weight training with extensive data augmentation including noise injection, reverb simulation, echo, clipping distortion, and pitch/time perturbation. For speaker diarization (Problem 2), we fine-tune the pyannote/segmentation-3.0 model using PyTorch Lightning on the competition annotated diarization dataset, swapping the fine-tuned segmentation backbone into the pyannote/speaker-diarization-community-1 pipeline while retaining the pretrained speaker embedding and clustering components. Our ASR system achieves a Word Error Rate (WER) of 0.2441, while our diarization system achieves a Diarization Error Rate (DER) of 0.2392, both evaluated on the test set, demonstrating notable improvements over the respective pretrained baselines. We describe our complete pipeline, including data preprocessing, text normalization, audio augmentation, training strategies, inference optimization, and post-processing for both tasks.

[AI-400] Beyond Penalization: Diffusion-based Out-of-Distribution Detection and Selective Regularization in Offline Reinforcement Learning ICLR2026

【速读】：该论文旨在解决离线强化学习（Offline Reinforcement Learning, Offline RL）中因对分布外（Out-of-Distribution, OOD）动作的价值高估而导致的策略性能下降问题。现有方法通常采用统一惩罚机制来抑制OOD样本，但难以准确识别OOD动作，且可能误抑制有益探索。其解决方案的关键在于提出DOSER（Diffusion-based OOD Detection and Selective Regularization）框架：通过训练两个扩散模型分别建模行为策略和状态分布，利用单步去噪重构误差作为可靠的OOD检测指标；在策略优化过程中进一步区分有益与有害的OOD动作，基于预测转移结果选择性地抑制风险动作并鼓励高潜力探索。该方法理论上保证了γ-收缩性，从而获得有界价值估计和渐近最优性能保障。

链接: https://arxiv.org/abs/2605.08202
作者: Qingjun Wang,Hongtu Zhou,Hang Yu,Junqiao Zhao,Yanping Zhao,Chen Ye,Ziqiao Wang,Guang Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures. Accepted to ICLR 2026

点击查看摘要

Abstract:Offline reinforcement learning (RL) faces a critical challenge of overestimating the value of out-of-distribution (OOD) actions. Existing methods mitigate this issue by penalizing unseen samples, yet they fail to accurately identify OOD actions and may suppress beneficial exploration beyond the behavioral support. Although several methods have been proposed to differentiate OOD samples with distinct properties, they typically rely on restrictive assumptions about the data distribution and remain limited in discrimination ability. To address this problem, we propose DOSER (Diffusion-based OOD Detection and Selective Regularization), a novel framework that goes beyond uniform penalization. DOSER trains two diffusion models to capture the behavior policy and state distribution, using single-step denoising reconstruction error as a reliable OOD indicator. During policy optimization, it further distinguishes between beneficial and detrimental OOD actions by evaluating predicted transitions, selectively suppressing risky actions while encouraging exploration of high-potential ones. Theoretically, we prove that DOSER is a \gamma -contraction and therefore admits a unique fixed point with bounded value estimates. We further provide an asymptotic performance guarantee relative to the optimal policy under model approximation and OOD detection errors. Across extensive offline RL benchmarks, DOSER consistently attains superior performance to prior methods, especially on suboptimal datasets.

[AI-401] FairHealth: An Open-Source Python Library for Trustworthy Healthcare AI in Low-Resource Settings

【速读】：该论文旨在解决当前医疗人工智能（Artificial Intelligence, AI）工具包在低资源和低收入国家（Low-Resource and Low-Income Country, LMIC）应用场景中存在的四大关键问题：缺乏对生物信号与临床表格数据的集成公平性审计、缺少兼容标准机器学习（Machine Learning, ML）工作流的隐私保护联邦学习（Federated Learning, FL）工具、缺乏适用于低带宽环境的可解释性工具，以及全球南方（Global South）医疗数据集的缺失。解决方案的核心在于构建一个开源的Python库FairHealth，其整合了来自五项同行评审研究的六个模块，涵盖同态加密支持的联邦学习、交叉公平性度量、混合模糊-SHAP可解释性、多语言登革热分诊、公平灾害援助分配及公共数据集加载器，并确保所有数据无需机构数据使用协议即可获取，从而实现可信医疗AI在LMIC场景下的落地应用。

链接: https://arxiv.org/abs/2605.08198
作者: Farjana Yesmin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 8 pages, open-source Python library

点击查看摘要

Abstract:We present FairHealth, an open-source Python library that provides a unified, modular framework for trustworthy machine learning in healthcare applications, with particular focus on low-resource and low-income country (LMIC) settings such as Bangladesh. FairHealth addresses four critical gaps in existing healthcare AI toolkits: (1) the absence of integrated fairness auditing for biosignals and clinical tabular data; (2) the lack of privacy-preserving federated learning tools compatible with standard ML workflows; (3) missing explainability tools tailored for low-bandwidth clinical decision support; and (4) no existing toolkit covering Global South healthcare datasets. Built from five peer-reviewed research contributions, FairHealth provides six modules covering federated learning with homomorphic encryption (this http URL), intersectional fairness metrics (this http URL), hybrid fuzzy-SHAP explainability (this http URL), multilingual dengue triage (this http URL), equitable disaster aid allocation (this http URL), and public dataset loaders (this http URL). All datasets used are publicly available without institutional data use agreements. FairHealth is installable via pip install fairhealth(PyPI: this http URL) and available at this https URL.

[AI-402] ReplaySCM: A Benchmark for Executable Causal Mechanism Induction from Interventions

【速读】：该论文旨在解决当前语言模型在因果推理任务中缺乏对可执行因果机制（executable causal mechanism）从有限干预证据中进行归纳的能力评估问题。现有基准多聚焦于局部答案或图结构的评分，而忽视了机制行为的一致性与泛化能力。其解决方案的关键在于提出ReplaySCM——一个包含1,300个项目的可执行因果机制诱导基准，每个项目基于潜在的全观测、无环布尔结构因果模型（Boolean structural causal model, SCM）生成二值世界；系统需输出受限布尔领域特定语言（DSL）表示的机制映射，并通过重放验证其在训练与保留干预世界中的行为一致性，而非仅比较公式字符串。此设计确保语义正确但语法不同的机制获得合理评分，同时引入Ordered、Block-order、Hidden-order和Hidden-roots等设置以考察结构信息暴露程度的影响，并结合支持审计阶梯（Support-Audit Ladder）提升前驱模式覆盖率至1.0，从而更严格地检验因果机制的可发现性与唯一性。

链接: https://arxiv.org/abs/2605.08197
作者: Serafim Batzoglou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Most causal benchmarks for language models score local answers or graph structure. We introduce ReplaySCM, a 1,300 item benchmark for executable causal mechanism induction from finite interventional evidence. Each item contains binary worlds generated by a latent fully observed acyclic Boolean structural causal model (SCM). A system must output a mechanism map in a restricted Boolean DSL; the submission is parsed, checked for legality and acyclicity, and replayed on training and held-out intervention worlds. Scoring uses replay behavior rather than formula strings, so syntactically different mechanisms receive credit when they behave correctly. ReplaySCM varies the structural information disclosed to the model through Ordered, Block-order, Hidden-order, and Hidden-roots settings, and includes Alternative-SCM tasks that supply a valid reference SCM and ask for a semantically distinct alternative that fits the training worlds, together with a separating intervention and witness. Frontier LLMs infer parts of the functional-parent structure, but held-out replay drops sharply when order or root structure is hidden. We also evaluate a matched support-audit ladder: Original, Extra Worlds, and Counterexample Audit (CEx), that raises mean local predecessor-pattern coverage from 0.8949 to 0.9815 to 1.0; under the audited searches, no discovered semantic alternative remains consistent with the training worlds. The Ordered/Hidden-order gap persists under this stronger evidence. ReplaySCM complements answer-level causal reasoning and graph-discovery benchmarks by evaluating executable replay generalization from finite interventional evidence, without claiming unique identification of the latent SCM.

[AI-403] NeurIPS Should Require Reproducibility Standards for Frontier AI Safety Claims

【速读】：该论文旨在解决当前前沿人工智能（AI）安全声明的可复现性危机问题，即关键的安全评估结论因核心数据和模型细节被隐瞒而难以验证，导致治理决策与公众信任建立在不可靠证据之上。其解决方案的关键在于提出一个三层次披露框架（公共、受控、受限披露），并配套强制性的声明清单、范围说明及分阶段实施路径，将非公开内容纳入受控审查机制（如联邦化合格安全评审机构），同时明确区分“应公开”与“应保密”的边界，确保最重大的安全主张获得与最低限度要求同等严格的评价标准。

链接: https://arxiv.org/abs/2605.08192
作者: Varad Vishwarupe,Nigel Shadbolt,Marina Jirotka,Ivan Flechais
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注: Preprint

点击查看摘要

Abstract:Frontier AI safety claims - published assertions that a highly capable general-purpose model is below a threshold of concern, adequately mitigated, or suitable for release - increasingly shape model deployment, governance, and public trust. Yet the artefacts needed to evaluate them are routinely withheld, producing an evidential inversion: the most consequential claims in AI safety are often the least reproducible. This position paper argues that NeurIPS should require reproducibility standards for papers making such claims, treating non-reproducibility not as a transparency preference but as an evaluation-methodology failure. The 2026 International AI Safety Report [Bengio et al., 2026] concludes that reliable pre-deployment safety testing has become harder to conduct and that models now distinguish test from deployment contexts; the 2025 Foundation Model Transparency Index [Wan et al., 2025] reports a sector-average transparency score of 40/100 with no major developer adequately disclosing train-test overlap; contemporaneous measurement-theory work shows that attack-success-rate comparisons across systems are often founded on low-validity measurements [Chouldechova et al., 2025]. We propose a three-tier disclosure framework, distinguishing public, controlled, and claim-restricted disclosure, paired with a mandatory claim inventory, scope statements, and a phased implementation path with graduated sanctions. The framework treats secrecy and openness as endpoints of a spectrum, with controlled review (via a federated colloquium of qualified secure-review hosts) covering claims whose artefacts cannot be released publicly, and right-scaling claims whose artefacts cannot be reviewed even confidentially. The standard the community applies to its most consequential claims should be at least as high as the standard it applies to its least.

[AI-404] From Ontology Conformance to Admissible Reconfiguration: A RoSO/SMGI Adequacy Argument for Robotic Service Governance

【速读】：该论文旨在解决服务机器人领域中，当服务在运行时被重新绑定、重组、修复或重新部署后，如何确保其配置仍能保持为同一受保护服务的合法实现这一问题。传统本体论（ontology）仅能验证静态合规性，无法应对动态变更下的语义一致性挑战。解决方案的关键在于引入结构化通用智能模型（Structural Model of General Intelligence, SMGI），通过其结构接口 $\theta$ 、诱导的行为语义 $T_\theta$ 以及规范尊重的治理机制，将RoSO（Robotic Service Ontology）嵌入为一个可动态治理的类型化语义层，从而实现服务描述从“语法正确”到“行为可管”的跃迁。此方法不仅提供了RoSO到SMGI的适配定理，还定义了身份保持的重构准则和局部更新全局可接受的组合条件，为服务在运行时的合法性演化提供了形式化保障。

链接: https://arxiv.org/abs/2605.08185
作者: Aomar Osmani
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 26 pages

点击查看摘要

Abstract:The Robotic Service Ontology (RoSO) gives service robotics a typed semantic vocabulary for services, functions, interactions, and deployment-sensitive constraints. Its public revision trail makes visible a harder question than ontology conformance alone can settle: once a service is rebound, recomposed, repaired, or redeployed, under what conditions does the resulting configuration remain an admissible realization of the same protected service? This article argues that the Structural Model of General Intelligence (SMGI) is relevant exactly at that level \citeposmani2026smgi. SMGI adds not only a structural interface \theta , but an induced behavioral semantics T_\theta and a governance discipline for norm-respecting change. We show that RoSO can be embedded into SMGI as a typed semantic layer, so that service descriptions become dynamically governable rather than merely well formed. This yields a RoSO-to-SMGI adequacy theorem, identity-preserving reconfiguration criteria, and compositional conditions under which locally acceptable updates remain globally admissible. The resulting claim is not that SMGI replaces RoSO, but that it provides a formal account of what admissible runtime change requires once service semantics must survive revision.

[AI-405] Quantile Geometry Regularization for Distributional Reinforcement Learning

【速读】：该论文旨在解决基于分位数的分布强化学习方法（Quantile-based Distributional Reinforcement Learning）中因Bootstrap目标分位数导致的分布估计退化或失真问题。其解决方案的关键在于提出一种轻量级的Wasserstein分布鲁棒增强框架——RQIQN，通过将IQN损失重新解释为一组局部经验分位数估计问题，并对每个局部分位数槽位引入Wasserstein分布鲁棒分位数估计公式，从而得到一个闭式、分数依赖的贝尔曼目标修正项。该修正项直接缓解了分布退化：其中位数反对称性保持风险中性分位数平均不变，而单调性则扩大上下分位数间距，抑制分布坍缩，且无需改变原有价值目标或重构样本集即可实现分位数几何正则化。

链接: https://arxiv.org/abs/2605.08182
作者: Zhaofan Zhang,Minghao Yang,Rufeng Chen,Sihong Xie,Hui Xiong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Quantile-based distributional reinforcement learning methods learn return distributions through sampled quantile regression, but their bootstrapped target quantiles may induce distorted or degenerate distribution estimates. We propose Robust Quantile-based Implicit Quantile Networks (RQIQN), a lightweight Wasserstein distributionally robust enhancement boosted from a quantile estimation perspective. We first reinterpret a snapshot of IQN loss as a collection of local empirical quantile estimation problems over sampled current fractions. We then robustify each local slot with a Wasserstein distributionally robust quantile estimation formulation, yielding a closed-form, fraction-dependent correction to the Bellman target. This correction directly addresses distributional degeneration: its median antisymmetry preserves the risk-neutral quantile average, while its monotonicity enlarges upper-lower quantile gaps and counteracts collapsed distributional spread. RQIQN thus regularizes quantile geometry without changing the underlying value objective or requiring additional sample set reconstruction. Finally, we empirically show that the proposed RQIQN outperforms other existing quantile-based distributional reinforcement learning algorithms in risk-sensitive navigation and Atari games.

[AI-406] Generalized Category Discovery in Federated Graph Learning

【速读】：该论文旨在解决联邦图学习（Federated Graph Learning, FGL）在动态环境中难以应对新类别持续涌现的问题，提出联邦图广义类别发现（Federated Graph Generalized Category Discovery, FGGCD）这一实际场景，目标是在保持已知类别知识的同时，协同发现分布式图客户端中的 novel categories。其核心挑战包括：(1) 邻居吸收效应（Neighborhood Absorption Effect），即结构碎片化导致邻居聚合偏差，使新节点被误判为已知类别；(2) 全局语义不一致（Global Semantic Inconsistency），局部偏差通过服务器聚合并在异构子图分布下放大，阻碍跨客户端知识融合。解决方案的关键在于：在客户端引入拓扑可靠语义对齐与发现机制（Topology-Reliable Semantic Alignment and Discovery），缓解邻居吸收效应；在服务器端采用分层原型对齐策略（Hierarchical Prototype Alignment），消除全局语义不一致，从而实现高效、稳定的联邦广义类别发现。

链接: https://arxiv.org/abs/2605.08178
作者: Zhongzheng Yuan,Lianshuai Guo,Xunkai Li,Wenyu Wang,Meixia Qu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Federated Graph Learning (FGL) enables collaborative learning over distributed graph data, yet existing approaches largely rely on a closed-world assumption, limiting their applicability in dynamic environments where novel categories continuously emerge. To bridge this gap, we target the practical scenario of Federated Graph Generalized Category Discovery (FGGCD), aiming to collaboratively discover novel categories across decentralized graph clients while retaining knowledge of known categories. We observe that FGGCD introduces two fundamental challenges: (1) the Neighborhood Absorption Effect, where structural fragmentation leads to biased neighborhood aggregation, causing novel nodes to be misclassified as known categories; and (2) Global Semantic Inconsistency, where the aforementioned local biases propagate to the server and are amplified by heterogeneous subgraph distributions, hindering cross-client knowledge integration. To address these issues, we propose GCD-FGL, an FGL framework for GCD that integrates a client-side Topology-Reliable Semantic Alignment and Discovery process to mitigate the neighborhood absorption effect, and a server-side Hierarchical Prototype Alignment strategy to resolve global semantic inconsistency. Extensive experiments on five real-world graph datasets demonstrate that GCD-FGL consistently outperforms state-of-the-art baselines, achieving an average absolute gain of +4.86 in HRScore.

[AI-407] Echo-LoRA: Parameter-Efficient Fine-Tuning via Cross-Layer Representation Injection

【速读】：该论文旨在解决当前参数高效微调（Parameter-efficient fine-tuning, PEFT）方法中，如LoRA类方法在训练过程中仅利用浅层权重更新而忽略深层中间表示（intermediate representations）所导致的性能瓶颈问题。其核心解决方案是提出Echo-LoRA，通过跨层表示注入机制，在训练阶段从深层源层收集边界隐藏状态（boundary hidden states），聚合为样本级的“回声”表示（echo representation），并借助轻量级投影与门控网络将其注入浅层LoRA或DoRA模块中；同时采用答案掩码（answer-only masking）、掩码蒸馏（masked distillation）和随机路由（stochastic routing）策略稳定辅助路径，缩小训练与推理差异。该方法在不增加部署时参数或计算开销的前提下显著提升了模型在常识推理任务上的表现。

链接: https://arxiv.org/abs/2605.08177
作者: Yihang Peng,Peng Jin,Jie Gong,Xingyuan Chen,Lingjiao Xu,Ning Su,Yan Ran
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Parameter-efficient fine-tuning (PEFT) has become a practical route for adapting large language models to downstream tasks, with LoRA-style methods being particularly attractive because they are inexpensive to train and easy to deploy. Most LoRA variants, however, revise the update rule within the weight space of each layer and leave the intermediate representations formed by deeper layers largely unused. We propose Echo-LoRA, a cross-layer representation injection method for parameter-efficient fine-tuning. During training, Echo-LoRA collects boundary hidden states from deeper source layers, aggregates them into a sample-level echo representation, and uses lightweight projection and gating networks to inject the resulting signal into shallow LoRA or DoRA modules. Answer-only masking, masked distillation, and stochastic routing are used to keep this auxiliary path stable and to reduce the gap between training and inference. On eight commonsense reasoning benchmarks, Echo-LoRA exceeds the reported LoRA baselines by 5.7 percentage points on average across LLaMA-7B, LLaMA2-7B, and LLaMA3-8B. Under reproduced LoRA baselines in our unified implementation, the average gain is 3.0 points; when combined with DoRA, the gain is 2.7 points. The Echo path is discarded after training, so the deployed model keeps the original low-rank LoRA/DoRA form and adds neither inference-time parameters nor inference computation.

[AI-408] Communication Dynamics Neural Networks: FFT-Diagonalized Layers for Improved Hessian Conditioning at Reduced Parameter Count

【速读】：该论文旨在解决神经网络训练中参数效率低与优化困难的问题，特别是针对传统密集层（dense layer）在高维输入下参数量大、Hessian条件数高导致的收敛慢和泛化差等挑战。其核心解决方案是引入通信动力学（Communication Dynamics, CD）框架下的块循环线性层（CDLinear），该层基于块大小为 $ B = 2l+1 $ 的循环矩阵结构设计，通过离散傅里叶变换（Discrete Fourier Transform, DFT）实现权重空间的谱对角化：一方面使均方损失关于权重的Hessian矩阵可被DFT对角化，且特征值直接由输入统计特性决定；另一方面，在输入预白化条件下，理论证明群体Hessian条件数精确为1，经验条件数仅受样本数量影响，从而显著改善优化稳定性。实验证明，使用CDLinear构建的多层感知机（MLP）在仅需3.8倍少参数的情况下，准确率损失控制在0.65%以内，且Hessian条件数降低至原密集模型的约1/310，验证了该方法在保持性能的同时大幅提升参数效率与优化鲁棒性。

链接: https://arxiv.org/abs/2605.08171
作者: Lurong Pan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Background and motivation. The Communication Dynamics (CD) framework, introduced in two earlier papers for atomic-energy prediction and field-induced superconductivity, treats each physical channel as a (2l+1)-vertex polygon whose discrete Fourier transform yields its energy spectrum. This paper applies the same circulant-spectral machinery to neural-network design. Layer construction. CDLinear is a block-circulant linear layer with block size B = 2l+1 and 1/B the parameter count of a dense layer of equal input/output dimensions. Three properties follow from the construction. (i) The Hessian of mean-squared loss with respect to the weights is diagonalized by the discrete Fourier transform, with eigenvalues |FXj|^2 read directly from the input statistics (Theorem 1). (ii) Under input pre-whitening, the population Hessian condition number satisfies kappa = 1 exactly, with the empirical condition number bounded by 1+O(sqrt(B/N)) on N samples (Theorem 2). (iii) The Shannon noise rate alpha_CD = 0.0118 calibrated in the parent CD papers from the Na D-doublet specifies a transferable, non-arbitrary dropout rate. Empirical evaluation. A CDLinear MLP at B = 4 achieves 97.50% +/- 0.23% test accuracy with 2,380 parameters versus 98.15% +/- 0.47% for a parameter-matched dense MLP at 8,970 parameters, a 3.8x parameter reduction at 0.65% accuracy cost, within one standard deviation of the seed-to-seed spread. The CD-MLP mean Hessian condition number kappa = 1.9x10^4 is 310x smaller than the dense baseline kappa = 5.9x10^6, in quantitative agreement with Theorem 2. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.08171 [cs.LG] (or arXiv:2605.08171v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.08171 Focus to learn more arXiv-issued DOI via DataCite

[AI-409] Understanding Asynchronous Inference Methods for Vision-Language-Action Models

【速读】：该论文旨在解决视觉-语言-动作（Vision-Language-Action, VLA）模型在机器人控制中因推理延迟导致的动作执行与观测不同步问题，即“观察过时”（observation staleness）。其关键解决方案在于系统性地比较四种不同机制：基于推理时图像修复（inpainting）的IT-RTC、训练时延迟模拟（TT-RTC）、未来状态感知条件建模（VLASH）以及轻量级残差修正（A2C2）。其中，A2C2通过每步残差修正策略在Kinetix和LIBERO基准上均展现出最优性能，尤其在高延迟场景下表现稳定；TT-RTC则因其训练阶段引入延迟模拟而具备最强鲁棒性，且无额外推理开销，成为最稳定的训练驱动方法。

链接: https://arxiv.org/abs/2605.08168
作者: Ayoub Agouzoul
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) models offer a promising path to generalist robot control, but their inference latency causes observation staleness when generated actions are executed asynchronously. Several methods have been proposed concurrently to mitigate this problem: inference-time inpainting (IT-RTC), training-time delay simulation (TT-RTC), future-state-aware conditioning (VLASH), and lightweight residual correction (A2C2). Each takes a fundamentally different approach, but they have so far been evaluated independently with different codebases, base policies, and protocols. We present a systematic comparison of these four methods under controlled conditions. We develop two unified codebases that integrate all methods with harmonized library and dataset versions, and we benchmark them on the Kinetix suite with MLPMixer policies and on the LIBERO manipulation benchmark with SmolVLA, sweeping inference delays up to d=20 control steps. A2C2’s per-step residual correction is the most effective method on Kinetix, holding above 90% solve rate up to d=8 , and also leads on LIBERO from d=4 onwards. IT-RTC is competitive at low delays but degrades sharply under long chunks ( H=30 ) and high delays. TT-RTC is the most robust training-based method: stable across d_\max choices, generalizes beyond its training delay distribution, and adds zero inference overhead. VLASH exhibits a clear low-delay vs. high-delay trade-off governed by the fine-tuning delay range [0,d_\max] . Code is available at this https URL

[AI-410] parHSOM: A novel parallel Hierarchical Self-Organizing Map implementation

【速读】：该论文旨在解决传统层次自组织映射（Hierarchical Self-Organizing Maps, HSOM）在大规模网络安全数据集上训练速度慢的问题，其关键解决方案是提出一种并行化架构——parHSOM，通过引入并行计算机制显著缩短HSOM的训练时间，同时保持检测性能稳定，实验证明该方法在多个数据集和配置下均实现了更快的训练效率且无明显性能损失。

链接: https://arxiv.org/abs/2605.08164
作者: Rebekah Lane,Logan Cummins,Andy Perkins,George Trawick,Ioana Banicescu,Sudip Mittal
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:The digital age has completely transformed the way that information is processed and stored, which makes cybersecurity a crucial field of research. Cybersecurity contains many different domains, but this work focuses on Intrusion Detection Systems (IDSs). Within the literature, Hierarchical Self-Organizing Maps (HSOMs) have been used to create trustworthy, explainable, and AI-based IDSs. However, HSOMs are trained sequentially, which means that training HSOMs on large datasets is slow. This work presents a novel parallel HSOM architecture, called parHSOM. The purpose of this research is to investigate the effect that parallel computation has on the HSOM training time. parHSOM is tested on two different testbeds, four different output grid sizes, and five different cybersecurity datasets. Performance metrics collected from these experiments show that parHSOM consistently trains faster than the Sequential HSOM algorithm without any significant loss in performance. Additionally, this work provides a platform for further investigation into parallel HSOM implementations.

[AI-411] Privacy-Preserving Federated Learning: Integrating Zero-Knowledge Proofs in Scalable Distributed Architectures

【速读】：该论文旨在解决联邦学习（Federated Learning, FL）在分布式边缘网络中面临的两大核心问题：一是模型中毒攻击（model poisoning attacks），即恶意节点通过篡改梯度更新破坏全局模型训练；二是聚合层的计算瓶颈，导致系统扩展性受限。解决方案的关键在于提出一种端到端的分布式架构，融合高级密码学验证与优化的大数据处理框架：首先引入零知识证明（Zero-Knowledge Proof, ZKP）封装器，在不查看原始梯度的前提下对节点计算进行密码学验证，从而有效抵御模型中毒攻击；其次将机器学习损失函数转化为适用于简洁验证的秩-1约束系统（Rank-1 Constraint Systems, R1CS），实现高效且可验证的梯度聚合过程。实验表明，该方案在1000个并行分布式节点下仍能保持94.2%的准确率，显著提升了FL系统的安全性与可扩展性。

链接: https://arxiv.org/abs/2605.08152
作者: Divya Gupta
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The intersection of Artificial Intelligence (AI) and distributed systems has given rise to Federated Learning (FL), a paradigm that enables decentralized model training without compromising local data privacy. As organizational data silos grow, deploying complex machine learning models across highly distributed edge networks becomes a critical infrastructural challenge. Standard FL implementations suffer from severe vulnerabilities related to adversarial gradient updates and computational bottlenecks at the aggregation layer. This paper presents a novel, end-to-end distributed architecture that hardens FL pipelines using advanced cryptographic verification and optimized big data processing frameworks. We introduce a Zero-Knowledge Proof (ZKP) wrapper that cryptographically validates node computations before global aggregation, neutralizing model poisoning attacks without inspecting raw gradients. Additionally, we evaluate the system’s performance using extreme gradient boosting models optimized for distributed edge execution. We formalize the mathematical transformation of the machine learning loss functions into Rank-1 Constraint Systems (R1CS) suitable for succinct verification. Extensive experimental results demonstrate that our hybrid architecture achieves a 94.2% accuracy retention under adversarial conditions while maintaining scalable throughput across 1,000 parallel distributed nodes, effectively bridging the gap between rigorous cryptographic security and high-performance distributed AI.

[AI-412] SPECTRE: Hybrid Ordinary-Parallel Speculative Serving for Resource-Efficient LLM Inference

【速读】：该论文旨在解决大模型服务在多模型云系统中因用户请求分布呈长尾特性而导致的资源利用率不均问题：即少数热门大模型负载过重，而大量冷门小模型（tail models）则长期处于低利用率状态。解决方案的核心是提出SPECTRE框架，通过**推测解码（speculative decoding）**技术，将未被充分利用的小模型作为远程草稿生成器（remote drafter），为高负载的大模型提供并行化的草稿生成与目标端验证能力。其关键创新在于三项机制：1）基于吞吐量分析设计阈值引导的混合普通-并行推测解码策略；2）推测优先调度策略以维持多租户场景下的草稿-目标重叠；3）草稿端提示压缩技术降低草稿延迟。实验证明，SPECTRE在显著提升大模型服务吞吐量的同时，对小模型原生工作负载干扰极小，最大可实现2.28倍于自回归解码的速度提升，并较最优推测解码基线再提升66%。

链接: https://arxiv.org/abs/2605.08151
作者: Jincheng Xie,Yawen Ling,Qi Xiao,Feiyu Zhang,Zhongyi Huang,Wen Hu,Yu Zheng
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM serving platforms are increasingly deployed as multi-model cloud systems, where user demand is often long-tailed: a few popular large models receive most requests, while many smaller tail models remain underutilized. We propose \textbfSPECTRE (Parallel \textbfSPECulative Decoding with a Multi-\textbfTenant \textbfREmote Drafter), a serving framework that reuses underutilized tail-model services as remote drafters for heavily loaded large-model services through speculative decoding. SPECTRE enables draft generation and target-side verification to run in parallel, and makes such parallelism effective through three techniques: a hybrid ordinary-parallel speculative decoding strategy guided by a threshold derived from throughput analysis, speculative priority scheduling to preserve draft–target overlap under multi-tenant traffic, and draft-side prompt compression to reduce draft latency. We implement SPECTRE in \textttSGLang and evaluate it across multiple draft–target model pairs, reasoning benchmarks, real-world long-context workloads, and a wide range of batch sizes. Results show that SPECTRE consistently improves large-model serving throughput while causing only minor interference to the native workloads of tail-model services. In large-model deployments, including Qwen3-235B-A22B with TP=8, SPECTRE achieves up to \textbf2.28 \times speedup over autoregressive decoding and up to an additional \textbf66% relative improvement over the strongest speculative decoding baselines. Talk is cheap, we show you the code: this https URL.

[AI-413] HoReN: Normalized Hopfield Retrieval for Large-Scale Sequential Model Editing

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在持续学习场景下知识更新的难题：模型部署后会因事实过时或错误而失效，但重新训练成本过高，因此需要一种高效、可扩展且不破坏原有知识的模型编辑方法。现有方法分为两类——直接修改基础权重的“定位-编辑”策略易引发知识干扰，而依赖外部记忆的路由机制在大规模编辑时性能下降明显。论文提出HoReN，其核心创新在于构建一个基于代码本（codebook）的参数保持型编辑器，关键机制包括三点：(1) 将单层MLP与离散键值代码本结合，使每个条目同时作为知识记忆键和现代Hopfield存储模式；(2) 通过投影到单位超球面实现角度相似性驱动的检索，消除编辑提示与其改写版本之间的幅度失配；(3) 利用阻尼Hopfield吸引子动力学对查询进行精炼，使同义表达收敛至正确存储模式的吸引盆，而无关查询不受扰动。该方案实现了跨标准ZsRE、结构化WikiBigEdit及非结构化UnKE基准的稳定编辑性能，并支持高达5万次连续编辑仍保持整体性能高于0.9，显著优于先前方法。

链接: https://arxiv.org/abs/2605.08143
作者: Yuan Fang,Yi Xie,Xuming Ran
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 30 pages, 10 figures

点击查看摘要

Abstract:Large language models encode vast factual knowledge that inevitably becomes outdated or incorrect after deployment, yet retraining is costly prohibitive, motivating model editing in lifelong settings that updates targeted behavior without harming the rest of the model. One line of work installs new facts by directly modifying base weights through locate-then-edit procedures, but accumulated edits progressively disrupt originally preserved knowledge, even with constraint-based projections. A complementary line leaves base weights intact and routes edits through external memory, but it faces routing challenges and its performance degrades at scale. We propose HoReN, a codebook-based parameter-preserving editor with enhanced routing built on three ideas. First, HoReN wraps a single MLP layer with a discrete key-value codebook, where each entry is interpreted simultaneously as a knowledge-memory key and a modern Hopfield stored pattern. Second, both keys and queries are projected onto the unit hypersphere so retrieval is governed by angular similarity, removing magnitude-driven mismatches between an edit prompt and its rephrasings. Third, the query is refined through damped Hopfield attractor dynamics, so paraphrases relax into the correct stored pattern’s basin of attraction while unrelated queries remain undisturbed. HoReN achieves well-edited performance with consistent gains across diverse benchmarks spanning standard ZsRE, structured WikiBigEdit, and unstructured UnKE evaluations. Moreover, HoReN scales to 50K sequential edits on ZsRE with stable overall performance above 0.9, while prior editors collapse or degrade severely before reaching 10K. Our code is available at this https URL.

[AI-414] Intelligent Autonomous Orchestration for Distributed Cloud Resources using Complex-Stability Analysis

【速读】：该论文旨在解决现代分布式云环境中因网络延迟导致的传统弹性伸缩机制易引发云抖动（cloud thrashing）的问题，从而实现高效且稳定的资源分配。其解决方案的关键在于提出一种名为C-SAS（Complex-Stability Aware Scaling）的智能自主编排框架，该框架利用复分析方法将遥测噪声转化为s-平面中的确定性“安全边界”（Safety Envelope），通过引入Argument Principle和Rouché定理计算实时的解析稳定性指数（Analytic Stability Index, ASI），从而智能抑制可能导致性能下降的振荡式伸缩操作。实验表明，C-SAS可将虚拟机频繁切换（VM flapping）减少94%，并实现96%的资源利用率，显著优于标准PID及基于机器学习的自治代理。

链接: https://arxiv.org/abs/2605.08139
作者: Gopal Krishna Shyam,Priyanka Bharti
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: 7 pages

点击查看摘要

Abstract:In modern distributed cloud environments, efficient resource allocation is required as traditional scaling mechanisms are often subject to cloud thrashing due to network-induced latencies. In this paper, we propose C-SAS (Complex-Stability Aware Scaling), an intelligent autonomous orchestration framework that leverages complex analytic methods to achieve system-wide equilibrium. In contrast to heuristic-based models, C-SAS acts as a stability-aware agent, converting telemetry noise into a deterministic “Safety Envelope” on the s -plane using the Argument Principle and Rouché’s Theorem. The algorithm smartly suppresses oscillatory scaling operations that would otherwise degrade performance, by computing a real-time Analytic Stability Index (ASI). The experimental results show that C-SAS reduces VM flapping by 94%, and achieves 96% resource efficiency, significantly outperforming standard PID and ML-based autonomous agents. Our results suggest that future resilient autonomous cloud infrastructures will require AI-driven orchestrators with built-in formal stability constraints.

[AI-415] Weight Pruning Amplifies Bias: A Multi-Method Study of Compressed LLM s for Edge AI

【速读】：该论文旨在解决权重剪枝（Weight Pruning）在资源受限的物联网（IoT）和边缘设备上部署大语言模型（Large Language Models, LLMs）时，对模型公平性（model fairness）影响尚不明确的问题。研究发现，尽管激活感知剪枝（如Wanda方法）能较好地维持语言建模能力（困惑度仅增加3.5%），却显著放大偏见（StereoType Reliance Score提升83.7%），且导致47–59%原本无偏的样本出现新的刻板行为；而随机剪枝虽严重破坏语言能力（困惑度达10⁸），但仅产生随机偏见。关键结论是：仅依赖困惑度评估无法反映行为等效性，必须引入偏见感知验证机制，才能保障剪枝后模型在边缘部署中的对齐安全性（alignment safety）。

链接: https://arxiv.org/abs/2605.08137
作者: Plawan Kumar Rath,Rahul Maliakkal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 8 pages, 7 figures, 8 tables. Accepted at the 7th Annual World AIIoT Congress (AIIoT 2026). This is the author’s accepted version; the version of record will appear in IEEE Xplore

点击查看摘要

Abstract:Weight pruning is widely advocated for deploying Large Language Models on resource-constrained IoT and edge devices, yet its impact on model fairness remains poorly understood. We conduct a controlled empirical study of three instruction-tuned models (Gemma-2-9b-it, Mistral-7B-Instruct-v0.3, Phi-3.5-mini-instruct) across three pruning methods (Random, Magnitude, Wanda) at four sparsity levels (10-70%) on 12,148 BBQ bias benchmark items with 5 random seeds, totaling 2,368,860 inference records. Our results reveal a Smart Pruning Paradox: activation-aware pruning (Wanda) preserves perplexity nearly perfectly (just 3.5% increase at 50% sparsity for Mistral-7B), yet produces the highest bias amplification, with Stereotype Reliance Score increasing 83.7% and 47-59% of previously unbiased items developing new stereotypical behaviors at 70% sparsity. Random pruning destroys language capability entirely (perplexity exceeding 10^4 and reaching 10^8 ) but produces only random-chance bias. We further show that unstructured pruning provides zero storage savings and zero inference latency reduction on real edge hardware, undermining the primary motivation for its use in IoT deployment. Of 180 dense-vs-pruned comparisons, 141 (78.3%) are significant ( p 0.05 ) with mean |h| = 0.305 . Published quantization studies report up to 21% of responses flipping between biased and unbiased states; our pruning results show transition rates nearly three times higher (47-59%), suggesting pruning poses a categorically greater risk to alignment than quantization. These findings demonstrate that perplexity-based evaluation provides false assurance of behavioral equivalence, and that IoT deployment pipelines require bias-aware validation before deploying pruned models at the edge.

[AI-416] DARE: Diffusion Language Model Activation Reuse for Efficient Inference

【速读】：该论文旨在解决扩散语言模型（Diffusion Language Models, dLLMs）在推理效率上的不足问题，尤其是其相较于自回归（auto-regressive, AR）模型在生成速度和计算资源消耗方面的劣势。研究发现，dLLMs中存在一种未被充分探索的特性——token-wise redundancy（令牌级冗余），即双向自注意力机制中不同token之间的激活值高度相关，且查询（query）表示的时间变化可预测对应键（key）、值（value）及输出激活的冗余性。为此，作者提出DARE（Diffusion Language Model Activation Reuse），通过两个互补机制实现高效激活复用：DARE-KV复用缓存的key-value激活以减少冗余计算，DARE-O复用输出激活进一步优化效率。实验表明，DARE可在不显著影响生成质量的前提下，实现每层高达1.20倍的延迟降低，并复用高达87%的注意力激活，为提升扩散类大语言模型的推理效率提供了有效且无需重新训练的解决方案。

链接: https://arxiv.org/abs/2605.08134
作者: Natalia Frumkin,Bokun Wang,Hung-Yueh Chiang,Chi-Chih Chang,Mohamed S. Abdelfattah,Diana Marculescu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to auto-regressive (AR) models, offering greater expressive capacity and potential for parallel generation and faster inference. However, open-source dLLMs remain immature, lagging behind AR models in both efficiency and quality. We identify an underexplored property of dLLMs: token-wise redundancy in bi-directional self-attention. Self-attention activations are highly correlated across tokens, and temporal changes in query representations can predict redundancy in corresponding key, value, and output activations. We introduce DARE, with two complementary mechanisms: DARE-KV, which reuses cached key-value (KV) activations, and DARE-O, which reuses output activations to reduce redundant computation while preserving quality. DARE achieves up to 1.20x per-layer latency reduction and reuses up to 87% of attention activations, with negligible degradation on reasoning and code-generation benchmarks. DARE-KV and DARE-O incur average performance drops of only 2.0% and 1.2%, respectively. Combined with techniques such as prefix caching and Fast-dLLM, DARE provides additive gains without retraining. These results establish token-wise reuse as an effective strategy for improving the efficiency of diffusion-based LLMs while preserving generation fidelity. Code: this https URL Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.08134 [cs.LG] (or arXiv:2605.08134v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.08134 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Natalia Frumkin [view email] [v1] Fri, 1 May 2026 19:15:45 UTC (1,443 KB) Full-text links: Access Paper: View a PDF of the paper titled DARE: Diffusion Language Model Activation Reuse for Efficient Inference, by Natalia Frumkin and 5 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.LG prev | next new | recent | 2026-05 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[AI-417] owards Universal Gene Regulatory Network Inference: Unlocking Generalizable Regulatory Knowledge in Single-cell Foundation Models ICML2026

【速读】：该论文旨在解决单细胞基础模型（single-cell Foundation Models, scFMs）在基因调控网络（Gene Regulatory Network, GRN）推断中表现不佳的问题，其核心症结在于标准的重建类预训练目标难以显式捕捉潜在的调控信号。解决方案的关键在于提出两种新颖的特征提取方法——虚拟值扰动（Virtual Value Perturbation）与梯度轨迹分析（Gradient Trajectory），通过从scFMs中蒸馏出隐含的调控信息，构建高度泛化的基因间特征表示，从而显著提升GRN推断的通用性与准确性，为利用scFMs实现普适性GRN推断树立了新范式。

链接: https://arxiv.org/abs/2605.08128
作者: Jiaxin Qi,Hang Li,Yan Cui,Yuhua Zheng,Jianqiang Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to the 43rd International Conference on Machine Learning (ICML 2026)

点击查看摘要

Abstract:Gene Regulatory Network (GRN) inference is essential for understanding complex cellular mechanisms, rendered tractable through single-cell transcriptomic data. With the emergence of single-cell Foundation Models (scFMs), enhanced transcriptomic encoding is widely expected to revolutionize GRN inference. However, we observe that their performance remains far from satisfactory. The primary reason is that the standard reconstruction-based pre-training objectives often fail to explicitly capture latent regulatory signals. To bridge this gap, we first introduce a GRN generalization benchmark designed to evaluate regulatory predictions on unseen genes and datasets, which relies on the zero-shot capabilities of scFMs and is inherently challenging for traditional methods. Furthermore, to unlock the regulatory knowledge within the foundation models, we propose two novel methods, Virtual Value Perturbation and Gradient Trajectory, to distill implicit regulatory information from scFMs into highly generalizable inter-gene features. Extensive experiments demonstrate that our approach significantly outperforms existing methods, establishing a new paradigm for leveraging the potential of scFMs in universal GRN inference.

[AI-418] Feature Repulsion and Spectral Lock-in: An Empirical Study of Two-Layer Network Grokking

【速读】：该论文旨在解决生成式 AI（Generative AI）在“grokking”现象中特征学习阶段的机制可解释性问题，特别是矩阵 $ B = (\widetilde{F}^\top \widetilde{F} + \eta I)^{-1} $ 所诱导的特征间排斥力是否具有可观测的谱特征及其与激活函数的关系。解决方案的关键在于通过实验验证 Tian (2025) 提出的排斥定理（Theorem 6），发现尽管特征相似性导致的负对角外项结构（sign rule）在不同激活函数下均稳定存在（如 $\sigma = x^2$ 和 $\sigma = \operatorname{ReLU}$ 下 sign-match 分别达 0.985 和 1.000），但其对应的参数更新谱结构却强烈依赖于激活函数的导数 $\sigma'$ ：当 $\sigma = x^2$ 时，参数更新的特征值间隙（eigengap）显著变化并呈现 rank-2 结构，表明特征分离可被检测；而 $\sigma = \operatorname{ReLU}$ 时谱结构保持 rank-1，无明显特征分离信号。这揭示了特征排斥机制向权重空间演化过程中的非线性敏感性，关键区分点在于激活函数的导数如何调控从特征空间到参数空间的信息传递路径。

链接: https://arxiv.org/abs/2605.08119
作者: Yongzhong Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 11 pages, 4 figures

点击查看摘要

Abstract:Tian (2025) proves a repulsion theorem (Theorem 6) for the matrix B = (\widetildeF^\top \widetildeF + \eta I)^-1 during the interactive feature-learning stage of grokking: similar features have negative off-diagonal entries B_j\ell , producing an effective repulsive force that drives them apart. However, the theorem does not specify when this mechanism becomes empirically observable, nor whether it leaves a measurable spectral signature in the parameter updates. We test this directly on Tian’s modular addition setup ( M = 71 , K = 2048 , MSE loss) and observe a clear structure-mechanism dissociation. The predicted sign rule holds robustly on the top-200 most-similar feature pairs across activations (empirical sign-match rising from 0.865 to 0.985 on \sigma = x^2 across 5 seeds, and saturating at 1.000 on \sigma = \operatornameReLU ). However, the spectral signature in the parameter updates is strongly activation-dependent. With \sigma = x^2 , a simple slope detector on the rolling eigengap \sigma_2 / \sigma_3 of \Delta W fires in 15/15 grokking seeds at epoch 174 (IQR [173,174]) and in 0/15 non-grokking controls, with 229 \times late-stage magnitude separation; the spectrum is rank-2. In contrast, with \sigma = \operatornameReLU , the detector never fires and the spectrum remains effectively rank-1. This dissociation aligns with Tian’s Theorem 5 distinction between focused (power-law) and spreading (ReLU) memorization: while the sign structure of B depends only on \widetildeF^\top \widetildeF , how feature repulsion translates into weight updates critically depends on the activation derivative \sigma’ .

[AI-419] he Safety-Aware Denoiser for Text Diffusion Models

【速读】：该论文旨在解决文本扩散模型（text diffusion models）在生成过程中缺乏有效安全控制的问题。现有方法主要针对自回归生成模型设计，依赖事后过滤或推理时干预，难以适配文本扩散模型的安全需求。其解决方案的关键在于提出Safety-Aware Denoiser（SAD），通过修改迭代去噪过程，在最终去噪步骤中引导文本样本进入可证明安全的文本空间区域；该方法在推理阶段集成安全约束，无需重新训练扩散模型，具备计算高效、灵活轻量的特点，从而实现了对文本扩散模型生成内容的有效安全管控。

链接: https://arxiv.org/abs/2605.08116
作者: Amman Yusuf,Zhejun Jiang,Mijung Park
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 27 pages, 12 figures. Code available at: this https URL

点击查看摘要

Abstract:Recent work on text diffusion models offers a promising alternative to autoregressive generation, but controlling their safety remains underexplored. Existing safety approaches are geared toward autoregressive models and typically rely on post-hoc filtering or inference-time interventions. These are inadequate for effectively addressing safety risks in text diffusion models. We propose the Safety-Aware Denoiser (SAD), a safety-guidance framework in text diffusion models. The SAD modifies the iterative denoising process such that the text sample at the final denoising step is steered toward provably safe regions of the text space. This inference-time method can integrate safety constraints into the denoiser, avoiding computationally expensive retraining of the underlying diffusion model and enabling flexible, lightweight safety guidance. We evaluate the safety of the generated text using the SAD, with respect to hazard taxonomy, memorization, and jailbreak. Experimental results show that SAD substantially reduces unsafe generations while preserving generation quality, diversity, and fluency, outperforming existing methods. These results demonstrate that our safety guidance during denoising provides an effective and scalable mechanism for enforcing safety in text diffusion models.

[AI-420] Context-Augmented Code Generation: How Product Context Improves AI Coding Agent Decision Compliance by 49%

【速读】：该论文旨在解决生成式 AI 编码代理（AI coding agents）在软件开发中因缺乏对团队特定产品决策的理解而导致的合规性问题，即这些代理往往无法遵循仅存在于代码库之外的产品、设计和工程决策。解决方案的关键在于引入一个名为 Brief 的产品上下文检索系统，该系统通过规范生成、构建过程中的咨询以及对已记录决策、用户痛点、客户反馈和竞品情报的检索，为 AI 编码代理提供必要的产品上下文信息。实验表明，与仅依赖代码库访问的基线配置相比，加入 Brief 后的增强配置将决策合规率从 46% 提升至 95%，显著改善了 AI 编码代理对隐含产品逻辑的理解与执行能力。

链接: https://arxiv.org/abs/2605.08112
作者: Drew Dillon,Kasyap Varanasi
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
备注: 16 pages, 3 figures, 16 tables. Benchmark repository: this https URL

点击查看摘要

Abstract:AI coding agents powered by large language models can read codebases and produce functional code, but they routinely violate team-specific product decisions that are invisible in the source code alone. We introduce a controlled benchmark measuring decision compliance, the rate at which an AI coding agent follows established product, design, and engineering decisions, across 8 realistic software engineering tasks containing 41 weighted decision points. We compare a baseline configuration (Claude Code with codebase access only) against an augmented configuration that adds Brief, a product-context retrieval system providing spec generation, mid-build consultation, and retrieval of recorded decisions, persona pain points, customer signals, and competitive intelligence. On identical prompts and the same repository, the augmented configuration achieves 95% decision compliance versus 46% for the baseline, a 49 percentage point improvement. Per-decision analysis reveals that the baseline achieves 100% compliance on decisions visible in the codebase and 0-33% on decisions requiring product context, suggesting that product-context retrieval is a key driver of the improvement. We release the benchmark repository, all 16 pull requests, and scoring harness for independent reproduction.

[AI-421] CD:Transformer Integrated Temporal Causal Discovery from Non-Stationary Time Series Data

【速读】：该论文旨在解决非平稳、非线性且噪声复杂的时序数据中因果发现的难题，尤其在有限样本下传统约束型方法因条件独立性检验失效而性能下降，而评分型方法则受限于强统计假设。其核心解决方案是提出Transformer集成时序因果发现（TTCD）框架，关键创新在于通过重建引导的因果信号蒸馏机制——利用Transformer解码器的重构过程提取本质因果信号，从而抑制噪声和虚假相关性，同时保留真实依赖关系；进而由专用因果结构学习器基于蒸馏后的信号推断因果图，无需对噪声分布或数据生成过程施加限制，显著提升了在合成、基准及真实世界数据上的准确性与领域知识一致性。

链接: https://arxiv.org/abs/2605.08111
作者: Omar Faruque,Sahara Ali,Xue Zheng,Jianwu Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME)
备注: 18 Pages

点击查看摘要

Abstract:The widespread availability of complex time series data in various domains such as environmental science, epidemiology, and economics demands robust causal discovery methods that can identify intricate contemporaneous and lagged relationships in non-stationary, nonlinear, and noisy settings. Existing constraint-based methods often rely heavily on conditional independence tests that degrade for limited data samples and complex distributions, while score-based methods impose strong statistical assumptions. Recent methods address special cases such as change point detection or distribution shifts, but struggle to provide a unified solution. We propose the Transformer Integrated Temporal Causal Discovery (TTCD) Framework, a novel end-to-end approach that learns contemporaneous and lagged causal relations from non-stationary time series. TTCD introduces a Non-Stationary Feature Learner integrating temporal and frequency-domain attention with dynamic non-stationarity profiling, and a custom Causal Structure Learner. A key innovation is reconstruction-guided causal signal distillation, to distill essential causal signals through the reconstruction process of the transformer decoder, which mitigates noise and spurious correlations while preserving meaningful dependencies. The Causal Structure Learner operates on distilled reconstructed signals to infer the underlying causal graph without restrictive assumptions on noise distributions or data generation processes. Experiments on synthetic, benchmark, and real world datasets show that TTCD consistently outperforms state-of-the-art baselines in both accuracy and consistency with domain knowledge, demonstrating the approach’s effectiveness for causal discovery in challenging real world contexts.

[AI-422] BaLoRA: Bayesian Low-Rank Adaptation of Large Scale Models

【速读】：该论文旨在解决低秩适配（Low-Rank Adaptation, LoRA）在微调大规模预训练模型时存在的三个核心问题：表达能力受限、与全参数微调之间存在精度差距，以及缺乏内置的不确定性量化能力，从而限制其在可靠性至关重要的场景中的应用。解决方案的关键在于提出BaLoRA——一种贝叶斯扩展的LoRA方法，其核心创新在于引入了一种新颖的输入自适应贝叶斯参数化方式来建模LoRA矩阵，该方法仅增加少量参数和计算开销，却能同时实现校准良好的不确定性估计，并通过自适应噪声注入显著提升预测准确性，缩小与全参数微调的性能差距，且在金属有机框架材料带隙预测任务中展现出优于训练好的LoRA集成模型的零样本测试时不确定性估计能力。

链接: https://arxiv.org/abs/2605.08110
作者: Dario Coscia,Sindy Löwe,Max Welling
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Low-Rank Adaptation (LoRA) has become the standard for fine-tuning large pre-trained models at reduced computational cost. However, its low-rank point-estimate updates limit expressiveness, leave a persistent gap relative to full fine-tuning accuracy, and provide no built-in uncertainty quantification, limiting its applicability in settings where reliability matters as much as accuracy. We introduce BaLoRA, a Bayesian extension of LoRA with a novel input-adaptive Bayesian parameterization of LoRA matrices that adds minimal parameters and compute. Surprisingly, not only does the Bayesian extension yield well-calibrated uncertainty estimates, but the adaptive noise injection underlying our approach also significantly improves prediction accuracy, narrowing the gap with full fine-tuning across both natural language reasoning and vision tasks. When applied to band gap prediction in metal-organic frameworks, BaLoRA produces zero-shot test-time uncertainty estimates that correlate more strongly with model error than a trained ensemble of LoRA models, and improve monotonically with compute without sacrificing accuracy.

[AI-423] MedThink: Enhancing Diagnostic Accuracy in Small Models via Teacher-Guided Reasoning Correction

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在临床诊断中因计算与内存开销过大而难以部署于资源受限环境的问题，同时克服传统知识蒸馏（Knowledge Distillation, KD）仅迁移表面答案模式、无法保留结构化推理能力的局限。其解决方案的关键在于提出一种两阶段蒸馏框架MedThink：第一阶段通过教师模型筛选数据并注入领域知识解释，为学生模型（Small Language Model, SLM）建立知识基础；第二阶段由教师模型分析学生错误，生成连接知识与正确答案的推理链，并通过二次微调强化学生的诊断推理能力。该方法显著提升了SLMs在通用医学基准和胃肠病学数据集上的诊断准确率与泛化性能，同时保持了计算效率。

链接: https://arxiv.org/abs/2605.08094
作者: Xinchun Su,Chunxu Luo,Lipeng Ma,Yixuan Li,Weidong Yang
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate clinical diagnosis requires extensive domain knowledge and complex clinical reasoning capabilities. Although large language models (LLMs) hold great potential for clinical reasoning, their high computational and memory requirements limit their deployment in resource-constrained environments. Knowledge distillation (KD) can compress LLM capabilities into smaller models, but traditional KD merely transfers superficial answer patterns and fails to preserve the structured reasoning required for reliable diagnosis. To address this, we propose a two-stage distillation framework, MedThink, designed to cultivate robust clinical reasoning in small language models (SLMs). In the first stage, a teacher LLM screens data and injects domain-knowledge explanations to fine-tune a student model, establishing a knowledge foundation. In the second stage, the teacher evaluates the student’s errors, generates reasoning chains linking knowledge to correct answers, and refines the student’s diagnostic reasoning through a second round of fine-tuning. We evaluate MedThink on general medical benchmarks and a gastroenterology dataset comprising 955 question-answer pairs. Experiments demonstrate that MedThink outperforms six distillation strategies in all benchmarks: achieving an improvement of up to 12.7% over the student baseline in general tasks, and reaching a total top accuracy of 56.4% in gastroenterology evaluation. This indicates that iterative distillation centered on reasoning can significantly enhance the diagnostic accuracy and generalization capabilities of SLMs whilst maintaining computational efficiency. Our code and data are publicly available at this https URL.

[AI-424] Delulu: A Verified Multi-Lingual Benchmark for Code Hallucination Detection in Fill-in-the-Middle Tasks

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在填空式代码生成（Fill-in-the-Middle, FIM）任务中普遍存在幻觉（hallucination）的问题，即模型生成看似合理但实际错误的代码片段，如虚构的API方法、无效参数、未定义变量或不存在的导入语句，这类错误难以通过表面审查发现，却会导致运行时异常。解决方案的关键在于构建一个经过验证的多语言基准测试集Delulu，其核心创新包括：采用对抗性流水线（adversarial pipeline）——由前沿LLM生成 plausible hallucinations，四种不同判别模型进行评估，基于嵌入聚类挖掘更难样本，利用自包含Docker容器验证黄金补全版本可编译而幻觉版本产生预期运行时错误，并辅以人工专家最终审核去除偏倚或过于简单的样本，从而确保数据集的质量和挑战性。此方法有效揭示了FIM任务内在难度，而非特定模型家族的局限性。

链接: https://arxiv.org/abs/2605.07024
作者: Mahdi Erfanian,Nelson Daniel Troncoso,Aashna Garg,Amabel Gale,Xiaoyu Liu,Pareesa Ameneh Golnari,Shengyu Fu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models for code generation frequently produce hallucinations in Fill-in-the-Middle (FIM) tasks – plausible but incorrect completions such as invented API methods, invalid parameters, undefined variables, or non-existent imports. These failures pass superficial review yet introduce runtime errors. We introduce Delulu, a verified multi-lingual benchmark of 1,951 FIM samples across 7 languages and 4 hallucination types. Samples are curated through an adversarial pipeline: a frontier LLM generates plausible hallucinations, four diverse judge models evaluate them, embedding-based clustering mines progressively harder examples, self-contained Docker containers verify that golden completions compile while hallucinated variants produce the expected runtime error, and a final human-expert review removes any remaining biased or trivially decidable samples. We evaluate 11 open-weight FIM models from five families spanning 0.5B-32B parameters: a six-point Qwen2.5-Coder scaling slate, plus a cross-family slate (CodeLlama, DeepSeek-Coder-V2, StarCoder2). The strongest model reaches only 84.5% pass@1, no family exceeds 0.77 Edit Similarity, and every family produces hallucination-aligned completions on a non-trivial share of samples, confirming that the difficulty exposed by Delulu is task-intrinsic rather than family-specific. We release the benchmark, containers, and evaluation framework at this https URL.

[AI-425] Reason STL: Bridging Natural Language and Signal Temporal Logic via Tool-Augmented Process-Rewarded Learning

【速读】：该论文旨在解决自然语言到信号时序逻辑（Signal Temporal Logic, STL）的自动翻译问题，这一任务在自主系统和信息物理系统的形式化验证与合成中至关重要。当前方法存在两大局限：一是人工编写STL公式需要时序逻辑专业知识且难以扩展；二是依赖商业大语言模型（Large Language Models, LLMs）API进行提示推理会带来高昂的token成本并引发敏感系统需求泄露的隐私风险。为此，作者提出\textscReasonSTL框架，其核心创新在于将翻译过程分解为显式推理、确定性工具调用和结构化公式构建三阶段，并引入基于过程奖励的训练机制，联合监督工具使用轨迹与最终STL公式的生成质量。该方案实现了透明、低成本且隐私安全的正式规范起草，显著提升了翻译性能。

链接: https://arxiv.org/abs/2605.06483
作者: Bowen Ye,Zhijian Li,Junyue Huang,Junkai Ma,Xiang Yin
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Signal Temporal Logic (STL) is an expressive formal language for specifying spatio-temporal requirements over real-valued, real-time signals. It has been widely used for the verification and synthesis of autonomous systems and cyber-physical systems. In practice, however, users often express their requirements in natural language rather than in structured STL formulas, making natural-language-to-STL translation a critical yet challenging task. Manual specification requires temporal-logic expertise and cannot scale, while prompting commercial LLM APIs incurs substantial token costs and may expose sensitive system requirements to third-party services, raising privacy concerns for industrial deployment. To address these challenges, we present \textscReasonSTL, a tool-augmented framework that adapts local open-source language models for natural-language-to-STL generation. \textscReasonSTL decomposes the translation process into explicit reasoning, deterministic tool calls, and structured formula construction. We further introduce process-rewarded training to supervise both tool-use trajectories and final formulas, together with \textscSTL-Bench, a bilingual, computation-aware benchmark grounded in real-world signals. Experiments show that a 4B model trained with \textscReasonSTL achieves state-of-the-art performance in both automatic metrics and human evaluations, demonstrating that \textscReasonSTL provides a transparent, low-cost, and privacy-preserving alternative for formal specification drafting.

[AI-426] A Unified Pair-GRPO Family: From Implicit to Explicit Preference Constraints for Stable and General RL Alignment

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）通过人类偏好强化学习（Reinforcement Learning from Human Preferences, RLHF）进行对齐时面临的不稳定策略更新、梯度方向模糊、可解释性差以及梯度方差高等问题。其解决方案的核心在于构建一个统一的理论框架——Pair-GRPO家族，包含Soft-Pair-GRPO与Hard-Pair-GRPO两种紧密耦合的变体：前者通过对Group Relative Policy Optimization (GRPO) 的最小修改，将组归一化的标量奖励替换为二元偏好奖励，在保持原结构的基础上实现了梯度等价性证明（即在当前策略的一阶泰勒展开下，其梯度是标准GRPO梯度的正标量倍数），从而解释了其经验上的稳定性；后者进一步引入显式局部概率约束和受限KL拟合优化，有效抑制梯度噪声并减少全局策略漂移，同时提供完备的理论保障，包括单调策略改进、确定性梯度方向、梯度方差降低及动态步长收敛性。

链接: https://arxiv.org/abs/2605.06375
作者: Hao Yu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistics Theory (math.ST)
备注:

点击查看摘要

Abstract:Large language model (LLM) alignment via reinforcement learning from human preferences (RLHF) suffers from unstable policy updates, ambiguous gradient directions, poor interpretability, and high gradient variance in mainstream pairwise preference learning paradigms. To systematically address these limitations, we establish a unified theoretical framework for preference-based RL optimization centered on the Pair-GRPO family, comprising two tightly coupled variants: Soft-Pair-GRPO and Hard-Pair-GRPO. Soft-Pair-GRPO is a minimal modification of Group Relative Policy Optimization (GRPO) that replaces group-normalized scalar rewards with binary pairwise preference rewards, retaining GRPO’s clipped surrogate and KL-regularized structure. We prove a critical gradient equivalence theorem: under first-order Taylor expansion around the current policy, Soft-Pair-GRPO’s gradient is a positive scalar multiple of standard GRPO’s gradient, explaining its empirical stability despite discarding continuous reward magnitudes. Building on this foundation, we propose Hard-Pair-GRPO, an advanced variant introducing explicit local probability constraints and constrained KL-fitting optimization to further suppress gradient noise and global policy drift. We provide comprehensive theoretical guarantees for both variants–including monotonic policy improvement, deterministic gradient direction, gradient-variance reduction, and dynamic step-size convergence. Extensive experiments on standard LLM alignment benchmarks (HH-RLHF,UltraFeedback) and the MuJoCo continuous control task HalfCheetah-v4 demonstrate that our Pair-GRPO family consistently outperforms state-of-the-art baselines in alignment quality, human preference win rate, training stability, and generalization to general reinforcement learning. Ablation studies validate the critical contributions of each core component.

[AI-427] Attractor-Vascular Coupling Theory: Formal Grounding and Empirical Validation for AAMI-Standard Cuffless Blood Pressure Estimation from Smartphone Photoplethysmography

【速读】：该论文旨在解决无袖带血压（cuffless blood pressure, BP）监测中精度不足与可泛化性差的问题，尤其在临床级连续血压追踪场景下。其核心解决方案是提出吸引子-血管耦合理论（Attractor-Vascular Coupling Theory, AVCT），该理论基于心脏稳定性理论（Cardiac Stability Theory），通过Takens延迟嵌入和吸引子形态提取技术，将光电容积脉搏波（photoplethysmography, PPG）信号中的非线性动力学特征转化为可用于血压估计的几何信息。关键创新在于：利用PPG吸引子特征（如脉搏传导时间PTT与心脏稳定性指数CSI）构建轻量级LightGBM模型，在单点校准条件下即可实现高精度BP估计（收缩压MAE=2.05 mmHg，舒张压MAE=1.67 mmHg），且在46名受试者上严格留一被试交叉验证（LOSO-CV）均满足AAMI/IEEE SP10标准（MAE<5 mmHg）。AVCT不仅预测了特征重要性层级，还首次从非线性动力系统角度提供可解释性机制，优于传统事后可解释AI方法，并证明仅用智能手机摄像头采集的PPG信号即可实现临床级血压跟踪。

链接: https://arxiv.org/abs/2605.10871
作者: Timothy Oladunni,Farouk Ganiyu Adewumi
机构: 未知
类目: Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This work proposes Attractor-Vascular Coupling Theory (AVCT), a mathematical framework showing that cardiac attractor geometry encodes blood pressure (BP) information sufficient for AAMI-standard estimation, and validates the theory through a calibrated cuffless BP model using photoplethysmography (PPG). AVCT is grounded in Cardiac Stability Theory and operationalized using Takens delay embedding and attractor morphology extraction. Two theorems, one proposition, and one corollary formally justify the use of PPG attractor features for BP estimation and predict the feature-importance hierarchy. A LightGBM model trained on pulse transit time (PTT) and Cardiac Stability Index (CSI) attractor features under single-point calibration was evaluated using strict leave-one-subject-out cross-validation (LOSO-CV) on 46 subjects from BIDMC ICU (n = 9) and VitalDB surgical data (n = 37), comprising 29,684 windows. The model achieved systolic BP (SBP) mean absolute error (MAE) of 2.05 mmHg and diastolic BP (DBP) MAE of 1.67 mmHg, with correlations r = 0.990 and r = 0.991, satisfying the AAMI/IEEE SP10 requirement of MAE below 5 mmHg. Median per-subject MAE was 1.87/1.54 mmHg, and 70%/76% of subjects individually satisfied AAMI criteria. A PPG-only ablation using nine smartphone attractor features matched the ECG+PPG model within 0.05 mmHg, demonstrating that clinical-grade BP tracking is achievable using only a smartphone camera while surpassing prior generalized LOSO-CV results using fewer sensors. All four AVCT predictions were quantitatively confirmed, with 91.5% error reduction from uncalibrated to calibrated estimation (epsilon_cal = 0.915). Unlike post-hoc explainable AI methods, AVCT predicts features satisfying the architectural faithfulness criterion of the Explainable-AI Trustworthiness (EAT) framework and grounding BP estimation in nonlinear dynamical systems theory.

[AI-428] Switching-Geometry Analysis of Deflated Q-Value Iteration

【速读】：该论文旨在解决折扣马尔可夫决策过程（discounted Markov decision process）中基于秩一修正（rank-one deflated）Q值迭代（Q-VI）算法的收敛性分析问题，尤其是如何更精确地刻画其收敛速率。传统Q-VI的收敛速率由折扣因子 $\gamma \in (0,1)$ 控制，但该论文指出，由于所有子系统共享全1向量作为不变方向，标准Q-VI切换系统模型的联合谱半径（joint spectral radius, JSR）恰好等于 $\gamma$ ，这可能高估了实际收敛速度。解决方案的关键在于引入商空间（quotient space）投影，去除冗余的全1方向后得到一个新的投影切换系统模型，其JSR可能严格小于 $\gamma$ ，从而提供比原空间 $\gamma$ -界更精细的收敛速率描述。此外，论文证明该修正等价于对标准Q-VI进行标量重中心化（scalar recentering），因此贪婪策略序列保持不变，说明deflation的优势不在于改变决策问题本身，而在于通过几何重构实现更准确的收敛性分析。

链接: https://arxiv.org/abs/2605.10811
作者: Donghwan Lee
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper develops a joint spectral radius (JSR) framework for analyzing rank-one deflated Q-value iteration (Q-VI) in discounted Markov decision process control. Focusing on an all-ones residual correction, we interpret the resulting algorithm through the geometry of switching systems and, to the best of our knowledge, give the first JSR-based convergence analysis of deflated Q-VI for policy optimization problems. Our analysis reveals that the standard Q-VI switching system model has JSR exactly the discount factor \gamma\in (0,1) , since all admissible subsystems share the all-ones vector as an invariant direction. By passing to the quotient space that removes this direction, we obtain a projected switching system model whose JSR governs the relevant error dynamics and may be strictly smaller than \gamma . Therefore, the deflated Q-VI admits a potentially sharper convergence-rate characterization than the ambient-space \gamma -bound. Finally, we prove that the correction is equivalent to a scalar recentering of standard Q-VI. Hence, the projected trajectory, and therefore the greedy-policy sequence, is unchanged relative to standard Q-VI initialized from the same point. The benefit of deflation is not a change in the induced decision-making problem, but a more precise JSR-based description of the convergence geometry after the redundant all-ones component is removed.

[AI-429] An agent ic framework for gravitational-wave counterpart association in the multi-messenger era

【速读】：该论文旨在解决多信使天文学中引力波（Gravitational Waves, GWs）事件与电磁（Electromagnetic, EM）对应体关联分析的挑战，尤其是在下一代引力波和电磁探测器时代，事件数量激增对现有数据分析范式带来的压力。解决方案的关键在于开发了一个名为GW-Eyes的代理框架（agentic framework），该框架基于大语言模型（Large Language Models, LLMs），首次实现了领域特定工具的集成，并能自主执行引力波与候选电磁事件之间的对应体关联任务；其核心优势在于利用LLMs复杂的决策能力与可追溯的推理过程，同时支持自然语言交互，辅助人类专家完成目录管理、天区图可视化及快速验证等辅助任务，从而显著提升多信使研究的效率与智能化水平。

链接: https://arxiv.org/abs/2605.10584
作者: Yiming Dong,Yacheng Kang,Junjie Zhao,Xinyuan Zhu,Ziming Wang,Lijing Shao
机构: 未知
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the detection of gravitational waves (GWs), multi-messenger astronomy has opened a new window for advancing our understanding of astrophysics, dense matter, gravitation, and cosmology. The GW sources detected to date are from mergers of compact object binaries, which possess the potential to generate detectable electromagnetic (EM) counterparts. Searching for associations between GW signals and their EM counterparts is an essential step toward enabling subsequent multi-messenger studies. In the era of next-generation GW and EM detectors, the rapid increase in the number of events brings not only unprecedented scientific opportunities, but also substantial challenges to the existing data analysis paradigm. To help address these challenges, we develop GW-Eyes, an agentic framework powered by large language models (LLMs). For the first time, GW-Eyes integrates domain-specific tools and autonomously performs counterpart association tasks between GW and candidate EM events. It supports natural language interaction to assist human experts with auxiliary tasks such as catalog management, skymap visualization, and rapid verification. Our framework leverages the complex decision-making capabilities of LLMs and their traceable reasoning processes, offering a new perspective to the multi-messenger astronomy.

[AI-430] Cavity-Enhanced Collective Quantum Processing with Polarization-Encoded Qubits

【速读】：该论文旨在解决如何在腔量子体系中实现可扩展、稳定的集体量子处理问题，特别是如何在不依赖极端物理条件（如极强非线性系数或超长光子寿命）的前提下构建通用量子门集。其解决方案的关键在于提出了一种腔增强的光学架构：通过将逻辑量子比特编码在腔内模式的偏振子空间中，明确分离物理载体与计算自由度——谐振腔束提供稳定共振基底，而可编程偏振变换实现单量子比特操作；同时，在纠缠区域引入偏振选择性的非线性相互作用，生成可调的受控相位门，从而构成通用门集。参数缩放分析表明，厘米尺度腔体结合实验可实现的固态非线性介质即可获得接近单位量级的条件相位，无需苛刻的实验条件，为基于腔的集体量子架构提供了物理可行的平台。

链接: https://arxiv.org/abs/2605.10473
作者: Kamil Wereszczyński(0000-0003-1686-472X),Józef Cyran(0009-0006-5205-8986),Adam Brzezowski(0009-0004-6997-445X),Dawid Załużny(0009-0003-5106-0855),Robert Potoniec(0009-0005-7477-3625),Kasper Wiśniowski(0009-0004-6696-9778),Agnieszka Michalczuk(0000-0002-8963-1030)
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce a cavity-enhanced optical architecture for collective quantum processing in which logical qubits are encoded in the polarization subspace of recirculating intracavity modes. The physical carrier and computational degree of freedom are explicitly separated: harmonic cavity bundles provide a stable resonant substrate, while programmable polarization transformations implement single-qubit operations. A polarization-selective nonlinear interaction in the entanglement region generates tunable controlled-phase gates, enabling a universal gate set. A parameter-scaling analysis shows that order-unity conditional phases are attainable in centimeter-scale cavities using experimentally accessible solid-state nonlinear media, without requiring extreme nonlinear coefficients, millisecond photon lifetimes, or sub-hertz laser stabilization. The results indicate that resonant recirculation provides a physically plausible platform for cavity based collective quantum architectures.

[AI-431] Physical probes expose and alleviate chemical-environment collapse in molecular representations

【速读】：该论文旨在解决核磁共振（NMR）光谱在分子表示学习中因数据异质性和原子级归属不完整而导致的表征崩溃问题，特别是当分子拓扑等价原子在真实化学环境中仍保持实验差异，以及静态构象限制了动态体系中3D描述准确性的瓶颈。解决方案的关键在于提出CLAIM（Contrastive Learning for Atom-to-molecule Inference of Molecular NMR），通过层次化化学先验和跨层级对比学习，将高效的拓扑分子输入与原子分辨的NMR观测对齐，从而恢复丢失的化学分辨率，并显著提升原子级别分子-谱图检索性能，且在柔性分子和互变异构体系中保持鲁棒性，无需显式3D建模即可增强立体异构体区分能力，并可迁移至ADMET预测和荧光估算等更广泛的分子属性任务。

链接: https://arxiv.org/abs/2605.10429
作者: Jiebin Fang,Zidi Yan,Churu Mao,Yongjun Jiang,Xinyi Tang,Lei Miao,Dan Lu,Yun Huang,Wanjing Ding,Zhongjun Ma
机构: 未知
类目: Chemical Physics (physics.chem-ph); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Nuclear magnetic resonance (NMR) spectroscopy provides an experimental readout of local chemical environments, but its use in molecular representation learning has been constrained by heterogeneous data and incomplete atom-level assignments. Here we construct complementary high-fidelity experimental and computational 13C NMR resources, which reveal a recurrent form of representational collapse: atoms that are equivalent in molecular topology can remain experimentally distinct in their real chemical environments, whereas explicit 3D descriptions are further limited by static conformations in dynamic regimes. To alleviate this bottleneck, we develop CLAIM (Contrastive Learning for Atom-to-molecule Inference of Molecular NMR), a framework that aligns efficient topological molecular inputs with atom-resolved NMR observables. Through hierarchical chemical priors and cross-level contrastive learning, CLAIM restores lost chemical resolution and markedly improves atom-level molecule-spectrum retrieval. CLAIM remains robust in flexible and tautomeric systems for 13C NMR prediction, improves stereoisomer discrimination without explicit 3D modelling, and transfers to broader molecular property tasks including ADMET prediction and fluorescence estimation. These results establish physically grounded spectral alignment as an effective strategy for alleviating chemical-environment collapse and for guiding experimentally grounded molecular representation learning.

[AI-432] Every finite group admits a just finite presentation

【速读】：该论文旨在解决Kourovka笔记本问题21.10中提出的开放性猜想，即是否每个有限群都存在一个“just finite”（恰好有限）的有限表示，亦即移除该表示中的任意一个关系后，所得群变为无限群。论文通过构造性证明，展示了对于任意给定的有限群，均可以构建出满足该性质的有限展示，从而正面回答了这一长期悬而未决的问题。解决方案的关键在于利用群论中的组合方法与几何群论工具，设计出一种系统性的关系添加策略，确保所构造的展示具有“just finite”的特性。

链接: https://arxiv.org/abs/2605.10402
作者: Marc Lackenby
机构: 未知
类目: Group Theory (math.GR); Artificial Intelligence (cs.AI)
备注: 5 pages. Significant assistance was provided by the AI co-mathematician tool developed by Google DeepMind

点击查看摘要

Abstract:A finite presentation X | R of a finite group is called `just finite’ if removing any relation from R results in a presentation for an infinite group. It has been an open question (Kourovka Notebook, Problem 21.10) whether every finite group admits such a presentation. We resolve this conjecture in the affirmative.

[AI-433] SCALAR: A Neurosymbolic Framework for Automated Conjecture and Reasoning in Quantum Circuit Analysis

【速读】：该论文旨在解决量子电路分析中自动推测最优参数与图结构之间关系的问题，特别是针对量子近似优化算法（QAOA）的参数选择和性能预测。其解决方案的关键在于提出了一种神经符号框架 SCALAR（Symbolic Conjecture and LLM-Assisted Reasoning），该框架融合了量子模拟、符号猜想生成与大语言模型（LLM）辅助解释，能够从大量图实例中自动发现并验证参数约束（如相位分离参数 γ 的周期性限制）以及参数迁移现象，并通过不变量描述符刻画图结构特征与优化景观属性之间的相关性。该方法基于 CUDA-Q 张量网络模拟器实现了对最多 77 量子比特规模问题的扩展实验，显著提升了对 QAOA 参数设计的自动化理解和泛化能力。

链接: https://arxiv.org/abs/2605.10327
作者: Sean Feeney,Pooja Rao,Andreas Klappenecker,Reuben Tate,Yuri Alexeev,Stefano Mensa,Elica Kyoseva,Stephan Eidenbenz
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Symbolic Computation (cs.SC)
备注:

点击查看摘要

Abstract:In this paper, we present SCALAR (Symbolic Conjecture and LLM-Assisted Reasoning), a neurosymbolic framework for automated conjecture generation in quantum circuit analysis built on top of the CUDA-Q open source framework. The system integrates quantum simulation, symbolic conjecture generation, and LLM-based interpretation. We evaluate SCALAR on 82 MaxCut instances from the MQLib benchmark dataset and extend the analysis to 2,000 randomly generated graphs across four topologies: regular, Erdos-Renyi, Barabasi-Albert, and Watts-Strogatz. The framework generates conjectured bounds relating optimal QAOA parameters to graph invariants, including known relationships such as periodicity constraints on the phase separation parameter \gamma . SCALAR also recovers previously reported parameter transfer phenomena across structurally similar instances. Additionally, the system identifies correlations between graph structural features and optimization landscape properties, which we characterize through invariant-based descriptors. Using CUDA-Q tensor network simulator, we scale experiments to instances of up to 77 qubits. We discuss the accuracy, generality, and limitations of the generated conjectures, including sensitivity to graph class and quantum circuit depth.

[AI-434] Generative AI Fuels Solo Entrepreneurship but Teams Still Lead at the Top

【速读】：该论文旨在解决生成式 AI（Generative AI）对创业进入模式及其质量分布影响的问题，具体关注其是否改变了创业者的构成结构以及最终高质量成果的分布格局。解决方案的关键在于利用 Product Hunt 平台上超过 160,000 次产品发布数据，实证分析 ChatGPT-3.5 公开发布前后创业行为的变化：发现生成式 AI 显著降低了个体创业者（solo entrepreneurs）的进入门槛，尤其在传统上依赖团队协作的领域中表现突出；但这种增长主要来自低承诺度的试验性进入，并未提升高质产品在平台排名中的代表性；相反，高质量成果仍由团队型创业项目主导。这表明生成式 AI 在降低个体创业壁垒的同时，强化了团队型创业的优势地位。

链接: https://arxiv.org/abs/2605.10291
作者: Hyunso Kim,Hyo Kang,Jaeyong Song
机构: 未知
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Applications (stat.AP)
备注:

点击查看摘要

Abstract:Recent advances in generative artificial intelligence (AI) are reshaping who enters entrepreneurship, but not who reaches the top of the quality distribution. Using data on over 160,000 product launches on Product Hunt, we find that entrepreneurial entry increased sharply following the public release of ChatGPT-3.5, driven disproportionately by solo entrepreneurs. This shift toward solo entry is particularly pronounced in categories that historically favored team-based ventures. However, much of this growth reflects low-commitment, experimental entry and does not translate into greater representation among the highest-quality outcomes. Team-based ventures are increasingly dominant in the top tiers of platform rankings. These findings suggest that generative AI lowers barriers to solo entrepreneurship while reinforcing team-based advantages.

[AI-435] Coarsening Linear Non-Gaussian Causal Models with Cycles

【速读】：该论文旨在解决高维因果结构在存在循环（cyclic）情况下如何有效抽象为低维因果有向无环图（DAG）的问题。传统方法假设高维与低维因果结构均为无环，限制了其在复杂系统中的应用；本文在线性非高斯（LiNG）设定下证明，即使高维结构包含循环，仍可恢复出一个不变的低维DAG，该DAG是观测等价类中所有成员的共同表示，从而提供了一个自然且可识别的低维抽象。解决方案的关键在于利用LiNG模型的特殊性质——循环结构仅在观测层面形成等价类（通过定向循环反转生成），而低维DAG在此等价类中保持不变，进而实现高效学习：算法在最坏情况下具有立方时间复杂度，并附带明确的样本复杂度界，显著优于现有基于高维变量的指数级方法。

链接: https://arxiv.org/abs/2605.10163
作者: Francisco Madaleno,Francisco C Pereira,Alex Markham
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent work on causal abstraction, in particular graphical approaches focusing on causal structure between clusters of variables, aims to summarize a high-dimensional causal structure in terms of a low-dimensional one. Existing methods for learning such summaries from data assume that both the high- and low-dimensional structures are acyclic, which is helpful for causal effect identification and reasoning but excludes many high-dimensional models and thus limits applicability. We show that in the linear non-Gaussian (LiNG) setting, the high-dimensional acyclicity assumption can be relaxed while still allowing recovery of a low-dimensional causal directed acyclic graph (DAG). We further connect identifiability of this low-dimensional DAG to existing results: LiNG models with cycles are observationally identifiable only up to an equivalence class whose members differ by reversals of directed cycles; our low-dimensional DAG, which is invariant across all members of a given equivalence class, thus forms a natural representative of the class. While existing approaches for learning this observational equivalence class over high-dimensional variables have exponential time complexity, our low-dimensional summary is learned in worst-case cubic time and comes with explicit bounds on the sample complexity. We provide open source code and experiments on synthetic data to corroborate our theoretical results.

[AI-436] PoDAR: Power-Disentangled Audio Representation for Generative Modeling

【速读】：该论文旨在解决音频潜在扩散模型（audio latent diffusion models）中潜在空间建模能力不足的问题，尤其是在生成式AI（Generative AI）任务中，如何提升模型对潜在表示的可建模性（modelability）以加速收敛并提高最终性能。其解决方案的关键在于通过显式因子解耦（explicit factor disentanglement），将信号功率（signal power）从不变的语义内容中分离出来，提出PoDAR（Power-Disentangled Audio Representation）框架，利用随机功率增强和潜在一致性目标实现这一解耦。该方法不仅使潜在空间更易建模，还支持仅对功率不变的内容应用条件引导（CFG），从而扩展了稳定引导范围至更高尺度，显著提升了训练效率与生成质量。

链接: https://arxiv.org/abs/2605.10084
作者: Alejandro Luebs,Mithilesh Vaidya,Ishaan Kumar,Sumukh Badam,Stephen W. Bailey,Matthew Bendel,Jose Sotelo,Xingzhe He
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD)
备注: 9 pages, 3 figures

点击查看摘要

Abstract:The performance of audio latent diffusion models is primarily governed by generator expressivity and the modelability of the underlying latent space. While recent research has focused primarily on the former, as well as improving the reconstruction fidelity of audio codecs, we demonstrate that latent modelability can be significantly improved through explicit factor disentanglement. We present PoDAR (Power-Disentangled Audio Representation), a framework that utilizes a randomized power augmentation and latent consistency objective to decouple signal power from invariant semantic content. This factorization makes the latent space easier to model, which both accelerates the convergence of downstream generative models and improves final overall performance. When applied to a Stable Audio 1.0 VAE with an F5-TTS generator, PoDAR achieves about a 2\times acceleration in convergence to match baseline performance, while increasing final speaker similarity by 0.055 and UTMOS by 0.22 on the LibriSpeech-PC dataset. Furthermore, isolating power into dedicated channels enables the application of CFG exclusively to power-invariant content, effectively extending the stable guidance regime to higher scales.

[AI-437] Yeti: A compact protein structure tokenizer for reconstruction and multi-modal generation

【速读】：该论文旨在解决现有蛋白质结构tokenization方法在生成能力上的局限性问题，即当前tokenizers虽能较好地重建蛋白质结构，但难以支持高效的多模态学习与生成任务。其关键解决方案是提出Yeti——一种基于无查找表量化（lookup free quantization）的紧凑型蛋白质结构tokenizer，并采用流匹配（flow matching）目标进行端到端训练，从而在保持低参数量（仅为ESM3的1/10）的同时实现最优的码本利用率和最高的token多样性，同时具备良好的重建精度和生成能力。通过将Yeti与氨基酸序列联合建模，作者构建了一个从零开始训练的轻量级多模态模型，在无预训练初始化条件下即可实现序列与结构的联合生成，生成结果可媲美参数量大10倍的模型，验证了Yeti作为高效、高表达力结构tokenzier的潜力。

链接: https://arxiv.org/abs/2605.09981
作者: Nabin Giri,Steven Farrell,Kristofer E. Bouchard
机构: 未知
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal models that jointly reason over protein sequences, structures, and function annotations within a unified representation hold immense potential for integrating multimodal data and generating new proteins with designed functional properties. To utilize transformer architectures, such models require a tokenizer that converts protein structure from continuous atomic coordinates into discrete representations suitable for scalable multimodal training. The quality of such models are fundamentally upper bounded by the fidelity and expressiveness of the underlying tokenized structure. However, existing tokenizers prioritize reconstruction over generative abilities. To address these gaps, we introduce Yeti, a simple and compact protein structure tokenizer based on lookup free quantization and trained end to end with a flow matching objective for multimodal learning. Compared to existing models, Yeti generally achieves the best codebook utilization and token diversity, and second best reconstruction accuracy (with 10x fewer parameters than ESM3) on diverse datasets. To validate Yeti’s generative capability, we trained a compact multimodal model jointly over its structure tokens and amino acid sequence entirely from scratch, with no pretrained initialization. The resulting multimodal model generates plausible structures under unconditional cogeneration of protein sequence and structures, achieving comparable results to 10x larger models. Together, these results demonstrate that Yeti is a compact and expressive protein structure tokenizer suitable for training multimodal models that cogenerates highly plausible sequences and structures.

[AI-438] Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech

【速读】：该论文针对Metric-induced discrete flow matching (MI-DFM) 在实际应用中面临的两个核心问题展开研究：一是启发式调度器需要超参数搜索，二是其一阶连续时间马尔可夫链（CTMC）求解器导致有限步路径追踪误差。解决方案的关键在于：首先，推导出适用于预设标量参数化概率路径的动能最优调度器，并将其具体应用于MI-DFM，形成一种无需训练的数值调度策略，能够在Fisher-Rao几何空间中以恒定速度遍历路径；其次，引入有限步矩校正机制，在保持CTMC跳跃目标分布不变的前提下调整跳跃概率，从而有效降低路径追踪误差。这一系列改进最终促成GibbsTTS方法的提出，在基于编解码器的零样本文本到语音（TTS）任务中展现出优异的自然度和说话人相似性表现。

链接: https://arxiv.org/abs/2605.09386
作者: Dong Yang,Yiyi Cai,Haoyu Zhang,Yuki Saito,Hiroshi Saruwatari
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Under Review

点击查看摘要

Abstract:Metric-induced discrete flow matching (MI-DFM) exploits token-latent geometry for discrete generation, but its practical use is limited by two issues: heuristic schedulers requiring hyperparameter search, and finite-step path-tracking error from its first-order continuous-time Markov chain (CTMC) solver. We address both issues. First, we derive a kinetic-optimal scheduler for prescribed scalar-parameterized probability paths, and instantiate it for MI-DFM as a training-free numerical schedule that traverses the path at constant Fisher-Rao speed. Second, we introduce a finite-step moment correction that adjusts the jump probability while preserving the CTMC jump destination distribution. We validate the resulting method, GibbsTTS, on codec-based zero-shot text-to-speech (TTS). Under controlled comparisons with a unified architecture and large-scale dataset, GibbsTTS achieves the best objective naturalness and is preferred in subjective evaluations over masked discrete generative baselines. Additionally, in comparison with the evaluated state-of-the-art TTS systems, GibbsTTS shows strong speaker similarity, achieving the highest similarity on three of four test sets and ranking second on the fourth. Project page: this https URL

[AI-439] Neural Information Causality

【速读】：该论文旨在解决表示学习中如何量化和诊断信息瓶颈（information bottleneck）与查询分离（query-separated computation）之间的因果关系问题，特别是避免将容量限制视为事后定义的参数，而是将其作为可操作的诊断工具。其核心解决方案是引入神经信息因果性（Neural Information Causality, Neural-IC）框架，通过嵌入信息因果性（Information Causality, IC）理论，明确区分两类逻辑上独立的陈述：第一，任何查询分离架构均诱导出随机访问通信实验（random-access communication, RAC），并满足嵌入不等式 $ I_\mathrm{N}\text{-RAC} \le I(\vec a; H, B) $；第二，对界面的独立物理容量约束（如 m-bit 字母表、有限精度寄存器或功率受限噪声信道）直接导致 $ I_\mathrm{N}\text{-RAC} \le C_H $。这一分离机制使得 Neural-IC 能够有效诊断查询泄露（query leakage）、精度泄露（precision leakage）以及特定任务的记忆能力（episode-specific memory），从而为深度神经网络中的信息流提供可解释且可验证的度量标准。

链接: https://arxiv.org/abs/2605.09316
作者: Jeongho Bang,Marcin Pawłowski
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注: 32 pages, 15 figures (including Appendix)

点击查看摘要

Abstract:Query-separated computation forces a representation to play an operational role: data are encoded before a query is known, and a later decoder can answer only through the intermediate interface. In this regime the representation functions as a message rather than merely as a feature map. We formalize this observation by embedding information causality (IC) into representation learning, obtaining a framework called neural information causality (Neural-IC). The revised formulation separates two logically distinct statements. First, every query-separated architecture induces a random-access communication experiment and obeys the embedding inequality I_\mathrmN\text-RAC\le I(\vec a:H,B) . Second, any independently certified physical capacity bound on the interface, such as a hard m -bit alphabet, a finite-precision register, or a power-constrained noisy channel, implies I_\mathrmN\text-RAC\le C_H . This separation avoids treating capacity as a post hoc definition and makes Neural-IC an operational diagnostic for query leakage, precision leakage, and episode-specific memory. We also provide an exact one-bit classical RAC benchmark, showing explicitly that the relevant quantum enhancement is not total information beyond the bottleneck, but fair query-conditioned access. For CHSH-type correlation layers, nested Neural-RAC protocols multiply correlation biases across depth; requiring stability of a one-bit bottleneck for arbitrary depth selects the Tsirelson threshold. We extend the analysis to asymmetric seed biases, to multi-capacity finite-depth phase diagrams, and to correlated data via a conditional information score. Controlled simulations, including straight-through binary bottlenecks and deliberately leaky ablations, verify that apparent violations are accounted for by broken query separation or undercounted capacity.

[AI-440] Select-then-differentiate: Solving Bilevel Optimization with Manifold Lower-level Solution Sets

【速读】：该论文旨在解决乐观双层优化（optimistic bilevel optimization）中下层问题存在非孤立最小值流形（non-isolated manifold of minimizers）时的可微性与优化难题。传统方法假设下层最优解唯一，但现实中常出现多解情形，导致超目标函数（hyper-objective）不可微，从而阻碍梯度计算与优化收敛。论文的关键突破在于：在局部Polyak–Łojasiewicz (PŁ) 条件下，证明了仅需“乐观选择”（optimistic selection）唯一即可保证超梯度的存在性，并提出基于伪逆（pseudoinverse）的显式超梯度公式，扩展了经典单点最小值结果。进一步地，论文揭示了超目标函数的正则性条件——所选最小值点沿流形非退化（non-degenerate）时局部光滑，否则可能导致不可微点或破坏超梯度的Hölder正则性。基于此理论，作者设计HG-MS算法，采用“先选择后求导”策略，结合高效伪逆超梯度计算，在下层解流形的内在维度上实现收敛，而非依赖于环境空间维度，显著提升效率。实验证明其在LLM源重加权任务中优于现有方法。

链接: https://arxiv.org/abs/2605.09209
作者: Saeed Masiha,Zebang Shen,Negar Kiyavash,Niao He
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We study optimistic bilevel optimization when the lower-level problem has a non-isolated manifold of minimizers. In this setting, the hyper-objective may be non-differentiable because the upper-level criterion must choose among multiple lower-level solutions. Under a local Polyak–Łojasiewicz (PŁ) condition, we show that differentiability does not require the lower-level solution set to be a singleton: uniqueness of the optimistic selection is sufficient. This yields an explicit pseudoinverse-based hyper-gradient formula extending the classical singleton-minimizer result. We further characterize the regularity of the hyper-objective: non-degeneracy of the selected minimizer along the solution manifold yields local smoothness, while failure of uniqueness can create many non-differentiable points and failure of non-degeneracy can destroy all positive Hölder regularity of the hyper-gradient. Motivated by this theory, we propose HG-MS, a select-then-differentiate method combining explicit optimistic selection with efficient pseudoinverse-based hyper-gradient computation. Despite the nonconvex nature of optimistic selection over the lower-level solution manifold, we show that HG-MS converges to a stationary point of the optimistic objective with complexity governed by the intrinsic dimension of the solution manifold rather than its ambient dimension. Empirically, we test a practical variant of HG-MS for matched-budget LLM source reweighting. This variant preserves the select-then-differentiate principle and obtains the best GSM8K/MATH scores across the tested backbones, along with competitive or best MT-Bench instruction-following results.

[AI-441] Core-Halo Decomposition: Decentralizing Large-Scale Fixed-Point Problems

【速读】：该论文旨在解决大规模固定点方程 $x^\star = \bar{F}(x^\star)$ 在分布式多智能体系统中求解时因严格分解（strict decomposition）导致的结构性偏差问题。严格分解将变量划分为互不重叠的块，每个智能体仅基于自身拥有的坐标进行更新，但多数算子 $\bar{F}$ 的块更新依赖于其他块变量，强行截断这些依赖会改变原问题的均值算子，引入无法通过增加样本数、减小步长或额外一致性机制消除的结构偏差。解决方案的关键在于提出Core-Halo分解（Core-Halo decomposition），其核心思想是将写入所有权与读取上下文分离：每个智能体在其“核心”（core）上执行更新，同时从与其重叠的“光环”（halo）区域读取数据，从而保留原算子 $\bar{F}$ 的块依赖结构，使固定点问题在去中心化系统中得以忠实实现。理论分析进一步通过贝尔曼闭包条件（Bellman closure condition）和逐块偏差下界刻画了严格分解的根本限制，证明局部更新可能改变原始固定点算子。实验表明，Core-Halo可在保持并行性优势的同时逼近集中式性能。

链接: https://arxiv.org/abs/2605.08681
作者: Haixiang,Yang Xu,Jiefu Zhang,Xudong Wu,Zihan Zhou,Jun He,Jiayu Chen
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Numerical Analysis (math.NA)
备注:

点击查看摘要

Abstract:We study solving large-scale fixed-point equation (x^\star=\bar F(x^\star)) with decomposition. Standard strict decomposition assigns each agent a disjoint block and evaluates updates using only owned coordinates. For most operators, however, a block update may depend on variables outside the block. Truncating these dependencies by strict decomposition changes the mean operator and creates structural bias that cannot be removed by more samples, smaller stepsizes, or additional consensus. We therefore propose Core-Halo decomposition, which separates write ownership from read-only evaluation context: each agent updates its own core and reads from an overlapping halo. By aligning the Core-Halo decomposition with the block-dependence structure of \bar F , the original fixed-point problem can be implemented faithfully in a decentralized multi-agent system. We further characterize the fundamental obstruction faced by strict decomposition through a Bellman closure condition and a blockwise bias lower bound, showing that local-only updates can alter the original fixed-point operator. Finally, we conduct extensive experiments across a range of application settings, and demonstrate that Core-Halo achieves near-centralized performance while retaining the parallelism benefits of decentralization.

[AI-442] Optimal FALQON for Quantum Approximate Optimization via Layer-wise Parameter Tuning

【速读】：该论文旨在解决反馈式自适应量子优化（Feedback-based Adaptive Quantum Optimization, FALQON）在噪声中等规模量子（Noisy Intermediate-Scale Quantum, NISQ）设备上因固定超参数导致收敛速度慢的问题，其典型表现是需要数百至数千层才能获得可接受的解。解决方案的关键在于提出Optimal FALQON，将每层的时间步长（ $\delta_k$ ）和缩放因子（ $M_k$ ）作为决策变量，通过经典优化方法进行联合优化，从而显著提升算法的成功概率、评估效率及深度归一化成本表现。

链接: https://arxiv.org/abs/2605.08332
作者: Michael Mancini,Shabnam Sodagari
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Feedback-based adaptive quantum optimization (FALQON) is a promising approach for solving combinatorial problems on noisy intermediate-scale quantum (NISQ) devices, requiring only single circuit evaluations per layer. However, standard FALQON relies on fixed hyperparameters that severely limit convergence speed, requiring hundreds to thousands of layers for acceptable solutions. This paper proposes Optimal FALQON, an optimization-based formulation that treats the per-layer time step ( \delta_k ) and scaling factor ( M_k ) as decision variables optimized via classical methods. We present a comprehensive empirical study on all 94 non-isomorphic 3-regular graphs with 12 vertices, comparing Optimal FALQON with standard FALQON and multiple QAOA variants. Results demonstrate statistically significant improvements in success probability, evaluation efficiency, and depth-normalized cost across the evaluated benchmarks. Furthermore, initializing QAOA with parameters from Optimal FALQON yields superior warm-start performance compared to fixed initialization.

[AI-443] CAMAL: Improving Attention Alignment and Faithfulness with Segmentation Masks

【速读】：该论文旨在解决视觉模型中注意力机制的对齐性（attention alignment）与忠实性（attention faithfulness）不足的问题，即模型注意力区域与真实判别区域不一致，且注意力对决策的影响缺乏因果意义。解决方案的关键在于提出类激活图注意力学习（Class Activation Map Attention Learning, CAMAL），该方法利用图像对应的分割掩码（segmentation masks）作为监督信号，在训练过程中将模型注意力与真实判别区域进行比对，并通过辅助正则化项引导注意力聚焦于正确区域、抑制无关区域，从而提升注意力的空间准确性与因果有效性。

链接: https://arxiv.org/abs/2605.08325
作者: Rajdeep Singh Hundal,Yan Xiao,Jin Song Dong,Manuel Rigger
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Many vision datasets now provide segmentation masks in addition to annotated images to support a wide range of tasks. In this work, we propose Class Activation Map Attention Learning (CAMAL), an efficient and scalable method that utilizes segmentation masks to improve attention alignment and faithfulness in vision models. Specifically, attention alignment refers to the degree to which a model’s attention aligns with ground-truth discriminative regions, while attention faithfulness refers to the degree to which a model’s attention influences its decision. Improving both attention alignment and faithfulness is essential for ensuring that model attention is both spatially accurate and causally meaningful. To improve attention alignment and faithfulness in vision models, CAMAL first extracts the model’s attention for each image during training and then compares the attention to ground-truth discriminative regions obtained from the corresponding segmentation masks. CAMAL then acts as an auxiliary regularizer, encouraging attention that aligns with ground-truth discriminative regions, while suppressing attention elsewhere. We evaluated CAMAL across two learning paradigms – Deep Learning (DL) and Deep Reinforcement Learning (DRL) – and observed consistent, significant improvements in both attention alignment and faithfulness. In particular, CAMAL yields statistically significant gains in attention alignment across all settings, and improves attention faithfulness by over 35% compared to recent work. Moreover, we show that improved attention alignment and faithfulness enhance explainability, while yielding improved or comparable generalization performance without increasing inference cost. These findings demonstrate that the spatial information contained within segmentation masks can be effectively leveraged to guide model attention across learning tasks.

[AI-444] FQPDR: Federated Quantum Neural Network for Privacy-preserving Early Detection of Diabetic Retinopathy

【速读】：该论文旨在解决糖尿病视网膜病变（Diabetic Retinopathy, DR）早期检测中因微动脉瘤点（microaneurysm dots）尺寸小、对比度低而导致的识别困难问题，同时兼顾医疗图像处理中的数据隐私保护需求。其解决方案的关键在于提出一种基于联邦学习（Federated Learning, FL）的量子神经网络（Federated Quantum Neural Network, Federated QNN）框架——FQPDR，该框架通过仅共享模型参数而非原始患者数据，在保障隐私的前提下实现轻量化模型训练，并在E-ophtha和Retina MNIST等有限样本数据集上验证了其对Kaggle图像中DR早期病变的鲁棒检测性能。

链接: https://arxiv.org/abs/2605.08324
作者: Debashis De,Mahua Nandy Pal,Dipankar Hazra
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Diabetic Retinopathy (DR) is a common complication of diabetes that can lead to blindness of people. Detecting DR at the earliest stage is essential to prevent irreversible eye damage. Microaneurysm dots are the first signs of DR. As the dots are tiny and of low contrast, detecting mild DR is a very challenging task. Federated learning (FL) preserves data privacy, which is a major concern for medical image processing. FL is a collaborative learning method, which shares only the model parameters with a server, without sharing the patient data to a central server. Inspired by classical FL, we propose a federated learning-based quantum neural network (federated QNN) for this task. We implemented the models with limited samples and few learnable parameters from the E-ophtha and Retina MNIST datasets. The crossevaluation efficiency of the proposed federated quantum neural network system for privacy-preserving early detection of diabetic retinopathy (FQPDR) in Kaggle dataset images indicates the robustness of the light weight learning models. FQPDR performances are inspiring while considering existing non-FL and FL methods.

[AI-445] SLayerGen: a Crystal Generative Model for all Space and Layer Groups

【速读】：该论文旨在解决现有晶体生成模型在处理二维（2D）或薄膜等非周期性材料时的局限性，这些材料属于双周期（diperiodic）系统，其对称性由层群（layer group）描述，而传统生成模型仅考虑三维周期性结构（即空间群）。关键解决方案是提出SLayerGen，一种能够生成满足任意空间群或层群对称性的晶体结构的生成模型：其核心包括从粗到细的离散自回归晶格生成、基于Transformer的Wyckoff位置、元素及对称不等价原子数的自回归采样，以及原子坐标的空间/层群等变扩散过程；其中特别修正了先前工作中因六方晶系在分数坐标下非正交导致的损失函数不一致问题，并构建了针对层群对称性的新型表示方法与评估指标，从而实现了对双周期材料的高效、准确生成。

链接: https://arxiv.org/abs/2605.08262
作者: Rees Chang,Andrew Novick,Ryan P Adams,Elif Ertekin
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Crystal generative models have shown rapid progress for accelerating the discovery of bulk, periodic materials. However, many material systems such as 2D superconductors, thin film semiconductors, and catalytic surfaces are diperiodic, i.e., aperiodic along one of the lattice directions. These systems are invariant under the layer groups, which are known to influence materials properties yet not considered by existing models. In this paper, we propose SLayerGen, a generative model that produces crystals constrained to be invariant to any space or layer group. SLayerGen consists of coarse-to-fine discrete autoregressive lattice generation; transformer-based autoregressive sampling of Wyckoff positions, elements, and numbers of symmetrically unique atoms; and space or layer group equivariant diffusion of atomic coordinates. For the diffusion component, we corrected an inconsistency in the loss from prior work arising from hexagonal groups being non-orthogonal in fractional coordinates. To facilitate progress in generative modeling of diperiodic materials, we assembled and filtered datasets of monolayers and bilayers, propose relevant evaluation metrics, and developed novel representations for layer group symmetries. For de novo generation of diperiodic materials, SLayerGen achieves consistent performance gains over bulk crystal generative models and is competitive when training jointly on bulk and diperiodic materials.

[AI-446] An Explainable Unsupervised-to-Supervised Machine Learning Framework for Dietary Pattern Discovery Using UK National Dietary Survey Data

【速读】：该论文旨在解决临床膳食评估中生成的高维营养素和食物组信息难以快速转化为营养咨询优先级的问题。其核心解决方案是提出了一种可解释的从无监督到有监督的机器学习框架，利用英国国家饮食与营养调查（NDNS）公开数据，通过K-means聚类等方法识别出四种具有饮食学意义的膳食模式，并构建一个高精度的监督代理分类器（macro-F1 = 0.963）来重现聚类结果。关键在于结合聚类稳定性、营养师可解释性与SHAP分析，将模型预测映射到饮食学驱动因素，从而支持营养师参与的膳食评估、优先级排序与随访监测。

链接: https://arxiv.org/abs/2605.08242
作者: Wing Yi Yu,Chun Yin Chiu
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages, 6 figures, 9 tables. Accepted by the 14th International Conference on Frontiers of Intelligent Computing: Theory and Applications (FICTA 2026)

点击查看摘要

Abstract:Clinical dietary assessment can generate detailed but high-dimensional nutrient and food-group information that is difficult to translate quickly into counselling priorities. This paper proposes an explainable unsupervised-to-supervised machine learning framework for discovering, reproducing and interpreting dietary patterns using public UK National Diet and Nutrition Survey data. Adult participants aged 19 years and above from NDNS Years 12-15 were represented using 25 energy-adjusted nutrient and food-group features. K-means, Gaussian Mixture Models and Agglomerative Clustering were compared across k = 2-8, with stability and dietetic interpretability used alongside internal validation metrics. The selected K-means k = 4 solution identified four interpretable dietary patterns: high fat/meat and sodium, higher fibre fruit-vegetable micronutrient, high free-sugar snacks and sugary drinks, and dairy/cereal calcium-rich saturated-fat. A supervised surrogate classifier reproduced held-out cluster membership with high test performance (macro-F1 = 0.963), but was interpreted only as an explanatory surrogate rather than as an independent clinical prediction model. SHAP analysis linked predictions to dietetically meaningful drivers, suggesting potential value for dietitian-in-the-loop assessment, counselling prioritisation and follow-up monitoring.

[AI-447] Rethinking Entropy Minimization in Test-Time Adaptation for Autoregressive Models INTERSPEECH2026

【速读】：该论文旨在解决生成式自回归模型在测试时适应（Test-Time Adaptation, TTA）过程中缺乏统一理论基础的问题。现有方法多依赖启发式策略，如使用伪标签的教师强制（teacher forcing）或基于策略梯度的强化学习，但这些方法缺乏数学上的严谨性和一致性。论文的关键解决方案是推导出一个专为自回归模型设计的熵最小化（Entropy Minimization, EM）严格公式，证明其目标函数可精确分解为逐标记（token-level）策略梯度损失与逐标记熵损失之和，并将先前方法重新诠释为该统一框架的部分实现。这一理论基础使得TTA在多个领域（如噪声、口音和多语言场景）中表现出稳定且显著的性能提升。

链接: https://arxiv.org/abs/2605.08186
作者: Wei-Ping Huang,Chee-En Yu,Guan-Ting Lin,Hung-yi Lee
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Submitted to INTERSPEECH 2026

点击查看摘要

Abstract:Test-Time Adaptation (TTA) via entropy minimization (EM) has proven effective for classification tasks, yet its application to generative autoregressive models remains theoretically fragmented. Existing approaches typically rely on distinct heuristics, such as teacher forcing with pseudo labels or policy-gradient-based reinforcement learning, without a unified mathematical foundation. In this work, we resolve this discrepancy by deriving a rigorous formulation of EM tailored to autoregressive models. We show that the exact objective naturally decomposes into a token-level policy gradient loss and a token-level entropy loss, and we reinterpret prior methods as partial realizations of this unified formulation. Using Whisper ASR as a testbed, we demonstrate that our approach consistently improves performance across more than 20 diverse domains, including acoustic noise, accents, and multilingual settings.

[AI-448] Improving TMS EEG Signal Quality for Closed-Loop Neuro Stimulation via Source-Domain Denoising

【速读】：该论文旨在解决经颅磁刺激-脑电图（TMS-EEG）信号中伪迹干扰严重、缺乏标准化预处理流程及评估基准的问题，从而影响数据质量和后续分析的可靠性。其解决方案的关键在于构建了一个经过严格预处理的参考数据集，并提出了一套验证过的TMS-EEG去伪迹处理流程，系统评估了两种主流基于源的伪迹去除方法对TMS诱发电位（TMS-evoked potentials, TEPs）的保留效果与信号质量提升能力，为未来自动化伪迹去除算法的发展提供了可比较的基准和可靠的预处理框架。

链接: https://arxiv.org/abs/2605.08184
作者: Zhen Tang,Ameer Hamoodi,Stevie Foglia,Aimee Nelson,Zhen Gao
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This research addresses a validated TMS EEG cleaning pipeline and a corresponding benchmark dataset. It evaluates two widely used artifact removal pipelines. A reference dataset of carefully preprocessed EEG signals was established to support future algorithm development and enable systematic comparison of automated artifact removal strategies, despite the absence of a true physiological ground truth. The study evaluates the effectiveness of two widely used source based artifact removal approaches and examines their impact on signal quality improvement and preservation of TMS-evoked potentials. The results support the robustness of the proposed preprocessing workflow and demonstrate its potential for improving data reliability in both research and clinical applications. A key goal is integrating TMS EEG and embedding it within a larger BCI framework. Ultimately, these efforts aim to enhance understanding of cortical dynamics and expand the clinical and research applications of TMS EEG.

[AI-449] Forecasting Source Stability in Scientific Experiments using Temporal Learning Models: A Case Study from Tritium Monitoring

【速读】：该论文旨在解决卡尔斯鲁厄氚中微子实验（KATRIN）中对无窗气态氚源稳定性实时预测的难题，该源是测量中微子绝对质量的关键组件，其活动性波动会影响实验精度。传统漂移检测方法难以应对氚源中不频繁且短暂的不稳定事件。解决方案的关键在于引入深度学习时间序列预测模型（如LSTM、N-BEATS、TFT等），以从复杂的大规模实验数据中学习并预测不稳定事件后的恢复稳定时间。研究发现，准确预测数百个未来时间点的稳定性变化具有重要实验价值，可优化测量调度与维护规划；其中N-BEATS模型在准确性与可重复性方面表现最优，验证了深度学习在提升大型物理实验效率中的潜力。

链接: https://arxiv.org/abs/2605.08140
作者: Nicholas Tan Jerome,Nadia Aouadi,Christoph Koehler,Suren Chilingaryan,Andreas Kopmann
机构: 未知
类目: Instrumentation and Detectors (physics.ins-det); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The Karlsruhe Tritium Neutrino Experiment (KATRIN) aims to measure the absolute neutrino mass with unprecedented sensitivity, requiring precise monitoring of the windowless gaseous tritium source, where tritium beta decay occurs. To track variations of the source activity, beta-induced X-ray spectroscopy provides real-time diagnostics. However, traditional drift detection methods struggle with the infrequent and transient nature of instability events in gaseous tritium. This study bridges the gap between state-of-the-art time-series forecasting models and real-world experimental applications by leveraging deep learning to predict the time to stability after instabilities. Unlike standard benchmarking approaches that emphasize algorithmic performance on fixed datasets, we apply forecasting models – including LSTM, N-BEATS, TFT, NHITS, DLinear, NLinear, TSMixer, and Chronos-LLM – to complex, large-scale experimental data. Our findings highlight two challenges: learning from sparse instability events and forecasting long time horizons (i.e., predicting hundreds of future points), both of which are ongoing challenges in time-series forecasting and remain active areas of research. This prediction task has direct experimental value by enabling better scheduling and maintenance planning. A reliable forecast of stability time allows for more efficient measurement and task management during stabilization periods. Through model selection, we identified N-BEATS as the top performer, excelling in accuracy and repeatability, demonstrating that deep learning can optimize large-scale physics experiments.

机器学习

[LG-0] Revisiting Policy Gradients for Restricted Policy Classes: Escaping Myopic Local Optima with k-step Policy Gradients

链接: https://arxiv.org/abs/2605.10909
作者: Alex DeWeese,Guannan Qu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This work revisits standard policy gradient methods used on restricted policy classes, which are known to get stuck in suboptimal critical points. We identify an important cause for this phenomenon to be that the policy gradient is itself fundamentally myopic, i.e. it only improves the policy based on the one-step Q -function. In this work, we propose a generalized k -step policy gradient method that couples the randomness within a k -step time window and can escape the myopic local optima in MDPs with restricted policy classes. We show this new method is theoretically guaranteed to converge to a solution that is exponentially close in performance to the optimal deterministic policy with respect to k . Further, we show projected gradient descent and mirror descent with this k -step policy gradient can achieve this exponential guarantee in O(\frac1T) iterations, despite only assuming smoothness and differentiability of the value function. This will provide near optimal solutions to previously elusive applications like state aggregation and partially observable cooperative multi-agent settings. Moreover, our bounds avoid the ubiquitous distribution mismatch factors ||d_\mu^\pi^* / d_\mu^\pi||\infty and ||d\mu^\pi^* / \mu||_\infty enabling the k -step policy gradient method to escape suboptimal critical points that emerge from poor exploration in fully observable settings.

[LG-1] Beyond Red-Teaming: Formal Guarantees of LLM Guardrail Classifiers

链接: https://arxiv.org/abs/2605.10901
作者: Nikita Kezins,Urbas Ekka,Pascal Berrang,Luca Arnaboldi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Guardrail Classifiers defend production language models against harmful behavior, but although results seem promising in testing, they provide no formal guarantees. Providing formal guarantees for such models is hard because “harmful behavior” has no natural specification in a discrete input space: and the standard epsilon-ball properties used in other domains do not carry semantic meaning. We close this gap by shifting verification from the discrete input space to the classifier’s pre-activation space, where we define a harmful region as a convex shape enclosing the representations of known harmful prompts. Because the sigmoid classification head is monotonic, certifying the worst-case point is sufficient to certify the entire region, yielding a closed-form soundness proof without approximation in O(d) time. To formally evaluate these classifiers, we propose two constructions of such regions: SVD-aligned hyper-rectangles, which yield exact SAT/UNSAT certificates, and Gaussian Mixture Models, which yield probabilistic certificates over semantically coherent clusters. Applying this framework to three author-trained Guardrail Classifiers on the toxicity domain, every hyper-rectangle configuration returns SAT, exposing verifiable safety holes across all classifiers, despite seemingly high empirical metrics. Probabilistic GMM certificates also expose a divergent structural stability in how these models represent harm. While GPT-2 and Llama-3.1-8B maintain robust coverage of 90% and 80% across varying boundaries, BERT’s safety guarantees prove uniquely volatile. This ‘coverage collapse’ to 55% at the optimal threshold reveals a sparsely populated safety margin in BERT, which only achieves full coverage by adopting an extremely conservative pessimistic threshold. These approaches combined, provide new insights on how effective Guardrail Classifiers really are, beyond traditional red-teaming.

[LG-2] V4FinBench: Benchmarking Tabular Foundation Models LLM s and Standard Methods on Corporate Bankruptcy Prediction

链接: https://arxiv.org/abs/2605.10896
作者: Marcin Kostrzewa,Sebastian Tomczak,Roman Furman,Anna Poberezhna,Michał Furgała,Oleksii Furman,Maciej Zięba
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Corporate bankruptcy prediction is a high-stakes financial task characterized by severe class imbalance and multi-horizon forecasting demands. Public datasets supporting it remain scarce and small: widely used free benchmarks contain between 6,000 and 80,000 company-year observations, while larger resources are behind subscription paywalls. To address this gap, we introduce V4FinBench, a benchmark of over one million company-year records from the Visegràd Group (V4) economies (2006-2021), with 131 financial and non-financial features, six prediction horizons, and a composite distress criterion jointly capturing solvency, profitability, and liquidity deterioration. V4FinBench is designed to support the evaluation of tabular and foundation-model methods under realistic class imbalance, with positive rates between 0.19% and 0.36%. We provide reference evaluations of standard tabular baselines, finetuned TabPFN, and QLoRA-finetuned Llama-3-8B. With imbalance-aware finetuning, TabPFN matches or exceeds gradient boosting at longer time horizons on both F_1 -score and ROC-AUC. In contrast, Llama-3-8B trails gradient boosting on ROC-AUC at every horizon and is generally weaker on F_1 -score, with the gap widening sharply beyond the immediate horizon. In an external evaluation on the American Bankruptcy Dataset, the V4FinBench-finetuned TabPFN checkpoint improves over vanilla TabPFN, suggesting that adaptation captures transferable financial-distress structure rather than only V4-specific patterns. V4FinBench is publicly released to support further evaluation and development of prediction methods on realistic financial data.

[LG-3] Neural Weight Norm = Kolmogorov Complexity

链接: https://arxiv.org/abs/2605.10878
作者: Tiberiu Musat
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:Why does weight decay work? We prove that, in any fixed-precision regime, the smallest weight norm of a looped neural network outputting a binary string equals the Kolmogorov complexity of that string, up to a logarithmic factor. This implies that weight decay induces a prior matching Solomonoff’s universal prior, the optimal prior over computable functions, up to a polynomial factor. The result is norm-agnostic: in fixed precision, every weight norm collapses to the non-zero parameter count up to constants, so the same sandwich bound holds for any norm used as a regulariser. The proof has two short reductions: any program for a universal Turing machine can be encoded into neural weights at unit cost per program bit, and any fixed-precision network can be described by enumerating its non-zero parameters with logarithmic addressing overhead. Both bounds are tight up to constants, with the logarithmic factor realised by permutation encodings: a network whose parameters encode a permutation produces a string whose Kolmogorov complexity is the non-zero parameter count times its logarithm. The fixed-precision assumption is essential: with infinite precision, neural networks can encode non-computable functions and the weight norm loses its relevance.

[LG-4] Conditional anomaly detection methods for patient-management alert systems ICML-2008 ALT

链接: https://arxiv.org/abs/2605.10847
作者: Michal Valko,Gregory Cooper,Amy Seybert,Shyam Visweswaran,Melissa Saul,Miloš Hauskrecht
类目: Machine Learning (cs.LG)
*备注: Published at Workshop on Machine Learning in Health Care Applications ICML-2008 - MLHealth

点击查看摘要

Abstract:Anomaly detection methods can be very useful in identifying unusual or interesting patterns in data. A recently proposed conditional anomaly detection framework extends anomaly detection to the problem of identifying anomalous patterns on a subset of attributes in the data. The anomaly always depends (is conditioned) on the value of remaining attributes. The work presented in this paper focuses on instance-based methods for detecting conditional anomalies. The methods rely on the distance metric to identify examples in the dataset that are most critical for detecting the anomaly. We investigate various metrics and metric learning methods to optimize the performance of the instance-based anomaly detection methods. We show the benefits of the instance-based methods on two real-world detection problems: detection of unusual admission decisions for patients with the community-acquired pneumonia and detection of unusual orders of an HPF4 test that is used to confirm Heparin induced thrombocytopenia - a life-threatening condition caused by the Heparin therapy.

[LG-5] NoRIN: Backbone-Adaptive Reversible Normalization for Time-Series Forecasting

链接: https://arxiv.org/abs/2605.10823
作者: Shun Zhang,Yuyang Xiao
类目: Machine Learning (cs.LG)
*备注: 8 pages, 2 figures

点击查看摘要

Abstract:Reversible instance normalization (RevIN) and its successors (Dish-TS, SAN, FAN) have become the de facto plug-in for time-series forecasting, yet the map they apply to each data point is strictly affine, x \mapsto ax+b , so they cannot reshape the underlying distribution – heavy tails remain heavy and skewness remains uncorrected. We propose NoRIN, a non-linear reversible normalization based on the arcsinh-form Johnson S_U transform with two shape parameters (\delta,\varepsilon) that control tailedness and skewness; the linear Z -score used by RevIN is recovered only in the limit \delta \to \infty . Training (\delta,\varepsilon) jointly with the backbone via gradient descent reliably pushes them toward this linear limit within a few epochs – a phenomenon we name the degeneration problem: the forecasting loss is locally indifferent to shape, and the high-capacity backbone compensates for any monotone reparameterization of its input. NoRIN escapes the degeneration by decoupling shape selection from gradient training: (\delta,\varepsilon) are initialized by a closed-form Slifker-Shapiro quantile fit and refined by Bayesian optimization on the validation objective, while the inner training loop is identical to standard RevIN-style training. Across six representative backbones x five real-world datasets x three prediction horizons (90 configurations), decoupled shape optimization recovers (\delta^\star,\varepsilon^\star) that sit systematically far from the linear limit, with values that vary in a backbone-dependent way. This empirically supports the central thesis: different backbones genuinely require different normalization parameters to reach their best performance.

[LG-6] Benchmarking Sensor-Fault Robustness in Forecasting

链接: https://arxiv.org/abs/2605.10822
作者: Alexander Windmann,Philipp Wittenberg,Gianluca Manca,Marcel Dix,Jens U. Brandt,Oliver Niggemann
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Cyber-physical system (CPS) forecasting models depend on sensor streams with noisy, biased, missing, or temporally misaligned readings, yet standard forecasting evaluation often selects models by nominal error without showing whether they remain robust under such faults. We introduce SensorFault-Bench, a shared CPS-grounded sensor-fault stress-test protocol for evaluating forecasting architectures and robustness-improvement methods, and an operational taxonomy organizing the method comparison. Across four real-world datasets and eight scored scenarios governed by a standardized severity model, it reports worst-scenario degradation, clean mean squared error (MSE), and worst-scenario fault-time MSE, separating relative robustness from absolute error. A disjoint fault-transfer split lets explicit fault-training methods train on adjacent fault families while evaluation uses separate benchmark scenarios. Empirically, forecasting architectures favored by clean MSE can degrade sharply under faults, and clean-MSE rankings can disagree with worst-scenario fault-time error rankings. Chronos-2, the evaluated zero-shot foundation-model representative, matches or trails the last-value naive forecaster in clean MSE on the two single-target datasets and has the largest worst-scenario degradation on ETTh1 and Traffic, where all channels are forecast targets. For the evaluated robustness-improvement method set, paired deltas show selective degradation reductions: projected gradient descent adversarial training and randomized training lead where value faults dominate observed degradation, while fault augmentation leads where availability faults dominate. SensorFault-Bench provides open-source code, documented data access, and reproduction and extension guides, so new datasets, architectures, and robustness-improvement methods can be evaluated under the same CPS sensor-fault robustness protocol.

[LG-7] On periodic distributed representations using Fourier embeddings

链接: https://arxiv.org/abs/2605.10818
作者: Jakeb Chouinard
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Periodic signals are critical for representing physical and perceptual phenomena. Scalar, real angular measures, e.g., radians and degrees, result in difficulty processing and distinguishing nearby angles, especially when their absolute difference exceeds pi. We can avoid this problem by using real-valued, periodic embeddings in high-dimensional space. These representations also allow us to control the nature of their dot product similarities, allowing us to construct a variety of different kernel shapes. In this work, we aim of highlight how these representations can be constructed and focus on the formalization of Dirichlet and periodic Gaussian kernels using the neurally-plausible representation scheme of Spatial Semantic Pointers.

[LG-8] Likelihood scoring for continuations of mathematical text: a self-supervised benchmark with tests for shortcut vulnerabilities

链接: https://arxiv.org/abs/2605.10810
作者: Daniel Ranard
类目: Machine Learning (cs.LG)
*备注: 13 pages + appendices, 4 figures

点击查看摘要

Abstract:We introduce an automatically generated benchmark for predicting hidden text in technical papers. A paper supplies visible context X and a hidden continuation Y ; the evaluated model writes an auxiliary forecast string Z , and a separate scorer assigns next-token probability to Y both with and without conditioning on Z . This gives a label-free test of whether Z transmits information about the continuation, compared against controls where Z is recent context rather than a forecast. Our main testbed is equation-suffix prediction: the predictor sees context and the first part of a displayed equation, then forecasts the rest. The task mixes surface-level arXiv/TeX text modeling with reasoning-sensitive inference; the suffix is one of many roughly equivalent continuations, so the benchmark is read statistically rather than item-by-item. On 1363 equation continuations from 138 recent physics and mathematics papers, forecasts from GPT-5.5, Opus 4.7, and GPT-5.4 nano all improve clipped likelihood over the context control under both Qwen3-8B and Kimi K2.6 scorers, distinguishing model families and reasoning-effort settings without human labels. To emulate shortcuts where Z further primes the scorer rather than making a useful forecast, we also fine-tune the scorer on context-only prompts and apply it to held-out papers as a stronger control. GPT-5.5 forecasts still beat this fine-tuned control; GPT-5.4 nano forecasts do not. Longer prose/TeX continuations show positive but noisier lift over controls, concentrated near the beginning of the target. These results support cross-model likelihood scoring as a static benchmark and as a setup for probing shortcut vulnerabilities before reinforcement learning or model-selection optimization is applied.

[LG-9] Mistake-Bounded Language Generation

链接: https://arxiv.org/abs/2605.10809
作者: Jon Kleinberg,Charlotte Peale,Omer Reingold
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注:

点击查看摘要

Abstract:We investigate the learning task of language generation in the limit, but shift focus from the traditional time-of-last-mistake metric of a generator’s success to a new notion of “mistake-bounded generation.” While existing results for language generation in the limit focus on guaranteeing eventual consistency, they are blind to the cumulative error incurred during the learning process. We address this by shifting the goal to minimizing the total number of invalid elements output by a generation algorithm. We establish a formal reduction to the Learning from Correct Demonstrations framework of Joshi et al. (2025), enabling a general recipe for deriving mistake bounds via weighted update rules. For finite classes, we provide an algorithm that simultaneously achieves an optimal last-mistake time of \mathsfCdim(L) and a mistake bound of \lfloor \log_2 |L| \rfloor , whereas for the non-uniform setting of countably infinite streams of languages, we prove a fundamental trade-off: achieving logarithmic mistakes O(\log i) necessarily precludes convergence guarantees established in prior work. Finally, we show that our framework can be extended to accommodate noisy adversaries and guarantee mistake bounds that scale with the adversary’s suboptimality.

[LG-10] LLM s for Secure Hardware Design and Related Problems: Opportunities and Challenges

链接: https://arxiv.org/abs/2605.10807
作者: Johann Knechtel,Ozgur Sinanoglu,Ramesh Karri
类目: Cryptography and Security (cs.CR); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: Accepted for 2026 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)

点击查看摘要

Abstract:The integration of Large Language Models (LLMs) into Electronic Design Automation (EDA) and hardware security is rapidly reshaping the semiconductor industry. While LLMs offer unprecedented capabilities in generating Register Transfer Level (RTL) code, automating testbenches, and bridging the semantic gap between high-level specifications and silicon, they simultaneously introduce severe vulnerabilities. This comprehensive review provides an in-depth analysis of the state-of-the-art in LLM-driven hardware design, organized around key advancements in EDA synthesis, hardware trust, design for security, and education. We systematically expand on the methodologies of recent breakthroughs – from reasoning-driven synthesis and multi-agent vulnerability extraction to data contamination and adversarial machine learning (ML) evasion. We integrate general discussions on critical countermeasures, such as dynamic benchmarking to combat data memorization and aggressive red-teaming for robust security assessment. Finally, we synthesize cross-cutting lessons learned to guide future research toward secure, trustworthy, and autonomous design ecosystems.

[LG-11] Muown: Row-Norm Control for Muon Optimization

链接: https://arxiv.org/abs/2605.10797
作者: Kai Lion,Florian Hübler,Bingcong Li,Antonio Orvieto,Niao He
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Muon has emerged as a strong competitor to AdamW for language model pre-training, yet its behavior at scale is sensitive to weight decay. Recent work has observed that, for Muon without decoupled weight decay, the spectral norm of weight matrices drifts upward over training. Through a decomposition of the spectral norm into a row-magnitude factor and a row-coherence factor, we identify the former as the empirical driver of this drift under Muon, while the latter remains well-behaved along the trajectory. Motivated by this diagnosis, we introduce Muown, a drop-in replacement for Muon that treats the row-magnitude vector as an explicit optimizer variable, updating it under the \ell_\infty geometry induced by the decomposition, while applying Muon unchanged to the remaining direction component. We prove that Muown attains the optimal non-convex rates in both deterministic and stochastic regimes under a dual norm aligned with the underlying geometries and with a stochastic noise coefficient that empirically remains below that of Muon throughout training. Across GPT-style pre-training on FineWeb-Edu with model sizes from 124M up to 2.7B parameters, Muown improves perplexity over Muon, SOAP, AdamW, and Lion. It also widens the plateau of near-optimal learning rates across model scales, reduces sensitivity to weight decay, and avoids the spectral norm drift at negligible step-time overhead when appropriately sharded.

[LG-12] ConQuR: Corner Aligned Activation Quantization via Optimized Rotations for LLM s

链接: https://arxiv.org/abs/2605.10793
作者: Chayne Thrash,Ali Abbasi,Soheil Kolouri
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) are costly to deploy due to their large memory footprint and high inference cost. Weight-activation quantization can reduce these costs, but low-bit activation quantization remains difficult because activation outliers induce large quantization error. Recent rotation-based methods address this by applying orthogonal transformations that redistribute activation magnitude across dimensions, but existing approaches either require expensive end-to-end rotation training or rely on stored activation corpora, introducing significant compute or storage overhead. We propose a lightweight post-training rotation calibration method for LLM activation quantization. Our method learns orthogonal rotations that align normalized activations with the corners of an inscribed hypercube, encouraging activation energy to be distributed more evenly across dimensions. This objective admits an efficient closed-form update via the orthogonal Procrustes problem, avoiding gradient-based optimization over the orthogonal group. We further introduce an online calibration procedure that updates rotations as calibration samples are processed, eliminating the need to store activations on disk and allowing rotations to adapt to quantized activation distributions during calibration. Experiments on Llama-2 and Llama-3 models from 3B to 70B parameters show that our method achieves competitive or improved performance across perplexity benchmarks and common sense reasoning tasks while avoiding both costly end-to-end training and large offline activation storage.

[LG-13] Elucidating Representation Degradation Problem in Diffusion Model Training

链接: https://arxiv.org/abs/2605.10790
作者: Zhipeng Yao,Dazhou Li,Zitong Zhang,Durude Mahee,Fan Zhu,Wenbin Zhang,Xinwei He,Yeying Jin,Rui Yu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion models have achieved remarkable success, yet their training remains inefficient due to a severe optimization bottleneck, which we term Representation Degradation. As noise levels increase, the outputs of the trained model exhibit progressive structural distortion, which can destabilize training and impair generation quality. Our analysis suggests that this instability is driven by mismatched target recoverability, which is associated with Neural Tangent Kernel (NTK) spectral weakening and effective low-rank behavior. To address this, we propose Elucidated Representation Diffusion (ERD), a plug-and-play framework that dynamically reallocates optimization effort according to effective recoverability. By stabilizing representation learning without external supervision, ERD accelerates convergence and achieves strong empirical performance across diffusion backbones.

[LG-14] MASS-DPO: Multi-negative Active Sample Selection for Direct Policy Optimization

链接: https://arxiv.org/abs/2605.10784
作者: Rohan Surana,Xintong Li,Sheldon Yu,Yiran Jenny Shen,Chuhan Wang,Tong Yu,Prithviraj Ammanabrolu,Jingbo Shang,Julian McAuley,Junda Wu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-negative preference optimization under the Plackett–Luce (PL) model extends Direct Preference Optimization (DPO) by leveraging comparative signals across one preferred and multiple rejected responses. However, optimizing over large negative pools is costly, and many candidates contribute redundant gradients due to their similar effects on policy updates. We introduce MASS-DPO, a multi-negative active sample selection method that derives a PL-specific Fisher-information objective for selecting compact, informative negative subsets within each prompt. The resulting log-determinant objective selects negatives that contribute complementary information for policy updates, yielding compact subsets that retain the full pool’s information while reducing redundancy. In practice, this favors negatives whose gradients cover different update directions, reducing redundant signal from near-duplicate candidates while preserving the most useful training information. Across four benchmarks spanning recommendation and multiple-choice QA and three model families, MASS-DPO consistently exceeds or matches existing methods in accuracy, improves Recall/NDCG and margin-based optimization dynamics, and delivers stronger alignment with substantially fewer negatives.

[LG-15] Locking Pretrained Weights via Deep Low-Rank Residual Distillation

链接: https://arxiv.org/abs/2605.10777
作者: Keitaro Sakamoto,Pierre Ablin,Federico Danieli,Marco Cuturi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The quality of open-weight language models has dramatically improved in recent years. Sharing weights greatly facilitates model adoption by enabling their use across diverse hardware and software platforms. They also allow for more open research and testing, to the extent that users can use them as checkpoints, fine-tune them according to their needs, and potentially redistribute them. In some cases, however, concerns on modifying these weights towards unauthorized uses may outweigh the pros of giving users such a freedom. Defending against such adaptation is non-trivial: since an adaptive attacker can observe all weights and architectures by definition, they can reverse simple structural defenses, and use optimization to defeat the simplest locking mechanisms. In this work, we exploit the inference-training asymmetry of automatic differentiation as a novel defense axis. We propose DLR-Lock, a method where the purveyor of the model purposely replaces each pretrained MLP in their model with a deep low-rank residual network (DLR-Net) of comparable parameter count, forcing activation memory that grows linearly with depth during backpropagation. DLR-Nets are efficiently trained via module-wise distillation. We show that, beyond this memory overhead, DLR-Lock results in architectural mismatches that complicate the optimization landscape of standard fine-tuning, and a backward pass that incurs disproportionately more overhead than the forward pass. Our defense succeeds in withstanding adaptive attackers with full knowledge of the defense strategy while preserving the original model’s capabilities. Experiments on LLM validate these claims.

[LG-16] DynaMiCS: Fine-tuning LLM s with Performance Constraints using Dynamic Mixtures

链接: https://arxiv.org/abs/2605.10770
作者: Eleonora Gualdoni,Sonia Laguna,Louis Bethune,Joao Monteiro,Pierre Ablin,Marco Cuturi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-domain fine-tuning of large language models requires improving performance on target domains while preserving performance on constrained domains, such as general knowledge, instruction following, or safety evaluations. Existing data mixing strategies rely on fixed heuristics or adaptive rules that cannot explicitly enforce preservation of such capabilities. We propose DynaMiCS, a dynamic mixture optimizer that casts multi-domain fine-tuning as a constrained optimization problem. At each update, DynaMiCS performs short domain-specific probing runs to estimate a slope matrix of local cross-domain effects, capturing how training on each fine-tuning dataset affects each evaluation domain. These estimates are then used to compute mixture weights through optimization over the probability simplex, with the objective of improving target-domain performance while keeping constrained-domain losses below reference levels. Across multi-domain fine-tuning scenarios with varying numbers of target and constrained domains, DynaMiCS achieves stronger target-domain improvements and higher constraint satisfaction than fixed-mixture baselines, at lower computational cost and without reference models, per-example scoring, or manually tuned mixture weights.

[LG-17] AdaPaD: Adaptive Parallel Deflation for PEFT with Self-Correcting Rank Discovery

链接: https://arxiv.org/abs/2605.10741
作者: Barbara Su,Fangshuo Liao,Anastasios Kyrillidis
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Fine-tuning large language models with LoRA requires choosing a rank r before training starts. Existing approaches either extract rank-1 components sequentially, freezing each component’s error permanently into every subsequent residual, or optimize the full low-rank factorization jointly with guarantees that describe only the joint update, not individual rank-1 directions. We present AdaPaD (Adaptive Parallel Deflation), which trains all rank-1 components simultaneously: each worker refines its component against a deflation target built from the latest estimates of all predecessors, and as those estimates improve, the targets improve too. We call this property self-correction: deflation errors converge to zero over rounds rather than persisting as fixed residuals. On top of this backbone, AdaPaD adds advance learning (private pre-training before activation) and per-module dynamic rank discovery (importance-based growth until a shared budget is exhausted), making the rank distribution an output rather than an input. We prove that every component’s error decays exponentially after a warm-up period, with a generalization bound that splits into a vanishing algorithmic term and an irreducible statistical floor. Empirically, AdaPaD is competitive with adaptive-rank LoRA baselines on GLUE with DeBERTaV3-base at matched parameter budgets, and competitive with fixed-rank LoRA on Qwen3-0.6B SQuAD/SQuAD v2 while deploying an adapter that is on average 30.7% smaller.

[LG-18] XQCfD: Accelerating Fast Actor-Critic Algorithms with Prior Data and Prior Policies

链接: https://arxiv.org/abs/2605.10734
作者: Daniel Palenicek,Florian Vogt,Joe Watson,Ingmar Posner,Danica Kragic,Jan Peters
类目: Machine Learning (cs.LG)
*备注: 22 pages, 10 figures, 2 tables

点击查看摘要

Abstract:For reinforcement learning in the real world online exploration is expensive A common practice in robotic reinforcement learning is to incorporate additional data to improve sample efficiency Expert demonstration data is often crucial for solving hard exploration tasks with sparse rewards While prior data is used to augment experience and pretrain models we show that the design of existing algorithms fails to achieve the sample efficiency that is possible in this setting due to a failure to use pretrained policies effectively We propose XQCfD which extends the sample-efficient XQC actor-critic to learn from demonstrations using augmented replay buffers pretrained policies and stationary policy architectures designed to avoid rapidly unlearning the strong initial policy like prior works We show our stationary network architecture enables policy improvement out-of-distribution better than standard network architectures due to its higher entropy predictions XQCfD achieves state of the art performance across a range of complex manipulation tasks with sparse rewards from the popular Adroit Robomimic and MimicGen benchmarks – notably with a low update-to-data ratio and no ensemble networks

[LG-19] Kernel-Gradient Drifting Models

链接: https://arxiv.org/abs/2605.10727
作者: Maria Esteban-Casadevall,Jorge Carrasco-Pollo,Max Welling,Jan-Willem van de Meent,Erik J. Bekkers,Floor Eijkelboom
类目: Machine Learning (cs.LG); Differential Geometry (math.DG)
*备注:

点击查看摘要

Abstract:We propose kernel-gradient drifting, a one-step generative modeling framework that replaces the fixed Euclidean displacement direction in drifting models with directions induced by the kernel itself. Standard drifting is attractive because it enables fast, high-quality generation without distilling a large pretrained diffusion model, but its theory is currently understood mainly for Gaussian kernels, where the drift coincides with smoothed score matching and is identifiable. Our gradient-based reformulation exposes this score-based structure for general kernels: the resulting drift is the score difference between kernel-smoothed data and model distributions, yielding identifiability for characteristic kernels and a smoothed-KL descent interpretation of the drifting dynamics. Since kernel gradients are intrinsic tangent vectors, the same construction extends naturally to Riemannian manifolds and to discrete data via the Fisher-Rao geometry of the probability simplex. Across spherical geospatial data, promoter DNA and molecule generation, kernel-gradient drifting enables state-of-the-art one-step generation beyond the Euclidean setting without distillation.

[LG-20] On Improving Graph Neural Networks for QSAR by Pre-training on Extended-Connectivity Fingerprints

链接: https://arxiv.org/abs/2605.10722
作者: Sam Money-Kyrle,Markus Dablander,Thierry Hanser,Stephane Werner,Charlotte M. Deane,Garrett M. Morris
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Molecular Graph Neural Networks (GNNs) are increasingly common in drug discovery, particularly for Quantitative Structure-Activity Relationship (QSAR) studies; yet, their superiority compared to classical molecular featurisation approaches is disputed. We report a general strategy for improving GNNs for QSAR by pre-training to predict Extended-Connectivity Fingerprints (ECFP). We validate our approach with statistical tests and challenging out-of-distribution (OOD) splits. Across five out of six Biogen benchmarks, we observed a statistically significant improvement in standard performance metrics over all evaluated baselines when using ECFP pre-trained GNNs. However, for more heterogeneous datasets and more complex endpoints, such as binding affinity prediction, pre-trained GNNs underperformed in OOD settings. Importantly, we investigated the impact of substructure-level data leakage during pre-training on downstream performance. While we identified scenarios where pre-training on ECFPs was less effective, our findings show that ECFP-based pre-training can enhance downstream OOD performance on a diverse set of practically relevant QSAR tasks.

[LG-21] What should post-training optimize? A test-time scaling law perspective

链接: https://arxiv.org/abs/2605.10716
作者: Muheng Li,Jian Qian,Wenlong Mou
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Large language models are increasingly deployed with test-time strategies: sample N responses, score them with a reward model or verifier, and return the best. This deployment rule exposes a mismatch in post-training: standard objectives optimize the mean reward of a single response, whereas best-of- N performance is governed by the upper tail of the reward distribution. Recent test-time-aware objectives partly address this mismatch, but typically assume that training can use the same per-prompt rollout budget as deployment, which is impractical when post-training must cover many prompts while deployment can allocate much larger per-prompt test-time compute. We study this budget-mismatch regime, where only m\ll N per-prompt rollouts are available during training but the target objective is best-of- N deployment. Under structural assumptions on the reward tails, we show that the policy gradient of the best-of- N objective can be approximated from a much smaller rollout group by extrapolating upper-tail statistics. This yields a family of Tail-Extrapolated estimators for best-of- N -oriented post-training: a simple direct estimator, Tail-Extrapolated Advantage (TEA), and a fixed-order debiased Prefix-TEA estimator based on moment cancellation. Experiments on instruction-following tasks show that TEA and Prefix-TEA improve best-of- N performance across different language models, reward models and datasets under various training and test-time budget settings.

[LG-22] RelFlexformer: Efficient Attention 3D-Transformers for Integrable Relative Positional Encodings

链接: https://arxiv.org/abs/2605.10706
作者: Byeongchan Kim,Arijit Sehanobish,Avinava Dubey,Min-hwan Oh,Krzysztof Choromanski
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a new class of efficient attention mechanisms applying universal 3D Relative Positional Encoding (RPE) methods given by arbitrary integrable modulation functions f . They lead to the new class of 3D-Transformer models, called \textitRelFlexformers, flexibly integrating those RPEs, and characterized by the O(L \log L) time complexity of the attention computation for the L -length input sequences. RelFlexformers builds on the theory of the Non-Uniform Fourier Transform (NU-FFT), naturally generalizing several existing efficient RPE-attention methods from structured settings with tokens homogeneously embedded in unweighted grids into general non-structured heterogeneous scenarios, where tokens’ positions are arbitrarily distributed in the corresponding 3D spaces. As such, RelFlexformers can be applied in particular to model point clouds. Our extensive empirical evaluation on a large portfolio of 3D datasets confirms quality improvements provided by the NU-FFT-driven attention modulation techniques in the RelFlexformers.

[LG-23] DANCE: Detect and Classify Events in EEG

链接: https://arxiv.org/abs/2605.10688
作者: Jarod Lévy,Hubert Banville,Jérémy Rapin,Jean-Remi King,Thomas Moreau,Stéphane d’Ascoli
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 29 pages

点击查看摘要

Abstract:Event identification in continuous neural recordings is a critical task in neuroscience. Decoding in EEG is dominated by classifying windows aligned to known event onsets. However, while available in controlled experiments, such onsets are absent in continuous real-world monitoring. Here, we introduce DANCE, a deep learning pipeline that frames neural decoding as a set-prediction problem and jointly detects and classifies events directly from raw, unaligned signals. Evaluated separately on ten datasets curated from the literature with a wide variety of event types (ranging from milliseconds to minutes in duration), our model outperforms existing methods on a broad range of cognitive, clinical and BCI tasks. This single architecture establishes a new state of the art in the competitive task of seizure monitoring and matches the accuracy of onset-informed models for BCI tasks. Overall, our method marks a step towards end-to-end asynchronous neural decoding models

[LG-24] he finite expression method for turbulent dynamics with high-order moment recovery

链接: https://arxiv.org/abs/2605.10687
作者: Xingjian Xu,Di Qi,Chunmei Wang
类目: Machine Learning (cs.LG)
*备注: 20 pages, 8 figures, 1 table

点击查看摘要

Abstract:Turbulent dynamical systems are characterized by nonlinear interactions and stochastic effects that generate coupled statistical quantities, such as non-zero higher-order moments, which are difficult to capture from data with accuracy. We propose a two-stage data-driven modeling framework that combines symbolic regression with generative models to jointly identify the governing dynamics and predict their key statistical quantities. In Stage I of the framework, the Finite Expression Method (FEX) is adopted to discover closed-form expressions of the deterministic dynamics, recovering nonlinear interaction terms and external forcing without predefined libraries. In Stage II, generative models are introduced to learn the residual stochastic components as a refined correction to the model error from the Stage I approximation, enabling accurate characterization of higher-order statistics. Theoretical analysis establishes the consistency of the symbolic estimator and quantifies the estimation error in terms of data size and numerical discretization. The model performance is verified through detailed numerical experiments on the stochastic triad models across multiple regimes, demonstrating that the framework successfully recovers interaction terms and forcing expressions, and accurately predicts statistical moments up to order five. These results highlight the potential of integrating interpretable symbolic discovery with data-driven stochastic modeling for complex turbulent systems.

[LG-25] Scalable Mamba-Based Message-Passing Neural Decoder for Error-Correcting Codes

链接: https://arxiv.org/abs/2605.10681
作者: Rostislav Gusev,Nikita Aleksandrov,Artem Solomkin,Dmitry Artemasov
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Forward error correction is essential for reliable communication over noisy channels. Attention-based model-free neural decoders have shown strong performance for short codes, but their scalability to longer codes is limited by the quadratic memory and computational cost of attention. In this paper, we introduce the Mamba message-passing decoder (MMPD), an attention-free syndrome-based neural decoder for binary linear codes. MMPD retains the Tanner-graph structure of a message-passing decoder by performing local pairwise aggregation along variable-check edges. To enable efficient long-range information propagation, these local updates are combined with bidirectional Mamba state-space blocks. By avoiding dense attention matrices, MMPD scales more favorably for long codes in both memory and computation. Experiments on the (1056, 880) LDPC code show that MMPD achieves a 0.45 dB gain over the state-of-the-art CrossMPT decoder at a specified target bit error rate, while reducing memory consumption by a factor of 1.5. This reduction factor increases substantially for longer codes, demonstrating the applicability of MMPD to scalable neural decoding of practical long codes.

[LG-26] Exact Unlearning from Proxies Induces Closeness Guarantees on Approximate Unlearning

链接: https://arxiv.org/abs/2605.10680
作者: Virgile Dine,Teddy Furon
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper proposes a paradigm shift linking machine unlearning directly to the structure of the data distributions rather than a mere update of the neural network parameters. We show that inferring these distributions with precision enables distilling the exact unlearning signal induced by the modeling. Theoretical bounds on the Kullback-Leibler divergence from the ideal retrained model to our unlearned model, under verifiable admissibility criterion, reveal the soundness of our framework. This method is experimentally validated over three forgetting scenarios as reaching the closest classifier to the ideal retrained model when compared to competitors.

[LG-27] Compander-Aligned Query Geometry for Quantized Zeroth-Order Optimization

链接: https://arxiv.org/abs/2605.10673
作者: Yao Shu,Zilin Zhu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Low-bit forward evaluation is an attractive route to memory-efficient zeroth-order (ZO) adaptation: the optimizer needs only scalar losses, and the model can be queried near deployment precision. The obstacle is that a quantized ZO query is not a continuous finite difference followed by harmless storage rounding. The query chooses endpoints, the low-precision engine rounds them, and the loss difference is measured along the rounded chord. For nonuniform companding quantizers, this makes the codebook insufficient to predict ZO behavior: a fixed weight-space radius can collapse in dense cells, over-span sparse cells, or assign a rounded chord to an unrounded update direction. We identify the missing object as query geometry and model scalar nonuniform quantization as Q = \phi^-1 \circ U \circ \phi . CAQ-ZO (Compander-Aligned Queries for Zeroth-Order Optimization) forms one-grid-step Rademacher stencils z \pm \Delta r in z = \phi(x) , maps endpoints back through \phi^-1 , and updates in z . Our theory proves the grid-span mismatch, decomposes endpoint-rounding estimator residuals, and gives stationarity bounds in which generic off-grid queries retain a \Delta^2/\mu^2 residual channel while CAQ-ZO makes the query-time residual exactly zero. Synthetic experiments isolate this channel, and matched NF4 Qwen/Llama fine-tuning shows that CAQ-ZO improves the trained NF4 baseline under the same quantizer and evaluation budget.

[LG-28] Natural Policy Gradient as Doubly Smoothed Policy Iteration: A Bellm an-Operator Framework

链接: https://arxiv.org/abs/2605.10671
作者: Phalguni Nanda,Zaiwei Chen
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In this work, we show that natural policy gradient, a core algorithm in reinforcement learning, admits an exact formulation as a smoothed and averaged form of policy iteration. Specifically, we introduce doubly smoothed policy iteration (DSPI), a Bellman-operator framework in which each policy is obtained by applying a regularized greedy step to a weighted average of past Q -functions. DSPI includes policy iteration, dual-averaged policy iteration, natural policy gradient, and more general policy dual averaging methods as special cases. Using only monotonicity and contraction of smoothed Bellman operators, we prove distribution-free global geometric convergence of DSPI. Consequently, standard natural policy gradient and policy dual averaging achieve an iteration complexity of \mathcalO((1-\gamma)^-1\log((1-\gamma)^-1\epsilon^-1)) for computing an \epsilon -optimal policy, without modifying the MDP, adding regularization beyond the mirror map inherent in the update, or using adaptive, trajectory-dependent stepsizes. For the unregularized greedy case, corresponding to dual-averaged policy iteration, we also prove finite termination. The same Bellman-operator framework further extends to discounted MDPs with linear function approximation and stochastic shortest path problems.

[LG-29] A Spectral Framework for Closed-Form Relative Density Estimation

链接: https://arxiv.org/abs/2605.10668
作者: Francis Bach(SIERRA)
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:We propose a closed-form spectral framework for relative log-density estimation in linearly parameterized probabilistic models, including unnormalized and conditional models. This is achieved by representing the Kullback-Leibler (KL) divergence as an integral of weighted chi-squared divergences, converting KL estimation into a family of least-squares problems. We derive an explicit spectral formula based only on first- and second-order feature moments, yielding closed-form estimators of both divergences and log-density potentials for fixed features. The framework extends to a broad class of f-divergences and can be combined with kernelization or feature learning with neural networks. We prove convergence guarantees for the resulting estimators and empirically compare them on synthetic data with optimization-based variational formulations, including logistic and softmax regression for normalized conditional models.

[LG-30] Why Zeroth-Order Adaptation May Forget Less: A Randomized Shaping Theory

链接: https://arxiv.org/abs/2605.10658
作者: Yao Shu,Jian Mu,Zhongxiang Dai
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Continual learning requires new-task adaptation without damaging previously acquired capabilities. Recent forward-pass and zeroth-order (ZO) results show that low-query adaptation may retain better than first-order (FO) descent, but the usual view of ZO as noisy FO estimation does not explain why. We give a local randomized gradient-shaping analysis: finite differences expose a raw shape that is mean-aligned with FO, while the norm-matched comparator fixes the expected squared adaptation norm. Under this controlled comparison, forgetting depends on how the adaptation shape exposes retention curvature. For norm-matched ZO, the expected shaped retention curvature obeys an exact identity that preserves the isotropic retention floor while contracting only the anisotropic component. Projecting this identity onto the incoming gradient yields the observable FO–ZO quadratic forgetting gap: ZO improves mean forgetting precisely when the FO direction has above-average retention curvature, by a query-dependent fraction of that curvature excess. A practical finite-query accounting separates the mean mechanism from one-batch sampling and smoothing perturbations. As an algorithmic transfer, RISE applies the calibrated ZO shape to exact FO gradients inside parameter blocks. Its target is a stability–plasticity tradeoff: randomized shaping may reduce the retention exposure paid by FO, exact gradients remove finite-smoothing bias from finite-difference ZO, and blockwise sampling supplies many local shaping directions after one gradient computation. The blockwise analysis separates mean-step damage from centered random exposure, showing how block-diagonal curvature, cross-block coupling, and local shaping diagnostics specify where this exact-gradient transfer is most likely to be visible.

[LG-31] BCJR-QAT: A Differentiable Relaxation of Trellis-Coded Weight Quantization

链接: https://arxiv.org/abs/2605.10655
作者: Venugopalan Iyengar
类目: Machine Learning (cs.LG)
*备注: 26 pages, 4 figures, 4 tables. Code at this https URL . Model weights and trajectory snapshots at this https URL

点击查看摘要

Abstract:Trellis-coded quantization sets the current 2-bit post-training frontier for LLMs (QTIP), but pushing below the PTQ ceiling requires quantization-aware training, and QAT on a trellis is obstructed by the non-differentiable Viterbi argmax. We introduce BCJR-QAT, a relaxation that replaces the argmax with the BCJR forward-backward sum-product algorithm at temperature T , producing a soft codeword equal to the Boltzmann expectation over trellis paths, exactly differentiable, recovering the hard QTIP code as T \to 0 , and mathematically identical to the transfer-matrix computation for a 1D Ising-like spin chain. We contribute (i) a fused Triton kernel making BCJR tractable on a single consumer GPU ( 6.57\times speedup, fp32 parity); (ii) a quantitative drift-budget theory of when BCJR-QAT can escape the QTIP-PTQ Voronoi basin, verified across four experiments; and (iii) a positive empirical result on Llama-3.2-1B at 2 bpw under end-to-end forward-KL distillation: with the right schedule (skip the high- T phase to avoid an overshoot we diagnose), single-layer BCJR-QAT beats QTIP-PTQ by \mathbf-0.084 PPL on WikiText-2, and multi-layer compounding is super-additive.

[LG-32] A Random-Matrix Criterion for Initializing Gated Recurrent Neural Networks

链接: https://arxiv.org/abs/2605.10650
作者: Tommaso Fioratti,Riccardo Marcaccioli,Francesco Casola
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn)
*备注: 10 pages, 5 figures, 2 appendices

点击查看摘要

Abstract:Proper weight initialization prior to training has historically been one of the key factors that helped kick off the deep learning revolution. Initialization is even more crucial in “reservoir computing”, where the weights of a readout layer are learned linearly while the reservoir weights are fixed and largely determine the richness, stability and memory of the resulting dynamics. In the infinite-width limit it has been shown that meaningful initializations are those sitting at an effective critical point of the randomly initialized model. The phase transition is controlled by the weight variance g^2 and separates an ordered phase from a chaotic one where information progressively degrades. Here we derive a simple criterion to estimate the critical g_c for a broad class of recurrent architectures and we show that it closely tracks the gain at which a gated-RNN reservoir achieves peak performance on a chaotic forecasting task. Finally, we argue that our criterion can serve as a design principle for future initialization schemes.

[LG-33] Composing diffusion priors with explicit physical context via generative Gibbs sampling

链接: https://arxiv.org/abs/2605.10642
作者: Weizhou Wang,Jonathan Weare,Aaron R. Dinner
类目: Machine Learning (cs.LG); Statistical Mechanics (cond-mat.stat-mech)
*备注: 31 pages, 11 figures

点击查看摘要

Abstract:Pretrained diffusion models provide powerful learned priors, but in scientific sampling the target distribution often depends on physical context that is not fully represented by one generative model. We introduce Generative Gibbs for Physics-Aware Sampling (GG-PA), a training-free framework that formulates the composition of learned partial priors and explicit physical context as inference over a joint target distribution in an augmented state space. We derive a Gibbs sampler for this joint target, show that it is asymptotically exact as the diffusion time approaches zero, and prove that in settings with quadratic interactions it remains exact at finite diffusion times. We further introduce replica exchange over diffusion time to accelerate mixing. Experiments on a double-well system, a \phi^4 lattice model, and atomistic peptide systems show that GG-PA recovers context-induced distribution shifts and emergent collective behavior in interacting systems using partial priors without retraining. These results demonstrate GG-PA as a practical approach for combining pretrained generative priors with explicit physical context.

[LG-34] Hierarchical End-to-End Taylor Bounds for Complete Neural Network Verification

链接: https://arxiv.org/abs/2605.10621
作者: Taha Entesari,Mahyar Fazlyab
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Reachability analysis of neural networks, which seeks to compute or bound the set of outputs attainable over a given input domain, is central to certifying safety and robustness in learning-enabled physical systems. Since exact reachable set computation is generally intractable, existing methods typically rely on tractable overapproximations. Examining the state of the art for smooth, twice-differentiable networks, we observe that existing approaches exploit at most second-order information and do not systematically leverage higher-order information. In this work, we introduce \textscHiTaB, a novel verification framework that exploits second-order smoothness through both the Hessian, \nabla^2 f , and its Lipschitz constant, L_\nabla^2 f . We further develop a unified hierarchy of zeroth-, first-, and second-order bounds, together with precise conditions under which higher-order approximations yield provable improvements. Our main technical contribution is a compositional procedure for efficiently bounding L_\nabla^2 f in deep neural networks via layerwise propagation of curvature bounds. We extend the framework to both \ell_2 - and \ell_\infty -constrained input sets and show how it can be integrated into branch-and-bound verification pipelines. To our knowledge, this is the first practical reachability analysis framework for smooth neural networks that systematically exploits Lipschitz continuity of curvature, leading to tighter and more informative safety certificates.

[LG-35] Reconfigurable Computing Challenge: Real-Time Graph Neural Networks for Online Event Selection in Big Science

链接: https://arxiv.org/abs/2605.10612
作者: Marc Neu,Frank Baptist,Thomas Lobmaier,Fabio Papagno,Torben Ferber,Jürgen Becker
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: Accepted to FCCM Reconfigurable Computing Challenge 2026

点击查看摘要

Abstract:Graph neural networks are increasingly adopted in trigger systems for collider experiments, where strict latency and throughput constraints render deployment on embedded platforms challenging. As detectors move towards higher granularity, the number of inputs per inference increase and FPGA-only solutions face resource bottlenecks. This work presents an end-to-end demonstrator for the real-time deployment of a dynamic Graph Neural Network for the Belle II electromagnetic calorimeter hardware trigger on the AMD Versal VCK190, leveraging both FPGA fabric and AI Engine tiles. We develop a Python-based semi-automated design flow covering operator fusion, partitioning, mapping, spatial parallelization, and kernel-level optimization. Our design achieves a throughput of 2.94 million events per second at an end-to-end latency of 7.15 microseconds. Compared to the FPGA-only baseline, this represents a 53% throughput improvement while reducing DSP utilization from 99% to 19% at 29% AI Engine tile utilization. To validate the deployment, an interactive visualization pipeline enables real-time monitoring of inference results on the physical demonstrator.

[LG-36] Controllability in preference-conditioned multi-objective reinforcement learning

链接: https://arxiv.org/abs/2605.10585
作者: Pau de las Heras Molins,Beyazit Yalcinkaya,Lasse Peters,David Fridovich-Keil,Georgios Bakirtzis
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-objective reinforcement learning (MORL) allows a user to express preference over outcomes in terms of the relative importance of the objectives, but standard metrics cannot capture whether changes in preference reliably change the agent’s behavior in the intended way, a property termed controllability. As a result, preference-conditioned agents can score well on standard MORL metrics while being insensitive to the preference input. If the ability to control agents cannot be reliably assessed, the symbolic interface that MORL provides between user intent and agent behavior is broken. Mainstream MORL metrics alone fail to measure the controllability of preference-conditioned agents, motivating a complementary metric specifically designed to that end. We hope the results spur discussion in the community on existing evaluation protocols to consolidate advances in preference adaptation in MORL to larger and more complex problems.

[LG-37] Online Sharp-Calibrated Bayesian Optimization

链接: https://arxiv.org/abs/2605.10572
作者: Marshal Arijona Sinaga,Julien Martinelli,Teemu Turpeinen,Samuel Kaski
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Bayesian optimization (BO) is a widely used framework for optimizing expensive black-box functions, commonly based on Gaussian process (GP) surrogate models. Its effectiveness relies on uncertainty quantification that is both sharp (informative) and well-calibrated along the BO trajectory. In practice, GP kernel hyperparameters are unknown and are refit online from sequentially collected (non-i.i.d.) data, which can yield miscalibrated or overly conservative uncertainty and lies outside the fixed-kernel assumptions of standard BO regret theory. We propose Online Sharp-Calibrated Bayesian Optimization (OSCBO), a BO algorithm that adaptively balances GP sharpness and calibration by casting hyperparameter selection as a constrained online-learning problem. We also show that OSCBO preserves sublinear regret bounds by leveraging the theoretical guarantees of the underlying online learning algorithm. Empirically, OSCBO performs competitively across synthetic and real-world benchmarks, ranking among the strongest methods in final simple regret while maintaining robust cumulative-regret behavior.

[LG-38] Its All Connected: Topology-Aware Structural Graph Encoding Improves Performance on Polymer Prediction

链接: https://arxiv.org/abs/2605.10551
作者: H. Ibrahim Erdogan(University of Bayreuth, Germany),Punith Raviswamy(University of Bayreuth, Germany),Nikita Agrawal(University of Bayreuth, Germany),Yannik Köster(Friedrich Schiller University Jena, Germany),Stefan Zechel(Friedrich Schiller University Jena, Germany),Ulrich S. Schubert(Friedrich Schiller University Jena, Germany),Ruben Mayer(University of Bayreuth, Germany),Christopher Kuenneth(University of Bayreuth, Germany)
类目: Machine Learning (cs.LG)
*备注: 9 pages, 4 figures

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have achieved strong results in molecular property prediction, but polymers present distinct challenges: labeled datasets are scarce and small (typically in the order of hundreds of polymers) due to the need for expensive experimentation, and complex polymer chain distributions influence polymer properties. Established practice in polymer prediction represents polymers solely by graphs of their repeat units, discarding the chain-scale morphology that governs key properties such as the glass transition temperature ( T_g ). In this work, we propose a principled graph construction that addresses this gap. Given a polymer’s molecular mass distribution (MMD), we sample representative chains from the Schulz-Zimm distribution and construct representative sets of large graphs encoding chain-scale topology directly, with atoms and bonds featurized using rich chemical descriptors. We further pretrain GNN encoders via masked graph modeling on 100,000 unlabeled PSMILES strings before fine-tuning on labeled data. On a dataset of 381 polymers (180 homopolymers and 201 copolymers), we show that graph construction and self-supervised pretraining are jointly necessary: without pretraining, the large graph method matches the repeat-unit baseline (28.40 K vs. 28.36 K RMSE); with pretraining, it achieves 24.76 K +/- 3.30 K, a 5.1% reduction in mean error over the pretrained repeat-unit baseline (26.08 K +/- 4.20 K, p 0.001, 30 runs). An ablation removing chemical features degrades performance to 36.65 K, confirming both components are essential. Results are architecture-agnostic, holding for both GINE and GATv2 encoders.

[LG-39] PhysEDA: Physics-Aware Learning Framework for Efficient EDA With Manhattan Distance Decay

链接: https://arxiv.org/abs/2605.10547
作者: Zetao Yang
类目: Machine Learning (cs.LG)
*备注: 9 pages, 4 figures, plus appendix. Code and data to be released upon publication

点击查看摘要

Abstract:Electronic design automation (EDA) addresses placement, routing, timing analysis, and power-integrity verification for integrated circuits. Learning methods – attention (Transformer) and reinforcement learning (RL) – have recently emerged on EDA tasks, yet face two common bottlenecks: vanilla attention’s quadratic complexity limits scaling, and data-scarce models overfit statistical noise and amplify weak long-range correlations against the underlying physics. We observe that EDA tasks share a physical prior – pairwise electrical and routing interactions decay exponentially along Manhattan distance – and integrate it as a unified inductive bias into both architecture and training. We propose PhysEDA, comprising two components Physics-Structured Linear Attention (PSLA) folds the separable Manhattan decay into the linear-attention kernel as a multiplicative bias, reducing complexity from quadratic to linear; Potential-Based Reward Shaping (PBRS) constructs a physical potential from the same kernel, providing dense reward signal under sparse RL while preserving the optimal policy via the policy-invariance theorem. Across three EDA scenarios – decoupling-capacitor placement, macro placement, and IR-drop prediction – PhysEDA improves zero-shot cross-scale transfer by 56.8% and achieves 14x inference speedup with 98.5% memory savings on 100x100 grids; PBRS adds another 10.8% in sparse-reward DPP.

[LG-40] Higher Resolution Better Generalization: Unlocking Visual Scaling in Deep Reinforcement Learning

链接: https://arxiv.org/abs/2605.10546
作者: Raphael Trumpp,Ömer Veysel Çağatan,Barış Akgün,Marco Caccamo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Pixel-based deep reinforcement learning agents are typically trained on heavily downsampled visual observations, a convention inherited from early benchmarks rather than grounded in principled design. In this work, we show that observation resolution is a critical yet overlooked variable for policy learning: higher-resolution inputs can substantially improve both performance and generalization, provided the network architecture can process them effectively. We find that the widely used Impala encoder, which flattens spatial features into a vector, suffers from quadratic parameter growth as resolution increases and fails to leverage the additional visual detail. Replacing this operation with global average pooling, as in the Impoola architecture, decouples parameter count from resolution and yields consistent improvements across resolutions and network widths - at their respective best conditions, visual scaling unlocks a 28 % performance gain for Impoola over Impala. These gains are strongest in environments that require precise perception of small or distant objects, and gradient saliency analysis confirms that the underlying mechanism is a more spatially localized visual attention of the policy at higher resolutions. Our results challenge the prevailing practice of aggressive input downsampling and position resolution-independent architectures as a simple, effective path toward scalable visual deep RL. To facilitate future research on resolution scaling in deep RL, we publicly release the open-source code for the Procgen-HD benchmark: this https URL.

[LG-41] ConfoundingSHAP: Quantifying confounding strength in causal inference

链接: https://arxiv.org/abs/2605.10533
作者: Marie Brockschmidt,Santo M.A.R. Thies,Maresa Schröder,Dennis Frauen,Valentyn Melnychuk,Maximilian Muschalik,Eyke Hüllermeier,Stefan Feuerriegel
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In causal inference, confounders are variables that influence both treatment decisions and outcomes. However, unlike as in randomized clinical trials, the treatment assignment mechanism in observational studies is not known, and it is thus unclear which covariates act as confounders. Here, we aim to generate insight for causal inference and answer: which of the observed covariates act as confounders? We introduce ConfoundingSHAP, a Shapley-based method for attributing confounding strength to individual covariates. Our contributions are twofold. First, we propose a Shapley game targeted to infer the confounding strength of the covariates. Our resulting Shapley values differ from the standard applications of SHAP explanations on causal targets, such as understanding treatment effect heterogeneity, which are ill-suited for our task. Second, as our task requires evaluating the value function over many adjustment sets, we provide a scalable TabPFN-based estimation that avoids exhaustive refitting. We demonstrate the practical value across various datasets, where ConfoundingSHAP provides informative explanations of which observed covariates drive confounding and thereby helps to provide more insight for causal inference in practice.

[LG-42] Priority-Driven Control and Communication in Decentralized Multi-Agent Systems via Reinforcement Learning

链接: https://arxiv.org/abs/2605.10482
作者: Qingyun Guo,Junyi Shi,Tomasz Piotr Kucner,Dominik Baumann
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Accepted to the 23rd IFAC World Congress

点击查看摘要

Abstract:Event-triggered control provides a mechanism for avoiding excessive use of constrained communication bandwidth in networked multi-agent systems. However, most existing methods rely on accurate system models, which may be unavailable in practice. In this work, we propose a model-free, priority-driven reinforcement learning algorithm that learns communication priorities and control policies jointly from data in decentralized multi-agent systems. By learning communication priorities, we circumvent the hybrid action space typical in event-triggered control with binary communication decisions. We evaluate our algorithm on benchmark tasks and demonstrate that it outperforms the baseline method.

[LG-43] Regret Minimization in Bilateral Trade With Perturbed Markets

链接: https://arxiv.org/abs/2605.10475
作者: Anna Lunghi,Matteo Castiglioni,Alberto Marchesi
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We address the problem of maximizing Gain from Trade (GFT) in repeated buyer-seller exchanges subject to global budget balance constraints. While this problem is well-understood in purely adversarial and stochastic settings, these environments exhibit a sharp dichotomy: adversarial environments allow for no-regret learning against the best fixed-price mechanism, whereas stochastic environments allow for no-regret learning against the best distribution over prices that is budget balanced in expectation. This gap is significant, as policies balanced in expectation can increase the GFT by a multiplicative factor of two. In this work, we bridge these extremes by studying perturbed markets, where an underlying stochastic distribution is subject to an adversarial corruption C . We design an algorithm that adaptively scales with the level of corruption, achieving an \tilde\mathcalO(T^3/4) + \mathcalO(C\log(T)) regret bound against the best budget-balanced distribution over prices. Simultaneously, our algorithm maintains the worst-case \tilde\mathcalO(T^3/4) regret bound relative to a per-round budget-balanced baseline, ensuring optimality even in fully adversarial environments.

[LG-44] Can Muon Fine-tune Adam-Pretrained Models?

链接: https://arxiv.org/abs/2605.10468
作者: Xingyu Qu,Peigeng Huang,Samuel Horvath
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Muon has emerged as an efficient alternative to Adam for pretraining, yet remains underused for fine-tuning. A key obstacle is that most open models are pretrained with Adam, and naively switching to Muon for fine-tuning leads to degraded performance due to an optimizer mismatch. We investigate this mismatch through controlled experiments and relate it to the distinct implicit biases of Adam and Muon. We provide evidence that the mismatch disrupts pretrained knowledge, and that this disruption scales with update strength. This leads us to hypothesize that constraining updates should mitigate the mismatch. We validate this with LoRA: across language and vision tasks, LoRA reduces the performance gap between Adam and Muon observed under full fine-tuning. Studies on LoRA rank, catastrophic forgetting, and LoRA variants further confirm that mismatch severity correlates with update strength. These results shed light on how optimizer mismatch affects fine-tuning and how it can be mitigated. Our code is available at this https URL.

[LG-45] Self-Attention as a Covariance Readout: A Unified View of In-Context Learning and Repetition

链接: https://arxiv.org/abs/2605.10466
作者: Haoren Xu,Guanhua Fang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) exhibit two striking and ostensibly unrelated behaviours: in-context learning (ICL) and repetitive generation. In both, the model behaves as though it had summarised the context into a population-level statistic and discarded token-level detail. We ask whether this ``summarisation and forgetting’’ can be derived from the attention mechanism itself, and answer in the affirmative. Under stationary, ergodic and elliptical inputs, the softmax attention output converges almost surely to \Theta_V\Sigma\Theta_K^\top\Theta_Q x_t , where \Sigma is the input covariance; the long-context limit is therefore a linear readout of the input’s second-order statistics. Two consequences follow. (i) For in-context linear regression, a single softmax head can implement one step of population gradient descent. Stacking such heads with residual connections iterates this update and implements multiple gradient descent steps. (ii) Propagated across an L -layer transformer, this readout drives the terminal hidden state at the parametric 1/t rate to a deterministic function of the current token alone, so that autoregressive generation collapses asymptotically to a first-order Markov chain whose attracting orbits furnish a structural account of repetition and mode collapse. The two phenomena thus emerge as facets of a single covariance-readout principle.

[LG-46] QT-Net: Rethinking Evaluation of AI Models in Atomic Chemical Space

链接: https://arxiv.org/abs/2605.10458
作者: Pablo Martínez Crespo,Stefano Ribes,Martin Rahm,Richard Beckmann,Robert S. Jordan,Marisa Gliege,Santiago Miret,Vijay Kris Narasimhan,Rocío Mercado
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Chemical Physics (physics.chem-ph)
*备注:

点击查看摘要

Abstract:Atomic properties such as partial charges or multipoles encode chemically meaningful information that can inform downstream molecular property prediction, but their evaluation as machine learning targets has been complicated by the absence of a principled out-of-distribution evaluation protocol at the atomic level. In this work, we propose a held-out evaluation protocol that clusters atomic environments by SOAP descriptors and computes metrics accounting only for cluster labels unseen during training. Following this procedure, we use 5 \times 5 cross-validation and Tukey’s HSD to run a statistically rigorous comparison of E(3)-equivariant against non-equivariant, rotationally augmented models for predicting electron populations and multipoles of H, C, N, and O atoms. Building on our results, we introduce the Quantum Topological Neural Network (QT-Net), a rotationally augmented, non-equivariant graph neural network. We show that QT-Net can be used to infer properties of atoms in molecules from QM9 outside our training set, and that these inferred properties can yield improvement when used as input features for downstream molecular property prediction. To further validate the framework, molecular dipole moments computed from QT-Net’s per-atom outputs recover the ground-truth values reported in QM9. We release all code and data, including a JAX implementation of QT-Net, to support the broader use of learned QTA properties as inductive biases for atomic-scale molecular machine learning.

[LG-47] AxiomOcean: Forecasting the Three-Dimensional Structure of the Upper Ocean

链接: https://arxiv.org/abs/2605.10455
作者: Sensen Wu,Yifan Chen,Guantao Pu,Xiaoyao Sun,Yijun Chen,Jin Qi,Ming Kong,Keyi Yang,Lichen Xu,Wenguan Wang,Xiaofeng Li,Zhenhong Du
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Short-term ocean forecast skill depends strongly on the three-dimensional ocean structure of the upper ocean, which governs stratification, subsurface heat storage, and the response of the ocean to atmospheric forcing. However, AI ocean forecasting models often fail to preserve this vertical structure, resulting in over-smoothed subsurface features and weak physical consistency under strong forcing. Here, we present AxiomOcean, a global AI ocean forecasting model that explicitly represents vertical hierarchy and cross-layer dependence within the water column. By combining a fully three-dimensional encoder-backbone-decoder architecture with surface atmospheric forcing, AxiomOcean jointly predicts upper-ocean temperature, salinity, and three-dimensional currents at global 1/12° resolution down to 643 m depth. In 10-day forecasts, AxiomOcean outperforms an advanced AI comparison model across variables and lead times, reducing day-1 RMSE by approximately 20 to 35% while maintaining higher anomaly correlation. The gain is not achieved through excessive smoothing: AxiomOcean better preserves eddy kinetic energy, temperature and salinity variance. Its advantage also extends through the water column and remains evident across the equatorial Pacific, Kuroshio Extension, and Southern Ocean, yielding a more realistic reconstruction of upper-ocean heat content. These results show that explicitly preserving upper-ocean three-dimensional structure can improve both forecast accuracy and physical fidelity in AI ocean prediction.

[LG-48] Dont Fix the Basis – Learn It: Spectral Representation with Adaptive Basis Learning for PDEs

链接: https://arxiv.org/abs/2605.10451
作者: Xuxiang Zhao,Angelica I. Aviles-Rivero
类目: Machine Learning (cs.LG); Functional Analysis (math.FA); Numerical Analysis (math.NA)
*备注: 26 pages, 4 figures

点击查看摘要

Abstract:Spectral neural operators achieve strong performance for PDE learning, but rely on fixed global bases that limit their ability to represent spatially heterogeneous and multiscale dynamics. We propose Adaptive Basis Learning (ABLE), a framework that learns data-dependent spectral representations instead of relying on predefined bases. ABLE constructs a spatially adaptive Parseval frame via a learned ancillary density, enabling the operator to act in a lifted spectral space while preserving invertibility and maintaining O(N\log N) complexity through FFT-based implementation. This shifts the source of expressivity from spectral coefficients to the representation itself, allowing the model to capture localized structures and non-translation-invariant interactions more efficiently. ABLE integrates seamlessly into existing neural operator architectures as a drop-in replacement for spectral layers. Across a range of benchmarks ABLE improves accuracy over strong baselines, with the largest gains in regimes characterized by sharp gradients and multiscale behavior. Moreover, augmenting existing models (e.g., U-FNO, HPM) with ABLE further enhances their performance, demonstrating its role as a general and complementary spectral refinement. Our results highlight that the data-driven choice of representation, rather than operator complexity alone, is a key bottleneck in neural operator design. By learning the basis itself, ABLE provides a principled and efficient framework for improving spectral methods in PDE learning.

[LG-49] DRIFT: Drift-Resilient Invariant-Feature Transformer for DGA Detection DSN2026

链接: https://arxiv.org/abs/2605.10436
作者: Chaeyoung Lee,Chaeri Jung,Seonghoon Jeong
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: 14 pages, 7 figures, 8 tables. Accepted to appear in Proc. of the 56th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2026)

点击查看摘要

Abstract:Domain Generation Algorithms (DGAs) evolve continuously to evade botnet detection, posing a persistent challenge for dependable network defense. While deep learning-based detectors achieve strong performance under static conditions, they suffer severe degradation when facing temporal drift. Through a 9-year longitudinal study (2017-2025), we empirically show that state-of-the-art character- and word-based DGA classifiers rapidly lose effectiveness as new DGA variants emerge. To address this problem, we propose a drift-resilient Transformer-based framework that learns invariant representations through a hybrid tokenization strategy and multi-task self-supervised pre-training. The model integrates (i) character-level encoding to capture stochastic morphological patterns and (ii) subword-level encoding for word-based DGAs. Three pre-training tasks enable the model to learn robust structural and contextual features prior to supervised fine-tuning. Comprehensive evaluations demonstrate that our method significantly mitigates temporal degradation and consistently outperforms state-of-the-art baselines in forward-chaining experiments. The proposed approach offers a dependable foundation for long-term DGA defense in evolving threat landscapes. Our code is available at: this https URL. Comments: 14 pages, 7 figures, 8 tables. Accepted to appear in Proc. of the 56th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2026) Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI) MSC classes: 68T07, 68M25 ACMclasses: C.2.0; I.2.6; K.6.5 Cite as: arXiv:2605.10436 [cs.CR] (or arXiv:2605.10436v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2605.10436 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-50] Remember to Forget: Gated Adaptive Positional Encoding

链接: https://arxiv.org/abs/2605.10414
作者: Riccardo Ali,Alessio Borgi,Christopher Irwin,Mario Severino,Pietro Liò
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Rotary Positional Encoding (RoPE) is widely used in modern large language models. However, when sequences are extended beyond the range seen during training, rotary phases can enter out-of-distribution regimes, leading to spurious long-range alignments, diffuse attention, and degraded retrieval. Existing remedies only partially address these failures, as they often trade local positional resolution for long-context stability. We propose GAPE (Gated Adaptive Positional Encoding), a drop-in augmentation for positional encodings that introduces a content-aware bias directly into the attention logits while preserving the rotary geometry. GAPE decouples distance-based suppression from token importance through a query-dependent gate that contracts irrelevant context and a key-dependent gate that preserves salient distant tokens. We prove that protected tokens remain accessible, while the attention mass assigned to unprotected distant tokens decays as a function of the query gate. We further show that GAPE can be implemented within standard scaled dot-product attention. We validate these properties empirically, finding that GAPE consistently yields sharper attention and improved long-context robustness over rotary baselines across both synthetic retrieval and long-context benchmarks.

[LG-51] Equilibrium Residuals Expose Three Regimes of Matrix-Game Strategic Reasoning in Language Models

链接: https://arxiv.org/abs/2605.10410
作者: Wenhua Nie,Binhan Luo,Zijie Meng,Jyh-Shing Roger Jang,Ching-Wen Ma
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models can score well on named game-theory benchmarks while failing on the same strategic computation once semantic cues are removed. We show this gap with procedurally generated zero-sum matrix games: a model that recognizes familiar games drops to 34%, 18%, and 2% success on anonymous 2\times2 , 3\times3 , and 5\times5 payoff matrices. The benchmark separates semantic recall, learned approximate Nash computation, and an output-interface bottleneck that limits scale. Training only on 2\times2 and 3\times3 games, supervised fine-tuning raises unseen 5\times5 – 7\times7 success from 2% to 61%, while exploitability-reward training averages 37% with high seed variance. We prove that the exploitability residual is 2 -Lipschitz in payoff perturbations, unlike discontinuous vertex-returning LP equilibrium selectors, explaining why residual training can transfer under payoff shifts even when formatting instability limits mean performance. A dominated-action padding experiment provides causal evidence: trained models solve 3\times3 games embedded in much larger matrices, while random-padded controls fail and dense 12\times12 games remain near failure. Procedural evaluation is therefore necessary for measuring strategic reasoning, and residual rewards expose a real but format-limited route to approximate equilibrium computation.

[LG-52] Identified-Set Geometry of Distributional Model Extraction under Top-K Censored API Access

链接: https://arxiv.org/abs/2605.10407
作者: Wenhua Nie,ZiCheng Zhu,Jianan Wu,Binhan Luo,Haoran Zheng,Jyh-Shing Roger Jang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modern LLM APIs often reveal only top- K logit scores and censor the remaining vocabulary. We study the per-position distribution-recovery limits of this access model. For censoring threshold \tau , the compatible teacher distributions form an identified set whose total-variation diameter is exactly U_K=(V-K)\exp(\tau)/(Z_A+(V-K)\exp(\tau)) , where Z_A is the observed partition function. For KL recovery, we give a computable binary-endpoint lower bound and an asymptotically matching small-ambiguity upper bound, with an extension to reference-aware attackers. Experiments on a Qwen3 math-reasoning teacher reveal a layered extraction hierarchy: on-task top- K distillation recovers 12% of private capability, full-logit distillation recovers 56% despite 99% KL closure, and generation-based extraction recovers 96%. Top- K censoring therefore limits per-position distribution recovery but does not by itself prevent capability extraction, separating fidelity from transfer in prompt-only logit distillation.

[LG-53] Valid Best-Model Identification for LLM Evaluation via Low-Rank Factorization

链接: https://arxiv.org/abs/2605.10405
作者: Elad Tolochinsky,Yaniv Tenzer,Yaniv Romano
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Selecting the best large language model (LLM) for a fixed benchmark is often expensive, since exhaustive evaluation requires running every model on every example. Multi-armed bandit (MAB) algorithms can reduce the number of LLM calls by sequentially selecting the next model-example pair to evaluate, thereby avoiding wasted evaluations on clearly underperforming models. Further savings can be achieved by predicting model scores from the partially observed model-example score matrix using low-rank factorization. However, such predictions are not ground truth: they can be biased and may therefore lead to incorrect identification of the best model. In this work, we propose a principled framework that combines MAB with cheap predicted scores without compromising statistical validity. Specifically, we derive doubly robust estimators of each model’s performance that use the low-rank predictions to reduce variance. This enables the construction of valid finite-sample confidence intervals in our setting, where models are selected adaptively and examples are sampled without replacement. Empirical results on real-world benchmarks show that our approach reduces the number of required evaluations, yielding meaningful savings in compute and cost while accurately identifying the best-performing model.

[LG-54] Causal Explanations from the Geometric Properties of ReLU Neural Networks

链接: https://arxiv.org/abs/2605.10396
作者: Hector Woods,Philippa Ryan,Rob Alexander
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 7 pages, 0 figures, Accepted for presentation at the Yorkshire Innovation in Science and Engineering Conference

点击查看摘要

Abstract:Neural networks have proved an effective means of learning control policies for autonomous systems, but these learned policies are difficult to understand due to the black-box nature of neural networks. This lack of interpretability makes safety assurance for such autonomous systems challenging. The fields of eXplainable Artificial Intelligence (XAI) and eXplainable Reinforcement Learning (XRL) aim to interpret the decision making processes of neural networks and autonomous agents, respectively. In particular, work on causal explanations aims to provide “why” and “why not” explanations for why a model made a given decision. However, most of the work on explainability to date utilises a distilled version of the original model. While this distilled policy is interpretable, it necessarily degrades in performance significantly when compared to the original model, and is not guaranteed to be an accurate reflection of the decision making processes in the original model and as such cannot be used to guarantee its safety. Recent work on understanding the geometry of ReLU neural networks shows that a ReLU network corresponds to a piecewise linear function divided into regions defined by an n-dimensional convex polytope. Through this lens, a neural network can be understood as dividing the input space into distinct regions which apply a single linear function for each output neuron. We show that this geometric representation can be used to generate causal explanations for the network’s behaviour similar to previous work, but which extracts rules directly from the geometry of Neural Networks with the ReLU activation function, and is therefore an accurate reflection of the network’s behaviour.

[LG-55] he Polynomial Counting Capabilities of Message Passing Neural Networks

链接: https://arxiv.org/abs/2605.10393
作者: Marco Sälzer,Pascal Bergsträßer,Anthony W. Lin
类目: Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注:

点击查看摘要

Abstract:The counting power of Message Passing Neural Networks (MPNN) has been the subject of many recent papers, showing that they can express logic that involves counting up to a threshold or more generally satisfy a linear arithmetic constraint. In this paper, we study the counting capabilities of MPNN beyond linear arithmetic, primarily utilising local and global mean aggregations. In particular, our goal is to tease out conditions required to express extensions of graded modal logic with polynomial counting constraints. We show that global polynomial counting constraints in node-labelled graphs can be checked using mean MPNN under mild assumptions. Checking local constraints is also possible, if we consider formulas with no nested modalities and additionally either (i) permit sum/max aggregations, or (ii) only restrict to regular graphs. We also show how formulas with nested modalities can be captured by mean MPNN over graphs with tree-like structures and similar assumptions.

[LG-56] DeepLévy: Learning Heavy-Tailed Uncertainty in Highly Volatile Time Series

链接: https://arxiv.org/abs/2605.10364
作者: Yang Yang,Du Yin,Hao Xue,Flora Salim
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modeling uncertainty in heavy-tailed time series remains a critical challenge for deep probabilistic forecasting models, which often struggle to capture abrupt, extreme events. While Lévy stable distributions offer a natural framework for modeling such non-Gaussian behaviors, the intractability of their probability density functions severely limits conventional likelihood-based inference. To address this, we introduce DeepLévy, a neural framework that learns mixtures of Lévy stable distributions by minimizing the discrepancy between empirical and parametric characteristic functions. DeepLévy incorporates a mixture mechanism that adaptively learns context-dependent weights and parameters over multiple Lévy components, enabling flexible multi-horizon uncertainty modeling. Evaluations on both real and synthetic datasets demonstrate that DeepLévy outperforms state-of-the-art deep probabilistic forecasting approaches in tail risk metrics, especially under extreme volatility.

[LG-57] Foundations of Reliable Inference: Reliability-Efficiency Co-Design

链接: https://arxiv.org/abs/2605.10351
作者: Jiayi Huang
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: PhD Thesis

点击查看摘要

Abstract:Reliable inference requires that artificial intelligence (AI) models provide trustworthy uncertainty estimates, not merely accurate predictions. Recent advances in Bayesian learning have made significant progress toward this goal, and growing concerns about computational overhead have jointly shifted the design criterion from reliability alone to the co-design of reliability and efficiency, i.e., reducing computational overhead while preserving trustworthy uncertainty quantification. This thesis develops a unified framework from two perspectives to address the central question: can we efficiently perform reliable inference?

[LG-58] Signature Approach for Contextual Bandits with Nonlinear and Path-dependent Rewards

链接: https://arxiv.org/abs/2605.10313
作者: Xin Guo,Grace He,Xinyu Li
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We study contextual bandits with nonlinear and path-dependent rewards through a novel signature-transform-based approach. Leveraging the universal nonlinearity property of signatures, we approximate continuous path-dependent reward functionals by linear functionals in the signature space. This representation enables the use of efficient linear contextual bandit methods while preserving expressive sequential structure. Building on this framework, we propose \textttDisSigUCB, a signature-based disjoint upper confidence bound (UCB) algorithm. Under boundedness and non-degeneracy assumptions, we prove a high-probability data-dependent sublinear regret bound of order (\tilde\mathcal O(\sqrt(d+m)KT)) where (d) is the context dimension and (m) is the signature feature dimension. Synthetic experiments and numerical applications on temperature sensor monitoring, sleep-stage classification, and hospital nurse staffing demonstrate that \textttDisSigUCB consistently outperforms classical linear and kernelized contextual bandit baselines in nonlinear and path-dependent settings.

[LG-59] Follow the Mean: Reference-Guided Flow Matching

链接: https://arxiv.org/abs/2605.10302
作者: Pedro M. P. Curvo,Maksim Zhdanov,Floor Eijkelboom,Jan-Willem van de Meent
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Existing approaches to controllable generation typically rely on fine-tuning, auxiliary networks, or test-time search. We show that flow matching admits a different control interface: adaptation through examples. For deterministic interpolants, the velocity field is solely governed by a conditional endpoint mean; shifting this mean shifts the flow itself. This yields a simple principle for controllable generation: steer a pretrained model by changing the reference set it follows. We instantiate this idea in two forms. Reference-Mean Guidance is training-free: it computes a closed-form endpoint-mean correction from a reference bank and applies it to a frozen FLUX.2-klein (4B) model, enabling control of color, identity, style, and structure while keeping the prompt, seed, and weights fixed. Semi-Parametric Guidance amortizes the same idea through an explicit mean anchor and learned residual refiner, matching unconditional DiT-B/4 quality on AFHQv2 while allowing the reference set to be swapped at inference time. These results point to a broader direction: generative models that adapt through data, not parameter updates.

[LG-60] Nearly-Optimal Algorithm for Adversarial Kernelized Bandits

链接: https://arxiv.org/abs/2605.10299
作者: Shogo Iwazaki
类目: Machine Learning (cs.LG)
*备注: 47 pages

点击查看摘要

Abstract:This paper studies kernelized bandits (also known as Gaussian process bandits) in an adversarial environment, where the reward functions in a known reproducing kernel Hilbert space (RKHS) may be adversarially chosen at each round. We show that the exponential-weight algorithm achieves \tildeO(\sqrtT \gamma_T) adversarial regret, where T and \gamma_T denote the number of total rounds and the maximum information gain, respectively. For squared exponential (SE) and \nu -Matérn kernels, we also show algorithm-independent lower bounds that guarantee the optimality of our algorithm up to polylogarithmic factors. Furthermore, we present a computationally efficient variant of our algorithm using Nyström approximation while maintaining nearly optimal regret guarantees.

[LG-61] Set Prediction for Next-Day Active Fire Forecasting

链接: https://arxiv.org/abs/2605.10298
作者: Yuchen Bai,Georgios Athanasiou,Xin Yu,Diogenis Antonopoulos,Ioannis Papoutsis,Stijn Hantson,Nuno Carvalhais
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate next-day active fire forecasts can support early warning, disaster response, forest risk assessment, and downstream estimation of fire-related carbon emissions. Existing machine learning approaches to wildfire forecasting typically predict wildfire danger or fire probability on kilometre-scale daily grids, which is useful for regional warning but does not directly represent localized fire events. We propose Wildfire Ignition Set Predictor (WISP), a query-based model that reformulates next-day active fire forecasting as point-set prediction. From 48 hours of covariates including meteorology, satellite vegetation products, static land, and fire history, WISP predicts a fixed-size ranked set of future active fire cluster centres on a 375 m grid across globally distributed regions. The model is trained end-to-end with Hungarian matching; to address the conflicting roles of the classification score in assignment, ranking, and query activation, we use asymmetric classification-localization weighting in matching and loss. We further construct a globally distributed, hourly, multi-source benchmark for this task. On a held-out test set spanning fire regions worldwide, the best WISP variant achieves 38.2% average precision (AP) for ranked fire-centre detections, covers 53.4% of fire cluster mass weighted by fire radiative power (FRP), and localizes 54.1% of observed clusters within 5 km. These results establish sparse set prediction as a viable formulation for high-resolution wildfire forecasting and provide a benchmark for future work in this regime.

[LG-62] Sample-Mean Anchored Thompson Sampling for Offline-to-Online Learning with Distribution Shift

链接: https://arxiv.org/abs/2605.10289
作者: Bochao Li,Yao Fu,Wei Chen,Fang Kong
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Offline-to-online learning aims to improve online decision-making by leveraging offline logged data. A central challenge in this setting is the distribution shift between offline and online environments. While some existing works attempt to leverage shifted offline data, they largely rely on UCB-type algorithms. Thompson sampling (TS) represents another canonical class of bandit algorithms, well known for its strong empirical performance and naturally suited to offline-to-online learning through its Bayesian formulation. However, unlike UCB indices, posterior samples in TS are not guaranteed to be optimistic with respect to the true arm means. This makes indices constructed from purely online and hybrid data difficult to compare and complicates their use. To address this issue, we propose sample-mean anchored TS (Anchor-TS), which introduces a novel median-based anchoring rule that defines the arm index as the median of an online posterior sample, a hybrid posterior sample, and the online sample mean. The median anchoring systematically corrects bias induced by distribution shift by mitigating over-estimation for suboptimal arms and under-estimation for optimal arms, while exploiting offline information to obtain more accurate estimates when the shift is small. We establish theoretical guarantees showing that the proposed algorithm safely leverages offline data to accelerate online learning, and quantifying how the degree of distribution shift and the size of offline data affect the resulting regret reduction. Extensive experiments demonstrate consistent improvements of our algorithm over baselines.

[LG-63] BROS: Bias-Corrected Randomized Subspaces for Memory-Efficient Single-Loop Bilevel Optimization

链接: https://arxiv.org/abs/2605.10288
作者: Hengrui Zhang,Boao Kong,Engao Zhang,Kun Yuan
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Stochastic bilevel optimization (SBO) has become a standard framework for hyperparameter learning, data reweighting, representation learning, and data-mixture optimization in deep learning. Existing exact single-loop SBO methods and memory-efficient surrogate SBO methods either create severe memory pressure for large lower-level neural networks or lack competitive convergence guarantees under standard assumptions. In this paper, we propose BROS, a memory-efficient single-loop SBO method with the same convergence rate order as exact single-loop SBO methods. BROS performs lower and auxiliary updates in randomized subspaces with a Rademacher bi-probe correction that recovers an unbiased Hessian-action estimator. We prove that BROS preserves the \mathcal O(\varepsilon^-2) sample complexity of MA-SOBA for finding an \varepsilon -stationary point under only standard assumptions. Experiments on hyper-data cleaning, data-mixture learning, hyper-representation learning, and ViT sample reweighting show that BROS reduces peak memory by up to 44.9% while closely matching full-space baseline performance.

[LG-64] DeepLog: A Software Framework for Modular Neurosymbolic AI IJCAI2026

链接: https://arxiv.org/abs/2605.10279
作者: Robin Manhaeve,Stefano Colamonaco,Vincent Derkinderen,Rik Adriaensen,Lucas Van Praet,Luc De Raedt,Giuseppe Marra
类目: Machine Learning (cs.LG)
*备注: Preprint accepted at IJCAI2026 Demo Track

点击查看摘要

Abstract:DeepLog is an operational neurosymbolic framework that unifies logic and deep learning within standard PyTorch workflows. While existing neurosymbolic systems focus on a particular paradigm and semantics, DeepLog serves as a universal backend that can emulate many systems in the neurosymbolic alphabet soup. By treating diverse neurosymbolic languages as high-level specifications, the DeepLog software automatically compiles them into optimized arithmetic circuits. This design lowers the barrier for machine learning practitioners by treating logic as composable modules, while providing neurosymbolic developers with a shared, high-performance basis for prototyping new integration strategies. The code is available here: this https URL

[LG-65] Predictive Radiomics for Evaluation of Cancer Immune SignaturE in Glioblastoma: the PRECISE-GBM study

链接: https://arxiv.org/abs/2605.10278
作者: Prajwal Ghimire,Junjie Li,Liu Yaou,Marc Modat,Thomas Booth
类目: Machine Learning (cs.LG)
*备注: Abstract : 226; Importance of study: 109; Manuscript: 5690 (excluding references) Figures: 4, Tables: 2 Supplemental File: 1

点击查看摘要

Abstract:Background: Radiogenomics allows identification of radiological biomarkers for genomic phenotypes. In glioblastoma, these biomarkers could potentially complement patient stratification strategies. We aim to develop and analytically validate radiological biomarkers that capture immune cell signatures within IDH-wildtype glioblastoma microenvironment using radiogenomic analysis. Methods: This was a retrospective multicenter study using curated open-access anonymized imaging and genomic data from TCGA-GBM, CPTAC, IvyGAP, REMBRANDT and CGGA datasets. Imaging data consisted of MRI-based radiomic features extracted from necrotic core, enhancing and edema regions of deep learning-based auto-segmented tumors. Radiomic feature selections were performed using nested cross-validated LASSO. Support vector machine and ensemble models were trained using seventeen immune and cell-specific score labels extracted from deconvoluted transcriptomic data using pan-cancer and glioblastoma immune signature matrices as reference standards. Seventeen classifier models trained in three cross-cohort strategies were validated on three held-out datasets assessing stability and generalizability. Results: One-hundred-and-seventy-six patients were included in the study. The immune-related radiomic signatures obtained after feature selection were shape, first order and higher order radiomic features. Models predicting macrophage subtype immune signature showed stable mean performance on balanced accuracy (0.67) and precision (0.89) metrics for three independent holdout datasets with ensemble model outperforming support vector machine model. Conclusion: Radiogenomic models non-invasively predicted the macrophage subtype M0 immune signature in IDH-wildtype glioblastoma. These biomarkers have the potential to stratify patients for immunotherapy within prospective glioblastoma clinical trials. Comments: Abstract : 226; Importance of study: 109; Manuscript: 5690 (excluding references) Figures: 4, Tables: 2 Supplemental File: 1 Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.10278 [cs.LG] (or arXiv:2605.10278v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.10278 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Neuro-Oncology Advances 2026. Published online May 2, 2026 Related DOI: https://doi.org/10.1093/noajnl/vdag115 Focus to learn more DOI(s) linking to related resources Submission history From: Prajwal Ghimire [view email] [v1] Mon, 11 May 2026 09:38:40 UTC (1,855 KB)

[LG-66] Generalization Error Bounds for Picard-Type Operator Learning in Nonlinear Parabolic PDEs

链接: https://arxiv.org/abs/2605.10277
作者: Koichi Taniguchi,Sho Sonoda
类目: Machine Learning (cs.LG); Analysis of PDEs (math.AP); Machine Learning (stat.ML)
*备注: 39 pages

点击查看摘要

Abstract:Operator learning for partial differential equations (PDEs) aims to learn solution operators on infinite-dimensional function spaces from finite-resolution data. In this setting, it is important for the learned model to be discretization-invariant, or resolution-robust, and to reflect PDE-specific structure. It is therefore natural to ask how such structure should be encoded in the model architecture, hypothesis class, or learning procedure. In this paper, we study operator learning for solution operators of nonlinear parabolic PDEs based on Duhamel–Picard iteration. We formulate Picard iteration as an abstract state-transition model and present a theoretical framework for Picard-type operator learning. We derive implementation-agnostic generalization error bounds that separate the implementation error from the estimation error associated with the abstract state-transition model induced by Picard iteration. A key consequence is that increasing the Picard depth reduces the Picard truncation error without causing an unbounded growth of the entropy-based estimation error. We also extend the analysis to long-time prediction by rolling out the same learned local model over successive time blocks. Finally, we illustrate the theory for nonlinear heat equations on the torus using a Picard-type Fourier neural operator as a concrete implementation.

[LG-67] aching LLM s to See Graphs: Unifying Text and Structural Reasoning

链接: https://arxiv.org/abs/2605.10247
作者: Dario Vajda
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Using Large Language Models (LLMs) to process graph-structured data is an active research area, yet current state-of-the-art approaches typically rely on multi-step pipelines with Graph Neural Network (GNN) encoders that compress rich textual attributes into solitary tokens, creating a significant semantic bottleneck. In this paper, we introduce the Graph Transformer Language Model (GTLM), a novel architecture that enables pretrained LLMs to natively process graph topologies while entirely eliminating this compressive bottleneck. GTLM is exceptionally parameter-efficient: by injecting graph-aware attention biases directly into the LLM’s attention modules, it introduces only 0.015% additional parameters relative to the base model. We theoretically prove that our bidirectional attention prefix preserves node permutation equivariance while maintaining exact backward compatibility with the pretrained base model. Extensive evaluations demonstrate that a 1B-parameter GTLM matches or exceeds the performance of 7B-parameter state-of-the-art models on standard Text-Attributed Graph benchmarks, while significantly surpassing baselines on GraphQA. Finally, we demonstrate that GTLM attention heads implicitly learn to simulate message passing, explaining its superior performance on algorithmic tasks. This paradigm shift enables true algorithmic reasoning within LLMs and provides a scalable foundation for next-generation GraphRAG and relational deep learning.

[LG-68] MARGIN: Margin-Aware Regularized Geometry for Imbalanced Vulnerability Detection

链接: https://arxiv.org/abs/2605.10240
作者: Yuteng Zhang,Huifang Ma,Jiahui Wei,Qingqing Li,Yafei Yang
类目: oftware Engineering (cs.SE); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 12 pages.9 figures, 4 tables

点击查看摘要

Abstract:Software vulnerability detection is critical for ensuring software security and reliability. Despite recent advances in deep learning, real-world vulnerability datasets suffer from two severe challenges: frequency imbalance and difficulty imbalance. We reinterpret these challenges from an embedding geometry perspective, observing that such imbalances induce geometric distortions in hyperspherical representation space. To address this issue, we propose MARGIN, a metric-based framework that learns discriminative vulnerability representations through adaptive margin metric learning and hyperspherical prototype modeling. MARGIN dynamically adjusts geometric regularization according to the distribution structure estimated by the von Mises-Fisher concentration, aligning the probability mass of embedding distributions with their corresponding Voronoi cells, thereby reducing geometric distortion and yielding more stable decision boundaries. Extensive experiments on public vulnerability datasets show that MARGIN consistently outperforms strong baselines, achieving notable improvements in classification and detection, especially on challenging, imbalanced datasets. Further analysis demonstrates that MARGIN produces more structured embedding geometries, improving robustness, interpretability, and generalization.

[LG-69] he Benefits of Temporal Correlations: SGD Learns k-Juntas from Random Walks Efficiently

链接: https://arxiv.org/abs/2605.10237
作者: Elisabetta Cornacchia,Dan Mikulincer,Elchanan Mossel
类目: Machine Learning (cs.LG)
*备注: 10 pages main body, 3 figures

点击查看摘要

Abstract:We study how temporal correlations in the data can make certain sparse learning problems efficiently learnable by gradient-based methods. Our focus is on Boolean k-juntas, a canonical sparse learning problem known to pose barriers for gradient-based methods under independent uniform samples. We show that this picture changes when the samples are generated by a lazy random walk on the hypercube. In this setting, the temporal dependencies can be exploited by a two-layer ReLU network trained using stylized-SGD with a temporal-difference loss, which compares target and predicted increments across consecutive samples. For every fixed k, the resulting sample complexity is essentially linear in the ambient dimension d. By contrast, we show that for large-batch gradient methods using standard convex pointwise losses, temporal correlations do not provide the same advantage.

[LG-70] FORGE: Frag ment-Oriented Ranking and Generation for Context-Aware Molecular Optimization

链接: https://arxiv.org/abs/2605.10230
作者: Qingchuan Zhang,He Cao,Hao Li,Yanjun Shao,Zhiyuan Liu,Shihang Wang,Shufang Xie,Shenghua Gao,Xinwu Ye
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Molecular optimization seeks to improve a molecule through small structural edits while preserving similarity to the starting compound. Recent language-model approaches typically treat this task as prompt-conditioned sequence generation. However, relying on natural language introduces an inherent data-scaling bottleneck, often leads to chemical hallucinations, and ignores the strong context dependence of fragment effects. We present FORGE, a two-stage framework that reformulates molecular optimization as context-aware local editing. By utilizing automatically mined, verified low-to-high edit pairs instead of expensive human text annotations, Stage 1 ranks candidate fragments by their property contribution under the full molecular context to inject chemical prior, and Stage 2 generates explicit fragment replacements. Built on a compact 0.6B language model, FORGE further adapts to unseen black-box objectives through in-context demonstrations. Across Prompt-MolOpt, PMO-1k and ChemCoTBench, FORGE consistently outperforms prior methods, including substantially larger language models and graph methods. These results highlight the value of explicit fragment-level supervision as a more easily obtainable, scalable, and hallucination-less alternative to natural language training.

[LG-71] Unveiling High-Probability Generalization in Decentralized SGD

链接: https://arxiv.org/abs/2605.10205
作者: Jiahuan Wang,Ping Luo,Ziqing Wen,Dongsheng Li,Tao Sun
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Decentralized stochastic gradient descent (D-SGD) is an efficient method for large-scale distributed learning. Existing generalization studies mainly address expected results, achieving rates limited to \mathcalO\left(\frac1\delta \sqrtmn\right) , where \delta is the confidence parameter, m the number of workers, and n the sample size. When m=1 , D-SGD reduces to traditional SGD, whose optimal high-probability generalization bound is \mathcalO\left(\frac1\sqrtn\log (1/\delta)\right) . This discrepancy reveals a gap between high-probability guarantees for SGD and those for D-SGD. To close this, we develop a high-probability learning theory for D-SGD, aiming for the optimal \mathcalO\left(\frac1\sqrtmn\log (1/\delta)\right) rate. We refine bounds for D-SGD using pointwise uniform stability in distributed learning-a weaker notion than uniform stability-and analyze them across convex, strongly convex, and non-convex settings. We also provide high-probability results for gradient-based measures in non-convex cases where only local minima exist, and derive optimization error and excess risk bounds. Finally, accounting for communication overhead, we analyze generalization bounds for local models within time-varying frameworks.

[LG-72] Many Needles in a Haystack: Active Hit Discovery for Perturbation Experiments ICML

链接: https://arxiv.org/abs/2605.10196
作者: Andrea Rubbi,Arpit Merchant,Samuel Ogden,Amir Akbarnejad,Pietro Liò,Sattar Vakili,Mo Lotfollahi
类目: Machine Learning (cs.LG)
*备注: To be published in International Conference on Machine Learning (ICML) 2026

点击查看摘要

Abstract:High-throughput gene perturbation experiments can test several genetic interventions in parallel, yet experimental budgets remain limited. A central goal is hit discovery: identifying as many perturbations as possible whose phenotypic effect exceeds a predefined threshold. Pure exploration strategies are statistically inefficient, wasting budget on low-value regions. Bayesian optimization methods offer a principled alternative but target a single global optimum, over-exploiting dominant modes while neglecting other high-value regions. We formalize hit discovery as a sequential experimental design problem and propose Probability-of-Hit, an acquisition function that directly targets threshold exceedance by ranking candidates according to their posterior probability of being a hit. We prove asymptotic optimality of this approach and demonstrate strong empirical performance on both synthetic benchmarks and real biological immunology datasets, including up to 6.4% improvement over baselines on the Schmidt IL-2 dataset.

[LG-73] Breaking the Reward Barrier: Accelerating Tree-of-Thought Reasoning via Speculative Exploration OSDI2026

链接: https://arxiv.org/abs/2605.10195
作者: Shuzhang Zhong,Haochen Huang,Shengxuan Qiu,Pengfei Zuo,Runsheng Wang,Meng Li
类目: Machine Learning (cs.LG)
*备注: OSDI 2026

点击查看摘要

Abstract:Tree-of-Thought (ToT) reasoning structures Large Language Model (LLM) inference as a tree-based search, demonstrating strong potential for solving complex mathematical and programming tasks. However, its efficiency is constrained by the reward dependency barrier – a synchronization bottleneck caused by sequential reward-guided exploration that limits search parallelism and introduces substantial latency. Prior system optimizations, mainly designed for linear Chain-of-Thought (CoT) reasoning, cannot address these challenges, leaving the efficiency of ToT underexplored. To enhance ToT reasoning efficiency, we observe that the reasoning paths can be explored speculatively to break the reward synchronization barrier. Therefore, in this paper, we propose SPEX and introduce three key techniques: (i) intra-query speculative path selection to predict and expand high-potential branches of ToT, (ii) inter-query budget allocation to balance speculative resource allocation across queries dynamically, and (iii) adaptive early termination to prune deep and redundant branches for a skewed search tree. We implement SPEX on top of the SGLang framework and evaluate it across diverse ToT algorithms and LLMs. Extensive experiments show that SPEX achieves 1.2 \sim 3 \times speedup for different ToT reasoning algorithms. Moreover, SPEX synergizes with token-level speculative decoding, achieving cumulative speedups of up to 4.1\times . Ablation studies further confirm the contributions of each technique. Overall, SPEX represents a significant step toward efficient and scalable ToT reasoning, unlocking the parallelism required for high-performance inference-time scaling for LLMs. Comments: OSDI 2026 Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.10195 [cs.LG] (or arXiv:2605.10195v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.10195 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-74] Fix the Loss Not the Radius: Rethinking the Adversarial Perturbation of Sharpness-Aware Minimization ICML2026

链接: https://arxiv.org/abs/2605.10183
作者: Jinping Wang,Qinhan Liu,Zhiwu Xie,Zhiqiang Gao
类目: Machine Learning (cs.LG)
*备注: Accepted by ICML2026

点击查看摘要

Abstract:Sharpness-Aware Minimization (SAM) improves generalization by minimizing the worst-case loss within a fixed parameter-space radius neighborhood. SAM and its variants mainly rely on a first-order linearized surrogate, while flat minima are inherently a second-order (curvature) this http URL revisit this mismatch and propose Loss-Equated SAM (LE-SAM), which inverts the traditional SAM mechanism that fixed perturbation radius with a fixed loss-space budget,effectively removing gradient-norm-dominated learning signals and shifting optimization toward curvature-dominated terms. Extensive experiments across diverse benchmarks and tasks demonstrate the strong generalization ability of LESAM that consistently outperforms SAM and even its variants, achieving the state-of-the-art performance.

[LG-75] Balancing Efficiency and Fairness in Traffic Light Control through Deep Reinforcement Learning

链接: https://arxiv.org/abs/2605.10170
作者: Matteo Cederle,Giacomo Scatto,Gian Antonio Susto
类目: Machine Learning (cs.LG)
*备注: Paper accepted to the 2026 IFAC World Congress, held in Busan (KOR), August 23rd-28th, 2026

点击查看摘要

Abstract:Urban traffic congestion presents a significant challenge for modern cities, which impacts mobility and sustainability. Traditional traffic light control systems often fail to adapt to dynamic conditions, leading to inefficiencies. This paper proposes a novel deep reinforcement learning agent for traffic light control that addresses this limitation by explicitly integrating fairness considerations for both vehicular and pedestrian traffic. Unlike prior work, our approach dynamically balances these flows based on real-time demand, moving beyond systems focused solely on vehicles. Experimental results demonstrate that our agent effectively reduces congestion while ensuring equitable service for both the categories of road users. This research contributes to a practical and adaptable solution for intelligent traffic management within the framework of smart cities, paving the way for more efficient and inclusive urban mobility.

[LG-76] Hyperparameter Transfer for Dense Associative Memories

链接: https://arxiv.org/abs/2605.10164
作者: Roi Holtzman,Dmitry Krotov,Boris Hanin
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Dense Associative Memory (DenseAM) is a promising family of AI architectures that is represented by a neural network performing temporal dynamics on an energy landscape. While hyperparameter transfer methods are well-studied for feed-forward networks, these methods have not been developed for settings in which weights are shared across layers and within the layer, which is common in DenseAMs. Additionally, DenseAMs utilize rapidly peaking activation functions that are rarely used in feed-forward architectures. The confluence of these aspects makes DenseAM a challenging framework for using existing methods for hyperparameter transfer. Our work initiates the development of hyperparameter transfer methods for this class of models. We derive explicit prescriptions for how the hyperparameters tuned on small models can be transferred to models trained at scale. We demonstrate excellent agreement between these theoretical findings and empirical results.

[LG-77] OUIDecay: Adaptive Layer-wise Weight Decay for CNNs Using Online Activation Patterns

链接: https://arxiv.org/abs/2605.10161
作者: Alberto Fernández-Hernández,Jose I. Mestre,Cristian Pérez-Corral,Manuel F. Dolz,Jose Duato,Enrique S. Quintana-Ortí
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Weight decay remains one of the most widely used regularization mechanisms for training convolutional neural networks, yet it is still commonly applied as a fixed coefficient shared by all layers throughout training. This uniform treatment ignores that different layers may follow different structural dynamics and therefore may require different regularization strengths. In this work, we propose OUIDecay, an adaptive layer-wise and time-dependent weight decay scheduler for CNNs driven by the Overfitting-Underfitting Indicator (OUI), an activation-based metric previously shown to provide early information about regularization quality. OUIDecay uses a lightweight batch-based formulation of OUI to monitor the structural behavior of each layer online and periodically rescales its weight decay relative to the other layers in the network. Unlike gradient-based adaptive decay methods, our approach relies on functional information extracted from activation patterns and does not require validation data. Experiments on EfficientNet-B0 with Stanford Cars, ResNet50 with Food101, DenseNet121 with CIFAR100, and MobileNetV2 with CIFAR10 show that OUIDecay achieves the best mean best-validation-loss in 7 out of 8 evaluated settings. These results indicate that activation-driven weight decay adaptation is a practical and effective alternative to fixed decay and gradient-based adaptive decay, while keeping the method lightweight and suitable for online use.

[LG-78] jNO: A JAX Library for Neural Operator and Foundation Model Training

链接: https://arxiv.org/abs/2605.10159
作者: Leon Armbruster,Rathan Ramesh,Georg Kruse,Christopher Straub
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:jNO (jax Neural Operators) is a JAX-native library for neural operators and foundation models with unified support for both data-driven and physics-informed training. Its core design is a tracing system in which domains, model calls, residuals, supervised losses, and diagnostics are written in one symbolic language and compiled into one optimization pipeline. This allows users to move between operator regression, mesh-aware residual evaluation, and PDE-constrained training without restructuring the surrounding code. jNO also supports multi-model compositions, fine-grained control at parameter level (model, optimizer, and learning rate), hyperparameter tuning, and JAX-native workflows for translated PDE foundation-model families. The source repository is available at this https URL.

[LG-79] Unsupervised Process Reward Models

链接: https://arxiv.org/abs/2605.10158
作者: Artyom Gadetsky,Maxim Kodryan,Siba Smarak Panigrahi,Hang Guo,Maria Brbic
类目: Machine Learning (cs.LG)
*备注: preprint

点击查看摘要

Abstract:Process Reward Models (PRMs) are a powerful mechanism for steering large language model reasoning by providing fine-grained, step-level supervision. However, this effectiveness comes at a significant cost: PRMs require expert annotations for every reasoning step, making them costly and difficult to scale. Here, we propose a method for training unsupervised PRMs (uPRM) that requires no human supervision, neither at the level of step-by-step annotations nor through ground-truth verification of final answers. The key idea behind our approach is to define a scoring function, derived from LLM next-token probabilities, that jointly assesses candidate positions of first erroneous steps across a batch of reasoning trajectories. We demonstrate the effectiveness of uPRM across diverse scenarios: (i) uPRM achieves up to 15% absolute accuracy improvements over the LLM-as-a-Judge in identifying first erroneous steps on the ProcessBench dataset; (ii) as a verifier for test-time scaling, uPRM performs comparably to supervised PRMs and outperforms the majority voting baseline by up to 6.9%, and (iii) when used as a reward signal in reinforcement learning, uPRM enables more robust policy optimization throughout training compared to a supervised PRM trained using ground-truth labels. Overall, our results open a path toward scalable reward modeling for complex reasoning tasks.

[LG-80] Stable Long-Horizon PDE Forecasting via Latent Structured Spectral Propagators

链接: https://arxiv.org/abs/2605.10154
作者: Xiaoxiao Lu,Ye Yuan,Jiahao Shi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Long-horizon forecasting of time-dependent partial differential equations (PDEs) is critical for characterizing the sustained evolution of physical systems. While neural operators have emerged as efficient surrogates, they typically learn implicit finite-time transitions from discrete observations. When deployed autoregressively, such propagators often suffer from rapid error accumulation and dynamic drift. To address this, we propose a neural forecasting framework that reformulates PDE rollout as learning a Structured Spectral Propagator (SSP) in a propagation-oriented latent space. Following an analysis-propagation-synthesis design, our framework: (i) maps physical states into a shared, time-consistent spatial representation; (ii) projects this space into a compact propagation state to isolate recurrent dynamics from fine-grained spatial details, thereby decoupling reconstruction fidelity from rollout regularity; and (iii) evolves retained spectral modes using a frequency-conditioned linear backbone complemented by a nonlinear spectral closure to account for truncated interactions. This explicit structuring endows the propagator with a strong inductive bias for coherent modal evolution. Extensive experiments demonstrate that SSP significantly outperforms state-of-the-art baselines, reducing relative L_2 errors by up to 48.9% and exhibiting improved stability in temporal extrapolation beyond the supervised horizon.

[LG-81] APEX: Audio Prototype EXplanations for Classification Tasks

链接: https://arxiv.org/abs/2605.10153
作者: Piotr Kawa,Kornel Howil,Piotr Borycki,Miłosz Adamczyk,Przemysław Spurek,Piotr Syga
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Explainable AI (XAI) has achieved remarkable success in image classification, yet the audio domain lacks equally mature solutions. Current methods apply vision-based attribution techniques to spectrograms, overlooking fundamental differences between visual and acoustic signals. While prototype reasoning is promising, acoustic similarity remains multidimensional. We introduce APEX (Audio Prototype EXplanations), a post-hoc framework for interpreting pre-trained audio classifiers. Crucially, APEX requires no fine-tuning of the original backbone and strictly preserves output invariance. APEX disentangles explanations into four perspectives: Square-based prototypes to localize transient events, Time-based for temporal patterns, Frequency-based highlighting spectral bands, and Time-Frequency-based integrating both. This yields intuitive, example-based explanations that respect acoustic properties, providing greater semantic clarity than standard gradient-based methods.

[LG-82] Learning to Sparsify Stochastic Linear Bandits IJCAI2026

链接: https://arxiv.org/abs/2605.10151
作者: Zhengmiao Wang,Ming Chi,Zhi-Wei Liu,Lintao Ye,Carla Fabiana Chiasserini
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
*备注: Include all the omitted details and proofs from the conference paper accepted to IJCAI 2026

点击查看摘要

Abstract:This paper addresses the problem of learning to sparsify stochastic linear bandits, where a decision-maker sequentially selects actions from a high-dimensional space subject to a sparsity constraint on the number of nonzero elements in the action vector. The key challenge lies in minimizing cumulative regret while tackling the potential NP-hardness of finding optimal sparse actions due to the inherent combinatorial structure of the problem. We propose an adaptively phased exploration and exploitation algorithmic framework, utilizing ordinary least squares for parameter learning and specialized subroutines for sparse action selection. When the action set is a Euclidean ball, optimal sparse actions can be efficiently computed, enabling us to establish a \tilde\mathcalO(d\sqrtT) regret, where d is the dimension of the action vector and T is the time horizon length. For general convex and compact action sets where finding optimal sparse actions is intractable, we employ a greedy subroutine. For general strongly convex action sets, we derive a \tilde\mathcalO(d \sqrtT) \alpha -regret; for general compact sets lacking strong convexity, we establish a \tilde\mathcalO(d T^2/3) \alpha -regret, where \alpha pertains to the approximation ratio of the greedy algorithm. Finally, we validate the performance of our algorithms using extensive experiments including an application to recommendation system.

[LG-83] Per-Loss Adapters for Gradient Conflict in Physics-Informed Neural Networks

链接: https://arxiv.org/abs/2605.10136
作者: Bum Jun Kim,Gnankan Landry Regis N’guessan
类目: Machine Learning (cs.LG)
*备注: 49 pages, 10 figures

点击查看摘要

Abstract:Physics-informed neural networks (PINNs) train a single neural approximation by minimizing multiple physics- and data-derived losses, but the gradients of these losses often interfere and can stall optimization. Existing remedies typically treat this pathology either through scalar loss balancing or full-parameter-space gradient surgery, leaving it unclear which intervention is most appropriate. We show that PINN gradient conflict is not a uniform failure mode with one universal remedy. Instead, we identify distinct PINN gradient-conflict regimes, each associated with a different intervention class. Persistent directional conflict may require separate loss-indexed parameter subspaces, magnitude imbalance often favors scalar reweighting, and low or transient conflict may require no extra mitigation. To select between scalar reweighting and a lightweight architectural intervention, we propose a diagnostic-first framework. It profiles a 1000-step unmodified PINN run and, when intervention is warranted, uses one low-rank adapter per loss to create explicit loss-indexed parameter subspaces attached to a shared PINN trunk, providing each loss with a direct gradient pathway. Across more than 60 PDE configurations, including forward, inverse, multi-physics, parameter-varying, and high-dimensional problems up to 50D, persistent directional conflict dominates standard forward K=3 benchmarks and a natural K=4 thermoelastic system, where adapters combined with reweighting yield significant improvements. In contrast, K=3 inverse problems and natural K=5 and K=6 multi-physics systems are largely magnitude-dominated and often favor reweighting alone, while full-parameter-space gradient surgery can fail on heterogeneous parameter spaces.

[LG-84] GELATO: Generative Entropy- and Lyapunov-based Adaptive Token Offloading for Device-Edge Speculative LLM Inference

链接: https://arxiv.org/abs/2605.10124
作者: Zengzipeng Tang,Yuxuan Sun,Wei Chen,Jianwen Ding,Bo Ai
类目: Networking and Internet Architecture (cs.NI); Distributed, Parallel, and Cluster Computing (cs.DC); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:The recent growth of on-device Large Language Model (LLM) inference has driven significant interest in device-edge collaborative LLM inference. As a promising architecture, Speculative Decoding (SD) is increasingly adopted where a lightweight draft model rapidly generates candidate tokens to be verified by a powerful target model. However, a fundamental challenge lies in achieving per-token resource scheduling to effectively adapt SD paradigm to resource-constrained edge environment. This paper proposes a Generative Entropy- and Lyapunov-based Adaptive Token Offloading framework, named GELATO, to maximize decoding throughput under energy constraints in a device-edge collaborative SD system. Specifically, an outer drift-plus-penalty loop makes online decisions to establish a reference drafting budget, managing long-term energy-throughput trade-off. Further, a nested entropy-driven generation mechanism executes early exiting to adapt to per-token dynamic generative uncertainty. Theoretical analysis establishes a rigorous performance bound on long-term throughput for GELATO. Extensive evaluations demonstrate that GELATO achieves a globally optimal tradeoff, outperforming state-of-the-art distributed SD architectures by 64.98% in token throughput and reducing energy consumption by 47.47% under resource-constrained environments, while preserving LLM decoding quality.

[LG-85] Complex-Valued Phase-Coherent Transformer

链接: https://arxiv.org/abs/2605.10123
作者: Leona Hioki
类目: Machine Learning (cs.LG)
*备注: 26 pages, 17 tables (no figures). Companion Lean 4 formalization of Theorems 1 and 2 at this https URL

点击查看摘要

Abstract:Complex-valued Transformers have largely inherited softmax attention from real-valued architectures. However, row-normalised token competition is not necessarily aligned with phase-preserving computation. In this paper, we introduce the Phase-Coherent Transformer (PCT), which applies a real-valued, element-independent, smooth gate to L2-normalised complex query-key similarities. PCT replaces token competition with token-non-competing attention and is designed to preserve phase information across layers. Across mid-scale benchmarks spanning long-range memory, hierarchical long-range reasoning, positional retrieval, phase-based memory and superposition, and image classification, PCT shows strong generalisation across task categories. Under parameter-fair comparison, PCT consistently outperforms both the standard softmax Transformer and its direct complex-valued counterpart. Moreover, even on tasks traditionally considered difficult for complex-valued neural networks, such as NIAH and LRA-Text, PCT remains competitive with Multiscreen, the strongest real-valued NN baseline in our comparison. Experiments introducing gates that deliberately violate the PCT conditions show that the design is not incidental: smooth gates that preserve negatively aligned phase components remain strong, whereas gates that delete such components collapse on long-range retrieval, and gates whose outputs become excessively large suffer clear performance degradation. PCT also shows no depth-related accuracy collapse across the tested depth range. These results support introducing multi-layer phase-coherent structure into attention as a promising design principle for achieving generalisation in complex-valued Transformers. Comments: 26 pages, 17 tables (no figures). Companion Lean 4 formalization of Theorems 1 and 2 at this https URL Subjects: Machine Learning (cs.LG) MSC classes: 68T07, 68V20, 60J10 ACMclasses: I.2.6; I.5.1; I.2.7 Cite as: arXiv:2605.10123 [cs.LG] (or arXiv:2605.10123v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.10123 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-86] Scaling the Memory of Balanced Adam

链接: https://arxiv.org/abs/2605.10119
作者: Alberto Fernández-Hernández,Cristian Pérez-Corral,Jose I. Mestre,Manuel F. Dolz,Enrique S. Quintana-Ortí
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent evidence suggests that Adam performs robustly when its momentum parameters are tied, \beta_1=\beta_2 , reducing the optimizer to a single remaining parameter. However, the value of this parameter is still poorly understood. We argue that, in balanced Adam, \beta should not be treated as a dimensionless constant: it defines a statistical memory horizon H_\beta=(1-\beta)^-1 . In terms of the effective learning horizon T_\mathrmES , estimated from the validation trajectory, we study the refresh count R_\beta=(1-\beta)T_\mathrmES , which measures how many times Adam renews its internal statistics during the useful phase of training. Across 11 vision and language experiments, we find that choosing \beta so that R_\beta\approx1000 selects different beta values depending on the training scale, yet improves robustness over the best fixed-beta baseline. Compared with the strongest fixed choice \beta=0.94377 , the refresh rule improves worst-case robustness, reducing the global maximum validation gap by 33.4% , while bringing all 11 runs within 1% of their validation oracle. These results suggest that the remaining hyperparameter of balanced Adam is better understood as a memory-scale variable than as a fixed constant. This provides a simple budget-aware perspective on optimizer scaling and opens a path toward treating Adam’s momentum as part of the learning dynamics rather than as a static default.

[LG-87] Generating Symmetric Materials using Latent Flow Matching

链接: https://arxiv.org/abs/2605.10115
作者: Anmar Karmush,Cedric Mathieu Brandenburg,Soheil Ershadrad,Johanna Rosén,Michael Felsberg,Filip Ekström Kelvinius
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注: Preprint

点击查看摘要

Abstract:Tackling the task of materials generation, we aim to enhance the previously proposed All-atom Diffusion Transformer (ADiT) by introducing SymADiT, a symmetry-aware variant. To do so, we use a representation of materials based on Wyckoff positions. We follow ADiT and perform generative modelling in latent space, adapted to our symmetry-aware representation. By forcing the output of the generative model to adhere to the symmetry restrictions imposed by the generated crystal’s space group and each atom’s Wyckoff-position, the generated materials exhibit more realistic symmetry properties. We benchmark our method against both symmetry-aware and symmetry-agnostic models for materials generation and show competitive performance, generating stable, symmetric materials with a simple Transformer architecture.

[LG-88] opoU-Net: a U-Net architecture for topological domains

链接: https://arxiv.org/abs/2605.10091
作者: Gaurav Gaurav,Ibrahem ALJabea,Yaroslav Zakomornyy,Eric Frank,Mohamed Elhamdadi,Theodore Papamarkou,Mustafa Hajij
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Many modern datasets mix points, edges, regions, groups, objects, events, hyperedges, and relations. Yet neural architectures often force such data into grids, graphs, or sequences, obscuring higher-order structure and making encoder-decoder designs domain-specific. We view U-Net not as a grid-specific architecture, but as a hierarchical encoder-decoder principle: representation spaces, transport maps between levels, and skip connections between matched levels. Combinatorial complexes naturally supply these ingredients through cells, incidences, and ranks. We introduce TopoU-Net, a rank-path U-Net for topological domains. Given a path from an input rank to a bottleneck rank and back, the encoder lifts cochains upward along incidence maps, the decoder transports them downward, and skip connections merge features at matched ranks. Rank replaces spatial scale: choosing paths through nodes, edges, faces, hyperedges, or global cells becomes the central architectural decision. A key quantity is the bottleneck support ratio, the number of cells at the bottleneck relative to the number of cells at the input rank. This ratio is fixed by the complex and chosen path rather than by arbitrary pooling, and it clarifies when skip connections are optional, useful, or structurally important. Across node classification, graph classification, hypergraph node classification, mesh classification, and image reconstruction, TopoU-Net provides a reusable encoder-decoder template for higher-order structured data. Among the evaluated baselines, it achieves the strongest mean accuracy on six of eight node-classification datasets and four of five hypergraph datasets, with the largest gains on heterophilic graphs. Ablations show that removing skip connections is most damaging under severe bottleneck compression.

[LG-89] Unlocking air traffic flow prediction through microscopic aircraft-state modeling

链接: https://arxiv.org/abs/2605.10083
作者: Bin Wang,Anqi Liu,Jiangtao Zhao,Yanyong Huang,Peilan He,Guiyuan Jiang,Feng Hong,Yanwei Yu,Tianrui Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Short-term air traffic flow prediction in terminal airspace is essential for proactive air traffic management. Existing approaches predominantly model traffic flow as aggregated time series, despite traffic dynamics being governed by aircraft states and interactions in continuous airspace. Such aggregation obscures fine-grained information including aircraft kinematics, boundary interactions, and control intent. Here we present AeroSense, a state-to-flow modeling framework that predicts future traffic flow directly from instantaneous airspace situations represented as dynamic sets of aircraft states derived from ADS-B trajectories. By establishing an end-to-end mapping from microscopic aircraft states to future regional traffic flow, AeroSense preserves aircraft-level dynamics while naturally accommodating varying traffic density without relying on historical look-back windows. Experiments on a large-scale real-world dataset show that AeroSense consistently improves predictive accuracy over aggregation-based forecasting approaches, particularly during high-density traffic periods. These findings suggest that instantaneous airspace situations provide an effective alternative to conventional time-series-based traffic forecasting paradigms.

[LG-90] rajDLM: Topology-Aware Block Diffusion Language Model for Trajectory Generation

链接: https://arxiv.org/abs/2605.10020
作者: Wilson Wongso,Lihuan Li,Arian Prabowo,Xiachong Lin,Baiyu Chen,Hao Xue,Flora D. Salim
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generating high-fidelity synthetic GPS trajectories is increasingly important for applications in transportation, urban planning, and what-if scenario simulation, especially as privacy concerns limit access to real-world mobility data. Existing trajectory generation models face a trade-off between efficiency and faithfulness to road network topology: continuous-space methods enable fast generation but ignore the road network, while topology-aware approaches rely on search-based autoregressive decoding that limits generation speed. We propose TrajDLM, a topology-aware trajectory generation framework based on block diffusion language models that bridges this gap. TrajDLM models trajectories as sequences of discrete road segments, combining a block diffusion backbone for efficient denoising, topology-aware embeddings from a road network encoder, and topology-constrained sampling to ensure coherent and realistic trajectories. Across three city-scale datasets, TrajDLM achieves strong performance on fine-grained local similarity metrics while being up to 2.8\times faster than prior work, and demonstrates strong zero-shot transfer across domains, including unseen transportation modes. These results highlight the effectiveness of block-wise discrete diffusion as a scalable approach to accurate and efficient trajectory generation. Our code is available at this https URL

[LG-91] he Value of Mechanistic Priors in Sequential Decision Making

链接: https://arxiv.org/abs/2605.10018
作者: Itai Shufaro,Gal Benor,Shie Mannor
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Hybrid mechanistic models, physical priors with learned residuals, promise to reduce the data required for good decisions, but have no computable criterion to test this. We characterize the value of mechanistic priors in sequential decision-making within both asymptotic and burn-in regimes. To formalize this, we introduce the mechanistic information of a model – the mutual information between the model’s recommended policy \hat\pi and the true optimal policy \pi^* – quantified via an occupancy-weighted bias B_\mu . In the asymptotic regime (large N ), matched bounds reveal that Bayesian regret scales with the residual entropy H_\mathrmmech , delivering a theoretical sample complexity reduction of H(\mu)/H_\mathrmmech compared to an uninformed baseline. Furthermore, we provide a model certificate to determine empirical sample efficiency. Complementarily, in the clinically relevant burn-in regime (small N ), we establish a lower bound on the penalty incurred by confidently wrong priors. We demonstrate both the asymptotic and burn-in bounds across 5-fluorouracil (5-FU) dosing simulations motivated by published FOLFOX pharmacokinetic data, where a hybrid prior yields large sample-efficiency gains in the burn-in regime. Finally, we contrast these grounded models with LLM priors, demonstrating that LLMs can suffer severe losses in mechanistic information, thereby motivating the exclusive use of physically-grounded priors for safety-critical applications.

[LG-92] Anchor-guided Hypergraph Condensation with Dual-level Discrimination ICML2026

链接: https://arxiv.org/abs/2605.10001
作者: Fan Li,Xiaoyang Wang,Chen Chen,Wenjie Zhang
类目: Machine Learning (cs.LG)
*备注: This paper has been accepted by ICML 2026

点击查看摘要

Abstract:The increasing prevalence of large-scale hypergraphs poses significant computational challenges for hypergraph neural network (HNN) training. To address this, hypergraph condensation (HGC) distills large real hypergraphs into compact yet informative synthetic ones, beyond graph condensation (GC) methods limited to pairwise relations. However, existing HGC methods rely on decoupled training architectures, where structure generators are pre-trained on the original hypergraph but not jointly optimized with condensed features during refinement, resulting in misaligned structures that degrade downstream utility. Moreover, trajectory-based optimization incurs substantial computational overhead in refinement, limiting condensation efficiency. To tackle these issues, we propose \textbfAnchor-guided \textbfHyper\textbfGraph \textbfCondensation with \textbfDual-level \textbfDiscrimination (\textbfAHGCDD), which consists of three key components: (1) a node initialization module based on Heat Kernel PageRank (HKPR) to encode structural knowledge into feature semantics; (2) an anchor-guided hyperedge synthesis strategy for joint optimization of condensed features and structure; (3) a theoretically grounded dual-level discrimination objective for utility-preserving condensation without redundant HNN training. Extensive experiments demonstrate the superior effectiveness and efficiency of AHGCDD.

[LG-93] Lakestream: A Consistent and Brokerless Data Plane for Large Foundation Model Training

链接: https://arxiv.org/abs/2605.09994
作者: Ting Sun,Junjie Zhang,Xiao Yan,Songxin Zhang,Zhuoyang Song,Jingyi Xi,Zunyao Mao,Bingyi Jing,Jiaxing Zhang,Zejian Xie
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modern Large Foundation Model (LFM) training has transformed the data pipeline from a static ingestion layer into a dynamic component that must co-evolve with the training process. Existing systems are ill-equipped: colocated dataloaders offer no failure isolation, while message queue-based disaggregated dataloaders operate on a record/offset abstraction that cannot express the batch-level semantics required by distributed training. We present Lakestream, a brokerless, object-store-native training data plane with three key properties. First, it introduces the Transactional Global Batch (TGB), which builds on lakehouse-style ACID storage semantics and extends them with training-specific consistency, including atomic all-rank batch visibility, a globally ordered step sequence, checkpoint-aligned lifecycle management, and end-to-end exactly-once recovery. Second, it realizes recovery and retention directly in the storage layer, by inlining producer state in the manifest and tying reclamation to distributed checkpoint state. Third, its Decentralized Adaptive Commit (DAC) algorithm sustains stable ingestion throughput as the manifest grows, without any inter-producer communication. Evaluations on large-scale multimodal pre-training and SFT workloads using 64 GPUs show that Lakestream outperforms colocated dataloader throughput while providing full failure isolation, outperforms Apache Kafka in ingestion throughput, and achieves lower consumer read latency than Kafka. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2605.09994 [cs.DC] (or arXiv:2605.09994v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2605.09994 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-94] Learning Graph Foundation Models on Riemannian Graph-of-Graphs ICML2026

链接: https://arxiv.org/abs/2605.09993
作者: Haokun Liu,Zezhong Ding,Xike Xie
类目: Machine Learning (cs.LG)
*备注: This paper has been accepted by ICML 2026

点击查看摘要

Abstract:Graph foundation models (GFMs), pretrained on massive graph data, have transformed graph machine learning by supporting general-purpose reasoning across diverse graph tasks and domains. Existing GFMs pretrained with fixed-hop subgraph sampling impose a fixed receptive field, causing scale mismatch on diverse tasks, which often require heterogeneous and unknown structural contexts beyond a fixed sampling scale. We propose R-GFM, a Riemannian Graph-of-Graphs (GoG) based foundation model, that treats structural scale as a first-class citizen in modeling. R-GFM constructs a multi-scale GoG over-sampled subgraphs at different hop distances and learns geometry-adaptive representations from Riemannian manifolds. Theoretical analysis shows that R-GFM reduces structural domain generalization error compared to fixed-scale GFMs. Experiments on various datasets demonstrate that R-GFM achieves state-of-the-art performance, with up to a 49% relative improvement on downstream tasks. Our code is available at this https URL.

[LG-95] Chebyshev Center-Based Direction Selection for Multi-Objective Optimization and Training PINNs

链接: https://arxiv.org/abs/2605.09975
作者: Hoyeol Yoon,Seoungbin Bae,Nam Ho-Nguyen,Dabeen Lee
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Physics-informed neural networks (PINNs) are a promising approach for solving partial differential equations (PDEs). Their training, however, is often difficult because multiple loss terms induced by PDE residuals and boundary or initial conditions must be optimized simultaneously. To address this difficulty, existing approaches often construct update directions by explicitly enforcing particular desirable properties, such as scale robustness and simultaneous descent. While effective in many cases, such property-by-property designs can make it unclear which conditions are essential, what geometric principle determines the selected update direction, and how different methods are structurally related. In this work, we formulate update-direction selection for PINN training as a Chebyshev-center problem in the dual cone. The proposed formulation selects a normalized direction that maximizes the minimum distance to the cone facets. The resulting formulation admits an efficient dual problem in a much lower-dimensional space and yields a convergence guarantee in the nonconvex setting. It also recovers the key desirable properties targeted by existing approaches without imposing them separately; rather, they follow from the single geometric criterion underlying the formulation. This makes the selected direction interpretable through a single geometric rule and provides a unified basis for systematically comparing related direction-selection methods. Experiments on several PINN benchmarks further demonstrate strong empirical performance of the proposed method.

[LG-96] Consolidation-Expansion Operator Mechanics:A Unified Framework for Adaptive Learning

链接: https://arxiv.org/abs/2605.09968
作者: Debashis Guha
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 38 pages

点击查看摘要

Abstract:Every adaptive learning system must alternate between two operations: consolidating what it already knows and expanding into new evidence. We propose \emphConsolidation-Expansion Operator Mechanics (OpMech), a framework that makes this structure precise. The central object is the \emphorder-gap \Ogap(\theta; e) , the degree to which a consolidation operator~ Q and an expansion operator~ P_e fail to commute at a given knowledge state. Because the order-gap is computable from the system’s own trajectory, it serves as a real-time control signal: large values indicate that the system is still sensitive to the ordering of consolidation and expansion; once the order-gap falls and stays small, further processing is unlikely to change the outcome. Three results give the signal precise meaning: the order-gap decays along convergent trajectories; a persistently large order-gap implies the system is far from its settled state; and an order-gap-based stopping rule terminates with provable guarantees in both noiseless and bounded-noise settings. The framework applies across five domains: bandits, reinforcement learning, stochastic optimization, continual learning, and recursive language models. We give conditions under which the order-gap reliably tracks convergence in three representative cases. We develop the recursive language model application in detail, showing how OpMech replaces heuristic stopping rules and fixed recursion budgets with principled, evidence-driven alternatives. Comments: 38 pages Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML) MSC classes: 68T05 Cite as: arXiv:2605.09968 [cs.LG] (or arXiv:2605.09968v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.09968 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-97] nsor Product Representation Probes Reveal Shared Structure Across Linear Directions

链接: https://arxiv.org/abs/2605.09967
作者: Andrew Lee,Fernanda Viégas,Martin Wattenberg
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While researchers are finding concepts represented as linear directions in language models, a bag of linear directions fails to capture relational structure. To better understand this dichotomy, we study a model with known linear representations, but trained in a highly structured domain – the board game Othello. While the model’s internal board-state representation is linearly decodable, we find additional structure in the form of tensor product representations (TPRs). We train TPR probes to recover shared structure amongst the linear probes, yielding a factorization into square-embeddings, color-embeddings, and a binding matrix that composes them to construct the model’s board-state representation. We find geometric signatures within the weights of our TPR probe that align with the structure of the board, but perhaps more importantly, that the linear probes can be recovered directly from the parameters of our TPR probe. Our findings suggest that directional representations may be projections of more structured underlying representations.

[LG-98] Generating synthetic electronic health record data using agent -based models to evaluate machine learning robustness under mass casualty incidents

链接: https://arxiv.org/abs/2605.09951
作者: Roben Delos Reyes,Daniel Capurro,Nicholas Geard
类目: Machine Learning (cs.LG)
*备注: 14 pages, 1 figure; accepted at CHIL 2026

点击查看摘要

Abstract:ML models in healthcare are typically evaluated using curated real-world EHR data. A key limitation of such evaluations is that they may fail to assess the robustness of ML models to changes in the data at deployment, which is a common issue because EHR data used for ML model development cannot capture all such changes. Mass casualty incidents (MCIs) caused by disasters are critical instances where this will be an issue, as they induce rare, uncertain, and novel changes to routine system conditions. Because real-world EHR data from MCIs are often limited or unavailable, assessing ML robustness under such conditions before deployment remains challenging. Here, we propose an agent-based modelling approach for generating synthetic EHR data to evaluate the robustness of ML models under MCI scenarios. We use real-world EHR data to develop and calibrate an agent-based model (ABM) of an emergency department (ED) that explicitly models patient arrivals, resource capacity, and clinical workflow. By changing these system conditions to reflect plausible MCI scenarios, the ED model generates synthetic versions of the real-world EHR data that exhibit shifts in system behaviour. Using these synthetic data, we test ML models for predicting length of stay. We observed consistent declines in recall under MCI conditions relative to baseline system conditions, resulting in an increase in the number of patients with prolonged length of stay that were missed by the ML models. These results highlight the impact of changes in system conditions on patient outcomes, EHR data, and ML model performance. Our work establishes ABM-based synthetic EHR data generation as a proactive and systematic approach for evaluating the robustness of ML models under MCI or other system conditions not captured in real-world EHR data, supporting the safer and more effective deployment of ML models in healthcare systems.

[LG-99] From Syntax to Semantics: Unveiling the Emergence of Chirality in SMILES Translation Models

链接: https://arxiv.org/abs/2605.09949
作者: Zehao Li,Yasuhiro Yoshikai,Shumpei Nemoto,Hiroyuki Kusuhara,Tadahaya Mizuno
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Understanding how chemical language models (CLMs) learn chemical meaning from molecular string representations, rather than only surface-level string patterns, is an important question in chemical representation learning and machine learning for chemistry. Chirality provides a demanding test case: enantiomers can differ greatly in pharmacological activity and toxicity, yet CLMs often struggle to distinguish chiral configurations reliably. Here we present Pan-CORE (Pan-Chemical Omniscale Representation Engine), a family of autoregressive Transformer-based encoder-decoder models for SMILES translation, and use high-temporal-resolution checkpoint analysis to investigate how chiral information is learned during training. Across all tested Pan-CORE variants, we observe a reproducible jump-up in which chiral-token accuracy rises abruptly after a long plateau, suggesting that chiral learning stagnation is not explained by model capacity alone and instead reflects the complexity of chiral constraints. Analyses of attention dynamics, residual-stream trajectories, and latent-space geometry support an encoder-centered mechanism in which chiral-token representations undergo transient destabilization and reconstruction, seen as a V-shaped drop and recovery in vector norm and directional stability, together with a clear reorganization of chiral molecular representations in the latent space. Encoder-decoder cross-evaluation further supports the encoder-centered nature of the transition, and targeted attention-head ablation identifies a small set of chiral-sensitive heads whose removal selectively reduces chiral-token accuracy even in the fully trained model. These findings show that SMILES translation can serve as a useful experimental system for mechanistic analysis of semantic emergence in CLMs, with implications for interpretable chemical representation learning.

[LG-100] Selection of the Best Policy under Fairness Constraints for Subpopulations

链接: https://arxiv.org/abs/2605.09945
作者: Tingyu Zhu,Yuhang Wu,Zeyu Zheng
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Many high-stakes decisions in health care, public policy, and clinical development require committing to a single policy that will be applied uniformly across a heterogeneous population. Regulatory and fairness standards sometime requires that the chosen policy performs adequately in every pre-specified subpopulation, not only on average. We formalize this as a Selection of the Best with Fairness Constraints (SBFC) problem, in order to identify the policy with the highest average performance among those policies that meet a minimum per-subpopulation threshold. We establish an instance-specific lower bound on sample complexity of the SBFC problem. We then develop a Track-and-Stop with Constraints on Subpopulation (T-a-S-CS) algorithm that achieves the lower bound asymptotically. We extend the framework to general closed-set and penalty-based fairness specifications with matching guarantees. Numerical experiments and a case study using the International Stroke Trial demonstrate substantial efficiency gains over policy-level allocation baselines.

[LG-101] ResilienceBench: Quantifying Resilience for LLM Reasoning in Telecommunications

链接: https://arxiv.org/abs/2605.09929
作者: Pranshav Gajjar,Emmanuel Ojo,Vijay K Shah
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Deploying large language models in telecommunications requires more than task accuracy. In realistic workflows, a model may inherit partially completed reasoning from a prior step, an upstream agent, or its own earlier generation, and must continue that reasoning even when it is already going wrong. We introduce TeleResilienceBench, a benchmark that quantifies this capability, which we term reasoning resilience, across seven telecom sub-domains drawn from the GSMA Open-Telco LLM suite. Instances are constructed by collecting failures from a weak generator model, truncating the flawed reasoning trace at its midpoint, and asking a target model to continue and correct it. We propose the Correct Flip Rate (CFR) as a direct measure of successful recovery and evaluate eight models spanning the Qwen3.5, Gemma4, and Nemotron-3 families. Our results show that even the strongest model achieves a macro-average CFR of only 29.1%, and scale does not reliably improve resilience within families. Nemotron-3-nano 4b outperforms all Qwen3.5 variants including the 27b model and leads the auxiliary TeleMath numerical evaluation at 23.4% CR%, offering the best resilience-to-cost ratio in the set. A difficulty-stratified analysis further reveals that existing telecom benchmark difficulty labels reflect factual specificity rather than reasoning depth, suggesting that current evaluations measure knowledge coverage more than reasoning ability.

[LG-102] Deep Learning under Fractional-Order Differential Privacy

链接: https://arxiv.org/abs/2605.09890
作者: Mohammad Partohaghighi,Roummel Marcia
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Differentially private stochastic gradient descent (DP-SGD) is a standard approach to privacy-preserving learning based on per-example clipping, subsampling, Gaussian perturbation, and privacy accounting. Classical DP-SGD releases a noisy version of the current clipped subsampled gradient sum. We propose Fractional-Order Differentially Private Stochastic Gradient Descent (\textbfFO-DP-SGD), a mechanism-level extension that replaces this current-only query, before Gaussian noise is added, with a fractional recursive query combining the current clipped sum with a finite-window, power-law-weighted aggregation of previously released private sum-level outputs. This injects fractional memory into the release mechanism while preserving the standard \emphsum-then-noise-then-divide structure. Under add/remove adjacency with Poisson subsampling, the current-step sensitivity analysis shows that the only newly data-dependent term is the scaled current clipped sum. Hence, conditioned on the private history, the effective (\ell_2)-sensitivity is at most (\beta C), where (C) is the clipping threshold and (\beta\in(0,1]) controls the current-step contribution. Thus, FO-DP-SGD admits standard per-step Rényi differential privacy accounting via a Poisson-subsampled Gaussian mechanism with effective noise-to-sensitivity ratio (\sigma/\beta), and composes to yield overall ((\varepsilon,\delta))-differential privacy guarantees. FO-DP-SGD provides a framework for studying long-memory effects in private optimization. The fractional order, memory window, and mixing coefficient govern the trade-off among current-step sensitivity, signal retention, and private-history influence. Experiments on SVHN, CIFAR-10, and CIFAR-100 show improved test accuracy and privacy–utility performance over DP-SGD and private baselines including DP-Adam, DP-IS, SA-DP-SGD, ADP-AdamW, DP-SAT, and DP-Adam-AC. Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2605.09890 [cs.CR] (or arXiv:2605.09890v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2605.09890 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-103] Concordia: Self-Improving Synthetic Tables for Federated LLM s

链接: https://arxiv.org/abs/2605.09855
作者: Jimin Huang,Duanyu Feng,Nuo Chen,Xiaoyu Wang,Zhiqiang Zhang,Xueqing Peng,Mingquan Lin,Prayag Tiwari,Guojun Xiong,Alejandro Lopez-Lira,Sophia Ananiadou
类目: Machine Learning (cs.LG)
*备注: 12 pages

点击查看摘要

Abstract:Federated learning (FL) enables training large language models (LLMs) without sharing raw data, but adapting LLMs under strict data isolation and non-IID client distributions remains challenging in practice. Synthetic data offers a natural privacy-preserving surrogate for local training, yet existing federated pipelines typically treat synthetic generation as static or loosely coupled with downstream optimization, leading to rapidly diminishing utility under heterogeneous clients. We study federated adaptation of LLMs on tabular tasks where raw records and validation data cannot be shared, and local training must rely entirely on synthetic tables. We propose Concordia, a tri-level optimization framework that aligns synthetic data generation with federated validation utility despite these constraints. At the client level, models are adapted via parameter-efficient LoRA training on synthetic tables. Clients additionally learn lightweight utility scorers from private validation feedback to reweight synthetic samples during local training. At the outer level, each client refines its own synthetic table generator using group-relative policy optimization (GRPO), guided by an ensemble of heterogeneous scorers shared across clients, without aggregating generator parameters or exposing validation data. Experiments on privacy-sensitive tabular benchmarks from finance and healthcare demonstrate that Concordia consistently improves federated performance, cross-client stability, and robustness to distribution shift compared to static and decoupled synthetic-data baselines.

[LG-104] Exploration-Driven Optimization for Test-Time Large Language Model Reasoning

链接: https://arxiv.org/abs/2605.09853
作者: Changhao Li,Yuchen Zhuang,Chenxiao Gao,Haotian Sun,Rushi Qiang,Chao Zhang,Bo Dai
类目: Machine Learning (cs.LG)
*备注: Accepted by TMLR 2026

点击查看摘要

Abstract:Post-training techniques combined with inference-time scaling significantly enhance the reasoning and alignment capabilities of large language models (LLMs). However, a fundamental tension arises: inference-time methods benefit from diverse sampling from a relatively flattened probability distribution, whereas reinforcement learning (RL)-based post-training inherently sharpens these distributions. To address this, we propose Exploration-Driven Optimization (EDO), which extends reward-biasing style exploration objectives to iterative post-training and integrates them into standard RL objectives, encouraging greater diversity in sampled solutions while facilitating more effective inference-time computation. We incorporate EDO into iterative Direct Preference Optimization (iDPO) and Group Relative Policy Optimization (GRPO), resulting in two variants: ED-iDPO and ED-GRPO. Extensive experiments demonstrate that both ED-iDPO and ED-GRPO exhibit greater solution diversity and improved reasoning abilities, particularly when combined with test-time computation techniques like self-consistency. Across three in-distribution reasoning benchmarks, EDO achieves a 1.0-1.3% improvement over the strongest baselines, and delivers an additional 1.5% average gain on five out-of-distribution tasks. Beyond accuracy, EDO preserves model entropy and stabilizes RL training dynamics, highlighting its effectiveness in preventing over-optimization collapse. Taken together, these results establish EDO as a practical framework for balancing exploration and exploitation in LLM reasoning, especially in settings that rely on test-time scaling.

[LG-105] Efficient Neural Architectures for Real-Time ECG Interpretation on Limited Hardware

链接: https://arxiv.org/abs/2605.09848
作者: Ashery Mbilinyi,Callum O’Riley,Julia Handra,Ashley Moller-Hansen,Jason Andrade,Marc Deyell,Cameron Hague,Nathaniel Hawkins,Kendall Ho,Jonathan Leipsic,Roger Tam
类目: Machine Learning (cs.LG)
*备注: 9 pages, 6 figures, 3 tables. Published in: 2025 IEEE International Conference on Big Data (BigData), pp. 3275-3284. DOI: https://doi.org/10.1109/BIGDATA66926.2025.11402097

点击查看摘要

Abstract:Electrocardiogram (ECG) interpretation is essential for diagnosing a wide range of cardiac abnormalities. While deep learning has shown strong potential for automating ECG classification, many existing models rely on large, computationally intensive architectures that hinder practical deployment. In this paper, we present an empirical study of convolutional neural network (CNN) architectures, exploring tradeoffs between diagnostic accuracy and computational efficiency. We benchmark two established baselines: AttiaNet, a compact model composed of sequential temporal and spatial blocks, and DeepResidualCNN, the winning architecture of the 2021 PhysioNet/Computing in Cardiology Challenge. Building on these, we propose three lightweight models: (i) ParallelCNN, which employs dual temporal and spatial branches for parallel pattern extraction; (ii) ParallelCNNew, a variant with symmetric weight initialization for balanced feature learning; and (iii) SimpleNet, a streamlined architecture that jointly processes temporal and spatial dimensions. Our experiments span three publicly available 12-lead ECG datasets from Germany, China, and the United States, covering binary, multiclass, and multilabel classification tasks across diverse patient populations. We further evaluate the impact of integrating low-cost demographic metadata (age and sex) to improve performance with minimal overhead. To ensure fair comparison, we introduce a unified Efficiency Score that integrates model size, inference speed, memory usage, and AUC performance. By balancing diagnostic performance and efficiency, our models offer a scalable and viable foundation for next-generation AI systems in cardiovascular care.

[LG-106] Sub-Footprint Effect Correction in FW-LiDAR Point Clouds via Intra-Footprint Target Unmixing

链接: https://arxiv.org/abs/2605.09845
作者: Zhen Xiao,Yanfeng Gu,Xian Li
类目: Machine Learning (cs.LG)
*备注: 11 pages,7 figures

点击查看摘要

Abstract:Sub-footprint target mixing within a laser footprint significantly increases LiDAR intensity uncertainty, especially in complex environments where heterogeneous materials inside one footprint cause nonlinear distortions that impair intensity-based applications. However, the forward mixing inherent to the single-pixel detection mode of LiDAR systems blurs sub-footprint contributions, making sub-footprint effects difficult to address effectively in existing studies. To address this issue, we introduce a novel, physics-based framework that explicitly resolves sub-footprint intensity correction in full-waveform LiDAR (FW-LiDAR) point clouds. The key innovation is to make the otherwise implicit intra-footprint mixing process explicit: we first develop a spatiotemporal laser-beam distribution model to physically characterize within-footprint forward mixing of multi-target returns. Building on this formulation, we incorporate ancillary information including waveform parameters and surface geometry as constraints to pose a well-defined inverse unmixing problem and decompose each footprint into fractional contributions from multiple sub-targets. We then recover sub-footprint-corrected intensities by inverting the observed mixtures through a unified combination of parametric and model-driven approaches. To the best of our knowledge, few prior studies explicitly establish sub-footprint inversion and correction within a single laser footprint, and our framework offers a principled, physics-grounded solution. Experiments on both controlled and real-world LiDAR datasets demonstrate that the proposed method significantly enhances semantic separability across heterogeneous targets and intensity consistency across homogeneous targets.

[LG-107] Cross-Domain Lossy Compression via Constrained Minimum Entropy Coupling

链接: https://arxiv.org/abs/2605.09833
作者: Nam Nguyen,Hassan Tavakoli,An Vuong,Thinh Nguyen,Bella Bose
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper studies cross-domain lossy compression through the lens of minimum entropy coupling (MEC) with rate and classification constraints. In this setting, an encoder observes samples from a degraded source domain, while the decoder is required to generate outputs following a prescribed target distribution and to preserve information relevant to a downstream classification task. Motivated by logarithmic-loss distortion, we adopt an information-based objective that maximizes the coupling strength between the source and reconstruction, rather than minimizing a sample-wise distortion. Under common randomness, we formulate a rate-constrained MEC problem (MEC-B) and show that the intermediate representation can be removed without loss of optimality, yielding an equivalent deterministic coupling formulation. For Bernoulli sources, closed-form expressions are derived with and without classification constraints. In addition, we implement a neural restoration framework using quantization, entropy modeling, distribution matching, and classification regularization. Experiments on MNIST super-resolution and SVHN denoising show that increasing the available rate improves classification accuracy and yields more informative reconstructions.

[LG-108] Modeling Atomic Conformational Ensembles of Proteins via Test-Time Supervision of Boltz-2 on Cryo-EM Density Maps

链接: https://arxiv.org/abs/2605.09832
作者: Jay Shenoy,Miro Astore,Axel Levy,Frédéric Poitevin,Sonya M. Hanson,Gordon Wetzstein
类目: Machine Learning (cs.LG)
*备注: Project page: this https URL

点击查看摘要

Abstract:Knowledge of a protein’s atomic conformational ensemble is critical to determining its function, yet state-of-the-art ensemble prediction models are limited by lack of high-quality conformational data from simulation or experiment. Recent advances in heterogeneous reconstruction for cryo-electron microscopy (cryo-EM) have enabled scientists to visualize ensembles of density maps for larger proteins and complexes not typically accessible through simulation, but building atomic models into these maps remains a challenge. Traditionally, ensemble prediction models are trained via a two-stage process: experimental density maps are converted into atomic structural ensembles through model building, after which these structures are used to train sequence-to-atomic ensemble predictors. In this work, we propose a new principle for fine-tuning pre-trained static structure prediction models such as Boltz-2 directly on raw cryo-EM maps, bypassing the two-stage process. We apply this technique to the problem of atomic model building by fine-tuning Boltz-2 to generate atomic conformations from an input ensemble of cryo-EM maps, achieving superior model building accuracy compared to prior work. Beyond overfitting to individual map ensembles, our method, CryoSampler, also shows preliminary evidence of in-domain generalization after fine-tuning, sampling diverse atomic conformations for an unseen sequences within the same protein family without requiring cryo-EM data. These capabilities indicate that CryoSampler holds the potential to train next-generation atomic ensemble prediction models directly on raw cryo-EM measurements.

[LG-109] Dystruct: Dynamically Structured Diffusion Language Model Decoding via Bayesian Inference

链接: https://arxiv.org/abs/2605.09820
作者: Bian Sun,Kevin Zhai,Mubarak Shah,Zhenyi Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion language models (DLMs) have recently emerged as a promising alternative to autoregressive models, primarily due to their ability to enable parallel decoding. Despite this advantage, most existing DLMs rely on a fixed generation length specified prior to decoding, which restricts their flexibility in real-world applications. While a few recent works attempt to support flexible-length generation, they typically suffer from notable limitations: some require costly retraining to accommodate variable-length outputs, while others depend solely on local confidence signals during decoding. Such local criteria fail to capture the evolving structure of the sequence, often resulting in suboptimal generation quality. In this paper, we propose a training-free, Bayesian structured decoding framework that formulates flexible-length generation as a dynamic structural inference problem. Our approach formulates flexible-length generation as a dynamic structural inference problem, jointly computing the expansion length, the block boundaries, and the decoding schedule. At each window expansion step, the method integrates local uncertainty with structural signals via a unified mechanism that supports dynamic structured generation, including both flexible block expansion and block organization, while maintaining coherence. Extensive experiments across multiple benchmarks demonstrate that our approach significantly improves generation quality and flexibility over existing fixed-length and flexible-length baselines. These results highlight the advantage of Bayesian structured decoding for diffusion language model, providing a principled and efficient solution for structured text generation.

[LG-110] Learning to Compress Time-to-Control: A Reinforcement Learning Framework for Chronic Disease Management

链接: https://arxiv.org/abs/2605.09818
作者: Prabhjot Singh,Abhishek Gupta,Chris Betz,Abe Flansburg,Brett Ives,Sudeep Lama,Jung Hoon Son
类目: Machine Learning (cs.LG)
*备注: 26 pages, 3 figures

点击查看摘要

Abstract:Reinforcement learning (RL) in healthcare has had mixed results, with reward sparsity, unreliable off-policy evaluation, and deployment-simulation gap as recurring failure modes. We argue that chronic disease management is structurally a more tractable RL setting than the acute-care problems the field has primarily studied, but only if the problem is formalized to exploit chronic care’s properties. We propose such a formalization. The agent’s objective is to compress time-to-control (TTC) under a tiered reward calibrated to the CMS ACCESS Model. Two quantities from our companion preference-learning paper [Singh et al. 2026] enter as load-bearing structural elements: the execution intensity \epsilon bounds action availability under a constrained Markov Decision Process, and the clinician capability \kappa weights offline-data transitions during RL training. Together they couple preference learning and RL into a two-loop architecture. We present simulation results on synthetic state machines for hypertension and type 2 diabetes. Capability-weighted offline RL outperforms uniform-weighted offline RL and the behavior policy by 15 percentage points on T2D TTC; the uniform-weighted formulation (the standard in existing healthcare RL) underperforms even the heterogeneous behavior policy. \Epsilon-aware policies generalize across deployment regimes while \epsilon-naive policies do not.

[LG-111] Optimizing Server Placement for Vertical Federated Learning in Dynamic Edge/Fog Networks

链接: https://arxiv.org/abs/2605.09813
作者: Su Wang,Mung Chiang,H. Vincent Poor
类目: Networking and Internet Architecture (cs.NI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Under revision at IEEE/ACM transactions on networking

点击查看摘要

Abstract:We investigate the control and optimization of vertical federated learning (VFL), a class of distributed machine learning (ML) methods in which edge/fog devices contain separate data features, in dynamic edge/fog networks. Owing to heterogeneous data features and hardware across edge/fog networks, devices’ contributions to VFL vary substantially, and, moreover, dynamic edge/fog networks can lead to the permanent exit or entry of select data features. In this setting, our proposed methodology, server controlled VFL in dynamic networks (SC-DN), first establishes the existence of a global first-order stationary point for every global round, and then leverages this result to jointly optimize ML model training and resource consumption based on four key control variables: (i) server placement, (ii) device-to-server transmit power, (iii) local device processor frequency, and (iv) local training iterations per global round. The resulting optimization formulation contains coupled variables as well as numerous forms of logarithmic constraints which we show is a mixed-integer signomial program, an NP-hard problem, and for which we develop a general solver. Finally, via experiments on both image and multi-modal datasets, we show that our methodology demonstrates superior classification/regression performance and resource consumption savings than even greedy methodologies.

[LG-112] Bayesian Optimization with Structured Measurements: A Vector-Valued RKHS Framework

链接: https://arxiv.org/abs/2605.09775
作者: Wenbin Wang,Colin N. Jones
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Bayesian optimization (BO) is an efficient framework for optimizing expensive black-box functions. However, it is typically formulated as learning an end-to-end mapping from inputs to scalar objectives, thereby discarding the potentially rich information whenever a structured system output is available. In this work, we study Bayesian optimization over a vector-valued operator with structured measurements, where each measurement observes multidimensional or functional outputs, e.g., trajectories or spatial fields, rather than a single scalar value. The objective is then defined as a linear functional of these measurements. This allows each observation to reveal substantially richer information about the underlying system compared to scalar observations. Assuming the unknown operator lies in a vector-valued reproducing kernel Hilbert space (RKHS), we derive high-probability concentration bounds for the kernel ridge regression (KRR) estimator directly in the measurement space, characterizing uncertainty in a general Hilbert space. Building on these results, we propose an algorithm based on the upper confidence bound (UCB) acquisition function with regret guarantees under mild assumptions, recovering sublinear rates for common kernels. Empirically, we demonstrate that leveraging structured measurements leads to improved sample efficiency by enabling efficient transfer of information across objectives and adaptation to time-varying settings.

[LG-113] On Uniform Error Bounds for Kernel Regression under Non-Gaussian Noise ICML

链接: https://arxiv.org/abs/2605.09757
作者: Johannes Teutsch,Oleksii Molodchyk,Marion Leibold,Timm Faulwasser,Armin Lederer
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: This paper has been accepted at the 43rd International Conference on Machine Learning (ICML) 2026

点击查看摘要

Abstract:Providing non-conservative uncertainty quantification for function estimates derived from noisy observations remains a fundamental challenge in statistical machine learning, particularly for applications in safety-critical domains. In this work, we propose novel non-asymptotic probabilistic uniform error bounds for kernel-based regression. Compared to related bounds in the literature that are restricted to (conditionally) independent sub-Gaussian noise, our bounds allow to consider a broad class of non-Gaussian distributions, such as sub-Gaussian, bounded, sub-exponential, and variance/moment-bounded noise. Moreover, our results apply to correlated and uncorrelated noise. We compare our proposed error bounds with existing results in terms of the induced uncertainty region and their performance in safe control, demonstrating the tightness of the proposed bounds.

[LG-114] Accelerating Power Method with Fast Sketching for Stronger Low-Rank Approximation

链接: https://arxiv.org/abs/2605.09755
作者: Shabarish Chenakkod,Michał Dereziński
类目: Numerical Analysis (math.NA); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The power method is one of the most fundamental tools for extracting top principal components from data through low-rank matrix approximation. Yet, when the target rank is large, the cost of matrix multiplication associated with this procedure becomes a major bottleneck. We develop an algorithmic and theoretical framework for accelerating the power method using fast sketching, which is a popular paradigm in randomized linear algebra. Our framework leads to simple and provably efficient methods for singular value decomposition, low-rank factorization, and Nyström approximation, which attain strong numerical performance on benchmark problems. The key novelty in our analysis is the use of regularized spectral approximation, a property of fast sketching methods which proves more flexible in generalizing power method guarantees than traditional arguments.

[LG-115] Learning from Acceptance: Cumulative Regret in the Game of Coding

链接: https://arxiv.org/abs/2605.09754
作者: Hanzaleh Akbari Nodehi,Parsa Moradi,Mohammad Ali Maddah-Ali
类目: Information Theory (cs.IT); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Classical coding-theoretic guarantees often rely on trust assumptions, such as requiring sufficiently many honest nodes compared with adversarial ones. These assumptions are difficult to enforce in open decentralized systems where participants are not centrally certified. At the same time, such environments often contain incentive mechanisms: participants may be rewarded only when their submitted data are accepted and the system remains functional. This changes the role of an adversary. Rather than acting as a pure saboteur, a strategic adversary may submit data that are consistent enough to be accepted while still degrading the quality of the final estimate. The game-of-coding framework models this strategic interaction between a data collector (DC) and an adversary. Existing works on the game of coding mostly consider the complete-information case, where the DC knows how the adversary trades off acceptance and estimation error. In this paper, we study an incomplete-information version of the game of coding in which the DC, acting as a Stackelberg leader, does not know the adversary’s utility trade-off and must learn through repeated interaction. Prior work on the unknown-adversary setting considered an explore-then-commit objective, where only the final selected acceptance rule is evaluated. In contrast, we study the full learning trajectory: every acceptance rule used during the algorithm is executed and contributes to performance. We propose an algorithm that refines its search around promising acceptance rules, prove that it achieves sublinear cumulative regret, and evaluate its performance through numerical experiments. Subjects: Information Theory (cs.IT); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2605.09754 [cs.IT] (or arXiv:2605.09754v1 [cs.IT] for this version) https://doi.org/10.48550/arXiv.2605.09754 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-116] CALYREX: Cross-Attention LaYeR EXtended Transformers for System Prompt Anchoring

链接: https://arxiv.org/abs/2605.09737
作者: Li Lixing
类目: Machine Learning (cs.LG)
*备注: Preprint. 25 pages, 4 figures, 9 tables

点击查看摘要

Abstract:Modern large language models (LLMs) rely on system prompts to establish behavioral constraints and safety rules. Standard causal self-attention treats privileged instructions and untrusted user content with equal structural priority – a mismatch that leaves models vulnerable to prompt injection and instruction erosion over extended contexts. We propose CALYREX (Cross-Attention LaYeR EXtended transformers), which utilizes cross-attention between input and system prompt to structurally isolate and anchor the rule. A placement ablation on a 1.5B backbone identifies insertion at the final eighth of layers as optimal, confirmed by mechanistic activation analysis showing behavioral constraints are naturally concentrated there. At 8B scale, controlling for training data, backbone, and parameter budget, CALYREX yields +7.4% on instruction-following (IFEval) and +16.3% on multi-turn instruction adherence, while reducing many-shot jailbreaking attack success rate by 13% . This advantage appears to widen with model scale, consistent with larger models more effectively utilizing the dedicated routing pathway.

[LG-117] RubricRefine: Improving Tool-Use Agent Reliability with Training-Free Pre-Execution Refinement

链接: https://arxiv.org/abs/2605.09730
作者: Will LeVine,Brendan Evers,Sam Saltwick,Abhay Venkatesh
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Iterative self-refinement is a popular inference-time reliability technique, but its effectiveness in code-mode tool use depends heavily on the structure of the feedback signal: unstructured critique helps inconsistently across models, and even revision with real execution feedback improves only modestly ( 0.75 vs. 0.65 baseline). The dominant failures are inter-tool contract violations - wrong output shape, incorrect tool routing, broken argument provenance - that run to completion without raising errors, making runtime feedback insufficient. We introduce RubricRefine, a training-free pre-execution reliability layer that generates task- and registry-specific rubrics, scores candidate code against explicit contract checks, and iteratively repairs failures before any execution occurs. With zero execution attempts, RubricRefine reaches 0.86 on M3ToolEval averaged across seven models-improving over prior inference-time baselines on every model tested on this benchmark, at 2.6X lower latency than the strongest non-iterative alternative - and remains flat on the predominantly single-step API-Bank, consistent with the method’s reliance on inter-tool contract structure. A rubric-category ablation and calibration analysis further characterize when and why the method works.

[LG-118] Model Capacity Determines Grokking through Competing Memorisation and Generalisation Speeds

链接: https://arxiv.org/abs/2605.09724
作者: Yiding Song,Hanming Ye
类目: Machine Learning (cs.LG)
*备注: 23 pages, 10 figures, 12 tables

点击查看摘要

Abstract:Existing accounts of grokking explain the phenomena in terms of mechanistic frameworks such as circuit efficiency or lazy-to-rich transitions. However, despite a known dependence between grokking and model size, how model capacity shapes grokking remains an open question. We give an information-theoretic account of this relationship on the task of modular arithmetic, showing that grokking does not immediately occur when a model becomes large enough to memorise the training set, but rather emerges as the outcome of a competition between two measurable timescales: a memorisation speed T_\textmem§ and a generalisation speed T_\textgen§ , both of which are functions of model parameter count P . Adapting the information capacity framework of Morris et al. (2025), we estimate T_\textmem§ on random-label data of equivalent complexity and T_\textgen§ on the modular task itself, and show that grokking emerges close to the parameter scale where these timescales intersect. The framework also suggests an empirical model for predicting memorisation speed given model capacity and dataset complexity, recovering the previously reported empirical observation that larger models memorise faster. Overall, we motivate the formalisation of different learning timescales as important abstractions to study when explaining how model capacity shapes grokking on algorithmic tasks.

[LG-119] Benchmarking Transformer and xLSTM for Time-Series Forecasting of Heat Consumption

链接: https://arxiv.org/abs/2605.09722
作者: Marja Wahl,Daniel R. Bayer,Sven Rausch,Marco Pruckner
类目: Machine Learning (cs.LG)
*备注: Submitted version of the paper submitted to IEEE SusTech, 2026

点击查看摘要

Abstract:Obtaining an accurate short-term forecasting for heat demand is an essential part of operating district heating networks cost-efficient and reliable. Heat consumption time series at the building level are highly dependent on exogenous variables such as outdoor temperature and individual usage patterns, making forecasting in this context a challenging task. Thus, this paper benchmarks novel Transformer-based and xLSTM architectures for short-term heat-demand forecasting. Using hourly data from 25 German buildings (2017-2025), we compare three-hour and 24-hour forecasting horizons relevant for intraday control and day-ahead scheduling. We establish a multi-building benchmark that tests whether models trained on pooled, heterogeneous building data are able to generalize across diverse building stock. The results show that the xLSTM achieves the lowest RMSE (19.88 kWh for three-hour, 21.47 kWh for 24-hour forecasts), while the Temporal Fusion Transformer attains the best MAE (9.16 kWh for three-hour forecasts). As xLSTMs and Transformers require long training times and have a huge number of trainable parameters, their sustainability remains questionable. Therefore, this paper further investigates the trade-off between predictive accuracy and computational resource demand of the evaluated forecasting models. The findings indicate that also low-parameter models like a traditional fully-connected network achieve good predictive results, highlighting that marginal accuracy gains of the novel prediction models come at substantial resource expense for this use case.

[LG-120] Discovery of Nonlinear Dynamics with Automated Basis Function Generation

链接: https://arxiv.org/abs/2605.09696
作者: Mohammad Amin Basiri,Charles Nicholson
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Symbolic Computation (cs.SC)
*备注: 53 pages, 17 figures. Code available at this https URL

点击查看摘要

Abstract:Discovering governing equations from observational data remains a fundamental challenge in scientific modeling, particularly when the underlying mathematical structure is unknown. Traditional sparse identification methods like SINDy excel at discovering parsimonious models but require researchers to specify candidate basis functions a priori, a limitation that often leads to model failure when critical terms are omitted or when systems exhibit unconventional dynamics. Purely symbolic regression approaches offer unlimited flexibility but struggle with noise sensitivity and frequently produce overly complex, unstable equations. We present AutoSINDy, a hybrid Discovery-then-Solve framework that combines the exploratory power of symbolic regression with the robust sparsity-promoting capabilities of SINDy. Our method operates in three stages: (1) PySR-based symbolic regression discovers candidate functional forms from bootstrapped data chunks; (2) a curation pipeline decomposes, expands, and filters these expressions using collinearity analysis to construct a minimal yet comprehensive library; and (3) SINDy identifies sparse governing equations from this custom-tailored library. Extensive experiments across canonical nonlinear systems demonstrate that AutoSINDy consistently recovers ground-truth equations even under high observational noise, achieving a ground-truth recovery rate of 92.8% across all trials. Compared with standard SINDy using enriched libraries and standalone symbolic regression, AutoSINDy achieves higher predictive accuracy, superior generalization to unseen trajectories, and substantially lower symbolic complexity.

[LG-121] Quantum Circuit Simulation of Compartmental Drug Dynamics: Leverag ing Variational Algorithms for Nonlinear Mixed-Effects Population Pharmacokinetics

链接: https://arxiv.org/abs/2605.09691
作者: Isshaan Singh,Nandan Patel
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Population pharmacokinetic/pharmacodynamic (PK/PD) modeling traditionally relies on classical ordinary differential equations to simulate drug dynamics. In this work, we reformulate a compartmental PK/PD model as an open quantum system and implement it using quantum circuits developed in PennyLane. Four pharmacological compartments (central, peripheral, effect-site, and response) are encoded using twelve qubits, with inter-compartmental transitions represented through controlled quantum operations that emulate stochastic dynamics. The framework is evaluated on Phase 1 clinical data using a quantum-enhanced stochastic approximation expectation-maximization (SAEM) approach. Compared with the classical implementation, the quantum model achieves substantially improved log-likelihood values, indicating stronger statistical fit while preserving identical parameter estimates, thereby validating numerical consistency and model interpretability. The quantum-based optimization converges faster in terms of iterations, although total runtime is increased due to current simulation overhead. The study demonstrates stable large-scale simulation performance and establishes a hybrid quantum-classical approach that maintains biological fidelity while improving statistical modeling capacity. The dataset and problem statement originate from the Quantum Innovation Challenge 2025, and additional details are provided via the associated link.

[LG-122] FreeMOCA: Memory-Free Continual Learning for Malicious Code Analysis

链接: https://arxiv.org/abs/2605.09664
作者: Zahra Asadi,Haeseung Jeon,Sohyun Han,Md Mahmuduzzaman Kamol,Se Eun Oh,Mohammad Saidur Rahman
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 17 pages, 5 figures, 12 tables

点击查看摘要

Abstract:As over 200 million new malware samples are identified each year, antivirus systems must continuously adapt to the evolving threat landscape. However, retraining solely on new samples leads to catastrophic forgetting and exploitable blind spots, while retraining on the entire dataset incurs substantial computational cost. We propose FreeMOCA, a memory- and compute-efficient continual learning framework for malicious code analysis that preserves prior knowledge via adaptive layer-wise interpolation between consecutive task updates, leveraging the fact that warm-started task optima are connected by low-loss paths in parameter space. We evaluate FreeMOCA in both class-incremental (Class-IL) and domain-incremental (Domain-IL) settings on large-scale Windows (EMBER) and Android (AZ) malware benchmarks. FreeMOCA achieves substantial gains in Class-IL, outperforming 11 baselines on both EMBER and AZ benchmarks. It also significantly reduces forgetting, achieving the best retention across baselines, and improving accuracy by up to 42% and 37% on EMBER and AZ, respectively. These results demonstrate that warm-started interpolation in parameter space provides a scalable and effective alternative to replay for continual malware detection. Code is available at: this https URL. Comments: 17 pages, 5 figures, 12 tables Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2605.09664 [cs.CR] (or arXiv:2605.09664v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2605.09664 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-123] Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction

链接: https://arxiv.org/abs/2605.09649
作者: Ngoc Bui,Hieu Trung Nguyen,Arman Cohan,Rex Ying
类目: Machine Learning (cs.LG)
*备注: A learnable KV eviction method for large language models

点击查看摘要

Abstract:The key-value (KV) cache is a major bottleneck in long-context inference, where memory and computation grow with sequence length. Existing KV eviction methods reduce this cost but typically degrade performance relative to full-cache inference. Our key insight is that full-cache attention is not always optimal: in long contexts, irrelevant tokens can dilute attention away from useful evidence, so selective, learnable eviction can improve generation rather than merely approximate the full cache. We introduce a global retention-based KV eviction method that learns each token’s future utility under a unified memory budget. Lightweight retention gates assign utility scores to cached KV entries, and a shared final scoring projection calibrates these scores across all layers and heads. This enables a single global eviction policy in which tokens from different layers, heads, and modalities compete directly for cache capacity. We further provide theoretical analysis showing that preferentially retaining useful tokens reduces attention dilution, and we justify geometric retention as a query-agnostic proxy for future utility. Across diverse long-context language and vision-language reasoning, and multi-turn dialogue benchmarks, our method substantially reduces KV memory while matching or surpassing full-cache inference. These results suggest that learned, globally calibrated KV eviction is not only a compression technique, but also a mechanism for improving long-context reasoning.

[LG-124] Plan2Cleanse: Test-Time Backdoor Defense via Monte-Carlo Planning in Deep Reinforcement Learning

链接: https://arxiv.org/abs/2605.09638
作者: Sze-Ann Chen,Zhi-Yi Chin,Kui-Yuan Chen,Chi-Yu Li,Ping-Chun Hsieh
类目: Machine Learning (cs.LG)
*备注: Published in Transactions on Machine Learning Research (TMLR)

点击查看摘要

Abstract:Ensuring the security of reinforcement learning (RL) models is critical, particularly when they are trained by third parties and deployed in real-world systems. Attackers can implant backdoors into these models, causing them to behave normally under typical conditions, but execute malicious behaviors when specific triggers are activated. In this work, we propose Plan2Cleanse, a test-time detection and mitigation framework that adapts Monte Carlo Tree Search to efficiently identify and neutralize RL backdoor attacks without requiring model retraining. Our approach recasts backdoor detection as a planning problem, enabling systematic exploration of temporally extended trigger sequences while maintaining black-box access to the target policy. By leveraging the detection results, Plan2Cleanse can further achieve efficient mitigation through tree-search preventive replanning. We evaluated our method in competitive MuJoCo environments, simulated O-RAN wireless networks, and Atari games. Plan2Cleanse achieves substantial improvements, increasing trigger detection success rates by more than 61.4 percentage points in stealthy O-RAN scenarios and improving win rates from 35% to 53% in competitive Humanoid environments. These results demonstrate the effectiveness of our test-time defense approach and highlight the importance of proactive defenses against backdoor threats in RL deployments. Our implementation is publicly available at this https URL.

[LG-125] Minimal Filling Architectures of Polynomial Neural Networks: Counterexamples Frontier Search and Defects

链接: https://arxiv.org/abs/2605.09609
作者: Kevin Dao,Jose Israel Rodriguez
类目: Machine Learning (cs.LG); Algebraic Geometry (math.AG)
*备注:

点击查看摘要

Abstract:We provide a counterexample to the minimal unimodal conjecture for polynomial neural networks (PNNs) with power activation functions. Fixing the input and output widths, the conjecture states that any minimal filling architecture has unimodal widths for the hidden layers. We found a counterexample via a frontier search and certified it using recursive dimension bounds and symbolic computation. Notably, several subarchitectures of this example exhibit large defect, in contrast with the predominantly small-defect behavior observed in prior examples.

[LG-126] Geometry Conflict: Explaining and Controlling Forgetting in LLM Continual Post-Training

链接: https://arxiv.org/abs/2605.09608
作者: Yuanyi Wang,Yifan Yang,Su Lu,Yanggan Gu,Pengkai Wang,Wenjun Wang,Zhaoyi Yan,Congkai Xie,Jianmin Wu,Jialun Cao,Shing-Chi Cheung,Hongxia Yang
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:Continual post-training aims to extend large language models (LLMs) with new knowledge, skills, and behaviors, yet it remains unclear when sequential updates enable capability transfer and when they cause catastrophic forgetting. Existing methods mitigate forgetting through sequential fine-tuning, replay, regularization, or model merging, but offer limited criteria for determining when incorporating new updates is beneficial or harmful. In this work, we study LLM continual post-training through three questions: What drives forgetting? When do sequentially acquired capabilities transfer or interfere? How can compatibility be used to control update integration? We address these questions through task geometry: we represent each post-training task by its parameter update and study the covariance geometry induced by the update. Our central finding is that: forgetting can be considered as a state-relative update-integration failure, it arises when the covariance geometries induced by tasks misalign with the geometry of the evolving model state. Sequential updates transfer when they remain compatible with the model state shaped by previous updates, and interfere when state-relative geometry conflict becomes high. Motivated by this finding, we propose Geometry-Conflict Wasserstein Merging (GCWM), a data-free update-integration method that constructs a shared Wasserstein metric via Gaussian Wasserstein barycenters and uses geometry conflict to gate geometry-aware correction. Across Qwen3 0.6B–14B on domain-continual and capability-continual settings, GCWM consistently outperforms data-free baselines, improving retention and final performance without replay data. These results identify geometry conflict as both an explanatory signal for forgetting and a practical control signal for LLM continual post-training.

[LG-127] End-to-End Keyword Spotting on FPGA Using Graph Neural Networks with a Neuromorphic Auditory Sensor

链接: https://arxiv.org/abs/2605.09570
作者: Wiktor Matykiewicz,Piotr Wzorek,Kamil Jeziorek,Tomás Muñoz,Antonio Rios-Navarro,Angel Jiménez-Fernández,Tomasz Kryjak
类目: Machine Learning (cs.LG)
*备注: Accepted for the ARC 2026 conference

点击查看摘要

Abstract:With the rapid growth of mobile robotics and embedded intelligence, there is an increasing demand for efficient on-device data processing on edge platforms. A promising research direction is the use of neuromorphic sensors inspired by human sensory systems, which generate sparse, event-based data encoding changes in the environment. In this work, we present the first end-to-end FPGA implementation of a keyword spotting system that integrates a Neuromorphic Auditory Sensor (NAS) and a graph neural network (GNN) on a single FPGA device, enabling real-time processing of raw audio data. The proposed architecture eliminates conventional signal preprocessing and operates directly on event-based audio streams. Leveraging a compute-near-memory network architecture, the system achieves efficient inference with low latency and low power consumption. Experimental results demonstrate an accuracy of 87.43% after quantization on the Google Speech Commands v2 dataset processed through the neuromorphic sensor, with end-to-end latency below 35 us and average power consumption of 1.12 W. The processed datasets, software models, and hardware modules are available at this https URL.

[LG-128] Online Set Learning from Precision and Recall Feedback

链接: https://arxiv.org/abs/2605.09565
作者: Lee Cohen,Yishay Mansour,Shay Moran,Han Shao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider the problem of learning an unknown subset N_\texttarget of a domain in an online setting. In each round t , the learner predicts a set of items N_t and receives one of two types of feedback, each with equal probability: precision feedback, in which a randomly chosen item from the predicted set N_t is revealed and the learner is told whether it belongs to N_\texttarget (incurring a reward if it does), or recall feedback, in which a randomly chosen item from the target set N_\texttarget is revealed and the learner is told whether it belongs to N_t (incurring a reward if it does). The goal is to maximize the cumulative reward over time. This simple online set learning problem abstracts a variety of learning scenarios with precision- and recall-type feedback. We show that a hypothesis class (a family of subsets of the domain) is learnable in this setting if and only if it has finite Vapnik-Chervonenkis (VC) dimension, mirroring the classical PAC characterization. However, the resulting algorithmic structure is markedly more intricate: in contrast to standard Probably Approximately Correct (PAC) learning – where the algorithmic landscape is governed by the simple principle of Empirical Risk Minimization (ERM) – our partial feedback model can invalidate ERM and even all proper learning rules. We develop algorithms to address the dependencies induced by the feedback, obtaining regret guarantees in both the realizable and agnostic settings. Our results provide a qualitative characterization of learnability in this model, addressing its most basic question, while pointing to a range of natural and intriguing open questions, including the determination of optimal regret rates. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.09565 [cs.LG] (or arXiv:2605.09565v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.09565 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-129] When Adaptation Fails: A Gradient-Based Diagnosis of Collapsed Gating in Vision-Language Prompt Learning

链接: https://arxiv.org/abs/2605.09549
作者: Yunxuan Fang,Ziwei Zhang,Xinhe Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Adaptive prompting mechanisms have been proposed to enhance vision-language models by dynamically tailoring prompts to inputs. However, in frozen few-shot prompt learning with CLIP-style backbones, we systematically observe that adaptive gates and prompt-selection modules often collapse: they produce nearly constant outputs, contribute negligible gradient signals, and frequently fail to outperform fixed prompts. To further explore this issue, we present a systematic diagnostic study to uncover the underlying causes and conditions of adaptation failure. Through controlled experiments across datasets and multiple prompt learning architectures, we identify two recurring failure modes: gradient magnitude imbalance and gate degradation. Our findings invite a re-examination of indiscriminately adding architectural complexity in parameter-efficient learning and clarify when prompt-level adaptive gating is, and is not, effective in this regime.

[LG-130] HS-FNO: History-Space Fourier Neural Operator for Non-Markovian Partial Differential Equations

链接: https://arxiv.org/abs/2605.09523
作者: Lennon J. Shikhman
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Numerical Analysis (math.NA); Computational Physics (physics.comp-ph); Machine Learning (stat.ML)
*备注: 29 pages, 4 figures, 1 table. Under review. Code at this https URL

点击查看摘要

Abstract:Neural operators provide fast surrogate models for time-dependent partial differential equations, but their standard autoregressive use usually assumes that the instantaneous field u(t,\cdot) is a complete state. This assumption fails for delay equations, distributed-memory systems, and other non-Markovian dynamics: two trajectories may agree at time t and nevertheless have different futures because their histories differ. We introduce the History-Space Fourier Neural Operator (HS-FNO), a neural operator for delay and memory-driven PDEs formulated on the lifted state u_t(\theta,x)=u(t+\theta,x) , \theta\in[-\tau,0] . The key computational step is to decompose one history-state update into a learned predictor for the newly exposed future slice and an exact shift-append transport for the portion of the history window already known from the previous state. This avoids learning deterministic history coordinates, reduces the learned output dimension, and enforces the natural discrete history update. We test HS-FNO on five benchmark families covering delayed reaction–diffusion, spatial epidemiology, nonlocal neural-field dynamics, delayed waves, and distributed-memory closures. Across ten random seeds, HS-FNO attains the lowest aggregate one-step, history-space, and rollout errors among the principal baselines. The largest gain occurs in autoregressive prediction, where aggregate rollout error decreases from 0.241 , 0.188 , and 0.185 for current-state, lag-stack, and unconstrained history-to-history operators, respectively, to 0.094 . The same model uses fewer parameters than unconstrained history prediction. These results indicate that enforcing the discrete shift structure of history-state evolution is an effective inductive bias for non-Markovian PDE surrogate modeling.

[LG-131] LLM -Driven Performance-Space Augmentation for Meta-Learning-Based Algorithm Selection

链接: https://arxiv.org/abs/2605.09518
作者: Darren Zhu,Daren Ler
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Meta-learning for algorithm selection relies on a meta-dataset in which each row corresponds to a supervised learning dataset described by meta-features and labelled with a target value that is associated with algorithm choice (typically, some function of algorithm performance). A persistent limitation is that the number of curated real-world datasets is small, resulting in sparse meta-datasets that constrain meta-learner generalisation. In this paper, we address this problem by augmenting the meta-dataset with synthetic regression datasets produced via a large language model (LLM), with generation steered toward target regions of a low-dimensionality performance space. In our experiments, we adopt a two-dimensional geometric setting defined by the cross-validated R^2 scores of two anchor algorithms, known as landmarkers. We compare two augmentation strategies: (1) uniform sampling, which distributes synthetic datasets across the performance space; and (2) margin-based sampling, which concentrates them near the decision boundary where landmarker preference is most ambiguous. Across 42 real-world UCI regression datasets and 730 synthetic datasets, both strategies substantially improve meta-learner performance over the unaugmented baseline under regression and multi-label evaluation formulations. However, uniform augmentation consistently outperforms margin-based augmentation, achieving a 17.47% relative reduction in Hamming loss, a 100.41% relative improvement in subset accuracy, and a +6.09% relative gain in pooled out-of-fold R^2 . These results lead us to postulate a central thesis: the performance of algorithms resides on a low-dimensional performance manifold, whose reconstruction bias may be minimised by user-guided LLMs that seek to maximise uniform \epsilon -cover, and consequently, lead to improved meta-learning for algorithm selection.

[LG-132] Doubly Robust Proxy Causal Learning with Neural Mean Embeddings

链接: https://arxiv.org/abs/2605.09514
作者: Bariscan Bozkurt,Alexandre Galashov,Dimitri Meunier,Zikai Shen,Arthur Gretton,Houssam Zenati
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Unobserved confounding prevents standard covariate adjustment from identifying causal response functions in observational studies. Proxy causal learning addresses this problem through bridge equations involving treatment- and outcome-inducing proxies, avoiding direct recovery of the latent confounder. Existing doubly robust proxy estimators combine outcome and treatment bridges, but typically rely on fixed kernels, sieves, or low-dimensional semiparametric models; existing neural proxy methods are more flexible, but are largely single-bridge estimators. We develop a neural doubly robust framework for proxy causal learning with continuous and structured treatments. Our method introduces a neural mean-embedding estimator for the treatment bridge, combines it with a neural outcome bridge, and estimates the doubly robust correction through a final regression stage. The framework covers population, heterogeneous, and conditional dose-response functions, yielding full response-curve estimators rather than binary-treatment effects. The algorithms use two stages for each bridge and history-aware updates of the final linear layers to stabilize stochastic multi-stage training. We prove consistency of the algorithms showing that the doubly robust error is controlled by the final averaging and regression errors together with the smaller of the outcome- and treatment-side weak-norm bridge errors. Across synthetic and image-valued benchmarks, the proposed estimators outperform existing baselines and single-bridge neural estimators, showing the benefit of combining learned outcome and treatment bridges in a doubly robust construction. Our implementation is available at this https URL.

[LG-133] Kintsugi: Learning Policies by Repairing Executable Knowledge Bases

链接: https://arxiv.org/abs/2605.09487
作者: Teng Cao,Yu Deng,Hikaru Shindo,Quentin Delfosse,Lanxi Wen,Suli Wang,Jannis Blüml,Christopher Tauchmann,Kristian Kersting
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modern embodied agents achieve impressive performance, but their task knowledge is often stored in neural weights, latent state, or prompt-bound memory, making individual policy knowledge difficult to inspect, validate, recombine, and reuse. We introduce \textbfKintsugi, a white-box policy-learning framework that treats embodied policy improvement as verifier-gated construction of a typed executable Knowledge Base (KB). Kintsugi represents task-level policy knowledge as composable typed entries – predicates, operators, policy schemas, monitors, recovery rules, experience records, and goals – and improves this artifact through localized typed edits induced from rollout evidence, rather than relying on test-time language-model reasoning. Between rollouts, a tool-constrained agentic editing loop diagnoses trajectory failures, localizes them to editable KB layers, and proposes candidate edits. A deterministic verification gate admits an edit only when the candidate type-checks, the resulting KB executes, and focused validation success or trajectory-health metrics improve without violating protected-regression checks. At inference, the accepted KB is executed by a deterministic symbolic executor with zero LLM calls. Across long-horizon text-agent benchmarks and representative object-centric manipulation settings, Kintsugi achieves strong endpoint performance while preserving inspectability, local editability, and verifier-gated deployment. These results suggest that embodied policy improvement can be organized around executable task knowledge.

[LG-134] SEMASIA: A Large-Scale Dataset of Semantically Structured Latent Representations

链接: https://arxiv.org/abs/2605.09485
作者: Mario Edoardo Pandolfo,Enrico Grimaldi,Lorenzo Marinucci,Leonardo Di Nino,Simone Fiorellino,Sergio Barbarossa,Paolo Di Lorenzo
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Latent representations learned by neural networks often exhibit semantic structure, where concept similarity is reflected by geometric proximity in embedding space. However, comparing such spaces across models remains difficult: changes in architecture, pretraining data, objective, or random seed can yield embeddings with similar content but incompatible geometry. This latent space alignment problem is central to interpretability, transfer and multimodal learning, federated systems, and semantic communication; however, progress remains limited by the lack of large-scale, model-diverse, and metadata-rich benchmarks. To address this gap, we introduce SEMASIA, a large-scale collection of latent representations extracted from approximately 1,700 pretrained vision models across eight standard image-classification benchmarks. SEMASIA pairs embeddings with structured metadata describing architectures, training regimes, pretraining sources, and model scale. We demonstrate three applications of the resource. First, we analyze the conceptual organization of individual latent spaces, showing consistent prototype-like clustering and hierarchical semantic neighborhoods across models and datasets. Second, we benchmark supervised alignment mappings between latent spaces using reconstruction error and downstream task performance. Third, we perform a large-scale regression analysis of how pretraining-data complexity, specialization, transfer learning, augmentation, and model scale relate to geometric and probing properties of embeddings. By coupling representational scale with standardized metadata, SEMASIA provides a reproducible foundation for studying latent geometry, evaluating alignment methods, and developing next-generation heterogeneous and interoperable AI systems.

[LG-135] Positional LSH: Binary Block Matrix Approximation for Attention with Linear Biases

链接: https://arxiv.org/abs/2605.09472
作者: Daniel Wolfson,Tal Wagner
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注:

点击查看摘要

Abstract:Positional encoding in transformers is commonly implemented through positional embeddings, attention masks, or bias terms, but formal connections between these mechanisms remain limited. We study attention with positional bias through the lens of locality-sensitive hashing (LSH), focusing on Attention with Linear Biases (ALiBi). We show that the ALiBi bias matrix is the expectation of contiguous block-diagonal binary masks induced by a ``positional LSH’’ scheme. The empirical mean of masks sampled from this scheme yields spectral norm and max-norm approximation guarantees with bounded block sizes with high probability. This structural theorem implies a uniform approximation theorem for ALiBi-biased attention: with high probability over the sampled masks, the approximate attention output is accurate simultaneously for all query-key-value inputs and can be computed in near-linear time in the context length, reducing long-context ALiBi to a collection of randomized short-context regular (positionally unbiased) attention operations. Conceptually, this connects positional bias, masks, and positional embeddings in a single formal framework and suggests an approach to efficient ALiBi-biased attention. Experiments on large language models validate our theoretical findings.

[LG-136] Learning to Bid with Unknown Private Values in Budget-Constrained First-Price Auctions

链接: https://arxiv.org/abs/2605.09448
作者: Zihao Hu,Yuxiao Wen,Yuan Yao,Jiheng Zhang,Zhengyuan Zhou
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The transition to First-Price Auctions (FPA) in digital advertising has spurred significant research, yet existing work typically assumes access to a valuation oracle, ignoring the reality that values must be inferred from censored data. While Linear Treatment Effect (LTE) models address this by learning value uplift, they have not been adapted to realistic settings with hard Budget constraints or Return-on-Spend (RoS) targets requiring regret and violation control. In this work, we propose a unified primal-dual framework for constrained FPAs that jointly learns the latent LTE valuation parameters and the competitor’s bid distribution. This simultaneous learning introduces a critical technical challenge: the estimation error is dynamically scaled by the Lagrangian multiplier, potentially leading to unbounded regret. We resolve this by leveraging a strong Slater condition and a novel adaptive burn-in procedure to stabilize the dual variables. Our approach achieves near-optimal regret guarantees, providing the first theoretically grounded solution for constrained bidding with latent valuations.

[LG-137] Inverse Design for Conditional Distribution Matching

链接: https://arxiv.org/abs/2605.09439
作者: Ori Meidler,Shaul Tolkovsky,Or Zuk
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Generative models are powerful tools for sampling from a learned distribution \mathcalP(Y \mid X) , and inverse-design methods invert this map to find an input x that produces a desired point output y^* . However, many design goals are naturally distributional rather than pointwise, incorporating the inherent uncertainty of Y and targeting a specific form for it, a task not addressed by standard inverse design. To address this issue we introduce Conditional Distribution Matching (CDM), a new inverse-design problem class in generative modeling: given a joint distribution \mathcalP(X, Y) and a target distribution \mathcalG(Y) , find an input x^* whose induced conditional distribution \mathcalP(Y \mid X = x^*) matches \mathcalG . We formally define two variants: Conditional Distribution Matching Sampling (CDMS) and Conditional Distribution Matching Optimization (CDMO). To solve these problems, we propose MLGD-F (Matching-Loss Guided Diffusion with a Fast inner sampler), a plug-and-play inference-time algorithm that combines a pretrained score-based diffusion model with a pretrained fast conditional sampler, requiring no additional training or fine-tuning. By leveraging single-step conditional sampling, MLGD-F enables tractable gradient computation, making the estimation of \mathcalP(Y \mid X) both memory-efficient and computationally lightweight. We validate MLGD-F on synthetic benchmarks, structured image transformations, and generative editing optimization, demonstrating reliable recovery of inputs whose conditional distributions match diverse user-specified targets, including discrete mixtures and continuous low-rank supports.

[LG-138] fmxcoders: Factorized Masked Crosscoders for Cross-Layer Feature Discovery

链接: https://arxiv.org/abs/2605.09438
作者: Andreas D. Demou,Panagiotis Koromilas,James Oldfield,Yannis Panagakis,Mihalis A. Nicolaou
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Many features in pretrained Transformers span multiple layers: they emerge through stages of inference, persist in the residual stream, or are built jointly by parallel MLPs. Crosscoders (namely, sparse dictionaries trained jointly across layers) aim to recover these cross-layer features in a single shared latent space. We show that standard crosscoders largely fail at this purpose. Although their decoder weight norms spread evenly across layers, a functional coherence metric we introduce reveals that each latent’s activation is effectively driven by only one or two layers on average. While functionally coherent latents act as human-interpretable concept detectors (e.g., US states and cities), the layer-localized latents that crosscoders predominantly learn collapse onto surface-level patterns such as digit detectors. We trace this failure to two structural limitations: unconstrained cross-layer parameterization and unregularized cross-layer dependence. We address both by introducing fmxcoders, which (i) replace the encoder and decoder with low-rank tensor factorizations that draw every latent’s per-layer weights from a shared cross-layer basis, and (ii) apply stochastic layer masking, a denoising regularizer along the layer axis that penalizes latents whose contribution collapses when a single layer is masked. Across GPT2-Small, Pythia-410M, Pythia-1.4B, and Gemma2-2B, fmxcoders lift mean probing F1 by 10-30 points, surpassing per-layer SAE baselines that standard crosscoders fail to reach, reduce reconstruction MSE by 25-50%, and roughly double mean functional coherence. An LLM-as-a-judge evaluation further shows that fmxcoders recover 3-13 \times more semantically coherent latents than standard crosscoders across all four base LLMs.

[LG-139] FedCIGAR: A Personalized Reconstruction Approach for Federated Graph-level Anomaly Detection IJCAI2026

链接: https://arxiv.org/abs/2605.09428
作者: Yunfeng Zhao,Yixin Liu,Qingfeng Chen,Shiyuan Li,Yue Tan,Shirui Pan
类目: Machine Learning (cs.LG)
*备注: Accepted by IJCAI 2026

点击查看摘要

Abstract:Graph-level anomaly detection (GLAD) is crucial for ensuring the reliability of graph-driven applications by identifying abnormal graphs that deviate from the majority. Considering the privacy concerns in distributed scenarios, federated graph-level anomaly detection (FedGLAD) has emerged as a promising solution to enable collaborative detection without sharing raw data. However, existing methods suffer from poor generalization due to the reliance on unrealistic synthetic anomalies and insufficient personalization capabilities under data heterogeneity. To address these challenges, we propose a novel Federated graph-level anomaly detection approach with Cluster-adaptIve GAted Reconstruction (FedCIGAR). Specifically, we design a reconstruction-based paradigm trained on normal graphs to avoid synthetic data. Furthermore, we introduce a client-side node contribution gating mechanism and a server-side sliding window-based clustering strategy to tackle data heterogeneity. Extensive experiments demonstrate that FedCIGAR achieves superior performance and robustness in contrast to state-of-the-art methods.

[LG-140] abular Foundation Model for Generative Modelling

链接: https://arxiv.org/abs/2605.09424
作者: Xiangjian Jiang,Mingxuan Liu,Nikola Simidjievski,Tassilo Klein,Mateja Jamnik
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generative modelling is a demanding test of foundation models, because it requires robust, holistic representation learning for a given data modality, rather than optimisation for a supervised prediction target alone. While recent work on tabular foundation models has achieved remarkable progress in predictive modelling, generative tabular foundation models remain underexplored. Existing tabular foundation generators, in particular, have not yet consistently matched strong dataset-specific generators in synthetic data quality. A key reason is their misalignment with the distinctive causal structural prior of heterogeneous tabular data. In this paper, we address this gap by introducing a novel tabular foundation model, \textbfTabFORGE, built on pretrained \textbfTabular \textbfFOundational \textbfRepresentations for \textbfGEneration. TabFORGE is designed to utilise the implicitly learned causal information underlying diverse tabular datasets in a unified latent space induced by a pretrained causality-aware feature encoder. It further decouples latent modelling from decoding through a two-stage design: we first pretrain a score-based diffusion transformer, and then pretrain a denoising-aligned decoder using the denoised latent embeddings. This design elegantly mitigates the distribution shifts in latent embeddings that typically arise between training and inference. We evaluate TabFORGE comprehensively against 22 benchmark methods on 45 real-world datasets. Our results show that TabFORGE effectively learns and leverages generalisable tabular representations, enabling efficient generation of high-quality synthetic tabular data, particularly with strong structural fidelity.

[LG-141] A Controlled Diagnostic Study of Hardware-Induced Distortions in Hardware-Aware Training

链接: https://arxiv.org/abs/2605.09416
作者: Yunxuan Fang,Xinhe Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Hardware-aware training (HAT) is widely used to improve the robustness of neural networks on non-ideal AI accelerators, such as analog in-memory computing (IMC) systems. However, not all hardware-induced distortions are equally compensable by training. This paper presents a diagnostic framework that models hardware non-idealities as structured perturbations of the forward operator and evaluates their compatibility with gradient-based optimization. We analyze six representative perturbation classes–read noise, variability, drift, stuck-at faults, IR-drop, and ADC discretization–and identify three key diagnostics: gradient expectation consistency, bounded gradient variance, and non-degenerate sensitivity. Our results show a clear separation between perturbations that can be compensated by HAT and those that consistently break optimization. This provides practical guidance for hardware-software co-design, clarifying which non-idealities can be addressed at the training level and which require circuit-, architecture-, or calibration-level mitigation. This study should be interpreted as a controlled empirical analysis under vanilla forward-perturbation HAT, rather than as a universal theory of hardware-aware training.

[LG-142] GravityGraphSAGE: Link Prediction in Directed Attributed Graphs

链接: https://arxiv.org/abs/2605.09408
作者: Riccardo Porcedda,Francesca Chiaromonte,Fabrizio Lillo,Andrea Vandin
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Link prediction (inferring missing or future connections between nodes in a graph) is a fundamental problem in network science with widespread applications in, e.g., biological systems, recommender systems, finance and cybersecurity. The ability to accurately predict links has significant real-world applications, such as detecting fraudulent financial transactions or identifying drug-target interactions in biomedicine. Despite a rich literature, link prediction is still challenging, especially for graphs enriched with information on edges (direction) and nodes (attributes). In fact, research on link prediction, especially the one based on Graph Deep Learning (GDL), has mostly focused on undirected graphs, without fully leveraging node attributes. Here, we fill this gap by proposing Gravity-GraphSAGE (GG-SAGE), a modified version of GraphSAGE, a GDL model for node embeddings, composed of a gravity-inspired decoder. This implementation is the first example in the literature of a GraphSAGE backbone adopted for directed link prediction. Using the benchmark datasets Cora, Citeseer, PubMed and 16 real-world graphs from the online Netzschleuder repository, we show that our proposed model outperforms state-of-the-art GDL link prediction techniques. Using further experimental evidence, we relate the quality of the output of our model with various characteristics of the graph, suggesting that our framework scales well when applied to data of increasing complexity.

[LG-143] D2ACE: Multi-Label Batch Selection Guided by Dual Dynamics and Adaptive Correlation Enhancement

链接: https://arxiv.org/abs/2605.09400
作者: Bin Liu,Haoyu Peng,Zhijia Wei,Jiajing Zhang,Grigorios Tsoumakas
类目: Machine Learning (cs.LG)
*备注: 18 pages

点击查看摘要

Abstract:Batch selection is crucial for improving both training efficiency and predictive performance in deep multi-label classification (MLC). Existing batch selection methods typically rely on a single metric to assess instance importance and use static label weights to distinguish label significance, neglecting the dynamic evolution of metric utility and label significance during training. In addition, the method that explicitly exploits label correlations is largely affected by abundant irrelevant labels and insensitive to local label distributions. To address these issues, we propose D2ACE, a novel multi-label batch selection method guided by Dual Dynamics and Adaptive Correlation Enhancement. D2ACE explicitly captures metric and label-level training dynamics by combining stage-wise Bernoulli mixture sampling, which balances uncertainty and noise-resistant hardness, with dynamic label weighting to recalibrate label priorities at each epoch based on current metric statistics. Furthermore, D2ACE introduces a local context-aware correlation enhancement to focus on relevant labels with instance-adaptive dependencies. Extensive experiments on tabular and image benchmarks demonstrate that D2ACE outperforms existing batch selection approaches across various deep MLC models, achieving stronger predictive performance and more efficient correlation modeling.

[LG-144] Universal Feature Selection with Noisy Observations and Weak Symmetry Conditions

链接: https://arxiv.org/abs/2605.09396
作者: Dier Tang(1),Guangyue Han(1) ((1) Department of Mathematics, The University of Hong Kong, Hong Kong, China)
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: 6 pages, 0 figures. This work has been submitted to the 2026 IEEE Information Theory Workshop (ITW) for possible publication

点击查看摘要

Abstract:This paper relaxes the restrictive symmetry conditions adopted in [4], [5] and extends their universal feature selection framework to accommodate noisy observations as well as attribute structures that may exhibit directional preferences. We introduce the notion of weak spherical symmetry, quantified by second-moment distances, which allows controlled deviations from rotational invariance. Under this relaxed condition, we develop a universal feature selection framework based on the singular value decomposition of the canonical dependence matrix computed from noisy data. Our main result shows that the selected features achieve asymptotically optimal error exponents up to a residual term that depends on the symmetry deviation \delta and the noise levels \eta_1, \eta_2 . When \delta, \eta_1, \eta_2 are relatively small, our result recovers that of [5], thereby demonstrating that exact spherical symmetry is unnecessary. Overall, our findings highlight the robustness of the selection framework against second-moment deviations and observation noise, thereby broadening its applicability across diverse inference tasks and providing a theoretically grounded tool for universal feature selection in practical scenarios.

[LG-145] Multi-scale Predictive Representations for Goal-conditioned Reinforcement Learning

链接: https://arxiv.org/abs/2605.09364
作者: Valliappan Chidambaram Adaikkappan,David Meger,Sai Rajeswar,Pietro Mazzaglia
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper investigates robust representation learning in offline goal-conditioned reinforcement learning (GCRL). Particularly in sparse reward scenarios, learning representations that align state and goal latents is a challenge that frequently culminates in representation divergence where the encoder drifts toward a low-dimensional, goal-agnostic subspace that destabilizes policy learning. We address this issue by showing that an agent must acquire a fundamental understanding of its environment across multiple scales, from local physical dynamics to long-horizon goal-directed structure. Building on this insight, we propose this http URL, a framework that leverages multi-scale predictive supervision to enforce goal-directed alignment within the latent space. We demonstrate that this http URL leads to improved representation quality and strong performance on both vision and state-based tasks. Furthermore, we show that our approach is exceptionally resilient under realistic, challenging data regimes, maintaining state-of-the-art performance across a wide variety of tasks, trajectory stitching scenarios, and extreme noise conditions.

[LG-146] Near-Optimal Last-Iterate Convergence for Zero-Sum Games with Bandit Feedback and Opponent Actions

链接: https://arxiv.org/abs/2605.09363
作者: Soumita Hait,Ping Li,Haipeng Luo,Mengxiao Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Last-iterate convergence of learning dynamics in games has attracted significant recent attention. In two-player zero-sum games with bandit feedback, where only the loss of the selected action pair is observed, Fiegel et al. (2025) show a separation between average-iterate and last-iterate convergence in duality gap: while the optimal t^(-1/2) rate after t rounds is achievable for the former via standard no-regret algorithms, the latter cannot converge faster than t^(-1/3) in expectation or t^(-1/4) with high probability. However, in many practical settings, such as preference learning, the players observe not only their loss but also the opponent’s action. This raises a natural question: can such additional information enable faster last-iterate convergence? We answer this question affirmatively, showing that t^(-1/2) last-iterate convergence is achievable with high probability in this setting, via an efficient algorithm that updates its strategy infrequently by solving an estimated log-barrier-regularized game. We identify fundamental obstacles preventing standard analysis for multi-armed bandits, the single-player case, from generalizing to games, and develop a novel analysis to overcome them. Experiments confirm that our algorithm indeed converges faster than naive baselines and prior methods that do not exploit opponent-action feedback. Finally, we note that our results also improve those for dueling bandits, a special case with skew-symmetric game matrices. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.09363 [cs.LG] (or arXiv:2605.09363v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.09363 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-147] Split CNN Inference on Networked Microcontrollers

链接: https://arxiv.org/abs/2605.09357
作者: Junyu Lu,Shashwath Suresh,Hao Liu,Qi Hong,Qing Wang
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 10 pages

点击查看摘要

Abstract:Running deep neural networks on microcontroller units (MCUs) is severely constrained by limited memory resources. While TinyML techniques reduce model size and computation, they often fail in practice due to excessive peak Random Access Memory (RAM) usage during inference, dominated by intermediate activations. As a result, many models remain infeasible on standalone MCUs. In this work, we present a fine-grained split inference system for networked MCUs that enables collaborative inference of Convolutional Neural Networks (CNN) models across multiple devices. Our key insight is that breaking the memory bottleneck requires splitting inference at sub-layer granularity rather than at layer boundaries. We reinterpret pre-trained models to enable kernel-wise and neuron-wise partitioning, and distribute both model parameters and intermediate activations across multiple MCUs. A lightweight, resource-aware coordinator orchestrates the inference across MCU devices with heterogeneous resources. We implement the proposed system on a real testbed and evaluate it on up to 8 MCUs using MobileNetV2, a representative CNN model. Our experimental results show that CNN models infeasible on a single MCU can be executed across networked MCUs, reducing the per-MCU peak RAM usage while maintaining the practical end-to-end inference latency. All the source code of this work can be found here: this https URL.

[LG-148] Function-Space ADMM for Decentralized Federated Learning: A Control Theoretic Perspective

链接: https://arxiv.org/abs/2605.09356
作者: Akihito Taya,Yuuki Nishiyama,Kaoru Sezaki
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: © 2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

点击查看摘要

Abstract:Decentralized federated learning (FL) is a promising approach for training machine learning models on sensor networks, Internet of Things (IoT) devices, and other edge systems where no central server exists. While federated learning offers advantages such as preserving data privacy, it often suffers from non-independent and identically distributed (IID) data distributions across devices, which cause significant performance degradation. This issue is particularly severe when directly optimizing model parameters, because neural network training is inherently non-convex and standard convergence guarantees for convex optimization do not apply. Unlike existing decentralized FL methods that primarily operate in parameter space, we propose federated function-space alternating direction method of multipliers (FedF-ADMM). FedF-ADMM exploits the convexity of loss functionals within function space to derive alternating direction method of multipliers (ADMM)-based update directions, which are subsequently projected onto the parameter space via knowledge distillation. We further introduce a stabilization coefficient to enhance robustness under severe non-IID settings and analyze its behavior from a control-theoretic perspective by interpreting it as a proportional-integral (PI) term. Experiments under challenging non-IID scenarios, including settings where each device has data from only a single label, demonstrate that FedF-ADMM achieves faster and more stable convergence than existing decentralized FL methods, while attaining higher accuracy and better consensus among devices.

[LG-149] FLAME: Adaptive Mixture-of-Experts for Continual Multimodal Multi-Task Learning

链接: https://arxiv.org/abs/2605.09355
作者: Xing Han,Shravan Chaudhari,Tanvi Ranade,Rama Chellappa,Suchi Saria
类目: Machine Learning (cs.LG)
*备注: 37 pages, 25 figures, 6 tables

点击查看摘要

Abstract:Real-world model deployment across multiple domains requires multimodal models to operate under two complementary regimes: (1) multi-task pretraining, tasks are co-available at design time where related tasks could borrow representational strength from one another, (2) continual adaptation, in which new tasks emerge after deployment with previously unseen modality combinations. However, neither regime alone suffices: the pretraining task set is never exhaustive, while bypassing joint training forfeits the transfer gains and efficiency among co-trainable tasks. Sparse Mixture-of-Experts (MoE) is a natural fit for this dual requirement: sparse activation enables modular capacity expansion as new tasks arrive, while routing decouples modality-level computation from task-level composition. In this work, we propose a scalable MoE framework for multitask pretraining and continual learning across flexible modality combinations. The framework is designed to support training on multimodal tasks with diverse modality configurations by leveraging modality-specific routers that process tokens from each modality across tasks. Furthermore, it enables continual learning over sequential multimodal tasks within a fixed-capacity MoE by compressing accumulated expert knowledge into low-rank memory subspaces, while expanding only the lightweight routers. We validate the effectiveness of our method on multiple healthcare multimodal benchmarks. It demonstrates competitive multitask pretraining performance while alleviating catastrophic forgetting and improving parameter efficiency.

[LG-150] Selection Plateau and a Sparsity-Dependent Hierarchy of Pruning Features

链接: https://arxiv.org/abs/2605.09345
作者: Guangqi Li,Yongxin Li
类目: Machine Learning (cs.LG)
*备注: 22 pages, 3 figures, 5 tables. Empirical study + framework hypothesis on ViT-Small/CIFAR-10. Cross-domain validation (vision token pruning, KV cache compression, MoE routing) and cross-architecture extensions deferred to follow-up work

点击查看摘要

Abstract:We identify a Selection Plateau phenomenon in one-shot neural network pruning: all rank-monotone weight scorers converge to identical accuracy at fixed sparsity, independent of functional form. We propose the Sparsity-Information-Complexity Spectrum (SICS) hypothesis: a sparsity-dependent minimum feature complexity kappa(S) governs plateau escape, with kappa=0 sufficient at low sparsity (S0.65), kappa=1 dominant at critical sparsity (S~0.7), and kappa=2 necessary at extreme sparsity (S0.75). On ViT-Small/CIFAR-10, testing nine feature classes across four sparsities, smooth non-monotone features provide +6.6% escape at S=0.7, while only raw features with high-frequency wiggle escape at S=0.8 (+2.6%). A fake non-monotone scorer underperforms the gradient baseline, indicating the requirement is magnitude-independent non-monotonicity. A handcrafted Gaussian bump achieves only +0.006 escape vs. chaos-derived +0.046, indicating rank-alignment is necessary but insufficient. SICS provides a unifying explanation for the performance clustering of diverse pruning methods and suggests that future selection algorithms should adapt feature complexity to target sparsity.

[LG-151] Adversary-Robust Learning from Fully Asynchronous Directional Derivative Estimates

链接: https://arxiv.org/abs/2605.09337
作者: Anik Kumar Paul,Nibedita Roy,Nagesh Talagani,Swetha Ganesh,Gugan Thoppe,Alexandre Reiffers-Masson
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We propose FAR-SIGN (Fully Asynchronous Robust optimization via SIGNed directional projections) for adversary-resilient learning in parameter-server–worker systems. FAR-SIGN achieves robustness through sign-based updates along carefully designed directions and mitigates the resulting bias via a two-timescale mechanism. It admits both first-order and zeroth-order implementations and enables fully asynchronous execution without requiring a private reference dataset at the server. We establish almost-sure convergence of FAR-SIGN to the set of stationary points for smooth, nonconvex objectives. Moreover, we prove the near-optimal rate of O(n^-1/4+\epsilon) in the first-order setting and the standard O(n^-1/6+\epsilon) in the zeroth-order setting, where n is the iteration count and \epsilon0 can be chosen arbitrarily small. Experiments on MNIST show that FAR-SIGN outperforms robust aggregation-based methods in both accuracy and wall-clock time.

[LG-152] Functional Graphs for Predicting and Explaining Goal Failure in Sparse Goal-Conditioned RL

链接: https://arxiv.org/abs/2605.09335
作者: Shalley Dash
类目: Machine Learning (cs.LG)
*备注: 9 pages main, 21 pages appendx, 2 figures in main. 8 figures in appendix, Submitted to a conference

点击查看摘要

Abstract:Sparse goal-conditioned reinforcement learning can produce policies whose failures are hidden by aggregate success rates. We analyze trained goal-conditioned value policies through the deterministic functional graphs induced by greedy evaluation: for each goal, every state maps to a single successor, decomposing behavior into attractors and basins. This reveals a local-to-global structure in learned policies. We define local goal support (LGS), a one-step statistic measuring the fraction of valid neighboring states whose greedy successor is the goal. In deterministic sparse GridWorlds, zero LGS exactly precludes goal entry from non-goal starts. Empirically, weak LGS is a strong diagnostic of goal-level failure across update rules, curricula, larger grids, and bottleneck geometries: the fixed rule LGS = 0.5 identifies low-success goals with precision 0.921, recall 0.929, and F1 0.925 in the main 8x8 TD setting, with similar performance across variants. However, local support is not sufficient for global success: some supported goals still fail because distant states are captured by competing attractors or fragmented basin structure. We therefore introduce a compact post-hoc taxonomy of policy-induced graphs – goal-dominant, competitor-dominated, partial/contested, and fragmented – to characterize residual failure modes beyond local support. These results show that sparse GCRL failures can be understood as structured policy-induced dynamics, and that local one-step policy structure provides a cheap post-training diagnostic for goal-level failure.

[LG-153] Dimension-Free Saddle-Point Escape in Muon

链接: https://arxiv.org/abs/2605.09331
作者: Yanlin Long,Yufei Gu,Zeke Xie
类目: Machine Learning (cs.LG)
*备注: 33 pages, 5 figures. Preprint

点击查看摘要

Abstract:Modern Large Language Model (LLM) training is fundamentally bottlenecked by pathologically flat saddle points in extreme high-dimensional landscapes. Motivated by this challenge, we analyze the saddle-point escape dynamics of the emerging Muon optimizer, demonstrating its resilience against the \mathcalO(D) dimensional curse that severely traps element-wise adaptive optimizers like AdamW. By extending generalized matrix perturbation theory, we develop a theoretical framework to capture Muon’s non-equilibrium optimization trajectories. This theoretical machinery mathematically proves that Muon elegantly bypasses the dimensional curse via a non-linear spectral shaping mechanism. By leveraging resolvent functional calculus and macroscopic Cauchy contour integration, we avoid isotropic noise assumptions and Tracy-Widom edge singularities. We establish that structural incoherence securely shields the trajectory from orthogonal drift, enabling a dimension-free saddle-point escape, and triggering a deterministic \mathcalO(1) discrete ballistic ejection under sufficient spectral gap. Consequently, we provide an algebraically dimension-free escape bound for Muon, formalizing the underlying mechanics of its non-convex optimization dynamics.

[LG-154] Path-Dependent Denoising: A Non-Conservative Field Perspective on Order Collapse in Diffusion Language Models

链接: https://arxiv.org/abs/2605.09303
作者: Jeonseong Kim
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion language models (DLMs) offer a structural alternative to autoregressive generation: denoising can update tokens in arbitrary orders or in parallel rather than along a fixed left-to-right chain. In practice, fast DLM decoding remains strongly order-sensitive and often drifts toward autoregressive-like trajectories. We trace this tension to compatibility. At each reverse-time step, a DLM provides local denoising conditionals over the unresolved tokens. Arbitrary-order denoising becomes well defined when these local conditionals compose into order-invariant pseudo-joints. We formalize this view by defining order-induced pseudo-joints and a local denoising circulation: the log-ratio between the two pseudo-joints obtained by swapping a pair of unresolved positions. This circulation is zero under compatible conditionals, and global order gaps decompose into sums of local circulations along adjacent swaps. We further separate incompatibility-driven path dependence from conditional-dependence error in parallel updates and from order-specific estimation error. The resulting framework provides inference-only diagnostics for testing when DLM decoding is genuinely order-free. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.09303 [cs.LG] (or arXiv:2605.09303v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.09303 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-155] LagrangianSplats: Divergence-Free Transport of Gaussian Primitives for Fluid Reconstruction

链接: https://arxiv.org/abs/2605.09299
作者: Ningxiao Tao,Baoquan Chen,Mengyu Chu
类目: Graphics (cs.GR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reconstructing 3D fluid velocity fields from sparse 2D video observations is a highly ill-posed inverse problem, demanding both transport consistency with observed motion and physical validity under fluid laws. Existing methods typically impose these constraints through soft penalties, often leading to compromised accuracy and convergence issues. We introduce a reconstruction framework that structurally enforces both constraints. Specifically, we parameterize the reconstructed velocity using a continuous Divergence-Free Kernel representation, driving the advection of a Lagrangian 3D Gaussian Splatting representation. This formulation intrinsically guarantees both flow incompressibility and long-range transport coherence by construction. To enable the efficient optimization of such a constrained system, we introduce a novel Sliding Window scheme that propagates gradients over meaningful temporal horizons while maintaining tractable training costs. Experiments on synthetic and real-world datasets demonstrate that our method outperforms state-of-the-art baselines in both transport consistency and physical accuracy, enabling applications such as high-quality re-simulation and flow analysis.

[LG-156] dFlowGRPO: Rate-Aware Policy Optimization for Discrete Flow Models

链接: https://arxiv.org/abs/2605.09291
作者: Zhengyan Wan,Yidong Ouyang,Panwen Hu,Qiang Sun
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Discrete flow models (DFMs) are a class of flexible generative models for generating discrete data, and diffusion large language models (dLLMs) can be viewed as a special case with a specific choice of mixture path and a masked source distribution. While several recent works have explored reinforcement learning into dLLMs, its application to more general discrete flow models remains underexplored. In this work, we present discrete Flow-GRPO (dFlowGRPO), a unified reinforcement learning framework for discrete flow models that supports a broad family of probability paths and non-masked source distributions. We derive the full trajectory probability for DFMs and formulate denoising as a Markov decision process, enabling dFlowGRPO to incorporate information from both the associated conditional transition rates and the posterior model during reinforcement learning. We apply dFlowGRPO to FUDOKI, a recent multimodal discrete flow model, and evaluate it on both image generation and multimodal understanding tasks. Empirical results show that dFlowGRPO outperforms existing GRPO-type methods for dLLMs on text-to-image generation tasks and achieves performance competitive with continuous flow-based models trained using FlowGRPO, while also demonstrating strong capabilities on understanding tasks.

[LG-157] From Regression to Inference: Meta-Learning Predictors for Neural Architecture Search

链接: https://arxiv.org/abs/2605.09290
作者: Liping Deng,MingQing Xiao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Prediction-based approaches are widely used in neural architecture search (NAS), where a predictor estimates the performance of candidate architectures to guide selection. However, existing predictors are typically trained via supervised regression on limited samples, leading to overfitting and poor generalization to unseen architectures. In this work, we propose a fundamentally different formulation that models performance prediction as a conditional function inference problem using a Convolutional Neural Process (ConvNP) with meta-learning capabilities. Instead of fitting a fixed mapping to limited samples, our approach meta-learns to infer performance from partial observations by training with context-target splits across a group of synthesized tasks, explicitly optimizing for generalization under data scarcity and aligning the training procedure with the deployment setting in NAS. We further design simple yet effective meta-features for cell-based architectures and evaluate our method on NAS-Bench-101 and NAS-Bench-201. Extensive experiments show that our approach consistently improves top-K ranking quality and achieves the state-of-the-art architecture selection using limited samples.

[LG-158] Q: Efficient Low-Rank Quantization of Mixture-of-Experts with 2D Tiling

链接: https://arxiv.org/abs/2605.09281
作者: Hongyaoxing Gu,Xinzhe Chen,Lijuan Hu,Fangfang Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Mixture-of-Experts (MoE) models achieve remarkable performance by sparsely activating specialized experts, yet their massive parameters in experts pose significant challenges for deployment. While low-rank quantization offers a promising route to compress MoE models, existing methods still incur nonnegligible memory overhead and inference latency. To address these limitations, we propose \textscTileQ, a fine-tuning-free post-training quantization (PTQ) method that employs 2D-tiling structured low-rank quantization to share low-rank factors across both input and output dimensions of MoE experts. Furthermore, we introduce an efficient inference technique for \textscTileQ that fuses multiple low-rank expert computations into a single-pass operation, significantly improving hardware utilization. Experiments show that \textscTileQ cuts down additional memory usage up to 10 \times and reduces inference latency to \sim 5% while preserving state-of-the-art accuracy.

[LG-159] First Worst-Case Regret Bounds for Combinatorial Thompson Sampling in Sleeping Semi-Bandits

链接: https://arxiv.org/abs/2605.09277
作者: Zhiming Huang,Bingshan Hu,Jianping Pan
类目: Machine Learning (cs.LG)
*备注: Accepted by INFOCOM 26 on Dec 2025

点击查看摘要

Abstract:We revisit combinatorial Thompson sampling (CTS) for semi-bandits with sleeping arms, where arm availability varies over time and actions must satisfy combinatorial constraints, as in wireless mesh routing with fluctuating link availability. Despite its practical relevance, CTS has been hindered by several long-standing problems: (i) the absence of worst-case regret guarantees in the semi-bandit setting even without sleeping arms, (ii) the lack of theory under adversarially varying availability, and (iii) the consistently weak empirical performance of CTS with Gaussian priors (CTS-G). This paper resolves these long-standing issues by providing the first worst-case regret analysis of CTS-G, proving an upper bound of \tildeO(m\sqrtNT) and a matching lower bound of \tilde\Omega(m\sqrtNT) . To bridge the gap between theory and practice, we further propose CL-SG, a simple CTS-G variant that samples a single shared Gaussian seed each round to coordinate exploration across arms. We show that CL-SG achieves an improved regret bound of \tildeO(\sqrtmNT) , together with a matching lower bound \Omega(\sqrtmNT) . Experiments on real-world datasets demonstrate that CL-SG consistently outperforms strong baselines including CTS-G and CTS-B, and we open-source our implementation for reproducibility.

[LG-160] DiffATS: Diffusion in Aligned Tensor Space

链接: https://arxiv.org/abs/2605.09275
作者: Jinhua Lyu,Tianmin Yu,Brian Kim,Lizhuo Zhou,Chanwook Park,Naichen Shi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Direct diffusion modeling of high-resolution spatiotemporal fields is computationally challenging. Parameter-efficient primitives address this by representing high-dimensional data with a compact set of parameters. In this paper, we construct data-dependent tensor primitives without pretrained compression autoencoders. Our construction starts from Tucker decomposition, which captures low-rank multilinear structure through a core tensor and mode-wise factors. However, Tucker factors are non-unique: the same tensor can be represented by different rotated factors, which complicates generative modeling. We address this issue with orthogonal Procrustes (OP) alignment. Specifically, we select medoid anchor matrices from the data and align the factor matrices to resolve the gauge ambiguity. This yields matrix Grassmannian primitives and tensor Grassmannian primitives that are compact, data-adaptive, and directly decodable by explicit multilinear reconstruction. Theoretically, we prove that the proposed primitive maps are homeomorphisms between low-rank tensors and their corresponding primitive spaces, certifying that the representations are non-degenerate and topologically faithful. Building on these primitives, we propose Diffusion in Aligned Tensor Space (DiffATS), a generative framework that trains diffusion models directly on aligned tensor primitives. Across images, videos, and PDE solutions, DiffATS achieves strong unconditional and conditional generation performance while compressing original data by 3.9\times to 210\times , without relying on any pretrained deep compression autoencoders.

[LG-161] Instance-Adaptive Online Multicalibration

链接: https://arxiv.org/abs/2605.09273
作者: Zhiming Huang,Jamie Morgenstern,Aaron Roth,Claire Jie Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study online multicalibration beyond the worst-case. We give a single, efficient algorithm which dynamically interpolates between benign and worst-case sequences by adaptively refining a dyadic grid of prediction values. Its error is controlled by the number of leaves in the refinement tree. Our analysis recovers the known \widetilde O(T^2/3) worst-case-optimal rate for online multicalibration, while simultaneously automatically adapting to easier instances: in the marginal stochastic setting it obtains a rate of \widetilde O(\sqrt T) , and for piecewise-stationary means with J segments its rate is \widetilde O(\sqrtJT) . More generally, the rate depends on a threshold-complexity measure of the predictable mean process relative to the group family. We show that this dependence is tight up to logarithmic factors.

[LG-162] Privacy-Preserving Distributed Learning in IoT Systems: A Unified Threat Model and Evaluation Framework

链接: https://arxiv.org/abs/2605.09232
作者: John Cartmell,Alexander Williams
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 14 pages, 6 figures

点击查看摘要

Abstract:The increasing deployment of Internet-of-Things (IoT) devices has accelerated the use of distributed learning frameworks, where data remains local while model updates are shared across decentralized systems. Although this reduces centralized data collection, it introduces privacy risks through the exchange of gradients, model parameters, and intermediate representations. A variety of privacy-preserving techniques have been proposed to address these risks, including differential privacy, cryptographic methods, and lightweight system-level approaches. However, existing surveys often evaluate these methods in isolation and lack a unified framework for comparing their effectiveness under realistic attack models and IoT resource constraints. This paper presents a structured analysis of privacy-preserving techniques for distributed learning in IoT environments. A unified threat model is introduced that captures model inversion, membership inference, gradient leakage, and communication-based attacks. Building on this model, an evaluation framework is developed to compare methods in terms of both privacy robustness and system-level efficiency, including computational, memory, and communication overhead. Using this framework, representative approaches including differential privacy, homomorphic encryption, secure multi-party computation, distributed selective stochastic gradient descent, and Bloom Filter-based methods are analyzed. The results highlight a fundamental trade-off between privacy strength and system efficiency. In particular, Bloom Filter-based encodings are shown to provide lightweight privacy through collision-induced ambiguity while maintaining low computational and communication overhead. The paper provides a unified perspective on privacy-preserving design choices for distributed learning in IoT systems. Comments: 14 pages, 6 figures Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2605.09232 [cs.CR] (or arXiv:2605.09232v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2605.09232 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-163] SMIXAE: Towards Unsupervised Manifold Discovery in Language Models ICML2026

链接: https://arxiv.org/abs/2605.09224
作者: Collin Francel
类目: Machine Learning (cs.LG)
*备注: 20 pages, 10 figures, 11 tables. Submitted to Mechanistic Interpretability Workshop, ICML 2026

点击查看摘要

Abstract:Sparse autoencoders (SAEs) have been used widely to decompose and interpret neural network activations, especially those of transformer language models. One key issue with SAEs is their inability to directly model multidimensional features. Instead, SAEs may tile such features by a set of independent directions that must be grouped together after the SAE training phase, impeding discoverability and interpretation of learned feature representations. We begin to address this issue by introducing the Sparse MIXture of Autoencoders (SMIXAE) architecture. Empirically, we provide evidence that SMIXAE models have success both in directly learning previously identified manifold structures, as well as finding novel structures, within the open source Gemma 2 2B and 9B models. Finally, we discuss several limitations and point towards areas for future work.

[LG-164] Rethinking Ratio-Based Trust Regions for Policy Optimization in Multi-Agent Reinforcement Learning

链接: https://arxiv.org/abs/2605.09212
作者: Chulabhaya Wijesundara,Andrea Baisero,Zhongheng Li,Gregory Castañón,Alan Carlin,Christopher Amato
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Centralized training with decentralized execution (CTDE) is a standard framework for cooperative multi-agent policy-gradient reinforcement learning, allowing agents to learn from joint information while acting from local observations. Ratio-based trust-region methods such as Multi-Agent Proximal Policy Optimization (MAPPO) and Multi-Agent Simple Policy Optimization (MASPO) update decentralized actors using per-agent probability ratios weighted by joint advantage estimates. Teammate non-stationarity increases the variance of these advantages, which in turn increases the variance in the local ratio updates. This exposes two method-specific failure modes: MAPPO’s additive clipping removes gradients for outlier samples and weakens recovery from policy drift, while MASPO’s soft quadratic penalty can allow probability collapse. We introduce Multi-Agent Ratio Symmetry (MARS), a novel policy optimization objective that replaces these additive ratio-based trust-region mechanisms with a multiplicatively symmetric geometric barrier. MARS preserves corrective gradients while assigning unbounded cost as probability ratios approach zero. Across 47 tasks spanning eight multi-agent environments, including novel JAX benchmarks PaxMen and AeroJAX, MARS matches or exceeds MAPPO and MASPO in aggregate environment-level performance. Ablations show that these gains arise from the geometry of the symmetric barrier rather than from flexible trust-region boundaries alone.

[LG-165] SNN: A Non-parametric and Interpretable Framework for Traffic Time Series Forecasting

链接: https://arxiv.org/abs/2605.09208
作者: Bowen Liu,Haijian Lai,Chan-Tong Lam,Junhao Dong,Benjamin Ng,Wei Ke,Sio-Kei Im
类目: Machine Learning (cs.LG)
*备注: Accepted by IEEE Transactions on Knowledge and Data Engineering

点击查看摘要

Abstract:Although many complex models were proposed to analyze time series data, some studies have demonstrated remarkable performance with simpler structures. A recent study proposed a non-parametric framework for 3D point cloud classification, which has the potential to be adapted for time series forecasting and enable interpretability. Inspired by the previous works, we present TSNN, a non-parametric and interpretable framework for traffic time series forecasting. TSNN consists of multiple layers that decouple the time series by matching the entries in a memory bank, where the memory bank is constructed using a similar matching process within the training set. It leverages the periodicity in traffic data to enhance forecasting accuracy while maintaining a simple model architecture. The proposed model operates without trainable parameters, preserving its inherent interpretability. In the experiments, TSNN achieves competitive performance compared to the typical deep learning models in four real-world traffic flow datasets. We also visualize the decoupling process to show the effectiveness of the components. Finally, we demonstrate the interpretability of the model and illustrate the contribution of each time step within the memory bank.

[LG-166] LBI: Parallel Scan Backpropagation via Latent Bounded Interfaces

链接: https://arxiv.org/abs/2605.09204
作者: Shaun Christopher Lee,Sangeetha Abdu Jyothi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Backpropagation is inherently sequential across depth, creating an O(K) -deep dependency chain that bottlenecks parallel training. While parallel-scan formulations theoretically reduce this depth to O(\log K) , they are computationally prohibitive for modern architectures due to the O(d^3) cost of composing full-rank d\times d Jacobians over the entire hidden state. We introduce Latent Bounded Interfaces (LBI), an algorithmic formulation that makes scan-based backpropagation tractable by restricting inter-region communication to a low-dimensional latent interface, m_k \in \mathbbR^r , where r \ll d . This reduces the adjoint recursion to a suffix scan over r \times r Jacobians, cutting per-combine cost from O(d^3) to O(r^3) while preserving exact gradients under the bounded-interface model. We demonstrate that LBI maintains model quality across four architectures (Mamba-2, Mamba-3, Transformer, and a Mamba–Transformer hybrid) at 47–61M block parameters. Interfaces of dimension r=16 suffice to preserve training quality within 0.16–0.35 cross entropy of dense baselines. The resulting framework provides an algorithmic foundation for region-parallel training, reducing cross-device backward communication to a single scan over K fixed-size matrices, of approximately 56 KB for our experimental configurations.

[LG-167] On Characterizing Learnability for Adversarial Noisy Bandits

链接: https://arxiv.org/abs/2605.09200
作者: Steve Hanneke,Kun Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study adversarial noisy bandits given a known function class \mathcalF . In each round, the adversary selects a function f \in \mathcalF , the learner chooses an arm, and then observes a noisy reward determined by the chosen arm and the function f . The goal is to minimize the cumulative regret R(T) , defined as the difference between the learner’s performance and that of the best fixed arm in hindsight over T rounds. We say that a function class \mathcalF is learnable if there exists an algorithm achieving sublinear regret. Our main results concern characterizing learnability. The main quantity appearing in our characterization is a convexified variant of the generalized maximin volume introduced by Hanneke and Wang (2025). For oblivious adversaries, we characterize learnability in terms of this convexified generalized maximin volume. For adaptive adversaries, we show that the same quantity characterizes learnability when the arm space is countable. Our analysis builds on a connection between convexified generalized maximin volume and the existence of simple hitting sets. We further conjecture that the same quantity also characterizes learnability when the arm space is uncountable, via its relation to a new complexity measure, which we call the distribution covering number. This notion can be viewed as a strengthened form of the hitting set that still admits efficient learning via the multiplicative weights algorithm. We also pose a number of relevant open questions regarding this problem.

[LG-168] Practical Scaling Laws: Converting Compute into Performance in a Data-Constrained World

链接: https://arxiv.org/abs/2605.09189
作者: Christopher M. Bryant,Hao Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The scaling laws guiding modern model training were calibrated for a single regime: data-rich, single-epoch pretraining. The dominant such scaling law form, Chinchilla’s L = E + A/N^\alpha + B/D^\beta , has three structural limitations outside that regime: it diverges as unique data shrinks instead of saturating at the uninformed baseline; it cannot represent overfitting when capacity exceeds the data; and it conflates total examples seen with unique examples available. We propose a closed-form extension, L(N, D, T) = E + (L_0 - E),h/(1+h) with h = a/N^\alpha + b/T^\beta + c,N^\gamma/D^\delta , that decomposes loss into undercapacity, undertraining, and overfitting terms. It saturates between the irreducible loss E and an uninformed baseline L_0 fixed by the loss type, and reduces to Chinchilla in the data-rich, single-epoch limit. We validate it on four multi-epoch experiments spanning four architecture families (MLPs, ResNets, Fourier neural operators, and transformers) across vision, scientific ML, and language domains, and refit it to five published LLM scaling-law grids. Extrapolating to higher compute and larger unique data than seen at fit time, our form achieves state-of-the-art RMSE on every published LLM grid we evaluate and on most cells of our constructed experiments. Once calibrated, the form admits a cost-aware allocation that recovers Chinchilla’s optimum when data is free and shifts toward smaller corpora and more epochs as data grows expensive.

[LG-169] Learning When to Stop: Selective Imitation Learning Under Arbitrary Dynamics Shift

链接: https://arxiv.org/abs/2605.09183
作者: Surbhi Goel,Jonathan Pei,James Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Behavior cloning provides strong imitation learning guarantees when training and test environments share the same dynamics. However, in many deployment settings the test environment’s transitions differ from training, and classical offline IL offers no recourse: the learner must commit to an action at every state, even when its demonstrations are uninformative and could lead to arbitrary degradation of performance. This motivates the study of selective imitation, where the learner may choose to stop when it cannot act reliably. We introduce a model for selective imitation under arbitrary dynamics shift: given labeled expert demonstrations from a training environment and unlabeled state trajectories from the same expert in a test environment, the learner outputs a selective policy that is complete (rarely stops in training) and sound (incurs low regret before stopping in test). Our algorithm, SeqRejectron, constructs a stopping rule using a small set of validator policies whose size is independent of the horizon or policy class. For deterministic policies, this yields horizon-free \tildeO(\log|\Pi|/\epsilon^2) sample complexity, assuming sparse costs. For stochastic policies, we obtain analogous horizon-free guarantees using a cumulative Hellinger stopping time. We extend the framework to misspecified experts and different expert policies across train and test and obtain results that gracefully degrade with the amount of misspecification.

[LG-170] Objective-Specific Privileged Bases via Full-Prefix Matryoshka Learning

链接: https://arxiv.org/abs/2605.09160
作者: Arghamitra Talukder,Philippe Chlenski,Itsik Pe’er
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Learned representations are often invariant to rotational transformations, leaving individual dimensions non-identifiable and interchangeable. We study how Matryoshka Representation Learning (MRL) induces a task-aligned privileged basis distinct from variance-based or regularizer-induced orderings. In the linear setting, we prove that full-prefix MRL recovers the ordered principal directions, and can be computed efficiently using shared statistics. Empirically, we demonstrate that MRL yields consistent per-dimension structure aligned with task signal, where coordinate magnitude reflects informativeness.

[LG-171] Predicting Large Model Test Losses with a Noisy Quadratic System ICML2026

链接: https://arxiv.org/abs/2605.09154
作者: Chuning Li,Chris J. Maddison
类目: Machine Learning (cs.LG)
*备注: ICML 2026

点击查看摘要

Abstract:We introduce a predictive model that estimates the pre-training loss of large models from model size (N), batch size (B) and number of weight updates (K). This is the first loss prediction model that can handle changing batch size. The model outperforms Chinchilla’s loss model, a model of the test loss using the batch size and number of tokens, in terms of projecting the loss at extrapolated compute budgets (up to 1000 folds). A natural use of the model is to find optimal N, B, K configurations under explicit and compound resource constraints like time, memory and compute. In our experiments, the model-selected configurations are close to ground-truth optimal. Our work advocates for loss prediction as a better alternative to heuristic-based laws, which are growing in complexity. The implementation is available on this https URL.

[LG-172] AlphaExploitem: Going Beyond the Nash Equilibrium in Poker by Learning to Exploit Suboptimal Play

链接: https://arxiv.org/abs/2605.09150
作者: Vlad Murgoci,Matthijs Spaan,Yaniv Oren
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Poker is an imperfect information game that has served as a long-standing benchmark for decision-making under uncertainty. To maximize utility beyond the Nash equilibrium, an agent can deviate from Nash-equilibrium policies to exploit suboptimal play. We introduce AlphaExploitem, which extends the competitive RL poker agent AlphaHoldem by using a hierarchical transformer encoder that enables reasoning over previously played hands and modifying the training procedure with the inclusion of a diverse pool of exploitable opponents to facilitate learning to exploit. We train and evaluate AlphaExploitem on two standard benchmarks for imperfect-information games. Empirically, AlphaExploitem successfully exploits weak play by both in- and out-of-distribution opponents, without losing performance against NE opponents.

[LG-173] FedVSSAM: Mitigating Flatness Incompatibility in Sharpness-Aware Federated Learning

链接: https://arxiv.org/abs/2605.09144
作者: Bingnan Xiao,Yuan Gao,Bingcong Li,Wei Ni,Xin Wang,Tony Q. S. Quek
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sharpness-aware minimization (SAM) is an effective method for improving the generalization of federated learning (FL) by steering local training toward flat minima. Under data heterogeneity, however, device-side SAM searches for locally flat basins that are incompatible with the flat region preferred by the global objective. We identify this structural failure mode as flatness incompatibility, which explains why improving local flatness alone may provide limited training and generalization improvement for the global model. We reveal that flatness incompatibility arises from data heterogeneity and the friendly adversary phenomenon, and is further amplified by local updates and partial device participation. To mitigate this issue, we propose Federated Learning with variance-suppressed sharpness-aware minimization (FedVSSAM), which constructs a variance-suppressed adjusted direction and uses it consistently in local flatness search, local descent, and global update. FedVSSAM anchors both perturbation and update directions to a more stable global direction, instead of correcting only an isolated local perturbation. We establish non-convex convergence guarantees of FedVSSAM and prove that the mean-square deviation between the adjusted direction and the global gradient is effectively controlled. Experiments demonstrate that FedVSSAM mitigates flatness incompatibility and outperforms the baselines across diverse FL settings.

[LG-174] Evaluating Federated Learning approaches for mammography under breast density heterogeneity

链接: https://arxiv.org/abs/2605.09137
作者: Gonzalo Iñaki Quintana,Franco Martin Di Maria,Laurence Vancamberg
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Breast density is a key factor that influences mammography interpretation and is a major source of heterogeneity in multicenter datasets. Such heterogeneity poses challenges for collaborative machine learning across institutions, particularly in Federated Learning. This study aims to evaluate the impact of breast density-induced heterogeneity on FL for mammography image classification and to assess the robustness of common FL algorithms in realistic clinical settings. We conducted experiments under two scenarios: (1) a strongly heterogeneous setting where each participating site contributed exclusively low- or high-density cases, based on the BI-RADS density score, and (2) a population-based setting simulating breast density distributions in White and Asian populations. For the strongly heterogeneous setting, we evaluated two configurations: one with 2 clients, where the cases were grouped as BI-RADS A-B and C-D, and one with 4 clients, where each site contained cases of a single BI-RADS density. We compared three FL methods (FedAvg, FedProx, SCAFFOLD) against centralized training, local-only training, and naive aggregation approaches, including ensembling and weight averaging. Across both scenarios, FL achieved performance comparable to centralized training, while local models and naive aggregation approaches underperformed in the presence of strong heterogeneity. Notably, FedAvg achieved accuracy on par with or exceeding centralized training, demonstrating resilience to breast density-induced data imbalance without requiring specialized heterogeneity mitigation algorithms. These findings show that FL can address breast density-related heterogeneity, supporting its feasibility for real-world mammography workflows. The demonstrated robustness of FedAvg underscores the potential for broad clinical deployment of FL, enabling collaborative model development while maintaining data privacy.

[LG-175] Cosine-Gated Adam-Decay: Drop-In Staleness-Aware Outer Optimization for Decoupled DiLoCo

链接: https://arxiv.org/abs/2605.09126
作者: Vatsal Shah,Jiahao Sun
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Asynchronous DiLoCo systems may receive pseudo-gradients computed several outer rounds earlier, yet the standard Nesterov outer optimizer does not explicitly condition its update on per-update age. This can make the outer momentum buffer brittle under large controlled delays. We propose Cosine Gated Adam Decay (CGAD), a simple, drop-in, age-aware outer optimizer that scales each incoming pseudo-gradient by \sigma(\tau) = \gamma(\tau) e^-\alpha\tau before it enters Adam’s first- and second-moment buffers; the exponential models information decay and the cosine gate \gamma(\tau) smoothly zeroes contributions past a chosen cutoff. CGAD reduces to plain Adam at \tau=0 , adds two hyperparameters whose defaults transfer across scales, and extends to partial-sync schedulers via a per-fragment age-aware variant (PA-CGAD). For an idealized gated-adaptive update on smooth non convex objectives, we prove a non-asymptotic convergence bound whose staleness-bias term depends on \alpha alone, rather than on the realized maximum delay \tau_\max ; standard analyses of asynchronous momentum-SGD instead carry a \tau_\max^2 factor. Empirically, on Llama style language model pretraining at 25M, 1B, and 7B parameters, CGAD trains stably across the controlled delays we sweep. The cosine cutoff acts as scale insurance: the closest baseline, Adam Decay (CGAD without the cutoff), is competitive at 25M but its seed-to-seed \sigma at \tau=8 grows 27x from 25M to 7B, pushing its single-shot risk (mean + \sigma ) above the chance-level loss while CGAD’s stays well below. The published Nesterov recipe is the least stable method on the full sweep.

[LG-176] ransfer Learning of Multiobjective Indirect Low-Thrust Trajectories Using Diffusion Models and Markov Chain Monte Carlo

链接: https://arxiv.org/abs/2605.09125
作者: Jannik Graebner,Ryne Beeson
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Preliminary low-thrust spacecraft mission design is a global search problem characterized by a complex solution landscape, multiple objectives, and numerous local minima. During this phase, mission parameters are often not yet fully defined, requiring new solutions to be generated at a high cadence across varying parameter values. When combined with the indirect approach to optimal control, diffusion models can accelerate this search by learning distributions that represent high-quality initial costates. However, generating training data remains expensive, and opportunities exist to better exploit past data. We propose a transfer-learning framework that combines homotopy in a mission parameter with Markov chain Monte Carlo (MCMC) to generate training data more efficiently. The approach reformulates a multiobjective optimization problem as sampling from an unnormalized target distribution in costate space. We compare three MCMC algorithms on a planar multi-revolution transfer in the circular restricted three-body problem, with homotopy in the system mass parameter. The results show that gradient-based MCMC variants achieve the best trade-off between sample quality and computational cost. For the test transfer, the proposed framework generates 40 % more feasible solutions and achieves a higher-quality Pareto front than a state-of-the-art indirect approach based on adjoint control transformations and gradient-based optimization. Finally, the MCMC-generated samples are used to fine-tune a diffusion model conditioned on the mass parameter, enabling it to learn a global representation of the underlying solution distribution and efficiently generate new solutions. These findings establish the transfer-learning framework as a practical method for efficiently solving indirect trajectory optimization problems with varying parameters.

[LG-177] Bridging Spectral Operator Learning and U-Net Hierarchies: SpectraNet for Stable Autoregressive PDE Surrogates

链接: https://arxiv.org/abs/2605.09096
作者: Enrique Hernández Noguera,Md Meftahul Ferdaus,Elias Ioup,Mahdi Abdelguerfi,Julian Simeonov
类目: Machine Learning (cs.LG)
*备注: 29 pages, 9 figures. Code: this https URL

点击查看摘要

Abstract:Neural operators for time-dependent PDEs face a structural tension: spectral architectures (FNO and descendants) inherit exponential rollout-error growth from their one-step Lipschitz constant, while hierarchical U-Net operators trade resolution invariance for multi-scale detail. We introduce SpectraNet, an autoregressive neural operator that composes truncated spectral convolutions inside a U-Net hierarchy with a Residual-Target Spectral Block trained under a Semigroup-Consistency Loss. The residual-target parametrization replaces L^T stability blow-up with linear T*delta drift, and the spectral path’s parameter count is Theta(L w^2 M^2), independent of grid N. Under a single unified protocol against 16 published neural-operator baselines on Navier-Stokes nu=1e-5 at 64x64, SpectraNet reaches test relative L2 = 0.0822 at 2.04M parameters – 2.33x fewer than canonical FNO at ~20% lower error – and wins five of six rows in a cross-PDE comparison against FNO (NS at nu in 1e-4, 1e-3, PDEBench Shallow-Water 2D and Diffusion-Reaction, with the Active-Matter row going to FNO inside its seed spread). Trained from scratch at native 128^2 under the same protocol, SpectraNet improves to 0.0724 while FNO regresses to 0.3080. Free rollout stays bounded for T=100 where FNO diverges across all 200 test trajectories. On consumer CPU at B=1, SpectraNet runs sub-200ms while the full-attention Transformer that wins raw L2 pays ~60x latency; we do not claim to beat that Transformer on raw L2, only to dominate the lightweight (=5M parameter, sub-200ms CPU) Pareto frontier. Source code: this https URL

[LG-178] A Tale of Two Problems: Multi-Task Bilevel Learning Meets Equality Constrained Multi-Objective Optimization

链接: https://arxiv.org/abs/2605.09094
作者: Zhiyao Zhang,Myeung Suk Oh,Zhen Qin,Jiaxiang Li,Xin Zhang,Jia Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, bilevel optimization (BLO) has attracted significant attention for its broad applications in machine learning. However, most existing works on BLO remain confined to the single-task setting and rely on the lower-level strong convexity assumption, which significantly restricts their applicability to modern machine learning problems of growing complexity. In this paper, we make the first attempt to extend BLO to the multi-task setting under a relaxed lower-level general convexity (LLGC) assumption. To this end, we reformulate the multi-task bilevel learning (MTBL) problem with LLGC into an equality constrained multi-objective optimization (ECMO) problem. However, ECMO itself is a new problem that has not yet been studied in the literature. To address this gap, we first establish a new Karush-Kuhn-Tucker (KKT)-based Pareto stationarity as the convergence criterion for ECMO algorithm design. Based on this foundation, we propose a weighted Chebyshev (WC)-penalty algorithm that achieves a finite-time convergence rate of O(ST^-\frac12) to KKT-based Pareto stationarity in both deterministic and stochastic settings, where S denotes the number of objectives, and T is the total iterations. Moreover, by varying the preference vector over the S -dimensional simplex, our WC-penalty method systematically explores the Pareto front. Finally, solutions to the ECMO problem translate directly into solutions for the original MTBL problem, thereby closing the loop between these two foundational optimization frameworks.

[LG-179] owards Trustworthy Audio Deepfake Detection: A Systematic Framework for Diagnosing and Mitigating Gender Bias

链接: https://arxiv.org/abs/2605.09087
作者: Aishwarya Fursule,Shruti Kshirsagar,Anderson R. Avila
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: Submitted to SMC 2026 conference

点击查看摘要

Abstract:Audio deepfake detection systems are increasingly deployed in high-stakes security applications, yet their fairness across demographic groups remains critically underexamined. Prior work measures gender disparity but does not investigate where it comes from or how to fix it systematically. We present the first diagnosis-first framework that identifies bias source before applying targeted mitigation, evaluated on two models, AASIST and Wav2Vec2+ResNet18, on ASVSpoof5. Our diagnosis shows that bias does not stem from imbalanced training data but from acoustic representation differences, gender leakage in learned features, and structural evaluation asymmetry. We test mitigation strategies across in-processing, post-processing and combined families, including novel methods introduced in this work. Adjusting the decision threshold separately per gender reduces unfairness by 54% to 75% at no cost to detection accuracy, and our new epoch-level fairness regularisation method outperforms existing per-batch approaches. Adversarial debiasing succeeds only when gender leakage is localised, and fails when it is diffuse, an outcome correctly predicted by our diagnosis before training. No single method fully closes the fairness gap, confirming that bias sources must be identified before fixes are applied and that fairer benchmark design is equally important

[LG-180] Predicting Plasticity in Deep Continual Learning: A Theoretical Perspective

链接: https://arxiv.org/abs/2605.09044
作者: Jiuqi Wang,Jayanth Srinivasa,Claire Chen,Shuze Daniel Liu,Ali Payani,Shangtong Zhang
类目: Machine Learning (cs.LG)
*备注: 21 pages, 4 figures, 2 tables

点击查看摘要

Abstract:Deep continual learning requires models to adapt to new tasks without retraining from scratch. However, neural networks can lose their ability to adapt to new tasks after training on previous ones, a phenomenon known as loss of plasticity. There have been several explanations and diagnostics proposed for plasticity loss. Motivated by the philosophy that “all models are wrong, but some are useful”, we ask: can existing diagnostics predict a neural network’s plasticity? In this work, we take a practical view to interpret plasticity as trainability, i.e., a neural network’s future optimization gain on a target task. We first take a theoretical approach, showing, by constructing a few counterexamples, that some widely adopted diagnostics of plasticity, including representation rank and neural tangent kernel rank, can fail to predict the loss of trainability in both regression and classification settings. We instead propose a novel metric, called optimization readiness, which combines gradient strength and gradient reliability. We prove that optimization readiness lower bounds one-step optimization gain under standard smoothness assumptions, providing a theoretical guarantee for its predictive power. Empirically, we show that across commonly used deep continual learning settings, such as Slowly-Changing Regression and Permuted MNIST, optimization readiness more reliably ranks checkpoints by trainability than prior diagnostics, even with substantially fewer samples.

[LG-181] PACT: Peak-Aware Cross-Attention Graph Transformers for Efficient Storm-Surge Emulation

链接: https://arxiv.org/abs/2605.09036
作者: Zesheng Liu,Doyup Kwon,Ning Lin,Maryam Rahnemoonfar
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate and efficient storm-surge emulation is essential for coastal hazard assessment, yet high-fidelity hydrodynamic models remain too expensive for large scenario ensembles and rapid evaluation under heterogeneous climate forcings. We present PACT, a peak-aware cross-attention graph transformer for efficient station-level storm-surge prediction from atmospheric forcing fields. PACT represents each forcing patch as a graph, encodes spatial structure with GraphSAGE, and uses a learned station query to aggregate node information through cross-attention rather than uniform pooling. A Transformer encoder models temporal dependence across the forcing history, and a horizon-query decoder generates lead-specific forecasts from a shared temporal memory. To better capture extreme events, we introduce a peak-aware learning strategy that couples a lightweight auxiliary peak-aware head with a tailored training objective, including a tail-focused loss on peak-dominated samples and a horizon-wise slope regularizer to encourage coherent multi-step evolution. Across multiple tide-gauge stations along the US Northeast coast, PACT outperforms a strong spatio-temporal graph neural network baseline in both RMSE and MAE. Diagnostics show improved peak fidelity and tail preservation for reanalysis and most CMIP6 datasets. PACT is also computationally efficient, requiring about 3.5~s to generate a full winter-season surge trajectory for one year after training. Under distribution shift across five CMIP6 forcings, PACT transfers well within the CMIP6 family but degrades markedly when transferring from reanalysis to climate-model forcings, highlighting a persistent reanalysis–GCM gap.

[LG-182] Accelerating Zeroth-Order Spectral Optimization with Partial Orthogonalization from Power Iteration

链接: https://arxiv.org/abs/2605.09034
作者: Jiahe Chen,Ziye Ma
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Zeroth-order (ZO) optimization has become increasingly popular and important in fine-tuning large language models (LLMs), especially on edge devices due to its ability to adjust the model to local data without the need for memory-intensive back-propagation. Recent works try to reduce ZO variance through low-dimensional subspace search, but subspace restriction alone leaves key optimization geometry under-exploited, motivating additional acceleration. In this work, we focus on the hidden layer training problem in which spectral optimizers like Muon outperform AdamW due to its ability to exploit weak spectral directions by orthogonalization. However, we have discovered that unlike in the first-order setting, full orthogonalization works poorly in the ZO setting since the gradient estimates are highly noisy and unreliable. To address this issue, we propose a key approach we call partial orthogonalization. To do so, we replace the iconic Newton-Schulz procedure in Muon with the faster, more concentrated power-iteration method so that it only amplifies dominant spectral directions. Furthermore, to improve the efficiency and generalization of the algorithm, we adopted a streaming variant of power-iteration that requires low variance in gradients, which was achieved through constraining our search inside a subspace obtained through the projection of momentum, echoing recent advances. Experiments on LLM fine-tuning show that our method can achieve from 1.5x to 4x the convergence speed of ZO-Muon, the current SOTA algorithm, across SuperGlue datasets in the OPT-13B model. Across different models, we also reach competitive final accuracies with less time in most cases compared with strong ZO baselines such as MeZO, LOZO and ZO-Muon. Code is available at this https URL.

[LG-183] Spherical Boltzmann machines: a solvable theory of learning and generation in energy-based models

链接: https://arxiv.org/abs/2605.09031
作者: Thomas Tulinski,Simona Cocco,Rémi Monasson,Jorge Fernandez-De-Cossio-Diaz
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Energy-based models (EBMs) are flexible generative architectures inspired by statistical physics, but their learning and generative properties remain poorly understood. Here, we analyze a solvable EBM in the high-dimensional limit: the spherical Boltzmann machine (SBM). Combining tools from random matrix theory and dynamical mean-field theory, we: solve exact equations describing the training dynamics of the SBM; compute the Bayesian evidence, which acts as a partition function in parameter space and encodes global properties of the trained model; and uncover cascades of phase transitions that occur both during training and as a function of hyperparameters, related to successive alignment and condensation of the top modes of the coupling matrix to the data. We connect these transitions to sampling-time generative phenomena in a teacher-student scenario, including: sampling temperature tuning, double descent as a function of regularization strength, tempered posterior effects, and out-of-equilibrium effects during training that induce biases in the trained model. We provide numerical evidence demonstrating that all these phenomena appear in standard generative architectures, beyond the SBM.

[LG-184] Diagnosing and Mitigating Domain Shift in Permission-Based Android Malware Detection

链接: https://arxiv.org/abs/2605.09028
作者: Md Rafid Islam
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning-based Android malware detectors often fail in real-world deployment due to domain shift, where models trained on one data source perform poorly on applications from another. This paper presents a comprehensive study on the generalizability and interpretability of permission-based detectors under cross-domain conditions. Using two complementary datasets (PerMalDroid and NATICUSdroid) and five ensemble classifiers, we first establish an intra-domain baseline, where models achieve over 92% accuracy, and then quantify a severe asymmetric performance drop. While models trained on PerMalDroid generalize well to NATICUSdroid (86% accuracy), the reverse direction sees a drastic drop to 73% accuracy. Explainable AI analysis reveals bimodal feature distributions and shows that feature importance is highly unstable, with key permissions losing or gaining influence across domains. The predictive feature sets for different domains are fundamentally mismatched, as models rely on different, dataset-specific permissions. Most importantly, an ablation study demonstrates that for most models, training on a noisy feature set leads to poor generalization, confirming that domain-specific artifacts are a greater obstacle than missing features. To mitigate this, we validate a hybrid training strategy based on the intersection of common features and successfully recover cross-domain performance, achieving 88% accuracy on PerMalDroid and maintaining 97% on NATICUSdroid. These findings highlight the importance of explainable, cross-domain-robust malware detection systems and provide a practical pathway toward improving real-world deployment of permission-based Android malware detectors.

[LG-185] Non-Parametric Rehearsal Learning via Conditional Mean Embeddings

链接: https://arxiv.org/abs/2605.08999
作者: Wen-Bo Du,Tian-Zuo Wang,Han-Jia Ye,Zhi-Hua Zhou
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In machine learning, a critical class of decision-related problems concerns preventing predicted undesirable outcomes, referred to as the \textitavoiding undesired future (AUF) problem. To address this, the \textitrehearsal learning framework has been proposed to model influence relations for effective decisions. However, existing rehearsal methods rely on restrictive parametric assumptions such as linear systems or additive noise, limiting their practical applicability. In this paper, we propose the first non-parametric rehearsal learning approach for AUF without assuming specific functional forms of data generation processes. Specifically, we use kernel machinery to reformulate the AUF objective into a unified representation that disentangles desirability modeling from action-induced distributional changes. To handle the discontinuity of desirability indicator, we present a smooth Probit surrogate and provide an approximation error bound. Meanwhile, we capture the action-induced changes via conditional mean embeddings, and develop a kernel ridge regression based nested estimator for AUF objective with consistency guarantees. Such a formulation naturally accommodates nonlinear systems and non-additive noise, and empirical results on synthetic and real-data-derived semi-synthetic benchmarks demonstrate the effectiveness and flexibility of our approach.

[LG-186] Machine Learning-Based Graph Simplification for Symbolic Accelerators

链接: https://arxiv.org/abs/2605.08996
作者: Tiffany Yu,Rye Stahle-Smith,Darssan Eswaramoorthi,Rasha Karakchi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph-based accelerators have been widely adopted in symbolic data processing applications such as genomics, cybersecurity, and artificial intelligence. However, these systems often suffer from excessive memory usage and inefficiencies stemming from redundant graph structures. We present AutoSlim, a machine learning-based framework that leverages data-driven methods to prune automata graphs for hardware accelerators. Using features extracted from prior graph executions and a Random Forest classifier, AutoSlim identifies and removes low-impact nodes and edges. When applied to a Non-deterministic Finite Automata overlay architecture (NAPOLY+), AutoSlim reduces FPGA resource usage by up to 40%, with corresponding improvements in throughput and power efficiency. The framework includes a verification step to ensure functional equivalence after pruning and suggests promising directions for both hardware optimization and security.

[LG-187] When More Parameters Hurt: Foundation Model Priors Amplify Worst-Client Disparity Under Extreme Federated Heterogeneity IJCAI2026

链接: https://arxiv.org/abs/2605.08992
作者: Kiran Naseer,Umar Shoaib
类目: Machine Learning (cs.LG)
*备注: 7 pages, 5 figures. Submitted to FL@FM-IJCAI 2026 Workshop

点击查看摘要

Abstract:Federated learning (FL) is increasingly used to fine-tune foundation models (FMs) on distributed private data. The community largely assumes that large-scale pretraining serves as a ‘rising tide that lifts all boats’ in federated settings. However, our experiments reveal that these powerful priors can hinder rather than help the most disadvantaged clients under extreme heterogeneity. Through controlled experiments on federated text classification, we compare worst-client accuracy between TextCNN (2.7M parameters) and DistilBERT with Low-Rank Adaptation (LoRA, 66M parameters) across four Non-IID heterogeneity levels. Under extreme label skew (alpha = 0.1), DistilBERT+LoRA produces a worst-client accuracy gap of 50.1% – 56% larger than TextCNN’s 32.2% gap, despite having 25x more parameters and extensive pretraining. Under moderate heterogeneity (alpha = 0.5), the pattern reverses: the FM nearly eliminates the gap. We call this the FM Fairness Paradox. We further show that an inverse-weighted LoRA aggregation method (FedAvgW) does not resolve the disparity, suggesting aggregation reweighting alone may be insufficient. Our results highlight the need for mechanisms that explicitly protect minority clients before deploying foundation models in high-stakes federated contexts such as healthcare and education.

[LG-188] PMCTS: Particle Monte Carlo Tree Search for Principled Parallelized Inference Time Scaling

链接: https://arxiv.org/abs/2605.08982
作者: Yaniv Oren,Viliam Vadocz,Joery A. de Vries,Wendelin Böhmer,Matthijs T. J. Spaan,Hendrik Baier
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Monte Carlo Tree Search (MCTS) is a widely used approach for policy improvement through search with increasing popularity for real world applications. Due to the sequential and deterministic nature of its search, runtime-scaling of MCTS with parallel compute remains a major challenge. We introduce Particle MCTS (PMCTS), to our knowledge the first principled parallel MCTS algorithm which is suited for neural network evaluations and can preserve formal policy improvement guarantees. Empirically, PMCTS scales well with parallel compute and significantly outperforms the popular heuristic-based baselines across domains.

[LG-189] Muon Does Not Converge on Convex Lipschitz Functions

链接: https://arxiv.org/abs/2605.08980
作者: Tetiana Parshakova,Ahmed Khaled,Michael Crawshaw,Guillaume Garrigos,Robert M. Gower
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Muon and its variants have shown strong empirical performance in a variety of deep learning tasks. Existing convergence analyses of Muon rely on smoothness assumptions, though arguably the most successful function class for developing deep learning methods (such as AdaGrad, Shampoo, Schedule-Free and more) has been the class of convex and Lipschitz functions. In this paper we question whether the classical convex Lipschitz model is a useful one for understanding Muon. Our answer is no. We show that Muon does not converge on the class of convex and Lipschitz functions, regardless of the choice of learning rate schedule. We also show that error feedback restores convergence of Muon and all the non-Euclidean subgradient methods with momentum. However, this theoretical fix using error feedback degrades the performance of Muon in two representative settings for image classification (CIFAR-10) and language modeling (nanoGPT on FineWeb-Edu 10B). Our conclusion is that convex Lipschitz theory, despite having a prominent role in the design of practical methods for deep learning, is not the most suited one for Muon. This suggests that Muon’s success must come from structure absent from this model, most plausibly related to smoothness. Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML) Cite as: arXiv:2605.08980 [cs.LG] (or arXiv:2605.08980v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.08980 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-190] VORT: Adaptive Power-Law Memory for NLP Transformers

链接: https://arxiv.org/abs/2605.08966
作者: Nabil Mlaiki
类目: Machine Learning (cs.LG)
*备注: 18 pages, 5 figures

点击查看摘要

Abstract:Standard Transformers impose near-exponential decay on the influence of distant tokens, conflicting with the power-law structure of long-range dependencies in natural language. We introduce the \emphVariable-Order Retention Transformer (\VORT), a memory architecture in which each ingested token is assigned a learnable fractional order \alpha_i\in[\delta,1] that governs a Grünwald–Letnikov power-law retention kernel. Because the fractional weighted sum is non-Markovian, we approximate it through a sum-of-exponentials (SOE) decomposition computed by Gauss–Laguerre quadrature on a Laplace-type integral representation of the kernel weights. Each exponential component admits a one-step Markovian recurrence at O(Sd_v) per step, where S=O(\log(T/\varepsilon)) terms suffice for \varepsilon-uniform accuracy on horizon [1,T]. Retrieval is keyed and associative via a linear-attention accumulator with an exact O(KSd_\phi d_v) -per-step recurrence. Four results are established: (i) an SOE approximation theorem with geometric convergence rate from the analyticity of the integrand after a log-change of variables; (ii) a quantisation bound valid on [\delta,1] with correct analysis near \alpha=0; (iii) a direct L^2 energy argument (Proposition) showing that for \alpha1/2 any mixture with fixed minimum decay rate \Lambda0 incurs L^2([1,T]) error at least N_\alpha(T)-C(\Lambda)\to\infty, with the \Lambda-dependence made explicit; and (iv) linear convergence of a gradient plasticity rule under the Polyak–Łojasiewicz condition. Two synthetic experiments confirm the architectural advantage: a Zipf-distributed retrieval benchmark and an entity label-copy task with uniform lag distribution, the latter ruling out prior-matching as an explanation for the power-law kernel’s advantage.

[LG-191] rustworthy AI: Ensuring Reliability and Accountability from Models to Agents

链接: https://arxiv.org/abs/2605.08964
作者: Carol Xuan Long
类目: Machine Learning (cs.LG)
*备注: PhD thesis

点击查看摘要

Abstract:In this thesis, we develop algorithms with theoretical guarantees for ensuring reliability and accountability of Machine Learning (ML) systems. As ML systems evolve from predictive models to generative models and autonomous agents, the landscape of trustworthy AI has shifted. This thesis introduces tools grounded in information theory, optimization, and statistical learning to mitigate bias, reduce arbitrary decisions, ensure content provenance, and evaluate LLM-driven agents in autonomous settings. Towards mitigating bias and arbitrariness in traditional ML models, we introduce a kernel-based method to achieve multiaccuracy across complex subpopulations that traditional demographic categories may overlook. We also develop methods to address predictive multiplicity, where equally accurate models yield conflicting individual predictions. We ensure the accountability in generative AI through watermarking large language models (LLMs). We characterize the information-theoretic trade-off between watermark detection and text distortion and derive optimal watermarking strategies by leveraging optimal transport and coding theory. Empirical evaluations show our watermarks achieve a superior detection-quality tradeoff across language generation and coding tasks. Finally, we evaluate autonomous LLM agents in multi-agent environments through the first simulator of a fully LLM-driven supply chain. LLM agents offer significant performance gains, outperforming human teams and reducing costs by up to 67%, but also introduce systemic risks, including costly tail events.

[LG-192] Learning predictive models for combinations of heterogeneous proteomic data sources

链接: https://arxiv.org/abs/2605.08958
作者: Michal Valko,Richard Pelikan,Miloš Hauskrecht
类目: Machine Learning (cs.LG)
*备注: Published at in AMIA Summit on Translational Bioinformatics (STB 2008

点击查看摘要

Abstract:Multiple technologies that measure expression levels of protein mixtures in the human body offer a potential for detection and understanding the disease. The recent increase of these technologies prompts researchers to evaluate the individual and combined utility of data generated by the technologies. In this work, we study two data sources to measure the expression of protein mixtures in the human body: whole-sample MS profiling and multiplexed protein arrays. We investigate the individual and combined utility of these technologies by learning and testing a variety of classification models on the data from a pancreatic cancer study. We show that for the combination of these two (heterogeneous) datasets, classification models that work well on one of them individually fail on the combination of the two datasets. We study and propose a class of model fusion methods that acknowledge the differences and try to reap most of the benefits from their combination.

[LG-193] Outlier detection for patient monitoring and alerting

链接: https://arxiv.org/abs/2605.08955
作者: Miloš Hauskrecht,Iyad Batal,Michal Valko,Shyam Visweswaran,Gregory F. Cooper,Gilles Clermont
类目: Machine Learning (cs.LG)
*备注: Published at JBI 2013

点击查看摘要

Abstract:We develop and evaluate a data-driven approach for detecting unusual (anomalous) patient-management decisions using past patient cases stored in electronic health records (EHRs). Our hypothesis is that a patient-management decision that is unusual with respect to past patient care may be due to an error and that it is worthwhile to generate an alert if such a decision is encountered. We evaluate this hypothesis using data obtained from EHRs of 4486 post-cardiac surgical patients and a subset of 222 alerts generated from the data. We base the evaluation on the opinions of a panel of experts. The results of the study support our hypothesis that the outlier-based alerting can lead to promising true alert rates. We observed true alert rates that ranged from 25% to 66% for a variety of patient-management actions, with 66% corresponding to the strongest outliers.

[LG-194] Muon-OGD: Muon-based Spectral Orthogonal Gradient Projection for LLM Continual Learning

链接: https://arxiv.org/abs/2605.08949
作者: Binghang Lu,Zheyuan Deng,Runyu Zhang,Bing Hu,Yunhan Zhao,Yuan Tian,Changhong Mou,Guang Lin,Xiaomin Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A central challenge in continual learning for large language models (LLMs) is catastrophic forgetting, where adapting to new tasks can substantially degrade performance on previously learned ones. Existing projection-based methods mitigate such interference by restricting parameter updates to subspaces that are orthogonal to directions associated with past tasks. However, these methods are typically formulated under Euclidean parameter geometry, with update magnitudes and projections governed by the Frobenius norm. The recent empirical success of the Muon optimizer, which applies orthogonalized matrix updates and admits a spectral-norm interpretation, suggests that Frobenius geometry may not be the most effective choice for matrix-valued LLM parameters. Motivated by this observation, we propose Muon-OGD, a spectral-norm-aware continual learning framework that integrates Muon-style operator-norm geometry with orthogonal projection constraints. Our method formulates each update as a spectral-norm-constrained optimization problem with linear non-interference constraints, and solves it efficiently through dual iterations and Newton–Schulz matrix-sign approximations. By applying orthogonalized momentum updates that avoid protected directions associated with prior tasks, Muon-OGD aims to improve the stability–plasticity trade-off in sequential LLM adaptation. We evaluate the proposed method on standard continual learning benchmarks, TRACE, and domain-specific Coding–Math–Medical curricula using both encoder–decoder and decoder-only architectures. Empirically, Muon-OGD consistently improves over sequential fine-tuning and competitive orthogonal-gradient baselines, while remaining computationally scalable. These results suggest that spectral-norm-aware update geometry provides a practical and effective alternative to Frobenius-norm projection for continual learning in LLMs.

[LG-195] A Single Deep Preference-Conditioned Policy for Learning Pareto Coverag e Sets

链接: https://arxiv.org/abs/2605.08946
作者: Akihiro Kubo,Kosuke Nakanishi,Shin Ishii
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Preference-conditioned multi-objective reinforcement learning aims to learn a single policy that captures trade-offs across preferences, but under nonlinear scalarization the uniqueness and continuity of the preference-to-solution correspondence remain unclear. We study this problem in tabular multi-objective Markov decision processes (MDPs) using smooth Tchebycheff scalarization as a monotone utility. Under mild interior conditions on the preference set, we prove that each preference induces a unique Pareto-optimal return vector and that this vector depends Lipschitz-continuously on the preference, providing a principled foundation for preference sweeping toward dense Pareto-front coverage. To compute these targets, we formulate the problem over occupancy measures and derive Concave Mirror Descent Policy Iteration (CMDPI), which achieves an O(1/k) objective-suboptimality rate. We further show that each update is equivalent to solving a Kullback-Leibler-regularized MDP with the previous policy as reference, yielding a policy-iteration interpretation and finite-iterate policy continuity across preferences. We instantiate the update as a deep actor-critic algorithm preserving previous-policy regularization. On eight MO-Gymnasium tasks, it achieves the best average hypervolume rank among recent baselines and strong expected-utility performance. Continuous-control experiments indicate gains beyond the discrete-action setting.

[LG-196] From Mechanistic to Compositional Interpretability

链接: https://arxiv.org/abs/2605.08934
作者: Ward Gauderis,Thomas Dooms,Steven T. Holmer,Kola Ayonrinde,Geraint A. Wiggins
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Mechanistic interpretability aims to explain neural model behaviour by reverse-engineering learned computational structure into human-understandable components. Without a formal framework, however, mechanistic explanations cannot be objectively verified, compared, or composed. We introduce compositional interpretability, a category-theoretic framework grounded in the principles of compositionality and minimum description length. Compositional interpretations are pairs of syntactic and semantic mappings that must commute to enforce consistency between a model’s decomposition and its observed behaviour. We deconstruct explanation quality into measures of faithfulness and complexity to cast interpretability as a constrained optimisation problem, and introduce compressive refinement to systematically restructure models into simpler parts without altering their function. Finally, we prove a parsimony criterion under which syntactic compression theoretically guarantees more concise, human-aligned explanations. Our framework situates prominent mechanistic methods as subclasses of refinement, and clarifies why their compressibility heuristics tend to align with human interpretability. Our work provides a measurable, optimisable foundation for automating the discovery and evaluation of mechanistic explanations.

[LG-197] When and Why Grouping Attention Heads Accelerates Muon Optimization

链接: https://arxiv.org/abs/2605.08933
作者: Hongtao Zhang,Wenjie Zhou,Wei Chen,Xueqi Cheng
类目: Machine Learning (cs.LG)
*备注: 16 pages, 4 figures

点击查看摘要

Abstract:Muon orthogonalizes matrix updates, but multi-head attention naturally operates at the level of heads. This granularity mismatch raises the question of whether Muon should be applied to the full attention projection, to individual heads, or to intermediate head groups. We study this question through a one-step descent comparison between full-matrix Muon and group-wise Muon. Our analysis reveals a trade-off between the \textbfgroup-wise whitening gain from group-wise updates and the \textbfgrouping-induced norm cost, an additional update-norm cost caused by replacing full-matrix whitening with group-wise whitening. Motivated by this trade-off, we propose \textbfGroup Muon, which treats head group size and grouping rule as optimizer hyperparameters. On GPT-2 Small trained on FineWeb, appropriate grouping improves validation loss over both full-QKV Muon and fully head-wise MuonSplit.

[LG-198] Physics-Informed Neural PDE Solvers via Spatio-Temporal MeanFlow

链接: https://arxiv.org/abs/2605.08915
作者: Hanru Bai,Yuncheng Zhou,Difan Zou
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep learning paradigms, such as PINNs and neural operators, have significantly advanced the solving of PDEs. However, they often struggle to capture the continuous integral nature of physical systems, relying either on pointwise residuals that ignore the integral perspective or on pre-discretized temporal grids. Drawing inspiration from MeanFlow, a continuous-time integrator recently developed to efficiently solve generative ODEs, we introduce Spatio-Temporal MeanFlow, which functions as a novel PDE solver learning the finite-interval evolution of physical states. By substituting the generative velocity field with the physical PDE operator, we transform multi-step numerical integration into an efficient prediction with a freely controllable integration length. Crucially, we extend the original MeanFlow constraint from the temporal to the spatio-temporal domain, coupling time evolution with spatial consistency. This yields a unified framework naturally accommodating both time-dependent and stationary PDEs. Comprehensive experiments on benchmarks demonstrate that our approach achieves superior accuracy and inference efficiency over representative baselines. Furthermore, the proposed integral constraint enables excellent generalization to out-of-distribution initial conditions and varying spatial resolutions.

[LG-199] Enhancing Adversarial Robustness in Network Intrusion Detection: A Layer-wise Adaptive Regularization Approach

链接: https://arxiv.org/abs/2605.08910
作者: Hira Nasir,Eiman Javed,Balawal Shabir,Zunera Jalil,Ahmad Mohsin
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The new wave of adversarial attacks that utilize gradient-related vulnerabilities in neural network-based classifiers makes Network Intrusion Detection Systems more open to such threats. Although state-of-the-art adversarial training methods have shown promising results in producing more robust classifiers, their interpretability and defense ability are limited due to their lack of understanding of how adversarial attacks propagate in different layers of network classifiers. In this paper, we present an insightful approach, called LARAR (Layer-wise Adversarial Robustness using Adaptive Regularization), that incorporates additional layer-wise vulnerability analysis and adaptive weighting in conventional adversarial training methods. Additionally, we utilize ‘Auxiliary Classifiers’ in our approach. LARAR provides interpretable layer-wise vulnerability scores, achieves a clean accuracy of 95.01%, and provides better robustness against adversarial attacks (FGSM, PGD, and transfer attacks) on the UNSW-NB15 dataset. Through the identification of vulnerable layers, the proposed framework reduces computational complexity and enables the early detection of adversarial samples, thus enhancing the effectiveness and interpretability of adversarial defense mechanisms in NIDS.

[LG-200] Bilinear autoencoders find interpretable manifolds

链接: https://arxiv.org/abs/2605.08891
作者: Thomas Dooms,Ward Gauderis,Geraint Wiggins,Jose Oramas
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sparse autoencoders have become a standard tool for uncovering interpretable latent representations in neural networks. Yet salient concepts often span manifolds that current linear methods cannot capture without post hoc analysis. This paper uses quadratic latents to close this gap: we implement these with bilinear autoencoders, which decompose activations into low-rank quadratic forms, compose linearly in weight space, and admit input-independent geometric analysis. This qualitative difference in what concepts quadratic latents can detect challenges the standard linear representation hypothesis. Our experiments and visualisations show that multi-dimensional geometries are highly prevalent and that composite latents capture them well, systematically improving reconstruction error in language models. Furthermore, we show that autoencoders with varying geometric priors recover the same input subspace despite their dictionary entries being distinct. Practically, these models serve as an unsupervised tool for manifold discovery, which we demonstrate through an interactive online visualizer for Qwen 3.5. This is a step toward nonlinear but mathematically tractable latent representations whose composition is expressive and interpretable by design.

[LG-201] Compact SO(3) Equivariant Atomistic Foundation Models via Structural Pruning

链接: https://arxiv.org/abs/2605.08885
作者: Chen Wang,Siyu Hu,Guangming Tan,Weile Jia
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:SO(3) equivariant graph neural networks have become the dominant paradigm for atomistic foundation models, achieving high accuracy and data efficiency by building rotational symmetry directly into the architecture. Yet the computational cost of their higher-order tensor operations creates a tough trade-off between model accuracy and inference efficiency. In this paper, we propose a structural pruning method for SO(3) equivariant atomistic foundation models to bridge this accuracy-efficiency gap. The pruning is applied along the channel and order dimensions, with each irreducible representation kept or removed as a complete block, thereby retaining SO(3) equivariance. Starting from a large checkpoint, the pruned model substantially reduces the inference cost while retaining higher accuracy than an independently trained small model. The pruned MACE-MP model outperforms the official from-scratch trained small model on 7 of 9 metrics on the Matbench Discovery leaderboard. In terms of efficiency, compressed MACE-MP and MACE-OFF models contain 1.5 \times to 4 \times fewer parameters and require 2.5 \times to 4 \times less pre-training compute than training a small model from scratch. For downstream applications, fine-tuning the pruned model reduces energy and force errors by 70.1% and 34.4% compared to training task-specific models from scratch across eight representative downstream datasets. We demonstrate that the method generalizes to other SO(3) equivariant architectures (SevenNet, eSCN) and can be combined with quantization and knowledge distillation for further gains.

[LG-202] Discrete Flow Matching: Convergence Guarantees Under Minimal Assumptions

链接: https://arxiv.org/abs/2605.08882
作者: Le-Tuyet-Nhi Pham,Giovanni Conforti,Zhenjie Ren,Alain Durmus
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Flow Matching has recently emerged as a popular class of generative models for simulating a target distribution \mu_1 from samples drawn from a source distribution \mu_0 . This framework relies on a fixed coupling between \mu_0 and \mu_1 , and on a deterministic or stochastic bridge to define an interpolating process between the two distributions. The time marginals of this process can then be approximately sampled by estimating the transition rates, or more generally the generator, of its Markovian projection. This framework has recently been extended to the case of discrete source and target distributions, under the name Discrete Flow Matching (DFM). However, theoretical guarantees for such models remain scarce. In this paper, we study two DFM models on \mathbbZ_m^d = \0,\ldots,m-1^d , sampled through time discretization, and derive non-asymptotic associated bounds for both of them. In contrast to previous work, we establish non-asymptotic bounds in Kullback–Leibler divergence for the early-stopped version of the target distribution. We also derive explicit convergence guarantees in total variation distance with respect to the true target distribution. Importantly, these bounds rely only on an approximation error assumption, relaxing standard score assumptions used in earlier works, while also yielding improved dependence on the vocabulary size m and the dimension d .

[LG-203] OTora: A Unified Red Teaming Framework for Reasoning -Level Denial-of-Service in LLM Agents ICML2026

链接: https://arxiv.org/abs/2605.08876
作者: Xinyu Li,Ronghui Mu,Lin Li,Tianjin Huang,Gaojie Jin
类目: Machine Learning (cs.LG)
*备注: Accepted to ICML 2026

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly deployed as autonomous agents that execute tool-augmented, multi-step tasks, where latency is a critical factor for real-world applications. Yet an overlooked threat is Reasoning-Level Denial-of-Service (R-DoS), in which an attacker preserves task correctness but degrades availability by inflating an agent’s reasoning depth or tool-use budget. We introduce OTora, the first unified, two-stage red-teaming framework for instantiating R-DoS attacks. Stage I optimizes an adversarial trigger that induces targeted tool invocations using insertion-aware scoring and dynamic target co-evolution, supporting both black-box and white-box settings. Stage II generates agent-aware reasoning payloads via an ICL-guided genetic search that amplifies overthinking while maintaining correct task outcomes. Across WebShop, Email, and OS agents built on multiple backbone models such as LLaMA-70B and GPT-OSS-120B, OTora achieves up to 10 times increases in reasoning tokens and order-of-magnitude latency slowdowns, all while preserving near-baseline task accuracy. Finally, we discuss mitigation strategies for detecting and constraining abnormal reasoning and latency spikes. The code is available at this https URL.

[LG-204] CoDistill-GRPO: A Co-Distillation Recipe for Efficient Group Relative Policy Optimization

链接: https://arxiv.org/abs/2605.08873
作者: Soo Min Kwon,Ziteng Sun,Ananda Theertha Suresh,Himanshu Jain,Sanjiv Kumar
类目: Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Group Relative Policy Optimization (GRPO) has emerged as a powerful algorithm for improving the reasoning capabilities of language models, but often fails to improve small models due to sparse rewards on difficult tasks. Existing works mitigate this issue by leveraging a larger model, either to provide hints for rollouts or to provide dense reward signals through knowledge distillation (KD). However, this assumes the existence of such an oracle, and training one can significantly increase total training time. In this work, we propose CoDistill-GRPO, a co-distillation algorithm that simultaneously trains a large and a small model by maximizing carefully designed GRPO objectives. The two models learn from each other: the small model uses an on-policy KD reward to learn from the large model’s distribution, while the large model is updated using rollouts generated by the small model with importance reweighting, reducing the computational overhead of rollout generation. We show that CoDistill-GRPO substantially improves small model performance over standard GRPO on mathematical benchmarks across both Qwen and Llama models. Specifically, with Qwen2.5-Math-1.5B, we observe an accuracy increase of over 11.6 percentage points over the base model and an additional 6.0 percentage points over GRPO on the Minerva dataset. Interestingly, the larger model (Qwen2.5-Math-7B) trained with CoDistill-GRPO nearly matches standard GRPO performance despite training on small-model rollouts. This highlights CoDistill-GRPO as a cost-effective alternative to GRPO for larger models, yielding an approximate 18% speedup, which may be of independent interest.

[LG-205] opoGeoScore: A Self-Supervised Source-Only Geometric Framework for OOD Checkpoint Selection

链接: https://arxiv.org/abs/2605.08870
作者: Farid Hazratian,Ali Zia,Hien Duy Nguyen
类目: Machine Learning (cs.LG); Algebraic Topology (math.AT); Differential Geometry (math.DG)
*备注:

点击查看摘要

Abstract:Out-of-distribution (OOD) robustness is difficult to diagnose when target-domain labels are unavailable. We consider a more restrictive source-only variant of unsupervised accuracy estimation: selecting robust checkpoints using only source-domain representations, with no target samples or target labels. We propose \textbfTopoGeoScore, a source-only geometric scorer for label-free OOD checkpoint selection. Given a trained checkpoint, we construct class-conditional mutual k -nearest-neighbour graphs from source embeddings and extract three interpretable signals: a torsion-inspired reduced Laplacian log-determinant for global class-manifold complexity, Ollivier–Ricci curvature for local neighbourhood regularity, and higher-order topological summaries for fragmented connectivity, loops, and global–local inconsistency. Instead of fixing their weights by hand, TopoGeoScore learns a non-negative linear score through a self-supervised objective that enforces invariance under approximately geometry-preserving embedding views and separation from structure-breaking views. The score remains interpretable and uses no target-domain samples or labels. Results across CIFAR-based corruption and distribution-shift benchmarks, ImageNet-C, MNLI \to HANS transfer, and OGBN-Arxiv suggest that source representations contain measurable global–local–topological evidence of robustness, supporting practical checkpoint selection before deployment under distribution shift.

[LG-206] Higher-Order Equilibrium Tracking for EM-Compressible Online Estimation

链接: https://arxiv.org/abs/2605.08864
作者: ZhiMing Li,Yue Song
类目: Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 41 pages, 6 figures

点击查看摘要

Abstract:We study online estimation in latent-variable models by recasting the problem as tracking a moving empirical equilibrium. Standard online EM and stochastic approximation analyses primarily study convergence toward the population parameter and typically do not isolate the empirical batch optimum from the online tracking error at finite horizon. Our framework decomposes the online estimate into the frozen batch equilibrium at the current running statistic and a tracking lag that captures the algorithm’s delay behind this moving target. We prove a batch-to-online transfer theorem: provided \lVert e_T \rVert_L^2 = o(T^-1/2) , the online estimator inherits the batch central limit theorem and the sharp first-order risk constant. Our key observation is that the empirical optimum evolves on a smooth equilibrium manifold indexed by the running statistic. An m -th order equilibrium-jet predictor combined with an order- \nu frozen corrector yields localized tracking rates O(T^-\nu(m+1)) . We formalize EM-compressibility and EM-jet ^R -compressibility as the structural conditions that make the equilibrium response and the Newton corrector evaluable from a retained streaming statistic. The theory is instantiated in latent linear Gaussian covariance estimation, where the first-order scheme operates on a compressed d \times d statistic with explicit finite-sample risk envelopes and a certified restart rule.

[LG-207] RareCP: Regime-Aware Retrieval for Efficient Conformal Prediction

链接: https://arxiv.org/abs/2605.08857
作者: Manuel Heurich,Maximilian Granz,Tim Landgraf
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advances in uncertainty quantification for time series forecasting show that conformal prediction can provide reliable prediction intervals, yet standard conformal methods are often inefficient under temporal dependence, drift, and heterogeneous error behavior. Existing methods typically either update miscoverage rates over time or learn unconstrained calibration weights, without explicitly separating two central sources of nonstationarity: smoothly drifting error distributions and co-existing distinct error regimes. We introduce RareCP, a regime-aware retrieval method for adaptive conformal time series prediction. RareCP learns local calibration representations through a mixture of cosine-attention experts that each capture distinct error regimes, while a compact hypernetwork adapts the kernel parameters to track temporal drift. Given a new forecasting context, RareCP retrieves the top-k most relevant calibration examples, assigns similarity weights, and forms a weighted conformal quantile over their signed residuals, yielding asymmetric prediction intervals. The adaptive kernel is trained using a smooth interval score objective, with a parameter-space anchor to a lightweight teacher kernel to preserve stable local representations. On the GIFT-Eval benchmark, RareCP improves interval efficiency over recent conformal baselines and foundation model uncertainty estimates while maintaining empirical coverage. Ablations confirm that regime-specific experts, drift-adaptive kernels, sparse retrieval, and teacher anchoring each contribute to the final performance.

[LG-208] Controlling Transient Amplification Improves Long-horizon Rollouts

链接: https://arxiv.org/abs/2605.08856
作者: Adeel Pervez,Francesco Locatello
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Autoregressive neural simulators now match classical solvers on short-horizon prediction of physical systems, yet their accuracy degrades rapidly when rolled out over long horizons. In this work, we identify transient amplification of perturbations around rollout trajectories as a structural mechanism driving rollout error. Using a linearization analysis we show that when the Jacobians along an autoregressive trajectory are non-normal and non-commuting, the model amplifies errors transiently, resulting in model rollout drift even when the overall system is asymptotically stable. Building on the analysis, we propose commutativity regularization: a combination of two penalties designed to reduce the normality defect of individual Jacobians and the commutator norm of Jacobians across steps. The penalties are estimated with Jacobian-vector products and have no inference-time cost. We show a propagator bound that quantifies rollout error under approximate commutativity and normality. We evaluate UNet and FNO variants with commutativity regularization on 1D and 2D spatio-temporal data in synthetic and real settings, showing successful long-horizon rollouts over thousands of steps. Further, we show that the method improves FourCastNet climate forecasts on ERA5 without using any new data. The gain is most pronounced out-of-distribution: trained on trajectories of a few hundred steps, regularized models remain in-distribution for thousands of rollout steps on initial conditions where baselines diverge.

[LG-209] Inpainting physics: self-supervised learning for context-driven fluid simulation

链接: https://arxiv.org/abs/2605.08832
作者: Jonas Weidner,Yeray Martin-Ruisanchez,Daniel Rückert,Benedikt Wiestler,Julian Suk
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:Neural surrogate models for computational fluid dynamics (CFD) are typically trained as forward operators that map explicit problem specifications, such as geometry and boundary conditions, to solution fields. This ties the model to the conditioning variables seen during training and limits reuse under boundary-condition shifts or local geometry changes. We propose to reformulate steady CFD inference as an inpainting problem: instead of training on explicit boundary conditions, we learn a self-supervised prior over velocity fields and impose boundary constraints only during inference by fixing known regions such as inlet, outlet or unchanged regions from previous simulations. To scale this idea to large 3D meshes, we introduce a local neighbourhood tokeniser that represents high-resolution velocity fields as compact spatial latent tokens and train latent flow-matching and masked-autoencoder models on these tokens. On intracranial aneurysm hemodynamics, our method reconstructs full velocity fields from sparse boundary context, outperforms supervised neural surrogates under boundary-condition and dataset shift and enables local geometry editing by reusing unchanged simulation context. These results suggest that viewing CFD inference as context-conditioned inpainting can turn neural surrogates from task-specific predictors into reusable flow priors.

[LG-210] MicroFuse: Protein-to-Genome Expert Fusion for Microbial Operon Reasoning

链接: https://arxiv.org/abs/2605.08815
作者: Seungik Cho
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM); Genomics (q-bio.GN); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Predicting microbial operon co-membership requires integrating two complementary biological signals: protein-scale molecular identity and genome-context organization. While recent biological foundation models provide powerful representations of each view independently, naive concatenation of these modalities ignores a key biological property – protein identity and genomic context may agree when adjacent genes form a coherent functional module, or conflict when sequence similarity is misleading but genomic layout indicates independent regulation. We present MicroFuse, a protein-to-genome expert fusion framework that integrates structure-aware protein representations from ProstT5 with genome-context representations from Bacformer through a four-expert Mixture-of-Experts module (protein, genome-context, agreement, and conflict experts) with a learned soft router. Training combines binary cross-entropy with symmetric cross-modal InfoNCE alignment and disagreement-weighted supervised contrastive shaping. We further construct OG-Operon100K, a 100,000-pair scaffold-level benchmark from the OMG metagenomic corpus with biologically grounded positive and negative criteria. On OG-Operon100K, MicroFuse achieves the strongest AUROC, AUPRC, mAP, and mAR among ProstT5-only, Bacformer-only, and Concat MLP baselines. Ablations identify cross-modal contrastive alignment as the dominant component, and a hard sequence-conflict subset reveals MicroFuse’s largest gains precisely in biologically ambiguous cases where protein identity alone is misleading.

[LG-211] AgentS limming: Towards Efficient and Cost-Aware Multi-Agent Systems

链接: https://arxiv.org/abs/2605.08813
作者: Yulang Chen,Haoxuan Peng,Jinyan Liu,Zichen Wen,Dongrui Liu,Linfeng Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Model-based Multi-Agent Systems (MAS) have demonstrated remarkable capabilities in complex tasks. However, manually designing optimal communication topologies is labor-intensive, while automated expansion methods often result in bloated structures with redundant agents, leading to excessive token consumption. To address this problem, we introduce \textbfAgentSlimming, a plug-and-play compression framework for graph-structured multi-agent workflows. Motivated by pruning and quantization in neural networks, AgentSlimming compresses workflows by first estimating the importance score of each agent with a hybrid mechanism, and then removes redundant agents or replaces them with low-cost ones, where each operation is validated using a baseline-anchored acceptance rule to prevent performance collapse. Experiments show that AgentSlimming reduces average token cost by up to 78.9% with negligible performance degradation, and sometimes even improves accuracy, achieving a strong Pareto-optimal trade-off between cost and quality. \textitOur code is publicly available at this https URL

[LG-212] Data-driven transport modelling without overfit

链接: https://arxiv.org/abs/2605.08801
作者: Peter Vanya,Katarína Šimková,Rastislav Farkaš
类目: Machine Learning (cs.LG)
*备注: 6 pages, 6 figures

点击查看摘要

Abstract:Macroscopic transport modelling aims to predict traffic flows after proposed public policy interventions, such as a new road or railway section or a temporary road closure. As such, it is a vital step in infrastructure planning and development. Traditionally, building a transport model has relied on complex understanding of socio-economic characteristics of the population requiring expensive data collection via surveys, which are prone to biases. Previous numerical frameworks to optimize transport models to fit observed traffic flows are not easily-interpretable and can lead to overfit. We present here an alternative: a data-driven modelling protocol with objective function based on traffic counts, which can be nowadays cheaply and reliably obtained; explainable model weights; and a controlled path to increase model complexity and accuracy. We demonstrate our approach on several toy and realistic examples, and suggest ways to generalize to multimodal systems including public transport.

[LG-213] PRIM: Meta-Learned Bayesian Root Cause Analysis

链接: https://arxiv.org/abs/2605.08786
作者: Christopher Lohse,Anish Dhir,Amadou Ba,Bradley Eck,Marco Ruffini,Jonas Wahl
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Root cause analysis (RCA) in complex systems is challenging due to error propagation across multiple variables, the need for structural causal knowledge, and the computational cost of inference at test time. We introduce PRIM (Prior-fitted Root cause Identification with Meta-learning), a causal meta-learning approach that frames RCA as a Bayesian inference task over a synthetic prior of causal models. By marginalising out structural uncertainty, PRIM implicitly identifies changes in the data-generating mechanism between baseline and anomalous periods. In doing so, PRIM infers distributional differences without explicit statistical testing, and implicitly learns causal structure without model fitting at test time. Following the simulation-based meta-learning paradigm of prior-fitted networks, PRIM uses a Model-Averaged Causal Estimation (MACE) transformer neural process that jointly attends over observational and anomalous samples and the causal structure of nodes, enabling zero-shot inference in 17,ms for systems with up to 100 variables. Across synthetic benchmarks and two realistic benchmark datasets, PetShop and CausRCA, PRIM is competitive with methods that are aware of the system’s causal graphical structure a priori while outperforming graph-unaware methods on several tasks. Lightweight fine-tuning to specific domains and data dynamics improves performance further.

[LG-214] ProcVLM: Learning Procedure-Grounded Progress Rewards for Robotic Manipulation

链接: https://arxiv.org/abs/2605.08774
作者: Youhe Feng,Hansen Shi,Haoyang Li,Xinlei Guo,Yang Wang,Chengyang Zhang,Jinkai Zhang,Xiaohan Zhang,Jie Tang,Jing Zhang
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Long-horizon robotic manipulation requires dense feedback that reflects how a task advances through its procedural stages, not merely whether the final outcome is successful. Existing reward models often rely on trajectory-level success labels or time-based interpolation, which can conflate elapsed time with true task progress and therefore fail to capture unfinished steps, stagnation, and failure states. We present ProcVLM, a progress-aware vision-language model that learns procedure-grounded progress as a dense reward signal for manipulation. Rather than deriving progress from terminal outcomes or temporal proxies, ProcVLM grounds progress estimation in procedural structure and intra-stage visual change, and further adopts a reasoning-before-estimation paradigm that infers the remaining atomic actions before estimating task progress. Specifically, we construct this supervision by synthesizing frame-level subtask-semantic annotations, assigning progress budgets according to subtask structure, and distributing each budget based on intra-subtask visual change. To train ProcVLM at scale, we build a standardized procedural supervision synthesis pipeline and construct ProcCorpus-60M from 30 embodied datasets with 60M annotated frames, from which we derive ProcVQA for procedure-aware pretraining, with progress estimation as the central task alongside action segmentation and future planning. Experiments on ProcVQA and reward-model benchmarks show that ProcVLM improves embodied procedural reasoning and yields more discriminative trajectory-internal progress estimates than representative baselines, supporting its use as a dense reward model for downstream reward-guided policy optimization. Project page: this https URL

[LG-215] Omni-DeepSearch: A Benchmark for Audio-Driven Omni-Modal Deep Search

链接: https://arxiv.org/abs/2605.08762
作者: Tao Yu,yiming ding,Shenghua Chai,Minghui Zhang,Zhongtian Luo,Xinming Wang,Xinlong Chen,Zhaolu Kang,Junhao Gong,Yuxuan Zhou,Haopeng Jin,Zhiqing Cui,Jiabing Yang,YiFan Zhang,Hongzhu Yi,Zheqi He,Xi Yang,Yan Huang,Liang Wang
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: 43 pages

点击查看摘要

Abstract:Current omni-modal benchmarks mainly evaluate models under settings where multiple modalities are provided simultaneously, while the ability to start from audio alone and actively search for cross-modal evidence remains underexplored. In this paper, we introduce \textbfOmni-DeepSearch, a benchmark for audio-driven omni-modal deep search. Given one or more audio clips and a related question, models must infer useful clues from audio, invoke text, image, and video search tools, and perform multi-hop reasoning to produce a short, objective, and verifiable answer. Omni-DeepSearch contains 640 samples across 15 fine-grained categories, covering four retrieval target modalities and four audio content types. A multi-stage filtering pipeline ensures audio dependence, retrieval necessity, visual modality necessity, and answer uniqueness. Experiments on recent closed-source and open-source omni-modal models show that this task remains highly challenging: the strongest evaluated model, Gemini-3-Pro, achieves only 43.44% average accuracy. Further analyses illustrate key bottlenecks in audio entity inference, query formulation, tool-use reliability, multi-hop retrieval, and cross-modal verification. These results highlight audio-driven omni-modal deep search as an important and underexplored direction for future multimodal agents.

[LG-216] FedGMI: Generative Model-Driven Federated Learning for Probabilistic Mixture Inference

链接: https://arxiv.org/abs/2605.08760
作者: Qijun Hou,Yuchen Shi,Pingyi Fan,Khaled B. Letaief
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) facilitates collaborative model training across decentralized clients while preserving data privacy by avoiding raw data exchange. Despite its potential, FL performance is often compromised by data heterogeneity across clients. To address this, Clustered Federated Learning (CFL) groups clients with similar data distributions to improve model performance, but constrained by intra-cluster heterogeneity. Conversely, Personalized Federated Learning (PFL) tailors models to individual clients, but usually neglects the underlying structural similarities among clients. In this work, we investigate a probabilistic mixture (PM) scenario, where each client’s local data distribution is modeled as a convex combination of several shared inherent distributions. To effectively model this structure, we propose FedGMI, a framework that utilizes Variational Autoencoders (VAEs) as generative density estimators to represent these inherent distributions and infer the mixture components of clients’ local data distributions. This approach enables structured personalization without sacrificing the benefits of collaborative learning. Extensive experiments demonstrate that FedGMI effectively characterizes and discriminate the inherent distributions, as well as accurately estimates mixture proportions. Furthermore, FedGMI maintains robust performance even under communication cost constraints.

[LG-217] MDL-GBG: A Non-parametric and Interpretable Granular-Ball Generation Method for Clustering

链接: https://arxiv.org/abs/2605.08759
作者: Zeqiang Xian,Caihui Liu,Yong Zhang,Wenjing Qiu,Duoqian Miao,Witold Pedrycz
类目: Machine Learning (cs.LG)
*备注: 12 pages, 5 figures

点击查看摘要

Abstract:Existing granular-ball generation methods are still mainly driven by handcrafted quality measures and heuristic splitting or stopping criteria, which weakens the transparency of local generation decisions in clustering. To address this issue, this paper proposes Minimum Description Length based Granular-Ball Generation (MDL-GBG), a non-parametric and interpretable granular-ball generation method for clustering. MDL-GBG reformulates granular-ball generation as a local model selection problem under the Minimum Description Length principle. For each granular ball, three candidate explanations are compared, namely a single-ball model, a two-ball model, and a core-ball-plus-residual model, and the model with the shortest description length is selected. In this way, ball retention, splitting, and residual peeling are unified within a common coding-theoretic framework. A residual reassignment mechanism is further introduced to globally re-evaluate peeled-off boundary samples after stable granular-balls are formed. Experiments on 20 UCI datasets show that the stable granular-balls generated by MDL-GBG provide a highly competitive upstream representation for clustering, with MDL-GBG+AC achieving the best overall average ranks in ARI, ACC, and NMI among the compared methods. These results demonstrate that MDL-GBG offers an effective and interpretable alternative to conventional heuristic granular-ball generation strategies.

[LG-218] LAQuant: A Simple Overhead-free Large Reasoning Model Quantization by Layer-wise Lookahead Loss

链接: https://arxiv.org/abs/2605.08755
作者: Euntae Choi,Sumin Song,Sungjoo Yoo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large reasoning models (LRMs) reach competition-level math and coding accuracy via long autoregressive decoding, making per-token decoding cost a primary deployment concern. Weight quantization is the standard tool for acceleration, but representative recipes – including state-of-the-art end-to-end (E2E) QAT – lose accuracy on long-decoding reasoning benchmarks despite preserving perplexity and short-decode accuracy. Through a systematic gradient-direction analysis, we identify two factors driving this gap: (i) KV-cache fidelity preservation under the QAT loss, which E2E supervision attenuates via the softmax Fisher metric; and (ii) Hessian-subspace alignment between calibration data and the deployment distribution. We propose LookAhead Quantization (LAQuant), a layer-wise weight-only QAT method that addresses both factors without online-transform overhead by combining reasoning-domain calibration with a one-layer lookahead loss whose implicit cross-layer co-adaptation preserves the next-layer residual stream. For Qwen3-4B under W3G128 quantization, LAQuant improves AIME25 Pass@1 over ParoQuant by 15.11pp (1.93pp over ParoQuant++ at matched calibration) while achieving a 3.42x decoding speedup over FP16 on RTX A6000, compared with ParoQuant’s 3.01x.

[LG-219] he Wristband Gaussian Loss: Deterministic Composable Latents via a Sphere-Interval Decomposition

链接: https://arxiv.org/abs/2605.08749
作者: Mikhail Parakhin,André M. Carvalho,Patrick Haluptzok
类目: Machine Learning (cs.LG)
*备注: preprint

点击查看摘要

Abstract:We present the Wristband Gaussian Loss, a deterministic batch loss for Gaussianizing point embeddings without sampling, KL terms, or iterative transport. Each x \in \mathbbR^d is mapped to a direction u = x/|x| and a CDF-transformed radius t = F_\chi^2_d(|x|^2) on the wristband S^d-1 \times [0,1] . We prove (and machine-verify in Lean~4) that for d \ge 2 the pushforward wristband map equals \sigma_d-1 \otimes \mathrmUnif[0,1] iff the source is \mathcalN(0, I_d) , and that the Neumann-reflected wristband repulsion energy is uniquely minimized at the uniform target. We compute this reflected-kernel objective in two ways: a nearest three-image pairwise truncation at O(N^2 d) , and a spectral Neumann path joining angular and radial Mercer modes (spherical-harmonic and cosine) at O(N d K) , with empirically matched gradients. A 1D Wasserstein radial term and a moment penalty serve as finite-sample accelerators with the same optimum, and Monte-Carlo null calibration turns the components into a single standardized statistic. We evaluate direct point-cloud Gaussianization with a calibrated barycentric W_2 score: a deterministic Gaussian reference batch is built by recursive Hungarian averaging, with each method reported as a z -score against same-size Gaussian batches. On the axis-uniform X benchmark, Wristband is competitive in 2D and gives the best 10D score. On a harder radial–angular-copula impostor whose Gaussian radial and angular marginals are correct but dependent, Wristband gives the best 10D and 128D scores. Coupled with learnable-key Euclidean attention and exact invertible flows, the resulting Deterministic Gaussian Autoencoder delivers a Gaussian-latent interface for counterfactual sampling with independent factors and a context/residual construction for dependent factors.

[LG-220] he Global Empirical NTK: Self-Referential Bias and Dimensionality of Gradient Descent Learning

链接: https://arxiv.org/abs/2605.08746
作者: James Hazelden,Laura Driscoll,Eli Shlizerman,Eric Shea-Brown
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Optimization and Control (math.OC)
*备注: Submitted to TMLR

点击查看摘要

Abstract:In training a neural network with gradient descent (GD), each iteration induces a linear operator that governs first-order updates to a model’s internal state variables. We define this operator as the Global Empirical Neural Tangent Kernel (NTK). In finite-width networks, the NTK is typically intractable to form, leading prior work to focus on restrictive settings such as tracking outputs only or taking infinite-width limits. Here, we study the structure of the NTK for a range of models. Formulating the model state as the solution to a single global implicit constraint, we derive the NTK as a product of two operators: K, accounting for immediate parameter-to-state interactions, and P, describing internal state-to-state dependencies. For a broad class of weight-based models, including RNNs and transformers, we prove a universal Kronecker-core theorem showing that K admits an exact, computable form given by the Gram matrix of weight-site variables. This core structure reveals that the NTK is structurally bottlenecked, constraining its effective rank and giving rise to a self-referential bias whereby GD preferentially learns within dominant modes of joint hidden and input activity. For recurrent models, we examine the spectrum of the NTK and show when it is biased and low-rank in space or time under the proposed decomposition. We further demonstrate that model dynamics at initialization bias the NTK, restricting learning and preventing task components from being learned effectively. Finally, we show that the NTK associated with a self-attention transformer is likewise structurally constrained to be low-rank. Overall, we show that the NTK possesses tractable structure that explains GD bias toward task solutions and the emergence of low-rank representations. To enable use of the NTK as a practical metric, we build kpflow, a library relying on randomized matrix-free numerical linear algebra.

[LG-221] Generative Actor-Critic with Soft Bridge Policies

链接: https://arxiv.org/abs/2605.08733
作者: Ke He,Le He,Shunpu Tang,Yafei Wang,Lisheng Fan
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:Expressive generative policies such as diffusion and flow models are appealing for MaxEnt online reinforcement learning because of their ability to model multimodal and highly non-Gaussian action distributions. However, training effective soft generative policies faces two obstacles that often arise together. First, marginal action densities are often unavailable, so existing methods typically rely on entropy bounds, heuristic proxies or approximations. Second, iterative shared-parameter samplers raise inference cost and require backpropagation through time over repeated network evaluations, increasing memory cost and destabilizing policy optimization. These obstacles motivate us to seek a generative policy that exposes a tractable MaxEnt objective while requiring only a single sampled actor forward pass for action generation. To this end, we propose soft generative actor-critic (SoftGAC), whose actor defines a stochastic bridge from a fixed base latent to a terminal action latent in pre-tanh space. This structured bridge allows us to lift the MaxEnt objective as an analytically tractable path-wise relative-entropy objective against a high-entropy reference process. In practical finite-step implementation, this relative entropy reduces exactly to sampled transition control energy and thus provides principled soft regularization. Moreover, we keep the single-pass actor lightweight by using small step-specific bridge transitions, each evaluated only once per sampled action, while maintaining a parameter budget comparable to strong actor baselines. Extensive experiments on challenging continuous-control benchmarks show that SoftGAC attains higher or competitive returns than strong generative policy baselines, including diffusion and flow-matching policies, while staying in the low-latency regime of one-pass actors and showing considerable improvements in the compute-return tradeoff.

[LG-222] Latent Geometry Beyond Search: Amortizing Planning in World Models

链接: https://arxiv.org/abs/2605.08732
作者: Hoang Nguyen,Xiaohao Xu,Xiaonan Huang
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 31 pages

点击查看摘要

Abstract:Modern vision-based world models can represent observations as compact yet expressive latent manifolds, but fast goal-oriented planning in these spaces remains challenging. This raises a central question: when does a learned representation simplify control, rather than merely enabling prediction? We study this question in a pretrained LeWorldModel, whose latent geometry is regularized for smoothness and uniformity. Our key insight is that, under such geometry, planning can be amortized into a latent inverse-dynamics mapping instead of requiring online search. We therefore replace iterative planning with a lightweight Goal-Conditioned Inverse Dynamics Model (GC-IDM) that maps the current latent state, goal latent state, and remaining horizon directly to the next action. Empirically, across four benchmark environments spanning navigation, contact-rich manipulation, and continuous control, our controller matches or exceeds CEM in seven of eight environment-protocol settings while reducing per-decision cost by 100-130x. A broader sweep over test-time planners (CEM, MPPI, iCEM, and gradient-based methods) shows that this result is not specific to a particular optimizer. These findings suggest that much of the structure recovered by test-time planning is already locally encoded in the latent representation. More broadly, our results indicate that sufficiently structured latent spaces can shift part of the planning burden from online optimization to learned inference.

[LG-223] Single-Thread JPEG Decoder Benchmarks Mis-Evaluate ML Data Loaders

链接: https://arxiv.org/abs/2605.08731
作者: Vladimir Iglovikov
类目: Performance (cs.PF); Machine Learning (cs.LG)
*备注: 9 pages, 4 figures. Code and data: this https URL

点击查看摘要

Abstract:JPEG decode is routine ML infrastructure, but Python decoder choices are often justified by single-process, single-thread microbenchmarks. We audit this evaluation assumption with twelve Python-accessible JPEG decode paths on five matched 16 vCPU Google Cloud CPUs: Intel Emerald Rapids, AMD Zen 4, AMD Zen 5, ARM Neoverse V2, and ARM Neoverse N1. ImageNet validation is the workload, not a new dataset contribution: each run decodes the full 50,000-image split from memory and reports single-thread throughput for all decoders, PyTorch DataLoader throughput for eligible decoders at worker counts 0,2,4,8, and decoder skip behavior. The evaluation protocol changes the supported conclusion. On Neoverse V2, imageio is ninth in single-thread throughput yet lands in the top DataLoader tier with torchvision; on Zen 4, torchvision rises from seventh single-thread to the top measured DataLoader tier; on Neoverse N1, imagecodecs is the single-thread leader but fourth at peak DataLoader throughput. We also find that worker-count conclusions differ between Zen 4 and Zen 5, TensorFlow has a large single-thread ARM penalty, and strict libjpeg-turbo-family wrappers reject the same rare ImageNet JPEG. For PyTorch DataLoader workloads, torchvision and simplejpeg form the strongest measured zero-skip tier: torchvision has the highest mean normalized throughput, while simplejpeg has the highest minimum. OpenCV remains a robust general-purpose fallback above 90% of the platform-local winner on every tested CPU. We release raw JSON, generated tables/figures, and an executable local/cloud benchmark framework.

[LG-224] Classification-Head Bias in Class-Level Machine Unlearning: Diagnosis Mitigation and Evaluation

链接: https://arxiv.org/abs/2605.08730
作者: Weidong Zheng,Kongyang Chen,Yuanwei Guo,Yatie Xiao
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Class-level machine unlearning aims to remove the influence of specified classes while preserving model utility on retained classes. Existing methods are commonly evaluated by retain-set accuracy, forget-set accuracy, and unlearning time, but these metrics provide limited insight into how forgetting is achieved internally. In this paper, we reveal a bias-dominated shortcut in class-level unlearning: the prediction of forgotten classes can be suppressed by decreasing the corresponding bias terms in the final classification head. We first analyze the gradient dynamics of classification-head biases under softmax cross-entropy training, explaining why retain-set-only optimization tends to reduce the biases of absent classes. Based on this observation, we introduce BiasShift as a diagnostic baseline, showing that simple bias manipulation can satisfy conventional unlearning metrics while leaving abnormal bias patterns that reveal forgotten labels. To mitigate excessive forgotten-class bias suppression, we propose two bias-aware mechanisms, namely Two-Stage Bias Gradient Reversal Mechanism (TS-BGRM) and Lower-Bound Hinge Regularization (LB-HR). We further introduce three bias-oriented metrics, including Bias Stability Coefficient (BSC), Median Bias Gap (MBG), and Minimal Bias Score (MBS), to quantify bias dependence and potential leakage. Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet demonstrate that the proposed methods maintain competitive unlearning performance while producing more stable bias distributions. We have released our code at this https URL.

[LG-225] METBRA25Y: Brazil Surface Meteorology Archive with Harmonized Variables and Quality Control

链接: https://arxiv.org/abs/2605.08701
作者: Matheus Lima Castro,William Dantas Vichete,Leopoldo Lusquino Filho
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注: 12 pages, 5 figures. Dataset paper describing METBRA25Y, a harmonized archive of hourly Brazilian surface meteorological observations derived from INMET records. Dataset available at Zenodo: https://doi.org/10.5281/zenodo.19964979

点击查看摘要

Abstract:This data paper describes METBRA25Y, a harmonized archive of hourly surface meteorological observations from Brazil derived from public historical records of the Instituto Nacional de Meteorologia (INMET). The dataset was designed to support reproducible environmental, climatological, hydrological, agricultural, urban-risk, and machine-learning studies that require station-level meteorological time series with standardized variable names and explicit quality-control metadata. The processing workflow ingests annual INMET archives, parses station metadata from raw file headers, normalizes heterogeneous Portuguese column names into a canonical schema, constructs hourly timestamps, consolidates observations by city and station, and exports compressed CSV files together with station manifests, per-station quality flags, daily precipitation aggregates, variable-level failure summaries, and missing-data audits. The quality-control protocol follows a two-stage strategy: first, physically implausible values are converted to missing values and flagged; second, temporal and cross-variable consistency checks generate diagnostic flags without necessarily overwriting the original measurements. The resulting package covers observations between 2000 and 2025, with stationspecific temporal coverage, and includes key meteorological variables such as precipitation, air temperature, dew point, relative humidity, atmospheric pressure, wind speed, wind gust, wind direction, and global solar radiation. Based on the summary files included in the current release snapshot, the archive contains 616 unique station codes across variable summaries, of which 605 have coordinates within a broad Brazil plausibility envelope. This paper documents the dataset provenance, file organization, harmonized schema, quality-control rules, technical validation outputs, limitations, and recommended usage practices.

[LG-226] MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI

链接: https://arxiv.org/abs/2605.08678
作者: Bohan Lyu,Yucheng Yang,Siqiao Huang,Jiaru Zhang,Qixin Xu,Xinghan Li,Xinyang Han,Yicheng Zhang,Huaqing Zhang,Runhan Huang,Kaicheng Yang,Zitao Chen,Wentao Guo,Junlin Yang,Xinyue Ai,Wenhao Chai,Yadi Cao,Ziran Yang,Kun Wang,Dapeng Jiang,Huan-ang Gao,Shange Tang,Chengshuai Shi,Simon S. Du,Max Simchowitz,Jiantao Jiao,Dawn Song,Chi Jin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modern AI progress has been driven by ML methods that are generalizable across settings and scalable to larger regimes. As large language models demonstrate advanced capabilities in reasoning, coding, and engineering tasks, it is increasingly important to understand whether they can discover such methods rather than only apply existing ones. We introduce MLS-Bench, a benchmark for evaluating whether AI systems can invent generalizable and scalable ML methods. MLS-Bench contains 140 tasks across 12 domains, each requiring an agent to improve one targeted component of an ML system or algorithm and demonstrate that the improvement generalizes across controlled settings and scales. We find that current agents remain far from reliably surpassing human-designed methods, and that engineering-style tuning is easier for them than genuine method invention. We further study the effects of test-time scaling, adaptive compute allocation, and context provision on agents’ discovery performance, together with case studies of their behavior. Our analyses suggest that the bottleneck is not only in proposing new methods, but also in the scientific insight needed to plan, validate, and scale claims about them. More search, compute, or context alone does not remove this bottleneck. We build and maintain a community platform for cumulative and comparable iteration, and release the data and code at this https URL.

[LG-227] PHIDA: Persistence-Guided Node-to-Cluster Mapping for Online Clustering

链接: https://arxiv.org/abs/2605.08673
作者: Naoki Masuyama,Yusuke Nojima,Stefan Wermter,Yuichiro Toda,Hisao Ishibuchi,Chu Kiong Loo
类目: Machine Learning (cs.LG)
*备注: This paper is currently under review

点击查看摘要

Abstract:Online clustering methods that adaptively create and update nodes as data arrive often make node learning explicit, whereas the mapping from the learned node state to output clusters often remains implicit or simplified. Implicit mappings make output clusters sensitive to weak graph bridges or local relations based on distance in the graph over learned nodes, leaving no explicit constraint on which node groups remain intact during mapping. This paper addresses this gap by proposing PHIDA, a persistence-guided node-to-cluster mapping method for online clustering with learned nodes. PHIDA implements this mapping within Adaptive Resonance Theory (ART)-based online clustering by combining Inverse-Distance ART (IDA) node learning with node-to-cluster mapping constrained by Persistent Homology (PH). Experiments on 24 benchmark datasets show that PHIDA achieves the best average ranks in stationary comparisons that include the recent stationary-only clustering methods, while also improving aggregate performance in the nonstationary setting over the evaluated online methods that adaptively create and update nodes. Ablations and comparisons with conventional node-to-cluster mappings indicate that the observed gains are associated with PH-constrained mapping that preserves raw PH components, together with the use of the PH component view during node learning. Source code is available at this https URL

[LG-228] he Cancellation Hypothesis in Critic-Free RL: From Outcome Rewards to Token Credits

链接: https://arxiv.org/abs/2605.08666
作者: Tianhao Cheng,Zeyu Huang,Zihan Qiu,Yu Cheng,Edoardo Ponti,Yinghui Xu,Ivan Titov,Zenglin Xu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A commonly accepted explanation of critic-free RL for LLMs, based on sequence-level rewards, is that it reinforces successful rollouts with a positive advantage while penalizing failed ones. In contrast, we study critic-free RL from a token-level perspective, revealing the token-flipping phenomenon: positive and negative rollouts exhibit remarkably similar proportions of tokens whose probabilities are boosted or suppressed during RL training. To explain this phenomenon, we further show that a token’s change in probability is not fully determined by its own advantage; coupled gradient interactions with other tokens also play a non-negligible role. Specifically, these token coupling effects occur primarily between identical tokens that are both predicted with low confidence. Building upon this analysis, we propose the cancellation hypothesis: as a result of coupling, opposing signals cancel out for tokens shared by positive and negative rollouts, while tokens more specific to successful rollouts receive stronger reinforcement, thereby inducing hidden token-level credit assignment from rollout-level rewards. We support this hypothesis with complementary empirical evidence. (1) Compared with training on only positive rollouts, critic-free RL shifts updates from template and formatting tokens toward reasoning tokens; (2) Tokens boosted by critic-free RL consistently demonstrate higher value than suppressed tokens, regardless of whether they originate from positive or negative rollouts. Guided by this view, we implement two batching interventions to encourage or preserve cancellation in critic-free RL training: query-preserved mini-batching and reward-balanced batching. Despite their simplicity, these interventions improve RLVR training across multiple model scales, supporting cancellation as both an explanatory principle and a practical design criterion for critic-free RL training.

[LG-229] Optimised Support Vector Regression for California Housing Price Prediction: The Critical Role of Feature Engineering and Hyperparameter Tuning

链接: https://arxiv.org/abs/2605.08660
作者: Emmanuel Adutwum
类目: Machine Learning (cs.LG)
*备注: 25 pages, 13 figures, 10 tables

点击查看摘要

Abstract:In the recent literature, Support Vector Regression (SVR) has been cited as one of the weakest performers on the California Housing benchmark dataset, with Preethi et al. (2025)specifically ranking it last among the algorithms they tested, reporting an R2 of only 0.60. This paper examines whether the previously reported performance reflects experimental configuration choices rather than an inherent algorithmic limitation. A structured experimental workflow is applied: ten domain-motivated derived features are constructed from the eight raw inputs, an exploratory ensemble feature importance analysis identifies the most predictive candidates, and a randomised search over hyperparameter combinations with three-fold cross-validation selects the optimal SVR configuration within a leakage-safe scikit-learn Pipeline. A formal four-stage ablation study isolates the contribution of each component: scaling alone accounts for +0.744 in R2 (from -0.054 to 0.690), feature engineering adds +0.026 (to 0.716), and hyperparameter tuning contributes +0.008 (to 0.723). The resulting tuned SVR achieves a test R2 of 0.723, a 0.123-point absolute improvement over the previously reported SVR result (from 0.60 to 0.723, approximately 20% relative gain). In the ten-model comparison, the tuned SVR ranks fourth with R2 = 0.723, below XGBoost (0.832), Random Forest (0.814) and Gradient Boosting (0.783), while substantially outperforming simpler baselines. Ten-fold cross-validation yields a mean R2 of 0.703 (95% CI: [0.630, 0.775]), confirming robust generalisation. The observed improvement from R2 = 0.60 to R2 = 0.723 is associated primarily with proper feature scaling within a unified preprocessing pipeline, with domain-motivated feature engineering and systematic hyperparameter tuning, providing further incremental gains.

[LG-230] FLUX: Geometry-Aware Longitudinal Flow Matching with Mixture of Experts

链接: https://arxiv.org/abs/2605.08648
作者: Josue Ortega Caro,Yongxu Zhang,Hannah M Batchelor,Sizhuang He,Jessica Cardin,Shreya Saxena
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Many biological systems evolve through continuous local dynamics while switching between latent regimes defined by learning, stimulus context, internal state, or developmental stage. These processes are often observed only as unpaired longitudinal snapshots: the same cells, neurons, or animals are not tracked as matched trajectories, even though population states are sampled across successive stages. This creates two coupled challenges. First, trajectories must respect curved low-dimensional manifolds embedded in high-dimensional biological measurements. Second, the model must identify when the transport mechanism itself changes. We introduce FLUX (FLow matching for Unpaired longitudinal data with miXture-of-experts), a geometry-aware longitudinal flow-matching framework for joint transport modeling and unsupervised regime discovery. FLUX learns a data-dependent metric from pooled labeled and unlabeled observations, uses that metric to construct geometry-aware conditional paths between adjacent marginals, and decomposes the resulting velocity field into sparse expert vector fields selected by a Straight-Through Gumbel-Softmax router. Across manifold controls, a regime-switching Lorenz system, widefield cortical calcium imaging during associative learning, and embryoid body single-cell differentiation, FLUX reconstructs longitudinal transport while recovering interpretable regime structure. Ablations show that mixture-of-experts routing alone is insufficient: FLUX without geometric learning can fit local transport but fails or weakens regime discovery when regimes are encoded in local dynamics. These results suggest that geometry-aware velocity decomposition provides a general strategy for discovering latent biological state transitions from unpaired longitudinal snapshots.

[LG-231] ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning

链接: https://arxiv.org/abs/2605.08639
作者: Chao Jin,Xinming Wei,Yinmin Zhong,Chengxu Yang,Bingyang Wu,Ruidong Zhu,Zili Zhang,Yuliang Liu,Xin Jin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Load imbalance is a long-standing challenge in Mixture-of-Experts (MoE) training and is exacerbated in reinforcement learning (RL) for LLMs, where hot experts can shift frequently across micro-batches. Existing MoE training systems rely on historical loads to predict future expert demand, making them less effective under sharp fluctuations. We propose ReLibra, an MoE RL training system that exploits a unique opportunity in RL’s rollout-training workflow, routing replay, to enable fine-grained load balancing at micro-batch granularity. Because rollout and training process the same tokens with the same MoE parameters, the token-to-expert routing decisions are known before training starts. Leveraging this information, ReLibra places two MoE load-balancing mechanisms at inter- and intra-batch timescales, matching their communication patterns to hierarchical network bandwidths. At the inter-batch timescale, ReLibra performs expert reordering to redistribute experts for batch-level cross-node balancing; at the intra-batch timescale, it dynamically performs expert replication within a node to absorb micro-batch-level load fluctuations. Experiments on diverse MoE LLMs and RL workloads show that ReLibra improves training throughput by up to 1.6 \times over Megatron-LM and by up to 1.2 \times over EPLB, even when EPLB is given oracle loads. Moreover, ReLibra remains within 6%-10% of the throughput of an idealized balanced baseline.

[LG-232] Robust Server Defense Against Unreliable Clients in One-Shot Fair Collaborative Machine Learning

链接: https://arxiv.org/abs/2605.08616
作者: Chia-Yuan Wu,Frank E. Curtis,Daniel P. Robinson
类目: Machine Learning (cs.LG)
*备注: Accepted at the 2nd International Conference on Federated Learning and Intelligent Computing Systems (FLICS 2026)

点击查看摘要

Abstract:Collaborative machine learning (CML) enables multiple clients to train a global model jointly in a data-distributed setting. To address data privacy and communication efficiency, one-shot CML has been increasingly adopted, where clients communicate with the server only once by sharing synthetic or processed proxy data. This single-round communication, however, eliminates the possibility of iterative correction at the server, making the learning process particularly vulnerable to client unreliability. In this setting, unreliable clients, whether malicious or non-malicious, may provide biased proxy data that favors certain groups, thereby degrading the fairness of the global model and harming minority or unprivileged groups. In this work, we propose a server-side defense framework based on a bilevel optimization formulation. The proposed approach learns client-level weights to mitigate the influence of biased client proxy data while enforcing fairness constraints by using a very small trusted root dataset available at the server. Experimental results on benchmark datasets show that our method improves fairness with little accuracy loss under biased proxy data contributions from unreliable clients. Moreover, the proposed approach remains effective even when unreliable clients make up a majority of the system, consistently outperforming other existing methods.

[LG-233] FLARE: One-Shot PE-Level Fault Localization in Systolic Arrays via Algebraic Test Vectors

链接: https://arxiv.org/abs/2605.08594
作者: Logashree Venkatasubramanian(1),Zishen Wan(1),Viveck Cadambe(1) ((1) Georgia Institute of Technology)
类目: Hardware Architecture (cs.AR); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Systolic arrays are the dominant compute fabric for neural network inference. Prior work has addressed column-level fault detection efficiently with uniform test patterns, but row-level (PE-level) fault localization within a faulty column remains open without resorting to hardware redundancy. The fundamental obstacle is that uniform test inputs destroy per-row signatures: any test that activates every row equally cannot distinguish which row is the source of an observed deviation. In this paper, we propose a lightweight, purely algorithmic remedy based on coprime test vectors. By assigning pairwise coprime integers as test-input entries, a permanent weight-register fault produces a deviation whose divisibility signature uniquely identifies the faulty row. Under a general bounded error model, a single test pass localizes the faulty row with high probability. This error model covers a broader class of faults than what prior dataflow-aware testing work has primarily emphasized. When one round is insufficient, a second pass using a ratio computation achieves exact localization; for the special case of single-bit errors, odd coprime entries guarantee exact localization in one round. For INT16 arithmetic, a single test pass covers array sizes up to 256\times256 with localization probability above 0.98 , at a test cost under 1% of one inference GEMM tile. Subjects: Hardware Architecture (cs.AR); Information Theory (cs.IT); Machine Learning (cs.LG) Cite as: arXiv:2605.08594 [cs.AR] (or arXiv:2605.08594v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2605.08594 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-234] PRISM: Fast Online LLM Serving via Scheduling-Memory Co-design

链接: https://arxiv.org/abs/2605.08581
作者: Xingyu Qu,Tianhao Lin,Yiqi Li,Zhiyu Chen,Sheng Wang
类目: Machine Learning (cs.LG)
*备注: 25 pages, 9 figures, Preprint

点击查看摘要

Abstract:Modern online large language model (LLM) services, such as Retrieval-Augmented Generation (RAG) and agent systems, increasingly expose two prominent characteristics: prompt segmentation (e.g., system instructions, retrieved passages, tool outputs) and hotspot skew, where a small set of these segments recurs frequently across user requests. Failing to jointly exploit these patterns could lead to repeated prefill of hot segments and prolonged TTFT, undermining both throughput and user-perceived responsiveness. However, existing work tackles these patterns independently: KV-cache management mainly exploits segment reuse while scheduling reorders requests to improve cache locality, yet neither aligns request admission with KV-cache retention. To address this gap, we first analyze how scheduling and KV-cache management jointly affect TTFT. Guided by this, we present PRISM (Prefix Reuse Optimization Integrated Scheduling and Memory), which co-designs a query-aware scheduler (QAS) with a demand-aware radix tree (DART) to align request admission with exact-prefix KV retention. Our evaluation results show that, versus the strongest baseline, PRISM reduces average per-QPS P99 TTFT by 23.3% and 37.1% while increasing exact-prefix KV-cache hit rate by 5.9 and 12.2 percentage points on 4B and 13B models, respectively.

[LG-235] Different Prompts Different Ranks: Prompt-aware Dynamic Rank Selection for SVD-based LLM Compression

链接: https://arxiv.org/abs/2605.08568
作者: Hengyi Zhu,Zhendong Mi,Grace Li Zhang,Shaoyi Huang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have rapidly grown in scale, creating substantial memory and computational costs that hinder efficient deployment. Singular value decomposition (SVD) has emerged as an effective post-training compression technique, but existing SVD-based methods rely on static rank truncation, applying a fixed prefix of singular components to all inputs regardless of their diversity. We identify two limitations of this static design: the optimal rank varies across individual prompts, and the selected rank is sensitive to the choice of calibration set, leading to suboptimal performance across diverse inputs. To address these challenges, we propose \textbfPARSE , a post-training framework for \textbfP rompt- \textbfA ware \textbfR ank \textbfS election as \textbfE xperts in SVD-compressed LLMs. PARSE trains a linear router offline to perform prompt-aware rank selection, decoupling it from calibration information by supervising the router against dense-model outputs on a large-scale corpus. We further observe that rank-selection patterns are shared across semantically similar prompts and remain stable across decoding steps, allowing appropriate rank subsets to be served directly from a pattern cache at inference. Complemented by expert memory aggregation and kernel fusion for system-level efficiency, PARSE is orthogonal to existing SVD-based pipelines and consistently improves both model quality and inference efficiency. Integrated with four representative SVD-based methods, PARSE improves average task accuracy by up to 10% at a compression ratio of 0.6 on LLaMA-7B, and achieves up to 2.5 \times prefill and 2.4 \times decode speedup over native SVD execution.

[LG-236] Finer is Better (with the Right Scaling)

链接: https://arxiv.org/abs/2605.08565
作者: Clemens Schaefer,Gil Tabak
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Microscaling is a critical technique for preserving the quality of Large Language Models (LLMs) quantized to ultra-low precision formats. Intuitively, finer block sizes should yield lower quantization error; however, a paradox recently identified in the literature demonstrates that standard abs-max scaling can actually degrade model quality as block sizes shrink. In this work, we investigate the underlying mechanics of this phenomenon. We demonstrate that this degradation is not an inherent limitation of finer granularity, but is primarily driven by heavy-tailed tensor distributions interacting poorly with the coarse upper quantization bins of the FP4 element format. Specifically, we show that i) preventing the scaling factor from underflowing to zero mitigates localized errors, ii) targeted algorithmic interventions like the 4-over-6 methodology effectively correct the quantization geometry for large elements, and iii) a brute-force search establishes an optimal baseline, confirming that the theoretical Mean Squared Error (MSE) strictly improves with finer block sizes. Ultimately, our findings reveal a valuable interchangeability: applying the correct algorithmic recipe allows standard, hardware-compliant formats (like OCP E4M3) to match the performance of custom, wider-exponent formats (like UE5M3). We validate these results across several large language models, fully resolving the block size paradox and achieving robust downstream perplexity improvements.

[LG-237] Beyond Static Bias: Adaptive Multi-Fidelity Bandits with Improving Proxies

链接: https://arxiv.org/abs/2605.08558
作者: Muyun Lu,Haoyang Hong,Huazheng Wang,Ying Lin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As an extension of the classical multi-armed bandit problem, multi-fidelity multi-armed bandits (MF-MAB) enable individual arms to be evaluated using diverse feedback sources that vary in both cost and accuracy. Prior stochastic models typically assume fixed low-to-high fidelity discrepancies, whereas modern proxy sources, such as learning-based simulators and Large Language Models (LLMs), can be improved using additional calibration. We investigate adaptive MF-MAB with improving proxy sources, and focus on the canonical two-fidelity case in which the low-fidelity source becomes more informative with repeated use. To capture this dynamic, we introduce a selected-average mismatch bound that converts dynamic low-fidelity observations into improvement-aware confidence bounds for the high-fidelity target. We propose the Threshold-Based Adaptive Continuation Companion (TACC), an optimistic algorithm that uses a bounded continuation rule to decide when low-fidelity sampling remains cost-effective and when to escalate. We prove an instance-dependent regret bound showing that, for detected intermediate arms, adaptive continuation replaces logarithmic high-fidelity confirmation with bounded low-fidelity continuation. Experiments on synthetic bandits and an LLM-as-a-judge policy-evaluation task examine when continuation improves cost-weighted regret.

[LG-238] Can Revealed Preferences Clarify LLM Alignment and Steering?

链接: https://arxiv.org/abs/2605.08556
作者: Khurram Yamin,Jingjing Tang,Eric Horvitz,Bryan Wilder
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:LLMs are increasingly used to make or support high-stakes decisions under uncertainty, where alignment depends not only on factual accuracy but on how models weigh tradeoffs between different outcomes. We present an empirical pipeline for estimating the implied preferences that an LLM’s observed choices optimize: we elicit the model’s probability distribution over unknowns along with the choice it would make for the decision task and then fit a discrete choice model to recover the cost function that best rationalizes the model’s decisions. We show how this revealed-preference description allows rigorous evaluation of whether models behave in a consistently goal-directed way, whether they can verbalize a description of their objectives which matches their revealed decision policy, and whether prompting can reliably steer those policies to implement a user-specified cost function. We apply this evaluation across four medical diagnosis domains and multiple frontier and open-source models. We find that while many models have a nontrivial degree of internal coherence, they also have significant weaknesses in faithfully reporting or adopting preferences in response to user direction.

[LG-239] A Call to Lagrangian Action: Learning Population Mechanics from Temporal Snapshots ICML2026

链接: https://arxiv.org/abs/2605.08550
作者: Vincent Guan,Lazar Atanackovic,Kirill Neklyudov
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted at ICML 2026 (spotlight)

点击查看摘要

Abstract:The population dynamics of molecules, cells, and organisms are governed by a number of unknown forces. In the last decade, population dynamics have predominantly been modeled with Wasserstein gradient flows. However, since gradient flows minimize free energy, they fail to capture important dynamical properties, such as periodicity. In this work, we propose a change in perspective by considering dynamics that minimize a population-level action under a damped Wasserstein Lagrangian. By deriving the corresponding Hamiltonian equations of motion, we formalize Wasserstein Lagrangian Mechanics, a structured class of second-order dynamics that encompasses classical mechanics, quantum mechanics, and gradient flows. We then propose WLM as the first algorithm that learns these second-order dynamics from observed marginals, without specifying the Lagrangian. By directly learning the population mechanics, WLM can both forecast and interpolate unseen marginals, and outperforms existing gradient flow and flow matching methods across a wide range of dynamics, including vortex dynamics, embryonic development, and flocking.

[LG-240] okens-per-Parameter Coverag e Is Critical for Robust LLM Scaling Law Extrapolation

链接: https://arxiv.org/abs/2605.08541
作者: Joshua Shay Kricheli,Alexander Lawrence Reid,Soumajyoti Sarkar,Venkata Gandikota,Paulo Shakarian
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural scaling laws approximate a language model’s loss as a power-law function of parameter count N and token count D . Following Chinchilla-style compute-optimal training, many studies fit scaling laws from runs performed under a fixed tokens-per-parameter (TPP) ratio k and set D = kN . We show that this collinear design, combined with the empirically common near-equality of the exponents governing N and D , induces an inherent ill-conditioning in the Gauss-Newton least-squares problem: the condition number of the design grows as the inverse square of the gap between the N and D -exponents. The scale coefficients become practically unidentifiable, with confidence intervals inflating by an order of magnitude or more, yielding a ``sloppy’’ model whose extrapolations degrade sharply off the training ray. We prove this for four scaling-law formalisms and derive a closed-form TPP-diversity threshold that is necessary and sufficient for well-conditioned estimation. Empirically, non-collinear designs outperform collinear ones on held-out splits with a 97.3% win rate across four laws, five corpora, multiple floating point precision modes. We further show the degeneracy is rooted in Jacobian geometry and is not an artifact of the loss function: any smooth estimation objective whose curvature involves the Jacobian inherits the same ill-conditioning.

[LG-241] he Propagation Field: A Geometric Substrate Theory of Deep Learning ICML4

链接: https://arxiv.org/abs/2605.08529
作者: Xingrui Gu
类目: Machine Learning (cs.LG)
*备注: Technical notes on exploring the nature of deep learning propagation, Under review by the ICML 4th Workshop on High-dimensional Learning Dynamics (HiLD) 2026

点击查看摘要

Abstract:Modern deep learning treats neural networks primarily as endpoint functions from inputs to outputs. Inspired by the shift from force to geometry in physics, we ask whether a network should instead be understood through the geometry of its internal propagation. We define a neural propagation field as the collection of hidden-state trajectories and local Jacobian operators across depth. Endpoint losses constrain only the boundary behavior of this field, leaving its interior geometry underdetermined. We show that endpoint-equivalent models can differ by orders of magnitude in trajectory and Jacobian structure, and introduce observable field metrics such as path sensitivity, solver consistency, and trajectory/Jacobian retention. In controlled teacher-flow and PDE systems, endpoint fitting fails to recover the underlying propagation law. In real multi-path tasks, field-aware objectives improve unseen-path generalization, OOD robustness, and calibration when aligned with the observation structure, but can collapse when over-constrained. In continual learning, field-preservation regularization complements replay and distillation: on Split CIFAR-100, DER++ with field preservation improves average accuracy, backward transfer, and field-retention metrics. These results identify propagation-field quality as a measurable and trainable property of neural networks beyond endpoint performance.

[LG-242] Skill-CMIB: Multimodal Agent Skill for Consistent Action via Conditional Multimodal Information Bottleneck

链接: https://arxiv.org/abs/2605.08526
作者: Zihan Huang,Junda Wu,Tong Yu,Qianqi Yan,Rohan Surana,Uttaran Bhattacharya,Lina Yao,Xin Eric Wang,Julian McAuley
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While LLM-based agents excel at planning and executing long action sequences, their execution often remains inconsistent across trials, limiting reliability. Consolidating agent consistency requires distilling trial-error trajectories into reusable skills that preserve task-relevant invariants while discarding trajectory-specific noise. However, in multimodal settings, the key challenge is not only that useful invariants are distributed across vision and language information, but that different modalities support different kinds of reusable skill content: while some skills are verbalizable and interpretable, others reside in perceptual evidence beyond text. Text-only skills may lose perceptual cues, whereas storing text and perception naively introduces redundancy and noise. Existing inference-time methods, such as self-consistency, improve reliability through costly multi-sample decoding, while internalization strategies lack a way to separate verbalizable skill content from residual perceptual information. To address this, we introduce Conditional Multimodal Information Bottleneck (CMIB), a method for multimodal skill construction. CMIB begins with a joint bottleneck over multimodal skills and derives an exact sequential decomposition: (1) a text-stage bottleneck distilling interpretable skill cards, and (2) a conditional multimodal bottleneck compressing only residual information in perception that remains predictive beyond text. Unlike naive two-stream formulations, CMIB explicitly conditions the multimodal latent on the text skill, thus structurally reducing cross-modal redundancy and enabling independent control over textual and perceptual compression. We instantiate CMIB with a variational objective that makes its conditional decomposition tractable to optimize, yielding reusable multimodal skills that improve execution stability without incurring multi-sample inference overhead.

[LG-243] FlashEvolve: Accelerating Agent Self-Evolution with Asynchronous Stage Orchestration

链接: https://arxiv.org/abs/2605.08520
作者: Zhengding Hu,Mingge Lu,Zhen Wang,Jixuan Ruan,Chang Chen,Zaifeng Pan,Yue Guan,Ruiyi Wang,Zhongkai Yu,Chao Zhang,Yufei Ding
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:LLM-based evolution has emerged as a promising way to improve agents by refining non-parametric artifacts, but its wall-clock cost remains a major bottleneck. We identify that this cost comes from synchronized stage execution and imbalance inside each LLM-heavy stage. We present FlashEvolve, an efficient framework that replaces synchronized execution with asynchronous workers and queues, allowing different stages and steps to overlap. To handle data staleness introduced by asynchrony, FlashEvolve tracks artifact versions and applies different policies to update, discard, or patch stale artifacts. Unlike weight-space staleness in asynchronous RL, language-space staleness is inspectable and repairable: a stale artifact is not just delayed work, but readable evidence that the LLM can reflect on, revise, and turn into useful evolution signal. FlashEvolve further improves throughput and token efficiency with speculative stage completion and adaptive workflow control. On GEPA workloads, FlashEvolve improves proposal throughput by 3.5\times on local vLLM and 4.9\times on API serving over synchronous GEPA. The same design also applies to ACE and Meta-Harness.

[LG-244] SeBA: Semi-supervised few-shot learning via Separated-at-Birth Alignment for tabular data

链接: https://arxiv.org/abs/2605.08519
作者: Kacper Jurek,Wojciech Batko,Marek Śmieja,Marcin Przewięźlikowski
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Learning from scarce labeled data with a larger pool of unlabeled samples, known as semi-supervised few-shot learning (SS-FSL), remains critical for applications involving tabular data in domains like medicine, finance, and science. The existing SS-FSL methods often rely on self-supervised learning (SSL) frameworks developed for vision or language, which assume the availability of a natural form of data augmentations. For tabular data, defining meaningful augmentations is non-trivial and can easily distort semantics, limiting the effectiveness of conventional SSL. In this work, we rethink SSL for tabular data and propose Separated-at-Birth Alignment (SeBA), a joint-embedding framework for SS-FSL that eliminates the dependence on augmentations. Our core idea is to separate the data into two independent, but complementary views and align the representations of one view to mirror the nearest-neighbor correspondence of the data in the second view. Our experimental evaluation supported by a theoretical analysis justifies that SeBA generates an output space, which improves the feature-label relationship. An experimental study conducted in various benchmark datasets demonstrates that SeBA achieves the state-of-the-art performance in the majority of cases, opening a new avenue for SS-FSL paradigm in the domain of tabular data.

[LG-245] Quantile-Coupled Flow Matching for Distributional Reinforcement Learning

链接: https://arxiv.org/abs/2605.08515
作者: Michael Groom,Victor-Alexandru Darvariu,Lars Kunze,James Wilson,Nick Hawes
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Unlike standard expected-return Reinforcement Learning (RL), Distributional RL (DRL) models the full return distribution, making it better-suited for uncertainty-aware and risk-sensitive decision-making. Conditional Flow Matching (CFM) critics have recently attracted attention for modelling continuous, multi-modal return distributions. Despite this interest, there remains a substantial metric mismatch: DRL theory relies on the distributional Bellman operator being contractive in the p -Wasserstein distance, yet existing CFM critics are trained with arbitrary source-target couplings, so their flow-matching losses are not Wasserstein-aligned surrogates for matching Bellman target return distributions. In this work, we address this mismatch by proposing FlowIQN, a CFM critic that sorts source and Bellman target samples within each mini-batch to approximate the monotone optimal transport coupling, replacing arbitrary pairings with quantile-aligned flow paths. We prove that the loss of our quantile-coupled CFM critic yields a Wasserstein-aligned approximate projection compatible with the foundations of DRL. To our knowledge, FlowIQN is the first flow-matching distributional critic with an explicit Wasserstein-aligned projection guarantee. We further extend FlowIQN with shortcut models for efficient inference. Empirical results show that FlowIQN improves Wasserstein return-distribution accuracy over other CFM critics. It also yields competitive performance on offline RL benchmarks across multiple policy extraction methods, providing a theoretically grounded CFM critic that is readily compatible with DRL pipelines. Code: this https URL.

[LG-246] MoMo: Conditioned Contrastive Representation Learning for Preference-Modulated Planning

链接: https://arxiv.org/abs/2605.08512
作者: Yusuf Syed,Viraj Parimi,Brian Williams
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Temporally contrastive representation learning induces a latent structure capable of reducing long-horizon planning to inference in a low-dimensional linear system. However, existing contrastive planning work learns a single latent geometry which cannot distinguish multiple valid behaviors trading task efficiency against risk exposure for the same start-goal query. We introduce MoMo, a preference-conditioned contrastive planner allowing a scalar user preference to continuously modulate plan conservativeness at inference time, without retraining. MoMo learns a joint conditioning of the representation geometry and latent prediction operator via Feature-Wise Linear Modulation and low-rank neural modulation, respectively. We show that our formulation preserves the probability density ratio encoded in the representation space that is required for inference-driven contrastive planning, further retaining its inference-time efficiency. Across six environments, MoMo smoothly adapts plan safety according to user preferences, yielding improved temporal and preferential consistency over state augmentation baselines.

[LG-247] Learning Polyhedral Conformal Sets for Robust Optimization

链接: https://arxiv.org/abs/2605.08506
作者: Shuyi Chen,Wenbin Zhou,Shixiang Zhu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Robust optimization (RO) provides a principled framework for decision-making under uncertainty, but its performance critically depends on the choice of the uncertainty set. While large sets ensure reliability, they often lead to overly conservative decisions, whereas small sets risk excluding the true outcome. Recent data-driven approaches, particularly conformal prediction, offer finite-sample validity guarantees but remain largely task-agnostic, ignoring the downstream decision structure. In this paper, we propose a decision-aware conformal framework that learns uncertainty sets tailored to robust optimization objectives. Our approach parameterizes a flexible family of polyhedral sets via data-driven hyperplanes and learns their geometry by directly minimizing the induced robust loss, while preserving statistical validity through conformal calibration. To correct for data-dependent selection, we incorporate a re-calibration step on an independent dataset to restore coverage. The resulting sets capture directional and anisotropic uncertainty aligned with the decision objective while remaining computationally tractable. We provide finite-sample coverage guarantees and bounds on the sub-optimality gap to an oracle decision. This work bridges the gap between statistical validity and decision optimality, providing a principled framework for data-driven robust optimization.

[LG-248] NeuralBench: A Unifying Framework to Benchmark NeuroAI Models

链接: https://arxiv.org/abs/2605.08495
作者: Hubert Banville,Stéphane d’Ascoli,Simon Dahan,Jérémy Rapin,Marlène Careil,Yohann Benchetrit,Jarod Lévy,Saarang Panchavati,Antoine Ratouchniak,Mingfang(Lucy)Zhang,Elisa Cascardi,Katelyn Begany,Teon Brooks,Jean-Rémi King
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: 31 pages, 9 figures

点击查看摘要

Abstract:Deep learning and large public datasets have recently catalyzed the proliferation of AI models for processing brain recordings. However, systematically evaluating these models remains a challenge: not only do the preprocessing pipelines, training and finetuning approaches largely vary across studies, but their downstream evaluation is often limited to small sets of tasks and/or datasets. Here, we present NeuralBench: a unified framework for benchmarking AI models of brain activity. We accompany this framework with NeuralBench-EEG v1.0 – a large EEG benchmark that includes 36 electroencephalography (EEG) tasks and 14 deep learning architectures, and is evaluated on 94 datasets accessed through a standardized interface. This first EEG-focused release already highlights two main findings. First, current foundation models only marginally outperform task-specific models. Second, a large set of tasks (e.g. cognitive decoding, clinical predictions) remain highly challenging, even for the best models. Critically, NeuralBench is designed for the integration of new tasks, datasets, models, and neuroimaging modalities, as illustrated by preliminary extensions to MEG and fMRI datasets and models. Through this white paper, we invite the community to expand this open-source framework and work together toward a unified benchmarking standard for neuroimaging models.

[LG-249] When Independent Sampling Outperforms Agent ic Reasoning

链接: https://arxiv.org/abs/2605.08478
作者: Yihe Dong,Boris Shigida
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study how to allocate inference-time compute for competitive programming under fixed budgets. Evaluating 216 Codeforces problems across Divisions 1-3, we compare agent-based reasoning with repeated independent sampling (k-shot) as a function of both cost and number of model calls. Across models and difficulty levels, k-shot consistently achieves a better accuracy-cost and accuracy-query tradeoff. This gap persists despite prompt caching in agent frameworks, indicating lower per-call effectiveness. Our results show that, for self-contained algorithmic tasks, independent exploration can outperform deeper agentic reasoning under realistic resource constraints. We also provide a budget-allocation analysis when the inference budget is fixed, and prove that a cost-optimal solver minimizes the principled metric log failure likelihood per dollar.

[LG-250] CUDAHercules: Benchmarking Hardware-Aware Expert-level CUDA Optimization for LLM s

链接: https://arxiv.org/abs/2605.08467
作者: Shiyang Li,Zijian Zhang,Guangyan Sun,Yuebo Luo,Winson Chen,Yanzhi Wang,Mingyi Hong,Caiwen Ding
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models show promise for automated CUDA programming, however even the strongest coding models (e.g., Claude-Opus-4.6) may still fall short of expert-level, architecture-aware optimization. We introduce CUDAHercules, a benchmark that evaluates generated CUDA against end-to-end human-expert SOTA systems. It spans single kernels, module-level operators, full applications, and unsolved challenge tasks across Ampere, Hopper, and Blackwell GPUs, with end-to-end tasks gated by domain-specific semantic validators. Evaluating models such as Claude-Opus-4.6 and GPT-5.4 shows a large gap between runnable CUDA and expert CUDA engineering: models often compile and pass tests, but rarely recover the optimization strategies needed to match expert performance. Application semantics further reduce success, and iterative or tool-augmented feedback can improve correctness while drifting toward slow fallback implementations. These results show that automated CUDA programming remains far from fully solved and requires stronger hardware reasoning, better tool use, and training objectives that connect code understanding to hardware architecture-grounded intelligence.

[LG-251] he Geometric Structure of Models Learning Sparse Data

链接: https://arxiv.org/abs/2605.08464
作者: Thomas Walker,T. Mitchell Roddenberry,Ahmed Imtiaz Humayun,Randall Balestriero,Richard Baraniuk
类目: Machine Learning (cs.LG)
*备注: 27 pages, 7 figures, 5 tables

点击查看摘要

Abstract:The manifold hypothesis (MH) is often used to explain how machine learning can overcome the curse of dimensionality. However, the MH is only applicable in regimes where the training data provides a sufficiently dense sample of the underlying low-dimensional data manifold, or where such a low-dimensional manifold is conceivably present. We describe the regimes where the MH is not applicable as sparse. In this paper, we demonstrate that models succeed in the sparse regime by exploiting a highly structured local geometry, a property we formalize as normal alignment. We prove that normal-aligned classifiers – whose input-output Jacobians are rank-one and align perfectly with the training data – minimize the training objective under norm constraints and achieve maximal local robustness under a non-zero Jacobian constraint. For continuous piecewise-affine deep networks, normal alignment manifests geometrically as centroid alignment within the network’s induced power diagram partition and results from the feature-learning regime. Motivated by these theoretical insights, we introduce GrokAlign, a regularization strategy that actively induces normal alignment. We demonstrate that GrokAlign significantly accelerates the training dynamics of deep networks relevant to the grokking phenomenon. Furthermore, we apply the principle of normal alignment to Recursive Feature Machines (RFMs) to introduce Recursive Feature Alignment Machines (RFAMs). We show that RFAMs exhibit greater adversarial robustness compared to RFMs when trained on tabular data.

[LG-252] Neurally-plausible radial basis kernels using distributed Fourier embeddings

链接: https://arxiv.org/abs/2605.08458
作者: Jakeb Chouinard
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Coherent, continuous spatial representations are critical for synthesizing physical and perceptual phenomena into a single representational space. Radial basis kernels provide a path forward for this type of distributed representation. In this work, we aim to characterize and analyze common radial basis kernels realizable in the neurally-plausible framework of spatial semantic pointers. Further, we analyze previous radial basis kernel work based on grid cell-like representations and demonstrate that such representations are both capable of and optimal for realizing radial basis kernels.

[LG-253] HEART: A High-Efficiency Adaptive Real-Time Telemonitoring Framework for Secure Electrocardiogram Signal Transmission Using Chaotic Encryption

链接: https://arxiv.org/abs/2605.08456
作者: Beyazıt Bestami Yuksel
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 15 pages, 4 figure, 3 table

点击查看摘要

Abstract:The realtime analysis and secure transmission of electrocardiogram ECG signals are critical for accurate diagnosis and safeguarding patient privacy in telemedicine applications This study presents a novel realtime ECG monitoring system that employs a learnable key generator LKG derived from each patients own ECG signal characteristics to dynamically produce unique encryption keys These keys determine the parameters r and x0 of a logistic map used for chaotic encryption The system securely encrypts realtime ECG data immediately after acquisition ensuring confidential transmission and storage in the cloud For remote clinical access the encrypted data is downloaded and decrypted on the doctors side using the matching key generated at the source or securely stored in the cloud This approach eliminates the need for traditional key exchange and substantially raises the cost of exhaustive key search in practice through persegment biometric key refresh and combined permutation and XOR diffusion supported by minentropy evaluation Compared to statickey methods the learnable biometric key design offers greater unpredictability and individualization A comprehensive set of security assessments including Shannon entropy 7678 bits correlation and autocorrelation disruption histogram statistics NIST SP 80022 frequency testing plaintextkey sensitivity avalanche effect FFTbased spectral flatness and robustness to noise and occlusion confirms the methods strength Reconstruction fidelity MSE approximately 5x106 PSNR greater than 52 dB MAE approximately 0002 demonstrates nearlossless decryption and preserved diagnostic features Encryption latency remains low preserving realtime performance.

[LG-254] CUDABeaver: Benchmarking LLM -Based Automated CUDA Debugging

链接: https://arxiv.org/abs/2605.08455
作者: Shiyang Li,Haoyang Chen,Mattia Fazzini,Caiwen Ding
类目: Machine Learning (cs.LG); Programming Languages (cs.PL); Software Engineering (cs.SE)
*备注: 25 pages, 5 figures

点击查看摘要

Abstract:Debugging CUDA programs has long been challenging because failures often arise from subtle interactions among hardware behavior, compiler decisions, memory hierarchy, and asynchronous execution. More importantly, with the rapid expansion of GPU usage across scientific computing, machine learning, graphics, and systems workloads, CUDA debugging has become more challenging than ever. Current evaluations of LLM-based CUDA programming largely miss this setting: a model can pass correctness tests with repair by degeneration, simplifying the CUDA code into a safer but slower program that abandons the original optimization structure. We introduce CUDABEAVER, a benchmark for CUDA debugging from real failing workspaces produced during LLM-based CUDA generation. Each task provides the broken candidate, native build/test commands, raw error evidence, and a single editable file. CUDABEAVER evaluates whether a fixer truly repairs the failing CUDA code or merely finds a slower test-passing replacement, reporting results by failure category, debugging trajectory, stagnation mode, and performance preservation. We further propose pass@k(M,C,A), a protocol-conditional CUDA debugging metric by making the fixer M, corpus C, and protocol axes Aexplicit. Using this metric across 213 tasks and seven frontier LLMs, we show that protocol-aware evaluation gives a more faithful view of CUDA debugging ability: when performance-loss tolerance is high, fixers appear much stronger, but even a minor stricter performance requirement can sharply reduce measured success, shifting scores by up to 40 percentage points.

[LG-255] RubiConv – Efficient Boundary-Respecting Convolutions

链接: https://arxiv.org/abs/2605.08451
作者: Linda Friso,Annie Marsden,Xinyi Chen,Arushi Gupta,Peter Bartlett,Mark Braverman,Elad Hazan
类目: Machine Learning (cs.LG)
*备注: 19 pages, 12 figures

点击查看摘要

Abstract:Convolutional architectures have emerged as powerful alternatives to Transformers for sequence modeling. The primary advantage is that they offer improved theoretical sequence length complexity by leveraging the Fast Fourier Transform (FFT). However, this theoretical improvement does not always meaningfully land in practice. One critical obstacle is that applying standard FFTs is not amenable to the large-scale training pipeline wherein data is packed from different sources into a single sequence for hardware efficiency. Indeed, standard FFT algorithms are not easily amenable to document packing. Existing workarounds suffer from severe inefficiencies, crippling the practical performance of convolutional architectures. We close this gap with RubiConv, a novel algorithm for performing hardware-efficient, boundary-respecting convolutions on packed sequences. Extensive experiments show that RubiConv achieves significant speedups over both attention and standard FFT-based baselines. This work makes the theoretical efficiency of long convolutional models a practical reality for large-scale, real-world data packing.

[LG-256] Direct Bethe Free Energy Minimization for Bayesian Neural Ne twork

链接: https://arxiv.org/abs/2605.08446
作者: Pavel Prochazka
类目: Machine Learning (cs.LG)
*备注: Submited to conference

点击查看摘要

Abstract:We propose training Bayesian neural networks by directly minimizing the Bethe free energy rather than maximizing a variational lower bound. On tree-structured factor graphs the Bethe free energy is exact; deterministic layers drop out of the objective and are trained by standard backpropagation, so the framework accommodates any mixture of probabilistic and deterministic subgraphs without modification. Restricting the weight posterior to a last-layer Gaussian yields analytically tractable losses: for a Gaussian likelihood the Bethe loss equals the exact marginal likelihood, and for a probit likelihood it reduces to a closed form via the probit-Gaussian convolution. Both objectives sit strictly between MAP and the ELBO ( L_\textMAP \leq L_\textBethe \leq L_\textELBO ), removing the structural Jensen gap that no choice of variational family can close. The Z-consistent prior formulation makes the prior precision a differentiable parameter, enabling empirical Bayes - joint optimization of weights, covariance, and hyperparameters - in a single gradient pass, with no cross-validation or outer loop. All variants admit a closed-form predictive at MAP-equivalent inference cost, in contrast to ensemble and sampling-based methods. On 8 UCI regression and 12 UCI classification benchmarks evaluated under a single shared hyperparameter regime, Bethe is competitive with standard reference methods at single-pass cost. Independently, joint single-pass empirical Bayes matches grid-search cross-validation of the prior precision on essentially all dataset-variant combinations, eliminating the outer hyperparameter loop without measurable cost. Isolated optimization gaps on a few datasets reflect numerical rather than principled limitations of the framework.

[LG-257] Generalized Wasserstein Flow Matching: Transport Plans Everywhere All at Once

链接: https://arxiv.org/abs/2605.08424
作者: Moritz Piening,Richard Duong,Gabriele Steidl
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Probability (math.PR)
*备注:

点击查看摘要

Abstract:Flow matching has recently emerged as a flexible and efficient framework for generative modelling by learning deterministic transport dynamics between probability measures. In this work, we extend flow matching to the space of probability measures over probability measures, introducing a Wasserstein-on-Wasserstein (WoW) formulation. Leveraging the nested Wasserstein geometry, we show that measures over transport plans naturally induce velocity fields that realize metameasure flows. This yields a principled generalization of Wasserstein flow matching via coupled outer and inner transport plans. To address the substantial computational cost of WoW transport, we propose scalable approximations based on sliced and linear Wasserstein distances, enabling efficient training while promoting numerically stable, near-straight trajectories. Our framework unifies and extends existing approaches to point cloud and set generation, providing a practical and theoretically grounded method for generative modelling in WoW spaces.

[LG-258] Central Limit Theorem for Two-Time-Scale Approximate Distributionally Robust RL

链接: https://arxiv.org/abs/2605.08417
作者: Shengbo Wang,Zexi Zhang
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Designing model-free algorithms for distributionally robust reinforcement learning (DRRL) poses fundamental challenges. The robust Bellman operator is nonlinear in the transition kernel, which makes one-sample Bellman updates biased, while the adversarial optimization underlying robustness makes robust evaluation computationally demanding. To address these difficulties, we consider the natural small-ambiguity regime under Kullback–Leibler ambiguity sets and propose an approximate DRRL framework based on a first-order expansion of the relevant robust functional. This yields an approximate robust Bellman equation that removes the adversarial optimization while remaining first-order accurate in the ambiguity radius. To learn the fixed point of this approximate equation, we propose Mean-Variance Stochastic Approximation (MVSA), a model-free algorithm that uses only one-sample updates. This is achieved via a lifted stochastic approximation dynamics and a two-time-scale design. We then prove convergence and a central limit theorem for MVSA: its main iterate satisfies a central limit theorem at the canonical n^-1/2 scale, with explicitly characterized asymptotic covariances. Finally, we validate our theoretical findings with a numerical experiment.

[LG-259] AdamFLIP: Adaptive Momentum Feedback Linearization Optimization for Hard Constrained PINN Training

链接: https://arxiv.org/abs/2605.08408
作者: Binghang Lu,Runyu Zhang,Changhong Mou,Na Li,Guang Lin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Physics-informed neural networks (PINNs) provide a flexible framework for solving forward and inverse problems governed by partial differential equations (PDEs), but standard PINN training typically relies on soft penalty formulations that combine PDE residuals, data mismatch, and initial/boundary conditions using manually chosen weights. This often leads to ill-conditioning, sensitivity to loss weights, and poor constraint satisfaction. In this work, we reformulate PINN training as an equality-constrained optimization problem and propose a novel Adaptive Momentum Feedback Linearization Optimization for Hard Constrained PINN (AdamFLIP). The key idea is to view the constraint residuals as the output of a controlled dynamical system and to compute the Lagrange multiplier as a feedback input that locally drives these residuals toward stable linear contraction dynamics. AdamFLIP then applies Adam-style first- and second-moment adaptation to the resulting feedback-linearized Lagrangian gradient, combining principled constraint handling with the scalability and robustness of adaptive neural-network optimization. We test AdamFLIP on a range of benchmark forward and inverse PDE problem, and it consistently outperforms both the standard soft-constrained PINN and state-of-the-art constrained optimizers. Specifically, on the Navier–Stokes equations benchmark, AdamFLIP \textbfreduces relative L_2 error by more than two thirds for the predicted solution compared to the next best method. Our AdamFLIP framework provides an effective and computationally scalable hard constraint optimization method for PINN training.

[LG-260] Geometry-Aware Discretization Error of Diffusion Models

链接: https://arxiv.org/abs/2605.08392
作者: Samuel Hurault,Thomas Moreau,Gabriel Peyré
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Practical diffusion sampling is a numerical approximation problem: under a fixed inference budget, one must simulate a reverse-time ODE or SDE using only a limited number of denoising steps, so discretization error is often the dominant source of error. Existing non-asymptotic analyses provide convergence guarantees, but are typically too loose and too insensitive to diffusion parameters to guide practical design: broad families of schedules receive the same rates, which depend on coarse worst-case quantities such as the dimension or the drift Lipschitz constant. We take a less ambitious but more informative route. In the exact-score setting, we derive first-order asymptotic expansions of the Euler-Maruyama weak and Fréchet discretization errors. These formulas hold for general smooth reverse diffusions and become fully explicit under Gaussian data. They show how discretization error adapts to the geometry of the data through the covariance spectrum, and how this geometry interacts with key diffusion parameters, including the diffusion schedules and the diffusion-term coefficient. This yields tractable objectives for geometry-aware parameter optimization. Finally, we show that the qualitative predictions of the Gaussian formulas remain robust across diffusion sampling problems with different geometries, including image generation on different datasets and image posterior sampling.

[LG-261] SACHI: Structured Agent Coordination via Holistic Information Integration in Multi-Agent Reinforcement Learning

链接: https://arxiv.org/abs/2605.08391
作者: Nikunj Gupta,James Zachary Hare,Jesse Milzman,Rajgopal Kannan,Viktor Prasanna
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cooperative multi-agent reinforcement learning agents that act on partial local observations face a fundamental information bottleneck: the knowledge needed to select jointly optimal actions is scattered across the team, yet each agent must commit to a decision without access to its teammates’ observations, intentions, or chosen actions. Existing methods either ignore this bottleneck, compress it into a scalar mixing signal, or route around it with learned communication channels. Framing action coordination as a problem of structured information integration among agents, we propose \textitstructured agent coordination via holistic information integration, or SACHI, in which graph transformer convolutions over an inter-agent coordination graph enrich each agent’s representation with receiver-sensitive, content-dependent signals from teammates prior to action selection. We evaluate SACHI across five cooperative tasks spanning spatial, communicative, and adversarial coordination challenges against twelve baselines. SACHI consistently matches or outperforms the best baseline on every task, and rigorous aggregate statistical analyses, including normalized metrics with bootstrap confidence intervals, Friedman ranking, and performance profiling, confirm that this advantage is statistically significant, robust across environments, and not attributable to increased model capacity. Parameter-matched ablations further trace the source of the gains to a single architectural property: the degree of content-dependence in the message-passing operator.

[LG-262] he Power of Second Order Methods for Sequence Preconditioning

链接: https://arxiv.org/abs/2605.08390
作者: Annie Marsden,Elad Hazan
类目: Machine Learning (cs.LG)
*备注: 14 pages, 5 figures

点击查看摘要

Abstract:Sequence prediction methods for dynamical systems with long memory, i.e. marginally stable systems, typically achieve regret that grows polynomially with the hidden dimension of the underlying generative model. Universal Sequence Preconditioning (USP) is a method that compresses any sequence which comes from a linear dynamical system into a “preconditioned” sequence which requires exponentially shorter memory for accurate prediction. However, the preconditioned sequence yields exponentially larger diameters and gradients, hindering USP from unlocking optimal regret bounds. Inspired by the minimum description length principle, we show that the Vovk-Azoury-Warmuth (VAW) algorithm is naturally matched to the USP regime. Indeed, it takes advantage of the memory compression while remaining robust to the exponential explosion of the diameter. We prove that combining USP with VAW achieves astoundingly strong results: for any marginally-stable linear dynamical system, this algorithm achieves polylogarithmic regret O \left( \log^3 T \right) even in the presence of asymmetric hidden transition matrices. Finally, we extend the applicability of USP beyond bounded-spectrum systems by providing new complex-analytic bounds on Chebyshev polynomials, allowing for systems with constant complex arguments.

[LG-263] Embedding Dimension Lower Bounds for Universality of Deep Sets and Janossy Pooling

链接: https://arxiv.org/abs/2605.08377
作者: Ali Syed,Aditya Nambiar,Jonathan W. Siegel
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In many practical applications it is important to build symmetries into neural network architectures. Consider the important case of permutation symmetry on point clouds consisting of n points in d dimensions. In this case the network learns a function on a set of n points in \mathbbR^d , and a natural paradigm for constructing invariant networks is Janossy pooling, which generalizes the popular Deep Sets architecture. We study the universality of this approach, in particular the important question of how large the embedding dimension must be to guarantee universality of this architecture. Specifically, using a novel technique, we prove new lower bounds on the required size of this embedding dimension. For Deep Sets, this gives the correct minimal dimension up to a constant factor for all d 1 . For k -ary Janossy pooling, we prove the first non-trivial lower bound on the required embedding dimension when k 1 .

[LG-264] SWE Atlas: Benchmarking Coding Agents Beyond Issue Resolution

链接: https://arxiv.org/abs/2605.08366
作者: Mohit Raghavendra,Soham Dan,Miguel Romero Calvo,Yannis Yiming He,Johannes Baptist Mols,Gautam Anand,Cole McCollum,Edgar Arakelyan,Vijay Bharadwaj,Andrew Park,Jeff Da,MohammadHossein Rezaei,Bing Liu,Brad Kenstler,Yunzhong He
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: 10 pages

点击查看摘要

Abstract:We introduce SWE Atlas, a benchmark suite for coding agents spanning three professional software engineering workflows: Codebase QA (124 tasks), Test Writing (90 tasks), and Refactoring (70 tasks). SWE Atlas differs from prior SWE benchmarks in three key ways: it targets underrepresented but practically important task categories, uses comprehensive category-specific evaluation protocols, and adopts under-specified, agentic task formulations that better reflect real-world usage. Its evaluation framework combines programmatic checks with rubric-based assessment. This goes beyond functional correctness, evaluating software engineering quality, including test and refactor completeness, maintainability, reusable abstractions, and codebase hygiene. We evaluate a range of frontier and open-weight models on SWE Atlas and find that GPT-5.4 and Opus 4.7 achieve the strongest overall performance, while even the best open-weight models score poorly. Our analysis suggests that top models rely on extensive codebase exploration and runtime-driven reasoning. However, even top models consistently struggle with subtle edge cases, complex runtime analysis, and adherence to software engineering best practices. Overall, SWE Atlas provides a complementary evaluation suite for measuring both correctness and engineering quality in coding agents.

[LG-265] Convergence Analysis of Newtons Method for Neural Networks in the Overparameterized Limit

链接: https://arxiv.org/abs/2605.08352
作者: Konstantin Riedl,Konstantinos Spiliopoulos,Justin Sirignano
类目: Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:A convergence analysis is developed for the regularized Newton method for training neural networks (NNs) in the overparameterized limit. As the number of hidden units tends to infinity, the NN training dynamics converge in probability to the solution of a deterministic limit equation involving a ``Newton neural tangent kernel’’ (NNTK). Explicit rates characterizing this convergence are provided and, in the infinite-width limit, we prove that the NN converges exponentially fast to the target data (i.e., a global minimizer with zero loss). We show that this convergence is uniform across the frequency spectrum, addressing the spectral bias inherent in gradient descent. The eigenvalues of the NTK for gradient descent accumulate at zero, leading to slow convergence for target data with high-frequency components. In contrast, the NNTK has uniformly lower bounded eigenvalues if the regularization parameter is selected appropriately, allowing Newton’s method to converge more quickly for data with high-frequency components. Mathematical challenges that need to be addressed in our analysis include the implicit parameter update of the Newton method with a potentially indefinite Hessian matrix and the fact that the dimension of this linear system of equations tends to infinity as the NN width grows. This complicates deriving the training dynamics in the overparameterized limit as well as proving the convergence of the finite-width dynamics thereto. The analysis identifies a scaling formula for selecting the regularization parameter, which we show can vanish at a suitable rate as the number of hidden units becomes larger. We prove that, for sufficiently large numbers of hidden units, the regularized Hessian remains positive definite during training and the Newton updates for individual NN parameters converge to zero, showing that the model behaves as a linearization around the initialization.

[LG-266] What Time Is It? How Data Geometry Makes Time Conditioning Optional for Flow Matching

链接: https://arxiv.org/abs/2605.08344
作者: Alec Helbling,Sebastian Gutierrez Hernandez,Benjamin Hoover,Duen Horng Chau,Parikshit Ram
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent work has shown that models flow matching models can be trained without explicit time conditioning, challenging the standard view that the interpolation time is needed to disambiguate velocity targets. But why should a time-blind model work at all? Decomposing the time-blind flow matching loss, we identify two sources of irreducible error: a coupling variance, which arises from ambiguous velocity targets induced by how noise and data points are paired, and the time-blindness gap, which is the additional error caused by ignoring time. This gap shows that time-blind training is strictly harder than conventional training, reinforcing the puzzle that time-blind models work so well in practice. We resolve this tension by showing that the geometry of high-dimensional data makes time identifiable directly from noisy observations. When data concentrates near a k -dimensional subspace, time can be recovered from the statistical structure of noisy interpolants in directions orthogonal to the data; under a spiked-covariance model, this yields a closed-form estimator that recovers t from a single observation z at rate O(1/\sqrtd-k) for ambient dimension d . As a consequence, we prove that the time-blindness gap is asymptotically negligible relative to the coupling variance. We empirically demonstrate our identifiability result on real-world data and show that changing the coupling has a much larger effect on loss and sample quality than removing time conditioning across CIFAR-10, CelebA-HQ, and FFHQ. These results explain why time-blind flow matching works and show that the main practical lever is the choice of coupling, not explicit time conditioning.

[LG-267] Private Vertical Federated Inference for Time-Series

链接: https://arxiv.org/abs/2605.08343
作者: Lucas Fenaux,Larris Xie,Aditya Bang,Alex Zhang,Kevin Wilson,Florian Kerschbaum
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Institutions may benefit from collaborative inference on time-series data. In settings where privacy is necessary, multi-party computation (MPC) is a straightforward approach to providing strong guarantees, yet it remains prohibitively expensive and scales poorly with modern transformer architectures. Vertical Federated Learning (VFL) offers efficiency but suffers from privacy leakage at the embedding level, and securing the entire VFL model head via MPC remains prohibitively slow and communication-heavy for larger models. To enable practical, secure inference at scale, we propose “Public/Private Hybrid Head-VFL” (PPHH-VFL). This hybrid architecture splits the model head into an efficient plaintext public head and a secure, lightweight MPC private head. By applying adversarial training to the public embeddings, we mitigate privacy leakage; concurrently, the small private head securely preserves the flow of sensitive information needed for high downstream utility. Empirical evaluations on models ranging up to 86 million parameters demonstrate that PPHH-VFL accelerates inference by up to six orders of magnitude compared to end-to-end MPC. Compared to a standard VFL+MPC baseline, our approach scales significantly better, achieving a speedup of up to 44.4x in WAN and a 91.2x reduction in communication costs (dropping from 1.7 GB to 19 MB per batch), while simultaneously improving downstream classification accuracy by 2.50% and regression RMSE by 40.7%.

[LG-268] Reflective Prompted Policy Optimization: Trajectory-Grounded Revision and Salience Bias

链接: https://arxiv.org/abs/2605.08315
作者: Rahaf Abu Hara,Vaibbhav Murarri,Claudio Zito
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Existing LLM-based policy optimizers see only scalar rewards: that a policy scored 0.45, but not whether the agent got stuck in a loop, fell into a hole on the third step, or performed well on 19 out of 20 rollouts and failed catastrophically on one. We propose Reflective Prompted Policy Optimization (R2PO), a two-stage LLM framework for policy search over compact policy classes that augments scalar reward feedback with trajectory-level behavioral evidence. A Search-LLM proposes candidate policy parameters; the environment executes them; a Critic-LLM inspects the resulting rollouts and proposes targeted revisions grounded in observed states, actions, and rewards. Across ten environments, ablations show R2PO’s gains require separating global search from behavior-grounded revision and using selection to filter high-variance edits. We further identify a dominant failure mode, salience bias: when presented with multiple rollouts, the Critic-LLM fixates on improving a single failure even when most trajectories succeed. In a three-trajectory variant where the Critic-LLM sees the best, worst, and median rollout, this behavior explains 76.6% of regressions on CartPole. R2PO mitigates this by reasoning over aggregate rollout statistics, median-trajectory selection, and a revision rule. Using a 20B open-weight model, R2PO achieves the highest mean best reward across all ten environments, reaches near-optimal performance substantially earlier (e.g., near-maximum CartPole reward within ~500 episodes), and trains far more stably than both deep RL and prior LLM-based methods. These results show that treating trajectories as first-class in-context evidence, rather than artifacts reduced to scalar returns, changes how even comparatively small LLMs search over policy spaces, enabling them to learn faster, diagnose more precisely, and reliably improve external controllers.

[LG-269] Exactness Matters for Physical Rule Enforcement

链接: https://arxiv.org/abs/2605.08285
作者: Bum Jun Kim
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注: 28 pages, 6 figures

点击查看摘要

Abstract:Autoregressive scientific forecasters often enforce physical or structural constraints by repairing each predicted state before feeding it back into the model. However, it remains unclear when stronger physical rule enforcement becomes reliable and when it becomes a source of distribution shift. We study this question through operator exactness, meaning whether the repair map is the identity on the target manifold and is aligned with the target geometry. We compare raw forecasting, post hoc repair, and in-loop repair across periodic incompressible Navier–Stokes, non-periodic CFDBench flows, and a hierarchical-forecasting support task. In the exact periodic regime, Fourier projection substantially improves rollout accuracy. On the NS-128 benchmark, a strong Raw-FNO has a final-step rollout MSE at horizon 100 of (9.390 \pm 6.290)\times 10^-5 , and post hoc and in-loop projection reduce it to (1.130 \pm 0.165)\times 10^-6 and (5.370 \pm 0.113)\times 10^-7 . However, once an exact projection is unavailable and only approximate boundary-preserving cleanup is available, the ordering changes. Across cavity, tube, dam, and cylinder flow, stronger Poisson-based cleanup can reduce divergence while worsening rollout error; target-distortion MSE predicts this harm far better than a linear-system residual. Controlled mismatch, screened cleanup, adaptive gating, and external-backbone checks show that the best approximate-regime operating point can be raw or near-identity. Hierarchical forecasting gives the same broader pattern. Exact forecast reconciliation is a stable baseline, whereas blended top-down repair, a validation-tuned interpolation toward historical-proportion top-down reconciliation, is dataset-dependent. Thus, constraint enforcement should be benchmarked by operator–data alignment before enforcement strength.

[LG-270] GPU-Accelerated Synthesis of Mixed-Boolean Arithmetic: Beyond Caching

链接: https://arxiv.org/abs/2605.08243
作者: Gabriel Bathie,Baptiste Mouillon,Nathanaël Fijalkow
类目: Programming Languages (cs.PL); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Synthesizing Mixed-Boolean Arithmetic (MBA) expressions from input-output examples is central to program deobfuscation and also useful for compiler optimization, reverse engineering, and cryptanalysis. Existing MBA synthesizers are typically CPU-based and scale poorly on large specifications or complex targets. Recent GPU-accelerated synthesis methods achieve large speedups in qualitative settings, but they depend on caching observationally equivalent candidates; this strategy breaks down for MBA because candidate outputs are quantitative bitvectors and the behavioral space is enormous. We present SIMBA (Synthesis of Mixed-Boolean Arithmetic), a GPU-accelerated MBA synthesizer built around cache-free bottom-up enumeration. SIMBA avoids language caches entirely and uses a GPU-oriented enumeration design that keeps work local and highly parallel. In experiments, SIMBA is substantially faster than prior MBA synthesis tools, handles larger specifications, and reaches expression sizes that existing methods fail to solve. These results establish cache-free GPU synthesis as a practical and scalable approach for quantitative domains, and identify it as a strong alternative to cache-centric designs.

[LG-271] Distributional Spectral Diagnostics for Localizing Grokking Transitions

链接: https://arxiv.org/abs/2605.08237
作者: Ziyue Wang,Yufeng Ying,Takafumi Kanamori
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In grokking, a model first fits the training data while test accuracy remains low, and only later begins to generalize. We ask whether this transition can be localized from observed training trajectories before the test accuracy rises, and formulate grokking transition localization as a diagnostic problem with an explicit threshold/FPR/lead-time trade-off. Task-dependent observables are summarized as empirical distributions, mapped to Wasserstein/quantile coordinates, and analyzed by Hankel dynamic mode decomposition (DMD); the resulting reconstruction residual, together with spectrum and effective rank, forms the diagnostic output. On held-out modular-addition Transformer runs, the residual achieves AUROC (\approx ) 0.93 for grokking-vs-non-grokking discrimination at the run level; under a fixed sustained-threshold operating rule, true-positive alarms can precede onset, with lead time reported jointly with false-alarm rate and uncertainty intervals. Perturbation experiments show that, in the tested (wd=1) pool, high-residual windows exhibit about (3\times) larger short-horizon perturbation deviation than low-residual windows. In a same-data norm-window control, perturbation sensitivity aligns with the residual ordering rather than total-parameter-norm ordering, suggesting that the residual is not merely a total-norm proxy at the window level in the studied (wd=1) dynamics. Norm signals remain strong run-level regime indicators, and log-probability performs best among the observables tested under the current protocol. We position the residual as a window-level monitoring and localization signal in the studied modular-arithmetic Transformer settings, not a universal early-warning predictor or an intervention rule.

[LG-272] Hierarchical Multi-Fidelity Learning for Predicting Three-Dimensional Flame Wrinkling and Turbulent Burning Velocity

链接: https://arxiv.org/abs/2605.08232
作者: Saghar Zolfaghari,Yu Xie,Junfeng Yang,Safa Jamali
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:High-fidelity experimental characterization of turbulent premixed flames remains limited by the cost and complexity of advanced diagnostics, particularly under elevated pressures and intense turbulence where measurements of coupled flame morphology and burning dynamics are sparse. Here, we develop a hierarchical multi-fidelity neural network framework (MuFiNNs) to address this challenge by integrating sparse high-fidelity experimental data with structured low-fidelity representations encoding dominant physical trends. The framework combines hierarchical low-fidelity construction with nonlinear multi-fidelity correction to learn coupled geometric and reactive flame behavior while recovering discrepancies that simplified models alone cannot capture. The methodology is applied to expanding turbulent premixed flames to predict three-dimensional flame wrinkling dynamics and turbulent mass burning velocity across varying fuels, pressures, and turbulence intensities. Using experimentally informed low-fidelity trend models with sparse high-fidelity measurements, MuFiNNs accurately reconstruct observed flame behavior, enable interpolation across unseen operating conditions, and demonstrate robust extrapolation beyond the training domain. Importantly, the framework remains effective in noisy, weakly structured, or experimentally inaccessible regimes where conventional data-driven approaches often fail. These results show that hierarchical multi-fidelity learning provides a scalable and physically grounded strategy for predictive combustion modeling in data-limited regimes. More broadly, this work establishes multi-fidelity scientific machine learning as a practical framework for extracting physically meaningful predictive models from sparse experiments, particularly for instability-dominated and turbulence-sensitive reactive flows where high-fidelity data acquisition is demanding.

[LG-273] Social Determinants of Health and Fentanyl Overdose Mortality Across US Counties: An XGBoost and SHAP Analysis Identifying Silent Risk Counties and Treatment Deserts

链接: https://arxiv.org/abs/2605.08230
作者: Kabi Raj Tiruwa(Clark University),Abhisan Ghimire(Clark University),Anuj Kumar Shah(Yeshiva University)
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注: 21 pages, 7 figures, 4 tables

点击查看摘要

Abstract:Background: Fentanyl overdose deaths are still increasing across the U.S. We do not fully understand which county-level social and structural conditions lead to higher overdose death rates. Social determinants of health, including disability, treatment access, and behavioral health issues, may help identify vulnerable counties before deaths become severe. No earlier study has used explainable machine learning with SHAP attribution on 2022 CDC WONDER data to study treatment access gaps and silent risk counties. Methods: We combined data from four government sources for 975 U.S. counties, including CDC WONDER (2022) overdose mortality data, CDC Social Vulnerability Index (SVI), CDC PLACES health behavior data, and Area Health Resources Files. An XGBoost model was used to predict overdose mortality risk using Standardized Mortality Ratio (SMR). Five-fold cross-validation was used to test model accuracy, and SHAP values were used to show which factors increase or decrease risk. Results: XGBoost outperformed all tested models (Spearman rho=0.67, R2=0.457, MAE=0.409, high-risk recall=71.1%). Top predictors were disability rate, hypertension, smoking, and lack of vehicle access. Treatment desert counties had 52.6% higher overdose mortality (SMR 1.786 vs 1.170; p0.0001). K-means identified 143 silent risk counties. Overdose deaths were spatially clustered (Moran’s I=0.505, p=0.001) with 75 hotspots and 136 coldspots. Suppressed counties were 58.2% of WONDER counties, mostly rural (72%) and treatment deserts (65%). Conclusions: County-level SDOH factors predict overdose deaths, especially disability, treatment access, and behavioral health burden. MOUD expansion should prioritize treatment desert counties, and silent risk counties need early intervention before mortality worsens. Comments: 21 pages, 7 figures, 4 tables Subjects: Machine Learning (cs.LG); Applications (stat.AP) Cite as: arXiv:2605.08230 [cs.LG] (or arXiv:2605.08230v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.08230 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Kabi Raj Tiruwa [view email] [v1] Wed, 6 May 2026 20:28:33 UTC (1,044 KB)

[LG-274] A Simulated Federated Analysis of MS-Induced Brain Lesions

链接: https://arxiv.org/abs/2605.08223
作者: Evelyn Trautmann,Joël Federer-Gsponer,Markus C. Elze,José-Tomás Prieto
类目: Machine Learning (cs.LG)
*备注: Accepted for publication at The 39th IEEE International Symposium on Computer-Based Medical Systems

点击查看摘要

Abstract:Federated techniques such as federated learning and federated analysis have emerged as a powerful paradigm for enabling multi-center research on sensitive clinical data while preserving patient privacy. In this study, we introduce a simulation framework that emulates a real-world federated research project focused on the analysis of multiple sclerosis (MS) patient data. The project comprises two components: an image segmentation task and a clinical data analysis task, where federated variants of survival analysis and Principal Component Analysis (PCA) are employed. To capture the complexity and heterogeneity of real clinical datasets, we construct a federation of high-fidelity synthetic cohorts designed to mirror MS-related clinical and demographic characteristics, while the imaging component leverages publicly available real-world datasets. Our simulation replicates key elements of authentic federated workflows, including distributed data governance, site-specific preprocessing, model training across isolated nodes, and the secure aggregation of analytical outputs. This framework provides a realistic testbed for developing, evaluating, and benchmarking federated learning methods in the context of MS research.

[LG-275] Learngene Search Across Multiple Datasets for Building Variable-Sized Models

链接: https://arxiv.org/abs/2605.08209
作者: Boyu Shi,Junbo Zhou,Chang Liu,Xu Yang,Qiufeng Wang,Xin Geng
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep learning methods are widely used under diverse resource constraints, resulting in models of varying sizes, such as the Vision Transformer (ViT) series. Deploying these models typically requires costly pretraining and finetuning. The Learngene paradigm addresses this issue by extracting transferable components, called learngenes, from a pretrained ancestry model (Ans-Net) to initialize variable-sized descendant models (Des-Nets).Existing learngene extraction methods rely on a single dataset, limiting downstream performance. To address this limitation, we propose Learngene Search Across Multiple Datasets for Building Variable-Sized Models (LSAMD). LSAMD expands the Ans-Net into a searchable super Ans-Net with dataset-specific blocks and dataset adapters (DADs). During training, LSAMD searches for an optimal architecture path for each dataset. The base blocks most frequently selected across datasets are extracted as learngenes for initializing this http URL on multiple datasets show that LSAMD achieves performance comparable to pretrain-finetune methods while significantly reducing storage and training costs.

[LG-276] ExecuTorch – A Unified PyTorch Solution to Run AI Models On-Device

链接: https://arxiv.org/abs/2605.08195
作者: Mergen Nachin,Digant Desai,Sicheng Stephen Jia,Chen Lai,Mengwei Liu,Jacob Szwejbka,Raziel Alvarez,RJ Ascani,Dave Bort,Manuel Candales,Andrew Caples,Yanan Cao,Zhengxu Chen,Soumith Chintala,Gregory Comer,Tanvir Islam,Songhao Jia,Tarun Karuturi,Jack Khuu,Abhinay Kukkadapu,Tugsbayasgalan Manlaibaatar,Andrew Or,Kimish Patel,Siddartha Pothapragada,Lucy Qiu,Supriya Rao,Orion Reblitz-Richardson,Max Ren,Scott Roy,Anthony Shoumikhin,Scott Wolchok,Guang Yang,Angela Yi,Martin Yuan,Hansong Zhang,Jack Zhang,Jerry Zhang,Shunting Zhang,C. Cagatay Bilgin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Local execution of AI on edge devices is important for low latency and offline operation. However, deploying models on diverse hardware remains fragmented, often requiring model conversion or complete reimplementation outside the PyTorch ecosystem where the model was originally authored. We introduce ExecuTorch, a unified PyTorch-native deployment framework for edge AI. ExecuTorch enables seamless deployment of machine learning models across heterogeneous compute environments. It scales from embedded microcontrollers to complex system-on-chips (SoCs) with dedicated accelerators, powering devices ranging from wearables and smartphones to large compute clusters. ExecuTorch preserves PyTorch semantics while allowing customization, support for optimizations like quantization, and pluggable execution “backends”. These features together enable fast experimentation, allowing researchers to validate deployment behavior entirely within PyTorch, bridging the gap between research and production.

[LG-277] Synergistic Simplex: Cooperative Runtime Assurance for Safety-Critical Autonomous Systems

链接: https://arxiv.org/abs/2605.08190
作者: Ayoosh Bansal,Mikael Yeghiazaryan,Artyom Khachatryan,Tianyi Zhu,Hunmin Kim,Naira Hovakimyan,Lui Sha
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Autonomous systems increasingly rely on machine-learning (ML) components for safety-critical tasks such as perception and control in autonomous vehicles (AVs). While ML enables essential capabilities, it inevitably exhibits long-tail faults that make it unsuitable for safety-critical tasks. Runtime assurance (RTA) mitigates this issue by pairing ML components with verifiable safety monitors, e.g., Control Simplex and Perception Simplex architectures. However, the limited performance of safety monitors remains a major bottleneck. The Synergistic Simplex (SS) architecture improves system performance by enabling bidirectional integration between ML components and safety monitors while preserving formal safety guarantees. The key innovation here is allowing safety monitors to use ML outputs, which is typically prohibited in RTA systems. We formally derive conditions under which this integration preserves safety and demonstrate the performance benefits. We present the design, analysis, and evaluation of SS for AV obstacle detection. Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY) Cite as: arXiv:2605.08190 [cs.LG] (or arXiv:2605.08190v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.08190 Focus to learn more arXiv-issued DOI via DataCite

[LG-278] Physics-Modeled Neural Networks

链接: https://arxiv.org/abs/2605.08176
作者: Raul Felipe-Sosa,Angel Martin del Rey,Maria Flores Ceballos
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:We introduce \emphDynamical Physics-Modeled Neural Networks (DynPMNNs), a continuous-time deep learning architecture in which each hidden layer is defined as the solution of an ordinary differential equation. Unlike classical feed-forward networks, this approach replaces static activation functions with time-evolving dynamical systems, providing a biologically inspired interpretation of hidden-layer behavior and enabling the integration of physically meaningful models. The framework is rigorously grounded in Reproducing Kernel Banach Spaces (RKBSs), allowing DynPMNNs to be characterized as finite-dimensional solutions of an abstract training problem and revealing structural connections with standard neural networks. We present a concrete implementation based on the FitzHugh–Nagumo model for neuronal activation, where numerical ODE solvers are embedded into the computational graph via Euler-type schemes. Both network weights and dynamical parameters are trained jointly. Through experiments on the California Housing dataset, we compare DynPMNNs with Neural ODEs (NODEs) and Closed-form Continuous-Time Networks (CfCs). Despite using fewer trainable parameters, DynPMNNs achieve competitive performance. These results position DynPMNNs as a principled bridge between dynamical systems and deep learning, with promising directions for further research in expressivity, stability, and physics-based modeling. Subjects: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE) Cite as: arXiv:2605.08176 [cs.LG] (or arXiv:2605.08176v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.08176 Focus to learn more arXiv-issued DOI via DataCite

[LG-279] Quantitative Sobolev Approximation Bounds for Neural Operators with Empirical Validation on Burgers Equation

链接: https://arxiv.org/abs/2605.08170
作者: Nicole Hao
类目: Machine Learning (cs.LG); Functional Analysis (math.FA)
*备注:

点击查看摘要

Abstract:Neural operators have emerged as a powerful tool for learning mappings between infinite-dimensional function spaces. However, their approximation properties in Sobolev norms remain poorly quantified, even though these norms control both function values and derivatives and are the natural metrics for PDE well-posedness, stability, and generalization. We develop a functional-analytic framework for operator learning in Sobolev spaces and connect it to the numerical behavior of Fourier Neural Operators (FNOs) on a prototypical PDE. First, for a continuous nonlinear operator \mathcalG: H^s(D)\to H^t(D’) with s d/2 and inputs restricted to a compact subset of H^s(D) , we prove that \mathcalG can be uniformly approximated in H^t -norm by a neural operator with \mathcalO(\varepsilon^-d/s) trainable parameters. This yields an explicit complexity–error relation of the form |\mathcalG-\mathcalG_\theta|H^t \lesssim C N^-s/d . We then study the one-dimensional viscous Burgers solution operator \mathcalG: u_0\mapsto u(\cdot,1) on a bounded H^1 -ball and train FNOs with an H^1 -loss. Across a sweep of model sizes, we obtain test H^1 -errors down to \mathcalO(10^-7) and relative errors of order 10^-3 , with predictions accurately matching both solutions and spatial derivatives on held-out data. A log-log plot of Sobolev error versus parameter count exhibits an approximate power law |\mathcalG-\mathcalG\theta|_H^1 \approx C N^-\alpha with empirical exponent \alpha \approx 1.4 , and long-horizon training reveals optimization instabilities in large FNOs, providing quantitative evidence that Sobolev-space approximation theory meaningfully predicts neural-operator scaling behavior.

[LG-280] mporal-Decay Shapley: A Time-Aware Data Valuation Framework for Time-Series Data

链接: https://arxiv.org/abs/2605.08153
作者: Chuwen Pang,Bing Mi,Kongyang Chen
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:With the rapid development of machine learning applications on time-series data, accurately assessing the value of training samples has become essential for data selection, noise detection, and model optimization. However, traditional data valuation methods usually assume that samples are independent and identically distributed, and thus ignore the time-varying nature of sample value in time-series data. This paper proposes an improved temporal Shapley data valuation method that enables accurate sample valuation for time-series data through a temporal decay mechanism and a multi-scale fusion strategy. Specifically, we propose three progressively enhanced temporal Shapley methods. Temporal-Decay Shapley (TDS) incorporates temporal information into Shapley value computation through exponential decay weights; the improved TDS adopts power exponential decay to better adapt to nonlinear temporal drift; and Multi-Scale Temporal-Decay Shapley (MS-TDS) constructs a multi-scale fusion mechanism that balances the value of short-term hotspot samples and long-term foundational samples through parallel multi-scale valuation and sample-level adaptive fusion. Experimental results show that the proposed methods generally outperform traditional methods in noise detection and high-value data identification tasks, with more evident advantages under most strongly temporal settings, thereby effectively improving the accuracy and robustness of data valuation.

[LG-281] A PyTorch Library of Turing-Complete Neural Networks

链接: https://arxiv.org/abs/2605.08150
作者: Jonathan Bates
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a PyTorch package that compiles neural networks and their weights from Turing machine descriptions, producing models that exactly simulate the specified machine without any training. Given a transition function and a set of terminal states, the package constructs a model whose forward pass corresponds to one step of the Turing machine. Two architectures are implemented, each realizing a different theoretical result: (1) a transformer with self-attention, cross-attention, and feedforward layers based on Wei, Chen, and Ma (2021), and (2) a recurrent network based on Siegelmann and Sontag (1995) that encodes the stack in a Cantor set. We develop the constructions from first principles, showing how ReLU networks implement Boolean circuits (AND, OR, NOT, XOR gates and their composition into DNF formulas and binary adders) and how hard attention implements positional lookup on the tape. The package serves as a concrete, runnable reference for the symbolic-neural bridge, and as a foundation for future work on the stability of constructed solutions under gradient-based optimization. Code is available at this https URL.

[LG-282] DataArc-SynData-Toolkit: A Unified Closed-Loop Framework for Multi-Path Multimodal and Multilingual Data Synthesis

链接: https://arxiv.org/abs/2605.08138
作者: Zhichao Shi,Cehao Yang,Hao Zhou,Xiaojun Wu,Huajie Li,Xuhui Jiang,Chengjin Xu,Yuanzhuo Wang,Jian Guo
类目: Machine Learning (cs.LG)
*备注: 6 pages

点击查看摘要

Abstract:Synthetic data has emerged as a crucial solution to the data scarcity bottleneck in large language models (LLMs), particularly for specialized domains and low-resource languages. However, the broader adoption of existing synthetic data tools is severely hindered by convoluted workflows, fragmented data standards, and limited scalability across modalities. To address these limitations, we develop DataArc-SynData-Toolkit, an open-source framework featuring: (1) a configuration-driven, end-to-end pipeline equipped with an intuitive visual interface and simplified CLI for exceptional usability; (2) a unified, quality-controllable synthesis paradigm that standardizes multi-source data generation to ensure high reusability; and (3) a highly modular architecture designed for seamless multimodal, multilingual, and multi-task adaptation. We apply the toolkit in multiple application scenarios. Experimental results demonstrate that our toolkit achieves an optimal balance between generation efficiency and data quality. By offering an end-to-end and visually interactive pipeline, DataArc-SynData-Toolkit significantly lowers the technical barrier to synthetic data generation and subsequent model training, accelerating its practical deployment in real-world applications.

[LG-283] Dendritic Neural Networks with Equilibrium Propagation

链接: https://arxiv.org/abs/2605.08135
作者: Yoshimasa Kubo
类目: Machine Learning (cs.LG)
*备注: 8 pages

点击查看摘要

Abstract:Equilibrium propagation (EP) is a biologically plausible alternative to backpropagation (BP), but its effectiveness can degrade in deeper and more challenging learning settings. In parallel, dendritic neural networks have demonstrated improved performance and generalization when trained with BP, suggesting that structured, biologically inspired architectures may enhance learning. In this work, we investigate the integration of dendritic neural networks with equilibrium propagation using an advanced EP framework. We evaluate the proposed dendritic EP model on MNIST, Kuzushiji-MNIST (KMNIST), and Fashion-MNIST (FMNIST), considering both shallow and deeper architectures. Our results show that dendritic EP achieves performance comparable to standard EP on simple tasks, while providing consistent improvements on more challenging datasets and deeper models. In particular, dendritic EP significantly outperforms standard EP on KMNIST and FMNIST, and approaches the performance of dendritic networks trained with backpropagation through this http URL further understand these improvements, we analyze the evolution of hidden states during the free phase. We observe that dendritic EP exhibits higher activation magnitudes and more distributed hidden-state activity compared to standard EP, indicating that dendritic structure alters the internal network dynamics. These findings suggest that incorporating dendritic structure can enhance the effectiveness of biologically plausible learning algorithms, especially in regimes where standard EP struggles. Our work highlights the importance of architectural design for improving biologically inspired training methods.

[LG-284] Interactive Inverse Reinforcement Learning of Interaction Scenarios via Bi-level Optimization

链接: https://arxiv.org/abs/2605.08131
作者: Yue Mao,Shicheng Liu,Siyuan Xu,Minghui Zhu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Inverse reinforcement learning (IRL) learns a reward function and a corresponding policy that best fit the demonstration data of an expert. However, in the current IRL setting, the learner is isolated from the expert and can only passively observe the expert demonstrations. This limits the applicability of IRL to interactive settings, where the learner actively interacts with the expert and needs to infer the expert’s reward function from the interactions. To bridge the gap, this paper studies interactive IRL (IIRL) where a learner aims to learn the reward function of an expert and a policy to interact with the expert during its interactions with the expert. We formulate IIRL as a stochastic bi-level optimization problem where the lower level learns a reward function to explain the behaviors of the expert, and the upper level learns a policy to interact with the expert. We develop a double-loop algorithm, Bi-level Interactive Scenarios Inverse Reinforcement Learning (BISIRL), which solves the lower-level problem in the inner loop and the upper-level problem in the outer loop. We formally guarantee that BISIRL converges and validate our algorithm through extensive experiments.

[LG-285] Additive Atomic Forests for Symbolic Function and Antiderivative Discovery

链接: https://arxiv.org/abs/2605.08130
作者: Reda Belaiche
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a framework for the simultaneous symbolic recovery of a function and its antiderivative from data. The framework rests on three ideas. First, a derivative algebra: the observation that the product rule \fracddx[f \cdot g] = f’g + fg’ and the chain rule, applied to a seed set of elementary functions, generate a self-expanding system of function-derivative pairs – a living library that grows each time a new function is discovered. Second, two complementary primitives – EML ,(e^u - \ln v) , which is theoretically complete for all elementary functions, and SOL ,(\sin u - \cos v) , introduced here, which makes trigonometric atoms available at depth~1 instead of depth~ \sim 8 – that seed the library with core atoms cheaply. Third, additive atomic forests: finite sums of primitive trees, optionally composed via multiplicative nodes, whose derivatives are fitted to data by continuous optimisation or by exhaustive search over the library. Because differentiation of each atom is determined by construction, the forest simultaneously encodes a symbolic expression F and its derivative F’ ; no symbolic integration step is required. The library is not a fixed object: it self-constructs from a small seed set by recursive application of the product rule, chain rule, and the two primitives, and it can grow as newly discovered functions are folded back in. The larger the library, the richer the expressible class of candidate functions. We give conditional completeness, additive-depth, and analytic simultaneous-recovery results for the framework. Empirically, in our reported runs on 17 classification benchmarks, sparse atom combinations match or exceed XGBoost on 13 datasets while producing interpretable formulas. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.08130 [cs.LG] (or arXiv:2605.08130v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.08130 Focus to learn more arXiv-issued DOI via DataCite

[LG-286] owards Customized Multimodal Role-Play

链接: https://arxiv.org/abs/2605.08129
作者: Chao Tang,Jianzong Wu,Qingyu Shi,Ye Tian,Aixi Zhang,Hao Jiang,Jiangning Zhang,Yunhai Tong
类目: Machine Learning (cs.LG)
*备注: Code available at this https URL Project page available at this https URL

点击查看摘要

Abstract:Unified multimodal understanding and generation models enable richer human-AI interaction. Yet jointly customizing a character’s persona, dialogue style, and visual identity while maintaining output consistency across modalities remains largely unexplored. To mitigate this gap, we introduce a new task, Customized Multimodal Role-Play (CMRP). We construct the RoleScape-20 dataset comprising 20 characters, including training and evaluation data that cover persona, stylistic descriptions, visual/expressive cues, and text-image interactions. Building on a unified model, we devise UniCharacter, a two-stage training framework containing Unified Supervised Finetuning (Unified-SFT) and character-specific group relative policy optimization (Character-GRPO). Given only 10 images plus corresponding interaction examples, the model acquires the target character and exhibits coherent persona, style, and visual identity in both generated text and images. This process takes about 100 GPU hours. Experiments on the RoleScape-20 dataset show that the proposed method substantially outperforms prior approaches. Ablation studies further validate the effectiveness of our cross-modal consistency design and few-shot customization strategy. We argue that CMRP, coupled with unified modeling, provides a basis for next-generation characterful and immersive interactive agents.

[LG-287] Performance and Energy Trade-Off Analysis of Hierarchical Federated Learning for Plant Disease Classification

链接: https://arxiv.org/abs/2605.08121
作者: Athanasios Papanikolaou,Athanasios Tziouvaras,Pavlos Stoikos,Apostolos Xenakis,Shameem A Puthiya Parambath,George Floros,Enrica Zereik,Ivan Petrovic,Fabio Bonsignorio
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: Accepted for publication at the 2026 ERAS Conference

点击查看摘要

Abstract:Early detection of plant diseases is critical for improving crop productivity, while it also facilitates the foundations of precision agriculture. Recent advances in distributed deep learning have enabled plant disease classification models to be trained across geographically distributed agricultural sensing infrastructures. However, deploying such systems in large-scale Internet of Things (IoT) environments, introduces significant challenges related to computational cost, energy consumption, and system efficiency. In this paper, we present a design-space exploration of hierarchical federated learning architectures for plant disease classification, with a particular focus on the trade-offs between predictive performance and energy efficiency. We further introduce a power- and energy-aware optimization framework that enables the systematic evaluation and selection of model-aggregator configurations under varying deployment constraints. The hierarchical federated architecture organizes distributed clients through intermediate aggregation layers, reducing communication and computational overhead. We evaluate multiple convolutional neural network architectures, including EfficientNet-B0, ResNet-50, and MobileNetV3-Large, in combination with different federated aggregation strategies such as FedAvg, FedProx, and FedAvgM. Experimental results demonstrate that different model-aggregator combinations exhibit distinct performance-energy trade-offs. Consequently, we highlight configurations that achieve competitive diagnostic accuracy and significantly reduce system resource requirements.

[LG-288] Statistical Inference and Quality Measures of KV Cache Quantisations Inspired by TurboQuant

链接: https://arxiv.org/abs/2605.08114
作者: Paolo D’Alberto
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Mathematical Software (cs.MS)
*备注: 23 pages, 7 Figures, multiple tables, the process is highly assisted by AI

点击查看摘要

Abstract:We analyse three KV cache quantization schemes under a fair bit budget: \textbfKV (scalar MSE baseline), \textbfKQV (WHT + MSE on K ; WHT + MSE + QJL on V ), and \textbfQKQV (WHT + MSE + QJL on both). Starting from the Beta distribution on the hypersphere, we trace how QJL on K inflates inner product variance by \pi/2 , which softmax amplifies nonlinearly via Jensen’s inequality, and we present statistical inference and information metrics to highlight practical differences. Three empirical findings emerge. (1)~At n=4 (the practically dominant budget), KQV wins on every measure – KL divergence, geometric K error, and 6D distance – across all distributions and ranks tested. (2)~The K–V asymmetry is unconditional: QKQV is consistently worse than KQV in KL divergence at every budget and distribution. (3)~A budget-dependent crossover exists: QKQV achieves better geometric K reconstruction at n \in \2,3,5\ , KQV at n \in \4,6\ , invariant to rank and tail weight – an open rate-distortion problem. \mathrmKL(p_\mathrmref | p_\mathrmquant) , K-only by construction, bridges K direction error to routing corruption and output collapse. We present a sufficient condition when the Jensen mechanism amplifies superlinearly through the softmax. At n \in \2,3,5\ , QKQV wins geometrically because this assumption does not bind. At n=4 , elevated K error and KL divergence for QKQV strongly suggest the Jensen mechanism is the operative cause of the crossover, providing a new perspective and explanation. Comments: 23 pages, 7 Figures, multiple tables, the process is highly assisted by AI Subjects: Machine Learning (cs.LG); Information Theory (cs.IT); Mathematical Software (cs.MS) Cite as: arXiv:2605.08114 [cs.LG] (or arXiv:2605.08114v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.08114 Focus to learn more arXiv-issued DOI via DataCite

[LG-289] Geometry-free prediction of inertial lift forces in microfluidic devices using deep learning

链接: https://arxiv.org/abs/2605.08109
作者: Jesse Ward-Bond,Ali Mashadian,Timothy C. Y. Chan,Edmond W. K. Young
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:Inertial microfluidic devices (IMDs) offer low-cost, high-throughput alternative techniques for many traditional particle- (or cell-) manipulation tasks, but simulating them requires being able to predict particle migration, and thus particle lift forces, under a variety of possible channel geometries. Recent work has demonstrated that machine learning models can be used to drastically speed up these numerical simulations, but doing so required training individual models for every unique channel cross-section type (e.g., rectangular, triangular) – shifting the burden from the simulation step to the training step. In this paper, we develop a novel approach for predicting particle lift forces that contains no explicit geometric parameters. We train a neural network model using a new parameter set and show that while it performs comparably to existing models on channel geometries in the training set, it is able to generalize to unseen channel geometries far more effectively. We show that the lift force model developed herein can be easily transferred to particle tracing simulation software, where it is capable of predicting particle migration patterns consistent with the literature across a variety of channel designs.

[LG-290] Distributional Reinforcement Learning via the Cramér Distance

链接: https://arxiv.org/abs/2605.08104
作者: Vanya Aziz,Ivo Nowak,E.M.T Hendrix
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper explores the application of the Soft Actor-Critic (SAC) algorithm within a Distributional Reinforcement Learning setting and introduces an implementation of such algorithm named Cramér-based Distributional Soft Actor-Critic (C-DSAC). The novel approach employs distributional reinforcement learning to represent state-action values, and minimizes the squared Cramér distance for learning the distribution. Empirical results across various robotic benchmarks indicate that our algorithm surpasses the performance of baseline SAC and contemporary distributional methods, with the performance advantage becoming increasingly pronounced in high-complexity environments. To explain the efficiency of the new approach, we conduct an analysis showing that its superior performance is partly due to \textitconfidence-driven Q-value updates: High-variance target distributions (low confidence in target) lead to more conservative model updates, thereby attenuating the impact of overestimated values. This work deepens the understanding of distributional reinforcement learning, offering insights into the algorithmic mechanisms governing convergence and value estimation.

[LG-291] Path-Based Gradient Boosting for Graph-Level Prediction

链接: https://arxiv.org/abs/2605.08102
作者: Claudio Meggio,Johan Pensar,Riccardo De Bin
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 20 Pages, 1 figure

点击查看摘要

Abstract:We propose PathBoost, a gradient tree boosting method for graph-level classification and regression that learns discriminative path-based features directly from the input graph structure. Building on a previous work, which was tailored to a specific chemistry application, PathBoost introduces three key extensions: (i) adaptation to binary classification through gradient boosting with a logistic loss, (ii) incorporation of multiple node and edge attributes into the path feature space via a prefix-based decomposition, and (iii) automatic anchor node selection based on categorical attribute diversity, eliminating the need for the user to specify the starting point of the considered path features. We compared PathBoost to graph neural networks and graph kernel approaches on several benchmark datasets, obtaining better results in half of them, and comparable results in the rest. PathBoost shows better performances on graphs with larger average node counts. Overall, the results demonstrate that path-based boosting methods can be competitive with more complex black-box approaches.

[LG-292] Reinforcement learning for inverse structural design and rapid laser cutting of kirigami prototypes

链接: https://arxiv.org/abs/2605.08098
作者: Milad Yazdani,Shahriar Shalileh,Dena Shahriari
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Kirigami is an increasingly useful fabrication method to produce shape-programmable metamaterial structures. However, inverse design remains difficult because deployment is nonlinear, and feasible cut layouts must satisfy discrete compatibility rules, avoid overlap, and map one target shape to valid designs. We present RL-Kirigami, an inverse design framework that combines optimal-transport conditional flow matching (OT-CFM) with reinforcement learning to generate compatible ratio fields for compact reconfigurable parallelogram quad kirigami. A marching decoder enforces global geometric compatibility, and Group Relative Policy Optimization (GRPO) aligns the generator with nondifferentiable rewards for silhouette matching, feasibility, and ratio-field regularity. Across procedurally generated target shape instances, a single sample from the pretrained OT-CFM prior reached 94.2% sIoU and outperformed solver baselines while reducing forward simulator evaluations from hundreds to 1. GRPO improved accuracy to 94.91% sIoU and, with regularity included, reduced \mathrmTV(\mathbfx) from 0.95 to 0.81 while maintaining 94.83% sIoU. Generated layouts were exported to DXF and laser-cut in 50~\mu\mathrmm polymeric sheets to produce deployable prototypes in 8.0 \pm 1.0 minutes per part. These results support a manufacturing-aware inverse design workflow for deployable kirigami metamaterials under hard geometric feasibility constraints.

[LG-293] Quantifying Concentration Phenomena of Mean-Field Transformers in the Low-Temperature Regime

链接: https://arxiv.org/abs/2605.10931
作者: Albert Alcalde,Leon Bungert,Konstantin Riedl,Tim Roith
类目: Analysis of PDEs (math.AP); Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注: 30 pages, 10 figures

点击查看摘要

Abstract:Transformers with self-attention modules as their core components have become an integral architecture in modern large language and foundation models. In this paper, we study the evolution of tokens in deep encoder-only transformers at inference time which is described in the large-token limit by a mean-field continuity equation. Leveraging ideas from the convergence analysis of interacting multi-particle systems, with particles corresponding to tokens, we prove that the token distribution rapidly concentrates onto the push-forward of the initial distribution under a projection map induced by the key, query, and value matrices, and remains metastable for moderate times. Specifically, we show that the Wasserstein distance of the two distributions scales like \sqrt\log(\beta+1)/\beta\exp(Ct)+\exp(-ct) in terms of the temperature parameter \beta^-1\to 0 and inference time t\geq 0 . For the proof, we establish Lyapunov-type estimates for the zero-temperature equation, identify its limit as t\to\infty , and employ a stability estimate in Wasserstein space together with a quantitative Laplace principle to couple the two equations. Our result implies that for time scales of order \log\beta the token distribution concentrates at the identified limiting distribution. Numerical experiments confirm this and, beyond that, complement our theory by showing that for finite \beta and large t the dynamics enter a different terminal phase, dominated by the spectrum of the value matrix.

[LG-294] Equivariant Reinforcement Learning for Clifford Quantum Circuit Synthesis

链接: https://arxiv.org/abs/2605.10910
作者: Richie Yeung,Aleks Kissinger,Rob Cornish
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider the problem of synthesizing Clifford quantum circuits for devices with all-to-all qubit connectivity. We approach this task as a reinforcement learning problem in which an agent learns to discover a sequence of elementary Clifford gates that reduces a given symplectic matrix representation of a Clifford circuit to the identity. This formulation permits a simple learning curriculum based on random walks from the identity. We introduce a novel neural network architecture that is equivariant to qubit relabelings of the symplectic matrix representation, and which is size-agnostic, allowing a single learned policy to be applied across different qubit counts without circuit splicing or network reparameterization. On six-qubit Clifford circuits, the largest regime for which optimal references are available, our agent finds circuits within one two-qubit gate of optimality in milliseconds per instance, and finds optimal circuits in 99.2% of instances within seconds per instance. After continued training on ten-qubit instances, the agent scales to unseen Clifford tableaus with up to thirty qubits, including targets generated from circuits with over a thousand Clifford gates, where it achieves lower average two-qubit gate counts than Qiskit’s Aaronson-Gottesman and greedy Clifford synthesizers.

[LG-295] Factual recall in linear associative memories: sharp asymptotics and mechanistic insights

链接: https://arxiv.org/abs/2605.10795
作者: Alessio Giorlandino,Sebastian Goldt,Antoine Maillard
类目: Machine Learning (stat.ML); Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models demonstrate remarkable ability in factual recall, yet the fundamental limits of storing and retrieving input–output associations with neural networks remain unclear. We study these limits in a minimal setting: a linear associative memory that maps p input embeddings in \mathbbR^d to their corresponding~ d -dimensional targets via a single layer, requiring each mapped input to be well separated from all other targets. Unlike in supervised classification, this strict separation induces~ p constraints per association and produces strong correlations between constraints that make a direct characterisation of the storage capacity difficult. Here, we provide a precise characterisation of this capacity in the following way. We first introduce a decoupled model in which each input has its own independent set of competing outputs, and provide numerical and analytical evidence that this decoupled model is equivalent to the original model in terms of storage capacity, spectra of the learnt weights, and storage mechanism. Using tools from statistical physics, we show that the decoupled model can store up to p_c \log p_c / d^2 = 1 / 2 associations, and generalise the computation of p_c to linear two-layer architectures. Our analysis also gives mechanistic insight into how the optimal solution improves over a naïve Hebbian learning rule: rather than boosting input-output alignments with broad fluctuations, the optimal solution raises the correct scores just above the extreme-value threshold set by the competing outputs. These findings give a sharp statistical-physics characterisation of factual storage in linear networks and provide a baseline for understanding the memory capacity of more realistic neural architectures.

[LG-296] Fixed-Point Neural Optimal Transport without Implicit Differentiation

链接: https://arxiv.org/abs/2605.10792
作者: Yesom Park,Eric Gelphman,Stanley Osher,Samy Wu Fung
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 37 pages, submitted to SIAM Journal on Mathematical Data Science (currently under review)

点击查看摘要

Abstract:We propose an implicit neural formulation of optimal transport that eliminates adversarial min–max optimization and multi-network architectures commonly used in existing approaches. Our key idea is to parameterize a single potential in the Kantorovich dual and reformulate the associated c-transform as a proximal fixed-point problem. This yields a stable single-network framework in which dual feasibility is enforced exactly through proximal optimality conditions rather than adversarial training. Despite the inner fixed-point computation, gradients can be computed without differentiating through the fixed-point iterations, enabling efficient training without requiring implicit differentiation. We further establish convergence of stochastic gradient descent. The resulting framework is efficient, scalable, and broadly applicable: it simultaneously recovers forward and backward transport maps and naturally extends to class-conditional settings. Experiments on high-dimensional Gaussian benchmarks, physical datasets, and image translation tasks demonstrate strong transport accuracy together with improved training stability and favorable computational and memory efficiency.

[LG-297] On the global convergence of gradient descent for wide shallow models with bounded nonlinearities

链接: https://arxiv.org/abs/2605.10775
作者: Romain Petit,Clarice Poon,Gabriel Peyré
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A surprising phenomenon in the training of neural networks is the ability of gradient descent to find global minimizers of the training loss despite its non-convexity. Following earlier works, we investigate this behavior for wide shallow networks. Existing results essentially cover the case of ReLU activations and the case of sigmoid activations with scalar output weights. We study a large class of models that includes multi-head attention layers and two-layer sigmoid networks with vector output weights. Building upon [Chizat and Bach, 2018], we prove that all non-global minimizers of the training loss are unstable under gradient descent dynamics. Thus, when the initial distribution of the parameters has full support (which includes the popular Gaussian case), and in the many hidden neurons or attention heads limit, continuous-time gradient descent can only converge to global minimizers. Establishing the instability of non-global minimizers corresponds to the construction of an ``escaping active set’’ – we complete the proof of [Chizat and Bach, 2018] to construct this set for models with bounded nonlinearities and scalar output weights. We also extend this construction to new cases for models with vector output weights. Finally, we show the well-posedness and the stability with respect to discretization of the mean field training dynamic for sub-Gaussian initializations.

[LG-298] Price of Quality: Sufficient Conditions for Sparse Recovery using Mixed-Quality Data ICLR2026

链接: https://arxiv.org/abs/2605.10713
作者: Youssef Chaabouni,David Gamarnik
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: Published as a conference paper at ICLR 2026

点击查看摘要

Abstract:We study sparse recovery when observations come from mixed-quality sources: a small collection of high-quality measurements with small noise variance and a larger collection of lower-quality measurements with higher variance. For this heterogeneous-noise setting, we establish sample-size conditions for information-theoretic and algorithmic recovery. On the information-theoretic side, we show that it is sufficient for (n_1, n_2) to satisfy a linear trade-off defining the Price of Quality: the number of low-quality samples needed to replace one high-quality sample. In the agnostic setting, where the decoder is completely agnostic to the quality of the data, it is uniformly bounded, and in particular one high-quality sample is never worth more than two low-quality samples for this sufficient condition to hold. In the informed setting, where the decoder is informed of per-sample variances, the price of quality can grow arbitrarily large. On the algorithmic side, we analyze the LASSO in the agnostic setting and show that the recovery threshold matches the homogeneous-noise case and only depends on the average noise level, revealing a striking robustness of computational recovery to data heterogeneity. Together, these results give the first conditions for sparse recovery with mixed-quality data and expose a fundamental difference between how the information-theoretic and algorithmic thresholds adapt to changes in data quality.

[LG-299] Exact Fixed-Point Constraints in Neural-ODEs with Provable Universality

链接: https://arxiv.org/abs/2605.10613
作者: Feliciano Giuseppe Pacifico,Duccio Fanelli,Lorenzo Buffoni,Lorenzo Chicchi,Diego Febbe,Raffaele Marino
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG)
*备注: 15 pages, 3 figures

点击查看摘要

Abstract:We introduce a technique that enables Neural-ODEs to approximate arbitrary velocity fields with a priori planted fixed-points. Specifically, a recipe is given to explicitly accommodate for a finite collection of points in the reference multi-dimensional space of the Neural-ODE where the velocity field is exactly equal to zero. In this way, the gradient-based training is rigorously constrained inside the prescribed hypothesis class while leaving the expressive power of the Neural-ODE unaltered. We rigorously prove the universality of the Neural-ODE under any local constraints in the velocity field and give a computationally convenient way of imposing the fixed points. Our method is then tested on two paradigmatic physical models.

[LG-300] Amortizing Causal Sensitivity Analysis via Prior Data-Fitted Networks

链接: https://arxiv.org/abs/2605.10590
作者: Emil Javurek,Dennis Frauen,Marie Brockschmidt,Jonas Schweisthal,Stefan Feuerriegel
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Causal sensitivity analysis aims to provide bounds for causal effect estimates in the presence of unobserved confounding. However, existing methods for causal sensitivity analysis are per-instance procedures, meaning that changes to the dataset, causal query, sensitivity level, or treatment require new computation. Here, we instead present an in-context learning approach. Specifically, we propose an amortized approach to causal sensitivity analysis based on prior-data fitted networks. A key challenge is that the sensitivity bounds are not directly available when sampling training data. To address this, we develop a general prior-data construction that is applicable across the class of generalized treatment sensitivity models. Our construction involves a Lagrangian scalarization of the objective to generate training labels for the bounds through a tradeoff between causal effect min/max-imization and sensitivity model violation, which avoids model-specific analytical derivations. We further show that, under standard convexity and linearity conditions, our objective recovers the full Pareto frontier of solutions. Empirically, we demonstrate our amortized approach across various datasets, causal queries, and sensitivity levels, where our approach achieves a test-time computation that is orders of magnitude faster than per-instance methods. To the best of our knowledge, ours is the first foundation model for in-context learning for causal sensitivity analysis.

[LG-301] Affine Tracing: A New Paradigm for Probabilistic Linear Solvers

链接: https://arxiv.org/abs/2605.10566
作者: Disha Hegde,Marvin Pförtner,Jon Cockayne
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Probabilistic linear solvers (PLSs) return probability distributions that quantify uncertainty due to limited computation in the solution of linear systems. The literature has traditionally distinguished between Bayesian PLSs, which condition a prior on information obtained from projections of the linear system, and probabilistic iterative methods (PIMs), which lift classical iterative solvers to probability space. In this work we show this dichotomy to be false: Bayesian PLSs are a special case of non-stationary affine PIMs. In addition, we prove that any realistic affine PIM is calibrated. These results motivate a focus on (non-stationary) affine PIMs, but their practical adoption has been limited by the significant manual effort required to implement them. To address this, we introduce affine tracing, an algorithmic framework that automatically constructs a PIM from a standard implementation of an affine iterative method by passing symbolic tracers through the computation to build an affine computational graph. We show how this graph can be transformed to compute posterior covariances, and how equality saturation can be used to perform algebraic simplifications required for computation under specific prior choices. We demonstrate the framework by automatically generating a probabilistic multigrid solver and evaluate its performance in the context of Gaussian process approximation.

[LG-302] Sharp feature-learning transitions and Bayes-optimal neural scaling laws in extensive-width networks

链接: https://arxiv.org/abs/2605.10395
作者: Minh-Toan Nguyen,Jean Barbier
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the information-theoretic limits of learning a one-hidden-layer teacher network with hierarchical features from noisy queries, in the context of knowledge transfer to a smaller student model. We work in the high-dimensional regime where the teacher width k scales linearly with the input dimension d – a setting that captures large-but-finite-width networks and has only recently become analytically tractable. Using a heuristic leave-one-out decoupling argument, validated numerically throughout, we derive asymptotically sharp characterizations of the Bayes-optimal generalization error and individual feature overlaps via a system of closed fixed-point equations. These equations reveal that feature learnability is governed by a sequence of sharp phase transitions: as data grows, teacher features become recoverable sequentially, each through a discontinuous jump in overlap. This sequential acquisition underlies a precise notion of \textiteffective width k_c – the number of learnable features at a given data budget n – which unifies two distinct scaling regimes: a feature-learning regime in which the Bayes-optimal generalization error \varepsilon^\rm BO scales as n^1/(2\beta)-1 , and a refinement regime in which it scales as n^-1 , where \beta1/2 is the exponent of the power-law feature hierarchy. Both laws collapse to the single relation \varepsilon^\rm BO=\Theta(k_c d/n) . We further show empirically that a student trained with \textscAdam near the effective width k_c achieves these optimal scaling laws (up to a small algorithmic gap), and provide an information-theoretic account of the associated scaling in model size.

[LG-303] Regret Analysis of Guided Diffusion for Black-Box Optimization over Structured Inputs

链接: https://arxiv.org/abs/2605.10385
作者: Masaki Adachi,Anita Yang,Yakun Wang,Song Liu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 48 pages, 12 figures

点击查看摘要

Abstract:Guided-diffusion black-box optimization (BO) has shown strong empirical performance on structured design problems such as molecules and crystals, but its regret behavior remains poorly understood. Existing BO regret analyses typically rely on maximum information gain, non-pretrained surrogate models, or exact acquisition maximization – assumptions that break down in modern diffusion – BO pipelines, where pretrained diffusion models serve as powerful priors over valid structures and acquisition maximization is replaced by approximate sampling over astronomically large discrete spaces. We develop a first certificate-based expected simple-regret framework for guided-diffusion BO that avoids maximum-information-gain bounds, RKHS assumptions, and exact acquisition maximization. The central quantity in our analysis is mass lift: the increase in probability mass assigned to near-optimal designs relative to the pretrained generator. This view explains how exponential-looking finite-budget convergence and polynomial acceleration can all arise from the same mechanism. We also give practical diagnostics for estimating search exponents from finite candidate pools and a proposal-corrected resampling construction that provides a fully certified sampler instance.

[LG-304] Multifidelity Gaussian process regression for solving nonlinear partial differential equations

链接: https://arxiv.org/abs/2605.10383
作者: Fatima-Zahrae El-Boukkouri,Josselin Garnier,Olivier Roustant
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 31 pages, 20 figures

点击查看摘要

Abstract:Solving nonlinear partial differential equations (PDEs) using kernel methods offers a compelling alternative to traditional numerical solvers. However, the performance of these methods strongly depends on the choice of kernel. In this work, as the available information is inherently multifidelity, we propose a kernel learning approach based on cokriging, leveraging empirical information from multifidelity simulations. In the first step, we fit a differentiable non-stationary kernel to an empirical kernel obtained from low-fidelity simulations. In the second step, we derive a high-fidelity kernel with estimated hyperparameters, and construct a corresponding high-fidelity mean using the multifidelity framework. These components can then be used within a Gaussian process framework for solving PDEs. Finally, we demonstrate the performance of the proposed physics-informed method on the Burgers’ equation.

[LG-305] Fast Training of Mixture-of-Experts for Time Series Forecasting via Expert Loss Integration

链接: https://arxiv.org/abs/2605.10330
作者: Btissame El Mahtout,Florian Ziel
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:We propose a novel adaptive Mixture-of-Experts (MoE) framework for time series forecasting that enhances expert specialization by incorporating expert-specific loss information directly into the training process. Notably, the overall objective comprises the base forecasting loss and expert-specific losses, allowing expert-level prediction errors to jointly shape training alongside the global forecasting loss. This framework is further combined with a partial online learning strategy, enabling incremental updates of both the gating mechanism and expert parameters. This approach significantly reduces computational cost by eliminating the need for repeated full model retraining. By integrating expert-level loss awareness with efficient online optimization, the proposed method achieves improved learning efficiency while maintaining strong predictive performance. Empirical results across economic, tourism, and energy datasets with varying frequencies demonstrate that the proposed approach generally outperforms both statistical methods and state-of-the-art neural network models, such as Transformers and WaveNet, in forecasting accuracy and computational efficiency. Furthermore, ablation studies confirm the effectiveness of the expert-specific loss integration strategy, highlighting its contribution to enhancing predictive performance.

[LG-306] Characterizing the Generalization Error of Random Feature Regression with Arbitrary Data-Augmentation

链接: https://arxiv.org/abs/2605.10290
作者: Lucas Morisset,Alain Durmus,Adrien Hardy
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:This paper aims at analyzing the regularization effect that data augmentation induces on supervised regression methods in the proportional regime, where the number of covariates grows proportionally to the number of samples. We provide a tight characterization of the test error, measured in mean squared error, in terms only of the population quantities of the true data, as well as first and second order statistics of the augmentation scheme. Our results are valid under misspecified feature maps, and for any network architecture where only the last readout layer is trained, and the rest of the network is either frozen or randomly initialized. We specify our results in the case of Gaussian data, and show that our asymptotic characterization is tight in this setting.

[LG-307] Scalable Gaussian process inference via neural feature maps

链接: https://arxiv.org/abs/2605.10285
作者: Anthony Stephenson
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 27 pages

点击查看摘要

Abstract:We present a theoretically grounded Gaussian process framework that leverages neural feature maps to construct expressive kernels. We show that the learned feature map can be interpreted as an optimal low-rank approximation to a Gram matrix derived from an implied RKHS, from which we establish consistency of the GP posterior. We further analyse the spectral properties of the induced kernels and introduce product feature-map kernels to address oversmoothing. This simple yet powerful approach enables fast, scalable, and accurate exact GP inference with minimal upfront work. The flexibility of kernel design supports seamless application to both regression and classification tasks across diverse data modalities, including tabular inputs and structured domains such as images. On benchmark datasets, this approach surpasses pre-existing methods in terms of accuracy and training and prediction efficiency.

[LG-308] Stellar Age Compression Reshapes Interpretations of the Milky Way Thick-Disk Formation History

链接: https://arxiv.org/abs/2605.10220
作者: Zhipeng Zhang
类目: Astrophysics of Galaxies (astro-ph.GA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The formation timescale of the Milky Way thick disk is one of the central debates in Galactic archaeology. The age-metallicity relation (AMR), formation timescale, and chemical evolution gradients are frequently used to infer a rapid assembly, short-timescale enrichment, and bursty formation history of the thick disk. However, stellar ages are not directly observable, introducing the potential risk that inferred ages may harbor a systematic compression tied to observational quality. In this paper, we use the same stellar sample and identical physical covariate matching conditions, but two independent age scales–spectroscopic inferred ages (astroNN) and asteroseismic ages (APOKASC-3)–to compare the observable signatures of the thick-disk formation history. We find that several key observables previously supporting a rapid thick-disk formation are systematically weakened under seismic anchoring: the AMR slope flattens from -3.29 to -1.86 Gyr dex-1 (Delta a = +1.43), the formation timescale widens from 3.04 to 3.55 Gyr, and the peak formation age shifts from 9.1 to 6.0 Gyr. Through transport inversion experiments, we further show that additive noise can only broaden the age distribution and cannot reproduce the above pattern, whereas a compressive transport map (lambda 1) simultaneously reproduces a narrower age distribution, a steeper AMR, and rapid-formation-like observables. This result indicates that the compression transformation itself is sufficient to generate rapid-formation-friendly observables without requiring an intrinsically bursty formation history. Our findings reveal that statistical interpretations of the Milky Way formation history may depend sensitively on the stellar age definition itself. Subjects: Astrophysics of Galaxies (astro-ph.GA); Machine Learning (cs.LG) Cite as: arXiv:2605.10220 [astro-ph.GA] (or arXiv:2605.10220v1 [astro-ph.GA] for this version) https://doi.org/10.48550/arXiv.2605.10220 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-309] Parameterized Complexity of Stationarity Testing for Piecewise-Affine Functions and Shallow CNN Losses

链接: https://arxiv.org/abs/2605.10219
作者: Yuhan Ye
类目: Optimization and Control (math.OC); Computational Complexity (cs.CC); Machine Learning (cs.LG)
*备注: 32 pages, 1 figure, 1 table

点击查看摘要

Abstract:We study the parameterized complexity of testing approximate first-order stationarity at a prescribed point for continuous piecewise-affine (PA) functions, a basic task in nonsmooth optimization. PA functions form a canonical model for nonsmooth stationarity testing and capture the local polyhedral geometry that appears in ReLU-type training losses. Recent work by Tian and So (SODA 2025) shows that testing approximate stationarity notions for PA functions is computationally intractable in the worst case, and identifies fixed-dimensional tractability as an open direction. We address this direction from the viewpoint of parameterized complexity, with the ambient dimension d as the parameter. In this paper, we give XP algorithms in fixed dimension for the tractable sides, and prove W[1]-hardness for the complementary sides. Moreover, lower bounds under the Exponential Time Hypothesis rule out algorithms running in time \rho(d)\size^o(d) for any computable function \rho , where \size denotes the total binary encoding length of the stationarity-testing instance. As a further consequence, our results yield the corresponding parameterized complexity picture for testing local minimality of continuous PA functions. We further extend our hardness results to a family of shallow ReLU CNN training losses, with stationarity tested in the trainable weight space. Thus, the same parameterized-complexity picture also appears for simple CNN training losses. Comments: 32 pages, 1 figure, 1 table Subjects: Optimization and Control (math.OC); Computational Complexity (cs.CC); Machine Learning (cs.LG) MSC classes: 68Q27 (Primary), 90C60, 68Q17, 49J52, 52B55 (Secondary) Cite as: arXiv:2605.10219 [math.OC] (or arXiv:2605.10219v1 [math.OC] for this version) https://doi.org/10.48550/arXiv.2605.10219 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-310] Extended Wasserstein-GAN Approach to Causal Distribution Learning: Density-Free Estimation and Minimax Optimality

链接: https://arxiv.org/abs/2605.10206
作者: Shu Tamano,Masaaki Imaizumi
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Distributional causal inference requires estimating not only average treatment effects but also interventional outcome distributions, including quantiles, tail risks, and policy-dependent uncertainty. As a method for distributional causal inference, generative adversarial network (GAN)-based counterfactual methods are flexible tools for this task. However, these methods have several limitations. First, the objectives of certain techniques do not coincide with the statistical risk of the identifiable causal target, and therefore provide limited theoretical guarantees regarding estimable counterfactual distributions or optimality. Second, they tend to rely on unstable density-based methods, such as density ratio estimation. In this paper, we propose GANICE (GAN for Interventional Conditional Estimation) with several advantages: it (i) clarifies the conditional interventional distribution for each treatment–covariate state as the causal estimation target; (ii) estimates the conditional distribution such that its averaged Wasserstein risk is minimized; (iii) establishes minimax optimality. GANICE achieves these advantages through the introduction of the extended Wasserstein distance, the incorporation of a cellwise critic in its dual, and an optimality proof based on Besov space theory. Our experiments demonstrate that GANICE consistently outperforms existing methods.

[LG-311] Joint sparse coding and temporal dynamics support context reconfiguration

链接: https://arxiv.org/abs/2605.10178
作者: Qianqian Shi,Yue Che,Faqiang Liu,Hongyi Li,Mingkun Xu,Sandra Reinert,Pieter M. Goltstein,Rong Zhao,Luping Shi
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 37 pages, 6 figures, 6 extended data figures. Preprint version

点击查看摘要

Abstract:Adaptive behavior requires the brain to transition between distinct contexts while maintaining representations of prior experience. The ability to reconfigure neural representations without erasing previously acquired knowledge is central to learning in dynamic environments, yet the neural mechanisms that support this balance remain unclear. Understanding these mechanisms is also critical for addressing catastrophic forgetting in artificial systems designed for lifelong learning. Here, we identify joint sparse coding and temporal dynamics in both the mouse medial prefrontal cortex (mPFC) and computational networks as mechanisms that help preserve prior representations during context transitions. Specifically, sparsity in context-dependent representations reduces cross-context interference, whereas temporal dynamics within the network activity further enhance context separability across time. Strikingly, networks endowed with both properties, such as spiking neural networks, exhibit improved retention during lifelong learning without auxiliary heuristics. These findings establish joint sparse coding and temporal dynamics as a core mechanism supporting flexible context reconfiguration in lifelong learning and, through their activity constraining nature, as an energy-efficient architectural principle for stable adaptation. Together, they provide a mechanistic framework for understanding how the brain preserves prior knowledge while flexibly adapting to new contexts.

[LG-312] PFN-TS: Thompson Sampling for Contextual Bandits via Prior-Data Fitted Networks

链接: https://arxiv.org/abs/2605.10137
作者: Yan Shuo Tan,Kenyon Ng,Ruizhe Deng,Sumetha Loganathan,Qiong Zhang,Bibhas Chakraborty
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Thompson sampling is a widely used strategy for contextual bandits: at each round, it samples a reward function from a Bayesian posterior and acts greedily under that sample. Prior-data fitted networks (PFNs), such as TabPFN v2+ and TabICL v2, are attractive candidates for this purpose because they approximate Bayesian posterior predictive distributions in a single forward pass. However, PFNs predict noisy future rewards, while Thompson sampling requires uncertainty over the latent mean reward function. We propose PFN-TS, a Thompson sampling algorithm that converts PFN posterior predictives into mean-reward samples using a subsampled predictive central limit theorem. The method estimates posterior variance from a geometric grid of O(\log n) dataset prefixes rather than the full O(n) predictive sequence used in previous predictive-sequence approaches, and reuses TabICL’s cached representations across rounds. We prove consistency of the subsampled variance estimator and give a Bayesian regret bound that decomposes PFN-TS regret into exact posterior-sampling regret under the PFN prior plus approximation terms. Empirically, PFN-TS achieves the best average rank across nonlinear synthetic and OpenML classification-to-bandit benchmarks, remains competitive on linear and BART-generated rewards, and attains the highest estimated policy value in an offline mobile-health evaluation. Code is available at this https URL.

[LG-313] A Stability Benchmark of Generative Regularizers for Inverse Problems

链接: https://arxiv.org/abs/2605.10076
作者: Alexander Denker,Johannes Hertrich,Sebastian Neumayer
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generative (diffusion) priors demonstrate remarkable performance in addressing inverse problems in imaging. Yet, for scientific and medical imaging, it is crucial that reconstruction techniques remain stable and reliable under imperfect settings. Typical definitions of stability encompass the notion of ‘‘convergent regularization’’, robustness to out-of-distribution data, and to inaccuracies in the forward operator or noise model. We evaluate these properties numerically. Furthermore, we benchmark generative approaches against modern optimization-based methods inspired by the widely used variational techniques. Our results give insights for which settings and applications generative priors can deliver state-of-the-art reconstructions, and on those in which they fall short or may even be problematic.

[LG-314] Differentially Private Sampling from Distributions via Wasserstein Projection

链接: https://arxiv.org/abs/2605.10015
作者: Shokichi Takakura,Seng Pei Liew,Satoshi Hasegawa
类目: Machine Learning (stat.ML); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we study the problem of sampling from a distribution under the constraint of differential privacy (DP). Prior works measure the utility of DP sampling with density ratio-based measures such as KL divergence. However, such formulations suffer from two key limitations: 1) they fail to capture the geometric structure of the support, and 2) they are not applicable when the supports of the distributions differ. To deal with these issues, we develop a novel framework for DP sampling with Wasserstein distance as the utility measure. In this formulation, we propose Wasserstein Projection Mechanism (WPM), a minimax optimal mechanism based on Wasserstein projection. Furthermore, we develop efficient algorithms for computing the proposed mechanisms approximately and provide convergence guarantees.

[LG-315] otal Generalized Variation regularization closes the gap between neural-eld and classical methods in seismic travel-time tomography

链接: https://arxiv.org/abs/2605.09960
作者: Isao Kurosawa
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 15 pages, 6 figures. Manuscript submitted to Geophysical Journal International

点击查看摘要

Abstract:Travel-time tomography forces a trade-off between mesh resolution and stability in which the regularizer choice dominates what can be recovered. We introduce MIMIR, a differentiable framework that represents the 2D velocity field as a Fourier-feature neural network, replacing the grid-based slowness vector with a continuous, infinitely differentiable function. Prior neural-field tomography has staircased smooth fields under total-variation (TV) priors or oscillated near interfaces under L^2 Laplacian smoothing. We adopt second-order total generalized variation (TGV ^2 ) and parametrize its auxiliary vector field as a second neural network jointly optimized with the velocity field, eliminating the inner Chambolle-Pock primal-dual loop that classically dominates TGV computation. On three synthetic benchmarks (Gaussian, horizontally layered, curved-fault inspired by OpenFWI) using cross-well acquisition, 5% travel-time noise, and five seeds, MIMIR-TGV ^2 ties a classical FMM-LSMR baseline with auto-tuned hyperparameters on the Gaussian ( p=0.134 , paired t -test) and significantly outperforms it on layered ( p0.0001 , 44% RMSE reduction) and curved-fault ( p=0.0002 , 33% reduction). Replacing TGV ^2 with TV degrades performance on Gaussian ( p=0.004 ) and layered ( p=0.003 ); curriculum-annealed TV improves Gaussian RMSE by only 5.4%, confirming that TV’s staircase bias is intrinsic to the regularizer rather than a scheduling artifact. The results empirically validate the Bredies-Kunisch-Pock prediction that piecewise-affine priors are better suited to subsurface velocity recovery than piecewise-constant TV priors. We argue that the central design choice in physics-informed neural-field inversion is not the network architecture but the regularizer. The full pipeline reproduces in under one hour on consumer hardware.

[LG-316] he Observable Wasserstein Distance

链接: https://arxiv.org/abs/2605.09916
作者: Edivaldo Lopes dos Santos,Leandro Vicente Mauri,Washington Mio,Tom Needham
类目: Metric Geometry (math.MG); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce the observable Wasserstein distance, a framework for deriving lower bounds on the Wasserstein distance between probability measures on Polish metric spaces, designed to bypass the computational intractability of exact optimal transport in large-scale, non-Euclidean datasets. Analogous to the sliced Wasserstein distance in \mathbbR^d , our approach projects measures onto the real line via 1-Lipschitz observables and computes the Wasserstein distances between the resulting pushforward distributions. We define a hierarchy of pseudo-metrics by restricting observables to a nested chain of subspaces. A central theoretical contribution is an injectivity result linking the metric covering dimension of the support of a measure to the specific order in the hierarchy that guarantees unique recovery. This serves as a metric-space analogue to the Cramér-Wold Device for Euclidean distributions. We demonstrate that this hierarchy offers a tunable trade-off between sharpness as a lower bound on the Wasserstein distance and computational efficiency. We also present a discrete computational model for finite grids and numerical experiments validating the efficacy and utility of these approximations.

[LG-317] Dissecting Jet-Tagger Through Mechanistic Interpretability

链接: https://arxiv.org/abs/2605.09881
作者: Saurabh Rai,Sanmay Ganguly
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex)
*备注: 40 pages, 14 figures, 12 tables. Comments are welcome

点击查看摘要

Abstract:Mechanistic interpretability seeks to reverse engineer a trained neural network by identifying the minimal subset of internal components. We perform a mechanistic interpretability analysis of the Particle Transformer architecture, trained on the Top Quark Tagging reference dataset, with the goal of identifying the computational circuit responsible for jet classification and characterizing the physical content of its internal representations. Combining zero ablation, path patching with two complementary on-manifold corruption strategies and linear probing of the residual stream, we identify a sparse six-head circuit that recovers the great majority of the full model performance while admitting a clean source-relay-readout interpretation. In this circuit, a single early layer head serves as the primary causal source, a cluster of middle-layer heads acts as relays selectively attending to hard pairwise substructure and a single late-layer head reads out the aggregated signal. Linear probes show that the residual stream is preferentially aligned with the energy correlator basis over the N -subjettiness basis. Within the energy correlator basis, the model preferentially encodes 2-prong substructure observables over the 3-prong observables. A per-layer trained probe further reveals that the apparent single step commitment of the model to a classification decision in the first class attention block is in fact a basis rotation, with the discriminating signal already saturating in the particle attention stack. These results demonstrate that mechanistic interpretability methods developed for natural language models can be used for jet physics classifiers and indicate that gradient descent may rediscover physically meaningful aspects of jet tagging without supervision.

[LG-318] Unified Approach for Weakly Supervised Multicalibration

链接: https://arxiv.org/abs/2605.09857
作者: Futoshi Futami,Takashi Ishida
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multicalibration requires predicted scores to agree with label probabilities across rich families of subgroups and score-dependent tests, but existing methods require clean input-label pairs for evaluation and post-processing. This assumption fails in weakly supervised learning (WSL) regimes – including positive-unlabeled, unlabeled-unlabeled, and positive-confidence learning – where clean labels are costly or unavailable even though reliable uncertainty estimates may be crucial. We address this gap by developing estimators of multicalibration error and post-hoc correction methods for WSL settings in which clean input-label pairs are unavailable. We propose a unified framework for estimating and correcting multicalibration under weak supervision by combining contamination-matrix risk rewrites with witness-based calibration constraints, yielding corrected multicalibration moments with finite-sample guarantees. We further propose weak-label multicalibration boost (WLMC), a generic post-hoc recalibration algorithm under weak supervision. Finally, we conduct experiments across multiple weak-supervision settings to evaluate multicalibration behavior and offer empirical insight into uncertainty estimation under weak supervision.

[LG-319] Supercharging Bayesian Inference with Reliable AI-Informed Priors

链接: https://arxiv.org/abs/2605.09834
作者: Jongwoo Choi,Sean O’Hagan
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modern predictive systems encode beliefs that can act as useful prior information for statistical inference in data-limited settings. Using them for prior construction introduces a tradeoff: an informative prior built from a predictive model can sharpen inference from limited data, but also risks propagating error from the model into the posterior. We propose a framework for AI-informed prior elicitation that mitigates this tension by rectifying the AI-induced law that generates synthetic data before using it to inform a prior. The rectified law can be embedded into synthetic data-driven prior elicitation techniques, including as a base measure in a Dirichlet process (DP) prior on the data-generating process. We refer to the resulting prior and corresponding posterior as the rectified AI prior and rectified AI posterior. We establish Gaussian asymptotics for the rectified AI posterior under non-vanishing prior strength and derive a first-order expression for its centering bias. Our rectified AI priors substantially reduce bias compared to standard approaches, improve the coverage of credible intervals, and make AI-powered prior information more reliable. We additionally apply the rectified AI prior to a real skin disease classification task and show that it can meaningfully boost predictive performance.

[LG-320] D3B: Transition-Directed Discrete Diffusion for Allosteric Binder Generation ICML2026

链接: https://arxiv.org/abs/2605.09810
作者: Hanqun Cao,Aastha Pal,Sophia Tang,Yinuo Zhang,Jingjie Zhang,Pheng Ann Heng,Pranam Chatterjee
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注: Published as a Spotlight at ICML 2026 (Proceedings of the 43rd International Conference on Machine Learning, Seoul, South Korea)

点击查看摘要

Abstract:Protein function is often controlled by ligands that bias the direction of state transitions, such as agonists and antagonists, rather than stabilizing a single conformation. This is especially important for clinically relevant G protein-coupled receptors (GPCRs), where therapeutic efficacy depends on functional directionality. Structure-based design methods optimize binding to static conformations and cannot represent non-reversible, directional effects or systematically distinguish agonist from antagonist behavior. To address this gap, we introduce Transition-Directed Discrete Diffusion for Allosteric Binder Design (TD3B), a sequence-based generative framework that designs binders with specified agonist or antagonist behavior via a directional transition control objective. TD3B combines a target-aware Direction Oracle, a soft binding-affinity gate, and amortized fine-tuning of a pre-trained discrete diffusion model, enabling targeted agonist and antagonist generation decoupled from binding affinity and unattainable by equilibrium-based or inference-only guidance baselines. The code and checkpoints are available at this https URL.

[LG-321] Learning stochastic multiscale models through normalizing flows

链接: https://arxiv.org/abs/2605.09718
作者: Anan Saha,Arnab Ganguly
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST)
*备注: 17 pages, 4 figures

点击查看摘要

Abstract:Many systems in physics, engineering, and biology exhibit multiscale stochastic dynamics, where low-dimensional slow variables evolve under the influence of high-dimensional fast processes. In practice, observations are often limited to a single trajectory of the slow component, while the fast dynamics remain unobserved, making statistical learning challenging. Approaches based on partial differential equations (PDE), such as Fokker-Planck formulations, aim to characterize the evolution of probability densities, typically requiring dense space-time data or grid-based solvers. In contrast, we adopt a trajectory-based perspective and develop a data-driven framework for learning effective stochastic dynamics from a single observed path. We model the dynamics by coupled multiscale stochastic differential equations (SDEs) and first obtain a principled model reduction through stochastic averaging. Unlike generic model reduction techniques such as PCA, this respects the dynamical structure of the original system and explicitly incorporates the interaction between slow and fast scales. A central challenge, however, is that the reduced model depends on the invariant distribution of the fast process, which is a solution to an intractable and often unknown PDE. We introduce a novel learning framework that parameterizes the invariant distribution using normalizing flows, enabling expressive density modeling in the latent fast-variable space. The flow is trained end-to-end by optimizing a penalized likelihood objective induced by the reduced stochastic dynamics. Furthermore, we develop a Bayesian variational inference procedure for uncertainty quantification, employing a second normalizing flow to approximate the posterior distribution over model parameters. This yields a scalable approach to capturing epistemic uncertainty in multiscale systems.

[LG-322] Metropolis-Adjusted Diffusion Models

链接: https://arxiv.org/abs/2605.09654
作者: Kevin H. Lam,Tyler Farghly,Christopher Williams,Jun Yang,Yee Whye Teh,Arnaud Doucet
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:Sampling from score-based diffusion models incurs bias due to both time discretisation and the approximation of the score function. A common strategy for reducing this bias is to apply corrector steps based on the unadjusted Langevin algorithm (ULA) at each noise level within a predictor-corrector framework. However, ULA is itself a biased sampler, as it discretises a continuous diffusion process. In this work, we consider adjusted Langevin correctors that employ Metropolis–Hastings (MH) or Barker’s accept-reject steps to correct for this bias. Since the target density ratio typically required by MH-based algorithms is unavailable, we propose methods that instead utilise the score function to compute the correct acceptance probability. We introduce the first exact method for adjusting Langevin corrections in diffusion models, based on a two-coin Bernoulli factory algorithm. We also propose an efficient approximation based on Simpson’s rule that achieves accuracy of order 5/2 in the step size at near-zero marginal cost. We demonstrate that these procedures improve sample quality on both synthetic and image datasets, yielding consistent gains in Fréchet Inception Distance (FID) on the latter.

[LG-323] Phases of Muon: When Muon Eclipses SignSGD

链接: https://arxiv.org/abs/2605.09552
作者: Elliot Paquette,Noah Marshall,Lucas Benigni,Guangyuan Wang,Atish Agarwala,Courtney Paquette
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Recently, Muon and related spectral optimizers have demonstrated strong empirical performance as scalable stochastic methods, often outperforming Adam. Yet their behaviour remains poorly understood. We analyze stochastic spectral optimizers, including Muon, on a high-dimensional matrix-valued least squares problem. We derive explicit deterministic dynamics that provide a tractable framework for studying learning behaviour with a focus on (stochastic) SignSVD, which Muon approximates, and (stochastic) SignSGD, the latter serving as a proxy for Adam. Our analysis shows that for large batch size, SignSVD performs a square-root preconditioning with respect to the data covariance spectrum, while for small batch size smaller eigenmodes behave like SGD, slowing down convergence. We contrast with SignSGD which for generic covariance performs no preconditioning and has no transition, leading to different optimal learning rates and convergence characteristics. The two methods match up to a constant factor with isotropic data, but behave differently with anisotropic data. An analysis of a power law covariance model with data exponent \alpha and target exponent \beta shows there are three phases in the (\alpha,\beta) plane: one where SignSGD is uniformly favored, one where SignSVD is uniformly favored, and a third where the two methods exhibit a trade-off in performance.

[LG-324] Empirical Bayes 1-bit matrix completion

链接: https://arxiv.org/abs/2605.09509
作者: Takeru Matsuda
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:The problem of predicting unobserved entries in a binary matrix, known as 1-bit matrix completion, has found diverse applications in fields such as recommendation systems. In this study, we develop an empirical Bayes method for 1-bit matrix completion motivated by the Efron–Morris estimator, a matrix generalization of the James–Stein estimator that shrinks singular values toward zero. The proposed method exploits the underlying low-rank structure of binary matrices, drawing parallels with multidimensional item response theory. Simulation studies and real-data applications demonstrate that the proposed method achieves a superior balance of predictive accuracy, calibration reliability (uncertainty quantification), and computational efficiency compared to existing methods.

[LG-325] Enabling Structure-Only Initialization and Out-of-Distribution Generalization in GNN-based Molecular Dynamics Simulators

链接: https://arxiv.org/abs/2605.09495
作者: S. A. Shteingolts,Salman N. Salman,Dan Mendels
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 10 pages, 7 figures

点击查看摘要

Abstract:Machine learning-based simulators offer the potential to model the dynamics of complex systems more efficiently than classical approaches, while retaining differentiability, a key property for materials design. Graph neural network (GNN)-based simulators have shown strong performance across a range of physical domains, including molecular dynamics. However, their reliance on temporal context for accurate prediction limits their use in inverse design settings, where simulations must be initialized from a single static configuration. Moreover, inverse design requires robust out-of-distribution (OOD) generalization, as candidate structures typically lie outside the training domain. Here, we address both challenges by introducing two complementary strategies that enable stable and accurate structure-only initialization of GNN-based simulations. To directly target OOD generalization, we propose an inference-time physics-based optimization framework that constrains model predictions to remain physically consistent during rollout. In addition, we introduce a differentiable, GNN-based barostat that enables accurate tracking of system dimensions and pressure, critical for capturing macroscopic responses and supporting OOD generalization. We evaluate these approaches in the context of uniaxial compression of disordered elastic networks spanning a broad range of geometries, Poisson ratios, and microscopic behaviors. We find that, together, these methods substantially improve rollout stability and enable reliable OOD generalization, including regimes with distinct, more complex dynamics than those in the training data. These results show that, when properly initialized and constrained, GNN-based simulators can serve as efficient and generalizable tools for materials discovery and structural optimization, advancing their use in materials, molecular, and dynamical system design.

[LG-326] Quantitative Local Convergence of Mean-Field Stein Variational Gradient Flow

链接: https://arxiv.org/abs/2605.09456
作者: Lénaïc Chizat,Maria Colombo,Roberto Colombo,Xavier Fernández-Real
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Analysis of PDEs (math.AP); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Stein Variational Gradient Descent (SVGD) is a deterministic interacting-particle method for sampling from a target probability measure given access to its score function. In the mean-field and continuous-time limit, it is known that the flow converges weakly toward the target, but no quantitative rate is known for the last iterate. In this paper, we establish quantitative local convergence in strong norms for this dynamics, when the interaction kernel is of Riesz type on the d -dimensional torus. Specifically, assuming that the initial density and the target are smooth and close in L^2 -norm, we obtain explicit polynomial convergence rates in L^2 -norm that depend on the dimension and on the regularity parameters of the kernel, the initialization and the target. We further show that these rates are sharp in certain regimes, and support the theory with numerical experiments. In the edge case of kernels with a Coulomb singularity, we recover the global exponential convergence result established in prior work. Our analysis is inspired by recent results on Wasserstein gradient flows of kernel mean discrepancies.

[LG-327] Optimal Regret for Single Index Bandits

链接: https://arxiv.org/abs/2605.09454
作者: Devdan Dey,Sujoy Bhore,Avishek Ghosh
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 27 pages, 9 figures

点击查看摘要

Abstract:We study the \textitsingle-index bandit problem, where rewards depend on an unknown one-dimensional projection of high-dimensional contexts through an unknown reward function. This model extends linear and generalized linear bandits to a nonparametric setting, and is particularly relevant when the reward function is not known in advance. While optimal regret guarantees are known for monotone reward functions, the general non-monotone case remains poorly understood, with the best known bound being \tilde\mathcalO(T^3/4) (under standard boundedness and Lipschitz assumptions on the reward function [Kang et al., 2025]). We close this gap by establishing the optimal regret for general single-index bandits. We propose a simple two-phase algorithm, namely, Zoomed Single Index Bandit with Upper Confidence Bound ( \textttZoomSIB-UCB ), that first estimates the projection direction via a normalized Stein estimator, and then reduces the problem to a one-dimensional bandit using discretization and finally use UCB. This approach achieves a regret of \tilde\mathcalO(T^2/3) , and improves significantly upon prior work without any additional assumptions. We also prove a matching minimax lower bound of \tilde\Omega(T^2/3) , showing that the upper bound is essentially tight. Our upper and lower bounds together provide a sharp characterization of the regret in single-index bandits. Moreover, the empirical results further demonstrate the effectiveness and robustness of our approach. Comments: 27 pages, 9 figures Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2605.09454 [stat.ML] (or arXiv:2605.09454v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2605.09454 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-328] Mutual Information Optimal Density Control of Linear Systems and Generalized Schrödinger Bridges with Reference Refinement

链接: https://arxiv.org/abs/2605.09349
作者: Shoju Enami,Kenji Kashima
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 19 pages, 5 figures

点击查看摘要

Abstract:We consider a mutual information (MI) regularized version of optimal density control of a discrete-time linear system. MI optimal control has been proposed as an extension of maximum entropy optimal control to trade off between control performance and benefits provided by stochastic inputs. MI regularization induces stochasticity in the policy, which poses challenges for applications of MI optimal control in safety-critical scenarios. To remedy this situation, we impose Gaussian density constraints at specified times to directly control state uncertainty. For this MI optimal density control problem, we propose an alternating optimization algorithm and derive the closed form of each step in the algorithm. In addition, we reveal that the alternating optimization of the MI optimal density control problem coincides with that of the so-called generalized Schrödinger bridge problem associated with the discrete-time linear system.

[LG-329] Kinetic theory for Transformers and the lost-in-the-middle phenomenon

链接: https://arxiv.org/abs/2605.09213
作者: Mitia Duerinckx,Borjan Geshkovski,Stefano Rossi
类目: Analysis of PDEs (math.AP); Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:We study causal self-attention dynamics – a toy model for decoder Transformers – which we interpret as a non-exchangeable interacting particle system. Adapting cumulant expansions to the triangular causal dependency structure of the model, and appealing to non-hierarchical methods to estimate correlations using Glauber calculus, we prove a quantitative mean-field limit result and a next-order characterization of correlations. For iid uniformly distributed tokens, the limiting correlation equation can be solved in closed form and we obtain a rigorous explanation of the empirically observed \emphlost-in-the-middle phenomenon: the token retrieval profile, as a function of the source position in the prompt, is \mathsfU -shaped, with primacy, recency, and a unique interior minimum under an explicit smallness condition.

[LG-330] Quantum Transfer Learning Shows Improved Robustness in Low-Data Regimes

链接: https://arxiv.org/abs/2605.09118
作者: Li-An Lo,Li-Yi Hsu,Hsien-Yi Hsieh
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 22 pages, 5 figures

点击查看摘要

Abstract:Transfer learning under limited data is a challenging setting, where models must adapt to new tasks with minimal supervision. Prior work has primarily focused on improving absolute accuracy in transfer learning. However, empirical evidence comparing quantum and classical models in realistic transfer learning settings remains limited, especially in low-data regimes. In this work, we systematically study the robustness of quantum models under reduced training data. We evaluate multiple quantum and classical architectures across diverse transfer tasks and retraining configurations, and quantify robustness using accuracy degradation and relative performance retention (RPR). Our results show that, although classical models often achieve higher peak performance, they exhibit significantly larger degradation when training data is limited. In contrast, quantum models maintain more stable performance across data regimes, indicating improved robustness and data efficiency. These findings provide empirical evidence that quantum models can offer improved robustness in low-resource transfer learning scenarios.

[LG-331] Optimality of Sub-network Laplace Approximations: New Results and Methods

链接: https://arxiv.org/abs/2605.09075
作者: Swarnali Raha,Kshitij Khare,Rohit K Patra
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 34 Pages, 8 Figures, 2 Tables

点击查看摘要

Abstract:Although the Laplace approximation offers a simple route to uncertainty quantification in deep neural networks, its reliance on inverting large Hessian matrices has motivated a range of computationally feasible low-dimensional or sparse approximations. A prominent class of such methods - sub-network Laplace approximations, constructs surrogates by restricting attention to a small subset of parameters. Existing approaches in this family typically rely on diagonal, layer-wise, or other architectural heuristics for subset selection, which ignore cross-parameter interactions and lack formal optimality guarantees. In this paper, we provide a rigorous theoretical analysis of the sub-network Laplace paradigm. We prove that all sub-network Laplace methods systematically underestimate the predictive variance of the full Laplace posterior, and that this bias decreases monotonically as the retained sub-matrix expands. Leveraging this insight, we propose two principled, analytically grounded sub-network Hessian approximations: \textitGradient-Laplace selects parameters with the largest average squared gradients of the model output with respect to the parameters over a reference dataset; while \textitGreedy-Laplace iteratively refines this selection by accounting for off-diagonal interactions in the precision matrix. We establish theoretical guarantees characterizing their optimality properties and show that Gradient-Laplace provably outperforms existing heuristic approaches. Extensive numerical studies across diverse settings indicate that these methods perform strongly relative to existing benchmarks.

[LG-332] A Market-Rule-Informed Neural Network for Efficient Imbalance Electricity Price Forecasting

链接: https://arxiv.org/abs/2605.09061
作者: Runyao Yu,Julia Lin,Derek W. Bunn,Jochen Stiasny,Wentao Wang,Yujie Chen,Tara Esterl,Peter Palensky,Jochen L. Cremer
类目: Computational Finance (q-fin.CP); Machine Learning (cs.LG)
*备注: 10 pages, 3 figures, 3 tables

点击查看摘要

Abstract:Accurate and efficient imbalance electricity price forecasting is critical for industrial energy trading systems, especially as battery assets and automated bidding pipelines increasingly participate in balancing markets. However, real-time forecasting is complicated by nonlinear market-rule-based price formation, heterogeneous input signals, and incomplete data availability caused by communication delays, publication lags, and measurement outages. This paper proposes a market-rule-informed neural forecasting framework that embeds imbalance price formation rules into the latent space of an expressive neural network. The proposed framework preserves raw signal information while exploiting transparent market-rule priors. We further analyze operational robustness by removing price-component information and characterize how forecasting performance scales with input length and forecasting horizon. Experimental results show that the proposed model achieves competitive forecasting performance with substantially fewer trainable parameters and shorter training time than generic deep learning baselines. Experimental results show that the proposed model achieves competitive forecasting performance with substantially fewer trainable parameters and shorter training time than generic deep learning baselines, demonstrating that market-rule priors and expressive neural networks should be jointly used for accurate and computationally sustainable forecasting in industrial energy trading applications. The implementation is publicly available at this https URL.

[LG-333] Nonlinear GENERIC Informed Neural Networks (N-GINNs): learning GENERIC dynamics with non-quadratic dissipation potentials

链接: https://arxiv.org/abs/2605.09058
作者: Vojtěch Votruba,Zequn He,Weilun Qiu,Celia Reina,Michal Pavelka
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG)
*备注: 26 pages, 7 figures, 4 tables

点击查看摘要

Abstract:We introduce Nonlinear GENERIC Informed Neural Networks (N-GINNs), a deep learning framework for discovering evolution equations of systems governed by the nonlinear GENERIC formalism (General Equation for Non-Equilibrium Reversible-Irreversible Coupling). Such systems exhibit coupled conservative and dissipative dynamics, and can be described via the superposition of a Hamiltonian flow and a generalized gradient flow. In contrast to existing approaches, our formulation incorporates generalized gradient flows via convex dissipation potentials, enabling the identification of a broader class of thermodynamically consistent dynamics, including systems with non-quadratic dissipation potentials. Thermodynamic structure is strongly enforced by construction through suitable reparameterizations of both the bivector operator and the dissipation potential, ensuring exact compliance with the first and second laws of thermodynamics. We validate the proposed approach on three representative examples: a harmonic oscillator coupled to a heat bath, an idealized chemical motor, and a one-dimensional viscoplastic model of Perzyna type. These results demonstrate the method’s ability to accurately infer thermodynamically consistent models from data for systems incorporating both conservative and nonlinear dissipative dynamics.

[LG-334] Learning Pure Quantum States in Any Dimension (Almost) Without Regret

链接: https://arxiv.org/abs/2605.09019
作者: Josep Lumbreras,Marco Tomamichel
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 43 pages

点击查看摘要

Abstract:We extend quantum state tomography with minimal cumulative disturbance, first investigated in [arXiv:2406.18370], to arbitrary finite-dimensional pure states. A learner sequentially receives fresh copies of an unknown pure state, chooses a rank-one projector for each copy using the previous outcomes, and performs the corresponding two-outcome projective measurement. The goal is to learn the state while keeping the chosen projectors close to the unknown state in order to minimize disturbance. The qubit solution relies on the special geometry of the Bloch sphere and does not extend directly to qudits, where pure states form a curved manifold. We show that this obstruction can be overcome by working locally on the pure-state manifold. The algorithm proceeds in epochs. In each epoch, it fixes a current estimate, measures pairs of nearby rank-one projectors obtained by moving in opposite tangent directions, and takes differences of the corresponding outcomes. This gives an exact linear observation of the tangent component of the error. The resulting local linear models are combined with a robust variance-adaptive estimator and a hot-start regularization that transfers precision across epochs. For every unknown pure state in dimension (d), after (T) measured copies, our protocol achieves cumulative regret (\mathcalO(d^3\log^2 T)), and at each intermediate time (t\leq T) its current estimate has online infidelity (\mathcalO(d^3\log(T)/t)). Hence, pure-state tomography with essentially no cumulative disturbance is not a peculiarity of qubits but a geometric phenomenon that persists for qudits.

[LG-335] Beyond the Black Box: An Interpretable Machine Learning Framework for Predicting Electronic Structure Microdescriptors and Structure-Performance Relationships in Fe-based Catalytic Systems

链接: https://arxiv.org/abs/2605.08994
作者: Oyinkansola Romiluyi
类目: Chemical Physics (physics.chem-ph); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 27 pages, 10 figures

点击查看摘要

Abstract:The current catalyst discovery and development pipeline for energy-intensive applications like methane conversion remains bottlenecked by expensive trial-and-error experimentation, irreproducible chemical intuition, and a lack of frameworks linking complex catalytic design spaces to performance. This work presents an interpretable machine learning framework that integrates SHAP-based feature importance analysis (Explainable AI) with tree-based ensembles (Random Forest and Bayesian-optimized CatBoost) to characterize Fe-zeolite and oxide-supported catalysts for the partial oxidation of methane (POM). Despite limited data, the framework decodes complex structure-performance relationships by identifying and ranking thermodynamic, structural, and geometric microdescriptors that influence the electronic band gap and govern macroscale performance metrics such as selectivity, activity, and stability. This work explicitly demonstrates that thermodynamic lattice stability and geometric factors are the primary drivers of electronic band gap (a critical proxy for redox reactivity) rather than bulk stoichiometry. Non-linear models achieve an R2 of 0.61 - 0.77, significantly outperforming traditional linear baselines (R2 = 0.32). This workflow provides both a light-weight generalizable methodology and a prioritized list of physical features for accelerated catalyst screening - and these features can subsequently be integrated into microkinetic and reaction engineering models to create digital twins of complex reactor systems and to enable predictive optimization in autonomous RD laboratories.

[LG-336] Survey-aware Machine Learning: A Guideline for Valid Population Health Inference based on Scoping Review

链接: https://arxiv.org/abs/2605.08963
作者: YongKyung Oh,Henry W. Zheng,Jeffrey Feng,Alex A. T. Bui
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine Learning (ML) models trained on complex health surveys such as the National Health and Nutrition Examination Survey (NHANES) often ignore primary sampling units, stratification variables, and sampling weights. This practice violates the independence assumptions of standard evaluation methods. As a result, estimates become biased, uncertainty is underestimated, and fairness assessments fail to reflect population-level disparities. We propose Survey-aware Machine Learning (SaML), a nine-step guideline that incorporates survey design metadata across the ML lifecycle. Through a scoping review of 16 methodological papers, we summarize existing work on weighted model training, design-based cross-validation, and survey-adjusted performance evaluation. We also identify gaps in hyperparameter tuning and deployment. We provide task-specific guidance that clarifies which steps are required for different analytical objectives. SaML provides a checklist for valid population inference from survey data.

[LG-337] CrystalREPA: Transferring Physical Priors from Universal MLIPs to Crystal Generative Models

链接: https://arxiv.org/abs/2605.08960
作者: Chengqian Zhang,Yucheng Jin,Duo Zhang,Tiejun Li,Han Wang
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Chemical Physics (physics.chem-ph); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Crystal generative models mainly learn what stable crystals look like, with little explicit supervision for what makes them stable. We reveal a substantial representation gap between state-of-the-art crystal generative models and pretrained universal machine learning interatomic potentials (MLIPs) via energy probing, and show this gap can be closed by a simple training-time alignment. We propose Crystal REPresentation Alignment (CrystalREPA), a plug-and-play framework that aligns the atom-wise hidden states of generative encoders with frozen MLIP representations through an element-aware contrastive objective, transferring stability-aware atomistic priors with marginal training overhead and no additional inference cost. Across three generative frameworks, ten MLIP teachers, and two benchmark datasets, CrystalREPA consistently improves the thermodynamic stability, structural validity, and structural fidelity of generated crystals. Equally important, we find that an MLIP’s transfer effectiveness is poorly predicted by its accuracy on standard leaderboards (e.g., Matbench Discovery) but strongly predicted by the distinguishability of its atom-wise representation space, yielding a practical, accuracy-independent criterion for selecting MLIP teachers for generative transfer.

[LG-338] Rennala MVR: Improved Time Complexity for Parallel Stochastic Optimization via Momentum-Based Variance Reduction

链接: https://arxiv.org/abs/2605.08871
作者: Zhirayr Tovmasyan,Artavazd Maranjyan,Peter Richtárik
类目: Optimization and Control (math.OC); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Large-scale machine learning models are trained on clusters of machines that exhibit heterogeneous performance due to hardware variability, network delays, and system-level instabilities. In such environments, time complexity rather than iteration complexity becomes the relevant performance metric for optimization algorithms. Recent work by Tyurin and Richtárik (2023) established the first time complexity analysis for parallel first-order stochastic optimization, proposing Rennala SGD as a time-optimal method for smooth nonconvex optimization. However, Rennala SGD is fundamentally a modification of SGD, and variance reduction techniques are known to improve the iteration complexity of SGD. In this work, we investigate whether variance reduction can also improve time complexity in heterogeneous systems. We show that, under a mean-squared smoothness assumption, variance reduction can improve time complexity in relevant parameter regimes. To this end, we propose Rennala MVR, a variance-reduced extension of Rennala SGD based on momentum-based variance reduction, and analyze its oracle and time complexity. We establish lower bounds for time complexity under these assumptions. On a stochastic quadratic benchmark, experiments with the exact method support the theory, while neural-network experiments with a practical inexact variant show similar empirical gains over Rennala SGD.

[LG-339] ght Generalization Bounds for Noiseless Inverse Optimization

链接: https://arxiv.org/abs/2605.08866
作者: Pouria Fatemi,Hoomaan Maskan,Suvrit Sra,Peyman Mohajerin Esfahani
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 29 pages, 2 figures

点击查看摘要

Abstract:Inverse optimization (IO) seeks to infer the parameters of a decision-maker’s objective from observed context–action data. We study noiseless IO, where demonstrations are generated by a ground-truth objective. We provide a high-probability O(\fracdT) generalization bound for the induced action set, where d is the number of unknown parameters and T is the size of the training dataset. We strengthen these guarantees under additional conditions that ensure uniqueness of the chosen action, bringing our IO guarantees in line with best-arm identification results in the bandit literature. We further show that the O(\fracdT) rate is tight over all consistent estimators considered here, and extend the result to both instantaneous and cumulative regret. Notably, the resulting regret lower bound matches the corresponding upper bounds in the adversarial setting, indicating that the stochastic IO setting is effectively adversarial for the class of estimators studied here. Finally, we propose a parameter-free algorithm with lower per-iteration complexity than generic solvers. Experiments validate the predicted rates and illustrate the tightness of our bounds.

[LG-340] Local LMO: Constrained Gradient Optimization via a Local Linear Minimization Oracle

链接: https://arxiv.org/abs/2605.08850
作者: Peter Richtárik,Kaja Gruntkowska,Hanmin Li
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 71 pages, 8 figures

点击查看摘要

Abstract:We design Local LMO - a new projection-free gradient-type method for constrained optimization. The key algorithmic idea is to replace the global linear minimization oracle over the constraint set used by Frank-Wolfe (FW) with a local linear minimization oracle over the intersection of the constraint set and a “small” ball centered at the current iterate. In particular, when minimizing f:\mathbbR^d\to \mathbbR over a constraint \emptyset\neq\mathcalX\subseteq\mathbbR^d , Local LMO performs the iteration [x_k+1\in \arg\min_z\in\mathcalX\cap\mathcalB(x_k,t_k)\langle\nabla f(x_k), z \rangle,] where x_0\in\mathcalX , and t_k0 is a suitably chosen radius which can be interpreted as an effective stepsize. While designed as an alternative to FW, Local LMO is perhaps best viewed as a generalization of Gradient Descent (GD) rather than a modification of FW. Indeed, it is easy to see that Local LMO reduces to GD in the unconstrained setting and, more generally, to GD restricted to an affine subspace if the constraint \mathcalX is affine. We prove that this simple algorithmic scheme transfers the known (unaccelerated) convergence rates of Projected Gradient Descent (PGD) to the projection-free world in several important regimes, some of which are beyond the reach of FW. In contrast to FW theory, i) our guarantees hold without requiring the feasible set \mathcalX to be bounded, ii) our theory does not require the “curvature” assumption, which allows us to establish a standard sublinear rate for convex functions with bounded gradients, iii) we obtain a linear rate in the smooth strongly convex regime. Furthermore, we obtain sharp sublinear rates in the smooth convex and non-convex regimes, in the (L_0,L_1) -smooth convex regime, and in stochastic and non-differentiable settings.

[LG-341] Learning Theory of Transformers: Local-to-Global Approximation via Softmax Partition of Unity

链接: https://arxiv.org/abs/2605.08811
作者: Zhongjie Shi,Wenjing Liao
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper investigates the learning theory of Transformer networks for regression tasks on the compact Euclidean domain [0,1]^d and d -dimensional compact Riemannian manifolds. We propose a novel constructive approximation framework for Transformers that builds local approximations of the target function and aggregates them into a global approximation via softmax partition of unity. This approach leverages the attention mechanism to achieve spatial localization through affine transformations of the input. The softmax activation plays a crucial role in aggregating local approximations to a global output. From an approximation perspective, we prove that a dense Transformer equipped with only two encoder blocks and standard single-hidden-layer point-wise feed-forward networks can achieve a uniform \varepsilon -approximation error for \alpha -Hölder continuous functions with \alpha \in (0,1] using \mathcalO(\varepsilon^-d/\alpha) total parameters. Building upon this approximation guarantee, we establish a near minimax-optimal generalization error bound of order \mathcalO\big(n^-\frac2\alpha2\alpha+d \log n\big) for the empirical risk minimizer, where n is the training data size. The Transformer architecture studied in this paper is dense, shallow and wide, and employs softmax activation and sinusoidal positional encodings, closely reflecting practical implementations.

[LG-342] Measuring and Decomposing Mode Separation via the Canonical Diffusion

链接: https://arxiv.org/abs/2605.08777
作者: Shaul Tolkovsky,Ori Meidler,Or Zuk
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:Mode separation, namely how sharply a distribution fragments into barrier-separated clusters, is a fundamental geometric property of densities, difficult to quantify in high dimensions. It is structurally distinct from dispersion, yet existing tools fall short: differential entropy rises with spread regardless of fragmentation, PCA orders directions by variance regardless of barriers, and mutual information requires a mixture decomposition one usually does not have. We measure mode separation through a single stochastic process intrinsic to the density: a unique reversible diffusion with f as its stationary distribution and constant scalar diffusion coefficient. We extract two readouts from its autocovariance matrix: SSA (Sum of Squared Autocorrelations), a scalar barrier-sensitive measure; and DA (Dominant Autocorrelation directions), linear projections ordered by metastability rather than variance. Under an isotropic-Gaussian null, we derive a closed-form spectrum for the empirical autocovariance that generalizes Marchenko–Pastur, with an analytic upper edge that selects the lag at which DA is read off. Both readouts use only samples and a score function, scaling to high dimensions through pretrained score-based generative models via Tweedie’s identity. We apply our framework to three settings: (i) synthetic Gaussian mixtures, where SSA tracks mutual information; (ii) SDXL text-to-image generations, where SSA and DA capture structure that entropy and PCA miss; and (iii) molecular dynamics of alanine dipeptide, where DA recovers the known slow backbone dihedrals from static samples alone.

[LG-343] Energy-based models for diagnostic reconstruction and analysis in a laboratory plasma device

链接: https://arxiv.org/abs/2605.08645
作者: Phil Travis,Troy Carter
类目: Plasma Physics (physics.plasm-ph); Machine Learning (cs.LG)
*备注: 15 pages, 10 figures

点击查看摘要

Abstract:Energy-based models (EBMs) provide a powerful and flexible way of learning a joint probability distribution over data by constructing an energy surface. This energy surface enables insight extraction and conditional sampling. We apply EBMs to laboratory plasma physics, a domain characterized by highly nonlinear phenomena. These phenomena are studied using plasma diagnostics, which are often difficult to analyze and subject to hardware degradation. In addition, the possible configuration space of a plasma device is sufficiently large that it cannot be efficiently searched using conventional analysis techniques. EBMs address these issues. At the Large Plasma Device (LAPD), a CNN- and attention-based EBM is trained on a set of randomly generated machine conditions and their corresponding diagnostic time series. We demonstrate diagnostic reconstruction using this EBM on real data and show that additional diagnostics improves reconstruction error and generation quality. The energy surface is directly evaluated for an ill-posed inverse problem: inferring probe position from a time-series measurement. This inference illuminates symmetries in the data, potentially leading to a method of inquiry to supplement conventional data analysis. Trends in diagnostic signals are inferred via conditional sampling over machine inputs. In addition, this multimodal EBM is able to unconditionally reproduce all distributional modes, suggesting future potential in anomaly detection on the LAPD. Fundamentally, this work demonstrates the flexibility and efficacy of EBM-based generative modeling of laboratory plasma data, and showcases multiple practical uses of just a single trained EBM in the physical sciences.

[LG-344] CONTRA: Conformal Prediction Region via Normalizing Flow Transformation

链接: https://arxiv.org/abs/2605.08561
作者: Zhenhan Fang,Aixin Tan,Jian Huang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 18 pages, 7 figures and 5 tables

点击查看摘要

Abstract:Density estimation and reliable prediction regions for outputs are crucial in supervised and unsupervised learning. While conformal prediction effectively generates coverage-guaranteed regions, it struggles with multi-dimensional outputs due to reliance on one-dimensional nonconformity scores. To address this, we introduce CONTRA: CONformal prediction region via normalizing flow TRAnsformation. CONTRA utilizes the latent spaces of normalizing flows to define nonconformity scores based on distances from the center. This allows for the mapping of high-density regions in latent space to sharp prediction regions in the output space, surpassing traditional hyperrectangular or elliptical conformal regions. Further, for scenarios where other predictive models are favored over flow-based models, we extend CONTRA to enhance any such model with a reliable prediction region by training a simple normalizing flow on the residuals. We demonstrate that both CONTRA and its extension maintain guaranteed coverage probability and outperform existing methods in generating accurate prediction regions across various datasets. We conclude that CONTRA is an effective tool for (conditional) density estimation, addressing the under-explored challenge of delivering multi-dimensional prediction regions.

[LG-345] Structure-Preserving Reconstruction of Convex Lipschitz Functionals on Hilbert Spaces from Finite Samples

链接: https://arxiv.org/abs/2605.08559
作者: Anastasis Kratsios
类目: Functional Analysis (math.FA); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Numerical Analysis (math.NA); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Convex functionals are ubiquitous in applied analysis, appearing as value functions, risk measures, super-hedging prices, and loss functionals in machine learning. In many applications, however, the functional is only observed through finitely many exact pointwise evaluations. We ask whether a convex functional on a separable Hilbert space H can be reconstructed, up to arbitrary uniform accuracy, by an explicit formula which preserves convexity and Lipschitz regularity and is finitely computable. We answer this affirmatively. For every compact convex C\subseteq H , every L -Lipschitz convex functional \rho:C\to\mathbbR , and every \varepsilon0 , we construct an explicit finite-sample reconstruction which is convex, L -Lipschitz, and uniformly \varepsilon -accurate on C . The construction uses only finitely many linear measurements \langle b,\cdot\rangle_H , with b lying in a finite-dimensional subspace of H , and is exactly implementable by a \operatornameReLU -MLP. Building on this, we introduce convex neural functionals (CNFs), a structured trainable architecture class containing our reconstruction, whose every admissible parameter configuration is automatically convex and Lipschitz, providing a principled foundation for learning convex functionals from finite data. Subjects: Functional Analysis (math.FA); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Numerical Analysis (math.NA); Optimization and Control (math.OC) MSC classes: 41A65, 46N10, 52A41, 68T07, 90C25, 41A30 Cite as: arXiv:2605.08559 [math.FA] (or arXiv:2605.08559v1 [math.FA] for this version) https://doi.org/10.48550/arXiv.2605.08559 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-346] Learnability and Competition in High-Dimensional Multi-Component ICA

链接: https://arxiv.org/abs/2605.08552
作者: Eser Ilke Genc,Samet Demir,Zafer Dogan
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 56 pages, 9 figures

点击查看摘要

Abstract:Independent Component Analysis (ICA) is a foundational tool for unsupervised representation learning, yet its high-dimensional theory remains largely limited to single-component recovery. We develop an asymptotically exact mean-field theory for multi-component online ICA, capturing the coupling induced by simultaneous learning and orthogonalization. In the high-dimensional limit, the joint empirical distribution of learned estimates and ground-truth components converges to a deterministic process, yielding a closed ODE system for the overlap matrix between learned directions and true components. This characterization reveals a genuinely multi-component, initialization-driven phase structure: a decoupled regime, where estimates align with distinct components and evolve nearly independently, and a competition regime, where overlapping initializations induce orthogonality-driven conflicts, slow reorientation, and delayed convergence. Our steady-state analysis gives explicit learnability boundaries and competition conditions linking step size, data moments, and initialization. These conditions show that larger higher-order moments and competition shrink the stable learning-rate window, increase convergence times, and predict a staircase phenomenon in which the number of recoverable components changes discretely with the learning rate. Experiments on synthetic data and hyperspectral remote sensing data validate the predicted trajectories and phase behavior.

[LG-347] Sliced Inner Product Gromov-Wasserstein Distances

链接: https://arxiv.org/abs/2605.08546
作者: Xiaoyun Gong,Gabriel Rioux,Ziv Goldfeld
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 49 pages, 8 figures

点击查看摘要

Abstract:The Gromov-Wasserstein (GW) problem provides a framework for aligning heterogeneous datasets by matching their intrinsic geometry, but its statistical and computational scaling remains an issue for high-dimensional problems. Slicing techniques offer an appealing route to scalability, but, unlike Wasserstein distances, GW problems do not generally admit closed-form solutions in one-dimension. We resolve this problem for the GW problem with inner product cost (IGW), propose a sliced IGW distance that enjoys a natural rotational invariance property, and comprehensively study its structural and computational properties. Numerical experiments validating our theory are presented, followed by applications to heterogeneous clustering of text data and language model representation comparison.

[LG-348] A Unified Lyapunov-IQC Framework for Uniform Stability of Smooth Quadratic First-Order Accelerated Optimizers

链接: https://arxiv.org/abs/2605.08488
作者: Don Li,Dacian Daescu
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We develop a unified Lyapunov-integral quadratic constraint (IQC) framework for establishing uniform stability of first-order accelerated optimization algorithms in the \beta -smooth and \gamma -strongly convex regime. Classical analyses of uniform stability, such as the work of Hardt, Recht, and Singer for stochastic gradient descent (SGD), rely on direct coupling arguments and case-by-case control of iterate differences under random sampling. Extending such arguments to accelerated methods, such as Nesterov Accelerated Gradient (NAG), is complicated by the presence of higher-order state dynamics induced by momentum. We first extend this classical approach with the use of Lyapunov functions to provide a uniform stability bound for smooth quadratic NAG, and supplement this result with small-scale numerical experiments. We then extend this framework by modeling first-order accelerated optimizers as Lur’e-type feedback interconnections between a linear dynamical system and a (non-linear) gradient operator. \beta -Smoothness and \gamma -strong convexity are encoded a sector IQC inequality. Under this representation, uniform stability is certified via the existence of a quadratic Lyapunov function satisfying a finite-dimensional linear matrix inequality (LMI) in the form of a feasibility problem, which can be solved via semi-definite programming (SDP). We instantiate this framework for NAG and show how classical uniform stability bounds can be recovered via this framework. These results underscore a structural connection between optimization dynamics and robust control theory, providing a modular methodology for reliable and reproducible numerical certification of uniform stability and generalization behavior of first-order methods via convex optimization tools that is adaptable to increasingly complex optimization algorithms.

[LG-349] Sinkhorn Treatment Effects: A Causal Optimal Transport Measure

链接: https://arxiv.org/abs/2605.08485
作者: Medha Agarwal,Alex Luedtke
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
*备注: 55 pages, 6 figures

点击查看摘要

Abstract:We introduce the Sinkhorn treatment effect, an entropic optimal transport measure of divergence between counterfactual distributions. Unlike classical quantities such as the average treatment effect, this measure captures differences across entire distributions. We analyze this divergence as a statistical functional and show it can be written as a smooth transformation of counterfactual mean embeddings with an appropriate kernel. This characterization allows us to establish first-order pathwise differentiability in general, and second-order pathwise differentiability under the null hypothesis of equal counterfactual distributions. Leveraging this smoothness, we construct debiased estimators and use them to obtain asymptotically valid tests for distributional treatment effects with a fixed entropic regularization parameter. Because the power of the test depends on this unknown parameter, we further propose an aggregated test that combines evidence across a grid of regularization choices. Experiments on simulated and image data demonstrate the practical advantages of our estimator and testing procedure.

[LG-350] Active Multiple-Prediction-Powered Inference

链接: https://arxiv.org/abs/2605.08429
作者: Nicholas Brawand,Nima Leclerc,Anhthy Ngo,Matthew Peterson,Sriram Vishwanath,Laith Alhussein,Ben Wellner
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Post-deployment monitoring of healthcare AI requires statistically valid, label-efficient methods, but gold-standard labels from clinician chart review are expensive. Prediction-powered inference (PPI) and active statistical inference (ASI) reduce label cost by combining a small labeled sample with abundant model predictions, but both are restricted to a single predictor, a poor fit for modern clinical pipelines that have multiple predictors of differing cost and accuracy available at inference time. We propose Active Multiple-Prediction-Powered Inference (AM-PPI), which routes each instance to a cost-appropriate predictor subset, samples gold-standard labels in proportion to the chosen subset’s residual uncertainty, and reweights predictions to minimize estimator variance, all under a single deployment-time budget. AM-PPI generalizes ASI to leverage multiple predictors and extends Multiple-PPI from global per-predictor allocation to per-instance adaptive routing. We derive closed-form Karush-Kuhn-Tucker (KKT) conditions for all three decisions and prove, via biconvexity and strong duality, that the resulting fixed point is a global optimum despite the joint problem being non-jointly-convex. We establish asymptotic normality with valid coverage, minimum-variance unbiasedness within the linear-prediction augmented inverse propensity weighted (AIPW) class, and a closed-form criterion identifying when multiple predictors help. On synthetic data and three healthcare monitoring tasks, AM-PPI produces 10 to 40 percent narrower confidence intervals (CIs) than single-predictor ASI in the budget regime where routing matters, and matches the better baseline elsewhere.

[LG-351] On Observation Time for Recovering Latent Hawkes Networks

链接: https://arxiv.org/abs/2605.08400
作者: Jonas Linkerhägner,Michele Bortolasi,Lorenzo Baldassari,Maarten V. de Hoop,Ivan Dokmanić
类目: atistics Theory (math.ST); Information Theory (cs.IT); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Dynamics of interacting systems in engineering, society, and nature often evolve over latent networks that govern which entities can interact. We study the problem of inferring these networks from event-based observations, which arise naturally in finance, seismology, and neuroscience. While there is substantial algorithmic work addressing this important problem, theoretical results are scarce. In this paper we ask the following fundamental question: what is the minimum time that one must observe the dynamics in order to exactly recover the underlying network, as a function of the number d of interacting entities? For a class of stationary Hawkes processes with sparse, weak interactions, we prove that an observation time of order \log d is sufficient and necessary. For the upper bound we construct a two-stage estimator that uses clipped and binned event data for screening, followed by a least-squares refinement, and apply concentration bounds derived from the Poisson cluster representation. For the lower bound we combine Fano’s inequality with Jacod’s Girsanov formula for point processes on a suitable subclass of networks.

[LG-352] ransfer Learning for Dead Fuel Moisture Prediction Using Time-Warping Recurrent Neural Networks

链接: https://arxiv.org/abs/2605.08379
作者: Jonathon Hirschi,Jan Mandel,Adam Kochanski
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注: Preprint. Related to PhD thesis work that is also available for preprint at this https URL

点击查看摘要

Abstract:This paper proposes a time-warping transfer learning method, a technique for temporally rescaling the learned dynamics of a recurrent neural network (RNN) with a Long Short-Term Memory (LSTM) layer to enable task transfer across fuel moisture classes. Fuel moisture content (FMC) is divided into idealized classes based on characteristic lag time. Large quantities of real-time data are available for 10h fuels from sensors on weather stations, but observations of other fuel classes are sparse in space and time. We use transfer learning to adapt an RNN pretrained on 10h FMC to predict FMC for 1h, 100h, and 1000h fuels. We validate this method using data from a landmark field study conducted in Oklahoma that was used to calibrate the state-of-the-art Nelson fuel moisture model.

[LG-353] Non-intrusive Body Composition Assessment from Full-body mmWave Scans

链接: https://arxiv.org/abs/2605.08306
作者: Miriam Senne,Benjamin D. Killeen,Tony Wang,Nassir Navab
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Body composition assessment (BCA) provides detailed information about the distribution of different tissue types in the body, enabling more precise characterization of individuals than BMI or weight alone. Consistent and frequent BCA would be valuable for personalized medicine, but the gold standard methods for BCA, such as CT and MRI, are only practical for opportunistic monitoring of patients with clinical indications for imaging and are not suitable for routine use in the general population. Here, we consider an imaging modality which is not currently used in medical applications: millimeter wave (mmWave) radar. Commonly used in security settings, mmWave scans enable fast, non-intrusive, and privacy-preserving reconstruction of full body shape without the need to remove clothing. To demonstrate the feasibility of fast and convenient BCA from mmWave scans, we present a method for BCA value regression using a multi-task learning strategy that leverages synthetic mmWave-like point clouds derived from clinical imaging and parametric human models. We evaluate the model on a pilot cohort of real mmWave scans with bioimpedance-derived body fat measurements, supporting the feasibility of estimating VAT and body fat percentage (BFP) from mmWave data acquired through clothing in a standing posture. We find that the model can predict VAT and BFP with a mean absolute error of 1.0 L and 3.2%, respectively, demonstrating the potential of mmWave scanning for routine BCA in a wide range of settings.

[LG-354] Decentralized Conformal Novelty Detection via Quantized Model Exchange

链接: https://arxiv.org/abs/2605.08263
作者: Kyle Loh,Yu Xiang
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:This work studies decentralized novelty detection with global false discovery rate (FDR) control across heterogeneous composite null distributions, without sharing the raw data due to privacy and bandwidth considerations. We propose a framework based on the exchange of quantized surrogate models, allowing independent agents to share low-precision representations of locally learned non-conformity score functions. We prove that evaluating data against these quantized composite scores preserves conditional exchangeability, providing rigorous finite-sample guarantees for global FDR control. Empirical studies on synthetic datasets confirm our theoretical results, demonstrating that the proposed approach maintains competitive statistical power while drastically reducing the communication cost.

[LG-355] Inverse Design of Multi-Layer Sub-Pixel-Resolution RF Passives Through Grayscale Diffusion with Flexible S-Parameter Conditioning

链接: https://arxiv.org/abs/2605.08233
作者: Tommaso Dreossi,Christopher M. Bryant,Hao Liu,Nathan Mirman,Noah Kessler,Michael Frei,Harish Krishnaswamy
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Inverse design of RF passive components from S-parameters is a high-dimensional, ill-posed problem, and prior generative approaches are limited to single-layer binary-metallization structures. This paper presents an inverse design approach that generates passive components from partial S-parameter inputs on an 8\times8 mm board discretized at 64\times64 pixels with sub-pixel grayscale metallization across 1-20 GHz. The framework generates two-layer copper layouts with vias, with hard physical constraints on feed locations enforced through annealed Langevin projection, flexible multi-modal conditioning on partial S-parameter specifications, port locations, dielectric properties, reference topology, and variable port placement. Candidate designs are generated in seconds, with surrogate-predicted S-parameters matching targets to within 0.77 \pm 1.28 dB weighted mean absolute error. We validate the approach with two fabricated designs on RO4003C: a manufacturable alternative to a hairpin filter whose coupling gaps violate fabrication rules, and a combline bandpass filter designed from scratch given only target S-parameters.

[LG-356] Learning the Channel Gain from Anywhere to Anywhere via Cross-environment Transformer Estimators

链接: https://arxiv.org/abs/2605.08211
作者: Prasenjit Dhara,Daniel Romero
类目: ignal Processing (eess.SP); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Channel-gain maps provide the channel gain between any two locations in a geographical region. They find numerous applications, from resource allocation and interference control to path planning for autonomous vehicles. Channel-gain map estimation (CGME) is considerably more challenging than conventional radio map estimation (RME) because channel-gain maps are functions over a 6-dimensional input space. This calls for specialized methods, which currently rely on the (inaccurate) radio tomographic model or require a prohibitively large number of measurements since they do not exploit any spatial structure. This paper overcomes this issue by leveraging spatial patterns that channel-gain maps exhibit across environments, as dictated by the laws of physics and typical environmental characteristics (e.g. building materials and layouts). Adopting a metalearning perspective, a transformer-based estimator is proposed to implicitly learn this common structure from measurements collected in multiple environments. This enables CGME in new environments from significantly fewer measurements (five times less in our experiments). To maximize learning efficiency, the transformer is composed with a feature map that enforces the invariances of CGME, such as those following from reciprocity. Numerical experiments corroborate the merits of the proposed estimator relative to existing methods.

[LG-357] Domain-Adaptive Arrhythmia Classification Using a Hybrid Transformer on Wearable Heart Signals

链接: https://arxiv.org/abs/2605.08199
作者: Maedeh H. Toosi,Siamak Mohammadi
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cardiovascular disease remains the leading cause of death globally, underscoring the need for effective, accessible monitoring solutions, particularly through wearable devices that enable continuous, real-time tracking of heart rhythms in home settings. However, deploying deep learning models trained on clinical electrocardiogram (ECG) datasets to wearable devices remains challenging, as differences in recording equipment, signal quality, and patient populations introduce domain shifts that degrade model performance. We propose a hybrid transformer model that processes continuous ECG signals alongside seven heart rate variability (HRV) features, where the raw signal path captures beat-level morphological patterns and the HRV path encodes rhythm regularity statistics, allowing the model to jointly leverage complementary information from both representations. To enhance the model’s ability to generalize across domains, we employ representation learning techniques, including Maximum Mean Discrepancy (MMD), a non-parametric kernel-based metric that quantifies the distance between feature distributions of different domains, to align feature distributions between source and target domains, addressing the challenge of domain shifts between public datasets and wearable device data. By leveraging five public ECG datasets for training, the model learns robust, generalized representations that mitigate domain-specific biases. When tested on wearable device data with an unseen domain, the model achieved an F1-macro 95% and balanced accuracy of 96.15%. These results demonstrate minimal performance degradation, with only a 2% drop in F1-macro compared to seen-domain evaluation, highlighting the model’s generalization capabilities and its potential for reliable, real-time heart monitoring applications in home and ambulatory settings.

[LG-358] owards Interpretable Damage Detection based on Aerodynamic Pressure Measurements

链接: https://arxiv.org/abs/2605.08187
作者: Philip Franz,Max von Danwitz,Gregory Duthé,Alexander Popp,Eleni Chatzi
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 28 pages, 30 figures

点击查看摘要

Abstract:The increasing flexibility of modern large wind turbine blades necessitates cost-efficient and reliable structural monitoring solutions. For this purpose, we propose to use aerodynamic pressure measurements obtained via Aerosense, a novel, non-intrusive and economical sensing system. In former work [Franz et al., 2025], we investigated the potential of aerodynamic pressure measurements for structural damage detection on elastic and aerodynamically loaded structures. An experimental campaign was conducted on a NACA 633418 airfoil mounted on a vertically vibrating cantilever beam within an open wind tunnel. Structural damage was introduced progressively through controlled saw cuts near the beam support. Aerodynamic pressure distributions were recorded under varying inflow conditions and structural states. Based on this data set, we developed a convolutional neural network to detect structural damage and classify its severity using only aerodynamic pressure signals. The results demonstrate that pressure measurements can effectively enable real-time detection and quantification of damage in elastic, beam-like structures subjected to mildly turbulent flow and varying operational conditions. Recognizing the limitations of pure black-box classification, in this study, we further incorporate physics-based insights and explainable machine learning methods to interpret how structural damage influences both the dynamic response and the aerodynamic pressure field. This leads to an enhanced damage detection pipeline, aiming to improve transparency, robustness, and physical consistency in data-driven monitoring of elastic, aerodynamically loaded structures.

[LG-359] Neural Posterior Estimation of Terrain Parameters from Radar Sounder Data

链接: https://arxiv.org/abs/2605.08179
作者: Jordy Dal Corso,Annalena Kofler,Marco Cortellazzi,Lorenzo Bruzzone,Bernhard Schölkopf
类目: ignal Processing (eess.SP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: 5 pages, 3 figures; accepted at IGARSS 2026, 9 - 14 August 2026, Washington D.C., USA

点击查看摘要

Abstract:Radar sounders are electromagnetic instruments that can probe deep into the subsurface of Earth and other planetary bodies by processing the echo of transmitted radar waves. Conventional approaches for analyzing such data rely on approximate assumptions and often produce point estimates that ignore parameter correlations as well as galactic and measurement noise. We propose a simulation-based inference approach to terrain parameter inversion from radar sounder data, where synthetic observations from a GPU-based simulator are used to train a neural network-based density estimator for neural posterior estimation (NPE). By explicitly conditioning on reference surface assumptions, the proposed framework allows systematic evaluation of posterior robustness to reference surface variability. We demonstrate that our NPE model is well calibrated on simulated data and transferable to real Mars radar profiles, where we analyze terrain parameters using literature-informed reference values.

附件下载

点击下载今日全部论文列表

目录

概览 (2026-05-12)

多智能体系统

自然语言处理

信息检索

人机交互

计算机视觉

人工智能

机器学习

附件下载