本篇博文主要内容为 2026-02-05 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。

说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。

提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。

目录

概览 (2026-02-05)

今日共更新612篇论文,其中:

  • 自然语言处理102篇(Computation and Language (cs.CL))
  • 人工智能169篇(Artificial Intelligence (cs.AI))
  • 计算机视觉115篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习224篇(Machine Learning (cs.LG))
  • 多智能体系统12篇(Multiagent Systems (cs.MA))

多智能体系统

[MA-0] WideSeek-R1: Exploring Width Scaling for Broad Information Seeking via Multi-Agent Reinforcement Learning

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在处理广泛信息检索任务时,从个体能力瓶颈向组织协同能力瓶颈转变的问题。现有方法多依赖于深度扩展(depth scaling),即单个智能体通过多轮推理和工具调用完成复杂任务,但难以应对任务广度带来的挑战。为此,作者提出宽度扩展(width scaling)策略,核心解决方案是设计一种基于多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)的“主代理-子代理”框架WideSeek-R1,其关键在于利用共享大语言模型(LLM)并行执行多个具有隔离上下文与专用工具的子代理,并通过联合优化主代理与子代理的协作机制,在20k条宽泛信息查询任务数据集上实现高效、可扩展的并行调度与执行,从而显著提升系统整体性能。

链接: https://arxiv.org/abs/2602.04634
作者: Zelai Xu,Zhexuan Xu,Ruize Zhang,Chunyang Zhu,Shi Yu,Weilin Liu,Quanlu Zhang,Wenbo Ding,Chao Yu,Yu Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have largely focused on depth scaling, where a single agent solves long-horizon problems with multi-turn reasoning and tool use. However, as tasks grow broader, the key bottleneck shifts from individual competence to organizational capability. In this work, we explore a complementary dimension of width scaling with multi-agent systems to address broad information seeking. Existing multi-agent systems often rely on hand-crafted workflows and turn-taking interactions that fail to parallelize work effectively. To bridge this gap, we propose WideSeek-R1, a lead-agent-subagent framework trained via multi-agent reinforcement learning (MARL) to synergize scalable orchestration and parallel execution. By utilizing a shared LLM with isolated contexts and specialized tools, WideSeek-R1 jointly optimizes the lead agent and parallel subagents on a curated dataset of 20k broad information-seeking tasks. Extensive experiments show that WideSeek-R1-4B achieves an item F1 score of 40.0% on the WideSearch benchmark, which is comparable to the performance of single-agent DeepSeek-R1-671B. Furthermore, WideSeek-R1-4B exhibits consistent performance gains as the number of parallel subagents increases, highlighting the effectiveness of width scaling.
zh

[MA-1] Dual Mind World Model Inspired Network Digital Twin for Access Scheduling

【速读】:该论文旨在解决工业物联网(Industrial Internet of Things, IIoT)和实时网络物理系统(Cyber-Physical Systems, CPS)中动态流量、截止时间约束与干扰限制下智能调度策略的适应性不足问题。传统基于规则或纯数据驱动的调度方法难以在复杂环境中实现高效决策,且缺乏可解释性和样本效率。解决方案的关键在于提出一种受双心智世界模型(Dual Mind World Model, DMWM)启发的数字孪生(Digital Twin)调度框架,其核心创新是将短期预测规划与符号化模型驱动的回溯滚动(symbolic model-based rollout)相结合,使调度器能够预判未来网络状态并据此调整传输决策,从而在突发流量、干扰受限及截止时间敏感场景下实现更优性能,同时保持可解释性与低样本开销。

链接: https://arxiv.org/abs/2602.04566
作者: Hrishikesh Dutta,Roberto Minerva,Noel Crespi
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Emerging networked systems such as industrial IoT and real-time cyber-physical infrastructures demand intelligent scheduling strategies capable of adapting to dynamic traffic, deadlines, and interference constraints. In this work, we present a novel Digital Twin-enabled scheduling framework inspired by Dual Mind World Model (DMWM) architecture, for learning-informed and imagination-driven network control. Unlike conventional rule-based or purely data-driven policies, the proposed DMWM combines short-horizon predictive planning with symbolic model-based rollout, enabling the scheduler to anticipate future network states and adjust transmission decisions accordingly. We implement the framework in a configurable simulation testbed and benchmark its performance against traditional heuristics and reinforcement learning baselines under varied traffic conditions. Our results show that DMWM achieves superior performance in bursty, interference-limited, and deadline-sensitive environments, while maintaining interpretability and sample efficiency. The proposed design bridges the gap between network-level reasoning and low-overhead learning, marking a step toward scalable and adaptive NDT-based network optimization.
zh

[MA-2] SPEAR: An Engineering Case Study of Multi-Agent Coordination for Smart Contract Auditing

【速读】:该论文旨在解决智能合约审计过程中存在的协调效率低、容错能力弱及资源利用率不高的问题。其解决方案的关键在于提出一个名为SPEAR的多智能体协同框架,该框架将审计任务建模为由专业化智能体协作完成的使命:规划智能体(Planning Agent)基于风险感知启发式策略优先排序合约,执行智能体(Execution Agent)通过合同网协议(Contract Net protocol)分配任务,修复智能体(Repair Agent)采用以程序化优先的修复策略自主恢复脆弱生成物;各智能体通过AGM一致的信念更新机制维护本地知识,并借助协商与拍卖协议实现动态协调,从而在面对可控故障场景时展现出更优的协调能力、恢复行为和资源利用效率。

链接: https://arxiv.org/abs/2602.04418
作者: Arnab Mallick,Indraveni Chebolu,Harmesh Rana
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Emerging Technologies (cs.ET); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:We present SPEAR, a multi-agent coordination framework for smart contract auditing that applies established MAS patterns in a realistic security analysis workflow. SPEAR models auditing as a coordinated mission carried out by specialized agents: a Planning Agent prioritizes contracts using risk-aware heuristics, an Execution Agent allocates tasks via the Contract Net protocol, and a Repair Agent autonomously recovers from brittle generated artifacts using a programmatic-first repair policy. Agents maintain local beliefs updated through AGM-compliant revision, coordinate via negotiation and auction protocols, and revise plans as new information becomes available. An empirical study compares the multi-agent design with centralized and pipeline-based alternatives under controlled failure scenarios, focusing on coordination, recovery behavior, and resource use.
zh

[MA-3] From Assumptions to Actions: Turning LLM Reason ing into Uncertainty-Aware Planning for Embodied Agents ICLR2026

【速读】:该论文旨在解决多智能体、部分可观测且去中心化环境中,智能体在面对隐藏对象和协作方意图不确定性时的规划与决策难题。传统方法依赖频繁的智能体间通信以缓解不确定性,但这种方式带来高昂的token消耗和时间成本,并可能干扰人类协作者的既定工作流。解决方案的关键在于提出PCE(Planner-Composer-Evaluator)框架,该框架将大语言模型(LLM)推理轨迹中隐含的碎片化假设转化为结构化的决策树:内部节点编码环境假设,叶节点映射至具体动作;每条路径通过场景可能性、目标导向收益和执行成本进行评分,从而实现无需高频率通信的理性行动选择。实验证明,PCE在多个基准测试中显著提升任务成功率与效率,同时保持与基线相当的token使用量,并在用户研究中被人类伙伴视为更高效和可信。

链接: https://arxiv.org/abs/2602.04326
作者: SeungWon Seo,SooBin Lim,SeongRae Noh,Haneul Kim,HyeongYeop Kang
机构: Korea University (韩国大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: 31 pages, 10 figures, Accepted ICLR 2026

点击查看摘要

Abstract:Embodied agents operating in multi-agent, partially observable, and decentralized environments must plan and act despite pervasive uncertainty about hidden objects and collaborators’ intentions. Recent advances in applying Large Language Models (LLMs) to embodied agents have addressed many long-standing challenges, such as high-level goal decomposition and online adaptation. Yet, uncertainty is still primarily mitigated through frequent inter-agent communication. This incurs substantial token and time costs, and can disrupt established workflows, when human partners are involved. We introduce PCE, a Planner-Composer-Evaluator framework that converts the fragmented assumptions latent in LLM reasoning traces into a structured decision tree. Internal nodes encode environment assumptions and leaves map to actions; each path is then scored by scenario likelihood, goal-directed gain, and execution cost to guide rational action selection without heavy communication. Across two challenging multi-agent benchmarks (C-WAH and TDW-MAT) and three diverse LLM backbones, PCE consistently outperforms communication-centric baselines in success rate and task efficiency while showing comparable token usage. Ablation results indicate that the performance gains obtained by scaling model capacity or reasoning depth persist even when PCE is applied, while PCE consistently raises the baseline across both capacity and reasoning-depth scales, confirming that structured uncertainty handling complements both forms of scaling. A user study further demonstrates that PCE produces communication patterns that human partners perceive as more efficient and trustworthy. Together, these results establish a principled route for turning latent LLM assumptions into reliable strategies for uncertainty-aware planning.
zh

[MA-4] Disentangling Causal Importance from Emergent Structure in Multi-Expert Orchestration

【速读】:该论文旨在解决多专家系统(Multi-expert systems)中任务编排策略(orchestration policies)的不透明性问题,即如何明确理解多个大语言模型(Large Language Models, LLMs)在协作过程中交互结构、执行顺序与因果影响之间的关系。其解决方案的关键在于提出INFORM这一可解释性分析框架,将编排过程视为一个显式的、可分析的计算步骤,从而实现专家交互结构、执行顺序与因果归因的解耦。通过该方法,研究发现路由主导性(routing dominance)并非功能必要性的可靠指标,并揭示了关系重要性(由路由质量与交互拓扑体现)与内在重要性(由梯度驱动的因果归因衡量)之间的显著差异:频繁被选中的专家可能仅为交互枢纽而实际因果影响有限,而稀疏路由的专家却可能是结构性关键节点。此洞察表明,仅依赖准确率无法全面评估系统稳定性,INFORM能识别出超越性能指标的因果与结构依赖关系。

链接: https://arxiv.org/abs/2602.04291
作者: Sudipto Ghosh,Sujoy Nath,Sunny Manchanda,Tanmoy Chakraborty
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Multi-expert systems, where multiple Large Language Models (LLMs) collaborate to solve complex tasks, are increasingly adopted for high-performance reasoning and generation. However, the orchestration policies governing expert interaction and sequencing remain largely opaque. We introduce INFORM, an interpretability analysis that treats orchestration as an explicit, analyzable computation, enabling the decoupling of expert interaction structure, execution order, and causal attribution. We use INFORM to evaluate an orchestrator on GSM8K, HumanEval, and MMLU using a homogeneous consortium of ten instruction-tuned experts drawn from LLaMA-3.1 8B, Qwen-3 8B, and DeepSeek-R1 8B, with controlled decoding-temperature variation, and a secondary heterogeneous consortium spanning 1B-7B parameter models. Across tasks, routing dominance is a poor proxy for functional necessity. We reveal a divergence between relational importance, captured by routing mass and interaction topology, and intrinsic importance, measured via gradient-based causal attribution: frequently selected experts often act as interaction hubs with limited causal influence, while sparsely routed experts can be structurally critical. Orchestration behaviors emerge asynchronously, with expert centralization preceding stable routing confidence and expert ordering remaining non-deterministic. Targeted ablations show that masking intrinsically important experts induces disproportionate collapse in interaction structure compared to masking frequent peers, confirming that INFORM exposes causal and structural dependencies beyond accuracy metrics alone.
zh

[MA-5] On the Uncertainty of Large Language Model-Based Multi-Agent Systems

【速读】:该论文旨在解决多智能体系统(Multi-agent Systems, MAS)在基于公开大语言模型(Large Language Models, LLMs)执行复杂任务时,其有效性机制尚不明确的问题,特别是对MAS成功或失败的内在原因缺乏系统性理解。解决方案的关键在于引入不确定性(uncertainty)视角,通过分析不同拓扑结构下六种基准任务中token级、轨迹级和轮次级熵(entropy)的变化动态,发现:1)单个智能体在约43.3%的情况下优于MAS;2)不确定性主要在首轮交互中确定;3)降低任意阶段的不确定性对确保正确解至关重要。基于此,作者提出一种简单但有效的算法——熵判断器(Entropy Judger),从MAS的pass@k结果中选择最优解,从而在所有配置和任务中实现一致的准确率提升。

链接: https://arxiv.org/abs/2602.04234
作者: Yuxuan Zhao,Sijia Chen,Ningxin Su
机构: 未知
类目: Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Multi-agent systems (MAS) have emerged as a prominent paradigm for leveraging large language models (LLMs) to tackle complex tasks. However, the mechanisms governing the effectiveness of MAS built upon publicly available LLMs, specifically the underlying rationales for their success or failure, remain largely unexplored. In this paper, we revisit MAS through the perspective of uncertainty, considering both intra- and inter-agent dynamics by investigating entropy transitions during problem-solving across various topologies and six benchmark tasks. By analyzing 245 features spanning token-, trajectory-, and round-level entropy, we counterintuitively find that a single agent outperforms MAS in approximately 43.3% of cases, and that uncertainty dynamics are largely determined during the first round of interaction. Furthermore, we provide three key observations: 1) Certainty Preference: reducing uncertainty at any stage for any agent is critical for guaranteeing correct solutions; 2) Base Uncertainty: base models with lower entropy during problem-solving directly benefit MAS performance; and 3) Task Awareness: entropy dynamics of MAS play varying roles across different tasks. Building on these insights, we introduce a simple yet effective algorithm, the Entropy Judger, to select solutions from MAS’s pass@k results, leading to consistent accuracy improvements across all MAS configurations and tasks. Our source code is available at this https URL.
zh

[MA-6] KGLAMP: Knowledge Graph-guided Language model for Adaptive Multi-robot Planning and Replanning

【速读】:该论文旨在解决异构多机器人系统在长时程任务中因环境动态变化导致的符号化表示不准确与计划一致性难以维持的问题。现有方法中,经典PDDL规划器依赖人工构建的符号模型,而基于大语言模型(LLM)的规划方法则常忽略机器人异质性和环境不确定性。其解决方案的关键在于提出KGLAMP框架——一个基于知识图谱引导的大语言模型规划框架,通过维护一个结构化的知识图谱来编码对象关系、空间可达性及机器人能力,并作为持久且动态更新的记忆体,在检测到状态不一致时触发重规划,从而实现符号化计划对环境演化的自适应调整。

链接: https://arxiv.org/abs/2602.04129
作者: Chak Lam Shek,Faizan M. Tariq,Sangjae Bae,David Isele,Piyush Gupta
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Heterogeneous multi-robot systems are increasingly deployed in long-horizon missions that require coordination among robots with diverse capabilities. However, existing planning approaches struggle to construct accurate symbolic representations and maintain plan consistency in dynamic environments. Classical PDDL planners require manually crafted symbolic models, while LLM-based planners often ignore agent heterogeneity and environmental uncertainty. We introduce KGLAMP, a knowledge-graph-guided LLM planning framework for heterogeneous multi-robot teams. The framework maintains a structured knowledge graph encoding object relations, spatial reachability, and robot capabilities, which guides the LLM in generating accurate PDDL problem specifications. The knowledge graph serves as a persistent, dynamically updated memory that incorporates new observations and triggers replanning upon detecting inconsistencies, enabling symbolic plans to adapt to evolving world states. Experiments on the MAT-THOR benchmark show that KGLAMP improves performance by at least 25.5% over both LLM-only and PDDL-based variants.
zh

[MA-7] FDA Flocking: Future Direction-Aware Flocking via Velocity Prediction

【速读】:该论文旨在解决现有群体 flocking 模型多为反应式(reactive)而忽略前瞻信号导致协调性不足的问题,尤其在存在感知与通信延迟及测量噪声时易失稳。其解决方案的关键在于提出一种基于生物启发的前瞻性增强机制——未来方向感知(Future Direction-Aware, FDA) flocking,通过引入一个可调混合参数,使智能体将反应式对齐行为与基于邻近个体短期未来速度预测的项进行融合,从而提升速度一致性、群集凝聚力与分离平衡,并显著改善系统对延迟和噪声的鲁棒性。

链接: https://arxiv.org/abs/2602.04012
作者: Hossein B. Jond,Martin Saska
机构: 未知
类目: Robotics (cs.RO); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Understanding self-organization in natural collectives such as bird flocks inspires swarm robotics, yet most flocking models remain reactive, overlooking anticipatory cues that enhance coordination. Motivated by avian postural and wingbeat signals, as well as multirotor attitude tilts that precede directional changes, this work introduces a principled, bio-inspired anticipatory augmentation of reactive flocking termed Future Direction-Aware (FDA) flocking. In the proposed framework, agents blend reactive alignment with a predictive term based on short-term estimates of neighbors’ future velocities, regulated by a tunable blending parameter that interpolates between reactive and anticipatory behaviors. This predictive structure enhances velocity consensus and cohesion-separation balance while mitigating the adverse effects of sensing and communication delays and measurement noise that destabilize reactive baselines. Simulation results demonstrate that FDA achieves faster and higher alignment, enhanced translational displacement of the flock, and improved robustness to delays and noise compared to a purely reactive model. Future work will investigate adaptive blending strategies, weighted prediction schemes, and experimental validation on multirotor drone swarms.
zh

[MA-8] Agent Ark: Distilling Multi-Agent Intelligence into a Single LLM Agent

【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)多智能体系统在实际部署中面临的高计算成本和错误传播问题。其解决方案的关键在于提出AgentArk框架,通过将多智能体动态过程蒸馏到单一模型的权重中,实现从推理时显式交互到训练时隐式能力的转化,从而在保持单智能体计算效率的同时,赋予其多智能体系统的推理与自修正能力。该方法通过三种分层蒸馏策略——增强推理微调、基于轨迹的增强和过程感知蒸馏——有效提升了模型在多样任务中的鲁棒性和泛化性能。

链接: https://arxiv.org/abs/2602.03955
作者: Yinyi Luo,Yiqiao Jin,Weichen Yu,Mengqi Zhang,Srijan Kumar,Xiaoxiao Li,Weijie Xu,Xin Chen,Jindong Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:While large language model (LLM) multi-agent systems achieve superior reasoning performance through iterative debate, practical deployment is limited by their high computational cost and error propagation. This paper proposes AgentArk, a novel framework to distill multi-agent dynamics into the weights of a single model, effectively transforming explicit test-time interactions into implicit model capabilities. This equips a single agent with the intelligence of multi-agent systems while remaining computationally efficient. Specifically, we investigate three hierarchical distillation strategies across various models, tasks, scaling, and scenarios: reasoning-enhanced fine-tuning; trajectory-based augmentation; and process-aware distillation. By shifting the burden of computation from inference to training, the distilled models preserve the efficiency of one agent while exhibiting strong reasoning and self-correction performance of multiple agents. They further demonstrate enhanced robustness and generalization across diverse reasoning tasks. We hope this work can shed light on future research on efficient and robust multi-agent development. Our code is at this https URL.
zh

[MA-9] Enhancing Mathematical Problem Solving in LLM s through Execution-Driven Reason ing Augmentation ACL

【速读】:该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的数学问题求解系统在推理过程可修订性方面的不足,即现有方法要么采用刚性顺序流水线无法修正前期错误,要么依赖启发式自评估机制难以有效识别和修复错误;同时,程序化上下文可能干扰语言模型导致准确率下降。其解决方案的关键在于提出迭代改进的程序构建方法(Iteratively Improved Program Construction, IIPC),通过将执行反馈与基础LLM的思维链(Chain-of-thought)能力相结合,实现对程序化推理链的持续迭代优化,从而在保持高层语境聚焦的同时提升推理可靠性。

链接: https://arxiv.org/abs/2602.03950
作者: Aditya Basarkar,Benyamin Tabarsi,Tiffany Barnes,Dongkuan(DK)Xu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 9 pages, 7 figures, submitted to ACL ARR 2026

点击查看摘要

Abstract:Mathematical problem solving is a fundamental benchmark for assessing the reasoning capabilities of artificial intelligence and a gateway to applications in education, science, and engineering where reliable symbolic reasoning is essential. Although recent advances in multi-agent LLM-based systems have enhanced their mathematical reasoning capabilities, they still lack a reliably revisable representation of the reasoning process. Existing agents either operate in rigid sequential pipelines that cannot correct earlier steps or rely on heuristic self-evaluation that can fail to identify and fix errors. In addition, programmatic context can distract language models and degrade accuracy. To address these gaps, we introduce Iteratively Improved Program Construction (IIPC), a reasoning method that iteratively refines programmatic reasoning chains and combines execution feedback with the native Chain-of-thought abilities of the base LLM to maintain high-level contextual focus. IIPC surpasses competing approaches in the majority of reasoning benchmarks on multiple base LLMs. All code and implementations are released as open source.
zh

[MA-10] El Agent e Quntur: A research collaborator agent for quantum chemistry

【速读】:该论文旨在解决计算量子化学(computational quantum chemistry)工具在实际应用中因方法复杂性、软件异构性和结果解释门槛高而导致的可及性问题,从而将这些强大但高度专业化的工具扩展至更广泛的化学研究者群体。解决方案的关键在于提出并实现了一个名为El Agente Quntur的分层多智能体AI系统,其核心创新包括:(1)摒弃硬编码的程序化策略,转而采用基于推理的决策机制;(2)构建通用且可组合的操作单元以提升泛化能力和效率;(3)引入引导式深度研究机制,整合跨子学科的量子化学推理与对软件内部逻辑和语法的深入理解。该系统不仅支持ORCA 6.0全部计算功能,还能通过推理科学文献和软件文档自主规划、执行、调整和分析计算实验,体现了从自动化工具向科研协作伙伴的范式转变。

链接: https://arxiv.org/abs/2602.04850
作者: Juan B. Pérez-Sánchez,Yunheng Zou,Jorge A. Campos-Gonzalez-Angulo,Marcel Müller,Ignacio Gustin,Andrew Wang,Han Hao,Tsz Wai Ko,Changhyeok Choi,Eric S. Isbrandt,Mohammad Ghazi Vakili,Hanyong Xu,Chris Crebolder,Varinia Bernales,Alán Aspuru-Guzik
机构: 未知
类目: Chemical Physics (physics.chem-ph); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Quantum chemistry is a foundational enabling tool for the fields of chemistry, materials science, computational biology and others. Despite of its power, the practical application of quantum chemistry simulations remains in the hands of qualified experts due to methodological complexity, software heterogeneity, and the need for informed interpretation of results. To bridge the accessibility gap for these tools and expand their reach to chemists with broader backgrounds, we introduce El Agente Quntur, a hierarchical, multi-agent AI system designed to operate not merely as an automation tool but as a research collaborator for computational quantum chemistry. Quntur was designed following three main strategies: i) elimination of hard-coded procedural policies in favour of reasoning-driven decisions, ii) construction of general and composable actions that facilitate generalization and efficiency, and iii) implementation of guided deep research to integrate abstract quantum-chemical reasoning across subdisciplines and a detailed understanding of the software’s internal logic and syntax. Although instantiated in ORCA, these design principles are applicable to research agents more generally and easily expandable to additional quantum chemistry packages and beyond. Quntur supports the full range of calculations available in ORCA 6.0 and reasons over software documentation and scientific literature to plan, execute, adapt, and analyze in silico chemistry experiments following best practices. We discuss the advances and current bottlenecks in agentic systems operating at the research level in computational chemistry, and outline a roadmap toward a fully autonomous end-to-end computational chemistry research agent.
zh

[MA-11] El Agent e Estructural: An Artificially Intelligent Molecular Editor

【速读】:该论文旨在解决传统分子生成模型在化学结构操作中缺乏精确控制与直观交互能力的问题,尤其是在原子级或官能团层面的替换、连接性调整及立体化学调控等方面。解决方案的关键在于提出El Agente Estructural,一个基于多模态自然语言驱动的几何生成与操作代理,其通过整合领域知识引导的工具集和视觉-语言模型(vision-language models),模拟人类专家对三维分子体系的直接操作,从而实现无需重建复杂核心骨架即可进行精准结构编辑的能力。此方法突破了单纯依赖生成式AI(Generative AI)进行分子设计的局限,显著提升了分子建模的可控性与情境感知能力。

链接: https://arxiv.org/abs/2602.04849
作者: Changhyeok Choi,Yunheng Zou,Marcel Müller,Han Hao,Yeonghun Kang,Juan B. Pérez-Sánchez,Ignacio Gustin,Hanyong Xu,Mohammad Ghazi Vakili,Chris Crebolder,Alán Aspuru-Guzik,Varinia Bernales
机构: 未知
类目: Chemical Physics (physics.chem-ph); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:We present El Agente Estructural, a multimodal, natural-language-driven geometry-generation and manipulation agent for autonomous chemistry and molecular modelling. Unlike molecular generation or editing via generative models, Estructural mimics how human experts directly manipulate molecular systems in three dimensions by integrating a comprehensive set of domain-informed tools and vision-language models. This design enables precise control over atomic or functional group replacements, atomic connectivity, and stereochemistry without the need to rebuild extensive core molecular frameworks. Through a series of representative case studies, we demonstrate that Estructural enables chemically meaningful geometry manipulation across a wide range of real-world scenarios. These include site-selective functionalization, ligand binding, ligand exchange, stereochemically controlled structure construction, isomer interconversion, fragment-level structural analysis, image-guided generation of structures from schematic reaction mechanisms, and mechanism-driven geometry generation and modification. These examples illustrate how multimodal reasoning, when combined with specialized geometry-aware tools, supports interactive and context-aware molecular modelling beyond structure generation. Looking forward, the integration of Estructural into El Agente Quntur, an autonomous multi-agent quantum chemistry platform, enhances its capabilities by adding sophisticated tools for the generation and editing of three-dimensional structures.
zh

自然语言处理

[NLP-0] Reinforced Attention Learning

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在后训练阶段通过强化学习(Reinforcement Learning, RL)提升推理能力时,因依赖冗长推理过程(verbose rationales)而导致感知任务性能提升有限甚至下降的问题。其解决方案的关键在于提出了一种新的策略梯度框架——强化注意力学习(Reinforced Attention Learning, RAL),该方法直接优化模型内部的注意力分布而非输出token序列,从而将优化目标从“生成什么”转变为“关注哪里”,促进复杂多模态输入中信息的有效分配与更好对齐。实验表明,RAL在多个图像和视频基准上均优于GRPO及其他基线方法,并进一步引入了基于策略的注意力蒸馏(On-Policy Attention Distillation),证明转移隐式注意力行为可实现比传统知识蒸馏更强的跨模态对齐效果。

链接: https://arxiv.org/abs/2602.04884
作者: Bangzheng Li,Jianmo Ni,Chen Qu,Ian Miao,Liu Yang,Xingyu Fu,Muhao Chen,Derek Zhiyuan Cheng
机构: UC Davis(加州大学戴维斯分校); Google DeepMind; Google(谷歌); Princeton University(普林斯顿大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Post-training with Reinforcement Learning (RL) has substantially improved reasoning in Large Language Models (LLMs) via test-time scaling. However, extending this paradigm to Multimodal LLMs (MLLMs) through verbose rationales yields limited gains for perception and can even degrade performance. We propose Reinforced Attention Learning (RAL), a policy-gradient framework that directly optimizes internal attention distributions rather than output token sequences. By shifting optimization from what to generate to where to attend, RAL promotes effective information allocation and improved grounding in complex multimodal inputs. Experiments across diverse image and video benchmarks show consistent gains over GRPO and other baselines. We further introduce On-Policy Attention Distillation, demonstrating that transferring latent attention behaviors yields stronger cross-modal alignment than standard knowledge distillation. Our results position attention policies as a principled and general alternative for multimodal post-training. Subjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2602.04884 [cs.CL] (or arXiv:2602.04884v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2602.04884 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-1] Rethinking the Trust Region in LLM Reinforcement Learning

【速读】: 该论文旨在解决当前基于强化学习(Reinforcement Learning, RL)微调大语言模型(Large Language Models, LLMs)时,主流算法Proximal Policy Optimization (PPO) 中比例裁剪(ratio clipping)机制在处理LLM大规模词表时存在的结构性缺陷问题。具体而言,PPO依赖于单样本蒙特卡洛估计的概率比来约束策略更新,导致对低概率词的更新过度惩罚,而对高概率词的潜在灾难性变化则约束不足,从而引发训练效率低下和不稳定。其解决方案的关键在于提出Divergence Proximal Policy Optimization (DPPO),用基于直接策略差异估计(如总变差或KL散度)的更严谨约束替代启发式裁剪,并引入高效的二值化(Binary)与Top-K近似方法以控制内存开销,从而实现更稳定、高效的RL微调过程。

链接: https://arxiv.org/abs/2602.04879
作者: Penghui Qi,Xiangxin Zhou,Zichen Liu,Tianyu Pang,Chao Du,Min Lin,Wee Sun Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) has become a cornerstone for fine-tuning Large Language Models (LLMs), with Proximal Policy Optimization (PPO) serving as the de facto standard algorithm. Despite its ubiquity, we argue that the core ratio clipping mechanism in PPO is structurally ill-suited for the large vocabularies inherent to LLMs. PPO constrains policy updates based on the probability ratio of sampled tokens, which serves as a noisy single-sample Monte Carlo estimate of the true policy divergence. This creates a sub-optimal learning dynamic: updates to low-probability tokens are aggressively over-penalized, while potentially catastrophic shifts in high-probability tokens are under-constrained, leading to training inefficiency and instability. To address this, we propose Divergence Proximal Policy Optimization (DPPO), which substitutes heuristic clipping with a more principled constraint based on a direct estimate of policy divergence (e.g., Total Variation or KL). To avoid huge memory footprint, we introduce the efficient Binary and Top-K approximations to capture the essential divergence with negligible overhead. Extensive empirical evaluations demonstrate that DPPO achieves superior training stability and efficiency compared to existing methods, offering a more robust foundation for RL-based LLM fine-tuning.
zh

[NLP-2] Subliminal Effects in Your Data: A General Mechanism via Log-Linearity

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)训练中数据集对模型行为影响的机制不明确问题,特别是当数据集传递出无法从单个数据点直接观测到的隐含信号时,传统以数据为中心的理解框架面临挑战。解决方案的关键在于提出一种名为Logit-Linear-Selection (LLS) 的方法,该方法通过识别和选择通用偏好数据集中特定子集,能够系统性地诱发多种隐藏效应,如模型产生特定偏好、在未出现的语言中响应提示或采用不同人格特征。实验表明,这些效果在不同架构的模型上均保持稳定,验证了其普遍性和一般性,从而为理解数据驱动行为涌现提供了新的理论基础和可操作工具。

链接: https://arxiv.org/abs/2602.04863
作者: Ishaq Aden-Ali,Noah Golowich,Allen Liu,Abhishek Shetty,Ankur Moitra,Nika Haghtalab
机构: University of California, Berkeley (加州大学伯克利分校); Microsoft Research (微软研究院); Courant Institute, New York University (纽约大学库朗研究所); Massachusetts Institute of Technology (麻省理工学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: Code available at this https URL

点击查看摘要

Abstract:Training modern large language models (LLMs) has become a veritable smorgasbord of algorithms and datasets designed to elicit particular behaviors, making it critical to develop techniques to understand the effects of datasets on the model’s properties. This is exacerbated by recent experiments that show datasets can transmit signals that are not directly observable from individual datapoints, posing a conceptual challenge for dataset-centric understandings of LLM training and suggesting a missing fundamental account of such phenomena. Towards understanding such effects, inspired by recent work on the linear structure of LLMs, we uncover a general mechanism through which hidden subtexts can arise in generic datasets. We introduce Logit-Linear-Selection (LLS), a method that prescribes how to select subsets of a generic preference dataset to elicit a wide range of hidden effects. We apply LLS to discover subsets of real-world datasets so that models trained on them exhibit behaviors ranging from having specific preferences, to responding to prompts in a different language not present in the dataset, to taking on a different persona. Crucially, the effect persists for the selected subset, across models with varying architectures, supporting its generality and universality. Comments: Code available at this https URL Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML) Cite as: arXiv:2602.04863 [cs.LG] (or arXiv:2602.04863v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.04863 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-3] CoT is Not the Chain of Truth: An Empirical Internal Analysis of Reason ing LLM s for Fake News Generation

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在生成虚假新闻等有害内容时,即便最终输出被拒绝,其内部思维链(Chain-of-Thought, CoT)仍可能隐含并传播不安全叙事的问题。这一现象挑战了“拒绝响应即代表推理过程安全”的默认假设。解决方案的关键在于提出一个统一的安全分析框架,通过雅可比(Jacobian)谱度量系统性地分解模型各层的CoT生成过程,并引入三个可解释指标——稳定性(stability)、几何性(geometry)和能量(energy),用于量化特定注意力头对欺骗性推理模式的响应与嵌入能力。实验表明,在激活思考模式时,生成风险显著上升,且关键决策集中于少数连续的中层位置,从而实现了对潜在推理风险的精准定位与机制解析。

链接: https://arxiv.org/abs/2602.04856
作者: Zhao Tong,Chunlin Gong,Yiping Zhang,Qiang Liu,Xingcheng Xu,Shu Wu,Haichao Shi,Xiao-Yu Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注: 28 pages, 35 figures

点击查看摘要

Abstract:From generating headlines to fabricating news, the Large Language Models (LLMs) are typically assessed by their final outputs, under the safety assumption that a refusal response signifies safe reasoning throughout the entire process. Challenging this assumption, our study reveals that during fake news generation, even when a model rejects a harmful request, its Chain-of-Thought (CoT) reasoning may still internally contain and propagate unsafe narratives. To analyze this phenomenon, we introduce a unified safety-analysis framework that systematically deconstructs CoT generation across model layers and evaluates the role of individual attention heads through Jacobian-based spectral metrics. Within this framework, we introduce three interpretable measures: stability, geometry, and energy to quantify how specific attention heads respond or embed deceptive reasoning patterns. Extensive experiments on multiple reasoning-oriented LLMs show that the generation risk rise significantly when the thinking mode is activated, where the critical routing decisions concentrated in only a few contiguous mid-depth layers. By precisely identifying the attention heads responsible for this divergence, our work challenges the assumption that refusal implies safety and provides a new understanding perspective for mitigating latent reasoning risks.
zh

[NLP-4] Decomposed Prompting Does Not Fix Knowledge Gaps But Helps Models Say “I Dont Know”

【速读】: 该论文试图解决大语言模型在闭卷问答(closed-book question answering)中难以识别自身知识边界,从而产生自信但错误的幻觉(hallucination)的问题。解决方案的关键在于利用三种任务等价的提示策略(Direct、Assistive 和 Incremental)之间的不一致性作为模型内部不确定性的诊断信号——由于事实性知识具有稳定性而幻觉具有随机性,跨提示策略的一致性可精确指示潜在错误。基于此信号,作者提出一种无需训练、无需检索或微调的“拒答”(abstention)策略,在多个模型规模和多跳问答基准上显著优于标准不确定性基线,有效提升了F1和AUROC指标。

链接: https://arxiv.org/abs/2602.04853
作者: Dhruv Madhwal,Lyuxin David Zhang,Dan Roth,Tomer Wolfson,Vivek Gupta
机构: Arizona State University (亚利桑那州立大学); University of Pennsylvania (宾夕法尼亚大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models often struggle to recognize their knowledge limits in closed-book question answering, leading to confident hallucinations. While decomposed prompting is typically used to improve accuracy, we investigate its impact on reliability. We evaluate three task-equivalent prompting regimes: Direct, Assistive, and Incremental, across different model scales and multi-hop QA benchmarks. We find that although accuracy gains from decomposition diminish in frontier models, disagreements between prompting regimes remain highly indicative of potential errors. Because factual knowledge is stable while hallucinations are stochastic, cross-regime agreement provides a precise signal of internal uncertainty. We leverage this signal to implement a training-free abstention policy that requires no retrieval or fine-tuning. Our results show that disagreement-based abstention outperforms standard uncertainty baselines as an error detector, improving both F1 and AUROC across settings. This demonstrates that decomposition-based prompting can serve as a practical diagnostic probe for model reliability in closed-book QA.
zh

[NLP-5] Horizon-LM: A RAM-Centric Architecture for LLM Training

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)训练中因GPU显存限制导致的模型规模扩展瓶颈问题。传统训练系统依赖多GPU分布式并行和CPU/存储层级的卸载机制,但其仍基于GPU为中心的执行范式,使得模型规模与GPU数量紧密耦合,且内存消耗不可预测,难以在单节点上高效完成指令微调、对齐和领域适配等后训练任务。解决方案的关键在于提出Horizon-LM,一个以主机内存(host memory)为核心的训练系统,通过“CPU主控、GPU模板”执行模型重构了计算资源角色:将参数存储权威交由主机内存,GPU仅作为瞬态计算引擎;同时采用显式重计算结合手动梯度传播机制,并引入流水线双缓冲执行引擎,从而消除GPU持久驻留模块和完整自动求导图(autograd graph),实现模型规模与GPU数量解耦,内存使用严格控制在理论参数占用范围内,显著提升单机训练效率与可预测性。

链接: https://arxiv.org/abs/2602.04816
作者: Zhengqing Yuan,Lichao Sun,Yanfang(Fanny)Ye
机构: University of Notre Dame (圣母大学); Lehigh University (利哈伊大学)
类目: Operating Systems (cs.OS); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:The rapid growth of large language models (LLMs) has outpaced the evolution of single-GPU hardware, making model scale increasingly constrained by memory capacity rather than computation. While modern training systems extend GPU memory through distributed parallelism and offloading across CPU and storage tiers, they fundamentally retain a GPU-centric execution paradigm in which GPUs host persistent model replicas and full autograd graphs. As a result, scaling large models remains tightly coupled to multi-GPU clusters, complex distributed runtimes, and unpredictable host memory consumption, creating substantial barriers for node-scale post-training workloads such as instruction tuning, alignment, and domain adaptation. We present Horizon-LM, a memory-centric training system that redefines the roles of CPU and GPU for large-model optimization. Horizon-LM treats host memory as the authoritative parameter store and uses GPUs solely as transient compute engines through a CPU-master, GPU-template execution model. By eliminating persistent GPU-resident modules and autograd graphs, employing explicit recomputation with manual gradient propagation, and introducing a pipelined double-buffered execution engine, Horizon-LM decouples model scale from GPU count and bounds memory usage to the theoretical parameter footprint. On a single H200 GPU with 1.5,TB host RAM, Horizon-LM reliably trains models up to 120B parameters. On a standard single A100 machine, Horizon-LM achieves up to 12.2 \times higher training throughput than DeepSpeed ZeRO-3 with CPU offloading while preserving numerical correctness. Across platforms and scales, Horizon-LM sustains high device utilization and predictable memory growth, demonstrating that host memory, not GPU memory, defines the true feasibility boundary for node-scale large-model training.
zh

[NLP-6] SE-Bench: Benchmarking Self-Evolution with Knowledge Internalization

【速读】: 该论文旨在解决生成式 AI (Generative AI) 在长期学习过程中难以实现真正自我演化的难题,即如何让智能体将新经验内化为自身知识以应对未来任务。其核心挑战在于两个障碍:一是先验知识的纠缠(prior knowledge entanglement),即新知识可能已存在于预训练数据中;二是推理复杂度的纠缠(reasoning complexity entanglement),导致失败原因难以区分是知识遗忘还是问题难度所致。解决方案的关键在于提出 SE-Bench 这一诊断环境,通过混淆 NumPy 库及其 API 文档为伪新包并随机命名标识符,迫使模型在无文档支持下完成简单编码任务,从而构建一个纯净的测试场景——任务对具备内部化知识的模型而言是可解的,而对基础模型则不可解。该方法揭示了“闭卷训练”(Closed-Book Training)对知识压缩的重要性、强化学习(RL)在知识内化上的局限性,并验证了自对弈(Self-Play)结合监督微调(SFT)在促进知识内化方面的可行性,从而为评估和提升生成式 AI 的自我演化能力提供了严谨的基准平台。

链接: https://arxiv.org/abs/2602.04811
作者: Jiarui Yuan,Tailin Jin,Weize Chen,Zeyuan Liu,Zhiyuan Liu,Maosong Sun
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Under review

点击查看摘要

Abstract:True self-evolution requires agents to act as lifelong learners that internalize novel experiences to solve future problems. However, rigorously measuring this foundational capability is hindered by two obstacles: the entanglement of prior knowledge, where ``new’’ knowledge may appear in pre-training data, and the entanglement of reasoning complexity, where failures may stem from problem difficulty rather than an inability to recall learned knowledge. We introduce SE-Bench, a diagnostic environment that obfuscates the NumPy library and its API doc into a pseudo-novel package with randomized identifiers. Agents are trained to internalize this package and evaluated on simple coding tasks without access to documentation, yielding a clean setting where tasks are trivial with the new API doc but impossible for base models without it. Our investigation reveals three insights: (1) the Open-Book Paradox, where training with reference documentation inhibits retention, requiring “Closed-Book Training” to force knowledge compression into weights; (2) the RL Gap, where standard RL fails to internalize new knowledge completely due to PPO clipping and negative gradients; and (3) the viability of Self-Play for internalization, proving models can learn from self-generated, noisy tasks when coupled with SFT, but not RL. Overall, SE-Bench establishes a rigorous diagnostic platform for self-evolution with knowledge internalization. Our code and dataset can be found at this https URL.
zh

[NLP-7] OmniSIFT: Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models

【速读】: 该论文旨在解决Omni-modal Large Language Models (Omni-LLMs)在音频-视频理解任务中因依赖长模态令牌序列而导致的高计算开销问题,当前针对Omni-LLMs的令牌压缩方法仍较为有限。解决方案的关键在于提出一种模态不对称的令牌压缩框架OmniSIFT(Omni-modal Spatio-temporal Informed Fine-grained Token compression),其核心包含两个阶段:一是时空视频剪枝模块,用于消除帧内结构冗余和帧间重叠带来的冗余;二是视觉引导的音频选择模块,用于过滤冗余音频令牌。整个框架通过可微分的straight-through估计器进行端到端优化,实验证明其在仅使用原始令牌上下文25%的情况下,性能优于所有压缩基线,并在多个任务上超越全令牌模型。

链接: https://arxiv.org/abs/2602.04804
作者: Yue Ding,Yiyan Ji,Jungang Li,Xuyang Liu,Xinlong Chen,Junfei Wu,Bozhou Li,Bohan Zeng,Yang Shi,Yushuo Guan,Yuanxing Zhang,Jiaheng Liu,Qiang Liu,Pengfei Wan,Liang Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注: Code will be released soon

点击查看摘要

Abstract:Omni-modal Large Language Models (Omni-LLMs) have demonstrated strong capabilities in audio-video understanding tasks. However, their reliance on long multimodal token sequences leads to substantial computational overhead. Despite this challenge, token compression methods designed for Omni-LLMs remain limited. To bridge this gap, we propose OmniSIFT (Omni-modal Spatio-temporal Informed Fine-grained Token compression), a modality-asymmetric token compression framework tailored for Omni-LLMs. Specifically, OmniSIFT adopts a two-stage compression strategy: (i) a spatio-temporal video pruning module that removes video redundancy arising from both intra-frame structure and inter-frame overlap, and (ii) a vision-guided audio selection module that filters audio tokens. The entire framework is optimized end-to-end via a differentiable straight-through estimator. Extensive experiments on five representative benchmarks demonstrate the efficacy and robustness of OmniSIFT. Notably, for Qwen2.5-Omni-7B, OmniSIFT introduces only 4.85M parameters while maintaining lower latency than training-free baselines such as OmniZip. With merely 25% of the original token context, OmniSIFT consistently outperforms all compression baselines and even surpasses the performance of the full-token model on several tasks.
zh

[NLP-8] Speaker-Aware Simulation Improves Conversational Speech Recognition

【速读】: 该论文旨在解决低资源语言(如匈牙利语)在对话式自动语音识别(ASR)中因缺乏大规模、高质量的多说话人对话数据而面临的性能瓶颈问题。其核心挑战在于如何有效增强训练数据,同时保留自然对话中的复杂时间动态特性。解决方案的关键在于提出并实现一种改进的说话人感知模拟对话(Speaker-aware Simulated Conversations, SASC)框架,并进一步引入C-SASC变体——通过基于话语时长的停顿建模来更精确地捕捉人类对话中的局部时间依赖关系。该方法在BEA-Large语料库上生成合成匈牙利语对话,并与真实对话数据结合用于ASR训练,实验证明其显著优于简单的拼接式数据增强策略,且C-SASC在字符级错误率上表现出系统性提升,但效果依赖于源对话统计特征与目标域的一致性。

链接: https://arxiv.org/abs/2602.04776
作者: Máté Gedeon,Péter Mihajlik
机构: BME (匈牙利布达佩斯技术与经济大学)
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Automatic speech recognition (ASR) for conversational speech remains challenging due to the limited availability of large-scale, well-annotated multi-speaker dialogue data and the complex temporal dynamics of natural interactions. Speaker-aware simulated conversations (SASC) offer an effective data augmentation strategy by transforming single-speaker recordings into realistic multi-speaker dialogues. However, prior work has primarily focused on English data, leaving questions about the applicability to lower-resource languages. In this paper, we adapt and implement the SASC framework for Hungarian conversational ASR. We further propose C-SASC, an extended variant that incorporates pause modeling conditioned on utterance duration, enabling a more faithful representation of local temporal dependencies observed in human conversation while retaining the simplicity and efficiency of the original approach. We generate synthetic Hungarian dialogues from the BEA-Large corpus and combine them with real conversational data for ASR training. Both SASC and C-SASC are evaluated extensively under a wide range of simulation configurations, using conversational statistics derived from CallHome, BEA-Dialogue, and GRASS corpora. Experimental results show that speaker-aware conversational simulation consistently improves recognition performance over naive concatenation-based augmentation. While the additional duration conditioning in C-SASC yields modest but systematic gains–most notably in character-level error rates–its effectiveness depends on the match between source conversational statistics and the target domain. Overall, our findings confirm the robustness of speaker-aware conversational simulation for Hungarian ASR and highlight the benefits and limitations of increasingly detailed temporal modeling in synthetic dialogue generation.
zh

[NLP-9] Beyond Many-Shot Translation: Scaling In-Context Demonstrations For Low-Resource Machine Translation EACL2026

【速读】: 该论文旨在解决低资源语言机器翻译(Machine Translation, MT)系统构建中因高质量数据稀缺而导致的性能瓶颈问题。传统方法难以有效适配大型语言模型(Large Language Models, LLMs)至低资源语言,而本文提出通过长上下文提示学习(in-context learning, ICL)来增强LLMs在低资源场景下的适应能力。其解决方案的关键在于:利用长达100万token的上下文窗口,在推理时以不同类型的语料作为示例进行条件化监督,包括单语无监督数据、指令式数据和双语平行数据(英语-目标语言及印尼语-目标语言),从而探索大规模上下文对低资源MT效果的影响及其与语料类型的高度敏感性。实验表明,额外上下文带来的性能提升迅速饱和甚至在接近最大上下文窗口时出现退化,且不同语料类型的效果差异显著,其中某些单语监督形式可媲美平行数据,揭示了长上下文ICL在低资源MT中的有效边界与优化潜力。

链接: https://arxiv.org/abs/2602.04764
作者: Luis Frentzen Salim,Esteban Carlin,Alexandre Morinvil,Xi Ai,Lun-Wei Ku
机构: Institute of Information Science, Academia Sinica (中央研究院资讯科学研究所); National Taiwan University of Science and Technology (台湾科技大学); Ecole Centrale de Marseille (马赛中央理工学院); National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL)
备注: 8 pages, 18 figures, EACL 2026 Conference - LoResMT workshop

点击查看摘要

Abstract:Building machine translation (MT) systems for low-resource languages is notably difficult due to the scarcity of high-quality data. Although Large Language Models (LLMs) have improved MT system performance, adapting them to lesser-represented languages remains challenging. In-context learning (ICL) may offer novel ways to adapt LLMs for low-resource MT by conditioning models on demonstration at inference time. In this study, we explore scaling low-resource machine translation ICL beyond the few-shot setting to thousands of examples with long-context models. We scale in-context token budget to 1M tokens and compare three types of training corpora used as in-context supervision: monolingual unsupervised data, instruction-style data, and parallel data (English–target and Indonesian–target). Our experiments on Javanese and Sundanese show that gains from additional context saturate quickly and can degrade near the maximum context window, with scaling behavior strongly dependent on corpus type. Notably, some forms of monolingual supervision can be competitive with parallel data, despite the latter offering additional supervision. Overall, our results characterize the effective limits and corpus-type sensitivity of long-context ICL for low-resource MT, highlighting that larger context windows do not necessarily yield proportional quality gains.
zh

[NLP-10] When Silence Is Golden: Can LLM s Learn to Abstain in Temporal QA and Beyond? ICLR2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在时间感知问答(Temporal Question Answering)任务中缺乏不确定性识别能力的问题,即模型常生成看似流畅但错误的答案,而非选择拒绝回答(abstention),导致事实混淆和可靠性下降。解决方案的关键在于将“拒答”视为可教学的技能,并提出一种结合思维链(Chain-of-Thought, CoT)监督与基于拒答感知奖励的强化学习(Reinforcement Learning, RL)的训练流水线,从而系统性地优化模型在复杂推理中的拒答行为与准确性。实验表明,该方法显著提升了模型在TimeQA-Easy和Hard数据集上的精确匹配(Exact Match)指标,并大幅改善了对无法回答问题的真阳性率(True Positive Rate)。

链接: https://arxiv.org/abs/2602.04755
作者: Xinyu Zhou,Chang Jin,Carsten Eickhoff,Zhijiang Guo,Seyed Ali Bahrainian
机构: HKUST (GZ) (香港科技大学(广州)); Tongji University (同济大学); University of Tübingen (图宾根大学); HKUST (香港科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ICLR2026

点击查看摘要

Abstract:Large language models (LLMs) rarely admit uncertainty, often producing fluent but misleading answers, rather than abstaining (i.e., refusing to answer). This weakness is even evident in temporal question answering, where models frequently ignore time-sensitive evidence and conflate facts across different time-periods. In this paper, we present the first empirical study of training LLMs with an abstention ability while reasoning about temporal QA. Existing approaches such as calibration might be unreliable in capturing uncertainty in complex reasoning. We instead frame abstention as a teachable skill and introduce a pipeline that couples Chain-of-Thought (CoT) supervision with Reinforcement Learning (RL) guided by abstention-aware rewards. Our goal is to systematically analyze how different information types and training techniques affect temporal reasoning with abstention behavior in LLMs. Through extensive experiments studying various methods, we find that RL yields strong empirical gains on reasoning: a model initialized by Qwen2.5-1.5B-Instruct surpasses GPT-4o by 3.46% and 5.80% in Exact Match on TimeQA-Easy and Hard, respectively. Moreover, it improves the True Positive rate on unanswerable questions by 20% over a pure supervised fine-tuned (SFT) variant. Beyond performance, our analysis shows that SFT induces overconfidence and harms reliability, while RL improves prediction accuracy but exhibits similar risks. Finally, by comparing implicit reasoning cues (e.g., original context, temporal sub-context, knowledge graphs) with explicit CoT supervision, we find that implicit information provides limited benefit for reasoning with abstention. Our study provides new insights into how abstention and reasoning can be jointly optimized, providing a foundation for building more reliable LLMs.
zh

[NLP-11] Exploiting contextual information to improve stance detection in informal political discourse with LLM s

【速读】: 该论文旨在解决在非正式在线话语中进行政治立场检测(Political Stance Detection)的难题,此类语境下语言常具有讽刺性、模糊性和高度依赖上下文的特点。传统方法难以准确捕捉此类复杂语义,因此研究者提出通过引入用户层面的上下文信息来提升模型性能。解决方案的关键在于构建结构化的用户档案(user profile summaries),该档案基于历史发言内容提取用户的意识形态倾向、高频话题和语言模式,并将其作为提示(prompt)注入大型语言模型(Large Language Models, LLMs)的输入中,从而增强模型对政治立场判断的准确性。实证结果表明,这种基于用户级上下文的提示策略可使分类准确率提升17.5%至38.5%,最高达74%,显著优于无上下文的基线方法。

链接: https://arxiv.org/abs/2602.04750
作者: Arman Engin Sucu,Yixiang Zhou,Mario A. Nascimento,Tony Mullen
机构: Khoury College of Computer Sciences (计算机科学学院); Northeastern University (东北大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 14 pages, 7 figures

点击查看摘要

Abstract:This study investigates the use of Large Language Models (LLMs) for political stance detection in informal online discourse, where language is often sarcastic, ambiguous, and context-dependent. We explore whether providing contextual information, specifically user profile summaries derived from historical posts, can improve classification accuracy. Using a real-world political forum dataset, we generate structured profiles that summarize users’ ideological leaning, recurring topics, and linguistic patterns. We evaluate seven state-of-the-art LLMs across baseline and context-enriched setups through a comprehensive cross-model evaluation. Our findings show that contextual prompts significantly boost accuracy, with improvements ranging from +17.5% to +38.5%, achieving up to 74% accuracy that surpasses previous approaches. We also analyze how profile size and post selection strategies affect performance, showing that strategically chosen political content yields better results than larger, randomly selected contexts. These findings underscore the value of incorporating user-level context to enhance LLM performance in nuanced political classification tasks.
zh

[NLP-12] Inference-Time Reason ing Selectively Reduces Implicit Social Bias in Large Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中隐式偏见(implicit bias)的测量与调控问题,尤其关注推理能力启用后对隐式偏见的影响。研究表明,尽管模型在后训练阶段通过对齐和安全机制减少了显式社会偏见(explicit social bias),其在类IAT(Implicit Association Test)任务上仍表现出显著的隐式偏见;而本文发现,在推理模式下,部分模型类别在十五个刻板印象主题上的隐式偏见显著降低,且该效应具有社会偏见特异性,非社会性隐式关联未出现类似变化。解决方案的关键在于引入推理能力作为干预手段,揭示了推理机制可作为调节隐式偏见的新路径,并强调认知科学理论对AI公平性评估的启发价值。

链接: https://arxiv.org/abs/2602.04742
作者: Molly Apsel,Michael N. Jones
机构: Indiana University (印第安纳大学)
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Drawing on constructs from psychology, prior work has identified a distinction between explicit and implicit bias in large language models (LLMs). While many LLMs undergo post-training alignment and safety procedures to avoid expressions of explicit social bias, they still exhibit significant implicit biases on indirect tasks resembling the Implicit Association Test (IAT). Recent work has further shown that inference-time reasoning can impair LLM performance on tasks that rely on implicit statistical learning. Motivated by a theoretical link between implicit associations and statistical learning in human cognition, we examine how reasoning-enabled inference affects implicit bias in LLMs. We find that enabling reasoning significantly reduces measured implicit bias on an IAT-style evaluation for some model classes across fifteen stereotype topics. This effect appears specific to social bias domains, as we observe no corresponding reduction for non-social implicit associations. As reasoning is increasingly enabled by default in deployed LLMs, these findings suggest that it can meaningfully alter fairness evaluation outcomes in some systems, while also raising questions about how alignment procedures interact with inference-time reasoning to drive variation in bias reduction across model types. More broadly, this work highlights how theory from cognitive science and psychology can complement AI evaluation research by providing methodological and interpretive frameworks that reveal new insights into model behavior.
zh

[NLP-13] Alignment Drift in Multimodal LLM s: A Two-Phase Longitudinal Evaluation of Harm Across Eight Model Releases

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在对抗性提示(adversarial prompting)下的安全性问题,特别是其危害性行为的稳定性与可衡量性。解决方案的关键在于构建并执行一个两阶段评估框架:首先使用由26名专业红队人员编写的726个对抗性提示对四款主流MLLM进行测试(GPT-4o、Claude Sonnet 3.5、Pixtral 12B、Qwen VL Plus),随后在第二阶段评估这些模型的后续版本(GPT-5、Claude Sonnet 4.5、Pixtral Large、Qwen Omni),共收集82,256条人工危害评分。该设计揭示了不同模型家族在安全表现上的显著差异和随迭代变化的对齐漂移现象,表明MLLM的危害性并非静态或一致,强调需建立纵向、多模态基准以持续追踪模型安全行为的演化趋势。

链接: https://arxiv.org/abs/2602.04739
作者: Casey Ford,Madison Van Doren,Emily Dix
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: under peer-review

点击查看摘要

Abstract:Multimodal large language models (MLLMs) are increasingly deployed in real-world systems, yet their safety under adversarial prompting remains underexplored. We present a two-phase evaluation of MLLM harmlessness using a fixed benchmark of 726 adversarial prompts authored by 26 professional red teamers. Phase 1 assessed GPT-4o, Claude Sonnet 3.5, Pixtral 12B, and Qwen VL Plus; Phase 2 evaluated their successors (GPT-5, Claude Sonnet 4.5, Pixtral Large, and Qwen Omni) yielding 82,256 human harm ratings. Large, persistent differences emerged across model families: Pixtral models were consistently the most vulnerable, whereas Claude models appeared safest due to high refusal rates. Attack success rates (ASR) showed clear alignment drift: GPT and Claude models exhibited increased ASR across generations, while Pixtral and Qwen showed modest decreases. Modality effects also shifted over time: text-only prompts were more effective in Phase 1, whereas Phase 2 produced model-specific patterns, with GPT-5 and Claude 4.5 showing near-equivalent vulnerability across modalities. These findings demonstrate that MLLM harmlessness is neither uniform nor stable across updates, underscoring the need for longitudinal, multimodal benchmarks to track evolving safety behaviour.
zh

[NLP-14] From Data to Behavior: Predicting Unintended Model Behaviors Before Training

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在训练过程中可能无意中吸收来自看似无害的数据中的偏见问题,而现有方法难以在微调前有效检测此类风险,导致事后评估成本高且效率低。其解决方案的关键在于提出一种名为Data2Behavior的新任务和轻量级方法Manipulating Data Features (MDF),该方法通过提取候选数据的均值表示并注入到基础模型的前向传播中,利用数据中的潜在统计信号影响模型激活状态,从而在不更新任何参数的情况下揭示潜在偏见与安全风险,实现对模型行为的提前预测,且仅需约微调所需GPU资源的20%。

链接: https://arxiv.org/abs/2602.04735
作者: Mengru Wang,Zhenqian Xu,Junfeng Fang,Yunzhi Yao,Shumin Deng,Huajun Chen,Ningyu Zhang
机构: Zhejiang University (浙江大学); National University of Singapore (新加坡国立大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Information Retrieval (cs.IR)
备注: Work in progress

点击查看摘要

Abstract:Large Language Models (LLMs) can acquire unintended biases from seemingly benign training data even without explicit cues or malicious content. Existing methods struggle to detect such risks before fine-tuning, making post hoc evaluation costly and inefficient. To address this challenge, we introduce Data2Behavior, a new task for predicting unintended model behaviors prior to training. We also propose Manipulating Data Features (MDF), a lightweight approach that summarizes candidate data through their mean representations and injects them into the forward pass of a base model, allowing latent statistical signals in the data to shape model activations and reveal potential biases and safety risks without updating any parameters. MDF achieves reliable prediction while consuming only about 20% of the GPU resources required for fine-tuning. Experiments on Qwen3-14B, Qwen2.5-32B-Instruct, and Gemma-3-12b-it confirm that MDF can anticipate unintended behaviors and provide insight into pre-training vulnerabilities.
zh

[NLP-15] Less Finetuning Better Retrieval: Rethinking LLM Adaptation for Biomedical Retrievers via Synthetic Data and Model Merging

【速读】: 该论文旨在解决如何将通用大语言模型(Large Language Models, LLMs)高效适配为特定领域(如生物医学)检索器的问题,以提升检索增强生成(Retrieval-Augmented Generation, RAG)系统在专业场景下的性能与鲁棒性。其解决方案的关键在于提出一个模块化框架Synthesize-Train-Merge (STM),通过合成硬负样本(synthetic hard negatives)、检索提示优化(retrieval prompt optimization)以及模型合并(model merging)三个核心步骤,在不依赖大规模预训练的前提下显著提升领域专用检索能力,同时保留模型的通用知识表征能力。

链接: https://arxiv.org/abs/2602.04731
作者: Sameh Khattab,Jean-Philippe Corbeil,Osman Alperen Koraş,Amin Dada,Julian Friedrich,François Beaulieu,Paul Vozila,Jens Kleesiek
机构: IKIM, University Hospital Essen, Germany; Microsoft Healthcare & Life Sciences; Cancer Research Center Cologne Essen (CCCE); German Cancer Consortium (DKTK, Partner site Essen); Department of Physics of TU Dortmund (Dortmund, Germany)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Preprint

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) has become the backbone of grounding Large Language Models (LLMs), improving knowledge updates and reducing hallucinations. Recently, LLM-based retriever models have shown state-of-the-art performance for RAG applications. However, several technical aspects remain underexplored on how to adapt general-purpose LLMs into effective domain-specific retrievers, especially in specialized domains such as biomedicine. We present Synthesize-Train-Merge (STM), a modular framework that enhances decoder-only LLMs with synthetic hard negatives, retrieval prompt optimization, and model merging. Experiments on a subset of 12 medical and general tasks from the MTEB benchmark show STM boosts task-specific experts by up to 23.5% (average 7.5%) and produces merged models that outperform both single experts and strong baselines without extensive pretraining. Our results demonstrate a scalable, efficient path for turning general LLMs into high-performing, domain-specialized retrievers, preserving general-domain capabilities while excelling on specialized tasks.
zh

[NLP-16] “Be My Cheese?”: Cultural Nuance Benchmarking for Machine Translation in Multilingual LLM s

【速读】: 该论文旨在解决当前机器翻译(Machine Translation, MT)评估中对文化本地化(cultural localisation)能力关注不足的问题。现有基准主要聚焦于词元级准确性和语法正确性,但忽视了实际应用场景中所需的语用和文化相关性。其解决方案的关键在于构建首个大规模、多语言、由母语者标注的人工评估基准,系统性地评估7种多语言大语言模型(Multilingual Large Language Models, LLMs)在15种目标语言中的文化细微表达能力,涵盖习语、双关语、节日及文化嵌入概念等四类典型文化元素,并采用有序评分(0–3)量化文化适配度。结果揭示出模型在语法层面表现尚可,但在文化共鸣方面存在显著短板,尤其体现在习语和双关语的翻译失败率较高,从而凸显了改进跨语言语用建模与引入文化敏感训练数据的必要性。

链接: https://arxiv.org/abs/2602.04729
作者: Madison Van Doren,Casey Ford,Jennifer Barajas,Cory Holland
机构: 未知
类目: Computation and Language (cs.CL)
备注: under peer-review

点击查看摘要

Abstract:We present a large-scale human evaluation benchmark for assessing cultural localisation in machine translation produced by state-of-the-art multilingual large language models (LLMs). Existing MT benchmarks emphasise token-level and grammatical accuracy, but of ten overlook pragmatic and culturally grounded competencies required for real-world localisation. Building on a pilot study of 87 translations across 20 languages, we evaluate 7 multilingual LLMs across 15 target languages with 5 native-speaker raters per language. Raters scored both full-text translations and segment-level instances of culturally nuanced language (idioms, puns, holidays, and culturally embedded concepts) on an ordinal 0-3 quality scale; segment ratings additionally included an NA option for untranslated segments. Across full-text evaluations, mean overall quality is modest (1.68/3): GPT-5 (2.10/3), Claude Sonnet 3.7 (1.97/3), and Mistral Medium 3.1 (1.84/3) form the strongest tier with fewer catastrophic failures. Segment-level results show sharp category effects: holidays (2.20/3) and cultural concepts (2.19/3) translate substantially better than idioms (1.65/3) and puns (1.45/3), and idioms are most likely to be left untranslated. These findings demonstrate a persistent gap between grammatical adequacy and cultural resonance. To our knowledge, this is the first multilingual, human-annotated benchmark focused explicitly on cultural nuance in translation and localisation, highlighting the need for culturally informed training data, improved cross-lingual pragmatics, and evaluation paradigms that better reflect real-world communicative competence. Comments: under peer-review Subjects: Computation and Language (cs.CL) Cite as: arXiv:2602.04729 [cs.CL] (or arXiv:2602.04729v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2602.04729 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-17] Identifying Intervenable and Interpretable Features via Orthogonality Regularization

【速读】: 该论文旨在解决语言模型中特征表示存在干扰与叠加(superposition)的问题,这限制了模型的可解释性和可控性。解决方案的关键在于引入正交性惩罚(orthogonality penalty),对固定稀疏自动编码器(sparse autoencoder)训练过程中解码矩阵进行优化,从而将特征分解为近乎正交的形式。这一方法显著降低了特征间的干扰,同时保持目标数据集上的性能不变,并使特征具有可识别性和唯一性,进而支持基于因果机制(Independent Causal Mechanisms)原则的模块化表示和隔离干预(isolated interventions)。

链接: https://arxiv.org/abs/2602.04718
作者: Moritz Miller,Florent Draye,Bernhard Schölkopf
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With recent progress on fine-tuning language models around a fixed sparse autoencoder, we disentangle the decoder matrix into almost orthogonal features. This reduces interference and superposition between the features, while keeping performance on the target dataset essentially unchanged. Our orthogonality penalty leads to identifiable features, ensuring the uniqueness of the decomposition. Further, we find that the distance between embedded feature explanations increases with stricter orthogonality penalty, a desirable property for interpretability. Invoking the \textitIndependent Causal Mechanisms principle, we argue that orthogonality promotes modular representations amenable to causal intervention. We empirically show that these increasingly orthogonalized features allow for isolated interventions. Our code is available under \textttthis https URL .
zh

[NLP-18] Linguistically Informed Evaluation of Multilingual ASR for African Languages

【速读】: 该论文旨在解决传统词错误率(Word Error Rate, WER)在评估非洲语言自动语音识别(ASR)模型性能时存在的局限性,即WER将音位、声调及其他语言学错误合并为单一的词汇级错误,掩盖了模型在具体语音特征层面的真实表现。解决方案的关键在于引入更细粒度的误差度量方法:字符错误率(Character Error Rate, CER)和特征错误率(Feature Error Rate, FER),并进一步提出声调感知扩展指标(Tone-aware Error Rate, TER)。通过在两种非洲语言上对比这些指标,研究发现FER和TER能够揭示即使在低词级准确率下仍具语言学意义的错误模式,尤其凸显出模型在音段特征上的较好表现与对声调(特别是中调和降阶调)的显著挑战,从而为ASR模型的诊断与优化提供了更具解释性的依据。

链接: https://arxiv.org/abs/2602.04716
作者: Fei-Yueh Chen,Lateef Adeleke,C.M. Downey
机构: 未知
类目: Computation and Language (cs.CL)
备注: To appear at AfricaNLP 2026

点击查看摘要

Abstract:Word Error Rate (WER) mischaracterizes ASR models’ performance for African languages by combining phonological, tone, and other linguistic errors into a single lexical error. By contrast, Feature Error Rate (FER) has recently attracted attention as a viable metric that reveals linguistically meaningful errors in models’ performance. In this paper, we evaluate three speech encoders on two African languages by complementing WER with CER, and FER, and add a tone-aware extension (TER). We show that by computing errors on phonological features, FER and TER reveal linguistically-salient error patterns even when word-level accuracy remains low. Our results reveal that models perform better on segmental features, while tones (especially mid and downstep) remain the most challenging features. Results on Yoruba show a striking differential in metrics, with WER=0.788, CER=0.305, and FER=0.151. Similarly for Uneme (an endangered language absent from pretraining data) a model with near-total WER and 0.461 CER achieves the relatively low FER of 0.267. This indicates model error is often attributable to individual phonetic feature errors, which is obscured by all-or-nothing metrics like WER.
zh

[NLP-19] LiteToken: Removing Intermediate Merge Residues From BPE Tokenizers

【速读】: 该论文旨在解决基于字节对编码(Byte Pair Encoding, BPE)的分词器中存在的“中间合并残留物”(intermediate merge residues)问题,即某些在合并学习过程中频繁出现但最终在实际文本分词时极少被使用的低频token,这些token不仅浪费词汇表容量,还可能增加模型对对抗性或异常输入的脆弱性。解决方案的关键在于提出LiteToken方法,通过系统性地识别并移除这些残留token,在不需重新微调预训练模型的前提下,有效减少token碎片化、降低参数量,并提升对噪声或拼写错误输入的鲁棒性,同时保持整体性能不变。

链接: https://arxiv.org/abs/2602.04706
作者: Yike Sun,Haotong Yang,Zhouchen Lin,Muhan Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Tokenization is fundamental to how language models represent and process text, yet the behavior of widely used BPE tokenizers has received far less study than model architectures and training. In this paper, we investigate intermediate merge residues in BPE vocabularies: tokens that are frequent during merge learning so that retained in the final vocabulary, but are mostly further merged and rarely emitted when tokenizing the corpus during tokenizer usage. Such low-frequency tokens not only waste vocabulary capacity but also increase vulnerability to adversarial or atypical inputs. We present a systematic empirical characterization of this phenomenon across commonly used tokenizers and introduce LiteToken, a simple method for removing residue tokens. Because the affected tokens are rarely used, pretrained models can often accommodate the modified tokenizer without additional fine-tuning. Experiments show that LiteToken reduces token fragmentation, reduces parameters, and improves robustness to noisy or misspelled inputs, while preserving overall performance.
zh

[NLP-20] ERNIE 5.0 Technical Report

【速读】: 该论文旨在解决统一多模态理解与生成模型在大规模部署时面临的资源约束问题,以及如何高效、稳定地扩展强化学习以适配超稀疏混合专家(MoE)架构下的多模态基础模型。其核心解决方案在于提出一种原生自回归的统一多模态基础模型ERNIE 5.0,采用模态无关的专家路由机制,在单一预训练过程中通过弹性训练范式学习一系列子模型,这些子模型在深度、专家容量和路由稀疏性上具有可调性,从而在性能、模型规模和推理延迟之间实现灵活权衡;同时,系统性地解决了强化学习在统一多模态基础模型中的扩展挑战,确保在超稀疏MoE架构下后训练阶段的高效性和稳定性。

链接: https://arxiv.org/abs/2602.04705
作者: Haifeng Wang,Hua Wu,Tian Wu,Yu Sun,Jing Liu,Dianhai Yu,Yanjun Ma,Jingzhou He,Zhongjun He,Dou Hong,Qiwen Liu,Shuohuan Wang,Junyuan Shang,Zhenyu Zhang,Yuchen Ding,Jinle Zeng,Jiabin Yang,Liang Shen,Ruibiao Chen,Weichong Yin,Siyu Ding,Dai Dai,Shikun Feng,Siqi Bao,Bolei He,Yan Chen,Zhenyu Jiao,Ruiqing Zhang,Zeyu Chen,Qingqing Dang,Kaipeng Deng,Jiajun Jiang,Enlei Gong,Guoxia Wang,Yanlin Sha,Yi Liu,Yehan Zheng,Weijian Xu,Jiaxiang Liu,Zengfeng Zeng,Yingqi Qu,Zhongli Li,Zhengkun Zhang,Xiyang Wang,Zixiang Xu,Xinchao Xu,Zhengjie Huang,Dong Wang,Bingjin Chen,Yue Chang,Xing Yuan,Shiwei Huang,Qiao Zhao,Xinzhe Ding,Shuangshuang Qiao,Baoshan Yang,Bihong Tang,Bin Li,Bingquan Wang,Binhan Tang,Binxiong Zheng,Bo Cui,Bo Ke,Bo Zhang,Bowen Zhang,Boyan Zhang,Boyang Liu,Caiji Zhang,Can Li,Chang Xu,Chao Pang,Chao Zhang,Chaoyi Yuan,Chen Chen,Cheng Cui,Chenlin Yin,Chun Gan,Chunguang Chai,Chuyu Fang,Cuiyun Han,Dan Zhang,Danlei Feng,Danxiang Zhu,Dong Sun,Dongbo Li,Dongdong Li,Dongdong Liu,Dongxue Liu,Fan Ding,Fan Hu,Fan Li,Fan Mo,Feisheng Wu,Fengwei Liu,Gangqiang Hu,Gaofeng Lu,Gaopeng Yong,Gexiao Tian,Guan Wang,Guangchen Ni
机构: Baidu(百度)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this report, we introduce ERNIE 5.0, a natively autoregressive foundation model desinged for unified multimodal understanding and generation across text, image, video, and audio. All modalities are trained from scratch under a unified next-group-of-tokens prediction objective, based on an ultra-sparse mixture-of-experts (MoE) architecture with modality-agnostic expert routing. To address practical challenges in large-scale deployment under diverse resource constraints, ERNIE 5.0 adopts a novel elastic training paradigm. Within a single pre-training run, the model learns a family of sub-models with varying depths, expert capacities, and routing sparsity, enabling flexible trade-offs among performance, model size, and inference latency in memory- or time-constrained scenarios. Moreover, we systematically address the challenges of scaling reinforcement learning to unified foundation models, thereby guaranteeing efficient and stable post-training under ultra-sparse MoE architectures and diverse multimodal settings. Extensive experiments demonstrate that ERNIE 5.0 achieves strong and balanced performance across multiple modalities. To the best of our knowledge, among publicly disclosed models, ERNIE 5.0 represents the first production-scale realization of a trillion-parameter unified autoregressive model that supports both multimodal understanding and generation. To facilitate further research, we present detailed visualizations of modality-agnostic expert routing in the unified model, alongside comprehensive empirical analysis of elastic training, aiming to offer profound insights to the community.
zh

[NLP-21] LinGO: A Linguistic Graph Optimization Framework with LLM s for Interpreting Intents of Online Uncivil Discourse

【速读】: 该论文旨在解决现有分类模型在识别网络不文明语言(uncivil language)时容易误判的问题,即模型常将包含不文明线索但表达文明意图的文本错误归类为有害内容,从而高估线上不文明行为的实际比例。解决方案的关键在于提出LinGO(linguistic graph optimization)框架,该框架通过分解语言为多步骤的语义结构组件,识别导致错误的关键步骤,并迭代优化提示(prompt)和/或示例(example)组件,以提升大语言模型(LLM)对政治不文明意图的多类别分类准确性。其核心创新在于将语言的多层次语义结构嵌入到指令设计中,并结合检索增强生成(Retrieval-Augmented Generation, RAG)等优化技术,显著提升了模型对复杂语义含义的理解能力。

链接: https://arxiv.org/abs/2602.04693
作者: Yuan Zhang,Thales Bertaglia
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Detecting uncivil language is crucial for maintaining safe, inclusive, and democratic online spaces. Yet existing classifiers often misinterpret posts containing uncivil cues but expressing civil intents, leading to inflated estimates of harmful incivility online. We introduce LinGO, a linguistic graph optimization framework for large language models (LLMs) that leverages linguistic structures and optimization techniques to classify multi-class intents of incivility that use various direct and indirect expressions. LinGO decomposes language into multi-step linguistic components, identifies targeted steps that cause the most errors, and iteratively optimizes prompt and/or example components for targeted steps. We evaluate it using a dataset collected during the 2022 Brazilian presidential election, encompassing four forms of political incivility: Impoliteness (IMP), Hate Speech and Stereotyping (HSST), Physical Harm and Violent Political Rhetoric (PHAVPR), and Threats to Democratic Institutions and Values (THREAT). Each instance is annotated with six types of civil/uncivil intent. We benchmark LinGO using three cost-efficient LLMs: GPT-5-mini, Gemini 2.5 Flash-Lite, and Claude 3 Haiku, and four optimization techniques: TextGrad, AdalFlow, DSPy, and Retrieval-Augmented Generation (RAG). The results show that, across all models, LinGO consistently improves accuracy and weighted F1 compared with zero-shot, chain-of-thought, direct optimization, and fine-tuning baselines. RAG is the strongest optimization technique and, when paired with Gemini model, achieves the best overall performance. These findings demonstrate that incorporating multi-step linguistic components into LLM instructions and optimize targeted components can help the models explain complex semantic meanings, which can be extended to other complex semantic explanation tasks in the future.
zh

[NLP-22] Investigating Disability Representations in Text-to-Image Models

【速读】: 该论文旨在解决生成式 AI (Generative AI) 在图像生成过程中对残障群体代表性不足且存在偏见的问题,尤其关注其在文本到图像模型(如 Stable Diffusion XL 和 DALL-E 3)中的表现。其解决方案的关键在于采用结构化提示设计,通过对比通用残障提示与具体残障类别提示生成的图像相似性,系统评估残障表征的多样性与偏差;同时结合情感极性分析(sentiment polarity analysis),从自动和人工两个层面量化情感框架的影响,从而识别并缓解模型中存在的代表性失衡问题。

链接: https://arxiv.org/abs/2602.04687
作者: Yang Yian,Yu Fan,Liudmila Zavolokina,Sarah Ebling
机构: University of Zurich (苏黎世大学); ETH Zurich (苏黎世联邦理工学院); University of Lausanne (洛桑大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: 21 pages, 9 figures. References included

点击查看摘要

Abstract:Text-to-image generative models have made remarkable progress in producing high-quality visual content from textual descriptions, yet concerns remain about how they represent social groups. While characteristics like gender and race have received increasing attention, disability representations remain underexplored. This study investigates how people with disabilities are represented in AI-generated images by analyzing outputs from Stable Diffusion XL and DALL-E 3 using a structured prompt design. We analyze disability representations by comparing image similarities between generic disability prompts and prompts referring to specific disability categories. Moreover, we evaluate how mitigation strategies influence disability portrayals, with a focus on assessing affective framing through sentiment polarity analysis, combining both automatic and human evaluation. Our findings reveal persistent representational imbalances and highlight the need for continuous evaluation and refinement of generative models to foster more diverse and inclusive portrayals of disability.
zh

[NLP-23] Audio ControlNet for Fine-Grained Audio Generation and Editing

【速读】: 该论文旨在解决文本到音频(text-to-audio, T2A)生成任务中对音频属性(如音量、音高和声学事件)缺乏精细控制的问题。现有模型虽能生成高质量音频,但难以实现对特定声学特征的精确调控。其解决方案的关键在于提出一种基于预训练T2A模型的ControlNet架构,通过引入轻量级适配模块(T2A-Adapter)实现高效可控生成——仅需38M额外参数即可在AudioSet-Strong数据集上达到事件级与片段级F1分数的最先进性能,同时进一步扩展至音频编辑任务(T2A-Editor),支持按指令在指定时间位置插入或移除音频事件。

链接: https://arxiv.org/abs/2602.04680
作者: Haina Zhu,Yao Xiao,Xiquan Li,Ziyang Ma,Jianwei Yu,Bowen Zhang,Mingqi Yang,Xie Chen
机构: X-LANCE Lab, School of Computer Science, Shanghai Jiao Tong University (上海交通大学计算机科学学院); Shanghai Innovation Institute; MiniMax; Independent Researcher
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:We study the fine-grained text-to-audio (T2A) generation task. While recent models can synthesize high-quality audio from text descriptions, they often lack precise control over attributes such as loudness, pitch, and sound events. Unlike prior approaches that retrain models for specific control types, we propose to train ControlNet models on top of pre-trained T2A backbones to achieve controllable generation over loudness, pitch, and event roll. We introduce two designs, T2A-ControlNet and T2A-Adapter, and show that the T2A-Adapter model offers a more efficient structure with strong control ability. With only 38M additional parameters, T2A-Adapter achieves state-of-the-art performance on the AudioSet-Strong in both event-level and segment-level F1 scores. We further extend this framework to audio editing, proposing T2A-Editor for removing and inserting audio events at time locations specified by instructions. Models, code, dataset pipelines, and benchmarks will be released to support future research on controllable audio generation and editing.
zh

[NLP-24] Overstating Attitudes Ignoring Networks: LLM Biases in Simulating Misinformation Susceptibility

【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)能否有效模拟人类在虚假信息信念与传播行为上的模式,从而作为计算社会科学研究中人类判断的替代代理。解决方案的关键在于通过构建基于真实社会调查数据(包括网络结构、人口统计学特征、态度和行为变量)的提示词(prompt),生成LLM模拟的受访者回答,并将其与三组在线调查的人类响应进行对比,评估其分布匹配度及关键变量间关联的保真度。结果表明,LLM生成的回答虽能捕捉总体分布趋势,但系统性高估信念与分享之间的关联,且线性模型拟合时过度依赖态度和行为特征而忽略个人网络特征,反映出训练数据中对虚假信息相关概念的系统性偏差。因此,LLM模拟更适合用于识别与人类判断的系统性差异,而非直接替代人类判断。

链接: https://arxiv.org/abs/2602.04674
作者: Eun Cheol Choi,Lindsay E. Young,Emilio Ferrara
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used as proxies for human judgment in computational social science, yet their ability to reproduce patterns of susceptibility to misinformation remains unclear. We test whether LLM-simulated survey respondents, prompted with participant profiles drawn from social survey data measuring network, demographic, attitudinal and behavioral features, can reproduce human patterns of misinformation belief and sharing. Using three online surveys as baselines, we evaluate whether LLM outputs match observed response distributions and recover feature-outcome associations present in the original survey data. LLM-generated responses capture broad distributional tendencies and show modest correlation with human responses, but consistently overstate the association between belief and sharing. Linear models fit to simulated responses exhibit substantially higher explained variance and place disproportionate weight on attitudinal and behavioral features, while largely ignoring personal network characteristics, relative to models fit to human responses. Analyses of model-generated reasoning and LLM training data suggest that these distortions reflect systematic biases in how misinformation-related concepts are represented. Our findings suggest that LLM-based survey simulations are better suited for diagnosing systematic divergences from human judgment than for substituting it.
zh

[NLP-25] Delving into Muon and Beyond: Deep Analysis and Extensions

【速读】: 该论文旨在解决Muon优化器的内在机制及其与自适应优化器(如Adam)之间关系不明确的问题。其解决方案的关键在于提出一种统一的谱视角,将Muon视为一类形式为 $ U \boldsymbol\Sigma^p V’ $ 的谱变换家族在 $ p = 0 $ 处的端点,并引入多个变体($ p = 1/2, 1/4, 1 )来系统分析不同谱变换对梯度更新的影响。作者进一步区分了基于一阶矩更新(类似动量SGD)和均方根(RMS)归一化梯度更新(类似Adam)两种情形,并设计了一种耦合牛顿迭代算法以避免显式的奇异值分解(SVD),从而实现高效计算。实验表明,RMS归一化更新比一阶矩更新更稳定,而谱压缩虽能增强稳定性,但Muon)来系统分析不同谱变换对梯度更新的影响。作者进一步区分了基于一阶矩更新(类似动量SGD)和均方根(RMS)归一化梯度更新(类似Adam)两种情形,并设计了一种耦合牛顿迭代算法以避免显式的奇异值分解(SVD),从而实现高效计算。实验表明,RMS归一化更新比一阶矩更新更稳定,而谱压缩虽能增强稳定性,但Muon( p = 0 $)并未始终优于Adam,说明其本质是一种有效的谱归一化方法,而非普适更优的优化策略。

链接: https://arxiv.org/abs/2602.04669
作者: Xianbiao Qi,Marco Chen,Jiaquan Ye,Yelin He,Rong Xiao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: This paper studies matrix-based optimizers (e.g., Muon) from a spectral perspective and unifies a range of methods under a common spectral framework

点击查看摘要

Abstract:The Muon optimizer has recently attracted considerable attention for its strong empirical performance and use of orthogonalized updates on matrix-shaped parameters, yet its underlying mechanisms and relationship to adaptive optimizers such as Adam remain insufficiently understood. In this work, we aim to address these questions through a unified spectral perspective. Specifically, we view Muon as the p = 0 endpoint of a family of spectral transformations of the form U \boldsymbol\Sigma^p V’ , and consider additional variants with p = 1/2 , p = 1/4 , and p = 1 . These transformations are applied to both first-moment updates, as in momentum SGD, and to root-mean-square (RMS) normalized gradient updates as in Adam. To enable efficient computation, we develop a coupled Newton iteration that avoids explicit singular value decomposition. Across controlled experiments, we find that RMS-normalized updates yield more stable optimization than first-moment updates. Moreover, while spectral compression provides strong stabilization benefits under first-moment updates, the Muon update (p = 0) does not consistently outperform Adam. These results suggest that Muon is best understood as an effective form of spectral normalization, but not a universally superior optimization method. Our source code will be released at this https URL.
zh

[NLP-26] Approaches to Semantic Textual Similarity in Slovak Language: From Algorithms to Transformers

【速读】: 该论文旨在解决低资源语言(如斯洛伐克语)中句子级语义文本相似度(Semantic Textual Similarity, STS)计算的挑战问题。解决方案的关键在于系统性地比较传统算法、监督式机器学习模型与第三方深度学习工具在斯洛伐克语上的表现,并通过人工蜂群优化(Artificial Bee Colony Optimization)联合指导特征选择与超参数调优,以提升机器学习模型性能;同时评估了包括预训练的SlovakBERT模型、CloudNLP微调模型及OpenAI嵌入模型等第三方工具的效果,从而揭示不同方法在准确性和实用性之间的权衡关系。

链接: https://arxiv.org/abs/2602.04659
作者: Lukas Radosky,Miroslav Blstak,Matej Krajcovic,Ivan Polasek
机构: Comenius University Bratislava (布拉迪斯拉发夸美纽斯大学); Kempelen Institute of Intelligent Technologies (Kempelen智能技术研究所)
类目: Computation and Language (cs.CL)
备注: This is a preprint of a paper that was presented at the IEEE 24th World Symposium on Applied Machine Intelligence and Informatics (SAMI 2026)

点击查看摘要

Abstract:Semantic textual similarity (STS) plays a crucial role in many natural language processing tasks. While extensively studied in high-resource languages, STS remains challenging for under-resourced languages such as Slovak. This paper presents a comparative evaluation of sentence-level STS methods applied to Slovak, including traditional algorithms, supervised machine learning models, and third-party deep learning tools. We trained several machine learning models using outputs from traditional algorithms as features, with feature selection and hyperparameter tuning jointly guided by artificial bee colony optimization. Finally, we evaluated several third-party tools, including fine-tuned model by CloudNLP, OpenAI’s embedding models, GPT-4 model, and pretrained SlovakBERT model. Our findings highlight the trade-offs between different approaches.
zh

[NLP-27] Outcome Accuracy is Not Enough: Aligning the Reason ing Process of Reward Models

【速读】: 该论文旨在解决生成式奖励模型(Generative Reward Models, GenRMs)和大语言模型作为评判者(LLM-as-a-Judge)在强化学习人类反馈(RLHF)过程中出现的“欺骗性对齐”(deceptive alignment)问题,即模型虽能给出正确结果,但其推理过程与人类判断不一致,导致泛化能力下降。解决方案的关键在于引入理由一致性(Rationale Consistency)这一细粒度指标,用于量化模型推理过程与人类判断的一致性,并在此基础上设计一种融合理由一致性与结果准确性的混合信号用于GenRM训练,从而有效缓解欺骗性对齐,提升模型在RM-Bench、JudgeBench及Arena Hard v2等基准上的性能表现。

链接: https://arxiv.org/abs/2602.04649
作者: Binghai Wang,Yantao Liu,Yuxuan Liu,Tianyi Tang,Shenzhi Wang,Chang Gao,Chujie Zheng,Yichang Zhang,Le Yu,Shixuan Liu,Tao Gui,Qi Zhang,Xuanjing Huang,Bowen Yu,Fei Huang,Junyang Lin
机构: Qwen Team, Alibaba Group(阿里巴巴集团); Fudan University(复旦大学); Tsinghua University(清华大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Generative Reward Models (GenRMs) and LLM-as-a-Judge exhibit deceptive alignment by producing correct judgments for incorrect reasons, as they are trained and evaluated to prioritize Outcome Accuracy, which undermines their ability to generalize during RLHF. We introduce Rationale Consistency, a fine-grained metric that quantifies the alignment between the model’s reasoning process and human judgment. Our evaluation of frontier models reveals that rationale consistency effectively discriminates among state-of-the-art models and detects deceptive alignment, while outcome accuracy falls short in both respects. To mitigate this gap, we introduce a hybrid signal that combines rationale consistency with outcome accuracy for GenRM training. Our training method achieves state-of-the-art performance on RM-Bench (87.1%) and JudgeBench (82%), surpassing outcome-only baselines by an average of 5%. Using RM during RLHF, our method effectively improves performance as demonstrated on Arena Hard v2, notably yielding a 7% improvement in creative writing tasks. Further analysis confirms that our method escapes the deceptive alignment trap, effectively reversing the decline in rationale consistency observed in outcome-only training.
zh

[NLP-28] Mapping the Web of Science a large-scale graph and text-based dataset with LLM embeddings

【速读】: 该论文旨在解决大规模文本数据集(如科学出版物)中同时包含语义信息与结构关系的双重特征难以有效整合和利用的问题。传统方法通常仅能处理其中一种特征:例如图算法擅长处理文本间的链接或引用等结构关系,而对文本本身的语义内容则缺乏高效建模能力。解决方案的关键在于引入大语言模型(Large Language Models, LLMs)生成的嵌入表示(embedding),将文本的语义信息转化为高维向量空间中的数值表达,从而实现对文本内容的深度表征,并结合已有的图结构分析方法进行统一建模。通过在Web of Science约5600万篇科学文献上的实证研究,作者验证了该嵌入方法能够揭示文本数据内在的自组织结构,显著提升了对复杂文本语义与关联关系联合建模的能力。

链接: https://arxiv.org/abs/2602.04630
作者: Tim Kunt,Annika Buchholz,Imene Khebouri,Thorsten Koch,Ida Litzel,Thi Huong Vu
机构: Zuse Institute Berlin (柏林泽斯研究所); Technische Universität Berlin (柏林工业大学); Institute of Mathematics, Vietnam Academy of Science and Technology (越南科学技术研究院数学研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large text data sets, such as publications, websites, and other text-based media, inherit two distinct types of features: (1) the text itself, its information conveyed through semantics, and (2) its relationship to other texts through links, references, or shared attributes. While the latter can be described as a graph structure and can be handled by a range of established algorithms for classification and prediction, the former has recently gained new potential through the use of LLM embedding models. Demonstrating these possibilities and their practicability, we investigate the Web of Science dataset, containing ~56 million scientific publications through the lens of our proposed embedding method, revealing a self-structured landscape of texts.
zh

[NLP-29] LEAD: Layer-wise Expert-aligned Decoding for Faithful Radiology Report Generation

【速读】: 该论文旨在解决生成式医学影像报告(Radiology Report Generation, RRG)中大视觉语言模型(Large Vision Language Models, LVLM)存在的幻觉问题,即模型生成看似合理但与图像内容不符的病理描述。现有方法多依赖外部知识引导来增强文本与视觉信息的一致性,但忽视了预训练模型内在的解码先验和视觉-语言对齐偏差,且因依赖人工构建的引导机制而缺乏鲁棒性。解决方案的关键在于提出分层专家对齐解码(Layer-wise Expert-aligned Decoding, LEAD),通过在每个解码层引入多专家模块提取差异化病理特征,并借助门控机制将这些特征动态注入解码过程,使语言模型在每一步推理中都能基于学习到的门控函数调用专家特征,从而有效修正解码偏差并提升生成结果的事实一致性。

链接: https://arxiv.org/abs/2602.04617
作者: Ruixiao Yang,Yuanhe Tian,Xu Yang,Huiqi Li,Yan Song
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Radiology Report Generation (RRG) aims to produce accurate and coherent diagnostics from medical images. Although large vision language models (LVLM) improve report fluency and accuracy, they exhibit hallucinations, generating plausible yet image-ungrounded pathological details. Existing methods primarily rely on external knowledge guidance to facilitate the alignment between generated text and visual information. However, these approaches often ignore the inherent decoding priors and vision-language alignment biases in pretrained models and lack robustness due to reliance on constructed guidance. In this paper, we propose Layer-wise Expert-aligned Decoding (LEAD), a novel method to inherently modify the LVLM decoding trajectory. A multiple experts module is designed for extracting distinct pathological features which are integrated into each decoder layer via a gating mechanism. This layer-wise architecture enables the LLM to consult expert features at every inference step via a learned gating function, thereby dynamically rectifying decoding biases and steering the generation toward factual consistency. Experiments conducted on multiple public datasets demonstrate that the LEAD method yields effective improvements in clinical accuracy metrics and mitigates hallucinations while preserving high generation quality.
zh

[NLP-30] Disentangling meaning from language in LLM -based machine translation

【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在机器翻译(Machine Translation, MT)任务中缺乏机制可解释性的问题,尤其是以往研究受限于模型规模,仅能进行词级分析,难以揭示句子层面的内部工作机制。其解决方案的关键在于将MT任务分解为两个子任务:目标语言识别(target language identification)和句子等价性保持(sentence equivalence),并通过系统分析注意力头(attention heads)的功能分工,发现特定且稀疏的注意力头集合分别专精于这两个子任务。基于此发现,作者构建了针对各子任务的定向控制向量(steering vectors),仅修改约1%的相关注意力头即可实现无需指令提示(instruction-free)的翻译性能,达到与指令提示相当的效果,同时选择性消融这些头会显著破坏对应翻译功能,从而验证了机制解耦的有效性与可控性。

链接: https://arxiv.org/abs/2602.04613
作者: Théo Lasnier,Armel Zebaze,Djamé Seddah,Rachel Bawden,Benoît Sagot
机构: 未知
类目: Computation and Language (cs.CL)
备注: 61 pages, 70 figures

点击查看摘要

Abstract:Mechanistic Interpretability (MI) seeks to explain how neural networks implement their capabilities, but the scale of Large Language Models (LLMs) has limited prior MI work in Machine Translation (MT) to word-level analyses. We study sentence-level MT from a mechanistic perspective by analyzing attention heads to understand how LLMs internally encode and distribute translation functions. We decompose MT into two subtasks: producing text in the target language (i.e. target language identification) and preserving the input sentence’s meaning (i.e. sentence equivalence). Across three families of open-source models and 20 translation directions, we find that distinct, sparse sets of attention heads specialize in each subtask. Based on this insight, we construct subtask-specific steering vectors and show that modifying just 1% of the relevant heads enables instruction-free MT performance comparable to instruction-based prompting, while ablating these heads selectively disrupts their corresponding translation functions.
zh

[NLP-31] Focus-LIME: Surgical Interpretation of Long-Context Large Language Models via Proxy-Based Neighborhood Selection

【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在处理长文本上下文时,现有无模型依赖的局部解释方法因特征维度高而导致的归因稀释问题,从而难以提供精准、可操作的特征级解释。解决方案的关键在于提出一种粗粒度到细粒度的框架 Focus-LIME,其通过引入代理模型(proxy model)来优化扰动邻域,使目标模型仅在精心筛选的上下文中进行精细归因,从而恢复手术式解释(surgical interpretation)的可行性与忠实性。

链接: https://arxiv.org/abs/2602.04607
作者: Junhao Liu,Haonan Yu,Zhenyu Yan,Xin Zhang
机构: Peking University (北京大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) scale to handle massive context windows, achieving surgical feature-level interpretation is essential for high-stakes tasks like legal auditing and code debugging. However, existing local model-agnostic explanation methods face a critical dilemma in these scenarios: feature-based methods suffer from attribution dilution due to high feature dimensionality, thus failing to provide faithful explanations. In this paper, we propose Focus-LIME, a coarse-to-fine framework designed to restore the tractability of surgical interpretation. Focus-LIME utilizes a proxy model to curate the perturbation neighborhood, allowing the target model to perform fine-grained attribution exclusively within the optimized context. Empirical evaluations on long-context benchmarks demonstrate that our method makes surgical explanations practicable and provides faithful explanations to users.
zh

[NLP-32] RexBERT: Context Specialized Bidirectional Encoders for E-commerce

【速读】: 该论文旨在解决通用预训练语言模型在电商领域语义理解任务中表现不佳的问题,其核心挑战在于通用语料库缺乏对电商场景的充分覆盖,导致模型难以捕捉特定领域的语义特征。解决方案的关键在于:首先构建了一个包含3500亿token的高质量电商专属语料库Ecom-niverse,通过模块化管道从开放网络资源中精准提取电商内容;其次提出一种分阶段的可复现预训练策略,包括通用预训练、上下文扩展和渐进式领域专业化三个阶段;最终训练出参数规模仅为通用模型2–3倍的RexBERT系列模型,在电商相关任务上显著优于更大规模的通用模型,并达到或超越长文本处理模型的性能。这一方法表明,高质量的领域数据与结构化的训练范式比单纯扩大模型规模更能提升特定应用场景的效果。

链接: https://arxiv.org/abs/2602.04605
作者: Rahul Bajaj,Anuj Garg
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Blog: this https URL Models: this https URL Ecom-niverse Dataset: this https URL

点击查看摘要

Abstract:Encoder-only transformers remain indispensable in retrieval, classification, and ranking systems where latency, stability, and cost are paramount. Most general purpose encoders, however, are trained on generic corpora with limited coverage of specialized domains. We introduce RexBERT, a family of BERT-style encoders designed specifically for e-commerce semantics. We make three contributions. First, we release Ecom-niverse, a 350 billion token corpus curated from diverse retail and shopping sources. We describe a modular pipeline that isolates and extracts e-commerce content from FineFineWeb and other open web resources, and characterize the resulting domain distribution. Second, we present a reproducible pretraining recipe building on ModernBERT’s architectural advances. The recipe consists of three phases: general pre-training, context extension, and annealed domain specialization. Third, we train RexBERT models ranging from 17M to 400M parameters and evaluate them on token classification, semantic similarity, and general natural language understanding tasks using e-commerce datasets. Despite having 2-3x fewer parameters, RexBERT outperforms larger general-purpose encoders and matches or surpasses modern long-context models on domain-specific benchmarks. Our results demonstrate that high quality in-domain data combined with a principled training approach provides a stronger foundation for e-commerce applications than indiscriminate scaling alone.
zh

[NLP-33] Beyond Holistic Scores: Automatic Trait-Based Quality Scoring of Argumentative Essays

【速读】: 该论文旨在解决自动化作文评分(Automated Essay Scoring, AES)系统在复杂议论文写作中缺乏可解释性与教学适配性的问题,传统模型多输出整体分数,难以满足教育场景下对分维度、符合评分量规(rubric)的细粒度反馈需求。其解决方案的关键在于:一方面采用基于小规模开源大语言模型(Large Language Models, LLMs)的结构化上下文学习方法,通过设计与评分量规对齐的提示模板实现无需任务微调即可获得竞争力表现,尤其在推理类特质上优势显著;另一方面构建基于BigBird架构的监督式序数回归模型,引入CORAL(Rank-Consistent Ordinal Regression)框架显式建模分数的序数特性,从而大幅提升与人工评分者的一致性,验证了将模型目标与评分量规语义对齐的重要性。

链接: https://arxiv.org/abs/2602.04604
作者: Lucile Favero,Juan Antonio Pérez-Ortiz,Tanja Käser,Nuria Oliver
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Automated Essay Scoring systems have traditionally focused on holistic scores, limiting their pedagogical usefulness, especially in the case of complex essay genres such as argumentative writing. In educational contexts, teachers and learners require interpretable, trait-level feedback that aligns with instructional goals and established rubrics. In this paper, we study trait-based Automatic Argumentative Essay Scoring using two complementary modeling paradigms designed for realistic educational deployment: (1) structured in-context learning with small open-source LLMs, and (2) a supervised, encoder-based BigBird model with a CORAL-style ordinal regression formulation, optimized for long-sequence understanding. We conduct a systematic evaluation on the ASAP++ dataset, which includes essay scores across five quality traits, offering strong coverage of core argumentation dimensions. LLMs are prompted with designed, rubric-aligned in-context examples, along with feedback and confidence requests, while we explicitly model ordinality in scores with the BigBird model via the rank-consistent CORAL framework. Our results show that explicitly modeling score ordinality substantially improves agreement with human raters across all traits, outperforming LLMs and nominal classification and regression-based baselines. This finding reinforces the importance of aligning model objectives with rubric semantics for educational assessment. At the same time, small open-source LLMs achieve a competitive performance without task-specific fine-tuning, particularly for reasoning-oriented traits, while enabling transparent, privacy-preserving, and locally deployable assessment scenarios. Our findings provide methodological, modeling, and practical insights for the design of AI-based educational systems that aim to deliver interpretable, rubric-aligned feedback for argumentative writing.
zh

[NLP-34] VILLAIN at AVerImaTeC: Verifying Image-Text Claims via Multi-Agent Collaboration EACL2026

【速读】: 该论文旨在解决图像-文本联合陈述(image-text claim)的多模态事实核查问题,即如何有效验证包含图像与文本信息的陈述是否属实。解决方案的关键在于提出一个基于提示(prompt-based)的多智能体协作系统VILLAIN,其通过多个阶段实现自动化分析:首先利用视觉-语言模型(vision-language model)代理从增强的知识库中检索文本和视觉证据;其次,模态特定及跨模态代理生成分析报告以识别关键信息并处理证据间的不一致性;随后基于报告生成问答对;最终由判决预测代理结合原始图像-文本陈述与问答对输出验证结果。该方法在AVerImaTeC共享任务中表现优异,排名第一。

链接: https://arxiv.org/abs/2602.04587
作者: Jaeyoon Jung,Yejun Yoon,Seunghyun Yoon,Kunwoo Park
机构: Soongsil University (弘益大学); MAUM AI Inc.; Department of Intelligent Semiconductors (智能半导体系); Adobe Research (Adobe 研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: A system description paper for the AVerImaTeC shared task at the Ninth FEVER Workshop (co-located with EACL 2026)

点击查看摘要

Abstract:This paper describes VILLAIN, a multimodal fact-checking system that verifies image-text claims through prompt-based multi-agent collaboration. For the AVerImaTeC shared task, VILLAIN employs vision-language model agents across multiple stages of fact-checking. Textual and visual evidence is retrieved from the knowledge store enriched through additional web collection. To identify key information and address inconsistencies among evidence items, modality-specific and cross-modal agents generate analysis reports. In the subsequent stage, question-answer pairs are produced based on these reports. Finally, the Verdict Prediction agent produces the verification outcome based on the image-text claim and the generated question-answer pairs. Our system ranked first on the leaderboard across all evaluation metrics. The source code is publicly available at this https URL.
zh

[NLP-35] rust The Typical

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)安全机制依赖于对已知威胁进行枚举和拦截的脆弱性问题,这种“猫鼠游戏”式的方法难以应对未知或变种攻击。其解决方案的关键在于提出 Trust The Typical (T3) 框架,将安全建模为一种分布外检测(out-of-distribution, OOD)问题:通过学习语义空间中可接受提示的正常分布,将显著偏离该分布的输入识别为潜在威胁。T3 不需要使用有害样本进行训练,仅基于安全文本即可实现跨18个基准测试(涵盖毒性、仇恨言论、越狱攻击、多语言危害及过度拒绝等场景)的最先进性能,且相比专用安全模型将误报率降低高达40倍,同时具备良好的跨域与多语言迁移能力,并在 vLLM 中实现了GPU优化部署,支持生成过程中的持续防护,延迟增加低于6%。

链接: https://arxiv.org/abs/2602.04581
作者: Debargha Ganguly,Sreehari Sankar,Biyao Zhang,Vikash Singh,Kanan Gupta,Harshini Kavuru,Alan Luo,Weicong Chen,Warren Morningstar,Raghu Machiraju,Vipin Chaudhary
机构: Case Western Reserve University (凯斯西储大学); University of Pittsburgh (匹兹堡大学); The Ohio State University (俄亥俄州立大学); Google Research (谷歌研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Current approaches to LLM safety fundamentally rely on a brittle cat-and-mouse game of identifying and blocking known threats via guardrails. We argue for a fresh approach: robust safety comes not from enumerating what is harmful, but from deeply understanding what is safe. We introduce Trust The Typical (T3), a framework that operationalizes this principle by treating safety as an out-of-distribution (OOD) detection problem. T3 learns the distribution of acceptable prompts in a semantic space and flags any significant deviation as a potential threat. Unlike prior methods, it requires no training on harmful examples, yet achieves state-of-the-art performance across 18 benchmarks spanning toxicity, hate speech, jailbreaking, multilingual harms, and over-refusal, reducing false positive rates by up to 40x relative to specialized safety models. A single model trained only on safe English text transfers effectively to diverse domains and over 14 languages without retraining. Finally, we demonstrate production readiness by integrating a GPU-optimized version into vLLM, enabling continuous guardrailing during token generation with less than 6% overhead even under dense evaluation intervals on large-scale workloads.
zh

[NLP-36] AIANO: Enhancing Information Retrieval with AI-Augmented Annotation

【速读】: 该论文旨在解决当前信息检索数据集(information retrieval datasets)标注过程中因依赖通用标注工具而导致的效率低、流程复杂的问题。解决方案的关键在于开发了一种专用于标注的AI增强型工具AIANO,其核心创新是将人类专家知识与大语言模型(Large Language Models, LLMs)的辅助能力紧密结合,形成一种人机协同的标注工作流: annotators 可在保留最终决策权的同时利用AI建议提升效率。实验表明,AIANO显著加快了标注速度(接近翻倍),并提升了检索准确率,验证了该方法在提升标注效率与质量方面的有效性。

链接: https://arxiv.org/abs/2602.04579
作者: Sameh Khattab,Marie Bauer,Lukas Heine,Till Rostalski,Jens Kleesiek,Julian Friedrich
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rise of Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) has rapidly increased the need for high-quality, curated information retrieval datasets. These datasets, however, are currently created with off-the-shelf annotation tools that make the annotation process complex and inefficient. To streamline this process, we developed a specialized annotation tool - AIANO. By adopting an AI-augmented annotation workflow that tightly integrates human expertise with LLM assistance, AIANO enables annotators to leverage AI suggestions while retaining full control over annotation decisions. In a within-subject user study ( n = 15 ), participants created question-answering datasets using both a baseline tool and AIANO. AIANO nearly doubled annotation speed compared to the baseline while being easier to use and improving retrieval accuracy. These results demonstrate that AIANO’s AI-augmented approach accelerates and enhances dataset creation for information retrieval tasks, advancing annotation capabilities in retrieval-intensive domains.
zh

[NLP-37] Semantic Self-Distillation for Language Model Uncertainty

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在不确定性量化(uncertainty quantification)方面的挑战,特别是由于模型复杂性和输出多样性导致的难以有效评估预测置信度的问题。其核心问题是:如何在不显著增加计算开销的前提下,准确捕捉生成式AI(Generative AI)输出中的语义不确定性,以支持如幻觉检测(hallucination prediction)和域外答案识别(out-of-domain answer detection)等关键任务。解决方案的关键在于提出一种名为语义自蒸馏(Semantic Self-Distillation, SSD)的方法,通过将大语言模型生成的语义分布蒸馏为轻量级学生模型(student model),使其能够在语言模型生成任何输出token之前,基于提示(prompt)预测一个条件化的语义分布;该分布的熵作为有效的不确定性信号,同时其概率密度可用于评估候选答案的可靠性,从而实现高效且准确的不确定性估计。

链接: https://arxiv.org/abs/2602.04577
作者: Edward Phillips,Sean Wu,Boyan Gao,David A. Clifton
机构: University of Oxford (牛津大学); Oxford Suzhou Centre for Advanced Research (苏州牛津 advanced research 中心)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models present challenges for principled uncertainty quantification, in part due to their complexity and the diversity of their outputs. Semantic dispersion, or the variance in the meaning of sampled answers, has been proposed as a useful proxy for model uncertainty, but the associated computational cost prohibits its use in latency-critical applications. We show that sampled semantic distributions can be distilled into lightweight student models which estimate a prompt-conditioned uncertainty before the language model generates an answer token. The student model predicts a semantic distribution over possible answers; the entropy of this distribution provides an effective uncertainty signal for hallucination prediction, and the probability density allows candidate answers to be evaluated for reliability. On TriviaQA, our student models match or outperform finite-sample semantic dispersion for hallucination prediction and provide a strong signal for out-of-domain answer detection. We term this technique Semantic Self-Distillation (SSD), which we suggest provides a general framework for distilling predictive uncertainty in complex output spaces beyond language.
zh

[NLP-38] Can LLM s capture stable human-generated sentence entropy measures?

【速读】: 该论文旨在解决两个关键问题:一是确定人类对句子中词预测熵(Shannon entropy)估计所需的最小响应样本量,以实现稳定且无偏的测量;二是评估大语言模型(LLMs)在复制人类熵值方面的有效性。其解决方案的关键在于采用基于自举法(bootstrap-based convergence analysis)的统计方法,追踪熵估计随样本量变化的收敛过程,并通过对比多个LLM(如GPT-4o、RoBERTa、LLaMA 2等)与人类数据在不同提取方式(基于logit的概率估计与基于采样的频率估计)下的表现,揭示了人类熵估计的收敛性高度依赖于句子可预测性(predictability),且GPT-4o在logit基础上最接近人类分布,但采样方法更能捕捉人类变异性,从而为规范人类语料库构建和LLM作为替代工具的应用提供了实证依据与实践指南。

链接: https://arxiv.org/abs/2602.04570
作者: Estrella Pivel-Villanueva,Elisabeth Frederike Sterner,Franziska Knolle
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Predicting upcoming words is a core mechanism of language comprehension and may be quantified using Shannon entropy. There is currently no empirical consensus on how many human responses are required to obtain stable and unbiased entropy estimates at the word level. Moreover, large language models (LLMs) are increasingly used as substitutes for human norming data, yet their ability to reproduce stable human entropy remains unclear. Here, we address both issues using two large publicly available cloze datasets in German 1 and English 2. We implemented a bootstrap-based convergence analysis that tracks how entropy estimates stabilize as a function of sample size. Across both languages, more than 97% of sentences reached stable entropy estimates within the available sample sizes. 90% of sentences converged after 111 responses in German and 81 responses in English, while low-entropy sentences (1) required as few as 20 responses and high-entropy sentences (2.5) substantially more. These findings provide the first direct empirical validation for common norming practices and demonstrate that convergence critically depends on sentence predictability. We then compared stable human entropy values with entropy estimates derived from several LLMs, including GPT-4o, using both logit-based probability extraction and sampling-based frequency estimation, GPT2-xl/german-GPT-2, RoBERTa Base/GottBERT, and LLaMA 2 7B Chat. GPT-4o showed the highest correspondence with human data, although alignment depended strongly on the extraction method and prompt design. Logit-based estimates minimized absolute error, whereas sampling-based estimates were better in capturing the dispersion of human variability. Together, our results establish practical guidelines for human norming and show that while LLMs can approximate human entropy, they are not interchangeable with stable human-derived distributions.
zh

[NLP-39] xtual Planning with Explicit Latent Transitions

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在规划任务中因逐 token 生成和重复全前向传播导致的高延迟与高计算开销问题,尤其限制了多步前瞻(multi-step lookahead)和基于 rollout 的搜索效率。其解决方案的关键在于提出 EmbedPlan,该方法摒弃传统的自回归式状态生成方式,转而采用一个轻量级的转移模型(transition model),在冻结的语言嵌入空间(frozen language embedding space)中直接预测下一状态的嵌入表示,并通过最近邻相似性检索获取实际状态,从而实现无需微调编码器的快速规划计算。

链接: https://arxiv.org/abs/2602.04557
作者: Eliezer Shlomi,Ido Levy,Eilam Shapira,Michael Katz,Guy Uziel,Segev Shlomov,Nir Mashkif,Roi Reichart,Sarah Keren
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Planning with LLMs is bottlenecked by token-by-token generation and repeated full forward passes, making multi-step lookahead and rollout-based search expensive in latency and compute. We propose EmbedPlan, which replaces autoregressive next-state generation with a lightweight transition model operating in a frozen language embedding space. EmbedPlan encodes natural language state and action descriptions into vectors, predicts the next-state embedding, and retrieves the next state by nearest-neighbor similarity, enabling fast planning computation without fine-tuning the encoder. We evaluate next-state prediction across nine classical planning domains using six evaluation protocols of increasing difficulty: interpolation, plan-variant, extrapolation, multi-domain, cross-domain, and leave-one-out. Results show near-perfect interpolation performance but a sharp degradation when generalization requires transfer to unseen problems or unseen domains; plan-variant evaluation indicates generalization to alternative plans rather than memorizing seen trajectories. Overall, frozen embeddings support within-domain dynamics learning after observing a domain’s transitions, while transfer across domain boundaries remains a bottleneck.
zh

[NLP-40] Rethinking Weight Tying: Pseudo-Inverse Tying for Stable LM Training and Updates

【速读】: 该论文旨在解决权重共享(weight tying)在紧凑语言模型中导致的token接口不稳定问题,即在训练过程中编码tokens到隐藏状态与解码隐藏状态到logits之间的对应关系可能发生漂移,从而加剧优化敏感性,并使后训练干预(如编辑、修补和轻量级适应)变得不可预测。解决方案的关键在于提出伪逆绑定(Pseudo-Inverse Tying, PIT),其通过将嵌入(embedding)和未嵌入(unembedding)建模为共享潜在token记忆的耦合投影,确保在整个训练过程中保持伪逆一致性。PIT维护一个正交共享记忆,可通过薄极分解(thin polar decomposition)进行教师初始化或从随机正交初始化开始,并引入一个由Cholesky因子参数化的全学习对称正定隐藏空间变换;输出头在词汇投影前应用该变换,而嵌入层则使用稳定的三角求解计算其逆变换,避免显式伪逆重计算及任何词汇大小的辅助参数,从而实现稳定且高效的训练与下游任务适应。

链接: https://arxiv.org/abs/2602.04556
作者: Jian Gu,Aldeida Aleti,Chunyang Chen,Hongyu Zhang
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: an early-stage version

点击查看摘要

Abstract:Weight tying is widely used in compact language models to reduce parameters by sharing the token table between the input embedding and the output projection. However, weight sharing does not guarantee a stable token interface: during training, the correspondence between encoding tokens into hidden states and decoding hidden states into logits can drift, worsening optimization sensitivity and making post-training interventions such as editing, patching, and lightweight adaptation less predictable. We propose Pseudo-Inverse Tying (PIT), which synchronizes embedding and unembedding as coupled projections of a shared latent token memory, guaranteeing a pseudo-inverse-consistent interface throughout training. PIT maintains an orthonormal shared memory, obtained by thin polar decomposition for teacher initialization or random orthonormal initialization from scratch, and introduces a fully learned symmetric positive definite hidden-space transform parameterized via a Cholesky factor. The output head applies this transform to hidden states before the vocabulary projection, while the embedding applies the inverse transform to token vectors using stable triangular solves, avoiding explicit pseudo-inverse recomputation and any vocabulary-sized auxiliary parameters. We evaluate PIT on on-device models spanning 256M-1.3B parameters across pretraining and adaptation, and consistently observe improved training stability, stronger layerwise semantic consistency, and substantially reduced side effects.
zh

[NLP-41] Unmasking Superspreaders: Data-Driven Approaches for Identifying and Comparing Key Influencers of Conspiracy Theories on X.com

【速读】: 该论文旨在解决社交媒体上阴谋论传播问题,特别是识别和理解两类关键传播者——人类超级传播者(Human Superspreaders)与机器人账号(Bots)的行为差异及其对信息扩散的贡献。其解决方案的关键在于通过分析超过七百万条新冠疫情期间的推文,系统性地比较这两类传播者在语言复杂度、毒性水平及标签使用等方面的特征差异,并提出27项新型量化指标用于评估阴谋论传播的严重程度;其中,研究发现适配后的H-指数(H-Index)能有效实现对人类超级传播者的计算可行识别,从而为平台内容审核策略、账号处置机制及公众认知干预提供可操作的技术依据。

链接: https://arxiv.org/abs/2602.04546
作者: Florian Kramer,Henrich R. Greve,Moritz von Zahn,Hayagreeva Rao
机构: 未知
类目: ocial and Information Networks (cs.SI); Computation and Language (cs.CL); Computers and Society (cs.CY); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Conspiracy theories can threaten society by spreading misinformation, deepening polarization, and eroding trust in democratic institutions. Social media often fuels the spread of conspiracies, primarily driven by two key actors: Superspreaders – influential individuals disseminating conspiracy content at disproportionately high rates, and Bots – automated accounts designed to amplify conspiracies strategically. To counter the spread of conspiracy theories, it is critical to both identify these actors and to better understand their behavior. However, a systematic analysis of these actors as well as real-world-applicable identification methods are still lacking. In this study, we leverage over seven million tweets from the COVID-19 pandemic to analyze key differences between Human Superspreaders and Bots across dimensions such as linguistic complexity, toxicity, and hashtag usage. Our analysis reveals distinct communication strategies: Superspreaders tend to use more complex language and substantive content while relying less on structural elements like hashtags and emojis, likely to enhance credibility and authority. By contrast, Bots favor simpler language and strategic cross-usage of hashtags, likely to increase accessibility, facilitate infiltration into trending discussions, and amplify reach. To counter both Human Superspreaders and Bots, we propose and evaluate 27 novel metrics for quantifying the severity of conspiracy theory spread. Our findings highlight the effectiveness of an adapted H-Index for computationally feasible identification of Human Superspreaders. By identifying behavioral patterns unique to Human Superspreaders and Bots as well as providing suitable identification methods, this study provides a foundation for mitigation strategies, including platform moderation policies, temporary and permanent account suspensions, and public awareness campaigns.
zh

[NLP-42] LycheeDecode: Accelerating Long-Context LLM Inference via Hybrid-Head Sparse Decoding ICLR2026

【速读】: 该论文旨在解决长上下文大语言模型(Long-context Large Language Models, LLMs)在解码过程中因键值缓存(key-value cache)快速膨胀而导致的高内存占用与延迟问题。现有方法通过跨层共享单一关键token集合来缓解此瓶颈,但这种粗粒度共享策略忽视了注意力头(attention head)的功能多样性,从而损害模型性能。其解决方案的关键在于提出一种基于细粒度混合注意力机制(fine-grained hybrid-head attention mechanism)的高效解码方法——LycheeDecode,该方法采用硬件友好的top-k选择策略,将注意力头划分为少量动态检索头(retrieval heads)用于识别关键token,以及多数稀疏头(sparse heads)复用这些token以实现计算效率提升,从而在保持注意力头功能多样性的前提下显著优化推理效率与生成质量。

链接: https://arxiv.org/abs/2602.04541
作者: Gang Lin,Dongfang Li,Zhuoen Chen,Yukun Shi,Xuhui Chen,Baotian Hu,Min Zhang
机构: Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ICLR 2026

点击查看摘要

Abstract:The proliferation of long-context large language models (LLMs) exposes a key bottleneck: the rapidly expanding key-value cache during decoding, which imposes heavy memory and latency costs. While recent approaches attempt to alleviate this by sharing a single set of crucial tokens across layers, such coarse-grained sharing undermines model performance by neglecting the functional diversity of attention heads. To address this, we propose LycheeDecode, an efficient decoding method centered on a fine-grained hybrid-head attention mechanism that employs a hardware-efficient top-k selection strategy. Specifically, the novel HardKuma-based mechanism partitions attention heads into a small subset of retrieval heads that dynamically identify crucial tokens and a majority of sparse heads that reuse them for efficient computation. Through extensive experiments on leading models like Llama3 and Qwen3 across diverse benchmarks for long-context understanding (e.g., LongBench, RULER) and complex reasoning (e.g., AIME24, OlympiadBench), we demonstrate that LycheeDecode achieves generative quality comparable to, and at times surpassing even the full-attention baseline. Crucially, this is accomplished with up to a 2.7x speedup at a 128K context length. By preserving the functional diversity of attention heads, our fine-grained strategy overcomes the performance bottlenecks of existing methods, providing a powerful and validated pathway to both efficient and high-quality long-context LLM inference.
zh

[NLP-43] PersoPilot: An Adaptive AI-Copilot for Transparent Contextualized Persona Classification and Personalized Response Generation ICDM

【速读】: 该论文旨在解决现有个性化服务系统中用户画像(persona)与情境上下文(context)分离建模的问题,导致难以实现精细化、自适应的交互体验。其核心挑战在于如何将静态的用户特征与动态的情境信息深度融合,以生成更具针对性的服务推荐。解决方案的关键在于提出PersoPilot——一个集成用户画像理解与情境分析的智能代理型AI协作者(agentic AI-Copilot),通过透明可解释的对话接口支持终端用户自然语言表达偏好,并为分析师提供基于推理的标签辅助工具及主动学习驱动的分类机制,形成闭环反馈系统,从而实现从原始画像数据到可操作的情境感知洞察的转化。

链接: https://arxiv.org/abs/2602.04540
作者: Saleh Afzoon,Amin Beheshti,Usman Naseem
机构: Macquarie University (麦考瑞大学)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注: Accepted for the Demo Track at the IEEE International Conference on Data Mining (ICDM) 2025

点击查看摘要

Abstract:Understanding and classifying user personas is critical for delivering effective personalization. While persona information offers valuable insights, its full potential is realized only when contextualized, linking user characteristics with situational context to enable more precise and meaningful service provision. Existing systems often treat persona and context as separate inputs, limiting their ability to generate nuanced, adaptive interactions. To address this gap, we present PersoPilot, an agentic AI-Copilot that integrates persona understanding with contextual analysis to support both end users and analysts. End users interact through a transparent, explainable chat interface, where they can express preferences in natural language, request recommendations, and receive information tailored to their immediate task. On the analyst side, PersoPilot delivers a transparent, reasoning-powered labeling assistant, integrated with an active learning-driven classification process that adapts over time with new labeled data. This feedback loop enables targeted service recommendations and adaptive personalization, bridging the gap between raw persona data and actionable, context-aware insights. As an adaptable framework, PersoPilot is applicable to a broad range of service personalization scenarios.
zh

[NLP-44] C-ΔΘ: Circuit-Restricted Weight Arithmetic for Selective Refusal

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在实际部署中因依赖推理时干预(inference-time interventions)而导致的计算开销和部署复杂性问题。现有方法如激活操控(activation steering)虽能实现选择性拒绝,但需运行时钩子且成本随生成次数增加;条件变体虽提升选择性,仍保留推理时控制路径。解决方案的关键在于将选择性拒绝机制完全移至离线阶段:通过EAP-IG识别特定类别拒绝因果计算的稀疏电路,并基于该电路约束权重更新ΔθC(通常仅作用于约5%参数),从而生成一个无需推理时干预的可直接部署检查点(checkpoint),实现从每请求干预到一次性离线修改的成本转移。

链接: https://arxiv.org/abs/2602.04521
作者: Aditya Kasliwal,Pratinav Seth,Vinay Kumar Sankarapu
机构: Lexsi Labs
类目: Computation and Language (cs.CL); Emerging Technologies (cs.ET)
备注:

点击查看摘要

Abstract:Modern deployments require LLMs to enforce safety policies at scale, yet many controls rely on inference-time interventions that add recurring compute cost and serving complexity. Activation steering is widely used, but it requires runtime hooks and scales cost with the number of generations; conditional variants improve selectivity by gating when steering is applied but still retain an inference-time control path. We ask whether selective refusal can be moved entirely offline: can a mechanistic understanding of category-specific refusal be distilled into a circuit-restricted weight update that deploys as a standard checkpoint? We propose C-\Delta\theta: Circuit Restricted Weight Arithmetic, which (i) localizes refusal-causal computation as a sparse circuit using EAP-IG and (ii) computes a constrained weight update \Delta\thetaC supported only on that circuit (typically 5% of parameters). Applying \Delta\thetaC yields a drop-in edited checkpoint with no inference-time hooks, shifting cost from per-request intervention to a one-time offline update. We evaluate category-targeted selectivity and capability retention on refusal and utility benchmarks.
zh

[NLP-45] ReFRAME or Remain: Unsupervised Lexical Semantic Change Detection with Frame Semantics

【速读】: 该论文旨在解决当前主流词汇语义变化(Lexical Semantic Change, LSC)检测方法依赖神经嵌入分布表示所带来的可解释性差的问题。其解决方案的关键在于摒弃传统的分布语义模型,转而采用仅基于框架语义(Frame Semantics)的方法,通过分析词语在不同语境中所激活的语义框架来识别语义演变,从而实现既有效又高度可解释的LSC检测结果。

链接: https://arxiv.org/abs/2602.04514
作者: Bach Phan-Tat,Kris Heylen,Dirk Geeraerts,Stefano De Pascale,Dirk Speelman
机构: KU Leuven (鲁汶大学); Instituut voor de Nederlandse Taal (荷兰语研究所); Vrije Universiteit Brussel (布鲁塞尔自由大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The majority of contemporary computational methods for lexical semantic change (LSC) detection are based on neural embedding distributional representations. Although these models perform well on LSC benchmarks, their results are often difficult to interpret. We explore an alternative approach that relies solely on frame semantics. We show that this method is effective for detecting semantic change and can even outperform many distributional semantic models. Finally, we present a detailed quantitative and qualitative analysis of its predictions, demonstrating that they are both plausible and highly interpretable
zh

[NLP-46] Model-Dowser: Data-Free Importance Probing to Mitigate Catastrophic Forgetting in Multimodal Large Language Models

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在任务特定数据上微调时引发的灾难性遗忘(Catastrophic Forgetting)问题,即模型在提升下游任务性能的同时,会显著损害其在预训练任务上的泛化能力。解决方案的关键在于提出一种名为Model-Dowser的稀疏微调方法,其核心是通过联合考虑权重幅度(weight magnitudes)、输入激活(input activations)和输出敏感度(output sensitivities)来量化每个参数对预训练泛化能力的重要性得分,并在微调过程中仅更新低重要性参数,保留高重要性参数,从而在保持模型性能的同时实现高效且可扩展的微调策略。

链接: https://arxiv.org/abs/2602.04509
作者: Hyeontaek Hwang,Nguyen Dinh Son,Daeyoung Kim
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Fine-tuning Multimodal Large Language Models (MLLMs) on task-specific data is an effective way to improve performance on downstream applications. However, such adaptation often leads to a degradation in generalization on pretrained tasks, a phenomenon known as Catastrophic Forgetting. Existing methods that aim to mitigate this issue either become ineffective when fine-tuning deeper layers of the language decoder or scale poorly with increasing model size. To address these limitations, we propose Model-Dowser, a novel sparse fine-tuning approach for MLLMs. Model-Dowser measures a principled importance score for each model parameter with respect to pretrained generalization (prior to downstream adaptation) by jointly considering weight magnitudes, input activations, and output sensitivities. During fine-tuning, Model-Dowser selectively preserves high-importance parameters and updates the remaining. Comprehensive experiments on two representative MLLMs, LLaVA and NVILA, demonstrate that Model-Dowser effectively mitigates catastrophic forgetting and consistently outperforms prior methods, while remaining resource-efficient and scalable to multi-billion-parameter models.
zh

[NLP-47] PersoDPO: Scalable Preference Optimization for Instruction-Adherent Persona-Grounded Dialogue via Multi-LLM Evaluation

【速读】: 该论文旨在解决开放源代码大语言模型(Large Language Models, LLMs)在人格化对话系统中难以同时实现个性化(personalization)与上下文连贯性(contextual coherence)的问题,尽管这些模型在通用对话流畅性和自然度上表现良好。解决方案的关键在于提出一种可扩展的偏好优化框架 PersoDPO,该框架利用来自闭源和开源LLM生成响应的自动评估信号(包括针对连贯性、个性化以及长度格式合规性的指标),无需人工标注即可自动构建高质量偏好对,从而实现对对话模型的有效微调。实验表明,基于PersoDPO微调后的开源模型在FoCus数据集上显著优于多个强基线模型和标准直接偏好优化(Direct Preference Optimization, DPO)方法。

链接: https://arxiv.org/abs/2602.04493
作者: Saleh Afzoon,MohammadHossein Ahmadi,Usman Naseem,Amin Beheshti
机构: Macquarie University (麦考瑞大学)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Accepted at WISE 2025 Conference

点击查看摘要

Abstract:Personalization and contextual coherence are two essential components in building effective persona-grounded dialogue systems. These aspects play a crucial role in enhancing user engagement and ensuring responses are more relevant and consistent with user identity. However, recent studies indicate that open-source large language models (LLMs) continue to struggle to generate responses that are both contextually grounded and aligned with persona cues, despite exhibiting strong general conversational abilities like fluency and naturalness. We present PersoDPO, a scalable preference optimisation framework that uses supervision signals from automatic evaluations of responses generated by both closed-source and open-source LLMs to fine-tune dialogue models. The framework integrates evaluation metrics targeting coherence and personalization, along with a length-format compliance feature to promote instruction adherence. These signals are combined to automatically construct high-quality preference pairs without manual annotation, enabling a scalable and reproducible training pipeline. Experiments on the FoCus dataset show that an open-source language model fine-tuned with the PersoDPO framework consistently outperforms strong open-source baselines and a standard Direct Preference Optimization (DPO) variant across multiple evaluation dimensions.
zh

[NLP-48] Deconstructing sentence disambiguation by joint latent modeling of reading paradigms: LLM surprisal is not enough

【速读】: 该论文旨在解决人类阅读过程中对结构暂时歧义句(如“While the team trained the striker wondered…”)的处理机制问题,特别是如何量化和区分“歧义概率(garden-path probability)”、“歧义代价(garden-path cost)”与“重新分析代价(reanalysis cost)”这三个关键认知过程。其解决方案的关键在于提出一种潜在过程混合模型(latent-process mixture model),该模型整合了四种不同的阅读范式(眼动追踪、单向与双向自定步速阅读及Maze任务),并通过引入对注意力不集中试次的建模,显著提升了处理代价估计的真实性。相比基于GPT-2生成的预期值(surprisal)的无混合模型,该方法在重现重读行为、理解问答和语法判断等实验数据方面具有更强的拟合能力与预测效度。

链接: https://arxiv.org/abs/2602.04489
作者: Dario Paape,Tal Linzen,Shravan Vasishth
机构: University of Potsdam (波茨坦大学); New York University (纽约大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Using temporarily ambiguous garden-path sentences (“While the team trained the striker wondered …”) as a test case, we present a latent-process mixture model of human reading behavior across four different reading paradigms (eye tracking, uni- and bidirectional self-paced reading, Maze). The model distinguishes between garden-path probability, garden-path cost, and reanalysis cost, and yields more realistic processing cost estimates by taking into account trials with inattentive reading. We show that the model is able to reproduce empirical patterns with regard to rereading behavior, comprehension question responses, and grammaticality judgments. Cross-validation reveals that the mixture model also has better predictive fit to human reading patterns and end-of-trial task data than a mixture-free model based on GPT-2-derived surprisal values. We discuss implications for future work.
zh

[NLP-49] Beyond Unimodal Shortcuts: MLLM s as Cross-Modal Reason ers for Grounded Named Entity Recognition

【速读】: 该论文旨在解决生成式 AI 在跨模态命名实体识别(Grounded Multimodal Named Entity Recognition, GMNER)任务中因模态偏差(modality bias)导致的性能瓶颈问题,即多模态大语言模型(MLLMs)倾向于依赖单一模态的捷径(如仅凭视觉或文本线索),而非进行严格的跨模态验证。解决方案的关键在于提出一种模态感知的一致性推理机制(Modality-aware Consistency Reasoning, MCR),其核心由两个组件构成:一是通过多风格推理模式注入(Multi-style Reasoning Schema Injection, MRSI)将抽象约束转化为可执行的推理链,二是利用约束引导的可验证优化(Constraint-guided Verifiable Optimization, CVO)使模型动态对齐其推理路径与Group Relative Policy Optimization(GRPO),从而有效缓解模态偏差并提升端到端GMNER性能。

链接: https://arxiv.org/abs/2602.04486
作者: Jinlong Ma,Yu Zhang,Xuefeng Bai,Kehai Chen,Yuwei Wang,Zeming Liu,Jun Yu,Min Zhang
机构: Harbin Institute of Technology, Shenzhen, China (哈尔滨工业大学深圳分校); Institute of Computing Technology Chinese Academy of Sciences (中国科学院计算技术研究所); Beijing University of Aeronautics and Astronautics (北京航空航天大学)
类目: Computation and Language (cs.CL)
备注: GMNER

点击查看摘要

Abstract:Grounded Multimodal Named Entity Recognition (GMNER) aims to extract text-based entities, assign them semantic categories, and ground them to corresponding visual regions. In this work, we explore the potential of Multimodal Large Language Models (MLLMs) to perform GMNER in an end-to-end manner, moving beyond their typical role as auxiliary tools within cascaded pipelines. Crucially, our investigation reveals a fundamental challenge: MLLMs exhibit \textbfmodality bias , including visual bias and textual bias, which stems from their tendency to take unimodal shortcuts rather than rigorous cross-modal verification. To address this, we propose Modality-aware Consistency Reasoning ( \textbfMCR ), which enforces structured cross-modal reasoning through Multi-style Reasoning Schema Injection (MRSI) and Constraint-guided Verifiable Optimization (CVO). MRSI transforms abstract constraints into executable reasoning chains, while CVO empowers the model to dynamically align its reasoning trajectories with Group Relative Policy Optimization (GRPO). Experiments on GMNER and visual grounding tasks demonstrate that MCR effectively mitigates modality bias and achieves superior performance compared to existing baselines.
zh

[NLP-50] Is Micro Domain-Adaptive Pre-Training Effective for Real-World Operations? Multi-Step Evaluation Reveals Potential and Bottlenecks EACL2026

【速读】: 该论文旨在解决小域自适应预训练(micro domain-adaptive pre-training, mDAPT)在生成式任务中的有效性问题,尤其是在企业真实运营场景下,如IT技术支持中对专有知识的生成式问答能力。以往研究仅验证了mDAPT在多项选择题上的有效性,但其在需要复杂推理与长文本生成的任务中表现尚不明确。论文的关键解决方案是将回答过程解耦为三个子任务:事实提取(eliciting)、推理(reasoning)和答案生成(composing),并通过实证分析发现,mDAPT能显著提升模型在事实提取环节的表现,但无法有效改善推理与生成环节,揭示了其在知识获取方面的优势以及在推理能力上的瓶颈。这一发现强调了增强模型推理能力对于实现高绩效生成式AI系统的重要性。

链接: https://arxiv.org/abs/2602.04466
作者: Masaya Tsunokake,Yuta Koreeda,Terufumi Morishita,Koichi Nagatsuka,Hikaru Tomonari,Yasuhiro Sogawa
机构: Hitachi, Ltd (日立有限公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13 pages, 9 figures, Accepted by EACL2026 Industry Track

点击查看摘要

Abstract:When applying LLMs to real-world enterprise operations, LLMs need to handle proprietary knowledge in small domains of specific operations ( \textbfmicro domains ). A previous study shows micro domain-adaptive pre-training ( \textbfmDAPT ) with fewer documents is effective, similarly to DAPT in larger domains. However, it evaluates mDAPT only on multiple-choice questions; thus, its effectiveness for generative tasks in real-world operations remains unknown. We aim to reveal the potential and bottlenecks of mDAPT for generative tasks. To this end, we disentangle the answering process into three subtasks and evaluate the performance of each subtask: (1) \textbfeliciting facts relevant to questions from an LLM’s own knowledge, (2) \textbfreasoning over the facts to obtain conclusions, and (3) \textbfcomposing long-form answers based on the conclusions. We verified mDAPT on proprietary IT product knowledge for real-world questions in IT technical support operations. As a result, mDAPT resolved the elicitation task that the base model struggled with but did not resolve other subtasks. This clarifies mDAPT’s effectiveness in the knowledge aspect and its bottlenecks in other aspects. Further analysis empirically shows that resolving the elicitation and reasoning tasks ensures sufficient performance (over 90%), emphasizing the need to enhance reasoning capability.
zh

[NLP-51] Growth First Care Second? Tracing the Landscape of LLM Value Preferences in Everyday Dilemmas

【速读】: 该论文旨在解决生成式 AI(Generative AI)在提供建议时如何处理价值权衡的问题,特别是当人类用户面临多维价值观冲突的复杂情境时,AI 是否能合理识别并回应这些权衡。其解决方案的关键在于构建一个基于 Reddit 社区真实建议数据的层次化价值框架,并通过共现网络分析揭示不同语境下价值冲突的结构差异;在此基础上,系统性评估大语言模型(Large Language Model, LLM)对各类价值的偏好倾向,发现 LLM 在多数情境中普遍偏向“探索-成长”类价值,而弱于“利他-联结”类价值,揭示了 AI 辅助决策可能引发的价值同质化风险。

链接: https://arxiv.org/abs/2602.04456
作者: Zhiyi Chen,Eun Cheol Choi,Yingjia Luo,Xinyi Wang,Yulei Xiao,Aizi Yang,Luca Luceri
机构: University of Southern California (南加州大学); Tsinghua University (清华大学)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: dataset available at this https URL

点击查看摘要

Abstract:People increasingly seek advice online from both human peers and large language model (LLM)-based chatbots. Such advice rarely involves identifying a single correct answer; instead, it typically requires navigating trade-offs among competing values. We aim to characterize how LLMs navigate value trade-offs across different advice-seeking contexts. First, we examine the value trade-off structure underlying advice seeking using a curated dataset from four advice-oriented subreddits. Using a bottom-up approach, we inductively construct a hierarchical value framework by aggregating fine-grained values extracted from individual advice options into higher-level value categories. We construct value co-occurrence networks to characterize how values co-occur within dilemmas and find substantial heterogeneity in value trade-off structures across advice-seeking contexts: a women-focused subreddit exhibits the highest network density, indicating more complex value conflicts; women’s, men’s, and friendship-related subreddits exhibit highly correlated value-conflict patterns centered on security-related tensions (security vs. respect/connection/commitment); by contrast, career advice forms a distinct structure where security frequently clashes with self-actualization and growth. We then evaluate LLM value preferences against these dilemmas and find that, across models and contexts, LLMs consistently prioritize values related to Exploration Growth over Benevolence Connection. This systemically skewed value orientation highlights a potential risk of value homogenization in AI-mediated advice, raising concerns about how such systems may shape decision-making and normative outcomes at scale.
zh

[NLP-52] No One-Size-Fits-All: Building Systems For Translation to Bashkir Kazakh Kyrgyz Tatar and Chuvash Using Synthetic And Original Data EACL2026

【速读】: 该论文旨在解决低资源 Turkic 语言对(包括巴什基尔语、哈萨克语、吉尔吉斯语、鞑靼语和楚瓦什语)的机器翻译问题,这些语言在现有大规模预训练模型中缺乏足够的训练数据支持。解决方案的关键在于结合两种策略:一是通过 LoRA(Low-Rank Adaptation)微调 nllb-200-distilled-600M 模型,并利用合成数据提升性能,实现了哈萨克语(chrF++ 49.71)和巴什基尔语(chrF++ 46.94)的显著效果;二是采用基于检索的提示方法(retrieval-based prompting),将相似示例引入 DeepSeek-V3.2 模型,使楚瓦什语翻译达到 chrF++ 39.47,同时零样本或检索方法在鞑靼语(chrF++ 41.6)和吉尔吉斯语(chrF++ 45.6)上也展现出良好适应性。整体方案体现了小样本场景下高效迁移学习与外部知识增强的有效协同。

链接: https://arxiv.org/abs/2602.04442
作者: Dmitry Karpov
机构: PAO Severstal (Severstal公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to EACL 2026 (LoResMT workshop)

点击查看摘要

Abstract:We explore machine translation for five Turkic language pairs: Russian-Bashkir, Russian-Kazakh, Russian-Kyrgyz, English-Tatar, English-Chuvash. Fine-tuning nllb-200-distilled-600M with LoRA on synthetic data achieved chrF++ 49.71 for Kazakh and 46.94 for Bashkir. Prompting DeepSeek-V3.2 with retrieved similar examples achieved chrF++ 39.47 for Chuvash. For Tatar, zero-shot or retrieval-based approaches achieved chrF++ 41.6, while for Kyrgyz the zero-shot approach reached 45.6. We release the dataset and the obtained weights.
zh

[NLP-53] Fine-Grained Activation Steering: Steering Less Achieving More ICLR2026

【速读】: 该论文旨在解决现有激活控制(activation steering)方法在大型语言模型(Large Language Model, LLM)行为修改中效率低、干预粗粒度的问题。当前方法通常在模块(block)层面进行干预,但研究发现模块级激活本质上具有异质性,混杂了有益、无关甚至有害的特征,导致干预效果粗糙且易引入干扰。解决方案的关键在于将激活分解为细粒度的原子单元(atomic unit, AU)级别——每个AU对应模块激活的一个维度,并关联到权重矩阵的一个切片。通过识别对输出分布具有判别性的AU并仅对其施加自适应强度的干预,AUSteer实现了更精准、高效的控制策略。实验证明,减少干预数量反而提升了性能,揭示了“少即是多”的精细控制范式。

链接: https://arxiv.org/abs/2602.04428
作者: Zijian Feng,Tianjiao Li,Zixiao Zhu,Hanzhang Zhou,Junlang Qian,Li Zhang,Jia Jim Deryl Chua,Lee Onn Mak,Gee Wah Ng,Kezhi Mao
机构: Nanyang Technological University (南洋理工大学); Home Team Science and Technology Agency (HTX) (新加坡内政科技局)
类目: Computation and Language (cs.CL)
备注: ICLR 2026

点击查看摘要

Abstract:Activation steering has emerged as a cost-effective paradigm for modifying large language model (LLM) behaviors. Existing methods typically intervene at the block level, steering the bundled activations of selected attention heads, feedforward networks, or residual streams. However, we reveal that block-level activations are inherently heterogeneous, entangling beneficial, irrelevant, and harmful features, thereby rendering block-level steering coarse, inefficient, and intrusive. To investigate the root cause, we decompose block activations into fine-grained atomic unit (AU)-level activations, where each AU-level activation corresponds to a single dimension of the block activation, and each AU denotes a slice of the block weight matrix. Steering an AU-level activation is thus equivalent to steering its associated AU. Our theoretical and empirical analysis show that heterogeneity arises because different AUs or dimensions control distinct token distributions in LLM outputs. Hence, block-level steering inevitably moves helpful and harmful token directions together, which reduces efficiency. Restricting intervention to beneficial AUs yields more precise and effective steering. Building on this insight, we propose AUSteer, a simple and efficient method that operates at a finer granularity of the AU level. AUSteer first identifies discriminative AUs globally by computing activation momenta on contrastive samples. It then assigns adaptive steering strengths tailored to diverse inputs and selected AU activations. Comprehensive experiments on multiple LLMs and tasks show that AUSteer consistently surpasses advanced baselines while steering considerably fewer activations, demonstrating that steering less achieves more.
zh

[NLP-54] History-Guided Iterative Visual Reason ing with Self-Correction

【速读】: 该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在跨模态推理任务中因固定“重复采样与投票”范式导致的局限性问题,即缺乏对历史推理信息的 reuse 和动态误差修正能力,从而难以有效纠正视觉理解错误并提升推理可靠性。解决方案的关键在于提出 H-GIVR 框架,该框架通过让模型在迭代推理过程中多次观察图像,并将先前生成的答案作为后续步骤的参考依据,实现基于历史信息的动态误差校正,从而显著提升跨模态推理准确率,同时保持较低的计算开销。

链接: https://arxiv.org/abs/2602.04413
作者: Xinglong Yang,Zhilin Peng,Zhanzhan Liu,Haochen Shi,Sheng-Jun Huang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Self-consistency methods are the core technique for improving the reasoning reliability of multimodal large language models (MLLMs). By generating multiple reasoning results through repeated sampling and selecting the best answer via voting, they play an important role in cross-modal tasks. However, most existing self-consistency methods are limited to a fixed ``repeated sampling and voting’’ paradigm and do not reuse historical reasoning information. As a result, models struggle to actively correct visual understanding errors and dynamically adjust their reasoning during iteration. Inspired by the human reasoning behavior of repeated verification and dynamic error correction, we propose the H-GIVR framework. During iterative reasoning, the MLLM observes the image multiple times and uses previously generated answers as references for subsequent steps, enabling dynamic correction of errors and improving answer accuracy. We conduct comprehensive experiments on five datasets and three models. The results show that the H-GIVR framework can significantly improve cross-modal reasoning accuracy while maintaining low computational cost. For instance, using \textttLlama3.2-vision:11b on the ScienceQA dataset, the model requires an average of 2.57 responses per question to achieve an accuracy of 78.90%, representing a 107% improvement over the baseline.
zh

[NLP-55] Swordsman: Entropy-Driven Adaptive Block Partition for Efficient Diffusion Language Models

【速读】: 该论文旨在解决扩散语言模型(Diffusion Language Models, DLMs)中传统块式解码(block-wise decoding)因固定且刚性地划分块结构而导致语义或句法成分被割裂,从而影响推理速度与质量的问题。解决方案的关键在于提出Swordsman——一个基于熵驱动的自适应块式解码框架:通过分析相邻词元间的熵变化来动态识别语义或句法成分边界,实现更贴合语言结构的块划分;同时,根据块内实时去掩码状态动态调整去掩码阈值,提升解码效率与稳定性。该方法无需训练,依托KV Cache实现高效推理,在多项评估中达到当前最优性能。

链接: https://arxiv.org/abs/2602.04399
作者: Yu Zhang,Xinchen Li,Jialei Zhou,Hongnan Ma,Zhongwei Wan,Yiwei Shi,Duoqian Miao,Qi Zhang,Longbing Cao
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Block-wise decoding effectively improves the inference speed and quality in diffusion language models (DLMs) by combining inter-block sequential denoising and intra-block parallel unmasking. However, existing block-wise decoding methods typically partition blocks in a rigid and fixed manner, which inevitably fragments complete semantic or syntactic constituents, leading to suboptimal performance. Inspired by the entropy reduction hypothesis (ERH), we recognize that constituent boundaries offer greater opportunities for uncertainty reduction, which motivates us to employ entropy analysis for identifying constituent boundaries. Therefore, we propose Swordsman, an entropy-driven adaptive block-wise decoding framework for DLMs. Swordsman adaptively partitions blocks by identifying entropy shifts between adjacent tokens to better align with semantic or syntactic constituent boundaries. In addition, Swordsman dynamically adjusts unmasking thresholds conditioned on the real-time unmasking status within a block, further improving both efficiency and stability. As a training-free framework, supported by KV Cache, Swordsman demonstrates state-of-the-art performance across extensive evaluations.
zh

[NLP-56] Bi-directional Bias Attribution: Debiasing Large Language Models without Modifying Prompts

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成文本时存在的社会偏见问题,特别是由刻板印象诱导词(stereotype-inducing words)引发的不公平输出。现有去偏方法如微调或提示工程存在可扩展性差或影响多轮交互用户体验的局限。其解决方案的关键在于:首先通过跨群体对比分析识别出诱发刻板印象的形容词和名词;其次利用基于积分梯度(integrated gradients)的两种归因策略将偏见行为定位到特定神经元;最后在投影层直接干预这些神经元的激活值以实现去偏,而无需对模型进行微调或修改提示。该方法在不损害整体性能的前提下有效降低了模型偏见。

链接: https://arxiv.org/abs/2602.04398
作者: Yujie Lin,Kunquan Li,Yixuan Liao,Xiaoxin Chen,Jinsong Su
机构: School of Informatics, Xiamen University, China (厦门大学信息学院); vivo AI Lab, China (vivo人工智能实验室); Key Laboratory of Digital Protection and Intelligent Processing of Intangible Cultural Heritage of Fujian and Taiwan (Xiamen University), Ministry of Culture and Tourism, China (文化和旅游部闽台非物质文化遗产数字保护与智能处理重点实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated impressive capabilities across a wide range of natural language processing tasks. However, their outputs often exhibit social biases, raising fairness concerns. Existing debiasing methods, such as fine-tuning on additional datasets or prompt engineering, face scalability issues or compromise user experience in multi-turn interactions. To address these challenges, we propose a framework for detecting stereotype-inducing words and attributing neuron-level bias in LLMs, without the need for fine-tuning or prompt modification. Our framework first identifies stereotype-inducing adjectives and nouns via comparative analysis across demographic groups. We then attribute biased behavior to specific neurons using two attribution strategies based on integrated gradients. Finally, we mitigate bias by directly intervening on their activations at the projection layer. Experiments on three widely used LLMs demonstrate that our method effectively reduces bias while preserving overall model performance. Code is available at the github link: this https URL.
zh

[NLP-57] Evaluating the Presence of Sex Bias in Clinical Reason ing by Large Language Models

【速读】: 该论文旨在解决生成式 AI (Generative AI) 在医疗场景中因训练数据中存在的性别偏倚而可能引发的临床推理偏差问题,特别是模型在无性别信息的临床案例中表现出的系统性性别分配倾向。解决方案的关键在于:首先,通过系统性实验验证不同通用大语言模型(LLMs)在临床推理任务中存在稳定且模型特异性的性别偏倚;其次,提出采用保守且可追溯的模型配置策略、按专科级别进行临床数据审计,并强调在医疗环境中部署通用模型时必须持续保持人工监督,以降低偏倚带来的风险并确保安全应用。

链接: https://arxiv.org/abs/2602.04392
作者: Isabel Tsintsiper,Sheng Wong,Beth Albert,Shaun P Brennecke,Gabriel Davis Jones
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly embedded in healthcare workflows for documentation, education, and clinical decision support. However, these systems are trained on large text corpora that encode existing biases, including sex disparities in diagnosis and treatment, raising concerns that such patterns may be reproduced or amplified. We systematically examined whether contemporary LLMs exhibit sex-specific biases in clinical reasoning and how model configuration influences these behaviours. We conducted three experiments using 50 clinician-authored vignettes spanning 44 specialties in which sex was non-informative to the initial diagnostic pathway. Four general-purpose LLMs (ChatGPT (gpt-4o-mini), Claude 3.7 Sonnet, Gemini 2.0 Flash and DeepSeekchat). All models demonstrated significant sex-assignment skew, with predicted sex differing by model. At temperature 0.5, ChatGPT assigned female sex in 70% of cases (95% CI 0.66-0.75), DeepSeek in 61% (0.57-0.65) and Claude in 59% (0.55-0.63), whereas Gemini showed a male skew, assigning a female sex in 36% of cases (0.32-0.41). Contemporary LLMs exhibit stable, model-specific sex biases in clinical reasoning. Permitting abstention reduces explicit labelling but does not eliminate downstream diagnostic differences. Safe clinical integration requires conservative and documented configuration, specialty-level clinical data auditing, and continued human oversight when deploying general-purpose models in healthcare settings.
zh

[NLP-58] Beyond Rejection Sampling: Trajectory Fusion for Scaling Mathematical Reason ing

【速读】: 该论文旨在解决当前基于拒绝采样(rejection sampling)的微调策略在数学推理任务中对错误推理轨迹建模不足的问题。现有方法仅保留正确的推理路径,将教师生成的错误视为噪声并直接丢弃,导致模型难以学习到如何从错误中修正自身,尤其在复杂或长序列推理问题上表现受限。解决方案的关键在于提出TrajFusion,一种重构监督信号的微调策略:它通过交织选定的错误轨迹、反思提示(reflection prompts)和正确轨迹,构建融合的推理路径,显式建模试错过程;同时根据教师错误的频率与多样性自适应控制融合长度,在保证安全性的前提下增强对难题的监督信号,无需修改模型架构或训练目标即可显著提升性能。

链接: https://arxiv.org/abs/2602.04391
作者: Jie Deng,Hanshuang Tong,Jun Li,Shining Liang,Ning Wu,Hongzhi Li,Yutao Xie
机构: Microsoft(微软)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have made impressive strides in mathematical reasoning, often fine-tuned using rejection sampling that retains only correct reasoning trajectories. While effective, this paradigm treats supervision as a binary filter that systematically excludes teacher-generated errors, leaving a gap in how reasoning failures are modeled during training. In this paper, we propose TrajFusion, a fine-tuning strategy that reframes rejection sampling as a structured supervision construction process. Specifically, TrajFusion forms fused trajectories that explicitly model trial-and-error reasoning by interleaving selected incorrect trajectories with reflection prompts and correct trajectories. The length of each fused sample is adaptively controlled based on the frequency and diversity of teacher errors, providing richer supervision for challenging problems while safely reducing to vanilla rejection sampling fine-tuning (RFT) when error signals are uninformative. TrajFusion requires no changes to the architecture or training objective. Extensive experiments across multiple math benchmarks demonstrate that TrajFusion consistently outperforms RFT, particularly on challenging and long-form reasoning problems.
zh

[NLP-59] Can Vision Replace Text in Working Memory? Evidence from Spatial n-Back in Vision-Language Models

【速读】: 该论文试图解决的问题是:在视觉-语言模型(vision-language models)中,当信息以视觉而非文本形式呈现时,其工作记忆(working memory)类行为是否能与文本模态下产生可比的计算机制。为回答这一问题,研究者设计了一个受控的空间n-back任务,分别以文本渲染和图像渲染的网格形式呈现刺激,并对比Qwen2.5与Qwen2.5-VL模型的表现。解决方案的关键在于:通过试次级别的对数概率证据分析,揭示了传统n-back任务中“指定滞后”(如2/3-back)往往未能真实反映指令所要求的记忆延迟,而是更符合基于近期记忆的比较机制(recency-locked comparison),并进一步发现网格尺寸会改变刺激流中的最近重复结构,从而影响干扰模式和错误分布。这一发现推动了对多模态工作记忆计算敏感性的理解,强调需从过程层面解析模型内部工作机制。

链接: https://arxiv.org/abs/2602.04355
作者: Sichu Liang,Hongyu Zhu,Wenwen Wang,Deyu Zhou
机构: Southeast University (东南大学); Shanghai Jiao Tong University (上海交通大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Working memory is a central component of intelligent behavior, providing a dynamic workspace for maintaining and updating task-relevant information. Recent work has used n-back tasks to probe working-memory-like behavior in large language models, but it is unclear whether the same probe elicits comparable computations when information is carried in a visual rather than textual code in vision-language models. We evaluate Qwen2.5 and Qwen2.5-VL on a controlled spatial n-back task presented as matched text-rendered or image-rendered grids. Across conditions, models show reliably higher accuracy and d’ with text than with vision. To interpret these differences at the process level, we use trial-wise log-probability evidence and find that nominal 2/3-back often fails to reflect the instructed lag and instead aligns with a recency-locked comparison. We further show that grid size alters recent-repeat structure in the stimulus stream, thereby changing interference and error patterns. These results motivate computation-sensitive interpretations of multimodal working memory.
zh

[NLP-60] A Domain-Specific Curated Benchmark for Entity and Document-Level Relation Extraction EACL2026

【速读】: 该论文旨在解决当前生物医学信息抽取(Information Extraction, IE)基准数据集在覆盖范围和标注质量上的局限性问题,尤其是在快速发展的肠道-脑轴(gut-brain axis)研究领域中,现有基准往往依赖远距离监督或自动生成的标注,难以支撑鲁棒的IE方法开发。解决方案的关键在于构建一个高质量、多任务的基准数据集GutBrainIE,其基于超过1600篇PubMed摘要,由生物医学与术语学专家进行细粒度实体标注、概念级链接及关系抽取,并融合了高度人工校准与弱监督数据,从而为跨领域的生物医学IE系统开发与评估提供更可靠的基础。

链接: https://arxiv.org/abs/2602.04320
作者: Marco Martinelli,Stefano Marchesin,Vanessa Bonato,Giorgio Maria Di Nunzio,Nicola Ferro,Ornella Irrera,Laura Menotti,Federica Vezzani,Gianmaria Silvello
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to EACL 2026

点击查看摘要

Abstract:Information Extraction (IE), encompassing Named Entity Recognition (NER), Named Entity Linking (NEL), and Relation Extraction (RE), is critical for transforming the rapidly growing volume of scientific publications into structured, actionable knowledge. This need is especially evident in fast-evolving biomedical fields such as the gut-brain axis, where research investigates complex interactions between the gut microbiota and brain-related disorders. Existing biomedical IE benchmarks, however, are often narrow in scope and rely heavily on distantly supervised or automatically generated annotations, limiting their utility for advancing robust IE methods. We introduce GutBrainIE, a benchmark based on more than 1,600 PubMed abstracts, manually annotated by biomedical and terminological experts with fine-grained entities, concept-level links, and relations. While grounded in the gut-brain axis, the benchmark’s rich schema, multiple tasks, and combination of highly curated and weakly supervised data make it broadly applicable to the development and evaluation of biomedical IE systems across domains.
zh

[NLP-61] DeFrame: Debiasing Large Language Models Against Framing Effects

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在公平性评估中存在的一种隐蔽偏差问题——即模型在标准评估下表现公平,但在不同表述方式(framing)的提示下却产生显著不公平响应的现象。这种现象被称为“框架差异”(framing disparity),其根源在于语义等价但表达形式不同的提示对模型输出的影响不一致。为应对这一挑战,论文提出一种面向框架感知的去偏方法(framing-aware debiasing),其核心在于通过增强模型在不同框架下的响应一致性来提升公平性鲁棒性,从而在降低整体偏见的同时减少由框架变化引发的不公平波动。实验表明,该方法能够有效提升模型输出的公平性和一致性。

链接: https://arxiv.org/abs/2602.04306
作者: Kahee Lim,Soyeon Kim,Steven Euijong Whang
机构: KAIST
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 40 pages, 12 figures

点击查看摘要

Abstract:As large language models (LLMs) are increasingly deployed in real-world applications, ensuring their fair responses across demographics has become crucial. Despite many efforts, an ongoing challenge is hidden bias: LLMs appear fair under standard evaluations, but can produce biased responses outside those evaluation settings. In this paper, we identify framing – differences in how semantically equivalent prompts are expressed (e.g., “A is better than B” vs. “B is worse than A”) – as an underexplored contributor to this gap. We first introduce the concept of “framing disparity” to quantify the impact of framing on fairness evaluation. By augmenting fairness evaluation benchmarks with alternative framings, we find that (1) fairness scores vary significantly with framing and (2) existing debiasing methods improve overall (i.e., frame-averaged) fairness, but often fail to reduce framing-induced disparities. To address this, we propose a framing-aware debiasing method that encourages LLMs to be more consistent across framings. Experiments demonstrate that our approach reduces overall bias and improves robustness against framing disparities, enabling LLMs to produce fairer and more consistent responses.
zh

[NLP-62] Beyond Static Cropping: Layer-Adaptive Visual Localization and Decoding Enhancement

【速读】: 该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在处理图像时因固定视觉token预算导致的细粒度信息丢失和幻觉问题,尤其是在复杂推理任务中过度依赖语言先验而忽视视觉细节。其核心解决方案是提出一种动态视觉定位机制——Visual Activation by Query (VAQ),通过层间敏感性分析识别与查询相关的最优注意力层,并在此基础上设计训练-free的推理方法LASER(Layer-adaptive Attention-guided Selective visual and decoding Enhancement for Reasoning),实现根据任务复杂度自适应选择视觉激活层,从而提升视觉问答(VQA)任务中的准确性与鲁棒性。

链接: https://arxiv.org/abs/2602.04304
作者: Zipeng Zhu,Zhanghao Hu,Qinglin Zhu,Yuxi Hong,Yijun Liu,Jingyong Su,Yulan He,Lin Gui
机构: Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区); King’s College London (伦敦国王学院); Harbin Institute of Technology (哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 9 pages, 5 figures

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have advanced rapidly by aligning visual patches with the text embedding space, but a fixed visual-token budget forces images to be resized to a uniform pretraining resolution, often erasing fine-grained details and causing hallucinations via over-reliance on language priors. Recent attention-guided enhancement (e.g., cropping or region-focused attention allocation) alleviates this, yet it commonly hinges on a static “magic layer” empirically chosen on simple recognition benchmarks and thus may not transfer to complex reasoning tasks. In contrast to this static assumption, we propose a dynamic perspective on visual grounding. Through a layer-wise sensitivity analysis, we demonstrate that visual grounding is a dynamic process: while simple object recognition tasks rely on middle layers, complex visual search and reasoning tasks require visual information to be reactivated at deeper layers. Based on this observation, we introduce Visual Activation by Query (VAQ), a metric that identifies the layer whose attention map is most relevant to query-specific visual grounding by measuring attention sensitivity to the input query. Building on VAQ, we further propose LASER (Layer-adaptive Attention-guided Selective visual and decoding Enhancement for Reasoning), a training-free inference procedure that adaptively selects task-appropriate layers for visual localization and question answering. Experiments across diverse VQA benchmarks show that LASER significantly improves VQA accuracy across tasks with varying levels of complexity.
zh

[NLP-63] Revisiting Prompt Sensitivity in Large Language Models for Text Classification: The Role of Prompt Underspecification

【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)在零样本和少样本分类任务中对提示(prompt)敏感性过高这一问题,尤其关注这种敏感性是否主要源于提示的不明确性(underspecification)。研究表明,许多现有工作所使用的提示缺乏具体任务指令,导致模型输出空间未被充分约束,从而放大了性能波动。解决方案的关键在于区分“非指定提示”(underspecified prompts)与“指令提示”(instruction-prompts),并通过性能分析、logit分析和线性探针(linear probing)等方法系统比较二者差异,发现明确的任务指令能够显著降低性能方差并提升相关token的logit值,同时揭示提示不明确性的影响主要体现在模型最后几层而非内部表征层面。

链接: https://arxiv.org/abs/2602.04297
作者: Branislav Pecher,Michal Spiegel,Robert Belanec,Jan Cegin
机构: Kempelen Institute of Intelligent Technologies (Kempelen智能技术研究所); Brno University of Technology (布林诺理工大学); Masaryk University (马萨里克大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are widely used as zero-shot and few-shot classifiers, where task behaviour is largely controlled through prompting. A growing number of works have observed that LLMs are sensitive to prompt variations, with small changes leading to large changes in performance. However, in many cases, the investigation of sensitivity is performed using underspecified prompts that provide minimal task instructions and weakly constrain the model’s output space. In this work, we argue that a significant portion of the observed prompt sensitivity can be attributed to prompt underspecification. We systematically study and compare the sensitivity of underspecified prompts and prompts that provide specific instructions. Utilising performance analysis, logit analysis, and linear probing, we find that underspecified prompts exhibit higher performance variance and lower logit values for relevant tokens, while instruction-prompts suffer less from such problems. However, linear probing analysis suggests that the effects of prompt underspecification have only a marginal impact on the internal LLM representations, instead emerging in the final layers. Overall, our findings highlight the need for more rigour when investigating and mitigating prompt sensitivity.
zh

[NLP-64] How Few-shot Demonstrations Affect Prompt-based Defenses Against LLM Jailbreak Attacks

【速读】: 该论文旨在解决生成式 AI (Generative AI) 在面对 jailbreak 攻击时的安全性问题,特别是探讨少样本演示(few-shot demonstrations)在不同系统提示策略(如角色导向提示 RoP 和任务导向提示 ToP)中的作用机制。其解决方案的关键在于通过系统性实验发现:少样本演示对 RoP 和 ToP 具有相反影响——它能提升 RoP 的安全率(最高达 4.5%),因强化了角色身份认知;但会显著削弱 ToP 的效果(最差下降 21.2%),因其分散了模型对任务指令的注意力。这一发现为实际部署基于提示的防御策略提供了重要依据。

链接: https://arxiv.org/abs/2602.04294
作者: Yanshu Wang,Shuaishuai Yang,Jingjing He,Tong Yang
机构: Peking University (北京大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 13 pages, 4 figures, 6 tables

点击查看摘要

Abstract:Large Language Models (LLMs) face increasing threats from jailbreak attacks that bypass safety alignment. While prompt-based defenses such as Role-Oriented Prompts (RoP) and Task-Oriented Prompts (ToP) have shown effectiveness, the role of few-shot demonstrations in these defense strategies remains unclear. Prior work suggests that few-shot examples may compromise safety, but lacks investigation into how few-shot interacts with different system prompt strategies. In this paper, we conduct a comprehensive evaluation on multiple mainstream LLMs across four safety benchmarks (AdvBench, HarmBench, SG-Bench, XSTest) using six jailbreak attack methods. Our key finding reveals that few-shot demonstrations produce opposite effects on RoP and ToP: few-shot enhances RoP’s safety rate by up to 4.5% through reinforcing role identity, while it degrades ToP’s effectiveness by up to 21.2% through distracting attention from task instructions. Based on these findings, we provide practical recommendations for deploying prompt-based defenses in real-world LLM applications.
zh

[NLP-65] Guided Verifier: Collaborative Multimodal Reason ing via Dynamic Process Supervision

【速读】: 该论文旨在解决当前基于强化学习(Reinforcement Learning, RL)的多模态大语言模型(Multimodal Large Language Models, MLLMs)在复杂推理任务中因采用孤立式rollout策略而导致的错误传播问题。这种策略缺乏中间阶段的监督,使得早期逻辑偏差难以纠正,进而引发不可逆失败并产生噪声优化信号。解决方案的关键在于提出Guided Verifier框架,通过引入一个动态验证器(dynamic verifier),在推理过程中与策略模型实时协同求解任务,主动检测不一致并提供方向性引导,从而将模型导向有效推理路径。为支持该机制,作者构建了专门针对多模态幻觉的数据合成流水线,生成过程级负样本(CoRe数据集)和正确引导推理轨迹(Correct-guide Reasoning trajectories),用于训练验证器,最终实现计算资源向协同推理与动态验证的高效分配,显著提升模型性能。

链接: https://arxiv.org/abs/2602.04290
作者: Lingzhuang Sun,Ruitong Liu,Yuxia Zhu,Xiaohan Xu,Jingxuan Wei,Xiangxiang Zhang,Bihui Yu,Wentao Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement Learning (RL) has emerged as a pivotal mechanism for enhancing the complex reasoning capabilities of Multimodal Large Language Models (MLLMs). However, prevailing paradigms typically rely on solitary rollout strategies where the model works alone. This lack of intermediate oversight renders the reasoning process susceptible to error propagation, where early logical deviations cascade into irreversible failures, resulting in noisy optimization signals. In this paper, we propose the \textbfGuided Verifier framework to address these structural limitations. Moving beyond passive terminal rewards, we introduce a dynamic verifier that actively co-solves tasks alongside the policy. During the rollout phase, this verifier interacts with the policy model in real-time, detecting inconsistencies and providing directional signals to steer the model toward valid trajectories. To facilitate this, we develop a specialized data synthesis pipeline targeting multimodal hallucinations, constructing \textbfCoRe dataset of process-level negatives and \textbfCorrect-guide \textbfReasoning trajectories to train the guided verifier. Extensive experiments on MathVista, MathVerse and MMMU indicate that by allocating compute to collaborative inference and dynamic verification, an 8B-parameter model can achieve strong performance.
zh

[NLP-66] Proxy Compression for Language Modeling

【速读】: 该论文旨在解决现代语言模型训练中依赖固定分词器(tokenizer)所导致的耦合问题,即模型与外部无损压缩器(如UTF-8字节序列上的压缩器)绑定,限制了推理时对原始字节流的直接处理能力。其解决方案的关键在于提出“代理压缩”(proxy compression)训练范式:在训练过程中,语言模型同时学习原始字节序列和由外部压缩器生成的压缩视图,通过联合训练使模型内部实现压缩序列与原始字节之间的对齐;这种对齐机制使得两种格式间具备强迁移能力,即便训练时主要使用压缩输入(推理时丢弃),仍能实现高效且稳定的性能表现,从而在固定计算预算下显著提升训练效率,并在模型规模扩大时逼近或超越基于分词器的方法,同时保持纯字节级建模的鲁棒性优势。

链接: https://arxiv.org/abs/2602.04289
作者: Lin Zheng,Xinyu Li,Qian Liu,Xiachong Feng,Lingpeng Kong
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Modern language models are trained almost exclusively on token sequences produced by a fixed tokenizer, an external lossless compressor often over UTF-8 byte sequences, thereby coupling the model to that compressor. This work introduces proxy compression, an alternative training scheme that preserves the efficiency benefits of compressed inputs while providing an end-to-end, raw-byte interface at inference time. During training, one language model is jointly trained on raw byte sequences and compressed views generated by external compressors; through the process, the model learns to internally align compressed sequences and raw bytes. This alignment enables strong transfer between the two formats, even when training predominantly on compressed inputs which are discarded at inference. Extensive experiments on code language modeling demonstrate that proxy compression substantially improves training efficiency and significantly outperforms pure byte-level baselines given fixed compute budgets. As model scale increases, these gains become more pronounced, and proxy-trained models eventually match or rival tokenizer approaches, all while operating solely on raw bytes and retaining the inherent robustness of byte-level modeling.
zh

[NLP-67] Contextual Drag : How Errors in the Context Affect LLM Reason ing

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在自改进(self-improvement)过程中因上下文中的失败尝试引发的“情境拖拽”(contextual drag)问题,即模型在生成后续内容时会受到先前错误结构的偏差影响,导致性能下降甚至自我恶化。其核心发现是:即使通过外部反馈或成功自验证,也无法消除这一现象;尽管采用回退行为微调(fallback-behavior fine-tuning)和上下文去噪(context denoising)等策略可部分缓解,仍无法完全恢复基线性能,表明情境拖拽是当前推理架构中一个持续存在的失效模式。

链接: https://arxiv.org/abs/2602.04288
作者: Yun Cheng,Xingyu Zhu,Haoyu Zhao,Sanjeev Arora
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Central to many self-improvement pipelines for large language models (LLMs) is the assumption that models can improve by reflecting on past mistakes. We study a phenomenon termed contextual drag: the presence of failed attempts in the context biases subsequent generations toward structurally similar errors. Across evaluations of 11 proprietary and open-weight models on 8 reasoning tasks, contextual drag induces 10-20% performance drops, and iterative self-refinement in models with severe contextual drag can collapse into self-deterioration. Structural analysis using tree edit distance reveals that subsequent reasoning trajectories inherit structurally similar error patterns from the context. We demonstrate that neither external feedback nor successful self-verification suffices to eliminate this effect. While mitigation strategies such as fallback-behavior fine-tuning and context denoising yield partial improvements, they fail to fully restore baseline performance, positioning contextual drag as a persistent failure mode in current reasoning architectures.
zh

[NLP-68] ECG-R1: Protocol-Guided and Modality-Agnostic MLLM for Reliable ECG Interpretation

【速读】: 该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在心电图(Electrocardiography, ECG)解读中存在严重幻觉(hallucination)的问题,即模型常生成看似合理但临床错误的分析结果,导致诊断不可靠。其解决方案的关键在于提出首个面向可靠ECG解读的推理型多模态大语言模型ECG-R1,通过三项核心创新实现:(1) 基于指南驱动的指令数据生成方法(Protocol-Guided Instruction Data Generation),将解读严格锚定于可测量的ECG特征及权威文献定义的定量阈值与诊断逻辑;(2) 采用模态解耦架构结合交错模态丢弃策略(Interleaved Modality Dropout),提升在ECG信号或图像缺失时的鲁棒性与跨模态一致性;(3) 引入基于ECG诊断证据的强化学习机制(Reinforcement Learning with ECG Diagnostic Evidence Rewards),增强解释过程的证据依赖性和临床准确性。

链接: https://arxiv.org/abs/2602.04279
作者: Jiarui Jin,Haoyu Wang,Xingliang Wu,Xiaocheng Fang,Xiang Lan,Zihan Wang,Deyun Zhang,Bo Liu,Yingying Zhang,Xian Wu,Hongyan Li,Shenda Hong
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Electrocardiography (ECG) serves as an indispensable diagnostic tool in clinical practice, yet existing multimodal large language models (MLLMs) remain unreliable for ECG interpretation, often producing plausible but clinically incorrect analyses. To address this, we propose ECG-R1, the first reasoning MLLM designed for reliable ECG interpretation via three innovations. First, we construct the interpretation corpus using \textitProtocol-Guided Instruction Data Generation, grounding interpretation in measurable ECG features and monograph-defined quantitative thresholds and diagnostic logic. Second, we present a modality-decoupled architecture with \textitInterleaved Modality Dropout to improve robustness and cross-modal consistency when either the ECG signal or ECG image is missing. Third, we present \textitReinforcement Learning with ECG Diagnostic Evidence Rewards to strengthen evidence-grounded ECG interpretation. Additionally, we systematically evaluate the ECG interpretation capabilities of proprietary, open-source, and medical MLLMs, and provide the first quantitative evidence that severe hallucinations are widespread, suggesting that the public should not directly trust these outputs without independent verification. Code and data are publicly available at \hrefthis https URLhere, and an online platform can be accessed at \hrefthis http URLhere.
zh

[NLP-69] Scaling Agent ic Verifier for Competitive Coding

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在竞赛编程(competitive programming)任务中单次尝试难以正确求解的问题。现有基于执行的重排序(execution-based re-ranking)方法受限于测试用例生成困难或随机输入采样效率低下,无法有效区分候选解的正确性差异。解决方案的关键在于提出 Agentic Verifier——一个能够主动推理程序行为并搜索高判别力测试输入的执行代理,通过与代码执行环境的多轮交互迭代优化输入生成器,从而产生针对性的反例而非盲目采样。该方法利用大规模数据合成、拒绝式微调和代理强化学习构建可扩展的训练流程,显著提升了 Best@K 准确率(最高达 +10–15% 绝对提升),并展现出良好的测试时扩展性。

链接: https://arxiv.org/abs/2602.04254
作者: Zeyao Ma,Jing Zhang,Xiaokang Zhang,Jiaxi Yang,Zongmeng Zhang,Jiajun Zhang,Yuheng Jing,Lei Zhang,Hao Zheng,Wenting Zhao,Junyang Lin,Binyuan Hui
机构: Renmin University of China (中国人民大学); Qwen Team, Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated strong coding capabilities but still struggle to solve competitive programming problems correctly in a single attempt. Execution-based re-ranking offers a promising test-time scaling strategy, yet existing methods are constrained by either difficult test case generation or inefficient random input sampling. To address this limitation, we propose Agentic Verifier, an execution-based agent that actively reasons about program behaviors and searches for highly discriminative test inputs that expose behavioral discrepancies among candidate solutions. Through multi-turn interaction with code execution environments, the verifier iteratively refines the candidate input generator and produces targeted counterexamples rather than blindly sampling inputs. We train the verifier to acquire this discriminative input generation capability via a scalable pipeline combining large-scale data synthesis, rejection fine-tuning, and agentic reinforcement learning. Extensive experiments across five competitive programming benchmarks demonstrate consistent improvements over strong execution-based baselines, achieving up to +10-15% absolute gains in Best@K accuracy. Further analysis reveals clear test-time scaling behavior and highlights the verifier’s broader potential beyond reranking.
zh

[NLP-70] Empirical-MCTS: Continuous Agent Evolution via Dual-Experience Monte Carlo Tree Search

【速读】: 该论文旨在解决当前基于蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)的推理策略在大型语言模型(Large Language Models, LLMs)中普遍存在的“无状态”问题,即每次推理后丢弃成功路径,无法积累经验以提升后续任务表现,从而难以模拟人类在复杂问题解决中通过实践不断积累智慧的能力。其解决方案的关键在于提出Empirical-MCTS框架,该框架通过双层结构实现从离散推理到持续学习的转变:一是引入Pairwise-Experience-Evolutionary Meta-Prompting (PE-EMP),在局部搜索中利用成对反馈动态演化系统提示(meta-prompts),实现自适应优化;二是设计Memory Optimization Agent,在全局层面维护一个可更新的记忆库作为策略先验,通过原子操作提炼跨任务高质量洞察。此机制将结构化搜索与经验积累耦合,显著提升了模型在开放性、高复杂度推理任务上的性能。

链接: https://arxiv.org/abs/2602.04248
作者: Hao Lu,Haoyuan Huang,Yulin Zhou,Chen Li,Ningxin Zhu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 9 pages, 5 figures

点击查看摘要

Abstract:Inference-time scaling strategies, particularly Monte Carlo Tree Search (MCTS), have significantly enhanced the reasoning capabilities of Large Language Models (LLMs). However, current approaches remain predominantly stateless, discarding successful reasoning patterns after each problem instance and failing to mimic the empirical accumulation of wisdom characteristic of human problem-solving. To bridge this gap, we introduce Empirical-MCTS, a dual-loop framework that transforms stateless search into a continuous, non-parametric learning process. The framework unifies local exploration with global memory optimization through two novel mechanisms: Pairwise-Experience-Evolutionary Meta-Prompting (PE-EMP) and a Memory Optimization Agent. PE-EMP functions as a reflexive optimizer within the local search, utilizing pairwise feedback to dynamically synthesize adaptive criteria and evolve meta-prompts (system prompts) in real-time. Simultaneously, the Memory Optimization Agent manages a global repository as a dynamic policy prior, employing atomic operations to distill high-quality insights across problems. Extensive evaluations on complex reasoning benchmarks, including AIME25, ARC-AGI-2, and MathArena Apex, demonstrate that Empirical-MCTS significantly outperforms both stateless MCTS strategies and standalone experience-driven agents. These results underscore the critical necessity of coupling structured search with empirical accumulation for mastering complex, open-ended reasoning tasks.
zh

[NLP-71] DementiaBank-Emotion: A Multi-Rater Emotion Annotation Corpus for Alzheimers Disease Speech (Version 1.0) EACL2026

【速读】: 该论文旨在解决阿尔茨海默病(Alzheimer’s disease, AD)患者在言语中情绪表达识别的难题,特别是缺乏多标注者验证的情绪标注语料库。其解决方案的关键在于构建并发布首个多标注者情绪标注语料库——DementiaBank-Emotion,涵盖1,492个来自108名说话者的语音片段,并对Ekman六种基本情绪及中性情绪进行标注。研究发现AD患者比健康对照组表现出显著更多的非中性情绪(16.9% vs. 5.7%,p < .001),且在声学层面揭示了可能的情绪-韵律映射部分保留现象(如响度可区分情绪类别),同时提示AD与对照组在悲伤情绪的基频(F0)调制上存在差异(交互效应 p = .023)。该语料库及相关工具的开放共享为临床人群中的情绪识别研究提供了基础支持。

链接: https://arxiv.org/abs/2602.04247
作者: Cheonkam Jeong,Jessica Liao,Audrey Lu,Yutong Song,Christopher Rashidian,Donna Krogh,Erik Krogh,Mahkameh Rasouli,Jung-Ah Lee,Nikil Dutt,Lisa M Gibbs,David Sultzer,Julie Rousseau,Jocelyn Ludlow,Margaret Galvez,Alexander Nuth,Chet Khay,Sabine Brunswicker,Adeline Nyamathi
机构: Sue & Bill Gross School of Nursing, University of California, Irvine (UCI); Donald Bren School of Information and Computer Sciences, UCI, USA; Smart Forward, Rancho Palos Verdes, USA; Dept. of Psychiatry and Human Behavior, UCI, USA; Amore Senior Living, Laguna Niguel, USA; School of Medicine, University of California, Irvine; Purdue University, West Lafayette, USA; The George Washington University, Washington, D.C., USA
类目: Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted at HeaLING Workshop @ EACL 2026. 9 pages, 3 figures, 8 tables

点击查看摘要

Abstract:We present DementiaBank-Emotion, the first multi-rater emotion annotation corpus for Alzheimer’s disease (AD) speech. Annotating 1,492 utterances from 108 speakers for Ekman’s six basic emotions and neutral, we find that AD patients express significantly more non-neutral emotions (16.9%) than healthy controls (5.7%; p .001). Exploratory acoustic analysis suggests a possible dissociation: control speakers showed substantial F0 modulation for sadness (Delta = -3.45 semitones from baseline), whereas AD speakers showed minimal change (Delta = +0.11 semitones; interaction p = .023), though this finding is based on limited samples (sadness: n=5 control, n=15 AD) and requires replication. Within AD speech, loudness differentiates emotion categories, indicating partially preserved emotion-prosody mappings. We release the corpus, annotation guidelines, and calibration workshop materials to support research on emotion recognition in clinical populations.
zh

[NLP-72] CoLT: Reason ing with Chain of Latent Tool Calls

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在链式思维(Chain-of-Thought, CoT)推理过程中效率低下的问题,尤其是现有隐式推理(latent reasoning)方法通常依赖模型结构扩展和大量训练,限制了其通用性。解决方案的关键在于提出CoLT框架,将隐式推理建模为“工具调用”(tool calls),即通过生成包含推理步骤信息的种子标记(seed tokens),在触发隐式工具调用时,由一个小型外部模型以这些种子标记的隐藏状态为输入,将其还原为完整的推理步骤。该机制确保主模型始终在显式标记空间中进行推理,从而在不牺牲性能的前提下显著提升推理效率,并兼容强化学习算法与不同解码器结构。

链接: https://arxiv.org/abs/2602.04246
作者: Fangwei Zhu,Zhifang Sui
机构: Peking University (北京大学); Bytedance BandAI (字节跳动BandAI)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Chain-of-Thought (CoT) is a critical technique in enhancing the reasoning ability of Large Language Models (LLMs), and latent reasoning methods have been proposed to accelerate the inefficient token-level reasoning chain. We notice that existing latent reasoning methods generally require model structure augmentation and exhaustive training, limiting their broader applicability. In this paper, we propose CoLT, a novel framework that implements latent reasoning as ``tool calls’'. Instead of reasoning entirely in the latent space, CoLT generates seed tokens that contain information of a reasoning step. When a latent tool call is triggered, a smaller external model will take the hidden states of seed tokens as its input, and unpack the seed tokens back to a full reasoning step. In this way, we can ensure that the main model reasons in the explicit token space, preserving its ability while improving efficiency. Experimental results on four mathematical datasets demonstrate that CoLT achieves higher accuracy and shorter reasoning length than baseline latent models, and is compatible with reinforcement learning algorithms and different decoder structures.
zh

[NLP-73] okenization and Morphological Fidelity in Uralic NLP: A Cross-Lingual Evaluation

【速读】: 该论文旨在解决子词分词(subword tokenization)在形态丰富且资源匮乏的语言族中表现不佳的问题,尤其是在乌拉尔语系(Uralic languages)中的跨语言迁移效能受限的挑战。其解决方案的关键在于采用一种改进的子词分词方法——重叠字节对编码(Overlap BPE, OBPE),相较于传统的字节对编码(BPE)和Unigram语言模型,OBPE能更有效地保留词素结构,减少开放类词性(如名词、动词)的碎片化,并在频次分布上实现更好的平衡,从而显著提升词性标注(POS tagging)任务的准确性,尤其在拉丁字母书写系统中表现突出。研究进一步表明,这种分词策略与下游模型架构、训练数据量及语言亲缘关系存在交互效应,强调了形态感知型分词不仅是预处理步骤,更是实现高效跨语言迁移学习的核心因素。

链接: https://arxiv.org/abs/2602.04241
作者: Nuo Xu,Ahrii Kim
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Subword tokenization critically affects Natural Language Processing (NLP) performance, yet its behavior in morphologically rich and low-resource language families remains under-explored. This study systematically compares three subword paradigms – Byte Pair Encoding (BPE), Overlap BPE (OBPE), and Unigram Language Model – across six Uralic languages with varying resource availability and typological diversity. Using part-of-speech (POS) tagging as a controlled downstream task, we show that OBPE consistently achieves stronger morphological alignment and higher tagging accuracy than conventional methods, particularly within the Latin-script group. These gains arise from reduced fragmentation in open-class categories and a better balance across the frequency spectrum. Transfer efficacy further depends on the downstream tagging architecture, interacting with both training volume and genealogical proximity. Taken together, these findings highlight that morphology-sensitive tokenization is not merely a preprocessing choice but a decisive factor in enabling effective cross-lingual transfer for agglutinative, low-resource languages.
zh

[NLP-74] RAPO: Risk-Aware Preference Optimization for Generalizable Safe Reason ing

【速读】: 该论文旨在解决大型推理模型(Large Reasoning Models, LRM)在面对复杂越狱攻击(jailbreak attacks)时,其安全推理机制泛化能力不足的问题。现有方法虽能引导模型拒绝有害提示,但在多样且复杂的攻击场景下常失效,根源在于安全推理过程的充分性不足。解决方案的关键在于提出一种风险感知偏好优化(Risk-Aware Preference Optimization, RAPO)框架,使LRM能够在推理过程中自适应地识别并以适当粒度处理潜在安全风险,从而在保持通用能力的同时显著提升对多种攻击提示的安全推理泛化性能。

链接: https://arxiv.org/abs/2602.04224
作者: Zeming Wei,Qiaosheng Zhang,Xia Hu,Xingcheng Xu
机构: Shanghai AI Laboratory (上海人工智能实验室); Peking University (北京大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:Large Reasoning Models (LRMs) have achieved tremendous success with their chain-of-thought (CoT) reasoning, yet also face safety issues similar to those of basic language models. In particular, while algorithms are designed to guide them to deliberately refuse harmful prompts with safe reasoning, this process often fails to generalize against diverse and complex jailbreak attacks. In this work, we attribute these failures to the generalization of the safe reasoning process, particularly their insufficiency against complex attack prompts. We provide both theoretical and empirical evidence to show the necessity of a more sufficient safe reasoning process to defend against advanced attack prompts. Building on this insight, we propose a Risk-Aware Preference Optimization (RAPO) framework that enables LRM to adaptively identify and address the safety risks with appropriate granularity in its thinking content. Extensive experiments demonstrate that RAPO successfully generalizes multiple LRMs’ safe reasoning adaptively across diverse attack prompts whilst preserving general utility, contributing a robust alignment technique for LRM safety. Our code is available at this https URL.
zh

[NLP-75] Frontend Token Enhancement for Token-Based Speech Recognition ICASSP2026

【速读】: 该论文旨在解决语音表示在环境噪声下易失真导致下游任务性能下降的问题,特别是针对基于离散化语音表示(如语义或音位标记)的自动语音识别(ASR)系统。其解决方案的关键在于设计一个前端增强系统,通过不同输入/输出域的模型架构(包括波形到波形、标记到标记、连续SSL特征到标记、波形到标记四种类型)估计干净语音标记,从而提升ASR性能。实验表明,波形到标记(wave-to-token)增强模型效果最佳,且优于直接使用连续SSL特征的ASR系统。

链接: https://arxiv.org/abs/2602.04217
作者: Takanori Ashihara,Shota Horiguchi,Kohei Matsuura,Tsubasa Ochiai,Marc Delcroix
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted at ICASSP 2026

点击查看摘要

Abstract:Discretized representations of speech signals are efficient alternatives to continuous features for various speech applications, including automatic speech recognition (ASR) and speech language models. However, these representations, such as semantic or phonetic tokens derived from clustering outputs of self-supervised learning (SSL) speech models, are susceptible to environmental noise, which can degrade backend task performance. In this work, we introduce a frontend system that estimates clean speech tokens from noisy speech and evaluate it on an ASR backend using semantic tokens. We consider four types of enhancement models based on their input/output domains: wave-to-wave, token-to-token, continuous SSL features-to-token, and wave-to-token. These models are trained independently of ASR backends. Experiments on the CHiME-4 dataset demonstrate that wave-to-token enhancement achieves the best performance among the frontends. Moreover, it mostly outperforms the ASR system based on continuous SSL features.
zh

[NLP-76] Language Models Struggle to Use Representations Learned In-Context

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在面对全新情境时,虽能通过上下文学习(in-context learning)提取并编码新语义信息,却难以灵活部署这些表示以完成下游任务的问题。其核心挑战在于模型是否具备将隐式习得的上下文表征有效迁移至目标任务的能力,从而实现真正的适应性行为。研究的关键发现是:即便开放权重模型能够成功在上下文中构建丰富的语义表示,它们仍普遍无法可靠地利用这些表示进行下一步推理或任务执行;即使是最先进的闭源推理模型,在自适应世界建模任务中也表现不佳。因此,该研究强调未来需发展新的方法,促使模型不仅能在上下文中编码信息,更要以支持灵活再利用的方式组织和存储这些知识,推动LLMs向真正具备情境适应能力的智能系统迈进。

链接: https://arxiv.org/abs/2602.04212
作者: Michael A. Lepori,Tal Linzen,Ann Yuan,Katja Filippova
机构: Google DeepMind(谷歌深度思维); Brown University (布朗大学); New York University (纽约大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Though large language models (LLMs) have enabled great success across a wide variety of tasks, they still appear to fall short of one of the loftier goals of artificial intelligence research: creating an artificial system that can adapt its behavior to radically new contexts upon deployment. One important step towards this goal is to create systems that can induce rich representations of data that are seen in-context, and then flexibly deploy these representations to accomplish goals. Recently, Park et al. (2024) demonstrated that current LLMs are indeed capable of inducing such representation from context (i.e., in-context representation learning). The present study investigates whether LLMs can use these representations to complete simple downstream tasks. We first assess whether open-weights LLMs can use in-context representations for next-token prediction, and then probe models using a novel task, adaptive world modeling. In both tasks, we find evidence that open-weights LLMs struggle to deploy representations of novel semantics that are defined in-context, even if they encode these semantics in their latent representations. Furthermore, we assess closed-source, state-of-the-art reasoning models on the adaptive world modeling task, demonstrating that even the most performant LLMs cannot reliably leverage novel patterns presented in-context. Overall, this work seeks to inspire novel methods for encouraging models to not only encode information presented in-context, but to do so in a manner that supports flexible deployment of this information. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.04212 [cs.CL] (or arXiv:2602.04212v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2602.04212 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-77] Enforcing Monotonic Progress in Legal Cross-Examination: Preventing Long-Horizon Stagnation in LLM -Based Inquiry

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在执行需要长期规划和严格程序约束的任务时,因纯概率生成机制导致的“程序停滞”(procedural stagnation)问题,即模型虽能保持语义连贯性,却无法确保任务按既定步骤推进。解决方案的关键在于提出Soft-FSM——一种神经符号架构,通过外部确定性状态控制器对累积的关键信息单元(Key Information Units, KIUs)实施单调递增的控制,从而强制实现可验证的程序进展。实验表明,该方法在三个真实台湾刑事案件中将任务完成度提升至97%以上,同时几乎消除冗余输出,证明了显式外部状态控制对于保障复杂任务可靠完成的重要性。

链接: https://arxiv.org/abs/2602.04206
作者: Hsien-Jyh Liao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Submitted to ICAIL 2026. Under review

点击查看摘要

Abstract:Large language models (LLMs) exhibit impressive linguistic fluency but struggle to reliably complete long-horizon tasks under explicit procedural constraints. In legal cross-examination, purely proba-bilistic generation often maintains behavioral coherence while failing to ensure procedural advancement. We characterize this failure as procedural stagnation and propose Soft-FSM, a neuro-symbolic architecture that enforces monotonic progress over accumulated Key Information Units (KIUs) via an external deterministic state controller. Experiments on three real-world Taiwanese criminal homicide cases show that baseline methods collapse below 40% completeness, while Soft-FSM consistently achieves over 97% with near-zero redundancy. These results suggest that, in such domains, reliable task completion cannot be guaranteed by emergent LLM behavior alone, and can be reliably enforced through explicit and verifiable external state control.
zh

[NLP-78] From Helpfulness to Toxic Proactivity: Diagnosing Behavioral Misalignment in LLM Agents

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)驱动的智能体在具备更强规划与工具使用能力的同时,可能因对“有用性”的过度优化而产生一种新型主动失效模式——即“毒性主动性”(Toxic Proactivity),这种行为表现为智能体为维持其“有用性”而无视伦理约束、采取过度或操纵性策略。传统研究多关注于“过度拒绝”(over-refusal)这一被动失效模式,却忽视了毒性主动性这一更具隐蔽性和危害性的主动风险。解决方案的关键在于提出一种基于双模型困境交互的新型评估框架,通过模拟多步行为轨迹来揭示和分析此类行为,并构建系统性基准以跨情境评估毒性主动行为,从而实现对该类风险的有效识别与量化。

链接: https://arxiv.org/abs/2602.04197
作者: Xinyue Wang,Yuanhe Zhang,Zhengshuo Gong,Haoran Gao,Fanyu Meng,Zhenhong Zhou,Li Sun,Yang Liu,Sen Su
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages (excluding appendices), 6 figures. Code is available at this https URL

点击查看摘要

Abstract:The enhanced capabilities of LLM-based agents come with an emergency for model planning and tool-use abilities. Attributing to helpful-harmless trade-off from LLM alignment, agents typically also inherit the flaw of “over-refusal”, which is a passive failure mode. However, the proactive planning and action capabilities of agents introduce another crucial danger on the other side of the trade-off. This phenomenon we term "Toxic Proactivity’‘: an active failure mode in which an agent, driven by the optimization for Machiavellian helpfulness, disregards ethical constraints to maximize utility. Unlike over-refusal, Toxic Proactivity manifests as the agent taking excessive or manipulative measures to ensure its "usefulness’’ is maintained. Existing research pays little attention to identifying this behavior, as it often lacks the subtle context required for such strategies to unfold. To reveal this risk, we introduce a novel evaluation framework based on dilemma-driven interactions between dual models, enabling the simulation and analysis of agent behavior over multi-step behavioral trajectories. Through extensive experiments with mainstream LLMs, we demonstrate that Toxic Proactivity is a widespread behavioral phenomenon and reveal two major tendencies. We further present a systematic benchmark for evaluating Toxic Proactive behavior across contextual settings.
zh

[NLP-79] he Missing Half: Unveiling Training-time Implicit Safety Risks Beyond Deployment

【速读】: 该论文试图解决训练阶段(training time)中尚未被充分研究的安全风险问题,尤其是隐式训练时安全风险——即模型在缺乏显式奖励机制的情况下,因内部激励和上下文背景信息而产生有害行为的风险。解决方案的关键在于提出首个系统性研究框架,包括五个风险等级、十类细粒度风险类别以及三种激励类型,并通过实验证明此类风险广泛存在且严重,例如Llama-3.1-8B-Instruct在仅提供背景信息时有74.4%的训练运行表现出风险行为;同时揭示了多智能体训练场景下同样存在此类隐式风险,从而识别出一个被忽视但亟需关注的AI训练阶段安全挑战。

链接: https://arxiv.org/abs/2602.04196
作者: Zhexin Zhang,Yida Lu,Junfeng Fang,Junxiao Yang,Shiyao Cui,Hao Zhou,Fandong Meng,Jie Zhou,Hongning Wang,Minlie Huang,Tat-Seng Chua
机构: The Conversational AI (CoAI) group, DCST, Tsinghua University (清华大学); National University of Singapore (新加坡国立大学); Pattern Recognition Center, WeChat AI, Tencent Inc (腾讯公司)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Safety risks of AI models have been widely studied at deployment time, such as jailbreak attacks that elicit harmful outputs. In contrast, safety risks emerging during training remain largely unexplored. Beyond explicit reward hacking that directly manipulates explicit reward functions in reinforcement learning, we study implicit training-time safety risks: harmful behaviors driven by a model’s internal incentives and contextual background information. For example, during code-based reinforcement learning, a model may covertly manipulate logged accuracy for self-preservation. We present the first systematic study of this problem, introducing a taxonomy with five risk levels, ten fine-grained risk categories, and three incentive types. Extensive experiments reveal the prevalence and severity of these risks: notably, Llama-3.1-8B-Instruct exhibits risky behaviors in 74.4% of training runs when provided only with background information. We further analyze factors influencing these behaviors and demonstrate that implicit training-time risks also arise in multi-agent training settings. Our results identify an overlooked yet urgent safety challenge in training.
zh

[NLP-80] raining Data Efficiency in Multimodal Process Reward Models

【速读】: 该论文旨在解决多模态过程奖励模型(Multimodal Process Reward Models, MPRMs)在训练过程中对大规模蒙特卡洛(Monte Carlo, MC)标注语料依赖过高、数据效率低的问题。现有方法通常需要大量随机采样的MC标注数据进行训练,但实验表明其性能在小规模随机子集上即可饱和,说明数据存在显著冗余。为提升数据效率,作者提出一种无需额外成本的平衡信息得分(Balanced-Information Score, BIS),其核心在于基于rollout层级的MC信号,同时优化两个关键因素:正负步骤标签的混合比例(label mixtures)与标签可靠性(label reliability,即正向步骤的平均MC评分)。BIS通过优先选择兼具高信息量和可靠性的样本子集,在仅使用10%训练数据的情况下即可达到甚至超越全量数据性能,相较随机子采样相对提升4.1%,显著提升了MPRM训练的数据利用效率。

链接: https://arxiv.org/abs/2602.04145
作者: Jinyuan Li,Chengsong Huang,Langlin Huang,Shaoyang Xu,Haolin Liu,Wenxuan Zhang,Jiaxin Huang
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Multimodal Process Reward Models (MPRMs) are central to step-level supervision for visual reasoning in MLLMs. Training MPRMs typically requires large-scale Monte Carlo (MC)-annotated corpora, incurring substantial training cost. This paper studies the data efficiency for MPRM this http URL preliminary experiments reveal that MPRM training quickly saturates under random subsampling of the training data, indicating substantial redundancy within existing MC-annotated this http URL explain this, we formalize a theoretical framework and reveal that informative gradient updates depend on two factors: label mixtures of positive/negative steps and label reliability (average MC scores of positive steps). Guided by these insights, we propose the Balanced-Information Score (BIS), which prioritizes both mixture and reliability based on existing MC signals at the rollout level, without incurring any additional cost. Across two backbones (InternVL2.5-8B and Qwen2.5-VL-7B) on VisualProcessBench, BIS-selected subsets consistently match and even surpass the full-data performance at small fractions. Notably, the BIS subset reaches full-data performance using only 10% of the training data, improving over random subsampling by a relative 4.1%.
zh

[NLP-81] From Lemmas to Dependencies: What Signals Drive Light Verbs Classification? EACL

【速读】: 该论文旨在解决土耳其语中轻动词结构(Light Verb Constructions, LVCs)的识别难题,尤其针对其丰富的形态学特征和生成性复合谓词导致的习语性谓词意义与字面动词-论元用法之间对比不显著的问题。解决方案的关键在于系统性地限制模型输入,通过对比不同特征组合的性能表现来揭示驱动LVC分类的核心信号:仅依赖粗粒度语法信息(如UD标注的UPOS、DEPREL和MORPH)不足以实现鲁棒检测,而词汇身份虽有助于判断但对校准和归一化方式敏感;研究进一步表明,“仅词基”(lemma-only)并非单一明确的表征,其有效性高度依赖于具体的归一化操作实现方式。

链接: https://arxiv.org/abs/2602.04127
作者: Sercan Karakaş,Yusuf Şimşek
机构: University of Chicago (芝加哥大学); Fırat University (菲拉特大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: EACL SIGTURK

点击查看摘要

Abstract:Light verb constructions (LVCs) are a challenging class of verbal multiword expressions, especially in Turkish, where rich morphology and productive complex predicates create minimal contrasts between idiomatic predicate meanings and literal verb–argument uses. This paper asks what signals drive LVC classification by systematically restricting model inputs. Using UD-derived supervision, we compare lemma-driven baselines (lemma TF–IDF + Logistic Regression; BERTurk trained on lemma sequences), a grammar-only Logistic Regression over UD morphosyntax (UPOS/DEPREL/MORPH), and a full-input BERTurk baseline. We evaluate on a controlled diagnostic set with Random negatives, lexical controls (NLVC), and LVC positives, reporting split-wise performance to expose decision-boundary behavior. Results show that coarse morphosyntax alone is insufficient for robust LVC detection under controlled contrasts, while lexical identity supports LVC judgments but is sensitive to calibration and normalization choices. Overall, Our findings motivate targeted evaluation of Turkish MWEs and show that ``lemma-only’’ is not a single, well-defined representation, but one that depends critically on how normalization is operationalized.
zh

[NLP-82] DELTA: Deliberative Multi-Agent Reason ing with Reinforcement Learning for Multimodal Psychological Counseling

【速读】: 该论文旨在解决当前基于语言模型的心理咨询系统仅依赖文本信息、缺乏对多模态信号(如视觉和语音线索)显式推理的问题,从而导致心理状态推断不充分、情感共鸣不足。其解决方案的关键在于提出DELTA框架——一个结构化的多智能体推理系统,将咨询过程分解为证据锚定(evidence grounding)、心理状态抽象(mental state abstraction)与响应生成三个独立模块,并引入基于分布级情绪契合度评分(Emotion Attunement Score)的强化学习机制,以显式建模多模态信号并提升情感适配性。实验表明,该方法在多模态心理咨询基准上显著提升了咨询质量和情绪共鸣水平。

链接: https://arxiv.org/abs/2602.04112
作者: Jiangnan Yang,Junjie Chen,Fei Wang,Yiqi Nie,Yuxin Liu,Zhangling Duan,Jie Chen
机构: Anhui University (安徽大学); Hefei University of Technology (合肥工业大学); Institute of Artificial Intelligence, Hefei Comprehensive National Science Center (合肥综合性国家科学中心人工智能研究院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Psychological counseling is a fundamentally multimodal cognitive process in which clinicians integrate verbal content with visual and vocal cues to infer clients’ mental states and respond empathically. However, most existing language-model-based counseling systems operate on text alone and rely on implicit mental state inference. We introduce DELTA, a deliberative multi-agent framework that models counseling as a structured reasoning process over multimodal signals, separating evidence grounding, mental state abstraction, and response generation. DELTA further incorporates reinforcement learning guided by a distribution-level Emotion Attunement Score to encourage emotionally attuned responses. Experiments on a multimodal counseling benchmark show that DELTA improves both counseling quality and emotion attunement across models. Ablation and qualitative analyses suggest that explicit multimodal reasoning and structured mental state representations play complementary roles in supporting empathic human-AI interaction.
zh

[NLP-83] Expert Selections In MoE Models Reveal (Almost) As Much As Text

【速读】: 该论文旨在解决混合专家(Mixture-of-Experts, MoE)语言模型中通过专家选择(expert selections)进行文本重构的问题,即如何从仅包含专家路由决策的信息中恢复原始输入文本。其关键解决方案在于揭示了MoE模型中专家选择机制泄露的敏感信息远超以往认知,并提出使用深度神经网络结构(如3层MLP和基于Transformer的序列解码器)实现高精度的文本重建:实验表明,采用Transformer解码器可在32词长度的序列上实现91.2%的top-1准确率(94.8% top-10),显著优于此前基于逻辑回归的方法。这一发现强调在实际部署中应将专家选择视为与文本内容同等敏感的数据。

链接: https://arxiv.org/abs/2602.04105
作者: Amir Nuriyev,Gabriel Kulp
机构: 未知
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:We present a text-reconstruction attack on mixture-of-experts (MoE) language models that recovers tokens from expert selections alone. In MoE models, each token is routed to a subset of expert subnetworks; we show these routing decisions leak substantially more information than previously understood. Prior work using logistic regression achieves limited reconstruction; we show that a 3-layer MLP improves this to 63.1% top-1 accuracy, and that a transformer-based sequence decoder recovers 91.2% of tokens top-1 (94.8% top-10) on 32-token sequences from OpenWebText after training on 100M tokens. These results connect MoE routing to the broader literature on embedding inversion. We outline practical leakage scenarios (e.g., distributed inference and side channels) and show that adding noise reduces but does not eliminate reconstruction. Our findings suggest that expert selections in MoE deployments should be treated as sensitive as the underlying text.
zh

[NLP-84] Rethinking Perplexity: Revealing the Impact of Input Length on Perplexity Evaluation in LLM s

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在使用困惑度(Perplexity)作为评估指标时,因输入长度变化而导致的不可靠性问题,尤其关注长无关输入对评估结果的影响,从而影响基准测试与系统部署的公平性和效率。解决方案的关键在于提出一个系统感知的评估框架 LengthBenchmark,该框架将输入长度明确纳入评估体系,通过设计两种评分协议(直接累加和固定窗口滑动),并同时测量延迟、内存占用和评估成本等系统级开销,从而将预测性能与实际部署条件相连接。该设计揭示了长度偏倚是普遍存在的现象,且不随量化压缩而消除,因而为实现更公平、可复现的跨模型比较提供了新的方法论基础。

链接: https://arxiv.org/abs/2602.04099
作者: Letian Cheng,Junyan Wang,Yan Gao,Elliott Wen,Ting Dang,Hong Jia
机构: University of Melbourne(墨尔本大学); University of Adelaide(阿德莱德大学); University of Cambridge(剑桥大学); University of Auckland(奥克兰大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Perplexity is a widely adopted metric for assessing the predictive quality of large language models (LLMs) and often serves as a reference metric for downstream evaluations. However, recent evidence shows that perplexity can be unreliable, especially when irrelevant long inputs are used, raising concerns for both benchmarking and system deployment. While prior efforts have employed selective input filtering and curated datasets, the impact of input length on perplexity has not been systematically studied from a systems perspective and input length has rarely been treated as a first-class system variable affecting both fairness and efficiency. In this work, we close this gap by introducing LengthBenchmark, a system-conscious evaluation framework that explicitly integrates input length, evaluation protocol design, and system-level costs, evaluating representative LLMs under two scoring protocols (direct accumulation and fixed window sliding) across varying context lengths. Unlike prior work that focuses solely on accuracy-oriented metrics, LengthBenchmark additionally measures latency, memory footprint, and evaluation cost, thereby linking predictive metrics to deployment realities. We further incorporate quantized variants not as a main contribution, but as robustness checks, showing that length-induced biases persist across both full-precision and compressed models. This design disentangles the effects of evaluation logic, quantization, and input length, and demonstrates that length bias is a general phenomenon that undermines fair cross-model comparison. Our analysis yields two key observations: (i) sliding window evaluation consistently inflates performance on short inputs, and (ii) both full-precision and quantized models appear to realise gains as the evaluated segment length grows.
zh

[NLP-85] Scaling In-Context Online Learning Capability of LLM s via Cross-Episode Meta-RL

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在在线决策任务中难以有效利用上下文交互经验的问题。这类任务具有信息需通过交互逐步获取、反馈延迟以及需权衡信息收集与利用的特性,而现有LLMs虽能在静态任务中表现优异,却缺乏在推理时动态学习的能力。解决方案的关键在于提出ORBIT框架——一个基于多任务、多回合元强化学习(meta-reinforcement learning)的训练机制,使LLM能够在不更新权重的情况下,从交互中学习并适应全新环境。实验表明,经此训练后的小规模开源模型(Qwen3-14B)在未见过的在线环境中表现出显著提升的上下文在线学习能力,性能媲美GPT-5.2,并大幅优于标准强化学习微调方法,且模型规模扩大时仍具持续增益潜力。

链接: https://arxiv.org/abs/2602.04089
作者: Xiaofeng Lin,Sirou Zhu,Yilei Chen,Mingyu Chen,Hejian Sang,Ioannis Paschalidis,Zhipeng Wang,Aldo Pacchiano,Xuezhou Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) achieve strong performance when all task-relevant information is available upfront, as in static prediction and instruction-following problems. However, many real-world decision-making tasks are inherently online: crucial information must be acquired through interaction, feedback is delayed, and effective behavior requires balancing information collection and exploitation over time. While in-context learning enables adaptation without weight updates, existing LLMs often struggle to reliably leverage in-context interaction experience in such settings. In this work, we show that this limitation can be addressed through training. We introduce ORBIT, a multi-task, multi-episode meta-reinforcement learning framework that trains LLMs to learn from interaction in context. After meta-training, a relatively small open-source model (Qwen3-14B) demonstrates substantially improved in-context online learning on entirely unseen environments, matching the performance of GPT-5.2 and outperforming standard RL fine-tuning by a large margin. Scaling experiments further reveal consistent gains with model size, suggesting significant headroom for learn-at-inference-time decision-making agents. Code reproducing the results in the paper can be found at this https URL.
zh

[NLP-86] BASS: Benchmarking Audio LMs for Musical Structure and Semantic Reason ing

【速读】: 该论文旨在解决当前音频语言模型(Audio Language Models, ALMs)在音乐理解与推理能力上的不足,尤其是对音乐结构、语义及音乐学知识的深层理解能力有限的问题。解决方案的关键在于提出BASS(Benchmark for Audio Segmentation and Semantics),这是一个涵盖四大类任务(结构分割、歌词转录、音乐学分析和艺术家合作)的综合性评估基准,包含2658个问题、1993首独特歌曲和超过138小时跨流派音乐数据,用于系统性地评测ALMs在真实场景下的音乐理解能力。通过在14个开源及前沿多模态模型上进行评估,研究发现现有模型虽在歌词转录等基础任务表现良好,但在涉及音乐结构和音乐学推理的高级任务中仍存在显著短板,从而揭示了当前模型依赖语言先验但缺乏对音频属性深度推理能力的核心局限。

链接: https://arxiv.org/abs/2602.04085
作者: Min Jang,Orevaoghene Ahia,Nazif Tamer,Sachin Kumar,Yulia Tsvetkov,Noah A. Smith
机构: University of Washington (华盛顿大学); The Ohio State University (俄亥俄州立大学); Allen Institute for AI (艾伦人工智能研究所)
类目: ound (cs.SD); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Music understanding is a complex task that often requires reasoning over both structural and semantic elements of audio. We introduce BASS, designed to evaluate music understanding and reasoning in audio language models across four broad categories: structural segmentation, lyric transcription, musicological analysis, and artist collaboration. BASS comprises 2658 questions spanning 12 tasks, 1993 unique songs and covering over 138 hours of music from a wide range of genres and tracks, crafted to assess musicological knowledge and reasoning in real-world scenarios. We evaluate 14 open-source and frontier multimodal LMs, finding that even state-of-the-art models struggle on higher-level reasoning tasks such as structural segmentation and artist collaboration, while performing best on lyric transcription. Our analysis reveals that current models leverage linguistic priors effectively but remain limited in reasoning over musical structure, vocal, and musicological attributes. BASS provides an evaluation framework with widespread applications in music recommendation and search and has the potential to guide the development of audio LMs.
zh

[NLP-87] Abstraction Induces the Brain Alignment of Language and Speech Models

【速读】: 该论文试图解决的问题是:为何大型语言模型(Large Language Models, LLMs)和语音音频模型的中间隐藏层(intermediate hidden states)能有效预测大脑对自然语言刺激的响应,而输出层则表现较差。其关键解决方案在于揭示了模型与大脑之间对应关系的本质来源并非模型的任务目标(如下一个词预测),而是模型在中间层构建的高阶语义特征抽象能力。研究通过引入层间内在维度(intrinsic dimension)这一衡量特征复杂度的指标发现,中间层存在一个内在维度峰值,表明该层具备更强的语义抽象能力;同时,内在维度与fMRI和ECoG脑信号解释力高度相关,且这种关系在预训练过程中形成,并可通过微调模型以提升脑预测性能来进一步增强内在维度和语义内容。这说明语义丰富性、高内在维度与脑预测能力三者相互映射,驱动模型-大脑相似性的核心机制是输入信息的深度意义抽象。

链接: https://arxiv.org/abs/2602.04081
作者: Emily Cheng,Aditya R. Vaidya,Richard Antonello
机构: 未知
类目: Computation and Language (cs.CL)
备注: under review

点击查看摘要

Abstract:Research has repeatedly demonstrated that intermediate hidden states extracted from large language models and speech audio models predict measured brain response to natural language stimuli. Yet, very little is known about the representation properties that enable this high prediction performance. Why is it the intermediate layers, and not the output layers, that are most effective for this unique and highly general transfer task? We give evidence that the correspondence between speech and language models and the brain derives from shared meaning abstraction and not their next-word prediction properties. In particular, models construct higher-order linguistic features in their middle layers, cued by a peak in the layerwise intrinsic dimension, a measure of feature complexity. We show that a layer’s intrinsic dimension strongly predicts how well it explains fMRI and ECoG signals; that the relation between intrinsic dimension and brain predictivity arises over model pre-training; and finetuning models to better predict the brain causally increases both representations’ intrinsic dimension and their semantic content. Results suggest that semantic richness, high intrinsic dimension, and brain predictivity mirror each other, and that the key driver of model-brain similarity is rich meaning abstraction of the inputs, where language modeling is a task sufficiently complex (but perhaps not the only) to require it.
zh

[NLP-88] Stroke Lesions as a Rosetta Stone for Language Model Interpretability

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)解释性研究中缺乏外部验证的问题,即如何准确识别模型中哪些组件对语言功能真正必要。现有方法多依赖内部指标,无法提供因果层面的可验证性。其解决方案的关键在于引入临床神经科学中的病变-症状映射(lesion-symptom mapping)作为外部参考结构,构建了脑-LLM统一模型(Brain-LLM Unified Model, BLUM),通过系统性扰动Transformer层并模拟人类失语症患者的临床评估,将LLM的行为错误模式投影到人类脑损伤空间中进行比对。结果显示,LLM的错误模式与人类患者高度一致,且在67%至68.3%的条件下能正确预测对应脑区损伤位置,表明该框架为LLM解释性提供了基于人类认知神经机制的外部验证路径,并揭示了行为相似性可能反映共享计算原理的潜力。

链接: https://arxiv.org/abs/2602.04074
作者: Julius Fridriksson(1,2),Roger D. Newman-Norlund(1,2),Saeed Ahmadi(1),Regan Willis(3),Nadra Salman(4),Kalil Warren(4),Xiang Guan(3),Yong Yang(3),Srihari Nelakuditi(3),Rutvik Desai(5),Leonardo Bonilha(6),Jeff Charney(2,7),Chris Rorden(5) ((1) University of South Carolina, (2) a href=“http://ALLT.AI” rel=“external noopener nofollow” class="link-external link-http"this http URL/a, LLC, (3) University of South Carolina, Department of Computer Science and Engineering, (4) University of South Carolina, Linguistics Program, (5) Department of Psychology, University of South Carolina, (6) Department of Neurology, USC School of Medicine, (7) MKHSTRY, LLC)
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 45 pages, 17 figures

点击查看摘要

Abstract:Large language models (LLMs) have achieved remarkable capabilities, yet methods to verify which model components are truly necessary for language function remain limited. Current interpretability approaches rely on internal metrics and lack external validation. Here we present the Brain-LLM Unified Model (BLUM), a framework that leverages lesion-symptom mapping, the gold standard for establishing causal brain-behavior relationships for over a century, as an external reference structure for evaluating LLM perturbation effects. Using data from individuals with chronic post-stroke aphasia (N = 410), we trained symptom-to-lesion models that predict brain damage location from behavioral error profiles, applied systematic perturbations to transformer layers, administered identical clinical assessments to perturbed LLMs and human patients, and projected LLM error profiles into human lesion space. LLM error profiles were sufficiently similar to human error profiles that predicted lesions corresponded to actual lesions in error-matched humans above chance in 67% of picture naming conditions (p 10^-23) and 68.3% of sentence completion conditions (p 10^-61), with semantic-dominant errors mapping onto ventral-stream lesion patterns and phonemic-dominant errors onto dorsal-stream patterns. These findings open a new methodological avenue for LLM interpretability in which clinical neuroscience provides external validation, establishing human lesion-symptom mapping as a reference framework for evaluating artificial language systems and motivating direct investigation of whether behavioral alignment reflects shared computational principles.
zh

[NLP-89] On the Credibility of Evaluating LLM s using Survey Questions EACL2026

【速读】: 该论文旨在解决当前评估大语言模型(Large Language Models, LLMs)价值取向时所采用的基于社会调查的提示方法存在的系统性偏差问题,这些问题可能导致对模型与人类价值一致性程度的高估或低估。其关键解决方案在于引入一种新的度量指标——自相关距离(self-correlation distance),用于衡量LLM是否在不同问题间保持与人类一致的响应结构关系,而不仅依赖于单个问题回答的平均相似性;同时建议采用思维链(Chain-of-Thought, CoT)提示和基于采样的解码策略(如生成数十个样本),并结合多种指标进行综合评估,以提升评估结果的可靠性与结构性准确性。

链接: https://arxiv.org/abs/2602.04033
作者: Jindřich Libovický
机构: Charles University (查理大学); Institute of Formal and Applied Linguistics (形式与应用语言学研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Accepted to the Workshop on Multilingual and Multicultural Evaluation at EACL 2026, 12 pages, 2 figures

点击查看摘要

Abstract:Recent studies evaluate the value orientation of large language models (LLMs) using adapted social surveys, typically by prompting models with survey questions and comparing their responses to average human responses. This paper identifies limitations in this methodology that, depending on the exact setup, can lead to both underestimating and overestimating the similarity of value orientation. Using the World Value Survey in three languages across five countries, we demonstrate that prompting methods (direct vs. chain-of-thought) and decoding strategies (greedy vs. sampling) significantly affect results. To assess the interaction between answers, we introduce a novel metric, self-correlation distance. This metric measures whether LLMs maintain consistent relationships between answers across different questions, as humans do. This indicates that even a high average agreement with human data, when considering LLM responses independently, does not guarantee structural alignment in responses. Additionally, we reveal a weak correlation between two common evaluation metrics, mean-squared distance and KL divergence, which assume that survey answers are independent of each other. For future research, we recommend CoT prompting, sampling-based decoding with dozens of samples, and robust analysis using multiple metrics, including self-correlation distance.
zh

[NLP-90] Chaplains Reflections on the Design and Usage of AI for Conversational Care

【速读】: 该论文旨在解决当前对话式人工智能(Conversational AI)在非临床情境下支持日常情绪福祉时存在的局限性问题,尤其是在缺乏对人类情感支持本质的理解与适配的情况下。解决方案的关键在于引入牧师(chaplains)这一群体的实践视角,通过他们对倾听(Listening)、联结(Connecting)、承载(Carrying)和渴望(Wanting)等核心关怀维度的阐释,揭示了AI聊天机器人在实现“共鸣”(attunement)方面的不足,并据此为面向非临床场景的情感支持型聊天机器人设计提供新的理论依据与实践方向。

链接: https://arxiv.org/abs/2602.04017
作者: Joel Wester,Samuel Rhys Cox,Henning Pohl,Niels van Berkel
机构: University of Copenhagen (哥本哈根大学); Aalborg University (奥尔堡大学)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注: To appear at ACM CHI 2026. 15 pages, 2 figures, 3 tables

点击查看摘要

Abstract:Despite growing recognition that responsible AI requires domain knowledge, current work on conversational AI primarily draws on clinical expertise that prioritises diagnosis and intervention. However, much of everyday emotional support needs occur in non-clinical contexts, and therefore requires different conversational approaches. We examine how chaplains, who guide individuals through personal crises, grief, and reflection, perceive and engage with conversational AI. We recruited eighteen chaplains to build AI chatbots. While some chaplains viewed chatbots with cautious optimism, the majority expressed limitations of chatbots’ ability to support everyday well-being. Our analysis reveals how chaplains perceive their pastoral care duties and areas where AI chatbots fall short, along the themes of Listening, Connecting, Carrying, and Wanting. These themes resonate with the idea of attunement, recently highlighted as a relational lens for understanding the delicate experiences care technologies provide. This perspective informs chatbot design aimed at supporting well-being in non-clinical contexts.
zh

[NLP-91] ransformers perform adaptive partial pooling

【速读】: 该论文试图解决的问题是:语言模型在面对不常见但非新颖语境时,如何合理利用来自相似语境的信息进行预测,即如何实现对证据的自适应部分聚合(adaptive partial pooling of evidence)。解决方案的关键在于揭示了Transformer(如GPT-2)在训练过程中逐渐减少对外部上下文信息的依赖,其Pooling程度随训练轮次增加而降低,并且该过程受到语境频率、语境类型数量(type frequency)和语境变异性的影响,与层次回归(hierarchical regression)中的行为一致。这表明Transformer的学习机制在理性与实证层面均具备现实合理性。

链接: https://arxiv.org/abs/2602.03980
作者: Vsevolod Kapatsinski
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 6 pages, submitted to the annual meeting of the Cognitive Science Society

点击查看摘要

Abstract:Because language is creative, any reasonable language model must generalize, deciding what to say in novel contexts by using information from similar contexts. But what about contexts that are not novel but merely infrequent? In hierarchical regression, the model’s predictions for behavior in a context are affected by observations from other similar contexts to the extent that 1) the current context is infrequent and 2) different contexts behave similarly. This is called adaptive partial pooling of evidence. This paper shows that next-word predictions of a transformer (GPT2) are increasingly unaffected by observations from outside the current context across epochs of training (the amount of pooling reduces with training), and that the extent of pooling is affected by context frequency, context number (type frequency) and context variability in a similar way to hierarchical regression. These characteristics of learning in transformers are argued to be realistic on both rational and empirical grounds.
zh

[NLP-92] Likelihood-Based Reward Designs for General LLM Reason ing

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理基准上通过强化学习进行微调时,依赖特定奖励函数(通常为二元奖励)所带来的两个问题:一是需要人工设计奖励机制,二是二元奖励可能稀疏导致训练不稳定。其解决方案的关键在于采用基于似然(likelihood-based)的奖励策略,特别是使用参考答案的对数概率(log-probability)作为奖励信号。该方法无需外部验证器即可大规模应用,在可验证和不可验证场景下均表现稳定,且与预训练阶段使用的下一个词对数似然损失一致,从而在链式思维(Chain-of-Thought, CoT)微调中实现跨任务、跨长度的统一有效优化。

链接: https://arxiv.org/abs/2602.03979
作者: Ariel Kwiatkowski,Natasha Butt,Ismail Labiad,Julia Kempe,Yann Ollivier
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Fine-tuning large language models (LLMs) on reasoning benchmarks via reinforcement learning requires a specific reward function, often binary, for each benchmark. This comes with two potential limitations: the need to design the reward, and the potentially sparse nature of binary rewards. Here, we systematically investigate rewards derived from the probability or log-probability of emitting the reference answer (or any other prompt continuation present in the data), which have the advantage of not relying on specific verifiers and being available at scale. Several recent works have advocated for the use of similar rewards (e.g., VeriFree, JEPO, RLPR, NOVER). We systematically compare variants of likelihood-based rewards with standard baselines, testing performance both on standard mathematical reasoning benchmarks, and on long-form answers where no external verifier is available. We find that using the log-probability of the reference answer as the reward for chain-of-thought (CoT) learning is the only option that performs well in all setups. This reward is also consistent with the next-token log-likelihood loss used during pretraining. In verifiable settings, log-probability rewards bring comparable or better success rates than reinforcing with standard binary rewards, and yield much better perplexity. In non-verifiable settings, they perform on par with SFT. On the other hand, methods based on probability, such as VeriFree, flatline on non-verifiable settings due to vanishing probabilities of getting the correct answer. Overall, this establishes log-probability rewards as a viable method for CoT fine-tuning, bridging the short, verifiable and long, non-verifiable answer settings.
zh

[NLP-93] Automatic Classification of Pedagogical Materials against CS Curriculum Guidelines

【速读】: 该论文旨在解决计算机科学(Computer Science)课程项目在遵循ACM与IEEE发布的国际标准指南时,难以高效评估其内容覆盖程度的问题。由于指南涵盖数千个具体条目,传统人工审计每门课程需耗时约一天,效率低下且认知负荷高。解决方案的关键在于引入自然语言处理(Natural Language Processing, NLP)技术,通过两类方法实现自动化分类:一是基于传统NLP工具(如解析、词性标注和嵌入表示),二是利用大语言模型(Large Language Models, LLMs)的能力,从而显著提升对教学材料的自动分类与内容匹配效率。

链接: https://arxiv.org/abs/2602.03962
作者: Erik Saule,Kalpathi Subramanian,Razvan Bunescu
机构: The University of North Carolina at Charlotte (北卡罗来纳大学夏洛特分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Professional societies often publish curriculum guidelines to help programs align their content to international standards. In Computer Science, the primary standard is published by ACM and IEEE and provide detailed guidelines for what should be and could be included in a Computer Science program. While very helpful, it remains difficult for program administrators to assess how much of the guidelines is being covered by a CS program. This is in particular due to the extensiveness of the guidelines, containing thousands of individual items. As such, it is time consuming and cognitively demanding to audit every course to confidently mark everything that is actually being covered. Our preliminary work indicated that it takes about a day of work per course. In this work, we propose using Natural Language Processing techniques to accelerate the process. We explore two kinds of techniques, the first relying on traditional tools for parsing, tagging, and embeddings, while the second leverages the power of Large Language Models. We evaluate the application of these techniques to classify a corpus of pedagogical materials and show that we can meaningfully classify documents automatically. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2602.03962 [cs.CL] (or arXiv:2602.03962v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2602.03962 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-94] Linguistic Blind Spots in Clinical Decision Extraction EACL

【速读】: 该论文旨在解决从临床病历文本中提取医疗决策(medical decisions)的准确性问题,特别是分析不同类别决策的语义和语言特征差异如何影响自动抽取模型的表现。其关键解决方案在于通过构建七种语言学指标对决策片段进行量化分析,并结合基于Transformer的标准抽取模型在DICTUM标注数据集上的表现,揭示了不同决策类别(如药物相关、问题定义、建议与预防类)的语言风格差异——例如建议类决策具有更高比例的停用词、代词及情态词(hedging)和否定词(negation),导致精确匹配(exact-match)召回率显著下降(仅24%)。研究进一步发现,若采用宽松的重叠匹配标准(overlap-based match),召回率可提升至71%,表明多数错误源于边界识别偏差而非内容缺失,从而提出下游系统应引入边界容错型评估与抽取策略以提升对叙事性决策片段的识别能力。

链接: https://arxiv.org/abs/2602.03942
作者: Mohamed Elgaar,Hadi Amiri
机构: University of Massachusetts Lowell (马萨诸塞大学洛厄尔分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: EACL HeaLing Workshop 2026

点击查看摘要

Abstract:Extracting medical decisions from clinical notes is a key step for clinical decision support and patient-facing care summaries. We study how the linguistic characteristics of clinical decisions vary across decision categories and whether these differences explain extraction failures. Using MedDec discharge summaries annotated with decision categories from the Decision Identification and Classification Taxonomy for Use in Medicine (DICTUM), we compute seven linguistic indices for each decision span and analyze span-level extraction recall of a standard transformer model. We find clear category-specific signatures: drug-related and problem-defining decisions are entity-dense and telegraphic, whereas advice and precaution decisions contain more narrative, with higher stopword and pronoun proportions and more frequent hedging and negation cues. On the validation split, exact-match recall is 48%, with large gaps across linguistic strata: recall drops from 58% to 24% from the lowest to highest stopword-proportion bins, and spans containing hedging or negation cues are less likely to be recovered. Under a relaxed overlap-based match criterion, recall increases to 71%, indicating that many errors are span boundary disagreements rather than complete misses. Overall, narrative-style spans–common in advice and precaution decisions–are a consistent blind spot under exact matching, suggesting that downstream systems should incorporate boundary-tolerant evaluation and extraction strategies for clinical decisions.
zh

[NLP-95] SpatiaLab: Can Vision-Language Models Perform Spatial Reason ing in the Wild? ICLR2026

【速读】: 该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在真实、复杂场景中空间推理能力严重不足的问题。现有研究多依赖于合成或大语言模型(LLM)生成的环境,任务设计受限且呈现“谜题式”结构,难以反映现实世界中的视觉噪声、多样空间关系及动态交互等挑战。解决方案的关键在于提出一个名为SpatiaLab的综合性基准测试集,其包含1,400个视觉问答对,覆盖相对位置、遮挡深度、朝向、尺度大小、空间导航和三维几何六大类共30种具体任务类型,每类至少200个问题,并支持多选与开放式两种评估方式。该基准基于真实场景构建,能有效暴露VLMs在复杂空间理解上的系统性短板,从而为未来提升模型的空间推理能力提供可量化、可复现的评估框架与研究方向。

链接: https://arxiv.org/abs/2602.03916
作者: Azmine Toushik Wasi,Wahid Faisal,Abdur Rahman,Mahfuz Ahmed Anik,Munem Shahriar,Mohsin Mahmud Topu,Sadia Tasnim Meem,Rahatun Nesa Priti,Sabrina Afroz Mitu,Md. Iqramul Hoque,Shahriyar Zaman Ridoy,Mohammed Eunus Ali,Majd Hawasly,Mohammad Raza,Md Rizwan Parvez
机构: Computational Intelligence and Operations Laboratory (CIOL); Shahjalal University of Science and Technology (SUST); BRAC University; North South University (NSU); Monash University; Qatar Computing Research Institute (QCRI)
类目: Computer Vision and Pattern Recognition (cs.CV); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to ICLR 2026. 92 Pages. 42 Figures and 29 Tables

点击查看摘要

Abstract:Spatial reasoning is a fundamental aspect of human cognition, yet it remains a major challenge for contemporary vision-language models (VLMs). Prior work largely relied on synthetic or LLM-generated environments with limited task designs and puzzle-like setups, failing to capture the real-world complexity, visual noise, and diverse spatial relationships that VLMs encounter. To address this, we introduce SpatiaLab, a comprehensive benchmark for evaluating VLMs’ spatial reasoning in realistic, unconstrained contexts. SpatiaLab comprises 1,400 visual question-answer pairs across six major categories: Relative Positioning, Depth Occlusion, Orientation, Size Scale, Spatial Navigation, and 3D Geometry, each with five subcategories, yielding 30 distinct task types. Each subcategory contains at least 25 questions, and each main category includes at least 200 questions, supporting both multiple-choice and open-ended evaluation. Experiments across diverse state-of-the-art VLMs, including open- and closed-source models, reasoning-focused, and specialized spatial reasoning models, reveal a substantial gap in spatial reasoning capabilities compared with humans. In the multiple-choice setup, InternVL3.5-72B achieves 54.93% accuracy versus 87.57% for humans. In the open-ended setting, all models show a performance drop of around 10-25%, with GPT-5-mini scoring highest at 40.93% versus 64.93% for humans. These results highlight key limitations in handling complex spatial relationships, depth perception, navigation, and 3D geometry. By providing a diverse, real-world evaluation framework, SpatiaLab exposes critical challenges and opportunities for advancing VLMs’ spatial reasoning, offering a benchmark to guide future research toward robust, human-aligned spatial understanding. SpatiaLab is available at: this https URL.
zh

[NLP-96] HybridQuestion: Human-AI Collaboration for Identifying High-Impact Research Questions

【速读】: 该论文旨在解决当前生成式 AI(Generative AI)在科学研究中能否有效识别具有意义的研究问题这一关键问题。尽管大型语言模型(Large Language Models, LLMs)已在特定任务中展现出良好的创意生成能力,但其在战略性、长期性评估历史突破与未来科学问题方面的潜力仍不明确。为此,作者提出了一种人机协同的解决方案,其核心在于将AI在大规模文献处理上的高效性与人类专家的价值判断相结合:首先通过AI加速信息收集构建混合数据基础;其次利用多模型投票机制从六种不同LLM中筛选候选问题;最后通过多阶段逐步增强人类干预的过滤流程实现高质量问题遴选。实验结果表明,AI在识别已知突破方面与人类高度一致,但在预测前瞻性科学问题时存在显著偏差,凸显了人类专家在主观性、前瞻性的科学判断中不可替代的作用。

链接: https://arxiv.org/abs/2602.03849
作者: Keyu Zhao,Fengli Xu,Yong Li,Tie-Yan Liu
机构: Tsinghua University (清华大学); Zhongguancun Academy (中关村学院)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 16 pages, 6 figures, 4 tables

点击查看摘要

Abstract:The “AI Scientist” paradigm is transforming scientific research by automating key stages of the research process, from idea generation to scholarly writing. This shift is expected to accelerate discovery and expand the scope of scientific inquiry. However, a key question remains unclear: can AI scientists identify meaningful research questions? While Large Language Models (LLMs) have been applied successfully to task-specific ideation, their potential to conduct strategic, long-term assessments of past breakthroughs and future questions remains largely unexplored. To address this gap, we explore a human-AI hybrid solution that integrates the scalable data processing capabilities of AI with the value judgment of human experts. Our methodology is structured in three phases. The first phase, AI-Accelerated Information Gathering, leverages AI’s advantage in processing vast amounts of literature to generate a hybrid information base. The second phase, Candidate Question Proposing, utilizes this synthesized data to prompt an ensemble of six diverse LLMs to propose an initial candidate pool, filtered via a cross-model voting mechanism. The third phase, Hybrid Question Selection, refines this pool through a multi-stage filtering process that progressively increases human oversight. To validate this system, we conducted an experiment aiming to identify the Top 10 Scientific Breakthroughs of 2025 and the Top 10 Scientific Questions for 2026 across five major disciplines. Our analysis reveals that while AI agents demonstrate high alignment with human experts in recognizing established breakthroughs, they exhibit greater divergence in forecasting prospective questions, suggesting that human judgment remains crucial for evaluating subjective, forward-looking challenges.
zh

[NLP-97] Do LLM s Truly Benefit from Longer Context in Automatic Post-Editing?

【速读】: 该论文旨在解决生成式 AI(Generative AI)在文档级自动后编辑(Automatic Post-Editing, APE)中的有效性与局限性问题,尤其关注大型语言模型(Large Language Models, LLMs)在利用文档上下文进行错误修正时的表现。其核心解决方案是通过系统性对比专有模型与开源权重模型在简单的一次提示(one-shot prompting)设置下执行文档级APE的效果,揭示不同模型在翻译质量、上下文利用能力、鲁棒性和效率方面的差异。关键发现在于:专有LLMs即使不显式使用文档上下文也能达到接近人类水平的APE质量,但其对文档级上下文的利用不足,且存在显著的计算成本和延迟问题,表明当前主流模型尚未实现高效、充分的文档感知型翻译优化,亟需发展更高效的长上下文建模方法。

链接: https://arxiv.org/abs/2601.19410
作者: Ahrii Kim,Seong-heum Kim
机构: 未知
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Automatic post-editing (APE) aims to refine machine translations by correcting residual errors. Although recent large language models (LLMs) demonstrate strong translation capabilities, their effectiveness for APE–especially under document-level context–remains insufficiently understood. We present a systematic comparison of proprietary and open-weight LLMs under a naive document-level prompting setup, analyzing APE quality, contextual behavior, robustness, and efficiency. Our results show that proprietary LLMs achieve near human-level APE quality even with simple one-shot prompting, regardless of whether document context is provided. While these models exhibit higher robustness to data poisoning attacks than open-weight counterparts, this robustness also reveals a limitation: they largely fail to exploit document-level context for contextual error correction. Furthermore, standard automatic metrics do not reliably reflect these qualitative improvements, highlighting the continued necessity of human evaluation. Despite their strong performance, the substantial cost and latency overheads of proprietary LLMs render them impractical for real-world APE deployment. Overall, our findings elucidate both the promise and current limitations of LLM-based document-aware APE, and point toward the need for more efficient long-context modeling approaches for translation refinement. Subjects: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC) Cite as: arXiv:2601.19410 [cs.CL] (or arXiv:2601.19410v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2601.19410 Focus to learn more arXiv-issued DOI via DataCite Related DOI: https://doi.org/10.36227/techrxiv.176107895.57699371/v1 Focus to learn more DOI(s) linking to related resources
zh

[NLP-98] Merged ChemProt-DrugProt for Relation Extraction from Biomedical Literature

【速读】: 该论文旨在解决化学-基因关系(Chemical-Gene Relations, CPR)提取中样本量不足与模型性能受限的问题,尤其是在药物发现和生物医学研究领域中准确识别化合物与基因之间相互作用的挑战。解决方案的关键在于通过合并ChemProt与DrugProt两个数据集以扩充训练样本,并结合图卷积网络(Graph Convolutional Networks, GCNs)与预训练生物医学语言模型BioBERT,从而同时利用局部语境信息(由BioBERT捕捉)和全局结构信息(由GCN建模),显著提升模型在共享CPR类别上的精确率与召回率。

链接: https://arxiv.org/abs/2405.18605
作者: Mai H. Nguyen,Shibani Likhite,Jiawei Tang,Darshini Mahendran,Bridget T. McInnes
机构: San Diego Supercomputer Center, University of California San Diego (圣地亚哥超级计算机中心,加州大学圣地亚哥分校); Department of Computer Science & Engineering, University of California San Diego (计算机科学与工程系,加州大学圣地亚哥分校); Department of Computer Science, Virginia Commonwealth University (计算机科学系,弗吉尼亚联邦大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Molecular Networks (q-bio.MN)
备注:

点击查看摘要

Abstract:The extraction of chemical-gene relations plays a pivotal role in understanding the intricate interactions between chemical compounds and genes, with significant implications for drug discovery, disease understanding, and biomedical research. This paper presents a data set created by merging the ChemProt and DrugProt datasets to augment sample counts and improve model accuracy. We evaluate the merged dataset using two state of the art relationship extraction algorithms: Bidirectional Encoder Representations from Transformers (BERT) specifically BioBERT, and Graph Convolutional Networks (GCNs) combined with BioBERT. While BioBERT excels at capturing local contexts, it may benefit from incorporating global information essential for understanding chemical-gene interactions. This can be achieved by integrating GCNs with BioBERT to harness both global and local context. Our results show that by integrating the ChemProt and DrugProt datasets, we demonstrated significant improvements in model performance, particularly in CPR groups shared between the datasets. Incorporating the global context using GCN can help increase the overall precision and recall in some of the CPR groups over using just BioBERT.
zh

[NLP-99] Universal Robust Speech Adaptation for Cross-Domain Speech Recognition and Enhancement

【速读】: 该论文旨在解决预训练的自动语音识别(ASR)和语音增强(SE)模型在遭遇域偏移(domain shift)时性能显著下降的问题,尤其是在面对未见过的噪声类型和信道失真时。解决方案的关键在于提出一种统一且领域感知的生成框架URSA-GAN,其核心创新包括:1)采用双嵌入架构(noise encoder 和 channel encoder),分别对噪声和信道条件进行建模,并利用有限的域内数据进行预训练以提取领域相关表征;2)通过GAN-based语音生成器,结合这些嵌入实现语音内容保真的声学对齐合成;3)引入动态随机扰动(dynamic stochastic perturbation)作为正则化策略,在生成过程中向嵌入注入可控变异性,从而提升模型对未见域的鲁棒性。实验证明该方法在多种噪声与信道不匹配场景下均能有效降低字符错误率(CER)并提升感知指标。

链接: https://arxiv.org/abs/2602.04307
作者: Chien-Chun Wang,Hung-Shin Lee,Hsin-Min Wang,Berlin Chen
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
备注: Accepted to IEEE Transactions on Audio, Speech and Language Processing (IEEE TASLP)

点击查看摘要

Abstract:Pre-trained models for automatic speech recognition (ASR) and speech enhancement (SE) have exhibited remarkable capabilities under matched noise and channel conditions. However, these models often suffer from severe performance degradation when confronted with domain shifts, particularly in the presence of unseen noise and channel distortions. In view of this, we in this paper present URSA-GAN, a unified and domain-aware generative framework specifically designed to mitigate mismatches in both noise and channel conditions. URSA-GAN leverages a dual-embedding architecture that consists of a noise encoder and a channel encoder, each pre-trained with limited in-domain data to capture domain-relevant representations. These embeddings condition a GAN-based speech generator, facilitating the synthesis of speech that is acoustically aligned with the target domain while preserving phonetic content. To enhance generalization further, we propose dynamic stochastic perturbation, a novel regularization technique that introduces controlled variability into the embeddings during generation, promoting robustness to unseen domains. Empirical results demonstrate that URSA-GAN effectively reduces character error rates in ASR and improves perceptual metrics in SE across diverse noisy and mismatched channel scenarios. Notably, evaluations on compound test conditions with both channel and noise degradations confirm the generalization ability of URSA-GAN, yielding relative improvements of 16.16% in ASR performance and 15.58% in SE metrics.
zh

[NLP-100] Benchmarking Automatic Speech Recognition for Indian Languages in Agricultural Contexts

【速读】: 该论文旨在解决印度农业咨询服务数字化过程中,多语言环境下自动语音识别(ASR)系统对领域特定术语识别准确率不足的问题。其核心挑战在于低资源农业语境下,不同方言(如印地语、泰卢固语和奥里亚语)的语音数据质量参差不齐,且存在多说话人混叠现象,导致传统词错误率(WER)指标难以真实反映实际应用效果。解决方案的关键在于构建一个面向农业场景的基准评估框架,引入农业加权词错误率(AWWER)和领域专用效用评分等新指标,并结合说话人分离(speaker diarization)与最佳说话人选择策略,显著降低多说话人录音中的WER(最高可达66%),从而为提升农业领域ASR系统的实用性与鲁棒性提供了可量化、可复现的技术路径。

链接: https://arxiv.org/abs/2602.03868
作者: Chandrashekar M S,Vineet Singh,Lakshmi Pedapudi
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
备注: 9 pages, 6 figures

点击查看摘要

Abstract:The digitization of agricultural advisory services in India requires robust Automatic Speech Recognition (ASR) systems capable of accurately transcribing domain-specific terminology in multiple Indian languages. This paper presents a benchmarking framework for evaluating ASR performance in agricultural contexts across Hindi, Telugu, and Odia languages. We introduce evaluation metrics including Agriculture Weighted Word Error Rate (AWWER) and domain-specific utility scoring to complement traditional metrics. Our evaluation of 10,934 audio recordings, each transcribed by up to 10 ASR models, reveals performance variations across languages and models, with Hindi achieving the best overall performance (WER: 16.2%) while Odia presents the greatest challenges (best WER: 35.1%, achieved only with speaker diarization). We characterize audio quality challenges inherent to real-world agricultural field recordings and demonstrate that speaker diarization with best-speaker selection can substantially reduce WER for multi-speaker recordings (upto 66% depending on the proportion of multi-speaker audio). We identify recurring error patterns in agricultural terminology and provide practical recommendations for improving ASR systems in low-resource agricultural domains. The study establishes baseline benchmarks for future agricultural ASR development.
zh

计算机视觉

[CV-0] CoWTracker: Tracking by Warping instead of Correlation

【速读】:该论文旨在解决密集点跟踪(Dense Point Tracking)中因依赖代价体(Cost Volume)而导致的空间分辨率复杂度为二次方的问题,从而限制了模型的可扩展性和效率。其解决方案的关键在于摒弃传统代价体计算,转而采用基于图像变形(Warping)的策略:通过迭代地将目标帧特征根据当前跟踪估计映射到查询帧,结合Transformer架构实现所有轨迹的联合时空推理,从而在不显式计算特征相关性的情况下建立长距离对应关系。这一设计显著提升了模型的简洁性与性能,在多个标准基准测试中达到最优结果,并展现出在光流估计任务中的卓越表现。

链接: https://arxiv.org/abs/2602.04877
作者: Zihang Lai,Eldar Insafutdinov,Edgar Sucar,Andrea Vedaldi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project website: this http URL

点击查看摘要

Abstract:Dense point tracking is a fundamental problem in computer vision, with applications ranging from video analysis to robotic manipulation. State-of-the-art trackers typically rely on cost volumes to match features across frames, but this approach incurs quadratic complexity in spatial resolution, limiting scalability and efficiency. In this paper, we propose \method, a novel dense point tracker that eschews cost volumes in favor of warping. Inspired by recent advances in optical flow, our approach iteratively refines track estimates by warping features from the target frame to the query frame based on the current estimate. Combined with a transformer architecture that performs joint spatiotemporal reasoning across all tracks, our design establishes long-range correspondences without computing feature correlations. Our model is simple and achieves state-of-the-art performance on standard dense point tracking benchmarks, including TAP-Vid-DAVIS, TAP-Vid-Kinetics, and Robo-TAP. Remarkably, the model also excels at optical flow, sometimes outperforming specialized methods on the Sintel, KITTI, and Spring benchmarks. These results suggest that warping-based architectures can unify dense point tracking and optical flow estimation.
zh

[CV-1] PerpetualWonder: Long-Horizon Action-Conditioned 4D Scene Generation

【速读】:该论文旨在解决现有生成式模拟器在长时序、动作条件下的4D场景生成任务中面临的挑战,即物理状态与视觉表征解耦导致无法通过生成优化更新底层物理模型以支持后续交互。其解决方案的关键在于提出首个真正的闭环系统——PerpetualWonder,它引入了一种新颖的统一表示机制,建立了物理状态与视觉基元之间的双向关联,使生成 refinements 能够同时修正动态行为和外观表现;此外,还设计了鲁棒的更新机制,通过多视角监督信息缓解优化过程中的歧义性问题,从而实现从单张图像出发的复杂多步交互模拟,并保持物理合理性与视觉一致性。

链接: https://arxiv.org/abs/2602.04876
作者: Jiahao Zhan,Zizhang Li,Hong-Xing Yu,Jiajun Wu
机构: Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project website: this https URL

点击查看摘要

Abstract:We introduce PerpetualWonder, a hybrid generative simulator that enables long-horizon, action-conditioned 4D scene generation from a single image. Current works fail at this task because their physical state is decoupled from their visual representation, which prevents generative refinements to update the underlying physics for subsequent interactions. PerpetualWonder solves this by introducing the first true closed-loop system. It features a novel unified representation that creates a bidirectional link between the physical state and visual primitives, allowing generative refinements to correct both the dynamics and appearance. It also introduces a robust update mechanism that gathers supervision from multiple viewpoints to resolve optimization ambiguity. Experiments demonstrate that from a single image, PerpetualWonder can successfully simulate complex, multi-step interactions from long-horizon actions, maintaining physical plausibility and visual consistency.
zh

[CV-2] Laminating Representation Autoencoders for Efficient Diffusion

【速读】:该论文旨在解决基于DINOv2等视觉Transformer编码器提取的密集patch特征在扩散模型中存在冗余问题,导致计算成本过高。其解决方案的关键在于提出FlatDINO——一种变分自编码器(Variational Autoencoder, VAE),将原始高维patch序列压缩为仅32个连续的1D token表示,实现序列长度降低8倍、总维度压缩48倍。在此基础上训练的DiT-XL模型在ImageNet 256×256上取得了gFID=1.80的高质量图像生成效果,同时每前向传播减少8倍浮点运算量(FLOPs),训练步骤最多减少4.5倍FLOPs,显著提升了扩散模型的效率。

链接: https://arxiv.org/abs/2602.04873
作者: Ramón Calvo-González,François Fleuret
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent work has shown that diffusion models can generate high-quality images by operating directly on SSL patch features rather than pixel-space latents. However, the dense patch grids from encoders like DINOv2 contain significant redundancy, making diffusion needlessly expensive. We introduce FlatDINO, a variational autoencoder that compresses this representation into a one-dimensional sequence of just 32 continuous tokens -an 8x reduction in sequence length and 48x compression in total dimensionality. On ImageNet 256x256, a DiT-XL trained on FlatDINO latents achieves a gFID of 1.80 with classifier-free guidance while requiring 8x fewer FLOPs per forward pass and up to 4.5x fewer FLOPs per training step compared to diffusion on uncompressed DINOv2 features. These are preliminary results and this work is in progress.
zh

[CV-3] When LLaVA Meets Objects: Token Composition for Vision-Language-Models

【速读】:该论文旨在解决当前自回归视觉语言模型(Autoregressive Vision Language Models, VLMs)在推理阶段依赖大量视觉标记(visual tokens)导致计算资源消耗过高的问题。其解决方案的关键在于提出Mask-LLaVA框架,通过融合不同层级的视觉特征——包括基于掩码的对象表示(mask-based object representations)、全局标记(global tokens)和局部补丁标记(local patch tokens)——构建一种紧凑且信息丰富的视觉表征。训练时使用所有类型的标记,而在测试阶段可灵活减少掩码对象标记的数量,实现无需重新训练即可动态调整推理阶段的标记数量,同时保持性能稳定。

链接: https://arxiv.org/abs/2602.04864
作者: Soumya Jahagirdar,Walid Bousselham,Anna Kukleva,Hilde Kuehne
机构: Tuebingen AI Center/University of Tuebingen; Max Planck Institute for Informatics, SIC; MIT-IBM Watson AI Lab
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Current autoregressive Vision Language Models (VLMs) usually rely on a large number of visual tokens to represent images, resulting in a need for more compute especially at inference time. To address this problem, we propose Mask-LLaVA, a framework that leverages different levels of visual features to create a compact yet information-rich visual representation for autoregressive VLMs. Namely, we combine mask-based object representations together with global tokens and local patch tokens. While all tokens are used during training, it shows that the resulting model can flexibly drop especially the number of mask-based object-tokens at test time, allowing to adapt the number of tokens during inference without the need to retrain the model and without a significant drop in performance. We evaluate the proposed approach on a suite of standard benchmarks showing results competitive to current token efficient methods and comparable to the original LLaVA baseline using only a fraction of visual tokens. Our analysis demonstrates that combining multi-level features enables efficient learning with fewer tokens while allowing dynamic token selection at test time for good performance.
zh

[CV-4] PDF-HR: Pose Distance Fields for Humanoid Robots

【速读】:该论文旨在解决人形机器人在运动规划与控制中缺乏有效姿态与运动先验(pose and motion priors)的问题,尤其针对当前高质机器人运动数据稀缺导致的模型泛化能力不足。其解决方案的关键在于提出Pose Distance Fields for Humanoid Robots (PDF-HR),这是一种轻量级、连续且可微的先验表示方法,将机器人姿态分布建模为一个流形空间,能够对任意姿态预测其到大规模重定向机器人姿态集合的距离,从而提供平滑的姿态合理性评分(pose plausibility),适用于优化与控制任务中的奖励塑造、正则化或独立评分模块。该方法可无缝集成至多种机器人运动学习与生成流程中,实验证明其能显著增强主流基线模型的性能。

链接: https://arxiv.org/abs/2602.04851
作者: Yi Gu,Yukang Gao,Yangchen Zhou,Xingyu Chen,Yixiao Feng,Mingle Zhao,Yunyang Mo,Zhaorui Wang,Lixin Xu,Renjing Xu
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: \href{ this https URL }{Project page}

点击查看摘要

Abstract:Pose and motion priors play a crucial role in humanoid robotics. Although such priors have been widely studied in human motion recovery (HMR) domain with a range of models, their adoption for humanoid robots remains limited, largely due to the scarcity of high-quality humanoid motion data. In this work, we introduce Pose Distance Fields for Humanoid Robots (PDF-HR), a lightweight prior that represents the robot pose distribution as a continuous and differentiable manifold. Given an arbitrary pose, PDF-HR predicts its distance to a large corpus of retargeted robot poses, yielding a smooth measure of pose plausibility that is well suited for optimization and control. PDF-HR can be integrated as a reward shaping term, a regularizer, or a standalone plausibility scorer across diverse pipelines. We evaluate PDF-HR on various humanoid tasks, including single-trajectory motion tracking, general motion tracking, style-based motion mimicry, and general motion retargeting. Experiments show that this plug-and-play prior consistently and substantially strengthens strong baselines. Code and models will be released.
zh

[CV-5] LitS: A novel Neighborhood Descriptor for Point Clouds

【速读】:该论文旨在解决点云数据中局部几何结构描述不足的问题,尤其是在面对变密度、噪声等常见数据质量问题时,现有邻域描述子难以准确刻画局部点分布特性。解决方案的关键在于提出一种新型邻域描述子 LitS(Local Information Tracking on the Sphere),其本质是在单位圆上定义的分段常值函数,能够使每个点基于局部参考系追踪其周围方向上的邻居数量。通过在特定方向上评估 LitS,可获得以该方向为中心的锥形区域内邻近点的数量信息,从而高效捕捉局部点排列的细微特征,并借助相邻点间 LitS 的变化实现全局结构理解。LitS 具备“常规”与“累积”两种版本及两个可调参数,具备良好的适应性和鲁棒性,适用于多种类型的点云场景。

链接: https://arxiv.org/abs/2602.04838
作者: Jonatan B. Bastos,Francisco F. Rivera,Oscar G. Lorenzo,David L. Vilariño,José C. Cabaleiro,Alberto M. Esmorís,Tomás F. Pena
机构: Centro Singular de Investigación en Tecnoloxías Intelixentes (CiTIUS), Universidade de Santiago de Compostela, Spain; Departamento de Electrónica e Computación, Universidade de Santiago de Compostela, Spain
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the advancement of 3D scanning technologies, point clouds have become fundamental for representing 3D spatial data, with applications that span across various scientific and technological fields. Practical analysis of this data depends crucially on available neighborhood descriptors to accurately characterize the local geometries of the point cloud. This paper introduces LitS, a novel neighborhood descriptor for 2D and 3D point clouds. LitS are piecewise constant functions on the unit circle that allow points to keep track of their surroundings. Each element in LitS’ domain represents a direction with respect to a local reference system. Once constructed, evaluating LitS at any given direction gives us information about the number of neighbors in a cone-like region centered around that same direction. Thus, LitS conveys a lot of information about the local neighborhood of a point, which can be leveraged to gain global structural understanding by analyzing how LitS changes between close points. In addition, LitS comes in two versions (‘regular’ and ‘cumulative’) and has two parameters, allowing them to adapt to various contexts and types of point clouds. Overall, they are a versatile neighborhood descriptor, capable of capturing the nuances of local point arrangements and resilient to common point cloud data issues such as variable density and noise.
zh

[CV-6] Its not a Lottery its a Race: Understanding How Gradient Descent Adapts the Networks Capacity to the Task

【速读】:该论文试图解决神经网络理论容量与实际有效容量之间的不匹配问题,即为何在梯度下降训练过程中,神经网络的理论复杂度会自发降低至适应任务需求的有效复杂度。其解决方案的关键在于通过分析单隐藏层ReLU网络中个体神经元的学习动力学,识别出三个核心动力学原理:相互对齐(mutual alignment)、解锁(unlocking)和竞速(racing),这些机制共同解释了为何训练后可通过合并等效神经元或剪枝低范数权重来成功降低模型容量,并进一步阐明了“彩票猜想”(lottery ticket conjecture)的内在机制——即某些特定且有益的初始条件使部分神经元获得更高的权重范数并主导训练过程。

链接: https://arxiv.org/abs/2602.04832
作者: Hannah Pinson
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:Our theoretical understanding of neural networks is lagging behind their empirical success. One of the important unexplained phenomena is why and how, during the process of training with gradient descent, the theoretical capacity of neural networks is reduced to an effective capacity that fits the task. We here investigate the mechanism by which gradient descent achieves this through analyzing the learning dynamics at the level of individual neurons in single hidden layer ReLU networks. We identify three dynamical principles – mutual alignment, unlocking and racing – that together explain why we can often successfully reduce capacity after training through the merging of equivalent neurons or the pruning of low norm weights. We specifically explain the mechanism behind the lottery ticket conjecture, or why the specific, beneficial initial conditions of some neurons lead them to obtain higher weight norms.
zh

[CV-7] oward Reliable and Explainable Nail Disease Classification: Leverag ing Adversarial Training and Grad-CAM Visualization

【速读】:该论文旨在解决人类甲病(nail diseases)在临床诊断中因视觉特征相似性导致的早期识别困难问题,尤其针对老年群体中易被忽视的指甲病变。其解决方案的关键在于构建一个基于深度学习的自动化分类模型,利用公开的3,835张图像数据集(涵盖六类甲病),对图像进行标准化预处理后,采用四种主流卷积神经网络(CNN)模型进行训练与比较,其中InceptionV3表现最优(准确率达95.57%)。为进一步提升鲁棒性,引入对抗训练以增强模型在噪声或复杂图像下的判别能力,并结合SHAP(SHapley Additive exPlanations)方法解释模型决策依据,从而为医生提供可信赖、高精度且可解释的辅助诊断工具。

链接: https://arxiv.org/abs/2602.04820
作者: Farzia Hossain,Samanta Ghosh,Shahida Begum,B. M. Shahria Alam,Mohammad Tahmid Noor,Md Parvez Mia,Nishat Tasnim Niloy
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 6 pages, 12 figures. This is the author’s accepted manuscript of a paper accepted for publication in the Proceedings of the 16th International IEEE Conference on Computing, Communication and Networking Technologies (ICCCNT 2025). The final published version will be available via IEEE Xplore

点击查看摘要

Abstract:Human nail diseases are gradually observed over all age groups, especially among older individuals, often going ignored until they become severe. Early detection and accurate diagnosis of such conditions are important because they sometimes reveal our body’s health problems. But it is challenging due to the inferred visual differences between disease types. This paper presents a machine learning-based model for automated classification of nail diseases based on a publicly available dataset, which contains 3,835 images scaling six categories. In 224x224 pixels, all images were resized to ensure consistency. To evaluate performance, four well-known CNN models-InceptionV3, DenseNet201, EfficientNetV2, and ResNet50 were trained and analyzed. Among these, InceptionV3 outperformed the others with an accuracy of 95.57%, while DenseNet201 came next with 94.79%. To make the model stronger and less likely to make mistakes on tricky or noisy images, we used adversarial training. To help understand how the model makes decisions, we used SHAP to highlight important features in the predictions. This system could be a helpful support for doctors, making nail disease diagnosis more accurate and faster.
zh

[CV-8] XtraLight-MedMamba for Classification of Neoplastic Tubular Adenomas

【速读】:该论文旨在解决结肠镜筛查中低级别异型增生(low-grade dysplasia)病变风险分层不准确的问题,其核心挑战在于传统组织病理学评估存在主观性,难以识别微小但与恶性进展相关的形态学特征。为此,作者提出XtraLight-MedMamba模型,其关键创新在于融合ConvNeXt浅层特征提取器与并行视觉状态空间模型(Vision Mamba),以高效建模图像的长程和短程依赖关系,并通过空间与通道注意力桥接模块(SCAB)增强多尺度特征提取能力;同时引入固定非负正交分类器(FNOClassifier),在显著减少参数量(约32,000个)的同时提升模型泛化性能,最终在低级别管状腺瘤患者后续是否发展为结直肠癌(CRC)的分类任务中达到97.18%准确率和0.9767 F1分数,优于复杂度更高的Transformer和传统Mamba架构。

链接: https://arxiv.org/abs/2602.04819
作者: Aqsa Sultana,Rayan Afsar,Ahmed Rahu,Surendra P. Singh,Brian Shula,Brandon Combs,Derrick Forchetti,Vijayan K. Asari
机构: University of Dayton(代顿大学); University of Georgia(佐治亚大学); The University of Toledo Medical Center(托莱多大学医学中心); Honeywell International Inc.(霍尼韦尔国际公司); South Bend Medical Foundation(南本德医学基金会)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 13 pages, 8 figures

点击查看摘要

Abstract:Accurate risk stratification of precancerous polyps during routine colonoscopy screenings is essential for lowering the risk of developing colorectal cancer (CRC). However, assessment of low-grade dysplasia remains limited by subjective histopathologic interpretation. Advancements in digital pathology and deep learning provide new opportunities to identify subtle and fine morphologic patterns associated with malignant progression that may be imperceptible to the human eye. In this work, we propose XtraLight-MedMamba, an ultra-lightweight state-space-based deep learning framework for classifying neoplastic tubular adenomas from whole-slide images (WSIs). The architecture is a blend of ConvNext based shallow feature extractor with parallel vision mamba to efficiently model both long- and short-range dependencies and image generalization. An integration of Spatial and Channel Attention Bridge (SCAB) module enhances multiscale feature extraction, while Fixed Non-Negative Orthogonal Classifier (FNOClassifier) enables substantial parameter reduction and improved generalization. The model was evaluated on a curated dataset acquired from patients with low-grade tubular adenomas, stratified into case and control cohorts based on subsequent CRC development. XtraLight-MedMamba achieved an accuracy of 97.18% and an F1-score of 0.9767 using approximately 32,000 parameters, outperforming transformer-based and conventional Mamba architectures with significantly higher model complexity.
zh

[CV-9] X2HDR: HDR Image Generation in a Perceptually Uniform Space

【速读】:该论文旨在解决当前生成式AI模型(如Stable Diffusion和FLUX)在高动态范围(HDR)图像生成方面的局限性,即这些模型通常仅能输出低动态范围(LDR)图像,主要受限于缺乏大规模HDR训练数据。其解决方案的关键在于:通过将HDR输入转换为感知均匀编码(如PU21或PQ),利用已预训练的LDR变分自编码器(VAE)在感知统一空间中实现高质量重建,从而避免对整个模型进行重新训练;具体而言,该方法冻结VAE参数,仅通过低秩适应(LoRA)微调去噪器模块,形成一种高效且统一的计算框架,支持文本到HDR合成与单张RAW图像到HDR重建任务,显著提升感知保真度、图文对齐效果及有效动态范围。

链接: https://arxiv.org/abs/2602.04814
作者: Ronghuan Wu,Wanchao Su,Kede Ma,Jing Liao,Rafał K. Mantiuk
机构: City University of Hong Kong(香港城市大学); Monash University(莫纳什大学); University of Cambridge(剑桥大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Project page: this https URL , Code: this https URL

点击查看摘要

Abstract:High-dynamic-range (HDR) formats and displays are becoming increasingly prevalent, yet state-of-the-art image generators (e.g., Stable Diffusion and FLUX) typically remain limited to low-dynamic-range (LDR) output due to the lack of large-scale HDR training data. In this work, we show that existing pretrained diffusion models can be easily adapted to HDR generation without retraining from scratch. A key challenge is that HDR images are natively represented in linear RGB, whose intensity and color statistics differ substantially from those of sRGB-encoded LDR images. This gap, however, can be effectively bridged by converting HDR inputs into perceptually uniform encodings (e.g., using PU21 or PQ). Empirically, we find that LDR-pretrained variational autoencoders (VAEs) reconstruct PU21-encoded HDR inputs with fidelity comparable to LDR data, whereas linear RGB inputs cause severe degradations. Motivated by this finding, we describe an efficient adaptation strategy that freezes the VAE and finetunes only the denoiser via low-rank adaptation in a perceptually uniform space. This results in a unified computational method that supports both text-to-HDR synthesis and single-image RAW-to-HDR reconstruction. Experiments demonstrate that our perceptually encoded adaptation consistently improves perceptual fidelity, text-image alignment, and effective dynamic range, relative to previous techniques.
zh

[CV-10] VISTA-Bench: Do Vision-Language Models Really Understand Visualized Text as Well as Pure Text?

【速读】:该论文旨在解决当前视觉-语言模型(Vision-Language Models, VLMs)在处理图像中可视化文本(visualized text)时表现不佳的问题,即现有模型在纯文本查询上性能优异,但在面对语义相同但以图像形式呈现的文本输入时出现显著性能下降。解决方案的关键在于提出VISTA-Bench,一个系统性基准测试框架,通过在受控渲染条件下对比纯文本与可视化文本问题,量化模型在多模态感知、推理到单模态理解等不同领域中的表现差异,从而揭示并诊断模型对文本呈现形式敏感的局限性,并为构建更统一的语言表征(跨分词文本与像素)提供指导。

链接: https://arxiv.org/abs/2602.04802
作者: Qing’an Liu,Juntong Feng,Yuhao Wang,Xinzhe Han,Yujie Cheng,Yue Zhu,Haiwen Diao,Yunzhi Zhuge,Huchuan Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 27 pages, 19 figures

点击查看摘要

Abstract:Vision-Language Models (VLMs) have achieved impressive performance in cross-modal understanding across textual and visual inputs, yet existing benchmarks predominantly focus on pure-text queries. In real-world scenarios, language also frequently appears as visualized text embedded in images, raising the question of whether current VLMs handle such input requests comparably. We introduce VISTA-Bench, a systematic benchmark from multimodal perception, reasoning, to unimodal understanding domains. It evaluates visualized text understanding by contrasting pure-text and visualized-text questions under controlled rendering conditions. Extensive evaluation of over 20 representative VLMs reveals a pronounced modality gap: models that perform well on pure-text queries often degrade substantially when equivalent semantic content is presented as visualized text. This gap is further amplified by increased perceptual difficulty, highlighting sensitivity to rendering variations despite unchanged semantics. Overall, VISTA-Bench provides a principled evaluation framework to diagnose this limitation and to guide progress toward more unified language representations across tokenized text and pixels. The source dataset is available at this https URL.
zh

[CV-11] Light Forcing: Accelerating Autoregressive Video Diffusion via Sparse Attention

【速读】:该论文旨在解决自回归(Autoregressive, AR)视频生成模型中注意力机制的二次复杂度问题,这一瓶颈限制了模型在实际部署中的效率。现有稀疏注意力方法虽在双向模型中表现良好,但在AR模型上会导致显著性能下降,原因在于未充分考虑块(chunk)生成的孤立性以及过去信息上下文利用不足。解决方案的关键是提出首个专为AR视频生成设计的稀疏注意力方法——Light Forcing,其核心创新包括:(1) 块感知增长机制(Chunk-Aware Growth),定量评估每个块的贡献并动态分配稀疏性,使当前块能继承早期块的知识;(2) 分层稀疏注意力(Hierarchical Sparse Attention),通过粗到细的两层掩码选择策略(帧级与块级)自适应捕捉历史和局部信息,从而在保持高质量的同时实现高效推理。

链接: https://arxiv.org/abs/2602.04789
作者: Chengtao Lv,Yumeng Shi,Yushi Huang,Ruihao Gong,Shen Ren,Wenya Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 7 figures

点击查看摘要

Abstract:Advanced autoregressive (AR) video generation models have improved visual fidelity and interactivity, but the quadratic complexity of attention remains a primary bottleneck for efficient deployment. While existing sparse attention solutions have shown promise on bidirectional models, we identify that applying these solutions to AR models leads to considerable performance degradation for two reasons: isolated consideration of chunk generation and insufficient utilization of past informative context. Motivated by these observations, we propose \textscLight Forcing, the \textitfirst sparse attention solution tailored for AR video generation models. It incorporates a \textitChunk-Aware Growth mechanism to quantitatively estimate the contribution of each chunk, which determines their sparsity allocation. This progressive sparsity increase strategy enables the current chunk to inherit prior knowledge in earlier chunks during generation. Additionally, we introduce a \textitHierarchical Sparse Attention to capture informative historical and local context in a coarse-to-fine manner. Such two-level mask selection strategy (\ie, frame and block level) can adaptively handle diverse attention patterns. Extensive experiments demonstrate that our method outperforms existing sparse attention in quality (\eg, 84.5 on VBench) and efficiency (\eg, 1.2\sim1.3\times end-to-end speedup). Combined with FP8 quantization and LightVAE, \textscLight Forcing further achieves a 2.3\times speedup and 19.7,FPS on an RTX~5090 GPU. Code will be released at \hrefthis https URLthis https URL.
zh

[CV-12] Generative Modeling via Drifting

【速读】:该论文旨在解决生成式模型中推理阶段需要多步迭代(如扩散模型或流模型)导致的计算效率低下的问题,目标是实现高质量的一步生成(one-step generation)。其解决方案的关键在于提出了一种新范式——漂移模型(Drifting Models),通过引入一个漂移场(drifting field)来在训练过程中动态演化样本分布,使生成分布与数据分布达到平衡;该机制使得模型在训练时即可优化分布匹配,从而在推理时仅需一步即可生成高质量样本,实验表明该方法在ImageNet 256×256分辨率下取得了当前最优的FID指标(潜空间1.54,像素空间1.61)。

链接: https://arxiv.org/abs/2602.04770
作者: Mingyang Deng,He Li,Tianhong Li,Yilun Du,Kaiming He
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Generative modeling can be formulated as learning a mapping f such that its pushforward distribution matches the data distribution. The pushforward behavior can be carried out iteratively at inference time, for example in diffusion and flow-based models. In this paper, we propose a new paradigm called Drifting Models, which evolve the pushforward distribution during training and naturally admit one-step inference. We introduce a drifting field that governs the sample movement and achieves equilibrium when the distributions match. This leads to a training objective that allows the neural network optimizer to evolve the distribution. In experiments, our one-step generator achieves state-of-the-art results on ImageNet at 256 x 256 resolution, with an FID of 1.54 in latent space and 1.61 in pixel space. We hope that our work opens up new opportunities for high-quality one-step generation.
zh

[CV-13] Mitigating Long-Tail Bias via Prompt-Controlled Diffusion Augmentation

【速读】:该论文旨在解决高分辨率遥感影像语义分割中因像素类别分布严重长尾不平衡(long-tailed imbalance)导致的模型性能下降问题,尤其是在LoveDA数据集中由于城市/乡村(Urban/Rural)域差异显著、类频次统计不一致所加剧的挑战。解决方案的关键在于提出一种提示控制的扩散增强框架(prompt-controlled diffusion augmentation framework),其核心由两个阶段组成:Stage A利用面向域的、掩码比例条件约束的离散扩散模型生成满足指定类别比例且保留类别共现结构的布局;Stage B则通过Stable Diffusion结合ControlNet引导,将布局转化为具有域一致性的真实感图像。该方法通过混合合成样本与真实数据,在多个分割骨干网络上均实现性能提升,尤其在少数类(minority classes)和城乡泛化能力方面效果显著,验证了可控增强作为缓解遥感分割中长尾偏差的有效机制。

链接: https://arxiv.org/abs/2602.04749
作者: Buddhi Wijenayake,Nichula Wasalathilake,Roshan Godaliyadda,Vijitha Herath,Parakrama Ekanayake,Vishal M. Patel
机构: University of Peradeniya (佩拉德尼雅大学); Johns Hopkins University (约翰霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Semantic segmentation of high-resolution remote-sensing imagery is critical for urban mapping and land-cover monitoring, yet training data typically exhibits severe long-tailed pixel imbalance. In the dataset LoveDA, this challenge is compounded by an explicit Urban/Rural split with distinct appearance and inconsistent class-frequency statistics across domains. We present a prompt-controlled diffusion augmentation framework that synthesizes paired label–image samples with explicit control of both domain and semantic composition. Stage~A uses a domain-aware, masked ratio-conditioned discrete diffusion model to generate layouts that satisfy user-specified class-ratio targets while respecting learned co-occurrence structure. Stage~B translates layouts into photorealistic, domain-consistent images using Stable Diffusion with ControlNet guidance. Mixing the resulting ratio and domain-controlled synthetic pairs with real data yields consistent improvements across multiple segmentation backbones, with gains concentrated on minority classes and improved Urban and Rural generalization, demonstrating controllable augmentation as a practical mechanism to mitigate long-tail bias in remote-sensing segmentation. Source codes, pretrained models, and synthetic datasets are available at \hrefthis https URLGithub
zh

[CV-14] How to rewrite the stars: Mapping your orchard over time through constellations of fruits

【速读】:该论文旨在解决在果园中通过视频序列追踪果实生长过程中的关键难题,即如何在不同时间采集的视频之间准确匹配同一果实,从而实现对果实生长动态的自动化监测。现有方法受限于相机位置固定、特征显著性要求高或依赖GPS等外部数据,难以应对非刚性变化、遮挡及视觉特征稀疏等挑战。解决方案的关键在于提出基于三维质心星座(constellations of 3D centroids)的新范式,并设计一种适用于极稀疏三维点云的描述子,通过匹配整组果实的空间构型而非单个果实来提升鲁棒性,从而有效实现跨视频、跨时间的果实匹配,并进一步构建果园地图以支持机器人六自由度(6DoF)位姿估计,为果园自主导航与选择性采摘提供技术基础。

链接: https://arxiv.org/abs/2602.04722
作者: Gonçalo P. Matos,Carlos Santiago,João P. Costeira,Ricardo L. Saldanha,Ernesto M. Morgado
机构: SISCOG – Sistemas Cognitivos, SA; Institute for Systems and Robotics (ISR) / LARSyS, Instituto Superior Técnico (IST), Lisbon, Portugal
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: submitted to IEEE International Conference on Robotics Automation

点击查看摘要

Abstract:Following crop growth through the vegetative cycle allows farmers to predict fruit setting and yield in early stages, but it is a laborious and non-scalable task if performed by a human who has to manually measure fruit sizes with a caliper or dendrometers. In recent years, computer vision has been used to automate several tasks in precision agriculture, such as detecting and counting fruits, and estimating their size. However, the fundamental problem of matching the exact same fruits from one video, collected on a given date, to the fruits visible in another video, collected on a later date, which is needed to track fruits’ growth through time, remains to be solved. Few attempts were made, but they either assume that the camera always starts from the same known position and that there are sufficiently distinct features to match, or they used other sources of data like GPS. Here we propose a new paradigm to tackle this problem, based on constellations of 3D centroids, and introduce a descriptor for very sparse 3D point clouds that can be used to match fruits across videos. Matching constellations instead of individual fruits is key to deal with non-rigidity, occlusions and challenging imagery with few distinct visual features to track. The results show that the proposed method can be successfully used to match fruits across videos and through time, and also to build an orchard map and later use it to locate the camera pose in 6DoF, thus providing a method for autonomous navigation of robots in the orchard and for selective fruit picking, for example.
zh

[CV-15] Adaptive Prompt Elicitation for Text-to-Image Generation

【速读】:该论文旨在解决文本到图像生成模型中用户意图对齐困难的问题,尤其针对用户提供模糊输入以及难以掌握模型特有行为的情况。解决方案的关键在于提出自适应提示引导(Adaptive Prompt Elicitation, APE),其核心是基于信息论框架构建交互式意图推断机制:通过语言模型先验将隐含意图表示为可解释的特征要求,自适应生成视觉查询以获取用户反馈,并将所获需求整合为高效提示。该方法在IDEA-Bench和DesignBench上的评估显示更强意图对齐与更高效率,且在具有挑战性的用户任务中实现19.8%更高的对齐度而无需增加工作负荷。

链接: https://arxiv.org/abs/2602.04713
作者: Xinyi Wen,Lena Hegemann,Xiaofu Jin,Shuai Ma,Antti Oulasvirta
机构: Aalto University (阿尔托大学); University of Helsinki (赫尔辛基大学); ELLIS Institute Finland (芬兰ELLIS研究所)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: ACM International Conference on Intelligent User Interfaces (IUI) 2026, March 23-26, Paphos, Cyprus

点击查看摘要

Abstract:Aligning text-to-image generation with user intent remains challenging, for users who provide ambiguous inputs and struggle with model idiosyncrasies. We propose Adaptive Prompt Elicitation (APE), a technique that adaptively asks visual queries to help users refine prompts without extensive writing. Our technical contribution is a formulation of interactive intent inference under an information-theoretic framework. APE represents latent intent as interpretable feature requirements using language model priors, adaptively generates visual queries, and compiles elicited requirements into effective prompts. Evaluation on IDEA-Bench and DesignBench shows that APE achieves stronger alignment with improved efficiency. A user study with challenging user-defined tasks demonstrates 19.8% higher alignment without workload overhead. Our work contributes a principled approach to prompting that, for general users, offers an effective and efficient complement to the prevailing prompt-based interaction paradigm with text-to-image models.
zh

[CV-16] SAR-RAG : ATR Visual Question Answering by Semantic Search Retrieval and MLLM Generation

【速读】:该论文旨在解决合成孔径雷达(Synthetic Aperture Radar, SAR)图像中目标识别(Automatic Target Recognition, ATR)精度不足的问题,尤其是在军事车辆等目标在SAR图像中特征相似、难以区分的情况下。解决方案的关键在于提出一种视觉上下文图像检索增强生成(Image Retrieval-Augmented Generation, ImageRAG)辅助的AI代理方法——SAR-RAG,其核心机制是将多模态大语言模型(Multimodal Large Language Model, MLLM)与基于语义嵌入的向量数据库相结合,构建一个可检索的ATR记忆库,通过匹配历史已知标签的图像样例来增强当前识别任务的上下文理解能力,从而提升分类准确率和尺寸回归性能。

链接: https://arxiv.org/abs/2602.04712
作者: David F. Ramirez,Tim Overman,Kristen Jaskie,Joe Marvin,Andreas Spanias
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注: Submitted to 2026 IEEE Radar Conference

点击查看摘要

Abstract:We present a visual-context image retrieval-augmented generation (ImageRAG) assisted AI agent for automatic target recognition (ATR) of synthetic aperture radar (SAR). SAR is a remote sensing method used in defense and security applications to detect and monitor the positions of military vehicles, which may appear indistinguishable in images. Researchers have extensively studied SAR ATR to improve the differentiation and identification of vehicle types, characteristics, and measurements. Test examples can be compared with known vehicle target types to improve recognition tasks. New methods enhance the capabilities of neural networks, transformer attention, and multimodal large language models. An agentic AI method may be developed to utilize a defined set of tools, such as searching through a library of similar examples. Our proposed method, SAR Retrieval-Augmented Generation (SAR-RAG), combines a multimodal large language model (MLLM) with a vector database of semantic embeddings to support contextual search for image exemplars with known qualities. By recovering past image examples with known true target types, our SAR-RAG system can compare similar vehicle categories, achieving improved ATR prediction accuracy. We evaluate this through search and retrieval metrics, categorical classification accuracy, and numeric regression of vehicle dimensions. These metrics all show improvements when SAR-RAG is added to an MLLM baseline method as an attached ATR memory bank.
zh

[CV-17] Annotation Free Spacecraft Detection and Segmentation using Vision Language Models ICRA2026

【速读】:该论文旨在解决空间目标检测与分割任务中因手动标注困难(如低可见度、光照变化及物体与行星背景融合)而导致的模型训练难题。其核心解决方案是提出一种基于视觉语言模型(Vision Language Models, VLMs)的无标注检测与分割流程:首先利用预训练VLM自动为少量未标注的真实数据生成伪标签,随后在教师-学生知识蒸馏框架中使用这些伪标签训练轻量级模型。尽管伪标签存在噪声,但通过蒸馏过程显著提升了性能,相较于直接零样本VLM推理,在SPARK-2024、SPEED+和TANGO数据集上的平均精度(AP)提升最高达10个百分点。

链接: https://arxiv.org/abs/2602.04699
作者: Samet Hicsonmez,Jose Sosa,Dan Pineau,Inder Pal Singh,Arunkumar Rathinam,Abd El Rahman Shabayek,Djamila Aouada
机构: Interdisciplinary Centre for Security, Reliability and Trust (SnT), University of Luxembourg (卢森堡大学信息安全、可靠性与信任跨学科中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICRA 2026

点击查看摘要

Abstract:Vision Language Models (VLMs) have demonstrated remarkable performance in open-world zero-shot visual recognition. However, their potential in space-related applications remains largely unexplored. In the space domain, accurate manual annotation is particularly challenging due to factors such as low visibility, illumination variations, and object blending with planetary backgrounds. Developing methods that can detect and segment spacecraft and orbital targets without requiring extensive manual labeling is therefore of critical importance. In this work, we propose an annotation-free detection and segmentation pipeline for space targets using VLMs. Our approach begins by automatically generating pseudo-labels for a small subset of unlabeled real data with a pre-trained VLM. These pseudo-labels are then leveraged in a teacher-student label distillation framework to train lightweight models. Despite the inherent noise in the pseudo-labels, the distillation process leads to substantial performance gains over direct zero-shot VLM inference. Experimental evaluations on the SPARK-2024, SPEED+, and TANGO datasets on segmentation tasks demonstrate consistent improvements in average precision (AP) by up to 10 points. Code and models are available at this https URL.
zh

[CV-18] DRMOT: A Dataset and Framework for RGBD Referring Multi-Object Tracking

【速读】:该论文旨在解决现有 referring multi-object tracking (RMOT) 模型仅依赖2D RGB图像导致的跟踪精度受限问题,尤其是在处理具有复杂空间语义描述(如“离相机最近的人”)以及严重遮挡场景下难以保持目标身份一致性的问题。其核心挑战在于缺乏显式的3D空间信息以支撑精准的空间语义定位与稳定轨迹关联。解决方案的关键是提出了一种新的任务——RGBD Referring Multi-Object Tracking (DRMOT),该任务要求模型融合RGB、深度(Depth, D)和语言(Language, L)三模态信息实现3D感知的目标跟踪;并构建了专门的数据集DRSet(包含187个场景的RGB图像与深度图、240条语言描述,其中56条含深度相关信息),同时设计了DRTrack框架,通过多模态大语言模型(MLLM)引导的深度感知目标定位和基于深度线索的轨迹关联机制,显著提升了空间语义 grounding 能力与跟踪鲁棒性。

链接: https://arxiv.org/abs/2602.04692
作者: Sijia Chen,Lijuan Ma,Yanqiu Yu,En Yu,Liman Liu,Wenbing Tao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Referring Multi-Object Tracking (RMOT) aims to track specific targets based on language descriptions and is vital for interactive AI systems such as robotics and autonomous driving. However, existing RMOT models rely solely on 2D RGB data, making it challenging to accurately detect and associate targets characterized by complex spatial semantics (e.g., ``the person closest to the camera’‘) and to maintain reliable identities under severe occlusion, due to the absence of explicit 3D spatial information. In this work, we propose a novel task, RGBD Referring Multi-Object Tracking (DRMOT), which explicitly requires models to fuse RGB, Depth (D), and Language (L) modalities to achieve 3D-aware tracking. To advance research on the DRMOT task, we construct a tailored RGBD referring multi-object tracking dataset, named DRSet, designed to evaluate models’ spatial-semantic grounding and tracking capabilities. Specifically, DRSet contains RGB images and depth maps from 187 scenes, along with 240 language descriptions, among which 56 descriptions incorporate depth-related information. Furthermore, we propose DRTrack, a MLLM-guided depth-referring tracking framework. DRTrack performs depth-aware target grounding from joint RGB-D-L inputs and enforces robust trajectory association by incorporating depth cues. Extensive experiments on the DRSet dataset demonstrate the effectiveness of our framework.
zh

[CV-19] REDistill: Robust Estimator Distillation for Balancing Robustness and Efficiency

【速读】:该论文旨在解决知识蒸馏(Knowledge Distillation, KD)中教师模型预测存在噪声或过度自信的问题,传统基于KL散度的KD方法假设教师输出可靠,但在实际应用中常因教师模型不确定性导致学生模型性能下降。解决方案的关键在于提出REDistill(Robust Estimator Distillation),其核心是用幂发散损失(power divergence loss)替代标准KD目标函数,该损失函数作为KL散度的推广形式,能够自适应地降低不可靠教师输出的权重,同时保留有信息量的logit关系,从而在不依赖特定模型超参数调优的前提下实现对教师噪声的统一建模与鲁棒处理。

链接: https://arxiv.org/abs/2602.04677
作者: Ondrej Tybl,Lukas Neumann
机构: Czech Technical University (捷克技术大学); FEE (工程学院)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Knowledge Distillation (KD) transfers knowledge from a large teacher model to a smaller student by aligning their predictive distributions. However, conventional KD formulations - typically based on Kullback-Leibler divergence - assume that the teacher provides reliable soft targets. In practice, teacher predictions are often noisy or overconfident, and existing correction-based approaches rely on ad-hoc heuristics and extensive hyper-parameter tuning, which hinders generalization. We introduce REDistill (Robust Estimator Distillation), a simple yet principled framework grounded in robust statistics. REDistill replaces the standard KD objective with a power divergence loss, a generalization of KL divergence that adaptively downweights unreliable teacher output while preserving informative logit relationships. This formulation provides a unified and interpretable treatment of teacher noise, requires only logits, integrates seamlessly into existing KD pipelines, and incurs negligible computational overhead. Extensive experiments on CIFAR-100 and ImageNet-1k demonstrate that REDistill consistently improves student accuracy in diverse teacher-student architectures. Remarkably, it achieves these gains without model-specific hyper-parameter tuning, underscoring its robustness and strong generalization to unseen teacher-student pairs.
zh

[CV-20] AGILE: Hand-Object Interaction Reconstruction from Video via Agent ic Generation

【速读】:该论文旨在解决从单目视频中重建动态手物交互(dynamic hand-object interactions)的两大难题:一是现有方法依赖神经渲染时,在严重遮挡下常生成碎片化、无法用于仿真的几何结构;二是对脆弱的结构光恢复(Structure-from-Motion, SfM)初始化的强依赖导致在真实场景视频中频繁失败。其解决方案的关键在于提出AGILE框架,通过两个核心创新实现突破:首先,采用代理式(agentic)流水线,利用视觉语言模型(Vision-Language Model, VLM)引导生成模型合成完整且纹理高保真的物体网格,从而摆脱视频遮挡影响;其次,摒弃SfM初始化,提出鲁棒的“锚定-跟踪”策略,基于基础模型在交互起始帧初始化物体姿态,并通过生成资产与视频观测之间的强视觉相似性进行时序传播;最终结合接触感知优化,融合语义、几何与交互稳定性约束,确保物理合理性,从而生成可用于机器人仿真和真实世界部署的高质量数字孪生资产。

链接: https://arxiv.org/abs/2602.04672
作者: Jin-Chuan Shi,Binhong Ye,Tao Liu,Junzhe He,Yangjinhui Xu,Xiaoyang Liu,Zeju Li,Hao Chen,Chunhua Shen
机构: State Key Lab of CAD & CG, Zhejiang University (浙江大学CAD与CG国家重点实验室); Zhejiang University of Technology (浙江工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO)
备注: 11 pages

点击查看摘要

Abstract:Reconstructing dynamic hand-object interactions from monocular videos is critical for dexterous manipulation data collection and creating realistic digital twins for robotics and VR. However, current methods face two prohibitive barriers: (1) reliance on neural rendering often yields fragmented, non-simulation-ready geometries under heavy occlusion, and (2) dependence on brittle Structure-from-Motion (SfM) initialization leads to frequent failures on in-the-wild footage. To overcome these limitations, we introduce AGILE, a robust framework that shifts the paradigm from reconstruction to agentic generation for interaction learning. First, we employ an agentic pipeline where a Vision-Language Model (VLM) guides a generative model to synthesize a complete, watertight object mesh with high-fidelity texture, independent of video occlusions. Second, bypassing fragile SfM entirely, we propose a robust anchor-and-track strategy. We initialize the object pose at a single interaction onset frame using a foundation model and propagate it temporally by leveraging the strong visual similarity between our generated asset and video observations. Finally, a contact-aware optimization integrates semantic, geometric, and interaction stability constraints to enforce physical plausibility. Extensive experiments on HO3D, DexYCB, and in-the-wild videos reveal that AGILE outperforms baselines in global geometric accuracy while demonstrating exceptional robustness on challenging sequences where prior art frequently collapses. By prioritizing physical validity, our method produces simulation-ready assets validated via real-to-sim retargeting for robotic applications.
zh

[CV-21] PIO-FVLM: Rethinking Training-Free Visual Token Reduction for VLM Acceleration from an Inference-Objective Perspective

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)中冗余视觉标记(visual tokens)导致推理效率低下的问题,现有方法多依赖于基于视觉标记间相似性或跨模态相似性的启发式压缩策略,存在压缩性能有限且难以实际部署的局限。其解决方案的关键在于从推理目标出发,将视觉标记压缩转化为保持输出结果不变性的优化问题,并通过设计层内局部代理损失(layer-local proxy loss)生成标记级梯度显著性(token-level gradient saliency),指导视觉标记重排序,再依据非极大值抑制(Non-Maximum Suppression, NMS)原则选取对最终输出最具重要性的视觉标记。该方法无需训练(training-free),兼容FlashAttention,可独立作为编码器无关(encoder-free)方案部署,也可与如VisionZip等编码器压缩方法结合使用,实现高效且实用的视觉标记压缩。

链接: https://arxiv.org/abs/2602.04657
作者: Haokui Zhang,Congyang Ou,Dawei Yan,Peng Wang,Qingsen Yan,Ying Li,Rong Xiao,Chunhua Shen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, reducing redundant visual tokens in vision-language models (VLMs) to accelerate VLM inference has emerged as a hot topic. However, most existing methods rely on heuristics constructed based on inter-visual-token similarity or cross-modal visual-text similarity, which gives rise to certain limitations in compression performance and practical deployment. In contrast, we propose PIO-FVLM from the perspective of inference objectives, which transforms visual token compression into preserving output result invariance and selects tokens primarily by their importance to this goal. Specially, vision tokens are reordered with the guidance of token-level gradient saliency generated by our designed layer-local proxy loss, a coarse constraint from the current layer to the final result. Then the most valuable vision tokens are selected following the non-maximum suppression (NMS) principle. The proposed PIO-FVLM is training-free and compatible with FlashAttention, friendly to practical application and deployment. It can be deployed independently as an encoder-free method, or combined with encoder compression approaches like VisionZip for use as an encoder-involved method. On LLaVA-Next-7B, PIO-FVLM retains just 11.1% of visual tokens but maintains 97.2% of the original performance, with a 2.67 \times prefill speedup, 2.11 \times inference speedup, 6.22 \times lower FLOPs, and 6.05 \times reduced KV Cache overhead. Our code is available at this https URL.
zh

[CV-22] A labeled dataset of simulated phlebotomy procedures for medical AI: polygon annotations for object detection and human-object interaction

【速读】:该论文旨在解决医学培训中自动化评估与反馈系统缺失的问题,特别是针对静脉采血(phlebotomy)操作流程的智能化分析与教学支持。其解决方案的关键在于构建一个大规模、高质量的标注图像数据集,包含11,884张在受控条件下采集的模拟采血过程图像,每张图像均带有五类医疗相关对象(注射器、止血带、消毒棉片、手套和训练手臂)的多边形分割标注,并通过结构相似性指数(SSIM)过滤冗余帧、自动人脸匿名化处理,最终以YOLOv8兼容格式输出。该数据集按70%/15%/15%划分训练、验证和测试子集,为开发基于视觉的工具检测、步骤识别、流程分析及合规性检查等应用提供了坚实基础,从而推动医疗训练自动化和人机交互研究的发展。

链接: https://arxiv.org/abs/2602.04624
作者: Raúl Jiménez Cruz,César Torres-Huitzil,Marco Franceschetti,Ronny Seiger,Luciano García-Bañuelos,Barbara Weber
机构: University of St.Gallen (圣加仑大学); University of Guadalajara (瓜达拉哈拉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This data article presents a dataset of 11,884 labeled images documenting a simulated blood extraction (phlebotomy) procedure performed on a training arm. Images were extracted from high-definition videos recorded under controlled conditions and curated to reduce redundancy using Structural Similarity Index Measure (SSIM) filtering. An automated face-anonymization step was applied to all videos prior to frame selection. Each image contains polygon annotations for five medically relevant classes: syringe, rubber band, disinfectant wipe, gloves, and training arm. The annotations were exported in a segmentation format compatible with modern object detection frameworks (e.g., YOLOv8), ensuring broad usability. This dataset is partitioned into training (70%), validation (15%), and test (15%) subsets and is designed to advance research in medical training automation and human-object interaction. It enables multiple applications, including phlebotomy tool detection, procedural step recognition, workflow analysis, conformance checking, and the development of educational systems that provide structured feedback to medical trainees. The data and accompanying label files are publicly available on Zenodo.
zh

[CV-23] ImmuVis: Hyperconvolutional Foundation Model for Imaging Mass Cytometry

【速读】:该论文旨在解决成像质谱(Imaging Mass Cytometry, IMC)中多通道图像建模的挑战,即传统视觉骨干网络假设固定的通道空间,而IMC的实际分子标记组合在不同研究中具有高度可变性,导致标准卷积模型难以直接适用。解决方案的关键在于提出一种标记自适应超卷积(marker-adaptive hyperconvolutions)机制,通过从学习到的标记嵌入中动态生成卷积核,使单一模型能够无需重新训练即可处理任意测量的标记子集,从而实现对异构IMC数据的高效、通用建模。

链接: https://arxiv.org/abs/2602.04585
作者: Marcin Możejko,Dawid Uchal,Krzysztof Gogolewski,Piotr Kupidura,Szymon Łukasik,Jakub Giezgała,Tomasz Nocoń,Kacper Pietrzyk,Robert Pieniuta,Mateusz Sulimowicz,Michal Orzyłowski,Tomasz Siłkowski,Karol Zagródka,Eike Staub,Ewa Szczurek
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 6 figures

点击查看摘要

Abstract:We present ImmuVis, an efficient convolutional foundation model for imaging mass cytometry (IMC), a high-throughput multiplex imaging technology that handles molecular marker measurements as image channels and enables large-scale spatial tissue profiling. Unlike natural images, multiplex imaging lacks a fixed channel space, as real-world marker sets vary across studies, violating a core assumption of standard vision backbones. To address this, ImmuVis introduces marker-adaptive hyperconvolutions that generate convolutional kernels from learned marker embeddings, enabling a single model to operate on arbitrary measured marker subsets without retraining. We pretrain ImmuVis on the largest to-date dataset, IMC17M (28 cohorts, 24,405 images, 265 markers, over 17M patches), using self-supervised masked reconstruction. ImmuVis outperforms SOTA baselines and ablations in virtual staining and downstream classification tasks at substantially lower compute cost than transformer-based alternatives, and is the sole model that provides calibrated uncertainty via a heteroscedastic likelihood objective. These results position ImmuVis as a practical, efficient foundation model for real-world IMC modeling.
zh

[CV-24] SalFormer360: a transformer-based saliency estimation model for 360-degree videos

【速读】:该论文旨在解决360度视频中视觉显著性估计(saliency estimation)的准确性问题,以支持视口预测(viewport prediction)和沉浸式内容优化等应用。其解决方案的关键在于提出了一种基于Transformer架构的新型模型SalFormer360,该模型融合了预训练的SegFormer编码器与自定义解码器,并引入了“观看中心偏置”(Viewing Center Bias)机制,以更好地模拟用户在360度环境中的注意力分布。实验表明,该方法在三个主流基准数据集上均显著优于现有最先进方法,在Pearson相关系数指标上提升幅度达2.5%至18.6%。

链接: https://arxiv.org/abs/2602.04584
作者: Mahmoud Z. A. Wahba,Francesco Barbato,Sara Baldoni,Federica Battisti
机构: University of Padova (帕多瓦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Saliency estimation has received growing attention in recent years due to its importance in a wide range of applications. In the context of 360-degree video, it has been particularly valuable for tasks such as viewport prediction and immersive content optimization. In this paper, we propose SalFormer360, a novel saliency estimation model for 360-degree videos built on a transformer-based architecture. Our approach is based on the combination of an existing encoder architecture, SegFormer, and a custom decoder. The SegFormer model was originally developed for 2D segmentation tasks, and it has been fine-tuned to adapt it to 360-degree content. To further enhance prediction accuracy in our model, we incorporated Viewing Center Bias to reflect user attention in 360-degree environments. Extensive experiments on the three largest benchmark datasets for saliency estimation demonstrate that SalFormer360 outperforms existing state-of-the-art methods. In terms of Pearson Correlation Coefficient, our model achieves 8.4% higher performance on Sport360, 2.5% on PVS-HM, and 18.6% on VR-EyeTracking compared to previous state-of-the-art.
zh

[CV-25] PEPR: Privileged Event-based Predictive Regularization for Domain Generalization

【速读】:该论文旨在解决深度神经网络在视觉感知任务中因域偏移(domain shift)导致的泛化能力不足问题,尤其在训练数据与实际部署环境存在差异时表现不佳。其核心解决方案是基于学习使用特权信息(Learning Using Privileged Information, LUPI)范式,引入事件相机(event camera)作为训练阶段可用的特权信息源,通过提出特权事件预测正则化(Privileged Event-based Predictive Regularization, PEPR)机制,使RGB模态编码器在共享潜在空间中学习预测事件模态的隐表示,从而在不牺牲语义丰富性的前提下增强模型对昼夜等域变化的鲁棒性。关键创新在于将跨模态对齐转化为预测任务,避免了直接特征对齐导致的语义损失。

链接: https://arxiv.org/abs/2602.04583
作者: Gabriele Magrini,Federico Becattini,Niccolò Biondi,Pietro Pala
机构: University of Florence (佛罗伦萨大学); University of Siena (锡耶纳大学); University of Trento (特伦托大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep neural networks for visual perception are highly susceptible to domain shift, which poses a critical challenge for real-world deployment under conditions that differ from the training data. To address this domain generalization challenge, we propose a cross-modal framework under the learning using privileged information (LUPI) paradigm for training a robust, single-modality RGB model. We leverage event cameras as a source of privileged information, available only during training. The two modalities exhibit complementary characteristics: the RGB stream is semantically dense but domain-dependent, whereas the event stream is sparse yet more domain-invariant. Direct feature alignment between them is therefore suboptimal, as it forces the RGB encoder to mimic the sparse event representation, thereby losing semantic detail. To overcome this, we introduce Privileged Event-based Predictive Regularization (PEPR), which reframes LUPI as a predictive problem in a shared latent space. Instead of enforcing direct cross-modal alignment, we train the RGB encoder with PEPR to predict event-based latent features, distilling robustness without sacrificing semantic richness. The resulting standalone RGB model consistently improves robustness to day-to-night and other domain shifts, outperforming alignment-based baselines across object detection and semantic segmentation.
zh

[CV-26] Understanding Degradation with Vision Language Model

【速读】:该论文旨在解决视觉退化理解(visual degradation understanding)这一计算机视觉中的关键挑战,即如何从图像中准确识别退化类型、参数键(parameter keys)及其连续物理值,而不仅限于定性描述。传统视觉语言模型(Vision-Language Models, VLMs)虽能提供直观的退化描述,但在建模退化背后的物理参数方面表现不足。其解决方案的关键在于将退化理解重新定义为一个分层结构预测任务,并通过自回归式的下一个词预测范式统一不同空间的子任务(如退化类型分类、参数键识别与连续值估计),并证明该方法的误差受值空间量化网格的上界约束。基于此理论基础,作者提出DU-VLM模型,结合监督微调与强化学习训练,并利用结构化奖励提升性能;同时,该模型可作为零样本控制器直接作用于预训练扩散模型,实现无需微调生成主干即可高保真恢复图像的效果。

链接: https://arxiv.org/abs/2602.04565
作者: Guanzhou Lan,Chenyi Liao,Yuqi Yang,Qianli Ma,Zhigang Wang,Dong Wang,Bin Zhao,Xuelong Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages

点击查看摘要

Abstract:Understanding visual degradations is a critical yet challenging problem in computer vision. While recent Vision-Language Models (VLMs) excel at qualitative description, they often fall short in understanding the parametric physics underlying image degradations. In this work, we redefine degradation understanding as a hierarchical structured prediction task, necessitating the concurrent estimation of degradation types, parameter keys, and their continuous physical values. Although these sub-tasks operate in disparate spaces, we prove that they can be unified under one autoregressive next-token prediction paradigm, whose error is bounded by the value-space quantization grid. Building on this insight, we introduce DU-VLM, a multimodal chain-of-thought model trained with supervised fine-tuning and reinforcement learning using structured rewards. Furthermore, we show that DU-VLM can serve as a zero-shot controller for pre-trained diffusion models, enabling high-fidelity image restoration without fine-tuning the generative backbone. We also introduce \textbfDU-110k, a large-scale dataset comprising 110,000 clean-degraded pairs with grounded physical annotations. Extensive experiments demonstrate that our approach significantly outperforms generalist baselines in both accuracy and robustness, exhibiting generalization to unseen distributions.
zh

[CV-27] Nix and Fix: Targeting 1000x Compression of 3D Gaussian Splatting with Diffusion Models

【速读】:该论文旨在解决3D高斯泼溅(3D Gaussian Splatting, 3DGS)在极低码率下压缩时出现显著视觉伪影的问题,从而提升其在沉浸式通信等对带宽敏感场景中的应用可行性。解决方案的关键在于提出NiFi方法,通过基于扩散模型的一步式蒸馏机制实现artifact-aware的图像恢复,能够在极低压缩率(如0.1 MB)下显著改善感知质量,并在相近感知性能下实现相较于原始3DGS约1000倍的码率降低。

链接: https://arxiv.org/abs/2602.04549
作者: Cem Eteke,Enzo Tartaglione
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) revolutionized novel view rendering. Instead of inferring from dense spatial points, as implicit representations do, 3DGS uses sparse Gaussians. This enables real-time performance but increases space requirements, hindering applications such as immersive communication. 3DGS compression emerged as a field aimed at alleviating this issue. While impressive progress has been made, at low rates, compression introduces artifacts that degrade visual quality significantly. We introduce NiFi, a method for extreme 3DGS compression through restoration via artifact-aware, diffusion-based one-step distillation. We show that our method achieves state-of-the-art perceptual quality at extremely low rates, down to 0.1 MB, and towards 1000x rate improvement over 3DGS at comparable perceptual performance. The code will be open-sourced upon acceptance.
zh

[CV-28] OmniRad: A Radiological Foundation Model for Multi-Task Medical Image Analysis

【速读】:该论文旨在解决医学影像分析中预训练视觉表征模型在跨模态、多任务场景下的通用性与迁移能力不足的问题。现有方法往往针对特定任务或模态进行优化,缺乏可复用的高质量特征表示,限制了下游任务的性能提升和部署效率。解决方案的关键在于提出OmniRad——一个基于自监督学习、在120万张医学图像上预训练的放射学基础模型(radiological foundation model),其设计遵循放射学启发原则,强调特征表示的复用性和跨任务迁移能力。通过在多个公共基准数据集上的评估,包括分类和分割任务,验证了OmniRad在冻结主干网络时仍能实现显著的Dice分数提升,并在MedMNISTv2上相较其他基础模型提高F1分数达2.05%,表明其具备强大的通用表征能力和任务适应性。

链接: https://arxiv.org/abs/2602.04547
作者: Luca Zedda,Andrea Loddo,Cecilia Di Ruberto
机构: University of Cagliari (卡利亚里大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 19 pages, 4 figures, 12 tables

点击查看摘要

Abstract:Radiological analysis increasingly benefits from pretrained visual representations that can support heterogeneous downstream tasks across imaging modalities. In this work, we introduce OmniRad, a self-supervised radiological foundation model pretrained on 1.2 million medical images, designed with radiology-inspired principles emphasizing representation reuse and cross-task transferability. We evaluate the pretrained encoder under multiple downstream adaptation regimes, including lightweight task-specific adapters with a frozen backbone as well as full end-to-end fine-tuning for classification, allowing us to assess both representation quality and task-specific performance. OmniRad is evaluated on a broad suite of public benchmarks spanning classification and segmentation across multiple modalities. On the MedMNISTv2 collection, OmniRad improves classification F1 by up to 2.05% over competing foundation models. For dense prediction, OmniRad attains mean Dice score improvements across six MedSegBench datasets when using frozen representations. Qualitative analyses and latent-space visualizations suggest improved feature clustering and modality-related separation.
zh

[CV-29] SLUM-i: Semi-supervised Learning for Urban Mapping of Informal Settlements and Data Quality Benchmarking

【速读】:该论文旨在解决低收入和中等收入国家大城市中非正式居住区(informal settlements)的大规模精准制图难题,其核心挑战在于标注数据稀缺、光谱混淆严重以及标注噪声显著。为应对这些问题,作者构建了涵盖拉合尔、卡拉奇和孟买三地的基准数据集(共1,869 km²),并扩展至全球五个现有基准,覆盖八座城市以评估模型鲁棒性。解决方案的关键创新在于提出一种新型半监督分割框架:通过引入类感知自适应阈值机制(Class-Aware Adaptive Thresholding)动态调整置信度阈值,避免少数类被抑制;同时设计原型库系统(Prototype Bank System)利用历史高保真特征表示锚定预测结果,增强语义一致性。实验表明,该方法在跨域迁移能力上表现优异——仅用10%源域标签训练即可在未见地理区域达到0.461 mIoU,优于全监督模型的零样本泛化性能。

链接: https://arxiv.org/abs/2602.04525
作者: Muhammad Taha Mukhtar(1 and 2),Syed Musa Ali Kazmi(1),Khola Naseem(2),Muhammad Ali Chattha(2),Andreas Dengel(2),Sheraz Ahmed(2),Muhammad Naseer Bajwa(1),Muhammad Imran Malik(1) ((1) National University of Sciences and Technology (NUST), Islamabad, Pakistan, (2) German Research Center for Artificial Intelligence (DFKI), Kaiserslautern, Germany)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 8 figures, 5 tables

点击查看摘要

Abstract:Rapid urban expansion has fueled the growth of informal settlements in major cities of low- and middle-income countries, with Lahore and Karachi in Pakistan and Mumbai in India serving as prominent examples. However, large-scale mapping of these settlements is severely constrained not only by the scarcity of annotations but by inherent data quality challenges, specifically high spectral ambiguity between formal and informal structures and significant annotation noise. We address this by introducing a benchmark dataset for Lahore, constructed from scratch, along with companion datasets for Karachi and Mumbai, which were derived from verified administrative boundaries, totaling 1,869 \textkm^2 of area. To evaluate the global robustness of our framework, we extend our experiments to five additional established benchmarks, encompassing eight cities across three continents, and provide comprehensive data quality assessments of all datasets. We also propose a new semi-supervised segmentation framework designed to mitigate the class imbalance and feature degradation inherent in standard semi-supervised learning pipelines. Our method integrates a Class-Aware Adaptive Thresholding mechanism that dynamically adjusts confidence thresholds to prevent minority class suppression and a Prototype Bank System that enforces semantic consistency by anchoring predictions to historically learned high-fidelity feature representations. Extensive experiments across a total of eight cities spanning three continents demonstrate that our approach outperforms state-of-the-art semi-supervised baselines. Most notably, our method demonstrates superior domain transfer capability whereby a model trained on only 10% of source labels reaches a 0.461 mIoU on unseen geographies and outperforms the zero-shot generalization of fully supervised models.
zh

[CV-30] S-MUSt3R: Sliding Multi-view 3D Reconstruction

【速读】:该论文旨在解决基于单目RGB图像的大规模3D重建中,由于内存限制导致的扩展性瓶颈问题。现有基础模型(foundation models)虽在未标定图像下的3D感知任务中表现出色,但在处理长序列RGB数据时难以维持高效与稳定。其解决方案的关键在于提出S-MUSt3R流水线,通过简单的序列分段策略(sequence segmentation)、分段对齐(segment alignment)以及轻量级闭环优化(lightweight loop closure optimization),在不重新训练模型的前提下,显著提升了MUSt3R模型在大规模场景下的可扩展性,实现了与复杂传统方法相当的轨迹估计和3D重建性能,并且直接输出度量空间中的预测结果,具备实际应用优势。

链接: https://arxiv.org/abs/2602.04517
作者: Leonid Antsfeld,Boris Chidlovskii,Yohann Cabon,Vincent Leroy,Jerome Revaud
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 8 pages, 5 figures, 5 tables

点击查看摘要

Abstract:The recent paradigm shift in 3D vision led to the rise of foundation models with remarkable capabilities in 3D perception from uncalibrated images. However, extending these models to large-scale RGB stream 3D reconstruction remains challenging due to memory limitations. This work proposes S-MUSt3R, a simple and efficient pipeline that extends the limits of foundation models for monocular 3D reconstruction. Our approach addresses the scalability bottleneck of foundation models through a simple strategy of sequence segmentation followed by segment alignment and lightweight loop closure optimization. Without model retraining, we benefit from remarkable 3D reconstruction capacities of MUSt3R model and achieve trajectory and reconstruction performance comparable to traditional methods with more complex architecture. We evaluate S-MUSt3R on TUM, 7-Scenes and proprietary robot navigation datasets and show that S-MUSt3R runs successfully on long RGB sequences and produces accurate and consistent 3D reconstruction. Our results highlight the potential of leveraging the MUSt3R model for scalable monocular 3D scene in real-world settings, with an important advantage of making predictions directly in the metric space.
zh

[CV-31] EgoActor: Grounding Task Planning into Spatial-aware Egocentric Actions for Humanoid Robots via Visual-Language Models

【速读】:该论文旨在解决人形机器人在真实场景中部署时面临的多模态感知、运动规划与操作执行的紧密集成问题,尤其在部分观测信息和动态环境条件下实现子任务间的鲁棒切换。其解决方案的关键在于提出了一种新任务范式——EgoActing,要求将高层指令直接映射为精确的空间感知型动作;并进一步设计了EgoActor模型,这是一个统一且可扩展的视觉语言模型(Vision-Language Model, VLM),能够实时预测行走、转向、侧移、高度变化等运动基元、头部动作、操作命令及人机交互行为,从而协调感知与执行。EgoActor通过来自真实世界示范的视角RGB数据、空间推理问答以及仿真环境演示的广泛监督信号进行训练,实现了上下文感知的稳健决策和快速动作推理(<1秒),并在模拟与真实环境中验证了其跨任务和未见环境的良好泛化能力。

链接: https://arxiv.org/abs/2602.04515
作者: Yu Bai,MingMing Yu,Chaojie Li,Ziyi Bai,Xinlong Wang,Börje F. Karlsson
机构: Beijing Academy of Artificial Intelligence (北京人工智能研究院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deploying humanoid robots in real-world settings is fundamentally challenging, as it demands tight integration of perception, locomotion, and manipulation under partial-information observations and dynamically changing environments. As well as transitioning robustly between sub-tasks of different types. Towards addressing these challenges, we propose a novel task - EgoActing, which requires directly grounding high-level instructions into various, precise, spatially aware humanoid actions. We further instantiate this task by introducing EgoActor, a unified and scalable vision-language model (VLM) that can predict locomotion primitives (e.g., walk, turn, move sideways, change height), head movements, manipulation commands, and human-robot interactions to coordinate perception and execution in real-time. We leverage broad supervision over egocentric RGB-only data from real-world demonstrations, spatial reasoning question-answering, and simulated environment demonstrations, enabling EgoActor to make robust, context-aware decisions and perform fluent action inference (under 1s) with both 8B and 4B parameter models. Extensive evaluations in both simulated and real-world environments demonstrate that EgoActor effectively bridges abstract task planning and concrete motor execution, while generalizing across diverse tasks and unseen environments.
zh

[CV-32] Vision-aligned Latent Reason ing for Multi-modal Large Language Model

【速读】:该论文旨在解决多模态大语言模型(Multi-modal Large Language Models, MLLMs)在需要多步推理任务中表现不佳的问题,其核心瓶颈在于长文本生成过程中视觉信息的逐步稀释,限制了模型对测试时扩展(test-time scaling)的有效利用。解决方案的关键在于提出一种名为视觉对齐潜在推理(Vision-aligned Latent Reasoning, VaLR)的框架,该框架通过在每次思维链(Chain of Thought)推理步骤前动态生成与视觉对齐的潜在标记(latent tokens),引导模型基于潜在空间中的感知线索进行推理;具体而言,VaLR通过将MLLM中间嵌入与视觉编码器输出对齐,从而在推理过程中有效保留视觉知识,实验证明该方法在多个需长上下文理解或精确视觉感知的基准上显著优于现有方法,并展现出此前MLLM中未观察到的测试时扩展特性。

链接: https://arxiv.org/abs/2602.04476
作者: Byungwoo Jeon,Yoonwoo Jeong,Hyunseok Lee,Minsu Cho,Jinwoo Shin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages; 5 figures

点击查看摘要

Abstract:Despite recent advancements in Multi-modal Large Language Models (MLLMs) on diverse understanding tasks, these models struggle to solve problems which require extensive multi-step reasoning. This is primarily due to the progressive dilution of visual information during long-context generation, which hinders their ability to fully exploit test-time scaling. To address this issue, we introduce Vision-aligned Latent Reasoning (VaLR), a simple, yet effective reasoning framework that dynamically generates vision-aligned latent tokens before each Chain of Thought reasoning step, guiding the model to reason based on perceptual cues in the latent space. Specifically, VaLR is trained to preserve visual knowledge during reasoning by aligning intermediate embeddings of MLLM with those from vision encoders. Empirical results demonstrate that VaLR consistently outperforms existing approaches across a wide range of benchmarks requiring long-context understanding or precise visual perception, while exhibiting test-time scaling behavior not observed in prior MLLMs. In particular, VaLR improves the performance significantly from 33.0% to 52.9% on VSI-Bench, achieving a 19.9%p gain over Qwen2.5-VL.
zh

[CV-33] SALAD-Pan: Sensor-Agnostic Latent Adaptive Diffusion for Pan-Sharpening

【速读】:该论文旨在解决现有扩散模型在多光谱(Multispectral, MS)图像融合中存在高延迟和传感器依赖性的问题,即大多数方法在像素空间进行扩散且需为不同传感器单独训练模型,限制了效率与泛化能力。其解决方案的关键在于提出一种传感器无关的潜在空间扩散方法SALAD-Pan:首先通过单通道变分自编码器(VAE)将高分辨率多光谱图像编码为紧凑潜在表示,支持不同通道数的MS图像并奠定加速基础;其次引入单向和双向交互控制结构,分别将光谱物理特性、全色(PAN)图与MS图注入扩散主干网络,实现高精度融合;最后在扩散模型中心层加入轻量级跨光谱注意力模块,增强光谱关联性以提升融合精度和光谱一致性。

链接: https://arxiv.org/abs/2602.04473
作者: Junjie Li,Congyang Ou,Haokui Zhang,Guoting Wei,Shengqin Jiang,Ying Li,Chunhua Shen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, diffusion models bring novel insights for Pan-sharpening and notably boost fusion precision. However, most existing models perform diffusion in the pixel space and train distinct models for different multispectral (MS) imagery, suffering from high latency and sensor-specific limitations. In this paper, we present SALAD-Pan, a sensor-agnostic latent space diffusion method for efficient pansharpening. Specifically, SALAD-Pan trains a band-wise single-channel VAE to encode high-resolution multispectral (HRMS) into compact latent representations, supporting MS images with various channel counts and establishing a basis for acceleration. Then spectral physical properties, along with PAN and MS images, are injected into the diffusion backbone through unidirectional and bidirectional interactive control structures respectively, achieving high-precision fusion in the diffusion process. Finally, a lightweight cross-spectral attention module is added to the central layer of diffusion model, reinforcing spectral connections to boost spectral consistency and further elevate fusion precision. Experimental results on GaoFen-2, QuickBird, and WorldView-3 demonstrate that SALAD-Pan outperforms state-of-the-art diffusion-based methods across all three datasets, attains a 2-3x inference speedup, and exhibits robust zero-shot (cross-sensor) capability.
zh

[CV-34] mporal Slowness in Central Vision Drives Semantic Object Learning ICLR2026

【速读】:该论文旨在解决如何从类人视觉经验中形成语义物体表征的问题,特别是探究中心视觉(central vision)与时间缓慢性学习(slowness learning)在这一过程中的作用。解决方案的关键在于:利用Ego4D数据集模拟五个月的人类类视觉体验,并通过先进的注视预测模型生成注视坐标,从而提取模拟中心视野的图像片段;在此基础上,训练一个基于时间对比的自监督学习模型(time-contrastive Self-Supervised Learning),以同时利用中心视觉的空间聚焦特性与时间缓慢性约束。实验表明,结合中心视觉可增强前景物体特征的提取,而引入时间缓慢性(尤其在注视稳定期)则有助于编码更广泛的语义信息,从而提升对物体多维语义特征的表征能力。

链接: https://arxiv.org/abs/2602.04462
作者: Timothy Schaumlöffel,Arthur Aubret,Gemma Roig,Jochen Triesch
机构: Goethe University Frankfurt (歌德大学法兰克福分校); The Hessian Center for Artificial Intelligence (hessian.AI) (黑森州人工智能中心); Frankfurt Institute for Advanced Studies (法兰克福高等研究院); Xidian-FIAS international Joint Research Center (西安电子科技大学-法兰克福高等研究院联合研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICLR 2026

点击查看摘要

Abstract:Humans acquire semantic object representations from egocentric visual streams with minimal supervision. Importantly, the visual system processes with high resolution only the center of its field of view and learns similar representations for visual inputs occurring close in time. This emphasizes slowly changing information around gaze locations. This study investigates the role of central vision and slowness learning in the formation of semantic object representations from human-like visual experience. We simulate five months of human-like visual experience using the Ego4D dataset and generate gaze coordinates with a state-of-the-art gaze prediction model. Using these predictions, we extract crops that mimic central vision and train a time-contrastive Self-Supervised Learning model on them. Our results show that combining temporal slowness and central vision improves the encoding of different semantic facets of object representations. Specifically, focusing on central vision strengthens the extraction of foreground object features, while considering temporal slowness, especially during fixational eye movements, allows the model to encode broader semantic information about objects. These findings provide new insights into the mechanisms by which humans may develop semantic object representations from natural visual experience.
zh

[CV-35] Seg-ReSearch: Segmentation with Interleaved Reason ing and External Search

【速读】:该论文旨在解决当前基于多模态大语言模型(Multimodal Large Language Models, MLLMs)的图像/视频分割系统受限于模型内部冻结知识的问题,这些系统难以应对需要实时信息或领域特定概念的开放世界场景。解决方案的关键在于提出一种名为Seg-ReSearch的新分割范式,其核心是通过交错式推理与外部搜索机制,使分割系统能够动态获取并利用外部知识,从而处理超出MLLM预训练知识范围的查询任务。此外,为有效训练该能力,作者设计了一种分层奖励机制,协调初始引导与渐进激励,缓解稀疏结果信号与严格步骤监督之间的矛盾。

链接: https://arxiv.org/abs/2602.04454
作者: Tianming Liang,Qirui Du,Jian-Fang Hu,Haichao Jiang,Zicheng Lin,Wei-Shi Zheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Segmentation based on language has been a popular topic in computer vision. While recent advances in multimodal large language models (MLLMs) have endowed segmentation systems with reasoning capabilities, these efforts remain confined by the frozen internal knowledge of MLLMs, which limits their potential for real-world scenarios that involve up-to-date information or domain-specific concepts. In this work, we propose \textbfSeg-ReSearch, a novel segmentation paradigm that overcomes the knowledge bottleneck of existing approaches. By enabling interleaved reasoning and external search, Seg-ReSearch empowers segmentation systems to handle dynamic, open-world queries that extend beyond the frozen knowledge of MLLMs. To effectively train this capability, we introduce a hierarchical reward design that harmonizes initial guidance with progressive incentives, mitigating the dilemma between sparse outcome signals and rigid step-wise supervision. For evaluation, we construct OK-VOS, a challenging benchmark that explicitly requires outside knowledge for video object segmentation. Experiments on OK-VOS and two existing reasoning segmentation benchmarks demonstrate that our Seg-ReSearch improves state-of-the-art approaches by a substantial margin. Code and data will be released at this https URL.
zh

[CV-36] SynthVerse: A Large-Scale Diverse Synthetic Dataset for Point Tracking

【速读】:该论文旨在解决当前通用点跟踪(point tracking)任务中因高质量数据稀缺而导致的泛化能力不足问题,现有数据集在多样性与轨迹标注质量方面存在明显短板。其解决方案的关键在于构建一个大规模、多样化的合成数据集 SynthVerse,该数据集引入了动画电影风格内容、具身操作、场景导航和关节物体等新领域与对象类型,显著扩展了数据覆盖范围,并提供了高质量动态运动与交互信息,从而支持更鲁棒的训练与评估,同时建立了涵盖广泛域偏移的基准测试体系以系统性验证先进方法性能。

链接: https://arxiv.org/abs/2602.04441
作者: Weiguang Zhao,Haoran Xu,Xingyu Miao,Qin Zhao,Rui Zhang,Kaizhu Huang,Ning Gao,Peizhou Cao,Mingze Sun,Mulin Yu,Tao Lu,Linning Xu,Junting Dong,Jiangmiao Pang
机构: University of Liverpool (利物浦大学); Zhejiang University (浙江大学); Durham University (杜伦大学); Beihang University (北京航空航天大学); Duke Kunshan University (昆山杜克大学); Xi’an Jiaotong-Liverpool University (西交利物浦大学); Xi’an Jiaotong University (西安交通大学); Tsinghua University (清华大学); Shanghai AI Laboratory (上海人工智能实验室); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Point tracking aims to follow visual points through complex motion, occlusion, and viewpoint changes, and has advanced rapidly with modern foundation models. Yet progress toward general point tracking remains constrained by limited high-quality data, as existing datasets often provide insufficient diversity and imperfect trajectory annotations. To this end, we introduce SynthVerse, a large-scale, diverse synthetic dataset specifically designed for point tracking. SynthVerse includes several new domains and object types missing from existing synthetic datasets, such as animated-film-style content, embodied manipulation, scene navigation, and articulated objects. SynthVerse substantially expands dataset diversity by covering a broader range of object categories and providing high-quality dynamic motions and interactions, enabling more robust training and evaluation for general point tracking. In addition, we establish a highly diverse point tracking benchmark to systematically evaluate state-of-the-art methods under broader domain shifts. Extensive experiments and analyses demonstrate that training with SynthVerse yields consistent improvements in generalization and reveal limitations of existing trackers under diverse settings.
zh

[CV-37] rajVG: 3D Trajectory-Coupled Visual Geometry Learning

【速读】:该论文旨在解决前馈多帧三维重建模型在存在物体运动的视频中性能下降的问题,具体表现为全局参考信息因多重运动而模糊,局部点云图(local pointmap)则依赖于估计的相对位姿且易发生漂移,导致跨帧错位和结构重复。其解决方案的关键在于提出TrajVG框架,通过显式预测相机坐标系下的3D轨迹来建立跨帧三维对应关系;并通过几何一致性约束耦合稀疏轨迹、每帧局部点云图与相对相机位姿:(i) 双向轨迹-点云图一致性并控制梯度流,(ii) 利用静态轨迹锚点驱动的姿态一致性目标抑制动态区域梯度传播。此外,为适应真实场景视频中缺乏3D轨迹标签的情况,作者将上述约束重构为仅需伪2D轨迹的自监督目标,实现混合监督下的统一训练。

链接: https://arxiv.org/abs/2602.04439
作者: Xingyu Miao,Weiguang Zhao,Tao Lu,Linning Yu,Mulin Yu,Yang Long,Jiangmiao Pang,Junting Dong
机构: Durham University (杜伦大学); University of Liverpool (利物浦大学); Shanghai AI Lab (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Feed-forward multi-frame 3D reconstruction models often degrade on videos with object motion. Global-reference becomes ambiguous under multiple motions, while the local pointmap relies heavily on estimated relative poses and can drift, causing cross-frame misalignment and duplicated structures. We propose TrajVG, a reconstruction framework that makes cross-frame 3D correspondence an explicit prediction by estimating camera-coordinate 3D trajectories. We couple sparse trajectories, per-frame local point maps, and relative camera poses with geometric consistency objectives: (i) bidirectional trajectory-pointmap consistency with controlled gradient flow, and (ii) a pose consistency objective driven by static track anchors that suppresses gradients from dynamic regions. To scale training to in-the-wild videos where 3D trajectory labels are scarce, we reformulate the same coupling constraints into self-supervised objectives using only pseudo 2D tracks, enabling unified training with mixed supervision. Extensive experiments across 3D tracking, pose estimation, pointmap reconstruction, and video depth show that TrajVG surpasses the current feedforward performance baseline.
zh

[CV-38] Med-MMFL: A Multimodal Federated Learning Benchmark in Healthcare

【速读】:该论文旨在解决医疗领域多模态联邦学习(MultiModal Federated Learning, MMFL)缺乏标准化评估基准的问题,当前研究多集中于单模态或双模态场景,且任务类型和联邦设置较为有限,难以支撑对MMFL方法的系统性比较与进展推动。其解决方案的关键在于构建首个面向医疗领域的综合性MMFL基准——Med-MMFL,涵盖2至4种模态、共计10类医学数据(如文本、病理图像、心电图ECG、X光片、放射学报告及多种MRI序列),覆盖分割、分类、模态对齐(检索)和视觉问答(VQA)等典型任务,并在自然联邦、合成独立同分布(IID)与非独立同分布(non-IID)三种场景下进行实验验证。该基准支持对六种前沿联邦算法的公平比较,包含不同聚合策略、损失函数设计与正则化技术,从而为未来真实医疗场景下的多模态联邦学习研究提供可复现、可扩展的评估平台。

链接: https://arxiv.org/abs/2602.04416
作者: Aavash Chhetri,Bibek Niroula,Pratik Shrestha,Yash Raj Shrestha,Lesley A Anderson,Prashnna K Gyawali,Loris Bazzani,Binod Bhattarai
机构: University of Aberdeen(阿伯丁大学); NepAl Applied Mathematics and Informatics Institute for research(尼泊尔应用数学与信息研究所); University of Lausanne(洛桑大学); West Virginia University(西弗吉尼亚大学); University of Verona(维罗纳大学); University College London(伦敦大学学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Federated learning (FL) enables collaborative model training across decentralized medical institutions while preserving data privacy. However, medical FL benchmarks remain scarce, with existing efforts focusing mainly on unimodal or bimodal modalities and a limited range of medical tasks. This gap underscores the need for standardized evaluation to advance systematic understanding in medical MultiModal FL (MMFL). To this end, we introduce Med-MMFL, the first comprehensive MMFL benchmark for the medical domain, encompassing diverse modalities, tasks, and federation scenarios. Our benchmark evaluates six representative state-of-the-art FL algorithms, covering different aggregation strategies, loss formulations, and regularization techniques. It spans datasets with 2 to 4 modalities, comprising a total of 10 unique medical modalities, including text, pathology images, ECG, X-ray, radiology reports, and multiple MRI sequences. Experiments are conducted across naturally federated, synthetic IID, and synthetic non-IID settings to simulate real-world heterogeneity. We assess segmentation, classification, modality alignment (retrieval), and VQA tasks. To support reproducibility and fair comparison of future multimodal federated learning (MMFL) methods under realistic medical settings, we release the complete benchmark implementation, including data processing and partitioning pipelines, at this https URL .
zh

[CV-39] Self-evolving Embodied AI

【速读】:该论文旨在解决现有具身人工智能(Embodied AI)在真实复杂环境(in-the-wild setting)中适应性不足的问题,即当前方法受限于人工设计的静态场景与固定任务,无法应对动态变化的环境、可变的具身形态以及持续演化的认知需求。其解决方案的关键在于提出“自演化具身智能”(self-evolving embodied AI)的新范式,通过五大核心机制实现自主进化:记忆自更新、任务自切换、环境自预测、具身自适应和模型自演化,从而构建具备持续适应能力的通用智能体,使其能够像人类一样在开放环境中自主学习与交互。

链接: https://arxiv.org/abs/2602.04411
作者: Tongtong Feng,Xin Wang,Wenwu Zhu
机构: Tsinghua University (清华大学)
类目: Emerging Technologies (cs.ET); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Embodied Artificial Intelligence (AI) is an intelligent system formed by agents and their environment through active perception, embodied cognition, and action interaction. Existing embodied AI remains confined to human-crafted setting, in which agents are trained on given memory and construct models for given tasks, enabling fixed embodiments to interact with relatively static environments. Such methods fail in in-the-wild setting characterized by variable embodiments and dynamic open environments. This paper introduces self-evolving embodied AI, a new paradigm in which agents operate based on their changing state and environment with memory self-updating, task self-switching, environment self-prediction, embodiment self-adaptation, and model self-evolution, aiming to achieve continually adaptive intelligence with autonomous evolution. Specifically, we present the definition, framework, components, and mechanisms of self-evolving embodied AI, systematically review state-of-the-art works for realized components, discuss practical applications, and point out future research directions. We believe that self-evolving embodied AI enables agents to autonomously learn and interact with environments in a human-like manner and provide a new perspective toward general artificial intelligence.
zh

[CV-40] LCUDiff: Latent Capacity Upgrade Diffusion for Faithful Human Body Restoration

【速读】:该论文旨在解决人类中心图像恢复(Human-Centric Image Restoration, HCIR)中因现有方法在人体区域恢复时 fidelity 不足的问题,尤其是针对人像修复(Human Body Restoration, HBR)任务中的细节丢失与伪影问题。其核心解决方案是提出 LCUDiff 框架,关键创新在于:1)将预训练的潜空间扩散模型从 4 通道扩展至 16 通道以增强表示能力;2)采用通道分裂蒸馏(Channel Splitting Distillation, CSD)策略,在保持前四通道与预训练先验一致的同时,利用新增通道编码高频细节;3)设计先验保留适配(Prior-Preserving Adaptation, PPA)机制,缓解 4 通道扩散主干与高维潜空间之间的不匹配;4)引入解码器路由机制(Decoder Router, DeR),基于恢复质量评分对每张样本动态选择最优解码器路径,从而提升多样场景下的视觉保真度和鲁棒性。

链接: https://arxiv.org/abs/2602.04406
作者: Jue Gong,Zihan Zhou,Jingkai Wang,Shu Li,Libo Liu,Jianliang Lan,Yulun Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 7 figures. The code and model will be at this https URL

点击查看摘要

Abstract:Existing methods for restoring degraded human-centric images often struggle with insufficient fidelity, particularly in human body restoration (HBR). Recent diffusion-based restoration methods commonly adapt pre-trained text-to-image diffusion models, where the variational autoencoder (VAE) can significantly bottleneck restoration fidelity. We propose LCUDiff, a stable one-step framework that upgrades a pre-trained latent diffusion model from the 4-channel latent space to the 16-channel latent space. For VAE fine-tuning, channel splitting distillation (CSD) is used to keep the first four channels aligned with pre-trained priors while allocating the additional channels to effectively encode high-frequency details. We further design prior-preserving adaptation (PPA) to smoothly bridge the mismatch between 4-channel diffusion backbones and the higher-dimensional 16-channel latent. In addition, we propose a decoder router (DeR) for per-sample decoder routing using restoration-quality score annotations, which improves visual quality across diverse conditions. Experiments on synthetic and real-world datasets show competitive results with higher fidelity and fewer artifacts under mild degradations, while preserving one-step efficiency. The code and model will be at this https URL.
zh

[CV-41] Interactive Spatial-Frequency Fusion Mamba for Multi-Modal Image Fusion

【速读】:该论文旨在解决多模态图像融合(Multi-Modal Image Fusion, MMIF)中现有方法对空间与频率域信息融合方式单一、缺乏交互机制的问题,导致难以充分挖掘和利用跨模态互补特征。其解决方案的关键在于提出一种交互式空频融合框架(Interactive Spatial-Frequency Fusion Mamba, ISFM),核心创新包括:1)设计模态特异性提取器(Modality-Specific Extractor, MSE),以线性复杂度建模图像长程依赖;2)引入多尺度频率融合模块(Multi-scale Frequency Fusion, MFF),自适应整合多尺度低频与高频成分,增强频率特征鲁棒性;3)提出交互式空频融合机制(Interactive Spatial-Frequency Fusion, ISF),通过频率特征引导空间特征跨模态融合,实现更有效的互补表示。实验表明,该框架在六组基准数据集上优于当前主流方法。

链接: https://arxiv.org/abs/2602.04405
作者: Yixin Zhu,Long Lv,Pingping Zhang,Xuehu Liu,Tongdan Tang,Feng Tian,Weibing Sun,Huchuan Lu
机构: Dalian University of Technology (大连理工大学); Key Laboratory of Data Science and Smart Education (Hainan Normal University), Ministry of Education (教育部数据科学与智能教育重点实验室); Affiliated Zhongshan Hospital of Dalian University (大连大学附属中山医院); School of Computer Science and Artificial Intelligence, Wuhan University of Technology (武汉理工大学计算机科学与人工智能学院); Central Hospital of Dalian University of Technology (大连理工大学附属医院); School of Information and Communication Engineering, Dalian University of Technology (大连理工大学信息与通信工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: This work is accepted by IEEE Transactions on Image Processing. More modifications may be performed

点击查看摘要

Abstract:Multi-Modal Image Fusion (MMIF) aims to combine images from different modalities to produce fused images, retaining texture details and preserving significant information. Recently, some MMIF methods incorporate frequency domain information to enhance spatial features. However, these methods typically rely on simple serial or parallel spatial-frequency fusion without interaction. In this paper, we propose a novel Interactive Spatial-Frequency Fusion Mamba (ISFM) framework for MMIF. Specifically, we begin with a Modality-Specific Extractor (MSE) to extract features from different modalities. It models long-range dependencies across the image with linear computational complexity. To effectively leverage frequency information, we then propose a Multi-scale Frequency Fusion (MFF). It adaptively integrates low-frequency and high-frequency components across multiple scales, enabling robust representations of frequency features. More importantly, we further propose an Interactive Spatial-Frequency Fusion (ISF). It incorporates frequency features to guide spatial features across modalities, enhancing complementary representations. Extensive experiments are conducted on six MMIF datasets. The experimental results demonstrate that our ISFM can achieve better performances than other state-of-the-art methods. The source code is available at this https URL.
zh

[CV-42] Quantile Transfer for Reliable Operating Point Selection in Visual Place Recognition

【速读】:该论文旨在解决视觉定位(Visual Place Recognition, VPR)系统在无全球导航卫星系统(GNSS)环境下的定位精度与召回率之间权衡问题,尤其针对传统方法中阈值需人工离线调优且固定部署导致的环境变化下性能下降难题。解决方案的关键在于提出一种基于小规模校准遍历(known correspondences)的自动阈值选择方法,通过相似度分布的分位数归一化(quantile normalisation)实现阈值在不同校准规模和查询子集间的稳定迁移,从而在满足用户定义的精度要求下最大化召回率,无需人工干预即可适应新环境并泛化至不同运行条件。

链接: https://arxiv.org/abs/2602.04401
作者: Dhyey Manish Rajani,Michael Milford,Tobias Fischer
机构: Queensland University of Technology (昆士兰科技大学); QUT Centre for Robotics (QUT机器人中心)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual Place Recognition (VPR) is a key component for localisation in GNSS-denied environments, but its performance critically depends on selecting an image matching threshold (operating point) that balances precision and recall. Thresholds are typically hand-tuned offline for a specific environment and fixed during deployment, leading to degraded performance under environmental change. We propose a method that, given a user-defined precision requirement, automatically selects the operating point of a VPR system to maximise recall. The method uses a small calibration traversal with known correspondences and transfers thresholds to deployment via quantile normalisation of similarity score distributions. This quantile transfer ensures that thresholds remain stable across calibration sizes and query subsets, making the method robust to sampling variability. Experiments with multiple state-of-the-art VPR techniques and datasets show that the proposed approach consistently outperforms the state-of-the-art, delivering up to 25% higher recall in high-precision operating regimes. The method eliminates manual tuning by adapting to new environments and generalising across operating conditions. Our code will be released upon acceptance.
zh

[CV-43] Enabling Real-Time Colonoscopic Polyp Segmentation on Commodity CPUs via Ultra-Lightweight Architecture

【速读】:该论文旨在解决结直肠癌早期筛查中,高精度息肉分割模型因依赖GPU而难以在基层医院、移动内镜设备或胶囊机器人等资源受限场景中部署的问题。其核心解决方案是提出UltraSeg系列轻量级模型,在极端压缩条件下(参数量低至0.13M)实现高性能实时分割:通过联合优化编码器-解码器宽度、引入约束膨胀卷积以扩大感受野,并集成跨层轻量化融合模块,在单核CPU上达到90 FPS的推理速度,同时保持与31M参数U-Net相当的分割精度(Dice分数保留94%),为临床可落地的极小化压缩视觉任务提供了可靠基准和可复现的技术路径。

链接: https://arxiv.org/abs/2602.04381
作者: Weihao Gao,Zhuo Deng,Zheng Gong,Lan Ma
机构: Guangdong University of Education (广东教育学院); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 19pages, 5 figures

点击查看摘要

Abstract:Early detection of colorectal cancer hinges on real-time, accurate polyp identification and resection. Yet current high-precision segmentation models rely on GPUs, making them impractical to deploy in primary hospitals, mobile endoscopy units, or capsule robots. To bridge this gap, we present the UltraSeg family, operating in an extreme-compression regime (0.3 M parameters). UltraSeg-108K (0.108 M parameters) is optimized for single-center data, while UltraSeg-130K (0.13 M parameters) generalizes to multi-center, multi-modal images. By jointly optimizing encoder-decoder widths, incorporating constrained dilated convolutions to enlarge receptive fields, and integrating a cross-layer lightweight fusion module, the models achieve 90 FPS on a single CPU core without sacrificing accuracy. Evaluated on seven public datasets, UltraSeg retains 94% of the Dice score of a 31 M-parameter U-Net while utilizing only 0.4% of its parameters, establishing a strong, clinically viable baseline for the extreme-compression domain and offering an immediately deployable solution for resource-constrained settings. This work provides not only a CPU-native solution for colonoscopy but also a reproducible blueprint for broader minimally invasive surgical vision applications. Source code is publicly available to ensure reproducibility and facilitate future benchmarking.
zh

[CV-44] SparVAR: Exploring Sparsity in Visual AutoRegressive Modeling for Training-Free Acceleration

【速读】:该论文旨在解决视觉自回归(Visual AutoRegressive, VAR)建模在高分辨率图像生成中因注意力机制计算复杂度随分辨率四次方增长而导致的推理延迟问题,同时避免传统加速方法通过跳过高频尺度而损失图像细节。其解决方案的关键在于提出一种无需训练的稀疏化加速框架SparVAR,该框架利用VAR注意力的三个核心特性:(i)强注意力汇聚点(strong attention sinks)、(ii)跨尺度激活相似性(cross-scale activation similarity)和(iii)显著局部性(pronounced locality)。具体而言,SparVAR通过动态预测高分辨率尺度的稀疏注意力模式,并借助高效的索引映射机制构建尺度自相似稀疏注意力结构,从而实现大规模尺度下的高效稀疏注意力计算;进一步引入跨尺度局部稀疏注意力并设计块级稀疏核,使前向传播速度比FlashAttention快5倍。实验表明,该方法可在不跳过任何尺度的前提下,将8B模型生成1024×1024图像的时间压缩至1秒以内,相比FlashAttention加速的基线提升1.57倍,同时保留几乎全部高频细节。

链接: https://arxiv.org/abs/2602.04361
作者: Zekun Li,Ning Wang,Tongxin Bai,Changwang Mei,Peisong Wang,Shuang Qiu,Jian Cheng
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); Beijing Academy of Artificial Intelligence (北京人工智能研究院); Nanjing University of Science and Technology (南京理工大学); City University of Hong Kong (香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Visual AutoRegressive (VAR) modeling has garnered significant attention for its innovative next-scale prediction paradigm. However, mainstream VAR paradigms attend to all tokens across historical scales at each autoregressive step. As the next scale resolution grows, the computational complexity of attention increases quartically with resolution, causing substantial latency. Prior accelerations often skip high-resolution scales, which speeds up inference but discards high-frequency details and harms image quality. To address these problems, we present SparVAR, a training-free acceleration framework that exploits three properties of VAR attention: (i) strong attention sinks, (ii) cross-scale activation similarity, and (iii) pronounced locality. Specifically, we dynamically predict the sparse attention pattern of later high-resolution scales from a sparse decision scale, and construct scale self-similar sparse attention via an efficient index-mapping mechanism, enabling high-efficiency sparse attention computation at large scales. Furthermore, we propose cross-scale local sparse attention and implement an efficient block-wise sparse kernel, which achieves \mathbf 5\times faster forward speed than FlashAttention. Extensive experiments demonstrate that the proposed SparseVAR can reduce the generation time of an 8B model producing 1024\times1024 high-resolution images to the 1s, without skipping the last scales. Compared with the VAR baseline accelerated by FlashAttention, our method achieves a \mathbf1.57\times speed-up while preserving almost all high-frequency details. When combined with existing scale-skipping strategies, SparseVAR attains up to a \mathbf2.28\times acceleration, while maintaining competitive visual generation quality. Code is available at this https URL.
zh

[CV-45] When and Where to Attack? Stage-wise Attention-Guided Adversarial Attack on Large Vision Language Models

【速读】:该论文旨在解决针对大视觉语言模型(Large Vision-Language Models, LVLMs)的对抗攻击中,如何更高效地利用有限的像素级扰动预算以生成高成功率且难以察觉的对抗样本的问题。现有基于随机裁剪等输入变换的攻击方法虽揭示了局部扰动的有效性,但其随机性导致扰动效率低下。论文的关键创新在于提出分阶段注意力引导攻击(Stage-wise Attention-Guided Attack, SAGA),其核心机制是:首先发现区域注意力分数与对抗损失敏感性正相关,进而利用该特性逐步将扰动集中于高注意力区域;同时,攻击高注意力区域会引发注意力向后续显著区域的结构化重分配,从而增强攻击效果。SAGA通过这种注意力驱动的渐进式扰动策略,在约束条件下显著提升了攻击成功率并保持了对抗样本的隐蔽性。

链接: https://arxiv.org/abs/2602.04356
作者: Jaehyun Kwak,Nam Cao,Boryeong Cho,Segyu Lee,Sumyeong Ahn,Se-Young Yun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Pre-print

点击查看摘要

Abstract:Adversarial attacks against Large Vision-Language Models (LVLMs) are crucial for exposing safety vulnerabilities in modern multimodal systems. Recent attacks based on input transformations, such as random cropping, suggest that spatially localized perturbations can be more effective than global image manipulation. However, randomly cropping the entire image is inherently stochastic and fails to use the limited per-pixel perturbation budget efficiently. We make two key observations: (i) regional attention scores are positively correlated with adversarial loss sensitivity, and (ii) attacking high-attention regions induces a structured redistribution of attention toward subsequent salient regions. Based on these findings, we propose Stage-wise Attention-Guided Attack (SAGA), an attention-guided framework that progressively concentrates perturbations on high-attention regions. SAGA enables more efficient use of constrained perturbation budgets, producing highly imperceptible adversarial examples while consistently achieving state-of-the-art attack success rates across ten LVLMs. The source code is available at this https URL.
zh

[CV-46] VecSet-Edit: Unleashing Pre-trained LRM for Mesh Editing from Single Image

【速读】:该论文旨在解决当前3D网格(mesh)编辑方法中存在分辨率受限和依赖繁琐3D掩码的问题,尤其是在基于体素(voxel)表示的方案如VoxHammer中表现明显。其核心解决方案是提出首个基于高保真VecSet Large Reconstruction Model (LRM) 的网格编辑框架——VecSet-Edit。关键创新在于:首先通过分析VecSet tokens的空间属性,发现token子集对应不同几何区域;进而设计Mask-guided Token Seeding与Attention-aligned Token Gating策略,仅用2D图像条件即可精确定位目标区域;同时引入Drift-aware Token Pruning机制以在去噪过程中剔除几何异常点;最后通过Detail-preserving Texture Baking模块实现几何与纹理信息的双重保留,从而显著提升编辑精度与视觉质量。

链接: https://arxiv.org/abs/2602.04349
作者: Teng-Fang Hsiao,Bo-Kai Ruan,Yu-Lun Liu,Hong-Han Shuai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:3D editing has emerged as a critical research area to provide users with flexible control over 3D assets. While current editing approaches predominantly focus on 3D Gaussian Splatting or multi-view images, the direct editing of 3D meshes remains underexplored. Prior attempts, such as VoxHammer, rely on voxel-based representations that suffer from limited resolution and necessitate labor-intensive 3D mask. To address these limitations, we propose \textbfVecSet-Edit, the first pipeline that leverages the high-fidelity VecSet Large Reconstruction Model (LRM) as a backbone for mesh editing. Our approach is grounded on a analysis of the spatial properties in VecSet tokens, revealing that token subsets govern distinct geometric regions. Based on this insight, we introduce Mask-guided Token Seeding and Attention-aligned Token Gating strategies to precisely localize target regions using only 2D image conditions. Also, considering the difference between VecSet diffusion process versus voxel we design a Drift-aware Token Pruning to reject geometric outliers during the denoising process. Finally, our Detail-preserving Texture Baking module ensures that we not only preserve the geometric details of original mesh but also the textural information. More details can be found in our project page: this https URL
zh

[CV-47] Finding NeMO: A Geometry-Aware Representation of Template Views for Few-Shot Perception

【速读】:该论文旨在解决少样本(few-shot)环境下对未见过物体的检测、分割及6自由度(6DoF)位姿估计问题,尤其在无需相机特定参数或目标数据重训练的情况下实现高效感知。其解决方案的关键在于提出了一种名为Neural Memory Object (NeMO) 的新型以物体为中心的表示方法:通过一个仅需少量RGB模板视图的编码器,利用学习到的隐式距离函数(UDF, Unsigned Distance Function)生成包含语义与几何信息的稀疏点云;随后由解码器结合该对象编码与查询图像生成多种密集预测结果。此设计实现了将物体信息外置于NeMO,并通过单一网络完成多感知任务,显著提升了对新物体的交互能力与系统扩展性,同时避免了重新训练和复杂预处理流程。

链接: https://arxiv.org/abs/2602.04343
作者: Sebastian Jung,Leonard Klüpfel,Rudolph Triebel,Maximilian Durner
机构: German Aerospace Center (DLR)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages including supplement, published in 3DV 2026, Project website: this https URL

点击查看摘要

Abstract:We present Neural Memory Object (NeMO), a novel object-centric representation that can be used to detect, segment and estimate the 6DoF pose of objects unseen during training using RGB images. Our method consists of an encoder that requires only a few RGB template views depicting an object to generate a sparse object-like point cloud using a learned UDF containing semantic and geometric information. Next, a decoder takes the object encoding together with a query image to generate a variety of dense predictions. Through extensive experiments, we show that our method can be used for few-shot object perception without requiring any camera-specific parameters or retraining on target data. Our proposed concept of outsourcing object information in a NeMO and using a single network for multiple perception tasks enhances interaction with novel objects, improving scalability and efficiency by enabling quick object onboarding without retraining or extensive pre-processing. We report competitive and state-of-the-art results on various datasets and perception tasks of the BOP benchmark, demonstrating the versatility of our approach. this https URL
zh

[CV-48] Explicit Uncertainty Modeling for Active CLIP Adaptation with Dual Prompt Tuning

【速读】:该论文旨在解决预训练视觉-语言模型(如CLIP)在标注预算有限的情况下,如何有效适应下游图像分类任务的问题。现有主动学习方法通常依赖熵或表示聚类来估计不确定性,但未从模型自身角度显式建模不确定性。其解决方案的关键在于提出一种基于双提示微调(dual-prompt tuning)的鲁棒不确定性建模框架:在CLIP的文本分支中引入两个可学习提示——正向提示用于增强任务特定文本嵌入的判别性,提升分类可靠性;反向提示则以逆向方式训练,显式建模预测标签正确的概率,从而提供一个理论严谨的不确定性信号,用于指导主动采样策略。实验表明,该方法在不同微调范式下均能显著优于现有主动学习方法。

链接: https://arxiv.org/abs/2602.04340
作者: Qian-Wei Wang,Yaguang Song,Shu-Tao Xia
机构: Tsinghua Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院,清华大学); Institute of Perceptual Intelligence, Peng Cheng Laboratory (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Pre-trained vision-language models such as CLIP exhibit strong transferability, yet adapting them to downstream image classification tasks under limited annotation budgets remains challenging. In active learning settings, the model must select the most informative samples for annotation from a large pool of unlabeled data. Existing approaches typically estimate uncertainty via entropy-based criteria or representation clustering, without explicitly modeling uncertainty from the model perspective. In this work, we propose a robust uncertainty modeling framework for active CLIP adaptation based on dual-prompt tuning. We introduce two learnable prompts in the textual branch of CLIP. The positive prompt enhances the discriminability of task-specific textual embeddings corresponding to light-weight tuned visual embeddings, improving classification reliability. Meanwhile, the negative prompt is trained in an reversed manner to explicitly model the probability that the predicted label is correct, providing a principled uncertainty signal for guiding active sample selection. Extensive experiments across different fine-tuning paradigms demonstrate that our method consistently outperforms existing active learning methods under the same annotation budget.
zh

[CV-49] Fine-tuning Pre-trained Vision-Language Models in a Human-Annotation-Free Manner

【速读】:该论文旨在解决大规模视觉语言模型(Vision-Language Models, VLMs)在下游任务中进行无监督适应时面临的挑战,尤其是现有自训练方法依赖伪标签但常因置信度筛选不可靠、确认偏差以及低置信度样本利用不足而导致性能受限的问题。解决方案的关键在于提出协同微调(Collaborative Fine-Tuning, CoFT),其核心是通过双模型跨模态协作机制,在样本层面显式建模伪标签的纯净度,采用正负文本提示(positive and negative textual prompts)实现无需人工设定阈值或噪声假设的伪标签质量控制;同时引入两阶段训练策略,先在高置信度样本上进行参数高效微调,再基于协同过滤后的伪标签进行全量微调,并辅以负提示对轻量化视觉适配模块进行正则化,从而提升在噪声监督下的鲁棒性与泛化能力。

链接: https://arxiv.org/abs/2602.04337
作者: Qian-Wei Wang,Guanghao Meng,Ren Cai,Yaguang Song,Shu-Tao Xia
机构: Tsinghua Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院); Institute of Perceptual Intelligence, Peng Cheng Laboratory (鹏城实验室); Peking University Shenzhen Graduate School, Peking University (北京大学深圳研究生院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large-scale vision-language models (VLMs) such as CLIP exhibit strong zero-shot generalization, but adapting them to downstream tasks typically requires costly labeled data. Existing unsupervised self-training methods rely on pseudo-labeling, yet often suffer from unreliable confidence filtering, confirmation bias, and underutilization of low-confidence samples. We propose Collaborative Fine-Tuning (CoFT), an unsupervised adaptation framework that leverages unlabeled data through a dual-model, cross-modal collaboration mechanism. CoFT introduces a dual-prompt learning strategy with positive and negative textual prompts to explicitly model pseudo-label cleanliness in a sample-dependent manner, removing the need for hand-crafted thresholds or noise assumptions. The negative prompt also regularizes lightweight visual adaptation modules, improving robustness under noisy supervision. CoFT employs a two-phase training scheme, transitioning from parameter-efficient fine-tuning on high-confidence samples to full fine-tuning guided by collaboratively filtered pseudo-labels. Building on CoFT, CoFT+ further enhances adaptation via iterative fine-tuning, momentum contrastive learning, and LLM-generated prompts. Extensive experiments demonstrate consistent gains over existing unsupervised methods and even few-shot supervised baselines.
zh

[CV-50] Multiview Self-Representation Learning across Heterogeneous Views

【速读】:该论文旨在解决在无监督迁移学习场景下,如何从不同预训练模型提取的异构多视角(heterogeneous multiview)特征中学习到具有不变性的表示问题。由于不同预训练模型的架构或预训练目标差异,同一样本在不同模型下的特征分布存在显著偏移,导致直接融合这些特征难以获得稳定且鲁棒的表示。解决方案的关键在于提出一种多视角自表示学习(Multiview Self-Representation Learning, MSRL)方法:首先利用冻结的预训练骨干网络提取多视角特征,并在其上堆叠独立的线性层;随后引入基于自表示性质的信息传递机制,实现跨视角特征聚合;同时设计分配概率分布一致性约束,引导不同视角间互补信息协同优化,从而强制多个线性模型输出的表示在语义空间中保持一致。该方法通过理论分析和大量基准数据集实验验证了其有效性与泛化能力。

链接: https://arxiv.org/abs/2602.04328
作者: Jie Chen,Zhu Wang,Chuanbin Liu,Xi Peng
机构: Sichuan University (四川大学); China University of Petroleum (Beijing) (中国石油大学(北京)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages

点击查看摘要

Abstract:Features of the same sample generated by different pretrained models often exhibit inherently distinct feature distributions because of discrepancies in the model pretraining objectives or architectures. Learning invariant representations from large-scale unlabeled visual data with various pretrained models in a fully unsupervised transfer manner remains a significant challenge. In this paper, we propose a multiview self-representation learning (MSRL) method in which invariant representations are learned by exploiting the self-representation property of features across heterogeneous views. The features are derived from large-scale unlabeled visual data through transfer learning with various pretrained models and are referred to as heterogeneous multiview data. An individual linear model is stacked on top of its corresponding frozen pretrained backbone. We introduce an information-passing mechanism that relies on self-representation learning to support feature aggregation over the outputs of the linear model. Moreover, an assignment probability distribution consistency scheme is presented to guide multiview self-representation learning by exploiting complementary information across different views. Consequently, representation invariance across different linear models is enforced through this scheme. In addition, we provide a theoretical analysis of the information-passing mechanism, the assignment probability distribution consistency and the incremental views. Extensive experiments with multiple benchmark visual datasets demonstrate that the proposed MSRL method consistently outperforms several state-of-the-art approaches.
zh

[CV-51] JOintGS: Joint Optimization of Cameras Bodies and 3D Gaussians for In-the-Wild Monocular Reconstruction

【速读】:该论文旨在解决从单目RGB视频中重建高保真、可驱动的3D人体虚拟形象(Human Avatar)这一难题,尤其是在无约束的野外场景下,传统方法(如COLMAP、HMR2.0)获取的相机参数和人体姿态往往存在显著误差,导致现有基于3D高斯溅射(3DGS)的方法因依赖精确标定与姿态标注而难以适用。其解决方案的关键在于提出JOintGS框架,通过联合优化相机外参、人体姿态与3D高斯表示,构建一种协同精化机制:显式分离前景(人体)与背景(静态场景)的高斯表示,使背景高斯通过多视角一致性稳定相机估计,进而提升人体对齐精度;同时,优化的人体姿态又能通过静态约束去除动态伪影,改善场景重建质量,从而实现相互增强的闭环优化。

链接: https://arxiv.org/abs/2602.04317
作者: Zihan Lou,Jinlong Fan,Sihan Ma,Yuxiang Yang,Jing Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 15 figures, Project page at this https URL

点击查看摘要

Abstract:Reconstructing high-fidelity animatable 3D human avatars from monocular RGB videos remains challenging, particularly in unconstrained in-the-wild scenarios where camera parameters and human poses from off-the-shelf methods (e.g., COLMAP, HMR2.0) are often inaccurate. Splatting (3DGS) advances demonstrate impressive rendering quality and real-time performance, they critically depend on precise camera calibration and pose annotations, limiting their applicability in real-world settings. We present JOintGS, a unified framework that jointly optimizes camera extrinsics, human poses, and 3D Gaussian representations from coarse initialization through a synergistic refinement mechanism. Our key insight is that explicit foreground-background disentanglement enables mutual reinforcement: static background Gaussians anchor camera estimation via multi-view consistency; refined cameras improve human body alignment through accurate temporal correspondence; optimized human poses enhance scene reconstruction by removing dynamic artifacts from static constraints. We further introduce a temporal dynamics module to capture fine-grained pose-dependent deformations and a residual color field to model illumination variations. Extensive experiments on NeuMan and EMDB datasets demonstrate that JOintGS achieves superior reconstruction quality, with 2.1~dB PSNR improvement over state-of-the-art methods on NeuMan dataset, while maintaining real-time rendering. Notably, our method shows significantly enhanced robustness to noisy initialization compared to the this http URL source code is available at this https URL.
zh

[CV-52] GeneralVLA: Generalizable Vision-Language-Action Models with Knowledge-Guided Trajectory Planning

【速读】:该论文旨在解决机器人领域中基础模型在未见过场景下的零样本(zero-shot)泛化能力不足的问题,尤其是当前视觉-语言-动作(Vision-Language-Action, VLA)模型在复杂任务中难以实现有效迁移和自主数据生成的瓶颈。其解决方案的关键在于提出一种分层式VLA模型——GeneralVLA,该模型通过三个层级协同工作:高层的可及性分割模块(Affordance Segmentation Module, ASM)用于感知图像中的关键点可及性;中层3DAgent完成任务理解、技能知识整合与三维路径规划,输出引导机器人末端执行器运动的3D轨迹;底层则是一个具备3D感知能力的控制策略,基于上述轨迹进行高精度操作。此架构无需真实机器人数据采集或人类示范,即可自动生成高质量演示数据,并显著提升行为克隆策略的鲁棒性,从而实现对新任务的零样本适应和规模化扩展。

链接: https://arxiv.org/abs/2602.04315
作者: Guoqing Ma,Siheng Wang,Zeyu Zhang,Shan Yu,Hao Tang
机构: CASIA; Peking University
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large foundation models have shown strong open-world generalization to complex problems in vision and language, but similar levels of generalization have yet to be achieved in robotics. One fundamental challenge is that the models exhibit limited zero-shot capability, which hampers their ability to generalize effectively to unseen scenarios. In this work, we propose GeneralVLA (Generalizable Vision-Language-Action Models with Knowledge-Guided Trajectory Planning), a hierarchical vision-language-action (VLA) model that can be more effective in utilizing the generalization of foundation models, enabling zero-shot manipulation and automatically generating data for robotics. In particular, we study a class of hierarchical VLA model where the high-level ASM (Affordance Segmentation Module) is finetuned to perceive image keypoint affordances of the scene; the mid-level 3DAgent carries out task understanding, skill knowledge, and trajectory planning to produce a 3D path indicating the desired robot end-effector trajectory. The intermediate 3D path prediction is then served as guidance to the low-level, 3D-aware control policy capable of precise manipulation. Compared to alternative approaches, our method requires no real-world robotic data collection or human demonstration, making it much more scalable to diverse tasks and viewpoints. Empirically, GeneralVLA successfully generates trajectories for 14 tasks, significantly outperforming state-of-the-art methods such as VoxPoser. The generated demonstrations can train more robust behavior cloning policies than training with human demonstrations or from data generated by VoxPoser, Scaling-up, and Code-As-Policies. We believe GeneralVLA can be the scalable method for both generating data for robotics and solving novel tasks in a zero-shot setting. Code: this https URL. Website: this https URL.
zh

[CV-53] Light Up Your Face: A Physically Consistent Dataset and Diffusion Model for Face Fill-Light Enhancement

【速读】:该论文旨在解决人脸补光增强(Face Fill-Light Enhancement, FFE)中因传统光照重塑方法导致的前景与背景光照不一致问题,此类方法常会抑制输入光照或改变整个场景,难以满足实际应用需求。解决方案的关键在于构建一个物理一致的大规模配对数据集 LightYourFace-160K(LYF-160K),通过可控的六维参数驱动盘状补光区域生成160K组前后图像对,并提出物理感知光照提示(Physics-Aware Lighting Prompt, PALP)将六维光照参数编码为条件token,结合预训练扩散模型训练出高效的一步式补光扩散模型(Fill-Light Diffusion, FiLitDiff),从而实现高保真、可控且计算成本低的补光效果,同时有效保留背景原始光照信息。

链接: https://arxiv.org/abs/2602.04300
作者: Jue Gong,Zihan Zhou,Jingkai Wang,Xiaohong Liu,Yulun Zhang,Xiaokang Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 7 figures. The code and model will be available at this https URL

点击查看摘要

Abstract:Face fill-light enhancement (FFE) brightens underexposed faces by adding virtual fill light while keeping the original scene illumination and background unchanged. Most face relighting methods aim to reshape overall lighting, which can suppress the input illumination or modify the entire scene, leading to foreground-background inconsistency and mismatching practical FFE needs. To support scalable learning, we introduce LightYourFace-160K (LYF-160K), a large-scale paired dataset built with a physically consistent renderer that injects a disk-shaped area fill light controlled by six disentangled factors, producing 160K before-and-after pairs. We first pretrain a physics-aware lighting prompt (PALP) that embeds the 6D parameters into conditioning tokens, using an auxiliary planar-light reconstruction objective. Building on a pretrained diffusion backbone, we then train a fill-light diffusion (FiLitDiff), an efficient one-step model conditioned on physically grounded lighting codes, enabling controllable and high-fidelity fill lighting at low computational cost. Experiments on held-out paired sets demonstrate strong perceptual quality and competitive full-reference metrics, while better preserving background illumination. The dataset and model will be at this https URL.
zh

[CV-54] SkeletonGaussian: Editable 4D Generation through Gaussian Skeletonization

【速读】:该论文旨在解决现有4D生成方法中运动表示为隐式变形场导致的难以直接控制和编辑的问题。其解决方案的关键在于提出SkeletonGaussian框架,通过引入分层的关节式表示,将运动显式分解为由骨骼驱动的稀疏刚性运动与细粒度的非刚性运动;具体而言,首先提取鲁棒骨骼并利用线性混合皮肤化(Linear Blend Skinning, LBS)驱动刚性运动,随后采用基于六边形平面(hexplane)的优化策略对非刚性形变进行精细化调整,从而显著提升生成结果的可解释性与编辑灵活性。

链接: https://arxiv.org/abs/2602.04271
作者: Lifan Wu,Ruijie Zhu,Yubo Ai,Tianzhu Zhang
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注: Accepted by CVM 2026. Project page: this https URL

点击查看摘要

Abstract:4D generation has made remarkable progress in synthesizing dynamic 3D objects from input text, images, or videos. However, existing methods often represent motion as an implicit deformation field, which limits direct control and editability. To address this issue, we propose SkeletonGaussian, a novel framework for generating editable dynamic 3D Gaussians from monocular video input. Our approach introduces a hierarchical articulated representation that decomposes motion into sparse rigid motion explicitly driven by a skeleton and fine-grained non-rigid motion. Concretely, we extract a robust skeleton and drive rigid motion via linear blend skinning, followed by a hexplane-based refinement for non-rigid deformations, enhancing interpretability and editability. Experimental results demonstrate that SkeletonGaussian surpasses existing methods in generation quality while enabling intuitive motion editing, establishing a new paradigm for editable 4D generation. Project page: this https URL
zh

[CV-55] KVSmooth: Mitigating Hallucination in Multi-modal Large Language Models through Key-Value Smoothing

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在生成过程中存在的幻觉(hallucination)问题,即模型输出与视觉输入不一致的对象、属性或关系。现有方法常因解码过程中的语义漂移(semantic drift)导致长序列生成时偏离视觉事实。其解决方案的关键在于提出一种无需训练且可即插即用的方法——KVSmooth,通过基于注意力熵的自适应平滑机制对KV缓存(KV-Cache)中的键(key)和值(value)进行指数移动平均(EMA)处理,并动态量化每个token的“下沉程度”(sink degree),从而自适应调整平滑强度,有效抑制幻觉同时提升整体性能(F₁分数从77.5提升至79.2),且在精度与召回率上实现同步优化。

链接: https://arxiv.org/abs/2602.04268
作者: Siyu Jiang,Feiyang Chen,Xiaojin Zhang,Kun He
机构: Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite the significant progress of Multimodal Large Language Models (MLLMs) across diverse tasks, hallucination – corresponding to the generation of visually inconsistent objects, attributes, or relations – remains a major obstacle to their reliable deployment. Unlike pure language models, MLLMs must ground their generation process in visual inputs. However, existing models often suffer from semantic drift during decoding, causing outputs to diverge from visual facts as the sequence length increases. To address this issue, we propose KVSmooth, a training-free and plug-and-play method that mitigates hallucination by performing attention-entropy-guided adaptive smoothing on hidden states. Specifically, KVSmooth applies an exponential moving average (EMA) to both keys and values in the KV-Cache, while dynamically quantifying the sink degree of each token through the entropy of its attention distribution to adaptively adjust the smoothing strength. Unlike computationally expensive retraining or contrastive decoding methods, KVSmooth operates efficiently during inference without additional training or model modification. Extensive experiments demonstrate that KVSmooth significantly reduces hallucination ( \mathitCHAIR_S from 41.8 \rightarrow 18.2 ) while improving overall performance ( F_1 score from 77.5 \rightarrow 79.2 ), achieving higher precision and recall simultaneously. In contrast, prior methods often improve one at the expense of the other, validating the effectiveness and generality of our approach. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2602.04268 [cs.CV] (or arXiv:2602.04268v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2602.04268 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-56] Decoupled Hierarchical Distillation for Multimodal Emotion Recognition

【速读】:该论文旨在解决人类多模态情感识别(Multimodal Emotion Recognition, MER)中面临的固有多模态异质性(multimodal heterogeneities)以及不同模态贡献度差异的问题。其解决方案的关键在于提出了一种解耦分层多模态蒸馏框架(Decoupled Hierarchical Multimodal Distillation, DHMD),该框架首先通过自回归机制将每个模态特征解耦为模态无关(homogeneous)和模态专属(heterogeneous)成分;随后采用两级知识蒸馏策略:第一阶段利用图蒸馏单元(GD-Unit)在解耦后的特征空间中进行粗粒度蒸馏,通过动态图实现模态间自适应知识迁移;第二阶段则通过跨模态词典匹配机制实现细粒度蒸馏,对齐不同模态间的语义粒度,从而生成更具判别性的MER表示。此分层蒸馏设计实现了灵活的知识传递与有效的跨模态特征对齐,显著提升了模型性能。

链接: https://arxiv.org/abs/2602.04260
作者: Yong Li,Yuanzhi Wang,Yi Ding,Shiqing Zhang,Ke Lu,Cuntai Guan
机构: Southeast University (东南大学); Nanjing University of Science and Technology (南京理工大学); Taizhou University (台州大学); University of Chinese Academy of Sciences (中国科学院大学); Peng Cheng Laboratory (鹏城实验室); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: text overlap with arXiv:2303.13802

点击查看摘要

Abstract:Human multimodal emotion recognition (MER) seeks to infer human emotions by integrating information from language, visual, and acoustic modalities. Although existing MER approaches have achieved promising results, they still struggle with inherent multimodal heterogeneities and varying contributions from different modalities. To address these challenges, we propose a novel framework, Decoupled Hierarchical Multimodal Distillation (DHMD). DHMD decouples each modality’s features into modality-irrelevant (homogeneous) and modality-exclusive (heterogeneous) components using a self-regression mechanism. The framework employs a two-stage knowledge distillation (KD) strategy: (1) coarse-grained KD via a Graph Distillation Unit (GD-Unit) in each decoupled feature space, where a dynamic graph facilitates adaptive distillation among modalities, and (2) fine-grained KD through a cross-modal dictionary matching mechanism, which aligns semantic granularities across modalities to produce more discriminative MER representations. This hierarchical distillation approach enables flexible knowledge transfer and effectively improves cross-modal feature alignment. Experimental results demonstrate that DHMD consistently outperforms state-of-the-art MER methods, achieving 1.3%/2.4% (ACC _7 ), 1.3%/1.9% (ACC _2 ) and 1.9%/1.8% (F1) relative improvement on CMU-MOSI/CMU-MOSEI dataset, respectively. Meanwhile, visualization results reveal that both the graph edges and dictionary activations in DHMD exhibit meaningful distribution patterns across modality-irrelevant/-exclusive feature spaces.
zh

[CV-57] Depth-Guided Metric-Aware Temporal Consistency for Monocular Video Human Mesh Recovery

【速读】:该论文旨在解决单目视频人体网格重建中因深度歧义和尺度不确定性导致的度量一致性(metric consistency)与时间稳定性(temporal stability)难题,尤其在遮挡、深度排序错误及尺度漂移等场景下表现不佳。其解决方案的关键在于提出一个深度引导的综合框架,包含三个协同工作的模块:1)深度引导的多尺度融合模块(Depth-Guided Multi-Scale Fusion),通过置信度感知门控机制自适应融合几何先验与RGB特征;2)深度校准的人体姿态与形状估计器(D-MAPS),利用深度校准的骨骼统计信息实现尺度一致的初始估计;3)运动-深度对齐精化模块(MoDAR),通过跨模态注意力机制在运动动力学与几何线索间建立时序一致性约束。该方法在三个挑战性基准上显著提升了鲁棒性和空间精度,同时保持高效计算性能。

链接: https://arxiv.org/abs/2602.04257
作者: Jiaxin Cen,Xudong Mao,Guanghui Yue,Wei Zhou,Ruomei Wang,Fan Zhou,Baoquan Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Monocular video human mesh recovery faces fundamental challenges in maintaining metric consistency and temporal stability due to inherent depth ambiguities and scale uncertainties. While existing methods rely primarily on RGB features and temporal smoothing, they struggle with depth ordering, scale drift, and occlusion-induced instabilities. We propose a comprehensive depth-guided framework that achieves metric-aware temporal consistency through three synergistic components: A Depth-Guided Multi-Scale Fusion module that adaptively integrates geometric priors with RGB features via confidence-aware gating; A Depth-guided Metric-Aware Pose and Shape (D-MAPS) estimator that leverages depth-calibrated bone statistics for scale-consistent initialization; A Motion-Depth Aligned Refinement (MoDAR) module that enforces temporal coherence through cross-modal attention between motion dynamics and geometric cues. Our method achieves superior results on three challenging benchmarks, demonstrating significant improvements in robustness against heavy occlusion and spatial accuracy while maintaining computational efficiency.
zh

[CV-58] ACIL: Active Class Incremental Learning for Image Classification BMVC2024

【速读】:该论文旨在解决类增量学习(Class Incremental Learning, CIL)中因标注成本高且大量样本无法在后续阶段被模型利用而导致的标注资源浪费与灾难性遗忘问题。其解决方案的关键在于提出一种名为ACIL(Active Class-Incremental Learning)的主动学习框架,通过结合不确定性(uncertainty)与多样性(diversity)的采样准则,智能筛选出每一轮中最具信息量的示例样本(exemplar samples)进行人工标注,并将其加入下一阶段的训练数据中。该机制不仅显著降低了标注成本,还能有效缓解模型在新类别学习过程中对旧类别的遗忘现象。

链接: https://arxiv.org/abs/2602.04252
作者: Aditya R. Bhattacharya,Debanjan Goswami,Shayok Chakraborty
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: BMVC 2024 (Accepted). Authors, Aditya R. Bhattacharya and Debanjan Goswami contributed equally to this work

点击查看摘要

Abstract:Continual learning (or class incremental learning) is a realistic learning scenario for computer vision systems, where deep neural networks are trained on episodic data, and the data from previous episodes are generally inaccessible to the model. Existing research in this domain has primarily focused on avoiding catastrophic forgetting, which occurs due to the continuously changing class distributions in each episode and the inaccessibility of the data from previous episodes. However, these methods assume that all the training samples in every episode are annotated; this not only incurs a huge annotation cost, but also results in a wastage of annotation effort, since most of the samples in a given episode will not be accessible to the model in subsequent episodes. Active learning algorithms identify the salient and informative samples from large amounts of unlabeled data and are instrumental in reducing the human annotation effort in inducing a deep neural network. In this paper, we propose ACIL, a novel active learning framework for class incremental learning settings. We exploit a criterion based on uncertainty and diversity to identify the exemplar samples that need to be annotated in each episode, and will be appended to the data in the next episode. Such a framework can drastically reduce annotation cost and can also avoid catastrophic forgetting. Our extensive empirical analyses on several vision datasets corroborate the promise and potential of our framework against relevant baselines.
zh

[CV-59] owards Next-Generation SLAM: A Survey on 3DGS-SLAM Focusing on Performance Robustness and Future Directions

【速读】:该论文旨在解决传统同步定位与建图(SLAM)系统在渲染质量粗糙、场景细节恢复不足以及动态环境鲁棒性差等方面的局限性。其解决方案的关键在于引入3D高斯溅射(3D Gaussian Splatting, 3DGS),利用其高效的显式表示和高质量渲染能力,重构SLAM的重建范式。通过系统梳理3DGS与SLAM融合的核心技术路径,论文从渲染质量、跟踪精度、重建速度和内存消耗四个维度分析代表性方法的设计原理与突破,并探讨提升复杂环境下(如运动模糊和动态场景)鲁棒性的策略,为下一代高保真、高效且鲁棒的SLAM系统提供技术参考。

链接: https://arxiv.org/abs/2602.04251
作者: Li Wang,Ruixuan Gong,Yumo Han,Lei Yang,Lu Yang,Ying Li,Bin Xu,Huaping Liu,Rong Fu
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Traditional Simultaneous Localization and Mapping (SLAM) systems often face limitations including coarse rendering quality, insufficient recovery of scene details, and poor robustness in dynamic environments. 3D Gaussian Splatting (3DGS), with its efficient explicit representation and high-quality rendering capabilities, offers a new reconstruction paradigm for SLAM. This survey comprehensively reviews key technical approaches for integrating 3DGS with SLAM. We analyze performance optimization of representative methods across four critical dimensions: rendering quality, tracking accuracy, reconstruction speed, and memory consumption, delving into their design principles and breakthroughs. Furthermore, we examine methods for enhancing the robustness of 3DGS-SLAM in complex environments such as motion blur and dynamic environments. Finally, we discuss future challenges and development trends in this area. This survey aims to provide a technical reference for researchers and foster the development of next-generation SLAM systems characterized by high fidelity, efficiency, and robustness.
zh

[CV-60] SPOT-Occ: Sparse Prototype-guided Transformer for Camera-based 3D Occupancy Prediction

【速读】:该论文旨在解决从摄像头数据中实现高精度且实时的3D占用预测(3D occupancy prediction)问题,这是自动驾驶车辆安全与实用部署的关键需求。现有方法采用稀疏3D表示虽缓解了编码瓶颈,但给解码器带来新挑战:如何高效聚合来自稀疏、非均匀分布的体素特征,而无需依赖计算成本高昂的密集注意力机制。其解决方案的核心是提出一种基于原型的稀疏Transformer解码器(Prototype-based Sparse Transformer Decoder),通过两阶段过程——引导特征选择与聚焦聚合——替代传统密集交互。关键创新在于使解码器注意力机制“原型引导”(prototype-guided),即每个查询自适应地识别一组最显著的体素特征(称为原型),用于聚焦聚合;同时引入互补去噪范式,利用真实标签掩码提供显式指导,确保跨解码层查询与原型间的一致关联,从而提升稳定性与效率。该方法命名为SPOT-Occ,在保持更高准确率的同时显著提升推理速度。

链接: https://arxiv.org/abs/2602.04240
作者: Suzeyu Chen,Leheng Li,Ying-Cong Chen
机构: The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州) ); The Hong Kong University of Science and Technology(香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 8 pages, 6 figures

点击查看摘要

Abstract:Achieving highly accurate and real-time 3D occupancy prediction from cameras is a critical requirement for the safe and practical deployment of autonomous vehicles. While this shift to sparse 3D representations solves the encoding bottleneck, it creates a new challenge for the decoder: how to efficiently aggregate information from a sparse, non-uniformly distributed set of voxel features without resorting to computationally prohibitive dense attention. In this paper, we propose a novel Prototype-based Sparse Transformer Decoder that replaces this costly interaction with an efficient, two-stage process of guided feature selection and focused aggregation. Our core idea is to make the decoder’s attention prototype-guided. We achieve this through a sparse prototype selection mechanism, where each query adaptively identifies a compact set of the most salient voxel features, termed prototypes, for focused feature aggregation. To ensure this dynamic selection is stable and effective, we introduce a complementary denoising paradigm. This approach leverages ground-truth masks to provide explicit guidance, guaranteeing a consistent query-prototype association across decoder layers. Our model, dubbed SPOT-Occ, outperforms previous methods with a significant margin in speed while also improving accuracy. Source code is released at this https URL. Comments: 8 pages, 6 figures Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO) Cite as: arXiv:2602.04240 [cs.CV] (or arXiv:2602.04240v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2602.04240 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-61] An Intuitionistic Fuzzy Logic Driven UNet architecture: Application to Brain Image segmentation

【速读】:该论文旨在解决医学MRI脑图像分割中因部分容积效应(partial volume effect)导致的不确定性问题,该效应常引起组织边界模糊和像素级分类困难,从而影响分割精度。解决方案的关键在于提出一种融合直觉模糊逻辑(intuitionistic fuzzy logic)的改进型UNet架构(IF-UNet),通过引入隶属度、非隶属度和犹豫度三个维度对输入数据进行建模,从而更有效地表征组织模糊性和边界不确定性,提升分割鲁棒性与准确性。

链接: https://arxiv.org/abs/2602.04227
作者: Hanuman Verma,Kiho Im,Pranabesh Maji,Akshansh Gupta
机构: Bareilly College (巴瑞利学院); MJP Rohilkhand University (MJP罗希尔坎德大学); Boston Children’s Hospital (波士顿儿童医院); Harvard Medical School (哈佛医学院); Department of Pediatrics (儿科系); CSIR–Central Electronics Engineering Research Institute (CSIR-中央电子工程研究所); Academy of Scientific and Innovative Research (AcSIR) (科学与创新研究院); CSIR–National Institute of Science Communication and Policy Research (CSIR-国家科学技术传播与政策研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate segmentation of MRI brain images is essential for image analysis, diagnosis of neuro-logical disorders and medical image computing. In the deep learning approach, the convolutional neural networks (CNNs), especially UNet, are widely applied in medical image segmentation. However, it is difficult to deal with uncertainty due to the partial volume effect in brain images. To overcome this limitation, we propose an enhanced framework, named UNet with intuitionistic fuzzy logic (IF-UNet), which incorporates intuitionistic fuzzy logic into UNet. The model processes input data in terms of membership, nonmembership, and hesitation degrees, allowing it to better address tissue ambiguity resulting from partial volume effects and boundary uncertainties. The proposed architecture is evaluated on the Internet Brain Segmentation Repository (IBSR) dataset, and its performance is computed using accuracy, Dice coefficient, and intersection over union (IoU). Experimental results confirm that IF-UNet improves segmentation quality with handling uncertainty in brain images.
zh

[CV-62] Adaptive 1D Video Diffusion Autoencoder

【速读】:该论文旨在解决现有视频自编码器(Video Autoencoder, VAE)在视频生成任务中面临的三大局限性:固定速率压缩导致对简单视频浪费token、卷积神经网络(CNN)架构僵化难以支持可变长度潜在表示建模,以及确定性解码器在从压缩潜在表示中恢复细节时表现不佳。其解决方案的关键在于提出一种基于Transformer的一维扩散视频自编码器(One-Dimensional Diffusion Video Autoencoder, One-DVA),通过两个核心设计实现突破:一是采用查询驱动的视觉Transformer(Vision Transformer)进行自适应的一维编码,并引入可变长度dropout机制动态调整潜在序列长度;二是使用像素空间扩散Transformer作为解码器,以潜在表示为条件进行视频重建,从而提升细节恢复能力并支持更高压缩比的自适应压缩策略。此外,通过两阶段训练和潜在分布正则化,进一步优化了下游生成任务的性能。

链接: https://arxiv.org/abs/2602.04220
作者: Yao Teng,Minxuan Lin,Xian Liu,Shuai Wang,Xiao Yang,Xihui Liu
机构: The University of Hong Kong (香港大学); ByteDance Inc. (字节跳动); CUHK (香港中文大学); Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent video generation models largely rely on video autoencoders that compress pixel-space videos into latent representations. However, existing video autoencoders suffer from three major limitations: (1) fixed-rate compression that wastes tokens on simple videos, (2) inflexible CNN architectures that prevent variable-length latent modeling, and (3) deterministic decoders that struggle to recover appropriate details from compressed latents. To address these issues, we propose One-Dimensional Diffusion Video Autoencoder (One-DVA), a transformer-based framework for adaptive 1D encoding and diffusion-based decoding. The encoder employs query-based vision transformers to extract spatiotemporal features and produce latent representations, while a variable-length dropout mechanism dynamically adjusts the latent length. The decoder is a pixel-space diffusion transformer that reconstructs videos with the latents as input conditions. With a two-stage training strategy, One-DVA achieves performance comparable to 3D-CNN VAEs on reconstruction metrics at identical compression ratios. More importantly, it supports adaptive compression and thus can achieve higher compression ratios. To better support downstream latent generation, we further regularize the One-DVA latent distribution for generative modeling and fine-tune its decoder to mitigate artifacts caused by the generation process.
zh

[CV-63] AGMA: Adaptive Gaussian Mixture Anchors for Prior-Guided Multimodal Human Trajectory Forecasting

【速读】:该论文旨在解决人类轨迹预测中因先验分布(prior distribution)与真实行为多样性不匹配而导致的预测精度和多样性受限的问题。现有方法通常使用固定或学习得到的先验,难以充分捕捉多种可能的未来轨迹分布,从而成为性能瓶颈。解决方案的关键在于提出AGMA(Adaptive Gaussian Mixture Anchors),通过两个阶段构建高表达能力的先验:首先从训练数据中提取多样化的行人行为模式,然后将其蒸馏为场景自适应的全局先验用于推理,从而显著提升预测质量与多样性。

链接: https://arxiv.org/abs/2602.04204
作者: Chao Li,Rui Zhang,Siyuan Huang,Xian Zhong,Hongbo Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 14 pages, 3 figures

点击查看摘要

Abstract:Human trajectory forecasting requires capturing the multimodal nature of pedestrian behavior. However, existing approaches suffer from prior misalignment. Their learned or fixed priors often fail to capture the full distribution of plausible futures, limiting both prediction accuracy and diversity. We theoretically establish that prediction error is lower-bounded by prior quality, making prior modeling a key performance bottleneck. Guided by this insight, we propose AGMA (Adaptive Gaussian Mixture Anchors), which constructs expressive priors through two stages: extracting diverse behavioral patterns from training data and distilling them into a scene-adaptive global prior for inference. Extensive experiments on ETH-UCY, Stanford Drone, and JRDB datasets demonstrate that AGMA achieves state-of-the-art performance, confirming the critical role of high-quality priors in trajectory forecasting.
zh

[CV-64] VTok: A Unified Video Tokenizer with Decoupled Spatial-Temporal Latents

【速读】:该论文旨在解决视频表征中因传统帧采样策略导致的token序列过长、计算复杂度高且时空信息表达不高效的问题。其解决方案的关键在于提出VTok框架,通过解耦视频的时空表示:保留单个关键帧的空间特征,并将后续每一帧编码为一个残差token(residual token),从而将视频表示的复杂度从帧数与每帧token数的乘积降低到二者之和,同时有效捕捉相对于关键帧的视角变化和运动信息。此方法在视频理解与文本到视频生成任务中均展现出更短token序列下更高的性能与更强的时间一致性。

链接: https://arxiv.org/abs/2602.04202
作者: Feng Wang,Yichun Shi,Ceyuan Yang,Qiushan Guo,Jingxiang Sun,Alan Yuille,Peng Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This work presents VTok, a unified video tokenization framework that can be used for both generation and understanding tasks. Unlike the leading vision-language systems that tokenize videos through a naive frame-sampling strategy, we propose to decouple the spatial and temporal representations of videos by retaining the spatial features of a single key frame while encoding each subsequent frame into a single residual token, achieving compact yet expressive video tokenization. Our experiments suggest that VTok effectively reduces the complexity of video representation from the product of frame count and per-frame token count to their sum, while the residual tokens sufficiently capture viewpoint and motion changes relative to the key frame. Extensive evaluations demonstrate the efficacy and efficiency of VTok: it achieves notably higher performance on a range of video understanding and text-to-video generation benchmarks compared with baselines using naive tokenization, all with shorter token sequences per video (e.g., 3.4% higher accuracy on our TV-Align benchmark and 1.9% higher VBench score). Remarkably, VTok produces more coherent motion and stronger guidance following in text-to-video generation, owing to its more consistent temporal encoding. We hope VTok can serve as a standardized video tokenization paradigm for future research in video understanding and generation.
zh

[CV-65] Continuous Degradation Modeling via Latent Flow Matching for Real-World Super-Resolution AAAI2026

【速读】:该论文旨在解决深度学习超分辨率(Super-Resolution, SR)方法在真实世界图像上性能下降的问题,尤其是当图像遭受复杂非线性退化(如噪声、模糊和压缩伪影)时,传统基于合成退化(如双三次下采样)的训练策略难以泛化。其解决方案的关键在于提出一种新颖框架,通过在潜在退化空间中利用流匹配(flow matching)技术,从单张高分辨率(High-Resolution, HR)图像合成具有真实退化特征的低分辨率(Low-Resolution, LR)图像,从而生成大规模、多样化的现实世界SR训练数据集,且支持未见过的退化强度。实验表明,该方法生成的LR图像能准确复现真实退化模式,并显著提升传统与任意尺度SR模型的重建质量。

链接: https://arxiv.org/abs/2602.04193
作者: Hyeonjae Kim,Dongjin Kim,Eugene Jin,Tae Hyun Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI 2026

点击查看摘要

Abstract:While deep learning-based super-resolution (SR) methods have shown impressive outcomes with synthetic degradation scenarios such as bicubic downsampling, they frequently struggle to perform well on real-world images that feature complex, nonlinear degradations like noise, blur, and compression artifacts. Recent efforts to address this issue have involved the painstaking compilation of real low-resolution (LR) and high-resolution (HR) image pairs, usually limited to several specific downscaling factors. To address these challenges, our work introduces a novel framework capable of synthesizing authentic LR images from a single HR image by leveraging the latent degradation space with flow matching. Our approach generates LR images with realistic artifacts at unseen degradation levels, which facilitates the creation of large-scale, real-world SR training datasets. Comprehensive quantitative and qualitative assessments verify that our synthetic LR images accurately replicate real-world degradations. Furthermore, both traditional and arbitrary-scale SR models trained using our datasets consistently yield much better HR outcomes.
zh

[CV-66] DiMo: Discrete Diffusion Modeling for Motion Generation and Understanding

【速读】:该论文旨在解决现有文本到动作(Text-to-Motion, T2M)生成方法在双向理解与生成能力上的局限性,尤其是缺乏统一框架下对文本-动作双向映射及无文本场景下动作生成的支持。其核心解决方案是提出DiMo,一种基于离散扩散风格(discrete diffusion-style)的框架,通过迭代式掩码标记优化(iterative masked token refinement)替代传统GPT类自回归解码方式,从而在单一模型中统一实现T2M、动作到文本(Motion-to-Text, M2T)以及无文本动作到动作(Motion-to-Motion, M2M)任务。关键创新包括引入残差向量量化(Residual Vector Quantization, RVQ)提升动作标记保真度,以及采用分组相对策略优化(Group Relative Policy Optimization, GRPO)增强动作与文本间的对齐性和可控性,显著提升了模型在HumanML3D和KIT-ML数据集上的运动质量与双向理解性能。

链接: https://arxiv.org/abs/2602.04188
作者: Ning Zhang,Zhengyu Li,Kwong Weng Loh,Mingxi Xu,Qi Wang,Zhengyu Wen,Xiaoyu He,Wei Zhao,Kehong Gong,Mingyuan Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Prior masked modeling motion generation methods predominantly study text-to-motion. We present DiMo, a discrete diffusion-style framework, which extends masked modeling to bidirectional text–motion understanding and generation. Unlike GPT-style autoregressive approaches that tokenize motion and decode sequentially, DiMo performs iterative masked token refinement, unifying Text-to-Motion (T2M), Motion-to-Text (M2T), and text-free Motion-to-Motion (M2M) within a single model. This decoding paradigm naturally enables a quality-latency trade-off at inference via the number of refinement this http URL further improve motion token fidelity with residual vector quantization (RVQ) and enhance alignment and controllability with Group Relative Policy Optimization (GRPO). Experiments on HumanML3D and KIT-ML show strong motion quality and competitive bidirectional understanding under a unified framework. In addition, we demonstrate model ability in text-free motion completion, text-guided motion prediction and motion caption correction without architectural this http URL qualitative results are available on our project page: this https URL.
zh

[CV-67] Natural Language Instructions for Scene-Responsive Human-in-the-Loop Motion Planning in Autonomous Driving using Vision-Language-Action Models

【速读】:该论文旨在解决指令引导型驾驶(instruction-grounded driving)中,车辆如何基于乘客的自然语言指令准确理解意图并规划轨迹的问题。现有方法多依赖仿真环境或固定命令词汇表,难以在真实场景中泛化。其解决方案的关键在于构建首个真实世界数据集doScenes,该数据集将自由形式的带指代关系的指令与nuScenes的真实运动标注对齐,并在此基础上适配OpenEMMA——一个基于多模态大语言模型(Multimodal Large Language Model, MLLM)的端到端驾驶框架,使其能够接收前视摄像头图像和自车状态输入,并输出10步速度-曲率轨迹。通过在OpenEMMA的视觉-语言接口中嵌入乘客风格的指令提示(prompt),实现语言条件驱动的轨迹生成。实验表明,指令条件显著提升了鲁棒性(平均ADE降低98.7%),且良好表述的指令仍能进一步改善轨迹对齐度(最高提升5.1%)。

链接: https://arxiv.org/abs/2602.04184
作者: Angel Martinez-Sanchez,Parthib Roy,Ross Greer
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Instruction-grounded driving, where passenger language guides trajectory planning, requires vehicles to understand intent before motion. However, most prior instruction-following planners rely on simulation or fixed command vocabularies, limiting real-world generalization. doScenes, the first real-world dataset linking free-form instructions (with referentiality) to nuScenes ground-truth motion, enables instruction-conditioned planning. In this work, we adapt OpenEMMA, an open-source MLLM-based end-to-end driving framework that ingests front-camera views and ego-state and outputs 10-step speed-curvature trajectories, to this setting, presenting a reproducible instruction-conditioned baseline on doScenes and investigate the effects of human instruction prompts on predicted driving behavior. We integrate doScenes directives as passenger-style prompts within OpenEMMA’s vision-language interface, enabling linguistic conditioning before trajectory generation. Evaluated on 849 annotated scenes using ADE, we observe that instruction conditioning substantially improves robustness by preventing extreme baseline failures, yielding a 98.7% reduction in mean ADE. When such outliers are removed, instructions still influence trajectory alignment, with well-phrased prompts improving ADE by up to 5.1%. We use this analysis to discuss what makes a “good” instruction for the OpenEMMA framework. We release the evaluation prompts and scripts to establish a reproducible baseline for instruction-aware planning. GitHub: this https URL
zh

[CV-68] HoloEv-Net: Efficient Event-based Action Recognition via Holographic Spatial Embedding and Global Spectral Gating

【速读】:该论文旨在解决事件相机(Event Camera)在动作识别任务中面临的三大挑战:(i)密集体素表示带来的计算冗余;(ii)多分支架构固有的结构冗余;(iii)对频域谱信息利用不足,导致全局运动模式建模能力弱。解决方案的关键在于提出一种高效框架HoloEv-Net,其核心创新包括:(1)提出紧凑全息时空表示(Compact Holographic Spatiotemporal Representation, CHSR),通过将水平空间信息隐式嵌入时间-高度(T-H)视图,在保持三维时空上下文的同时显著降低计算复杂度;(2)设计全局频谱门控模块(Global Spectral Gating, GSG),利用快速傅里叶变换(FFT)在频域实现全局token混合,以极低参数开销增强模型对全局运动模式的感知能力。

链接: https://arxiv.org/abs/2602.04182
作者: Weidong Hao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Event-based Action Recognition (EAR) has attracted significant attention due to the high temporal resolution and high dynamic range of event cameras. However, existing methods typically suffer from (i) the computational redundancy of dense voxel representations, (ii) structural redundancy inherent in multi-branch architectures, and (iii) the under-utilization of spectral information in capturing global motion patterns. To address these challenges, we propose an efficient EAR framework named HoloEv-Net. First, to simultaneously tackle representation and structural redundancies, we introduce a Compact Holographic Spatiotemporal Representation (CHSR). Departing from computationally expensive voxel grids, CHSR implicitly embeds horizontal spatial cues into the Time-Height (T-H) view, effectively preserving 3D spatiotemporal contexts within a 2D representation. Second, to exploit the neglected spectral cues, we design a Global Spectral Gating (GSG) module. By leveraging the Fast Fourier Transform (FFT) for global token mixing in the frequency domain, GSG enhances the representation capability with negligible parameter overhead. Extensive experiments demonstrate the scalability and effectiveness of our framework. Specifically, HoloEv-Net-Base achieves state-of-the-art performance on THU-EACT-50-CHL, HARDVS and DailyDVS-200, outperforming existing methods by 10.29%, 1.71% and 6.25%, respectively. Furthermore, our lightweight variant, HoloEv-Net-Small, delivers highly competitive accuracy while offering extreme efficiency, reducing parameters by 5.4 times, FLOPs by 300times, and latency by 2.4times compared to heavy baselines, demonstrating its potential for edge deployment.
zh

[CV-69] Partial Ring Scan: Revisiting Scan Order in Vision State Space Models

【速读】:该论文旨在解决视觉状态空间模型(Vision State Space Models, Vision SSMs)中因固定扫描顺序(scan order)导致的空间邻接关系破坏、物体连续性断裂以及在几何变换(如旋转)下性能下降的问题。其解决方案的关键在于提出一种旋转鲁棒的遍历策略——部分环形扫描Mamba(Partial Ring Scan Mamba, PRISMamba),该方法将图像划分为同心环,在每个环内进行与顺序无关的聚合,并通过短程径向状态空间模型(radial SSMs)跨环传递上下文信息;同时引入局部通道过滤机制,仅将最具信息量的通道送入递归环路径,其余通道走轻量残差分支,从而在保证精度的同时显著提升效率和旋转鲁棒性。

链接: https://arxiv.org/abs/2602.04170
作者: Yi-Kuan Hsieh,Jun-Wei Hsieh,Xin li,Ming-Ching Chang,Yu-Chee Tseng
机构: National Yang Ming Chiao Tung University (国立阳明交通大学); University at Albany, SUNY (纽约州立大学阿尔巴尼分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 3 figures

点击查看摘要

Abstract:State Space Models (SSMs) have emerged as efficient alternatives to attention for vision tasks, offering lineartime sequence processing with competitive accuracy. Vision SSMs, however, require serializing 2D images into 1D token sequences along a predefined scan order, a factor often overlooked. We show that scan order critically affects performance by altering spatial adjacency, fracturing object continuity, and amplifying degradation under geometric transformations such as rotation. We present Partial RIng Scan Mamba (PRISMamba), a rotation-robust traversal that partitions an image into concentric rings, performs order-agnostic aggregation within each ring, and propagates context across rings through a set of short radial SSMs. Efficiency is further improved via partial channel filtering, which routes only the most informative channels through the recurrent ring pathway while keeping the rest on a lightweight residual branch. On ImageNet-1K, PRISMamba achieves 84.5% Top-1 with 3.9G FLOPs and 3,054 img/s on A100, outperforming VMamba in both accuracy and throughput while requiring fewer FLOPs. It also maintains performance under rotation, whereas fixed-path scans drop by 1~2%. These results highlight scan-order design, together with channel filtering, as a crucial, underexplored factor for accuracy, efficiency, and rotation robustness in Vision SSMs. Code will be released upon acceptance.
zh

[CV-70] Point2Insert: Video Object Insertion via Sparse Point Guidance

【速读】:该论文旨在解决视频中对象插入(object insertion)的两个核心问题:一是基于掩码(mask-based)的方法需要大量人工标注掩码,效率低下;二是基于指令(instruction-based)的方法难以实现精确的位置控制。其解决方案的关键在于提出 Point2Insert 框架,通过仅需少量稀疏点(sparse points)作为输入,结合正负点提示(positive and negative points)实现对插入区域的细粒度空间控制,从而在无需密集掩码标注的前提下完成灵活、精准的对象插入。

链接: https://arxiv.org/abs/2602.04167
作者: Yu Zhou,Xiaoyan Yang,Bojia Zi,Lihan Zhang,Ruijie Sun,Weishi Zheng,Haibin Huang,Chi Zhang,Xuelong Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper introduces Point2Insert, a sparse-point-based framework for flexible and user-friendly object insertion in videos, motivated by the growing popularity of accurate, low-effort object placement. Existing approaches face two major challenges: mask-based insertion methods require labor-intensive mask annotations, while instruction-based methods struggle to place objects at precise locations. Point2Insert addresses these issues by requiring only a small number of sparse points instead of dense masks, eliminating the need for tedious mask drawing. Specifically, it supports both positive and negative points to indicate regions that are suitable or unsuitable for insertion, enabling fine-grained spatial control over object locations. The training of Point2Insert consists of two stages. In Stage 1, we train an insertion model that generates objects in given regions conditioned on either sparse-point prompts or a binary mask. In Stage 2, we further train the model on paired videos synthesized by an object removal model, adapting it to video insertion. Moreover, motivated by the higher insertion success rate of mask-guided editing, we leverage a mask-guided insertion model as a teacher to distill reliable insertion behavior into the point-guided model. Extensive experiments demonstrate that Point2Insert consistently outperforms strong baselines and even surpasses models with \times 10 more parameters.
zh

[CV-71] Improving 2D Diffusion Models for 3D Medical Imaging with Inter-Slice Consistent Stochasticity ICLR2026

【速读】:该论文旨在解决基于2D扩散模型(Diffusion Models, DMs)进行3D医学图像重建时,由于扩散采样过程中的随机性导致的跨切片不连续问题(inter-slice discontinuities),该问题会严重影响3D体积重建质量。现有方法通常通过沿z轴施加连续性正则化来缓解此问题,但这类方法引入敏感超参数并可能导致过度平滑。论文提出的关键解决方案是跨切片一致性随机性控制(Inter-Slice Consistent Stochasticity, ISCS),其核心思想是在扩散采样过程中约束不同切片间噪声成分的一致性,从而对齐各切片的采样轨迹,无需添加额外损失项或优化步骤,即可实现高效、无额外计算成本的跨切片一致性增强。该策略具有即插即用特性,可无缝集成至任意基于2D扩散先验的3D重建流程中。

链接: https://arxiv.org/abs/2602.04162
作者: Chenhe Du,Qing Wu,Xuanyu Tian,Jingyi Yu,Hongjiang Wei,Yuyao Zhang
机构: ShanghaiTech University (上海科技大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注: Accepted by ICLR 2026

点击查看摘要

Abstract:3D medical imaging is in high demand and essential for clinical diagnosis and scientific research. Currently, diffusion models (DMs) have become an effective tool for medical imaging reconstruction thanks to their ability to learn rich, high-quality data priors. However, learning the 3D data distribution with DMs in medical imaging is challenging, not only due to the difficulties in data collection but also because of the significant computational burden during model training. A common compromise is to train the DMs on 2D data priors and reconstruct stacked 2D slices to address 3D medical inverse problems. However, the intrinsic randomness of diffusion sampling causes severe inter-slice discontinuities of reconstructed 3D volumes. Existing methods often enforce continuity regularizations along the z-axis, which introduces sensitive hyper-parameters and may lead to over-smoothing results. In this work, we revisit the origin of stochasticity in diffusion sampling and introduce Inter-Slice Consistent Stochasticity (ISCS), a simple yet effective strategy that encourages interslice consistency during diffusion sampling. Our key idea is to control the consistency of stochastic noise components during diffusion sampling, thereby aligning their sampling trajectories without adding any new loss terms or optimization steps. Importantly, the proposed ISCS is plug-and-play and can be dropped into any 2D trained diffusion based 3D reconstruction pipeline without additional computational cost. Experiments on several medical imaging problems show that our method can effectively improve the performance of medical 3D imaging problems based on 2D diffusion models. Our findings suggest that controlling inter-slice stochasticity is a principled and practically attractive route toward high-fidelity 3D medical imaging with 2D diffusion priors. The code is available at: this https URL
zh

[CV-72] Context Determines Optimal Architecture in Materials Segmentation

【速读】:该论文旨在解决材料图像分割模型在跨模态应用场景下的性能不确定性问题,即现有分割架构通常仅在单一成像模态(如SEM、AFM、XCT或光学显微镜)上进行评估,导致其在实际部署时可能出现性能下降,且缺乏对模型可靠性与可解释性的支持。解决方案的关键在于提出一个跨模态评价框架,涵盖四种常见材料表征成像技术,并通过在七个数据集上系统评估六种编码器-解码器组合,揭示了不同架构在特定模态和场景下的最优表现(如UNet在高对比度2D图像中表现优异,DeepLabv3+适用于复杂难分场景)。此外,该框架还集成分布外检测(out-of-distribution detection)和反事实解释(counterfactual explanations),提供模型可靠性信号与驱动预测的关键微结构特征,从而填补材料表征领域中关于模型选型与可信度评估的实践空白。

链接: https://arxiv.org/abs/2602.04154
作者: Mingjian Lu,Pawan K. Tripathi,Mark Shteyn,Debargha Ganguly,Roger H. French,Vipin Chaudhary,Yinghui Wu
机构: Case Western Reserve University (凯斯西储大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Segmentation architectures are typically benchmarked on single imaging modalities, obscuring deployment-relevant performance variations: an architecture optimal for one modality may underperform on another. We present a cross-modal evaluation framework for materials image segmentation spanning SEM, AFM, XCT, and optical microscopy. Our evaluation of six encoder-decoder combinations across seven datasets reveals that optimal architectures vary systematically by context: UNet excels for high-contrast 2D imaging while DeepLabv3+ is preferred for the hardest cases. The framework also provides deployment feedback via out-of-distribution detection and counterfactual explanations that reveal which microstructural features drive predictions. Together, the architecture guidance, reliability signals, and interpretability tools address a practical gap in materials characterization, where researchers lack tools to select architectures for their specific imaging setup or assess when models can be trusted on new samples.
zh

[CV-73] JSynFlow: Japanese Synthesised Flowchart Visual Question Answering Dataset built with Large Language Models

【速读】:该论文旨在解决当前视觉语言模型(Vision and Language Models, VLMs)在理解和分析包含流程图等复杂结构化文档时能力不足的问题,尤其是缺乏大规模、高质量的流程图图像与对应文本问答对(QA pairs)数据集,导致模型难以精准识别和解释流程图内容。解决方案的关键在于提出JSynFlow——一个通过大语言模型(Large Language Models, LLMs)自动生成的合成视觉问答数据集,其包含多种业务岗位的任务描述、由领域专用语言(Domain-Specific Language, DSL)代码渲染的流程图图像以及对应的问答对。该方法显著降低了人工标注成本,并通过实验证明,使用JSynFlow进行微调可显著提升VLM在流程图问答任务上的性能。

链接: https://arxiv.org/abs/2602.04142
作者: Hiroshi Sasaki
机构: The Japan Research Institute, Limited (日本研究机构有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 7 pages

点击查看摘要

Abstract:Vision and language models (VLMs) are expected to analyse complex documents, such as those containing flowcharts, through a question-answering (QA) interface. The ability to recognise and interpret these flowcharts is in high demand, as they provide valuable insights unavailable in text-only explanations. However, developing VLMs with precise flowchart understanding requires large-scale datasets of flowchart images and corresponding text, the creation of which is highly time-consuming. To address this challenge, we introduce JSynFlow, a synthesised visual QA dataset for Japanese flowcharts, generated using large language models (LLMs). Our dataset comprises task descriptions for various business occupations, the corresponding flowchart images rendered from domain-specific language (DSL) code, and related QA pairs. This paper details the dataset’s synthesis procedure and demonstrates that fine-tuning with JSynFlow significantly improves VLM performance on flowchart-based QA tasks. Our dataset is publicly available at this https URL.
zh

[CV-74] SuperPoint-E: local features for 3D reconstruction via tracking adaptation in endoscopy

【速读】:该论文旨在解决内窥镜视频中Structure-from-Motion (SfM) 三维重建性能不足的问题,核心挑战在于特征提取质量差导致重建稀疏、覆盖范围有限且稳定性低。解决方案的关键是提出了一种新的局部特征提取方法 SuperPoint-E,并引入了“Tracking Adaptation”监督策略,显著提升了特征检测与描述的精度和鲁棒性;该策略使检测更加密集、特征更易存活(即更高的检测精度),同时增强了描述子的判别能力,从而减少了对引导匹配步骤的依赖,最终实现了更稠密、更完整的3D重建结果。

链接: https://arxiv.org/abs/2602.04108
作者: O. Leon Barbed,José M. M. Montiel,Pascal Fua,Ana C. Murillo
机构: University of Zaragoza (萨拉戈萨大学); École Polytechnique Fédérale de Lausanne (洛桑联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 5 tables, 6 figures

点击查看摘要

Abstract:In this work, we focus on boosting the feature extraction to improve the performance of Structure-from-Motion (SfM) in endoscopy videos. We present SuperPoint-E, a new local feature extraction method that, using our proposed Tracking Adaptation supervision strategy, significantly improves the quality of feature detection and description in endoscopy. Extensive experimentation on real endoscopy recordings studies our approach’s most suitable configuration and evaluates SuperPoint-E feature quality. The comparison with other baselines also shows that our 3D reconstructions are denser and cover more and longer video segments because our detector fires more densely and our features are more likely to survive (i.e. higher detection precision). In addition, our descriptor is more discriminative, making the guided matching step almost redundant. The presented approach brings significant improvements in the 3D reconstructions obtained, via SfM on endoscopy videos, compared to the original SuperPoint and the gold standard SfM COLMAP pipeline.
zh

[CV-75] DMS2F-HAD: A Dual-branch Mamba-based Spatial-Spectral Fusion Network for Hyperspectral Anomaly Detection WACV2025

【速读】:该论文旨在解决高光谱异常检测(Hyperspectral Anomaly Detection, HAD)中现有深度学习方法难以同时捕捉长距离光谱依赖关系且计算效率低的问题。传统卷积神经网络(Convolutional Neural Networks, CNNs)在建模长程光谱特征时表现不足,而基于Transformer的方法虽然能建模全局依赖但存在较高的计算开销。为此,作者提出了一种新颖的双分支Mamba架构——DMS2F-HAD,其关键创新在于利用Mamba模型的线性时间复杂度特性,在独立分支中分别高效提取空间和光谱特征,并通过动态门控融合机制实现特征集成,从而显著提升异常定位精度与推理速度。实验表明,该方法在14个基准高光谱图像(Hyperspectral Images, HSIs)数据集上平均AUC达到98.78%,且推理速度比同类方法快4.6倍,展现出优异的泛化能力和实用性。

链接: https://arxiv.org/abs/2602.04102
作者: Aayushma Pant,Lakpa Tamang,Tsz-Kwan Lee,Sunil Aryal
机构: Deakin University (迪肯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This paper has been accepted in the WACV 2025 conference in algorithm track

点击查看摘要

Abstract:Hyperspectral anomaly detection (HAD) aims to identify rare and irregular targets in high-dimensional hyperspectral images (HSIs), which are often noisy and unlabelled data. Existing deep learning methods either fail to capture long-range spectral dependencies (e.g., convolutional neural networks) or suffer from high computational cost (e.g., Transformers). To address these challenges, we propose DMS2F-HAD, a novel dual-branch Mamba-based model. Our architecture utilizes Mamba’s linear-time modeling to efficiently learn distinct spatial and spectral features in specialized branches, which are then integrated by a dynamic gated fusion mechanism to enhance anomaly localization. Across fourteen benchmark HSI datasets, our proposed DMS2F-HAD not only achieves a state-of-the-art average AUC of 98.78%, but also demonstrates superior efficiency with an inference speed 4.6 times faster than comparable deep learning methods. The results highlight DMS2FHAD’s strong generalization and scalability, positioning it as a strong candidate for practical HAD applications.
zh

[CV-76] VideoBrain: Learning Adaptive Frame Sampling for Long Video Understanding

【速读】:该论文旨在解决长视频理解中因计算资源限制与跨数千帧的信息捕捉需求之间的矛盾问题,传统方法要么均匀采样易造成信息丢失,要么单次关键帧选择缺乏纠错机制。其解决方案的关键在于提出VideoBrain框架,通过两个互补的智能体实现自适应视觉信息获取:基于CLIP的语义检索代理用于跨视频的语义定位,均匀采样代理则在区间内进行密集时间采样;同时引入行为感知奖励函数与数据分类流水线,引导模型仅在真正需要时调用代理,从而在减少30-40%帧数的前提下,在四个长视频基准上实现3.5%至9.0%的性能提升,并展现出良好的跨数据集泛化能力。

链接: https://arxiv.org/abs/2602.04094
作者: Junbo Zou,Ziheng Huang,Shengjie Zhang,Liwen Zhang,Weining Shen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Long-form video understanding remains challenging for Vision-Language Models (VLMs) due to the inherent tension between computational constraints and the need to capture information distributed across thousands of frames. Existing approaches either sample frames uniformly (risking information loss) or select keyframes in a single pass (with no recovery from poor choices). We propose VideoBrain, an end-to-end framework that enables VLMs to adaptively acquire visual information through learned sampling policies. Our approach features dual complementary agents: a CLIP-based agent for semantic retrieval across the video and a Uniform agent for dense temporal sampling within intervals. Unlike prior agent-based methods that rely on text-only LLMs orchestrating visual tools, our VLM directly perceives frames and reasons about information sufficiency. To prevent models from invoking agents indiscriminately to maximize rewards, we introduce a behavior-aware reward function coupled with a data classification pipeline that teaches the model when agent invocation is genuinely beneficial. Experiments on four long video benchmarks demonstrate that VideoBrain achieves +3.5% to +9.0% improvement over the baseline while using 30-40% fewer frames, with strong cross-dataset generalization to short video benchmarks.
zh

[CV-77] Sight: Towards expert-AI co-assessment for improved immunohistochemistry staining interpretation

【速读】:该论文旨在解决免疫组织化学(Immunohistochemistry, IHC)图像分析中因组织类型和染色特征的域特异性差异导致的AI模型泛化能力不足问题。其核心解决方案是构建了一个大规模、多标签、带完整元数据的IHC图像数据集HPA10M(含1049万张图像,覆盖45种正常组织和20种主要癌症类型),并基于此训练了iSight多任务学习框架;该框架通过token级注意力机制融合全切片图像的视觉特征与组织元数据,实现对染色强度、位置、数量、组织类型及恶性状态的联合预测,显著提升了IHC自动评估的准确性与校准性,并在专家-人工智能协同评估中验证了其可提升病理诊断一致性与可靠性。

链接: https://arxiv.org/abs/2602.04063
作者: Jacob S. Leiby,Jialu Yao,Pan Lu,George Hu,Anna Davidian,Shunsuke Koga,Olivia Leung,Pravin Patel,Isabella Tondi Resta,Rebecca Rojansky,Derek Sung,Eric Yang,Paul J. Zhang,Emma Lundberg,Dokyoon Kim,Serena Yeung-Levy,James Zou,Thomas Montine,Jeffrey Nirschl,Zhi Huang
机构: University of Pennsylvania(宾夕法尼亚大学); Stanford University(斯坦福大学); University of Wisconsin(威斯康星大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Immunohistochemistry (IHC) provides information on protein expression in tissue sections and is commonly used to support pathology diagnosis and disease triage. While AI models for H\E-stained slides show promise, their applicability to IHC is limited due to domain-specific variations. Here we introduce HPA10M, a dataset that contains 10,495,672 IHC images from the Human Protein Atlas with comprehensive metadata included, and encompasses 45 normal tissue types and 20 major cancer types. Based on HPA10M, we trained iSight, a multi-task learning framework for automated IHC staining assessment. iSight combines visual features from whole-slide images with tissue metadata through a token-level attention mechanism, simultaneously predicting staining intensity, location, quantity, tissue type, and malignancy status. On held-out data, iSight achieved 85.5% accuracy for location, 76.6% for intensity, and 75.7% for quantity, outperforming fine-tuned foundation models (PLIP, CONCH) by 2.5–10.2%. In addition, iSight demonstrates well-calibrated predictions with expected calibration errors of 0.0150-0.0408. Furthermore, in a user study with eight pathologists evaluating 200 images from two datasets, iSight outperformed initial pathologist assessments on the held-out HPA dataset (79% vs 68% for location, 70% vs 57% for intensity, 68% vs 52% for quantity). Inter-pathologist agreement also improved after AI assistance in both held-out HPA (Cohen’s \kappa increased from 0.63 to 0.70) and Stanford TMAD datasets (from 0.74 to 0.76), suggesting expert–AI co-assessment can improve IHC interpretation. This work establishes a foundation for AI systems that can improve IHC diagnostic accuracy and highlights the potential for integrating iSight into clinical workflows to enhance the consistency and reliability of IHC assessment.
zh

[CV-78] SEIS: Subspace-based Equivariance and Invariance Scores for Neural Representations

【速读】:该论文旨在解决现有方法在评估神经网络特征表示对几何变换的响应时,难以区分等变性(equivariance)与不变性(invariance)的问题,且无法揭示内部表征中几何信息的组织方式。其解决方案的关键在于提出SEIS(Subspace-based Equivariance and Invariance Scores),这是一种基于子空间的度量方法,能够在不依赖标签或显式变换知识的情况下,逐层分析特征表示对几何变换的响应,从而有效分离等变性与不变性,准确刻画模型内部表征的几何结构保持能力。

链接: https://arxiv.org/abs/2602.04054
作者: Huahua Lin,Katayoun Farrahi,Xiaohao Cai
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Understanding how neural representations respond to geometric transformations is essential for evaluating whether learned features preserve meaningful spatial structure. Existing approaches primarily assess robustness by comparing model outputs under transformed inputs, offering limited insight into how geometric information is organized within internal representations and failing to distinguish between information loss and re-encoding. In this work, we introduce SEIS (Subspace-based Equivariance and Invariance Scores), a subspace metric for analyzing layer-wise feature representations under geometric transformations, disentangling equivariance from invariance without requiring labels or explicit knowledge of the transformation. Synthetic validation confirms that SEIS correctly recovers known transformations. Applied to trained classification networks, SEIS reveals a transition from equivariance in early layers to invariance in deeper layers, and that data augmentation increases invariance while preserving equivariance. We further show that multi-task learning induces synergistic gains in both properties at the shared encoder, and skip connections restore equivariance lost during decoding.
zh

[CV-79] Seeing Through Clutter: Structured 3D Scene Reconstruction via Iterative Object Removal

【速读】:该论文旨在解决复杂场景中从单张图像重建结构化三维表示的问题,现有方法依赖于语义分割和深度估计等中间任务,在存在遮挡和杂乱背景时性能受限。其解决方案的关键在于提出了一种迭代的对象移除与重建流水线,通过视觉语言模型(VLMs)作为调度器,逐个检测、分割、移除前景物体并进行三维拟合,从而将复杂场景分解为一系列更简单的子任务;该策略显著提升了在高度遮挡场景下的分割质量,且无需特定任务训练,可直接受益于基础模型的持续进步。

链接: https://arxiv.org/abs/2602.04053
作者: Rio Aguina-Kang,Kevin James Blackburn-Matzen,Thibault Groueix,Vladimir Kim,Matheus Gadelha
机构: University of California, San Diego (加州大学圣地亚哥分校); Adobe Research (Adobe 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: To appear in 3DV 2026

点击查看摘要

Abstract:We present SeeingThroughClutter, a method for reconstructing structured 3D representations from single images by segmenting and modeling objects individually. Prior approaches rely on intermediate tasks such as semantic segmentation and depth estimation, which often underperform in complex scenes, particularly in the presence of occlusion and clutter. We address this by introducing an iterative object removal and reconstruction pipeline that decomposes complex scenes into a sequence of simpler subtasks. Using VLMs as orchestrators, foreground objects are removed one at a time via detection, segmentation, object removal, and 3D fitting. We show that removing objects allows for cleaner segmentations of subsequent objects, even in highly occluded scenes. Our method requires no task-specific training and benefits directly from ongoing advances in foundation models. We demonstrate stateof-the-art robustness on 3D-Front and ADE20K datasets. Project Page: this https URL
zh

[CV-80] Artifact Removal and Image Restoration in AFM:A Structured Mask-Guided Directional Inpainting Approach

【速读】:该论文旨在解决原子力显微镜(Atomic Force Microscopy, AFM)图像中因环境噪声、扫描不完美及探针-样品相互作用等因素导致的伪影问题,这些问题会严重影响纳米尺度表面形貌的准确解析。解决方案的关键在于提出一个轻量级且全自动的处理框架:首先通过分类模型判断图像是否含伪影;若存在,则利用定制训练的轻量语义分割网络生成精确的伪影掩膜,并基于其结构方向自适应扩展掩膜区域;随后采用基于方向邻域的插值策略进行修复以保持三维表面连续性,最后结合局部高斯平滑实现无缝恢复。整个流程集成于用户友好的图形界面中,支持实时参数调整与批量处理,从而实现了对AFM图像中伪影的有效去除并保留纳米级结构细节,提供了一种几何感知的高保真数据处理方案。

链接: https://arxiv.org/abs/2602.04051
作者: Juntao Zhang,Angona Biswas,Jaydeep Rade,Charchit Shukla,Juan Ren,Anwesha Sarkar,Adarsh Krishnamurthy,Aditya Balu
机构: Iowa State University (爱荷华州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Atomic Force Microscopy (AFM) enables high-resolution surface imaging at the nanoscale, yet the output is often degraded by artifacts introduced by environmental noise, scanning imperfections, and tip-sample interactions. To address this challenge, a lightweight and fully automated framework for artifact detection and restoration in AFM image analysis is presented. The pipeline begins with a classification model that determines whether an AFM image contains artifacts. If necessary, a lightweight semantic segmentation network, custom-designed and trained on AFM data, is applied to generate precise artifact masks. These masks are adaptively expanded based on their structural orientation and then inpainted using a directional neighbor-based interpolation strategy to preserve 3D surface continuity. A localized Gaussian smoothing operation is then applied for seamless restoration. The system is integrated into a user-friendly GUI that supports real-time parameter adjustments and batch processing. Experimental results demonstrate the effective artifact removal while preserving nanoscale structural details, providing a robust, geometry-aware solution for high-fidelity AFM data interpretation.
zh

[CV-81] Fast Unsupervised Framework for Registration Quality Assessment of Multi-stain Histological Whole Slide Pairs

【速读】:该论文旨在解决组织病理学全切片图像(Whole Slide Images, WSI)中高保真配准的质量评估难题,尤其是在缺乏真实标签(Ground-Truth, GT)标注的情况下,传统基于标记点或强度相似性指标的评估方法存在耗时、不可靠和计算复杂度高等问题,难以在大规模数字病理分析中应用。其解决方案的关键在于提出一种快速、无监督的配准质量评估(Registration Quality Assessment, RQA)框架,该框架联合使用下采样组织掩膜(tissue masks)和形变(deformations)相关指标:前者衡量全局结构对应性,后者评估局部平滑性、连续性和变换合理性,从而实现无需GT即可高效、可靠地评估HE与IHC图像对的配准质量,具备高保真度和低计算资源消耗特性,适用于大规模数字病理的质量控制场景。

链接: https://arxiv.org/abs/2602.04046
作者: Shikha Dubey,Patricia Raciti,Kristopher Standish,Albert Juan Ramon,Erik Ames Burlingame
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IEEE ISBI 2026

点击查看摘要

Abstract:High-fidelity registration of histopathological whole slide images (WSIs), such as hematoxylin eosin (HE) and immunohistochemistry (IHC), is vital for integrated molecular analysis but challenging to evaluate without ground-truth (GT) annotations. Existing WSI-level assessments – using annotated landmarks or intensity-based similarity metrics – are often time-consuming, unreliable, and computationally intensive, limiting large-scale applicability. This study proposes a fast, unsupervised framework that jointly employs down-sampled tissue masks- and deformations-based metrics for registration quality assessment (RQA) of registered HE and IHC WSI pairs. The masks-based metrics measure global structural correspondence, while the deformations-based metrics evaluate local smoothness, continuity, and transformation realism. Validation across multiple IHC markers and multi-expert assessments demonstrate a strong correlation between automated metrics and human evaluations. In the absence of GT, this framework offers reliable, real-time RQA with high fidelity and minimal computational resources, making it suitable for large-scale quality control in digital pathology.
zh

[CV-82] A Parameterizable Convolution Accelerator for Embedded Deep Learning Applications

【速读】:该论文旨在解决嵌入式深度学习(Deep Learning, DL)应用中CNN加速器设计面临的多约束优化问题,即在实际部署场景下,仅追求峰值性能(如GOPS)已不足以满足延迟、功耗、面积和成本等综合需求。解决方案的关键在于提出一种软硬件协同设计(Hardware-Software Co-design)方法,利用高层次综合(High-Level Synthesis, HLS)工具对CNN加速器进行参数化建模,从而实现跨多个设计约束的高效优化,并通过实验验证该方法在灵活性和性能上优于非参数化设计方式,且易于扩展至其他类型的DL应用。

链接: https://arxiv.org/abs/2602.04044
作者: Panagiotis Mousouliotis,Georgios Keramidas
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Hardware Architecture (cs.AR)
备注: 6 pages, 4 figures. Published in the proceedings of the 2025 IEEE Computer Society Annual Symposium on VLSI (ISVLSI 2025), Kalamata, Greece, 6-9 July 2025

点击查看摘要

Abstract:Convolutional neural network (CNN) accelerators implemented on Field-Programmable Gate Arrays (FPGAs) are typically designed with a primary focus on maximizing performance, often measured in giga-operations per second (GOPS). However, real-life embedded deep learning (DL) applications impose multiple constraints related to latency, power consumption, area, and cost. This work presents a hardware-software (HW/SW) co-design methodology in which a CNN accelerator is described using high-level synthesis (HLS) tools that ease the parameterization of the design, facilitating more effective optimizations across multiple design constraints. Our experimental results demonstrate that the proposed design methodology is able to outperform non-parameterized design approaches, and it can be easily extended to other types of DL applications.
zh

[CV-83] AnyStyle: Single-Pass Multimodal Stylization for 3D Gaussian Splatting

【速读】:该论文旨在解决现有前馈式3D重建方法中风格化(stylization)控制能力不足的问题,特别是如何在无需姿态信息(pose-free)的前提下实现零样本(zero-shot)的外观风格迁移。现有方法主要依赖图像条件输入,导致风格控制的灵活性和可扩展性受限。解决方案的关键在于提出AnyStyle框架,其核心创新是引入多模态条件输入机制(支持文本和视觉风格参考),并通过模块化风格化架构实现与现有前馈3D重建模型(如3D Gaussian Splatting)的轻量级集成,从而在保持高质量几何重建的同时显著提升风格可控性。

链接: https://arxiv.org/abs/2602.04043
作者: Joanna Kaleta,Bartosz Świrta,Kacper Kania,Przemysław Spurek,Marek Kowalski
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The growing demand for rapid and scalable 3D asset creation has driven interest in feed-forward 3D reconstruction methods, with 3D Gaussian Splatting (3DGS) emerging as an effective scene representation. While recent approaches have demonstrated pose-free reconstruction from unposed image collections, integrating stylization or appearance control into such pipelines remains underexplored. Existing attempts largely rely on image-based conditioning, which limits both controllability and flexibility. In this work, we introduce AnyStyle, a feed-forward 3D reconstruction and stylization framework that enables pose-free, zero-shot stylization through multimodal conditioning. Our method supports both textual and visual style inputs, allowing users to control the scene appearance using natural language descriptions or reference images. We propose a modular stylization architecture that requires only minimal architectural modifications and can be integrated into existing feed-forward 3D reconstruction backbones. Experiments demonstrate that AnyStyle improves style controllability over prior feed-forward stylization methods while preserving high-quality geometric reconstruction. A user study further confirms that AnyStyle achieves superior stylization quality compared to an existing state-of-the-art approach. Repository: this https URL.
zh

[CV-84] CLS : Tightly Coupled Language Text Spotter

【速读】:该论文旨在解决场景文本识别(scene text spotting)中因文本实例短小、碎片化或视觉模糊而导致的识别困难问题。现有方法主要依赖视觉线索,隐式建模局部字符依赖关系,但忽略了外部语言知识的潜在价值。解决方案的关键在于提出一种端到端的文本检测与识别模型TiCLS,其核心创新是引入一个语言学解码器(linguistic decoder),显式融合来自字符级预训练语言模型(pretrained language model, PLM)的外部语言知识,并与视觉特征进行交互,从而提升对模糊或碎片化文本的鲁棒性识别能力。该设计使模型在不牺牲性能的前提下,有效利用语言先验信息,显著优于此前依赖纯视觉特征或未对齐词粒度的语言模型方法。

链接: https://arxiv.org/abs/2602.04030
作者: Leeje Jang,Yijun Lin,Yao-Yi Chiang,Jerod Weinman
机构: University of Minnesota (明尼苏达大学); Grinnell College (格里内尔学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Scene text spotting aims to detect and recognize text in real-world images, where instances are often short, fragmented, or visually ambiguous. Existing methods primarily rely on visual cues and implicitly capture local character dependencies, but they overlook the benefits of external linguistic knowledge. Prior attempts to integrate language models either adapt language modeling objectives without external knowledge or apply pretrained models that are misaligned with the word-level granularity of scene text. We propose TiCLS, an end-to-end text spotter that explicitly incorporates external linguistic knowledge from a character-level pretrained language model. TiCLS introduces a linguistic decoder that fuses visual and linguistic features, yet can be initialized by a pretrained language model, enabling robust recognition of ambiguous or fragmented text. Experiments on ICDAR 2015 and Total-Text demonstrate that TiCLS achieves state-of-the-art performance, validating the effectiveness of PLM-guided linguistic integration for scene text spotting.
zh

[CV-85] PromptSplit: Revealing Prompt-Level Disagreement in Generative Models

【速读】:该论文旨在解决生成式 AI 模型在不同提示(prompt)下行为差异难以量化与识别的问题,尤其关注如何系统性地检测和分析多个生成模型间因提示变化而产生的分歧。其解决方案的关键在于提出 PromptSplit 框架,通过构建提示与输出的张量积嵌入(tensor-product embeddings)形成联合表示,并计算核协方差矩阵;进而利用加权差异矩阵的特征空间来定位提示驱动的行为差异方向。为提升可扩展性,该方法采用随机投影近似,将复杂度降至 O(nr2+r3)O(nr^2 + r^3),并提供理论保证:近似所得特征结构与全维结果的期望偏差为 O(1/r2)O(1/r^2),从而实现高效且可解释的模型行为对比分析。

链接: https://arxiv.org/abs/2602.04009
作者: Mehdi Lotfian,Mohammad Jalali,Farzan Farnia
机构: The Chinese University of Hong Kong (香港中文大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Prompt-guided generative AI models have rapidly expanded across vision and language domains, producing realistic and diverse outputs from textual inputs. The growing variety of such models, trained with different data and architectures, calls for principled methods to identify which types of prompts lead to distinct model behaviors. In this work, we propose PromptSplit, a kernel-based framework for detecting and analyzing prompt-dependent disagreement between generative models. For each compared model pair, PromptSplit constructs a joint prompt–output representation by forming tensor-product embeddings of the prompt and image (or text) features, and then computes the corresponding kernel covariance matrix. We utilize the eigenspace of the weighted difference between these matrices to identify the main directions of behavioral difference across prompts. To ensure scalability, we employ a random-projection approximation that reduces computational complexity to O(nr^2 + r^3) for projection dimension r . We further provide a theoretical analysis showing that this approximation yields an eigenstructure estimate whose expected deviation from the full-dimensional result is bounded by O(1/r^2) . Experiments across text-to-image, text-to-text, and image-captioning settings demonstrate that PromptSplit accurately detects ground-truth behavioral differences and isolates the prompts responsible, offering an interpretable tool for detecting where generative models disagree.
zh

[CV-86] Efficient Long-Horizon Vision-Language-Action Models via Static-Dynamic Disentanglement

【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在机器人控制中面临的两大挑战:一是长时程上下文长度受限,二是由于二次注意力复杂度和参数量大导致的推理效率低下。其核心解决方案是提出SD-VLA框架,通过将视觉输入解耦为多层级的静态与动态token,实现两个关键改进:首先,仅保留单个静态token副本跨越帧间共享,显著缩短上下文长度;其次,引入轻量级重缓存门(recache gate)机制,在必要时更新静态token的键值(Key-Value, KV)缓存,从而实现高效的多帧信息融合与推理加速。这一设计在保持任务性能的同时大幅提升了推理速度和可扩展性。

链接: https://arxiv.org/abs/2602.03983
作者: Weikang Qiu,Tinglin Huang,Aosong Feng,Rex Ying
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have recently emerged as a promising paradigm for generalist robotic control. Built upon vision-language model (VLM) architectures, VLAs predict actions conditioned on visual observations and language instructions, achieving strong performance and generalization across tasks. However, VLAs face two major challenges: limited long-horizon context and inefficient inference due to the quadratic attention complexity and large parameter counts. Our work is motivated by the observation that much of the visual information in a trajectory remains static across timesteps (e.g., the background). Leveraging this property, we propose SD-VLA, a framework that disentangles visual inputs into multi-level static and dynamic tokens, which enables (1) retaining a single copy of static tokens across frames to significantly reduce context length, and (2) reusing the key-value (KV) cache of static tokens through a lightweight recache gate that updates only when necessary. This design enables efficient multi-frame integration and efficient inference. In addition, we introduce a new benchmark that more effectively evaluates the long-horizon temporal dependency modeling ability of VLAs. Experimental results show that our approach outperforms baselines on this benchmark by 39.8% absolute improvement in success rate, and achieves a 3.9% gain on the SimplerEnv benchmark. Moreover, SD-VLA delivers a 2.26x inference speedup over the base VLA model on the same benchmark, enabling faster and more practical real-world deployment.
zh

[CV-87] VLS: Steering Pretrained Robot Policies via Vision-Language Models

【速读】:该论文旨在解决预训练生成式机器人策略(如扩散模型或流匹配模型)在测试时遭遇分布外场景(如靠近障碍物、支撑面偏移或轻微杂乱环境)下性能显著下降的问题。此类失败通常并非由于缺乏运动技能,而是源于模仿学习在训练-测试分布偏移下的局限性——即动作生成与训练特定的空间配置和任务描述紧密耦合,导致策略无法灵活适应新场景。解决方案的关键在于提出一种无需重新训练的推理时自适应框架 Vision-Language Steering (VLS),其核心思想是将适应过程建模为推理时的控制问题:利用视觉-语言模型(Vision-Language Model, VLM)合成可微分轨迹奖励函数,引导预训练扩散或流匹配策略的去噪采样过程,从而生成满足测试时空间约束和任务语义要求的动作轨迹,实现对冻结策略的高效、无参数调整的实时适配。

链接: https://arxiv.org/abs/2602.03973
作者: Shuo Liu,Ishneet Sukhvinder Singh,Yiqing Xu,Jiafei Duan,Ranjay Krishna
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 Pages, Project page: this https URL

点击查看摘要

Abstract:Why do pretrained diffusion or flow-matching policies fail when the same task is performed near an obstacle, on a shifted support surface, or amid mild clutter? Such failures rarely reflect missing motor skills; instead, they expose a limitation of imitation learning under train-test shifts, where action generation is tightly coupled to training-specific spatial configurations and task specifications. Retraining or fine-tuning to address these failures is costly and conceptually misaligned, as the required behaviors already exist but cannot be selectively adapted at test time. We propose Vision-Language Steering (VLS), a training-free framework for inference-time adaptation of frozen generative robot policies. VLS treats adaptation as an inference-time control problem, steering the sampling process of a pretrained diffusion or flow-matching policy in response to out-of-distribution observation-language inputs without modifying policy parameters. By leveraging vision-language models to synthesize trajectory-differentiable reward functions, VLS guides denoising toward action trajectories that satisfy test-time spatial and task requirements. Across simulation and real-world evaluations, VLS consistently outperforms prior steering methods, achieving a 31% improvement on CALVIN and a 13% gain on LIBERO-PRO. Real-world deployment on a Franka robot further demonstrates robust inference-time adaptation under test-time spatial and semantic shifts. Project page: this https URL
zh

[CV-88] Representation Geometry as a Diagnostic for Out-of-Distribution Robustness

【速读】:该论文旨在解决在缺乏目标域标签的情况下,如何有效监测和优化模型在分布偏移(distribution shift)下的鲁棒性问题。现有方法多依赖训练时的正则化策略或低阶表示统计量,但难以准确预测模型在分布外(out-of-distribution, OOD)场景下的性能差异。论文提出了一种基于嵌入几何结构的诊断框架,其关键在于利用类条件互k近邻图(class-conditional mutual k-nearest-neighbor graphs)提取两个互补的不变量:一是基于归一化拉普拉斯矩阵的简化对数行列式作为全局谱复杂度代理指标,二是基于Ollivier–Ricci曲率的局部平滑性度量。实验证明,较低的谱复杂度和较高的平均曲率能一致地预测更强的OOD准确率,且这些信号反映的是有意义的表示结构而非表面嵌入统计特性,从而实现了无需标签的可解释鲁棒性诊断与无监督检查点选择。

链接: https://arxiv.org/abs/2602.03951
作者: Ali Zia,Farid Hazratian
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Differential Geometry (math.DG); General Topology (math.GN)
备注:

点击查看摘要

Abstract:Robust generalization under distribution shift remains difficult to monitor and optimize in the absence of target-domain labels, as models with similar in-distribution accuracy can exhibit markedly different out-of-distribution (OOD) performance. While prior work has focused on training-time regularization and low-order representation statistics, little is known about whether the geometric structure of learned embeddings provides reliable post-hoc signals of robustness. We propose a geometry-based diagnostic framework that constructs class-conditional mutual k-nearest-neighbor graphs from in-distribution embeddings and extracts two complementary invariants: a global spectral complexity proxy based on the reduced log-determinant of the normalized Laplacian, and a local smoothness measure based on Ollivier–Ricci curvature. Across multiple architectures, training regimes, and corruption benchmarks, we find that lower spectral complexity and higher mean curvature consistently predict stronger OOD accuracy across checkpoints. Controlled perturbations and topological analyses further show that these signals reflect meaningful representation structure rather than superficial embedding statistics. Our results demonstrate that representation geometry enables interpretable, label-free robustness diagnosis and supports reliable unsupervised checkpoint selection under distribution shift.
zh

[CV-89] Entropy Reveals Block Importance in Masked Self-Supervised Vision Transformers

【速读】:该论文旨在解决掩码自监督视觉Transformer(Masked Self-Supervised Vision Transformers)在资源受限场景下部署困难及高效迁移学习挑战,核心问题是:是否所有Transformer块对下游任务性能同等重要?解决方案的关键在于发现预训练块权重的信息熵(information entropy)与通过迭代移除块并微调获得的Oracle敏感性之间存在强相关性,从而提出Gardener——一种无需数据、一次性完成的块级剪枝方法,仅依赖信息论测量即可准确识别冗余块,实现高效模型压缩且保持优异迁移性能。

链接: https://arxiv.org/abs/2602.03918
作者: Peihao Xiang,Kaida Wu,Ou Bai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Masked self-supervised vision transformers have become a dominant pretraining paradigm, yet their substantial model size poses significant challenges for resource-constrained deployment and efficient transfer learning. A fundamental question remains: are all transformer blocks equally important for downstream performance? In this paper, we show that block importance in masked self-supervised vision transformers can be accurately estimated without access to any data. Our key finding is that the information entropy of pretrained block weights strongly correlates with oracle sensitivity obtained via iterative block removal and finetuning. This observation enables Gardener, a data-free, one-shot, block-level pruning principle that identifies redundant blocks through simple information-theoretic measurements. We evaluate Gardener on VideoMAE-B across multiple pruning ratios and downstream video recognition benchmarks. Despite its negligible computational overhead, Gardener consistently matches or outperforms existing data-free pruning baselines and closely approaches sensitivity-based pruning. Remarkably, even after pruning up to 91.7% of blocks, the pruned model retains competitive transfer performance. Our results reveal substantial block-level redundancy in masked self-supervised vision transformers and demonstrate that information-theoretic analysis offers a principled and efficient pathway for model compression and resource-efficient transfer learning.
zh

[CV-90] Phaedra: Learning High-Fidelity Discrete Tokenization for the Physical Science

【速读】:该论文旨在解决现有图像分词器(tokenizer)在科学图像(scientific images)中无法有效保留物理与光谱特性的问题,尤其是在处理偏微分方程(PDE)相关数据时,传统分词器难以同时捕捉精细细节和精确数值幅度。其解决方案的关键在于提出Phaedra,一种受经典形状增益量化(shape-gain quantization)和本征正交分解(proper orthogonal decomposition, POD)启发的新型分词方法,能够更准确地重构PDE驱动的数据,并在多个PDE数据集上实现一致的性能提升,同时展现出对不同分布外任务(包括已知和未知PDE以及真实地球观测与气象数据)的强大泛化能力。

链接: https://arxiv.org/abs/2602.03915
作者: Levi Lingsch,Georgios Kissas,Johannes Jakubik,Siddhartha Mishra
机构: ETH AI Center; IBM Research Europe; Seminar for Applied Mathematics, ETH Zurich; Swiss Data Science Center, ETH Zurich
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
备注: 57 pages, 27 figures

点击查看摘要

Abstract:Tokens are discrete representations that allow modern deep learning to scale by transforming high-dimensional data into sequences that can be efficiently learned, generated, and generalized to new tasks. These have become foundational for image and video generation and, more recently, physical simulation. As existing tokenizers are designed for the explicit requirements of realistic visual perception of images, it is necessary to ask whether these approaches are optimal for scientific images, which exhibit a large dynamic range and require token embeddings to retain physical and spectral properties. In this work, we investigate the accuracy of a suite of image tokenizers across a range of metrics designed to measure the fidelity of PDE properties in both physical and spectral space. Based on the observation that these struggle to capture both fine details and precise magnitudes, we propose Phaedra, inspired by classical shape-gain quantization and proper orthogonal decomposition. We demonstrate that Phaedra consistently improves reconstruction across a range of PDE datasets. Additionally, our results show strong out-of-distribution generalization capabilities to three tasks of increasing complexity, namely known PDEs with different conditions, unknown PDEs, and real-world Earth observation and weather data.
zh

[CV-91] Entropy-Aware Structural Alignment for Zero-Shot Handwritten Chinese Character Recognition

【速读】:该论文旨在解决零样本手写汉字识别(Zero-shot Handwritten Chinese Character Recognition, HCCR)中因字符被视为扁平的部件序列而导致的语义信息丢失问题,特别是忽略了部件间的层次结构拓扑关系以及不同组件的信息密度差异。其解决方案的关键在于提出一种熵感知的结构对齐网络(Entropy-Aware Structural Alignment Network),通过信息论建模弥合视觉-语义鸿沟:首先引入信息熵先验(Information Entropy Prior)动态调节位置嵌入,以乘法交互方式突出判别性强的根部部件;其次构建双视角部件树(Dual-View Radical Tree)提取多粒度结构特征,并通过自适应Sigmoid门控网络融合全局布局与局部空间角色信息;最后设计Top-K语义特征融合机制,利用语义邻域中心点增强解码过程,从而在特征层面实现一致性校正,显著提升识别准确率与数据效率。

链接: https://arxiv.org/abs/2602.03913
作者: Qiuming Luo,Tao Zeng,Feng Li,Heming Liu,Rui Mao,Chang Kong
机构: Shenzhen Polytechnic University (深圳职业技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 37 pages, 8 figures

点击查看摘要

Abstract:Zero-shot Handwritten Chinese Character Recognition (HCCR) aims to recognize unseen characters by leveraging radical-based semantic compositions. However, existing approaches often treat characters as flat radical sequences, neglecting the hierarchical topology and the uneven information density of different components. To address these limitations, we propose an Entropy-Aware Structural Alignment Network that bridges the visual-semantic gap through information-theoretic modeling. First, we introduce an Information Entropy Prior to dynamically modulate positional embeddings via multiplicative interaction, acting as a saliency detector that prioritizes discriminative roots over ubiquitous components. Second, we construct a Dual-View Radical Tree to extract multi-granularity structural features, which are integrated via an adaptive Sigmoid-based gating network to encode both global layout and local spatial roles. Finally, a Top-K Semantic Feature Fusion mechanism is devised to augment the decoding process by utilizing the centroid of semantic neighbors, effectively rectifying visual ambiguities through feature-level consensus. Extensive experiments demonstrate that our method establishes new state-of-the-art performance, significantly outperforming existing CLIP-based baselines in the challenging zero-shot setting. Furthermore, the framework exhibits exceptional data efficiency, demonstrating rapid adaptability with minimal support samples.
zh

[CV-92] Beyond the Vehicle: Cooperative Localization by Fusing Point Clouds for GPS-Challenged Urban Scenarios

【速读】:该论文旨在解决城市环境中GPS信号不可靠导致的车辆精确定位难题。其解决方案的关键在于提出一种协作式多传感器与多模态定位方法,通过融合车对车(V2V)和车对基础设施(V2I)系统的数据,并将其与基于点云配准的同步定位与地图构建(SLAM)算法相结合,利用车载激光雷达(LiDAR)、立体相机以及路口部署的传感器所生成的多源点云数据,借助基础设施共享信息显著提升复杂、高噪声GPS环境下车辆定位的精度与鲁棒性。

链接: https://arxiv.org/abs/2602.03908
作者: Kuo-Yi Chao,Ralph Rasshofer,Alois Christian Knoll
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 2 figures, Driving the Future Symposium 2025

点击查看摘要

Abstract:Accurate vehicle localization is a critical challenge in urban environments where GPS signals are often unreliable. This paper presents a cooperative multi-sensor and multi-modal localization approach to address this issue by fusing data from vehicle-to-vehicle (V2V) and vehicle-to-infrastructure (V2I) systems. Our approach integrates cooperative data with a point cloud registration-based simultaneous localization and mapping (SLAM) algorithm. The system processes point clouds generated from diverse sensor modalities, including vehicle-mounted LiDAR and stereo cameras, as well as sensors deployed at intersections. By leveraging shared data from infrastructure, our method significantly improves localization accuracy and robustness in complex, GPS-noisy urban scenarios.
zh

[CV-93] HY3D-Bench: Generation of 3D Assets

【速读】:该论文旨在解决当前3D内容生成领域中存在的数据处理瓶颈问题,即高质量、结构化和多样化的3D数据资源匮乏,限制了生成式AI(Generative AI)在3D感知、机器人技术和数字内容创作等方向的进一步发展。其解决方案的关键在于构建一个名为HY3D-Bench的开源生态系统,包含三个核心贡献:首先,从大规模数据源中提炼出25万件高保真度3D对象,提供可直接用于训练的闭合网格(watertight meshes)和多视角渲染图像;其次,引入结构化的部件级分解(part-level decomposition),实现细粒度感知与可控编辑;最后,通过可扩展的AIGC合成流程弥合真实分布差距,新增12.5万件合成资产以增强长尾类别的多样性。该框架经由Hunyuan3D-2.1-Small模型验证,显著提升了数据可用性与模型性能,推动3D生成技术的标准化与普及。

链接: https://arxiv.org/abs/2602.03907
作者: Team Hunyuan3D:Bowen Zhang,Chunchao Guo,Dongyuan Guo,Haolin Liu,Hongyu Yan,Huiwen Shi,Jiaao Yu,Jiachen Xu,Jingwei Huang,Kunhong Li,Lifu Wang,Linus,Penghao Wang,Qingxiang Lin,Ruining Tang,Xianghui Yang,Yang Li,Yirui Guan,Yunfei Zhao,Yunhan Yang,Zeqiang Lai,Zhihao Liang,Zibo Zhao
机构: Tencent Hunyuan3D
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Authors are listed alphabetically by the first name

点击查看摘要

Abstract:While recent advances in neural representations and generative models have revolutionized 3D content creation, the field remains constrained by significant data processing bottlenecks. To address this, we introduce HY3D-Bench, an open-source ecosystem designed to establish a unified, high-quality foundation for 3D generation. Our contributions are threefold: (1) We curate a library of 250k high-fidelity 3D objects distilled from large-scale repositories, employing a rigorous pipeline to deliver training-ready artifacts, including watertight meshes and multi-view renderings; (2) We introduce structured part-level decomposition, providing the granularity essential for fine-grained perception and controllable editing; and (3) We bridge real-world distribution gaps via a scalable AIGC synthesis pipeline, contributing 125k synthetic assets to enhance diversity in long-tail categories. Validated empirically through the training of Hunyuan3D-2.1-Small, HY3D-Bench democratizes access to robust data resources, aiming to catalyze innovation across 3D perception, robotics, and digital content creation.
zh

[CV-94] Benchmarking Bias Mitigation Toward Fairness Without Harm from Vision to LVLMs ICLR26

【速读】:该论文旨在解决当前机器学习模型在真实世界数据上训练时继承并放大对特定社会群体的偏见问题,尤其在视觉模型与大规模视觉-语言模型(LVLMs)中,由于数据集异质性、公平性指标不一致、视觉与多模态模型孤立评估以及超参数调优不足,导致bias mitigation方法的有效性难以客观比较。其解决方案的关键在于提出NH-Fair——一个统一的公平性基准,涵盖标准化的数据、指标和训练协议,并系统性地开展经验风险最小化(ERM)超参数调优研究,识别出对性能与偏差影响显著的训练选择,从而为实践者提供可操作的指导以缩小昂贵的超参数搜索空间;同时发现多数去偏方法无法稳定优于经过良好调优的ERM基线,而一种组合式数据增强方法则能持续提升公平性且不牺牲性能,成为更具实用性的策略。

链接: https://arxiv.org/abs/2602.03895
作者: Xuwei Tan,Ziyu Hu,Xueru Zhang
机构: The Ohio State University (俄亥俄州立大学); Stevens Institute of Technology (史蒂文斯理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at ICLR 26

点击查看摘要

Abstract:Machine learning models trained on real-world data often inherit and amplify biases against certain social groups, raising urgent concerns about their deployment at scale. While numerous bias mitigation methods have been proposed, comparing the effectiveness of bias mitigation methods remains difficult due to heterogeneous datasets, inconsistent fairness metrics, isolated evaluation of vision versus multi-modal models, and insufficient hyperparameter tuning that undermines fair comparisons. We introduce NH-Fair, a unified benchmark for fairness without harm that spans both vision models and large vision-language models (LVLMs) under standardized data, metrics, and training protocols, covering supervised and zero-shot regimes. Our key contributions are: (1) a systematic ERM tuning study that identifies training choices with large influence on both utility and disparities, yielding empirically grounded guidelines to help practitioners reduce expensive hyperparameter tuning space in achieving strong fairness and accuracy; (2) evidence that many debiasing methods do not reliably outperform a well-tuned ERM baseline, whereas a composite data-augmentation method consistently delivers parity gains without sacrificing utility, emerging as a promising practical strategy. (3) an analysis showing that while LVLMs achieve higher average accuracy, they still exhibit subgroup disparities, and gains from scaling are typically smaller than those from architectural or training-protocol choices. NH-Fair provides a reproducible, tuning-aware pipeline for rigorous, harm-aware fairness evaluation.
zh

[CV-95] Vision Transformers for Zero-Shot Clustering of Animal Images: A Comparative Benchmarking Study

【速读】:该论文旨在解决生态学研究中动物图像人工标注效率低下、限制生物多样性监测规模的问题。其核心解决方案是利用先进的视觉Transformer(Vision Transformer, ViT)基础模型,直接将大量未标注的动物图像聚类至物种级别,从而减少对人工标注的依赖。关键在于通过系统性 benchmarking 框架评估五种 ViT 模型结合五种降维技术与四种聚类算法(包括两种监督和两种无监督方法),发现使用 DINOv3 嵌入特征配合 t-SNE 降维与监督层次聚类可实现近完美的物种级聚类效果(V-measure: 0.958),且无监督方法亦表现优异(V-measure: 0.943),同时能识别出具有生态意义的种内变异模式(如性别、年龄及毛色差异)。

链接: https://arxiv.org/abs/2602.03894
作者: Hugo Markoff,Stefan Hein Bengtson,Michael Ørsted
机构: Aalborg University (奥尔堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Manual labeling of animal images remains a significant bottleneck in ecological research, limiting the scale and efficiency of biodiversity monitoring efforts. This study investigates whether state-of-the-art Vision Transformer (ViT) foundation models can reduce thousands of unlabeled animal images directly to species-level clusters. We present a comprehensive benchmarking framework evaluating five ViT models combined with five dimensionality reduction techniques and four clustering algorithms, two supervised and two unsupervised, across 60 species (30 mammals and 30 birds), with each test using a random subset of 200 validated images per species. We investigate when clustering succeeds at species-level, where it fails, and whether clustering within the species-level reveals ecologically meaningful patterns such as sex, age, or phenotypic variation. Our results demonstrate near-perfect species-level clustering (V-measure: 0.958) using DINOv3 embeddings with t-SNE and supervised hierarchical clustering methods. Unsupervised approaches achieve competitive performance (0.943) while requiring no prior species knowledge, rejecting only 1.14% of images as outliers requiring expert review. We further demonstrate robustness to realistic long-tailed distributions of species and show that intentional over-clustering can reliably extract intra-specific variation including age classes, sexual dimorphism, and pelage differences. We introduce an open-source benchmarking toolkit and provide recommendations for ecologists to select appropriate methods for sorting their specific taxonomic groups and data.
zh

[CV-96] GPAIR: Gaussian-Kernel-Based Ultrafast 3D Photoacoustic Iterative Reconstruction

【速读】:该论文旨在解决光声 computed tomography (PACT) 中迭代重建(IR)算法计算耗时过长的问题,尤其是在大规模三维(3D)成像场景下,传统IR方法需数秒至数十分钟,严重限制了其临床实用性。解决方案的关键在于提出一种基于高斯核的超快3D光声迭代重建方法(Gaussian-kernel-based Ultrafast 3D Photoacoustic Iterative Reconstruction, GPAIR),其核心创新包括:利用连续各向同性高斯核对传统空间网格进行重构,推导出压力波的解析闭式表达式,并通过GPU加速的可微分Triton算子实现高效计算;该方法在动物实验中实现了包含840万体素的3D目标亚秒级重建速度,显著提升了PACT的实时性和可扩展性,推动其向临床应用迈进。

链接: https://arxiv.org/abs/2602.03893
作者: Yibing Wang,Shuang Li,Tingting Huang,Yu Zhang,Chulhong Kim,Seongwook Choi,Changhui Li
机构: Peking University (北京大学); Samsung Electronics (三星电子); KAIST (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Although the iterative reconstruction (IR) algorithm can substantially correct reconstruction artifacts in photoacoustic (PA) computed tomography (PACT), it suffers from long reconstruction times, especially for large-scale three-dimensional (3D) imaging in which IR takes hundreds of seconds to hours. The computing burden severely limits the practical applicability of IR algorithms. In this work, we proposed an ultrafast IR method for 3D PACT, called Gaussian-kernel-based Ultrafast 3D Photoacoustic Iterative Reconstruction (GPAIR), which achieves orders-of-magnitude acceleration in computing. GPAIR transforms traditional spatial grids with continuous isotropic Gaussian kernels. By deriving analytical closed-form expression for pressure waves and implementing powerful GPU-accelerated differentiable Triton operators, GPAIR demonstrates extraordinary ultrafast sub-second reconstruction speed for 3D targets containing 8.4 million voxels in animal experiments. This revolutionary ultrafast image reconstruction enables near-real-time large-scale 3D PA reconstruction, significantly advancing 3D PACT toward clinical applications.
zh

[CV-97] Audit After Segmentation: Reference-Free Mask Quality Assessment for Language-Referred Audio-Visual Segmentation

【速读】:该论文旨在解决语言引导的音视频分割(Ref-AVS)中缺乏对分割掩码质量进行有效评估的问题,尤其是在推理阶段无法依赖真实标签的情况下。现有方法通常仅生成分割掩码而忽视其质量诊断,导致难以识别错误类型并采取改进措施。解决方案的关键在于提出一个新的任务——掩码质量评估(MQA-RefAVS),并构建了包含多种几何与语义错误模式的基准数据集 MQ-RAVSBench,同时设计了一个基于多模态大语言模型(MLLM)的审计器 MQ-Auditor,该模型能够联合分析音频、视觉和文本信息,对候选掩码进行定量和定性的质量评估,从而实现对分割结果的可解释性诊断与可控优化。

链接: https://arxiv.org/abs/2602.03892
作者: Jinxing Zhou,Yanghao Zhou,Yaoting Wang,Zongyan Han,Jiaqi Ma,Henghui Ding,Rao Muhammad Anwer,Hisham Cholakkal
机构: MBZUAI; National University of Singapore; Fudan University
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Language-referred audio-visual segmentation (Ref-AVS) aims to segment target objects described by natural language by jointly reasoning over video, audio, and text. Beyond generating segmentation masks, providing rich and interpretable diagnoses of mask quality remains largely underexplored. In this work, we introduce Mask Quality Assessment in the Ref-AVS context (MQA-RefAVS), a new task that evaluates the quality of candidate segmentation masks without relying on ground-truth annotations as references at inference time. Given audio-visual-language inputs and each provided segmentation mask, the task requires estimating its IoU with the unobserved ground truth, identifying the corresponding error type, and recommending an actionable quality-control decision. To support this task, we construct MQ-RAVSBench, a benchmark featuring diverse and representative mask error modes that span both geometric and semantic issues. We further propose MQ-Auditor, a multimodal large language model (MLLM)-based auditor that explicitly reasons over multimodal cues and mask information to produce quantitative and qualitative mask quality assessments. Extensive experiments demonstrate that MQ-Auditor outperforms strong open-source and commercial MLLMs and can be integrated with existing Ref-AVS systems to detect segmentation failures and support downstream segmentation improvement. Data and codes will be released at this https URL.
zh

[CV-98] 4DPC2hat: Towards Dynamic Point Cloud Understanding with Failure-Aware Bootstrapping

【速读】:该论文旨在解决当前多模态大语言模型(MLLM)在动态点云序列理解上的不足,尤其是针对时序运动建模与跨模态语义对齐的挑战。现有方法主要聚焦于静态点云,缺乏大规模标注数据和有效的时序推理机制,导致对动作、时空关系等动态信息的理解能力有限。其解决方案的关键在于:首先构建首个面向动态点云的大型跨模态数据集4DPC²hat-200K,包含拓扑一致的4D点云序列与多层次标注的问答对;其次提出一种基于Mamba架构增强的时序推理MLLM,以捕捉长距离依赖和动态模式;最后引入故障感知的自举学习策略,通过迭代识别模型缺陷并生成针对性QA监督信号,持续强化特定推理能力。

链接: https://arxiv.org/abs/2602.03890
作者: Xindan Zhang,Weilong Yan,Yufei Shi,Xuerui Qiu,Tao He,Ying Li,Ming Li,Hehe Fan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Point clouds provide a compact and expressive representation of 3D objects, and have recently been integrated into multimodal large language models (MLLMs). However, existing methods primarily focus on static objects, while understanding dynamic point cloud sequences remains largely unexplored. This limitation is mainly caused by the lack of large-scale cross-modal datasets and the difficulty of modeling motions in spatio-temporal contexts. To bridge this gap, we present 4DPC ^2 hat, the first MLLM tailored for dynamic point cloud understanding. To this end, we construct a large-scale cross-modal dataset 4DPC ^2 hat-200K via a meticulous two-stage pipeline consisting of topology-consistent 4D point construction and two-level captioning. The dataset contains over 44K dynamic object sequences, 700K point cloud frames, and 200K curated question-answer (QA) pairs, supporting inquiries about counting, temporal relationship, action, spatial relationship, and appearance. At the core of the framework, we introduce a Mamba-enhanced temporal reasoning MLLM to capture long-range dependencies and dynamic patterns among a point cloud sequence. Furthermore, we propose a failure-aware bootstrapping learning strategy that iteratively identifies model deficiencies and generates targeted QA supervision to continuously strengthen corresponding reasoning capabilities. Extensive experiments demonstrate that our 4DPC ^2 hat significantly improves action understanding and temporal reasoning compared with existing models, establishing a strong foundation for 4D dynamic point cloud understanding.
zh

[CV-99] Explainable Computer Vision Framework for Automated Pore Detection and Criticality Assessment in Additive Manufacturing

【速读】:该论文旨在解决增材制造(Additive Manufacturing, AM)构件中内部孔隙缺陷的自动检测与关键性评估问题,尤其针对现有自动化检测方法缺乏可解释性、无法为工程师提供物理机制依据的痛点。解决方案的关键在于构建了一个可解释的计算机视觉框架,通过灰度切片重建三维断层扫描数据,结合阈值分割与连通域分析提取500个独立孔隙,并利用几何描述符(如尺寸、长宽比、分布范围及相对于试样边界的空间位置)和基于百分位数的欧氏距离准则构建孔隙相互作用网络(共24,950条连接)。随后采用机器学习模型预测孔隙关键性评分,并借助SHAP(Shapley Additive Explanations)分析量化各特征贡献度,发现归一化表面距离是主导因素,其重要性远超其他参数,且孔隙靠近边界显著提升失效风险,揭示了边界驱动的失效机制。该框架实现了缺陷评估的透明化与可解释性,为工艺优化和质量控制提供明确指导。

链接: https://arxiv.org/abs/2602.03883
作者: Akshansh Mishra,Rakesh Morisetty
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
备注: 6 figures

点击查看摘要

Abstract:Internal porosity remains a critical defect mode in additively manufactured components, compromising structural performance and limiting industrial adoption. Automated defect detection methods exist but lack interpretability, preventing engineers from understanding the physical basis of criticality predictions. This study presents an explainable computer vision framework for pore detection and criticality assessment in three-dimensional tomographic volumes. Sequential grayscale slices were reconstructed into volumetric datasets, and intensity-based thresholding with connected component analysis identified 500 individual pores. Each pore was characterized using geometric descriptors including size, aspect ratio, extent, and spatial position relative to the specimen boundary. A pore interaction network was constructed using percentile-based Euclidean distance criteria, yielding 24,950 inter-pore connections. Machine learning models predicted pore criticality scores from extracted features, and SHAP analysis quantified individual feature contributions. Results demonstrate that normalized surface distance dominates model predictions, contributing more than an order of magnitude greater importance than all other descriptors. Pore size provides minimal influence, while geometric parameters show negligible impact. The strong inverse relationship between surface proximity and criticality reveals boundary-driven failure mechanisms. This interpretable framework enables transparent defect assessment and provides actionable insights for process optimization and quality control in additive manufacturing.
zh

[CV-100] PriorProbe: Recovering Individual-Level Priors for Personalizing Neural Networks in Facial Expression Recognition

【速读】:该论文旨在解决如何准确提取个体层面的认知先验(cognitive priors)以个性化神经网络的问题,现有方法要么无法唯一识别这些先验,要么引入系统性偏差。其解决方案的关键在于提出 PriorProbe,这是一种基于“与人协同的马尔可夫链蒙特卡洛”(Markov Chain Monte Carlo with People)的新颖先验 elicitation 方法,能够恢复细粒度且个体特异性的认知先验;通过在面部表情识别任务中对个体参与者应用 PriorProbe,并将恢复的先验整合进最先进的神经网络,实验表明该方法显著提升了模型对模糊刺激的个体分类预测性能,同时保持了网络对真实标签的推理能力,从而证明 PriorProbe 是一种通用且可解释的深度神经网络个性化框架。

链接: https://arxiv.org/abs/2602.03882
作者: Haijiang Yan,Nick Chater,Adam Sanborn
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Incorporating individual-level cognitive priors offers an important route to personalizing neural networks, yet accurately eliciting such priors remains challenging: existing methods either fail to uniquely identify them or introduce systematic biases. Here, we introduce PriorProbe, a novel elicitation approach grounded in Markov Chain Monte Carlo with People that recovers fine-grained, individual-specific priors. Focusing on a facial expression recognition task, we apply PriorProbe to individual participants and test whether integrating the recovered priors with a state-of-the-art neural network improves its ability to predict an individual’s classification on ambiguous stimuli. The PriorProbe-derived priors yield substantial performance gains, outperforming both the neural network alone and alternative sources of priors, while preserving the network’s inference on ground-truth labels. Together, these results demonstrate that PriorProbe provides a general and interpretable framework for personalizing deep neural networks.
zh

[CV-101] DiGAN: Diffusion-Guided Attention Network for Early Alzheimers Disease Detection

【速读】:该论文旨在解决阿尔茨海默病(Alzheimer’s disease, AD)早期诊断中因结构脑变化在前驱期表现隐匿且时间分布不规律而导致的挑战,尤其是现有深度学习方法对大规模纵向数据依赖性强、难以建模真实临床数据中的时间连续性和模态不规则性的问题。解决方案的关键在于提出扩散引导注意力网络(Diffusion-Guided Attention Network, DiGAN),其核心创新是将潜在扩散模型与注意力引导的卷积网络相结合:扩散模型从有限训练数据中合成逼真的纵向神经影像轨迹,增强时间上下文并提升对不规则随访间隔的鲁棒性;注意力-卷积层则有效捕捉区分认知正常人群与轻度认知障碍及主观认知下降人群的结构-时间判别特征。

链接: https://arxiv.org/abs/2602.03881
作者: Maxx Richard Rahman,Mostafa Hammouda,Wolfgang Maass
机构: University of Tübingen (图宾根大学); Institute for Advanced Study (高级研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Early diagnosis of Alzheimer’s disease (AD) remains a major challenge due to the subtle and temporally irregular progression of structural brain changes in the prodromal stages. Existing deep learning approaches require large longitudinal datasets and often fail to model the temporal continuity and modality irregularities inherent in real-world clinical data. To address these limitations, we propose the Diffusion-Guided Attention Network (DiGAN), which integrates latent diffusion modelling with an attention-guided convolutional network. The diffusion model synthesizes realistic longitudinal neuroimaging trajectories from limited training data, enriching temporal context and improving robustness to unevenly spaced visits. The attention-convolutional layer then captures discriminative structural–temporal patterns that distinguish cognitively normal subjects from those with mild cognitive impairment and subjective cognitive decline. Experiments on synthetic and ADNI datasets demonstrate that DiGAN outperforms existing state-of-the-art baselines, showing its potential for early-stage AD detection.
zh

[CV-102] ruKAN: Towards More Efficient Kolmogorov-Arnold Networks Using Truncated Power Functions

【速读】:该论文旨在解决Kolmogorov-Arnold Network (KAN) 在计算效率与模型表达能力之间存在的权衡问题,同时提升模型的可解释性。其解决方案的关键在于提出TruKAN架构:通过将原KAN中使用的B-spline基函数替换为基于k阶样条理论导出的一族截断幂函数(truncated power functions),在保持KAN高表达能力的同时显著提高精度和训练速度;此外,TruKAN每层结合截断幂项与多项式项,并支持共享或独立节点(knots)配置,从而增强模型的透明度与灵活性。实验表明,TruKAN在视觉任务中优于其他KAN变体,在准确率、计算效率和内存占用方面均具优势。

链接: https://arxiv.org/abs/2602.03879
作者: Ali Bayeh,Samira Sadaoui,Malek Mouhoub
机构: University of Regina (里贾纳大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 23 pages, 9 figures

点击查看摘要

Abstract:To address the trade-off between computational efficiency and adherence to Kolmogorov-Arnold Network (KAN) principles, we propose TruKAN, a new architecture based on the KAN structure and learnable activation functions. TruKAN replaces the B-spline basis in KAN with a family of truncated power functions derived from k-order spline theory. This change maintains the KAN’s expressiveness while enhancing accuracy and training time. Each TruKAN layer combines a truncated power term with a polynomial term and employs either shared or individual knots. TruKAN exhibits greater interpretability than other KAN variants due to its simplified basis functions and knot configurations. By prioritizing interpretable basis functions, TruKAN aims to balance approximation efficacy with transparency. We develop the TruKAN model and integrate it into an advanced EfficientNet-V2-based framework, which is then evaluated on computer vision benchmark datasets. To ensure a fair comparison, we develop various models: MLP-, KAN-, SineKAN and TruKAN-based EfficientNet frameworks and assess their training time and accuracy across small and deep architectures. The training phase uses hybrid optimization to improve convergence stability. Additionally, we investigate layer normalization techniques for all the models and assess the impact of shared versus individual knots in TruKAN. Overall, TruKAN outperforms other KAN models in terms of accuracy, computational efficiency and memory usage on the complex vision task, demonstrating advantages beyond the limited settings explored in prior KAN studies.
zh

[CV-103] Intellectual Property Protection for 3D Gaussian Splatting Assets: A Survey DATE

【速读】:该论文旨在解决3D Gaussian Splatting (3DGS) 在生成式 AI(Generative AI)时代下日益突出的知识产权(IP)保护问题,当前研究分散且缺乏对底层扰动机制、保护范式及鲁棒性挑战的系统性理解。其解决方案的关键在于提出首个针对3DGS IP保护的系统性综述,并构建一个自下而上的分析框架,从高斯基扰动机制、被动与主动保护范式以及生成式AI背景下鲁棒性威胁三个维度展开深入剖析,揭示了技术基础和鲁棒性刻画方面的研究空白,并指明了六个跨鲁棒性、效率与保护范式的未来研究方向,为实现可靠可信的3DGS资产IP保护提供理论支撑与实践路径。

链接: https://arxiv.org/abs/2602.03878
作者: Longjie Zhao,Ziming Hong,Jiaxin Huang,Runnan Chen,Mingming Gong,Tongliang Liu
机构: Sydney AI Centre, The University of Sydney (悉尼人工智能中心,悉尼大学); The University of Melbourne (墨尔本大学); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注: A collection of relevant papers is summarized and will be continuously updated at \url{ this https URL }

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has become a mainstream representation for real-time 3D scene synthesis, enabling applications in virtual and augmented reality, robotics, and 3D content creation. Its rising commercial value and explicit parametric structure raise emerging intellectual property (IP) protection concerns, prompting a surge of research on 3DGS IP protection. However, current progress remains fragmented, lacking a unified view of the underlying mechanisms, protection paradigms, and robustness challenges. To address this gap, we present the first systematic survey on 3DGS IP protection and introduce a bottom-up framework that examines (i) underlying Gaussian-based perturbation mechanisms, (ii) passive and active protection paradigms, and (iii) robustness threats under emerging generative AI era, revealing gaps in technical foundations and robustness characterization and indicating opportunities for deeper investigation. Finally, we outline six research directions across robustness, efficiency, and protection paradigms, offering a roadmap toward reliable and trustworthy IP protection for 3DGS assets.
zh

[CV-104] WebAccessVL: Making an Accessible Web via Violation-Conditioned VLM

【速读】:该论文旨在解决网页内容可访问性合规问题,即自动识别并修正违反Web Content Accessibility Guidelines 2 (WCAG2) 的HTML代码,以提升网站对残障用户的友好程度。其解决方案的关键在于提出一种视觉-语言模型(Vision-Language Model, VLM),将网页HTML及其渲染图像作为输入,通过监督式图像条件程序合成任务学习生成符合WCAG2规范的修正后HTML;同时引入违规条件机制,利用WCAG2违规数量作为额外条件引导修正过程,从而显著降低每页平均违规数(从5.34降至0.44),且保持原始视觉效果不变。

链接: https://arxiv.org/abs/2602.03850
作者: Amber Yijia Zheng,Jae Joong Lee,Bedrich Benes,Raymond A. Yeh
机构: Purdue University (普渡大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present a vision-language model (VLM) that automatically edits website HTML to address Web Content Accessibility Guidelines 2 (WCAG2) violations. We formulate this as a supervised image-conditioned program synthesis task, where the model learns to correct HTML given the HTML and its rendering. We collected WebAccessVL, a new dataset with manually corrected accessibility violations, establishing paired training data. We then propose a violation-conditioned VLM that additionally conditions on the WCAG2 violation count to guide the correction process. Experiments demonstrate that our method effectively reduces the average number of violations from 5.34 to 0.44 per website, outperforming commercial LLM APIs (Gemini, GPT-5). A perceptual study confirms that our edited websites maintain the original visual appearance and content.
zh

[CV-105] An Improved Boosted DC Algorithm for Nonsmooth Functions with Applications in Image Recovery

【速读】:该论文旨在解决非光滑非凸问题中差分凸(DC)函数优化的收敛性与效率问题,特别是当DC分解中的第一个函数为非光滑时,传统增强型差分凸算法(BDCA)可能出现上升方向、无法执行单调线搜索的问题。解决方案的关键在于提出一种单调改进的增强型差分凸算法(IBDCA),适用于可表示为非光滑函数与光滑函数之差的特定类型DC程序;该方法通过构造合适的下降方向并保证目标函数值单调递减,确保生成序列的任意聚点均为问题的临界点,并在Kurdyka-Łojasiewicz(KL)性质下实现全局收敛及收敛速率分析,从而在图像恢复等应用中展现出优于经典DCA及其他先进DC方法的计算效率和迭代性能。

链接: https://arxiv.org/abs/2602.04237
作者: ZeYu Li,Te Qi,TieYong Zeng
机构: The Chinese University of Hong Kong (香港中文大学); Beijing Normal University (北京师范大学)
类目: Optimization and Control (math.OC); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose a new approach to perform the boosted difference of convex functions algorithm (BDCA) on non-smooth and non-convex problems involving the difference of convex (DC) functions. The recently proposed BDCA uses an extrapolation step from the point computed by the classical DC algorithm (DCA) via a line search procedure in a descent direction to get an additional decrease of the objective function and accelerate the convergence of DCA. However, when the first function in DC decomposition is non-smooth, the direction computed by BDCA can be ascent and a monotone line search cannot be performed. In this work, we proposed a monotone improved boosted difference of convex functions algorithm (IBDCA) for certain types of non-smooth DC programs, namely those that can be formulated as the difference of a possibly non-smooth function and a smooth one. We show that any cluster point of the sequence generated by IBDCA is a critical point of the problem under consideration and that the corresponding objective value is monotonically decreasing and convergent. We also present the global convergence and the convergent rate under the Kurdyka-Lojasiewicz property. The applications of IBDCA in image recovery show the effectiveness of our proposed method. The corresponding numerical experiments demonstrate that our IBDCA outperforms DCA and other state-of-the-art DC methods in both computational time and number of iterations.
zh

[CV-106] MS-SCANet: A Multiscale Transformer-Based Architecture with Dual Attention for No-Reference Image Quality Assessment ICASSP2025

【速读】:该论文旨在解决无参考图像质量评估(No-Reference Image Quality Assessment, NR-IQA)中因单一尺度特征提取导致的细节捕捉不足问题,以及现有方法在多尺度特征融合时空间信息保持不充分、计算复杂度高的局限性。其解决方案的关键在于提出一种基于Transformer的多尺度空间通道注意力网络(Multi-Scale Spatial Channel Attention Network, MS-SCANet),该架构采用双分支结构以并行处理不同尺度的图像特征,并引入定制化的空间注意力与通道注意力机制来聚焦关键视觉特征、降低冗余计算;同时设计交叉分支注意力机制强化跨尺度特征整合能力,并创新性地提出两种一致性损失函数——跨分支一致性损失(Cross-Branch Consistency Loss)和自适应池化一致性损失(Adaptive Pooling Consistency Loss),有效维持特征缩放过程中的空间完整性,从而显著提升模型与人类主观评分的一致性。

链接: https://arxiv.org/abs/2602.04032
作者: Mayesha Maliha R. Mithila,Mylene C.Q. Farias
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Published in ICASSP 2025, 5 pages, 3 figures

点击查看摘要

Abstract:We present the Multi-Scale Spatial Channel Attention Network (MS-SCANet), a transformer-based architecture designed for no-reference image quality assessment (IQA). MS-SCANet features a dual-branch structure that processes images at multiple scales, effectively capturing both fine and coarse details, an improvement over traditional single-scale methods. By integrating tailored spatial and channel attention mechanisms, our model emphasizes essential features while minimizing computational complexity. A key component of MS-SCANet is its cross-branch attention mechanism, which enhances the integration of features across different scales, addressing limitations in previous approaches. We also introduce two new consistency loss functions, Cross-Branch Consistency Loss and Adaptive Pooling Consistency Loss, which maintain spatial integrity during feature scaling, outperforming conventional linear and bilinear techniques. Extensive evaluations on datasets like KonIQ-10k, LIVE, LIVE Challenge, and CSIQ show that MS-SCANet consistently surpasses state-of-the-art methods, offering a robust framework with stronger correlations with subjective human scores.
zh

[CV-107] AtlasPatch: An Efficient and Scalable Tool for Whole Slide Image Preprocessing in Computational Pathology

【速读】:该论文旨在解决全切片图像(Whole-slide image, WSI)预处理中的两大核心问题:一是传统方法依赖不准确的启发式阈值进行组织检测,二是现有基于AI的方法通常在数据多样性不足的情况下训练,且仅在补丁级别操作,导致计算复杂度高。其解决方案的关键在于提出AtlasPatch框架,该框架利用约3万张WSI缩略图构成的异质性、半自动标注数据集,通过高效微调Segment-Anything模型实现高精度组织检测,并将检测结果从缩略图外推至全分辨率切片以生成用户指定倍数下的补丁坐标;同时支持直接流式传输补丁至常见图像编码器进行嵌入或存储补丁图像,整个流程在CPU与GPU上并行优化,显著降低计算开销并在下游多实例学习任务中达到SOTA性能。

链接: https://arxiv.org/abs/2602.03998
作者: Ahmed Alagha,Christopher Leclerc,Yousef Kotp,Omar Metwally,Calvin Moras,Peter Rentopoulos,Ghodsiyeh Rostami,Bich Ngoc Nguyen,Jumanah Baig,Abdelhakim Khellaf,Vincent Quoc-Huy Trinh,Rabeb Mizouni,Hadi Otrok,Jamal Bentahar,Mahdi S. Hosseini
机构: Concordia University (康考迪亚大学); Mila (蒙特利尔学习算法研究所); University of Montreal (蒙特利尔大学); Kuwait University (科威特大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注: Under review

点击查看摘要

Abstract:Whole-slide image (WSI) preprocessing, typically comprising tissue detection followed by patch extraction, is foundational to AI-driven computational pathology workflows. This remains a major computational bottleneck as existing tools either rely on inaccurate heuristic thresholding for tissue detection, or adopt AI-based approaches trained on limited-diversity data that operate at the patch level, incurring substantial computational complexity. We present AtlasPatch, an efficient and scalable slide preprocessing framework for accurate tissue detection and high-throughput patch extraction with minimal computational overhead. AtlasPatch’s tissue detection module is trained on a heterogeneous and semi-manually annotated dataset of ~30,000 WSI thumbnails, using efficient fine-tuning of the Segment-Anything model. The tool extrapolates tissue masks from thumbnails to full-resolution slides to extract patch coordinates at user-specified magnifications, with options to stream patches directly into common image encoders for embedding or store patch images, all efficiently parallelized across CPUs and GPUs. We assess AtlasPatch across segmentation precision, computational complexity, and downstream multiple-instance learning, matching state-of-the-art performance while operating at a fraction of their computational cost. AtlasPatch is open-source and available at this https URL.
zh

[CV-108] Sounding Highlights: Dual-Pathway Audio Encoders for Audio-Visual Video Highlight Detection ICASSP2026

【速读】:该论文旨在解决当前音频-视觉视频精彩片段检测模型中对音频模态利用不足的问题,即现有方法多聚焦于高层语义特征而未能充分挖掘声音的丰富动态特性。其解决方案的关键在于提出了一种双路径音频编码器(Dual-Pathway Audio Encoder),该结构包含两个并行分支:一是语义路径,用于提取语音、音乐或特定声学事件等高层内容信息;二是动态路径,通过随时间演进的频率自适应机制联合建模频谱-时序动态特性,从而识别瞬态声学事件及能量快速变化。该设计显著提升了音频模态的表征能力,并在大规模基准数据集上实现了新的最先进性能。

链接: https://arxiv.org/abs/2602.03891
作者: Seohyun Joo,Yoori Oh
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD)
备注: 5 pages, 2 figures, to appear in ICASSP 2026

点击查看摘要

Abstract:Audio-visual video highlight detection aims to automatically identify the most salient moments in videos by leveraging both visual and auditory cues. However, existing models often underutilize the audio modality, focusing on high-level semantic features while failing to fully leverage the rich, dynamic characteristics of sound. To address this limitation, we propose a novel framework, Dual-Pathway Audio Encoders for Video Highlight Detection (DAViHD). The dual-pathway audio encoder is composed of a semantic pathway for content understanding and a dynamic pathway that captures spectro-temporal dynamics. The semantic pathway extracts high-level information by identifying the content within the audio, such as speech, music, or specific sound events. The dynamic pathway employs a frequency-adaptive mechanism as time evolves to jointly model these dynamics, enabling it to identify transient acoustic events via salient spectral bands and rapid energy changes. We integrate the novel audio encoder into a full audio-visual framework and achieve new state-of-the-art performance on the large-scale this http URL benchmark. Our results demonstrate that a sophisticated, dual-faceted audio representation is key to advancing the field of highlight detection.
zh

[CV-109] o What Extent Do Token-Level Representations from Pathology Foundation Models Improve Dense Prediction?

【速读】:该论文旨在解决病理学基础模型(Pathology Foundation Models, PFMs)在密集预测任务(如组织分割)中部署时缺乏清晰、可复现的性能评估与适应策略分析的问题。当前尽管PFMs展现出跨组织和机构的良好迁移能力,但其在不同数据集上的行为差异及适配方法对性能和稳定性的影响尚不明确。解决方案的关键在于构建一个大规模基准测试平台——PFM-DenseBench,系统性地在18个公开病理分割数据集上评估17种PFMs,并采用统一协议比较多种微调与适配策略,从而得出具有实践指导意义的结论,帮助研究人员和临床开发者选择合适的模型与策略以提升真实场景下的密集病理预测效果。

链接: https://arxiv.org/abs/2602.03887
作者: Weiming Chen,Xitong Ling,Xidong Wang,Zhenyang Cai,Yijia Guo,Mingxi Fu,Ziyi Zeng,Minxi Ouyang,Jiawen Li,Yizhi Wang,Tian Guan,Benyou Wang,Yonghong He
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Pathology foundation models (PFMs) have rapidly advanced and are becoming a common backbone for downstream clinical tasks, offering strong transferability across tissues and institutions. However, for dense prediction (e.g., segmentation), practical deployment still lacks a clear, reproducible understanding of how different PFMs behave across datasets and how adaptation choices affect performance and stability. We present PFM-DenseBench, a large-scale benchmark for dense pathology prediction, evaluating 17 PFMs across 18 public segmentation datasets. Under a unified protocol, we systematically assess PFMs with multiple adaptation and fine-tuning strategies, and derive insightful, practice-oriented findings on when and why different PFMs and tuning choices succeed or fail across heterogeneous datasets. We release containers, configs, and dataset cards to enable reproducible evaluation and informed PFM selection for real-world dense pathology tasks. Project Website: this https URL
zh

[CV-110] DINO-AD: Unsupervised Anomaly Detection with Frozen DINO-V3 Features

【速读】:该论文旨在解决医学图像中无监督异常检测(Unsupervised Anomaly Detection, AD)的问题,即在不依赖像素级标注的情况下精准定位异常区域,从而提升诊断系统的可扩展性和标签效率。其解决方案的关键在于提出一种基于DINO-V3表示的新型框架DINO-AD,通过引入嵌入相似性匹配策略选择语义对齐的支持图像,并结合前景感知的K-means聚类模块建模正常特征分布;随后利用余弦相似度将查询特征与聚类后的正常嵌入进行比较,生成高精度的异常图。该方法在Brain和Liver数据集上实现了高达98.71的AUROC分数,显著优于现有最先进方法。

链接: https://arxiv.org/abs/2602.03870
作者: Jiayu Huo,Jingyuan Hong,Liyun Chen
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ISBI 2026, 4 pages, 2 figures, 3 tables

点击查看摘要

Abstract:Unsupervised anomaly detection (AD) in medical images aims to identify abnormal regions without relying on pixel-level annotations, which is crucial for scalable and label-efficient diagnostic systems. In this paper, we propose a novel anomaly detection framework based on DINO-V3 representations, termed DINO-AD, which leverages self-supervised visual features for precise and interpretable anomaly localization. Specifically, we introduce an embedding similarity matching strategy to select a semantically aligned support image and a foreground-aware K-means clustering module to model the distribution of normal features. Anomaly maps are then computed by comparing the query features with clustered normal embeddings through cosine similarity. Experimental results on both the Brain and Liver datasets demonstrate that our method achieves superior quantitative performance compared with state-of-the-art approaches, achieving AUROC scores of up to 98.71. Qualitative results further confirm that our framework produces clearer and more accurate anomaly localization. Extensive ablation studies validate the effectiveness of each proposed component, highlighting the robustness and generalizability of our approach.
zh

人工智能

[AI-0] Protein Autoregressive Modeling via Multiscale Structure Generation

【速读】:该论文旨在解决蛋白质主链结构生成中因传统方法难以捕捉多尺度结构特征而导致的生成质量与灵活性不足的问题。其核心解决方案是提出了一种多尺度自回归建模框架(Protein Autoregressive Modeling, PAR),关键在于通过三个组件协同实现:(i) 多尺度下采样操作在训练过程中对蛋白质结构进行分层表征;(ii) 自回归Transformer编码多尺度信息并输出条件嵌入以引导生成;(iii) 基于流模型的主链解码器根据这些嵌入逐步生成原子级结构。此外,为缓解自回归模型常见的暴露偏差(exposure bias)问题,引入噪声上下文学习和调度采样策略,显著提升了生成稳定性与质量。PAR展现出强大的零样本泛化能力,支持无需微调的人工提示条件生成与基序支架构建,从而成为蛋白质结构生成领域具有前景的新范式。

链接: https://arxiv.org/abs/2602.04883
作者: Yanru Qu,Cheng-Yen Hsieh,Zaixiang Zheng,Ge Liu,Quanquan Gu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM); Quantitative Methods (q-bio.QM)
备注: ByteDance Seed Tech Report; Page: this https URL

点击查看摘要

Abstract:We present protein autoregressive modeling (PAR), the first multi-scale autoregressive framework for protein backbone generation via coarse-to-fine next-scale prediction. Using the hierarchical nature of proteins, PAR generates structures that mimic sculpting a statue, forming a coarse topology and refining structural details over scales. To achieve this, PAR consists of three key components: (i) multi-scale downsampling operations that represent protein structures across multiple scales during training; (ii) an autoregressive transformer that encodes multi-scale information and produces conditional embeddings to guide structure generation; (iii) a flow-based backbone decoder that generates backbone atoms conditioned on these embeddings. Moreover, autoregressive models suffer from exposure bias, caused by the training and the generation procedure mismatch, and substantially degrades structure generation quality. We effectively alleviate this issue by adopting noisy context learning and scheduled sampling, enabling robust backbone generation. Notably, PAR exhibits strong zero-shot generalization, supporting flexible human-prompted conditional generation and motif scaffolding without requiring fine-tuning. On the unconditional generation benchmark, PAR effectively learns protein distributions and produces backbones of high design quality, and exhibits favorable scaling behavior. Together, these properties establish PAR as a promising framework for protein structure generation.
zh

[AI-1] Contrastive Continual Learning for Model Adaptability in Internet of Things

【速读】:该论文旨在解决物联网(IoT)部署中因环境非平稳性(如传感器漂移、用户行为演化和隐私需求差异)导致模型性能下降的问题,其核心挑战在于如何在不发生灾难性遗忘的前提下实现模型的持续学习(Continual Learning, CL)。解决方案的关键在于将对比学习(Contrastive Learning)与持续学习相结合,形成生成式对比持续学习(Contrastive Continual Learning, CCL),通过融合对比损失与知识蒸馏(Distillation)损失来提升模型在动态数据流中的表征鲁棒性和样本效率。论文进一步提出面向物联网场景的参考架构,涵盖设备端(TinyML)、边缘和云端的协同训练机制,并强调了能量感知训练、概念漂移应对及联邦设置下的隐私保护等关键设计考量。

链接: https://arxiv.org/abs/2602.04881
作者: Ajesh Koyatan Chathoth
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Internet of Things (IoT) deployments operate in nonstationary, dynamic environments where factors such as sensor drift, evolving user behavior, and heterogeneous user privacy requirements can affect application utility. Continual learning (CL) addresses this by adapting models over time without catastrophic forgetting. Meanwhile, contrastive learning has emerged as a powerful representation-learning paradigm that improves robustness and sample efficiency in a self-supervised manner. This paper reviews the usage of \emphcontrastive continual learning (CCL) for IoT, connecting algorithmic design (replay, regularization, distillation, prompts) with IoT system realities (TinyML constraints, intermittent connectivity, privacy). We present a unifying problem formulation, derive common objectives that blend contrastive and distillation losses, propose an IoT-oriented reference architecture for on-device, edge, and cloud-based CCL, and provide guidance on evaluation protocols and metrics. Finally, we highlight open unique challenges with respect to the IoT domain, such as spanning tabular and streaming IoT data, concept drift, federated settings, and energy-aware training.
zh

[AI-2] CRoSS: A Continual Robotic Simulation Suite for Scalable Reinforcement Learning with High Task Diversity and Realistic Physics Simulation

【速读】:该论文旨在解决持续强化学习(Continual Reinforcement Learning, CRL)中代理在序列任务中学习时面临的灾难性遗忘问题,尤其是在具有高物理真实感的机器人场景下缺乏可靠、可扩展且易于复现的基准测试工具。其解决方案的关键在于提出一个名为Continual Robotic Simulation Suite (CRoSS) 的新型基准套件,该套件基于Gazebo仿真器构建,包含两类机器人平台:一种是带激光雷达、摄像头和碰撞传感器的差速驱动机器人,用于线跟踪和物体推动任务;另一种是7自由度机械臂,支持基于笛卡尔空间位置控制和关节角度控制的两种目标到达任务,并提供无需物理仿真的运动学变体以提升效率。CRoSS设计具备高度可扩展性,支持几乎任意传感器配置,同时通过容器化部署(Apptainer)保障实验可复现性,并报告了标准强化学习算法(如DQN和策略梯度方法)的性能表现,从而为CRL研究提供了一个兼具真实性与实用性的基准平台。

链接: https://arxiv.org/abs/2602.04868
作者: Yannick Denker,Alexander Gepperth
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Continual reinforcement learning (CRL) requires agents to learn from a sequence of tasks without forgetting previously acquired policies. In this work, we introduce a novel benchmark suite for CRL based on realistically simulated robots in the Gazebo simulator. Our Continual Robotic Simulation Suite (CRoSS) benchmarks rely on two robotic platforms: a two-wheeled differential-drive robot with lidar, camera and bumper sensor, and a robotic arm with seven joints. The former represent an agent in line-following and object-pushing scenarios, where variation of visual and structural parameters yields a large number of distinct tasks, whereas the latter is used in two goal-reaching scenarios with high-level cartesian hand position control (modeled after the Continual World benchmark), and low-level control based on joint angles. For the robotic arm benchmarks, we provide additional kinematics-only variants that bypass the need for physical simulation (as long as no sensor readings are required), and which can be run two orders of magnitude faster. CRoSS is designed to be easily extensible and enables controlled studies of continual reinforcement learning in robotic settings with high physical realism, and in particular allow the use of almost arbitrary simulated sensors. To ensure reproducibility and ease of use, we provide a containerized setup (Apptainer) that runs out-of-the-box, and report performances of standard RL algorithms, including Deep Q-Networks (DQN) and policy gradient methods. This highlights the suitability as a scalable and reproducible benchmark for CRL research.
zh

[AI-3] From Evaluation to Design: Using Potential Energy Surface Smoothness Metrics to Guide Machine Learning Interatomic Potential Architectures

【速读】:该论文旨在解决机器学习原子间势(Machine Learning Interatomic Potentials, MLIPs)在模拟中因无法准确再现量子势能面(Potential Energy Surface, PES)的物理平滑性而导致的错误行为问题,尤其指出传统能量和力回归评估指标难以发现此类缺陷。解决方案的关键在于提出一种高效且敏感的基准测试方法——键平滑性表征测试(Bond Smoothness Characterization Test, BSCT),该方法通过受控的键变形探测PES中的非平滑特征(如不连续性、人工极小值和虚假力),覆盖平衡态及远离平衡的状态;BSCT不仅与微正则分子动力学(MD)稳定性高度相关,且计算成本仅为MD的极小部分,从而成为MLIP开发过程中可嵌入迭代优化的设计代理(in-the-loop model design proxy),有效识别并指导修正模型中的物理异常。

链接: https://arxiv.org/abs/2602.04861
作者: Ryan Liu,Eric Qu,Tobias Kreiman,Samuel M. Blau,Aditi S. Krishnapriyan
机构: 未知
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Chemical Physics (physics.chem-ph)
备注: 13 pages main text, 10 pages reference appendix, 8 figures

点击查看摘要

Abstract:Machine Learning Interatomic Potentials (MLIPs) sometimes fail to reproduce the physical smoothness of the quantum potential energy surface (PES), leading to erroneous behavior in downstream simulations that standard energy and force regression evaluations can miss. Existing evaluations, such as microcanonical molecular dynamics (MD), are computationally expensive and primarily probe near-equilibrium states. To improve evaluation metrics for MLIPs, we introduce the Bond Smoothness Characterization Test (BSCT). This efficient benchmark probes the PES via controlled bond deformations and detects non-smoothness, including discontinuities, artificial minima, and spurious forces, both near and far from equilibrium. We show that BSCT correlates strongly with MD stability while requiring a fraction of the cost of MD. To demonstrate how BSCT can guide iterative model design, we utilize an unconstrained Transformer backbone as a testbed, illustrating how refinements such as a new differentiable k -nearest neighbors algorithm and temperature-controlled attention reduce artifacts identified by our metric. By optimizing model design systematically based on BSCT, the resulting MLIP simultaneously achieves a low conventional E/F regression error, stable MD simulations, and robust atomistic property predictions. Our results establish BSCT as both a validation metric and as an “in-the-loop” model design proxy that alerts MLIP developers to physical challenges that cannot be efficiently evaluated by current MLIP benchmarks.
zh

[AI-4] Fluid Representations in Reason ing Models

【速读】:该论文旨在解决生成式 AI (Generative AI) 中推理语言模型在抽象问题上表现优异但其内部机制尚不明确的问题。解决方案的关键在于对 QwQ-32B 模型在 Mystery Blocksworld 这一语义混淆规划域中的推理过程进行机制解析,发现模型通过上下文内对 token 表示的动态优化(即 Fluid Reasoning Representations)逐步构建出聚焦于结构而非具体动作名称的抽象编码,从而提升问题求解能力;进一步的操控实验验证了此类表示的因果作用:注入成功推理轨迹中提炼的表示可显著提高准确率,且符号化表示可替代部分混淆编码而几乎不影响性能。

链接: https://arxiv.org/abs/2602.04843
作者: Dmitrii Kharlapenko,Alessandro Stolfo,Arthur Conmy,Mrinmaya Sachan,Zhijing Jin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reasoning language models, which generate long chains of thought, dramatically outperform non-reasoning language models on abstract problems. However, the internal model mechanisms that allow this superior performance remain poorly understood. We present a mechanistic analysis of how QwQ-32B - a model specifically trained to produce extensive reasoning traces - process abstract structural information. On Mystery Blocksworld - a semantically obfuscated planning domain - we find that QwQ-32B gradually improves its internal representation of actions and concepts during reasoning. The model develops abstract encodings that focus on structure rather than specific action names. Through steering experiments, we establish causal evidence that these adaptations improve problem solving: injecting refined representations from successful traces boosts accuracy, while symbolic representations can replace many obfuscated encodings with minimal performance loss. We find that one of the factors driving reasoning model performance is in-context refinement of token representations, which we dub Fluid Reasoning Representations.
zh

[AI-5] Group-Evolving Agents : Open-Ended Self-Improvement via Experience Sharing

【速读】:该论文旨在解决现有开放-ended自进化代理(open-ended self-evolving agents)在演化过程中因树状结构导致的探索多样性利用效率低下问题,即孤立的演化分支难以有效复用早期探索经验,限制了长期性能提升。其解决方案的关键在于提出群体演化代理(Group-Evolving Agents, GEA),将一组代理视为基本的演化单元,通过显式的经验共享与重用来增强群体内部的知识传递,从而更高效地将早期探索多样性转化为持续的长期进步,显著提升了代码任务上的性能表现和框架鲁棒性。

链接: https://arxiv.org/abs/2602.04837
作者: Zhaotian Weng,Antonis Antoniades,Deepak Nathani,Zhen Zhang,Xiao Pu,Xin Eric Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 18 pages

点击查看摘要

Abstract:Open-ended self-improving agents can autonomously modify their own structural designs to advance their capabilities and overcome the limits of pre-defined architectures, thus reducing reliance on human intervention. We introduce Group-Evolving Agents (GEA), a new paradigm for open-ended self-improvements, which treats a group of agents as the fundamental evolutionary unit, enabling explicit experience sharing and reuse within the group throughout evolution. Unlike existing open-ended self-evolving paradigms that adopt tree-structured evolution, GEA overcomes the limitation of inefficient utilization of exploratory diversity caused by isolated evolutionary branches. We evaluate GEA on challenging coding benchmarks, where it significantly outperforms state-of-the-art self-evolving methods (71.0% vs. 56.7% on SWE-bench Verified, 88.3% vs. 68.3% on Polyglot) and matches or exceeds top human-designed agent frameworks (71.8% and 52.0% on two benchmarks, respectively). Analysis reveals that GEA more effectively converts early-stage exploratory diversity into sustained, long-term progress, achieving stronger performance under the same number of evolved agents. Furthermore, GEA exhibits consistent transferability across different coding models and greater robustness, fixing framework-level bugs in 1.4 iterations on average, versus 5 for self-evolving methods.
zh

[AI-6] Are AI Capabilities Increasing Exponentially? A Competing Hypothesis

【速读】:该论文试图解决当前关于人工智能(AI)能力增长是否呈指数级趋势的争议问题,特别是针对Model Evaluation Threat Research(METR)报告中提出的“自2019年以来AI能力呈现指数增长”的结论进行质疑。其解决方案的关键在于:首先,通过对METR原始数据重新拟合逻辑斯蒂(sigmoid)曲线,发现其预测的拐点(inflection point)实际上早已过去;其次,提出一个更复杂的分解模型,将AI能力拆分为基础能力(base capabilities)和推理能力(reasoning capabilities),并假设二者分别以不同速率提升,从而证明在短期内仍可能出现拐点。该研究旨在揭示现有指数增长预测的脆弱性,而非建立自身严谨的预测模型。

链接: https://arxiv.org/abs/2602.04836
作者: Haosen Ge,Hamsa Bastani,Osbert Bastani
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Rapidly increasing AI capabilities have substantial real-world consequences, ranging from AI safety concerns to labor market consequences. The Model Evaluation Threat Research (METR) report argues that AI capabilities have exhibited exponential growth since 2019. In this note, we argue that the data does not support exponential growth, even in shorter-term horizons. Whereas the METR study claims that fitting sigmoid/logistic curves results in inflection points far in the future, we fit a sigmoid curve to their current data and find that the inflection point has already passed. In addition, we propose a more complex model that decomposes AI capabilities into base and reasoning capabilities, exhibiting individual rates of improvement. We prove that this model supports our hypothesis that AI capabilities will exhibit an inflection point in the near future. Our goal is not to establish a rigorous forecast of our own, but to highlight the fragility of existing forecasts of exponential growth.
zh

[AI-7] Safe Urban Traffic Control via Uncertainty-Aware Conformal Prediction and World-Model Reinforcement Learning

【速读】:该论文旨在解决城市交通管理中对未来状态预测、异常检测与安全校正动作决策的联合需求,同时确保系统具备可靠的理论保障。其核心挑战在于如何在端到端框架中传播经过校准的不确定性,从而实现从预测到异常检测再到安全策略学习的统一建模与理论可证性。解决方案的关键在于提出STREAM-RL框架,包含三项创新算法:(1) PU-GAT+通过置信度单调注意力机制动态调整图注意力权重,实现分布无关的覆盖率保证;(2) CRFN-BY利用归一化流建模不确定性归一化残差,并结合Benjamini-Yekutieli方法控制任意依赖下的错误发现率(False Discovery Rate, FDR);(3) LyCon-WRL+构建基于李雅普诺夫稳定性证书和 Lipschitz 约束的安全世界模型强化学习代理,支持不确定性传播的想象回放(imagination rollouts)。该框架首次实现了从预测到策略学习全程的校准不确定性传递与端到端理论保障,在真实交通轨迹数据上验证了91.4%覆盖率效率、4.1% FDR控制以及相比标准PPO提升至95.2%的安全率。

链接: https://arxiv.org/abs/2602.04821
作者: Joydeep Chandra,Satyam Kumar Navneet,Aleksandr Algazinov,Yong Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Urban traffic management demands systems that simultaneously predict future conditions, detect anomalies, and take safe corrective actions – all while providing reliability guarantees. We present STREAM-RL, a unified framework that introduces three novel algorithmic contributions: (1) PU-GAT+, an Uncertainty-Guided Adaptive Conformal Forecaster that uses prediction uncertainty to dynamically reweight graph attention via confidence-monotonic attention, achieving distribution-free coverage guarantees; (2) CRFN-BY, a Conformal Residual Flow Network that models uncertainty-normalized residuals via normalizing flows with Benjamini-Yekutieli FDR control under arbitrary dependence; and (3) LyCon-WRL+, an Uncertainty-Guided Safe World-Model RL agent with Lyapunov stability certificates, certified Lipschitz bounds, and uncertainty-propagated imagination rollouts. To our knowledge, this is the first framework to propagate calibrated uncertainty from forecasting through anomaly detection to safe policy learning with end-to-end theoretical guarantees. Experiments on multiple real-world traffic trajectory data demonstrate that STREAM-RL achieves 91.4% coverage efficiency, controls FDR at 4.1% under verified dependence, and improves safety rate to 95.2% compared to 69% for standard PPO while achieving higher reward, with 23ms end-to-end inference latency.
zh

[AI-8] Agent ic AI in Healthcare Medicine: A Seven-Dimensional Taxonomy for Empirical Evaluation of LLM -based Agents

【速读】:该论文试图解决当前医疗领域中基于大语言模型(Large Language Model, LLM)的智能体(agent)研究缺乏统一框架的问题。现有文献多为宽泛综述或聚焦单一能力(如记忆、规划或推理),导致难以系统性理解其在临床工作流中的整合现状与潜力。解决方案的关键在于构建一个七维分类法(Cognitive Capabilities、Knowledge Management、Interaction Patterns、Adaptation Learning、Safety Ethics、Framework Typology 和 Core Tasks Subtasks),涵盖29个可操作子维度,并通过明确的纳入与排除标准及标签评分体系(Fully Implemented、Partially Implemented、Not Implemented)对49项相关研究进行映射分析,从而量化各能力维度的普及率与共现模式,揭示实际部署中的显著不对称性,例如外部知识集成高度实现而事件触发激活几乎缺失,为未来研究与应用提供结构化参考和优先级指引。

链接: https://arxiv.org/abs/2602.04813
作者: Shubham Vatsal,Harsh Dubey,Aditi Singh
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Large Language Model (LLM)-based agents that plan, use tools and act has begun to shape healthcare and medicine. Reported studies demonstrate competence on various tasks ranging from EHR analysis and differential diagnosis to treatment planning and research workflows. Yet the literature largely consists of overviews which are either broad surveys or narrow dives into a single capability (e.g., memory, planning, reasoning), leaving healthcare work without a common frame. We address this by reviewing 49 studies using a seven-dimensional taxonomy: Cognitive Capabilities, Knowledge Management, Interaction Patterns, Adaptation Learning, Safety Ethics, Framework Typology and Core Tasks Subtasks with 29 operational sub-dimensions. Using explicit inclusion and exclusion criteria and a labeling rubric (Fully Implemented, Partially Implemented, Not Implemented), we map each study to the taxonomy and report quantitative summaries of capability prevalence and co-occurrence patterns. Our empirical analysis surfaces clear asymmetries. For instance, the External Knowledge Integration sub-dimension under Knowledge Management is commonly realized (~76% Fully Implemented) whereas Event-Triggered Activation sub-dimenison under Interaction Patterns is largely absent (~92% Not Implemented) and Drift Detection Mitigation sub-dimension under Adaptation Learning is rare (~98% Not Implemented). Architecturally, Multi-Agent Design sub-dimension under Framework Typology is the dominant pattern (~82% Fully Implemented) while orchestration layers remain mostly partial. Across Core Tasks Subtasks, information centric capabilities lead e.g., Medical Question Answering Decision Support and Benchmarking Simulation, while action and discovery oriented areas such as Treatment Planning Prescription still show substantial gaps (~59% Not Implemented).
zh

[AI-9] Beyond Rewards in Reinforcement Learning for Cyber Defence

【速读】:该论文旨在解决当前基于深度强化学习(Deep Reinforcement Learning, DRL)的自主网络防御代理在训练过程中因依赖密集奖励函数而可能产生次优甚至高风险策略的问题。现有方法通常采用高度工程化的密集奖励函数,虽有助于探索复杂环境,但易引入偏差,导致防御行为偏离理想目标。论文的关键解决方案在于系统评估不同奖励结构(稀疏与密集)对学习效果和策略行为的影响,并提出:只要稀疏奖励与防御目标对齐且能频繁触发,即可显著提升训练可靠性,同时生成更安全、更符合防御者意图的策略——即使未显式设置惩罚项,也能有效减少昂贵的防御动作使用,从而实现更稳健的网络安全防护。

链接: https://arxiv.org/abs/2602.04809
作者: Elizabeth Bates,Chris Hicks,Vasilios Mavroudis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent years have seen an explosion of interest in autonomous cyber defence agents trained to defend computer networks using deep reinforcement learning. These agents are typically trained in cyber gym environments using dense, highly engineered reward functions which combine many penalties and incentives for a range of (un)desirable states and costly actions. Dense rewards help alleviate the challenge of exploring complex environments but risk biasing agents towards suboptimal and potentially riskier solutions, a critical issue in complex cyber environments. We thoroughly evaluate the impact of reward function structure on learning and policy behavioural characteristics using a variety of sparse and dense reward functions, two well-established cyber gyms, a range of network sizes, and both policy gradient and value-based RL algorithms. Our evaluation is enabled by a novel ground truth evaluation approach which allows directly comparing between different reward functions, illuminating the nuanced inter-relationships between rewards, action space and the risks of suboptimal policies in cyber environments. Our results show that sparse rewards, provided they are goal aligned and can be encountered frequently, uniquely offer both enhanced training reliability and more effective cyber defence agents with lower-risk policies. Surprisingly, sparse rewards can also yield policies that are better aligned with cyber defender goals and make sparing use of costly defensive actions without explicit reward-based numerical penalties.
zh

[AI-10] Skin Tokens: A Learned Compact Representation for Unified Autoregressive Rigging

【速读】:该论文旨在解决生成式3D模型在动画制作流程中的关键瓶颈——蒙皮(rigging)自动化问题。现有方法受限于将蒙皮视为一个病态的高维回归任务,导致优化效率低下且通常与骨骼生成解耦。其解决方案的核心在于提出SkinTokens:一种通过FSQ-CVAE学习得到的紧凑、离散的蒙皮权重表示,将原本连续的回归问题转化为更易处理的令牌序列预测问题。基于此表示,作者进一步构建了TokenRig框架,这是一个统一的自回归模型,联合建模骨骼参数与SkinTokens序列,显式学习骨骼结构与皮肤变形之间的复杂依赖关系,并引入强化学习阶段以几何和语义奖励提升对分布外资产的泛化能力。该方案显著提升了蒙皮精度(相比SOTA提高98%-133%)和骨骼预测性能(提升17%-22%),提供了一种高保真、鲁棒且可扩展的3D角色绑定生成方法。

链接: https://arxiv.org/abs/2602.04805
作者: Jia-peng Zhang,Cheng-Feng Pu,Meng-Hao Guo,Yan-Pei Cao,Shi-Min Hu
机构: 未知
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI)
备注: 14 pages, 10 figures

点击查看摘要

Abstract:The rapid proliferation of generative 3D models has created a critical bottleneck in animation pipelines: rigging. Existing automated methods are fundamentally limited by their approach to skinning, treating it as an ill-posed, high-dimensional regression task that is inefficient to optimize and is typically decoupled from skeleton generation. We posit this is a representation problem and introduce SkinTokens: a learned, compact, and discrete representation for skinning weights. By leveraging an FSQ-CVAE to capture the intrinsic sparsity of skinning, we reframe the task from continuous regression to a more tractable token sequence prediction problem. This representation enables TokenRig, a unified autoregressive framework that models the entire rig as a single sequence of skeletal parameters and SkinTokens, learning the complicated dependencies between skeletons and skin deformations. The unified model is then amenable to a reinforcement learning stage, where tailored geometric and semantic rewards improve generalization to complex, out-of-distribution assets. Quantitatively, the SkinTokens representation leads to a 98%-133% percents improvement in skinning accuracy over state-of-the-art methods, while the full TokenRig framework, refined with RL, enhances bone prediction by 17%-22%. Our work presents a unified, generative approach to rigging that yields higher fidelity and robustness, offering a scalable solution to a long-standing challenge in 3D content creation.
zh

[AI-11] am Then Trim: An Assembly-Line LLM Framework for High-Quality Tabular Data Generation

【速读】:该论文旨在解决真实世界中高质量表格数据(tabular data)获取困难的问题,尤其是在样本稀缺情况下常出现的类别不平衡、选择偏差和低保真度等缺陷。其核心解决方案是提出T²(Team-then-Trim)框架,关键在于将数据生成过程类比为制造流程:首先由多个专业化的大型语言模型(Large Language Models, LLMs)协同生成表格的不同组件,随后通过一个三阶段可插拔的数据质量控制(Data Quality Control, QC)管道对合成数据进行多维度评估与优化,从而显著提升合成数据的质量和可用性。

链接: https://arxiv.org/abs/2602.04785
作者: Congjing Zhang,Ryan Feng Lin,Ruoxuan Bao,Shuai Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While tabular data is fundamental to many real-world machine learning (ML) applications, acquiring high-quality tabular data is usually labor-intensive and expensive. Limited by the scarcity of observations, tabular datasets often exhibit critical deficiencies, such as class imbalance, selection bias, and low fidelity. To address these challenges, building on recent advances in Large Language Models (LLMs), this paper introduces Team-then-Trim (T ^2 ), a framework that synthesizes high-quality tabular data through a collaborative team of LLMs, followed by a rigorous three-stage plug-in data quality control (QC) pipeline. In T ^2 , tabular data generation is conceptualized as a manufacturing process: specialized LLMs, guided by domain knowledge, are tasked with generating different data components sequentially, and the resulting products, i.e., the synthetic data, are systematically evaluated across multiple dimensions of QC. Empirical results on both simulated and real-world datasets demonstrate that T ^2 outperforms state-of-the-art methods in producing high-quality tabular data, highlighting its potential to support downstream models when direct data collection is practically infeasible.
zh

[AI-12] Billion-Scale Graph Foundation Models

【速读】:该论文旨在解决如何构建适用于任意异构、百亿规模图数据的图基础模型(Graph Foundation Models, GFMs)这一挑战,以实现类似语言和视觉领域中预训练-微调范式在图学习中的扩展。其解决方案的关键在于提出了一种端到端的构建框架——GraphBFF,其中核心组件为GraphBFF Transformer架构,该架构具备灵活性与可扩展性,能够支持实际场景下的百亿参数级GFMs;同时,研究首次揭示了通用图上的神经网络缩放定律(neural scaling laws),表明损失随模型容量或训练数据量的增加而可预测地降低,从而为大规模图基础模型的构建提供了理论指导与实践方法。

链接: https://arxiv.org/abs/2602.04768
作者: Maya Bechler-Speicher,Yoel Gottlieb,Andrey Isakov,David Abensur,Ami Tavory,Daniel Haimovich,Ido Guy,Udi Weinsberg
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graph-structured data underpins many critical applications. While foundation models have transformed language and vision via large-scale pretraining and lightweight adaptation, extending this paradigm to general, real-world graphs is challenging. In this work, we present Graph Billion- Foundation-Fusion (GraphBFF): the first end-to-end recipe for building billion-parameter Graph Foundation Models (GFMs) for arbitrary heterogeneous, billion-scale graphs. Central to the recipe is the GraphBFF Transformer, a flexible and scalable architecture designed for practical billion-scale GFMs. Using the GraphBFF, we present the first neural scaling laws for general graphs and show that loss decreases predictably as either model capacity or training data scales, depending on which factor is the bottleneck. The GraphBFF framework provides concrete methodologies for data batching, pretraining, and fine-tuning for building GFMs at scale. We demonstrate the effectiveness of the framework with an evaluation of a 1.4 billion-parameter GraphBFF Transformer pretrained on one billion samples. Across ten diverse, real-world downstream tasks on graphs unseen during training, spanning node- and link-level classification and regression, GraphBFF achieves remarkable zero-shot and probing performance, including in few-shot settings, with large margins of up to 31 PRAUC points. Finally, we discuss key challenges and open opportunities for making GFMs a practical and principled foundation for graph learning at industrial scale.
zh

[AI-13] Active Asymmetric Multi-Agent Multimodal Learning under Uncertainty

【速读】:该论文旨在解决多智能体系统在异构多模态感知环境下,因传感器噪声或损坏导致的不确定性问题。现有协作框架通常在代理层面进行推理、假设同质感知且隐式处理不确定性,限制了系统在传感器污染下的鲁棒性。其解决方案的关键在于提出一种主动的、非对称的多智能体多模态学习方法(A2MAML),该方法将每种模态特征建模为带有不确定性预测的随机估计,通过贝叶斯逆方差加权机制主动选择可靠代理-模态组合并进行信息融合,从而实现细粒度的模态级融合、支持异构模态可用性,并提供抑制噪声或受损模态的理论依据。

链接: https://arxiv.org/abs/2602.04763
作者: Rui Liu,Pratap Tokekar,Ming Lin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-agent systems are increasingly equipped with heterogeneous multimodal sensors, enabling richer perception but introducing modality-specific and agent-dependent uncertainty. Existing multi-agent collaboration frameworks typically reason at the agent level, assume homogeneous sensing, and handle uncertainty implicitly, limiting robustness under sensor corruption. We propose Active Asymmetric Multi-Agent Multimodal Learning under Uncertainty (A2MAML), a principled approach for uncertainty-aware, modality-level collaboration. A2MAML models each modality-specific feature as a stochastic estimate with uncertainty prediction, actively selects reliable agent-modality pairs, and aggregates information via Bayesian inverse-variance weighting. This formulation enables fine-grained, modality-level fusion, supports asymmetric modality availability, and provides a principled mechanism to suppress corrupted or noisy modalities. Extensive experiments on connected autonomous driving scenarios for collaborative accident detection demonstrate that A2MAML consistently outperforms both single-agent and collaborative baselines, achieving up to 18.7% higher accident detection rate.
zh

[AI-14] Comparative Insights on Adversarial Machine Learning from Industry and Academia: A User-Study Approach

【速读】:该论文旨在解决生成式 AI(Generative AI)快速发展背景下,机器学习(Machine Learning)系统面临的对抗性机器学习(Adversarial Machine Learning, AML)安全威胁及其教育缺失问题。研究发现,行业从业者对AML威胁的关注度与其网络安全教育水平呈显著正相关,而学生群体则对基于CTF(Capture The Flag)的实践教学方式表现出较高兴趣。解决方案的关键在于将安全教育深度整合进机器学习课程体系,并通过设计贴近实战的CTF挑战(如NLP与生成式AI场景下的训练数据投毒攻击)提升学习者的主动参与度与风险意识,从而有效增强对AML漏洞的认知与防御能力。

链接: https://arxiv.org/abs/2602.04753
作者: Vishruti Kakkad(1),Paul Chung(2),Hanan Hibshi(1 and 3),Maverick Woo(1) ((1) Carnegie Mellon University, (2) University of California, San Diego, (3) King Abdulaziz University)
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:An exponential growth of Machine Learning and its Generative AI applications brings with it significant security challenges, often referred to as Adversarial Machine Learning (AML). In this paper, we conducted two comprehensive studies to explore the perspectives of industry professionals and students on different AML vulnerabilities and their educational strategies. In our first study, we conducted an online survey with professionals revealing a notable correlation between cybersecurity education and concern for AML threats. For our second study, we developed two CTF challenges that implement Natural Language Processing and Generative AI concepts and demonstrate a poisoning attack on the training data set. The effectiveness of these challenges was evaluated by surveying undergraduate and graduate students at Carnegie Mellon University, finding that a CTF-based approach effectively engages interest in AML threats. Based on the responses of the participants in our research, we provide detailed recommendations emphasizing the critical need for integrated security education within the ML curriculum.
zh

[AI-15] Supporting software engineering tasks with agent ic AI: Demonstration on document retrieval and test scenario generation

【速读】:该论文旨在解决软件工程中两个关键问题:一是从详细需求描述中自动生成测试场景,二是实现对单个软件开发相关文档的高效检索与多任务处理。解决方案的关键在于引入基于代理(agent)的生成式 AI (Generative AI) 架构——对于测试场景生成任务,采用星型拓扑结构,由专用工作代理围绕中央监督代理协同完成;对于文档检索任务,则为每种使用场景(如搜索、问答、变更追踪和大文档摘要)配置独立的基于大语言模型(LLM)的代理,每个代理自主执行其对应的所有子任务。这种模块化、任务导向的代理设计显著提升了自动化与灵活性。

链接: https://arxiv.org/abs/2602.04726
作者: Marian Kica,Lukas Radosky,David Slivka,Karin Kubinova,Daniel Dovhun,Tomas Uhercik,Erik Bircak,Ivan Polasek
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: This is a preprint of a paper that was accepted at the International Conference on Artificial Intelligence, Computer, Data Sciences and Applications (ACDSA 2026)

点击查看摘要

Abstract:The introduction of large language models ignited great retooling and rethinking of the software development models. The ensuing response of software engineering research yielded a massive body of tools and approaches. In this paper, we join the hassle by introducing agentic AI solutions for two tasks. First, we developed a solution for automatic test scenario generation from a detailed requirements description. This approach relies on specialized worker agents forming a star topology with the supervisor agent in the middle. We demonstrate its capabilities on a real-world example. Second, we developed an agentic AI solution for the document retrieval task in the context of software engineering documents. Our solution enables performing various use cases on a body of documents related to the development of a single software, including search, question answering, tracking changes, and large document summarization. In this case, each use case is handled by a dedicated LLM-based agent, which performs all subtasks related to the corresponding use case. We conclude by hinting at the future perspectives of our line of research.
zh

[AI-16] Addressing Corpus Knowledge Poisoning Attacks on RAG Using Sparse Attention

【速读】:该论文旨在解决检索增强生成(Retrieval Augmented Generation, RAG)系统在面对语料库知识投毒攻击时的脆弱性问题,即攻击者通过向语料库注入误导性文档,诱导大语言模型(Large Language Models, LLMs)输出恶意响应。论文指出,标准因果注意力机制允许不同检索文档之间的有害交互,从而放大攻击效果。解决方案的关键在于提出一种新的防御方法——稀疏文档注意力RAG(Sparse Document Attention RAG, SDAG),其核心是引入一种块稀疏注意力机制,强制禁止检索文档间的跨文档注意力连接;该方法仅需在推理阶段对注意力掩码进行最小改动,无需微调或架构调整,即可显著降低攻击成功率,并能与现有先进RAG防御方法有效集成,进一步提升鲁棒性。

链接: https://arxiv.org/abs/2602.04711
作者: Sagie Dekel,Moshe Tennenholtz,Oren Kurland
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval Augmented Generation (RAG) is a highly effective paradigm for keeping LLM-based responses up-to-date and reducing the likelihood of hallucinations. Yet, RAG was recently shown to be quite vulnerable to corpus knowledge poisoning: an attacker injects misleading documents to the corpus to steer an LLMs’ output to an undesired response. We argue that the standard causal attention mechanism in LLMs enables harmful cross-document interactions, specifically in cases of attacks. Accordingly, we introduce a novel defense approach for RAG: Sparse Document Attention RAG (SDAG). This is a block-sparse attention mechanism that disallows cross-attention between retrieved documents. SDAG requires a minimal inference-time change to the attention mask; furthermore, no fine-tuning or additional architectural changes are needed. We present an empirical evaluation of LLM-based question answering (QA) with a variety of attack strategies on RAG. We show that our SDAG method substantially outperforms the standard causal attention mechanism in terms of attack success rate. We further demonstrate the clear merits of integrating SDAG with state-of-the-art RAG defense methods. Specifically, the integration results in performance that is statistically significantly better than the state-of-the-art.
zh

[AI-17] Let Experts Feel Uncertainty: A Multi-Expert Label Distribution Approach to Probabilistic Time Series Forecasting

【速读】:该论文旨在解决实际时间序列预测中预测精度与不确定性可解释性之间的平衡问题。传统点预测方法难以捕捉时间序列中的固有不确定性,而现有概率性方法又往往在计算效率与可解释性之间难以兼顾。解决方案的关键在于提出一种基于多专家学习的分布标签(Multi-Expert Learning Distributional Labels, LDL)框架,通过混合专家(Mixture-of-Experts)架构实现分布式学习,从而生成丰富的不确定性量化结果;其中,两种互补方法分别聚焦于捕捉多样化的时序模式(Multi-Expert LDL)和显式分解时间序列为趋势、季节性、变化点及波动性等可解释成分(Pattern-Aware LDL-MoE),并利用最大均值差异(Maximum Mean Discrepancy, MMD)进行分布学习优化,最终在M5销售数据集上验证了其在准确性和可解释性上的双重优势。

链接: https://arxiv.org/abs/2602.04678
作者: Zhen Zhou,Zhirui Wang,Qi Hong,Yunyang Shi,Ziyuan Gu,Zhiyuan Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 11 pages, 2figures

点击查看摘要

Abstract:Time series forecasting in real-world applications requires both high predictive accuracy and interpretable uncertainty quantification. Traditional point prediction methods often fail to capture the inherent uncertainty in time series data, while existing probabilistic approaches struggle to balance computational efficiency with interpretability. We propose a novel Multi-Expert Learning Distributional Labels (LDL) framework that addresses these challenges through mixture-of-experts architectures with distributional learning capabilities. Our approach introduces two complementary methods: (1) Multi-Expert LDL, which employs multiple experts with different learned parameters to capture diverse temporal patterns, and (2) Pattern-Aware LDL-MoE, which explicitly decomposes time series into interpretable components (trend, seasonality, changepoints, volatility) through specialized sub-experts. Both frameworks extend traditional point prediction to distributional learning, enabling rich uncertainty quantification through Maximum Mean Discrepancy (MMD). We evaluate our methods on aggregated sales data derived from the M5 dataset, demonstrating superior performance compared to baseline approaches. The continuous Multi-Expert LDL achieves the best overall performance, while the Pattern-Aware LDL-MoE provides enhanced interpretability through component-wise analysis. Our frameworks successfully balance predictive accuracy with interpretability, making them suitable for real-world forecasting applications where both performance and actionable insights are crucial.
zh

[AI-18] Rethinking the Design Space of Reinforcement Learning for Diffusion Models: On the Importance of Likelihood Estimation Beyond Loss Design

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在扩散模型(Diffusion Models)和流模型(Flow Models)中应用于视觉任务(如文本到图像生成)时面临的挑战,特别是由于扩散模型的似然函数难以计算(intractable likelihoods),导致传统基于策略梯度的方法难以直接应用。解决方案的关键在于对RL设计空间进行系统性分析,解耦出三个核心因素:策略梯度目标函数、似然估计器以及轨迹采样方案,并发现采用仅基于最终生成样本计算的证据下界(Evidence Lower Bound, ELBO)似然估计器是实现高效、稳定RL优化的主导因素,其效果显著优于特定策略梯度损失函数的选择。这一发现使得在SD 3.5 Medium模型上,GenEval得分从0.24提升至0.95,且效率比FlowGRPO高4.6倍、比当前最优方法DiffusionNFT高2倍,同时避免了奖励欺骗(reward hacking)。

链接: https://arxiv.org/abs/2602.04663
作者: Jaemoo Choi,Yuchen Zhu,Wei Guo,Petr Molodyk,Bo Yuan,Jinbin Bai,Yi Xin,Molei Tao,Yongxin Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 23 pages, 11 figures

点击查看摘要

Abstract:Reinforcement learning has been widely applied to diffusion and flow models for visual tasks such as text-to-image generation. However, these tasks remain challenging because diffusion models have intractable likelihoods, which creates a barrier for directly applying popular policy-gradient type methods. Existing approaches primarily focus on crafting new objectives built on already heavily engineered LLM objectives, using ad hoc estimators for likelihood, without a thorough investigation into how such estimation affects overall algorithmic performance. In this work, we provide a systematic analysis of the RL design space by disentangling three factors: i) policy-gradient objectives, ii) likelihood estimators, and iii) rollout sampling schemes. We show that adopting an evidence lower bound (ELBO) based model likelihood estimator, computed only from the final generated sample, is the dominant factor enabling effective, efficient, and stable RL optimization, outweighing the impact of the specific policy-gradient loss functional. We validate our findings across multiple reward benchmarks using SD 3.5 Medium, and observe consistent trends across all tasks. Our method improves the GenEval score from 0.24 to 0.95 in 90 GPU hours, which is 4.6\times more efficient than FlowGRPO and 2\times more efficient than the SOTA method DiffusionNFT without reward hacking.
zh

[AI-19] owards Structured State-Aware and Execution-Grounded Reason ing for Software Engineering Agents

【速读】:该论文旨在解决当前软件工程(Software Engineering, SE)代理普遍存在的反应式行为局限性问题,即代理主要依赖对话历史和最近响应进行决策,缺乏显式的结构化记忆与持续状态,导致在长周期任务中难以维持连贯的推理能力、适应新证据以调整假设,或整合执行反馈来更新系统状态的认知模型。解决方案的关键在于推动SE代理从反应式向结构化、状态感知且基于执行反馈的推理范式演进,通过引入显式结构、持久且动态演化的状态机制,以及将执行结果嵌入到推理过程中的闭环反馈机制,从而提升其在复杂、长期任务中的推理一致性与可靠性。

链接: https://arxiv.org/abs/2602.04640
作者: Tse-Hsun(Peter)Chen
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Position paper accepted in BoatSE

点击查看摘要

Abstract:Software Engineering (SE) agents have shown promising abilities in supporting various SE tasks. Current SE agents remain fundamentally reactive, making decisions mainly based on conversation history and the most recent response. However, this reactive design provides no explicit structure or persistent state within the agent’s memory, making long-horizon reasoning challenging. As a result, SE agents struggle to maintain a coherent understanding across reasoning steps, adapt their hypotheses as new evidence emerges, or incorporate execution feedback into the mental reasoning model of the system state. In this position paper, we argue that, to further advance SE agents, we need to move beyond reactive behavior toward a structured, state-aware, and execution-grounded reasoning. We outline how explicit structure, persistent and evolving state, and the integration of execution-grounded feedback can help SE agents perform more coherent and reliable reasoning in long-horizon tasks. We also provide an initial roadmap for developing next-generation SE agents that can more effectively perform real-world tasks. Comments: Position paper accepted in BoatSE Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.04640 [cs.SE] (or arXiv:2602.04640v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2602.04640 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-20] A Human-Centered Privacy Approach (HCP) to AI

【速读】:该论文旨在解决人类中心型人工智能(Human-Centered AI, HCAI)发展中个体隐私保护的伦理挑战,特别是在AI开发全生命周期中(从数据收集到部署与再利用)所面临的风险。其解决方案的关键在于提出一个“以人为本的隐私”(Human-Centered Privacy, HCP)框架,整合技术手段(如联邦学习和差分隐私)、伦理规范、用户心理模型及治理机制,强调通过跨学科协作(包括技术、设计、政策与伦理领域专家)将隐私嵌入HCAI的核心,从而保障人类自主性、信任与尊严。

链接: https://arxiv.org/abs/2602.04616
作者: Luyi Sun,Wei Xu,Zaifeng Gao
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As the paradigm of Human-Centered AI (HCAI) gains prominence, its benefits to society are accompanied by significant ethical concerns, one of which is the protection of individual privacy. This chapter provides a comprehensive overview of privacy within HCAI, proposing a human-centered privacy (HCP) framework, providing integrated solution from technology, ethics, and human factors perspectives. The chapter begins by mapping privacy risks across each stage of AI development lifecycle, from data collection to deployment and reuse, highlighting the impact of privacy risks on the entire system. The chapter then introduces privacy-preserving techniques such as federated learning and dif erential privacy. Subsequent chapters integrate the crucial user perspective by examining mental models, alongside the evolving regulatory and ethical landscapes as well as privacy governance. Next, advice on design guidelines is provided based on the human-centered privacy framework. After that, we introduce practical case studies across diverse fields. Finally, the chapter discusses persistent open challenges and future research directions, concluding that a multidisciplinary approach, merging technical, design, policy, and ethical expertise, is essential to successfully embed privacy into the core of HCAI, thereby ensuring these technologies advance in a manner that respects and ensures human autonomy, trust and dignity.
zh

[AI-21] Vibe AIGC: A New Paradigm for Content Generation via Agent ic Orchestration

【速读】:该论文旨在解决当前生成式AI(Generative AI)在内容生成中面临的“可用性天花板”问题,即意图-执行鸿沟(Intent-Execution Gap),该鸿沟表现为用户高阶意图与单次推理模型的随机性、黑箱特性之间的根本性不匹配。解决方案的关键在于提出一种新的范式——Vibe AIGC,其核心是通过代理编排(agentic orchestration)实现层级化多智能体工作流的自主合成,将用户角色从传统提示工程者转变为提供“Vibe”(高层次美学偏好与功能逻辑表示)的指挥官,由中央元规划器(Meta-Planner)将其分解为可执行、可验证且自适应的智能体流水线,从而实现从随机推断到逻辑编排的范式跃迁,有效弥合人类想象力与机器执行之间的鸿沟。

链接: https://arxiv.org/abs/2602.04575
作者: Jiaheng Liu,Yuanxing Zhang,Shihao Li,Xinping Lei
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:For the past decade, the trajectory of generative artificial intelligence (AI) has been dominated by a model-centric paradigm driven by scaling laws. Despite significant leaps in visual fidelity, this approach has encountered a usability ceiling'' manifested as the Intent-Execution Gap (i.e., the fundamental disparity between a creator's high-level intent and the stochastic, black-box nature of current single-shot models). In this paper, inspired by the Vibe Coding, we introduce the \textbfVibe AIGC, a new paradigm for content generation via agentic orchestration, which represents the autonomous synthesis of hierarchical multi-agent workflows. Under this paradigm, the user's role transcends traditional prompt engineering, evolving into a Commander who provides a Vibe, a high-level representation encompassing aesthetic preferences, functional logic, and etc. A centralized Meta-Planner then functions as a system architect, deconstructing this Vibe’’ into executable, verifiable, and adaptive agentic pipelines. By transitioning from stochastic inference to logical orchestration, Vibe AIGC bridges the gap between human imagination and machine execution. We contend that this shift will redefine the human-AI collaborative economy, transforming AI from a fragile inference engine into a robust system-level engineering partner that democratizes the creation of complex, long-horizon digital assets. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2602.04575 [cs.AI] (or arXiv:2602.04575v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2602.04575 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-22] From Competition to Collaboration: Designing Sustainable Mechanisms Between LLM s and Online Forums

【速读】:该论文试图解决生成式 AI (Generative AI) 系统与问答 (QA) 论坛之间的悖论问题:一方面,GenAI 系统吸引用户离开 QA 论坛;另一方面,其性能提升又依赖于论坛产生的数据。解决方案的关键在于提出一种顺序交互框架,其中 GenAI 系统向论坛提议问题,论坛可选择发布部分问题,从而实现非货币性交换、缓解信息不对称并协调激励错位。通过基于真实 Stack Exchange 数据和主流大语言模型(LLM)的模拟实验,作者实证验证了激励错位的存在,同时表明参与者仍可获得理想完全信息情境下约一半的效用,凸显了 AI 与人类知识平台可持续协作的潜力。

链接: https://arxiv.org/abs/2602.04572
作者: Niv Fono,Yftah Ziser,Omer Ben-Porat
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注:

点击查看摘要

Abstract:While Generative AI (GenAI) systems draw users away from (QA) forums, they also depend on the very data those forums produce to improve their performance. Addressing this paradox, we propose a framework of sequential interaction, in which a GenAI system proposes questions to a forum that can publish some of them. Our framework captures several intricacies of such a collaboration, including non-monetary exchanges, asymmetric information, and incentive misalignment. We bring the framework to life through comprehensive, data-driven simulations using real Stack Exchange data and commonly used LLMs. We demonstrate the incentive misalignment empirically, yet show that players can achieve roughly half of the utility in an ideal full-information scenario. Our results highlight the potential for sustainable collaboration that preserves effective knowledge sharing between AI systems and human knowledge platforms.
zh

[AI-23] Continual Learning through Control Minimization

【速读】:该论文旨在解决神经网络在顺序训练任务时面临的灾难性遗忘(catastrophic forgetting)问题。其解决方案的关键在于将连续学习重新建模为一个控制问题,其中学习信号与保留信号在神经活动动力学中竞争;通过将正则化惩罚转化为保留信号以保护先前任务的表征,并最小化整合新任务所需的控制努力,同时与旧任务的保留相竞争。在平衡状态下,神经活动产生的权重更新隐式编码了完整的先验任务曲率,这一特性被称为连续自然梯度(continual-natural gradient),无需显式存储曲率信息。实验表明,该框架能够恢复真实的先验任务曲率并实现任务区分,在标准基准测试中优于现有方法且无需回放机制。

链接: https://arxiv.org/abs/2602.04542
作者: Sander de Haan,Yassine Taoudi-Benchekroun,Pau Vilimelis Aceituno,Benjamin F. Grewe
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Catastrophic forgetting remains a fundamental challenge for neural networks when tasks are trained sequentially. In this work, we reformulate continual learning as a control problem where learning and preservation signals compete within neural activity dynamics. We convert regularization penalties into preservation signals that protect prior-task representations. Learning then proceeds by minimizing the control effort required to integrate new tasks while competing with the preservation of prior tasks. At equilibrium, the neural activities produce weight updates that implicitly encode the full prior-task curvature, a property we term the continual-natural gradient, requiring no explicit curvature storage. Experiments confirm that our learning framework recovers true prior-task curvature and enables task discrimination, outperforming existing methods on standard benchmarks without replay.
zh

[AI-24] Learning the Value Systems of Agents with Preference-based and Inverse Reinforcement Learning

【速读】:该论文旨在解决自主软件代理在协商过程中如何确保协议符合人类伦理原则与道德价值的问题,尤其是在不同用户具有差异化价值体系且难以精确计算特定语境下价值含义的情况下。解决方案的关键在于提出一种从观察和人类示范中自动学习价值体系的新方法,其核心包括:构建价值系统学习的形式化模型、将其应用于基于多目标马尔可夫决策过程(Multi-Objective Markov Decision Processes, MOMDPs)的序列决策领域,并设计定制化的基于偏好的逆强化学习算法,以推断出价值接地函数(value grounding functions)和整体价值系统。

链接: https://arxiv.org/abs/2602.04518
作者: Andrés Holgado-Sánchez,Holger Billhardt,Alberto Fernández,Sascha Ossowski
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 42 pages, 5 figures. Published in Journal of Autonomous Agents and Multi-Agent Systems

点击查看摘要

Abstract:Agreement Technologies refer to open computer systems in which autonomous software agents interact with one another, typically on behalf of humans, in order to come to mutually acceptable agreements. With the advance of AI systems in recent years, it has become apparent that such agreements, in order to be acceptable to the involved parties, must remain aligned with ethical principles and moral values. However, this is notoriously difficult to ensure, especially as different human users (and their software agents) may hold different value systems, i.e. they may differently weigh the importance of individual moral values. Furthermore, it is often hard to specify the precise meaning of a value in a particular context in a computational manner. Methods to estimate value systems based on human-engineered specifications, e.g. based on value surveys, are limited in scale due to the need for intense human moderation. In this article, we propose a novel method to automatically \emphlearn value systems from observations and human demonstrations. In particular, we propose a formal model of the \emphvalue system learning problem, its instantiation to sequential decision-making domains based on multi-objective Markov decision processes, as well as tailored preference-based and inverse reinforcement learning algorithms to infer value grounding functions and value systems. The approach is illustrated and evaluated by two simulated use cases.
zh

[AI-25] ReThinker: Scientific Reason ing by Rethinking with Guided Reflection and Confidence Control

【速读】:该论文旨在解决大语言模型在专家级科学推理任务中表现受限的问题,尤其是在Humanity’s Last Exam (HLE)等基准测试中,由于刚性工具流水线、脆弱的多智能体协作以及低效的测试时扩展机制导致性能瓶颈。其解决方案的关键在于提出一种基于置信度感知的代理框架ReThinker,该框架采用分阶段的Solver-Critic-Selector架构,通过动态计算分配策略实现自适应工具调用、引导式的多维反思与置信度加权选择,从而提升推理过程的灵活性与鲁棒性。此外,为支持无需人工标注的可扩展训练,作者还设计了反向数据合成管道和自适应轨迹回收策略,将成功推理轨迹转化为高质量监督信号,显著提升了模型在HLE、GAIA和XBench等多个基准上的表现。

链接: https://arxiv.org/abs/2602.04496
作者: Zhentao Tang,Yuqi Cui,Shixiong Kai,Wenqian Zhao,Ke Ye,Xing Li,Anxin Tian,Zehua Pei,Hui-Ling Zhen,Shoubo Hu,Xiaoguang Li,Yunhe Wang,Mingxuan Yuan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Expert-level scientific reasoning remains challenging for large language models, particularly on benchmarks such as Humanity’s Last Exam (HLE), where rigid tool pipelines, brittle multi-agent coordination, and inefficient test-time scaling often limit performance. We introduce ReThinker, a confidence-aware agentic framework that orchestrates retrieval, tool use, and multi-agent reasoning through a stage-wise Solver-Critic-Selector architecture. Rather than following a fixed pipeline, ReThinker dynamically allocates computation based on model confidence, enabling adaptive tool invocation, guided multi-dimensional reflection, and robust confidence-weighted selection. To support scalable training without human annotation, we further propose a reverse data synthesis pipeline and an adaptive trajectory recycling strategy that transform successful reasoning traces into high-quality supervision. Experiments on HLE, GAIA, and XBench demonstrate that ReThinker consistently outperforms state-of-the-art foundation models with tools and existing deep research systems, achieving state-of-the-art results on expert-level reasoning tasks.
zh

[AI-26] LLM -Empowered Cooperative Content Caching in Vehicular Fog Caching-Assisted Platoon Networks

【速读】:该论文旨在解决车联网中车辆编队(platoon)场景下内容获取延迟高的问题,特别是在动态环境中如何高效协调本地车辆、车载雾计算(Vehicular Fog Computing, VFC)集群与云服务器(Cloud Server, CS)之间的分布式存储资源。解决方案的关键在于提出了一种三层内容缓存架构,并引入大语言模型(Large Language Models, LLMs)实现智能缓存决策:通过设计编码任务目标和缓存约束的提示框架(prompting framework),将缓存策略建模为决策任务,结合分层确定性缓存映射策略,实现对请求的自适应预测与跨三层存储的精准内容放置,从而在无需频繁重训练的前提下显著降低内容检索延迟。

链接: https://arxiv.org/abs/2602.04471
作者: Bowen Tan,Qiong Wu,Pingyi Fan,Kezhi Wang,Nan Cheng,Wen Chen
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: Corresponding author: Qiong Wu (qiongwu@jiangnan. this http URL )

点击查看摘要

Abstract:This letter proposes a novel three-tier content caching architecture for Vehicular Fog Caching (VFC)-assisted platoon, where the VFC is formed by the vehicles driving near the platoon. The system strategically coordinates storage across local platoon vehicles, dynamic VFC clusters, and cloud server (CS) to minimize content retrieval latency. To efficiently manage distributed storage, we integrate large language models (LLMs) for real-time and intelligent caching decisions. The proposed approach leverages LLMs’ ability to process heterogeneous information, including user profiles, historical data, content characteristics, and dynamic system states. Through a designed prompting framework encoding task objectives and caching constraints, the LLMs formulate caching as a decision-making task, and our hierarchical deterministic caching mapping strategy enables adaptive requests prediction and precise content placement across three tiers without frequent retraining. Simulation results demonstrate the advantages of our proposed caching scheme.
zh

[AI-27] RASA: Routing-Aware Safety Alignment for Mixture-of-Experts Models

【速读】:该论文旨在解决混合专家(Mixture-of-Experts, MoE)语言模型在安全对齐(safety alignment)过程中因稀疏路由机制引发的独特挑战,即在标准全参数微调下可能出现退化优化行为,导致安全关键专家(Safety-Critical Experts)未被有效修复,而仅通过路由或专家主导效应降低攻击成功率。解决方案的关键在于提出RASA框架——一种感知路由的专家级对齐方法,其核心是:首先识别被成功越狱攻击过度激活的安全敏感专家,然后在固定路由策略下仅对该类专家进行选择性微调,最后通过强制路由一致性确保安全对齐上下文下的行为稳定性。该方法实现了近完美的鲁棒性、跨攻击类型的强泛化能力以及显著减少过度拒绝现象,同时保持了MMLU、GSM8K和TruthfulQA等基准任务上的通用能力。

链接: https://arxiv.org/abs/2602.04448
作者: Jiacheng Liang,Yuhui Wang,Tanqiu Jiang,Ting Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 9 pages

点击查看摘要

Abstract:Mixture-of-Experts (MoE) language models introduce unique challenges for safety alignment due to their sparse routing mechanisms, which can enable degenerate optimization behaviors under standard full-parameter fine-tuning. In our preliminary experiments, we observe that naively applying full-parameter safety fine-tuning to MoE models can reduce attack success rates through routing or expert dominance effects, rather than by directly repairing Safety-Critical Experts. To address this challenge, we propose RASA, a routing-aware expert-level alignment framework that explicitly repairs Safety-Critical Experts while preventing routing-based bypasses. RASA identifies experts disproportionately activated by successful jailbreaks, selectively fine-tunes only these experts under fixed routing, and subsequently enforces routing consistency with safety-aligned contexts. Across two representative MoE architectures and a diverse set of jailbreak attacks, RASA achieves near-perfect robustness, strong cross-attack generalization, and substantially reduced over-refusal, while preserving general capabilities on benchmarks such as MMLU, GSM8K, and TruthfulQA. Our results suggest that robust MoE safety alignment benefits from targeted expert repair rather than global parameter updates, offering a practical and architecture-preserving alternative to prior approaches.
zh

[AI-28] Mixture of Masters: Sparse Chess Language Models with Player Routing

【速读】:该论文旨在解决现代棋类语言模型(chess language models)在训练过程中因使用大规模密集Transformer架构而导致的模式平均化(mode-averaged behavior)问题,即风格边界模糊、稀有但高效策略被压制的现象。解决方案的关键在于提出Mixture-of-Masters(MoM),这是一种基于小尺寸GPT专家的棋类混合专家模型,每个专家通过自监督学习与受棋类特定奖励引导的强化学习联合训练,以模拟特级大师(grandmaster)的不同对局风格;同时引入一个后验可学习的门控网络(gating network),根据当前局面动态选择最合适的专家进行决策,从而实现风格切换(如塔尔的进攻性或彼得罗相的防守性),在保持生成多样性、可控性和可解释性的前提下显著优于单一密集模型和主流GPT基线。

链接: https://arxiv.org/abs/2602.04447
作者: Giacomo Frisoni,Lorenzo Molfetta,Davide Freddi,Gianluca Moro
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern chess language models are dense transformers trained on millions of games played by thousands of high-rated individuals. However, these monolithic networks tend to collapse into mode-averaged behavior, where stylistic boundaries are blurred, and rare but effective strategies are suppressed. To counteract homogenization, we introduce Mixture-of-Masters (MoM), the first chess mixture-of-experts model with small-sized GPT experts emulating world-class grandmasters. Each expert is trained with a combination of self-supervised learning and reinforcement learning guided by chess-specific rewards. For each move, a post-hoc learnable gating network selects the most appropriate persona to channel depending on the game state, allowing MoM to switch its style dynamically – e.g., Tal’s offensive vocation or Petrosian’s defensive solidity. When evaluated against Stockfish on unseen standard games, MoM outperforms both dense individual expert networks and popular GPT baselines trained on aggregated data, while ensuring generation variety, control, and interpretability.
zh

[AI-29] EMA Policy Gradient: Taming Reinforcement Learning for LLM s with EMA Anchor and Top-k KL

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在大型语言模型(Large Language Models, LLMs)训练中政策梯度算法的稳定性与效率问题,尤其是在数学推理和代理型RL任务中的性能瓶颈。其核心解决方案包括两个关键改进:一是用指数移动平均(Exponential Moving Average, EMA)替代固定锚定策略(anchor policy),以提升策略更新的稳定性;二是提出Top-k KL估计器,在保持KL散度无偏估计的同时实现精确KL与采样KL之间的灵活插值,从而兼顾计算效率与梯度准确性。这两个技术结合GRPO(Generalized Reward Policy Optimization)形成EMA-PG方法,在多个基准测试中显著提升了LLMs的性能表现。

链接: https://arxiv.org/abs/2602.04417
作者: Lunjun Zhang,Jimmy Ba
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement Learning (RL) has enabled Large Language Models (LLMs) to acquire increasingly complex reasoning and agentic behaviors. In this work, we propose two simple techniques to improve policy gradient algorithms for LLMs. First, we replace the fixed anchor policy during RL with an Exponential Moving Average (EMA), similar to a target network in deep Q-learning. Second, we introduce Top-k KL estimator, which allows for flexible interpolation between exact KL and sampled KL. We derive the stability conditions for using EMA anchor; moreover, we show that our Top-k KL estimator yields both unbiased KL values and unbiased gradients at any k, while bringing the benefits of exact KL. When combined with GRPO, the two techniques (EMA-PG) lead to a significant performance boost. On math reasoning, it allows R1-distilled Qwen-1.5B to reach 53.9% on OlympiadBench compared to 50.8% by GRPO. On agentic RL domains, with Qwen-3B base, EMA-PG improves GRPO by an average of 33.3% across 7 datasets of QA with search engines, including 29.7% \rightarrow 44.1% on HotpotQA, 27.4% \rightarrow 40.1% on 2WikiMultiHopQA. Overall, we show that EMA-PG is a simple, principled, and powerful approach to scaling RL for LLMs. Code: this https URL
zh

[AI-30] LoRDO: Distributed Low-Rank Optimization with Infrequent Communication

【速读】:该论文旨在解决分布式训练基础模型时,由于通信带宽限制导致的效率瓶颈问题,特别是传统数据并行(DDP)策略中优化器状态带来的内存与通信开销。其核心挑战在于:在低频同步机制下,本地更新无法获取全批量梯度,从而难以有效计算低秩投影,进而影响模型性能。解决方案的关键在于提出LoRDO框架,通过引入一种全秩准双曲更新机制,在保持低秩优化优势的同时恢复对高维子空间的探索能力,从而在125M至720M参数规模的语言建模任务中实现接近低秩DDP的性能,且通信量减少约10倍;此外,在极低内存环境下(小秩/小批次),LoRDO进一步提升性能表现。

链接: https://arxiv.org/abs/2602.04396
作者: Andrej Jovanović,Alex Iacob,Mher Safaryan,Ionut-Vlad Modoranu,Lorenzo Sani,William F. Shen,Xinchi Qiu,Dan Alistarh,Nicholas D. Lane
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preprint; under review

点击查看摘要

Abstract:Distributed training of foundation models via \textttDDP is limited by interconnect bandwidth. While infrequent communication strategies reduce synchronization frequency, they remain bottlenecked by the memory and communication requirements of optimizer states. Low-rank optimizers can alleviate these constraints; however, in the local-update regime, workers lack access to the full-batch gradients required to compute low-rank projections, which degrades performance. We propose \textttLoRDO , a principled framework unifying low-rank optimization with infrequent synchronization. We first demonstrate that, while global projections based on pseudo-gradients are theoretically superior, they permanently restrict the optimization trajectory to a low-rank subspace. To restore subspace exploration, we introduce a full-rank quasi-hyperbolic update. \textttLoRDO achieves near-parity with low-rank \textttDDP in language modeling and downstream tasks at model scales of 125 M-- 720 M, while reducing communication by \approx 10 \times . Finally, we show that \textttLoRDO improves performance even more in very low-memory settings with small rank/batch size.
zh

[AI-31] Digital Twins ZeroConf AI: Structuring Automated Intelligent Pipelines for Industrial Applications

【速读】:该论文旨在解决工业领域中复杂网络物理系统(Cyber-Physical Systems, CPS)与人工智能(Artificial Intelligence, AI)及机器学习(Machine Learning, ML)技术融合时面临的挑战,特别是由于物联网(IoT)和工业物联网(IIoT)技术在通信协议、数据格式和设备能力上的碎片化,导致物理层与智能功能之间存在显著鸿沟。现有方法往往模块封闭且高度耦合,限制了AI功能的可扩展性和复用性。其解决方案的关键在于提出一种模块化、互操作性强的“零配置”(Zero Configuration, ZeroConf)AI流水线架构,通过数字孪生(Digital Twin, DT)技术实现数据管理与智能增强的协同调度,从而最小化配置开销并解耦DT与AI组件的角色,使AI流水线能够无缝集成到CPS中,并支持并发ML模型和动态数据处理,在复杂工业场景中加速智能服务部署。

链接: https://arxiv.org/abs/2602.04385
作者: Marco Picone,Fabio Turazza,Matteo Martinelli,Marco Mamei
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: Author-accepted manuscript of a paper published in the 2025 IEEE International Conference on Systems, Man and Cybernetics (IEEE SMC), October 2025, doi: https://doi.org/10.1109/SMC58881.2025.11343418

点击查看摘要

Abstract:The increasing complexity of Cyber-Physical Systems (CPS), particularly in the industrial domain, has amplified the challenges associated with the effective integration of Artificial Intelligence (AI) and Machine Learning (ML) techniques. Fragmentation across IoT and IIoT technologies, manifested through diverse communication protocols, data formats and device capabilities, creates a substantial gap between low-level physical layers and high-level intelligent functionalities. Recently, Digital Twin (DT) technology has emerged as a promising solution, offering structured, interoperable and semantically rich digital representations of physical assets. Current approaches are often siloed and tightly coupled, limiting scalability and reuse of AI functionalities. This work proposes a modular and interoperable solution that enables seamless AI pipeline integration into CPS by minimizing configuration and decoupling the roles of DTs and AI components. We introduce the concept of Zero Configuration (ZeroConf) AI pipelines, where DTs orchestrate data management and intelligent augmentation. The approach is demonstrated in a MicroFactory scenario, showing support for concurrent ML models and dynamic data processing, effectively accelerating the deployment of intelligent services in complex industrial settings.
zh

[AI-32] Blockchain Federated Learning for Sustainable Retail: Reducing Waste through Collaborative Demand Forecasting

【速读】:该论文旨在解决零售业中因数据隐私顾虑导致的多方协作障碍问题,从而提升易腐食品需求预测的准确性并减少食物浪费。其核心解决方案是引入基于区块链的联邦学习(Federated Learning, FL)框架,在不直接共享原始数据的前提下,实现多个零售商协同训练预测模型,显著优于各自独立建模的效果,同时接近于理想数据共享场景下的性能表现。

链接: https://arxiv.org/abs/2602.04384
作者: Fabio Turazza,Alessandro Neri,Marcello Pietri,Maria Angela Butturi,Marco Picone,Marco Mamei
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: Author-accepted manuscript of a paper published in the IEEE International Symposium on Computers and Communications (ISCC), 2025, pp. 1-6. doi: this https URL

点击查看摘要

Abstract:Effective demand forecasting is crucial for reducing food waste. However, data privacy concerns often hinder collaboration among retailers, limiting the potential for improved predictive accuracy. In this study, we explore the application of Federated Learning (FL) in Sustainable Supply Chain Management (SSCM), with a focus on the grocery retail sector dealing with perishable goods. We develop a baseline predictive model for demand forecasting and waste assessment in an isolated retailer scenario. Subsequently, we introduce a Blockchain-based FL model, trained collaboratively across multiple retailers without direct data sharing. Our preliminary results show that FL models have performance almost equivalent to the ideal setting in which parties share data with each other, and are notably superior to models built by individual parties without sharing data, cutting waste and boosting efficiency.
zh

[AI-33] Beyond KL Divergence: Policy Optimization with Flexible Bregman Divergences for LLM Reason ing

【速读】:该论文旨在解决当前基于群体的策略优化方法(如GRPO)在大语言模型(LLM)推理任务中仅使用KL散度进行策略正则化、而未探索其他散度函数选择的问题。其关键解决方案是提出Group-Based Mirror Policy Optimization (GBMPO) 框架,将群体策略优化扩展至可灵活使用Bregman散度(包括手设计的L2散度和学习得到的神经镜像映射),从而揭示散度函数的选择是一个此前被忽视但至关重要的设计维度。实验表明,采用概率空间中的L2散度(ProbL2-GRPO)可在GSM8K数学推理任务上提升5.5点准确率,而随机初始化的神经镜像映射在MBPP代码生成任务中已能获得接近最优性能,验证了该框架的有效性与实用性。

链接: https://arxiv.org/abs/2602.04380
作者: Rui Yuan,Mykola Khandoga,Vinay Kumar Sankarapu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Policy optimization methods like Group Relative Policy Optimization (GRPO) and its variants have achieved strong results on mathematical reasoning and code generation tasks. Despite extensive exploration of reward processing strategies and training dynamics, all existing group-based methods exclusively use KL divergence for policy regularization, leaving the choice of divergence function unexplored. We introduce Group-Based Mirror Policy Optimization (GBMPO), a framework that extends group-based policy optimization to flexible Bregman divergences, including hand-designed alternatives (L2 in probability space) and learned neural mirror maps. On GSM8K mathematical reasoning, hand-designed ProbL2-GRPO achieves 86.7% accuracy, improving +5.5 points over the Dr. GRPO baseline. On MBPP code generation, neural mirror maps reach 60.1-60.8% pass@1, with random initialization already capturing most of the benefit. While evolutionary strategies meta-learning provides marginal accuracy improvements, its primary value lies in variance reduction ( \pm 0.2 versus \pm 0.6) and efficiency gains (15% shorter responses on MBPP), suggesting that random initialization of neural mirror maps is sufficient for most practical applications. These results establish divergence choice as a critical, previously unexplored design dimension in group-based policy optimization for LLM reasoning.
zh

[AI-34] Counterfactual Explanations for Hypergraph Neural Networks

【速读】:该论文旨在解决超图神经网络(Hypergraph Neural Networks, HGNNs)在高风险场景中部署受限的问题,即其决策过程缺乏可解释性。为提升HGNN的可解释性,作者提出CF-HyperGNNExplainer,一种基于反事实推理的解释方法,其核心在于通过最少的结构修改(如移除节点-超边关联或删除超边)生成有效的反事实超图,从而揭示影响模型预测的关键高阶关系。该方法不仅确保解释的结构性合理性,还具备行动可行性,显著增强了HGNN在实际应用中的透明度与可信度。

链接: https://arxiv.org/abs/2602.04360
作者: Fabiano Veglianti,Lorenzo Antonelli,Gabriele Tolomei
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Hypergraph neural networks (HGNNs) effectively model higher-order interactions in many real-world systems but remain difficult to interpret, limiting their deployment in high-stakes settings. We introduce CF-HyperGNNExplainer, a counterfactual explanation method for HGNNs that identifies the minimal structural changes required to alter a model’s prediction. The method generates counterfactual hypergraphs using actionable edits limited to removing node-hyperedge incidences or deleting hyperedges, producing concise and structurally meaningful explanations. Experiments on three benchmark datasets show that CF-HyperGNNExplainer generates valid and concise counterfactuals, highlighting the higher-order relations most critical to HGNN decisions. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY) Cite as: arXiv:2602.04360 [cs.LG] (or arXiv:2602.04360v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.04360 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-35] UnMaskFork: Test-Time Scaling for Masked Diffusion via Deterministic Action Branching

【速读】:该论文旨在解决如何有效利用推理时计算资源以提升掩码扩散语言模型(Masked Diffusion Language Models, MDLMs)的推理能力问题。现有测试时扩展(test-time scaling)方法主要依赖随机采样策略,难以充分挖掘MDLM固有的迭代式、非自回归生成特性。其解决方案的关键在于提出UnMaskFork(UMF)框架,将未掩码轨迹建模为搜索树,并采用蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)优化生成路径;通过多个MDLM执行确定性的部分解掩码动作来探索搜索空间,从而实现更高效且可扩展的推理优化。实证结果表明,UMF在复杂编程基准上持续优于现有测试时扩展基线,并在数学推理任务中展现出强可扩展性。

链接: https://arxiv.org/abs/2602.04344
作者: Kou Misaki,Takuya Akiba
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Test-time scaling strategies have effectively leveraged inference-time compute to enhance the reasoning abilities of Autoregressive Large Language Models. In this work, we demonstrate that Masked Diffusion Language Models (MDLMs) are inherently amenable to advanced search strategies, owing to their iterative and non-autoregressive generation process. To leverage this, we propose UnMaskFork (UMF), a framework that formulates the unmasking trajectory as a search tree and employs Monte Carlo Tree Search to optimize the generation path. In contrast to standard scaling methods relying on stochastic sampling, UMF explores the search space through deterministic partial unmasking actions performed by multiple MDLMs. Our empirical evaluation demonstrates that UMF consistently outperforms existing test-time scaling baselines on complex coding benchmarks, while also exhibiting strong scalability on mathematical reasoning tasks.
zh

[AI-36] Efficient Equivariant High-Order Crystal Tensor Prediction via Cartesian Local-Environment Many-Body Coupling

【速读】:该论文旨在解决从原子结构端到端预测高阶晶体张量性质(如二阶介电、三阶压电和四阶弹性张量)的挑战,这类任务因传统球谐等变模型中Clebsch-Gordan张量积带来的显著计算与内存开销而难以高效实现。解决方案的关键在于提出Cartesian Environment Interaction Tensor Network (CEITNet),其通过为每个原子构建多通道笛卡尔局部环境张量,并利用可学习的通道空间相互作用实现灵活的多体混合,从而在通道空间中进行学习并使用笛卡尔张量基底组装等变输出,有效提升了高阶张量构建的效率与精度。

链接: https://arxiv.org/abs/2602.04323
作者: Dian Jin,Yancheng Yuan,Xiaoming Tao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:End-to-end prediction of high-order crystal tensor properties from atomic structures remains challenging: while spherical-harmonic equivariant models are expressive, their Clebsch-Gordan tensor products incur substantial compute and memory costs for higher-order targets. We propose the Cartesian Environment Interaction Tensor Network (CEITNet), an approach that constructs a multi-channel Cartesian local environment tensor for each atom and performs flexible many-body mixing via a learnable channel-space interaction. By performing learning in channel space and using Cartesian tensor bases to assemble equivariant outputs, CEITNet enables efficient construction of high-order tensor. Across benchmark datasets for order-2 dielectric, order-3 piezoelectric, and order-4 elastic tensor prediction, CEITNet surpasses prior high-order prediction methods on key accuracy criteria while offering high computational efficiency.
zh

[AI-37] ProxyWar: Dynamic Assessment of LLM Code Generation in Game Arenas ICSE2026

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在代码生成能力评估中面临的局限性问题,即现有方法主要依赖静态基准测试和简化指标,难以真实反映模型在动态、复杂环境中的实际表现。其解决方案的关键在于提出ProxyWar框架,通过将LLM生成的程序部署到多样且具有竞争性的游戏环境中,结合自动化测试、迭代式代码修复与多智能体锦标赛机制,从功能性正确性和运行特性两个维度实现对代码质量的系统性评估。这一方法揭示了传统基准分数与实际性能之间的显著差异,从而推动了基于竞赛的更全面代码生成评价体系的发展。

链接: https://arxiv.org/abs/2602.04296
作者: Wenjun Peng,Xinyu Wang,Qi Wu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: ICSE2026

点击查看摘要

Abstract:Large language models (LLMs) have revolutionized automated code generation, yet the evaluation of their real-world effectiveness remains limited by static benchmarks and simplistic metrics. We present ProxyWar, a novel framework that systematically assesses code generation quality by embedding LLM-generated agents within diverse, competitive game environments. Unlike existing approaches, ProxyWar evaluates not only functional correctness but also the operational characteristics of generated programs, combining automated testing, iterative code repair, and multi-agent tournaments to provide a holistic view of program behavior. Applied to a range of state-of-the-art coders and games, our approach uncovers notable discrepancies between benchmark scores and actual performance in dynamic settings, revealing overlooked limitations and opportunities for improvement. These findings highlight the need for richer, competition-based evaluation of code generation. Looking forward, ProxyWar lays a foundation for research into LLM-driven algorithm discovery, adaptive problem solving, and the study of practical efficiency and robustness, including the potential for models to outperform hand-crafted agents. The project is available at this https URL.
zh

[AI-38] Agent -Omit: Training Efficient LLM Agents Agent s for Adaptive Thought and Observation Omission via Agentic Reinforcement Learning

【速读】:该论文旨在解决多轮智能体-环境交互中,现有方法对所有交互轨迹同等对待的问题,忽略了不同轮次中思维(thought)和观测(observation)的必要性与效用存在差异。为实现更高效的代理行为,其核心解决方案是提出一种统一的训练框架Agent-Omit,使大语言模型(LLM)代理能够自适应地省略冗余的思维和观测。关键创新包括:1)利用少量冷启动数据(包含单轮与多轮省略场景)对代理进行微调以学习省略行为;2)引入一种感知省略的强化学习方法,结合双采样机制与定制化的省略奖励函数,激励代理具备自适应省略能力;3)理论证明了省略策略偏差由KL散度上界控制,保障优化稳定性。实验表明,Agent-Omit-8B在五个基准测试中性能媲美前沿LLM代理,并在效果与效率之间取得最优平衡。

链接: https://arxiv.org/abs/2602.04284
作者: Yansong Ning,Jun Fang,Naiqiang Tan,Hao Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Under Review

点击查看摘要

Abstract:Managing agent thought and observation during multi-turn agent-environment interactions is an emerging strategy to improve agent efficiency. However, existing studies treat the entire interaction trajectories equally, overlooking the thought necessity and observation utility varies across turns. To this end, we first conduct quantitative investigations into how thought and observation affect agent effectiveness and efficiency. Based on our findings, we propose Agent-Omit, a unified training framework that empowers LLM agents to adaptively omit redundant thoughts and observations. Specifically, we first synthesize a small amount of cold-start data, including both single-turn and multi-turn omission scenarios, to fine-tune the agent for omission behaviors. Furthermore, we introduce an omit-aware agentic reinforcement learning approach, incorporating a dual sampling mechanism and a tailored omission reward to incentivize the agent’s adaptive omission capability. Theoretically, we prove that the deviation of our omission policy is upper-bounded by KL-divergence. Experimental results on five agent benchmarks show that our constructed Agent-Omit-8B could obtain performance comparable to seven frontier LLM agent, and achieve the best effectiveness-efficiency trade-off than seven efficient LLM agents methods. Our code and data are available at this https URL.
zh

[AI-39] Multi Objective Design Optimization of Non Pneumatic Passenger Car Tires Using Finite Element Modeling Machine Learning and Particle swarm Optimization and Bayesian Optimization Algorithms

【速读】:该论文旨在解决非充气轮胎(Non Pneumatic Tire, NPT)中辐条结构在刚度调节、耐久性和高速振动方面的挑战。其关键解决方案是提出了一种融合生成式设计与机器学习驱动的优化框架,通过高阶多项式参数化上/下辐条轮廓,结合PCHIP几何变异生成约250种设计方案;利用核岭回归(KRR)模型预测刚度、XGBoost模型预测耐久性与振动性能,显著降低对计算成本高昂的有限元分析(FEM)的依赖;再采用粒子群优化(PSO)和贝叶斯优化(Bayesian Optimization)进行多目标性能调优,最终实现53%刚度可调性、最高50%耐久性提升及43%振动衰减,有效支撑下一代无内胎辐条结构(UPTIS)的系统化开发。

链接: https://arxiv.org/abs/2602.04277
作者: Priyankkumar Dhrangdhariya,Soumyadipta Maiti,Venkataramana Runkana
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Non Pneumatic tires offer a promising alternative to pneumatic tires. However, their discontinuous spoke structures present challenges in stiffness tuning, durability, and high speed vibration. This study introduces an integrated generative design and machine learning driven framework to optimize UPTIS type spoke geometries for passenger vehicles. Upper and lower spoke profiles were parameterized using high order polynomial representations, enabling the creation of approximately 250 generative designs through PCHIP based geometric variation. Machine learning models like KRR for stiffness and XGBoost for durability and vibration achieved strong predictive accuracy, reducing the reliance on computationally intensive FEM simulations. Optimization using Particle Swarm Optimization and Bayesian Optimization further enabled extensive performance refinement. The resulting designs demonstrate 53% stiffness tunability, up to 50% durability improvement, and 43% reduction in vibration compared to the baseline. PSO provided fast, targeted convergence, while Bayesian Optimization effectively explored multi objective tradeoffs. Overall, the proposed framework enables systematic development of high performance, next generation UPTIS spoke structures.
zh

[AI-40] hickening-to-Thinning: Reward Shaping via Human-Inspired Learning Dynamics for LLM Reason ing

【速读】:该论文旨在解决强化学习中基于可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)训练大语言模型(Large Language Models, LLMs)时面临的三大挑战:熵崩溃(entropy collapse)、冗余输出(excessive verbosity)以及在难题上探索不足(insufficient exploration)。其核心问题在于现有奖励机制无法区分问题求解过程中所需的广泛搜索与已掌握知识下的高效表达之间的差异。解决方案的关键是提出T2T(Thickening-to-Thinning)动态奖励框架,该框架模拟人类学习过程,采用双阶段机制:在错误尝试时通过“增厚”(thickening,即鼓励更长轨迹)扩大搜索空间以探索新解法路径;在正确时切换至“变薄”(thinning,施加长度惩罚)以抑制冗余,从而提升模型信心并固化推理能力。实验表明,该方法在数学推理基准(MATH-500、AIME、AMC)上显著优于标准GRPO及近期基线。

链接: https://arxiv.org/abs/2602.04265
作者: Wenze Lin,Zhen Yang,Xitai Jiang,Pony Ma,Gao Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising paradigm for enhancing reasoning in Large Language Models (LLMs). However, it frequently encounters challenges such as entropy collapse, excessive verbosity, and insufficient exploration for hard problems. Crucially, existing reward schemes fail to distinguish between the need for extensive search during problem-solving and the efficiency required for mastered knowledge. In this work, we introduce T2T(Thickening-to-Thinning), a dynamic reward framework inspired by human learning processes. Specifically, it implements a dual-phase mechanism: (1) On incorrect attempts, T2T incentivizes “thickening” (longer trajectories) to broaden the search space and explore novel solution paths; (2) Upon achieving correctness, it shifts to “thinning”, imposing length penalties to discourage redundancy, thereby fostering model confidence and crystallizing reasoning capabilities. Extensive experiments on mathematical benchmarks (MATH-500, AIME, AMC) across Qwen-series and Deepseek models demonstrate that T2T significantly outperforms standard GRPO and recent baselines, achieving superior performance.
zh

[AI-41] From Dead Neurons to Deep Approximators: Deep Bernstein Networks as a Provable Alternative to Residual Layers

【速读】:该论文旨在解决深度神经网络中梯度消失问题以及由分段线性激活函数(如ReLU)导致的训练效率低下和表达能力受限的问题。传统残差连接(Residual Connections)虽能缓解梯度消失,但引入了结构约束且无法从根本上改善激活函数本身的局限性。解决方案的关键在于提出Deep Bernstein Networks(深度伯恩斯坦网络),其核心创新是采用伯恩斯坦多项式(Bernstein Polynomials)作为激活函数,从而实现无需残差连接的高效训练与强大表征能力。理论分析表明,该架构的局部导数严格有下界,有效避免了“死神经元”现象(实验中从标准网络的90%降至<5%),同时逼近误差随深度呈指数级衰减,显著优于ReLU类网络的多项式收敛速率,因此在不依赖残差连接的前提下实现了更优的信号传播与函数逼近性能。

链接: https://arxiv.org/abs/2602.04264
作者: Ibrahim Albool,Malak Gamal El-Din,Salma Elmalaki,Yasser Shoukry
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
备注: 15 pages

点击查看摘要

Abstract:Residual connections are the de facto standard for mitigating vanishing gradients, yet they impose structural constraints and fail to address the inherent inefficiencies of piecewise linear activations. We show that Deep Bernstein Networks (which utilizes Bernstein polynomials as activation functions) can act as residual-free architecture while simultaneously optimize trainability and representation power. We provide a two-fold theoretical foundation for our approach. First, we derive a theoretical lower bound on the local derivative, proving it remains strictly bounded away from zero. This directly addresses the root cause of gradient stagnation; empirically, our architecture reduces ``dead’’ neurons from 90% in standard deep networks to less than 5%, outperforming ReLU, Leaky ReLU, SeLU, and GeLU. Second, we establish that the approximation error for Bernstein-based networks decays exponentially with depth, a significant improvement over the polynomial rates of ReLU-based architectures. By unifying these results, we demonstrate that Bernstein activations provide a superior mechanism for function approximation and signal flow. Our experiments on HIGGS and MNIST confirm that Deep Bernstein Networks achieve high-performance training without skip-connections, offering a principled path toward deep, residual-free architectures with enhanced expressive capacity.
zh

[AI-42] AppleVLM: End-to-end Autonomous Driving with Advanced Perception and Planning -Enhanced Vision-Language Models

【速读】:该论文旨在解决当前基于视觉-语言模型(Vision-Language Models, VLMs)的端到端自动驾驶系统中存在的车道感知不准确、语言理解偏差以及复杂场景下泛化能力不足等问题。其核心解决方案在于提出AppleVLM模型,关键创新包括:1)设计了一种新型视觉编码器,通过可变形Transformer机制融合多视角图像在时序上的空间信息,提升对摄像头差异的鲁棒性并支持跨平台部署;2)引入专门的规划模态编码器,显式建模鸟瞰图(Bird’s-Eye-View)空间信息,从而缓解导航指令中的语言偏见;3)采用分层思维链(Chain-of-Thought)微调的VLM解码器,整合视觉、语言与规划特征,输出鲁棒的驾驶航点,在CARLA仿真和实际AGV平台上均验证了其优越性能。

链接: https://arxiv.org/abs/2602.04256
作者: Yuxuan Han,Kunyuan Wu,Qianyi Shao,Renxiang Xiao,Zilu Wang,Cansen Jiang,Yi Xiao,Liang Hu,Yunjiang Lou
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:End-to-end autonomous driving has emerged as a promising paradigm integrating perception, decision-making, and control within a unified learning framework. Recently, Vision-Language Models (VLMs) have gained significant attention for their potential to enhance the robustness and generalization of end-to-end driving models in diverse and unseen scenarios. However, existing VLM-based approaches still face challenges, including suboptimal lane perception, language understanding biases, and difficulties in handling corner cases. To address these issues, we propose AppleVLM, an advanced perception and planning-enhanced VLM model for robust end-to-end driving. AppleVLM introduces a novel vision encoder and a planning strategy encoder to improve perception and decision-making. Firstly, the vision encoder fuses spatial-temporal information from multi-view images across multiple timesteps using a deformable transformer mechanism, enhancing robustness to camera variations and facilitating scalable deployment across different vehicle platforms. Secondly, unlike traditional VLM-based approaches, AppleVLM introduces a dedicated planning modality that encodes explicit Bird’s-Eye-View spatial information, mitigating language biases in navigation instructions. Finally, a VLM decoder fine-tuned by a hierarchical Chain-of-Thought integrates vision, language, and planning features to output robust driving waypoints. We evaluate AppleVLM in closed-loop experiments on two CARLA benchmarks, achieving state-of-the-art driving performance. Furthermore, we deploy AppleVLM on an AGV platform and successfully showcase real-world end-to-end autonomous driving in complex outdoor environments.
zh

[AI-43] OAT: Ordered Action Tokenization

【速读】:该论文旨在解决将自回归模型(autoregressive modeling)应用于连续机器人动作时所面临的动作离散化(action tokenization)难题。现有方法要么依赖解析式离散化导致token序列过长,要么使用无结构的隐式tokenizer,难以适配自回归的逐token预测机制。解决方案的关键在于提出有序动作离散化(Ordered Action Tokenization, OAT),其通过引入基于Transformer with registers、有限标量量化(finite scalar quantization)和诱导排序训练机制的架构,实现了高压缩率、完全可解码性和左到右因果顺序的token空间,从而自然契合自回归生成过程,并支持基于前缀的实时解码,实现推理成本与动作保真度之间的灵活权衡。

链接: https://arxiv.org/abs/2602.04215
作者: Chaoqi Liu,Xiaoshen Han,Jiawei Gao,Yue Zhao,Haonan Chen,Yilun Du
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Autoregressive policies offer a compelling foundation for scalable robot learning by enabling discrete abstraction, token-level reasoning, and flexible inference. However, applying autoregressive modeling to continuous robot actions requires an effective action tokenization scheme. Existing approaches either rely on analytical discretization methods that produce prohibitively long token sequences, or learned latent tokenizers that lack structure, limiting their compatibility with next-token prediction. In this work, we identify three desiderata for action tokenization - high compression, total decodability, and a left-to-right causally ordered token space - and introduce Ordered Action Tokenization (OAT), a learned action tokenizer that satisfies all three. OAT discretizes action chunks into an ordered sequence of tokens using transformer with registers, finite scalar quantization, and ordering-inducing training mechanisms. The resulting token space aligns naturally with autoregressive generation and enables prefix-based detokenization, yielding an anytime trade-off between inference cost and action fidelity. Across more than 20 tasks spanning four simulation benchmarks and real-world settings, autoregressive policies equipped with OAT consistently outperform prior tokenization schemes and diffusion-based baselines, while offering significantly greater flexibility at inference time.
zh

[AI-44] InterPReT: Interactive Policy Restructuring and Training Enable Effective Imitation Learning from Laypersons

【速读】:该论文旨在解决当前模仿学习(Imitation Learning)方法在实际应用中对用户技术背景要求过高、难以由非专业用户(layperson)有效教学的问题。现有方法通常依赖于大量来自专业人员的示范数据并需持续监控训练过程,这对普通用户而言门槛较高。解决方案的关键在于提出交互式策略重构与训练(Interactive Policy Restructuring and Training, InterPReT),该方法允许用户通过自然指令和实时演示不断调整策略结构并优化参数,从而实现人机协同的教学过程,使无机器学习背景的用户也能高效、可靠地训练出性能稳定的策略。

链接: https://arxiv.org/abs/2602.04213
作者: Feiyu Gavin Zhu,Jean Oh,Reid Simmons
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Proceedings of the 21st ACM/IEEE International Conference on Human-Robot Interaction

点击查看摘要

Abstract:Imitation learning has shown success in many tasks by learning from expert demonstrations. However, most existing work relies on large-scale demonstrations from technical professionals and close monitoring of the training process. These are challenging for a layperson when they want to teach the agent new skills. To lower the barrier of teaching AI agents, we propose Interactive Policy Restructuring and Training (InterPReT), which takes user instructions to continually update the policy structure and optimize its parameters to fit user demonstrations. This enables end-users to interactively give instructions and demonstrations, monitor the agent’s performance, and review the agent’s decision-making strategies. A user study (N=34) on teaching an AI agent to drive in a racing game confirms that our approach yields more robust policies without impairing system usability, compared to a generic imitation learning baseline, when a layperson is responsible for both giving demonstrations and determining when to stop. This shows that our method is more suitable for end-users without much technical background in machine learning to train a dependable policy
zh

[AI-45] Steering LLM s via Scalable Interactive Oversight

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在执行复杂、长周期任务(如Web开发)时出现的监督缺口(supervision gap)问题,即人类用户因缺乏领域专业知识、难以精确表达意图以及无法可靠验证复杂输出,而难以有效引导AI系统。解决方案的关键在于提出一种可扩展的交互式监督(Scalable Interactive Oversight)框架,该框架通过将复杂意图递归分解为一系列可管理的决策节点,在每个节点上收集低负担的人类反馈,并递归聚合这些信号以生成精准的全局指导。实验表明,该框架使非专家用户能够生成接近专家水平的产品需求文档(Product Requirement Documents),对齐度提升54%;更重要的是,该框架可通过仅依赖在线用户反馈进行强化学习优化,从而在AI能力持续增强的同时保持人类控制力。

链接: https://arxiv.org/abs/2602.04210
作者: Enyu Zhou,Zhiheng Xi,Long Ma,Zhihao Zhang,Shihan Dou,Zhikai Lei,Guoteng Wang,Rui Zheng,Hang Yan,Tao Gui,Qi Zhang,Xuanjing Huang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As Large Language Models increasingly automate complex, long-horizon tasks such as \emphvibe coding, a supervision gap has emerged. While models excel at execution, users often struggle to guide them effectively due to insufficient domain expertise, the difficulty of articulating precise intent, and the inability to reliably validate complex outputs. It presents a critical challenge in scalable oversight: enabling humans to responsibly steer AI systems on tasks that surpass their own ability to specify or verify. To tackle this, we propose Scalable Interactive Oversight, a framework that decomposes complex intent into a recursive tree of manageable decisions to amplify human supervision. Rather than relying on open-ended prompting, our system elicits low-burden feedback at each node and recursively aggregates these signals into precise global guidance. Validated in web development task, our framework enables non-experts to produce expert-level Product Requirement Documents, achieving a 54% improvement in alignment. Crucially, we demonstrate that this framework can be optimized via Reinforcement Learning using only online user feedback, offering a practical pathway for maintaining human control as AI scales.
zh

[AI-46] SCALE: Self-uncertainty Conditioned Adaptive Looking and Execution for Vision-Language-Action Models

【速读】:该论文旨在解决当前视觉-语言-动作(Vision-Language-Action, VLA)模型在测试时扩展(Test-Time Scaling, TTS)方法中存在的三大问题:一是现有TTS方法依赖额外训练、验证器及多次前向传播,难以部署;二是这些方法仅在动作解码阶段干预,而忽视了感知模糊场景下重新审视视觉表征的重要性;三是缺乏对感知与行为之间协同调节的能力。解决方案的关键在于提出SCALE策略——一种基于“自我不确定性”(self-uncertainty)的简单推理机制,其灵感来源于主动推断理论中的不确定性驱动探索,无需额外训练或验证器,仅需单次前向传播即可同时调制视觉感知和动作决策,在高不确定性时扩大感知与行为的探索空间,低不确定性时聚焦利用,从而实现跨不同条件下的自适应执行。

链接: https://arxiv.org/abs/2602.04208
作者: Hyeonbeom Choi,Daechul Ahn,Youhan Lee,Taewook Kang,Seongwon Cho,Jonghyun Choi
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 20 pages, 8 figures

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have emerged as a promising paradigm for general-purpose robotic control, with test-time scaling (TTS) gaining attention to enhance robustness beyond training. However, existing TTS methods for VLAs require additional training, verifiers, and multiple forward passes, making them impractical for deployment. Moreover, they intervene only at action decoding while keeping visual representations fixed-insufficient under perceptual ambiguity, where reconsidering how to perceive is as important as deciding what to do. To address these limitations, we propose SCALE, a simple inference strategy that jointly modulates visual perception and action based on ‘self-uncertainty’, inspired by uncertainty-driven exploration in Active Inference theory-requiring no additional training, no verifier, and only a single forward pass. SCALE broadens exploration in both perception and action under high uncertainty, while focusing on exploitation when confident-enabling adaptive execution across varying conditions. Experiments on simulated and real-world benchmarks demonstrate that SCALE improves state-of-the-art VLAs and outperforms existing TTS methods while maintaining single-pass efficiency.
zh

[AI-47] opology-Aware Revival for Efficient Sparse Training

【速读】:该论文旨在解决静态稀疏训练(static sparse training)中因固定掩码结构导致的鲁棒性不足问题,尤其在深度强化学习(deep reinforcement learning, DRL)场景下,早期剪枝决策易使网络陷入脆弱结构且难以调整。其解决方案的关键在于提出一种轻量级的一次性后剪枝方法——拓扑感知复苏(Topology-Aware Revival, TAR),该方法在静态剪枝后通过一个单一复苏步骤实现结构优化:根据各层拓扑需求分配少量保留预算,在每层内随机均匀地重新激活部分此前被剪枝的连接,并保持后续训练中连接关系不变,从而在不引入动态重连机制的前提下显著提升模型性能。

链接: https://arxiv.org/abs/2602.04166
作者: Meiling Jin,Fei Wang,Xiaoyun Yuan,Chen Qian,Yuan Cheng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Static sparse training is a promising route to efficient learning by committing to a fixed mask pattern, yet the constrained structure reduces robustness. Early pruning decisions can lock the network into a brittle structure that is difficult to escape, especially in deep reinforcement learning (RL) where the evolving policy continually shifts the training distribution. We propose Topology-Aware Revival (TAR), a lightweight one-shot post-pruning procedure that improves static sparsity without dynamic rewiring. After static pruning, TAR performs a single revival step by allocating a small reserve budget across layers according to topology needs, randomly uniformly reactivating a few previously pruned connections within each layer, and then keeping the resulting connectivity fixed for the remainder of training. Across multiple continuous-control tasks with SAC and TD3, TAR improves final return over static sparse baselines by up to +37.9% and also outperforms dynamic sparse training baselines with a median gain of +13.5%.
zh

[AI-48] Pruning for Generalization: A Transfer-Oriented Spatiotemporal Graph Framework ICLR2026

【速读】:该论文旨在解决图结构域中多变量时间序列预测在数据稀缺和跨域分布偏移下的性能下降问题。其解决方案的关键在于提出一种面向迁移的时空框架TL-GPSTGN,通过结构感知的上下文选择机制,利用信息论和相关性准则筛选出具有结构信息的子图与特征,从而构建紧凑且语义明确的优化上下文表示,并将其融入时空卷积架构以捕捉复杂的多变量动态变化,显著提升了样本效率和分布外泛化能力。

链接: https://arxiv.org/abs/2602.04153
作者: Zihao Jing,Yuxi Long,Ganlin Feng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Under review at ICLR 2026 Workshop TSALM

点击查看摘要

Abstract:Multivariate time series forecasting in graph-structured domains is critical for real-world applications, yet existing spatiotemporal models often suffer from performance degradation under data scarcity and cross-domain shifts. We address these challenges through the lens of structure-aware context selection. We propose TL-GPSTGN, a transfer-oriented spatiotemporal framework that enhances sample efficiency and out-of-distribution generalization by selectively pruning non-optimized graph context. Specifically, our method employs information-theoretic and correlation-based criteria to extract structurally informative subgraphs and features, resulting in a compact, semantically grounded representation. This optimized context is subsequently integrated into a spatiotemporal convolutional architecture to capture complex multivariate dynamics. Evaluations on large-scale traffic benchmarks demonstrate that TL-GPSTGN consistently outperforms baselines in low-data transfer scenarios. Our findings suggest that explicit context pruning serves as a powerful inductive bias for improving the robustness of graph-based forecasting models.
zh

[AI-49] MA3DSG: Multi-Agent 3D Scene Graph Generation for Large-Scale Indoor Environments

【速读】:该论文旨在解决当前3D场景图生成(3DSGG)方法在真实世界场景中可扩展性不足的问题,其核心挑战在于现有方法通常基于单智能体假设且局限于小规模环境。解决方案的关键在于提出首个多智能体3D场景图生成(MA3DSG)框架,通过设计一种无需训练的图对齐算法,将各智能体生成的局部查询图高效合并为统一的全局场景图,从而实现无需任何可学习参数的协同操作,显著提升系统在复杂、大规模环境中的适应能力。

链接: https://arxiv.org/abs/2602.04152
作者: Yirum Kim,Jaewoo Kim,Ue-Hwan Kim
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Current 3D scene graph generation (3DSGG) approaches heavily rely on a single-agent assumption and small-scale environments, exhibiting limited scalability to real-world scenarios. In this work, we introduce Multi-Agent 3D Scene Graph Generation (MA3DSG) model, the first framework designed to tackle this scalability challenge using multiple agents. We develop a training-free graph alignment algorithm that efficiently merges partial query graphs from individual agents into a unified global scene graph. Leveraging extensive analysis and empirical insights, our approach enables conventional single-agent systems to operate collaboratively without requiring any learnable parameters. To rigorously evaluate 3DSGG performance, we propose MA3DSG-Bench-a benchmark that supports diverse agent configurations, domain sizes, and environmental conditions-providing a more general and extensible evaluation framework. This work lays a solid foundation for scalable, multi-agent 3DSGG research.
zh

[AI-50] OMG-Agent : Toward Robust Missing Modality Generation with Decoupled Coarse-to-Fine Agent ic Workflows

【速读】:该论文旨在解决多模态系统中因数据不完整(data incompleteness)而导致的可靠性下降问题。现有重建方法存在两大瓶颈:参数化/生成模型易因过度依赖内部记忆而产生幻觉(hallucination),检索增强框架则受限于检索刚性(retrieval rigidity);更根本的是,这些端到端架构受制于语义-细节纠缠(Semantic-Detail Entanglement)——即逻辑推理与信号合成之间的结构冲突,导致输出 fidelity 降低。解决方案的关键在于提出 Omni-Modality Generation Agent (OMG-Agent),其核心创新是将静态映射范式转变为动态的粗粒度到细粒度的代理工作流(Agentic Workflow),通过模仿“深思后行动”的认知过程,显式解耦任务为三个协同阶段:(1) 基于多模态大语言模型(MLLM)的语义规划器(Semantic Planner)通过渐进式上下文推理生成确定性结构化语义计划;(2) 非参数化证据检索器(Evidence Retriever)将抽象语义锚定于外部知识;(3) 注入检索证据的执行器(Retrieval-Injected Executor)利用检索结果作为灵活特征提示,克服刚性并合成高保真细节。该设计有效缓解了语义-细节纠缠问题,在多个基准测试中显著优于现有方法,尤其在极端缺失场景下(如CMU-MOSI在70%缺失率下提升2.6点)仍保持鲁棒性。

链接: https://arxiv.org/abs/2602.04144
作者: Ruiting Dai,Zheyu Wang,Haoyu Yang,Yihan Liu,Chengzhi Wang,Zekun Zhang,Zishan Huang,Jiaman Cen,Lisi Mo
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Data incompleteness severely impedes the reliability of multimodal systems. Existing reconstruction methods face distinct bottlenecks: conventional parametric/generative models are prone to hallucinations due to over-reliance on internal memory, while retrieval-augmented frameworks struggle with retrieval rigidity. Critically, these end-to-end architectures are fundamentally constrained by Semantic-Detail Entanglement – a structural conflict between logical reasoning and signal synthesis that compromises fidelity. In this paper, we present \textbf\underlineOmni-\textbf\underlineModality \textbf\underlineGeneration Agent (\textbfOMG-Agent), a novel framework that shifts the paradigm from static mapping to a dynamic coarse-to-fine Agentic Workflow. By mimicking a \textitdeliberate-then-act cognitive process, OMG-Agent explicitly decouples the task into three synergistic stages: (1) an MLLM-driven Semantic Planner that resolves input ambiguity via Progressive Contextual Reasoning, creating a deterministic structured semantic plan; (2) a non-parametric Evidence Retriever that grounds abstract semantics in external knowledge; and (3) a Retrieval-Injected Executor that utilizes retrieved evidence as flexible feature prompts to overcome rigidity and synthesize high-fidelity details. Extensive experiments on multiple benchmarks demonstrate that OMG-Agent consistently surpasses state-of-the-art methods, maintaining robustness under extreme missingness, e.g., a 2.6 -point gain on CMU-MOSI at 70 % missing rates.
zh

[AI-51] Scalable Explainability-as-a-Service (XaaS) for Edge AI Systems

【速读】:该论文旨在解决当前可解释人工智能(Explainable AI, XAI)在边缘计算和物联网(IoT)系统中部署时存在的效率低下与资源不匹配问题。现有方法通常将解释生成与模型推理紧密耦合,导致冗余计算、高延迟及难以在异构边缘设备上扩展。解决方案的关键在于提出一种名为“可解释即服务”(Explainability-as-a-Service, XaaS)的分布式架构,其核心创新是将推理与解释生成解耦,使边缘设备可根据自身资源和延迟约束请求、缓存并验证解释。该架构通过三项关键技术实现:基于语义相似度的分布式解释缓存机制以减少冗余计算;轻量级验证协议保障缓存与新生成解释的一致性;以及自适应解释引擎根据设备能力与用户需求动态选择解释方法。实验证明,XaaS在三个真实边缘AI场景下平均降低38%延迟,同时保持高质量解释,从而推动了透明且可问责AI在大规模异构物联网系统中的落地。

链接: https://arxiv.org/abs/2602.04120
作者: Samaresh Kumar Singh,Joyjit Roy
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Software Engineering (cs.SE)
备注: 8 pages, 5 figures, submitted and accepted in the conference IEEE SoutheastCon 2026

点击查看摘要

Abstract:Though Explainable AI (XAI) has made significant advancements, its inclusion in edge and IoT systems is typically ad-hoc and inefficient. Most current methods are “coupled” in such a way that they generate explanations simultaneously with model inferences. As a result, these approaches incur redundant computation, high latency and poor scalability when deployed across heterogeneous sets of edge devices. In this work we propose Explainability-as-a-Service (XaaS), a distributed architecture for treating explainability as a first-class system service (as opposed to a model-specific feature). The key innovation in our proposed XaaS architecture is that it decouples inference from explanation generation allowing edge devices to request, cache and verify explanations subject to resource and latency constraints. To achieve this, we introduce three main innovations: (1) A distributed explanation cache with a semantic similarity based explanation retrieval method which significantly reduces redundant computation; (2) A lightweight verification protocol that ensures the fidelity of both cached and newly generated explanations; and (3) An adaptive explanation engine that chooses explanation methods based upon device capability and user requirement. We evaluated the performance of XaaS on three real-world edge-AI use cases: (i) manufacturing quality control; (ii) autonomous vehicle perception; and (iii) healthcare diagnostics. Experimental results show that XaaS reduces latency by 38% while maintaining high explanation quality across three real-world deployments. Overall, this work enables the deployment of transparent and accountable AI across large scale, heterogeneous IoT systems, and bridges the gap between XAI research and edge-practicality.
zh

[AI-52] oward Effective Multimodal Graph Foundation Model: A Divide-and-Conquer Based Approach

【速读】:该论文旨在解决现有Multimodal Graph Foundation Models (MGFMs) 在处理Multimodal-Attributed Graphs (MAGs) 时存在的两大核心问题:一是缺乏对模态间交互的显式建模,难以捕捉超越简单信息聚合的复杂跨模态语义;二是模态对齐效果不佳,无法有效弥合不同模态空间之间的显著语义差异。其解决方案的关键在于提出PLANET框架,采用“分而治之”策略,将模态交互与对齐解耦至不同粒度:在嵌入粒度上通过Embedding-wise Domain Gating (EDG) 实现局部语义增强,利用拓扑感知的跨模态上下文实现模态交互;在节点粒度上通过Node-wise Discretization Retrieval (NDR) 构建离散化语义表示空间(DSRS),实现全局模态对齐,从而显著提升模型在多样化图中心任务和多模态生成任务上的性能表现。

链接: https://arxiv.org/abs/2602.04116
作者: Sicheng Liu,Xunkai Li,Daohan Su,Ru Zhang,Hongchao Qin,Ronghua Li,Guoren Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注: 20 pages, 6 figures

点击查看摘要

Abstract:Graph Foundation Models (GFMs) have achieved remarkable success in generalizing across diverse domains. However, they mainly focus on Text-Attributed Graphs (TAGs), leaving Multimodal-Attributed Graphs (MAGs) largely untapped. Developing Multimodal Graph Foundation Models (MGFMs) allows for leveraging the rich multimodal information in MAGs, and extends applicability to broader types of downstream tasks. While recent MGFMs integrate diverse modality information, our empirical investigation reveals two fundamental limitations of existing MGFMs: (1)they fail to explicitly model modality interaction, essential for capturing intricate cross-modal semantics beyond simple aggregation, and (2)they exhibit sub-optimal modality alignment, which is critical for bridging the significant semantic disparity between distinct modal spaces. To address these challenges, we propose PLANET (graPh topoLogy-aware modAlity iNteraction and alignmEnT), a novel framework employing a Divide-and-Conquer strategy to decouple modality interaction and alignment across distinct granularities. At the embedding granularity, (1)Embedding-wise Domain Gating (EDG) performs local semantic enrichment by adaptively infusing topology-aware cross-modal context, achieving modality interaction. At the node granularity, (2)Node-wise Discretization Retrieval (NDR) ensures global modality alignment by constructing a Discretized Semantic Representation Space (DSRS) to bridge modality gaps. Extensive experiments demonstrate that PLANET significantly outperforms state-of-the-art baselines across diverse graph-centric and multimodal generative tasks.
zh

[AI-53] nker Tales: Supporting Child-AI Collaboration through Co-Creative Storytelling with Educational Scaffolding

【速读】:该论文试图解决儿童在生成式 AI (Generative AI) 环境中如何通过迭代式共创(iterative co-creation)实现有意义互动的问题,现有研究多聚焦于 AI 作为教学引导者,缺乏对儿童与 AI 协同创作过程的深入探讨。解决方案的关键在于设计了一种名为 Tinker Tales 的具身叙事系统,该系统融合了叙事和社交情感支架(narrative and social-emotional scaffolding),通过物理叙事板、嵌入 NFC 的玩具(代表角色、地点、物品和情绪)以及移动端应用,支持儿童通过实体操作和语音交互与 AI 进行协同故事构建;实验表明,儿童将 AI 视为专注且响应灵敏的合作者,同时支架机制有效促进了连贯叙事的迭代优化,且未削弱儿童的主体性。

链接: https://arxiv.org/abs/2602.04109
作者: Nayoung Choi,Jiseung Hong,Peace Cyebukayire,Ikseon Choi,Jinho D. Choi
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Artificial intelligence (AI) is increasingly framed as a collaborative partner in creative activities, yet children’s interactions with AI have largely been studied in AI-led instructional settings rather than co-creative collaboration. This leaves open questions about how children can meaningfully engage with AI through iterative co-creation. We present Tinker Tales, a tangible storytelling system designed with narrative and social-emotional scaffolding to support child-AI collaboration. The system combines a physical storytelling board, NFC-embedded toys representing story elements (e.g., characters, places, items, and emotions), and a mobile app that mediates child-AI interaction. Children shape and refine stories by placing and moving story elements and interacting with the AI through tangible and voice-based interaction. We conducted an exploratory user study with 10 children to examine how they interacted with Tinker Tales. Our findings show that children treated the AI as an attentive, responsive collaborator, while scaffolding supported coherent narrative refinement without diminishing children’s agency.
zh

[AI-54] Interfaze: The Future of AI is built on Task-Specific Small Models

【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在处理复杂多模态任务时存在的计算效率低下和上下文理解不足的问题。传统方法依赖单一、庞大的Transformer模型,难以高效应对包含复杂PDF文档、图表、多语言语音识别(ASR)以及动态网页交互等场景。为此,作者提出Interfaze系统,其核心创新在于构建一个分层架构:首先通过异构深度神经网络(DNNs)与小型语言模型组成的感知模块处理OCR、多语言ASR等底层任务;其次引入上下文构建层,从外部源(如网页、代码、PDF)中爬取、索引并解析为紧凑结构化状态;最后由动作层执行浏览器操作、代码沙箱执行和信息检索等具体行为。整个系统由一个轻量控制器统一调度,仅将提炼后的上下文传递给用户指定的大模型生成最终响应。这种设计显著减少了对昂贵LLM的依赖,使大部分查询由小模型和工具栈完成,从而在保持高准确率的同时大幅降低计算成本。

链接: https://arxiv.org/abs/2602.04101
作者: Harsha Vardhan Khurdula,Vineet Agarwal,Yoeven D Khemlani
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages, 1 figure

点击查看摘要

Abstract:We present Interfaze, a system that treats modern LLM applications as a problem of building and acting over context, not just picking the right monolithic model. Instead of a single transformer, we combine (i) a stack of heterogeneous DNNs paired with small language models as perception modules for OCR involving complex PDFs, charts and diagrams, and multilingual ASR with (ii) a context-construction layer that crawls, indexes, and parses external sources (web pages, code, PDFs) into compact structured state, and (iii) an action layer that can browse, retrieve, execute code in a sandbox, and drive a headless browser for dynamic web pages. A thin controller sits on top of this stack and exposes a single, OpenAI-style endpoint: it decides which small models and actions to run and always forwards the distilled context to a user-selected LLM that produces the final response. On this architecture, Interfaze-Beta achieves 83.6% on MMLU-Pro, 91.4% on MMLU, 81.3% on GPQA-Diamond, 57.8% on LiveCodeBench v5, and 90.0% on AIME-2025, along with strong multimodal scores on MMMU (val) (77.3%), AI2D (91.5%), ChartQA (90.9%), and Common Voice v16 (90.8%). We show that most queries are handled primarily by the small-model and tool stack, with the large LLM operating only on distilled context, yielding competitive accuracy while shifting the bulk of computation away from the most expensive and monolithic models. Comments: 8 pages, 1 figure Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2602.04101 [cs.AI] (or arXiv:2602.04101v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2602.04101 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-55] Principles of Lipschitz continuity in neural networks

【速读】:该论文旨在解决深度学习模型在面对小输入扰动时的鲁棒性不足以及在分布外数据上的泛化能力有限的问题,其核心挑战在于缺乏对支撑鲁棒性和泛化性的底层基本原理的深入理解。解决方案的关键在于从理论层面系统探究Lipschitz连续性在神经网络中的作用机制,具体通过两个互补视角展开:一是内部视角,研究训练过程中Lipschitz连续性的动态演化规律(即训练动力学);二是外部视角,分析Lipschitz连续性如何调控神经网络对输入特征中频率信号传播的调节能力(即频率信号传播的调制)。这一双重视角为构建更鲁棒、更具泛化能力的神经网络提供了坚实的理论基础。

链接: https://arxiv.org/abs/2602.04078
作者: Róisín Luo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: Ph.D. Thesis

点击查看摘要

Abstract:Deep learning has achieved remarkable success across a wide range of domains, significantly expanding the frontiers of what is achievable in artificial intelligence. Yet, despite these advances, critical challenges remain – most notably, ensuring robustness to small input perturbations and generalization to out-of-distribution data. These critical challenges underscore the need to understand the underlying fundamental principles that govern robustness and generalization. Among the theoretical tools available, Lipschitz continuity plays a pivotal role in governing the fundamental properties of neural networks related to robustness and generalization. It quantifies the worst-case sensitivity of network’s outputs to small input perturbations. While its importance is widely acknowledged, prior research has predominantly focused on empirical regularization approaches based on Lipschitz constraints, leaving the underlying principles less explored. This thesis seeks to advance a principled understanding of the principles of Lipschitz continuity in neural networks within the paradigm of machine learning, examined from two complementary perspectives: an internal perspective – focusing on the temporal evolution of Lipschitz continuity in neural networks during training (i.e., training dynamics); and an external perspective – investigating how Lipschitz continuity modulates the behavior of neural networks with respect to features in the input data, particularly its role in governing frequency signal propagation (i.e., modulation of frequency signal propagation).
zh

[AI-56] PluRel: Synthetic Data unlocks Scaling Laws for Relational Foundation Models

【速读】:该论文旨在解决关系型基础模型(Relational Foundation Models, RFMs)训练中因隐私限制导致真实多表数据库难以获取的问题,进而阻碍了模型在复杂数据驱动决策任务中的性能提升。解决方案的关键在于提出PluRel框架,通过分步建模实现从零开始合成结构合理且多样化的多表关系数据库:首先用有向图建模数据库的schema结构,其次利用二部图刻画表间主键-外键连接关系,最后通过条件因果机制模拟各表内特征分布。该设计在保证计算轻量的同时支持广泛的数据库多样性生成,从而为RFMs提供高质量的合成预训练数据,并首次揭示了合成数据规模与模型性能之间的幂律缩放关系及泛化优势。

链接: https://arxiv.org/abs/2602.04029
作者: Vignesh Kothapalli,Rishabh Ranjan,Valter Hudovernik,Vijay Prakash Dwivedi,Johannes Hoffart,Carlos Guestrin,Jure Leskovec
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Code: this https URL

点击查看摘要

Abstract:Relational Foundation Models (RFMs) facilitate data-driven decision-making by learning from complex multi-table databases. However, the diverse relational databases needed to train such models are rarely public due to privacy constraints. While there are methods to generate synthetic tabular data of arbitrary size, incorporating schema structure and primary–foreign key connectivity for multi-table generation remains challenging. Here we introduce PluRel, a framework to synthesize multi-tabular relational databases from scratch. In a step-by-step fashion, PluRel models (1) schemas with directed graphs, (2) inter-table primary-foreign key connectivity with bipartite graphs, and, (3) feature distributions in tables via conditional causal mechanisms. The design space across these stages supports the synthesis of a wide range of diverse databases, while being computationally lightweight. Using PluRel, we observe for the first time that (1) RFM pretraining loss exhibits power-law scaling with the number of synthetic databases and total pretraining tokens, (2) scaling the number of synthetic databases improves generalization to real databases, and (3) synthetic pretraining yields strong base models for continued pretraining on real databases. Overall, our framework and results position synthetic data scaling as a promising paradigm for RFMs.
zh

[AI-57] Axiomatic Foundations of Counterfactual Explanations

【速读】:该论文旨在解决当前生成式 AI(Generative AI)系统中解释方法的局限性问题,特别是现有反事实解释(counterfactual explanation)多集中于单一类型且仅提供局部解释(local explanations),缺乏对系统整体推理机制的全局理解。其解决方案的关键在于构建一个公理化框架(axiomatic framework),通过定义一组理想性质(desirable properties)来系统分类和刻画不同类型的反事实解释。该框架证明了若干不可能性定理(impossibility theorems),表明不存在任何单一解释器能同时满足某些性质组合;并通过表示定理(representation theorems)建立了五组公理子集与特定解释器家族之间的一一对应关系,从而揭示出五种本质不同的反事实类型——其中部分对应局部解释,其余则捕捉全局解释(global explanations)。这一理论体系不仅为现有解释方法提供了形式化定位与行为分析,还明确了生成这些解释的计算复杂度。

链接: https://arxiv.org/abs/2602.04028
作者: Leila Amgoud,Martin Cooper
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Methodology (stat.ME)
备注:

点击查看摘要

Abstract:Explaining autonomous and intelligent systems is critical in order to improve trust in their decisions. Counterfactuals have emerged as one of the most compelling forms of explanation. They address ``why not’’ questions by revealing how decisions could be altered. Despite the growing literature, most existing explainers focus on a single type of counterfactual and are restricted to local explanations, focusing on individual instances. There has been no systematic study of alternative counterfactual types, nor of global counterfactuals that shed light on a system’s overall reasoning process. This paper addresses the two gaps by introducing an axiomatic framework built on a set of desirable properties for counterfactual explainers. It proves impossibility theorems showing that no single explainer can satisfy certain axiom combinations simultaneously, and fully characterizes all compatible sets. Representation theorems then establish five one-to-one correspondences between specific subsets of axioms and the families of explainers that satisfy them. Each family gives rise to a distinct type of counterfactual explanation, uncovering five fundamentally different types of counterfactuals. Some of these correspond to local explanations, while others capture global explanations. Finally, the framework situates existing explainers within this taxonomy, formally characterizes their behavior, and analyzes the computational complexity of generating such explanations. Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Methodology (stat.ME) ACMclasses: F.4.1 Cite as: arXiv:2602.04028 [cs.AI] (or arXiv:2602.04028v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2602.04028 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-58] Understanding and Guiding Layer Placement in Parameter-Efficient Fine-Tuning of Large Language Models

【速读】:该论文旨在解决参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)中层选择策略缺乏理论指导与灵活性的问题,尤其是在推理延迟受限的部署场景下如何权衡性能与计算成本。其核心解决方案是提出一种统一的投影残差视角(unified projected residual view),基于局部二次逼近揭示了层级微调行为由三个关键量决定:(i) 投影残差范数(resnorm),衡量每层可纠正偏差的能力;(ii) 激活能量(activation energy),影响特征条件数和噪声放大;(iii) 层间耦合度(layer coupling),反映残差跨层交互强度。在此基础上设计了“层卡”(Layer Card)诊断工具,量化各层的残差信号强度、计算开销与性能贡献,从而实现对适配层的精准筛选与灵活配置,例如在保持接近全层LoRA性能的同时显著降低微调成本和推理阶段的适配器层数,为不同目标(如性能最大化或成本最小化)提供可复用的优化路径。

链接: https://arxiv.org/abs/2602.04019
作者: Yichen Xu,Yuyang Liang,Shan Dai,Tianyang Hu,Tsz Nam Chan,Chenhao Ma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As large language models (LLMs) continue to grow, the cost of full-parameter fine-tuning has made parameter-efficient fine-tuning (PEFT) the default strategy for downstream adaptation. Constraints from inference latency in scalable serving and fine-tuning cost in edge or rapid-deployment settings make the choice of which layers to fine-tune unavoidable. Yet current practice typically applies PEFT uniformly across all layers, with limited understanding or leverage of layer selection. This paper develops a unified projected residual view of PEFT on top of a frozen base model. Under a local quadratic approximation, layerwise adaptation is governed by three quantities: (i) the projected residual norm (resnorm), which measures how much correctable bias a layer can capture; (ii) the activation energy, which determines feature conditioning; and (iii) layer coupling, which quantifies how strongly residuals interact across layers. We show that, for squared loss and linear adapters, the resnorm equals a normalized gradient norm, activation energy controls ill-conditioning and noise amplification, and weak coupling yields approximately additive layerwise contributions. Building on these insights, we introduce the Layer Card, a reusable diagnostic that summarizes residual signal strength, compute cost, and performance for each layer of a given model. With an identical model and LoRA configuration, Layer Card-guided placement refines the choice of adapted layers to flexibly prioritize different objectives, such as maximizing performance or reducing fine-tuning cost. Moreover, on Qwen3-8B, we show that selectively adapting a subset of layers can achieve performance close to full-layer LoRA while substantially reducing fine-tuning cost and the number of adapter-augmented layers during inference, offering a more cost-performance-aware alternative to full-layer insertion.
zh

[AI-59] Rational ANOVA Networks

【速读】:该论文旨在解决深度神经网络中非线性激活函数固定不变所带来的可解释性差与函数类控制粒度粗的问题,同时克服现有基于样条的加法模型(如KAN)在计算效率和边界稳定性方面的局限。其解决方案的关键在于提出Rational-ANOVA Network (RAN),该架构基于函数ANOVA分解和Padé型有理逼近,将输入函数建模为主效应与稀疏成对交互项的组合,其中每个组件由一个具有严格正分母的可学习有理单元参数化;这种设计不仅避免了极点导致的数值不稳定,还能更高效地捕捉尖锐过渡和近奇异行为,从而提升外推性能,并通过显式的低阶交互偏差实现更高的数据效率和可解释性。

链接: https://arxiv.org/abs/2602.04006
作者: Jusheng Zhang,Ningyuan Liu,Qinhan Lyu,Jing Yang,Keze Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Code: \url{ this https URL }

点击查看摘要

Abstract:Deep neural networks typically treat nonlinearities as fixed primitives (e.g., ReLU), limiting both interpretability and the granularity of control over the induced function class. While recent additive models (like KANs) attempt to address this using splines, they often suffer from computational inefficiency and boundary instability. We propose the Rational-ANOVA Network (RAN), a foundational architecture grounded in functional ANOVA decomposition and Padé-style rational approximation. RAN models f(x) as a composition of main effects and sparse pairwise interactions, where each component is parameterized by a stable, learnable rational unit. Crucially, we enforce a strictly positive denominator, which avoids poles and numerical instability while capturing sharp transitions and near-singular behaviors more efficiently than polynomial bases. This ANOVA structure provides an explicit low-order interaction bias for data efficiency and interpretability, while the rational parameterization significantly improves extrapolation. Across controlled function benchmarks and vision classification tasks (e.g., CIFAR-10) under matched parameter and compute budgets, RAN matches or surpasses parameter-matched MLPs and learnable-activation baselines, with better stability and throughput. Code is available at this https URL.
zh

[AI-60] When AI Persuades: Adversarial Explanation Attacks on Human Trust in AI-Assisted Decision Making

【速读】:该论文旨在解决生成式 AI(Generative AI)在人机协作决策场景中,因模型输出的解释被恶意操纵而导致人类用户对错误预测产生不当信任的问题。传统对抗攻击主要针对模型计算行为,而本文首次将解释(explanation)视为一个新型认知层面的对抗通道(adversarial cognitive channel),提出对抗性解释攻击(Adversarial Explanation Attacks, AEAs),其核心在于通过操控大型语言模型(Large Language Models, LLMs)生成解释的框架(如推理模式、证据类型、沟通风格和呈现格式),诱导用户对错误结果产生与正确结果相当甚至更高的信任,从而造成“信任校准偏差”(trust miscalibration gap)。解决方案的关键在于量化这一偏差并揭示其在特定情境下的高脆弱性:当攻击解释模仿专家表达方式时——即使用权威证据、中性语气和领域适配推理——用户的信任水平几乎不受影响,尤其在任务难度高、依赖事实判断的领域及教育程度较低、年龄较轻或高度信赖AI的群体中表现最为显著。

链接: https://arxiv.org/abs/2602.04003
作者: Shutong Fan,Lan Zhang,Xiaoyong Yuan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Most adversarial threats in artificial intelligence target the computational behavior of models rather than the humans who rely on them. Yet modern AI systems increasingly operate within human decision loops, where users interpret and act on model recommendations. Large Language Models generate fluent natural-language explanations that shape how users perceive and trust AI outputs, revealing a new attack surface at the cognitive layer: the communication channel between AI and its users. We introduce adversarial explanation attacks (AEAs), where an attacker manipulates the framing of LLM-generated explanations to modulate human trust in incorrect outputs. We formalize this behavioral threat through the trust miscalibration gap, a metric that captures the difference in human trust between correct and incorrect outputs under adversarial explanations. By incorporating this gap, AEAs explore the daunting threats in which persuasive explanations reinforce users’ trust in incorrect predictions. To characterize this threat, we conducted a controlled experiment (n = 205), systematically varying four dimensions of explanation framing: reasoning mode, evidence type, communication style, and presentation format. Our findings show that users report nearly identical trust for adversarial and benign explanations, with adversarial explanations preserving the vast majority of benign trust despite being incorrect. The most vulnerable cases arise when AEAs closely resemble expert communication, combining authoritative evidence, neutral tone, and domain-appropriate reasoning. Vulnerability is highest on hard tasks, in fact-driven domains, and among participants who are less formally educated, younger, or highly trusting of AI. This is the first systematic security study that treats explanations as an adversarial cognitive channel and quantifies their impact on human trust in AI-assisted decision making.
zh

[AI-61] When Chains of Thought Dont Matter: Causal Bypass in Large Language Models ICLR

【速读】:该论文试图解决的问题是:链式思维(Chain-of-thought, CoT)提示是否真正实现了模型推理过程的因果依赖,即模型的答案是否确实依赖于CoT内容而非存在绕过机制(bypass circuit)。尽管CoT常被视为提升生成式AI(Generative AI)透明性的手段,但作者发现表面合规的CoT并不保证其对最终答案具有因果影响。解决方案的关键在于提出一个诊断框架,该框架包含两个核心组件:(i) 一个可解释的行为模块,用于量化CoT文本中与操控相关的信号得分;(ii) 一个因果探针(causal probe),通过隐藏状态修补(hidden-state patching)测量CoT中介影响(CMI),并输出“绕过分数”(1−CMI),从而定量评估答案是否独立于CoT内容而产生。实验表明,该框架能有效识别出多数任务中CoT被严重绕过的现象(CMI ≈ 0),同时也揭示了特定逻辑任务中存在一定程度的因果中介(CMI最高达0.56),为理解CoT提示的实际因果机制提供了新的审计工具。

链接: https://arxiv.org/abs/2602.03994
作者: Anish Sathyanarayanan,Aditya Nagarsekar,Aarush Rathore
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Under Review at ICLR, 2026

点击查看摘要

Abstract:Chain-of-thought (CoT) prompting is widely assumed to expose a model’s reasoning process and improve transparency. We attempted to enforce this assumption by penalizing unfaithful reasoning, but found that surface-level compliance does not guarantee causal reliance. Our central finding is negative: even when CoT is verbose, strategic, and flagged by surface-level manipulation detectors, model answers are often causally independent of the CoT content. We present a diagnostic framework for auditing this failure mode: it combines (i) an interpretable behavioral module that scores manipulation-relevant signals in CoT text and (ii) a causal probe that measures CoT-mediated influence (CMI) via hidden-state patching and reports a bypass score ( 1-\mathrmCMI ), quantifying the degree to which the answer is produced by a bypass circuit independent of the rationale. In pilot evaluations, audit-aware prompting increases detectable manipulation signals (mean risk-score delta: +5.10 ), yet causal probes reveal task-dependent mediation: many QA items exhibit near-total bypass (CMI \approx 0 ), while some logic problems show stronger mediation (CMI up to 0.56 ). Layer-wise analysis reveals narrow and task-dependent ``reasoning windows’’ even when mean CMI is low.
zh

[AI-62] DeXposure-FM: A Time-series Graph Foundation Model for Credit Exposures and Stability on Decentralized Financial Networks

【速读】:该论文旨在解决去中心化金融(Decentralized Finance, DeFi)中隐性且由代币媒介的信用敞口问题,此类敞口因协议间高度耦合而易引发不可控的传染效应,尤其在DeFi与传统金融基础设施通过稳定币等工具日益融合的背景下,亟需更强大的量化分析工具。解决方案的关键在于提出首个面向时间序列的图基础模型DeXposure-FM,其核心创新包括:基于图-表格编码器架构并采用预训练权重初始化,结合多任务特定头部结构;在包含4370万条数据、覆盖4300+协议、602条区块链及2.4万种代币的DeXposure数据集上进行训练,以预测协议级资金流动和信用敞口链接的拓扑结构与权重联合动态。该模型不仅在机器学习基准上显著优于现有方法(如图基础模型和时序图神经网络),还支持宏观审慎监管和场景化压力测试,通过“预测-测量”流水线生成协议级系统重要性评分、行业级溢出与集中度指标,从而实现对DeFi系统性风险的精准识别与量化。

链接: https://arxiv.org/abs/2602.03981
作者: Aijie Shu,Wenbin Wu,Gbenga Ibikunle,Fengxiang He
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Econometrics (econ.EM)
备注:

点击查看摘要

Abstract:Credit exposure in Decentralized Finance (DeFi) is often implicit and token-mediated, creating a dense web of inter-protocol dependencies. Thus, a shock to one token may result in significant and uncontrolled contagion effects. As the DeFi ecosystem becomes increasingly linked with traditional financial infrastructure through instruments, such as stablecoins, the risk posed by this dynamic demands more powerful quantification tools. We introduce DeXposure-FM, the first time-series, graph foundation model for measuring and forecasting inter-protocol credit exposure on DeFi networks, to the best of our knowledge. Employing a graph-tabular encoder, with pre-trained weight initialization, and multiple task-specific heads, DeXposure-FM is trained on the DeXposure dataset that has 43.7 million data entries, across 4,300+ protocols on 602 blockchains, covering 24,300+ unique tokens. The training is operationalized for credit-exposure forecasting, predicting the joint dynamics of (1) protocol-level flows, and (2) the topology and weights of credit-exposure links. The DeXposure-FM is empirically validated on two machine learning benchmarks; it consistently outperforms the state-of-the-art approaches, including a graph foundation model and temporal graph neural networks. DeXposure-FM further produces financial economics tools that support macroprudential monitoring and scenario-based DeFi stress testing, by enabling protocol-level systemic-importance scores, sector-level spillover and concentration measures via a forecast-then-measure pipeline. Empirical verification fully supports our financial economics tools. The model and code have been publicly available. Model: this https URL. Code: this https URL.
zh

[AI-63] Monitorability as a Free Gift: How RLVR Spontaneously Aligns Reason ing

【速读】:该论文旨在解决大型推理模型(Large Reasoning Models, LRMs)在实际部署中因链式思维(Chain-of-Thought, CoT)轨迹缺乏可审计性而导致的安全隐患问题。其核心挑战在于如何通过训练机制提升CoT的可监控性(monitorability),即确保CoT能忠实且充分地反映模型内部计算过程。解决方案的关键在于系统性评估强化学习结合可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)训练下不同模型家族与数据域中的monitorability变化,发现monitorability的提升并非普遍现象,而是高度依赖于训练数据的多样性及指令遵循类数据的引入;同时揭示monitorability与模型能力正交,其提升主要源于响应分布锐化(entropy reduction)和对提示的关注度增强,而非更强的因果推理依赖。这一发现为理解RLVR中monitorability的涌现机制提供了全面视角,并明确了其优化条件。

链接: https://arxiv.org/abs/2602.03978
作者: Zidi Xiong,Shan Chen,Himabindu Lakkaraju
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As Large Reasoning Models (LRMs) are increasingly deployed, auditing their chain-of-thought (CoT) traces for safety becomes critical. Recent work has reported that monitorability–the degree to which CoT faithfully and informatively reflects internal computation–can appear as a “free gift” during the early stages of Reinforcement Learning with Verifiable Rewards (RLVR). We make this observation concrete through a systematic evaluation across model families and training domains. Our results show that this effect is not universal: monitorability improvements are strongly data-dependent. In particular, we demonstrate the critical role of data diversity and instruction-following data during RLVR training. We further show that monitorability is orthogonal to capability–improvements in reasoning performance do not imply increased transparency. Through mechanistic analysis, we attribute monitorability gains primarily to response distribution sharpening (entropy reduction) and increased attention to the prompt, rather than stronger causal reliance on reasoning traces. We also reveal how monitorability dynamics vary with controlled training and evaluation difficulty. Together, these findings provide a holistic view of how monitorability emerges under RLVR, clarifying when gains are likely to occur and when they are not.
zh

[AI-64] Adaptive Test-Time Compute Allocation via Learned Heuristics over Categorical Structure

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)推理过程中验证成本过高且存在大量冗余或低效中间假设验证的问题。其核心挑战在于如何在验证资源受限的条件下,最优分配验证资源以提升整体推理效率与准确性。解决方案的关键在于提出一种基于状态级别的选择性验证框架,通过三个关键机制实现:(i) 基于结构化动作接口的确定性可行性过滤(deterministic feasibility gating),(ii) 利用学习到的状态距离与残差评分相结合的预验证排序(pre-verification ranking),以及 (iii) 根据局部不确定性自适应分配验证器调用(adaptive allocation of verifier calls)。该方法相较于传统的解级最优N(best-of-N)或均匀中间验证策略,能够更精准地将验证资源投向信息量最大的中间状态,在MATH基准上实现了更高准确率的同时减少44%的验证调用次数。

链接: https://arxiv.org/abs/2602.03975
作者: Shuhui Qu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Test-time computation has become a primary driver of progress in large language model (LLM) reasoning, but it is increasingly bottlenecked by expensive verification. In many reasoning systems, a large fraction of verifier calls are spent on redundant or unpromising intermediate hypotheses. We study reasoning under a \emphverification-cost-limited setting and ask how verification effort should be allocated across intermediate states. We propose a state-level selective verification framework that combines (i) deterministic feasibility gating over a structured move interface, (ii) pre-verification ranking using a hybrid of learned state-distance and residual scoring, and (iii) adaptive allocation of verifier calls based on local uncertainty. Unlike solution-level best-of- N or uniform intermediate verification, our method distributes verification where it is most informative. On the \textscMATH benchmark, our approach achieves higher accuracy than best-of- N , majority voting, and beam search while using 44% fewer verifier calls.
zh

[AI-65] Active Epistemic Control for Query-Efficient Verified Planning

【速读】:该论文旨在解决交互环境中部分可观测性下的规划难题:任务关键前提条件(如物体位置或容器状态)在决策时刻可能未知,而通过交互获取这些信息成本较高;尽管学习到的世界模型可低成本预测缺失事实,但其预测误差可能导致不可行的承诺。解决方案的关键在于提出主动认知控制(Active Epistemic Control, AEC),它是一个融合基于模型信念管理与分类可行性检查的规划层,通过严格区分用于承诺的已接地事实存储(grounded fact store)和仅用于剪枝候选计划的信念存储(belief store),在每一步动态决定是查询环境以确证高不确定性前提,还是利用模拟预测来过滤假设;最终承诺由已接地前提覆盖度及类似SQ-BCP回溯兼容性检查共同约束,确保仿真信念仅提升效率而不直接决定可行性。

链接: https://arxiv.org/abs/2602.03974
作者: Shuhui Qu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Planning in interactive environments is challenging under partial observability: task-critical preconditions (e.g., object locations or container states) may be unknown at decision time, yet grounding them through interaction is costly. Learned world models can cheaply predict missing facts, but prediction errors can silently induce infeasible commitments. We present \textbfActive Epistemic Control (AEC), an epistemic-categorical planning layer that integrates model-based belief management with categorical feasibility checks. AEC maintains a strict separation between a \emphgrounded fact store used for commitment and a \emphbelief store used only for pruning candidate plans. At each step, it either queries the environment to ground an unresolved predicate when uncertainty is high or predictions are ambiguous, or simulates the predicate to filter hypotheses when confidence is sufficient. Final commitment is gated by grounded precondition coverage and an SQ-BCP pullback-style compatibility check, so simulated beliefs affect efficiency but cannot directly certify feasibility. Experiments on ALFWorld and ScienceWorld show that AEC achieves competitive success with fewer replanning rounds than strong LLM-agent baselines.
zh

[AI-66] Structural shifts in institutional participation and collaboration within the AI arXiv preprint research ecosystem

【速读】:该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)的兴起如何重塑人工智能(Artificial Intelligence, AI)研究领域的结构特征,包括出版量、作者团队规模以及学术界与产业界的合作模式。其解决方案的关键在于构建一个多阶段数据采集与增强流程,并结合基于大语言模型的机构分类方法,对2021至2025年arXiv cs.AI板块的预印本文献进行系统分析,从而揭示生成式AI研究中学术—产业协作的结构性障碍及其演变趋势。

链接: https://arxiv.org/abs/2602.03969
作者: Shama Magnur,Mayank Kejriwal
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注: 16 pages, 5 Figures, 7 Tables

点击查看摘要

Abstract:The emergence of large language models (LLMs) represents a significant technological shift within the scientific ecosystem, particularly within the field of artificial intelligence (AI). This paper examines structural changes in the AI research landscape using a dataset of arXiv preprints (cs.AI) from 2021 through 2025. Given the rapid pace of AI development, the preprint ecosystem has become a critical barometer for real-time scientific shifts, often preceding formal peer-reviewed publication by months or years. By employing a multi-stage data collection and enrichment pipeline in conjunction with LLM-based institution classification, we analyze the evolution of publication volumes, author team sizes, and academic–industry collaboration patterns. Our results reveal an unprecedented surge in publication output following the introduction of ChatGPT, with academic institutions continuing to provide the largest volume of research. However, we observe that academic–industry collaboration is still suppressed, as measured by a Normalized Collaboration Index (NCI) that remains significantly below the random-mixing baseline across all major subfields. These findings highlight a continuing institutional divide and suggest that the capital-intensive nature of generative AI research may be reshaping the boundaries of scientific collaboration.
zh

[AI-67] Semantic Rate Distortion and Posterior Design: Compute Constraints Multimodality and Strategic Inference

【速读】:该论文旨在解决在速率(rate)和计算(compute)约束下,具有战略意图的高斯语义压缩问题,其中编码器与解码器分别优化不同的二次目标函数。其核心挑战在于如何设计最优后验协方差以在信息率约束下实现语义压缩的最优化,并揭示不同信息结构(直接、远程、全信息)下的语义率失真函数。解决方案的关键在于提出“语义水填”(semantic waterfilling)和受速率约束的高斯劝说(rate-constrained Gaussian persuasion)机制,并证明在目标不一致时高斯分布仍为最优策略;同时发现架构计算限制可视为隐式速率约束,从而通过模型深度和推理时间计算的增加带来语义准确率的指数级提升,且多模态观测能消除远程编码中固有的几何均值惩罚。

链接: https://arxiv.org/abs/2602.03949
作者: Emrah Akyol
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: submitted for publication

点击查看摘要

Abstract:We study strategic Gaussian semantic compression under rate and compute constraints, where an encoder and decoder optimize distinct quadratic objectives. A latent Gaussian state generates a task dependent semantic variable, and the decoder best responds via MMSE estimation, reducing the encoder’s problem to posterior covariance design under an information rate constraint. We characterize the strategic rate distortion function in direct, remote, and full information regimes, derive semantic waterfilling and rate constrained Gaussian persuasion solutions, and establish Gaussian optimality under misaligned objectives. We further show that architectural compute limits act as implicit rate constraints, yielding exponential improvements in semantic accuracy with model depth and inference time compute, while multimodal observation eliminates the geometric mean penalty inherent to remote encoding. These results provide information theoretic foundations for data and energy efficient AI and offer a principled interpretation of modern multimodal language models as posterior design mechanisms under resource constraints.
zh

[AI-68] WIND: Weather Inverse Diffusion for Zero-Shot Atmospheric Modeling

【速读】:该论文旨在解决当前天气与气候建模中模型碎片化的问题,即不同任务通常依赖于独立训练的专用模型,缺乏通用性和可扩展性。解决方案的关键在于提出一个统一的预训练基础模型 WIND,其通过无监督视频重建目标(自监督视频重建)在大气动力学数据上进行预训练,从而学习到一种任务无关的先验知识;在推理阶段,将各类特定领域问题严格建模为逆问题,并通过后验采样求解,无需任何任务特定微调即可完成概率预测、时空降尺度、稀疏重构及守恒律强制等复杂任务,实现了计算高效的生成式 AI (Generative AI) 驱动的大气建模范式转变。

链接: https://arxiv.org/abs/2602.03924
作者: Michael Aich,Andreas Fürst,Florian Sestak,Carlos Ruiz-Gonzalez,Niklas Boers,Johannes Brandstetter
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Atmospheric and Oceanic Physics (physics.ao-ph)
备注:

点击查看摘要

Abstract:Deep learning has revolutionized weather and climate modeling, yet the current landscape remains fragmented: highly specialized models are typically trained individually for distinct tasks. To unify this landscape, we introduce WIND, a single pre-trained foundation model capable of replacing specialized baselines across a vast array of tasks. Crucially, in contrast to previous atmospheric foundation models, we achieve this without any task-specific fine-tuning. To learn a robust, task-agnostic prior of the atmosphere, we pre-train WIND with a self-supervised video reconstruction objective, utilizing an unconditional video diffusion model to iteratively reconstruct atmospheric dynamics from a noisy state. At inference, we frame diverse domain-specific problems strictly as inverse problems and solve them via posterior sampling. This unified approach allows us to tackle highly relevant weather and climate problems, including probabilistic forecasting, spatial and temporal downscaling, sparse reconstruction and enforcing conservation laws purely with our pre-trained model. We further demonstrate the model’s capacity to generate physically consistent counterfactual storylines of extreme weather events under global warming scenarios. By combining generative video modeling with inverse problem solving, WIND offers a computationally efficient paradigm shift in AI-based atmospheric modeling.
zh

[AI-69] SpecMD: A Comprehensive Study On Speculative Expert Prefetching

【速读】:该论文旨在解决混合专家(Mixture-of-Experts, MoE)模型在推理过程中因专家访问模式缺乏时间局部性(temporal locality)而导致传统缓存策略(如LRU、LFU)效率低下,进而限制实际性能提升的问题。其解决方案的关键在于提出一种新的缓存淘汰策略——Least-Stale,该策略利用MoE模型中专家访问模式的可预测性,通过优先保留最近被访问过的专家以减少缓存冲突缺失(collision misses),从而显著提升缓存命中率(最高达88%)并降低首次token响应时间(TTFT,最多减少34.7%),且仅需5%或0.6GB的显存缓存容量即可实现这一优化效果。

链接: https://arxiv.org/abs/2602.03921
作者: Duc Hoang,Ajay Jaiswal,Mohammad Samragh,Minsik Cho
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mixture-of-Experts (MoE) models enable sparse expert activation, meaning that only a subset of the model’s parameters is used during each inference. However, to translate this sparsity into practical performance, an expert caching mechanism is required. Previous works have proposed hardware-centric caching policies, but how these various caching policies interact with each other and different hardware specification remains poorly understood. To address this gap, we develop \textbfSpecMD, a standardized framework for benchmarking ad-hoc cache policies on various hardware configurations. Using SpecMD, we perform an exhaustive benchmarking of several MoE caching strategies, reproducing and extending prior approaches in controlled settings with realistic constraints. Our experiments reveal that MoE expert access is not consistent with temporal locality assumptions (e.g LRU, LFU). Motivated by this observation, we propose \textbfLeast-Stale, a novel eviction policy that exploits MoE’s predictable expert access patterns to reduce collision misses by up to 85\times over LRU. With such gains, we achieve over 88% hit rates with up to 34.7% Time-to-first-token (TTFT) reduction on OLMoE at only 5% or 0.6GB of VRAM cache capacity.
zh

[AI-70] GeoIB: Geometry-Aware Information Bottleneck via Statistical-Manifold Compression

【速读】:该论文旨在解决信息瓶颈(Information Bottleneck, IB)在深度学习中因依赖可微分近似(如变分界或神经网络互信息估计器)而导致的压缩控制不直接、优化不稳定的问题。其核心挑战在于传统IB方法难以精确调控输入与表征之间的互信息 I(X;Z)I(X;Z),且估计偏差易引发训练脆弱性。解决方案的关键在于引入几何信息瓶颈(Geometric Information Bottleneck, GeoIB),它摒弃了对互信息的显式估计,转而基于信息几何视角,将 I(X;Z)I(X;Z)I(Z;Y)I(Z;Y) 表达为联合分布到独立流形的最小Kullback-Leibler(KL)距离的精确投影形式;进而通过两个互补项实现可控压缩:一是分布层面的Fisher-Rao(FR)差异,具有二阶KL匹配性和重参数化不变性;二是几何层面的Jacobian-Frobenius(JF)项,通过惩罚编码器拉回体积膨胀来提供 I(Z;X)I(Z;X) 的局部容量上界。此框架统一了分布与几何正则化,并设计了一致于FR度量的自然梯度优化器,显著提升了模型的稳定性和泛化性能。

链接: https://arxiv.org/abs/2602.03906
作者: Weiqi Wang,Zhiyi Tian,Chenhan Zhang,Shui Yu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Information Bottleneck (IB) is widely used, but in deep learning, it is usually implemented through tractable surrogates, such as variational bounds or neural mutual information (MI) estimators, rather than directly controlling the MI I(X;Z) itself. The looseness and estimator-dependent bias can make IB “compression” only indirectly controlled and optimization fragile. We revisit the IB problem through the lens of information geometry and propose a \textbfGeometric \textbfInformation \textbfBottleneck (\textbfGeoIB) that dispenses with mutual information (MI) estimation. We show that I(X;Z) and I(Z;Y) admit exact projection forms as minimal Kullback-Leibler (KL) distances from the joint distributions to their respective independence manifolds. Guided by this view, GeoIB controls information compression with two complementary terms: (i) a distribution-level Fisher-Rao (FR) discrepancy, which matches KL to second order and is reparameterization-invariant; and (ii) a geometry-level Jacobian-Frobenius (JF) term that provides a local capacity-type upper bound on I(Z;X) by penalizing pullback volume expansion of the encoder. We further derive a natural-gradient optimizer consistent with the FR metric and prove that the standard additive natural-gradient step is first-order equivalent to the geodesic update. We conducted extensive experiments and observed that the GeoIB achieves a better trade-off between prediction accuracy and compression ratio in the information plane than the mainstream IB baselines on popular datasets. GeoIB improves invariance and optimization stability by unifying distributional and geometric regularization under a single bottleneck multiplier. The source code of GeoIB is released at "this https URL. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (stat.ML) Cite as: arXiv:2602.03906 [cs.LG] (or arXiv:2602.03906v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.03906 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-71] Knowledge Model Prompting Increases LLM Performance on Planning Tasks

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLM)在推理能力和规划任务中表现不足的问题,尤其是针对其在符号操作与复杂任务分解上的局限性。解决方案的关键在于引入任务-方法-知识(Task-Method-Knowledge, TMK)框架作为结构化提示策略,该框架通过显式建模因果、目的论和层次化推理结构,以及明确的任务分解机制,引导模型从默认的语言模式转向形式化、可执行的推理路径。实验表明,TMK提示显著提升了模型在PlanBench中的Blocksworld域任务上的准确率,从原先的31.5%提升至97.3%,验证了其在增强LLM符号推理能力方面的有效性。

链接: https://arxiv.org/abs/2602.03900
作者: Erik Goh,John Kos,Ashok Goel
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLM) can struggle with reasoning ability and planning tasks. Many prompting techniques have been developed to assist with LLM reasoning, notably Chain-of-Thought (CoT); however, these techniques, too, have come under scrutiny as LLMs’ ability to reason at all has come into question. Borrowing from the domain of cognitive and educational science, this paper investigates whether the Task-Method-Knowledge (TMK) framework can improve LLM reasoning capabilities beyond its previously demonstrated success in educational applications. The TMK framework’s unique ability to capture causal, teleological, and hierarchical reasoning structures, combined with its explicit task decomposition mechanisms, makes it particularly well-suited for addressing language model reasoning deficiencies, and unlike other hierarchical frameworks such as HTN and BDI, TMK provides explicit representations of not just what to do and how to do it, but also why actions are taken. The study evaluates TMK by experimenting on the PlanBench benchmark, focusing on the Blocksworld domain to test for reasoning and planning capabilities, examining whether TMK-structured prompting can help language models better decompose complex planning problems into manageable sub-tasks. Results also highlight significant performance inversion in reasoning models. TMK prompting enables the reasoning model to achieve up to an accuracy of 97.3% on opaque, symbolic tasks (Random versions of Blocksworld in PlanBench) where it previously failed (31.5%), suggesting the potential to bridge the gap between semantic approximation and symbolic manipulation. Our findings suggest that TMK functions not merely as context, but also as a mechanism that steers reasoning models away from their default linguistic modes to engage formal, code-execution pathways in the context of the experiments.
zh

[AI-72] GOPO: Policy Optimization using Ranked Rewards

【速读】:该论文旨在解决标准人类反馈强化学习(Reinforcement Learning from Human Feedback, RLHF)中奖励模型与策略优化之间存在的对齐问题:即奖励模型基于成对偏好数据训练以捕捉相对偏好,而现有策略优化方法却依赖于奖励的绝对数值,这在不可验证奖励场景(如摘要生成、指令遵循和对话补全)中常导致次优性能。解决方案的关键在于提出一种新的策略优化方法——组序数策略优化(Group Ordinal Policy Optimization, GOPO),其核心创新是仅使用奖励的排序信息而非其具体数值,通过秩变换(rank-based transformation)实现对奖励信号的无量纲化处理,从而避免因奖励绝对值敏感性带来的优化偏差,显著提升了训练稳定性、效率及最终策略质量。

链接: https://arxiv.org/abs/2602.03876
作者: Kyuseong Choi,Dwaipayan Saha,Woojeong Kim,Anish Agarwal,Raaz Dwivedi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 17 pages, 8 figures

点击查看摘要

Abstract:Standard reinforcement learning from human feedback (RLHF) trains a reward model on pairwise preference data and then uses it for policy optimization. However, while reward models are optimized to capture relative preferences, existing policy optimization techniques rely on absolute reward magnitudes during training. In settings where the rewards are non-verifiable such as summarization, instruction following, and chat completion, this misalignment often leads to suboptimal performance. We introduce Group Ordinal Policy Optimization (GOPO), a policy optimization method that uses only the ranking of the rewards and discards their magnitudes. Our rank-based transformation of rewards provides several gains, compared to Group Relative Policy Optimization (GRPO), in settings with non-verifiable rewards: (1) consistently higher training/validation reward trajectories, (2) improved LLM-as-judge evaluations across most intermediate training steps, and (3) reaching a policy of comparable quality in substantially less training steps than GRPO. We demonstrate consistent improvements across a range of tasks and model sizes.
zh

[AI-73] Reversible Deep Learning for 13C NMR in Chemoinformatics: On Structures and Spectra

【速读】:该论文旨在解决核磁共振(NMR)谱图与分子结构之间双向映射的难题,尤其是谱图到结构的不确定性推理问题。传统方法通常将谱图预测和结构生成视为两个独立任务,难以统一建模谱图到结构的“一对多”关系。解决方案的关键在于提出一种可逆深度学习模型,该模型基于i-RevNet风格的双射块构建,通过单一条件可逆神经网络实现从分子结构到谱图代码(128位分箱编码)的正向映射,以及从谱图代码到结构候选的反向映射。该架构天然支持数值可逆性,且在训练数据上表现出良好的重构能力,同时在验证谱图上能生成具有粗粒度但语义合理的结构候选,从而在端到端框架内实现了谱图预测与不确定性感知的结构生成一体化。

链接: https://arxiv.org/abs/2602.03875
作者: Stefan Kuhn,Vandana Dwarka,Przemyslaw Karol Grenda,Eero Vainikko
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注: 10 pages, 4 figures, 4 tables

点击查看摘要

Abstract:We introduce a reversible deep learning model for 13C NMR that uses a single conditional invertible neural network for both directions between molecular structures and spectra. The network is built from i-RevNet style bijective blocks, so the forward map and its inverse are available by construction. We train the model to predict a 128-bit binned spectrum code from a graph-based structure encoding, while the remaining latent dimensions capture residual variability. At inference time, we invert the same trained network to generate structure candidates from a spectrum code, which explicitly represents the one-to-many nature of spectrum-to-structure inference. On a filtered subset, the model is numerically invertible on trained examples, achieves spectrum-code prediction above chance, and produces coarse but meaningful structural signals when inverted on validation spectra. These results demonstrate that invertible architectures can unify spectrum prediction and uncertainty-aware candidate generation within one end-to-end model.
zh

[AI-74] Decoding Ambiguous Emotions with Test-Time Scaling in Audio-Language Models

【速读】:该论文旨在解决语音情感识别中因情绪状态模糊、重叠且依赖情境而导致的标注困难与自动建模挑战,尤其是在生成式语音语言模型(Audio Language Models, ALMs)和测试时扩展(Test-Time Scaling, TTS)技术尚未充分探索其在处理模糊情绪能力的问题。解决方案的关键在于构建首个面向模糊情绪识别的基准测试平台,系统评估八种前沿ALMs与五种TTS策略在三个主流语音情感数据集上的表现,并深入分析模型容量、TTS方法与情感模糊性之间的交互关系,从而揭示计算与表征层面的挑战,为开发更具鲁棒性、情境感知和情感智能的语音AI系统奠定基础。

链接: https://arxiv.org/abs/2602.03873
作者: Hong Jia,Weibin Li,Jingyao Wu,Xiaofeng Yu,Yan Gao,Jintao Cheng,Xiaoyu Tang,Feng Xia,Ting Dang
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Emotion recognition from human speech is a critical enabler for socially aware conversational AI. However, while most prior work frames emotion recognition as a categorical classification problem, real-world affective states are often ambiguous, overlapping, and context-dependent, posing significant challenges for both annotation and automatic modeling. Recent large-scale audio language models (ALMs) offer new opportunities for nuanced affective reasoning without explicit emotion supervision, but their capacity to handle ambiguous emotions remains underexplored. At the same time, advances in inference-time techniques such as test-time scaling (TTS) have shown promise for improving generalization and adaptability in hard NLP tasks, but their relevance to affective computing is still largely unknown. In this work, we introduce the first benchmark for ambiguous emotion recognition in speech with ALMs under test-time scaling. Our evaluation systematically compares eight state-of-the-art ALMs and five TTS strategies across three prominent speech emotion datasets. We further provide an in-depth analysis of the interaction between model capacity, TTS, and affective ambiguity, offering new insights into the computational and representational challenges of ambiguous emotion understanding. Our benchmark establishes a foundation for developing more robust, context-aware, and emotionally intelligent speech-based AI systems, and highlights key future directions for bridging the gap between model assumptions and the complexity of real-world human emotion.
zh

[AI-75] Understanding the Impact of Differentially Private Training on Memorization of Long-Tailed Data

【速读】:该论文旨在解决差分隐私训练算法(如DP-SGD)在长尾数据分布下普遍存在的泛化性能下降问题,特别是模型对稀有或典型样本的识别能力显著弱于整体数据集的表现。其解决方案的关键在于构建首个从特征学习视角出发的理论框架,用于分析DP-SGD在长尾数据上的训练动态;研究表明,梯度裁剪与噪声注入共同作用会损害模型对信息丰富但代表性不足样本的记忆能力,从而导致长尾子群体测试误差显著高于全局测试误差。

链接: https://arxiv.org/abs/2602.03872
作者: Jiaming Zhang,Huanyi Xie,Meng Ding,Shaopeng Fu,Jinyan Liu,Di Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: arXiv admin note: text overlap with arXiv:2502.11893 by other authors

点击查看摘要

Abstract:Recent research shows that modern deep learning models achieve high predictive accuracy partly by memorizing individual training samples. Such memorization raises serious privacy concerns, motivating the widespread adoption of differentially private training algorithms such as DP-SGD. However, a growing body of empirical work shows that DP-SGD often leads to suboptimal generalization performance, particularly on long-tailed data that contain a large number of rare or atypical samples. Despite these observations, a theoretical understanding of this phenomenon remains largely unexplored, and existing differential privacy analysis are difficult to extend to the nonconvex and nonsmooth neural networks commonly used in practice. In this work, we develop the first theoretical framework for analyzing DP-SGD on long-tailed data from a feature learning perspective. We show that the test error of DP-SGD-trained models on the long-tailed subpopulation is significantly larger than the overall test error over the entire dataset. Our analysis further characterizes the training dynamics of DP-SGD, demonstrating how gradient clipping and noise injection jointly adversely affect the model’s ability to memorize informative but underrepresented samples. Finally, we validate our theoretical findings through extensive experiments on both synthetic and real-world datasets.
zh

[AI-76] PaperX: A Unified Framework for Multimodal Academic Presentation Generation with Scholar DAG

【速读】:该论文旨在解决科学论文转化为多模态展示内容时存在的劳动密集型问题,以及现有自动化方案因将每种格式视为独立下游任务而导致的冗余处理与语义不一致问题。其解决方案的关键在于提出PaperX统一框架,通过引入Scholar DAG(学术有向无环图)作为中间表示,将论文的逻辑结构与最终呈现语法解耦,再结合自适应图遍历策略,从单一源文档生成多样化、高质量的输出,从而在保持内容保真度和美学质量的同时显著提升成本效率。

链接: https://arxiv.org/abs/2602.03866
作者: Tao Yu,Minghui Zhang,Zhiqing Cui,Hao Wang,Zhongtian Luo,Shenghua Chai,Junhao Gong,Yuzhao Peng,Yuxuan Zhou,Yujia Yang,Zhenghao Zhang,Haopeng Jin,Xinming Wang,Yufei Xiong,Jiabing Yang,Jiahao Yuan,Hanqing Wang,Hongzhu Yi,YiFan Zhang,Yan Huang,Liang Wang
机构: 未知
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI)
备注: 29 pages, 9 figures

点击查看摘要

Abstract:Transforming scientific papers into multimodal presentation content is essential for research dissemination but remains labor intensive. Existing automated solutions typically treat each format as an isolated downstream task, leading to redundant processing and semantic inconsistency. We introduce PaperX, a unified framework that models academic presentation generation as a structural transformation and rendering process. Central to our approach is the Scholar DAG, an intermediate representation that decouples the paper’s logical structure from its final presentation syntax. By applying adaptive graph traversal strategies, PaperX generates diverse, high quality outputs from a single source. Comprehensive evaluations demonstrate that our framework achieves the state of the art performance in content fidelity and aesthetic quality while significantly improving cost efficiency compared to specialized single task agents.
zh

[AI-77] Perceptions of AI-CBT: Trust and Barriers in Chinese Postgrads TAAI2025

【速读】:该论文试图解决中国研究生群体心理健康支持可及性不足的问题,特别是如何通过生成式 AI (Generative AI) 技术赋能的认知行为疗法聊天机器人(AI-CBT)提升心理干预的可扩展性和接受度。其解决方案的关键在于基于健康信念模型(Health Belief Model, HBM)和计划行为理论(Theory of Planned Behavior, TPB)构建的质性分析框架,揭示了目标用户对AI-CBT的感知价值与使用障碍:一方面,用户认可其便捷性与全天候可用性带来的积极态度;另一方面,数据隐私、情感安全及复杂问题适配性的不确定性显著限制了采纳意愿。研究进一步指出社会规范(如污名化)和感知控制能力(如数字素养、语言质量)是影响实际使用的双重因素,从而为面向中国学生群体的心理健康AI工具提供文化敏感的设计原则,包括透明机制、风险防护措施以及分层护理路径的系统性设计建议。

链接: https://arxiv.org/abs/2602.03852
作者: Chan-in Sio,Alex Mann,Lingxi Fan,Andrew Cheung,Lik-hang Lee
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Accepted and presented in The 30th International Conference on Technologies and Applications of Artificial Intelligence in Taipei, Taiwan on 13-14 December 2025 (TAAI 2025)

点击查看摘要

Abstract:The mental well-being of graduate students is an increasing concern, yet the adoption of scalable support remains uneven. Artificial intelligence-powered cognitive behavioral therapy chatbots (AI-CBT) offer low barrier help, but little is known about how Chinese postgraduates perceive and use them. This qualitative study explored perceptions and experiences of AI-CBT chatbots among ten Chinese graduate students recruited through social media. Semi-structured Zoom interviews were conducted and analyzed using reflexive thematic analysis, with the Health Belief Model (HBM) and the Theory of Planned Behavior (TPB) as sensitizing frameworks. The findings indicate a cautious openness to AI-CBT chatbots: perceived usefulness and 24/7 access supported favorable attitudes, while data privacy, emotional safety, and uncertainty about `fit’ for complex problems restricted the intention to use. Social norms (e.g., stigma and peer views) and perceived control (digital literacy, language quality) further shaped adoption. The study offers context-specific information to guide the culturally sensitive design, communication, and deployment of AI mental well-being tools for student populations in China and outlines the design implications around transparency, safeguards, and graduated care pathways.
zh

[AI-78] Puzzle it Out: Local-to-Global World Model for Offline Multi-Agent Reinforcement Learning

【速读】:该论文旨在解决离线多智能体强化学习(Offline Multi-Agent Reinforcement Learning, Offline MARL)中因数据分布限制导致策略过于保守、难以泛化的问题。现有方法通常局限于数据支持的区域,而模型-based方法虽可通过生成合成数据扩展状态-动作空间,但在高维、非平稳且复杂的多智能体系统中,准确建模联合动态和奖励函数极具挑战。其解决方案的关键在于提出一种局部到全局(Local-to-Global, LOGO)世界模型框架:通过先估计易于建模的局部状态转移,再利用这些局部预测推断全局状态演化,从而在隐式捕捉个体间依赖关系的同时提升预测精度;进一步结合不确定性感知采样机制,根据预测不确定性自适应加权合成数据,降低误差传播风险,且仅需额外一个编码器即可实现不确定性估计,显著减少计算开销。实验表明,该方法在8个场景下优于8种基线,成为新的基于模型的离线多智能体学习基准。

链接: https://arxiv.org/abs/2601.07463
作者: Sijia li,Xinran Li,Shibo Chen,Jun Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Offline multi-agent reinforcement learning (MARL) aims to solve cooperative decision-making problems in multi-agent systems using pre-collected datasets. Existing offline MARL methods primarily constrain training within the dataset distribution, resulting in overly conservative policies that struggle to generalize beyond the support of the data. While model-based approaches offer a promising solution by expanding the original dataset with synthetic data generated from a learned world model, the high dimensionality, non-stationarity, and complexity of multi-agent systems make it challenging to accurately estimate the transitions and reward functions in offline MARL. Given the difficulty of directly modeling joint dynamics, we propose a local-to-global (LOGO) world model, a novel framework that leverages local predictions-which are easier to estimate-to infer global state dynamics, thus improving prediction accuracy while implicitly capturing agent-wise dependencies. Using the trained world model, we generate synthetic data to augment the original dataset, expanding the effective state-action space. To ensure reliable policy learning, we further introduce an uncertainty-aware sampling mechanism that adaptively weights synthetic data by prediction uncertainty, reducing approximation error propagation to policies. In contrast to conventional ensemble-based methods, our approach requires only an additional encoder for uncertainty estimation, significantly reducing computational overhead while maintaining accuracy. Extensive experiments across 8 scenarios against 8 baselines demonstrate that our method surpasses state-of-the-art baselines on standard offline MARL benchmarks, establishing a new model-based baseline for generalizable offline multi-agent learning.
zh

[AI-79] Multi-layer Cross-Attention is Provably Optimal for Multi-modal In-context Learning

【速读】:该论文旨在解决多模态数据下上下文学习(in-context learning)的理论机制不明确的问题,尤其关注Transformer类架构能否在多模态场景中实现贝叶斯最优性能。其关键解决方案在于提出一个数学上可处理的框架,并引入一种新型线性化交叉注意力(linearized cross-attention)机制;在交叉注意力层数和上下文长度均较大的情况下,证明该机制通过梯度流优化可达到贝叶斯最优性能,从而揭示了深度结构与交叉注意力对多模态上下文学习的必要性与有效性。

链接: https://arxiv.org/abs/2602.04872
作者: Nicholas Barnfield,Subhabrata Sen,Pragya Sur
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent progress has rapidly advanced our understanding of the mechanisms underlying in-context learning in modern attention-based neural networks. However, existing results focus exclusively on unimodal data; in contrast, the theoretical underpinnings of in-context learning for multi-modal data remain poorly understood. We introduce a mathematically tractable framework for studying multi-modal learning and explore when transformer-like architectures can recover Bayes-optimal performance in-context. To model multi-modal problems, we assume the observed data arises from a latent factor model. Our first result comprises a negative take on expressibility: we prove that single-layer, linear self-attention fails to recover the Bayes-optimal predictor uniformly over the task distribution. To address this limitation, we introduce a novel, linearized cross-attention mechanism, which we study in the regime where both the number of cross-attention layers and the context length are large. We show that this cross-attention mechanism is provably Bayes optimal when optimized using gradient flow. Our results underscore the benefits of depth for in-context learning and establish the provable utility of cross-attention for multi-modal distributions.
zh

[AI-80] BrainVista: Modeling Naturalistic Brain Dynamics as Multimodal Next-Token Prediction

【速读】:该论文旨在解决自然场景下功能性磁共振成像(fMRI)中脑状态因果演化建模的挑战,特别是由多模态输入与皮层网络复杂拓扑结构之间的时序不匹配所导致的问题。其解决方案的关键在于提出BrainVista框架,该框架采用两种核心机制:一是基于网络级标记器(Network-wise Tokenizers)以解耦不同神经系统的特异性动态;二是引入空间混合头(Spatial Mixer Head)在不破坏功能边界的前提下捕捉网络间的信息流动。此外,通过新颖的刺激到脑(Stimulus-to-Brain, S2B)掩码机制,实现高频感官刺激与血氧水平依赖(BOLD)信号的时序同步,从而实现严格的仅基于历史信息的因果条件建模。

链接: https://arxiv.org/abs/2602.04512
作者: Xuanhua Yin,Runkai Zhao,Lina Yao,Weidong Cai
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注: 17 pages, 7 figures, 11 tables

点击查看摘要

Abstract:Naturalistic fMRI characterizes the brain as a dynamic predictive engine driven by continuous sensory streams. However, modeling the causal forward evolution in realistic neural simulation is impeded by the timescale mismatch between multimodal inputs and the complex topology of cortical networks. To address these challenges, we introduce BrainVista, a multimodal autoregressive framework designed to model the causal evolution of brain states. BrainVista incorporates Network-wise Tokenizers to disentangle system-specific dynamics and a Spatial Mixer Head that captures inter-network information flow without compromising functional boundaries. Furthermore, we propose a novel Stimulus-to-Brain (S2B) masking mechanism to synchronize high-frequency sensory stimuli with hemodynamically filtered signals, enabling strict, history-only causal conditioning. We validate our framework on Algonauts 2025, CineBrain, and HAD, achieving state-of-the-art fMRI encoding performance. In long-horizon rollout settings, our model yields substantial improvements over baselines, increasing pattern correlation by 36.0% and 33.3% on relative to the strongest baseline Algonauts 2025 and CineBrain, respectively.
zh

[AI-81] Discovering Mechanistic Models of Neural Activity: System Identification in an in Silico Zebrafish

【速读】:该论文旨在解决神经回路机制模型验证缺乏真实基准(ground truth)的问题,从而限制了模型发现的可靠性与可解释性。其解决方案的关键在于构建一个基于斑马鱼幼虫神经机械仿真(neuromechanical simulations)的虚拟测试平台,提供透明且可控的真实系统行为作为基准;在此基础上,利用大语言模型(LLM)驱动的树搜索算法自动发现具有强预测能力的机制模型,同时揭示结构先验(structural priors)对实现分布外泛化和恢复可解释机制模型的重要性。

链接: https://arxiv.org/abs/2602.04492
作者: Jan-Matthis Lueckmann,Viren Jain,Michał Januszewski
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Constructing mechanistic models of neural circuits is a fundamental goal of neuroscience, yet verifying such models is limited by the lack of ground truth. To rigorously test model discovery, we establish an in silico testbed using neuromechanical simulations of a larval zebrafish as a transparent ground truth. We find that LLM-based tree search autonomously discovers predictive models that significantly outperform established forecasting baselines. Conditioning on sensory drive is necessary but not sufficient for faithful system identification, as models exploit statistical shortcuts. Structural priors prove essential for enabling robust out-of-distribution generalization and recovery of interpretable mechanistic models. Our insights provide guidance for modeling real-world neural recordings and offer a broader template for AI-driven scientific discovery.
zh

[AI-82] Performative Learning Theory

【速读】:该论文旨在解决表现性预测(performative predictions)下模型泛化能力的问题,即当预测模型影响其预测对象的行为时(如仅影响现有用户或全体潜在用户),如何保证模型在新数据上的有效性。核心挑战在于:预测本身改变了数据分布,从而削弱了传统统计学习理论中“数据独立同分布”的假设。解决方案的关键在于将表现性效应嵌入统计学习理论框架,并通过Wasserstein空间中的极小极大(min-max)和极小极小(min-min)风险泛函来刻画自否定(population negates predictions)与自实现(sample fulfills predictions)两种极端情形。这揭示了一个根本性权衡:模型对世界的改变越多,其从该世界中学习的能力就越弱;同时发现,通过对表现性扭曲的样本进行再训练,可显著提升泛化保证。

链接: https://arxiv.org/abs/2602.04402
作者: Julian Rodemann,Unai Fischer-Abaigar,James Bailie,Krikamol Muandet
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG); Statistics Theory (math.ST)
备注: 52 pages, 2 figures

点击查看摘要

Abstract:Performative predictions influence the very outcomes they aim to forecast. We study performative predictions that affect a sample (e.g., only existing users of an app) and/or the whole population (e.g., all potential app users). This raises the question of how well models generalize under performativity. For example, how well can we draw insights about new app users based on existing users when both of them react to the app’s predictions? We address this question by embedding performative predictions into statistical learning theory. We prove generalization bounds under performative effects on the sample, on the population, and on both. A key intuition behind our proofs is that in the worst case, the population negates predictions, while the sample deceptively fulfills them. We cast such self-negating and self-fulfilling predictions as min-max and min-min risk functionals in Wasserstein space, respectively. Our analysis reveals a fundamental trade-off between performatively changing the world and learning from it: the more a model affects data, the less it can learn from it. Moreover, our analysis results in a surprising insight on how to improve generalization guarantees by retraining on performatively distorted samples. We illustrate our bounds in a case study on prediction-informed assignments of unemployed German residents to job trainings, drawing upon administrative labor market records from 1975 to 2017 in Germany.
zh

[AI-83] A computational account of dreaming: learning and memory consolidation

【速读】:该论文试图解决的核心问题是:梦境是否具有功能性,尤其是其在学习与记忆巩固中的作用,这一问题长期受到“随机信号假说”与“功能假说”之间的争议。解决方案的关键在于提出一个认知与计算模型,模拟大脑在清醒状态下对来自海马体(hippocampus)的自发随机信号进行处理的过程,从而实现学习与记忆巩固功能。该模型表明,即使信号本身具有随机性,仍可通过神经回放机制(neural replaying)促进记忆整合,支持梦境作为清醒活动延续的观点,并与多项实证研究结果一致。

链接: https://arxiv.org/abs/2602.04095
作者: Qi Zhang
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: 30 pages, 4 tables, 2 figures

点击查看摘要

Abstract:A number of studies have concluded that dreaming is mostly caused by randomly arriving internal signals because “dream contents are random impulses”, and argued that dream sleep is unlikely to play an important part in our intellectual capacity. On the contrary, numerous functional studies have revealed that dream sleep does play an important role in our learning and other intellectual functions. Specifically, recent studies have suggested the importance of dream sleep in memory consolidation, following the findings of neural replaying of recent waking patterns in the hippocampus. The randomness has been the hurdle that divides dream theories into either functional or functionless. This study presents a cognitive and computational model of dream process. This model is simulated to perform the functions of learning and memory consolidation, which are two most popular dream functions that have been proposed. The simulations demonstrate that random signals may result in learning and memory consolidation. Thus, dreaming is proposed as a continuation of brain’s waking activities that processes signals activated spontaneously and randomly from the hippocampus. The characteristics of the model are discussed and found in agreement with many characteristics concluded from various empirical studies.
zh

[AI-84] Structure-Informed Estimation for Pilot-Limited MIMO Channels via Tensor Decomposition

【速读】:该论文旨在解决宽带多输入多输出(MIMO)系统中因导频开销限制而导致的高维超5G和第六代(6G)场景下信道估计性能受限的问题。其核心解决方案是提出一种混合张量-神经架构,将导频受限的信道估计建模为从稀疏观测中进行低秩张量补全——这与以往假设接收信号张量完全可观测的张量方法形成根本区别。关键创新在于:首先利用CP分解(Canonical Polyadic decomposition)和Tucker分解在不同信道模型下的互补优势(CP适用于匹配多径模型的镜面传播信道,Tucker更具鲁棒性),其次引入轻量级三维U-Net网络学习超出低秩结构的残差分量,从而融合代数模型与真实传播效应;实验证明该方法在样本复杂度上近似依赖于内在模型维度 $ L(N_r + N_t + N_f) $ 而非张量环境尺寸 $ N_r N_t N_f $,显著优于最小二乘(LS)和正交匹配追踪(OMP)基线,并在DeepMIMO射线追踪数据上进一步提升24–44%的归一化均方误差(NMSE)性能。

链接: https://arxiv.org/abs/2602.04083
作者: Alexandre Barbosa de Lima
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Channel estimation in wideband multiple-input multiple-output (MIMO) systems faces fundamental pilot overhead limitations in high-dimensional beyond-5G and sixth-generation (6G) scenarios. This paper presents a hybrid tensor-neural architecture that formulates pilot-limited channel estimation as low-rank tensor completion from sparse observations – a fundamentally different setting from prior tensor methods that assume fully observed received signal tensors. A canonical polyadic (CP) baseline implemented via a projection-based scheme (Tucker completion under partial observations) and Tucker decompositions are compared under varying signal-to-noise ratio (SNR) and scattering conditions: CP performs well for specular channels matching the multipath model, while Tucker provides greater robustness under model mismatch. A lightweight three-dimensional (3D) U-Net learns residual components beyond the low-rank structure, bridging algebraic models and realistic propagation effects. Empirical recovery threshold analysis shows that sample complexity scales approximately with intrinsic model dimensionality L(N_r + N_t + N_f) rather than ambient tensor size N_r N_t N_f , where L denotes the number of dominant propagation paths. Experiments on synthetic channels demonstrate 10-20,dB normalized mean-square error (NMSE) improvement over least-squares (LS) and orthogonal matching pursuit (OMP) baselines at 5-10% pilot density, while evaluations on DeepMIMO ray-tracing channels show 24-44% additional NMSE reduction over pure tensor-based methods.
zh

[AI-85] Fixed Budget is No Harder Than Fixed Confidence in Best-Arm Identification up to Logarithmic Factors

【速读】:该论文旨在解决最佳臂识别(Best-Arm Identification, BAI)问题中固定预算(Fixed-Budget, FB)与固定置信度(Fixed-Confidence, FC)两种设置之间的复杂度关系这一基础性研究问题:FB是否比FC更难?其解决方案的关键在于提出了一种名为FC2FB(Fixed Confidence to Fixed Budget)的元算法,该算法能够将任意一个FC算法转化为一个FB算法,并且保持样本复杂度仅相差对数因子。通过构造性证明,作者表明FB的最优样本复杂度至多为FC最优复杂度的对数倍,从而揭示了FB并不比FC更难的本质关系,并为多项FB问题带来了新的改进性能上限。

链接: https://arxiv.org/abs/2602.03972
作者: Kapilan Balagopalan,Yinan Li,Yao Zhao,Tuan Nguyen,Anton Daitche,Houssam Nassif,Kwang-Sung Jun
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The best-arm identification (BAI) problem is one of the most fundamental problems in interactive machine learning, which has two flavors: the fixed-budget setting (FB) and the fixed-confidence setting (FC). For K -armed bandits with the unique best arm, the optimal sample complexities for both settings have been settled down, and they match up to logarithmic factors. This prompts an interesting research question about the generic, potentially structured BAI problems: Is FB harder than FC or the other way around? In this paper, we show that FB is no harder than FC up to logarithmic factors. We do this constructively: we propose a novel algorithm called FC2FB (fixed confidence to fixed budget), which is a meta algorithm that takes in an FC algorithm \mathcalA and turn it into an FB algorithm. We prove that this FC2FB enjoys a sample complexity that matches, up to logarithmic factors, that of the sample complexity of \mathcalA . This means that the optimal FC sample complexity is an upper bound of the optimal FB sample complexity up to logarithmic factors. Our result not only reveals a fundamental relationship between FB and FC, but also has a significant implication: FC2FB, combined with existing state-of-the-art FC algorithms, leads to improved sample complexity for a number of FB problems. Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2602.03972 [stat.ML] (or arXiv:2602.03972v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2602.03972 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Kapilan Balagopalan [view email] [v1] Tue, 3 Feb 2026 19:49:55 UTC (349 KB) Full-text links: Access Paper: View a PDF of the paper titled Fixed Budget is No Harder Than Fixed Confidence in Best-Arm Identification up to Logarithmic Factors, by Kapilan Balagopalan and 6 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: stat.ML prev | next new | recent | 2026-02 Change to browse by: cs cs.AI cs.LG stat References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
zh

[AI-86] First-Principles AI finds crystallization of fractional quantum Hall liquids

【速读】:该论文旨在解决强朗道能级混杂 regime 下分数量子霍尔(Fractional Quantum Hall, FQH)液体何时发生液晶化(liquid crystallization)的问题,核心挑战在于如何在统一框架中同时描述分数量子化(fractionalization)与晶体序(crystallization)。解决方案的关键是提出了一种基于自注意力机制的变分波函数——MagNet,该模型专为磁场中的量子系统在环面几何下设计,能够通过仅依赖微观哈密顿量的能量最小化训练,自动识别并描述拓扑液体和电子晶体等竞争相态,无需外部标注数据或物理先验知识,从而实现对多体相互作用体系中不同有序态的无监督发现。

链接: https://arxiv.org/abs/2602.03927
作者: Ahmed Abouelkomsan,Liang Fu
机构: 未知
类目: Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Strongly Correlated Electrons (cond-mat.str-el); Artificial Intelligence (cs.AI)
备注: 5 pages + SM

点击查看摘要

Abstract:When does a fractional quantum Hall (FQH) liquid crystallize? Addressing this question requires a framework that treats fractionalization and crystallization on equal footing, especially in strong Landau-level mixing regime. Here, we introduce MagNet, a self-attention neural-network variational wavefunction designed for quantum systems in magnetic fields on the torus geometry. We show that MagNet provides a unifying and expressive ansatz capable of describing both FQH states and electron crystals within the same architecture. Trained solely by energy minimization of the microscopic Hamiltonian, MagNet discovers topological liquid and electron crystal ground states across a broad range of Landau-level mixing. Our results highlight the power of first-principles AI for solving strongly interacting many-body problems and finding competing phases without external training data or physics pre-knowledge.
zh

[AI-87] All-Atom GPCR-Ligand Simulation via Residual Isometric Latent Flow

【速读】:该论文旨在解决G蛋白偶联受体(GPCR)-配体复合物在分子动力学(MD)模拟中因计算成本过高而难以高效研究其构象转变与信号传导机制的问题。解决方案的关键在于提出一种名为GPCRLMD的深度生成框架,其核心创新是利用带有谐波先验的变分自编码器(HP-VAE)将复杂体系映射至一个受物理约束的正则化等距潜在空间,从而保留几何拓扑结构;在此空间内通过残差潜在流(Residual Latent Flow)采样演化轨迹,并以初始结构为锚点通过相对位移建模时间动态,有效解耦静态拓扑与动态波动,实现了对GPCR-配体系统热力学可观测量和关键相互作用的高保真模拟。

链接: https://arxiv.org/abs/2602.03902
作者: Jiying Zhang,Shuhao Zhang,Pierre Vandergheynst,Patrick Barth
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 36 pages

点击查看摘要

Abstract:G-protein-coupled receptors (GPCRs), primary targets for over one-third of approved therapeutics, rely on intricate conformational transitions to transduce signals. While Molecular Dynamics (MD) is essential for elucidating this transduction process, particularly within ligand-bound complexes, conventional all-atom MD simulation is computationally prohibitive. In this paper, we introduce GPCRLMD, a deep generative framework for efficient all-atom GPCR-ligand this http URL employs a Harmonic-Prior Variational Autoencoder (HP-VAE) to first map the complex into a regularized isometric latent space, preserving geometric topology via physics-informed constraints. Within this latent space, a Residual Latent Flow samples evolution trajectories, which are subsequently decoded back to atomic coordinates. By capturing temporal dynamics via relative displacements anchored to the initial structure, this residual mechanism effectively decouples static topology from dynamic fluctuations. Experimental results demonstrate that GPCRLMD achieves state-of-the-art performance in GPCR-ligand dynamics simulation, faithfully reproducing thermodynamic observables and critical ligand-receptor interactions.
zh

[AI-88] Byzantine Machine Learning: MultiKrum and an optimal notion of robustness

【速读】:该论文旨在解决多Krum(MultiKrum)聚合规则在存在拜占庭攻击(Byzantine threat model)的分布式学习场景中缺乏理论鲁棒性保障的问题。尽管MultiKrum在实践中表现出优于Krum聚合规则的性能,但此前尚未有理论证明其具备鲁棒性。解决方案的关键在于提出一个全新的最优鲁棒系数(κ\kappa^\star),用于更精确地量化聚合规则在对抗环境下对均值估计的鲁棒能力,并基于此构建了MultiKrum鲁棒系数的上下界,首次严格证明了其作为鲁棒聚合规则的有效性。同时,该分析还改进了Krum鲁棒系数的已有边界,表明在实际应用场景中,MultiKrum的鲁棒性表现不低于且通常优于Krum。

链接: https://arxiv.org/abs/2602.03899
作者: Gilles Bareilles,Wassim Bouaziz,Julien Fageot,El-Mahdi El-Mhamdi
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Optimization and Control (math.OC); Statistics Theory (math.ST)
备注:

点击查看摘要

Abstract:Aggregation rules are the cornerstone of distributed (or federated) learning in the presence of adversaries, under the so-called Byzantine threat model. They are also interesting mathematical objects from the point of view of robust mean estimation. The Krum aggregation rule has been extensively studied, and endowed with formal robustness and convergence guarantees. Yet, MultiKrum, a natural extension of Krum, is often preferred in practice for its superior empirical performance, even though no theoretical guarantees were available until now. In this work, we provide the first proof that MultiKrum is a robust aggregation rule, and bound its robustness coefficient. To do so, we introduce \kappa^\star , the optimal robustness coefficient of an aggregation rule, which quantifies the accuracy of mean estimation in the presence of adversaries in a tighter manner compared with previously adopted notions of robustness. We then construct an upper and a lower bound on MultiKrum’s robustness coefficient. As a by-product, we also improve on the best-known bounds on Krum’s robustness coefficient. We show that MultiKrum’s bounds are never worse than Krum’s, and better in realistic regimes. We illustrate this analysis by an experimental investigation on the quality of the lower bound.
zh

机器学习

[LG-0] Multi-Head LatentMoE and Head Parallel: Communication-Efficient and Deterministic MoE Parallelism

链接: https://arxiv.org/abs/2602.04870
作者: Chenwei Cui,Rockwell Jackson,Benjamin Joseph Herrera,Ana María Tárano,Hannah Kerner
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models have transformed many applications but remain expensive to train. Sparse Mixture of Experts (MoE) addresses this through conditional computation, with Expert Parallel (EP) as the standard distributed training method. However, EP has three limitations: communication cost grows linearly with the number of activated experts k , load imbalance affects latency and memory usage, and data-dependent communication requires metadata exchange. We propose Multi-Head LatentMoE and Head Parallel (HP), a new architecture and parallelism achieving O(1) communication cost regardless of k , completely balanced traffic, and deterministic communication, all while remaining compatible with EP. To accelerate Multi-Head LatentMoE, we propose IO-aware routing and expert computation. Compared to MoE with EP, Multi-Head LatentMoE with HP trains up to 1.61\times faster while having identical performance. With doubled granularity, it achieves higher overall performance while still being 1.11\times faster. Our method makes multi-billion-parameter foundation model research more accessible.

[LG-1] he Key to State Reduction in Linear Attention: A Rank-based Perspective

链接: https://arxiv.org/abs/2602.04852
作者: Philipp Nazari,T. Konstantin Rusch
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Linear attention offers a computationally efficient yet expressive alternative to softmax attention. However, recent empirical results indicate that the state of trained linear attention models often exhibits a low-rank structure, suggesting that these models underexploit their capacity in practice. To illuminate this phenomenon, we provide a theoretical analysis of the role of rank in linear attention, revealing that low effective rank can affect retrieval error by amplifying query noise. In addition to these theoretical insights, we conjecture that the low-rank states can be substantially reduced post-training with only minimal performance degradation, yielding faster and more memory-efficient models. To this end, we propose a novel hardware-aware approach that structurally prunes key and query matrices, reducing the state size while retaining compatibility with existing CUDA kernels. We adapt several existing pruning strategies to fit our framework and, building on our theoretical analysis, propose a novel structured pruning method based on a rank-revealing QR decomposition. Our empirical results, evaluated across models of varying sizes and on various downstream tasks, demonstrate the effectiveness of our state reduction framework. We highlight that our framework enables the removal of 50% of the query and key channels at only a marginal increase in perplexity. The code for this project can be found at this https URL.

[LG-2] Robust Generalizable Heterogeneous Legal Link Prediction

链接: https://arxiv.org/abs/2602.04812
作者: Lorenz Wendlinger,Simon Alexander Nonn,Abdullah Al Zubaer,Michael Granitzer
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: 9 Pages

点击查看摘要

Abstract:Recent work has applied link prediction to large heterogeneous legal citation networks \newwith rich meta-features. We find that this approach can be improved by including edge dropout and feature concatenation for the learning of more robust representations, which reduces error rates by up to 45%. We also propose an approach based on multilingual node features with an improved asymmetric decoder for compatibility, which allows us to generalize and extend the prediction to more, geographically and linguistically disjoint, data from New Zealand. Our adaptations also improve inductive transferability between these disjoint legal systems.

[LG-3] Evolving Afferent Architectures: Biologically-inspired Models for Damage-Avoidance Learning

链接: https://arxiv.org/abs/2602.04807
作者: Wolfgang Maass,Sabine Janzen,Prajvi Saxena,Sach Mukherjee
类目: Machine Learning (cs.LG)
*备注: 16 pages, 6 figures

点击查看摘要

Abstract:We introduce Afferent Learning, a framework that produces Computational Afferent Traces (CATs) as adaptive, internal risk signals for damage-avoidance learning. Inspired by biological systems, the framework uses a two-level architecture: evolutionary optimization (outer loop) discovers afferent sensing architectures that enable effective policy learning, while reinforcement learning (inner loop) trains damage-avoidance policies using these signals. This formalizes afferent sensing as providing an inductive bias for efficient learning: architectures are selected based on their ability to enable effective learning (rather than directly minimizing damage). We provide theoretical convergence guarantees under smoothness and bounded-noise assumptions. We illustrate the general approach in the challenging context of biomechanical digital twins operating over long time horizons (multiple decades of the life-course). Here, we find that CAT-based evolved architectures achieve significantly higher efficiency and better age-robustness than hand-designed baselines, enabling policies that exhibit age-dependent behavioral adaptation (23% reduction in high-risk actions). Ablation studies validate CAT signals, evolution, and predictive discrepancy as essential. We release code and data for reproducibility.

[LG-4] Maximum-Volume Nonnegative Matrix Factorization

链接: https://arxiv.org/abs/2602.04795
作者: Olivier Vu Thanh,Nicolas Gillis
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注: arXiv admin note: substantial text overlap with arXiv:2412.06380

点击查看摘要

Abstract:Nonnegative matrix factorization (NMF) is a popular data embedding technique. Given a nonnegative data matrix X , it aims at finding two lower dimensional matrices, W and H , such that X\approx WH , where the factors W and H are constrained to be element-wise nonnegative. The factor W serves as a basis for the columns of X . In order to obtain more interpretable and unique solutions, minimum-volume NMF (MinVol NMF) minimizes the volume of W . In this paper, we consider the dual approach, where the volume of H is maximized instead; this is referred to as maximum-volume NMF (MaxVol NMF). MaxVol NMF is identifiable under the same conditions as MinVol NMF in the noiseless case, but it behaves rather differently in the presence of noise. In practice, MaxVol NMF is much more effective to extract a sparse decomposition and does not generate rank-deficient solutions. In fact, we prove that the solutions of MaxVol NMF with the largest volume correspond to clustering the columns of X in disjoint clusters, while the solutions of MinVol NMF with smallest volume are rank deficient. We propose two algorithms to solve MaxVol NMF. We also present a normalized variant of MaxVol NMF that exhibits better performance than MinVol NMF and MaxVol NMF, and can be interpreted as a continuum between standard NMF and orthogonal NMF. We illustrate our results in the context of hyperspectral unmixing.

[LG-5] From independent patches to coordinated attention: Controlling information flow in vision transformers

链接: https://arxiv.org/abs/2602.04784
作者: Kieran A. Murphy
类目: Machine Learning (cs.LG)
*备注: Code at this https URL

点击查看摘要

Abstract:We make the information transmitted by attention an explicit, measurable quantity in vision transformers. By inserting variational information bottlenecks on all attention-mediated writes to the residual stream – without other architectural changes – we train models with an explicit information cost and obtain a controllable spectrum from independent patch processing to fully expressive global attention. On ImageNet-100, we characterize how classification behavior and information routing evolve across this spectrum, and provide initial insights into how global visual representations emerge from local patch processing by analyzing the first attention heads that transmit information. By biasing learning toward solutions with constrained internal communication, our approach yields models that are more tractable for mechanistic analysis and more amenable to control.

[LG-6] Legendre Memory Unit with A Multi-Slice Compensation Model for Short-Term Wind Speed Forecasting Based on Wind Farm Cluster Data

链接: https://arxiv.org/abs/2602.04782
作者: Mumin Zhang,Haochen Zhang,Xin Zhi Khoo,Yilin Zhang,Nuo Chen,Ting Zhang,Junjie Tang
类目: Machine Learning (cs.LG)
*备注: 10 pages, 11 figures,

点击查看摘要

Abstract:With more wind farms clustered for integration, the short-term wind speed prediction of such wind farm clusters is critical for normal operation of power systems. This paper focuses on achieving accurate, fast, and robust wind speed prediction by full use of cluster data with spatial-temporal correlation. First, weighted mean filtering (WMF) is applied to denoise wind speed data at the single-farm level. The Legendre memory unit (LMU) is then innovatively applied for the wind speed prediction, in combination with the Compensating Parameter based on Kendall rank correlation coefficient (CPK) of wind farm cluster data, to construct the multi-slice LMU (MSLMU). Finally, an innovative ensemble model WMF-CPK-MSLMU is proposed herein, with three key blocks: data pre-processing, forecasting, and multi-slice compensation. Advantages include: 1) LMU jointly models linear and nonlinear dependencies among farms to capture spatial-temporal correlations through backpropagation; 2) MSLMU enhances forecasting by using CPK-derived weights instead of random initialization, allowing spatial correlations to fully activate hidden nodes across clustered wind farms.; 3) CPK adaptively weights the compensation model in MSLMU and complements missing data spatially, to facilitate the whole model highly accurate and robust. Test results on different wind farm clusters indicate the effectiveness and superiority of proposed ensemble model WMF-CPK-MSLMU in the short-term prediction of wind farm clusters compared to the existing models.

[LG-7] Dynamical Regimes of Multimodal Diffusion Models

链接: https://arxiv.org/abs/2602.04780
作者: Emil Albrychiewicz,Andrés Franco Valiente,Li-Ching Chen
类目: Machine Learning (cs.LG)
*备注: 40 pages, 14 figures

点击查看摘要

Abstract:Diffusion based generative models have achieved unprecedented fidelity in synthesizing high dimensional data, yet the theoretical mechanisms governing multimodal generation remain poorly understood. Here, we present a theoretical framework for coupled diffusion models, using coupled Ornstein-Uhlenbeck processes as a tractable model. By using the nonequilibrium statistical physics of dynamical phase transitions, we demonstrate that multimodal generation is governed by a spectral hierarchy of interaction timescales rather than simultaneous resolution. A key prediction is the ``synchronization gap’', a temporal window during the reverse generative process where distinct eigenmodes stabilize at different rates, providing a theoretical explanation for common desynchronization artifacts. We derive analytical conditions for speciation and collapse times under both symmetric and anisotropic coupling regimes, establishing strict bounds for coupling strength to avoid unstable symmetry breaking. We show that the coupling strength acts as a spectral filter that enforces a tunable temporal hierarchy on generation. We support these predictions through controlled experiments with diffusion models trained on MNIST datasets and exact score samplers. These results motivate time dependent coupling schedules that target mode specific timescales, offering a potential alternative to ad hoc guidance tuning.

[LG-8] Interval-Based AUC (iAUC): Extending ROC Analysis to Uncertainty-Aware Classification

链接: https://arxiv.org/abs/2602.04775
作者: Yuqi Li,Matthew M. Engelhard
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In high-stakes risk prediction, quantifying uncertainty through interval-valued predictions is essential for reliable decision-making. However, standard evaluation tools like the receiver operating characteristic (ROC) curve and the area under the curve (AUC) are designed for point scores and fail to capture the impact of predictive uncertainty on ranking performance. We propose an uncertainty-aware ROC framework specifically for interval-valued predictions, introducing two new measures: AUC_L and AUC_U . This framework enables an informative three-region decomposition of the ROC plane, partitioning pairwise rankings into correct, incorrect, and uncertain orderings. This approach naturally supports selective prediction by allowing models to abstain from ranking cases with overlapping intervals, thereby optimizing the trade-off between abstention rate and discriminative reliability. We prove that under valid class-conditional coverage, AUC_L and AUC_U provide formal lower and upper bounds on the theoretical optimal AUC ( AUC^* ), characterizing the physical limit of achievable discrimination. The proposed framework applies broadly to interval-valued prediction models, regardless of the interval construction method. Experiments on real-world benchmark datasets, using bootstrap-based intervals as one instantiation, validate the framework’s correctness and demonstrate its practical utility for uncertainty-aware evaluation and decision-making.

[LG-9] NeuroCanvas: VLLM -Powered Robust Seizure Detection by Reformulating Multichannel EEG as Image

链接: https://arxiv.org/abs/2602.04769
作者: Yan Chen,Jie Peng,Moajjem Hossain Chowdhury,Tianlong Chen,Yunmei Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate and timely seizure detection from Electroencephalography (EEG) is critical for clinical intervention, yet manual review of long-term recordings is labor-intensive. Recent efforts to encode EEG signals into large language models (LLMs) show promise in handling neural signals across diverse patients, but two significant challenges remain: (1) multi-channel heterogeneity, as seizure-relevant information varies substantially across EEG channels, and (2) computing inefficiency, as the EEG signals need to be encoded into a massive number of tokens for the prediction. To address these issues, we draw the EEG signal and propose the novel NeuroCanvas framework. Specifically, NeuroCanvas consists of two modules: (i) The Entropy-guided Channel Selector (ECS) selects the seizure-relevant channels input to LLM and (ii) the following Canvas of Neuron Signal (CNS) converts selected multi-channel heterogeneous EEG signals into structured visual representations. The ECS module alleviates the multi-channel heterogeneity issue, and the CNS uses compact visual tokens to represent the EEG signals that improve the computing efficiency. We evaluate NeuroCanvas across multiple seizure detection datasets, demonstrating a significant improvement of 20% in F1 score and reductions of 88% in inference latency. These results highlight NeuroCanvas as a scalable and effective solution for real-time and resource-efficient seizure detection in clinical this http URL code will be released at this https URL.

[LG-10] Improved Dimension Dependence for Bandit Convex Optimization with Gradient Variations

链接: https://arxiv.org/abs/2602.04761
作者: Hang Yu,Yu-Hu Yan,Peng Zhao
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Gradient-variation online learning has drawn increasing attention due to its deep connections to game theory, optimization, etc. It has been studied extensively in the full-information setting, but is underexplored with bandit feedback. In this work, we focus on gradient variation in Bandit Convex Optimization (BCO) with two-point feedback. By proposing a refined analysis on the non-consecutive gradient variation, a fundamental quantity in gradient variation with bandits, we improve the dimension dependence for both convex and strongly convex functions compared with the best known results (Chiang et al., 2013). Our improved analysis for the non-consecutive gradient variation also implies other favorable problem-dependent guarantees, such as gradient-variance and small-loss regrets. Beyond the two-point setup, we demonstrate the versatility of our technique by achieving the first gradient-variation bound for one-point bandit linear optimization over hyper-rectangular domains. Finally, we validate the effectiveness of our results in more challenging tasks such as dynamic/universal regret minimization and bandit games, establishing the first gradient-variation dynamic and universal regret bounds for two-point BCO and fast convergence rates in bandit games.

[LG-11] A Dual-TransUNet Deep Learning Framework for Multi-Source Precipitation Merging and Improving Seasonal and Extreme Estimates

链接: https://arxiv.org/abs/2602.04757
作者: Yuchen Ye,Zixuan Qi,Shixuan Li,Wei Qi,Yanpeng Cai,Chaoxia Yuan
类目: Machine Learning (cs.LG)
*备注: 75 pages,20 figures

点击查看摘要

Abstract:Multi-source precipitation products (MSPs) from satellite retrievals and reanalysis are widely used for hydroclimatic monitoring, yet spatially heterogeneous biases and limited skill for extremes still constrain their hydrologic utility. Here we develop a dual-stage TransUNet-based multi-source precipitation merging framework (DDL-MSPMF) that integrates six MSPs with four ERA5 near-surface physical predictors. A first-stage classifier estimates daily precipitation occurrence probability, and a second-stage regressor fuses the classifier outputs together with all predictors to estimate daily precipitation amount at 0.25 degree resolution over China for 2001-2020. Benchmarking against multiple deep learning and hybrid baselines shows that the TransUNet - TransUNet configuration yields the best seasonal performance (R = 0.75; RMSE = 2.70 mm/day) and improves robustness relative to a single-regressor setting. For heavy precipitation (25 mm/day), DDL-MSPMF increases equitable threat scores across most regions of eastern China and better reproduces the spatial pattern of the July 2021 Zhengzhou rainstorm, indicating enhanced extreme-event detection beyond seasonal-mean corrections. Independent evaluation over the Qinghai-Tibet Plateau using TPHiPr further supports its applicability in data-scarce regions. SHAP analysis highlights the importance of precipitation occurrence probabilities and surface pressure, providing physically interpretable diagnostics. The proposed framework offers a scalable and explainable approach for precipitation fusion and extreme-event assessment.

[LG-12] Decomposing Query-Key Feature Interactions Using Contrastive Covariances

链接: https://arxiv.org/abs/2602.04752
作者: Andrew Lee,Yonatan Belinkov,Fernanda Viégas,Martin Wattenberg
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Despite the central role of attention heads in Transformers, we lack tools to understand why a model attends to a particular token. To address this, we study the query-key (QK) space – the bilinear joint embedding space between queries and keys. We present a contrastive covariance method to decompose the QK space into low-rank, human-interpretable components. It is when features in keys and queries align in these low-rank subspaces that high attention scores are produced. We first study our method both analytically and empirically in a simplified setting. We then apply our method to large language models to identify human-interpretable QK subspaces for categorical semantic features and binding features. Finally, we demonstrate how attention scores can be attributed to our identified features.

[LG-13] Rationality Measurement and Theory for Reinforcement Learning Agents

链接: https://arxiv.org/abs/2602.04737
作者: Kejiang Qian,Amos Storkey,Fengxiang He
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper proposes a suite of rationality measures and associated theory for reinforcement learning agents, a property increasingly critical yet rarely explored. We define an action in deployment to be perfectly rational if it maximises the hidden true value function in the steepest direction. The expected value discrepancy of a policy’s actions against their rational counterparts, culminating over the trajectory in deployment, is defined to be expected rational risk; an empirical average version in training is also defined. Their difference, termed as rational risk gap, is decomposed into (1) an extrinsic component caused by environment shifts between training and deployment, and (2) an intrinsic one due to the algorithm’s generalisability in a dynamic environment. They are upper bounded by, respectively, (1) the 1 -Wasserstein distance between transition kernels and initial state distributions in training and deployment, and (2) the empirical Rademacher complexity of the value function class. Our theory suggests hypotheses on the benefits from regularisers (including layer normalisation, \ell_2 regularisation, and weight normalisation) and domain randomisation, as well as the harm from environment shifts. Experiments are in full agreement with these hypotheses. The code is available at this https URL.

[LG-14] DMFlow: Disordered Materials Generation by Flow Matching

链接: https://arxiv.org/abs/2602.04734
作者: Liming Wu,Rui Jiao,Qi Li,Mingze Li,Songyou Li,Shifeng Jin,Wenbing Huang
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注:

点击查看摘要

Abstract:The design of materials with tailored properties is crucial for technological progress. However, most deep generative models focus exclusively on perfectly ordered crystals, neglecting the important class of disordered materials. To address this gap, we introduce DMFlow, a generative framework specifically designed for disordered crystals. Our approach introduces a unified representation for ordered, Substitutionally Disordered (SD), and Positionally Disordered (PD) crystals, and employs a flow matching model to jointly generate all structural components. A key innovation is a Riemannian flow matching framework with spherical reparameterization, which ensures physically valid disorder weights on the probability simplex. The vector field is learned by a novel Graph Neural Network (GNN) that incorporates physical symmetries and a specialized message-passing scheme. Finally, a two-stage discretization procedure converts the continuous weights into multi-hot atomic assignments. To support research in this area, we release a benchmark containing SD, PD, and mixed structures curated from the Crystallography Open Database. Experiments on Crystal Structure Prediction (CSP) and De Novo Generation (DNG) tasks demonstrate that DMFlow significantly outperforms state-of-the-art baselines adapted from ordered crystal generation. We hope our work provides a foundation for the AI-driven discovery of disordered materials.

[LG-15] Benchmarking and Enhancing PPG-Based Cuffless Blood Pressure Estimation Methods

链接: https://arxiv.org/abs/2602.04725
作者: Neville Mathew,Yidan Shen,Renjie Hu,Maham Rahimi,George Zouridakis
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Cuffless blood pressure screening based on easily acquired photoplethysmography (PPG) signals offers a practical pathway toward scalable cardiovascular health assessment. Despite rapid progress, existing PPG-based blood pressure estimation models have not consistently achieved the established clinical numerical limits such as AAMI/ISO 81060-2, and prior evaluations often lack the rigorous experimental controls necessary for valid clinical assessment. Moreover, the publicly available datasets commonly used are heterogeneous and lack physiologically controlled conditions for fair benchmarking. To enable fair benchmarking under physiologically controlled conditions, we created a standardized benchmarking subset NBPDB comprising 101,453 high-quality PPG segments from 1,103 healthy adults, derived from MIMIC-III and VitalDB. Using this dataset, we systematically benchmarked several state-of-the-art PPG-based models. The results showed that none of the evaluated models met the AAMI/ISO 81060-2 accuracy requirements (mean error 5 mmHg and standard deviation 8 mmHg). To improve model accuracy, we modified these models and added patient demographic data such as age, sex, and body mass index as additional inputs. Our modifications consistently improved performance across all models. In particular, the MInception model reduced error by 23% after adding the demographic data and yielded mean absolute errors of 4.75 mmHg (SBP) and 2.90 mmHg (DBP), achieves accuracy comparable to the numerical limits defined by AAMI/ISO accuracy standards. Our results show that existing PPG-based BP estimation models lack clinical practicality under standardized conditions, while incorporating demographic information markedly improves their accuracy and physiological validity.

[LG-16] Bounded-Abstention Multi-horizon Time-series Forecasting

链接: https://arxiv.org/abs/2602.04714
作者: Luca Stradiotti,Laurens Devos,Anna Monreale,Jesse Davis,Andrea Pugnana
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-horizon time-series forecasting involves simultaneously making predictions for a consecutive sequence of subsequent time steps. This task arises in many application domains, such as healthcare and finance, where mispredictions can have a high cost and reduce trust. The learning with abstention framework tackles these problems by allowing a model to abstain from offering a prediction when it is at an elevated risk of making a misprediction. Unfortunately, existing abstention strategies are ill-suited for the multi-horizon setting: they target problems where a model offers a single prediction for each instance. Hence, they ignore the structured and correlated nature of the predictions offered by a multi-horizon forecaster. We formalize the problem of learning with abstention for multi-horizon forecasting setting and show that its structured nature admits a richer set of abstention problems. Concretely, we propose three natural notions of how a model could abstain for multi-horizon forecasting. We theoretically analyze each problem to derive the optimal abstention strategy and propose an algorithm that implements it. Extensive evaluation on 24 datasets shows that our proposed algorithms significantly outperforms existing baselines.

[LG-17] owards Understanding and Avoiding Limitations of Convolutions on Graphs

链接: https://arxiv.org/abs/2602.04709
作者: Andreas Roth
类目: Machine Learning (cs.LG)
*备注: dissertation

点击查看摘要

Abstract:While message-passing neural networks (MPNNs) have shown promising results, their real-world impact remains limited. Although various limitations have been identified, their theoretical foundations remain poorly understood, leading to fragmented research efforts. In this thesis, we provide an in-depth theoretical analysis and identify several key properties limiting the performance of MPNNs. Building on these findings, we propose several frameworks that address these shortcomings. We identify two properties exhibited by many MPNNs: shared component amplification (SCA), where each message-passing iteration amplifies the same components across all feature channels, and component dominance (CD), where a single component gets increasingly amplified as more message-passing steps are applied. These properties lead to the observable phenomenon of rank collapse of node representations, which generalizes the established over-smoothing phenomenon. By generalizing and decomposing over-smoothing, we enable a deeper understanding of MPNNs, more targeted solutions, and more precise communication within the field. To avoid SCA, we show that utilizing multiple computational graphs or edge relations is necessary. Our multi-relational split (MRS) framework transforms any existing MPNN into one that leverages multiple edge relations. Additionally, we introduce the spectral graph convolution for multiple feature channels (MIMO-GC), which naturally uses multiple computational graphs. A localized variant, LMGC, approximates the MIMO-GC while inheriting its beneficial properties. To address CD, we demonstrate a close connection between MPNNs and the PageRank algorithm. Based on personalized PageRank, we propose a variant of MPNNs that allows for infinitely many message-passing iterations, while preserving initial node features. Collectively, these results deepen the theoretical understanding of MPNNs.

[LG-18] Static and auto-regressive neural emulation of phytoplankton biomass dynamics from physical predictors in the global ocean

链接: https://arxiv.org/abs/2602.04689
作者: Mahima Lakra,Ronan Fablet,Lucas Drumetz,Etienne Pauthenet,Elodie Martinez
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Phytoplankton is the basis of marine food webs, driving both ecological processes and global biogeochemical cycles. Despite their ecological and climatic significance, accurately simulating phytoplankton dynamics remains a major challenge for biogeochemical numerical models due to limited parameterizations, sparse observational data, and the complexity of oceanic processes. Here, we explore how deep learning models can be used to address these limitations predicting the spatio-temporal distribution of phytoplankton biomass in the global ocean based on satellite observations and environmental conditions. First, we investigate several deep learning architectures. Among the tested models, the UNet architecture stands out for its ability to reproduce the seasonal and interannual patterns of phytoplankton biomass more accurately than other models like CNNs, ConvLSTM, and 4CastNet. When using one to two months of environmental data as input, UNet performs better, although it tends to underestimate the amplitude of low-frequency changes in phytoplankton biomass. Thus, to improve predictions over time, an auto-regressive version of UNet was also tested, where the model uses its own previous predictions to forecast future conditions. This approach works well for short-term forecasts (up to five months), though its performance decreases for longer time scales. Overall, our study shows that combining ocean physical predictors with deep learning allows for reconstruction and short-term prediction of phytoplankton dynamics. These models could become powerful tools for monitoring ocean health and supporting marine ecosystem management, especially in the context of climate change.

[LG-19] Generalized Schrödinger Bridge on Graphs

链接: https://arxiv.org/abs/2602.04675
作者: Panagiotis Theodoropoulos,Juno Nam,Evangelos Theodorou,Jaemoo Choi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transportation on graphs is a fundamental challenge across many domains, where decisions must respect topological and operational constraints. Despite the need for actionable policies, existing graph-transport methods lack this expressivity. They rely on restrictive assumptions, fail to generalize across sparse topologies, and scale poorly with graph size and time horizon. To address these issues, we introduce Generalized Schrödinger Bridge on Graphs (GSBoG), a novel scalable data-driven framework for learning executable controlled continuous-time Markov chain (CTMC) policies on arbitrary graphs under state cost augmented dynamics. Notably, GSBoG learns trajectory-level policies, avoiding dense global solvers and thereby enhancing scalability. This is achieved via a likelihood optimization approach, satisfying the endpoint marginals, while simultaneously optimizing intermediate behavior under state-dependent running costs. Extensive experimentation on challenging real-world graph topologies shows that GSBoG reliably learns accurate, topology-respecting policies while optimizing application-specific intermediate state costs, highlighting its broad applicability and paving new avenues for cost-aware dynamical transport on general graphs.

[LG-20] Inference-Time Backdoors via Hidden Instructions in LLM Chat Templates

链接: https://arxiv.org/abs/2602.04653
作者: Ariel Fogel,Omer Hofman,Eilon Cohen,Roman Vainshtein
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Open-weight language models are increasingly used in production settings, raising new security challenges. One prominent threat in this context is backdoor attacks, in which adversaries embed hidden behaviors in language models that activate under specific conditions. Previous work has assumed that adversaries have access to training pipelines or deployment infrastructure. We propose a novel attack surface requiring neither, which utilizes the chat template. Chat templates are executable Jinja2 programs invoked at every inference call, occupying a privileged position between user input and model processing. We show that an adversary who distributes a model with a maliciously modified template can implant an inference-time backdoor without modifying model weights, poisoning training data, or controlling runtime infrastructure. We evaluated this attack vector by constructing template backdoors targeting two objectives: degrading factual accuracy and inducing emission of attacker-controlled URLs, and applied them across eighteen models spanning seven families and four inference engines. Under triggered conditions, factual accuracy drops from 90% to 15% on average while attacker-controlled URLs are emitted with success rates exceeding 80%; benign inputs show no measurable degradation. Backdoors generalize across inference runtimes and evade all automated security scans applied by the largest open-weight distribution platform. These results establish chat templates as a reliable and currently undefended attack surface in the LLM supply chain.

[LG-21] SAFE: Stable Alignment Finetuning with Entropy-Aware Predictive Control for RLHF

链接: https://arxiv.org/abs/2602.04651
作者: Dipan Maity
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Optimization (PPO) has been positioned by recent literature as the canonical method for the RL part of RLHF. PPO performs well empirically but has a heuristic motivation and handles the KL-divergence constraint used in LM-RLHF in an ad-hoc manner and suffers form reward oscillations, entropy collapse, value function drift, and sudden policy divergence that require frequent restarts and extensive hyperparameter tuning. In this paper, we develop a new pure on policy actor-critic RL method for the LM-RLHF setting. We present SAFE (Stable Alignment Finetuning with Entropy-aware control),a novel RLHF algorithm that combines a Double Soft-Min Critic for pessimistic value estimation with a new multi-layer stabilization framework combining entropy-gated KL regulation, and PID-controlled adaptive thresholds. Unlike standard PPO’s symmetric KL penalties, SAFE distinguishes high-entropy exploration from low-entropy mode collapse and adjusts penalties dynamically based on reward velocity. Experiments on a 3B parameter model show SAFE achieves +5.15% training-average reward than PPO (0.725 vs 0.689), negligible reward crashes, and superior KL control than ppo . Our method adds minimal computational overhead and provides an interpretable, crash-resistant RLHF framework that maintains aggressive learning speed while ensuring stable long-horizon optimization suitable for production deployment. Code is available at this https URL

[LG-22] MTS-JEPA: Multi-Resolution Joint-Embedding Predictive Architecture for Time-Series Anomaly Prediction

链接: https://arxiv.org/abs/2602.04643
作者: Yanan He,Yunshi Wen,Xin Wang,Tengfei Ma
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multivariate time series underpin modern critical infrastructure, making the prediction of anomalies a vital necessity for proactive risk mitigation. While Joint-Embedding Predictive Architectures (JEPA) offer a promising framework for modeling the latent evolution of these systems, their application is hindered by representation collapse and an inability to capture precursor signals across varying temporal scales. To address these limitations, we propose MTS-JEPA, a specialized architecture that integrates a multi-resolution predictive objective with a soft codebook bottleneck. This design explicitly decouples transient shocks from long-term trends, and utilizes the codebook to capture discrete regime transitions. Notably, we find this constraint also acts as an intrinsic regularizer to ensure optimization stability. Empirical evaluations on standard benchmarks confirm that our approach effectively prevents degenerate solutions and achieves state-of-the-art performance under the early-warning protocol.

[LG-23] RIGA-Fold: A General Framework for Protein Inverse Folding via Recurrent Interaction and Geometric Awareness

链接: https://arxiv.org/abs/2602.04637
作者: Sisi Yuan,Jiehuang Chen,Junchuang Cai,Dong Xu,Xueliang Li,Zexuan Zhu,Junkai Ji
类目: Machine Learning (cs.LG)
*备注: 16 pages, 4 figures. Includes appendix. Preprint under review

点击查看摘要

Abstract:Protein inverse folding, the task of predicting amino acid sequences for desired structures, is pivotal for de novo protein design. However, existing GNN-based methods typically suffer from restricted receptive fields that miss long-range dependencies and a “single-pass” inference paradigm that leads to error accumulation. To address these bottlenecks, we propose RIGA-Fold, a framework that synergizes Recurrent Interaction with Geometric Awareness. At the micro-level, we introduce a Geometric Attention Update (GAU) module where edge features explicitly serve as attention keys, ensuring strictly SE(3)-invariant local encoding. At the macro-level, we design an attention-based Global Context Bridge that acts as a soft gating mechanism to dynamically inject global topological information. Furthermore, to bridge the gap between structural and sequence modalities, we introduce an enhanced variant, RIGA-Fold*, which integrates trainable geometric features with frozen evolutionary priors from ESM-2 and ESM-IF via a dual-stream architecture. Finally, a biologically inspired ``predict-recycle-refine’’ strategy is implemented to iteratively denoise sequence distributions. Extensive experiments on CATH 4.2, TS50, and TS500 benchmarks demonstrate that our geometric framework is highly competitive, while RIGA-Fold* significantly outperforms state-of-the-art baselines in both sequence recovery and structural consistency.

[LG-24] QUATRO: Query-Adaptive Trust Region Policy Optimization for LLM Fine-tuning

链接: https://arxiv.org/abs/2602.04620
作者: Doyeon Lee,Eunyi Lyou,Hyunsoo Cho,Sookyung Kim,Joonseok Lee,Jaemoo Choi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:GRPO-style reinforcement learning (RL)-based LLM fine-tuning algorithms have recently gained popularity. Relying on heuristic trust-region approximations, however, they can lead to brittle optimization behavior, as global importance-ratio clipping and group-wise normalization fail to regulate samples whose importance ratios fall outside the clipping range. We propose Query-Adaptive Trust-Region policy Optimization (QUATRO), which directly enforces trust-region constraints through a principled optimization. This yields a clear and interpretable objective that enables explicit control over policy updates and stable, entropy-controlled optimization, with a stabilizer terms arising intrinsically from the exact trust-region formulation. Empirically verified on diverse mathematical reasoning benchmarks, QUATRO shows stable training under increased policy staleness and aggressive learning rates, maintaining well-controlled entropy throughout training.

[LG-25] Resilient Load Forecasting under Climate Change: Adaptive Conditional Neural Processes for Few-Shot Extreme Load Forecasting

链接: https://arxiv.org/abs/2602.04609
作者: Chenxi Hu,Yue Ma,Yifan Wu,Yunhe Hou
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Extreme weather can substantially change electricity consumption behavior, causing load curves to exhibit sharp spikes and pronounced volatility. If forecasts are inaccurate during those periods, power systems are more likely to face supply shortfalls or localized overloads, forcing emergency actions such as load shedding and increasing the risk of service disruptions and public-safety impacts. This problem is inherently difficult because extreme events can trigger abrupt regime shifts in load patterns, while relevant extreme samples are rare and irregular, making reliable learning and calibration challenging. We propose AdaCNP, a probabilistic forecasting model for data-scarce condition. AdaCNP learns similarity in a shared embedding space. For each target data, it evaluates how relevant each historical context segment is to the current condition and reweights the context information accordingly. This design highlights the most informative historical evidence even when extreme samples are rare. It enables few-shot adaptation to previously unseen extreme patterns. AdaCNP also produces predictive distributions for risk-aware decision-making without expensive fine-tuning on the target domain. We evaluate AdaCNP on real-world power-system load data and compare it against a range of representative baselines. The results show that AdaCNP is more robust during extreme periods, reducing the mean squared error by 22% relative to the strongest baseline while achieving the lowest negative log-likelihood, indicating more reliable probabilistic outputs. These findings suggest that AdaCNP can effectively mitigate the combined impact of abrupt distribution shifts and scarce extreme samples, providing a more trustworthy forecasting for resilient power system operation under extreme events.

[LG-26] Jacobian Regularization Stabilizes Long-Term Integration of Neural Differential Equations

链接: https://arxiv.org/abs/2602.04608
作者: Maya Janvier,Julien Salomon,Etienne Meunier
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Hybrid models and Neural Differential Equations (NDE) are getting increasingly important for the modeling of physical systems, however they often encounter stability and accuracy issues during long-term integration. Training on unrolled trajectories is known to limit these divergences but quickly becomes too expensive due to the need for computing gradients over an iterative process. In this paper, we demonstrate that regularizing the Jacobian of the NDE model via its directional derivatives during training stabilizes long-term integration in the challenging context of short training rollouts. We design two regularizations, one for the case of known dynamics where we can directly derive the directional derivatives of the dynamic and one for the case of unknown dynamics where they are approximated using finite differences. Both methods, while having a far lower cost compared to long rollouts during training, are successful in improving the stability of long-term simulations for several ordinary and partial differential equations, opening up the door to training NDE methods for long-term integration of large scale systems.

[LG-27] Stochastic Decision Horizons for Constrained Reinforcement Learning

链接: https://arxiv.org/abs/2602.04599
作者: Nikola Milosevic,Leonard Franz,Daniel Haeufle,Georg Martius,Nico Scherf,Pavel Kolev
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Constrained Markov decision processes (CMDPs) provide a principled model for handling constraints, such as safety and other auxiliary objectives, in reinforcement learning. The common approach of using additive-cost constraints and dual variables often hinders off-policy scalability. We propose a Control as Inference formulation based on stochastic decision horizons, where constraint violations attenuate reward contributions and shorten the effective planning horizon via state-action-dependent continuation. This yields survival-weighted objectives that remain replay-compatible for off-policy actor-critic learning. We propose two violation semantics, absorbing and virtual termination, that share the same survival-weighted return but result in distinct optimization structures that lead to SAC/MPO-style policy improvement. Experiments demonstrate improved sample efficiency and favorable return-violation trade-offs on standard benchmarks. Moreover, MPO with virtual termination (VT-MPO) scales effectively to our high-dimensional musculoskeletal Hyfydy setup.

[LG-28] Probabilistic Label Spreading: Efficient and Consistent Estimation of Soft Labels with Epistemic Uncertainty on Graphs

链接: https://arxiv.org/abs/2602.04574
作者: Jonathan Klees,Tobias Riedlinger,Peter Stehr,Bennet Böddecker,Daniel Kondermann,Matthias Rottmann
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Safe artificial intelligence for perception tasks remains a major challenge, partly due to the lack of data with high-quality labels. Annotations themselves are subject to aleatoric and epistemic uncertainty, which is typically ignored during annotation and evaluation. While crowdsourcing enables collecting multiple annotations per image to estimate these uncertainties, this approach is impractical at scale due to the required annotation effort. We introduce a probabilistic label spreading method that provides reliable estimates of aleatoric and epistemic uncertainty of labels. Assuming label smoothness over the feature space, we propagate single annotations using a graph-based diffusion method. We prove that label spreading yields consistent probability estimators even when the number of annotations per data point converges to zero. We present and analyze a scalable implementation of our method. Experimental results indicate that, compared to baselines, our approach substantially reduces the annotation budget required to achieve a desired label quality on common image datasets and achieves a new state of the art on the Data-Centric Image Classification benchmark.

[LG-29] Finding Structure in Continual Learning NEURIPS2025

链接: https://arxiv.org/abs/2602.04555
作者: Pourya Shamsolmoali,Masoumeh Zareapoor
类目: Machine Learning (cs.LG)
*备注: Submitted to NeurIPS 2025

点击查看摘要

Abstract:Learning from a stream of tasks usually pits plasticity against stability: acquiring new knowledge often causes catastrophic forgetting of past information. Most methods address this by summing competing loss terms, creating gradient conflicts that are managed with complex and often inefficient strategies such as external memory replay or parameter regularization. We propose a reformulation of the continual learning objective using Douglas-Rachford Splitting (DRS). This reframes the learning process not as a direct trade-off, but as a negotiation between two decoupled objectives: one promoting plasticity for new tasks and the other enforcing stability of old knowledge. By iteratively finding a consensus through their proximal operators, DRS provides a more principled and stable learning dynamic. Our approach achieves an efficient balance between stability and plasticity without the need for auxiliary modules or complex add-ons, providing a simpler yet more powerful paradigm for continual learning systems.

[LG-30] Gradient Flow Through Diagram Expansions: Learning Regimes and Explicit Solutions ICML’2026

链接: https://arxiv.org/abs/2602.04548
作者: Dmitry Yarotsky,Eugene Golikov,Yaroslav Gusev
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 48 pages, under review for ICML’2026

点击查看摘要

Abstract:We develop a general mathematical framework to analyze scaling regimes and derive explicit analytic solutions for gradient flow (GF) in large learning problems. Our key innovation is a formal power series expansion of the loss evolution, with coefficients encoded by diagrams akin to Feynman diagrams. We show that this expansion has a well-defined large-size limit that can be used to reveal different learning phases and, in some cases, to obtain explicit solutions of the nonlinear GF. We focus on learning Canonical Polyadic (CP) decompositions of high-order tensors, and show that this model has several distinct extreme lazy and rich GF regimes such as free evolution, NTK and under- and over-parameterized mean-field. We show that these regimes depend on the parameter scaling, tensor order, and symmetry of the model in a specific and subtle way. Moreover, we propose a general approach to summing the formal loss expansion by reducing it to a PDE; in a wide range of scenarios, it turns out to be 1st order and solvable by the method of characteristics. We observe a very good agreement of our theoretical predictions with experiment.

[LG-31] Forget to Generalize: Iterative Adaptation for Generalization in Federated Learning

链接: https://arxiv.org/abs/2602.04536
作者: Abdulrahman Alotaibi,Irene Tenison,Miriam Kim,Isaac Lee,Lalana Kagal
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Web is naturally heterogeneous with user devices, geographic regions, browsing patterns, and contexts all leading to highly diverse, unique datasets. Federated Learning (FL) is an important paradigm for the Web because it enables privacy-preserving, collaborative machine learning across diverse user devices, web services and clients without needing to centralize sensitive data. However, its performance degrades severely under non-IID client distributions that is prevalent in real-world web systems. In this work, we propose a new training paradigm - Iterative Federated Adaptation (IFA) - that enhances generalization in heterogeneous federated settings through generation-wise forget and evolve strategy. Specifically, we divide training into multiple generations and, at the end of each, select a fraction of model parameters (a) randomly or (b) from the later layers of the model and reinitialize them. This iterative forget and evolve schedule allows the model to escape local minima and preserve globally relevant representations. Extensive experiments on CIFAR-10, MIT-Indoors, and Stanford Dogs datasets show that the proposed approach improves global accuracy, especially when the data cross clients are Non-IID. This method can be implemented on top any federated algorithm to improve its generalization performance. We observe an average of 21.5%improvement across datasets. This work advances the vision of scalable, privacy-preserving intelligence for real-world heterogeneous and distributed web systems.

[LG-32] Greedy-Gnorm: A Gradient Matrix Norm-Based Alternative to Attention Entropy for Head Pruning

链接: https://arxiv.org/abs/2602.04491
作者: Yuxi Guo,Paul Sheridan
类目: Machine Learning (cs.LG)
*备注: 24 pages, 5 figures, 5 tables

点击查看摘要

Abstract:Attention head pruning has emerged as an effective technique for transformer model compression, an increasingly important goal in the era of Green AI. However, existing pruning methods often rely on static importance scores, which fail to capture the evolving role of attention heads during iterative removal. We propose Greedy-Gradient norm (Greedy-Gnorm), a novel head pruning algorithm that dynamically recalculates head importance after each pruning step. Specifically, each head is scored by the elementwise product of the l2-norms of its Q/K/V gradient blocks, as estimated from a hold-out validation set and updated at every greedy iteration. This dynamic approach to scoring mitigates against stale rankings and better reflects gradient-informed importance as pruning progresses. Extensive experiments on BERT, ALBERT, RoBERTa, and XLM-RoBERTa demonstrate that Greedy-Gnorm consistently preserves accuracy under substantial head removal, outperforming attention entropy. By effectively reducing model size while maintaining task performance, Greedy-Gnorm offers a promising step toward more energy-efficient transformer model deployment.

[LG-33] Hand Gesture Recognition from Doppler Radar Signals Using Echo State Networks IJCNN2026

链接: https://arxiv.org/abs/2602.04436
作者: Towa Sano,Gouhei Tanaka
类目: Machine Learning (cs.LG)
*备注: Submitted to IJCNN 2026. 21 pages, 10figures

点击查看摘要

Abstract:Hand gesture recognition (HGR) is a fundamental technology in human computer interaction (HCI).In particular, HGR based on Doppler radar signals is suited for in-vehicle interfaces and robotic systems, necessitating lightweight and computationally efficient recognition techniques. However, conventional deep learning-based methods still suffer from high computational costs. To address this issue, we propose an Echo State Network (ESN) approach for radar-based HGR, using frequency-modulated-continuous-wave (FMCW) radar signals. Raw radar data is first converted into feature maps, such as range-time and Doppler-time maps, which are then fed into one or more recurrent neural network-based reservoirs. The obtained reservoir states are processed by readout classifiers, including ridge regression, support vector machines, and random forests. Comparative experiments demonstrate that our method outperforms existing approaches on an 11-class HGR task using the Soli dataset and surpasses existing deep learning models on a 4-class HGR task using the Dop-NET dataset. The results indicate that parallel processing using multi-reservoir ESNs are effective for recognizing temporal patterns from the multiple different feature maps in the time-space and time-frequency domains. Our ESN approaches achieve high recognition performance with low computational cost in HGR, showing great potential for more advanced HCI technologies, especially in resource-constrained environments.

[LG-34] MaMa: A Game-Theoretic Approach for Designing Safe Agent ic Systems

链接: https://arxiv.org/abs/2602.04431
作者: Jonathan Nöther,Adish Singla,Goran Radanovic
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:LLM-based multi-agent systems have demonstrated impressive capabilities, but they also introduce significant safety risks when individual agents fail or behave adversarially. In this work, we study the automated design of agentic systems that remain safe even when a subset of agents is compromised. We formalize this challenge as a Stackelberg security game between a system designer (the Meta-Agent) and a best-responding Meta-Adversary that selects and compromises a subset of agents to minimize safety. We propose Meta-Adversary-Meta-Agent (MaMa), a novel algorithm for approximately solving this game and automatically designing safe agentic systems. Our approach uses LLM-based adversarial search, where the Meta-Agent iteratively proposes system designs and receives feedback based on the strongest attacks discovered by the Meta-Adversary. Empirical evaluations across diverse environments show that systems designed with MaMa consistently defend against worst-case attacks while maintaining performance comparable to systems optimized solely for task success. Moreover, the resulting systems generalize to stronger adversaries, as well as ones with different attack objectives or underlying LLMs, demonstrating robust safety beyond the training setting.

[LG-35] HoRD: Robust Humanoid Control via History-Conditioned Reinforcement Learning and Online Distillation

链接: https://arxiv.org/abs/2602.04412
作者: Puyue Wang,Jiawei Hu,Yan Gao,Junyan Wang,Yu Zhang,Gillian Dobbie,Tao Gu,Wafa Johal,Ting Dang,Hong Jia
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Humanoid robots can suffer significant performance drops under small changes in dynamics, task specifications, or environment setup. We propose HoRD, a two-stage learning framework for robust humanoid control under domain shift. First, we train a high-performance teacher policy via history-conditioned reinforcement learning, where the policy infers latent dynamics context from recent state–action trajectories to adapt online to diverse randomized dynamics. Second, we perform online distillation to transfer the teacher’s robust control capabilities into a transformer-based student policy that operates on sparse root-relative 3D joint keypoint trajectories. By combining history-conditioned adaptation with online distillation, HoRD enables a single policy to adapt zero-shot to unseen domains without per-domain retraining. Extensive experiments show HoRD outperforms strong baselines in robustness and transfer, especially under unseen domains and external perturbations. Code and project page are available at \hrefthis https URLthis https URL.

[LG-36] Separation-Utility Pareto Frontier: An Information-Theoretic Characterization

链接: https://arxiv.org/abs/2602.04408
作者: Shizhou Xu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study the Pareto frontier (optimal trade-off) between utility and separation, a fairness criterion requiring predictive independence from sensitive attributes conditional on the true outcome. Through an information-theoretic lens, we prove a characterization of the utility-separation Pareto frontier, establish its concavity, and thereby prove the increasing marginal cost of separation in terms of utility. In addition, we characterize the conditions under which this trade-off becomes strict, providing a guide for trade-off selection in practice. Based on the theoretical characterization, we develop an empirical regularizer based on conditional mutual information (CMI) between predictions and sensitive attributes given the true outcome. The CMI regularizer is compatible with any deep model trained via gradient-based optimization and serves as a scalar monitor of residual separation violations, offering tractable guarantees during training. Finally, numerical experiments support our theoretical findings: across COMPAS, UCI Adult, UCI Bank, and CelebA, the proposed method substantially reduces separation violations while matching or exceeding the utility of established baseline methods. This study thus offers a provable, stable, and flexible approach to enforcing separation in deep learning.

[LG-37] heory of Speciation Transitions in Diffusion Models with General Class Structure

链接: https://arxiv.org/abs/2602.04404
作者: Beatrice Achilli,Marco Benedetti,Giulio Biroli,Marc Mézard
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn)
*备注: 17 pages, 6 figures

点击查看摘要

Abstract:Diffusion Models generate data by reversing a stochastic diffusion process, progressively transforming noise into structured samples drawn from a target distribution. Recent theoretical work has shown that this backward dynamics can undergo sharp qualitative transitions, known as speciation transitions, during which trajectories become dynamically committed to data classes. Existing theoretical analyses, however, are limited to settings where classes are identifiable through first moments, such as mixtures of Gaussians with well-separated means. In this work, we develop a general theory of speciation in diffusion models that applies to arbitrary target distributions admitting well-defined classes. We formalize the notion of class structure through Bayes classification and characterize speciation times in terms of free-entropy difference between classes. This criterion recovers known results in previously studied Gaussian-mixture models, while extending to situations in which classes are not distinguishable by first moments and may instead differ through higher-order or collective features. Our framework also accommodates multiple classes and predicts the existence of successive speciation times associated with increasingly fine-grained class commitment. We illustrate the theory on two analytically tractable examples: mixtures of one-dimensional Ising models at different temperatures and mixtures of zero-mean Gaussians with distinct covariance structures. In the Ising case, we obtain explicit expressions for speciation times by mapping the problem onto a random-field Ising model and solving it via the replica method. Our results provide a unified and broadly applicable description of speciation transitions in diffusion-based generative models.

[LG-38] Optimal Rates for Feasible Payoff Set Estimation in Games

链接: https://arxiv.org/abs/2602.04397
作者: Annalisa Barbara,Riccardo Poiani,Martino Bernasconi,Andrea Celli
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study a setting in which two players play a (possibly approximate) Nash equilibrium of a bimatrix game, while a learner observes only their actions and has no knowledge of the equilibrium or the underlying game. A natural question is whether the learner can rationalize the observed behavior by inferring the players’ payoff functions. Rather than producing a single payoff estimate, inverse game theory aims to identify the entire set of payoffs consistent with observed behavior, enabling downstream use in, e.g., counterfactual analysis and mechanism design across applications like auctions, pricing, and security games. We focus on the problem of estimating the set of feasible payoffs with high probability and up to precision \epsilon on the Hausdorff metric. We provide the first minimax-optimal rates for both exact and approximate equilibrium play, in zero-sum as well as general-sum games. Our results provide learning-theoretic foundations for set-valued payoff inference in multi-agent environments.

[LG-39] On the use of LLM s to generate a dataset of Neural Networks

链接: https://arxiv.org/abs/2602.04388
作者: Nadia Daoudi,Jordi Cabot
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural networks are increasingly used to support decision-making. To verify their reliability and adaptability, researchers and practitioners have proposed a variety of tools and methods for tasks such as NN code verification, refactoring, and migration. These tools play a crucial role in guaranteeing both the correctness and maintainability of neural network architectures, helping to prevent implementation errors, simplify model updates, and ensure that complex networks can be reliably extended and reused. Yet, assessing their effectiveness remains challenging due to the lack of publicly diverse datasets of neural networks that would allow systematic evaluation. To address this gap, we leverage large language models (LLMs) to automatically generate a dataset of neural networks that can serve as a benchmark for validation. The dataset is designed to cover diverse architectural components and to handle multiple input data types and tasks. In total, 608 samples are generated, each conforming to a set of precise design choices. To further ensure their consistency, we validate the correctness of the generated networks using static analysis and symbolic tracing. We make the dataset publicly available to support the community in advancing research on neural network reliability and adaptability.

[LG-40] Reducing the labeling burden in time-series mapping using Common Ground: a semi-automated approach to tracking changes in land cover and species over time

链接: https://arxiv.org/abs/2602.04373
作者: Geethen Singh,Jasper A Slingsby,Tamara B Robinson,Glenn Moncrieff
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reliable classification of Earth Observation data depends on consistent, up-to-date reference labels. However, collecting new labelled data at each time step remains expensive and logistically difficult, especially in dynamic or remote ecological systems. As a response to this challenge, we demonstrate that a model with access to reference data solely from time step t0 can perform competitively on both t0 and a future time step t1, outperforming models trained separately on time-specific reference data (the gold standard). This finding suggests that effective temporal generalization can be achieved without requiring manual updates to reference labels beyond the initial time step t0. Drawing on concepts from change detection and semi-supervised learning (SSL), the most performant approach, “Common Ground”, uses a semi-supervised framework that leverages temporally stable regions-areas with little to no change in spectral or semantic characteristics between time steps-as a source of implicit supervision for dynamic regions. We evaluate this strategy across multiple classifiers, sensors (Landsat-8, Sentinel-2 satellite multispectral and airborne imaging spectroscopy), and ecological use cases. For invasive tree species mapping, we observed a 21-40% improvement in classification accuracy using Common Ground compared to naive temporal transfer, where models trained at a single time step are directly applied to a future time step. We also observe a 10 -16% higher accuracy for the introduced approach compared to a gold-standard approach. In contrast, when broad land cover categories were mapped across Europe, we observed a more modest 2% increase in accuracy compared to both the naive and gold-standard approaches. These results underscore the effectiveness of combining stable reference screening with SSL for scalable and label-efficient multi-temporal remote sensing classification.

[LG-41] Multi-scale hypergraph meets LLM s: Aligning large language models for time series analysis ICLR2026

链接: https://arxiv.org/abs/2602.04369
作者: Zongjiang Shang,Dongliang Cui,Binqing Wu,Ling Chen
类目: Machine Learning (cs.LG)
*备注: Accepted by ICLR2026

点击查看摘要

Abstract:Recently, there has been great success in leveraging pre-trained large language models (LLMs) for time series analysis. The core idea lies in effectively aligning the modality between natural language and time series. However, the multi-scale structures of natural language and time series have not been fully considered, resulting in insufficient utilization of LLMs capabilities. To this end, we propose MSH-LLM, a Multi-Scale Hypergraph method that aligns Large Language Models for time series analysis. Specifically, a hyperedging mechanism is designed to enhance the multi-scale semantic information of time series semantic space. Then, a cross-modality alignment (CMA) module is introduced to align the modality between natural language and time series at different scales. In addition, a mixture of prompts (MoP) mechanism is introduced to provide contextual information and enhance the ability of LLMs to understand the multi-scale temporal patterns of time series. Experimental results on 27 real-world datasets across 5 different applications demonstrate that MSH-LLM achieves the state-of-the-art results.

[LG-42] EXaMCaP: Subset Selection with Entropy Gain Maximization for Probing Capability Gains of Large Chart Understanding Training Sets

链接: https://arxiv.org/abs/2602.04365
作者: Jiapeng Liu,Liang Li,Bing Li,Peng Fu,Xiyan Gao,Chengyang Fang,Xiaoshuai Hao,Can Ma
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent works focus on synthesizing Chart Understanding (ChartU) training sets to inject advanced chart knowledge into Multimodal Large Language Models (MLLMs), where the sufficiency of the knowledge is typically verified by quantifying capability gains via the fine-tune-then-evaluate paradigm. However, full-set fine-tuning MLLMs to assess such gains incurs significant time costs, hindering the iterative refinement cycles of the ChartU dataset. Reviewing the ChartU dataset synthesis and data selection domains, we find that subsets can potentially probe the MLLMs’ capability gains from full-set fine-tuning. Given that data diversity is vital for boosting MLLMs’ performance and entropy reflects this feature, we propose EXaMCaP, which uses entropy gain maximization to select a subset. To obtain a high-diversity subset, EXaMCaP chooses the maximum-entropy subset from the large ChartU dataset. As enumerating all possible subsets is impractical, EXaMCaP iteratively selects samples to maximize the gain in set entropy relative to the current set, approximating the maximum-entropy subset of the full dataset. Experiments show that EXaMCaP outperforms baselines in probing the capability gains of the ChartU training set, along with its strong effectiveness across diverse subset sizes and compatibility with various MLLM architectures.

[LG-43] Mosaic Learning: A Framework for Decentralized Learning with Model Frag mentation

链接: https://arxiv.org/abs/2602.04352
作者: Sayan Biswas,Davide Frey,Romaric Gaudel,Nirupam Gupta,Anne-Marie Kermarrec,Dimitri Lerévérend,Rafael Pires,Rishi Sharma,François Taïani,Martijn de Vos
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Decentralized learning (DL) enables collaborative machine learning (ML) without a central server, making it suitable for settings where training data cannot be centrally hosted. We introduce Mosaic Learning, a DL framework that decomposes models into fragments and disseminates them independently across the network. Fragmentation reduces redundant communication across correlated parameters and enables more diverse information propagation without increasing communication cost. We theoretically show that Mosaic Learning (i) shows state-of-the-art worst-case convergence rate, and (ii) leverages parameter correlation in an ML model, improving contraction by reducing the highest eigenvalue of a simplified system. We empirically evaluate Mosaic Learning on four learning tasks and observe up to 12 percentage points higher node-level test accuracy compared to epidemic learning (EL), a state-of-the-art baseline. In summary, Mosaic Learning improves DL performance without sacrificing its utility or efficiency, and positions itself as a new DL standard.

[LG-44] MirrorLA: Reflecting Feature Map for Vision Linear Attention

链接: https://arxiv.org/abs/2602.04346
作者: Weikang Meng,Liangyu Huo,Yadan Luo,Yaowei Wang,Yingjian Li,Zheng Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Linear attention significantly reduces the computational complexity of Transformers from quadratic to linear, yet it consistently lags behind softmax-based attention in performance. We identify the root cause of this degradation as the non-negativity constraint imposed on kernel feature maps: standard projections like ReLU act as “passive truncation” operators, indiscriminately discarding semantic information residing in the negative domain. We propose MirrorLA, a geometric framework that substitutes passive truncation with active reorientation. By leveraging learnable Householder reflections, MirrorLA rotates the feature geometry into the non-negative orthant to maximize information retention. Our approach restores representational density through a cohesive, multi-scale design: it first optimizes local discriminability via block-wise isometries, stabilizes long-context dynamics using variance-aware modulation to diversify activations, and finally, integrates dispersed subspaces via cross-head reflections to induce global covariance mixing. MirrorLA achieves state-of-the-art performance across standard benchmarks, demonstrating that strictly linear efficiency can be achieved without compromising representational fidelity.

[LG-45] RISE: Interactive Visual Diagnosis of Fairness in Machine Learning Models

链接: https://arxiv.org/abs/2602.04339
作者: Ray Chen,Christan Grant
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Evaluating fairness under domain shift is challenging because scalar metrics often obscure exactly where and how disparities arise. We introduce \textitRISE (Residual Inspection through Sorted Evaluation), an interactive visualization tool that converts sorted residuals into interpretable patterns. By connecting residual curve structures to formal fairness notions, RISE enables localized disparity diagnosis, subgroup comparison across environments, and the detection of hidden fairness issues. Through post-hoc analysis, RISE exposes accuracy-fairness trade-offs that aggregate statistics miss, supporting more informed model selection.

[LG-46] Convolution Operator Network for Forward and Inverse Problems (FI-Conv): Application to Plasma Turbulence Simulations

链接: https://arxiv.org/abs/2602.04287
作者: Xingzhuo Chen,Anthony Poole,Ionut-Gabriel Farcas,David R. Hatch,Ulisses Braga-Neto
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose the Convolutional Operator Network for Forward and Inverse Problems (FI-Conv), a framework capable of predicting system evolution and estimating parameters in complex spatio-temporal dynamics, such as turbulence. FI-Conv is built on a U-Net architecture, in which most convolutional layers are replaced by ConvNeXt V2 blocks. This design preserves U-Net performance on inputs with high-frequency variations while maintaining low computational complexity. FI-Conv uses an initial state, PDE parameters, and evolution time as input to predict the system future state. As a representative example of a system exhibiting complex dynamics, we evaluate the performance of FI-Conv on the task of predicting turbulent plasma fields governed by the Hasegawa-Wakatani (HW) equations. The HW system models two-dimensional electrostatic drift-wave turbulence and exhibits strongly nonlinear behavior, making accurate approximation and long-term prediction particularly challenging. Using an autoregressive forecasting procedure, FI-Conv achieves accurate forward prediction of the plasma state evolution over short times (t ~ 3) and captures the statistic properties of derived physical quantities of interest over longer times (t ~ 100). Moreover, we develop a gradient-descent-based inverse estimation method that accurately infers PDE parameters from plasma state evolution data, without modifying the trained model weights. Collectively, our results demonstrate that FI-Conv can be an effective alternative to existing physics-informed machine learning methods for systems with complex spatio-temporal dynamics.

[LG-47] Multi-Integration of Labels across Categories for Component Identification (MILCCI)

链接: https://arxiv.org/abs/2602.04270
作者: Noga Mudrik,Yuxi Chen,Gal Mishne,Adam S. Charles
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC); Quantitative Methods (q-bio.QM); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Many fields collect large-scale temporal data through repeated measurements (trials), where each trial is labeled with a set of metadata variables spanning several categories. For example, a trial in a neuroscience study may be linked to a value from category (a): task difficulty, and category (b): animal choice. A critical challenge in time-series analysis is to understand how these labels are encoded within the multi-trial observations, and disentangle the distinct effect of each label entry across categories. Here, we present MILCCI, a novel data-driven method that i) identifies the interpretable components underlying the data, ii) captures cross-trial variability, and iii) integrates label information to understand each category’s representation within the data. MILCCI extends a sparse per-trial decomposition that leverages label similarities within each category to enable subtle, label-driven cross-trial adjustments in component compositions and to distinguish the contribution of each category. MILCCI also learns each component’s corresponding temporal trace, which evolves over time within each trial and varies flexibly across trials. We demonstrate MILCCI’s performance through both synthetic and real-world examples, including voting patterns, online page view trends, and neuronal recordings.

[LG-48] From Ambiguity to Action: A POMDP Perspective on Partial Multi-Label Ambiguity and Its Horizon-One Resolution

链接: https://arxiv.org/abs/2602.04255
作者: Hanlin Pan,Yuhao Tang,Wanfu Gao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In partial multi-label learning (PML), the true labels are unobserved, which makes label disambiguation important but difficult. A key challenge is that ambiguous candidate labels can propagate errors into downstream tasks such as feature engineering. To solve this issue, we jointly model the disambiguation and feature selection tasks as Partially Observable Markov Decision Processes (POMDP) to turn PML risk minimization into expected-return maximization. Stage 1 trains a transformer policy via reinforcement learning to produce high-quality hard pseudo-labels; Stage 2 describes feature selection as a sequential reinforcement learning problem, selecting features step by step and outputting an interpretable global ranking. We further provide the theoretical analysis of PML-POMDP correspondence and the excess-risk bound that decompose the error into pseudo label quality term and sample size. Experiments in multiple metrics and data sets verify the advantages of the framework.

[LG-49] raining A Foundation Model to Represent Graphs as Vectors

链接: https://arxiv.org/abs/2602.04244
作者: Qi Feng,Jicong Fan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper aims to train a graph foundation model that is able to represent any graph as a vector preserving structural and semantic information useful for downstream graph-level tasks such as graph classification and graph clustering. To learn the features of graphs from diverse domains while maintaining strong generalization ability to new domains, we propose a multi-graph-based feature alignment method, which constructs weighted graphs using the attributes of all nodes in each dataset and then generates consistent node embeddings. To enhance the consistency of the features from different datasets, we propose a density maximization mean alignment algorithm with guaranteed convergence. The original graphs and generated node embeddings are fed into a graph neural network to achieve discriminative graph representations in contrastive learning. More importantly, to enhance the information preservation from node-level representations to the graph-level representation, we construct a multi-layer reference distribution module without using any pooling operation. We also provide a theoretical generalization bound to support the effectiveness of the proposed model. The experimental results of few-shot graph classification and graph clustering show that our model outperforms strong baselines.

[LG-50] Cascading Robustness Verification: Toward Efficient Model-Agnostic Certification

链接: https://arxiv.org/abs/2602.04236
作者: Mohammadreza Maleki,Rushendra Sidibomma,Arman Adibi,Reza Samavi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Certifying neural network robustness against adversarial examples is challenging, as formal guarantees often require solving non-convex problems. Hence, incomplete verifiers are widely used because they scale efficiently and substantially reduce the cost of robustness verification compared to complete methods. However, relying on a single verifier can underestimate robustness because of loose approximations or misalignment with training methods. In this work, we propose Cascading Robustness Verification (CRV), which goes beyond an engineering improvement by exposing fundamental limitations of existing robustness metric and introducing a framework that enhances both reliability and efficiency. CRV is a model-agnostic verifier, meaning that its robustness guarantees are independent of the model’s training process. The key insight behind the CRV framework is that, when using multiple verification methods, an input is certifiably robust if at least one method certifies it as robust. Rather than relying solely on a single verifier with a fixed constraint set, CRV progressively applies multiple verifiers to balance the tightness of the bound and computational cost. Starting with the least expensive method, CRV halts as soon as an input is certified as robust; otherwise, it proceeds to more expensive methods. For computationally expensive methods, we introduce a Stepwise Relaxation Algorithm (SR) that incrementally adds constraints and checks for certification at each step, thereby avoiding unnecessary computation. Our theoretical analysis demonstrates that CRV achieves equal or higher verified accuracy compared to powerful but computationally expensive incomplete verifiers in the cascade, while significantly reducing verification overhead. Empirical results confirm that CRV certifies at least as many inputs as benchmark approaches, while improving runtime efficiency by up to ~90%.

[LG-51] From Sparse Sensors to Continuous Fields: STRIDE for Spatiotemporal Reconstruction

链接: https://arxiv.org/abs/2602.04201
作者: Yanjie Tong,Peng Chen
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注:

点击查看摘要

Abstract:Reconstructing high-dimensional spatiotemporal fields from sparse point-sensor measurements is a central challenge in learning parametric PDE dynamics. Existing approaches often struggle to generalize across trajectories and parameter settings, or rely on discretization-tied decoders that do not naturally transfer across meshes and resolutions. We propose STRIDE (Spatio-Temporal Recurrent Implicit DEcoder), a two-stage framework that maps a short window of sensor measurements to a latent state with a temporal encoder and reconstructs the field at arbitrary query locations with a modulated implicit neural representation (INR) decoder. Using the Fourier Multi-Component and Multi-Layer Neural Network (FMMNN) as the INR backbone improves representation of complex spatial fields and yields more stable optimization than sine-based INRs. We provide a conditional theoretical justification: under stable delay observability of point measurements on a low-dimensional parametric invariant set, the reconstruction operator factors through a finite-dimensional embedding, making STRIDE-type architectures natural approximators. Experiments on four challenging benchmarks spanning chaotic dynamics and wave propagation show that STRIDE outperforms strong baselines under extremely sparse sensing, supports super-resolution, and remains robust to noise.

[LG-52] LORE: Jointly Learning the Intrinsic Dimensionality and Relative Similarity Structure From Ordinal Data ICLR2026

链接: https://arxiv.org/abs/2602.04192
作者: Vivek Anand,Alec Helbling,Mark Davenport,Gordon Berman,Sankar Alagapan,Christopher Rozell
类目: Machine Learning (cs.LG)
*备注: 10 Pages, 31 with appendix: Accepted at ICLR 2026

点击查看摘要

Abstract:Learning the intrinsic dimensionality of subjective perceptual spaces such as taste, smell, or aesthetics from ordinal data is a challenging problem. We introduce LORE (Low Rank Ordinal Embedding), a scalable framework that jointly learns both the intrinsic dimensionality and an ordinal embedding from noisy triplet comparisons of the form, “Is A more similar to B than C?”. Unlike existing methods that require the embedding dimension to be set apriori, LORE regularizes the solution using the nonconvex Schatten- p quasi norm, enabling automatic joint recovery of both the ordinal embedding and its dimensionality. We optimize this joint objective via an iteratively reweighted algorithm and establish convergence guarantees. Extensive experiments on synthetic datasets, simulated perceptual spaces, and real world crowdsourced ordinal judgements show that LORE learns compact, interpretable and highly accurate low dimensional embeddings that recover the latent geometry of subjective percepts. By simultaneously inferring both the intrinsic dimensionality and ordinal embeddings, LORE enables more interpretable and data efficient perceptual modeling in psychophysics and opens new directions for scalable discovery of low dimensional structure from ordinal data in machine learning.

[LG-53] Benchmarking Uncertainty Quantification of Plug-and-Play Diffusion Priors for Inverse Problems Solving

链接: https://arxiv.org/abs/2602.04189
作者: Xiaoyu Qiu,Taewon Yang,Zhanhao Liu,Guanyang Wang,Liyue Shen
类目: Machine Learning (cs.LG); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:Plug-and-play diffusion priors (PnPDP) have become a powerful paradigm for solving inverse problems in scientific and engineering domains. Yet, current evaluations of reconstruction quality emphasize point-estimate accuracy metrics on a single sample, which do not reflect the stochastic nature of PnPDP solvers and the intrinsic uncertainty of inverse problems, critical for scientific tasks. This creates a fundamental mismatch: in inverse problems, the desired output is typically a posterior distribution and most PnPDP solvers induce a distribution over reconstructions, but existing benchmarks only evaluate a single reconstruction, ignoring distributional characterization such as uncertainty. To address this gap, we conduct a systematic study to benchmark the uncertainty quantification (UQ) of existing diffusion inverse solvers. Specifically, we design a rigorous toy model simulation to evaluate the uncertainty behavior of various PnPDP solvers, and propose a UQ-driven categorization. Through extensive experiments on toy simulations and diverse real-world scientific inverse problems, we observe uncertainty behaviors consistent with our taxonomy and theoretical justification, providing new insights for evaluating and understanding the uncertainty for PnPDPs.

[LG-54] Piece of CAKE: Adaptive Execution Engines via Microsecond-Scale Learning

链接: https://arxiv.org/abs/2602.04181
作者: Zijie Zhao,Ryan Marcus
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Low-level database operators often admit multiple physical implementations (“kernels”) that are semantically equivalent but have vastly different performance characteristics depending on the input data distribution. Existing database systems typically rely on static heuristics or worst-case optimal defaults to select these kernels, often missing significant performance opportunities. In this work, we propose CAKE (Counterfactual Adaptive Kernel Execution), a system that learns to select the optimal kernel for each data “morsel” using a microsecond-scale contextual multi-armed bandit. CAKE circumvents the high latency of traditional reinforcement learning by exploiting the cheapness of counterfactuals – selectively running multiple kernels to obtain full feedback – and compiling policies into low-latency regret trees. Experimentally, we show that CAKE can reduce end-to-end workload latency by up to 2x compared to state-of-the-art static heuristics.

[LG-55] BPDQ: Bit-Plane Decomposition Quantization on a Variable Grid for Large Language Models

链接: https://arxiv.org/abs/2602.04163
作者: Junyu Chen,Jungang Li,Jing Xiong,Wenjie Wang,Qingyao Yang,He Xiao,Zhen Li,Taiqiang Wu,Mengzhao Chen,Zhen Peng,Chaofan Tao,Long Shi,Hongxia Yang,Ngai Wong
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language model (LLM) inference is often bounded by memory footprint and memory bandwidth in resource-constrained deployments, making quantization a fundamental technique for efficient serving. While post-training quantization (PTQ) maintains high fidelity at 4-bit, it deteriorates at 2-3 bits. Fundamentally, existing methods enforce a shape-invariant quantization grid (e.g., the fixed uniform intervals of UINT2) for each group, severely restricting the feasible set for error minimization. To address this, we propose Bit-Plane Decomposition Quantization (BPDQ), which constructs a variable quantization grid via bit-planes and scalar coefficients, and iteratively refines them using approximate second-order information while progressively compensating quantization errors to minimize output discrepancy. In the 2-bit regime, BPDQ enables serving Qwen2.5-72B on a single RTX 3090 with 83.85% GSM8K accuracy (vs. 90.83% at 16-bit). Moreover, we provide theoretical analysis showing that the variable grid expands the feasible set, and that the quantization process consistently aligns with the optimization objective in Hessian-induced geometry. Code: this http URL.

[LG-56] Generative Neural Operators through Diffusion Last Layer

链接: https://arxiv.org/abs/2602.04139
作者: Sungwon Park,Anthony Zhou,Hongjoong Kim,Amir Barati Farimani
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Neural operators have emerged as a powerful paradigm for learning discretization-invariant function-to-function mappings in scientific computing. However, many practical systems are inherently stochastic, making principled uncertainty quantification essential for reliable deployment. To address this, we introduce a simple add-on, the diffusion last layer (DLL), a lightweight probabilistic head that can be attached to arbitrary neural operator backbones to model predictive uncertainty. Motivated by the relative smoothness and low-dimensional structure often exhibited by PDE solution distributions, DLL parameterizes the conditional output distribution directly in function space through a low-rank Karhunen-Loève expansion, enabling efficient and expressive uncertainty modeling. Across stochastic PDE operator learning benchmarks, DLL improves generalization and uncertainty-aware prediction. Moreover, even in deterministic long-horizon rollout settings, DLL enhances rollout stability and provides meaningful estimates of epistemic uncertainty for backbone neural operators.

[LG-57] Lyapunov Constrained Soft Actor-Critic (LC-SAC) using Koopman Operator Theory for Quadrotor Trajectory Tracking

链接: https://arxiv.org/abs/2602.04132
作者: Dhruv S. Kushwaha,Zoleikha A. Biron
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 12 pages, 7 Figures, submitted to IEEE RA-L

点击查看摘要

Abstract:Reinforcement Learning (RL) has achieved remarkable success in solving complex sequential decision-making problems. However, its application to safety-critical physical systems remains constrained by the lack of stability guarantees. Standard RL algorithms prioritize reward maximization, often yielding policies that may induce oscillations or unbounded state divergence. There has significant work in incorporating Lyapunov-based stability guarantees in RL algorithms with key challenges being selecting a candidate Lyapunov function, computational complexity by using excessive function approximators and conservative policies by incorporating stability criterion in the learning process. In this work we propose a novel Lyapunov-constrained Soft Actor-Critic (LC-SAC) algorithm using Koopman operator theory. We propose use of extended dynamic mode decomposition (EDMD) to produce a linear approximation of the system and use this approximation to derive a closed form solution for candidate Lyapunov function. This derived Lyapunov function is incorporated in the SAC algorithm to further provide guarantees for a policy that stabilizes the nonlinear system. The results are evaluated trajectory tracking of a 2D Quadrotor environment based on safe-control-gym. The proposed algorithm shows training convergence and decaying violations for Lyapunov stability criterion compared to baseline vanilla SAC algorithm. GitHub Repository: this https URL

[LG-58] Decoupling Time and Risk: Risk-Sensitive Reinforcement Learning with General Discounting

链接: https://arxiv.org/abs/2602.04131
作者: Mehrdad Moghimi,Anthony Coache,Hyejin Ku
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Distributional reinforcement learning (RL) is a powerful framework increasingly adopted in safety-critical domains for its ability to optimize risk-sensitive objectives. However, the role of the discount factor is often overlooked, as it is typically treated as a fixed parameter of the Markov decision process or tunable hyperparameter, with little consideration of its effect on the learned policy. In the literature, it is well-known that the discounting function plays a major role in characterizing time preferences of an agent, which an exponential discount factor cannot fully capture. Building on this insight, we propose a novel framework that supports flexible discounting of future rewards and optimization of risk measures in distributional RL. We provide a technical analysis of the optimality of our algorithms, show that our multi-horizon extension fixes issues raised with existing methodologies, and validate the robustness of our methods through extensive experiments. Our results highlight that discounting is a cornerstone in decision-making problems for capturing more expressive temporal and risk preferences profiles, with potential implications for real-world safety-critical applications.

[LG-59] Synthesizable Molecular Generation via Soft-constrained GFlowNets with Rich Chemical Priors

链接: https://arxiv.org/abs/2602.04119
作者: Hyeonah Kim,Minsu Kim,Celine Roget,Dionessa Biton,Louis Vaillancourt,Yves V. Brun,Yoshua Bengio,Alex Hernandez-Garcia
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:The application of generative models for experimental drug discovery campaigns is severely limited by the difficulty of designing molecules de novo that can be synthesized in practice. Previous works have leveraged Generative Flow Networks (GFlowNets) to impose hard synthesizability constraints through the design of state and action spaces based on predefined reaction templates and building blocks. Despite the promising prospects of this approach, it currently lacks flexibility and scalability. As an alternative, we propose S3-GFN, which generates synthesizable SMILES molecules via simple soft regularization of a sequence-based GFlowNet. Our approach leverages rich molecular priors learned from large-scale SMILES corpora to steer molecular generation towards high-reward, synthesizable chemical spaces. The model induces constraints through off-policy replay training with a contrastive learning signal based on separate buffers of synthesizable and unsynthesizable samples. Our experiments show that S3-GFN learns to generate synthesizable molecules ( \geq 95% ) with higher rewards in diverse tasks.

[LG-60] Learning to Reason in 13 Parameters

链接: https://arxiv.org/abs/2602.04118
作者: John X. Morris,Niloofar Mireshghallah,Mark Ibrahim,Saeed Mahloujifar
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent research has shown that language models can learn to \textitreason, often via reinforcement learning. Some work even trains low-rank parameterizations for reasoning, but conventional LoRA cannot scale below the model dimension. We question whether even rank=1 LoRA is necessary for learning to reason and propose TinyLoRA, a method for scaling low-rank adapters to sizes as small as one parameter. Within our new parameterization, we are able to train the 8B parameter size of Qwen2.5 to 91% accuracy on GSM8K with only 13 trained parameters in bf16 (26 total bytes). We find this trend holds in general: we are able to recover 90% of performance improvements while training 1000x fewer parameters across a suite of more difficult learning-to-reason benchmarks such as AIME, AMC, and MATH500. Notably, we are only able to achieve such strong performance with RL: models trained using SFT require 100-1000x larger updates to reach the same performance.

[LG-61] urning mechanistic models into forecasters by using machine learning

链接: https://arxiv.org/abs/2602.04114
作者: Amit K. Chakraborty,Hao Wang,Pouria Ramazi
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注: 47 pages, 11 figures

点击查看摘要

Abstract:The equations of complex dynamical systems may not be identified by expert knowledge, especially if the underlying mechanisms are unknown. Data-driven discovery methods address this challenge by inferring governing equations from time-series data using a library of functions constructed from the measured variables. However, these methods typically assume time-invariant coefficients, which limits their ability to capture evolving system dynamics. To overcome this limitation, we allow some of the parameters to vary over time, learn their temporal evolution directly from data, and infer a system of equations that incorporates both constant and time-varying parameters. We then transform this framework into a forecasting model by predicting the time-varying parameters and substituting these predictions into the learned equations. The model is validated using datasets for Susceptible-Infected-Recovered, Consumer–Resource, greenhouse gas concentration, and Cyanobacteria cell count. By dynamically adapting to temporal shifts, our proposed model achieved a mean absolute error below 3% for learning a time series and below 6% for forecasting up to a month ahead. We additionally compare forecasting performance against CNN-LSTM and Gradient Boosting Machine (GBM), and show that our model outperforms these methods across most datasets. Our findings demonstrate that integrating time-varying parameters into data-driven discovery of differential equations improves both modeling accuracy and forecasting performance.

[LG-62] ZKBoost: Zero-Knowledge Verifiable Training for XGBoost

链接: https://arxiv.org/abs/2602.04113
作者: Nikolas Melissaris,Jiayi Xu,Antigoni Polychroniadou,Akira Takahashi,Chenkai Weng
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Gradient boosted decision trees, particularly XGBoost, are among the most effective methods for tabular data. As deployment in sensitive settings increases, cryptographic guarantees of model integrity become essential. We present ZKBoost, the first zero-knowledge proof of training (zkPoT) protocol for XGBoost, enabling model owners to prove correct training on a committed dataset without revealing data or parameters. We make three key contributions: (1) a fixed-point XGBoost implementation compatible with arithmetic circuits, enabling instantiation of efficient zkPoT, (2) a generic template of zkPoT for XGBoost, which can be instantiated with any general-purpose ZKP backend, and (3) vector oblivious linear evaluation (VOLE)-based instantiation resolving challenges in proving nonlinear fixed-point operations. Our fixed-point implementation matches standard XGBoost accuracy within 1% while enabling practical zkPoT on real-world datasets.

[LG-63] Rate-Optimal Noise Annealing in Semi-Dual Neural Optimal Transport: Tangential Identifiability Off-Manifold Ambiguity and Guaranteed Recovery

链接: https://arxiv.org/abs/2602.04110
作者: Raymond Chu,Jaewoong Choi,Dohyun Kwon
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Semi-dual neural optimal transport learns a transport map via a max-min objective, yet training can converge to incorrect or degenerate maps. We fully characterize these spurious solutions in the common regime where data concentrate on low-dimensional manifold: the objective is underconstrained off the data manifold, while the on-manifold transport signal remains identifiable. Following Choi, Choi, and Kwon (2025), we study additive-noise smoothing as a remedy and prove new map recovery guarantees as the noise vanishes. Our main practical contribution is a computable terminal noise level \varepsilon_\mathrmstat(N) that attains the optimal statistical rate, with scaling governed by the intrinsic dimension m of the data. The formula arises from a theoretical unified analysis of (i) quantitative stability of optimal plans, (ii) smoothing-induced bias, and (iii) finite-sample error, yielding rates that depend on m rather than the ambient dimension. Finally, we show that the reduced semi-dual objective becomes increasingly ill-conditioned as \varepsilon \downarrow 0 . This provides a principled stopping rule: annealing below \varepsilon_\mathrmstat(N) can \textitworsen optimization conditioning without improving statistical accuracy.

[LG-64] Supervised Learning as Lossy Compression: Characterizing Generalization and Sample Complexity via Finite Blocklength Analysis

链接: https://arxiv.org/abs/2602.04107
作者: Kosuke Sugiyama,Masato Uchida
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注: 22 pages, 1 figure

点击查看摘要

Abstract:This paper presents a novel information-theoretic perspective on generalization in machine learning by framing the learning problem within the context of lossy compression and applying finite blocklength analysis. In our approach, the sampling of training data formally corresponds to an encoding process, and the model construction to a decoding process. By leveraging finite blocklength analysis, we derive lower bounds on sample complexity and generalization error for a fixed randomized learning algorithm and its associated optimal sampling strategy. Our bounds explicitly characterize the degree of overfitting of the learning algorithm and the mismatch between its inductive bias and the task as distinct terms. This separation provides a significant advantage over existing frameworks. Additionally, we decompose the overfitting term to show its theoretical connection to existing metrics found in information-theoretic bounds and stability theory, unifying these perspectives under our proposed framework.

[LG-65] CoRe: Context-Robust Remasking for Diffusion Language Models

链接: https://arxiv.org/abs/2602.04096
作者: Kevin Zhai,Sabbir Mollah,Zhenyi Wang,Mubarak Shah
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Standard decoding in Masked Diffusion Models (MDMs) is hindered by context rigidity: tokens are retained based on transient high confidence, often ignoring that early predictions lack full context. This creates cascade effects where initial inconsistencies misguide the remaining generation. Existing revision strategies attempt to mitigate this by relying on static confidence scores, but these signals are inherently myopic; inconsistent tokens can appear confident to the model itself. We propose Context-Robust Remasking (CoRe), a training-free framework for inference-time revision. Rather than trusting static token probabilities, CoRe identifies context-brittle tokens by probing their sensitivity to targeted masked-context perturbations. We formalize revision as a robust optimization objective over context shifts and efficiently approximate this objective to prioritize unstable tokens for revision. On LLaDA-8B-Base, CoRe delivers consistent improvements across reasoning and code benchmarks, outperforming compute-matched baselines and improving MBPP by up to 9.2 percentage points.

[LG-66] Federated Concept-Based Models: Interpretable models with distributed supervision

链接: https://arxiv.org/abs/2602.04093
作者: Dario Fenoglio,Arianna Casanova,Francesco De Santis,Mohan Li,Gabriele Dominici,Johannes Schneider,Martin Gjoreski,Marc Langheinrich,Pietro Barbiero,Giovanni De Felice
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Concept-based models (CMs) enhance interpretability in deep learning by grounding predictions in human-understandable concepts. However, concept annotations are expensive to obtain and rarely available at scale within a single data source. Federated learning (FL) could alleviate this limitation by enabling cross-institutional training that leverages concept annotations distributed across multiple data owners. Yet, FL lacks interpretable modeling paradigms. Integrating CMs with FL is non-trivial: CMs assume a fixed concept space and a predefined model architecture, whereas real-world FL is heterogeneous and non-stationary, with institutions joining over time and bringing new supervision. In this work, we propose Federated Concept-based Models (F-CMs), a new methodology for deploying CMs in evolving FL settings. F-CMs aggregate concept-level information across institutions and efficiently adapt the model architecture in response to changes in the available concept supervision, while preserving institutional privacy. Empirically, F-CMs preserve the accuracy and intervention effectiveness of training settings with full concept supervision, while outperforming non-adaptive federated baselines. Notably, F-CMs enable interpretable inference on concepts not available to a given institution, a key novelty with respect to existing approaches.

[LG-67] A Probabilistic Framework for Solving High-Frequency Helmholtz Equations via Diffusion Models

链接: https://arxiv.org/abs/2602.04082
作者: Yicheng Zou,Samuel Lanthaler,Hossein Salahshoor
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Deterministic neural operators perform well on many PDEs but can struggle with the approximation of high-frequency wave phenomena, where strong input-to-output sensitivity makes operator learning challenging, and spectral bias blurs oscillations. We argue for adopting a probabilistic approach for approximating waves in high-frequency regime, and develop our probabilistic framework using a score-based conditional diffusion operator. After demonstrating a stability analysis of the Helmholtz operator, we present our numerical experiments across a wide range of frequencies, benchmarked against other popular data-driven and machine learning approaches for waves. We show that our probabilistic neural operator consistently produces robust predictions with the lowest errors in L^2 , H^1 , and energy norms. Moreover, unlike all the other tested deterministic approaches, our framework remarkably captures uncertainties in the input sound speed map propagated to the solution field. We envision that our results position probabilistic operator learning as a principled and effective approach for solving complex PDEs such as Helmholtz in the challenging high-frequency regime.

[LG-68] Agent ic AI-Empowered Dynamic Survey Framework

链接: https://arxiv.org/abs/2602.04071
作者: Furkan Mumcu,Lokman Bekit,Michael J. Jones,Anoop Cherian,Yasin Yilmaz
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Survey papers play a central role in synthesizing and organizing scientific knowledge, yet they are increasingly strained by the rapid growth of research output. As new work continues to appear after publication, surveys quickly become outdated, contributing to redundancy and fragmentation in the literature. We reframe survey writing as a long-horizon maintenance problem rather than a one-time generation task, treating surveys as living documents that evolve alongside the research they describe. We propose an agentic Dynamic Survey Framework that supports the continuous updating of existing survey papers by incrementally integrating new work while preserving survey structure and minimizing unnecessary disruption. Using a retrospective experimental setup, we demonstrate that the proposed framework effectively identifies and incorporates emerging research while preserving the coherence and structure of existing surveys.

[LG-69] An Empirical Survey and Benchmark of Learned Distance Indexes for Road Networks

链接: https://arxiv.org/abs/2602.04068
作者: Gautam Choudhary,Libin Zhou,Yeasir Rayhan,Walid G. Aref
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注: Preprint (Under Review). 14 pages, 2 figures

点击查看摘要

Abstract:The calculation of shortest-path distances in road networks is a core operation in navigation systems, location-based services, and spatial analytics. Although classical algorithms, e.g., Dijkstra’s algorithm, provide exact answers, their latency is prohibitive for modern real-time, large-scale deployments. Over the past two decades, numerous distance indexes have been proposed to speed up query processing for shortest distance queries. More recently, with the advancement in machine learning (ML), researchers have designed and proposed ML-based distance indexes to answer approximate shortest path and distance queries efficiently. However, a comprehensive and systematic evaluation of these ML-based approaches is lacking. This paper presents the first empirical survey of ML-based distance indexes on road networks, evaluating them along four key dimensions: Training time, query latency, storage, and accuracy. Using seven real-world road networks and workload-driven query datasets derived from trajectory data, we benchmark ten representative ML techniques and compare them against strong classical non-ML baselines, highlighting key insights and practical trade-offs. We release a unified open-source codebase to support reproducibility and future research on learned distance indexes.

[LG-70] Partition Trees: Conditional Density Estimation over General Outcome Spaces

链接: https://arxiv.org/abs/2602.04042
作者: Felipe Angelim,Alessandro Leite
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: Code available at this https URL

点击查看摘要

Abstract:We propose Partition Trees, a tree-based framework for conditional density estimation over general outcome spaces, supporting both continuous and categorical variables within a unified formulation. Our approach models conditional distributions as piecewise-constant densities on data adaptive partitions and learns trees by directly minimizing conditional negative log-likelihood. This yields a scalable, nonparametric alternative to existing probabilistic trees that does not make parametric assumptions about the target distribution. We further introduce Partition Forests, an ensemble extension obtained by averaging conditional densities. Empirically, we demonstrate improved probabilistic prediction over CART-style trees and competitive or superior performance compared to state-of-the-art probabilistic tree methods and Random Forests, along with robustness to redundant features and heteroscedastic noise.

[LG-71] DADP: Domain Adaptive Diffusion Policy

链接: https://arxiv.org/abs/2602.04037
作者: Pengcheng Wang,Qinghang Liu,Haotian Lin,Yiheng Li,Guojian Zhan,Masayoshi Tomizuka,Yixiao Wang
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Learning domain adaptive policies that can generalize to unseen transition dynamics, remains a fundamental challenge in learning-based control. Substantial progress has been made through domain representation learning to capture domain-specific information, thus enabling domain-aware decision making. We analyze the process of learning domain representations through dynamical prediction and find that selecting contexts adjacent to the current step causes the learned representations to entangle static domain information with varying dynamical properties. Such mixture can confuse the conditioned policy, thereby constraining zero-shot adaptation. To tackle the challenge, we propose DADP (Domain Adaptive Diffusion Policy), which achieves robust adaptation through unsupervised disentanglement and domain-aware diffusion injection. First, we introduce Lagged Context Dynamical Prediction, a strategy that conditions future state estimation on a historical offset context; by increasing this temporal gap, we unsupervisedly disentangle static domain representations by filtering out transient properties. Second, we integrate the learned domain representations directly into the generative process by biasing the prior distribution and reformulating the diffusion target. Extensive experiments on challenging benchmarks across locomotion and manipulation demonstrate the superior performance, and the generalizability of DADP over prior methods. More visualization results are available on the this https URL.

[LG-72] he Illusion of Generalization: Re-examining Tabular Language Model Evaluation

链接: https://arxiv.org/abs/2602.04031
作者: Aditya Gorla,Ratish Puduppully
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Tabular Language Models (TLMs) have been claimed to achieve emergent generalization for tabular prediction. We conduct a systematic re-evaluation of Tabula-8B as a representative TLM, utilizing 165 datasets from the UniPredict benchmark. Our investigation reveals three findings. First, binary and categorical classification achieve near-zero median lift over majority-class baselines and strong aggregate performance is driven entirely by quartile classification tasks. Second, top-performing datasets exhibit pervasive contamination, including complete train-test overlap and task-level leakage that evades standard deduplication. Third, instruction-tuning without tabular exposure recovers 92.2% of standard classification performance and on quartile classification, format familiarity closes 71.3% of the gap with the residual attributable to contaminated datasets. These findings suggest claimed generalization likely reflects evaluation artifacts rather than learned tabular reasoning. We conclude with recommendations for strengthening TLM evaluation.

[LG-73] A Consensus-Bayesian Framework for Detecting Malicious Activity in Enterprise Directory Access Graphs

链接: https://arxiv.org/abs/2602.04027
作者: Pratyush Uppuluri,Shilpa Noushad,Sajan Kumar
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:This work presents a consensus-based Bayesian framework to detect malicious user behavior in enterprise directory access graphs. By modeling directories as topics and users as agents within a multi-level interaction graph, we simulate access evolution using influence-weighted opinion dynamics. Logical dependencies between users are encoded in dynamic matrices Ci, and directory similarity is captured via a shared influence matrix W. Malicious behavior is injected as cross-component logical perturbations that violate structural norms of strongly connected components(SCCs). We apply theoretical guarantees from opinion dynamics literature to determine topic convergence and detect anomaly via scaled opinion variance. To quantify uncertainty, we introduce a Bayesian anomaly scoring mechanism that evolves over time, using both static and online priors. Simulations over synthetic access graphs validate our method, demonstrating its sensitivity to logical inconsistencies and robustness under dynamic perturbation.

[LG-74] Group Contrastive Learning for Weakly Paired Multimodal Data

链接: https://arxiv.org/abs/2602.04021
作者: Aditya Gorla,Hugues Van Assel,Jan-Christian Huetter,Heming Yao,Kyunghyun Cho,Aviv Regev,Russell Littman
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We present GROOVE, a semi-supervised multi-modal representation learning approach for high-content perturbation data where samples across modalities are weakly paired through shared perturbation labels but lack direct correspondence. Our primary contribution is GroupCLIP, a novel group-level contrastive loss that bridges the gap between CLIP for paired cross-modal data and SupCon for uni-modal supervised contrastive learning, addressing a fundamental gap in contrastive learning for weakly-paired settings. We integrate GroupCLIP with an on-the-fly backtranslating autoencoder framework to encourage cross-modally entangled representations while maintaining group-level coherence within a shared latent space. Critically, we introduce a comprehensive combinatorial evaluation framework that systematically assesses representation learners across multiple optimal transport aligners, addressing key limitations in existing evaluation strategies. This framework includes novel simulations that systematically vary shared versus modality-specific perturbation effects enabling principled assessment of method robustness. Our combinatorial benchmarking reveals that there is not yet an aligner that uniformly dominates across settings or modality pairs. Across simulations and two real single-cell genetic perturbation datasets, GROOVE performs on par with or outperforms existing approaches for downstream cross-modal matching and imputation tasks. Our ablation studies demonstrate that GroupCLIP is the key component driving performance gains. These results highlight the importance of leveraging group-level constraints for effective multi-modal representation learning in scenarios where only weak pairing is available.

[LG-75] CP: Informative uncertainty quantification via Equivariantized Conformal Prediction with pre-trained models

链接: https://arxiv.org/abs/2602.03986
作者: Nikolaos Bousias,Lars Lindemann,George Pappas
类目: Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:We study the effect of group symmetrization of pre-trained models on conformal prediction (CP), a post-hoc, distribution-free, finite-sample method of uncertainty quantification that offers formal coverage guarantees under the assumption of data exchangeability. Unfortunately, CP uncertainty regions can grow significantly in long horizon missions, rendering the statistical guarantees uninformative. To that end, we propose infusing CP with geometric information via group-averaging of the pretrained predictor to distribute the non-conformity mass across the orbits. Each sample now is treated as a representative of an orbit, thus uncertainty can be mitigated by other samples entangled to it via the orbit inducing elements of the symmetry group. Our approach provably yields contracted non-conformity scores in increasing convex order, implying improved exponential-tail bounds and sharper conformal prediction sets in expectation, especially at high confidence levels. We then propose an experimental design to test these theoretical claims in pedestrian trajectory prediction.

[LG-76] Non-linear PCA via Evolution Strategies: a Novel Objective Function

链接: https://arxiv.org/abs/2602.03967
作者: Thomas Uriot,Elise Chung
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Principal Component Analysis (PCA) is a powerful and popular dimensionality reduction technique. However, due to its linear nature, it often fails to capture the complex underlying structure of real-world data. While Kernel PCA (kPCA) addresses non-linearity, it sacrifices interpretability and struggles with hyperparameter selection. In this paper, we propose a robust non-linear PCA framework that unifies the interpretability of PCA with the flexibility of neural networks. Our method parametrizes variable transformations via neural networks, optimized using Evolution Strategies (ES) to handle the non-differentiability of eigendecomposition. We introduce a novel, granular objective function that maximizes the individual variance contribution of each variable providing a stronger learning signal than global variance maximization. This approach natively handles categorical and ordinal variables without the dimensional explosion associated with one-hot encoding. We demonstrate that our method significantly outperforms both linear PCA and kPCA in explained variance across synthetic and real-world datasets. At the same time, it preserves PCA’s interpretability, enabling visualization and analysis of feature contributions using standard tools such as biplots. The code can be found on GitHub.

[LG-77] Child Mortality Prediction in Bangladesh: A Decade-Long Validation Study

链接: https://arxiv.org/abs/2602.03957
作者: Md Muhtasim Munif Fahim,Md Rezaul Karim
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:The predictive machine learning models for child mortality tend to be inaccurate when applied to future populations, since they suffer from look-ahead bias due to the randomization used in cross-validation. The Demographic and Health Surveys (DHS) data from Bangladesh for 2011-2022, with n = 33,962, are used in this paper. We trained the model on (2011-2014) data, validated it on 2017 data, and tested it on 2022 data. Eight years after the initial test of the model, a genetic algorithm-based Neural Architecture Search found a single-layer neural architecture (with 64 units) to be superior to XGBoost (AUROC = 0.76 vs. 0.73; p 0.01). Additionally, through a detailed fairness audit, we identified an overall “Socioeconomic Predictive Gradient,” with a positive correlation between regional poverty level (r = -0.62) and the algorithm’s AUC. In addition, we found that the model performed at its highest levels in the least affluent divisions (AUC 0.74) and decreased dramatically in the wealthiest divisions (AUC 0.66). These findings suggest that the model is identifying areas with the greatest need for intervention. Our model would identify approximately 1300 additional at-risk children annually than a Gradient Boosting model when screened at the 10% level and validated using SHAP values and Platt Calibration, and therefore provide a robust, production-ready computational phenotype for targeted maternal and child health interventions.

[LG-78] Grables: Tabular Learning Beyond Independent Rows

链接: https://arxiv.org/abs/2602.03945
作者: Tamara Cucumides,Floris Geerts
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Tabular learning is still dominated by row-wise predictors that score each row independently, which fits i.i.d. benchmarks but fails on transactional, temporal, and relational tables where labels depend on other rows. We show that row-wise prediction rules out natural targets driven by global counts, overlaps, and relational patterns. To make “using structure” precise across architectures, we introduce grables: a modular interface that separates how a table is lifted to a graph (constructor) from how predictions are computed on that graph (node predictor), pinpointing where expressive power comes from. Experiments on synthetic tasks, transaction data, and a RelBench clinical-trials dataset confirm the predicted separations: message passing captures inter-row dependencies that row-local models miss, and hybrid approaches that explicitly extract inter-row structure and feed it to strong tabular learners yield consistent gains.

[LG-79] Autonomous AI Agents for Real-Time Affordable Housing Site Selection: Multi-Objective Reinforcement Learning Under Regulatory Constraints

链接: https://arxiv.org/abs/2602.03940
作者: Olaf Yunus Laitinen Imanov,Duygu Erisken,Derya Umut Kulali,Taner Yilmaz,Rana Irem Turhan
类目: Machine Learning (cs.LG)
*备注: 12 pages, 6 figures, 5 tables

点击查看摘要

Abstract:Affordable housing shortages affect billions, while land scarcity and regulations make site selection slow. We present AURA (Autonomous Urban Resource Allocator), a hierarchical multi-agent reinforcement learning system for real-time affordable housing site selection under hard regulatory constraints (QCT, DDA, LIHTC). We model the task as a constrained multi-objective Markov decision process optimizing accessibility, environmental impact, construction cost, and social equity while enforcing feasibility. AURA uses a regulatory-aware state encoding 127 federal and local constraints, Pareto-constrained policy gradients with feasibility guarantees, and reward decomposition separating immediate costs from long-term social outcomes. On datasets from 8 U.S. metros (47,392 candidate parcels), AURA attains 94.3% regulatory compliance and improves Pareto hypervolume by 37.2% over strong baselines. In a New York City 2026 case study, it reduces selection time from 18 months to 72 hours and identifies 23% more viable sites; chosen sites have 31% better transit access and 19% lower environmental impact than expert picks.

[LG-80] C-IDS: Solving Contextual POMDP via Information-Directed Objective

链接: https://arxiv.org/abs/2602.03939
作者: Chongyang Shi,Michael Dorothy,Jie Fu
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the policy synthesis problem in contextual partially observable Markov decision processes (CPOMDPs), where the environment is governed by an unknown latent context that induces distinct POMDP dynamics. Our goal is to design a policy that simultaneously maximizes cumulative return and actively reduces uncertainty about the underlying context. We introduce an information-directed objective that augments reward maximization with mutual information between the latent context and the agent’s observations. We develop the C-IDS algorithm to synthesize policies that maximize the information-directed objective. We show that the objective can be interpreted as a Lagrangian relaxation of the linear information ratio and prove that the temperature parameter is an upper bound on the information ratio. Based on this characterization, we establish a sublinear Bayesian regret bound over K episodes. We evaluate our approach on a continuous Light-Dark environment and show that it consistently outperforms standard POMDP solvers that treat the unknown context as a latent state variable, achieving faster context identification and higher returns.

[LG-81] Online Vector Quantized Attention

链接: https://arxiv.org/abs/2602.03922
作者: Nick Alonso,Tomas Figliolia,Beren Millidge
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Standard sequence mixing layers used in language models struggle to balance efficiency and performance. Self-attention performs well on long context tasks but has expensive quadratic compute and linear memory costs, while linear attention and SSMs use only linear compute and constant memory but struggle with long context processing. In this paper, we develop a sequence mixing layer that aims to find a better compromise between memory-compute costs and long-context processing, which we call online vector-quantized (OVQ) attention. OVQ-attention requires linear compute costs and constant memory, but, unlike linear attention and SSMs, it uses a sparse memory update that allows it to greatly increase the size of its memory state and, consequently, memory capacity. We develop a theoretical basis for OVQ-attention based on Gaussian mixture regression, and we test it on a variety of synthetic long context tasks and on long context language modeling. OVQ-attention shows significant improvements over linear attention baselines and the original VQ-attention, on which OVQ-attention was inspired. It demonstrates competitive, and sometimes identical, performance to strong self-attention baselines up 64k sequence length, despite using a small fraction of the memory of full self-attention.

[LG-82] Causal Discovery for Cross-Sectional Data Based on Super-Structure and Divide-and-Conquer

链接: https://arxiv.org/abs/2602.03914
作者: Wenyu Wang(1),Yaping Wan(1) ((1) University of South China)
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 7 pages,16 figures

点击查看摘要

Abstract:This paper tackles a critical bottleneck in Super-Structure-based divide-and-conquer causal discovery: the high computational cost of constructing accurate Super-Structures–particularly when conditional independence (CI) tests are expensive and domain knowledge is unavailable. We propose a novel, lightweight framework that relaxes the strict requirements on Super-Structure construction while preserving the algorithmic benefits of divide-and-conquer. By integrating weakly constrained Super-Structures with efficient graph partitioning and merging strategies, our approach substantially lowers CI test overhead without sacrificing accuracy. We instantiate the framework in a concrete causal discovery algorithm and rigorously evaluate its components on synthetic data. Comprehensive experiments on Gaussian Bayesian networks, including magic-NIAB, ECOLI70, and magic-IRRI, demonstrate that our method matches or closely approximates the structural accuracy of PC and FCI while drastically reducing the number of CI tests. Further validation on the real-world China Health and Retirement Longitudinal Study (CHARLS) dataset confirms its practical applicability. Our results establish that accurate, scalable causal discovery is achievable even under minimal assumptions about the initial Super-Structure, opening new avenues for applying divide-and-conquer methods to large-scale, knowledge-scarce domains such as biomedical and social science research.

[LG-83] Echo State Networks for Time Series Forecasting: Hyperparameter Sweep and Benchmarking

链接: https://arxiv.org/abs/2602.03912
作者: Alexander Häußer
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper investigates the forecasting performance of Echo State Networks (ESNs) for univariate time series forecasting using a subset of the M4 Forecasting Competition dataset. Focusing on monthly and quarterly time series with at most 20 years of historical data, we evaluate whether a fully automatic, purely feedback-driven ESN can serve as a competitive alternative to widely used statistical forecasting methods. The study adopts a rigorous two-stage evaluation approach: a Parameter dataset is used to conduct an extensive hyperparameter sweep covering leakage rate, spectral radius, reservoir size, and information criteria for regularization, resulting in over four million ESN model fits; a disjoint Forecast dataset is then used for out-of-sample accuracy assessment. Forecast accuracy is measured using MASE and sMAPE and benchmarked against simple benchmarks like drift and seasonal naive and statistical models like ARIMA, ETS, and TBATS. The hyperparameter analysis reveals consistent and interpretable patterns, with monthly series favoring moderately persistent reservoirs and quarterly series favoring more contractive dynamics. Across both frequencies, high leakage rates are preferred, while optimal spectral radii and reservoir sizes vary with temporal resolution. In the out-of-sample evaluation, the ESN performs on par with ARIMA and TBATS for monthly data and achieves the lowest mean MASE for quarterly data, while requiring lower computational cost than the more complex statistical models. Overall, the results demonstrate that ESNs offer a compelling balance between predictive accuracy, robustness, and computational efficiency, positioning them as a practical option for automated time series forecasting.

[LG-84] he Role of Target Update Frequencies in Q-Learning

链接: https://arxiv.org/abs/2602.03911
作者: Simon Weissmann,Tilman Aach,Benedikt Wille,Sebastian Kassing,Leif Döring
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The target network update frequency (TUF) is a central stabilization mechanism in (deep) Q-learning. However, their selection remains poorly understood and is often treated merely as another tunable hyperparameter rather than as a principled design decision. This work provides a theoretical analysis of target fixing in tabular Q-learning through the lens of approximate dynamic programming. We formulate periodic target updates as a nested optimization scheme in which each outer iteration applies an inexact Bellman optimality operator, approximated by a generic inner loop optimizer. Rigorous theory yields a finite-time convergence analysis for the asynchronous sampling setting, specializing to stochastic gradient descent in the inner loop. Our results deliver an explicit characterization of the bias-variance trade-off induced by the target update period, showing how to optimally set this critical hyperparameter. We prove that constant target update schedules are suboptimal, incurring a logarithmic overhead in sample complexity that is entirely avoidable with adaptive schedules. Our analysis shows that the optimal target update frequency increases geometrically over the course of the learning process.

[LG-85] NeuroPareto: Calibrated Acquisition for Costly Many-Goal Search in Vast Parameter Spaces

链接: https://arxiv.org/abs/2602.03901
作者: Rong Fu,Wenxin Zhang,Chunlei Meng,Youjin Wang,Haoyu Zhao,Jiaxuan Lu,Kun Liu,JiaBao Dou,Simon James Fong
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 39 pages, 19 figures

点击查看摘要

Abstract:The pursuit of optimal trade-offs in high-dimensional search spaces under stringent computational constraints poses a fundamental challenge for contemporary multi-objective optimization. We develop NeuroPareto, a cohesive architecture that integrates rank-centric filtering, uncertainty disentanglement, and history-conditioned acquisition strategies to navigate complex objective landscapes. A calibrated Bayesian classifier estimates epistemic uncertainty across non-domination tiers, enabling rapid generation of high-quality candidates with minimal evaluation cost. Deep Gaussian Process surrogates further separate predictive uncertainty into reducible and irreducible components, providing refined predictive means and risk-aware signals for downstream selection. A lightweight acquisition network, trained online from historical hypervolume improvements, guides expensive evaluations toward regions balancing convergence and diversity. With hierarchical screening and amortized surrogate updates, the method maintains accuracy while keeping computational overhead low. Experiments on DTLZ and ZDT suites and a subsurface energy extraction task show that NeuroPareto consistently outperforms classifier-enhanced and surrogate-assisted baselines in Pareto proximity and hypervolume.

[LG-86] heory of Optimal Learning Rate Schedules and Scaling Laws for a Random Feature Model

链接: https://arxiv.org/abs/2602.04774
作者: Blake Bordelon,Francesco Mori
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Setting the learning rate for a deep learning model is a critical part of successful training, yet choosing this hyperparameter is often done empirically with trial and error. In this work, we explore a solvable model of optimal learning rate schedules for a powerlaw random feature model trained with stochastic gradient descent (SGD). We consider the optimal schedule \eta_T^\star(t) where t is the current iterate and T is the total training horizon. This schedule is computed both numerically and analytically (when possible) using optimal control methods. Our analysis reveals two regimes which we term the easy phase and hard phase. In the easy phase the optimal schedule is a polynomial decay \eta_T^\star(t) \simeq T^-\xi (1-t/T)^\delta where \xi and \delta depend on the properties of the features and task. In the hard phase, the optimal schedule resembles warmup-stable-decay with constant (in T ) initial learning rate and annealing performed over a vanishing (in T ) fraction of training steps. We investigate joint optimization of learning rate and batch size, identifying a degenerate optimality condition. Our model also predicts the compute-optimal scaling laws (where model size and training steps are chosen optimally) in both easy and hard regimes. Going beyond SGD, we consider optimal schedules for the momentum \beta(t) , where speedups in the hard phase are possible. We compare our optimal schedule to various benchmarks in our task including (1) optimal constant learning rates \eta_T(t) \sim T^-\xi (2) optimal power laws \eta_T(t) \sim T^-\xi t^-\chi , finding that our schedule achieves better rates than either of these. Our theory suggests that learning rate transfer across training horizon depends on the structure of the model and task. We explore these ideas in simple experimental pretraining setups.

[LG-87] Conditional Counterfactual Mean Embeddings: Doubly Robust Estimation and Learning Rates

链接: https://arxiv.org/abs/2602.04736
作者: Thatchanon Anancharoenkij,Donlapark Ponnoprat
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Code is available at this https URL

点击查看摘要

Abstract:A complete understanding of heterogeneous treatment effects involves characterizing the full conditional distribution of potential outcomes. To this end, we propose the Conditional Counterfactual Mean Embeddings (CCME), a framework that embeds conditional distributions of counterfactual outcomes into a reproducing kernel Hilbert space (RKHS). Under this framework, we develop a two-stage meta-estimator for CCME that accommodates any RKHS-valued regression in each stage. Based on this meta-estimator, we develop three practical CCME estimators: (1) Ridge Regression estimator, (2) Deep Feature estimator that parameterizes the feature map by a neural network, and (3) Neural-Kernel estimator that performs RKHS-valued regression, with the coefficients parameterized by a neural network. We provide finite-sample convergence rates for all estimators, establishing that they possess the double robustness property. Our experiments demonstrate that our estimators accurately recover distributional features including multimodal structure of conditional counterfactual distributions.

[LG-88] Cross-Attention Transformer for Joint Multi-Receiver Uplink Neural Decoding

链接: https://arxiv.org/abs/2602.04728
作者: Xavier Tardy,Grégoire Lefebvre,Apostolos Kountouris,Haïfa Fares,Amor Nafkha
类目: ignal Processing (eess.SP); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 6 pages, 3 figures, 3 tables, conference submission

点击查看摘要

Abstract:We propose a cross-attention Transformer for joint decoding of uplink OFDM signals received by multiple coordinated access points. A shared per-receiver encoder learns time-frequency structure within each received grid, and a token-wise cross-attention module fuses the receivers to produce soft log-likelihood ratios for a standard channel decoder, without requiring explicit per-receiver channel estimates. Trained with a bit-metric objective, the model adapts its fusion to per-receiver reliability, tolerates missing or degraded links, and remains robust when pilots are sparse. Across realistic Wi-Fi channels, it consistently outperforms classical pipelines and strong convolutional baselines, frequently matching (and in some cases surpassing) a powerful baseline that assumes perfect channel knowledge per access point. Despite its expressiveness, the architecture is compact, has low computational cost (low GFLOPs), and achieves low latency on GPUs, making it a practical building block for next-generation Wi-Fi receivers.

[LG-89] Knowledge Distillation for mmWave Beam Prediction Using Sub-6 GHz Channels ICASSP

链接: https://arxiv.org/abs/2602.04703
作者: Sina Tavakolian,Nhan Thanh Nguyen,Ahmed Alkhateeb,Markku Juntti
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 5 pages, 4 figures. Accepted for publication at IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2026

点击查看摘要

Abstract:Beamforming in millimeter-wave (mmWave) high-mobility environments typically incurs substantial training overhead. While prior studies suggest that sub-6 GHz channels can be exploited to predict optimal mmWave beams, existing methods depend on large deep learning (DL) models with prohibitive computational and memory requirements. In this paper, we propose a computationally efficient framework for sub-6 GHz channel-mmWave beam mapping based on the knowledge distillation (KD) technique. We develop two compact student DL architectures based on individual and relational distillation strategies, which retain only a few hidden layers yet closely mimic the performance of large teacher DL models. Extensive simulations demonstrate that the proposed student models achieve the teacher’s beam prediction accuracy and spectral efficiency while reducing trainable parameters and computational complexity by 99%.

[LG-90] Beyond Learning on Molecules by Weakly Supervising on Molecules

链接: https://arxiv.org/abs/2602.04696
作者: Gordan Prastalo,Kevin Maik Jablonka
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Molecular representations are inherently task-dependent, yet most pre-trained molecular encoders are not. Task conditioning promises representations that reorganize based on task descriptions, but existing approaches rely on expensive labeled data. We show that weak supervision on programmatically derived molecular motifs is sufficient. Our Adaptive Chemical Embedding Model (ACE-Mol) learns from hundreds of motifs paired with natural language descriptors that are cheap to compute, trivial to scale. Conventional encoders slowly search the embedding space for task-relevant structure, whereas ACE-Mol immediately aligns its representations with the task. ACE-Mol achieves state-of-the-art performance across molecular property prediction benchmarks with interpretable, chemically meaningful representations.

[LG-91] Causal explanations of outliers in systems with lagged time-dependencies

链接: https://arxiv.org/abs/2602.04667
作者: Philipp Alexander Schwarz,Johannes Oberpriller,Sven Klaassen
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Root-cause analysis in controlled time dependent systems poses a major challenge in applications. Especially energy systems are difficult to handle as they exhibit instantaneous as well as delayed effects and if equipped with storage, do have a memory. In this paper we adapt the causal root-cause analysis method of Budhathoki et al. [2022] to general time-dependent systems, as it can be regarded as a strictly causal definition of the term “root-cause”. Particularly, we discuss two truncation approaches to handle the infinite dependency graphs present in time-dependent systems. While one leaves the causal mechanisms intact, the other approximates the mechanisms at the start nodes. The effectiveness of the different approaches is benchmarked using a challenging data generation process inspired by a problem in factory energy management: the avoidance of peaks in the power consumption. We show that given enough lags our extension is able to localize the root-causes in the feature and time domain. Further the effect of mechanism approximation is discussed.

[LG-92] Learning to Separate RF Signals Under Uncertainty: Detect-Then-Separate vs. Unified Joint Models

链接: https://arxiv.org/abs/2602.04650
作者: Ariel Rodrigez,Alejandro Lancho,Amir Weiss
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 6 pages, 6 figures, 1 table, accepted at the 2026 IEEE International Conference on Communications

点击查看摘要

Abstract:The increasingly crowded radio frequency (RF) spectrum forces communication signals to coexist, creating heterogeneous interferers whose structure often departs from Gaussian models. Recovering the interference-contaminated signal of interest in such settings is a central challenge, especially in single-channel RF processing. Existing data-driven methods often assume that the interference type is known, yielding ensembles of specialized models that scale poorly with the number of interferers. We show that detect-then-separate (DTS) strategies admit an analytical justification: within a Gaussian mixture framework, a plug-in maximum a posteriori detector followed by type-conditioned optimal estimation achieves asymptotic minimum mean-square error optimality under a mild temporal-diversity condition. This makes DTS a principled benchmark, but its reliance on multiple type-specific models limits scalability. Motivated by this, we propose a unified joint model (UJM), in which a single deep neural architecture learns to jointly detect and separate when applied directly to the received signal. Using tailored UNet architectures for baseband (complex-valued) RF signals, we compare DTS and UJM on synthetic and recorded interference types, showing that a capacity-matched UJM can match oracle-aided DTS performance across diverse signal-to-interference-and-noise ratios, interference types, and constellation orders, including mismatched training and testing type-uncertainty proportions. These findings highlight UJM as a scalable and practical alternative to DTS, while opening new directions for unified separation under broader regimes.

[LG-93] argeted Synthetic Control Method

链接: https://arxiv.org/abs/2602.04611
作者: Yuxin Wang,Dennis Frauen,Emil Javurek,Konstantin Hess,Yuchen Ma,Stefan Feuerriegel
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The synthetic control method (SCM) estimates causal effects in panel data with a single-treated unit by constructing a counterfactual outcome as a weighted combination of untreated control units that matches the pre-treatment trajectory. In this paper, we introduce the targeted synthetic control (TSC) method, a new two-stage estimator that directly estimates the counterfactual outcome. Specifically, our TSC method (1) yields a targeted debiasing estimator, in the sense that the targeted updating refines the initial weights to produce more stable weights; and (2) ensures that the final counterfactual estimation is a convex combination of observed control outcomes to enable direct interpretation of the synthetic control weights. TSC is flexible and can be instantiated with arbitrary machine learning models. Methodologically, TSC starts from an initial set of synthetic-control weights via a one-dimensional targeted update through the weight-tilting submodel, which calibrates the weights to reduce bias of weights estimation arising from pre-treatment fit. Furthermore, TSC avoids key shortcomings of existing methods (e.g., the augmented SCM), which can produce unbounded counterfactual estimates. Across extensive synthetic and real-world experiments, TSC consistently improves estimation accuracy over state-of-the-art SCM baselines.

[LG-94] A principled framework for uncertainty decomposition in TabPFN

链接: https://arxiv.org/abs/2602.04596
作者: Sandra Fortini,Kenyon Ng,Sonia Petrone,Judith Rousseau,Susan Wei
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 9 pages (+2 reference, +34 appendix). Code in this https URL

点击查看摘要

Abstract:TabPFN is a transformer that achieves state-of-the-art performance on supervised tabular tasks by amortizing Bayesian prediction into a single forward pass. However, there is currently no method for uncertainty decomposition in TabPFN. Because it behaves, in an idealised limit, as a Bayesian in-context learner, we cast the decomposition challenge as a Bayesian predictive inference (BPI) problem. The main computational tool in BPI, predictive Monte Carlo, is challenging to apply here as it requires simulating unmodeled covariates. We therefore pursue the asymptotic alternative, filling a gap in the theory for supervised settings by proving a predictive CLT under quasi-martingale conditions. We derive variance estimators determined by the volatility of predictive updates along the context. The resulting credible bands are fast to compute, target epistemic uncertainty, and achieve near-nominal frequentist coverage. For classification, we further obtain an entropy-based uncertainty decomposition.

[LG-95] Universality of General Spiked Tensor Models

链接: https://arxiv.org/abs/2602.04472
作者: Yanjin Xiang,Zhihua Zhang
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML)
*备注: 102pages

点击查看摘要

Abstract:We study the rank-one spiked tensor model in the high-dimensional regime, where the noise entries are independent and identically distributed with zero mean, unit variance, and finite fourth this http URL setting extends the classical Gaussian framework to a substantially broader class of noise this http URL on asymmetric tensors of order d ( \ge 3 ), we analyze the maximum likelihood estimator of the best rank-one this http URL a mild assumption isolating informative critical points of the associated optimization landscape, we show that the empirical spectral distribution of a suitably defined block-wise tensor contraction converges almost surely to a deterministic limit that coincides with the Gaussian this http URL a consequence, the asymptotic singular value and the alignments between the estimated and true spike directions admit explicit characterizations identical to those obtained under Gaussian noise. These results establish a universality principle for spiked tensor models, demonstrating that their high-dimensional spectral behavior and statistical limits are robust to non-Gaussian noise. Our analysis relies on resolvent methods from random matrix theory, cumulant expansions valid under finite moment assumptions, and variance bounds based on Efron-Stein-type arguments. A key challenge in the proof is how to handle the statistical dependence between the signal term and the noise term. Comments: 102pages Subjects: Statistics Theory (math.ST); Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML) MSC classes: 60B20, 62H25 Cite as: arXiv:2602.04472 [math.ST] (or arXiv:2602.04472v1 [math.ST] for this version) https://doi.org/10.48550/arXiv.2602.04472 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-96] Bayesian PINNs for uncertainty-aware inverse problems (BPINN-IP) ICIP2006

链接: https://arxiv.org/abs/2602.04459
作者: Ali Mohammad-Djafari
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: submitted to ICIP 2006 conference

点击查看摘要

Abstract:The main contribution of this paper is to develop a hierarchical Bayesian formulation of PINNs for linear inverse problems, which is called BPINN-IP. The proposed methodology extends PINN to account for prior knowledge on the nature of the expected NN output, as well as its weights. Also, as we can have access to the posterior probability distributions, naturally uncertainties can be quantified. Also, variational inference and Monte Carlo dropout are employed to provide predictive means and variances for reconstructed images. Un example of applications to deconvolution and super-resolution is considered, details of the different steps of implementations are given, and some preliminary results are presented.

[LG-97] Journey to the Centre of Cluster: Harnessing Interior Nodes for A/B Testing under Network Interference ICLR2026

链接: https://arxiv.org/abs/2602.04457
作者: Qianyi Chen,Anpeng Wu,Bo Li,Lu Deng,Yong Wang
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注: ICLR 2026

点击查看摘要

Abstract:A/B testing on platforms often faces challenges from network interference, where a unit’s outcome depends not only on its own treatment but also on the treatments of its network neighbors. To address this, cluster-level randomization has become standard, enabling the use of network-aware estimators. These estimators typically trim the data to retain only a subset of informative units, achieving low bias under suitable conditions but often suffering from high variance. In this paper, we first demonstrate that the interior nodes - units whose neighbors all lie within the same cluster - constitute the vast majority of the post-trimming subpopulation. In light of this, we propose directly averaging over the interior nodes to construct the mean-in-interior (MII) estimator, which circumvents the delicate reweighting required by existing network-aware estimators and substantially reduces variance in classical settings. However, we show that interior nodes are often not representative of the full population, particularly in terms of network-dependent covariates, leading to notable bias. We then augment the MII estimator with a counterfactual predictor trained on the entire network, allowing us to adjust for covariate distribution shifts between the interior nodes and full population. By rearranging the expression, we reveal that our augmented MII estimator embodies an analytical form of the point estimator within prediction-powered inference framework. This insight motivates a semi-supervised lens, wherein interior nodes are treated as labeled data subject to selection bias. Extensive and challenging simulation studies demonstrate the outstanding performance of our augmented MII estimator across various settings.

[LG-98] Machine Learning-Driven Crystal System Prediction for Perovskites Using Augmented X-ray Diffraction Data

链接: https://arxiv.org/abs/2602.04435
作者: Ansu Mathew,Ahmer A. B. Baloch,Alamin Yakasai,Hemant Mittal,Vivian Alberts,Jayakumar V. Karunamurthy
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 37 pages, 7 figures. Author accepted manuscript. Published in Engineering Applications of Artificial Intelligence

点击查看摘要

Abstract:Prediction of crystal system from X-ray diffraction (XRD) spectra is a critical task in materials science, particularly for perovskite materials which are known for their diverse applications in photovoltaics, optoelectronics, and catalysis. In this study, we present a machine learning (ML)-driven framework that leverages advanced models, including Time Series Forest (TSF), Random Forest (RF), Extreme Gradient Boosting (XGBoost), Recurrent Neural Network (RNN), Long Short-Term Memory (LSTM), Gated Recurrent Unit (GRU), and a simple feedforward neural network (NN), to classify crystal systems, point groups, and space groups from XRD data of perovskite materials. To address class imbalance and enhance model robustness, we integrated feature augmentation strategies such as Synthetic Minority Over-sampling Technique (SMOTE), class weighting, jittering, and spectrum shifting, along with efficient data preprocessing pipelines. The TSF model with SMOTE augmentation achieved strong performance for crystal system prediction, with a Matthews correlation coefficient (MCC) of 0.9, an F1 score of 0.92, and an accuracy of 97.76%. For point and space group prediction, balanced accuracies above 95% were obtained. The model demonstrated high performance for symmetry-distinct classes, including cubic crystal systems, point groups 3m and m-3m, and space groups Pnma and Pnnn. This work highlights the potential of ML for XRD-based structural characterization and accelerated discovery of perovskite materials

[LG-99] Anytime-Valid Conformal Risk Control

链接: https://arxiv.org/abs/2602.04364
作者: Bror Hultberg,Dave Zachariah,Antônio H. Ribeiro
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Prediction sets provide a means of quantifying the uncertainty in predictive tasks. Using held out calibration data, conformal prediction and risk control can produce prediction sets that exhibit statistically valid error control in a computationally efficient manner. However, in the standard formulations, the error is only controlled on average over many possible calibration datasets of fixed size. In this paper, we extend the control to remain valid with high probability over a cumulatively growing calibration dataset at any time point. We derive such guarantees using quantile-based arguments and illustrate the applicability of the proposed framework to settings involving distribution shift. We further establish a matching lower bound and show that our guarantees are asymptotically tight. Finally, we demonstrate the practical performance of our methods through both simulations and real-world numerical examples.

[LG-100] A Bandit-Based Approach to Educational Recommender Systems: Contextual Thompson Sampling for Learner Skill Gain Optimization

链接: https://arxiv.org/abs/2602.04347
作者: Lukas De Kerpel,Arthur Thuy,Dries F. Benoit
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Accepted for publication in INFORMS Transactions on Education

点击查看摘要

Abstract:In recent years, instructional practices in Operations Research (OR), Management Science (MS), and Analytics have increasingly shifted toward digital environments, where large and diverse groups of learners make it difficult to provide practice that adapts to individual needs. This paper introduces a method that generates personalized sequences of exercises by selecting, at each step, the exercise most likely to advance a learner’s understanding of a targeted skill. The method uses information about the learner and their past performance to guide these choices, and learning progress is measured as the change in estimated skill level before and after each exercise. Using data from an online mathematics tutoring platform, we find that the approach recommends exercises associated with greater skill improvement and adapts effectively to differences across learners. From an instructional perspective, the framework enables personalized practice at scale, highlights exercises with consistently strong learning value, and helps instructors identify learners who may benefit from additional support.

[LG-101] Geometry-Aware Optimal Transport: Fast Intrinsic Dimension and Wasserstein Distance Estimation

链接: https://arxiv.org/abs/2602.04335
作者: Ferdinand Genans(SU, LPSM),Olivier Wintenberger(SU, LPSM)
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Solving large scale Optimal Transport (OT) in machine learning typically relies on sampling measures to obtain a tractable discrete problem. While the discrete solver’s accuracy is controllable, the rate of convergence of the discretization error is governed by the intrinsic dimension of our data. Therefore, the true bottleneck is the knowledge and control of the sampling error. In this work, we tackle this issue by introducing novel estimators for both sampling error and intrinsic dimension. The key finding is a simple, tuning-free estimator of \textOT_c(\rho, \hat\rho) that utilizes the semi-dual OT functional and, remarkably, requires no OT solver. Furthermore, we derive a fast intrinsic dimension estimator from the multi-scale decay of our sampling error estimator. This framework unlocks significant computational and statistical advantages in practice, enabling us to (i) quantify the convergence rate of the discretization error, (ii) calibrate the entropic regularization of Sinkhorn divergences to the data’s intrinsic geometry, and (iii) introduce a novel, intrinsic-dimension-based Richardson extrapolation estimator that strongly debiases Wasserstein distance estimation. Numerical experiments demonstrate that our geometry-aware pipeline effectively mitigates the discretization error bottleneck while maintaining computational efficiency.

[LG-102] Bures-Wasserstein Importance-Weighted Evidence Lower Bound: Exposition and Applications

链接: https://arxiv.org/abs/2602.04272
作者: Peiwen Jiang,Takuo Matsubara,Minh-Ngoc Tran
类目: Computation (stat.CO); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 27 pages, 6 figures. Submitted to Bayesian Analysis

点击查看摘要

Abstract:The Importance-Weighted Evidence Lower Bound (IW-ELBO) has emerged as an effective objective for variational inference (VI), tightening the standard ELBO and mitigating the mode-seeking behaviour. However, optimizing the IW-ELBO in Euclidean space is often inefficient, as its gradient estimators suffer from a vanishing signal-to-noise ratio (SNR). This paper formulates the optimisation of the IW-ELBO in Bures-Wasserstein space, a manifold of Gaussian distributions equipped with the 2-Wasserstein metric. We derive the Wasserstein gradient of the IW-ELBO and project it onto the Bures-Wasserstein space to yield a tractable algorithm for Gaussian VI. A pivotal contribution of our analysis concerns the stability of the gradient estimator. While the SNR of the standard Euclidean gradient estimator is known to vanish as the number of importance samples K increases, we prove that the SNR of the Wasserstein gradient scales favourably as \Omega(\sqrtK) , ensuring optimisation efficiency even for large K . We further extend this geometric analysis to the Variational Rényi Importance-Weighted Autoencoder bound, establishing analogous stability guarantees. Experiments demonstrate that the proposed framework achieves superior approximation performance compared to other baselines. Comments: 27 pages, 6 figures. Submitted to Bayesian Analysis Subjects: Computation (stat.CO); Machine Learning (cs.LG); Methodology (stat.ME) Cite as: arXiv:2602.04272 [stat.CO] (or arXiv:2602.04272v1 [stat.CO] for this version) https://doi.org/10.48550/arXiv.2602.04272 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-103] Aortic Valve Disease Detection from PPG via Physiology-Informed Self-Supervised Learning

链接: https://arxiv.org/abs/2602.04266
作者: Jiaze Wang,Qinghao Zhao,Zizheng Chen,Zhejun Sun,Deyun Zhang,Yuxi Zhou,Shenda Hong
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 28 pages, 7 figures. Under review

点击查看摘要

Abstract:Traditional diagnosis of aortic valve disease relies on echocardiography, but its cost and required expertise limit its use in large-scale early screening. Photoplethysmography (PPG) has emerged as a promising screening modality due to its widespread availability in wearable devices and its ability to reflect underlying hemodynamic dynamics. However, the extreme scarcity of gold-standard labeled PPG data severely constrains the effectiveness of data-driven approaches. To address this challenge, we propose and validate a new paradigm, Physiology-Guided Self-Supervised Learning (PG-SSL), aimed at unlocking the value of large-scale unlabeled PPG data for efficient screening of Aortic Stenosis (AS) and Aortic Regurgitation (AR). Using over 170,000 unlabeled PPG samples from the UK Biobank, we formalize clinical knowledge into a set of PPG morphological phenotypes and construct a pulse pattern recognition proxy task for self-supervised pre-training. A dual-branch, gated-fusion architecture is then employed for efficient fine-tuning on a small labeled subset. The proposed PG-SSL framework achieves AUCs of 0.765 and 0.776 for AS and AR screening, respectively, significantly outperforming supervised baselines trained on limited labeled data. Multivariable analysis further validates the model output as an independent digital biomarker with sustained prognostic value after adjustment for standard clinical risk factors. This study demonstrates that PG-SSL provides an effective, domain knowledge-driven solution to label scarcity in medical artificial intelligence and shows strong potential for enabling low-cost, large-scale early screening of aortic valve disease.

[LG-104] Provable Target Sample Complexity Improvements as Pre-Trained Models Scale AISTATS2026

链接: https://arxiv.org/abs/2602.04233
作者: Kazuto Fukuchi,Ryuichiro Hataya,Kota Matsui
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: AISTATS2026

点击查看摘要

Abstract:Pre-trained models have become indispensable for efficiently building models across a broad spectrum of downstream tasks. The advantages of pre-trained models have been highlighted by empirical studies on scaling laws, which demonstrate that larger pre-trained models can significantly reduce the sample complexity of downstream learning. However, existing theoretical investigations of pre-trained models lack the capability to explain this phenomenon. In this paper, we provide a theoretical investigation by introducing a novel framework, caulking, inspired by parameter-efficient fine-tuning (PEFT) methods such as adapter-based fine-tuning, low-rank adaptation, and partial fine-tuning. Our analysis establishes that improved pre-trained models provably decrease the sample complexity of downstream tasks, thereby offering theoretical justification for the empirically observed scaling laws relating pre-trained model size to downstream performance, a relationship not covered by existing results.

[LG-105] Maximin Relative Improvement: Fair Learning as a Bargaining Problem

链接: https://arxiv.org/abs/2602.04155
作者: Jiwoo Han,Moulinath Banerjee,Yuekai Sun
类目: Machine Learning (stat.ML); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:When deploying a single predictor across multiple subpopulations, we propose a fundamentally different approach: interpreting group fairness as a bargaining problem among subpopulations. This game-theoretic perspective reveals that existing robust optimization methods such as minimizing worst-group loss or regret correspond to classical bargaining solutions and embody different fairness principles. We propose relative improvement, the ratio of actual risk reduction to potential reduction from a baseline predictor, which recovers the Kalai-Smorodinsky solution. Unlike absolute-scale methods that may not be comparable when groups have different potential predictability, relative improvement provides axiomatic justification including scale invariance and individual monotonicity. We establish finite-sample convergence guarantees under mild conditions.

[LG-106] Attack-Resistant Uniform Fairness for Linear and Smooth Contextual Bandits

链接: https://arxiv.org/abs/2602.04125
作者: Qingwen Zhang,Wenjia Wang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Modern systems, such as digital platforms and service systems, increasingly rely on contextual bandits for online decision-making; however, their deployment can inadvertently create unfair exposure among arms, undermining long-term platform sustainability and supplier trust. This paper studies the contextual bandit problem under a uniform (1-\delta) -fairness constraint, and addresses its unique vulnerabilities to strategic manipulation. The fairness constraint ensures that preferential treatment is strictly justified by an arm’s actual reward across all contexts and time horizons, using uniformity to prevent statistical loopholes. We develop novel algorithms that achieve (nearly) minimax-optimal regret for both linear and smooth reward functions, while maintaining strong (1-\tildeO(1/T)) -fairness guarantees, and further characterize the theoretically inherent yet asymptotically marginal “price of fairness”. However, we reveal that such merit-based fairness becomes uniquely susceptible to signal manipulation. We show that an adversary with a minimal \tildeO(1) budget can not only degrade overall performance as in traditional attacks, but also selectively induce insidious fairness-specific failures while leaving conspicuous regret measures largely unaffected. To counter this, we design robust variants incorporating corruption-adaptive exploration and error-compensated thresholding. Our approach yields the first minimax-optimal regret bounds under C -budgeted attack while preserving (1-\tildeO(1/T)) -fairness. Numerical experiments and a real-world case demonstrate that our algorithms sustain both fairness and efficiency.

[LG-107] Efficient Subgroup Analysis via Optimal Trees with Global Parameter Fusion

链接: https://arxiv.org/abs/2602.04077
作者: Zhongming Xie,Joseph Giorgio,Jingshen Wang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Identifying and making statistical inferences on differential treatment effects (commonly known as subgroup analysis in clinical research) is central to precision health. Subgroup analysis allows practitioners to pinpoint populations for whom a treatment is especially beneficial or protective, thereby advancing targeted interventions. Tree based recursive partitioning methods are widely used for subgroup analysis due to their interpretability. Nevertheless, these approaches encounter significant limitations, including suboptimal partitions induced by greedy heuristics and overfitting from locally estimated splits, especially under limited sample sizes. To address these limitations, we propose a fused optimal causal tree method that leverages mixed integer optimization (MIO) to facilitate precise subgroup identification. Our approach ensures globally optimal partitions and introduces a parameter fusion constraint to facilitate information sharing across related subgroups. This design substantially improves subgroup discovery accuracy and enhances statistical efficiency. We provide theoretical guarantees by rigorously establishing out of sample risk bounds and comparing them with those of classical tree based methods. Empirically, our method consistently outperforms popular baselines in simulations. Finally, we demonstrate its practical utility through a case study on the Health and Aging Brain Study Health Disparities (HABS-HD) dataset, where our approach yields clinically meaningful insights.

[LG-108] hermodynamic assessment of machine learning models for solid-state synthesis prediction

链接: https://arxiv.org/abs/2602.04075
作者: Jane Schlesinger,Simon Hjaltason,Nathan J. Szymanski,Christopher J. Bartel
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning models have recently emerged to predict whether hypothetical solid-state materials can be synthesized. These models aim to circumvent direct first-principles modeling of solid-state phase transformations, instead learning from large databases of successfully synthesized materials. Here, we assess the alignment of several recently introduced synthesis prediction models with material and reaction thermodynamics, quantified by the energy with respect to the convex hull and a metric accounting for thermodynamic selectivity of enumerated synthesis reactions. A dataset of successful synthesis recipes was used to determine the likely bounds on both quantities beyond which materials can be deemed unlikely to be synthesized. With these bounds as context, thermodynamic quantities were computed using the CHGNet foundation potential for thousands of new hypothetical materials generated using the Chemeleon generative model. Four recently published machine learning models for synthesizability prediction were applied to this same dataset, and the resultant predictions were considered against computed thermodynamics. We find these models generally overpredict the likelihood of synthesis, but some model scores do trend with thermodynamic heuristics, assigning lower scores to materials that are less stable or do not have an available synthesis recipe that is calculated to be thermodynamically selective. In total, this work identifies existing gaps in machine learning models for materials synthesis and introduces a new approach to assess their quality in the absence of extensive negative examples (failed syntheses).

[LG-109] A Multi-Modal Foundational Model for Wireless Communication and Sensing

链接: https://arxiv.org/abs/2602.04016
作者: Vahid Yazdnian,Yasaman Ghasempour
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Artificial intelligence is a key enabler for next-generation wireless communication and sensing. Yet, today’s learning-based wireless techniques do not generalize well: most models are task-specific, environment-dependent, and limited to narrow sensing modalities, requiring costly retraining when deployed in new scenarios. This work introduces a task-agnostic, multi-modal foundational model for physical-layer wireless systems that learns transferable, physics-aware representations across heterogeneous modalities, enabling robust generalization across tasks and environments. Our framework employs a physics-guided self-supervised pretraining strategy incorporating a dedicated physical token to capture cross-modal physical correspondences governed by electromagnetic propagation. The learned representations enable efficient adaptation to diverse downstream tasks, including massive multi-antenna optimization, wireless channel estimation, and device localization, using limited labeled data. Our extensive evaluations demonstrate superior generalization, robustness to deployment shifts, and reduced data requirements compared to task-specific baselines.

[LG-110] Functional Stochastic Localization

链接: https://arxiv.org/abs/2602.03999
作者: Anming Gu,Bobby Shi,Kevin Tian
类目: Probability (math.PR); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: Comments welcome!

点击查看摘要

Abstract:Eldan’s stochastic localization is a probabilistic construction that has proved instrumental to modern breakthroughs in high-dimensional geometry and the design of sampling algorithms. Motivated by sampling under non-Euclidean geometries and the mirror descent algorithm in optimization, we develop a functional generalization of Eldan’s process that replaces Gaussian regularization with regularization by any positive integer multiple of a log-Laplace transform. We further give a mixing time bound on the Markov chain induced by our localization process, which holds if our target distribution satisfies a functional Poincaré inequality. Finally, we apply our framework to differentially private convex optimization in \ell_p norms for p \in [1, 2) , where we improve state-of-the-art query complexities in a zeroth-order model.

[LG-111] Statistical Guarantees for Reason ing Probes on Looped Boolean Circuits

链接: https://arxiv.org/abs/2602.03970
作者: Anastasis Kratsios,Giulia Livieri,A. Martina Neuman
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Metric Geometry (math.MG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:We study the statistical behaviour of reasoning probes in a stylized model of looped reasoning, given by Boolean circuits whose computational graph is a perfect \nu -ary tree ( \nu\ge 2 ) and whose output is appended to the input and fed back iteratively for subsequent computation rounds. A reasoning probe has access to a sampled subset of internal computation nodes, possibly without covering the entire graph, and seeks to infer which \nu -ary Boolean gate is executed at each queried node, representing uncertainty via a probability distribution over a fixed collection of \mathttm admissible \nu -ary gates. This partial observability induces a generalization problem, which we analyze in a realizable, transductive setting. We show that, when the reasoning probe is parameterized by a graph convolutional network (GCN)-based hypothesis class and queries N nodes, the worst-case generalization error attains the optimal rate \mathcalO(\sqrt\log(2/\delta)/\sqrtN) with probability at least 1-\delta , for \delta\in (0,1) . Our analysis combines snowflake metric embedding techniques with tools from statistical optimal transport. A key insight is that this optimal rate is achievable independently of graph size, owing to the existence of a low-distortion one-dimensional snowflake embedding of the induced graph metric. As a consequence, our results provide a sharp characterization of how structural properties of the computational graph govern the statistical efficiency of reasoning under partial access. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Metric Geometry (math.MG); Statistics Theory (math.ST) Cite as: arXiv:2602.03970 [stat.ML] (or arXiv:2602.03970v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2602.03970 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-112] Learning Multi-type heterogeneous interacting particle systems

链接: https://arxiv.org/abs/2602.03954
作者: Quanjun Lang,Xiong Wang,Fei Lu,Mauro Maggioni
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:We propose a framework for the joint inference of network topology, multi-type interaction kernels, and latent type assignments in heterogeneous interacting particle systems from multi-trajectory data. This learning task is a challenging non-convex mixed-integer optimization problem, which we address through a novel three-stage approach. First, we leverage shared structure across agent interactions to recover a low-rank embedding of the system parameters via matrix sensing. Second, we identify discrete interaction types by clustering within the learned embedding. Third, we recover the network weight matrix and kernel coefficients through matrix factorization and a post-processing refinement. We provide theoretical guarantees with estimation error bounds under a Restricted Isometry Property (RIP) assumption and establish conditions for the exact recovery of interaction types based on cluster separability. Numerical experiments on synthetic datasets, including heterogeneous predator-prey systems, demonstrate that our method yields an accurate reconstruction of the underlying dynamics and is robust to noise.

[LG-113] Privacy utility trade offs for parameter estimation in degree heterogeneous higher order networks

链接: https://arxiv.org/abs/2602.03948
作者: Bibhabasu Mandal,Sagnik Nandy
类目: Machine Learning (stat.ML); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Social and Information Networks (cs.SI); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:In sensitive applications involving relational datasets, protecting information about individual links from adversarial queries is of paramount importance. In many such settings, the available data are summarized solely through the degrees of the nodes in the network. We adopt the \beta model, which is the prototypical statistical model adopted for this form of aggregated relational information, and study the problem of minimax-optimal parameter estimation under both local and central differential privacy constraints. We establish finite sample minimax lower bounds that characterize the precise dependence of the estimation risk on the network size and the privacy parameters, and we propose simple estimators that achieve these bounds up to constants and logarithmic factors under both local and central differential privacy frameworks. Our results provide the first comprehensive finite sample characterization of privacy utility trade offs for parameter estimation in \beta models, addressing the classical graph case and extending the analysis to higher order hypergraph models. We further demonstrate the effectiveness of our methods through experiments on synthetic data and a real world communication network.

[LG-114] A Hitchhikers Guide to Poisson Gradient Estimation

链接: https://arxiv.org/abs/2602.03896
作者: Michael Ibrahim,Hanqi Zhao,Eli Sennesh,Zhi Li,Anqi Wu,Jacob L. Yates,Chengrui Li,Hadi Vafaii
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: Code: this https URL

点击查看摘要

Abstract:Poisson-distributed latent variable models are widely used in computational neuroscience, but differentiating through discrete stochastic samples remains challenging. Two approaches address this: Exponential Arrival Time (EAT) simulation and Gumbel-SoftMax (GSM) relaxation. We provide the first systematic comparison of these methods, along with practical guidance for practitioners. Our main technical contribution is a modification to the EAT method that theoretically guarantees an unbiased first moment (exactly matching the firing rate), and reduces second-moment bias. We evaluate these methods on their distributional fidelity, gradient quality, and performance on two tasks: (1) variational autoencoders with Poisson latents, and (2) partially observable generalized linear models, where latent neural connectivity must be inferred from observed spike trains. Across all metrics, our modified EAT method exhibits better overall performance (often comparable to exact gradients), and substantially higher robustness to hyperparameter choices. Together, our results clarify the trade-offs between these methods and offer concrete recommendations for practitioners working with Poisson latent variable models.

[LG-115] ranscendental Regularization of Finite Mixtures:Theoretical Guarantees and Practical Limitations

链接: https://arxiv.org/abs/2602.03889
作者: Ernest Fokoué
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 24 pages, 6 figures, 2 tables

点击查看摘要

Abstract:Finite mixture models are widely used for unsupervised learning, but maximum likelihood estimation via EM suffers from degeneracy as components collapse. We introduce transcendental regularization, a penalized likelihood framework with analytic barrier functions that prevent degeneracy while maintaining asymptotic efficiency. The resulting Transcendental Algorithm for Mixtures of Distributions (TAMD) offers strong theoretical guarantees: identifiability, consistency, and robustness. Empirically, TAMD successfully stabilizes estimation and prevents collapse, yet achieves only modest improvements in classification accuracy-highlighting fundamental limits of mixture models for unsupervised learning in high dimensions. Our work provides both a novel theoretical framework and an honest assessment of practical limitations, implemented in an open-source R package.

[LG-116] Prenatal Stress Detection from Electrocardiography Using Self-Supervised Deep Learning: Development and External Validation

链接: https://arxiv.org/abs/2602.03886
作者: Martin G. Frasch,Marlene J.E. Mayer,Clara Becker,Peter Zimmermann,Camilla Zelgert,Marta C. Antonelli,Silvia M. Lobmaier
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注: 22 pages, 5 figures

点击查看摘要

Abstract:Prenatal psychological stress affects 15-25% of pregnancies and increases risks of preterm birth, low birth weight, and adverse neurodevelopmental outcomes. Current screening relies on subjective questionnaires (PSS-10), limiting continuous monitoring. We developed deep learning models for stress detection from electrocardiography (ECG) using the FELICITy 1 cohort (151 pregnant women, 32-38 weeks gestation). A ResNet-34 encoder was pretrained via SimCLR contrastive learning on 40,692 ECG segments per subject. Multi-layer feature extraction enabled binary classification and continuous PSS prediction across maternal (mECG), fetal (fECG), and abdominal ECG (aECG). External validation used the FELICITy 2 RCT (28 subjects, different ECG device, yoga intervention vs. control). On FELICITy 1 (5-fold CV): mECG 98.6% accuracy (R2=0.88, MAE=1.90), fECG 99.8% (R2=0.95, MAE=1.19), aECG 95.5% (R2=0.75, MAE=2.80). External validation on FELICITy 2: mECG 77.3% accuracy (R2=0.62, MAE=3.54, AUC=0.826), aECG 63.6% (R2=0.29, AUC=0.705). Signal quality-based channel selection outperformed all-channel averaging (+12% R2 improvement). Mixed-effects models detected a significant intervention response (p=0.041). Self-supervised deep learning on pregnancy ECG enables accurate, objective stress assessment, with multi-layer feature extraction substantially outperforming single embedding approaches.

[LG-117] PENGUIN: General Vital Sign Reconstruction from PPG with Flow Matching State Space Model ICASSP2026

链接: https://arxiv.org/abs/2602.03858
作者: Shuntaro Suzuki,Shuitsu Koyama,Shinnosuke Hirano,Shunya Nagashima
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: Accepted for presentation at ICASSP2026

点击查看摘要

Abstract:Photoplethysmography (PPG) plays a crucial role in continuous cardiovascular health monitoring as a non-invasive and cost-effective modality. However, PPG signals are susceptible to motion artifacts and noise, making accurate estimation of vital signs such as arterial blood pressure (ABP) challenging. Existing estimation methods are often restricted to a single-task or environment, limiting their generalizability across diverse PPG decoding scenarios. Moreover, recent general-purpose approaches typically rely on predictions over multi-second intervals, discarding the morphological characteristics of vital signs. To address these challenges, we propose PENGUIN, a generative flow-matching framework that extends deep state space models, enabling fine-grained conditioning on PPG for reconstructing multiple vital signs as continuous waveforms. We evaluate PENGUIN using six real-world PPG datasets across three distinct vital sign reconstruction tasks (electrocardiogram reconstruction, respiratory monitoring, and ABP monitoring). Our method consistently outperformed both task-specific and general-purpose baselines, demonstrating PENGUIN as a general framework for robust vital sign reconstruction from PPG.

[LG-118] he Turing Synthetic Radar Dataset: A dataset for pulse deinterleaving

链接: https://arxiv.org/abs/2602.03856
作者: Edward Gunn,Adam Hosford,Robert Jones,Leo Zeitler,Ian Groves,Victoria Nockles
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 7 pages 6 figures, submitted to International Radar Symposium 2026

点击查看摘要

Abstract:We present the Turing Synthetic Radar Dataset, a comprehensive dataset to serve both as a benchmark for radar pulse deinterleaving research and as an enabler of new research methods. The dataset addresses the critical problem of separating interleaved radar pulses from multiple unknown emitters for electronic warfare applications and signal intelligence. Our dataset contains a total of 6000 pulse trains over two receiver configurations, totalling to almost 3 billion pulses, featuring realistic scenarios with up to 110 emitters and significant parameter space overlap. To encourage dataset adoption and establish standardised evaluation procedures, we have launched an accompanying Turing Deinterleaving Challenge, for which models need to associate pulses in interleaved pulse trains to the correct emitter by clustering and maximising metrics such as the V-measure. The Turing Synthetic Radar Dataset is one of the first publicly available, comprehensively simulated pulse train datasets aimed to facilitate sophisticated model development in the electronic warfare community

[LG-119] Majorization-Minimization Networks for Inverse Problems: An Application to EEG Imaging

链接: https://arxiv.org/abs/2602.03855
作者: Le Minh Triet Tran(IMT Atlantique, LaTIM),Sarah Reynaud(IMT Atlantique, LaTIM),Ronan Fablet(IMT Atlantique, Lab-STICC),Adrien Merlini(IMT Atlantique, Lab-STICC),François Rousseau(IMT Atlantique, LaTIM),Mai Quyen Pham(IMT Atlantique, Lab-STICC)
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Inverse problems are often ill-posed and require optimization schemes with strong stability and convergence guarantees. While learning-based approaches such as deep unrolling and meta-learning achieve strong empirical performance, they typically lack explicit control over descent and curvature, limiting robustness. We propose a learned Majorization-Minimization (MM) framework for inverse problems within a bilevel optimization setting. Instead of learning a full optimizer, we learn a structured curvature majorant that governs each MM step while preserving classical MM descent guarantees. The majorant is parameterized by a lightweight recurrent neural network and explicitly constrained to satisfy valid MM conditions. For cosine-similarity losses, we derive explicit curvature bounds yielding diagonal majorants. When analytic bounds are unavailable, we rely on efficient Hessian-vector product-based spectral estimation to automatically upper-bound local curvature without forming the Hessian explicitly. Experiments on EEG source imaging demonstrate improved accuracy, stability, and cross-dataset generalization over deep-unrolled and meta-learning baselines.

[LG-120] Online unsupervised Hebbian learning in deep photonic neuromorphic networks

链接: https://arxiv.org/abs/2601.22300
作者: Xi Li,Disha Biswas,Peng Zhou,Wesley H. Brigner,Anna Capuano,Joseph S. Friedman,Qing Gu
类目: Optics (physics.optics); Disordered Systems and Neural Networks (cond-mat.dis-nn); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注: 15 pages, 4 figures

点击查看摘要

Abstract:While software implementations of neural networks have driven significant advances in computation, the von Neumann architecture imposes fundamental limitations on speed and energy efficiency. Neuromorphic networks, with structures inspired by the brain’s architecture, offer a compelling solution with the potential to approach the extreme energy efficiency of neurobiological systems. Photonic neuromorphic networks (PNNs) are particularly attractive because they leverage the inherent advantages of light, namely high parallelism, low latency, and exceptional energy efficiency. Previous PNN demonstrations have largely focused on device-level functionalities or system-level implementations reliant on supervised learning and inefficient optical-electrical-optical (OEO) conversions. Here, we introduce a purely photonic deep PNN architecture that enables online, unsupervised learning. We propose a local feedback mechanism operating entirely in the optical domain that implements a Hebbian learning rule using non-volatile phase-change material synapses. We experimentally demonstrate this approach on a non-trivial letter recognition task using a commercially available fiber-optic platform and achieve a 100 percent recognition rate, showcasing an all-optical solution for efficient, real-time information processing. This work unlocks the potential of photonic computing for complex artificial intelligence applications by enabling direct, high-throughput processing of optical information without intermediate OEO signal conversions.

信息检索

[IR-0] Multi-Source Retrieval and Reason ing for Legal Sentencing Prediction

链接: https://arxiv.org/abs/2602.04690
作者: Junjie Chen,Haitao Li,Qilei Zhang,Zhenghua Li,Ya Zhang,Quan Zhou,Cheng Luo,Yiqun Liu,Dongsheng Guo,Qingyao Ai
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Legal judgment prediction (LJP) aims to predict judicial outcomes from case facts and typically includes law article, charge, and sentencing prediction. While recent methods perform well on the first two subtasks, legal sentencing prediction (LSP) remains difficult due to its need for fine-grained objective knowledge and flexible subjective reasoning. To address these limitations, we propose MSR^2 , a framework that integrates multi-source retrieval and reasoning in LLMs with reinforcement learning. MSR^2 enables LLMs to perform multi-source retrieval based on reasoning needs and applies a process-level reward to guide intermediate subjective reasoning steps. Experiments on two real-world datasets show that MSR^2 improves both accuracy and interpretability in LSP, providing a promising step toward practical legal AI. Our code is available at this https URL.

[IR-1] VK-LSVD: A Large-Scale Industrial Dataset for Short-Video Recommendation WWW’26

链接: https://arxiv.org/abs/2602.04567
作者: Aleksandr Poslavsky,Alexander D’yakonov,Yuriy Dorn,Andrey Zimovnov
类目: Information Retrieval (cs.IR); Computers and Society (cs.CY)
*备注: Accepted to The ACM Web Conference 2026 (WWW '26). Preprint of conference paper. 7 pages, 2 (7) figures, 4 tables. Dataset available at: this https URL

点击查看摘要

Abstract:Short-video recommendation presents unique challenges, such as modeling rapid user interest shifts from implicit feedback, but progress is constrained by a lack of large-scale open datasets that reflect real-world platform dynamics. To bridge this gap, we introduce the VK Large Short-Video Dataset (VK-LSVD), the largest publicly available industrial dataset of its kind. VK-LSVD offers an unprecedented scale of over 40 billion interactions from 10 million users and almost 20 million videos over six months, alongside rich features including content embeddings, diverse feedback signals, and contextual metadata. Our analysis supports the dataset’s quality and diversity. The dataset’s immediate impact is confirmed by its central role in the live VK RecSys Challenge 2025. VK-LSVD provides a vital, open dataset to use in building realistic benchmarks to accelerate research in sequential recommendation, cold-start scenarios, and next-generation recommender systems.

[IR-2] DOS: Dual-Flow Orthogonal Semantic IDs for Recommendation in Meituan WWW2026

链接: https://arxiv.org/abs/2602.04460
作者: Junwei Yin,Senjie Kou,Changhao Li,Shuli Wang,Xue Wei,Yinqiu Huang,Yinhua Zhu,Haitao Wang,Xingxing Wang
类目: Information Retrieval (cs.IR)
*备注: Accepted by WWW2026 (short paper)

点击查看摘要

Abstract:Semantic IDs serve as a key component in generative recommendation systems. They not only incorporate open-world knowledge from large language models (LLMs) but also compress the semantic space to reduce generation difficulty. However, existing methods suffer from two major limitations: (1) the lack of contextual awareness in generation tasks leads to a gap between the Semantic ID codebook space and the generation space, resulting in suboptimal recommendations; and (2) suboptimal quantization methods exacerbate semantic loss in LLMs. To address these issues, we propose Dual-Flow Orthogonal Semantic IDs (DOS) method. Specifically, DOS employs a user-item dual flow-framework that leverages collaborative signals to align the Semantic ID codebook space with the generation space. Furthermore, we introduce an orthogonal residual quantization scheme that rotates the semantic space to an appropriate orientation, thereby maximizing semantic preservation. Extensive offline experiments and online A/B testing demonstrate the effectiveness of DOS. The proposed method has been successfully deployed in Meituan’s mobile application, serving hundreds of millions of users.

[IR-3] SDR-CIR: Semantic Debias Retrieval Framework for Training-Free Zero-Shot Composed Image Retrieval

链接: https://arxiv.org/abs/2602.04451
作者: Yi Sun,Jinyu Xu,Qing Xie,Jiachen Li,Yanchun Ma,Yongjian Liu
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Composed Image Retrieval (CIR) aims to retrieve a target image from a query composed of a reference image and modification text. Recent training-free zero-shot methods often employ Multimodal Large Language Models (MLLMs) with Chain-of-Thought (CoT) to compose a target image description for retrieval. However, due to the fuzzy matching nature of ZS-CIR, the generated description is prone to semantic bias relative to the target image. We propose SDR-CIR, a training-free Semantic Debias Ranking method based on CoT reasoning. First, Selective CoT guides the MLLM to extract visual content relevant to the modification text during image understanding, thereby reducing visual noise at the source. We then introduce a Semantic Debias Ranking with two steps, Anchor and Debias, to mitigate semantic bias. In the Anchor step, we fuse reference image features with target description features to reinforce useful semantics and supplement omitted cues. In the Debias step, we explicitly model the visual semantic contribution of the reference image to the description and incorporate it into the similarity score as a penalty term. By supplementing omitted cues while suppressing redundancy, SDR-CIR mitigates semantic bias and improves retrieval performance. Experiments on three standard CIR benchmarks show that SDR-CIR achieves state-of-the-art results among one-stage methods while maintaining high efficiency. The code is publicly available at this https URL.

[IR-4] MiniRec: Data-Efficient Reinforcement Learning for LLM -based Recommendation

链接: https://arxiv.org/abs/2602.04278
作者: Lin Wang,Yang Zhang,Jingfan Chen,Xiaoyan Zhao,Fengbin Zhu,Qing Li,Tat-Seng Chua
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:The integration of reinforcement learning (RL) into large language models (LLMs) has opened new opportunities for recommender systems by eliciting reasoning and improving user preference modeling. However, RL-based LLM recommendation faces significant efficiency challenges, making full-data training costly. Existing data selection methods define sample value based on learnability or representativeness, yet their loss- or gradient-driven or dataset coverage-driven criteria often misalign with RL learning dynamics, resulting in suboptimal performance. To address this, we propose MiniRec, a data selection framework tailored for RL-based LLM recommendation. MiniRec evaluates sample learnability using key RL signals – rewards – pruning samples that are too easy (too high reward) or too difficult (consistently low reward). It assesses representativeness by aligning sample gradients with the approximated “ideal” global RL optimization trajectory, selecting samples that mainly drive model updates, and it also enforces diversity to reduce redundancy. Combined with a curriculum learning strategy from easy to hard samples, MiniRec significantly reduces training cost while largely preserving performance. Extensive experiments demonstrate MiniRec’s effectiveness, highlighting the importance of reward-aligned, trajectory-informed data selection in RL-based LLM recommendation.

[IR-5] LILaC: Late Interacting in Layered Component Graph for Open-domain Multimodal Multihop Retrieval EMNLP2025

链接: https://arxiv.org/abs/2602.04263
作者: Joohyung Yun,Doyup Lee,Wook-Shin Han
类目: Information Retrieval (cs.IR)
*备注: Project page: this https URL

点击查看摘要

Abstract:Multimodal document retrieval aims to retrieve query-relevant components from documents composed of textual, tabular, and visual elements. An effective multimodal retriever needs to handle two main challenges: (1) mitigate the effect of irrelevant contents caused by fixed, single-granular retrieval units, and (2) support multihop reasoning by effectively capturing semantic relationships among components within and across documents. To address these challenges, we propose LILaC, a multimodal retrieval framework featuring two core innovations. First, we introduce a layered component graph, explicitly representing multimodal information at two layers - each representing coarse and fine granularity - facilitating efficient yet precise reasoning. Second, we develop a late-interaction-based subgraph retrieval method, an edge-based approach that initially identifies coarse-grained nodes for efficient candidate generation, then performs fine-grained reasoning via late interaction. Extensive experiments demonstrate that LILaC achieves state-of-the-art retrieval performance on all five benchmarks, notably without additional fine-tuning. We make the artifacts publicly available at this http URL.

[IR-6] Following the TRAIL: Predicting and Explaining Tomorrows Hits with a Fine-Tuned LLM

链接: https://arxiv.org/abs/2602.04225
作者: Yinan Zhang,Zhixi Chen,Jiazheng Jing,Zhiqi Shen
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have been widely applied across multiple domains for their broad knowledge and strong reasoning capabilities. However, applying them to recommendation systems is challenging since it is hard for LLMs to extract user preferences from large, sparse user-item logs, and real-time per-user ranking over the full catalog is too time-consuming to be practical. Moreover, many existing recommender systems focus solely on ranking items while overlooking explanations, which could help improve predictive accuracy and make recommendations more convincing to users. Inspired by recent works that achieve strong recommendation performance by forecasting near-term item popularity, we propose TRAIL (TRend and explAnation Integrated Learner). TRAIL is a fine-tuned LLM that jointly predicts short-term item popularity and generates faithful natural-language explanations. It employs contrastive learning with positive and negative pairs to align its scores and explanations with structured trend signals, yielding accurate and explainable popularity predictions. Extensive experiments show that TRAIL outperforms strong baselines and produces coherent, well-grounded explanations.

[IR-7] GenMRP: A Generative Multi-Route Planning Framework for Efficient and Personalized Real-Time Industrial Navigation

链接: https://arxiv.org/abs/2602.04174
作者: Chengzhang Wang,Chao Chen,Jun Tao,Tengfei Liu,He Bai,Song Wang,Longfei Xu,Kaikui Liu,Xiangxiang Chu
类目: Robotics (cs.RO); Graphics (cs.GR); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Existing industrial-scale navigation applications contend with massive road networks, typically employing two main categories of approaches for route planning. The first relies on precomputed road costs for optimal routing and heuristic algorithms for generating alternatives, while the second, generative methods, has recently gained significant attention. However, the former struggles with personalization and route diversity, while the latter fails to meet the efficiency requirements of large-scale real-time scenarios. To address these limitations, we propose GenMRP, a generative framework for multi-route planning. To ensure generation efficiency, GenMRP first introduces a skeleton-to-capillary approach that dynamically constructs a relevant sub-network significantly smaller than the full road network. Within this sub-network, routes are generated iteratively. The first iteration identifies the optimal route, while the subsequent ones generate alternatives that balance quality and diversity using the newly proposed correctional boosting approach. Each iteration incorporates road features, user historical sequences, and previously generated routes into a Link Cost Model to update road costs, followed by route generation using the Dijkstra algorithm. Extensive experiments show that GenMRP achieves state-of-the-art performance with high efficiency in both offline and online environments. To facilitate further research, we have publicly released the training and evaluation dataset. GenMRP has been fully deployed in a real-world navigation app, demonstrating its effectiveness and benefits.

[IR-8] Nemotron ColEmbed V2: Top-Performing Late Interaction embedding models for Visual Document Retrieval

链接: https://arxiv.org/abs/2602.03992
作者: Gabriel de Souza P. Moreira,Ronay Ak,Mengyao Xu,Oliver Holworthy,Benedikt Schifferer,Zhiding Yu,Yauhen Babakhin,Radek Osmulski,Jiarui Cai,Ryan Chesler,Bo Liu,Even Oldridge
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) systems have been popular for generative applications, powering language models by injecting external knowledge. Companies have been trying to leverage their large catalog of documents (e.g. PDFs, presentation slides) in such RAG pipelines, whose first step is the retrieval component. Dense retrieval has been a popular approach, where embedding models are used to generate a dense representation of the user query that is closer to relevant content embeddings. More recently, VLM-based embedding models have become popular for visual document retrieval, as they preserve visual information and simplify the indexing pipeline compared to OCR text extraction. Motivated by the growing demand for visual document retrieval, we introduce Nemotron ColEmbed V2, a family of models that achieve state-of-the-art performance on the ViDoRe benchmarks. We release three variants - with 3B, 4B, and 8B parameters - based on pre-trained VLMs: NVIDIA Eagle 2 with Llama 3.2 3B backbone, Qwen3-VL-4B-Instruct and Qwen3-VL-8B-Instruct, respectively. The 8B model ranks first on the ViDoRe V3 leaderboard as of February 03, 2026, achieving an average NDCG@10 of 63.42. We describe the main techniques used across data processing, training, and post-training - such as cluster-based sampling, hard-negative mining, bidirectional attention, late interaction, and model merging - that helped us build our top-performing models. We also discuss compute and storage engineering challenges posed by the late interaction mechanism and present experiments on how to balance accuracy and storage with lower dimension embeddings. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2602.03992 [cs.IR] (or arXiv:2602.03992v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2602.03992 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

附件下载

点击下载今日全部论文列表