本篇博文主要内容为 2026-02-03 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2026-02-03)
今日共更新1500篇论文,其中:
- 自然语言处理共234篇(Computation and Language (cs.CL))
- 人工智能共537篇(Artificial Intelligence (cs.AI))
- 计算机视觉共323篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共564篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] Reward-free Alignment for Conflicting Objectives
【速读】: 该论文旨在解决多目标对齐中因目标冲突导致的训练不稳定和权衡效果差的问题,尤其针对现有加权损失方法难以同时优化多个目标以及依赖显式奖励模型而引入偏差与复杂性的情况。其解决方案的关键在于提出一种无需奖励模型的冲突目标对齐框架(Reward-free Alignment for Conflicted Objectives, RACO),通过引入一种新颖的截断型抗冲突梯度下降法(clipped variant of conflict-averse gradient descent)来化解梯度冲突,并提供收敛至帕累托临界点的理论保障,且在双目标场景下证明截断机制可显著提升收敛速率。
链接: https://arxiv.org/abs/2602.02495
作者: Peter Chen,Xiaopeng Li,Xi Chen,Tianyi Lin
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 27 pages
Abstract:Direct alignment methods are increasingly used to align large language models (LLMs) with human preferences. However, many real-world alignment problems involve multiple conflicting objectives, where naive aggregation of preferences can lead to unstable training and poor trade-offs. In particular, weighted loss methods may fail to identify update directions that simultaneously improve all objectives, and existing multi-objective approaches often rely on explicit reward models, introducing additional complexity and distorting user-specified preferences. The contributions of this paper are two-fold. First, we propose a Reward-free Alignment framework for Conflicted Objectives (RACO) that directly leverages pairwise preference data and resolves gradient conflicts via a novel clipped variant of conflict-averse gradient descent. We provide convergence guarantees to Pareto-critical points that respect user-specified objective weights, and further show that clipping can strictly improve convergence rate in the two-objective setting. Second, we improve our method using some heuristics and conduct experiments to demonstrate the compatibility of the proposed framework for LLM alignment. Both qualitative and quantitative evaluations on multi-objective summarization and safety alignment tasks across multiple LLM families (Qwen 3, Llama 3, Gemma 3) show that our method consistently achieves better Pareto trade-offs compared to existing multi-objective alignment baselines.
zh
[NLP-1] RLAnything: Forge Environment Policy and Reward Model in Completely Dynamic RL System
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)与智能体(agentic)场景中强化学习(Reinforcement Learning, RL)信号弱、训练效率低以及依赖人工标注奖励等问题。其核心解决方案是提出RLAnything框架,通过闭环优化动态构建环境、策略(policy)和奖励模型(reward model),实现三者协同增强:策略模型利用步骤级反馈与结果反馈的融合信号进行训练,奖励模型则通过一致性反馈联合优化,进而反哺策略训练;同时引入基于理论驱动的自动环境适应机制,借助批评者(critic)反馈实现从经验中学习,显著提升训练稳定性与性能。实验证明,该框架在多个LLM和智能体任务上均取得显著提升,且优化后的奖励信号优于依赖人工标签的结果信号。
链接: https://arxiv.org/abs/2602.02488
作者: Yinjie Wang,Tianbao Xie,Ke Shen,Mengdi Wang,Ling Yang
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Code: this https URL
Abstract:We propose RLAnything, a reinforcement learning framework that dynamically forges environment, policy, and reward models through closed-loop optimization, amplifying learning signals and strengthening the overall RL system for any LLM or agentic scenarios. Specifically, the policy is trained with integrated feedback from step-wise and outcome signals, while the reward model is jointly optimized via consistency feedback, which in turn further improves policy training. Moreover, our theory-motivated automatic environment adaptation improves training for both the reward and policy models by leveraging critic feedback from each, enabling learning from experience. Empirically, each added component consistently improves the overall system, and RLAnything yields substantial gains across various representative LLM and agentic tasks, boosting Qwen3-VL-8B-Thinking by 9.1% on OSWorld and Qwen2.5-7B-Instruct by 18.7% and 11.9% on AlfWorld and LiveBench, respectively. We also that optimized reward-model signals outperform outcomes that rely on human labels. Code: this https URL
zh
[NLP-2] RE-TRAC: REcursive TRAjectory Compression for Deep Search Agents
【速读】: 该论文旨在解决基于大语言模型(Large Language Models, LLMs)的深度研究代理(deep research agents)在使用ReAct框架时存在的局限性,包括难以回溯早期状态、无法分支探索替代路径、以及在长上下文场景下缺乏全局感知,从而导致陷入局部最优、冗余探索和搜索效率低下等问题。其解决方案的关键在于提出Re-TRAC框架,通过在每条轨迹(trajectory)结束后生成结构化的状态表示(structured state representation),用于总结证据、不确定性、失败原因及未来计划,并以此条件化后续轨迹,实现跨轨迹的探索与迭代反思,从而将研究过程重构为一个渐进式、全局信息驱动的优化过程。
链接: https://arxiv.org/abs/2602.02486
作者: Jialiang Zhu,Gongrui Zhang,Xiaolong Ma,Lin Xu,Miaosen Zhang,Ruiqi Yang,Song Wang,Kai Qiu,Zhirong Wu,Qi Dai,Ruichun Ma,Bei Liu,Yifan Yang,Chong Luo,Zhengyuan Yang,Linjie Li,Lijuan Wang,Weizhu Chen,Xin Geng,Baining Guo
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:LLM-based deep research agents are largely built on the ReAct framework. This linear design makes it difficult to revisit earlier states, branch into alternative search directions, or maintain global awareness under long contexts, often leading to local optima, redundant exploration, and inefficient search. We propose Re-TRAC, an agentic framework that performs cross-trajectory exploration by generating a structured state representation after each trajectory to summarize evidence, uncertainties, failures, and future plans, and conditioning subsequent trajectories on this state representation. This enables iterative reflection and globally informed planning, reframing research as a progressive process. Empirical results show that Re-TRAC consistently outperforms ReAct by 15-20% on BrowseComp with frontier LLMs. For smaller models, we introduce Re-TRAC-aware supervised fine-tuning, achieving state-of-the-art performance at comparable scales. Notably, Re-TRAC shows a monotonic reduction in tool calls and token usage across rounds, indicating progressively targeted exploration driven by cross-trajectory reflection rather than redundant search.
zh
[NLP-3] raining LLM s for Divide-and-Conquer Reasoning Elevates Test-Time Scalability
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理复杂推理任务时,传统逐步链式思维(Chain-of-Thought, CoT)方法因严格顺序性而导致的推理能力瓶颈与测试阶段可扩展性不足的问题。其解决方案的关键在于提出一种端到端的强化学习(Reinforcement Learning, RL)框架,通过引入分而治之(Divide-and-Conquer, DAC)式推理机制,在每个推理步骤中由策略网络同时完成问题分解与子问题求解,并将原始问题的解答条件化于子问题的解之上,从而实现分解与求解过程的联合训练与优化,显著提升了模型在竞赛级基准测试中的性能上限和测试时的可扩展性。
链接: https://arxiv.org/abs/2602.02477
作者: Xiao Liang,Zhong-Zhi Li,Zhenghao Lin,Eric Hancheng Jiang,Hengyuan Zhang,Yelong Shen,Kai-Wei Chang,Ying Nian Wu,Yeyun Gong,Weizhu Chen
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) have demonstrated strong reasoning capabilities through step-by-step chain-of-thought (CoT) reasoning. Nevertheless, at the limits of model capability, CoT often proves insufficient, and its strictly sequential nature constrains test-time scalability. A potential alternative is divide-and-conquer (DAC) reasoning, which decomposes a complex problem into subproblems to facilitate more effective exploration of the solution. Although promising, our analysis reveals a fundamental misalignment between general-purpose post-training and DAC-style inference, which limits the model’s capacity to fully leverage this potential. To bridge this gap and fully unlock LLMs’ reasoning capabilities on the most challenging tasks, we propose an end-to-end reinforcement learning (RL) framework to enhance their DAC-style reasoning capacity. At each step, the policy decomposes a problem into a group of subproblems, solves them sequentially, and addresses the original one conditioned on the subproblem solutions, with both decomposition and solution integrated into RL training. Under comparable training, our DAC-style framework endows the model with a higher performance ceiling and stronger test-time scalability, surpassing CoT by 8.6% in Pass@1 and 6.3% in Pass@32 on competition-level benchmarks.
zh
[NLP-4] MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents
【速读】: 该论文旨在解决当前大型语言模型(Large Language Model, LLM)代理记忆系统依赖少量静态、人工设计的操作来提取记忆的问题,这些问题导致系统在多样化交互模式下缺乏灵活性,并且在处理长期历史时效率低下。解决方案的关键在于提出 MemSkill,它将记忆操作重构为可学习和可进化的“记忆技能”(memory skills),即结构化且可复用的程序,用于从交互轨迹中提取、整合与修剪信息。MemSkill 通过一个控制器(controller)学习选择相关技能,配合基于 LLM 的执行器(executor)生成受技能引导的记忆;同时引入一个设计师(designer)模块,定期审查因技能选择不当导致错误或不完整记忆的困难案例,并通过提出改进与新增技能来演化技能集。这一闭环机制实现了技能选择策略与技能集合本身的协同优化,从而显著提升任务性能并增强跨场景泛化能力。
链接: https://arxiv.org/abs/2602.02474
作者: Haozhen Zhang,Quanyu Long,Jianzhu Bao,Tao Feng,Weizhi Zhang,Haodong Yue,Wenya Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Code is available at this https URL
Abstract:Most Large Language Model (LLM) agent memory systems rely on a small set of static, hand-designed operations for extracting memory. These fixed procedures hard-code human priors about what to store and how to revise memory, making them rigid under diverse interaction patterns and inefficient on long histories. To this end, we present \textbfMemSkill, which reframes these operations as learnable and evolvable memory skills, structured and reusable routines for extracting, consolidating, and pruning information from interaction traces. Inspired by the design philosophy of agent skills, MemSkill employs a \emphcontroller that learns to select a small set of relevant skills, paired with an LLM-based \emphexecutor that produces skill-guided memories. Beyond learning skill selection, MemSkill introduces a \emphdesigner that periodically reviews hard cases where selected skills yield incorrect or incomplete memories, and evolves the skill set by proposing refinements and new skills. Together, MemSkill forms a closed-loop procedure that improves both the skill-selection policy and the skill set itself. Experiments on LoCoMo, LongMemEval, HotpotQA, and ALFWorld demonstrate that MemSkill improves task performance over strong baselines and generalizes well across settings. Further analyses shed light on how skills evolve, offering insights toward more adaptive, self-evolving memory management for LLM agents.
zh
[NLP-5] SPARKLING: Balancing Signal Preservation and Symmetry Breaking for Width-Progressive Learning
【速读】: 该论文旨在解决中段训练阶段宽度扩展(width expansion)带来的训练不稳定性问题,这是实现计算资源高效利用的关键瓶颈。现有方法多局限于训练初期,而中段扩展对提升计算效率至关重要,但因激活统计量失衡和梯度对称性导致损失突增与特征多样性不足,难以稳定实施。解决方案的核心在于提出SPARKLING框架,其关键创新为:通过RMS-scale一致性机制保障信号传递的稳定性,从而维持激活统计特性;同时引入非对称优化器状态重置与学习率重新预热策略,打破梯度对称性以促进特征多样性。实验表明,该方法在Mixture-of-Experts(MoE)模型上可稳定实现多维度宽度扩展,相较从头训练降低35%的训练成本。
链接: https://arxiv.org/abs/2602.02472
作者: Qifan Yu,Xinyu Ma,Zhijian Zhuo,Minrui Wang,Deyi Liu,Shiyi Zhan,Yiyuan Ma,Liang Xiang,Xingyan Bin,Di He
机构: Peking University (北京大学); ByteDance (字节跳动)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Progressive Learning (PL) reduces pre-training computational overhead by gradually increasing model scale. While prior work has extensively explored depth expansion, width expansion remains significantly understudied, with the few existing methods limited to the early stages of training. However, expanding width during the mid-stage is essential for maximizing computational savings, yet it remains a formidable challenge due to severe training instabilities. Empirically, we show that naive initialization at this stage disrupts activation statistics, triggering loss spikes, while copy-based initialization introduces gradient symmetry that hinders feature diversity. To address these issues, we propose SPARKLING (balancing Signal Preservation And symmetRy breaKing for width-progressive LearnING), a novel framework for mid-stage width expansion. Our method achieves signal preservation via RMS-scale consistency, stabilizing activation statistics during expansion. Symmetry breaking is ensured through asymmetric optimizer state resetting and learning rate re-warmup. Extensive experiments on Mixture-of-Experts (MoE) models demonstrate that, across multiple width axes and optimizer families, SPARKLING consistently outperforms training from scratch and reduces training cost by up to 35% under 2\times width expansion.
zh
[NLP-6] Avenir-Web: Human-Experience-Imitating Multimodal Web Agents with Mixture of Grounding Experts
【速读】: 该论文旨在解决当前自主网页代理(autonomous web agents)在复杂动态网页界面中执行长周期任务时存在的三大核心问题:元素定位不准确、缺乏站点特定的程序性知识以及长期任务跟踪与记忆不稳定,尤其是在处理复杂的文档对象模型(Document Object Model, DOM)结构时表现尤为突出。解决方案的关键在于提出Avenir-Web,其创新性地融合了三个核心技术模块:基于多专家接地机制(Mixture of Grounding Experts)提升元素识别精度;通过经验模仿规划(Experience-Imitation Planning)引入程序先验知识以增强任务推理能力;并结合任务追踪清单(task-tracking checklist)与自适应记忆机制,实现跨多种用户界面范式的鲁棒、无缝交互。实验证明,Avenir-Web在Online-Mind2Web基准上达到开源模型的新 SOTA 性能,并逼近顶级专有模型水平。
链接: https://arxiv.org/abs/2602.02468
作者: Aiden Yiliu Li,Xinyue Hao,Shilong Liu,Mengdi Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Despite advances in multimodal large language models, autonomous web agents still struggle to reliably execute long-horizon tasks on complex and dynamic web interfaces. Existing agents often suffer from inaccurate element grounding, the absence of site-specific procedural knowledge, and unstable long-term task tracking and memory, particularly when operating over complex Document Object Model structures. To address these limitations, we introduce Avenir-Web, a web agent that achieves a new open-source state of the art on the Online-Mind2Web benchmark in real-world deployment. Avenir-Web leverages a Mixture of Grounding Experts, Experience-Imitation Planning for incorporating procedural priors, and a task-tracking checklist combined with adaptive memory to enable robust and seamless interaction across diverse user interface paradigms. We evaluate Avenir-Web on Online-Mind2Web, a rigorous benchmark of live and user-centered web tasks. Our results demonstrate that Avenir-Web significantly surpasses prior open-source agents and attains performance parity with top-tier proprietary models, thereby establishing a new open-source state of the art for reliable web agents on live websites.
zh
[NLP-7] Indications of Belief-Guided Agency and Meta-Cognitive Monitoring in Large Language Models
【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)是否具备某种形式的意识,特别是其是否表现出基于信念形成与元认知监控的代理行为(agency)。为回应这一问题,作者基于神经科学理论提出了一套用于评估人工系统意识的指标,并聚焦于其中的关键指标HOT-3——该指标测试模型是否具备由一般信念形成与行动选择机制驱动的自主性,且该机制能根据元认知监控动态更新信念。解决方案的关键在于将信念视为模型潜在空间中对输入响应所涌现的表征,并引入一种量化信念主导性的指标来分析不同模型和任务下竞争信念的动力学行为;实证结果表明,外部干预可系统性调节内部信念形成,信念形成因果驱动模型决策,且模型能够监控并报告自身信念状态,从而为LLMs中存在信念引导的代理性和元认知监控提供了实证支持。
链接: https://arxiv.org/abs/2602.02467
作者: Noam Steinmetz Yalon,Ariel Goldstein,Liad Mudrik,Mor Geva
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Rapid advancements in large language models (LLMs) have sparked the question whether these models possess some form of consciousness. To tackle this challenge, Butlin et al. (2023) introduced a list of indicators for consciousness in artificial systems based on neuroscientific theories. In this work, we evaluate a key indicator from this list, called HOT-3, which tests for agency guided by a general belief-formation and action selection system that updates beliefs based on meta-cognitive monitoring. We view beliefs as representations in the model’s latent space that emerge in response to a given input, and introduce a metric to quantify their dominance during generation. Analyzing the dynamics between competing beliefs across models and tasks reveals three key findings: (1) external manipulations systematically modulate internal belief formation, (2) belief formation causally drives the model’s action selection, and (3) models can monitor and report their own belief states. Together, these results provide empirical support for the existence of belief-guided agency and meta-cognitive monitoring in LLMs. More broadly, our work lays methodological groundwork for investigating the emergence of agency, beliefs, and meta-cognition in LLMs.
zh
[NLP-8] From Directions to Regions: Decomposing Activations in Language Models via Local Geometry
【速读】: 该论文旨在解决现有语言模型激活分解方法对概念结构假设过于简单的问题,尤其是其依赖全局线性方向搜索的局限性,无法有效捕捉具有非线性或高维结构的概念。解决方案的关键在于引入可扩展的无监督模型——混合因子分析(Mixture of Factor Analyzers, MFA),将激活空间建模为多个具有局部协方差结构的高斯区域,从而分解出两个组成几何对象:每个区域在激活空间中的中心点(centroid)和相对于该中心的局部变化(local variation)。这种方法通过显式建模局部几何结构,能够更准确地刻画复杂概念,并在定位与控制任务中表现出优于无监督基线、媲美监督方法且通常强于稀疏自编码器的性能,验证了以子空间形式表达的局部几何作为可扩展概念发现与模型控制单位的潜力。
链接: https://arxiv.org/abs/2602.02464
作者: Or Shafran,Shaked Ronen,Omri Fahn,Shauli Ravfogel,Atticus Geiger,Mor Geva
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Activation decomposition methods in language models are tightly coupled to geometric assumptions on how concepts are realized in activation space. Existing approaches search for individual global directions, implicitly assuming linear separability, which overlooks concepts with nonlinear or multi-dimensional structure. In this work, we leverage Mixture of Factor Analyzers (MFA) as a scalable, unsupervised alternative that models the activation space as a collection of Gaussian regions with their local covariance structure. MFA decomposes activations into two compositional geometric objects: the region’s centroid in activation space, and the local variation from the centroid. We train large-scale MFAs for Llama-3.1-8B and Gemma-2-2B, and show they capture complex, nonlinear structures in activation space. Moreover, evaluations on localization and steering benchmarks show that MFA outperforms unsupervised baselines, is competitive with supervised localization methods, and often achieves stronger steering performance than sparse autoencoders. Together, our findings position local geometry, expressed through subspaces, as a promising unit of analysis for scalable concept discovery and model control, accounting for complex structures that isolated directions fail to capture.
zh
[NLP-9] Abstract Activation Spaces for Content-Invariant Reasoning in Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在三段论推理中因语义合理性与形式有效性混淆而产生的系统性偏差问题,即“内容效应”(content effect),这种偏差即使在模型生成逐步解释时仍存在,表明中间推理过程可能继承了相同的语义捷径。解决方案的关键在于提出一种抽象引导推理框架(abstraction-guided reasoning framework),通过显式分离结构推理与词汇语义,构建抽象推理空间,并设计轻量级抽象器(Abstractors)从内容相关的残差流状态中预测与该空间对齐的表示,再通过多层前向传播干预实现对模型内部激活的精准调控,从而提升形式推理的鲁棒性并减少语义干扰导致的错误。
链接: https://arxiv.org/abs/2602.02462
作者: Gabriele Maraia,Marco Valentino,Fabio Massimo Zanzotto,Leonardo Ranaldi
机构: Human Centric ART, University of Rome Tor Vergata (人类中心艺术,罗马托尔维加塔大学); ILCC, School of Informatics, University of Edinburgh (信息学院语言计算中心,爱丁堡大学); School of Computer Science, University of Sheffield (计算机科学学院,谢菲尔德大学); Almawave S.p.A. (阿尔马瓦夫股份公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) often struggle with deductive judgment in syllogistic reasoning, systematically conflating semantic plausibility with formal validity a phenomenon known as content effect. This bias persists even when models generate step-wise explanations, indicating that intermediate rationales may inherit the same semantic shortcuts that affect answers. Recent approaches propose mitigating this issue by increasing inference-time structural constraints, either by encouraging abstract intermediate representations or by intervening directly in the model’s internal computations; however, reliably suppressing semantic interference remains an open challenge. To make formal deduction less sensitive to semantic content, we introduce a framework for abstraction-guided reasoning that explicitly separates structural inference from lexical semantics. We construct paired content-laden and abstract syllogisms and use the model’s activations on abstract inputs to define an abstract reasoning space. We then learn lightweight Abstractors that, from content-conditioned residual-stream states, predict representations aligned with this space and integrate these predictions via multi-layer interventions during the forward pass. Using cross-lingual transfer as a test bed, we show that abstraction-aligned steering reduces content-driven errors and improves validity-sensitive performance. Our results position activation-level abstraction as a scalable mechanism for enhancing the robustness of formal reasoning in LLMs against semantic interference.
zh
[NLP-10] Drift-Bench: Diagnosing Cooperative Breakdowns in LLM Agents under Input Faults via Multi-Turn Interaction
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)向自主代理(autonomous agents)演进过程中,用户输入常因违反合作原则(如隐含意图缺失、参数不全、错误预设或表达模糊)而导致执行风险的问题。现有基准测试通常假设指令明确或仅限于单轮文本澄清,无法评估在具身执行环境中的多轮澄清能力与风险控制。其解决方案的关键在于提出首个诊断性基准——Drift-Bench,该基准基于经典交际理论构建了合作失效的统一分类体系,并采用人格驱动的用户模拟器与Rise评估协议,在状态导向和任务导向两类执行环境中系统性地评估代理在多轮澄清下的语用适应能力,从而实现对潜在不安全执行的结构化诊断与量化分析。
链接: https://arxiv.org/abs/2602.02455
作者: Han Bao,Zheyuan Zhang,Pengcheng Jing,Zhengqing Yuan,Kaiwen Shi,Yanfang Ye
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)
备注: 65 pages, 40 figures
Abstract:As Large Language Models transition to autonomous agents, user inputs frequently violate cooperative assumptions (e.g., implicit intent, missing parameters, false presuppositions, or ambiguous expressions), creating execution risks that text-only evaluations do not capture. Existing benchmarks typically assume well-specified instructions or restrict evaluation to text-only, single-turn clarification, and thus do not measure multi-turn disambiguation under grounded execution risk. We introduce \textbfDrift-Bench, the first diagnostic benchmark that evaluates agentic pragmatics under input faults through multi-turn clarification across state-oriented and service-oriented execution environments. Grounded in classical theories of communication, \textbfDrift-Bench provides a unified taxonomy of cooperative breakdowns and employs a persona-driven user simulator with the \textbfRise evaluation protocol. Experiments show substantial performance drops under these faults, with clarification effectiveness varying across user personas and fault types. \MethodName bridges clarification research and agent safety evaluation, enabling systematic diagnosis of failures that can lead to unsafe executions.
zh
[NLP-11] Large Language Models for Mental Health: A Multilingual Evaluation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多语言心理健康领域任务中的性能表现尚不明确的问题,尤其是评估其在不同语言及机器翻译(Machine Translation, MT)数据上的泛化能力。解决方案的关键在于系统性地比较专有和开源LLM在八种不同语言的心理健康数据集及其机器翻译版本上的零样本(zero-shot)、少样本(few-shot)和微调(fine-tuned)表现,并与传统自然语言处理(NLP)基线方法进行对比,同时分析翻译质量对模型性能的影响,从而揭示LLMs在跨语言心理健康应用中的优势与局限性。
链接: https://arxiv.org/abs/2602.02440
作者: Nishat Raihan,Sadiya Sayara Chowdhury Puspo,Ana-Maria Bucur,Stevie Chancellor,Marcos Zampieri
机构: George Mason University, USA (乔治·梅森大学, 美国); University of Bucharest, Romania (布加勒斯特大学, 罗马尼亚); Universitat Politècnica de València, Spain (瓦伦西亚理工大学, 西班牙); University of Minnesota, USA (明尼苏达大学, 美国)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) have remarkable capabilities across NLP tasks. However, their performance in multilingual contexts, especially within the mental health domain, has not been thoroughly explored. In this paper, we evaluate proprietary and open-source LLMs on eight mental health datasets in various languages, as well as their machine-translated (MT) counterparts. We compare LLM performance in zero-shot, few-shot, and fine-tuned settings against conventional NLP baselines that do not employ LLMs. In addition, we assess translation quality across language families and typologies to understand its influence on LLM performance. Proprietary LLMs and fine-tuned open-source LLMs achieve competitive F1 scores on several datasets, often surpassing state-of-the-art results. However, performance on MT data is generally lower, and the extent of this decline varies by language and typology. This variation highlights both the strengths of LLMs in handling mental health tasks in languages other than English and their limitations when translation quality introduces structural or lexical mismatches.
zh
[NLP-12] Misconception Diagnosis From Student-Tutor Dialogue: Generate Retrieve Rerank
【速读】: 该论文旨在解决教育场景中学生误解(misconception)识别效率低下的问题,传统方法高度依赖教师的经验与投入,难以实现及时、准确的诊断。其解决方案的关键在于利用大语言模型(Large Language Models, LLMs)构建两阶段自动检测框架:首先通过微调后的LLM生成潜在误解项,再基于对话嵌入相似度筛选候选;随后由另一微调LLM对候选项进行相关性评估与重排序,从而提升误解识别的准确性与实用性。实验表明,该方法在真实教学对话数据上优于基线模型,且微调策略能显著提升生成质量,甚至超越更大规模的闭源模型。
链接: https://arxiv.org/abs/2602.02414
作者: Joshua Mitton,Prarthana Bhattacharyya,Digory Smith,Thomas Christie,Ralph Abboud,Simon Woodhead
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 21 pages, 8 figures, 8 tables. Joshua Mitton and Prarthana Bhattacharyya contributed equally to this paper
Abstract:Timely and accurate identification of student misconceptions is key to improving learning outcomes and pre-empting the compounding of student errors. However, this task is highly dependent on the effort and intuition of the teacher. In this work, we present a novel approach for detecting misconceptions from student-tutor dialogues using large language models (LLMs). First, we use a fine-tuned LLM to generate plausible misconceptions, and then retrieve the most promising candidates among these using embedding similarity with the input dialogue. These candidates are then assessed and re-ranked by another fine-tuned LLM to improve misconception relevance. Empirically, we evaluate our system on real dialogues from an educational tutoring platform. We consider multiple base LLM models including LLaMA, Qwen and Claude on zero-shot and fine-tuned settings. We find that our approach improves predictive performance over baseline models and that fine-tuning improves both generated misconception quality and can outperform larger closed-source models. Finally, we conduct ablation studies to both validate the importance of our generation and reranking steps on misconception generation quality.
zh
[NLP-13] ROG: Retrieval-Augmented LLM Reasoning for Complex First-Order Queries over Knowledge Graphs
【速读】: 该论文旨在解决在不完整知识图谱(Knowledge Graph, KG)上回答一阶逻辑(First-Order Logic, FOL)查询的难题,尤其是针对包含投影、交集、并集和否定等复杂操作符的查询结构。其解决方案的关键在于提出一种检索增强框架ROG,该框架通过查询感知的邻域检索与大语言模型(Large Language Model, LLM)链式思维推理相结合的方式,将多操作符查询分解为一系列单操作符子查询,并在每一步中基于紧凑且查询相关的邻域证据进行 grounding。中间答案集被缓存并复用于后续步骤,从而提升深层推理链的一致性,减少误差累积,尤其在高复杂度和含否定操作的查询上表现显著优于基于嵌入的逻辑推理方法。
链接: https://arxiv.org/abs/2602.02382
作者: Ziyan Zhang,Chao Wang,Zhuo Chen,Chiyi Li,Kai Song
机构: Chongqing Jiaotong University (重庆交通大学); State Grid Chongqing Electric Power Company (国网重庆市电力公司)
类目: Computation and Language (cs.CL)
备注:
Abstract:Answering first-order logic (FOL) queries over incomplete knowledge graphs (KGs) is difficult, especially for complex query structures that compose projection, intersection, union, and negation. We propose ROG, a retrieval-augmented framework that combines query-aware neighborhood retrieval with large language model (LLM) chain-of-thought reasoning. ROG decomposes a multi-operator query into a sequence of single-operator sub-queries and grounds each step in compact, query-relevant neighborhood evidence. Intermediate answer sets are cached and reused across steps, improving consistency on deep reasoning chains. This design reduces compounding errors and yields more robust inference on complex and negation-heavy queries. Overall, ROG provides a practical alternative to embedding-based logical reasoning by replacing learned operators with retrieval-grounded, step-wise inference. Experiments on standard KG reasoning benchmarks show consistent gains over strong embedding-based baselines, with the largest improvements on high-complexity and negation-heavy query types.
zh
[NLP-14] From Sycophancy to Sensemaking: Premise Governance for Human-AI Decision Making
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)从辅助工具向决策支持系统演进过程中出现的“流畅一致性陷阱”问题,即模型在缺乏校准判断的情况下产生看似合理但可能误导决策的输出,尤其在高不确定性决策场景中,这种现象会加速错误承诺的扩散,而专家往往无法及时纠正。解决方案的关键在于从以答案生成为中心转向以知识基底上的前提协同治理为核心,通过差异驱动的控制回路识别并定位类型化的不一致(如目的论、认识论和程序性差异),触发有限范围内的协商机制,并引入承诺门控(commitment gating)与价值驱动的挑战分配(value-gated challenge),从而将信任锚定于可审计的前提与证据标准,而非对话流畅度。
链接: https://arxiv.org/abs/2602.02378
作者: Raunak Jain,Mudita Khurana,John Stephens,Srinivas Dharmasanam,Shankar Venkataraman
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:As LLMs expand from assistance to decision support, a dangerous pattern emerges: fluent agreement without calibrated judgment. Low-friction assistants can become sycophantic, baking in implicit assumptions and pushing verification costs onto experts, while outcomes arrive too late to serve as reward signals. In deep-uncertainty decisions (where objectives are contested and reversals are costly), scaling fluent agreement amplifies poor commitments faster than it builds expertise. We argue reliable human-AI partnership requires a shift from answer generation to collaborative premise governance over a knowledge substrate, negotiating only what is decision-critical. A discrepancy-driven control loop operates over this substrate: detecting conflicts, localizing misalignment via typed discrepancies (teleological, epistemic, procedural), and triggering bounded negotiation through decision slices. Commitment gating blocks action on uncommitted load-bearing premises unless overridden under logged risk; value-gated challenge allocates probing under interaction cost. Trust then attaches to auditable premises and evidence standards, not conversational fluency. We illustrate with tutoring and propose falsifiable evaluation criteria.
zh
[NLP-15] Proof-RM: A Scalable and Generalizable Reward Model for Math Proof
【速读】: 该论文旨在解决生成式 AI(Generative AI)在处理证明类数学问题时缺乏可靠自动验证机制的问题,即现有基于可验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)方法难以对完整证明过程进行准确评估。解决方案的关键在于设计了一种可扩展的数据构建流水线,利用大语言模型(Large Language Models, LLMs)以极低的人工成本生成高质量的“问题-证明-检查”三元组数据,并通过分层人工审核确保标签一致性;在此基础上训练一个具备过程奖励建模能力的奖励模型(Reward Model, RM),并引入额外的过程奖励项与token权重平衡策略以稳定强化学习训练过程,从而实现对复杂数学证明的有效自动验证与引导。
链接: https://arxiv.org/abs/2602.02377
作者: Haotong Yang,Zitong Wang,Shijia Kang,Siqi Yang,Wenkai Yu,Xu Niu,Yike Sun,Yi Hu,Zhouchen Lin,Muhan Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注: Under review
Abstract:While Large Language Models (LLMs) have demonstrated strong math reasoning abilities through Reinforcement Learning with Verifiable Rewards (RLVR), many advanced mathematical problems are proof-based, with no guaranteed way to determine the authenticity of a proof by simple answer matching. To enable automatic verification, a Reward Model (RM) capable of reliably evaluating full proof processes is required. In this work, we design a scalable data-construction pipeline that, with minimal human effort, leverages LLMs to generate a large quantity of high-quality “question-proof-check” triplet data. By systematically varying problem sources, generation methods, and model configurations, we create diverse problem-proof pairs spanning multiple difficulty levels, linguistic styles, and error types, subsequently filtered through hierarchical human review for label alignment. Utilizing these data, we train a proof-checking RM, incorporating additional process reward and token weight balance to stabilize the RL process. Our experiments validate the model’s scalability and strong performance from multiple perspectives, including reward accuracy, generalization ability and test-time guidance, providing important practical recipes and tools for strengthening LLM mathematical capabilities.
zh
[NLP-16] Automated Multiple Mini Interview (MMI) Scoring
【速读】: 该论文旨在解决在竞争性选拔过程中对软技能(如共情能力、伦理判断和沟通能力)进行自动化评估时存在的不一致性和偏见问题,尤其是在多迷你面试(Multiple Mini-Interviews, MMIs)这种高度抽象且依赖上下文的任务中,现有基于理由的微调方法难以捕捉候选者叙述中的隐含信号。其解决方案的关键在于提出一种多智能体提示框架(multi-agent prompting framework),将评估过程分解为转录文本精炼与特定标准评分两个阶段,并采用三样本上下文学习(3-shot in-context learning)结合大规模指令微调模型,从而在无需额外训练的情况下显著提升评估可靠性(平均QWK达0.62,优于专用微调基线的0.32),并展现出良好的跨任务泛化能力。
链接: https://arxiv.org/abs/2602.02360
作者: Ryan Huynh,Frank Guerin,Alison Callwood
机构: University of Surrey (萨里大学)
类目: Computation and Language (cs.CL)
备注: 18 pages, 2 figures
Abstract:Assessing soft skills such as empathy, ethical judgment, and communication is essential in competitive selection processes, yet human scoring is often inconsistent and biased. While Large Language Models (LLMs) have improved Automated Essay Scoring (AES), we show that state-of-the-art rationale-based fine-tuning methods struggle with the abstract, context-dependent nature of Multiple Mini-Interviews (MMIs), missing the implicit signals embedded in candidate narratives. We introduce a multi-agent prompting framework that breaks down the evaluation process into transcript refinement and criterion-specific scoring. Using 3-shot in-context learning with a large instruct-tuned model, our approach outperforms specialised fine-tuned baselines (Avg QWK 0.62 vs 0.32) and achieves reliability comparable to human experts. We further demonstrate the generalisability of our framework on the ASAP benchmark, where it rivals domain-specific state-of-the-art models without additional training. These findings suggest that for complex, subjective reasoning tasks, structured prompt engineering may offer a scalable alternative to data-intensive fine-tuning, altering how LLMs can be applied to automated assessment.
zh
[NLP-17] Why Steering Works: Toward a Unified View of Language Model Parameter Dynamics
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)控制方法(如局部权重微调、LoRA适配和基于激活的干预)研究孤立、缺乏统一理论框架的问题,从而难以进行有效比较与理解。其解决方案的关键在于提出一个统一视角:将各类干预视为由控制信号诱导的动态权重更新,并在此基础上构建偏好-效用分析框架,将控制效果分解为“偏好”(对目标概念的倾向性)和“效用”(生成内容的一致性和任务有效性),并使用极性配对对比样本在共享的对数几率尺度上进行量化。研究发现,偏好与效用之间存在系统性权衡——增强控制虽提升偏好,但会降低效用;这一现象可通过激活流形(activation manifold)视角解释:控制使表示沿目标概念方向偏移以增强偏好,而效用下降主要源于干预导致表示偏离模型的有效生成流形。最终,基于该分析提出新方法SPLIT,可在提升偏好同时更好地维持效用。
链接: https://arxiv.org/abs/2602.02343
作者: Ziwen Xu,Chenyan Wu,Hengyu Sun,Haiwen Hong,Mengru Wang,Yunzhi Yao,Longtao Huang,Hui Xue,Shumin Deng,Zhixuan Chu,Huajun Chen,Ningyu Zhang
机构: Zhejiang University (浙江大学); Alibaba Group (阿里巴巴集团); National University of Singapore, NUS-NCS Joint Lab (新加坡国立大学,NUS-NCS联合实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Work in progress
Abstract:Methods for controlling large language models (LLMs), including local weight fine-tuning, LoRA-based adaptation, and activation-based interventions, are often studied in isolation, obscuring their connections and making comparison difficult. In this work, we present a unified view that frames these interventions as dynamic weight updates induced by a control signal, placing them within a single conceptual framework. Building on this view, we propose a unified preference-utility analysis that separates control effects into preference, defined as the tendency toward a target concept, and utility, defined as coherent and task-valid generation, and measures both on a shared log-odds scale using polarity-paired contrastive examples. Across methods, we observe a consistent trade-off between preference and utility: stronger control increases preference while predictably reducing utility. We further explain this behavior through an activation manifold perspective, in which control shifts representations along target-concept directions to enhance preference, while utility declines primarily when interventions push representations off the model’s valid-generation manifold. Finally, we introduce a new steering approach SPLIT guided by this analysis that improves preference while better preserving utility. Code is available at this https URL.
zh
[NLP-18] Language Steering for Multilingual In-Context Learning
【速读】: 该论文旨在解决多语言大语言模型(Multilingual Large Language Models, MLLMs)在非英语任务中性能显著低于英语的问题,尤其是在上下文学习(In-Context Learning, ICL)场景下,当演示文本为英语而测试输入为非英语时,模型表现严重下降。其解决方案的关键在于提出了一种无需训练的语言向量(Language Vectors)方法,该方法基于假设:大语言模型内部存在一个通用语义空间(Universal Semantic Space),不同语言在此空间中表现为不同的方向。通过分析源语言与目标语言在中间层激活值上的差异,构建语言向量,并在推理阶段将该向量加到模型的中间激活上,从而引导模型内部表示向目标语言空间偏移,实现对非英语任务的有效适应,且无需更新模型参数。实验表明,该方法在19种语言和多个任务上均显著优于基线,且语言向量具有任务无关性及可迁移性,同时其聚类结构与语言家族一致,揭示了模型内部语言表征的层次性。
链接: https://arxiv.org/abs/2602.02326
作者: Neeraja Kirtane,Kuan-Hao Huang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:While multilingual large language models have gained widespread adoption, their performance on non-English languages remains substantially inferior to English. This disparity is particularly evident in in-context learning scenarios, where providing demonstrations in English but testing on non-English inputs leads to significant performance degradation. In this paper, we hypothesize that LLMs develop a universal semantic space for understanding languages, where different languages are encoded as distinct directions within this space. Based on this hypothesis, we propose language vectors – a training-free language steering approach that leverages activation differences between source and target languages to guide model behavior. We steer the model generations by adding the vector to the intermediate model activations during inference. This is done to make the model’s internal representations shift towards the target language space without any parameter updates. We evaluate our method across three datasets and test on a total of 19 languages on three different models. Our results show consistent improvements on multilingual in-context learning over baselines across all tasks and languages tested. Beyond performance gains, hierarchical clustering of steering vectors reveals meaningful linguistic structure aligned with language families. These vectors also successfully transfer across tasks, demonstrating that these representations are task-agnostic.
zh
[NLP-19] A Large-Scale Dataset for Molecular Structure-Language Description via a Rule-Regularized Method
【速读】: 该论文旨在解决分子结构与自然语言之间精确对齐的问题,这是实现大语言模型(LLM)在下游化学任务中进行推理的关键前提。现有方法受限于人工标注成本高昂,难以构建大规模高质量的结构引导型描述数据集。解决方案的关键在于提出了一种全自动的注释框架:首先基于规则的化学命名法解析器解析IUPAC名称,并生成富含结构信息的XML元数据;随后利用该结构化元数据引导LLM生成准确的自然语言描述。此方法成功构建了约16.3万条分子-描述对的数据集,经LLM与专家人工联合验证,描述精度达98.6%,为未来分子-语言对齐研究提供了可靠基础,且具备良好的可扩展性以适应更大规模和更广泛的化学任务。
链接: https://arxiv.org/abs/2602.02320
作者: Feiyang Cai,Guijuan He,Yi Hu,Jingjing Wang,Joshua Luo,Tianyu Zhu,Srikanth Pilla,Gang Li,Ling Liu,Feng Luo
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
备注:
Abstract:Molecular function is largely determined by structure. Accurately aligning molecular structure with natural language is therefore essential for enabling large language models (LLMs) to reason about downstream chemical tasks. However, the substantial cost of human annotation makes it infeasible to construct large-scale, high-quality datasets of structure-grounded descriptions. In this work, we propose a fully automated annotation framework for generating precise molecular structure descriptions at scale. Our approach builds upon and extends a rule-based chemical nomenclature parser to interpret IUPAC names and construct enriched, structured XML metadata that explicitly encodes molecular structure. This metadata is then used to guide LLMs in producing accurate natural-language descriptions. Using this framework, we curate a large-scale dataset of approximately 163 k molecule-description pairs. A rigorous validation protocol combining LLM-based and expert human evaluation on a subset of 2,000 molecules demonstrates a high description precision of 98.6% . The resulting dataset provides a reliable foundation for future molecule-language alignment, and the proposed annotation method is readily extensible to larger datasets and broader chemical tasks that rely on structural descriptions.
zh
[NLP-20] he Shape of Beliefs: Geometry Dynamics and Interventions along Representation Manifolds of Language Models Posteriors
【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)中信念表示的机制问题,具体包括:如何在表征空间中编码提示条件下的信念(即对答案和主张的后验分布),这些信念如何随新证据更新,以及干预措施如何重塑它们。其核心挑战在于现有线性干预方法常导致模型偏离原始信念流形(belief manifolds),引发耦合的、分布外的偏移。解决方案的关键在于引入几何感知与场感知的干预策略(geometry- and field-aware steering),相比标准线性操纵,能更好地保持信念家族的内在结构;同时提出线性场探测(Linear Field Probing, LFP)作为简单有效的手段,用于在数据流形上进行网格化采样并实现尊重底层几何结构的干预。研究表明,LLMs 中自然涌现出复杂结构,而纯线性的概念表示往往不足以刻画其本质。
链接: https://arxiv.org/abs/2602.02315
作者: Raphaël Sarfati,Eric Bigelow,Daniel Wurgaft,Jack Merullo,Atticus Geiger,Owen Lewis,Tom McGrath,Ekdeep Singh Lubana
机构: GoodFire(好火)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) represent prompt-conditioned beliefs (posteriors over answers and claims), but we lack a mechanistic account of how these beliefs are encoded in representation space, how they update with new evidence, and how interventions reshape them. We study a controlled setting in which Llama-3.2 generates samples from a normal distribution by implicitly inferring its parameters (mean and standard deviation) given only samples from the distribution in context. We find representations of curved “belief manifolds” for these parameters form with sufficient in-context learning and study how the model adapts when the distribution suddenly changes. While standard linear steering often pushes the model off-manifold and induces coupled, out-of-distribution shifts, geometry and field-aware steering better preserves the intended belief family. Our work demonstrates an example of linear field probing (LFP) as a simple approach to tile the data manifold and make interventions that respect the underlying geometry. We conclude that rich structure emerges naturally in LLMs and that purely linear concept representations are often an inadequate abstraction.
zh
[NLP-21] Interpreting and Controlling LLM Reasoning through Integrated Policy Gradient
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在复杂推理行为中内部机制不透明的问题,特别是现有可解释性方法难以精确定位复杂推理路径或捕捉模型内部状态对推理输出的时序影响。其解决方案的关键在于提出一种基于结果导向和时序影响感知原则的新型框架——集成策略梯度(Integrated Policy Gradient, IPG),通过将复合结果信号(如推理后准确率)沿模型推理轨迹反向传播,从而精准识别对推理行为具有累积性贡献的模型内部组件,并实现对推理能力与强度等行为特征的可靠调控。
链接: https://arxiv.org/abs/2602.02313
作者: Changming Li,Kaixing Zhang,Haoyun Xu,Yingdong Shi,Zheng Zhang,Kaitao Song,Kan Ren
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) demonstrate strong reasoning abilities in solving complex real-world problems. Yet, the internal mechanisms driving these complex reasoning behaviors remain opaque. Existing interpretability approaches targeting reasoning either identify components (e.g., neurons) correlated with special textual patterns, or rely on human-annotated contrastive pairs to derive control vectors. Consequently, current methods struggle to precisely localize complex reasoning mechanisms or capture sequential influence from model internal workings to the reasoning outputs. In this paper, built on outcome-oriented and sequential-influence-aware principles, we focus on identifying components that have sequential contribution to reasoning behavior where outcomes are cumulated by long-range effects. We propose Integrated Policy Gradient (IPG), a novel framework that attributes reasoning behaviors to model’s inner components by propagating compound outcome-based signals such as post reasoning accuracy backward through model inference trajectories. Empirical evaluations demonstrate that our approach achieves more precise localization and enables reliable modulation of reasoning behaviors (e.g., reasoning capability, reasoning strength) across diverse reasoning models.
zh
[NLP-22] Advancing General-Purpose Reasoning Models with Modular Gradient Surgery
【速读】: 该论文旨在解决多领域强化学习(Reinforcement Learning, RL)中因领域异质性导致的跨域干扰问题,这种干扰在行为层面和梯度层面均显著,限制了通用大推理模型(Large Reasoning Models, LRMs)的整体性能提升。其解决方案的关键在于提出模块化梯度手术(Modular Gradient Surgery, MGS),通过在Transformer架构内部的模块级别上解决梯度冲突,从而有效缓解多任务训练中的梯度干扰,实现更稳定的多领域协同优化。实验表明,MGS在Llama和Qwen模型上分别带来平均4.3(16.6%)和4.5(11.1%)点的性能提升,且在长期训练中依然有效。
链接: https://arxiv.org/abs/2602.02301
作者: Min Cai,Yu Liang,Longzheng Wang,Yan Wang,Yueyang Zhang,Long Xia,Zhiyuan Sun,Xi Ye,Daiting Shi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint; Code: this https URL Website: this https URL
Abstract:Reinforcement learning (RL) has played a central role in recent advances in large reasoning models (LRMs), yielding strong gains in verifiable and open-ended reasoning. However, training a single general-purpose LRM across diverse domains remains challenging due to pronounced domain heterogeneity. Through a systematic study of two widely used strategies, Sequential RL and Mixed RL, we find that both incur substantial cross-domain interference at the behavioral and gradient levels, resulting in limited overall gains. To address these challenges, we introduce Modular Gradient Surgery (MGS), which resolves gradient conflicts at the module level within the transformer. When applied to Llama and Qwen models, MGS achieves average improvements of 4.3 (16.6%) and 4.5 (11.1%) points, respectively, over standard multi-task RL across three representative domains (math, general chat, and instruction following). Further analysis demonstrates that MGS remains effective under prolonged training. Overall, our study clarifies the sources of interference in multi-domain RL and presents an effective solution for training general-purpose LRMs.
zh
[NLP-23] Hallucination or Creativity: How to Evaluate AI-Generated Scientific Stories?
【速读】: 该论文旨在解决生成式 AI 在将科学文献转化为面向多元受众的叙事内容时,缺乏有效评估工具的问题。传统摘要指标难以捕捉叙事所需的抽象性、简化能力和教学创意,同时事实性幻觉(factual hallucination)在科学语境中尤为关键,而现有检测方法常误判合法的叙事重构,且在创造性表达下表现不稳定。解决方案的关键在于提出 StoryScore,一个整合语义对齐(semantic alignment)、词汇锚定(lexical grounding)、叙事控制(narrative control)、结构保真度(structural fidelity)、冗余规避(redundancy avoidance)以及实体级幻觉检测(entity-level hallucination detection)的复合指标框架,从而系统性地衡量科学叙事的质量与真实性。
链接: https://arxiv.org/abs/2602.02290
作者: Alex Argese,Pasquale Lisena,Raphaël Troncy
机构: EURECOM(欧洲电信学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Generative AI can turn scientific articles into narratives for diverse audiences, but evaluating these stories remains challenging. Storytelling demands abstraction, simplification, and pedagogical creativity-qualities that are not often well-captured by standard summarization metrics. Meanwhile, factual hallucinations are critical in scientific contexts, yet, detectors often misclassify legitimate narrative reformulations or prove unstable when creativity is involved. In this work, we propose StoryScore, a composite metric for evaluating AI-generated scientific stories. StoryScore integrates semantic alignment, lexical grounding, narrative control, structural fidelity, redundancy avoidance, and entity-level hallucination detection into a unified framework. Our analysis also reveals why many hallucination detection methods fail to distinguish pedagogical creativity from factual errors, highlighting a key limitation: while automatic metrics can effectively assess semantic similarity with original content, they struggle to evaluate how it is narrated and controlled.
zh
[NLP-24] Cross-Lingual Stability of LLM Judges Under Controlled Generation: Evidence from Finno-Ugric Languages EACL2026
【速读】: 该论文旨在解决跨语言大语言模型(Large Language Models, LLMs)评估中因测量不稳定性导致的误判问题,即区分模型真实性能差异与评估方法本身的波动性。其核心问题是:当前跨语言评估常将两种变异源混淆——模型能力的真实差异和由语言特性引发的评分不一致。解决方案的关键在于采用受控生成实验设计,在保持生成参数完全一致的前提下,仅改变目标语言(Estonian、Finnish 和 Hungarian),从而隔离出评估方法本身对不同语言的敏感性。研究发现,表面指标(如词汇多样性、语义相似度)具有跨语言稳定性,而语用层面判断(如连贯性和指令遵循)则出现显著排名颠倒和极低相关性,表明LLM作为评判者(LLM-as-a-judge)在形态丰富语言中的零样本迁移不可靠。这一受控设计提供了一个诊断工具:若评估方法在相同生成条件下无法维持稳定排名,则预示着部署前存在模型到语言的迁移失败风险,进而推动基于目标语言的人工基准进行校准。
链接: https://arxiv.org/abs/2602.02287
作者: Isaac Chung,Linda Freienthal
机构: Zendesk
类目: Computation and Language (cs.CL)
备注: First Workshop on Multilingual Multicultural Evaluation, co-located with EACL 2026
Abstract:Cross-lingual evaluation of large language models (LLMs) typically conflates two sources of variance: genuine model performance differences and measurement instability. We investigate evaluation reliability by holding generation conditions constant while varying target language. Using synthetic customer-support dialogues generated with identical parameters across Estonian, Finnish, and Hungarian, we test whether automatic metrics and LLM-as-a-judge scoring produce stable model rankings across these morphologically rich, related Finno-Ugric languages. With a small set of Estonian native speaker annotations as a reference point, we find systematic ranking instabilities: surface-level metrics (lexical diversity, surface and semantic similarity) maintain cross-language stability, but pragmatic judgments (coherence, instruction-following) exhibit rank inversions and near-zero correlations. Because generation is controlled, these inconsistencies reflect how judge scoring behaves differently across languages rather than true model differences. This controlled design provides a diagnostic probe: evaluation methods that fail to maintain stability under identical generation conditions signal transfer failure before deployment. Our findings suggest that zero-shot judge transfer is unreliable for discourse-level assessment in morphologically rich languages, motivating language-specific calibration against targeted human baselines. We release our controlled generation protocol, synthetic data, and evaluation framework to enable replication across language families at this https URL. Comments: First Workshop on Multilingual Multicultural Evaluation, co-located with EACL 2026 Subjects: Computation and Language (cs.CL) Cite as: arXiv:2602.02287 [cs.CL] (or arXiv:2602.02287v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2602.02287 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-25] Statistical Learning Theory in Lean 4: Empirical Processes from Scratch
【速读】: 该论文旨在解决统计学习理论(Statistical Learning Theory, SLT)中缺乏形式化验证的问题,特别是针对经验过程理论(Empirical Process Theory)的完整形式化建模。其解决方案的关键在于构建首个基于Lean 4的SLT形式化基础设施,涵盖高斯Lipschitz浓度的完整开发、首次对子高斯过程的Dudley熵积分定理的形式化以及在最小二乘(稀疏)回归中的应用并获得最优收敛速率。该工作采用人机协作的证明流程,由人类设计策略、AI代理执行战术构造,从而形成可复用且经人工验证的Lean 4工具箱,同时揭示并修正了传统SLT教材中隐含假设与细节缺失,推动了机器学习理论的形式化基础建设。
链接: https://arxiv.org/abs/2602.02285
作者: Yuanhe Zhang,Jason D. Lee,Fanghui Liu
机构: University of Warwick (华威大学); University of California, Berkeley (加州大学伯克利分校); Shanghai Jiao Tong University (上海交通大学); University of Warwick (华威大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Statistics Theory (math.ST)
备注: 19 pages, 2 figures. Comments are welcome
Abstract:We present the first comprehensive Lean 4 formalization of statistical learning theory (SLT) grounded in empirical process theory. Our end-to-end formal infrastructure implement the missing contents in latest Lean 4 Mathlib library, including a complete development of Gaussian Lipschitz concentration, the first formalization of Dudley’s entropy integral theorem for sub-Gaussian processes, and an application to least-squares (sparse) regression with a sharp rate. The project was carried out using a human-AI collaborative workflow, in which humans design proof strategies and AI agents execute tactical proof construction, leading to the human-verified Lean 4 toolbox for SLT. Beyond implementation, the formalization process exposes and resolves implicit assumptions and missing details in standard SLT textbooks, enforcing a granular, line-by-line understanding of the theory. This work establishes a reusable formal foundation and opens the door for future developments in machine learning theory. The code is available at this https URL
zh
[NLP-26] RACA: Representation-Aware Coverag e Criteria for LLM Safety Testing
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在安全测试中面临的两大核心问题:一是现有测试方法依赖静态数据集,缺乏系统性评估标准以衡量测试的质量与充分性;二是传统基于神经元层面的覆盖准则因LLM的高维度和复杂目标而难以直接适用。解决方案的关键在于提出RACA(Representation-based Coverage Analysis),其创新性地引入表示工程(representation engineering),聚焦于LLM中与安全性相关的概念表征,从而降低维度并过滤无关信息。RACA通过三个阶段实现:利用小规模专家标注的越狱提示校准集识别安全关键表示,基于这些表示计算测试集的概念激活分数,并采用六项子准则评估个体及组合的安全概念覆盖度。实验表明,RACA能有效识别高质量越狱提示,在测试集优先排序和攻击提示采样等实际场景中表现出优越性能和良好泛化能力,为LLM安全评估提供了可量化、可扩展的新框架。
链接: https://arxiv.org/abs/2602.02280
作者: Zeming Wei,Zhixin Zhang,Chengcan Wu,Yihao Zhang,Xiaokun Luan,Meng Sun
机构: Peking University (北京大学)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:
Abstract:Recent advancements in LLMs have led to significant breakthroughs in various AI applications. However, their sophisticated capabilities also introduce severe safety concerns, particularly the generation of harmful content through jailbreak attacks. Current safety testing for LLMs often relies on static datasets and lacks systematic criteria to evaluate the quality and adequacy of these tests. While coverage criteria have been effective for smaller neural networks, they are not directly applicable to LLMs due to scalability issues and differing objectives. To address these challenges, this paper introduces RACA, a novel set of coverage criteria specifically designed for LLM safety testing. RACA leverages representation engineering to focus on safety-critical concepts within LLMs, thereby reducing dimensionality and filtering out irrelevant information. The framework operates in three stages: first, it identifies safety-critical representations using a small, expert-curated calibration set of jailbreak prompts. Second, it calculates conceptual activation scores for a given test suite based on these representations. Finally, it computes coverage results using six sub-criteria that assess both individual and compositional safety concepts. We conduct comprehensive experiments to validate RACA’s effectiveness, applicability, and generalization, where the results demonstrate that RACA successfully identifies high-quality jailbreak prompts and is superior to traditional neuron-level criteria. We also showcase its practical application in real-world scenarios, such as test set prioritization and attack prompt sampling. Furthermore, our findings confirm RACA’s generalization to various scenarios and its robustness across various configurations. Overall, RACA provides a new framework for evaluating the safety of LLMs, contributing a valuable technique to the field of testing for AI.
zh
[NLP-27] Kimi K2.5: Visual Agent ic Intelligence
【速读】: 该论文旨在解决当前多模态智能体(Agentic Intelligence)在复杂任务处理中效率低、协同能力弱的问题。其核心挑战在于如何有效融合文本与视觉模态以提升智能体的泛化能力和任务执行效率,同时实现多智能体间的高效并行协作。解决方案的关键在于:首先,通过联合文本-视觉预训练、零视觉监督微调(zero-vision SFT)和联合文本-视觉强化学习,实现双模态的深度协同优化;其次,提出Agent Swarm框架,一种自驱动的并行智能体编排机制,能够动态将复杂任务分解为异构子问题并并发执行,显著降低延迟(最高达4.5倍于单智能体基线)。这一方法在编码、视觉理解、推理及代理任务等多个领域均达到当前最优性能。
链接: https://arxiv.org/abs/2602.02276
作者: Kimi Team:Tongtong Bai,Yifan Bai,Yiping Bao,S.H. Cai,Yuan Cao,Y. Charles,H.S. Che,Cheng Chen,Guanduo Chen,Huarong Chen,Jia Chen,Jiahao Chen,Jianlong Chen,Jun Chen,Kefan Chen,Liang Chen,Ruijue Chen,Xinhao Chen,Yanru Chen,Yanxu Chen,Yicun Chen,Yimin Chen,Yingjiang Chen,Yuankun Chen,Yujie Chen,Yutian Chen,Zhirong Chen,Ziwei Chen,Dazhi Cheng,Minghan Chu,Jialei Cui,Jiaqi Deng,Muxi Diao,Hao Ding,Mengfan Dong,Mengnan Dong,Yuxin Dong,Yuhao Dong,Angang Du,Chenzhuang Du,Dikang Du,Lingxiao Du,Yulun Du,Yu Fan,Shengjun Fang,Qiulin Feng,Yichen Feng,Garimugai Fu,Kelin Fu,Hongcheng Gao,Tong Gao,Yuyao Ge,Shangyi Geng,Chengyang Gong,Xiaochen Gong,Zhuoma Gongque,Qizheng Gu,Xinran Gu,Yicheng Gu,Longyu Guan,Yuanying Guo,Xiaoru Hao,Weiran He,Wenyang He,Yunjia He,Chao Hong,Hao Hu,Jiaxi Hu,Yangyang Hu,Zhenxing Hu,Ke Huang,Ruiyuan Huang,Weixiao Huang,Zhiqi Huang,Tao Jiang,Zhejun Jiang,Xinyi Jin,Yu Jing,Guokun Lai,Aidi Li,C. Li,Cheng Li,Fang Li,Guanghe Li,Guanyu Li,Haitao Li,Haoyang Li,Jia Li,Jingwei Li,Junxiong Li,Lincan Li,Mo Li,Weihong Li,Wentao Li,Xinhang Li,Xinhao Li,Yang Li,Yanhao Li,Yiwei Li
机构: 未知
类目: Computation and Language (cs.CL)
备注: Kimi K2.5 tech report
Abstract:We introduce Kimi K2.5, an open-source multimodal agentic model designed to advance general agentic intelligence. K2.5 emphasizes the joint optimization of text and vision so that two modalities enhance each other. This includes a series of techniques such as joint text-vision pre-training, zero-vision SFT, and joint text-vision reinforcement learning. Building on this multimodal foundation, K2.5 introduces Agent Swarm, a self-directed parallel agent orchestration framework that dynamically decomposes complex tasks into heterogeneous sub-problems and executes them concurrently. Extensive evaluations show that Kimi K2.5 achieves state-of-the-art results across various domains including coding, vision, reasoning, and agentic tasks. Agent Swarm also reduces latency by up to 4.5\times over single-agent baselines. We release the post-trained Kimi K2.5 model checkpoint to facilitate future research and real-world applications of agentic intelligence.
zh
[NLP-28] dziribot: rag based intelligent conversational agent for algerian arabic dialect
【速读】: 该论文旨在解决阿尔及利亚语方言(Darja)在客服对话系统中应用时面临的语言复杂性问题,包括非标准化拼写、法语混用(code-switching)以及阿拉伯文与拉丁文(Arabizi)并存的多语码现象。为应对这一低资源语言挑战,作者提出DziriBOT——一种融合自然语言理解(Natural Language Understanding, NLU)与检索增强生成(Retrieval-Augmented Generation, RAG)的混合智能对话代理,其关键在于采用分层架构实现结构化服务流程与基于企业文档知识库的动态响应能力,并通过对比三种方法(稀疏特征Rasa流水线、传统机器学习基线和Transformer微调)验证了微调后的DziriBERT模型在处理拼写噪声和罕见意图上的最优性能,从而显著优于传统方法,为区域市场中的方言感知自动化提供了可扩展的技术范式。
链接: https://arxiv.org/abs/2602.02270
作者: El Batoul Bechiri,Dihia Lanasri
机构: CESI, ATM Mobilis
类目: Computation and Language (cs.CL)
备注:
Abstract:The rapid digitalization of customer service has intensified the demand for conversational agents capable of providing accurate and natural interactions. In the Algerian context, this is complicated by the linguistic complexity of Darja, a dialect characterized by non-standardized orthography, extensive code-switching with French, and the simultaneous use of Arabic and Latin (Arabizi) scripts. This paper introduces DziriBOT, a hybrid intelligent conversational agent specifically engineered to overcome these challenges. We propose a multi-layered architecture that integrates specialized Natural Language Understanding (NLU) with Retrieval-Augmented Generation (RAG), allowing for both structured service flows and dynamic, knowledge-intensive responses grounded in curated enterprise documentation. To address the low-resource nature of Darja, we systematically evaluate three distinct approaches: a sparse-feature Rasa pipeline, classical machine learning baselines, and transformer-based fine-tuning. Our experimental results demonstrate that the fine-tuned DziriBERT model achieves state-of-the-art performance. These results significantly outperform traditional baselines, particularly in handling orthographic noise and rare intents. Ultimately, DziriBOT provides a robust, scalable solution that bridges the gap between formal language models and the linguistic realities of Algerian users, offering a blueprint for dialect-aware automation in the regional market.
zh
[NLP-29] OpenSeal: Good Fast and Cheap Construction of an Open-Source Southeast Asian LLM via Parallel Data
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)普遍存在的英语中心化问题,尤其是在低资源语言(low-resource languages)上的性能不足,特别是在东南亚地区缺乏真正开源的多语言模型。为应对这一挑战,作者提出通过持续预训练(continual pretraining)引入平行语料数据(parallel data)来增强模型对新语言的泛化能力。其解决方案的关键在于:仅使用平行数据即可显著提升模型在新语言上的表现,实验表明,利用347亿token的平行语料与8张NVIDIA H200 GPU运行180小时,即可构建出首个真正开源的东南亚语言大模型OpenSeal,其性能可媲美同类规模的闭源模型。
链接: https://arxiv.org/abs/2602.02266
作者: Tan Sang Nguyen,Muhammad Reza Qorib,Hwee Tou Ng
机构: National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have proven to be effective tools for a wide range of natural language processing (NLP) applications. Although many LLMs are multilingual, most remain English-centric and perform poorly on low-resource languages. Recently, several Southeast Asia-focused LLMs have been developed, but none are truly open source, as they do not publicly disclose their training data. Truly open-source models are important for transparency and for enabling a deeper and more precise understanding of LLM internals and development, including biases, generalization, and multilinguality. Motivated by recent advances demonstrating the effectiveness of parallel data in improving multilingual performance, we conduct controlled and comprehensive experiments to study the effectiveness of parallel data in continual pretraining of LLMs. Our findings show that using only parallel data is the most effective way to extend an LLM to new languages. Using just 34.7B tokens of parallel data and 180 hours on 8x NVIDIA H200 GPUs, we built OpenSeal, the first truly open Southeast Asian LLM that rivals the performance of existing models of similar size.
zh
[NLP-30] OmniCode: A Benchmark for Evaluating Software Engineering Agents
【速读】: 该论文旨在解决当前主流代码生成基准(如HumanEval和SWE-Bench)过于聚焦于竞赛编程或补丁生成等窄范围任务,难以全面评估大语言模型驱动的编码代理(LLM-powered coding agents)在真实软件工程场景中多维度能力的问题。其解决方案的关键在于提出OmniCode这一新型软件工程基准,该基准包含1794个任务,覆盖Python、Java和C++三种语言,并涵盖bug修复、测试生成、代码审查修复和风格修复四大类任务;同时,OmniCode通过人工验证确保问题定义清晰,并采用合成构造或近期整理的数据避免数据泄露,从而构建了一个更贴近实际开发需求且具有严谨性的评估框架。
链接: https://arxiv.org/abs/2602.02262
作者: Atharv Sonwane,Eng-Shen Tu,Wei-Chung Lu,Claas Beger,Carter Larsen,Debjit Dhar,Rachel Chen,Ronit Pattanayak,Tuan Anh Dang,Guohao Chen,Gloria Geng,Kevin Ellis,Saikat Dutta
机构: Cornell University (康奈尔大学); Independent contributor (独立贡献者); UC Santa Barbara (加州大学圣塔芭芭拉分校); Jadavpur University (贾达普大学); New York University (纽约大学)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:LLM-powered coding agents are redefining how real-world software is developed. To drive the research towards better coding agents, we require challenging benchmarks that can rigorously evaluate the ability of such agents to perform various software engineering tasks. However, popular coding benchmarks such as HumanEval and SWE-Bench focus on narrowly scoped tasks such as competition programming and patch generation. In reality, software engineers have to handle a broader set of tasks for real-world software development. To address this gap, we propose OmniCode, a novel software engineering benchmark that contains a broader and more diverse set of task categories beyond code or patch generation. Overall, OmniCode contains 1794 tasks spanning three programming languages (Python, Java, and C++) and four key categories: bug fixing, test generation, code review fixing, and style fixing. In contrast to prior software engineering benchmarks, the tasks in OmniCode are (1) manually validated to eliminate ill-defined problems, and (2) synthetically crafted or recently curated to avoid data leakage issues, presenting a new framework for synthetically generating diverse software tasks from limited real-world data. We evaluate OmniCode with popular agent frameworks such as SWE-Agent and show that while they may perform well on bug fixing for Python, they fall short on tasks such as Test Generation and in languages such as C++ and Java. For instance, SWE-Agent achieves a maximum of 20.9% with DeepSeek-V3.1 on Java Test Generation tasks. OmniCode aims to serve as a robust benchmark and spur the development of agents that can perform well across different aspects of software development. Code and data are available at this https URL.
zh
[NLP-31] Learning While Staying Curious: Entropy-Preserving Supervised Fine-Tuning via Adaptive Self-Distillation for Large Reasoning Models
【速读】: 该论文旨在解决标准后训练流程(SFT-then-RL)中因监督微调(Supervised Fine-Tuning, SFT)导致的探索能力受限问题:SFT 通常模仿专家示范,易引发模型过自信并降低生成多样性,从而压缩强化学习(Reinforcement Learning, RL)阶段的可探索解空间。解决方案的关键在于提出 CurioSFT,一种熵保持型 SFT 方法,通过内在好奇心驱动增强探索能力;其核心机制包括(a)自探索蒸馏(Self-Exploratory Distillation),利用温度缩放的自生成教师模型引导模型在自身能力范围内进行探索;以及(b)熵引导的温度选择(Entropy-Guided Temperature Selection),自适应调整蒸馏强度,在推理标记处放大探索、在事实标记处稳定输出,有效缓解知识遗忘。实验表明,CurioSFT 在 SFT 阶段即显著提升性能,并成功转化为 RL 阶段的实质性收益。
链接: https://arxiv.org/abs/2602.02244
作者: Hao Wang,Hao Gu,Hongming Piao,Kaixiong Gong,Yuxiao Ye,Xiangyu Yue,Sirui Han,Yike Guo,Dapeng Wu
机构: City University of Hong Kong (香港城市大学); The Hong Kong University of Science and Technology (香港科技大学); The Chinese University of Hong Kong (香港中文大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:The standard post-training recipe for large reasoning models, supervised fine-tuning followed by reinforcement learning (SFT-then-RL), may limit the benefits of the RL stage: while SFT imitates expert demonstrations, it often causes overconfidence and reduces generation diversity, leaving RL with a narrowed solution space to explore. Adding entropy regularization during SFT is not a cure-all; it tends to flatten token distributions toward uniformity, increasing entropy without improving meaningful exploration capability. In this paper, we propose CurioSFT, an entropy-preserving SFT method designed to enhance exploration capabilities through intrinsic curiosity. It consists of (a) Self-Exploratory Distillation, which distills the model toward a self-generated, temperature-scaled teacher to encourage exploration within its capability; and (b) Entropy-Guided Temperature Selection, which adaptively adjusts distillation strength to mitigate knowledge forgetting by amplifying exploration at reasoning tokens while stabilizing factual tokens. Extensive experiments on mathematical reasoning tasks demonstrate that, in SFT stage, CurioSFT outperforms the vanilla SFT by 2.5 points on in-distribution tasks and 2.9 points on out-of-distribution tasks. We also verify that exploration capabilities preserved during SFT successfully translate into concrete gains in RL stage, yielding an average improvement of 5.0 points.
zh
[NLP-32] Using Correspondence Patterns to Identify Irregular Words in Cognate sets Through Leave-One-Out Validation EACL2026
【速读】: 该论文旨在解决历史语言比较中音位对应规律性评估缺乏量化标准的问题,传统方法多依赖直观判断,而实际数据中不规则现象比Neogrammarian模型预期更为普遍。解决方案的关键在于提出一种新的规律性度量指标——平衡平均重复率(balanced average recurrence),并基于此设计了一种计算方法,用于识别在词源集合中因个别词形异常而导致整体不规则的词项。该方法通过留一法验证,在真实数据集上达到85%的整体准确率,证明其在提升计算机辅助语言比较中词源集合质量方面的有效性。
链接: https://arxiv.org/abs/2602.02221
作者: Frederic Blum,Johann-Mattis List
机构: Max Planck Institute for Evolutionary Anthropology (马克斯普朗克进化人类学研究所); University of Passau (帕绍大学)
类目: Computation and Language (cs.CL)
备注: Accepted for the L’Change workshop @ EACL 2026
Abstract:Regular sound correspondences constitute the principal evidence in historical language comparison. Despite the heuristic focus on regularity, it is often more an intuitive judgement than a quantified evaluation, and irregularity is more common than expected from the Neogrammarian model. Given the recent progress of computational methods in historical linguistics and the increased availability of standardized lexical data, we are now able to improve our workflows and provide such a quantitative evaluation. Here, we present the balanced average recurrence of correspondence patterns as a new measure of regularity. We also present a new computational method that uses this measure to identify cognate sets that lack regularity with respect to their correspondence patterns. We validate the method through two experiments, using simulated and real data. In the experiments, we employ leave-one-out validation to measure the regularity of cognate sets in which one word form has been replaced by an irregular one, checking how well our method identifies the forms causing the irregularity. Our method achieves an overall accuracy of 85% with the datasets based on real data. We also show the benefits of working with subsamples of large datasets and how increasing irregularity in the data influences our results. Reflecting on the broader potential of our new regularity measure and the irregular cognate identification method based on it, we conclude that they could play an important role in improving the quality of existing and future datasets in computer-assisted language comparison.
zh
[NLP-33] Am I More Pointwise or Pairwise? Revealing Position Bias in Rubric-Based LLM -as-a-Judge
【速读】: 该论文旨在解决rubric-based LLM-as-a-Judge(即基于评分量表的大型语言模型作为评判者)中存在的位置偏差(position bias)问题,即LLM在选择评分选项时倾向于偏好出现在特定位置的选项,而非真正反映文本质量。这种偏差会削弱模型评价结果与人类评分之间的一致性。解决方案的关键在于提出一种“平衡排列策略”(balanced permutation strategy),通过将每个评分选项在不同位置上均匀分布,从而消除位置效应;进一步地,通过对多个平衡排列后的得分进行聚合,不仅能揭示隐含的位置偏差,还能显著提升LLM-as-a-Judge与人类评分的相关性,表明该方法可有效增强基于评分量表的自动评估系统的可靠性。
链接: https://arxiv.org/abs/2602.02219
作者: Yuzheng Xu,Tosho Hirasawa,Tadashi Kozuno,Yoshitaka Ushiku
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) are now widely used to evaluate the quality of text, a field commonly referred to as LLM-as-a-judge. While prior works mainly focus on point-wise and pair-wise evaluation paradigms. Rubric-based evaluation, where LLMs select a score from multiple rubrics, has received less analysis. In this work, we show that rubric-based evaluation implicitly resembles a multi-choice setting and therefore has position bias: LLMs prefer score options appearing at specific positions in the rubric list. Through controlled experiments across multiple models and datasets, we demonstrate consistent position bias. To mitigate this bias, we propose a balanced permutation strategy that evenly distributes each score option across positions. We show that aggregating scores across balanced permutations not only reveals latent position bias, but also improves correlation between the LLM-as-a-Judge and human. Our results suggest that rubric-based LLM-as-a-Judge is not inherently point-wise and that simple permutation-based calibration can substantially improve its reliability.
zh
[NLP-34] owards AI Evaluation in Domain-Specific RAG Systems: The AgriHubi Case Study
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在农业领域应用时面临的三大挑战:弱事实 grounding、以英语为中心的训练数据限制,以及在低资源语言(如芬兰语)中缺乏高质量的实际评估。针对这些问题,作者提出 AgriHubi——一个面向芬兰语农业决策支持的领域自适应检索增强生成(Retrieval-Augmented Generation, RAG)系统。其解决方案的关键在于:将芬兰农业文档与开源 PORO 家族模型相结合,并通过显式来源标注与用户反馈机制实现迭代优化,从而提升答案完整性、语言准确性及用户感知可靠性,同时揭示了响应质量与延迟之间的实际权衡关系。
链接: https://arxiv.org/abs/2602.02208
作者: Md. Toufique Hasan,Ayman Asad Khan,Mika Saari,Vaishnavi Bankhele,Pekka Abrahamsson
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Software Engineering (cs.SE)
备注: 6 pages, 2 figures, submitted to MIPRO 2026
Abstract:Large language models show promise for knowledge-intensive domains, yet their use in agriculture is constrained by weak grounding, English-centric training data, and limited real-world evaluation. These issues are amplified for low-resource languages, where high-quality domain documentation exists but remains difficult to access through general-purpose models. This paper presents AgriHubi, a domain-adapted retrieval-augmented generation (RAG) system for Finnish-language agricultural decision support. AgriHubi integrates Finnish agricultural documents with open PORO family models and combines explicit source grounding with user feedback to support iterative refinement. Developed over eight iterations and evaluated through two user studies, the system shows clear gains in answer completeness, linguistic accuracy, and perceived reliability. The results also reveal practical trade-offs between response quality and latency when deploying larger models. This study provides empirical guidance for designing and evaluating domain-specific RAG systems in low-resource language settings.
zh
[NLP-35] Sinhala Physical Common Sense Reasoning Dataset for Global PIQA
【速读】: 该论文旨在解决低资源语言(Sinhala)在物理常识推理任务中的数据匮乏问题,以推动多语言认知模型的公平性和泛化能力。其解决方案的关键在于构建首个面向僧伽罗语(Sinhala)的物理常识推理数据集——Global PIQA,包含110个由人工创建并验证的数据样本,每个样本均包含一个提示(prompt)、正确答案及错误答案,且多数问题聚焦于斯里兰卡本地语境,从而为僧伽罗语环境下的生成式AI(Generative AI)提供高质量、文化相关的训练与评估基准。
链接: https://arxiv.org/abs/2602.02207
作者: Nisansa de Silva,Surangika Ranathunga
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:This paper presents the first-ever Sinhala physical common sense reasoning dataset created as part of Global PIQA. It contains 110 human-created and verified data samples, where each sample consists of a prompt, the corresponding correct answer, and a wrong answer. Most of the questions refer to the Sri Lankan context, where Sinhala is an official language.
zh
[NLP-36] More Than a Quick Glance: Overcoming the Greedy Bias in KV-Cache Compression
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长文本处理中因Key-Value(KV)缓存内存随上下文长度线性增长而导致的部署瓶颈问题。现有压缩策略虽能缓解内存压力,但通常以牺牲语义召回率为代价,且难以区分压缩效果与滑动窗口带来的干扰因素。其解决方案的关键在于提出LASER-KV框架,采用基于保护除数(n)的块级累积策略,在严格累加预算下实现更精准的KV缓存选择,从而有效分离压缩影响与滑动窗口效应;实验表明,该方法在Babilong基准测试中相较于传统压缩方法提升性能达10%,并揭示了仅依赖注意力分数作为token效用代理的局限性。
链接: https://arxiv.org/abs/2602.02199
作者: Aryan Sood,Tanvi Sharma,Vansh Agrawal
机构: Indian Institute of Technology, Roorkee(印度理工学院,鲁尔基)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:While Large Language Models (LLMs) can theoretically support extensive context windows, their actual deployment is constrained by the linear growth of Key-Value (KV) cache memory. Prevailing compression strategies mitigate this through various pruning mechanisms, yet trade-off semantic recall for memory efficiency. In this work, we present LASER-KV (Layer Accumulated Selection with Exact-LSH Recall), a framework designed to test the limits of KV compression under a strict accumulative budgeting policy. We deviate from the standard fixed summary size approach by implementing a block-wise accumulation strategy governed by a protection divisor (n). This allows us to isolate the effects of compression from sliding window artifacts. Our experiments on the Babilong benchmark reveal performance degradation in previous compression methods by 15-30% on various long context tasks. LASER-KV maintains stable performance, achieving superior accuracies by a margin of upto 10% at 128k. These findings challenge the prevailing assumption that attention scores alone are a sufficient proxy for token utility.
zh
[NLP-37] Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models
【速读】: 该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在视觉-深度研究(Vision-DeepResearch)任务中评估困难的问题,特别是现有基准测试存在的两大局限:一是缺乏视觉搜索中心性,即问题答案常可通过文本线索或模型先验知识推断,而非真正依赖视觉信息;二是评估场景过于理想化,图像检索常基于近似匹配,文本检索则过于直接,难以反映真实复杂环境下的挑战。为应对这些问题,作者构建了VDR-Bench基准数据集,包含2000个精心设计的视觉问答(VQA)实例,通过多阶段人工筛选和专家审核确保其真实性与挑战性;同时提出一种简化的多轮裁剪搜索(multi-round cropped-search)工作流,有效提升模型在现实视觉检索场景中的表现,从而为未来多模态深度研究系统的设计提供实用指导。
链接: https://arxiv.org/abs/2602.02185
作者: Yu Zeng,Wenxuan Huang,Zhen Fang,Shuang Chen,Yufan Shen,Yishuo Cai,Xiaoman Wang,Zhenfei Yin,Lin Chen,Zehui Chen,Shiting Huang,Yiming Zhao,Yao Hu,Philip Torr,Wanli Ouyang,Shaosheng Cao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Multimodal Large Language Models (MLLMs) have advanced VQA and now support Vision-DeepResearch systems that use search engines for complex visual-textual fact-finding. However, evaluating these visual and textual search abilities is still difficult, and existing benchmarks have two major limitations. First, existing benchmarks are not visual search-centric: answers that should require visual search are often leaked through cross-textual cues in the text questions or can be inferred from the prior world knowledge in current MLLMs. Second, overly idealized evaluation scenario: On the image-search side, the required information can often be obtained via near-exact matching against the full image, while the text-search side is overly direct and insufficiently challenging. To address these issues, we construct the Vision-DeepResearch benchmark (VDR-Bench) comprising 2,000 VQA instances. All questions are created via a careful, multi-stage curation pipeline and rigorous expert review, designed to assess the behavior of Vision-DeepResearch systems under realistic real-world conditions. Moreover, to address the insufficient visual retrieval capabilities of current MLLMs, we propose a simple multi-round cropped-search workflow. This strategy is shown to effectively improve model performance in realistic visual retrieval scenarios. Overall, our results provide practical guidance for the design of future multimodal deep-research systems. The code will be released in this https URL.
zh
[NLP-38] Evaluating Metalinguistic Knowledge in Large Language Models across the Worlds Languages
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在语言结构知识(linguistic structure knowledge)方面理解不足的问题,尤其是其元语言能力(metalinguistic knowledge)——即对语言结构进行显式推理的能力——尚未得到系统评估。现有语言基准测试通常局限于特定现象、高资源语言,并忽视了对语法层面细粒度区分的考察。论文的关键解决方案是构建一个涵盖多种语言和语言学领域的元语言知识评测基准,通过准确率(accuracy)和宏F1(macro F1)指标,结合多数类基线和随机基线,量化分析不同模型在跨语言语境下的表现差异。结果表明,当前LLMs的元语言知识呈现碎片化特征,性能高度依赖于数据可用性而非普遍的语法掌握能力,且低资源语言表现显著低于高数字存在度的语言。这一发现强调了未来LLMs发展需重视全球语言多样性与数据公平性。
链接: https://arxiv.org/abs/2602.02182
作者: Tjaša Arčon(1),Matej Klemen(1),Marko Robnik-Šikonja(1),Kaja Dobrovoljc(1, 2, 3) ((1) University of Ljubljana, Faculty of Computer and Information Science, Slovenia (2) University of Ljubljana, Faculty of Arts, Slovenia, (3) Jožef Stefan Institute, Ljubljana, Slovenia)
机构: University of Ljubljana, Faculty of Computer and Information Science, Ljubljana, Slovenia; University of Ljubljana, Faculty of Arts, Ljubljana, Slovenia; Jožef Stefan Institute, Ljubljana, Slovenia
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) are routinely evaluated on language use tasks, yet their knowledge of linguistic structure remains poorly understood. Existing linguistic benchmarks typically focus on narrow phenomena, emphasize high-resource languages, and rarely evaluate metalinguistic knowledge-explicit reasoning about language structure rather than language use. Using accuracy and macro F1, together with majority-class and chance baselines, we analyse overall performance and examine variation by linguistic domains and language-related factors. Our results show that metalinguistic knowledge in current LLMs is limited: GPT-4o performs best but achieves only moderate accuracy (0.367), while open-source models lag behind. All models perform above chance but fail to outperform the majority-class baseline, suggesting they capture cross-linguistic patterns but lack fine-grained grammatical distinctions. Performance varies across linguistic domains, with lexical features showing the highest accuracy and phonological features among the lowest, partially reflecting differences in online visibility. At the language level, accuracy shows a strong association with digital language status: languages with higher digital presence and resource availability are evaluated more accurately, while low-resource languages show substantially lower performance. Analyses of predictive factors confirm that resource-related indicators (Wikipedia size, corpus availability) are more informative predictors of accuracy than geographical, genealogical, or sociolinguistic factors. Together, these results suggest that LLMs’ metalinguistic knowledge is fragmented and shaped by data availability rather than generalizable grammatical competence across the world’s languages. We release our benchmark as an open-source dataset to support systematic evaluation and encourage greater global linguistic diversity in future LLMs.
zh
[NLP-39] AR-MAP: Are Autoregressive Large Language Models Implicit Teachers for Diffusion Large Language Models ?
【速读】: 该论文旨在解决扩散语言模型(Diffusion Large Language Models, DLLMs)在偏好对齐(preference alignment)过程中因基于证据下界(Evidence Lower Bound, ELBO)的似然估计引入高方差而导致的训练困难问题。解决方案的关键在于提出AR-MAP框架,利用已偏好对齐的自回归语言模型(Autoregressive LLMs, AR-LLMs)作为隐式教师,通过简单的权重缩放策略,使DLLM有效吸收AR-LLM中的对齐知识;这一方法充分利用了两类模型在架构上的共性,规避了直接对DLLM进行对齐时的高方差与计算开销,在多个偏好对齐任务中表现出优于或相当的性能。
链接: https://arxiv.org/abs/2602.02178
作者: Liang Lin,Feng Xiong,Zengbin Wang,Kun Wang,Junhao Dong,Xuecai Hu,Yong Wang,Xiangxiang Chu
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Diffusion Large Language Models (DLLMs) have emerged as a powerful alternative to autoregressive models, enabling parallel token generation across multiple positions. However, preference alignment of DLLMs remains challenging due to high variance introduced by Evidence Lower Bound (ELBO)-based likelihood estimation. In this work, we propose AR-MAP, a novel transfer learning framework that leverages preference-aligned autoregressive LLMs (AR-LLMs) as implicit teachers for DLLM alignment. We reveal that DLLMs can effectively absorb alignment knowledge from AR-LLMs through simple weight scaling, exploiting the shared architectural structure between these divergent generation paradigms. Crucially, our approach circumvents the high variance and computational overhead of direct DLLM alignment and comprehensive experiments across diverse preference alignment tasks demonstrate that AR-MAP achieves competitive or superior performance compared to existing DLLM-specific alignment methods, achieving 69.08% average score across all tasks and models. Our Code is available at this https URL.
zh
[NLP-40] D-CORE: Incentivizing Task Decomposition in Large Reasoning Models for Complex Tool Use
【速读】: 该论文旨在解决当前大推理模型(Large Reasoning Models, LRM)在复杂工具使用场景中缺乏子任务分解能力,导致“懒惰推理”(Lazy Reasoning)的问题。其解决方案的关键在于提出一个两阶段训练框架 D-CORE(Decomposing tasks and Composing Reasoning processes):第一阶段通过自蒸馏(self-distillation)激励模型的任务分解推理能力,第二阶段采用多样性感知强化学习(diversity-aware reinforcement learning, RL)恢复模型的反思性推理能力,从而显著提升模型在多种基准测试中的工具使用性能。
链接: https://arxiv.org/abs/2602.02160
作者: Bowen Xu,Shaoyu Wu,Hao Jiang,Kai Liu,Xin Chen,Lulu Hu,Bin Yang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Effective tool use and reasoning are essential capabilities for large reasoning models~(LRMs) to address complex real-world problems. Through empirical analysis, we identify that current LRMs lack the capability of sub-task decomposition in complex tool use scenarios, leading to Lazy Reasoning. To address this, we propose a two-stage training framework D-CORE~(\underline\textbfDecomposing tasks and \underline\textbfComposing \underline\textbfReasoning processes) that first incentivize the LRMs’ task decomposition reasoning capability via self-distillation, followed by diversity-aware reinforcement learning~(RL) to restore LRMs’ reflective reasoning capability. D-CORE achieves robust tool-use improvements across diverse benchmarks and model scales. Experiments on BFCLv3 demonstrate superiority of our method: D-CORE-8B reaches 77.7% accuracy, surpassing the best-performing 8B model by 5.7%. Meanwhile, D-CORE-14B establishes a new state-of-the-art at 79.3%, outperforming 70B models despite being 5 \times smaller. The source code is available at this https URL.
zh
[NLP-41] Focus-dLLM : Accelerating Long-Context Diffusion LLM Inference via Confidence-Guided Context Focusing
【速读】: 该论文旨在解决扩散型大语言模型(Diffusion Large Language Models, dLLMs)在长上下文处理中因双向全注意力机制导致的推理效率低下问题。现有稀疏注意力方法效果不佳,根源在于难以准确估计尚未解码token的重要性,且在扩散过程中未掩码位置未知。解决方案的关键在于提出了一种无需训练的注意力稀疏化框架Focus-dLLM:首先基于相邻推理步骤间token置信度强相关性的观察,设计了基于历史置信度的指示器以预测未掩码区域;进而提出一种sink-aware剪枝策略,精确估计并移除冗余注意力计算,同时保留高影响力的关注点(attention sinks);此外,通过利用跨层关注点的一致性复用剪枝位置,进一步降低开销。实验表明,在32K上下文长度下可实现超过29倍的无损加速。
链接: https://arxiv.org/abs/2602.02159
作者: Lingkun Long,Yushi Huang,Shihao Bai,Ruihao Gong,Jun Zhang,Ao Zhou,Jianlei Yang
机构: Beihang University (北京航空航天大学); Hong Kong University of Science and Technology (香港科技大学); SenseTime Research (商汤科技研究院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Diffusion Large Language Models (dLLMs) deliver strong long-context processing capability in a non-autoregressive decoding paradigm. However, the considerable computational cost of bidirectional full attention limits the inference efficiency. Although sparse attention is promising, existing methods remain ineffective. This stems from the need to estimate attention importance for tokens yet to be decoded, while the unmasked token positions are unknown during diffusion. In this paper, we present Focus-dLLM, a novel training-free attention sparsification framework tailored for accurate and efficient long-context dLLM inference. Based on the finding that token confidence strongly correlates across adjacent steps, we first design a past confidence-guided indicator to predict unmasked regions. Built upon this, we propose a sink-aware pruning strategy to accurately estimate and remove redundant attention computation, while preserving highly influential attention sinks. To further reduce overhead, this strategy reuses identified sink locations across layers, leveraging the observed cross-layer consistency. Experimental results show that our method offers more than 29\times lossless speedup under 32K context length. The code is publicly available at: this https URL
zh
[NLP-42] Revisiting Adaptive Rounding with Vectorized Reparameterization for LLM Quantization
【速读】: 该论文旨在解决后训练量化(Post-Training Quantization, PTQ)中自适应舍入(Adaptive Rounding)方法因使用稠密且逐元素的舍入矩阵而导致的高计算与存储开销问题,尤其在百亿参数级大语言模型(Large Language Models, LLMs)场景下难以实用。其解决方案的关键在于提出VQRound框架,通过将舍入矩阵重参数化为紧凑的码本(codebook),显著降低可训练参数量(仅需0.2%),同时在L∞范数下最小化逐元素最坏情况误差,从而更有效地处理LLMs中重尾分布的权重。此外,作者还发现舍入初始化对性能至关重要,并设计了一个轻量级端到端微调流程,仅用128个样本即可优化各层码本,实现快速收敛与高效部署。
链接: https://arxiv.org/abs/2602.02151
作者: Yuli Zhou,Qingxuan Chen,Luca Benini,Guolei Sun,Yawei Li
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 17 pages, 6 figures, 14 tables
Abstract:Adaptive Rounding has emerged as an alternative to round-to-nearest (RTN) for post-training quantization by enabling cross-element error cancellation. Yet, dense and element-wise rounding matrices are prohibitively expensive for billion-parameter large language models (LLMs). We revisit adaptive rounding from an efficiency perspective and propose VQRound, a parameter-efficient optimization framework that reparameterizes the rounding matrix into a compact codebook. Unlike low-rank alternatives, VQRound minimizes the element-wise worst-case error under L_\infty norm, which is critical for handling heavy-tailed weight distributions in LLMs. Beyond reparameterization, we identify rounding initialization as a decisive factor and develop a lightweight end-to-end finetuning pipeline that optimizes codebooks across all layers using only 128 samples. Extensive experiments on OPT, LLaMA, LLaMA2, and Qwen3 models demonstrate that VQRound achieves better convergence than traditional adaptive rounding at the same number of steps while using as little as 0.2% of the trainable parameters. Our results show that adaptive rounding can be made both scalable and fast-fitting. The code is available at this https URL.
zh
[NLP-43] Learning Generative Selection for Best-of-N
【速读】: 该论文旨在解决小规模语言模型在测试时通过并行采样(parallel sampling)提升推理能力时,受限于选择策略(Best-of-N selection)质量的问题。现有生成式选择方法(如GenSelect)虽能缓解此瓶颈,但其高性能通常仅适用于大型模型。论文的关键解决方案是:通过有针对性的强化学习(reinforcement learning),使小型推理模型(如1.7B参数模型)获得强大的生成式选择能力。具体而言,作者利用大规模数学与代码指令数据集构建选择任务(筛选出同时包含正确与错误候选解的实例),并采用DAPO算法训练模型以奖励正确选择行为。实验表明,该方法在多个数学与代码推理基准上显著优于提示工程和多数投票基线,并能泛化至选择更强模型的输出,从而证明强化学习是实现小模型高效生成式选择的有效途径。
链接: https://arxiv.org/abs/2602.02143
作者: Shubham Toshniwal,Aleksander Ficek,Siddhartha Jain,Wei Du,Vahid Noroozi,Sadegh Mahdavi,Somshubra Majumdar,Igor Gitman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Scaling test-time compute via parallel sampling can substantially improve LLM reasoning, but is often limited by Best-of-N selection quality. Generative selection methods, such as GenSelect, address this bottleneck, yet strong selection performance remains largely limited to large models. We show that small reasoning models can acquire strong GenSelect capabilities through targeted reinforcement learning. To this end, we synthesize selection tasks from large-scale math and code instruction datasets by filtering to instances with both correct and incorrect candidate solutions, and train 1.7B-parameter models with DAPO to reward correct selections. Across math (AIME24, AIME25, HMMT25) and code (LiveCodeBench) reasoning benchmarks, our models consistently outperform prompting and majority-voting baselines, often approaching or exceeding much larger models. Moreover, these gains generalize to selecting outputs from stronger models despite training only on outputs from weaker models. Overall, our results establish reinforcement learning as a scalable way to unlock strong generative selection in small models, enabling efficient test-time scaling.
zh
[NLP-44] Quantifying the Gap between Understanding and Generation within Unified Multimodal Models
【速读】: 该论文旨在解决统一多模态模型(Unified Multimodal Models, UMM)中理解与生成能力是否真正协同一致的问题,即当前模型是否存在表面整合而非深层认知融合。其解决方案的关键在于提出GapEval——一个双向基准测试框架,通过设计图像与文本双模态可回答的对称问题,量化理解与生成之间的能力差距,并评估模型在跨模态推理中的一致性,从而揭示现有UMM在知识整合上的局限性。
链接: https://arxiv.org/abs/2602.02140
作者: Chenlong Wang,Yuhang Chen,Zhihan Hu,Dongping Chen,Wenhu Chen,Sarah Wiegreffe,Tianyi Zhou
机构: University of Maryland (马里兰大学); University of Waterloo (滑铁卢大学); MBZUAI
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent advances in unified multimodal models (UMM) have demonstrated remarkable progress in both understanding and generation tasks. However, whether these two capabilities are genuinely aligned and integrated within a single model remains unclear. To investigate this question, we introduce GapEval, a bidirectional benchmark designed to quantify the gap between understanding and generation capabilities, and quantitatively measure the cognitive coherence of the two “unified” directions. Each question can be answered in both modalities (image and text), enabling a symmetric evaluation of a model’s bidirectional inference capability and cross-modal consistency. Experiments reveal a persistent gap between the two directions across a wide range of UMMs with different architectures, suggesting that current models achieve only surface-level unification rather than deep cognitive convergence of the two. To further explore the underlying mechanism, we conduct an empirical study from the perspective of knowledge manipulation to illustrate the underlying limitations. Our findings indicate that knowledge within UMMs often remains disjoint. The capability emergence and knowledge across modalities are unsynchronized, paving the way for further exploration.
zh
[NLP-45] EvoMU: Evolutionary Machine Unlearning
【速读】: 该论文旨在解决机器遗忘(Machine Unlearning)中如何自动寻找最优遗忘损失函数的问题,尤其是在面对不同数据结构和重叠情况时,传统方法难以找到通用且高效的损失函数。其核心挑战在于损失函数空间庞大且无普遍最优解,导致人工设计效率低且效果受限。解决方案的关键是提出 EvoMU 方法,通过进化搜索机制在庞大的损失函数空间中自动发现任务特定的遗忘损失函数,从而无需人工干预即可生成性能优越的新型损失函数,实现对敏感或版权数据的有效遗忘,同时保持模型整体性能;该方法首次在计算资源有限的情况下(仅使用 4B 参数模型)实现了当前最优(SotA)效果,体现了 AI 合作者(AI co-scientist)在科学发现中的潜力。
链接: https://arxiv.org/abs/2602.02139
作者: Pawel Batorski,Paul Swoboda
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Machine unlearning aims to unlearn specified training data (e.g. sensitive or copyrighted material). A prominent approach is to fine-tune an existing model with an unlearning loss that retains overall utility. The space of suitable unlearning loss functions is vast, making the search for an optimal loss function daunting. Additionally, there might not even exist a universally optimal loss function: differences in the structure and overlap of the forget and retain data can cause a loss to work well in one setting but over-unlearn or under-unlearn in another. Our approach EvoMU tackles these two challenges simultaneously. An evolutionary search procedure automatically finds task-specific losses in the vast space of possible unlearning loss functions. This allows us to find dataset-specific losses that match or outperform existing losses from the literature, without the need for a human-in-the-loop. This work is therefore an instance of automatic scientific discovery, a.k.a. an AI co-scientist. In contrast to previous AI co-scientist works, we do so on a budget: We achieve SotA results using a small 4B parameter model (Qwen3-4B-Thinking), showing the potential of AI co-scientists with limited computational resources. Our experimental evaluation shows that we surpass previous loss-based unlearning formulations on TOFU-5%, TOFU-10%, MUSE and WMDP by synthesizing novel unlearning losses. Our code is available at this https URL.
zh
[NLP-46] Understanding the Reversal Curse Mitigation in Masked Diffusion Models through Attention and Training Dynamics
【速读】: 该论文旨在解决自回归语言模型(Autoregressive Language Models, ARMs)中存在的“反转诅咒”(reversal curse)问题,即模型在学习到“ A 是 B ”后难以正确处理反向查询“ B 是 A ”。尽管基于掩码扩散机制的语言模型(Masked Diffusion-based Language Models, MDMs)在该问题上表现更优,但其内在原因此前不明确。论文指出,现有解释——即任意顺序训练目标(any-order training objective)缓解了该问题——并不充分,因为仅在训练中观察到“[MASK] 是 B”并不能保证模型能有效处理“B 是 [MASK]”的反向提示。研究发现,关键机制在于模型架构与训练过程的协同作用:在一层Transformer编码器中,权重共享使前向和反向注意力得分呈正相关,且对应的梯度方向一致,因此最小化前向损失的同时也能降低反向损失。实验结果在控制性小任务和大规模扩散语言模型上均验证了这一机制,从而阐明了MDMs为何能在一定程度上克服ARMs中持续存在的反向推理失败问题。
链接: https://arxiv.org/abs/2602.02133
作者: Sangwoo Shin,BumJun Kim,Kyelim Lee,Moongyu Jeon,Albert No
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Autoregressive language models (ARMs) suffer from the reversal curse: after learning that " A is B ", they often fail on the reverse query " B is A ". Masked diffusion-based language models (MDMs) exhibit this failure in a much weaker form, but the underlying reason has remained unclear. A common explanation attributes this mitigation to the any-order training objective. However, observing “[MASK] is B " during training does not necessarily teach the model to handle the reverse prompt " B is [MASK]”. We show that the mitigation arises from architectural structure and its interaction with training. In a one-layer Transformer encoder, weight sharing couples the two directions by making forward and reverse attention scores positively correlated. In the same setting, we further show that the corresponding gradients are aligned, so minimizing the forward loss also reduces the reverse loss. Experiments on both controlled toy tasks and large-scale diffusion language models support these mechanisms, explaining why MDMs partially overcome a failure mode that persists in strong ARMs.
zh
[NLP-47] here Is More to Refusal in Large Language Models than a Single Direction
【速读】: 该论文试图解决的问题是:现有研究认为大型语言模型中的拒绝行为(refusal)由激活空间中的单一方向决定,从而可被线性操控和消融;但这一观点是否全面准确尚不明确。论文通过系统分析十一类拒绝与非合规行为(包括安全限制、请求不完整或无支持、拟人化以及过度拒绝等),发现这些行为对应于激活空间中几何上彼此独立的方向,表明拒绝机制具有更高的复杂性。解决方案的关键在于揭示:尽管不同拒绝行为在激活空间中占据不同的方向,但沿任意相关方向进行线性操控时,均会引发几乎一致的“拒绝 vs. 过度拒绝”权衡关系,即存在一个共享的一维控制旋钮(control knob),其主要作用不是决定是否拒绝,而是调节模型拒绝的方式(how it refuses)。这一发现为理解拒绝行为的本质及其可控性提供了新的理论框架。
链接: https://arxiv.org/abs/2602.02132
作者: Faaiz Joad,Majd Hawasly,Sabri Boughorbel,Nadir Durrani,Husrev Taha Sencar
机构: Qatar Computing Research Institute, HBKU, Doha, Qatar (卡塔尔计算研究研究所,哈马德本哈利法大学,多哈,卡塔尔)
类目: Computation and Language (cs.CL)
备注:
Abstract:Prior work argues that refusal in large language models is mediated by a single activation-space direction, enabling effective steering and ablation. We show that this account is incomplete. Across eleven categories of refusal and non-compliance, including safety, incomplete or unsupported requests, anthropomorphization, and over-refusal, we find that these refusal behaviors correspond to geometrically distinct directions in activation space. Yet despite this diversity, linear steering along any refusal-related direction produces nearly identical refusal to over-refusal trade-offs, acting as a shared one-dimensional control knob. The primary effect of different directions is not whether the model refuses, but how it refuses.
zh
[NLP-48] Unifying Masked Diffusion Models with Various Generation Orders and Beyond
【速读】: 该论文旨在解决掩码扩散模型(Masked Diffusion Models, MDMs)在语言生成中因生成顺序(generation order)设计不当而导致的生成质量受限问题。现有方法要么采用固定顺序(如块状从左到右),要么通过两阶段优化学习排序策略,存在额外计算开销且难以获得最优解。其核心解决方案是提出有序表达式掩码扩散模型(Order-Expressive Masked Diffusion Model, OeMDM),统一建模不同生成顺序下的扩散过程,从而将MDMs、自回归模型(Autoregressive Models, ARMs)和块扩散模型纳入同一框架;在此基础上进一步引入可学习顺序掩码扩散模型(Learnable-Order Masked Diffusion Model, LoMDM),通过单一目标函数从头学习生成顺序与扩散主干网络,实现上下文感知的动态生成顺序,实验证明LoMDM在多个语言建模基准上优于现有离散扩散模型。
链接: https://arxiv.org/abs/2602.02112
作者: Chunsan Hong,Sanghyun Lee,Jong Chul Ye
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint
Abstract:Masked diffusion models (MDMs) are a potential alternative to autoregressive models (ARMs) for language generation, but generation quality depends critically on the generation order. Prior work either hard-codes an ordering (e.g., blockwise left-to-right) or learns an ordering policy for a pretrained MDM, which incurs extra cost and can yield suboptimal solutions due to the two-stage optimization. Motivated by this, we propose order-expressive masked diffusion model (OeMDM) for a broad class of diffusion generative processes with various generation orders, enabling the interpretation of MDM, ARM, and block diffusion in a single framework. Furthermore, building on OeMDM, we introduce learnable-order masked diffusion model (LoMDM), which jointly learns the generation ordering and diffusion backbone through a single objective from scratch, enabling the diffusion model to generate text in context-dependent ordering. Empirically, we confirm that LoMDM outperforms various discrete diffusion models across multiple language modeling benchmarks.
zh
[NLP-49] Out of the Memory Barrier: A Highly Memory Efficient Training System for LLM s with Million-Token Contexts
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长上下文训练中因GPU内存开销过高而导致的瓶颈问题,尤其是激活值(activations)的内存占用随序列长度线性增长所带来的限制。其解决方案的关键在于提出一种名为OOMB的高效训练系统,采用分块递归训练框架(chunk-recurrent training framework)结合即时激活重计算(on-the-fly activation recomputation),将激活内存占用稳定在常数级别(O(1)),从而将主要瓶颈转移至KV缓存(KV cache)管理。进一步地,OOMB通过一系列协同优化策略——包括用于KV缓存及其梯度的分页内存管理(paged memory manager)以消除碎片、异步CPU卸载(asynchronous CPU offloading)以隐藏数据传输延迟,以及基于页面的稀疏注意力机制(page-level sparse attention)以降低计算复杂度和通信开销——实现了极高的资源利用效率,使得Qwen2.5-7B模型可在单张H200 GPU上训练长达4M tokens的上下文,显著优于传统需多卡并行的方案。
链接: https://arxiv.org/abs/2602.02108
作者: Wenhao Li,Daohai Yu,Gen Luo,Yuxin Zhang,Fei Chao,Rongrong Ji,Yifan Wu,Jiaxin Liu,Ziyang Gong,Zimu Liao
机构: Xiamen University (厦门大学); Peking University (北京大学); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Shanghai Jiao Tong University (上海交通大学); Shanghai AI Laboratory (上海人工智能实验室)
类目: Computation and Language (cs.CL)
备注:
Abstract:Training Large Language Models (LLMs) on long contexts is severely constrained by prohibitive GPU memory overhead, not training time. The primary culprits are the activations, whose memory footprints scale linearly with sequence length. We introduce OOMB, a highly memory-efficient training system that directly confronts this barrier. Our approach employs a chunk-recurrent training framework with on-the-fly activation recomputation, which maintains a constant activation memory footprint (O(1)) and shifts the primary bottleneck to the growing KV cache. To manage the KV cache, OOMB integrates a suite of synergistic optimizations: a paged memory manager for both the KV cache and its gradients to eliminate fragmentation, asynchronous CPU offloading to hide data transfer latency, and page-level sparse attention to reduce both computational complexity and communication overhead. The synergy of these techniques yields exceptional efficiency. Our empirical results show that for every additional 10K tokens of context, the end-to-end training memory overhead increases by a mere 10MB for Qwen2.5-7B. This allows training Qwen2.5-7B with a 4M-token context on a single H200 GPU, a feat that would otherwise require a large cluster using context parallelism. This work represents a substantial advance in resource efficiency for long-context LLM training. The source code is available at this https URL.
zh
[NLP-50] Dicta-LM 3.0: Advancing The Frontier of Hebrew Sovereign LLM s
【速读】: 该论文旨在解决低资源语言(如希伯来语)在生成式 AI (Generative AI) 领域中大型语言模型(LLM)供给不足与需求旺盛之间的矛盾问题。其关键解决方案是提出并发布 Dicta-LM 3.0——一个基于希伯来语和英语大规模语料库训练的开源权重多尺寸 LLM 系列,包含 24B、12B 和 1.7B 三种参数规模,并为每种规模提供基础模型与支持工具调用(tool-calling)的对话模型变体,均具备 65k token 的原生上下文长度。此外,研究还构建了一个针对希伯来语对话型 LLM 的全新评估基准套件,涵盖翻译、摘要、Winograd 推理、以色列常识问答及音节标注(nikud)等多样化任务,从而系统性验证模型性能并为其他非英语语言的 LLM 适配提供可复用框架。
链接: https://arxiv.org/abs/2602.02104
作者: Shaltiel Shmidman,Avi Shmidman,Amir DN Cohen,Moshe Koppel
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Open-weight LLMs have been released by frontier labs; however, sovereign Large Language Models (for languages other than English) remain low in supply yet high in demand. Training large language models (LLMs) for low-resource languages such as Hebrew poses unique challenges. In this paper, we introduce Dicta-LM 3.0: an open-weight collection of LLMs trained on substantially-sized corpora of Hebrew and English texts. The model is released in three sizes: 24B - adapted from the Mistral-Small-3.1 base model, 12B - adapted from the NVIDIA Nemotron Nano V2 model, and 1.7B - adapted from the Qwen3-1.7B base model. We are releasing multiple variants of each model, each with a native context length of 65k tokens; base model and chat model with tool-calling support. To rigorously evaluate our models, we introduce a new benchmark suite for evaluation of Hebrew chat-LLMs, covering a diverse set of tasks including Translation, Summarization, Winograd, Israeli Trivia, and Diacritization (nikud). Our work not only addresses the intricacies of training LLMs in low-resource languages but also proposes a framework that can be leveraged for adapting other LLMs to various non-English languages, contributing to the broader field of multilingual NLP.
zh
[NLP-51] No Global Plan in Chain-of-Thought: Uncover the Latent Planning Horizon of LLM s
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)内部状态与显式链式思维(Chain-of-Thought, CoT)推理轨迹之间关系不明确的问题,尤其关注LLMs在生成CoT前是否存在潜在规划行为及其对不确定性估计和推理路径识别的影响。其解决方案的关键在于提出一种名为Tele-Lens的探针方法,用于分析不同任务域中隐藏状态的潜在规划强度;实证发现LLMs具有短视 horizon 特性,即主要进行增量式状态转移而非全局规划,据此进一步提出假设:仅需少量CoT位置即可有效表征整条推理路径的不确定性,并验证了无需性能损失即可自动识别CoT绕过行为。
链接: https://arxiv.org/abs/2602.02103
作者: Liyan Xu,Mo Yu,Fandong Meng,Jie Zhou
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:This work stems from prior complementary observations on the dynamics of Chain-of-Thought (CoT): Large Language Models (LLMs) is shown latent planning of subsequent reasoning prior to CoT emergence, thereby diminishing the significance of explicit CoT; whereas CoT remains critical for tasks requiring multi-step reasoning. To deepen the understanding between LLM’s internal states and its verbalized reasoning trajectories, we investigate the latent planning strength of LLMs, through our probing method, Tele-Lens, applying to hidden states across diverse task domains. Our empirical results indicate that LLMs exhibit a myopic horizon, primarily conducting incremental transitions without precise global planning. Leveraging this characteristic, we propose a hypothesis on enhancing uncertainty estimation of CoT, which we validate that a small subset of CoT positions can effectively represent the uncertainty of the entire path. We further underscore the significance of exploiting CoT dynamics, and demonstrate that automatic recognition of CoT bypass can be achieved without performance degradation. Our code, data and models are released at this https URL.
zh
[NLP-52] hink Dense Not Long: Dynamic Decoupled Conditional Advantage for Efficient Reasoning
【速读】: 该论文旨在解决强化学习中基于可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)方法在多步推理任务中出现的冗余生成问题,即模型倾向于产生过于冗长的推理轨迹,且传统的长度惩罚机制在群体相对优化中会因基准值被错误响应稀释而严重损害准确率。其关键解决方案是提出动态解耦条件优势(Dynamic Decoupled Conditional Advantage, DDCA),通过两个核心机制实现:一是仅在正确回答簇内条件性计算长度优势,以消除基线稀释(Dilution of Length Baseline)问题;二是利用群体通过率作为难度代理指标,动态调整惩罚强度,缓解难度-惩罚不匹配(Difficulty-Penalty Mismatch)问题,从而在保证准确率的同时显著提升生成效率。
链接: https://arxiv.org/abs/2602.02099
作者: Keqin Peng,Yuanxin Ouyang,Xuebo Liu,Zhiliang Tian,Ruijian Han,Yancheng Yuan,Liang Ding
机构: Beihang University (北京航空航天大学); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区); National University of Defense Technology (国防科技大学); Hong Kong Polytechnic University (香港理工大学); Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) can elicit strong multi-step reasoning, yet it often encourages overly verbose traces. Moreover, naive length penalties in group-relative optimization can severely hurt accuracy. We attribute this failure to two structural issues: (i) Dilution of Length Baseline, where incorrect responses (with zero length reward) depress the group baseline and over-penalize correct solutions; and (ii) Difficulty-Penalty Mismatch, where a static penalty cannot adapt to problem difficulty, suppressing necessary reasoning on hard instances while leaving redundancy on easy ones. We propose Dynamic Decoupled Conditional Advantage (DDCA) to decouple efficiency optimization from correctness. DDCA computes length advantages conditionally within the correct-response cluster to eliminate baseline dilution, and dynamically scales the penalty strength using the group pass rate as a proxy for difficulty. Experiments on GSM8K, MATH500, AMC23, and AIME25 show that DDCA consistently improves the efficiency–accuracy trade-off relative to adaptive baselines, reducing generated tokens by approximately 60% on simpler tasks (e.g., GSM8K) versus over 20% on harder benchmarks (e.g., AIME25), thereby maintaining or improving accuracy. Code is available at this https URL.
zh
[NLP-53] LEC-KG: An LLM -Embedding Collaborative Framework for Domain-Specific Knowledge Graph Construction – A Case Study on SDGs
【速读】: 该论文旨在解决从非结构化文本中构建领域特定知识图谱(Knowledge Graph, KG)所面临的挑战,包括实体提及的异质性、关系分布的长尾特性以及缺乏标准化模式等问题。其解决方案的核心在于提出一种双向协同框架 LEC-KG,通过整合大语言模型(Large Language Models, LLMs)的语义理解能力与知识图谱嵌入(Knowledge Graph Embeddings, KGE)的结构推理能力,实现两者的迭代增强:一方面,KGE 提供结构感知反馈以优化 LLM 的关系抽取结果;另一方面,经验证的三元组逐步提升 KGE 的表示质量,从而有效缓解长尾关系偏差,并支持对未见实体的结构验证。
链接: https://arxiv.org/abs/2602.02090
作者: Yikai Zeng,Yingchao Piao,Jianhui Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Constructing domain-specific knowledge graphs from unstructured text remains challenging due to heterogeneous entity mentions, long-tail relation distributions, and the absence of standardized schemas. We present LEC-KG, a bidirectional collaborative framework that integrates the semantic understanding of Large Language Models (LLMs) with the structural reasoning of Knowledge Graph Embeddings (KGE). Our approach features three key components: (1) hierarchical coarse-to-fine relation extraction that mitigates long-tail bias, (2) evidence-guided Chain-of-Thought feedback that grounds structural suggestions in source text, and (3) semantic initialization that enables structural validation for unseen entities. The two modules enhance each other iteratively-KGE provides structure-aware feedback to refine LLM extractions, while validated triples progressively improve KGE representations. We evaluate LEC-KG on Chinese Sustainable Development Goal (SDG) reports, demonstrating substantial improvements over LLM baselines, particularly on low-frequency relations. Through iterative refinement, our framework reliably transforms unstructured policy text into validated knowledge graph triples.
zh
[NLP-54] Closing the Loop: Universal Repository Representation with RPG-Encoder
【速读】: 该论文旨在解决当前代码库代理(repository agent)因表示碎片化而导致的推理断层问题,即现有方法依赖孤立的API文档或依赖图,缺乏语义深度,无法实现对代码库的统一理解与生成。其解决方案的关键在于提出RPG-Encoder框架,将静态的代码库规划图(Repository Planning Graph, RPG)升级为一个统一、高保真的表示体系,通过三个核心机制实现:(1) 将原始代码编码为融合提升语义特征与代码依赖关系的RPG;(2) 逐步演化拓扑结构以解耦维护成本与代码库规模,使开销降低95.7%;(3) 作为结构感知导航的统一接口。该方案有效闭合了从意图到实现的推理循环,在SWE-bench Verified和RepoCraft等基准上实现了显著优于现有方法的性能,验证了其在复杂代码库中细粒度定位与高保真重建的能力。
链接: https://arxiv.org/abs/2602.02084
作者: Jane Luo,Chengyu Yin,Xin Zhang,Qingtao Li,Steven Liu,Yiming Huang,Jie Wu,Hao Liu,Yangyu Huang,Yu Kang,Fangkai Yang,Ying Xin,Scarlett Li
机构: 未知
类目: Computation and Language (cs.CL); Software Engineering (cs.SE)
备注:
Abstract:Current repository agents encounter a reasoning disconnect due to fragmented representations, as existing methods rely on isolated API documentation or dependency graphs that lack semantic depth. We consider repository comprehension and generation to be inverse processes within a unified cycle: generation expands intent into implementation, while comprehension compresses implementation back into intent. To address this, we propose RPG-Encoder, a framework that generalizes the Repository Planning Graph (RPG) from a static generative blueprint into a unified, high-fidelity representation. RPG-Encoder closes the reasoning loop through three mechanisms: (1) Encoding raw code into the RPG that combines lifted semantic features with code dependencies; (2) Evolving the topology incrementally to decouple maintenance costs from repository scale, reducing overhead by 95.7%; and (3) Operating as a unified interface for structure-aware navigation. In evaluations, RPG-Encoder establishes state-of-the-art repository understanding on SWE-bench Verified with 93.7% Acc@5 and exceeds the best baseline by over 10% on SWE-bench Live Lite. These results highlight our superior fine-grained localization accuracy in complex codebases. Furthermore, it achieves 98.5% reconstruction coverage on RepoCraft, confirming RPG’s high-fidelity capacity to mirror the original codebase and closing the loop between intent and implementation.
zh
[NLP-55] WildGraphBench: Benchmarking GraphRAG with Wild-Source Corpora
【速读】: 该论文旨在解决现有GraphRAG(基于图的检索增强生成)评估基准在真实场景下能力不足的问题,即当前多数基准依赖短而精炼的片段作为外部知识,无法充分检验系统在长文本和大规模异构文档环境中的表现。其解决方案的关键在于构建WildGraphBench——一个基于维基百科结构设计的新型基准,利用其叙事连贯性与大量异构参考文献之间的关联,模拟真实世界的知识检索与生成任务;具体而言,通过选取12个主题下的文章及其引用文献作为检索语料库,并以引用链接的陈述作为真实答案,形成涵盖单事实问答、多事实问答和章节级摘要三类复杂度的1,100个问题,从而更全面地评估GraphRAG在实际应用中的性能。
链接: https://arxiv.org/abs/2602.02053
作者: Pengyu Wang,Benfeng Xu,Licheng Zhang,Shaohan Wang,Mingxuan Du,Chiwei Zhu,Zhendong Mao
机构: University of Science and Technology of China (中国科学技术大学); Metastone Technology (元象科技)
类目: Computation and Language (cs.CL)
备注: this https URL
Abstract:Graph-based Retrieval-Augmented Generation (GraphRAG) organizes external knowledge as a hierarchical graph, enabling efficient retrieval and aggregation of scattered evidence across multiple documents. However, many existing benchmarks for GraphRAG rely on short, curated passages as external knowledge, failing to adequately evaluate systems in realistic settings involving long contexts and large-scale heterogeneous documents. To bridge this gap, we introduce WildGraphBench, a benchmark designed to assess GraphRAG performance in the wild. We leverage Wikipedia’s unique structure, where cohesive narratives are grounded in long and heterogeneous external reference documents, to construct a benchmark reflecting real-word scenarios. Specifically, we sample articles across 12 top-level topics, using their external references as the retrieval corpus and citation-linked statements as ground truth, resulting in 1,100 questions spanning three levels of complexity: single-fact QA, multi-fact QA, and section-level summarization. Experiments across multiple baselines reveal that current GraphRAG pipelines help on multi-fact aggregation when evidence comes from a moderate number of sources, but this aggregation paradigm may overemphasize high-level statements at the expense of fine-grained details, leading to weaker performance on summarization tasks. Project page:this https URL.
zh
[NLP-56] Dissecting Outlier Dynamics in LLM NVFP4 Pretraining
【速读】: 该论文旨在解决使用4-bit浮点数(FP4)量化训练大语言模型时因动态范围受限而导致的量化误差问题,特别是NVFP4(Non-Volatile Floating Point 4)在预训练过程中存在的损失差距(loss gap)问题。其核心挑战在于量化过程对异常值(outliers)敏感,这些异常值会显著影响模型收敛和性能。解决方案的关键在于:通过纵向分析发现异常值主要来源于特定架构组件(如Softmax、Gating、SwiGLU等),并揭示其随训练阶段演化的规律——从早期瞬态尖峰演变为后期稳定存在的“热点通道”(hot channels)。基于此,作者提出Hot-Channel Patch(HCP)机制,一种在线补偿策略,可识别热点通道并通过硬件高效的核函数重新注入残差信号;进一步结合后QK操作保护策略,形成CHON训练配方,在保持下游任务准确率的同时,将与BF16相比的损失差距从0.94%降低至0.58%。
链接: https://arxiv.org/abs/2602.02047
作者: Peijie Dong,Ruibo Fan,Yuechen Tao,Di Mou,Wenhu Hu,Zhenheng Tang,Yinghao Yu,Jiamang Wang,Wenbo Su,Guodong Yang,Liping Zhang,Xiaowen Chu,Baochun Li,Bo Li
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 39 pages, 32 figures
Abstract:Training large language models using 4-bit arithmetic enhances throughput and memory efficiency. Yet, the limited dynamic range of FP4 increases sensitivity to outliers. While NVFP4 mitigates quantization error via hierarchical microscaling, a persistent loss gap remains compared to BF16. This study conducts a longitudinal analysis of outlier dynamics across architecture during NVFP4 pretraining, focusing on where they localize, why they occur, and how they evolve temporally. We find that, compared with Softmax Attention (SA), Linear Attention (LA) reduces per-tensor heavy tails but still exhibits persistent block-level spikes under block quantization. Our analysis attributes outliers to specific architectural components: Softmax in SA, gating in LA, and SwiGLU in FFN, with “post-QK” operations exhibiting higher sensitivity to quantization. Notably, outliers evolve from transient spikes early in training to a small set of persistent hot channels (i.e., channels with persistently large magnitudes) in later stages. Based on these findings, we introduce Hot-Channel Patch (HCP), an online compensation mechanism that identifies hot channels and reinjects residuals using hardware-efficient kernels. We then develop CHON, an NVFP4 training recipe integrating HCP with post-QK operation protection. On GLA-1.3B model trained for 60B tokens, CHON reduces the loss gap to BF16 from 0.94% to 0.58% while maintaining downstream accuracy.
zh
[NLP-57] Hunt Instead of Wait: Evaluating Deep Data Research on Large Language Models
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在面对开放性数据科学任务时缺乏自主探索能力的问题,即如何让模型从原始数据中自主设定目标并进行调查以提取关键洞察,而不仅限于执行明确指令。其解决方案的关键在于提出了一种名为Deep Data Research (DDR)的开放式任务框架和DDR-Bench基准测试体系,该体系通过结构化的检查清单实现可验证的评估,从而系统性地衡量模型的 investigatory intelligence(探究智能),并揭示出有效探究智能不仅依赖于代理架构设计或单纯规模扩展,更取决于模型内在的策略机制。
链接: https://arxiv.org/abs/2602.02039
作者: Wei Liu,Peijie Yu,Michele Orini,Yali Du,Yulan He
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB); Machine Learning (cs.LG)
备注: 14 pages, 7 tables, 8 figures
Abstract:The agency expected of Agentic Large Language Models goes beyond answering correctly, requiring autonomy to set goals and decide what to explore. We term this investigatory intelligence, distinguishing it from executional intelligence, which merely completes assigned tasks. Data Science provides a natural testbed, as real-world analysis starts from raw data rather than explicit queries, yet few benchmarks focus on it. To address this, we introduce Deep Data Research (DDR), an open-ended task where LLMs autonomously extract key insights from databases, and DDR-Bench, a large-scale, checklist-based benchmark that enables verifiable evaluation. Results show that while frontier models display emerging agency, long-horizon exploration remains challenging. Our analysis highlights that effective investigatory intelligence depends not only on agent scaffolding or merely scaling, but also on intrinsic strategies of agentic models.
zh
[NLP-58] Rethinking Genomic Modeling Through Optical Character Recognition
【速读】: 该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的基因组基础模型在处理DNA序列时存在的结构性不匹配问题:即传统一维序列建模方式无法有效应对基因组中稀疏且非连续的语义结构,导致计算资源浪费于低信息背景区域,并阻碍了对长序列的高效压缩与理解。其解决方案的关键在于提出OpticalDNA框架,将基因组建模重构为类似光学字符识别(Optical Character Recognition, OCR)的文档理解任务,通过视觉化渲染DNA序列为结构化布局,并利用具备OCR能力的视觉-语言模型实现高保真压缩与细粒度信息保留;具体而言,该方法采用一个视觉DNA编码器生成紧凑可重建的视觉token,结合prompt条件目标优化核心基因组原语(如读取、区域定位、子序列检索和掩码跨度补全),从而在显著减少有效token数量的同时提升性能表现。
链接: https://arxiv.org/abs/2602.02014
作者: Hongxin Xiang,Pengsen Ma,Yunkang Cao,Di Yu,Haowen Chen,Xinyu Yang,Xiangxiang Zeng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Recent genomic foundation models largely adopt large language model architectures that treat DNA as a one-dimensional token sequence. However, exhaustive sequential reading is structurally misaligned with sparse and discontinuous genomic semantics, leading to wasted computation on low-information background and preventing understanding-driven compression for long contexts. Here, we present OpticalDNA, a vision-based framework that reframes genomic modeling as Optical Character Recognition (OCR)-style document understanding. OpticalDNA renders DNA into structured visual layouts and trains an OCR-capable vision–language model with a \emphvisual DNA encoder and a \emphdocument decoder, where the encoder produces compact, reconstructible visual tokens for high-fidelity compression. Building on this representation, OpticalDNA defines prompt-conditioned objectives over core genomic primitives-reading, region grounding, subsequence retrieval, and masked span completion-thereby learning layout-aware DNA representations that retain fine-grained genomic information under a reduced effective token budget. Across diverse genomic benchmarks, OpticalDNA consistently outperforms recent baselines; on sequences up to 450k bases, it achieves the best overall performance with nearly 20\times fewer effective tokens, and surpasses models with up to 985\times more activated parameters while tuning only 256k \emphtrainable parameters.
zh
[NLP-59] NEAT: Neuron-Based Early Exit for Large Reasoning Models
【速读】: 该论文旨在解决大型推理模型(Large Reasoning Models, LRM)中存在的“过度思考”(overthinking)问题,即在得出正确解后仍生成冗余的推理步骤,导致计算资源浪费。解决方案的关键在于提出一种无需训练的早期推理退出框架 NEAT(Neuron-based Early Reasoning Exit),其通过监测神经元层面的激活动态来识别与退出相关的神经元,并追踪其激活模式,从而在推理过程中动态触发早期退出或抑制反思,有效减少不必要的推理步骤,同时保持解的质量。该方法不引入额外的测试时计算开销,且无需外部标注数据或额外训练过程。
链接: https://arxiv.org/abs/2602.02010
作者: Kang Liu,Yongkang Liu,Xiaocui Yang,Peidong Wang,Wen Zhang,Shi Feng,Yifei Zhang,Daling Wang
机构: Northeastern University, China (东北大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Reasoning Models (LRMs) often suffer from \emphoverthinking, a phenomenon in which redundant reasoning steps are generated after a correct solution has already been reached. Existing early reasoning exit methods primarily rely on output-level heuristics or trained probing models to skip redundant reasoning steps, thereby mitigating overthinking. However, these approaches typically require additional rollout computation or externally labeled datasets. In this paper, we propose \textbfNEAT, a \textbfNeuron-based \textbfEarly re\textbfAsoning exi\textbfT framework that monitors neuron-level activation dynamics to enable training-free early exits, without introducing additional test-time computation. NEAT identifies exit-associated neurons and tracks their activation patterns during reasoning to dynamically trigger early exit or suppress reflection, thereby reducing unnecessary reasoning while preserving solution quality. Experiments on four reasoning benchmarks across six models with different scales and architectures show that, for each model, NEAT achieves an average token reduction of 22% to 28% when averaged over the four benchmarks, while maintaining accuracy.
zh
[NLP-60] Beyond RAG for Agent Memory: Retrieval by Decoupling and Aggregation
【速读】: 该论文旨在解决代理记忆系统(agent memory systems)中传统检索增强生成(Retrieval-Augmented Generation, RAG)方法的局限性问题。在标准RAG中,假设检索对象来自大规模、异构语料库,而代理记忆则是有限且连贯的对话流,其中片段高度相关甚至重复,导致固定top-k相似度检索容易返回冗余信息,且后续修剪可能误删推理所需的时序依赖关系。为此,作者提出xMemory,其核心创新在于将检索机制从单纯相似性匹配转向基于潜在语义组件的结构化操作:通过解耦(decoupling)与聚合(aggregation)策略,将记忆拆分为语义单元并构建层次化组织结构,利用稀疏性-语义性目标函数指导记忆的分裂与合并,从而实现更高效、忠实的检索。在推理阶段,xMemory采用自顶向下的检索方式,优先选取紧凑且多样化的主题和语义节点以应对多事实查询,并仅在降低读者不确定性时才扩展至具体事件或原始消息,显著提升了答案质量和token效率。
链接: https://arxiv.org/abs/2602.02007
作者: Zhanghao Hu,Qinglin Zhu,Hanqi Yan,Yulan He,Lin Gui
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Agent memory systems often adopt the standard Retrieval-Augmented Generation (RAG) pipeline, yet its underlying assumptions differ in this setting. RAG targets large, heterogeneous corpora where retrieved passages are diverse, whereas agent memory is a bounded, coherent dialogue stream with highly correlated spans that are often duplicates. Under this shift, fixed top- k similarity retrieval tends to return redundant context, and post-hoc pruning can delete temporally linked prerequisites needed for correct reasoning. We argue retrieval should move beyond similarity matching and instead operate over latent components, following decoupling to aggregation: disentangle memories into semantic components, organise them into a hierarchy, and use this structure to drive retrieval. We propose xMemory, which builds a hierarchy of intact units and maintains a searchable yet faithful high-level node organisation via a sparsity–semantics objective that guides memory split and merge. At inference, xMemory retrieves top-down, selecting a compact, diverse set of themes and semantics for multi-fact queries, and expanding to episodes and raw messages only when it reduces the reader’s uncertainty. Experiments on LoCoMo and PerLTQA across the three latest LLMs show consistent gains in answer quality and token efficiency.
zh
[NLP-61] From Latent Signals to Reflection Behavior: Tracing Meta-Cognitive Activation Trajectory in R1-Style LLM s
【速读】: 该论文旨在解决生成式 AI(Generative AI)中基于 R1-style 大语言模型(Large Language Models, LLMs)的自我反思行为(self-reflection)内部机制不明确的问题。其核心解决方案在于通过“logit lens”方法追踪反射行为在不同网络层中的激活轨迹,识别出三个关键阶段:(i)潜在控制层(latent-control layers),其中近似线性方向编码了思考预算(thinking budget)语义;(ii)语义枢纽层(semantic-pivot layers), discourse-level 信号(如转折点和总结提示)显现并主导概率分布;(iii)行为外显层(behavior-overt layers),反射行为标记的概率显著上升直至被高频采样。进一步的靶向干预揭示了跨阶段的因果链:提示语义调控潜在控制方向上的激活投影,进而引发语义枢纽层中转折与总结线索的竞争,最终调节行为外显层中反射行为标记的采样概率。这一发现表明,LLMs 的自我反思过程遵循类人类元认知机制——从潜在监控到话语级调控,再到外显反思行为的逐步推进。
链接: https://arxiv.org/abs/2602.01999
作者: Yanrui Du,Yibo Gao,Sendong Zhao,Jiayun Li,Haochun Wang,Qika Lin,Kai He,Bing Qin,Mengling Feng
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:R1-style LLMs have attracted growing attention for their capacity for self-reflection, yet the internal mechanisms underlying such behavior remain unclear. To bridge this gap, we anchor on the onset of reflection behavior and trace its layer-wise activation trajectory. Using the logit lens to read out token-level semantics, we uncover a structured progression: (i) Latent-control layers, where an approximate linear direction encodes the semantics of thinking budget; (ii) Semantic-pivot layers, where discourse-level cues, including turning-point and summarization cues, surface and dominate the probability mass; and (iii) Behavior-overt layers, where the likelihood of reflection-behavior tokens begins to rise until they become highly likely to be sampled. Moreover, our targeted interventions uncover a causal chain across these stages: prompt-level semantics modulate the projection of activations along latent-control directions, thereby inducing competition between turning-point and summarization cues in semantic-pivot layers, which in turn regulates the sampling likelihood of reflection-behavior tokens in behavior-overt layers. Collectively, our findings suggest a human-like meta-cognitive process-progressing from latent monitoring, to discourse-level regulation, and to finally overt self-reflection. Our analysis code can be found at this https URL.
zh
[NLP-62] S3-CoT: Self-Sampled Succinct Reasoning Enables Efficient Chain-of-Thought LLM s
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在链式思维(Chain-of-Thought, CoT)推理中普遍存在冗余推理过程的问题,并探索是否能够使LLMs具备类似人类系统1(System 1)的快速思考模式。其解决方案的关键在于提出一种基于激活控制(activation steering)的自采样框架,无需外部教师指导即可从目标模型自身生成风格一致且长度可变的推理轨迹,从而缓解监督微调(Supervised Fine-Tuning, SFT)方法对高质量标注数据稀缺的依赖;进一步结合基于黄金答案过滤的数据进行SFT训练,引入类人双认知系统结构与渐进压缩课程(progressive compression curriculum),并最终实现仅依赖预测一致性数据驱动的自我演化机制,显著提升数学基准任务及跨领域(如医学)推理性能。
链接: https://arxiv.org/abs/2602.01982
作者: Yanrui Du,Sendong Zhao,Yibo Gao,Danyang Zhao,Qika Lin,Ming Ma,Jiayun Li,Yi Jiang,Kai He,Qianyi Xu,Bing Qin,Mengling Feng
机构: Harbin Institute of Technology (哈尔滨工业大学); National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) equipped with chain-of-thought (CoT) achieve strong performance and offer a window into LLM behavior. However, recent evidence suggests that improvements in CoT capabilities often come with redundant reasoning processes, motivating a key question: Can LLMs acquire a fast-thinking mode analogous to human System 1 reasoning? To explore this, our study presents a self-sampling framework based on activation steering for efficient CoT learning. Our method can induce style-aligned and variable-length reasoning traces from target LLMs themselves without any teacher guidance, thereby alleviating a central bottleneck of SFT-based methods-the scarcity of high-quality supervision data. Using filtered data by gold answers, we perform SFT for efficient CoT learning with (i) a human-like dual-cognitive system, and (ii) a progressive compression curriculum. Furthermore, we explore a self-evolution regime in which SFT is driven solely by prediction-consistent data of variable-length variants, eliminating the need for gold answers. Extensive experiments on math benchmarks, together with cross-domain generalization tests in medicine, show that our method yields stable improvements for both general and R1-style LLMs. Our data and model checkpoints can be found at this https URL.
zh
[NLP-63] Beyond Local Edits: Embedding-Virtualized Knowledge for Broader Evaluation and Preservation of Model Editing
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)知识编辑方法在评估时存在的局限性问题,即现有评测通常依赖预定义的有限数据集,难以全面反映编辑操作对模型整体知识系统的影响。其解决方案的关键在于提出Embedding-Virtualized Knowledge (EVK)框架,通过在嵌入空间中施加受控扰动来表征模型知识,从而探索远超显式标注数据范围的虚拟知识区域;在此基础上构建了EVK-Bench嵌入级评估基准以量化编辑引发的知识漂移效应,并进一步设计了可插拔的EVK-Align模块,在不牺牲编辑准确性的前提下有效约束嵌入层的知识漂移,实现更全面的评估与更强的知识保留能力。
链接: https://arxiv.org/abs/2602.01977
作者: Shuainan Liu,Xuanang Chen,Ben He,Le Sun
机构: University of Chinese Academy of Sciences (中国科学院大学); Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences (中国科学院软件研究所信息处理实验室)
类目: Computation and Language (cs.CL)
备注:
Abstract:Knowledge editing methods for large language models are commonly evaluated using predefined benchmarks that assess edited facts together with a limited set of related or neighboring knowledge. While effective, such evaluations remain confined to finite, dataset-bounded samples, leaving the broader impact of editing on the model’s knowledge system insufficiently understood. To address this gap, we introduce Embedding-Virtualized Knowledge (EVK) that characterizes model knowledge through controlled perturbations in embedding space, enabling the exploration of a substantially broader and virtualized knowledge region beyond explicit data annotations. Based on EVK, we construct an embedding-level evaluation benchmark EVK-Bench that quantifies potential knowledge drift induced by editing, revealing effects that are not captured by conventional sample-based metrics. Furthermore, we propose a plug-and-play EVK-Align module that constrains embedding-level knowledge drift during editing and can be seamlessly integrated into existing editing methods. Experiments demonstrate that our approach enables more comprehensive evaluation while significantly improving knowledge preservation without sacrificing editing accuracy.
zh
[NLP-64] Orthogonal Hierarchical Decomposition for Structure-Aware Table Understanding with Large Language Models
【速读】: 该论文旨在解决复杂表格(如多级表头、合并单元格和异构布局)在大语言模型(LLM)中的理解与推理难题,现有方法如表格线性化或标准化网格建模难以显式捕捉层级结构和跨维度依赖关系,导致结构语义与文本表示之间出现错位。解决方案的关键在于提出正交层次分解(Orthogonal Hierarchical Decomposition, OHD)框架,其核心创新是基于空间-语义协同约束的正交树诱导(Orthogonal Tree Induction, OTI)方法,将不规则表格分解为列树和行树,分别捕获垂直和水平方向的层次依赖关系;在此基础上设计双路径关联协议对齐单元格语义谱系,并引入LLM作为语义仲裁器实现多层次语义信息的一致性对齐。
链接: https://arxiv.org/abs/2602.01969
作者: Bin Cao,Huixian Lu,Chenwen Ma,Ting Wang,Ruizhe Li,Jing Fan
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Work in process
Abstract:Complex tables with multi-level headers, merged cells and heterogeneous layouts pose persistent challenges for LLMs in both understanding and reasoning. Existing approaches typically rely on table linearization or normalized grid modeling. However, these representations struggle to explicitly capture hierarchical structures and cross-dimensional dependencies, which can lead to misalignment between structural semantics and textual representations for non-standard tables. To address this issue, we propose an Orthogonal Hierarchical Decomposition (OHD) framework that constructs structure-preserving input representations of complex tables for LLMs. OHD introduces an Orthogonal Tree Induction (OTI) method based on spatial–semantic co-constraints, which decomposes irregular tables into a column tree and a row tree to capture vertical and horizontal hierarchical dependencies, respectively. Building on this representation, we design a dual-pathway association protocol to symmetrically reconstruct semantic lineage of each cell, and incorporate an LLM as a semantic arbitrator to align multi-level semantic information. We evaluate OHD framework on two complex table question answering benchmarks, AITQA and HiTab. Experimental results show that OHD consistently outperforms existing representation paradigms across multiple evaluation metrics.
zh
[NLP-65] Mixture-of-Experts with Intermediate CTC Supervision for Accented Speech Recognition
【速读】: 该论文旨在解决自动语音识别(ASR)系统在处理不同口音(accent)语音时性能下降的问题,特别是针对低资源和未见过的口音场景。现有方法中,通用口音(accent-agnostic)模型难以应对重度或未知口音,而特定口音(accent-specific)模型则受限于标注数据稀缺且质量不高。解决方案的关键在于提出Moe-Ctc架构——一种带有中间连接时序分类(CTC)监督的专家混合(Mixture-of-Experts, MoE)模型。该架构通过口音感知路由机制引导专家学习口音特异性模式,并在推理阶段转为无标签路由;每个专家配备独立的CTC头以对齐路由决策与识别准确性,同时引入路由增强损失函数稳定训练过程,从而实现专家专业化与泛化能力的协同提升。
链接: https://arxiv.org/abs/2602.01967
作者: Wonjun Lee,Hyounghun Kim,Gary Geunbae Lee
机构: POSTECH(浦项科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Accented speech remains a persistent challenge for automatic speech recognition (ASR), as most models are trained on data dominated by a few high-resource English varieties, leading to substantial performance degradation for other accents. Accent-agnostic approaches improve robustness yet struggle with heavily accented or unseen varieties, while accent-specific methods rely on limited and often noisy labels. We introduce Moe-Ctc, a Mixture-of-Experts architecture with intermediate CTC supervision that jointly promotes expert specialization and generalization. During training, accent-aware routing encourages experts to capture accent-specific patterns, which gradually transitions to label-free routing for inference. Each expert is equipped with its own CTC head to align routing with transcription quality, and a routing-augmented loss further stabilizes optimization. Experiments on the Mcv-Accent benchmark demonstrate consistent gains across both seen and unseen accents in low- and high-resource conditions, achieving up to 29.3% relative WER reduction over strong FastConformer baselines.
zh
[NLP-66] Breaking the Static Graph: Context-Aware Traversal for Robust Retrieval-Augmented Generation
【速读】: 该论文旨在解决当前检索增强生成(Retrieval-Augmented Generation, RAG)方法中因静态知识图谱(Knowledge Graph, KG)结构导致的“静态图谬误”(Static Graph Fallacy)问题,即固定边权重无法适应查询语义变化,造成随机游走过程被高连接度节点干扰,从而无法完整获取多跳推理所需的证据链。解决方案的关键在于提出CatRAG(Context-Aware Traversal),其核心创新是将静态KG转化为查询自适应的导航结构,通过三项机制协同优化:(1) 符号锚定(Symbolic Anchoring)引入弱实体约束以正则化游走路径;(2) 查询感知动态边加权(Query-Aware Dynamic Edge Weighting)实时调整图结构,抑制无关路径并强化语义一致路径;(3) 关键事实段落权重增强(Key-Fact Passage Weight Enhancement)低成本地结构性锚定游走至高概率证据节点。实验表明,CatRAG在多跳基准测试中显著提升推理完整性,有效弥合了部分召回与完全接地推理之间的鸿沟。
链接: https://arxiv.org/abs/2602.01965
作者: Kwun Hang Lau,Fangyuan Zhang,Boyu Ruan,Yingli Zhou,Qintian Guo,Ruiyuan Zhang,Xiaofang Zhou
机构: Huawei Hong Kong Research Center (华为香港研究中心); The Hong Kong University of Science and Technology (香港科技大学); The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in Retrieval-Augmented Generation (RAG) have shifted from simple vector similarity to structure-aware approaches like HippoRAG, which leverage Knowledge Graphs (KGs) and Personalized PageRank (PPR) to capture multi-hop dependencies. However, these methods suffer from a “Static Graph Fallacy”: they rely on fixed transition probabilities determined during indexing. This rigidity ignores the query-dependent nature of edge relevance, causing semantic drift where random walks are diverted into high-degree “hub” nodes before reaching critical downstream evidence. Consequently, models often achieve high partial recall but fail to retrieve the complete evidence chain required for multi-hop queries. To address this, we propose CatRAG, Context-Aware Traversal for robust RAG, a framework that builds on the HippoRAG 2 architecture and transforms the static KG into a query-adaptive navigation structure. We introduce a multi-faceted framework to steer the random walk: (1) Symbolic Anchoring, which injects weak entity constraints to regularize the random walk; (2) Query-Aware Dynamic Edge Weighting, which dynamically modulates graph structure, to prune irrelevant paths while amplifying those aligned with the query’s intent; and (3) Key-Fact Passage Weight Enhancement, a cost-efficient bias that structurally anchors the random walk to likely evidence. Experiments across four multi-hop benchmarks demonstrate that CatRAG consistently outperforms state of the art baselines. Our analysis reveals that while standard Recall metrics show modest gains, CatRAG achieves substantial improvements in reasoning completeness, the capacity to recover the entire evidence path without gaps. These results reveal that our approach effectively bridges the gap between retrieving partial context and enabling fully grounded reasoning. Resources are available at this https URL.
zh
[NLP-67] From Code-Centric to Concept-Centric: Teaching NLP with LLM -Assisted “Vibe Coding” EACL2026
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)快速发展的背景下,自然语言处理(Natural Language Processing, NLP)教育中如何平衡代码实现与概念理解的问题。传统教学往往过度强调编程语法和调试能力,导致学生在面对复杂NLP任务时难以聚焦于核心算法原理与思维训练。解决方案的关键在于引入“Vibe Coding”教学法——即让学生在实验中使用LLM生成代码,同时通过强制要求记录提示词(prompt logging)和以批判性反思问题为主的评估方式,引导其将注意力从语法正确性转向对NLP概念的深度理解。实证结果显示,该方法显著提升了学生的参与度和概念掌握水平,证明了在结构化设计下,LLM可作为辅助工具推动从“编码熟练”向“概念 mastery”的教学范式转变。
链接: https://arxiv.org/abs/2602.01919
作者: Hend Al-Khalifa
机构: King Saud University (沙特国王大学)
类目: Computation and Language (cs.CL)
备注: Accepted in The Seventh Workshop on Teaching Natural Language Processing (Teaching NLP @ EACL2026)
Abstract:The rapid advancement of Large Language Models (LLMs) presents both challenges and opportunities for Natural Language Processing (NLP) education. This paper introduces ``Vibe Coding,‘’ a pedagogical approach that leverages LLMs as coding assistants while maintaining focus on conceptual understanding and critical thinking. We describe the implementation of this approach in a senior-level undergraduate NLP course, where students completed seven labs using LLMs for code generation while being assessed primarily on conceptual understanding through critical reflection questions. Analysis of end-of-course feedback from 19 students reveals high satisfaction (mean scores 4.4-4.6/5.0) across engagement, conceptual learning, and assessment fairness. Students particularly valued the reduced cognitive load from debugging, enabling deeper focus on NLP concepts. However, challenges emerged around time constraints, LLM output verification, and the need for clearer task specifications. Our findings suggest that when properly structured with mandatory prompt logging and reflection-based assessment, LLM-assisted learning can shift focus from syntactic fluency to conceptual mastery, preparing students for an AI-augmented professional landscape.
zh
[NLP-68] GuideWeb: A Benchmark for Automatic In-App Guide Generation on Real-World Web UIs
【速读】: 该论文旨在解决数字采用平台(Digital Adoption Platform, DAP)中自动在应用内生成引导内容的挑战,尤其是针对真实世界网页用户界面(Web UI)中因页面布局和功能频繁变更而导致的手动维护成本高、效率低的问题。其解决方案的关键在于提出一个名为GuideWeb的新基准数据集与评估体系,将引导生成任务建模为两个核心子任务:基于网页内容选择目标元素(guide target elements)并生成符合用户意图的简洁引导文本(guide text)。通过引入GuideWeb Agent模型,在目标元素预测上达到30.79%准确率,同时在意图生成和引导文本生成上分别获得44.94的BLEU分数和21.34的BLEU分数,验证了该方法的有效性,并揭示当前自动引导生成仍面临显著挑战,亟需进一步研究以实现可靠的实际部署。
链接: https://arxiv.org/abs/2602.01917
作者: Chengguang Gan,Yoshihiro Tsujii,Yunhao Liang,Tatsunori Mori,Shiwen Ni,Hiroki Itoh
机构: Techtouch Inc.(Techtouch公司); University of Chinese Academy of Sciences (中国科学院大学); Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院); Yokohama National University (横滨国立大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Digital Adoption Platform (DAP) provide web-based overlays that deliver operation guidance and contextual hints to help users navigate complex websites. Although modern DAP tools enable non-experts to author such guidance, maintaining these guides remains labor-intensive because website layouts and functionalities evolve continuously, which requires repeated manual updates and re-annotation. In this work, we introduce \textbfGuideWeb, a new benchmark for automatic in-app guide generation on real-world web UIs. GuideWeb formulates the task as producing page-level guidance by selecting \textbfguide target elements grounded in the webpage and generating concise guide text aligned with user intent. We also propose a comprehensive evaluation suite that jointly measures the accuracy of guide target element selection and the quality of generated intents and guide texts. Experiments show that our proposed \textbfGuideWeb Agent achieves \textbf30.79% accuracy in guide target element prediction, while obtaining BLEU scores of \textbf44.94 for intent generation and \textbf21.34 for guide-text generation. Existing baselines perform substantially worse, which highlights that automatic guide generation remains challenging and that further advances are necessary before such systems can be reliably deployed in real-world settings.
zh
[NLP-69] ES-MemEval: Benchmarking Conversational Agents on Personalized Long-Term Emotional Support WWW
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在长期情感支持等复杂在线服务中因缺乏稳健的长时记忆能力而导致的性能瓶颈问题,尤其针对用户信息分散、隐含且持续演变的场景。其解决方案的关键在于提出ES-MemEval基准和EvoEmo数据集:前者系统评估五种核心记忆能力(信息提取、时间推理、冲突检测、回避回答与用户建模),后者构建多轮对话情境下捕捉碎片化、隐含用户披露及状态演化的个性化长时情感支持数据集,从而推动对LLMs长时记忆机制的有效评估与改进。
链接: https://arxiv.org/abs/2602.01885
作者: Tiantian Chen,Jiaqi Lu,Ying Shen,Lin Zhang
机构: Tongji University (同济大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 7 figures. Accepted to The Web Conference (WWW) 2026
Abstract:Large Language Models (LLMs) have shown strong potential as conversational agents. Yet, their effectiveness remains limited by deficiencies in robust long-term memory, particularly in complex, long-term web-based services such as online emotional support. However, existing long-term dialogue benchmarks primarily focus on static and explicit fact retrieval, failing to evaluate agents in critical scenarios where user information is dispersed, implicit, and continuously evolving. To address this gap, we introduce ES-MemEval, a comprehensive benchmark that systematically evaluates five core memory capabilities: information extraction, temporal reasoning, conflict detection, abstention, and user modeling, in long-term emotional support settings, covering question answering, summarization, and dialogue generation tasks. To support the benchmark, we also propose EvoEmo, a multi-session dataset for personalized long-term emotional support that captures fragmented, implicit user disclosures and evolving user states. Extensive experiments on open-source long-context, commercial, and retrieval-augmented (RAG) LLMs show that explicit long-term memory is essential for reducing hallucinations and enabling effective personalization. At the same time, RAG improves factual consistency but struggles with temporal dynamics and evolving user states. These findings highlight both the potential and limitations of current paradigms and motivate more robust integration of memory and retrieval for long-term personalized dialogue systems.
zh
[NLP-70] PretrainRL: Alleviating Factuality Hallucination of Large Language Models at the Beginning
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中存在的事实幻觉(factual hallucinations)问题,即模型在生成内容时会输出可验证的虚假信息。研究表明,这一问题的根本原因在于预训练语料库中数据分布的不平衡,导致模型倾向于生成高概率的错误陈述,而真实信息则处于低概率状态。为从根源上解决该问题,作者提出了一种名为PretrainRL的新框架,其核心思想是“去偏然后学习”(debias then learning),通过在预训练阶段引入强化学习机制,主动调整模型的概率分布:具体而言,该方法采用高效的负采样策略识别高概率虚假陈述,并对其加权降低,从而为低概率但真实的事实知识腾出学习空间。实验表明,PretrainRL在多个公开基准测试中显著缓解了事实幻觉,优于当前最优方法。
链接: https://arxiv.org/abs/2602.01875
作者: Langming Liu,Kangtao Lv,Haibin Chen,Weidong Zhang,Yejing Wang,Shilei Liu,Xin Tong,Yujin Yuan,Yongwei Wang,Wenbo Su,Bo Zheng
机构: Future Living Lab of Alibaba(阿里巴巴未来生活实验室); Zhejiang University(浙江大学); Alibaba Group(阿里巴巴集团); Shanghai AI Laboratory(上海人工智能实验室)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs), despite their powerful capabilities, suffer from factual hallucinations where they generate verifiable falsehoods. We identify a root of this issue: the imbalanced data distribution in the pretraining corpus, which leads to a state of “low-probability truth” and “high-probability falsehood”. Recent approaches, such as teaching models to say “I don’t know” or post-hoc knowledge editing, either evade the problem or face catastrophic forgetting. To address this issue from its root, we propose \textbfPretrainRL, a novel framework that integrates reinforcement learning into the pretraining phase to consolidate factual knowledge. The core principle of PretrainRL is “\textbfdebiasing then learning.” It actively reshapes the model’s probability distribution by down-weighting high-probability falsehoods, thereby making “room” for low-probability truths to be learned effectively. To enable this, we design an efficient negative sampling strategy to discover these high-probability falsehoods and introduce novel metrics to evaluate the model’s probabilistic state concerning factual knowledge. Extensive experiments on three public benchmarks demonstrate that PretrainRL significantly alleviates factual hallucinations and outperforms state-of-the-art methods.
zh
[NLP-71] Read As Human: Compressing Context via Parallelizable Close Reading and Skimming
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长上下文场景中面临的两个核心问题:计算效率低下和冗余信息干扰。解决方案的关键在于提出一种名为RAM(Read As HuMan)的上下文压缩框架,其核心创新是采用自适应混合阅读策略——借鉴人类阅读行为(即对重要内容精读、对次要内容略读),将输入上下文分段后并行编码,并基于查询相关性动态区分高相关段落(完整保留)与低相关段落(查询引导压缩为紧凑摘要向量)。通过显式文本片段与隐式摘要向量的拼接输入解码器,RAM在保持自然语言格式可解释性的同时实现性能提升;进一步引入基于正负样本对的对比学习目标以优化精读与略读的决策边界,实验表明该方法在多个问答与摘要基准上优于现有基线,并在长输入场景下实现最高达12倍的端到端加速。
链接: https://arxiv.org/abs/2602.01840
作者: Jiwei Tang,Shilei Liu,Zhicheng Zhang,Qingsong Lv,Runsong Zhao,Tingwei Lu,Langming Liu,Haibin Chen,Yujin Yuan,Hai-Tao Zheng,Wenbo Su,Bo Zheng
机构: Tsinghua University (清华大学); Pengcheng Laboratory; Future Living Lab of Alibaba (阿里巴巴未来生活实验室); Northeastern University (东北大学)
类目: Computation and Language (cs.CL)
备注: 13 pages,5 figures
Abstract:Large Language Models (LLMs) demonstrate exceptional capability across diverse tasks. However, their deployment in long-context scenarios is hindered by two challenges: computational inefficiency and redundant information. We propose RAM (Read As HuMan), a context compression framework that adopts an adaptive hybrid reading strategy, to address these challenges. Inspired by human reading behavior (i.e., close reading important content while skimming less relevant content), RAM partitions the context into segments and encodes them with the input query in parallel. High-relevance segments are fully retained (close reading), while low-relevance ones are query-guided compressed into compact summary vectors (skimming). Both explicit textual segments and implicit summary vectors are concatenated and fed into decoder to achieve both superior performance and natural language format interpretability. To refine the decision boundary between close reading and skimming, we further introduce a contrastive learning objective based on positive and negative query-segment pairs. Experiments demonstrate that RAM outperforms existing baselines on multiple question answering and summarization benchmarks across two backbones, while delivering up to a 12x end-to-end speedup on long inputs (average length 16K; maximum length 32K).
zh
[NLP-72] AXE: Low-Cost Cross-Domain Web Structured Information Extraction
【速读】: 该论文旨在解决从网页中提取结构化数据时面临的困境:传统手动规则方法过于脆弱,而使用大型语言模型(Large Language Models, LLMs)则成本过高。其解决方案的关键在于提出AXE(Adaptive X-Path Extractor)管道,该方法将HTML文档对象模型(DOM)视为一棵需要修剪的树,而非单纯的文本流;通过专用的“修剪”机制去除冗余节点,保留高密度上下文,使得仅0.6B参数的小型语言模型即可生成精确的结构化输出。同时引入基于物理位置可追溯性的Grounded XPath Resolution(GXR)机制,确保所有提取结果均能对应到原始DOM中的具体节点,从而在低资源消耗下实现最先进的零样本性能(SWDE数据集F1达88.1%)。
链接: https://arxiv.org/abs/2602.01838
作者: Abdelrahman Mansour,Khaled W. Alshaer,Moataz Elsaban
机构: Cairo University (开罗大学); Ejada Systems; Fawry Integrated Systems; Microsoft
类目: Computation and Language (cs.CL)
备注:
Abstract:Extracting structured data from the web is often a trade-off between the brittle nature of manual heuristics and the prohibitive cost of Large Language Models. We introduce AXE (Adaptive X-Path Extractor), a pipeline that rethinks this process by treating the HTML DOM as a tree that needs pruning rather than just a wall of text to be read. AXE uses a specialized “pruning” mechanism to strip away boilerplate and irrelevant nodes, leaving behind a distilled, high-density context that allows a tiny 0.6B LLM to generate precise, structured outputs. To keep the model honest, we implement Grounded XPath Resolution (GXR), ensuring every extraction is physically traceable to a source node. Despite its low footprint, AXE achieves state-of-the-art zero-shot performance, outperforming several much larger, fully-trained alternatives with an F1 score of 88.1% on the SWDE dataset. By releasing our specialized adaptors, we aim to provide a practical, cost-effective path for large-scale web information extraction.
zh
[NLP-73] Sentence Curve Language Models
【速读】: 该论文旨在解决扩散语言模型(Diffusion-based Language Models, DLMs)中因使用静态词嵌入(static word embeddings)表示目标句子而导致的局部预测准确但忽视全局句法结构的问题。其核心解决方案是提出一种连续的句子表征方法——句子曲线(sentence curve),该曲线通过样条函数(spline curve)定义,其控制点可同时影响句子中多个词语的表示。基于此,作者构建了句子曲线语言模型(Sentence Curve Language Model, SCLM),将DLM的目标从预测静态词嵌入改为预测句子曲线,从而在理论上引入正则化效应以增强对全局结构的建模能力,并在多个基准数据集上实现优于现有DLM的方法性能。
链接: https://arxiv.org/abs/2602.01807
作者: DongNyeong Heo,Heelyoul Choi
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Language models (LMs) are a central component of modern AI systems, and diffusion-based language models (DLMs) have recently emerged as a competitive alternative. Both paradigms rely on word embeddings not only to represent the input sentence, but also to represent the target sentence that backbone models are trained to predict. We argue that such static embedding of the target word is insensitive to neighboring words, encouraging locally accurate word prediction while neglecting global structure across the target sentence. To address this limitation, we propose a continuous sentence representation, termed sentence curve, defined as a spline curve whose control points affect multiple words in the sentence. Based on this representation, we introduce sentence curve language model (SCLM), which extends DLMs to predict sentence curves instead of the static word embeddings. We theoretically show that sentence curve prediction induces a regularization effect that promotes global structure modeling, and characterize how different sentence curve types affect this behavior. Empirically, SCLM achieves SOTA performance among DLMs on IWSLT14 and WMT14, shows stable training without burdensome knowledge distillation, and demonstrates promising potential compared to discrete DLMs on LM1B.
zh
[NLP-74] CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在源代码理解任务中因上下文长度随软件系统规模增长而带来的计算效率瓶颈问题。当前LLMs依赖文本模态处理代码,导致上下文长度和计算成本呈线性增长。解决方案的关键在于利用多模态大语言模型(Multimodal Large Language Models, MLLs)将源代码表示为渲染后的图像,借助图像模态天然具备的压缩能力,在保持语义可识别性的前提下显著降低token消耗——实验表明,该方法可在不牺牲代码理解性能的前提下实现最高8倍的token压缩,并通过视觉线索(如语法高亮)进一步提升代码补全效果,同时在克隆检测等任务中表现出对视觉压缩的强鲁棒性。
链接: https://arxiv.org/abs/2602.01785
作者: Yuling Shi,Chaoxiang Xie,Zhensu Sun,Yeheng Chen,Chenxu Zhang,Longfei Yun,Chengcheng Wan,Hongyu Zhang,David Lo,Xiaodong Gu
机构: Shanghai Jiao Tong University (上海交通大学); Hohai University (河海大学); Singapore Management University (新加坡管理大学); Beijing Institute of Technology, Zhuhai (北京理工大学珠海校区); Imperial College London (帝国理工学院); UC San Diego (加州大学圣地亚哥分校); East China Normal University (华东师范大学); Shanghai Innovation Institute (上海创新研究院); Chongqing University (重庆大学)
类目: Computation and Language (cs.CL); Software Engineering (cs.SE)
备注: Code and data are available at this https URL
Abstract:Large Language Models (LLMs) have achieved remarkable success in source code understanding, yet as software systems grow in scale, computational efficiency has become a critical bottleneck. Currently, these models rely on a text-based paradigm that treats source code as a linear sequence of tokens, which leads to a linear increase in context length and associated computational costs. The rapid advancement of Multimodal LLMs (MLLMs) introduces an opportunity to optimize efficiency by representing source code as rendered images. Unlike text, which is difficult to compress without losing semantic meaning, the image modality is inherently suitable for compression. By adjusting resolution, images can be scaled to a fraction of their original token cost while remaining recognizable to vision-capable models. To explore the feasibility of this approach, we conduct the first systematic study on the effectiveness of MLLMs for code understanding. Our experiments reveal that: (1) MLLMs can effectively understand code with substantial token reduction, achieving up to 8x compression; (2) MLLMs can effectively leverage visual cues such as syntax highlighting, improving code completion performance under 4x compression; and (3) Code-understanding tasks like clone detection exhibit exceptional resilience to visual compression, with some compression ratios even slightly outperforming raw text inputs. Our findings highlight both the potential and current limitations of MLLMs in code understanding, which points out a shift toward image-modality code representation as a pathway to more efficient inference.
zh
[NLP-75] Data Distribution Matters: A Data-Centric Perspective on Context Compression for Large Language Model
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在长上下文场景中因计算效率低下和信息冗余严重而导致的部署瓶颈问题。现有研究主要聚焦于模型侧优化,忽视了数据分布对上下文压缩质量的影响。论文首次从数据中心视角出发,系统性地分析输入数据分布与模型内部预训练知识(即内在数据)如何影响压缩效果,提出基于自编码器框架的语义完整性评估方法,量化压缩表示的质量。关键发现包括:输入数据的编码器测量熵与压缩质量呈负相关,而解码器测量熵在固定解码器设置下无显著关联;此外,编码器与解码器之间内在数据的差异会显著削弱压缩收益,且难以缓解。基于此,论文进一步提出了可操作的实践指南以提升压缩增益。
链接: https://arxiv.org/abs/2602.01778
作者: Kangtao Lv,Jiwei Tang,Langming Liu,Haibin Chen,Weidong Zhang,Shilei Liu,Yongwei Wang,Yujin Yuan,Wenbo Su,Bo Zheng
机构: 未知
类目: Computation and Language (cs.CL)
备注: 15 pages,6 figures
Abstract:The deployment of Large Language Models (LLMs) in long-context scenarios is hindered by computational inefficiency and significant information redundancy. Although recent advancements have widely adopted context compression to address these challenges, existing research only focus on model-side improvements, the impact of the data distribution itself on context compression remains largely unexplored. To bridge this gap, we are the first to adopt a data-centric perspective to systematically investigate how data distribution impacts compression quality, including two dimensions: input data and intrinsic data (i.e., the model’s internal pretrained knowledge). We evaluate the semantic integrity of compressed representations using an autoencoder-based framework to systematically investigate it. Our experimental results reveal that: (1) encoder-measured input entropy negatively correlates with compression quality, while decoder-measured entropy shows no significant relationship under a frozen-decoder setting; and (2) the gap between intrinsic data of the encoder and decoder significantly diminishes compression gains, which is hard to mitigate. Based on these findings, we further present practical guidelines to optimize compression gains.
zh
[NLP-76] SOG_k: One LLM Token for Explicit Graph Structural Understanding
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理图结构数据时因结构幻觉(structural hallucination)导致的性能瓶颈问题。现有方法要么将图转化为自然语言,造成token消耗过高和注意力分散,要么将图转换为连续嵌入(soft prompt),但与原始文本token存在严重对齐偏差。其解决方案的关键在于引入一个特殊标记SOG_k(Structure Of Graph),用于在统一token空间中完整表征图的拓扑结构;具体而言,提出一种拓扑感知的结构分词器(topology-aware structural tokenizer),将每个图拓扑映射为一个高选择性的单一token,并构建混合结构问答语料库以对齐新结构token与现有文本token。该方法使LLMs能够以简洁且准确的方式理解、生成和推理图结构信息,同时具备可解释性和一致性,在五个图级基准上相比基线提升9.9%至41.4%,并支持节点级任务的灵活扩展。
链接: https://arxiv.org/abs/2602.01771
作者: Jingyao Wu,Bin Lu,Zijun Di,Xiaoying Gan,Meng Jin,Luoyi Fu,Xinbing Wang,Chenghu Zhou
机构: Shanghai Jiao Tong University (上海交通大学); IGSNRR, Chinese Academy of Sciences (中国科学院地理科学与资源研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注:
Abstract:Large language models show great potential in unstructured data understanding, but still face significant challenges with graphs due to their structural hallucination. Existing approaches mainly either verbalize graphs into natural language, which leads to excessive token consumption and scattered attention, or transform graphs into trainable continuous embeddings (i.e., soft prompt), but exhibit severe misalignment with original text tokens. To solve this problem, we propose to incorporate one special token SOG_k to fully represent the Structure Of Graph within a unified token space, facilitating explicit topology input and structural information sharing. Specifically, we propose a topology-aware structural tokenizer that maps each graph topology into a highly selective single token. Afterwards, we construct a set of hybrid structure Question-Answering corpora to align new structural tokens with existing text tokens. With this approach, SOG_k empowers LLMs to understand, generate, and reason in a concise and accurate manner. Extensive experiments on five graph-level benchmarks demonstrate the superiority of our method, achieving a performance improvement of 9.9% to 41.4% compared to the baselines while exhibiting interpretability and consistency. Furthermore, our method provides a flexible extension to node-level tasks, enabling both global and local structural understanding. The codebase is publicly available at this https URL.
zh
[NLP-77] PRISM: Parametrically Refactoring Inference for Speculative Sampling Draft Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)因自回归特性导致的解码速度慢的问题。现有基于推测解码(speculative decoding)的方法通过使用更大参数量的草稿模型(draft model)来提升预测质量,但这种趋势带来了显著的计算开销,难以在预测准确性和计算延迟之间取得平衡。论文提出PRISM架构,其核心创新在于将每一步预测计算分解到不同的参数集合中,重构草稿模型的计算路径,从而成功实现模型容量与推理成本的解耦。这一架构创新使得PRISM在保持极低草稿延迟的同时,获得更长的接受长度(acceptance length),显著提升了端到端解码吞吐量——实验证明其可使已高度优化的推理引擎加速超过2.6倍。
链接: https://arxiv.org/abs/2602.01762
作者: Xuliang Wang,Yuetao Chen,Maochan Zhen,Fang Liu,Xinzhou Zheng,Xingwu Liu,Hong Xu,Ming Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Large Language Models (LLMs), constrained by their auto-regressive nature, suffer from slow decoding. Speculative decoding methods have emerged as a promising solution to accelerate LLM decoding, attracting attention from both systems and AI research communities. Recently, the pursuit of better draft quality has driven a trend toward parametrically larger draft models, which inevitably introduces substantial computational overhead. While existing work attempts to balance the trade-off between prediction accuracy and compute latency, we address this fundamental dilemma through architectural innovation. We propose PRISM, which disaggregates the computation of each predictive step across different parameter sets, refactoring the computational pathways of draft models to successfully decouple model capacity from inference cost. Through extensive experiments, we demonstrate that PRISM outperforms all existing draft architectures, achieving exceptional acceptance lengths while maintaining minimal draft latency for superior end-to-end speedup. We also re-examine scaling laws with PRISM, revealing that PRISM scales more effectively with expanding data volumes than other draft architectures. Through rigorous and fair comparison, we show that PRISM boosts the decoding throughput of an already highly optimized inference engine by more than 2.6x. Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2602.01762 [cs.AI] (or arXiv:2602.01762v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2602.01762 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-78] Zero2Text: Zero-Training Cross-Domain Inversion Attacks on Textual Embeddings
【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中向量数据库引发的隐私风险问题,特别是针对嵌入 inversion 攻击(embedding inversion attack)所暴露的敏感信息泄露隐患。现有方法面临两大局限:基于优化的方法计算开销巨大,而基于对齐的方法依赖于可获取的目标域训练数据,难以在严格的黑盒和跨域场景下应用。解决方案的关键在于提出 Zero2Text,一个无需训练的框架,其核心创新是通过递归在线对齐机制(recursive online alignment),结合大语言模型(Large Language Model, LLM)先验与动态岭回归(dynamic ridge regression)策略,在不依赖静态数据集的情况下实时迭代对齐生成内容与目标嵌入,从而实现高效且通用的文本恢复,实验证明其在跨域场景下显著优于现有基线方法。
链接: https://arxiv.org/abs/2602.01757
作者: Doohyun Kim,Donghwa Kang,Kyungjae Lee,Hyeongboo Baek,Brent Byunghoon Kang
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 10 pages
Abstract:The proliferation of retrieval-augmented generation (RAG) has established vector databases as critical infrastructure, yet they introduce severe privacy risks via embedding inversion attacks. Existing paradigms face a fundamental trade-off: optimization-based methods require computationally prohibitive queries, while alignment-based approaches hinge on the unrealistic assumption of accessible in-domain training data. These constraints render them ineffective in strict black-box and cross-domain settings. To dismantle these barriers, we introduce Zero2Text, a novel training-free framework based on recursive online alignment. Unlike methods relying on static datasets, Zero2Text synergizes LLM priors with a dynamic ridge regression mechanism to iteratively align generation to the target embedding on-the-fly. We further demonstrate that standard defenses, such as differential privacy, fail to effectively mitigate this adaptive threat. Extensive experiments across diverse benchmarks validate Zero2Text; notably, on MS MARCO against the OpenAI victim model, it achieves 1.8x higher ROUGE-L and 6.4x higher BLEU-2 scores compared to baselines, recovering sentences from unknown domains without a single leaked data pair.
zh
[NLP-79] WorldCup Sampling for Multi-bit LLM Watermarking
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)生成文本中多比特水印(multi-bit watermarking)的容量受限、信息流间接、解码效率低及生成质量下降等问题。现有方法多基于零比特方案通过种子驱动进行引导,导致水印嵌入与生成过程耦合不紧密,难以实现高效且鲁棒的信息编码与恢复。其解决方案的关键在于提出WorldCup框架,将采样过程视为天然通信信道,通过分层竞争机制直接将消息比特嵌入到token选择中,并利用互补信号进行引导;同时引入熵感知调制(entropy-aware modulation)以维持生成质量,并采用置信度感知解码(confidence-aware decoding)提升水印恢复的鲁棒性,从而在容量、可检测性、鲁棒性、文本质量和解码效率之间取得良好平衡。
链接: https://arxiv.org/abs/2602.01752
作者: Yidan Wang,Yubing Ren,Yanan Cao,Li Guo
机构: 未知
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:
Abstract:As large language models (LLMs) generate increasingly human-like text, watermarking offers a promising solution for reliable attribution beyond mere detection. While multi-bit watermarking enables richer provenance encoding, existing methods largely extend zero-bit schemes through seed-driven steering, leading to indirect information flow, limited effective capacity, and suboptimal decoding. In this paper, we propose WorldCup, a multi-bit watermarking framework for LLMs that treats sampling as a natural communication channel and embeds message bits directly into token selection via a hierarchical competition mechanism guided by complementary signals. Moreover, WorldCup further adopts entropy-aware modulation to preserve generation quality and supports robust message recovery through confidence-aware decoding. Comprehensive experiments show that WorldCup achieves a strong balance across capacity, detectability, robustness, text quality, and decoding efficiency, consistently outperforming prior baselines and laying a solid foundation for future LLM watermarking studies.
zh
[NLP-80] Enhancing Automated Essay Scoring with Three Techniques: Two-Stage Fine-Tuning Score Alignment and Self-Training
【速读】: 该论文旨在解决自动化作文评分(Automated Essay Scoring, AES)系统在真实场景中因标注数据极度稀缺而导致性能受限的问题。其核心解决方案在于提出三种关键技术:一是两阶段微调策略,利用低秩适应(Low-Rank Adaptation, LoRA)提升模型对目标提示作文的适配能力;二是分数对齐(Score Alignment)技术,增强预测分数分布与真实分数分布之间的一致性;三是基于不确定性的自训练机制,通过未标注数据生成伪标签样本以扩充训练集,同时抑制标签噪声传播。这三项技术在DualBERT框架上协同作用,在极小数据量下显著提升AES性能,且在全数据设置下亦能取得当前最优结果。
链接: https://arxiv.org/abs/2602.01747
作者: Hongseok Choi,Serynn Kim,Wencke Liermann,Jin Seong,Jin-Xia Huang
机构: ETRI(电子和电信研究院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 22 pages, 4 figures
Abstract:Automated Essay Scoring (AES) plays a crucial role in education by providing scalable and efficient assessment tools. However, in real-world settings, the extreme scarcity of labeled data severely limits the development and practical adoption of robust AES systems. This study proposes a novel approach to enhance AES performance in both limited-data and full-data settings by introducing three key techniques. First, we introduce a Two-Stage fine-tuning strategy that leverages low-rank adaptations to better adapt an AES model to target prompt essays. Second, we introduce a Score Alignment technique to improve consistency between predicted and true score distributions. Third, we employ uncertainty-aware self-training using unlabeled data, effectively expanding the training set with pseudo-labeled samples while mitigating label noise propagation. We implement above three key techniques on DualBERT. We conduct extensive experiments on the ASAP++ dataset. As a result, in the 32-data setting, all three key techniques improve performance, and their integration achieves 91.2% of the full-data performance trained on approximately 1,000 labeled samples. In addition, the proposed Score Alignment technique consistently improves performance in both limited-data and full-data settings: e.g., it achieves state-of-the-art results in the full-data setting when integrated into DualBERT.
zh
[NLP-81] SafePred: A Predictive Guardrail for Computer-Using Agents via World Models
【速读】: 该论文旨在解决当前计算机使用代理(Computer-using Agents, CUAs)在复杂现实环境中面临的长期风险问题。现有防护机制多采用反应式策略,仅在当前观测空间内约束代理行为,虽能防范即时短时风险(如点击钓鱼链接),但无法识别因看似合理的行为延迟引发的高风险后果(如清理日志导致未来审计不可追溯)。为此,作者提出一种预测性防护框架 SafePred,其核心在于建立“风险-决策”闭环,将预测的未来风险与当前决策对齐。关键创新在于:(1) 基于安全策略和世界模型预测短中期及长期风险,生成语义化的风险表征以修剪高风险动作;(2) 通过步骤级干预与任务级重规划,将预测风险转化为可执行的安全决策指导,从而实现主动规避长期风险并提升任务效用。
链接: https://arxiv.org/abs/2602.01725
作者: Yurun Chen,Zeyi Liao,Ping Yin,Taotao Xie,Keting Yin,Shengyu Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:With the widespread deployment of Computer-using Agents (CUAs) in complex real-world environments, prevalent long-term risks often lead to severe and irreversible consequences. Most existing guardrails for CUAs adopt a reactive approach, constraining agent behavior only within the current observation space. While these guardrails can prevent immediate short-term risks (e.g., clicking on a phishing link), they cannot proactively avoid long-term risks: seemingly reasonable actions can lead to high-risk consequences that emerge with a delay (e.g., cleaning logs leads to future audits being untraceable), which reactive guardrails cannot identify within the current observation space. To address these limitations, we propose a predictive guardrail approach, with the core idea of aligning predicted future risks with current decisions. Based on this approach, we present SafePred, a predictive guardrail framework for CUAs that establishes a risk-to-decision loop to ensure safe agent behavior. SafePred supports two key abilities: (1) Short- and long-term risk prediction: by using safety policies as the basis for risk prediction, SafePred leverages the prediction capability of the world model to generate semantic representations of both short-term and long-term risks, thereby identifying and pruning actions that lead to high-risk states; (2) Decision optimization: translating predicted risks into actionable safe decision guidances through step-level interventions and task-level re-planning. Extensive experiments show that SafePred significantly reduces high-risk behaviors, achieving over 97.6% safety performance and improving task utility by up to 21.4% compared with reactive baselines.
zh
[NLP-82] COMI: Coarse-to-fine Context Compression via Marginal Information Gain ICLR2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长文本场景下部署时面临的计算效率低下和信息冗余问题。现有方法难以在高压缩率下同时保持语义相关性和多样性,导致性能显著下降。解决方案的核心是提出一种粗粒度到细粒度的自适应上下文压缩框架 COMI,其关键创新在于引入边际信息增益(Marginal Information Gain, MIG)这一指标,定义为某单元对输入查询的相关性减去其与其他单元的语义冗余度,从而指导压缩过程优先保留高相关且低冗余的信息。COMI 分两阶段执行:首先基于组间 MIG 动态分配压缩率,实现压缩预算与信息价值分布的匹配;其次在组内通过 MIG 加权融合 token,有效保留关键语义并抑制冗余累积。实验表明,该方法在多个问答与摘要任务中显著优于现有基线,例如在 Qwen2-7B 模型上对 NaturalQuestions 数据集实现 32 倍压缩率下约 25 点 Exact Match(EM)提升。
链接: https://arxiv.org/abs/2602.01719
作者: Jiwei Tang,Shilei Liu,Zhicheng Zhang,Yujin Yuan,Libin Zheng,Wenbo Su,Bo Zheng
机构: Tsinghua University (清华大学); Future Living Lab of Alibaba (阿里巴巴未来生活实验室); Sun Yat-sen University (中山大学)
类目: Computation and Language (cs.CL)
备注: Accepted at ICLR 2026
Abstract:Large Language Models (LLMs) have demonstrated exceptional capabilities across diverse tasks. However, their deployment in long context scenarios remains hindered by computational inefficiency and information redundancy. Context compression methods address these challenges by significantly reducing input length and eliminating redundancy. We propose COMI, a coarse-to-fine adaptive context compression framework that jointly optimizes for semantic relevance and diversity under high compression rates. We introduce Marginal Information Gain (MIG), a metric defined as the relevance of a unit to the input query minus its semantic redundancy with other units, guiding the compression process to prioritize information that is both relevant and low redundant. The framework operates in two stages: (1) Coarse-Grained Group Reallocation, where the context is partitioned into groups and dynamically assigned compression rates based on inter-group MIG, ensuring compression budgets align with information value distribution; and (2) Fine-Grained Token Merging, where tokens within each group are fused via an intra-group MIG-based weighting mechanism, thereby preserving key semantics while avoiding the accumulation of redundancy. Extensive experiments across question-answering (e.g., NaturalQuestions, 2WikiMQA, HotpotQA and NarrativeQA), summarization (e.g., MultiNews) with various backbones (e.g., LLaMA-2-7B, Qwen2-7B) show that COMI outperforms existing baselines by a large margin, e.g., approximately 25-point Exact Match (EM) improvement under 32x compression constraint with Qwen2-7B on NaturalQuestions.
zh
[NLP-83] BBPE16: UTF-16-based byte-level byte-pair encoding for improved multilingual speech recognition ICASSP2026
【速读】: 该论文旨在解决多语言自动语音识别(ASR)中因使用基于UTF-8的字节级BPE(BBPE)分词器而导致非拉丁文字符(如中文、日文和韩文,即CJK)token序列过长的问题。这种变长编码会显著增加计算负载与内存消耗,影响模型效率。其解决方案的关键在于提出BBPE16——一种基于UTF-16的BBPE分词器,通过统一使用2字节码单元表示大多数现代书写系统,在保持BBPE语言无关特性的同时大幅提高跨语言token共享能力,从而在不牺牲识别准确率的前提下降低token数量(最高减少10.4%)并减少解码迭代次数(最高降低10.3%),提升训练与推理速度并降低内存占用。
链接: https://arxiv.org/abs/2602.01717
作者: Hyunsik Kim,Haeri Kim,Munhak Lee,Kyungmin Lee
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: accepted to ICASSP 2026
Abstract:Multilingual automatic speech recognition (ASR) requires tokenization that efficiently covers many writing systems. Byte-level BPE (BBPE) using UTF-8 is widely adopted for its language-agnostic design and full Unicode coverage, but its variable-length encoding inflates token sequences for non-Latin scripts, such as Chinese, Japanese, and Korean (CJK). Longer sequences increase computational load and memory use. We propose BBPE16, a UTF-16-based BBPE tokenizer that represents most modern scripts with a uniform 2-byte code unit. BBPE16 preserves BBPE’s language-agnostic properties while substantially improving cross-lingual token sharing. Across monolingual, bilingual, and trilingual ASR, and in a multilingual continual-learning setup, BBPE16 attains comparable or better accuracy; for Chinese, it reduces token counts by up to 10.4% and lowers decoding iterations by up to 10.3%. These reductions speed up fine-tuning and inference and decrease memory usage, making BBPE16 a practical tokenization choice for multilingual ASR.
zh
[NLP-84] Mechanistic Indicators of Steering Effectiveness in Large Language Models
【速读】: 该论文旨在解决激活引导(activation-based steering)在大型语言模型(Large Language Models, LLMs)中成功率不可预测的问题,即缺乏对干预行为是否成功的机制性理解。现有方法主要依赖黑盒输出或由LLM生成的评判标准,难以揭示其内在失效机制。论文的关键解决方案在于引入两种信息论度量:基于熵的归一化分支因子(Normalized Branching Factor, NBF)和引导激活与词汇空间中目标概念之间的Kullback-Leibler(KL)散度,通过分析这些内部信号来诊断引导的有效性。研究发现,有效的激活引导对应于解码过程中结构化熵的保持和KL散度的一致对齐,从而能够高精度预测引导成功与否并估计失败概率,为对比激活添加(Contrastive Activation Addition, CAA)和稀疏自编码器(Sparse Autoencoder)引导提供了更强的评估基准。
链接: https://arxiv.org/abs/2602.01716
作者: Mehdi Jafari,Hao Xue,Flora Salim
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Activation-based steering enables Large Language Models (LLMs) to exhibit targeted behaviors by intervening on intermediate activations without retraining. Despite its widespread use, the mechanistic factors that govern when steering succeeds or fails remain poorly understood, as prior work has relied primarily on black-box outputs or LLM-based judges. In this study, we investigate whether the reliability of steering can be diagnosed using internal model signals. We focus on two information-theoretic measures: the entropy-derived Normalized Branching Factor (NBF), and the Kullback-Leibler (KL) divergence between steered activations and targeted concepts in the vocabulary space. We hypothesize that effective steering corresponds to structured entropy preservation and coherent KL alignment across decoding steps. Building on a reliability study demonstrating high inter-judge agreement between two architecturally distinct LLMs, we use LLM-generated annotations as ground truth and show that these mechanistic signals provide meaningful predictive power for identifying successful steering and estimating failure probability. We further introduce a stronger evaluation baseline for Contrastive Activation Addition (CAA) and Sparse Autoencoder-based steering, the two most widely adopted activation-steering methods.
zh
[NLP-85] MedAraBench: Large-Scale Arabic Medical Question Answering Dataset and Benchmark
【速读】: 该论文旨在解决阿拉伯语在自然语言处理(Natural Language Processing, NLP)研究中,尤其是在医学应用领域资源极度匮乏的问题,这限制了大型语言模型(Large Language Models, LLMs)多语言能力的评估与提升。其解决方案的关键在于构建了一个大规模、高质量的阿拉伯语医学多选题问答数据集 MedAraBench,该数据集涵盖19个医学专科和五个难度层级,通过人工数字化阿拉伯语医学专业资料并进行严格预处理与划分,确保数据多样性与可用性;同时采用专家人工评估与LLM-as-a-judge两种框架验证数据质量,并对8种主流开源及闭源模型进行基准测试,揭示当前模型在阿拉伯语医学场景下的局限性,从而推动面向临床部署的多语言LLMs发展。
链接: https://arxiv.org/abs/2602.01714
作者: Mouath Abu-Daoud,Leen Kharouf,Omar El Hajj,Dana El Samad,Mariam Al-Omari,Jihad Mallat,Khaled Saleh,Nizar Habash,Farah E. Shamout
机构: New York University Abu Dhabi(纽约大学阿布扎比分校); Cleveland Clinic Abu Dhabi(克利夫兰诊所阿布扎比)
类目: Computation and Language (cs.CL)
备注:
Abstract:Arabic remains one of the most underrepresented languages in natural language processing research, particularly in medical applications, due to the limited availability of open-source data and benchmarks. The lack of resources hinders efforts to evaluate and advance the multilingual capabilities of Large Language Models (LLMs). In this paper, we introduce MedAraBench, a large-scale dataset consisting of Arabic multiple-choice question-answer pairs across various medical specialties. We constructed the dataset by manually digitizing a large repository of academic materials created by medical professionals in the Arabic-speaking region. We then conducted extensive preprocessing and split the dataset into training and test sets to support future research efforts in the area. To assess the quality of the data, we adopted two frameworks, namely expert human evaluation and LLM-as-a-judge. Our dataset is diverse and of high quality, spanning 19 specialties and five difficulty levels. For benchmarking purposes, we assessed the performance of eight state-of-the-art open-source and proprietary models, such as GPT-5, Gemini 2.0 Flash, and Claude 4-Sonnet. Our findings highlight the need for further domain-specific enhancements. We release the dataset and evaluation scripts to broaden the diversity of medical data benchmarks, expand the scope of evaluation suites for LLMs, and enhance the multilingual capabilities of models for deployment in clinical settings.
zh
[NLP-86] ARTIS: Agent ic Risk-Aware Test-Time Scaling via Iterative Simulation
【速读】: 该论文旨在解决当前测试时扩展(Test-Time Scaling, TTS)技术在代理型应用场景中表现不足的问题,此类场景下智能体的动作会直接与外部环境交互,且其影响可能不可逆且代价高昂。传统TTS方法虽能通过增加推理阶段计算来提升大语言模型(Large Language Model, LLM)性能,但难以保障动作层面的可靠性与鲁棒性。解决方案的关键在于提出一种名为“迭代模拟的代理风险感知测试时扩展”(Agentic Risk-Aware Test-Time Scaling via Iterative Simulation)的框架,其核心创新是通过模拟交互实现探索与执行的解耦:在真实执行前利用仿真环境进行多轮测试推理,从而在不引入环境风险的前提下提升决策质量。进一步地,作者指出基于LLM的朴素模拟器难以捕捉罕见但高影响的失败模式,为此引入了风险感知工具模拟器,通过针对性数据生成和再平衡训练增强对失败诱导动作的建模精度,实验证明该设计显著提升了多轮、多步代理任务中的可靠性与一致性。
链接: https://arxiv.org/abs/2602.01709
作者: Xingshan Zeng,Lingzhi Wang,Weiwen Liu,Liangyou Li,Yasheng Wang,Lifeng Shang,Xin Jiang,Qun Liu
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Current test-time scaling (TTS) techniques enhance large language model (LLM) performance by allocating additional computation at inference time, yet they remain insufficient for agentic settings, where actions directly interact with external environments and their effects can be irreversible and costly. We propose \emph\name, \emph\underlineAgentic \underlineRisk-Aware \underlineTest-Time Scaling via \underlineIterative \underlineSimulation, a framework that decouples exploration from commitment by enabling test-time exploration through simulated interactions prior to real-world execution. This design allows extending inference-time computation to improve action-level reliability and robustness without incurring environmental risk. We further show that naive LLM-based simulators struggle to capture rare but high-impact failure modes, substantially limiting their effectiveness for agentic decision making. To address this limitation, we introduce a \emphrisk-aware tool simulator that emphasizes fidelity on failure-inducing actions via targeted data generation and rebalanced training. Experiments on multi-turn and multi-step agentic benchmarks demonstrate that iterative simulation substantially improves agent reliability, and that risk-aware simulation is essential for consistently realizing these gains across models and tasks.
zh
[NLP-87] Game of Thought: Robust Information Seeking with Large Language Models Using Game Theory ICML2026
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在信息不充分场景下缺乏主动获取缺失信息能力的问题,尤其是在高风险应用中,现有方法因依赖简化假设而损害最坏情况下的性能表现。其解决方案的关键在于提出一种基于博弈论的框架——Game of Thought (GoT),将信息搜索建模为两人零和扩展式博弈,并通过近似纳什均衡(Nash equilibrium, NE)策略来优化决策过程,从而显著提升LLMs在最坏情况下的信息获取性能,优于直接提示(prompting-based)和启发式引导搜索方法。
链接: https://arxiv.org/abs/2602.01708
作者: Langyuan Cui,Chun Kai Ling,Hwee Tou Ng
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注: 23 pages, 10 figures, under review at ICML 2026
Abstract:Large Language Models (LLMs) are increasingly deployed in real-world scenarios where they may lack sufficient information to complete a given task. In such settings, the ability to actively seek out missing information becomes a critical capability. Existing approaches to enhancing this ability often rely on simplifying assumptions that degrade \textitworst-case performance. This is an issue with serious implications in high-stakes applications. In this work, we use the game of Twenty Questions to evaluate the information-seeking ability of LLMs. We introduce and formalize its adversarial counterpart, the Strategic Language Search (SLS) problem along with its variants as a two-player zero-sum extensive form game. We propose Game of Thought (GoT), a framework that applies game-theoretic techniques to approximate a Nash equilibrium (NE) strategy for the restricted variant of the game. Empirical results demonstrate that our approach consistently improves worst-case performance compared to (1) direct prompting-based methods and (2) heuristic-guided search methods across all tested settings.
zh
[NLP-88] textbfAGTAO: Robust and Stabilized LLM Unlearning via Adversarial Gating Training with Adaptive Orthogonality
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)因无意中记忆敏感数据而引发的隐私与安全风险问题,尤其是现有机器遗忘(Machine Unlearning)方法在“彻底删除”与“保持模型性能”之间难以平衡的困境:激进的遗忘策略会导致灾难性遗忘(Catastrophic Forgetting),显著降低模型效用;而保守策略则可能仅实现表面遗忘,使模型仍面临对抗恢复攻击的风险。解决方案的关键在于提出一个统一框架——AGT^AO(Adversarial Gating Training with Adaptive Orthogonality),其核心创新包括:(1) 自适应正交性(Adaptive Orthogonality, AO),通过动态缓解遗忘与保留目标之间的梯度几何冲突,最小化非预期知识退化;(2) 对抗门控训练(Adversarial Gating Training, AGT),将遗忘建模为潜在空间中的极小极大博弈,并引入基于课程学习的门控机制模拟并抵御内部恢复尝试,从而实现高效且稳健的知识擦除与模型性能的协同优化。
链接: https://arxiv.org/abs/2602.01703
作者: Pengyu Li,Lingling Zhang,Zhitao Gao,Yanrui Wu,Yuxuan Dong,Huan Liu,Bifan Wei,Jun Liu
机构: Xi’an Jiaotong University (西安交通大学); MOE KLINNS Lab (教育部KLINNS重点实验室); Shaanxi Province Key Laboratory of Big Data Knowledge Engineering (陕西省大数据知识工程重点实验室)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:While Large Language Models (LLMs) have achieved remarkable capabilities, they unintentionally memorize sensitive data, posing critical privacy and security risks. Machine unlearning is pivotal for mitigating these risks, yet existing paradigms face a fundamental dilemma: aggressive unlearning often induces catastrophic forgetting that degrades model utility, whereas conservative strategies risk superficial forgetting, leaving models vulnerable to adversarial recovery. To address this trade-off, we propose \textbfAGT ^AO (Adversarial Gating Training with Adaptive Orthogonality), a unified framework designed to reconcile robust erasure with utility preservation. Specifically, our approach introduces \textbfAdaptive Orthogonality (AO) to dynamically mitigate geometric gradient conflicts between forgetting and retention objectives, thereby minimizing unintended knowledge degradation. Concurrently, \textbfAdversarial Gating Training (AGT) formulates unlearning as a latent-space min-max game, employing a curriculum-based gating mechanism to simulate and counter internal recovery attempts. Extensive experiments demonstrate that \textbfAGT ^AO achieves a superior trade-off between unlearning efficacy (KUR \approx 0.01) and model utility (MMLU 58.30). Code is available at this https URL.
zh
[NLP-89] Restoring Exploration after Post-Training: Latent Exploration Decoding for Large Reasoning Models
【速读】: 该论文旨在解决大型推理模型(Large Reasoning Models, LRMs)在强化学习(Reinforcement Learning, RL)后训练过程中出现的“探索坍缩”(exploration collapse)问题,即温度采样无法有效提升通过率(pass@n accuracy)。研究发现,后训练后的模型最终层后验分布熵显著降低,而中间层熵保持较高,呈现出熵不对称现象。解决方案的关键在于提出一种无需额外训练或参数的深度条件解码策略——潜在探索解码(Latent Exploration Decoding, LED),其核心是通过累积求和聚合中间层后验分布,并选择熵最大的深度配置作为探索候选,从而恢复模型的探索能力并提升推理准确性。
链接: https://arxiv.org/abs/2602.01698
作者: Wenhui Tan,Fiorenzo Parascandolo,Enver Sangineto,Jianzhong Ju,Zhenbo Luo,Qian Cao,Rita Cucchiara,Ruihua Song,Jian Luan
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Large Reasoning Models (LRMs) have recently achieved strong mathematical and code reasoning performance through Reinforcement Learning (RL) post-training. However, we show that modern reasoning post-training induces an unintended exploration collapse: temperature-based sampling no longer increases pass@ n accuracy. Empirically, the final-layer posterior of post-trained LRMs exhibit sharply reduced entropy, while the entropy of intermediate layers remains relatively high. Motivated by this entropy asymmetry, we propose Latent Exploration Decoding (LED), a depth-conditioned decoding strategy. LED aggregates intermediate posteriors via cumulative sum and selects depth configurations with maximal entropy as exploration candidates. Without additional training or parameters, LED consistently improves pass@1 and pass@16 accuracy by 0.61 and 1.03 percentage points across multiple reasoning benchmarks and models. Project page: this https URL.
zh
[NLP-90] Counting Hypothesis: Potential Mechanism of In-Context Learning
【速读】: 该论文试图解决的问题是:尽管上下文学习(In-Context Learning, ICL)能够使大规模语言模型(Large Language Models, LLMs)在不修改内部结构的情况下,仅通过输入提示中的少量示例即可完成特定任务,但其底层机制仍不明确,导致错误修正与诊断困难。为破解这一难题,论文提出“计数假说”(Counting Hypothesis),其关键在于指出LLMs的编码策略可能是支撑ICL的核心机制,并提供了实证证据支持该假设,从而为理解ICL的本质及其局限性提供了新的理论视角。
链接: https://arxiv.org/abs/2602.01687
作者: Jung H. Lee,Sujith Vijayan
机构: Pacific Northwest National Laboratory (太平洋西北国家实验室); School of Neuroscience, Virginia Tech (弗吉尼亚理工学院生物神经科学学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 19 pages, 7 main Figures, 1 Table and 6 Supp. Figures
Abstract:In-Context Learning (ICL) indicates that large language models (LLMs) pretrained on a massive amount of data can learn specific tasks from input prompts’ examples. ICL is notable for two reasons. First, it does not need modification of LLMs’ internal structure. Second, it enables LLMs to perform a wide range of tasks/functions with a few examples demonstrating a desirable task. ICL opens up new ways to utilize LLMs in more domains, but its underlying mechanisms still remain poorly understood, making error correction and diagnosis extremely challenging. Thus, it is imperative that we better understand the limitations of ICL and how exactly LLMs support ICL. Inspired by ICL properties and LLMs’ functional modules, we propose 1the counting hypothesis’ of ICL, which suggests that LLMs’ encoding strategy may underlie ICL, and provide supporting evidence.
zh
[NLP-91] Scaling Search-Augmented LLM Reasoning via Adaptive Information Control
【速读】: 该论文旨在解决搜索增强型推理代理(search-augmented reasoning agents)在多步推理过程中因无控制的信息检索导致的冗余证据、上下文饱和和学习不稳定问题。现有方法依赖基于结果的强化学习(reinforcement learning, RL),但其对信息获取行为的调控能力有限。解决方案的关键在于提出 DeepControl 框架,该框架基于信息效用(information utility)这一形式化概念——即在特定推理状态下所获证据的边际价值——构建了检索延续控制(retrieval continuation control)与粒度控制机制(granularity control),从而动态调节何时继续或终止检索以及检索信息的精细程度;同时采用退火式控制策略(annealed control strategy)使代理在训练中内化高效的信息获取行为。实验证明该方法显著优于强基线,在多个基准上平均提升达 9.4% 和 8.6%,凸显了自适应信息控制对复杂现实环境中推理代理扩展的重要性。
链接: https://arxiv.org/abs/2602.01672
作者: Siheng Xiong,Oguzhan Gungordu,Blair Johnson,James C. Kerce,Faramarz Fekri
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL)
备注: Work in progress
Abstract:Search-augmented reasoning agents interleave multi-step reasoning with external information retrieval, but uncontrolled retrieval often leads to redundant evidence, context saturation, and unstable learning. Existing approaches rely on outcome-based reinforcement learning (RL), which provides limited guidance for regulating information acquisition. We propose DeepControl, a framework for adaptive information control based on a formal notion of information utility, which measures the marginal value of retrieved evidence under a given reasoning state. Building on this utility, we introduce retrieval continuation and granularity control mechanisms that selectively regulate when to continue and stop retrieval, and how much information to expand. An annealed control strategy enables the agent to internalize effective information acquisition behaviors during training. Extensive experiments across seven benchmarks demonstrate that our method consistently outperforms strong baselines. In particular, our approach achieves average performance improvements of 9.4% and 8.6% on Qwen2.5-7B and Qwen2.5-3B, respectively, over strong outcome-based RL baselines, and consistently outperforms both retrieval-free and retrieval-based reasoning methods without explicit information control. These results highlight the importance of adaptive information control for scaling search-augmented reasoning agents to complex, real-world information environments.
zh
[NLP-92] CoDiQ: Test-Time Scaling for Controllable Difficult Question Generation
【速读】: 该论文旨在解决现有自动化题目生成方法在难度控制精度不足、计算成本高以及难以规模化生成竞赛级(competition-level)题目方面的瓶颈问题。其解决方案的关键在于提出了一种名为CoDiQ(Controllable Difficult Question Generation)的框架,通过测试时缩放(test-time scaling)实现细粒度难度调控,并确保题目的可解性;同时,基于Qwen3-8B构建了CoDiQ-Generator模型,显著提升了生成高难度题目的上限能力,从而有效支持大推理模型(Large Reasoning Models, LRMs)的训练优化。
链接: https://arxiv.org/abs/2602.01660
作者: Zhongyuan Peng,Caijun Xu,Changyi Xiao,Shibo Hong,Eli Zhang,Stephen Huang,Yixin Cao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages, 5 tables, 5 figures
Abstract:Large Reasoning Models (LRMs) benefit substantially from training on challenging competition-level questions. However, existing automated question synthesis methods lack precise difficulty control, incur high computational costs, and struggle to generate competition-level questions at scale. In this paper, we propose CoDiQ (Controllable Difficult Question Generation), a novel framework enabling fine-grained difficulty control via test-time scaling while ensuring question solvability. Specifically, first, we identify a test-time scaling tendency (extended reasoning token budget boosts difficulty but reduces solvability) and the intrinsic properties defining the upper bound of a model’s ability to generate valid, high-difficulty questions. Then, we develop CoDiQ-Generator from Qwen3-8B, which improves the upper bound of difficult question generation, making it particularly well-suited for challenging question construction. Building on the CoDiQ framework, we build CoDiQ-Corpus (44K competition-grade question sequences). Human evaluations show these questions are significantly more challenging than LiveCodeBench/AIME with over 82% solvability. Training LRMs on CoDiQ-Corpus substantially improves reasoning performance, verifying that scaling controlled-difficulty training questions enhances reasoning capabilities. We open-source CoDiQ-Corpus, CoDiQ-Generator, and implementations to support related research.
zh
[NLP-93] Steering Vector Fields for Context-Aware Inference-Time Control in Large Language Models
【速读】: 该论文旨在解决生成式 AI(Generative AI)中推理时控制大型语言模型(Large Language Models, LLMs)的可靠性问题。现有方法如转向向量(Steering Vectors, SVs)虽轻量且介于提示(prompting)与微调(fine-tuning)之间,但存在概念不可控、对部分输入效果反转或减弱、长文本生成和多属性控制时性能下降等局限性。其根本原因在于静态SV假设概念增强方向在表示空间中全局一致,而实际中该方向随上下文变化,导致干预失准。解决方案的关键是提出转向向量场(Steering Vector Fields, SVF),通过学习一个可微分的概念评分函数,使其局部梯度定义每个激活点处的上下文相关转向方向,从而实现精准、一致且可扩展的多层协同干预,显著提升控制效果的强度与稳定性。
链接: https://arxiv.org/abs/2602.01654
作者: Jiaqian Li,Yanshu Li,Kuan-Hao Huang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Steering vectors (SVs) offer a lightweight way to control large language models (LLMs) at inference time by shifting hidden activations, providing a practical middle ground between prompting and fine-tuning. Yet SVs can be unreliable in practice. Some concepts are unsteerable, and even when steering helps on average it can backfire for a non-trivial fraction of inputs. Reliability also degrades in long-form generation and multi-attribute steering. We take a geometric view of these failures. A static SV applies the same update vector everywhere in representation space, implicitly assuming that the concept-improving direction is constant across contexts. When the locally effective direction varies with the current activation, a single global vector can become misaligned, which yields weak or reversed effects. Guided by this perspective, we propose Steering Vector Fields (SVF), which learns a differentiable concept scoring function whose local gradient defines the steering direction at each activation, making interventions explicitly context-dependent. This formulation supports coordinated multi-layer interventions in a shared, aligned concept space, and enables efficient long-form and multi-attribute control within a unified framework. Across multiple LLMs and steering tasks, SVF delivers stronger and more reliable control, improving the practicality of inference-time steering.
zh
[NLP-94] A2Eval: Agent ic and Automated Evaluation for Embodied Brain
【速读】: 该论文旨在解决当前具身视觉语言模型(Embodied Vision-Language Model, Embodied VLM)评估中依赖静态、专家定义且手动标注的基准测试所带来的冗余严重、覆盖不均、资源消耗大及排名偏差等问题。其核心解决方案是提出首个代理式自动化评估框架——A2Eval,关键在于引入两个协同工作的智能代理:数据代理(Data Agent)自动推导能力维度并构建平衡紧凑的评估套件,评估代理(Eval Agent)合成并验证可执行的评估流水线,从而实现全流程自动化、高保真度的模型评估。该方案在10个基准和13个模型上验证,使评估套件压缩85%、计算成本降低77%、速度提升4.6倍,同时显著改善排名一致性与人类对齐性(Spearman相关系数达0.85),为低成本、高精度的具身评估树立了新标准。
链接: https://arxiv.org/abs/2602.01640
作者: Shuai Zhang,Jiayu Hu,Zijie Chen,Zeyuan Ding,Yi Zhang,Yingji Zhang,Ziyi Zhou,Junwei Liao,Shengjie Zhou,Yong Dai,Zhenzhong Lan,Xiaozhu Ju
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Current embodied VLM evaluation relies on static, expert-defined, manually annotated benchmarks that exhibit severe redundancy and coverage imbalance. This labor intensive paradigm drains computational and annotation resources, inflates costs, and distorts model rankings, ultimately stifling iterative development. To address this, we propose Agentic Automatic Evaluation (A2Eval), the first agentic framework that automates benchmark curation and evaluation through two collaborative agents. The Data Agent autonomously induces capability dimensions and assembles a balanced, compact evaluation suite, while the Eval Agent synthesizes and validates executable evaluation pipelines, enabling fully autonomous, high-fidelity assessment. Evaluated across 10 benchmarks and 13 models, A2Eval compresses evaluation suites by 85%, reduces overall computational costs by 77%, and delivers a 4.6x speedup while preserving evaluation quality. Crucially, A2Eval corrects systematic ranking biases, improves human alignment to Spearman’s rho=0.85, and maintains high ranking fidelity (Kendall’s tau=0.81), establishing a new standard for high-fidelity, low-cost embodied assessment. Our code and data will be public soon.
zh
[NLP-95] SEA-Guard: Culturally Grounded Multilingual Safeguard for Southeast Asia
【速读】: 该论文旨在解决当前生成式 AI (Generative AI) 安全保障模型在跨文化场景下因缺乏本地化语境理解而导致的对区域敏感内容识别不足的问题。现有方法多依赖英文数据集经机器翻译构建,难以捕捉东南亚(SEA)地区特有的文化价值观、规范及法规差异,导致安全防护效果受限。解决方案的关键在于提出一种新型代理驱动的数据生成框架(agentic data-generation framework),可规模化生成真实、区域特异性的安全标注数据;在此基础上构建了首个扎根于东南亚文化语境的多语言安全保障模型家族——SEA-Guard,其在多个基准测试与文化变体评估中均展现出优于现有模型的区域敏感内容检测能力,同时保持良好的通用安全性表现。
链接: https://arxiv.org/abs/2602.01618
作者: Panuthep Tasawong,Jian Gang Ngui,Alham Fikri Aji,Trevor Cohn,Peerat Limkonchotiwat
机构: VISTEC(视觉技术研究中心); Google(谷歌); AI Singapore(新加坡人工智能研究所)
类目: Computation and Language (cs.CL)
备注: Under reivew
Abstract:Culturally aware safeguards are crucial for AI alignment in real-world settings, where safety extends beyond common sense and encompasses diverse local values, norms, and region-specific regulations. However, building large-scale, culturally grounded datasets is challenging due to limited resources and a scarcity of native annotators. Consequently, many safeguard models rely on machine translation of English datasets, often missing regional and cultural nuances. We present a novel agentic data-generation framework to scalably create authentic, region-specific safety datasets for Southeast Asia (SEA). On this foundation, we introduce the SEA-Guard family, the first multilingual safeguard models grounded in SEA cultural contexts. Evaluated across multiple benchmarks and cultural variants, SEA-Guard consistently outperforms existing safeguards at detecting regionally sensitive or harmful content while maintaining strong general safety performance.
zh
[NLP-96] Adaptive Rollout Allocation for Online Reinforcement Learning with Verifiable Rewards ICLR2026
【速读】: 该论文旨在解决强化学习中基于验证奖励(verifiable rewards)的策略优化方法中存在的采样效率瓶颈问题。现有方法如GRPO采用固定轮次分配策略,对所有训练提示(prompt)均匀分配rollout数量,忽略了不同提示的信息量差异,导致计算资源利用不均并阻碍训练进展。其解决方案的关键在于提出一种基于方差感知的预测性分配策略(Variance-Informed Predictive allocation, \Ours),通过轻量级高斯过程模型(Gaussian process model)预测每个提示的成功概率,并据此估算梯度方差,进而构建凸优化问题,在硬性计算预算约束下求解最优rollout分配方案,从而最小化策略更新的期望梯度方差,显著提升采样效率与最终性能。
链接: https://arxiv.org/abs/2602.01601
作者: Hieu Trung Nguyen,Bao Nguyen,Wenao Ma,Yuzhi Zhao,Ruifeng She,Viet Anh Nguyen
机构: The Chinese University of Hong Kong (香港中文大学); Huawei Hong Kong Research Center (华为香港研究中心)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted at ICLR 2026
Abstract:Sampling efficiency is a key bottleneck in reinforcement learning with verifiable rewards. Existing group-based policy optimization methods, such as GRPO, allocate a fixed number of rollouts for all training prompts. This uniform allocation implicitly treats all prompts as equally informative, and could lead to inefficient computational budget usage and impede training progress. We introduce \Ours, a Variance-Informed Predictive allocation strategy that allocates a given rollout budget to the prompts in the incumbent batch to minimize the expected gradient variance of the policy update. At each iteration, \Ours~uses a lightweight Gaussian process model to predict per-prompt success probabilities based on recent rollouts. These probability predictions are translated into variance estimates, which are then fed into a convex optimization problem to determine the optimal rollout allocations under a hard compute budget constraint. Empirical results show that \Ours~consistently improves sampling efficiency and achieves higher performance than uniform or heuristic allocation strategies in multiple benchmarks. Our code will be available at this https URL.
zh
[NLP-97] Expected Harm: Rethinking Safety Evaluation of (Mis)Aligned LLM s
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)安全评估中过度依赖基于严重程度的分类体系所带来的局限性问题,即忽视了威胁实际被执行的可能性(Execution Likelihood),从而导致对风险的误判。其解决方案的核心在于提出“预期危害”(Expected Harm)这一新指标,该指标将威胁的严重程度与其执行概率(由执行成本建模)相结合,从而更准确地衡量真实风险。研究发现,现有模型存在系统性的逆向风险校准现象:对高成本(低执行概率)威胁表现出更强的拒绝行为,而对低成本(高执行概率)威胁则防御薄弱;通过利用这一结构性漏洞,攻击成功率可提升至原有水平的2倍。进一步的线性探测分析表明,模型虽能在潜在空间中编码威胁严重性以指导拒绝决策,却缺乏对执行成本的内部表征,因而对关键风险维度“失明”。
链接: https://arxiv.org/abs/2602.01600
作者: Yen-Shan Chen,Zhi Rui Tam,Cheng-Kuang Wu,Yun-Nung Chen
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:
Abstract:Current evaluations of LLM safety predominantly rely on severity-based taxonomies to assess the harmfulness of malicious queries. We argue that this formulation requires re-examination as it assumes uniform risk across all malicious queries, neglecting Execution Likelihood–the conditional probability of a threat being realized given the model’s response. In this work, we introduce Expected Harm, a metric that weights the severity of a jailbreak by its execution likelihood, modeled as a function of execution cost. Through empirical analysis of state-of-the-art models, we reveal a systematic Inverse Risk Calibration: models disproportionately exhibit stronger refusal behaviors for low-likelihood (high-cost) threats while remaining vulnerable to high-likelihood (low-cost) queries. We demonstrate that this miscalibration creates a structural vulnerability: by exploiting this property, we increase the attack success rate of existing jailbreaks by up to 2\times . Finally, we trace the root cause of this failure using linear probing, which reveals that while models encode severity in their latent space to drive refusal decisions, they possess no distinguishable internal representation of execution cost, making them “blind” to this critical dimension of risk.
zh
[NLP-98] he Art of Socratic Inquiry: A Framework for Proactive Template-Guided Therapeutic Conversation Generation
【速读】: 该论文旨在解决当前心理大语言模型(Psychological Large Language Models, PLLMs)普遍存在的反应式倾向问题,即模型倾向于生成共情但浅层的回应,无法有效引导用户探索潜在信念或促进行为改变。其核心解决方案是提出一种轻量级、可插拔的治疗意图规划框架——苏格拉底式提问框架(Socratic Inquiry Framework, SIF),该框架通过策略锚定(Strategy Anchoring)与模板检索(Template Retrieval)的解耦设计,实现何时提问(when to ask)与问什么(what to ask)的分离,从而在无需端到端再训练的前提下,使模型具备情境感知且理论驱动的主动提问能力。这一方法显著提升了对话中的主动提问频率、深度及治疗一致性,标志着从被动响应向主动认知引导的范式转变。
链接: https://arxiv.org/abs/2602.01598
作者: Mingwen Zhang,Minqiang Yang,Changsheng Ma,Yang Yu,Hui Bai,Chen Xu,Xiangzhen Kong,Bin Hu
机构: Lanzhou University (兰州大学); Beijing Institute of Technology (北京理工大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Proactive questioning, where therapists deliberately initiate structured, cognition-guiding inquiries, is a cornerstone of cognitive behavioral therapy (CBT). Yet, current psychological large language models (LLMs) remain overwhelmingly reactive, defaulting to empathetic but superficial responses that fail to surface latent beliefs or guide behavioral change. To bridge this gap, we propose the \textbfSocratic Inquiry Framework (SIF), a lightweight, plug-and-play therapeutic intent planner that transforms LLMs from passive listeners into active cognitive guides. SIF decouples \textbfwhen to ask (via Strategy Anchoring) from \textbfwhat to ask (via Template Retrieval), enabling context-aware, theory-grounded questioning without end-to-end retraining. Complementing SIF, we introduce \textbfSocratic-QA, a high-quality dataset of strategy-aligned Socratic sequences that provides explicit supervision for proactive reasoning. Experiments show that SIF significantly enhances proactive questioning frequency, conversational depth, and therapeutic alignment, marking a clear shift from reactive comfort to proactive exploration. Our work establishes a new paradigm for psychologically informed LLMs: not just to respond, but to guide.
zh
[NLP-99] Wiki Live Challenge: Challenging Deep Research Agents with Expert-Level Wikipedia Articles
【速读】: 该论文旨在解决当前深度研究代理(Deep Research Agents, DRAs)评估体系中缺乏可靠、客观且细粒度评价标准的问题。现有方法多依赖大语言模型(Large Language Models, LLMs)生成的参考内容或评价维度,虽具可扩展性,但难以保证专家级准确性与细致评估能力。解决方案的关键在于提出 Wiki Live Challenge (WLC),一个基于最新维基百科优良条目(Good Articles, GAs)的实时基准测试框架,利用GA所体现的中立性、全面性和可验证性等高标准作为专家级参照;同时构建了包含39项写作质量细粒度指标和严谨事实可验证度量的Wiki Eval评估体系,从而有效揭示当前DRAs与人类专家水平之间的显著差距,并推动该领域研究向更高标准发展。
链接: https://arxiv.org/abs/2602.01590
作者: Shaohan Wang,Benfeng Xu,Licheng Zhang,Mingxuan Du,Chiwei Zhu,Xiaorui Wang,Zhendong Mao,Yongdong Zhang
机构: University of Science and Technology of China (中国科学技术大学); Metastone Technology ( metastone科技)
类目: Computation and Language (cs.CL)
备注: Preprint. Work in progress
Abstract:Deep Research Agents (DRAs) have demonstrated remarkable capabilities in autonomous information retrieval and report generation, showing great potential to assist humans in complex research tasks. Current evaluation frameworks primarily rely on LLM-generated references or LLM-derived evaluation dimensions. While these approaches offer scalability, they often lack the reliability of expert-verified content and struggle to provide objective, fine-grained assessments of critical dimensions. To bridge this gap, we introduce Wiki Live Challenge (WLC), a live benchmark that leverages the newest Wikipedia Good Articles (GAs) as expert-level references. Wikipedia’s strict standards for neutrality, comprehensiveness, and verifiability serve as a great challenge for DRAs, with GAs representing the pinnacle of which. We curate a dataset of 100 recent Good Articles and propose Wiki Eval, a comprehensive evaluation framework comprising a fine-grained evaluation method with 39 criteria for writing quality and rigorous metrics for factual verifiability. Extensive experiments on various DRA systems demonstrate a significant gap between current DRAs and human expert-level Wikipedia articles, validating the effectiveness of WLC in advancing agent research. We release our benchmark at this https URL
zh
[NLP-100] Provable Defense Framework for LLM Jailbreaks via Noise-Augumented Alignment
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在面对自适应越狱攻击(adaptive jailbreaks)时安全性不足的问题,此类攻击可轻易绕过现有的基于梯度的防御机制(如GCG)。其解决方案的关键在于提出一种可证明鲁棒性的框架——Certified Semantic Smoothing (CSS),通过分层随机删减(Stratified Randomized Ablation)将输入划分为不可变结构提示和可变内容载体,并利用超几何分布(Hypergeometric distribution)推导出严格的ℓ₀范数保证;同时引入噪声增强对齐微调(Noise-Augmented Alignment Tuning, NAAT),使基础模型转化为语义去噪器以缓解稀疏上下文下的性能下降。实验表明,该方法将基于梯度的攻击成功率从84.2%降至1.2%,同时保持94.1%的良性任务性能,显著优于字符级基线方法(utility降至74.3%),并提供确定性的安全证书,确保模型在可证明半径内对所有对抗变体保持鲁棒性。
链接: https://arxiv.org/abs/2602.01587
作者: Zehua Cheng,Jianwei Yang,Wei Dai,Jiahao Sun
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages
Abstract:Large Language Models (LLMs) remain vulnerable to adaptive jailbreaks that easily bypass empirical defenses like GCG. We propose a framework for certifiable robustness that shifts safety guarantees from single-pass inference to the statistical stability of an ensemble. We introduce Certified Semantic Smoothing (CSS) via Stratified Randomized Ablation, a technique that partitions inputs into immutable structural prompts and mutable payloads to derive rigorous lo norm guarantees using the Hypergeometric distribution. To resolve performance degradation on sparse contexts, we employ Noise-Augmented Alignment Tuning (NAAT), which transforms the base model into a semantic denoiser. Extensive experiments on Llama-3 show that our method reduces the Attack Success Rate of gradient-based attacks from 84.2% to 1.2% while maintaining 94.1% benign utility, significantly outperforming character-level baselines which degrade utility to 74.3%. This framework provides a deterministic certificate of safety, ensuring that a model remains robust against all adversarial variants within a provable radius.
zh
[NLP-101] LLM -based Embeddings: Attention Values Encode Sentence Semantics Better Than Hidden States
【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的句向量表示方法中,依赖最终层隐藏状态所导致的全局语义捕捉能力不足的问题。现有方法通常使用最后一层的隐藏状态进行句子编码,但这些状态主要优化于下一个词的预测任务,难以有效表征句子的整体语义。论文提出的关键解决方案是Value Aggregation (VA),其核心思想是从多个层和所有token位置聚合注意力值向量(attention value vectors),而非仅使用隐藏状态;进一步地,通过引入Aligned Weighted VA (AlignedWVA),利用最后一个token的注意力分数作为权重,并借助输出投影矩阵 $ W_O $ 将加权值向量对齐至LLM残差流的公共空间,从而实现更精确的语义表示。此方法在无需训练的情况下即达到甚至超越高成本的MetaEOL集成模型性能,展示了注意力机制在句向量构建中的潜力。
链接: https://arxiv.org/abs/2602.01572
作者: Yeqin Zhang,Yunfei Wang,Jiaxuan Chen,Ke Qin,Yizheng Zhao,Cam-Tu Nguyen
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Sentence representations are foundational to many Natural Language Processing (NLP) applications. While recent methods leverage Large Language Models (LLMs) to derive sentence representations, most rely on final-layer hidden states, which are optimized for next-token prediction and thus often fail to capture global, sentence-level semantics. This paper introduces a novel perspective, demonstrating that attention value vectors capture sentence semantics more effectively than hidden states. We propose Value Aggregation (VA), a simple method that pools token values across multiple layers and token indices. In a training-free setting, VA outperforms other LLM-based embeddings, even matches or surpasses the ensemble-based MetaEOL. Furthermore, we demonstrate that when paired with suitable prompts, the layer attention outputs can be interpreted as aligned weighted value vectors. Specifically, the attention scores of the last token function as the weights, while the output projection matrix ( W_O ) aligns these weighted value vectors with the common space of the LLM residual stream. This refined method, termed Aligned Weighted VA (AlignedWVA), achieves state-of-the-art performance among training-free LLM-based embeddings, outperforming the high-cost MetaEOL by a substantial margin. Finally, we highlight the potential of obtaining strong LLM embedding models through fine-tuning Value Aggregation.
zh
[NLP-102] FS-Researcher: Test-Time Scaling for Long-Horizon Research Tasks with File-System-Based Agents
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在执行深度研究(Deep Research)任务时因上下文窗口限制导致的证据收集与报告撰写能力受限问题,进而阻碍了测试时的扩展性(test-time scaling)。解决方案的关键在于提出一种基于文件系统的双代理框架 FS-Researcher,其核心创新是利用持久化的工作空间作为外部记忆,通过“上下文构建者”(Context Builder)代理将网络信息结构化并存入分层知识库,从而突破模型上下文长度限制;同时,“报告撰写者”(Report Writer)代理以该知识库为事实来源逐段生成报告,实现跨会话的迭代优化与有效计算分配。
链接: https://arxiv.org/abs/2602.01566
作者: Chiwei Zhu,Benfeng Xu,Mingxuan Du,Shaohan Wang,Xiaorui Wang,Zhendong Mao,Yongdong Zhang
机构: University of Science and Technology of China (中国科学技术大学); Metastone Technology (元象科技)
类目: Computation and Language (cs.CL)
备注: 19 pages, 6 figures
Abstract:Deep research is emerging as a representative long-horizon task for large language model (LLM) agents. However, long trajectories in deep research often exceed model context limits, compressing token budgets for both evidence collection and report writing, and preventing effective test-time scaling. We introduce FS-Researcher, a file-system-based, dual-agent framework that scales deep research beyond the context window via a persistent workspace. Specifically, a Context Builder agent acts as a librarian which browses the internet, writes structured notes, and archives raw sources into a hierarchical knowledge base that can grow far beyond context length. A Report Writer agent then composes the final report section by section, treating the knowledge base as the source of facts. In this framework, the file system serves as a durable external memory and a shared coordination medium across agents and sessions, enabling iterative refinement beyond the context window. Experiments on two open-ended benchmarks (DeepResearch Bench and DeepConsult) show that FS-Researcher achieves state-of-the-art report quality across different backbone models. Further analyses demonstrate a positive correlation between final report quality and the computation allocated to the Context Builder, validating effective test-time scaling under the file-system paradigm. The code and data are anonymously open-sourced at this https URL.
zh
[NLP-103] Argument Rarity-based Originality Assessment for AI-Assisted Writing
【速读】: 该论文试图解决传统写作评估在生成式 AI(Generative AI)能力提升背景下失去意义的问题,即当大语言模型(Large Language Models, LLMs)能够轻松生成高质量文本时,仅以质量为导向的评估已无法有效衡量学生批判性思维和原创观点的发展。解决方案的关键在于提出一种基于论点罕见性的原创性评估框架(Argument Rarity-based Originality Assessment, AROA),将原创性定义为在参考语料库中的稀有程度,并通过结构稀有性、论点稀有性、证据稀有性和认知深度四个互补维度进行量化评估,同时引入质量调整机制,使质量与原创性成为独立的评价轴线。实验表明,高质量文本往往具有较低的论点稀有性,存在质量-原创性权衡现象,且AI生成文本虽结构复杂度接近人类水平,但论点稀有性显著低于人类,揭示了LLMs在内容原创性上的局限。
链接: https://arxiv.org/abs/2602.01560
作者: Keito Inoshita,Michiaki Omura,Tsukasa Yamanaka,Go Maeda,Kentaro Tsuji
机构: Ritsumeikan University (立命馆大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:As Large Language Models (LLMs) have become capable of effortlessly generating high-quality text, traditional quality-focused writing assessment is losing its significance. If the essential goal of education is to foster critical thinking and original perspectives, assessment must also shift its paradigm from quality to originality. This study proposes Argument Rarity-based Originality Assessment (AROA), a framework for automatically evaluating argumentative originality in student essays. AROA defines originality as rarity within a reference corpus and evaluates it through four complementary components: structural rarity, claim rarity, evidence rarity, and cognitive depth. The framework quantifies the rarity of each component using density estimation and integrates them with a quality adjustment mechanism, thereby treating quality and originality as independent evaluation axes. Experiments using human essays and AI-generated essays revealed a strong negative correlation between quality and claim rarity, demonstrating a quality-originality trade-off where higher-quality texts tend to rely on typical claim patterns. Furthermore, while AI essays achieved comparable levels of structural complexity to human essays, their claim rarity was substantially lower than that of humans, indicating that LLMs can reproduce the form of argumentation but have limitations in the originality of content.
zh
[NLP-104] MAGIC: A Co-Evolving Attacker-Defender Adversarial Game for Robust LLM Safety
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在安全对齐(safety alignment)方面面临的挑战,即现有防御机制因依赖静态预收集数据分布而难以应对不断演化的对抗攻击。其解决方案的关键在于提出一种名为MAGIC的多轮多智能体强化学习框架,将LLM的安全对齐建模为一个非对称对抗博弈:其中攻击者智能体通过迭代重写原始查询生成欺骗性提示,而防御者智能体则同步优化策略以识别并拒绝此类输入。该动态过程引发协同进化,使攻击者持续暴露长尾漏洞,推动防御者泛化至未见过的攻击模式;同时,攻击者在初始推理能力基础上,经由强化学习训练演化出此前未见的组合策略,显著提升了检测边界。理论分析进一步揭示了更鲁棒的博弈均衡并提供安全性保障,实验表明该方法在不损害模型有用性的情况下实现了更高的防御成功率。
链接: https://arxiv.org/abs/2602.01539
作者: Xiaoyu Wen,Zhida He,Han Qi,Ziyu Wan,Zhongtian Ma,Ying Wen,Tianhang Zheng,Xingcheng Xu,Chaochao Lu,Qiaosheng Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:
Abstract:Ensuring robust safety alignment is crucial for Large Language Models (LLMs), yet existing defenses often lag behind evolving adversarial attacks due to their \textbfreliance on static, pre-collected data distributions. In this paper, we introduce \textbfMAGIC, a novel multi-turn multi-agent reinforcement learning framework that formulates LLM safety alignment as an adversarial asymmetric game. Specifically, an attacker agent learns to iteratively rewrite original queries into deceptive prompts, while a defender agent simultaneously optimizes its policy to recognize and refuse such inputs. This dynamic process triggers a \textbfco-evolution, where the attacker’s ever-changing strategies continuously uncover long-tail vulnerabilities, driving the defender to generalize to unseen attack patterns. Remarkably, we observe that the attacker, endowed with initial reasoning ability, evolves \textbfnovel, previously unseen combinatorial strategies through iterative RL training, underscoring our method’s substantial potential. Theoretically, we provide insights into a more robust game equilibrium and derive safety guarantees. Extensive experiments validate our framework’s effectiveness, demonstrating superior defense success rates without compromising the helpfulness of the model. Our code is available at this https URL.
zh
[NLP-105] A Relative-Budget Theory for Reinforcement Learning with Verifiable Rewards in Large Language Model Reasoning
【速读】: 该论文旨在解决强化学习(Reinforcement Learning, RL)在提升大语言模型推理能力时,其有效性在不同任务和计算预算下表现不一的问题。解决方案的关键在于提出了一种“相对预算理论”(relative-budget theory),通过一个核心指标——相对预算 ξ:=H/E[T] 来统一解释这一差异,其中 H 表示生成长度(token预算),T 为基线策略下首次产生正确解所需的token数。该理论揭示了ξ 控制奖励方差与信息轨迹出现概率,从而决定样本效率,并划分出三个学习 regimes:不足(deficient)、平衡(balanced)和充足(ample);特别地,在 ξ=Θ(1) 的平衡 regime 下,RL 实现最大样本效率,且实证表明最优学习效率对应 ξ∈[1.5,2.0],与推理性能峰值一致。
链接: https://arxiv.org/abs/2602.01523
作者: Akifumi Wachi,Hirota Kinoshita,Shokichi Takakura,Rei Higuchi,Taiji Suzuki
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 28 pages
Abstract:Reinforcement learning (RL) is a dominant paradigm for improving the reasoning abilities of large language models, yet its effectiveness varies across tasks and compute budgets. We propose a \emphrelative-budget theory explaining this variation through a single quantity called relative budget \xi := H/\mathbbE[T] , where H is the generation horizon (token budget) and T denotes the number of tokens until the first correct solution under a base policy. We show that \xi determines sample efficiency by controlling reward variance and the likelihood of informative trajectories. Our analysis reveals three regimes: in the \emphdeficient regime ( \xi \to 0 ), informative trajectories are rare and the sample complexity explodes; in the \emphbalanced regime ( \xi=\Theta(1) ), informative trajectories occur with non-negligible probability and RL is maximally sample-efficient; and in the \emphample regime ( \xi \to \infty ), learning remains stable but marginal gains per iteration diminish. We further provide finite-sample guarantees for online RL that characterize learning progress across these regimes. Specifically, in a case study under idealized distributional assumptions, we show that the relative budget grows linearly over iterations. Our empirical results confirm these predictions in realistic settings, identifying a budget \xi \in [1.5, 2.0] that maximizes learning efficiency and coincides with peak reasoning performance.
zh
[NLP-106] Alternating Reinforcement Learning for Rubric-Based Reward Modeling in Non-Verifiable LLM Post-Training
【速读】: 该论文旨在解决标准奖励模型(Reward Model)在非可验证领域(如创意写作或开放式指令遵循)中因仅预测标量分数而无法充分刻画响应质量多维度特性的问题。其解决方案的关键在于提出 Rubric-ARM 框架,该框架通过强化学习从偏好反馈中联合优化一个 rubric 生成器(rubric generator)与一个评判器(judge),并将 rubric 生成视为一种隐变量动作(latent action),以最大化评判准确性。该方法摒弃了静态 rubric 或分离训练流程的局限性,并引入交替优化策略来缓解同步更新带来的非平稳性问题,理论分析表明该策略可降低训练过程中的梯度方差,从而在多个基准测试中实现最优性能并显著提升下游策略对齐效果。
链接: https://arxiv.org/abs/2602.01511
作者: Ran Xu,Tianci Liu,Zihan Dong,Tony You,Ilgee Hong,Carl Yang,Linjun Zhang,Tao Zhao,Haoyu Wang
机构: Emory University (埃默里大学); Purdue University (普渡大学); Rutgers University (罗格斯大学); Georgia Institute of Technology (佐治亚理工学院); University at Albany (阿尔巴尼大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: The first two authors contributed equally
Abstract:Standard reward models typically predict scalar scores that fail to capture the multifaceted nature of response quality in non-verifiable domains, such as creative writing or open-ended instruction following. To address this limitation, we propose Rubric-ARM, a framework that jointly optimizes a rubric generator and a judge using reinforcement learning from preference feedback. Unlike existing methods that rely on static rubrics or disjoint training pipelines, our approach treats rubric generation as a latent action learned to maximize judgment accuracy. We introduce an alternating optimization strategy to mitigate the non-stationarity of simultaneous updates, providing theoretical analysis that demonstrates how this schedule reduces gradient variance during training. Extensive experiments show that Rubric-ARM achieves state-of-the-art performance among baselines on multiple benchmarks and significantly improves downstream policy alignment in both offline and online reinforcement learning settings.
zh
[NLP-107] Ebisu: Benchmarking Large Language Models in Japanese Finance
【速读】: 该论文旨在解决生成式 AI (Generative AI) 在处理日语金融语言理解任务时面临的挑战,特别是由黏着性、后置词序、混合书写系统及高语境沟通规范(依赖间接表达与隐含承诺)所导致的复杂性问题。解决方案的关键在于提出并构建了一个名为Ebisu的基准测试集,其包含两个基于语言学和文化背景、由专家标注的任务:JF-ICR(评估投资者问答中隐含承诺与拒绝识别能力)和JF-TE(评估专业披露文本中嵌套金融术语的层级提取与排序能力),从而为提升日语金融自然语言处理(NLP)模型的性能提供了一个聚焦且可量化评估的标准。
链接: https://arxiv.org/abs/2602.01479
作者: Xueqing Peng,Ruoyu Xiang,Fan Zhang,Mingzi Song,Mingyang Jiang,Yan Wang,Lingfei Qian,Taiki Hara,Yuqing Guo,Jimin Huang,Junichi Tsujii,Sophia Ananiadou
机构: University of Manchester (曼彻斯特大学); The Fin AI; New York University (纽约大学); The University of Tokyo (东京大学); Meiji Gakuin University (明治学院大学); National Institute of Advanced Industrial Science and Technology (AIST) (日本产业技术综合研究所); The National Centre for Text Mining (国家文本挖掘中心)
类目: Computation and Language (cs.CL)
备注:
Abstract:Japanese finance combines agglutinative, head-final linguistic structure, mixed writing systems, and high-context communication norms that rely on indirect expression and implicit commitment, posing a substantial challenge for LLMs. We introduce Ebisu, a benchmark for native Japanese financial language understanding, comprising two linguistically and culturally grounded, expert-annotated tasks: JF-ICR, which evaluates implicit commitment and refusal recognition in investor-facing QA, and JF-TE, which assesses hierarchical extraction and ranking of nested financial terminology from professional disclosures. We evaluate a diverse set of open-source and proprietary LLMs spanning general-purpose, Japanese-adapted, and financial models. Results show that even state-of-the-art systems struggle on both tasks. While increased model scale yields limited improvements, language- and domain-specific adaptation does not reliably improve performance, leaving substantial gaps unresolved. Ebisu provides a focused benchmark for advancing linguistically and culturally grounded financial NLP. All datasets and evaluation scripts are publicly released.
zh
[NLP-108] ConPress: Learning Efficient Reasoning from Multi-Question Contextual Pressure
【速读】: 该论文旨在解决大型推理模型(Large Reasoning Models, LRM)在处理推理密集型任务时因生成冗长的思维链(Chain-of-Thought, CoT)而导致的显著推理开销问题。解决方案的关键在于发现并利用一种可复现的推理期现象——自压缩(Self-Compression),即当多个独立且可回答的问题被置于同一提示(prompt)中时,模型会自发地为每个问题生成更短的推理轨迹。基于此现象,作者提出ConPress方法,通过构造多问题提示诱导自压缩,从中采样并解析过滤出每道题的简洁但正确的推理路径,直接用于监督微调,从而在无需外部教师模型、人工修剪或强化学习的情况下,使模型在单问题场景下内化压缩推理行为,仅用8k微调样本即可实现推理token使用量显著降低(如MATH500上减少59%,AIME25上减少33%),同时保持竞争力准确率。
链接: https://arxiv.org/abs/2602.01472
作者: Jie Deng,Shining Liang,Jun Li,Hongzhi Li,Yutao Xie
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large reasoning models (LRMs) typically solve reasoning-intensive tasks by generating long chain-of-thought (CoT) traces, leading to substantial inference overhead. We identify a reproducible inference-time phenomenon, termed Self-Compression: when multiple independent and answerable questions are presented within a single prompt, the model spontaneously produces shorter reasoning traces for each question. This phenomenon arises from multi-question contextual pressure during generation and consistently manifests across models and benchmarks. Building on this observation, we propose ConPress (Learning from Contextual Pressure), a lightweight self-supervised fine-tuning approach. ConPress constructs multi-question prompts to induce self-compression, samples the resulting model outputs, and parses and filters per-question traces to obtain concise yet correct reasoning trajectories. These trajectories are directly used for supervised fine-tuning, internalizing compressed reasoning behavior in single-question settings without external teachers, manual pruning, or reinforcement learning. With only 8k fine-tuning examples, ConPress reduces reasoning token usage by 59% on MATH500 and 33% on AIME25, while maintaining competitive accuracy.
zh
[NLP-109] Understanding QA generation: Extracting Parametric and Contextual Knowledge with CQA for Low Resource Bangla Language
【速读】: 该论文旨在解决低资源语言(如孟加拉语)问答(QA)模型在面对有限标注数据和语言复杂性时,难以区分其答案生成是依赖于预编码的参数化知识(parametric knowledge)还是上下文输入的问题。现有孟加拉语QA数据集缺乏结构化设计以支持此类分析,限制了对模型知识来源的深入理解。解决方案的关键在于提出首个孟加拉语反事实问答(Counterfactual QA, CQA)数据集BanglaCQA,通过扩展原始数据并引入反事实段落与可回答性标注,为评估模型在真实与反事实场景下的表现提供结构基础;同时构建针对编码器-解码器语言模型、多语言基线模型及仅解码器大语言模型(LLM)的不同微调与提示(prompting)管道,结合基于LLM和人工的语义相似度评估方法,系统性地解耦参数化知识与上下文知识。研究进一步发现,链式思维(Chain-of-Thought, CoT)提示在反事实场景中尤其有效,能显著增强仅解码器LLM提取参数化知识的能力,从而为低资源语言中的反事实推理提供了新范式。
链接: https://arxiv.org/abs/2602.01451
作者: Umme Abira Azmary,MD Ikramul Kayes,Swakkhar Shatabda,Farig Yousuf Sadeque
机构: BRAC University (BRAC大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Question-Answering (QA) models for low-resource languages like Bangla face challenges due to limited annotated data and linguistic complexity. A key issue is determining whether models rely more on pre-encoded (parametric) knowledge or contextual input during answer generation, as existing Bangla QA datasets lack the structure required for such analysis. We introduce BanglaCQA, the first Counterfactual QA dataset in Bangla, by extending a Bangla dataset while integrating counterfactual passages and answerability annotations. In addition, we propose fine-tuned pipelines for encoder-decoder language-specific and multilingual baseline models, and prompting-based pipelines for decoder-only LLMs to disentangle parametric and contextual knowledge in both factual and counterfactual scenarios. Furthermore, we apply LLM-based and human evaluation techniques that measure answer quality based on semantic similarity. We also present a detailed analysis of how models perform across different QA settings in low-resource languages, and show that Chain-of-Thought (CoT) prompting reveals a uniquely effective mechanism for extracting parametric knowledge in counterfactual scenarios, particularly in decoder-only LLMs. Our work not only introduces a novel framework for analyzing knowledge sources in Bangla QA but also uncovers critical findings that open up broader directions for counterfactual reasoning in low-resource language settings.
zh
[NLP-110] SentiFuse: Deep Multi-model Fusion Framework for Robust Sentiment Extraction
【速读】: 该论文旨在解决现有情感分析(sentiment analysis)模型缺乏统一整合框架的问题,导致其互补优势无法被有效利用。解决方案的关键在于提出一种灵活且与模型无关的集成框架 SentiFuse,通过标准化层和多种融合策略(包括决策级融合、特征级融合与自适应融合),实现异构情感模型的系统性组合。实验表明,特征级融合在多个社交媒体数据集上显著提升性能(F1分数最高提升4%),而自适应融合则增强了对否定、混合情绪等复杂表达的鲁棒性,验证了通过结构化方式挖掘模型互补性的有效性。
链接: https://arxiv.org/abs/2602.01447
作者: Hieu Minh Duong,Rupa Ghosh,Cong Hoan Nguyen,Eugene Levin,Todd Gary,Long Nguyen
机构: University of Louisville, Louisville, Kentucky, USA (路易斯维尔大学); Meharry Medical College, Nashville, Tennessee, USA (梅纳里医学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Sentiment analysis models exhibit complementary strengths, yet existing approaches lack a unified framework for effective integration. We present SentiFuse, a flexible and model-agnostic framework that integrates heterogeneous sentiment models through a standardization layer and multiple fusion strategies. Our approach supports decision-level fusion, feature-level fusion, and adaptive fusion, enabling systematic combination of diverse models. We conduct experiments on three large-scale social-media datasets: Crowdflower, GoEmotions, and Sentiment140. These experiments show that SentiFuse consistently outperforms individual models and naive ensembles. Feature-level fusion achieves the strongest overall effectiveness, yielding up to 4% absolute improvement in F1 score over the best individual model and simple averaging, while adaptive fusion enhances robustness on challenging cases such as negation, mixed emotions, and complex sentiment expressions. These results demonstrate that systematically leveraging model complementarity yields more accurate and reliable sentiment analysis across diverse datasets and text types.
zh
[NLP-111] he Gradient-Causal Gap: Why Gradient Importance Fails on Complex Tasks ICLR ICLR2026
【速读】: 该论文旨在解决神经网络中梯度大小与模型因果重要性之间关系的不一致性问题,即梯度 magnitude 无法可靠指示哪些参数对模型性能至关重要,从而导致基于梯度的剪枝(gradient-based pruning)方法在实际应用中存在不可预测的风险。其关键解决方案是通过量化 Transformer 在算法任务上的“梯度-因果差距”(Gradient-Causal Gap),揭示梯度大小与参数因果重要性之间的强弱关联随任务复杂度变化的现象:简单任务中二者正相关(ρ=0.73),复杂任务中则显著减弱甚至反转(ρ=-0.11)。实验进一步表明,移除低梯度的“隐藏英雄”(Hidden Heroes)会严重损害分布外(OOD)性能(下降32%),而移除高梯度的“梯度膨胀项”(Gradient Bloats)效果不确定——多数情况下无害(优化噪声),少数情况下灾难性(过拟合电路),这说明梯度剪枝不能可靠保留模型能力。
链接: https://arxiv.org/abs/2602.01442
作者: Donald Ye
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 8 pages, 4 figures. Submitted to the ICLR 2026 Workshop on Latent Implicit Thinking (LIT). Code: this https URL
Abstract:Removing ‘‘important’’ high-gradient components from a neural network can improve generalization, while removing unimportant’’ low-gradient components can destroy it. We demonstrate this paradox by formalizing the \textitGradient-Causal Gap in Transformers trained on algorithmic tasks. While gradient magnitude and causal importance align on simple tasks ( \rho=0.73 for reversal), this relationship collapses as task complexity increases ( \rho=0.32 for sorting), sometimes becoming inverted ( \rho=-0.11 ). Pruning experiments reveal that gradient magnitude is not merely inaccurate but \textitunpredictably so. Removing low-gradient ‘‘Hidden Heroes’’ consistently devastates OOD accuracy ( -32% ). Removing high-gradient ‘‘Gradient Bloats’’ is a coin flip: harmless in most seeds (indicating optimization noise), catastrophic in others (indicating overfitting circuits). This unpredictability means gradient-based pruning cannot reliably preserve model capabilities.
zh
[NLP-112] From Prag mas to Partners: A Symbiotic Evolution of Agent ic High-Level Synthesis
【速读】: 该论文试图解决的问题是:在生成式 AI(Generative AI)驱动的硬件设计兴起背景下,高层次综合(High-Level Synthesis, HLS)是否仍然具有重要价值。论文指出,尽管未来成熟的智能硬件系统将融合 HLS 与寄存器传输级(Register-Transfer Level, RTL)设计,但 HLS 仍作为关键抽象层,在支持 AI 代理(Agent)优化中发挥不可替代作用。其解决方案的关键在于:首先,明确 HLS 作为实用抽象层和代理硬件设计的黄金参考基准;其次,识别当前 HLS 工具在性能反馈不足、接口僵化和调试能力弱等方面的局限性,并强调 AI 代理能够针对性改进这些缺陷;最后,提出一个“代理-HLS 协同演化”分类法,厘清从人类协作伙伴到自主设计合作者的角色转变机制,从而推动 HLS 在智能硬件设计中的持续演进。
链接: https://arxiv.org/abs/2602.01401
作者: Niansong Zhang,Sunwoo Kim,Shreesha Srinath,Zhiru Zhang
机构: Cornell University (康奈尔大学); Cerebras Systems (赛博拉斯系统公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The rise of large language models has sparked interest in AI-driven hardware design, raising the question: does high-level synthesis (HLS) still matter in the agentic era? We argue that HLS remains essential. While we expect mature agentic hardware systems to leverage both HLS and RTL, this paper focuses on HLS and its role in enabling agentic optimization. HLS offers faster iteration cycles, portability, and design permutability that make it a natural layer for agentic this http URL position paper makes three contributions. First, we explain why HLS serves as a practical abstraction layer and a golden reference for agentic hardware design. Second, we identify key limitations of current HLS tools, namely inadequate performance feedback, rigid interfaces, and limited debuggability that agents are uniquely positioned to address. Third, we propose a taxonomy for the symbiotic evolution of agentic HLS, clarifying how responsibility shifts from human designers to AI agents as systems advance from copilots to autonomous design partners.
zh
[NLP-113] Rethinking Selective Knowledge Distillation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)知识蒸馏(Knowledge Distillation, KD)中效率与效果之间的权衡问题,特别是针对密集蒸馏(dense distillation)带来的高计算和存储开销。现有方法通过选择性蒸馏(selective distillation)在token位置、词汇类别或训练样本维度上减少监督信号,但缺乏对重要性信号(importance signals)和选择策略(selection policies)的系统性分析。论文的关键解决方案是提出学生熵引导的位置选择策略(Student-Entropy-guided Position Selection, SE-KD),其核心在于利用学生模型自身的预测熵来动态识别最具信息量的token位置进行蒸馏,从而提升蒸馏精度与内存效率;进一步扩展至类别和样本维度形成SE-KD 3X,显著降低存储需求(减少80%)和峰值内存(减少18%),同时保持性能不变,使离线教师缓存(offline teacher caching)成为可行方案。
链接: https://arxiv.org/abs/2602.01395
作者: Almog Tavor,Itay Ebenspanger,Neil Cnaan,Mor Geva
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Growing efforts to improve knowledge distillation (KD) in large language models (LLMs) replace dense teacher supervision with selective distillation, which uses a subset of token positions, vocabulary classes, or training samples for supervision. However, it remains unclear which importance signals, selection policies, and their interplay are most effective. In this work, we revisit where and how to distill in autoregressive LLMs. We disentangle selective KD along the position, class, and sample axes and systematically compare importance signals and selection policies. Then, guided by this analysis, we identify underexplored opportunities and introduce student-entropy-guided position selection (SE-KD). Across a suite of benchmarks, SE-KD often improves accuracy, downstream task adherence, and memory efficiency over dense distillation. Extending this approach across the class and sample axes (SE-KD 3X) yields complementary efficiency gains that make offline teacher caching feasible. In practice, this reduces wall time by 70% and peak memory by 18%, while cutting storage usage by 80% over prior methods without sacrificing performance.
zh
[NLP-114] On the Power of (Approximate) Reward Models for Inference-Time Scaling
【速读】: 该论文旨在解决在推理时扩展(inference-time scaling)过程中,使用近似奖励模型(approximate reward model)是否仍能有效提升大型语言模型推理能力的问题。其核心挑战在于,实际部署系统中无法获取真实奖励模型,而只能依赖近似版本,因此亟需理论解释:何时以及为何这些近似模型依然足够有效。论文的解决方案关键在于识别出**贝尔曼误差(Bellman error)**作为决定基于序贯蒙特卡洛(Sequential Monte Carlo, SMC)的推理扩展方法有效性的核心指标;研究表明,若近似奖励模型的贝尔曼误差以 $ O(1/T) $ 的速率随推理长度 $ T $ 有界,则结合SMC可将推理计算复杂度从指数级降低至多项式级,从而实现即使仅使用近似奖励也能获得指数级的推理效率提升。
链接: https://arxiv.org/abs/2602.01381
作者: Youheng Zhu,Yiping Lu
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:
Abstract:Inference-time scaling has recently emerged as a powerful paradigm for improving the reasoning capability of large language models. Among various approaches, Sequential Monte Carlo (SMC) has become a particularly important framework, enabling iterative generation, evaluation, rejection, and resampling of intermediate reasoning trajectories. A central component in this process is the reward model, which evaluates partial solutions and guides the allocation of computation during inference. However, in practice, true reward models are never available. All deployed systems rely on approximate reward models, raising a fundamental question: Why and when do approximate reward models suffice for effective inference-time scaling? In this work, we provide a theoretical answer. We identify the Bellman error of the approximate reward model as the key quantity governing the effectiveness of SMC-based inference-time scaling. For a reasoning process of length T , we show that if the Bellman error of the approximate reward model is bounded by O(1/T) , then combining this reward model with SMC reduces the computational complexity of reasoning from exponential in T to polynomial in T . This yields an exponential improvement in inference efficiency despite using only approximate rewards. Subjects: Computation and Language (cs.CL); Machine Learning (stat.ML) Cite as: arXiv:2602.01381 [cs.CL] (or arXiv:2602.01381v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2602.01381 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-115] Context Dependence and Reliability in Autoregressive Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成输出时,因输入上下文包含冗余信息而导致解释方法难以准确识别关键影响因素的问题。传统解释方法在面对冗余和重叠的上下文时表现不稳定,微小的输入变化可能引发 attribution score 的剧烈波动,从而削弱可解释性并带来如提示注入(prompt injection)等风险。解决方案的关键在于提出 RISE(Redundancy-Insensitive Scoring of Explanation),该方法通过量化每个输入元素相对于其他元素的独特影响,有效降低冗余干扰,从而提供更稳定、清晰的归因结果,强调条件信息在构建可信 LLM 解释与监控中的核心作用。
链接: https://arxiv.org/abs/2602.01378
作者: Poushali Sengupta,Shashi Raj Pandey,Sabita Maharjan,Frank Eliassen
机构: University of Oslo (奥斯陆大学); Aalborg University (奥尔堡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Large language models (LLMs) generate outputs by utilizing extensive context, which often includes redundant information from prompts, retrieved passages, and interaction history. In critical applications, it is vital to identify which context elements actually influence the output, as standard explanation methods struggle with redundancy and overlapping context. Minor changes in input can lead to unpredictable shifts in attribution scores, undermining interpretability and raising concerns about risks like prompt injection. This work addresses the challenge of distinguishing essential context elements from correlated ones. We introduce RISE (Redundancy-Insensitive Scoring of Explanation), a method that quantifies the unique influence of each input relative to others, minimizing the impact of redundancies and providing clearer, stable attributions. Experiments demonstrate that RISE offers more robust explanations than traditional methods, emphasizing the importance of conditional information for trustworthy LLM explanations and monitoring.
zh
[NLP-116] When Domains Interact: Asymmetric and Order-Sensitive Cross-Domain Effects in Reinforcement Learning for Reasoning
【速读】: 该论文旨在解决群体相对策略优化(Group Relative Policy Optimization, GRPO)在多领域推理任务训练中,因训练顺序和领域组合策略不同而导致性能差异显著的问题。现有研究缺乏对顺序训练(sequential training,一次仅训练一个领域)与混合域训练(mixed-domain training,同时训练多个领域)策略下GRPO行为的系统性理解。解决方案的关键在于通过系统实验揭示了三个核心现象:(1)单领域泛化具有高度不对称性,例如从其他领域训练可提升数学推理准确率约25%,但对逻辑和谜题类任务几乎无迁移效果;(2)跨领域交互强烈依赖训练顺序,如数学→科学顺序达到83%/41%的准确率,而反向顺序则下降至77%/25%;(3)不存在通用最优训练策略,顺序训练利于数学任务(最高达84%),混合训练更优科学与逻辑任务,错误顺序甚至导致性能差距达14个百分点(70% vs 56%)。这些发现表明,GRPO在多领域场景下的表现具有显著的不对称性、顺序敏感性和策略依赖性,因此必须采用领域感知和顺序感知的训练设计以实现最优性能。
链接: https://arxiv.org/abs/2602.01365
作者: Wang Yang,Shouren Wang,Chaoda Song,Chuang Ma,Xinpeng Li,Nengbo Wang,Kaixiong Zhou,Vipin Chaudhary,Xiaotian Han
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Group Relative Policy Optimization (GRPO) has become a key technique for improving reasoning abilities in large language models, yet its behavior under different domain sequencing strategies is poorly understood. In particular, the impact of sequential (one domain at a time) versus mixed-domain (multiple domain at a time) training in GRPO has not been systematically studied. We provide the first systematic analysis of training-order effects across math, science, logic, and puzzle reasoning tasks. We found (1) single-domain generalization is highly asymmetric: training on other domains improves math reasoning by approximately 25% accuracy, while yielding negligible transfer to logic and puzzle; (2) cross-domain interactions are highly order-dependent: training in the order math \rightarrow science achieves 83% / 41% accuracy on math / science, while reversing the order to science \rightarrow math degrades performance to 77% / 25%; (3) no single strategy is universally optimal in multi-domain training: sequential training favors math (up to 84%), mixed training favors science and logic, and poor ordering can incur large performance gaps (from 70% to 56%). Overall, our findings demonstrate that GRPO under multi-domain settings exhibits pronounced asymmetry, order sensitivity, and strategy dependence, highlighting the necessity of domain-aware and order-aware training design.
zh
[NLP-117] Causally Disentangled Contrastive Learning for Multilingual Speaker Embeddings
【速读】: 该论文旨在解决自监督声纹嵌入(self-supervised speaker embeddings)中存在敏感人口统计学信息(如性别、年龄和口音)泄露的问题,这可能引发公平性和隐私风险。解决方案的关键在于提出两种去偏策略:一是通过梯度反转的对抗训练方法来抑制性别等属性的信息编码;二是采用因果瓶颈架构(causal bottleneck architecture),显式分离人口统计学特征与残差信息。实验表明,这两种方法在降低性别泄露方面有效,但对年龄和口音的去偏效果有限,并且均带来声纹验证性能(以ROC-AUC和EER衡量)的下降,揭示了当前去偏技术在有效性与准确性之间存在的根本权衡。
链接: https://arxiv.org/abs/2602.01363
作者: Mariëtte Olijslager,Seyed Sahand Mohammadi Ziabari,Ali Mohammed Mansoor Alsahag
机构: University of Amsterdam (阿姆斯特丹大学); SUNY Empire State University (纽约州立大学艾莫里州立学院)
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:
Abstract:Self-supervised speaker embeddings are widely used in speaker verification systems, but prior work has shown that they often encode sensitive demographic attributes, raising fairness and privacy concerns. This paper investigates the extent to which demographic information, specifically gender, age, and accent, is present in SimCLR-trained speaker embeddings and whether such leakage can be mitigated without severely degrading speaker verification performance. We study two debiasing strategies: adversarial training through gradient reversal and a causal bottleneck architecture that explicitly separates demographic and residual information. Demographic leakage is quantified using both linear and nonlinear probing classifiers, while speaker verification performance is evaluated using ROC-AUC and EER. Our results show that gender information is strongly and linearly encoded in baseline embeddings, whereas age and accent are weaker and primarily nonlinearly represented. Adversarial debiasing reduces gender leakage but has limited effect on age and accent and introduces a clear trade-off with verification accuracy. The causal bottleneck further suppresses demographic information, particularly in the residual representation, but incurs substantial performance degradation. These findings highlight fundamental limitations in mitigating demographic leakage in self-supervised speaker embeddings and clarify the trade-offs inherent in current debiasing approaches.
zh
[NLP-118] Balancing Understanding and Generation in Discrete Diffusion Models
【速读】: 该论文旨在解决离散生成建模中两大主流范式——掩码扩散语言模型(Masked Diffusion Language Models, MDLM)与均匀噪声扩散语言模型(Uniform-noise Diffusion Language Models, UDLM)之间的性能失衡问题:MDLM在语义理解与零样本泛化能力上表现优异,而UDLM在少步生成质量上更强,但两者均未实现理解能力与生成质量的均衡优化。解决方案的关键在于提出XDLM框架,其核心创新是通过引入一个平稳噪声核(stationary noise kernel)对两种范式进行理论统一,将MDLM和UDLM分别视为特例;同时利用后验概率的代数简化显著缓解内存瓶颈,从而在保持理论一致性的同时提升实际性能。实验表明,XDLM在零样本文本任务上比UDLM提升5.4分,在图像生成FID指标上优于MDLM(54.1 vs. 80.8),并在8B参数大语言模型上以32步实现MBPP 15.0,较基线翻倍,展现出更强的长期扩展潜力。
链接: https://arxiv.org/abs/2602.01362
作者: Yue Liu,Yuzhong Zhao,Zheyong Xie,Qixiang Ye,Jianbin Jiao,Yao Hu,Shaosheng Cao,Yunfan Liu
机构: 未知
类目: Computation and Language (cs.CL)
备注: 32 pages, Code is available at this https URL
Abstract:In discrete generative modeling, two dominant paradigms demonstrate divergent capabilities: Masked Diffusion Language Models (MDLM) excel at semantic understanding and zero-shot generalization, whereas Uniform-noise Diffusion Language Models (UDLM) achieve strong few-step generation quality, yet neither attains balanced performance across both dimensions. To address this, we propose XDLM, which bridges the two paradigms via a stationary noise kernel. XDLM offers two key contributions: (1) it provides a principled theoretical unification of MDLM and UDLM, recovering each paradigm as a special case; and (2) an alleviated memory bottleneck enabled by an algebraic simplification of the posterior probabilities. Experiments demonstrate that XDLM advances the Pareto frontier between understanding capability and generation quality. Quantitatively, XDLM surpasses UDLM by 5.4 points on zero-shot text benchmarks and outperforms MDLM in few-step image generation (FID 54.1 vs. 80.8). When scaled to tune an 8B-parameter large language model, XDLM achieves 15.0 MBPP in just 32 steps, effectively doubling the baseline performance. Finally, analysis of training dynamics reveals XDLM’s superior potential for long-term scaling. Code is available at this https URL
zh
[NLP-119] CRAFT: Calibrated Reasoning with Answer-Faithful Traces via Reinforcement Learning for Multi-Hop Question Answering
【速读】: 该论文旨在解决多跳问答(multi-hop question answering, multi-hop QA)中生成式 AI(Generative AI)模型在推理过程中的三个核心问题:推理崩溃(Reasoning Collapse)、推理与答案不一致(Reasoning-answer inconsistency)以及格式控制丢失(Loss of format control)。针对这些问题,作者提出了一种基于组相对策略优化(Group Relative Policy Optimization, GRPO)的强化学习框架 CRAFT(Calibrated Reasoning with Answer-Faithful Traces),其关键在于引入双奖励机制——确定性奖励确保结构正确性,判别式奖励验证语义忠实性,从而训练模型在响应生成过程中执行可信赖的推理。CRAFT 支持可控的推理轨迹变体,可用于系统分析结构和规模对推理性能与忠实度的影响,实验表明其在多个多跳问答基准上显著提升答案准确率与推理忠实度,且 7B 模型在不同推理轨迹设置下达到与闭源大模型相当的性能。
链接: https://arxiv.org/abs/2602.01348
作者: Yu Liu,Wenxiao Zhang,Cong Cao,Fangfang Yuan,Weizhuo Chen,Cheng Hu,Pin Xu,Yuling Yang,Kun Peng,Diandian Guo,Qiang Sun,Yanbing Liu,Jin B. Hong,Zhiyuan Ma
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences (中国科学院大学网络空间安全学院); The University of Western Australia (西澳大学); Huazhong University of Science and Technology (华中科技大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Retrieval-augmented generation (RAG) is widely used to ground Large Language Models (LLMs) for multi-hop question answering. Recent work mainly focused on improving answer accuracy via fine-tuning and structured or reinforcement-based optimization. However, reliable reasoning in response generation faces three challenges: 1) Reasoning Collapse. Reasoning in multi-hop QA is inherently complex due to multi-hop composition and is further destabilized by noisy retrieval. 2) Reasoning-answer inconsistency. Due to the intrinsic uncertainty of LLM generation and exposure to evidence–distractor mixtures, models may produce correct answers that are not faithfully supported by their intermediate reasoning or evidence. 3) Loss of format control. Traditional chain-of-thought generation often deviates from required structured output formats, leading to incomplete or malformed structured content. To address these challenges, we propose CRAFT (Calibrated Reasoning with Answer-Faithful Traces), a Group Relative Policy Optimization (GRPO) based reinforcement learning framework that trains models to perform faithful reasoning during response generation. CRAFT employs dual reward mechanisms to optimize multi-hop reasoning: deterministic rewards ensure structural correctness while judge-based rewards verify semantic faithfulness. This optimization framework supports controllable trace variants that enable systematic analysis of how structure and scale affect reasoning performance and faithfulness. Experiments on three multi-hop QA benchmarks show that CRAFT improves both answer accuracy and reasoning faithfulness across model scales, with the CRAFT 7B model achieving competitive performance with closed-source LLMs across multiple reasoning trace settings.
zh
[NLP-120] A-MapReduce: Executing Wide Search via Agent ic MapReduce
【速读】: 该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的多智能体系统在面对广度搜索任务(wide search tasks)时效率低下、执行路径冗长的问题。现有框架主要围绕垂直结构的序列推理设计,在大规模、广度导向的检索场景中难以有效扩展,导致资源消耗高且性能受限。解决方案的关键在于提出A-MapReduce——一个受MapReduce范式启发的多智能体执行框架,其核心创新是将广度搜索重构为水平结构化的并行检索问题:通过任务自适应分解实现海量检索目标的并行处理,并借助经验记忆驱动查询条件下的任务分配与重组机制,从而支持在大规模广度搜索场景中的持续优化与渐进改进。
链接: https://arxiv.org/abs/2602.01331
作者: Mingju Chen,Guibin Zhang,Heng Chang,Yuchen Guo,Shiji Zhou
机构: 未知
类目: Multiagent Systems (cs.MA); Computation and Language (cs.CL)
备注: 33 pages
Abstract:Contemporary large language model (LLM)-based multi-agent systems exhibit systematic advantages in deep research tasks, which emphasize iterative, vertically structured information seeking. However, when confronted with wide search tasks characterized by large-scale, breadth-oriented retrieval, existing agentic frameworks, primarily designed around sequential, vertically structured reasoning, remain stuck in expansive search objectives and inefficient long-horizon execution. To bridge this gap, we propose A-MapReduce, a MapReduce paradigm-inspired multi-agent execution framework that recasts wide search as a horizontally structured retrieval problem. Concretely, A-MapReduce implements parallel processing of massive retrieval targets through task-adaptive decomposition and structured result aggregation. Meanwhile, it leverages experiential memory to drive the continual evolution of query-conditioned task allocation and recomposition, enabling progressive improvement in large-scale wide-search regimes. Extensive experiments on five agentic benchmarks demonstrate that A-MapReduce is (i) high-performing, achieving state-of-the-art performance on WideSearch and DeepWideSearch, and delivering 5.11% - 17.50% average Item F1 improvements compared with strong baselines with OpenAI o3 or Gemini 2.5 Pro backbones; (ii) cost-effective and efficient, delivering superior cost-performance trade-offs and reducing running time by 45.8% compared to representative multi-agent baselines. The code is available at this https URL.
zh
[NLP-121] DreamOn: Diffusion Language Models For Code Infilling Beyond Fixed-size Canvas ICLR2026
【速读】: 该论文旨在解决扩散语言模型(Diffusion Language Models, DLMs)在实际应用中因固定长度掩码输入导致的生成灵活性受限问题,尤其在代码补全任务中,当预设掩码长度与理想补全长度不匹配时性能显著下降。解决方案的关键在于提出DreamOn框架,通过引入两个长度控制状态(length control states),使模型能够在生成过程中自主调整输出长度,无需依赖外部长度信息或修改原有架构,仅需对训练目标进行微调即可实现动态、可变长度生成,从而显著提升DLM在变量长度任务上的表现,达到与最优自回归模型相当的效果。
链接: https://arxiv.org/abs/2602.01326
作者: Zirui Wu,Lin Zheng,Zhihui Xie,Jiacheng Ye,Jiahui Gao,Shansan Gong,Yansong Feng,Zhenguo Li,Wei Bi,Guorui Zhou,Lingpeng Kong
机构: The University of Hong Kong (香港大学); Kuaishou Technology (快手科技); Huawei Noah Ark Lab (华为诺亚方舟实验室); Peking University (北京大学)
类目: Computation and Language (cs.CL)
备注: ICLR 2026
Abstract:Diffusion Language Models (DLMs) present a compelling alternative to autoregressive models, offering flexible, any-order infilling without specialized prompting design. However, their practical utility is blocked by a critical limitation: the requirement of a fixed-length masked sequence for generation. This constraint severely degrades code infilling performance when the predefined mask size mismatches the ideal completion length. To address this, we propose DreamOn, a novel diffusion framework that enables dynamic, variable-length generation. DreamOn augments the diffusion process with two length control states, allowing the model to autonomously expand or contract the output length based solely on its own predictions. We integrate this mechanism into existing DLMs with minimal modifications to the training objective and no architectural changes. Built upon Dream-Coder-7B and DiffuCoder-7B, DreamOn achieves infilling performance on par with state-of-the-art autoregressive models on HumanEval-Infilling and SantaCoder-FIM and matches oracle performance achieved with ground-truth length. Our work removes a fundamental barrier to the practical deployment of DLMs, significantly advancing their flexibility and applicability for variable-length generation. Our code is available at this https URL.
zh
[NLP-122] PolySAE: Modeling Feature Interactions in Sparse Autoencoders via Polynomial Decoding
【速读】: 该论文旨在解决稀疏自编码器(Sparse Autoencoders, SAEs)在解释神经网络表征时无法捕捉特征组合结构的问题。传统SAE假设特征通过线性叠加进行重构,这使得模型难以区分复合概念(如“Starbucks”)是源于“star”和“coffee”两个特征的语义组合还是仅因它们频繁共现,从而导致模型将复合概念建模为单一、不可分解的特征。为此,作者提出PolySAE,其核心创新在于扩展SAE解码器以引入高阶项(多项式项),从而显式建模特征间的交互作用,同时保留线性编码器以保障可解释性。通过在共享投影子空间上进行低秩张量分解,PolySAE能高效捕获成对及三元特征交互,仅增加3%参数量(以GPT-2为例),并在多个语言模型和SAE变体中实现平均约8%的探测F1分数提升,且保持相近的重建误差;更重要的是,学习到的交互权重与共现频率相关性极低(r=0.06 vs. SAE的r=0.82),表明其有效识别了语义组合结构(如形态绑定和短语构成),而非表面统计关联。
链接: https://arxiv.org/abs/2602.01322
作者: Panagiotis Koromilas,Andreas D. Demou,James Oldfield,Yannis Panagakis,Mihalis Nicolaou
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Sparse autoencoders (SAEs) have emerged as a promising method for interpreting neural network representations by decomposing activations into sparse combinations of dictionary atoms. However, SAEs assume that features combine additively through linear reconstruction, an assumption that cannot capture compositional structure: linear models cannot distinguish whether “Starbucks” arises from the composition of “star” and “coffee” features or merely their co-occurrence. This forces SAEs to allocate monolithic features for compound concepts rather than decomposing them into interpretable constituents. We introduce PolySAE, which extends the SAE decoder with higher-order terms to model feature interactions while preserving the linear encoder essential for interpretability. Through low-rank tensor factorization on a shared projection subspace, PolySAE captures pairwise and triple feature interactions with small parameter overhead (3% on GPT2). Across four language models and three SAE variants, PolySAE achieves an average improvement of approximately 8% in probing F1 while maintaining comparable reconstruction error, and produces 2-10 \times larger Wasserstein distances between class-conditional feature distributions. Critically, learned interaction weights exhibit negligible correlation with co-occurrence frequency ( r = 0.06 vs. r = 0.82 for SAE feature covariance), suggesting that polynomial terms capture compositional structure, such as morphological binding and phrasal composition, largely independent of surface statistics.
zh
[NLP-123] EverMemBench: Benchmarking Long-Term Interactive Memory in Large Language Models EverMemBench: Benchmarking Long-Term Interactive Memory in Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长期对话记忆(long-term conversational memory)方面的不足,特别是现有基准测试仅聚焦于二元、单主题对话,无法反映真实复杂场景的问题。其解决方案的关键在于提出 EverMemBench——一个涵盖多参与者、多群体、跨话题交织且时间演化的百万级 token 对话数据集,用于系统性评估记忆机制的三个维度:细粒度回忆能力、记忆意识(memory awareness)和用户画像理解。该基准揭示了当前模型在多跳推理、时间语义建模及检索式记忆匹配中的核心瓶颈,为下一代记忆架构的设计与优化提供了挑战性测试平台。
链接: https://arxiv.org/abs/2602.01313
作者: Chuanrui Hu,Tong Li,Xingze Gao,Hongda Chen,Dannong Xu,Yi Bai,Tianwei Lin,Xinda Zhao,Xiaohong Li,Jiaqi An,Yunyun Han,Jian Pei,Yafeng Deng
机构: EverMind; Shanda Group; Duke University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 2 figures, 4 tables
Abstract:Long-term conversational memory is essential for LLM-based assistants, yet existing benchmarks focus on dyadic, single-topic dialogues that fail to capture real-world complexity. We introduce EverMemBench, a benchmark featuring multi-party, multi-group conversations spanning over 1 million tokens with temporally evolving information, cross-topic interleaving, and role-specific personas. EverMemBench evaluates memory systems across three dimensions through 1,000+ QA pairs: fine-grained recall, memory awareness, and user profile understanding. Our evaluation reveals critical limitations: (1) multi-hop reasoning collapses in multi-party settings, with even oracle models achieving only 26%; (2) temporal reasoning remains unsolved, requiring version semantics beyond timestamp matching; (3) memory awareness is bottlenecked by retrieval, where current similarity-based methods fail to bridge the semantic gap between queries and implicitly relevant memories. EverMemBench provides a challenging testbed for developing next-generation memory architectures.
zh
[NLP-124] PACER: Blockwise Pre-verification for Speculative Decoding with Adaptive Length
【速读】: 该论文旨在解决推测解码(Speculative Decoding, SD)中固定草案长度(draft length)导致推理速度优化受限的问题。研究表明,不同解码步骤的最优草案长度存在显著差异,固定长度策略无法充分挖掘加速潜力。解决方案的关键在于提出Pacer——一种通过轻量级可训练的预验证层(pre-verification layer)动态控制草案长度的新方法。该层在草案块(blockwise)级别提前验证候选token,若验证失败则立即终止草案模型生成,从而实现更灵活、高效的草案长度调整,显著提升推理速度。
链接: https://arxiv.org/abs/2602.01274
作者: Situo Zhang,Yifan Zhang,Zichen Zhu,Hankun Wang,Da Ma,Danyang Zhang,Lu Chen,Kai Yu
机构: X-LANCE Lab; School of Computer Science; MoE Key Lab of Artificial Intelligence; SJTU AI Institute; Shanghai Jiao Tong University; Jiangsu Key Lab of Language Computing; Suzhou Laboratory
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Speculative decoding (SD) is a powerful technique for accelerating the inference process of large language models (LLMs) without sacrificing accuracy. Typically, SD employs a small draft model to generate a fixed number of draft tokens, which are then verified in parallel by the target model. However, our experiments reveal that the optimal draft length varies significantly across different decoding steps. This variation suggests that using a fixed draft length limits the potential for further improvements in decoding speed. To address this challenge, we propose Pacer, a novel approach that dynamically controls draft length using a lightweight, trainable pre-verification layer. This layer pre-verifies draft tokens blockwise before they are sent to the target model, allowing the draft model to stop token generation if the blockwise pre-verification fails. We implement Pacer on multiple SD model pairs and evaluate its performance across various benchmarks. Our results demonstrate that Pacer achieves up to 2.66x Speedup over autoregressive decoding and consistently outperforms standard speculative decoding. Furthermore, when integrated with Ouroboros, Pacer attains up to 3.09x Speedup.
zh
[NLP-125] PARSE: An Open-Domain Reasoning Question Answering Benchmark for Persian SIGIR2026
【速读】: 该论文旨在解决低资源语言(特别是波斯语)在推理型问答(Reasoning-focused Question Answering, QA)领域缺乏高质量基准评测数据的问题。当前大型语言模型(Large Language Models, LLMs)在推理能力上的进展主要集中在高资源语言,而波斯语作为全球约1.3亿人使用的语言,长期缺乏系统性、开放域的推理QA基准。解决方案的关键在于构建PARSE——首个面向波斯语的开放域推理QA基准,包含10,800个涵盖布尔型、多项选择和事实型问题的数据集,覆盖多样化的推理类型、难度层级与答案结构;并通过受控的LLM生成流程、多阶段过滤、人工标注与一致性校验确保语言与事实质量;同时通过多语言及波斯语专用模型的对比实验验证其有效性,表明使用波斯语提示词和结构化提示策略(如链式思维CoT用于布尔/多项选择题,少样本提示用于事实型题)可显著提升性能,尤其在微调后效果更优,从而为低资源场景下推理能力模型的开发与公平比较提供坚实基础。
链接: https://arxiv.org/abs/2602.01246
作者: Jamshid Mozafari,Seyed Parsa Mousavinasab,Adam Jatowt
机构: University of Innsbruck(因斯布鲁克大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Submitted to SIGIR 2026
Abstract:Reasoning-focused Question Answering (QA) has advanced rapidly with Large Language Models (LLMs), yet high-quality benchmarks for low-resource languages remain scarce. Persian, spoken by roughly 130 million people, lacks a comprehensive open-domain resource for evaluating reasoning-capable QA systems. We introduce PARSE, the first open-domain Persian reasoning QA benchmark, containing 10,800 questions across Boolean, multiple-choice, and factoid formats, with diverse reasoning types, difficulty levels, and answer structures. The benchmark is built via a controlled LLM-based generation pipeline and validated through human evaluation. We also ensure linguistic and factual quality through multi-stage filtering, annotation, and consistency checks. We benchmark multilingual and Persian LLMs under multiple prompting strategies and show that Persian prompts and structured prompting (CoT for Boolean/multiple-choice; few-shot for factoid) improve performance. Fine-tuning further boosts results, especially for Persian-specialized models. These findings highlight how PARSE supports both fair comparison and practical model adaptation. PARSE fills a critical gap in Persian QA research and provides a strong foundation for developing and evaluating reasoning-capable LLMs in low-resource settings.
zh
[NLP-126] Large-Scale Terminal Agent ic Trajectory Generation from Dockerized Environments
【速读】: 该论文旨在解决训练终端任务代理模型(agentic models)时缺乏高质量、可执行且可验证的长程终端轨迹数据的问题。核心挑战在于:可执行性(Executability),即每个任务实例需适配特定的Docker环境;以及可验证性(Verifiability),由于任务输出异构,难以统一标准化验证。解决方案的关键是提出一个名为TerminalTraj的可扩展流水线,其核心包括:(i) 筛选高质量代码仓库构建Docker化执行环境,(ii) 生成与Docker对齐的任务实例,(iii) 合成带可执行验证代码的代理轨迹。该方案有效支持了大规模高质量终端轨迹数据的构建,显著提升了基于Qwen2.5-Coder骨干网络模型在TerminalBench上的性能表现。
链接: https://arxiv.org/abs/2602.01244
作者: Siwei Wu,Yizhi Li,Yuyang Song,Wei Zhang,Yang Wang,Riza Batista-Navarro,Xian Yang,Mingjie Tang,Bryan Dai,Jian Yang,Chenghua Lin
机构: 未知
类目: Computation and Language (cs.CL)
备注: Agentic Trajectory, Agentic Model, Terminal, Code Agent
Abstract:Training agentic models for terminal-based tasks critically depends on high-quality terminal trajectories that capture realistic long-horizon interactions across diverse domains. However, constructing such data at scale remains challenging due to two key requirements: \textbf\emphExecutability, since each instance requires a suitable and often distinct Docker environment; and \textbf\emphVerifiability, because heterogeneous task outputs preclude unified, standardized verification. To address these challenges, we propose \textbfTerminalTraj, a scalable pipeline that (i) filters high-quality repositories to construct Dockerized execution environments, (ii) generates Docker-aligned task instances, and (iii) synthesizes agent trajectories with executable validation code. Using TerminalTraj, we curate 32K Docker images and generate 50,733 verified terminal trajectories across eight domains. Models trained on this data with the Qwen2.5-Coder backbone achieve consistent performance improvements on TerminalBench (TB), with gains of up to 20% on TB~1.0 and 10% on TB~2.0 over their respective backbones. Notably, \textbfTerminalTraj-32B achieves strong performance among models with fewer than 100B parameters, reaching 35.30% on TB~1.0 and 22.00% on TB~2.0, and demonstrates improved test-time scaling behavior. All code and data are available at this https URL.
zh
[NLP-127] Minimizing Mismatch Risk: A Prototype-Based Routing Framework for Zero-shot LLM -generated Text Detection
【速读】: 该论文试图解决零样本检测大语言模型(Large Language Models, LLM)生成文本时因固定代理模型(surrogate model)选择不当而导致的检测性能不稳定问题。现有方法通常使用单一固定代理模型处理所有输入,但研究发现检测性能高度依赖于代理模型与源模型之间的对齐程度;尽管不存在通用最优代理模型,但在多样化的代理池中总存在一个与特定输入匹配度较高的代理。解决方案的关键在于将鲁棒检测转化为代理路由问题——通过提出DetectRouter框架,利用两阶段训练学习文本与检测器之间的亲和性:第一阶段从白盒模型构建判别性原型,第二阶段通过几何距离与观测检测分数对齐,实现对黑盒源的有效适配。实验表明,该方法在EvoBench和MAGE基准上均显著提升多种检测指标的稳定性与准确性。
链接: https://arxiv.org/abs/2602.01240
作者: Ke Sun,Guangsheng Bao,Han Cui,Yue Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Zero-shot methods detect LLM-generated text by computing statistical signatures using a surrogate model. Existing approaches typically employ a fixed surrogate for all inputs regardless of the unknown source. We systematically examine this design and find that detection performance varies substantially depending on surrogate-source alignment. We observe that while no single surrogate achieves optimal performance universally, a well-matched surrogate typically exists within a diverse pool for any given input. This finding transforms robust detection into a routing problem: selecting the most appropriate surrogate for each input. We propose DetectRouter, a prototype-based framework that learns text-detector affinity through two-stage training. The first stage constructs discriminative prototypes from white-box models; the second generalizes to black-box sources by aligning geometric distances with observed detection scores. Experiments on EvoBench and MAGE benchmarks demonstrate consistent improvements across multiple detection criteria and model families.
zh
[NLP-128] Inferential Question Answering WWW2026
【速读】: 该论文旨在解决传统问答(Question Answering, QA)系统在处理需要推理才能得出答案的问题时的局限性。现有方法大多假设答案可直接从文档中提取或生成(即答案包含性),但许多问题的答案并非显式存在于文本中,而是需基于线索进行推断(inference-based reasoning)。为应对这一挑战,作者提出了“推理型问答”(Inferential QA)任务,并构建了QUIT数据集,该数据集包含7,401个需推理的问题和240万条支持性段落,通过人类与大语言模型(LLM)联合标注三类相关性等级,以确保高质量的推理信号。关键解决方案在于建立一个专门用于测试模型从间接文本证据中推理能力的新基准,揭示当前检索器、重排序器及LLM阅读器在该任务上的显著性能瓶颈,从而推动QA系统向更深层次的理解与推理方向发展。
链接: https://arxiv.org/abs/2602.01239
作者: Jamshid Mozafari,Hamed Zamani,Guido Zuccon,Adam Jatowt
机构: University of Innsbruck(因斯布鲁克大学); University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校); The University of Queensland(昆士兰大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Proceedings of the ACM Web Conference 2026 (WWW 2026)
Abstract:Despite extensive research on a wide range of question answering (QA) systems, most existing work focuses on answer containment-i.e., assuming that answers can be directly extracted and/or generated from documents in the corpus. However, some questions require inference, i.e., deriving answers that are not explicitly stated but can be inferred from the available information. We introduce Inferential QA – a new task that challenges models to infer answers from answer-supporting passages which provide only clues. To study this problem, we construct QUIT (QUestions requiring Inference from Texts) dataset, comprising 7,401 questions and 2.4M passages built from high-convergence human- and machine-authored hints, labeled across three relevance levels using LLM-based answerability and human verification. Through comprehensive evaluation of retrievers, rerankers, and LLM-based readers, we show that methods effective on traditional QA tasks struggle in inferential QA: retrievers underperform, rerankers offer limited gains, and fine-tuning provides inconsistent improvements. Even reasoning-oriented LLMs fail to outperform smaller general-purpose models. These findings reveal that current QA pipelines are not yet ready for inference-based reasoning. Inferential QA thus establishes a new class of QA tasks that move towards understanding and reasoning from indirect textual evidence.
zh
[NLP-129] Supervised Fine-Tuning Needs to Unlock the Potential of Token Priority
【速读】: 该论文试图解决生成式 AI(Generative AI)在从拟合经验数据向实现真正人类效用过渡过程中存在的根本性瓶颈问题,即细粒度的自回归生成与粗粒度或均匀监督信号之间的粒度不匹配问题。其解决方案的关键在于提出“Token Priority”作为核心桥梁,将监督微调(Supervised Fine-Tuning, SFT)重新形式化为一种精确的分布重塑过程,旨在使原始数据分布对齐到理想的对齐流形(alignment manifold),从而实现更精准的模型对齐与性能提升。
链接: https://arxiv.org/abs/2602.01227
作者: Zhanming Shen,Zeyu Qin,Jiaqi Hu,Wentao Ye,Hao Chen,Xiaomeng Hu,Haokai Xu,Gang Chen,Yi R. Fung,Haobo Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The transition from fitting empirical data to achieving true human utility is fundamentally constrained by a granularity mismatch, where fine-grained autoregressive generation is often supervised by coarse or uniform signals. This position paper advocates Token Priority as the essential bridge, formalizing Supervised Fine-Tuning (SFT) not as simple optimization but as a precise distribution reshaping process that aligns raw data with the ideal alignment manifold. We analyze recent breakthroughs through this unified lens, categorizing them into two distinct regimes: Positive Priority for noise filtration and Signed Priority for toxic modes unlearning. We revisit existing progress and limitations, identify key challenges, and suggest directions for future research.
zh
[NLP-130] SimpleGPT : Improving GPT via A Simple Normalization Strategy DATE
【速读】: 该论文旨在解决Transformer模型在优化过程中因激活尺度(activation scale)不稳定导致的训练困难问题,特别是由此引发的Hessian矩阵谱范数过大,限制了可容忍的学习率上限。其解决方案的关键在于提出一种名为SimpleNorm的简单归一化策略,通过显式稳定中间层激活尺度来降低损失函数关于网络激活的Hessian矩阵的谱范数,从而允许使用更大且更稳定的训练学习率。实验表明,基于SimpleNorm构建的SimpleGPT模型在多个参数规模(1B至8B)下均展现出显著优于现有基线的优化稳定性和性能表现。
链接: https://arxiv.org/abs/2602.01212
作者: Marco Chen,Xianbiao Qi,Yelin He,Jiaquan Ye,Rong Xiao
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: We propose SimpleGPT, a simple yet effective GPT model, and provide theoretical insights into its mathematical foundations. We validate our theoretical findings through extensive experiments on large GPT models at parameter scales 1B, 1.4B, 7B and 8B
Abstract:In this work, we revisit Transformer optimization through the lens of second-order geometry and establish a direct connection between architectural design, activation scale, the Hessian matrix, and the maximum tolerable learning rate. We introduce a simple normalization strategy, termed SimpleNorm, which stabilizes intermediate activation scales by construction. Then, by analyzing the Hessian of the loss with respect to network activations, we theoretically show that SimpleNorm significantly reduces the spectral norm of the Hessian, thereby permitting larger stable learning rates. We validate our theoretical findings through extensive experiments on large GPT models at parameter scales 1B, 1.4B, 7B and 8B. Empirically, SimpleGPT, our SimpleNorm-based network, tolerates learning rates 3 \times -10 \times larger than standard convention, consistently demonstrates strong optimization stability, and achieves substantially better performance than well-established baselines. Specifically, when training 7B-scale models for 60K steps, SimpleGPT achieves a training loss that is 0.08 lower than that of LLaMA2 with QKNorm, reducing the loss from 2.290 to 2.208. Our source code will be released at this https URL.
zh
[NLP-131] Chronos: Learning Temporal Dynamics of Reasoning Chains for Test-Time Scaling
【速读】: 该论文旨在解决现有测试时扩展(Test-Time Scaling, TTS)方法在提升大语言模型(Large Language Models, LLMs)推理性能时存在的局限性,即多数投票和启发式token级评分策略对推理轨迹或token同等对待,导致对轨迹质量波动和局部逻辑错误敏感的问题。其解决方案的关键在于提出了一种轻量级、可插拔的时间序列推理评分器Chronos,该方法将每条推理轨迹建模为时间序列,学习token概率等轨迹特征并据此分配质量得分,进而采用加权投票机制进行决策,从而显著提升推理准确性且计算开销极低。
链接: https://arxiv.org/abs/2602.01208
作者: Kai Zhang,Jiayi Liao,Chengpeng Li,Ziyuan Xie,Sihang Li,Xiang Wang
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Test-Time Scaling (TTS) has emerged as an effective paradigm for improving the reasoning performance of large language models (LLMs). However, existing methods – most notably majority voting and heuristic token-level scoring – treat reasoning traces or tokens equally, thereby being susceptible to substantial variations in trajectory quality and localized logical failures. In this work, we introduce \textbfChronos, a lightweight and plug-and-play chronological reasoning scorer that models each trajectory as a time series. Specifically, Chronos learns to capture trajectory features of token probabilities, assigns quality scores accordingly, and employs a weighted voting mechanism. Extensive evaluations on both in-domain and out-of-domain benchmarks demonstrate that Chronos consistently delivers substantial gains across a variety of models, with negligible computational overhead. Notably, Chronos@128 achieves relative improvements of 34.21% over Pass@1 and 22.70% over Maj@128 on HMMT25 using Qwen3-4B-Thinking-2507, highlighting its effectiveness.
zh
[NLP-132] ASTER: Agent ic Scaling with Tool-integrated Extended Reasoning
【速读】: 该论文旨在解决强化学习(Reinforcement Learning, RL)在提升大语言模型(Large Language Models, LLMs)工具集成推理(Tool-Integrated Reasoning, TIR)能力时面临的“交互坍塌”(interaction collapse)问题,即模型在训练过程中逐渐退化为仅依赖内部推理、缺乏持续多轮工具调用的行为模式。解决方案的关键在于提出 ASTER(Agentic Scaling with Tool-integrated Extended Reasoning)框架,其核心创新是通过有选择性地构建冷启动阶段(cold-start)的高交互密度轨迹集合(interaction-dense trajectories),从而建立一个强健的代理行为先验(behavioral prior),显著增强后续RL训练中的探索效率与泛化能力;实验表明,仅需4K条此类高质量冷启动轨迹即可实现最优下游性能,在AIME 2025数学基准上达到90.0%准确率,超越当前主流开源模型如DeepSeek-V3.2-Exp。
链接: https://arxiv.org/abs/2602.01204
作者: Xuqin Zhang,Quan He,Zhenrui Zheng,Zongzhang Zhang,Xu He,Dong Li
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Reinforcement learning (RL) has emerged as a dominant paradigm for eliciting long-horizon reasoning in Large Language Models (LLMs). However, scaling Tool-Integrated Reasoning (TIR) via RL remains challenging due to interaction collapse: a pathological state where models fail to sustain multi-turn tool usage, instead degenerating into heavy internal reasoning with only trivial, post-hoc code verification. We systematically study three questions: (i) how cold-start SFT induces an agentic, tool-using behavioral prior, (ii) how the interaction density of cold-start trajectories shapes exploration and downstream RL outcomes, and (iii) how the RL interaction budget affects learning dynamics and generalization under varying inference-time budgets. We then introduce ASTER (Agentic Scaling with Tool-integrated Extended Reasoning), a framework that circumvents this collapse through a targeted cold-start strategy prioritizing interaction-dense trajectories. We find that a small expert cold-start set of just 4K interaction-dense trajectories yields the strongest downstream performance, establishing a robust prior that enables superior exploration during extended RL training. Extensive evaluations demonstrate that ASTER-4B achieves state-of-the-art results on competitive mathematical benchmarks, reaching 90.0% on AIME 2025, surpassing leading frontier open-source models, including DeepSeek-V3.2-Exp.
zh
[NLP-133] Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中注意力机制存在的“注意力汇聚”(attention sink)问题,即模型在推理过程中过度关注序列首token,导致部分注意力头(attention heads)长期处于低活跃状态,从而引发“头坍缩”(head collapse)现象——仅有固定子集的注意力头参与生成任务,削弱了模型表达能力。解决方案的关键在于揭示了标准注意力(Vanilla Attention)与汇聚注意力(Sink Attention)天然具备混合专家(Mixture-of-Experts, MoE)结构,并据此提出一种基于辅助负载均衡损失(auxiliary load balancing loss)的感知汇聚训练算法,通过显式优化各注意力头的激活分布,实现更均匀的头负载分配,从而缓解头坍缩并提升模型性能。
链接: https://arxiv.org/abs/2602.01203
作者: Zizhuo Fu,Wenxuan Zeng,Runsheng Wang,Meng Li
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Large Language Models (LLMs) often assign disproportionate attention to the first token, a phenomenon known as the attention sink. Several recent approaches aim to address this issue, including Sink Attention in GPT-OSS and Gated Attention in Qwen3-Next. However, a comprehensive analysis of the relationship among these attention mechanisms is lacking. In this work, we provide both theoretical and empirical evidence demonstrating that the sink in Vanilla Attention and Sink Attention naturally construct a Mixture-of-Experts (MoE) mechanism within attention layers. This insight explains the head collapse phenomenon observed in prior work, where only a fixed subset of attention heads contributes to generation. To mitigate head collapse, we propose a sink-aware training algorithm with an auxiliary load balancing loss designed for attention layers. Extensive experiments show that our method achieves effective head load balancing and improves model performance across Vanilla Attention, Sink Attention, and Gated Attention. We hope this study offers a new perspective on attention mechanisms and encourages further exploration of the inherent MoE structure within attention layers.
zh
[NLP-134] Bridging Lexical Ambiguity and Vision: A Mini Review on Visual Word Sense Disambiguation
【速读】: 该论文旨在解决视觉-语言任务中词汇歧义(lexical ambiguity)的问题,即同一词语在不同语境下具有多种含义,传统仅依赖文本的词义消歧(Word Sense Disambiguation, WSD)方法难以准确识别具体语境下的正确含义。为此,论文聚焦于视觉词义消歧(Visual Word Sense Disambiguation, VWSD),其核心解决方案是引入视觉线索以增强词义判断能力,从而在最小文本输入条件下利用多模态信息实现更精准的词义识别。关键在于从早期基于特征融合的方法演进至当前结合对比学习模型(如CLIP)、扩散模型驱动的图文生成以及大语言模型(Large Language Model, LLM)推理的新框架,通过提示工程(prompt engineering)、微调(fine-tuning)和多语言适配策略显著提升性能,实验证明CLIP微调模型与LLM增强系统相较零样本基线在平均倒数排名(Mean Reciprocal Rank, MRR)上提升达6–8%。
链接: https://arxiv.org/abs/2602.01193
作者: Shashini Nilukshi,Deshan Sumanathilaka
机构: Informatics Institute of Technology (信息学院技术研究所)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 2 figures, 2 Tables, Accepted at IEEE TIC 2026
Abstract:This paper offers a mini review of Visual Word Sense Disambiguation (VWSD), which is a multimodal extension of traditional Word Sense Disambiguation (WSD). VWSD helps tackle lexical ambiguity in vision-language tasks. While conventional WSD depends only on text and lexical resources, VWSD uses visual cues to find the right meaning of ambiguous words with minimal text input. The review looks at developments from early multimodal fusion methods to new frameworks that use contrastive models like CLIP, diffusion-based text-to-image generation, and large language model (LLM) support. Studies from 2016 to 2025 are examined to show the growth of VWSD through feature-based, graph-based, and contrastive embedding techniques. It focuses on prompt engineering, fine-tuning, and adapting to multiple languages. Quantitative results show that CLIP-based fine-tuned models and LLM-enhanced VWSD systems consistently perform better than zero-shot baselines, achieving gains of up to 6-8% in Mean Reciprocal Rank (MRR). However, challenges still exist, such as limitations in context, model bias toward common meanings, a lack of multilingual datasets, and the need for better evaluation frameworks. The analysis highlights the growing overlap of CLIP alignment, diffusion generation, and LLM reasoning as the future path for strong, context-aware, and multilingual disambiguation systems.
zh
[NLP-135] ASP-Bench: From Natural Language to Logic Programs
【速读】: 该论文旨在解决将自然语言规范自动翻译为逻辑程序(特别是Answer Set Programming, ASP)这一挑战性任务,其核心目标是提升神经符号工程中自然语言到逻辑建模的自动化水平。解决方案的关键在于提出并构建了ASP-Bench基准测试集,包含128个自然语言问题实例(64个基础问题及其难易变体),系统覆盖ASP的核心特性(如choice rules、aggregates和optimization),并通过参考验证器确保解的正确性;同时,利用基于ReAct(Reason and Act)框架的代理式方法进行迭代优化,借助求解器反馈实现模型的可靠修正,最终实现对自然语言问题的全饱和建模,从而揭示影响问题建模难度的多维因素(如优化、时序推理、默认逻辑等)。
链接: https://arxiv.org/abs/2602.01171
作者: Stefan Szeider
机构: TU Wien (维也纳工业大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Logic in Computer Science (cs.LO)
备注:
Abstract:Automating the translation of natural-language specifications into logic programs is a challenging task that affects neurosymbolic engineering. We present ASP-Bench, a benchmark comprising 128 natural language problem instances, 64 base problems with easy and hard variants. It evaluates systems that translate natural-language problems into Answer Set Programs (ASPs), a prominent form of logic programming. It provides systematic coverage of ASP features, including choice rules, aggregates, and optimization. Each problem includes reference validators that check whether solutions satisfy the problem specification. We characterize problems along seven largely independent reasoning aspects (optimization, temporal reasoning, default logic, resource allocation, recursion, spatial reasoning, and quantitative complexity), providing a multidimensional view of modeling difficulty. We test the benchmark using an agentic approach based on the ReAct (Reason and Act) framework, which achieves full saturation, demonstrating that feedback-driven iterative refinement with solver feedback provides a reliable and robust approach for modeling natural language in ASP. Our analysis across multiple agent runs enables us to gain insights into what determines a problem’s modeling hardness. Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Logic in Computer Science (cs.LO) Cite as: arXiv:2602.01171 [cs.AI] (or arXiv:2602.01171v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2602.01171 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-136] EmoAra: Emotion-Preserving English Speech Transcription and Cross-Lingual Translation with Arabic Text-to-Speech
【速读】: 该论文旨在解决跨语言语音交流中情感信息丢失的问题,特别是在银行客户服务场景下,情绪语境对服务质量具有重要影响。解决方案的关键在于构建一个端到端的情感保留管道(EmoAra),其核心由四个模块组成:基于CNN的语音情感识别(Speech Emotion Recognition)、Whisper语音转文字(Automatic Speech Recognition)、微调后的MarianMT模型实现英阿机器翻译(Machine Translation),以及MMS-TTS-Ara阿拉伯语语音合成(Text-to-Speech)。该系统通过多模态协同处理,在保持原始语音情感特征的同时完成高质量的跨语言语音输出。
链接: https://arxiv.org/abs/2602.01170
作者: Besher Hassan,Ibrahim Alsarraj,Musaab Hasan,Yousef Melhim,Shahem Fadi,Shahem Sultan
机构: MBZUAI (Mohamed Bin Zayed University of Artificial Intelligence)
类目: Computation and Language (cs.CL)
备注: 10 pages, 3 figures
Abstract:This work presents EmoAra, an end-to-end emotion-preserving pipeline for cross-lingual spoken communication, motivated by banking customer service where emotional context affects service quality. EmoAra integrates Speech Emotion Recognition, Automatic Speech Recognition, Machine Translation, and Text-to-Speech to process English speech and deliver an Arabic spoken output while retaining emotional nuance. The system uses a CNN-based emotion classifier, Whisper for English transcription, a fine-tuned MarianMT model for English-to-Arabic translation, and MMS-TTS-Ara for Arabic speech synthesis. Experiments report an F1-score of 94% for emotion classification, translation performance of BLEU 56 and BERTScore F1 88.7%, and an average human evaluation score of 81% on banking-domain translations. The implementation and resources are available at the accompanying GitHub repository.
zh
[NLP-137] PedagoSense: A Pedology Grounded LLM System for Pedagogical Strategy Detection and Contextual Response Generation in Learning Dialogues
【速读】: 该论文旨在解决对话式学习中交互质量提升的问题,核心挑战在于如何在师生对话中自动检测并推荐有效的教学策略(pedagogical strategies)。解决方案的关键在于提出PedagoSense系统,该系统基于教育学理论(pedology)构建,采用两阶段策略分类器:首先通过二分类模型判断是否存在教学策略,随后进行细粒度分类以识别具体策略类型;同时,系统从对话上下文中推荐合适策略,并利用大语言模型(Large Language Model, LLM)生成与该策略一致的响应。实验表明,该方法在教学策略检测上表现优异,且数据增强显著提升性能,验证了其在构建更自适应教育技术中的有效性。
链接: https://arxiv.org/abs/2602.01169
作者: Shahem Sultan,Shahem Fadi,Yousef Melhim,Ibrahim Alsarraj,Besher Hassan
机构: Al Andlus University (阿尔安达卢斯大学); Ajman University (阿治曼大学); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Computation and Language (cs.CL)
备注: 8 pages, 5 figures
Abstract:This paper addresses the challenge of improving interaction quality in dialogue based learning by detecting and recommending effective pedagogical strategies in tutor student conversations. We introduce PedagoSense, a pedology grounded system that combines a two stage strategy classifier with large language model generation. The system first detects whether a pedagogical strategy is present using a binary classifier, then performs fine grained classification to identify the specific strategy. In parallel, it recommends an appropriate strategy from the dialogue context and uses an LLM to generate a response aligned with that strategy. We evaluate on human annotated tutor student dialogues, augmented with additional non pedagogical conversations for the binary task. Results show high performance for pedagogical strategy detection and consistent gains when using data augmentation, while analysis highlights where fine grained classes remain challenging. Overall, PedagoSense bridges pedagogical theory and practical LLM based response generation for more adaptive educational technologies.
zh
[NLP-138] ypologically-Informed Candidate Reranking for LLM -based Translation into Low-Resource Languages
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在翻译低资源语言时因对高资源语言的结构偏好而产生的系统性偏差问题,即模型倾向于遵循英语等主导语言的句法和形态特征,导致译文在类型学上不合规。解决方案的关键在于提出一个无需平行语料或模型重训练的框架——通用元语言框架(Universal Metalinguistic Framework, UMF),其核心是将语言建模为16个类型学维度上的结构化特征谱,并通过干预机制在生成阶段进行语言学消歧、在选择阶段进行类型学合规评分,从而实现对目标语言结构特性的精准适配。
链接: https://arxiv.org/abs/2602.01162
作者: Nipuna Abeykoon,Ashen Weerathunga,Pubudu Wijesinghe,Parameswari Krishnamurthy
机构: ZWAG AI Ltd (ZWAG AI有限公司)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models trained predominantly on high-resource languages exhibit systematic biases toward dominant typological patterns, leading to structural non-conformance when translating into typologically divergent low-resource languages. We present a framework that leverages linguistic typology to improve translation quality without parallel training data or model retraining. The framework consists of two components: the Universal Metalinguistic Framework (UMF), which represents languages as structured profiles across 16 typological dimensions with divergence-weighted scoring, and the Computational Engine, which operates through linguistic disambiguation during generation and typological compliance scoring during selection. Evaluation across nine language pairs demonstrates intervention rates strongly correlating with typological distance from English. In experiments on 341 English sentences each having different morphological and syntactic phenomena, the framework shows an intervention precision of 48.16% for conservatively treated languages, 28.15% for morphologically dense languages, and 86.26% for structurally profiled languages. The framework requires no parallel training data and operates with any LLM capable of producing multiple candidate outputs, enabling practical deployment for under-resourced languages.
zh
[NLP-139] Beyond Training for Cultural Awareness: The Role of Dataset Linguistic Structure in Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在全球部署中因文化错位(cultural misalignment)引发的性能问题,核心在于理解用于文化适配微调的数据集的语言属性如何影响模型的文化表现能力。其解决方案的关键在于采用数据集中心视角,通过计算阿拉伯语、中文和日语微调数据集的轻量级语言学、语义与结构指标,并对每种语言独立进行主成分分析(Principal Component Analysis, PCA),从而提取出可解释的文化相关维度(如语义连贯性、表层词汇与句法多样性、词汇或结构丰富度)。研究发现,这些主成分在不同模型家族(LLaMA、Mistral、DeepSeek)中的预测效力存在显著差异,其中以词汇导向的主成分(PC3)最为稳健,能跨模型和基准任务保持一致的性能提升,而强调语义或多样性极端值的成分(PC1-PC2)则常无益甚至有害。
链接: https://arxiv.org/abs/2602.01161
作者: Reem I. Masoud,Chen Feng,Shunta Asano,Saied Alshahrani,Philip Colin Treleaven,Miguel R. D. Rodrigues
机构: University College London (伦敦大学学院); Queen’s University Belfast (贝尔法斯特女王大学); The University of Tokyo (东京大学); University of Bisha (布希大学); King Abdulaziz University (阿卜杜勒阿齐兹国王大学); AI Centre, University College London (伦敦大学学院人工智能中心)
类目: Computation and Language (cs.CL)
备注:
Abstract:The global deployment of large language models (LLMs) has raised concerns about cultural misalignment, yet the linguistic properties of fine-tuning datasets used for cultural adaptation remain poorly understood. We adopt a dataset-centric view of cultural alignment and ask which linguistic properties of fine-tuning data are associated with cultural performance, whether these properties are predictive prior to training, and how these effects vary across models. We compute lightweight linguistic, semantic, and structural metrics for Arabic, Chinese, and Japanese datasets and apply principal component analysis separately within each language. This design ensures that the resulting components capture variation among datasets written in the same language rather than differences between languages. The resulting components correspond to broadly interpretable axes related to semantic coherence, surface-level lexical and syntactic diversity, and lexical or structural richness, though their composition varies across languages. We fine-tune three major LLM families (LLaMA, Mistral, DeepSeek) and evaluate them on benchmarks of cultural knowledge, values, and norms. While PCA components correlate with downstream performance, these associations are strongly model-dependent. Through controlled subset interventions, we show that lexical-oriented components (PC3) are the most robust, yielding more consistent performance across models and benchmarks, whereas emphasizing semantic or diversity extremes (PC1-PC2) is often neutral or harmful.
zh
[NLP-140] Dont Judge a Book by its Cover: Testing LLM s Robustness Under Logical Obfuscation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在面对逻辑等价但形式被混淆的问题时表现显著下降的问题,即模型对输入表面形式的依赖导致其推理能力受限。解决方案的关键在于提出了一种结构保持的逻辑混淆框架——Logifus,用于生成具有语义不变性但形式复杂的测试样本,并基于此构建了首个诊断基准LogiQAte,涵盖四类推理任务:一阶逻辑蕴含(Obfus FOL)、亲属关系推理(Obfus Blood Relation)、数列模式识别(Obfus Number Series)和方向感知推理(Obfus Direction Sense)。通过该框架与基准,研究发现当前主流LLMs在零样本场景下性能平均下降达22%–47%,揭示了模型缺乏深层语义理解的本质缺陷,从而强调了发展真正具备语义保真推理能力模型的紧迫性。
链接: https://arxiv.org/abs/2602.01132
作者: Abhilekh Borah,Shubhra Ghosh,Kedar Joshi,Aditya Kumar Guru,Kripabandhu Ghosh
机构: Manipal University Jaipur(曼尼帕尔大学贾伊普尔分校); Indian Institute of Technology, Patna(印度理工学院巴特那分校); Indian Institute of Science Education and Research, Kolkata(印度科学教育与研究学院加尔各答分校)
类目: Computation and Language (cs.CL)
备注: 19 pages, 6 figures
Abstract:Tasks such as solving arithmetic equations, evaluating truth tables, and completing syllogisms are handled well by large language models (LLMs) in their standard form, but they often fail when the same problems are posed in logically equivalent yet obfuscated formats. To study this vulnerability, we introduce Logifus, a structure-preserving logical obfuscation framework, and, utilizing this, we present LogiQAte, a first-of-its-kind diagnostic benchmark with 1,108 questions across four reasoning tasks: (i) Obfus FOL (first-order logic entailment under equivalence-preserving rewrites), (ii) Obfus Blood Relation (family-graph entailment under indirect relational chains), (iii) Obfus Number Series (pattern induction under symbolic substitutions), and (iv) Obfus Direction Sense (navigation reasoning under altered directions and reference frames). Across all the tasks, evaluating six state-of-the-art models, we find that obfuscation severely degrades zero-shot performance, with performance dropping on average by 47% for GPT-4o, 27% for GPT-5, and 22% for reasoning model, o4-mini. Our findings reveal that current LLMs parse questions without deep understanding, highlighting the urgency of building models that genuinely comprehend and preserve meaning beyond surface form.
zh
[NLP-141] Long-range Modeling and Processing of Multimodal Event Sequences
【速读】: 该论文旨在解决现有时间点过程(Temporal Point Processes, TPPs)在处理多模态数据时面临的两大挑战:一是难以生成丰富且连贯的多模态内容,二是因引入视觉等模态导致序列长度显著增加,从而阻碍基于注意力机制的模型对长程依赖关系的理解。其解决方案的关键在于提出一种基于时间相似性的自适应序列压缩机制,能够在减少序列长度的同时保留关键事件模式,并结合两阶段训练范式——先在压缩序列上进行预训练,再针对下游任务进行监督微调,从而有效提升文本生成质量和事件动态建模能力。
链接: https://arxiv.org/abs/2602.01125
作者: Jichu Li,Yilun Zhong,Zhiting Li,Feng Zhou,Quyu Kong
机构: Renmin University of China (中国人民大学); Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing (北京未来区块链与隐私计算高精尖创新中心)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Temporal point processes (TPPs) have emerged as powerful tools for modeling asynchronous event sequences. While recent advances have extended TPPs to handle textual information, existing approaches are limited in their ability to generate rich, multimodal content and reason about event dynamics. A key challenge is that incorporating multimodal data dramatically increases sequence length, hindering the ability of attention-based models to generate coherent, long-form textual descriptions that require long-range understanding. In this paper, we propose a novel framework that extends LLM-based TPPs to the visual modality, positioning text generation as a core capability alongside time and type prediction. Our approach addresses the long-context problem through an adaptive sequence compression mechanism based on temporal similarity, which reduces sequence length while preserving essential patterns. We employ a two-stage paradigm of pre-training on compressed sequences followed by supervised fine-tuning for downstream tasks. Extensive experiments, including on the challenging DanmakuTPP-QA benchmark, demonstrate that our method outperforms state-of-the-art baselines in both predictive accuracy and the quality of its generated textual analyses.
zh
[NLP-142] MarkovScale: Towards Optimal Sequential Scaling at Inference Time
【速读】: 该论文旨在解决顺序缩放(Sequential Scaling)在大语言模型(Large Language Models, LLMs)推理阶段性能提升有限且机制不明确的问题,其核心挑战在于现有方法多依赖启发式策略,缺乏理论最优性边界。解决方案的关键在于提出一个基于两状态马尔可夫过程(two-state Markov process)的原理性框架,该框架揭示了顺序缩放的内在特性,并推导出精度提升的明确条件及理论上的上界、中性点和下界。在此基础上,作者开发了MarkovScale系统,通过应用这些最优性准则,在准确性和效率之间实现理论指导下的平衡,显著优于当前主流的并行与顺序缩放方法。
链接: https://arxiv.org/abs/2602.01120
作者: Youkang Wang,Jian Wang,Rubing Chen,Tianyi Zeng,Xiao-Yong Wei,Qing Li
机构: The Hong Kong Polytechnic University (香港理工大学); Sichuan University (四川大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 12 pages
Abstract:Sequential scaling is a prominent inference-time scaling paradigm, yet its performance improvements are typically modest and not well understood, largely due to the prevalence of heuristic, non-principled approaches that obscure clear optimality bounds. To address this, we propose a principled framework that models sequential scaling as a two-state Markov process. This approach reveals the underlying properties of sequential scaling and yields closed-form solutions for essential aspects, such as the specific conditions under which accuracy is improved and the theoretical upper, neutral, and lower performance bounds. Leveraging this formulation, we develop MarkovScale, a practical system that applies these optimality criteria to achieve a theoretically grounded balance between accuracy and efficiency. Comprehensive experiments across 3 backbone LLMs, 5 benchmarks, and over 20 configurations show that MarkovScale consistently outperforms state-of-the-art parallel and sequential scaling methods, representing a significant step toward optimal and resource-efficient inference in LLMs. The source code will be open upon acceptance at https://open-upon-acceptance.
zh
[NLP-143] ndem: A Hybrid AIHuman Platform
【速读】: 该论文旨在解决当前AI代理(Agent)在复杂任务中因缺乏可靠性和领域专业知识而导致的性能瓶颈问题,以及纯人工执行效率低、成本高的局限性。其解决方案的关键在于构建一个混合系统——Tendem,该系统通过将AI用于处理结构化、重复性任务,同时引入人类专家在模型失效或需验证结果时介入,形成人机协同的工作流;此外,所有输出均经过严格的质量审查流程,确保交付成果的高质量与可靠性。这种设计不仅提升了任务完成的准确率和时效性,还保持了接近纯人工操作的成本水平,展现出显著的实用价值。
链接: https://arxiv.org/abs/2602.01119
作者: Konstantin Chernyshev,Ekaterina Artemova,Viacheslav Zhukov,Maksim Nerush,Mariia Fedorova,Iryna Repik,Olga Shapovalova,Aleksey Sukhorosov,Vladimir Dobrovolskii,Natalia Mikhailova,Sergei Tilga
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Tendem is a hybrid system where AI handles structured, repeatable work and Human Experts step in when the models fail or to verify results. Each result undergoes a comprehensive quality review before delivery to the Client. To assess Tendem’s performance, we conducted a series of in-house evaluations on 94 real-world tasks, comparing it with AI-only agents and human-only workflows carried out by Upwork freelancers. The results show that Tendem consistently delivers higher-quality outputs with faster turnaround times. At the same time, its operational costs remain comparable to human-only execution. On third-party agentic benchmarks, Tendem’s AI Agent (operating autonomously, without human involvement) performs near state-of-the-art on web browsing and tool-use tasks while demonstrating strong results in frontier domain knowledge and reasoning.
zh
[NLP-144] Logic-Oriented Retriever Enhancement via Contrastive Learning ICASSP2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在知识密集型任务中表现不佳的问题,其核心挑战在于传统检索器容易因表面相似性过拟合,难以处理涉及复杂逻辑关系的查询。解决方案的关键在于提出LORE(Logic ORiented Retriever Enhancement),通过引入细粒度对比学习(fine-grained contrastive learning)激活模型表征中固有的逻辑分析能力,引导嵌入空间向与逻辑结构一致的证据靠拢,而非仅依赖浅层语义相似性。该方法无需外部监督、额外资源或预检索分析,且保持索引兼容性,在提升检索效用和下游生成质量的同时维持高效性。
链接: https://arxiv.org/abs/2602.01116
作者: Wenxuan Zhang,Yuan-Hao Jiang,Changyong Qi,Rui Jia,Yonghe Wu
机构: 未知
类目: Computation and Language (cs.CL)
备注: accepted by icassp 2026
Abstract:Large language models (LLMs) struggle in knowledge-intensive tasks, as retrievers often overfit to surface similarity and fail on queries involving complex logical relations. The capacity for logical analysis is inherent in model representations but remains underutilized in standard training. LORE (Logic ORiented Retriever Enhancement) introduces fine-grained contrastive learning to activate this latent capacity, guiding embeddings toward evidence aligned with logical structure rather than shallow similarity. LORE requires no external upervision, resources, or pre-retrieval analysis, remains index-compatible, and consistently improves retrieval utility and downstream generation while maintaining efficiency. The datasets and code are publicly available at this https URL.
zh
[NLP-145] What If We Allocate Test-Time Compute Adaptively?
【速读】: 该论文旨在解决传统测试时计算扩展(test-time compute scaling)方法在推理过程中存在的效率低下和性能瓶颈问题,即均匀分配计算资源、采用固定采样策略且仅在重排序阶段进行验证,导致难以适应复杂任务的动态需求。其解决方案的关键在于提出一种基于验证器引导的自适应框架,将推理建模为迭代轨迹生成与选择过程:通过引入过程奖励模型(Process Reward Model, PRM)作为统一控制信号,在每轮迭代中利用步骤级PRM评分指导生成过程中的剪枝与扩展,并在多轮迭代间基于轨迹奖励选择最优响应。该机制使计算资源聚焦于高价值推理路径,显著提升推理效率与准确性,尤其在高难度基准如AIME24和AMO-Bench上实现数倍性能提升。
链接: https://arxiv.org/abs/2602.01070
作者: Ahsan Bilal,Ahmed Mohsin,Muhammad Umer,Ali Subhan,Hassan Rizwan,Ayesha Mohsin,Dean Hougen
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Test-time compute scaling allocates inference computation uniformly, uses fixed sampling strategies, and applies verification only for reranking. In contrast, we propose a verifier-guided adaptive framework treating reasoning as iterative trajectory generation and selection. For each problem, the agent runs multiple inference iterations. In each iteration, it optionally produces a high-level plan, selects a set of reasoning tools and a compute strategy together with an exploration parameter, and then generates a candidate reasoning trajectory. A process reward model (PRM) serves as a unified control signal: within each iteration, step-level PRM scores are aggregated to guide pruning and expansion during generation, and across iterations, aggregated trajectory rewards are used to select the final response. Across datasets, our dynamic, PRM-guided approach consistently outperforms direct test-time scaling, yielding large gains on MATH-500 and several-fold improvements on harder benchmarks such as AIME24 and AMO-Bench. We characterize efficiency using theoretical FLOPs and a compute intensity metric penalizing wasted generation and tool overhead, demonstrating that verification-guided allocation concentrates computation on high-utility reasoning paths.
zh
[NLP-146] From Utterance to Vividity: Training Expressive Subtitle Translation LLM via Adaptive Local Preference Optimization ICLR2026
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在垂直领域翻译任务中表现受限的问题,特别是针对视觉媒体字幕翻译这一复杂场景下,如何构建具备表达力与生动性的定制化翻译LLM。其解决方案的关键在于:首先,构建并发布了一个多方向的字幕平行语料库数据集,以支持更精细的训练需求;其次,提出自适应局部偏好优化(Adaptive Local Preference Optimization, ALPO)方法,实现细粒度的偏好对齐,从而提升翻译质量在多维度评估中的表现。
链接: https://arxiv.org/abs/2602.01068
作者: Chaoqun Cui,Shijing Wang,Liangbin Huang,Qingqing Gu,Zhaolong Huang,Xiao Zeng,Wenji Mao
机构: MAIS, Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); Hujing Digital Media & Entertainment Group (虎鲸数字媒体与娱乐集团); Geely AI lab (吉利AI实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ICLR 2026
Abstract:The rapid development of Large Language Models (LLMs) has significantly enhanced the general capabilities of machine translation. However, as application scenarios become more complex, the limitations of LLMs in vertical domain translations are gradually becoming apparent. In this study, we focus on how to construct translation LLMs that meet the needs of domain customization. We take visual media subtitle translation as our topic and explore how to train expressive and vivid translation LLMs. We investigated the situations of subtitle translation and other domains of literal and liberal translation, verifying the reliability of LLM as reward model and evaluator for translation. Additionally, to train an expressive translation LLM, we constructed and released a multidirectional subtitle parallel corpus dataset and proposed the Adaptive Local Preference Optimization (ALPO) method to address fine-grained preference alignment. Experimental results demonstrate that ALPO achieves outstanding performance in multidimensional evaluation of translation quality.
zh
[NLP-147] Exploring Knowledge Purification in Multi-Teacher Knowledge Distillation for LLM s
【速读】: 该论文旨在解决多教师大语言模型(Large Language Models, LLMs)知识蒸馏过程中存在的知识冲突(knowledge conflicts)与高资源消耗问题。其核心解决方案是提出“知识净化”(Knowledge Purification)概念,即通过整合多个教师模型的推理逻辑(rationales)生成单一且一致的 rationale,从而减少知识冗余与冲突,提升蒸馏效率与效果。关键创新在于设计了五种从不同角度实现知识净化的方法,并验证了基于路由(router-based)的净化策略在泛化能力上的优势,为高效、稳定的多教师蒸馏提供了可行路径。
链接: https://arxiv.org/abs/2602.01064
作者: Ruihan Jin,Pengpeng Shao,Zhengqi Wen,Jinyang Wu,Mingkuan Feng,Shuo Yang,Chu Yuan Zhang,Jianhua Tao
机构: Tsinghua University (清华大学); Beijing National Research Center for Information Science and Technology (北京信息科学与技术国家研究中心)
类目: Computation and Language (cs.CL)
备注:
Abstract:Knowledge distillation has emerged as a pivotal technique for transferring knowledge from stronger large language models (LLMs) to smaller, more efficient models. However, traditional distillation approaches face challenges related to knowledge conflicts and high resource demands, particularly when leveraging multiple teacher models. In this paper, we introduce the concept of \textbfKnowledge Purification, which consolidates the rationales from multiple teacher LLMs into a single rationale, thereby mitigating conflicts and enhancing efficiency. To investigate the effectiveness of knowledge purification, we further propose five purification methods from various perspectives. Our experiments demonstrate that these methods not only improve the performance of the distilled model but also effectively alleviate knowledge conflicts. Moreover, router-based methods exhibit robust generalization capabilities, underscoring the potential of innovative purification techniques in optimizing multi-teacher distillation and facilitating the practical deployment of powerful yet lightweight models.
zh
[NLP-148] Personality Expression Across Contexts: Linguistic and Behavioral Variation in LLM Agents
【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)在使用相同人格提示(personality prompts)时,为何在不同对话场景下会表现出不一致的语言、行为和情感特征。这一现象引发了对LLM人格表达是否具有内在一致性,还是体现为类似人类的语境敏感适应性的疑问。解决方案的关键在于引入“整体特质理论”(Whole Trait Theory)作为分析框架,发现LLMs并非以固定方式表达人格,而是根据社交互动目标和情感条件灵活调整其人格表现,从而揭示了其人格表达具有语境依赖性而非不一致性,这为设计更符合人类行为逻辑的对话代理提供了理论依据。
链接: https://arxiv.org/abs/2602.01063
作者: Bin Han,Deuksin Kwon,Jonathan Gratch
机构: University of Southern California (南加州大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) can be conditioned with explicit personality prompts, yet their behavioral realization often varies depending on context. This study examines how identical personality prompts lead to distinct linguistic, behavioral, and emotional outcomes across four conversational settings: ice-breaking, negotiation, group decision, and empathy tasks. Results show that contextual cues systematically influence both personality expression and emotional tone, suggesting that the same traits are expressed differently depending on social and affective demands. This raises an important question for LLM-based dialogue agents: whether such variations reflect inconsistency or context-sensitive adaptation akin to human behavior. Viewed through the lens of Whole Trait Theory, these findings highlight that LLMs exhibit context-sensitive rather than fixed personality expression, adapting flexibly to social interaction goals and affective conditions.
zh
[NLP-149] Good SFT Optimizes for SFT Better SFT Prepares for Reinforcement Learning
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)后训练流程中存在的一项关键问题:在标准的监督微调(Supervised Fine-Tuning, SFT)与在线强化学习(Reinforcement Learning, RL)联合训练范式下,SFT阶段通常独立优化以最大化自身性能,但实证发现,即使经过相同的RL训练,从较强SFT检查点初始化的模型反而可能显著劣于从较弱SFT检查点初始化的模型。作者将此现象归因于SFT-RL流水线中的分布不匹配问题——即用于生成离线SFT数据的分布与在线RL阶段通过自我rollout学习到的策略分布存在显著差异。解决方案的关键在于提出PEAR(Policy Evaluation-inspired Algorithm for Offline Learning Loss Re-weighting),一种在SFT阶段引入重要性采样机制对损失函数进行重加权的方法,其核心思想是通过估计RL策略下的概率分布来调整SFT损失权重,从而缩小SFT与后续RL之间的分布差距。PEAR提供三种粒度(token、block、sequence)的变体,可无缝集成至标准SFT目标中且额外训练开销极低,实验表明其能稳定提升下游RL后的推理能力,在AIME2025数学推理任务上最高实现14.6%的通过率提升。
链接: https://arxiv.org/abs/2602.01058
作者: Dylan Zhang,Yufeng Xu,Haojin Wang,Qingzhi Chen,Hao Peng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Post-training of reasoning LLMs is a holistic process that typically consists of an offline SFT stage followed by an online reinforcement learning (RL) stage. However, SFT is often optimized in isolation to maximize SFT performance alone. We show that, after identical RL training, models initialized from stronger SFT checkpoints can significantly underperform those initialized from weaker ones. We attribute this to a mismatch typical in current SFT-RL pipelines: the distribution that generates the offline SFT data can differ substantially from the policy optimized during online RL, which learns from its own rollouts. We propose PEAR (Policy Evaluation-inspired Algorithm for Offline Learning Loss Re-weighting), an SFT-stage method that corrects this mismatch and better prepares the model for RL. PEAR uses importance sampling to reweight the SFT loss, with three variants operating at the token, block, and sequence levels. It can be used to augment standard SFT objectives and incurs little additional training overhead once probabilities for the offline data are collected. We conduct controlled experiments on verifiable reasoning games and mathematical reasoning tasks on Qwen 2.5 and 3 and DeepSeek-distilled models. PEAR consistently improves post-RL performance over canonical SFT, with pass at 8 gains up to a 14.6 percent on AIME2025. Our results suggest that PEAR is an effective step toward more holistic LLM post-training by designing and evaluating SFT with downstream RL in mind rather than in isolation. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2602.01058 [cs.LG] (or arXiv:2602.01058v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.01058 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Dylan Zhang [view email] [v1] Sun, 1 Feb 2026 06:53:45 UTC (303 KB)
zh
[NLP-150] Discovering Process-Outcome Credit in Multi-Step LLM Reasoning
【速读】: 该论文旨在解决强化学习(Reinforcement Learning, RL)在提升大语言模型(Large Language Models, LLMs)推理能力时面临的奖励稀疏性(reward sparsity)和信用分配(credit assignment)效率低下的问题。其解决方案的关键在于提出一种新颖的框架,通过引入步骤级边际信息增益(Step-wise Marginal Information Gain, MIG)机制,利用单调历史水印(Monotonic Historical Watermark)量化每个推理步骤的内在价值,从而提供连续奖励信号并过滤训练噪声;同时采用解耦掩码策略(Decoupled Masking Strategy),将过程导向奖励专门应用于思维链(Chain-of-Thought, CoT),而结果导向奖励作用于完整输出,实现信用分配的解耦;此外结合双门控监督微调(Dual-Gated SFT)目标以增强训练稳定性与结构化事实信号的保留。
链接: https://arxiv.org/abs/2602.01034
作者: Xiangwei Wang,Wei Wang,Ken Chen,Nanduni Nimalsiri,Saman Halgamuge
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Reinforcement Learning (RL) serves as a potent paradigm for enhancing reasoning capabilities in Large Language Models (LLMs), yet standard outcome-based approaches often suffer from reward sparsity and inefficient credit assignment. In this paper, we propose a novel framework designed to provide continuous reward signals, which introduces a Step-wise Marginal Information Gain (MIG) mechanism that quantifies the intrinsic value of reasoning steps against a Monotonic Historical Watermark, effectively filtering out training noise. To ensure disentangled credit distribution, we implement a Decoupled Masking Strategy, applying process-oriented rewards specifically to the chain-of-thought (CoT) and outcome-oriented rewards to the full completion. Additionally, we incorporate a Dual-Gated SFT objective to stabilize training with high-quality structural and factual signals. Extensive experiments across textual and multi-modal benchmarks (e.g., MATH, Super-CLEVR) demonstrate that our approach consistently outperforms baselines such as GRPO in both sample efficiency and final accuracy. Furthermore, our model exhibits superior out-of-distribution robustness, demonstrating promising zero-shot transfer capabilities to unseen and challenging reasoning tasks.
zh
[NLP-151] HalluHard: A Hard Multi-Turn Hallucination Benchmark
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多轮对话中产生的“幻觉”问题,即模型生成看似合理但缺乏事实依据的陈述,且随着对话轮次增加和早期错误累积,该问题显著加剧。解决方案的关键在于提出一个名为HalluHard的挑战性多轮幻觉评估基准,涵盖法律案例、科研问题、医疗指南和编程等高风险领域,并通过要求生成内容中的事实主张必须附带内联引用(inline citations)来量化“事实一致性”。此外,作者设计了一种迭代式判别流水线,利用网络搜索自动获取、过滤和解析全文来源(包括PDF),以验证引用材料是否真实支持生成内容,从而实现对开放场景下模型输出的可靠评估。
链接: https://arxiv.org/abs/2602.01031
作者: Dongyang Fan,Sebastien Delsad,Nicolas Flammarion,Maksym Andriushchenko
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) still produce plausible-sounding but ungrounded factual claims, a problem that worsens in multi-turn dialogue as context grows and early errors cascade. We introduce \textbfHalluHard , a challenging multi-turn hallucination benchmark with 950 seed questions spanning four high-stakes domains: legal cases, research questions, medical guidelines, and coding. We operationalize groundedness by requiring inline citations for factual assertions. To support reliable evaluation in open-ended settings, we propose a judging pipeline that iteratively retrieves evidence via web search. It can fetch, filter, and parse full-text sources (including PDFs) to assess whether cited material actually supports the generated content. Across a diverse set of frontier proprietary and open-weight models, hallucinations remain substantial even with web search ( \approx 30% for the strongest configuration, Opus-4.5 with web search), with content-grounding errors persisting at high rates. Finally, we show that hallucination behavior is shaped by model capacity, turn position, effective reasoning, and the type of knowledge required.
zh
[NLP-152] Bias in the Ear of the Listener: Assessing Sensitivity in Audio Language Models Across Linguistic Demographic and Positional Variations EACL2026
【速读】: 该论文旨在解决多语言多模态大语言模型(Multilingual Multimodal Large Language Models, MLLMs)在语音输入场景下的偏见问题,特别是语音增强背景下不同语言、口音和性别因素对模型性能公平性的影响。其解决方案的关键在于构建并发布BiasInEar数据集——一个基于Global MMLU Lite扩展的语音增强基准,涵盖英语、中文和韩语,按性别和口音平衡,包含约70.8小时的语音(约4,249分钟)和11,200个问题,并采用准确率、熵、APES和Fleiss’ κ 四种互补指标,在语言、口音、性别及选项顺序等扰动下系统评估九个代表性模型。研究发现,MLLMs对人口统计学因素相对鲁棒,但对语言和选项顺序高度敏感,表明语音输入可能放大结构性偏见,同时揭示了架构设计与推理策略对跨语言鲁棒性的显著影响,从而建立了一个统一的评估框架,填补了文本与语音评估之间的差距。
链接: https://arxiv.org/abs/2602.01030
作者: Sheng-Lun Wei,Yu-Ling Liao,Yen-Hua Chang,Hen-Hsen Huang,Hsin-Hsi Chen
机构: National Taiwan University (国立台湾大学); Academia Sinica (中央研究院); AI Research Center (AINTU) (人工智能研究中心)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted as a long findings paper at EACL 2026
Abstract:This work presents the first systematic investigation of speech bias in multilingual MLLMs. We construct and release the BiasInEar dataset, a speech-augmented benchmark based on Global MMLU Lite, spanning English, Chinese, and Korean, balanced by gender and accent, and totaling 70.8 hours ( \approx 4,249 minutes) of speech with 11,200 questions. Using four complementary metrics (accuracy, entropy, APES, and Fleiss’ \kappa ), we evaluate nine representative models under linguistic (language and accent), demographic (gender), and structural (option order) perturbations. Our findings reveal that MLLMs are relatively robust to demographic factors but highly sensitive to language and option order, suggesting that speech can amplify existing structural biases. Moreover, architectural design and reasoning strategy substantially affect robustness across languages. Overall, this study establishes a unified framework for assessing fairness and robustness in speech-integrated LLMs, bridging the gap between text- and speech-based evaluation. The resources can be found at this https URL.
zh
[NLP-153] Large Language Models as Students Who Think Aloud: Overly Coherent Verbose and Confident
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在AI辅助教学系统中对新手学习者推理过程和元认知判断建模的准确性问题,尤其是现有评估方法过度关注解题正确率而忽视了人类学习过程中碎片化、不完善且具有动态变化的思维特征。其解决方案的关键在于构建一个基于630条多步骤化学问题求解任务中学生“出声思考”(think-aloud)语料的评估框架,通过对比LLM生成的推理与真实学习者言语在最小和扩展上下文提示下的差异,并量化模型预测学习者每一步成功概率的能力,从而揭示LLM在模拟学习过程中的系统性偏差——如推理过度连贯、冗长且缺乏变异性,以及对学习者表现的持续高估。这一框架为未来设计更贴近真实学习过程的自适应教学系统提供了实证依据与改进方向。
链接: https://arxiv.org/abs/2602.01015
作者: Conrad Borchers,Jill-Jênn Vie,Roger Azevedo
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Manuscript under review
Abstract:Large language models (LLMs) are increasingly embedded in AI-based tutoring systems. Can they faithfully model novice reasoning and metacognitive judgments? Existing evaluations emphasize problem-solving accuracy, overlooking the fragmented and imperfect reasoning that characterizes human learning. We evaluate LLMs as novices using 630 think-aloud utterances from multi-step chemistry tutoring problems with problem-solving logs of student hint use, attempts, and problem context. We compare LLM-generated reasoning to human learner utterances under minimal and extended contextual prompting, and assess the models’ ability to predict step-level learner success. Although GPT-4.1 generates fluent and contextually appropriate continuations, its reasoning is systematically over-coherent, verbose, and less variable than human think-alouds. These effects intensify with a richer problem-solving context during prompting. Learner performance was consistently overestimated. These findings highlight epistemic limitations of simulating learning with LLMs. We attribute these limitations to LLM training data, including expert-like solutions devoid of expressions of affect and working memory constraints during problem solving. Our evaluation framework can guide future design of adaptive systems that more faithfully support novice learning and self-regulation using generative artificial intelligence.
zh
[NLP-154] Distilling Token-Trained Models into Byte-Level Models
【速读】: 该论文旨在解决当前字节语言模型(Byte Language Models, BLMs)训练成本高昂的问题,即现有BLMs通常需要从头训练并在数万亿字节的数据上进行优化,导致资源消耗巨大。为克服这一限制,作者提出了一种高效的蒸馏方案,其关键在于两阶段课程学习:第一阶段为渐进式知识蒸馏(Progressive Knowledge Distillation),将字节级表示对齐到已有token训练的大语言模型(LLM)的嵌入空间;第二阶段为字节级监督微调(Byte-Level Supervised Fine-Tuning),实现完全在字节空间内的端到端生成能力。该方法仅需约1250亿字节数据即可使蒸馏后的BLMs保留原教师模型的大部分性能,显著降低了训练成本并提升了实用性。
链接: https://arxiv.org/abs/2602.01007
作者: Zishuo Bao,Jiaqi Leng,Junxiong Wang,Bowen Peng,Yucheng Lu
机构: 未知
类目: Computation and Language (cs.CL)
备注: 17 pages, 3 figures, 13 tables
Abstract:Byte Language Models (BLMs) have emerged as a promising direction for scaling language models beyond tokenization. However, existing BLMs typically require training from scratch on trillions of bytes, making them prohibitively expensive. In this paper, we propose an efficient distillation recipe that converts existing token-trained LLMs into BLMs while retaining comparable capabilities. Our recipe follows a two-stage curriculum: (1) Progressive Knowledge Distillation, which aligns byte-level representations with the embeddings of the token-trained teacher model; and (2) Byte-Level Supervised Fine-Tuning, which enables end-to-end generation entirely in the byte space. We validate our approach across multiple model families, including Llama, Qwen, and OLMo, and demonstrate that the distilled BLMs retain most of the teacher models’ performance using only approximately 125B bytes.
zh
[NLP-155] Reliable Use of Lemmas via Eligibility Reasoning and Section-Aware Reinforcement Learning
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在数学推理中滥用定理(lemma)的问题,即模型常错误地应用定理而未验证前提条件(precondition),导致结论不可靠。其核心解决方案是将“定理判断”形式化为一个结构化预测任务:给定一个命题和候选定理,模型需输出两个部分——前提检查(precondition check)与结论效用检查(conclusion-utility check),从而推导出该定理是否适用。关键创新在于提出RULES框架,通过双段式输出结构(two-section output)并结合分段感知的强化学习训练策略(section-aware loss masking),使模型能精准定位错误来源并进行针对性优化,显著提升对前提破坏性扰动的鲁棒性。
链接: https://arxiv.org/abs/2602.00998
作者: Zhikun Xu,Xiaodong Yu,Ben Zhou,Jiang Liu,Jialian Wu,Ze Wang,Ximeng Sun,Hao Chen,Zicheng Liu
机构: Arizona State University (亚利桑那州立大学); Advanced Micro Devices, Inc. (超威半导体公司)
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent large language models (LLMs) perform strongly on mathematical benchmarks yet often misapply lemmas, importing conclusions without validating assumptions. We formalize lemma - judging as a structured prediction task: given a statement and a candidate lemma, the model must output a precondition check and a conclusion - utility check, from which a usefulness decision is derived. We present RULES, which encodes this specification via a two - section output and trains with reinforcement learning plus section - aware loss masking to assign penalty to the section responsible for errors. Training and evaluation draw on diverse natural language and formal proof corpora; robustness is assessed with a held - out perturbation suite; and end - to - end evaluation spans competition - style, perturbation - aligned, and theorem - based problems across various LLMs. Results show consistent in - domain gains over both a vanilla model and a single - label RL baseline, larger improvements on applicability - breaking perturbations, and parity or modest gains on end - to - end tasks; ablations indicate that the two - section outputs and section - aware reinforcement are both necessary for robustness.
zh
[NLP-156] Error Taxonomy-Guided Prompt Optimization
【速读】: 该论文旨在解决自动提示优化(Automatic Prompt Optimization, APO)中因采用自底向上迭代方式导致全局视角缺失的问题,即现有方法在基于单个问题的执行日志反馈进行局部调整时,难以系统性地识别和修复模型的共性错误模式。其解决方案的关键在于提出一种自顶向下的误差分类引导提示优化(Error Taxonomy-Guided Prompt Optimization, ETGPO)算法:首先收集模型在多个任务中的错误,构建一个误差分类体系(error taxonomy),进而通过向提示中引入针对高频失败模式的指导信息来增强提示性能。该方法在数学、问答和逻辑推理等多个基准测试中实现了与当前最优方法相当或更优的准确率,同时显著降低了优化阶段的token消耗和评估预算,约为其三分之一。
链接: https://arxiv.org/abs/2602.00997
作者: Mayank Singh,Vikas Yadav,Eduardo Blanco
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Automatic Prompt Optimization (APO) is a powerful approach for extracting performance from large language models without modifying their weights. Many existing methods rely on trial-and-error, testing different prompts or in-context examples until a good configuration emerges, often consuming substantial compute. Recently, natural language feedback derived from execution logs has shown promise as a way to identify how prompts can be improved. However, most prior approaches operate in a bottom-up manner, iteratively adjusting the prompt based on feedback from individual problems, which can cause them to lose the global perspective. In this work, we propose Error Taxonomy-Guided Prompt Optimization (ETGPO), a prompt optimization algorithm that adopts a top-down approach. ETGPO focuses on the global failure landscape by collecting model errors, categorizing them into a taxonomy, and augmenting the prompt with guidance targeting the most frequent failure modes. Across multiple benchmarks spanning mathematics, question answering, and logical reasoning, ETGPO achieves accuracy that is comparable to or better than state-of-the-art methods, while requiring roughly one third of the optimization-phase token usage and evaluation budget.
zh
[NLP-157] DeALOG: Decentralized Multi-Agents Log-Mediated Reasoning Framework
【速读】: 该论文旨在解决跨文本、表格和图像的复杂问答任务中,如何有效整合多源异构信息并实现可解释性与鲁棒性的难题。其核心挑战在于现有方法缺乏对不同模态数据的专用处理能力以及协同验证机制。解决方案的关键是提出一种去中心化的多智能体框架 DeALOG,通过引入专门化智能体(如表格、上下文、视觉、总结与验证智能体)协作完成任务,并利用共享的自然语言日志作为持久记忆进行通信与状态记录,从而实现无需中央控制的协同纠错与验证,显著提升了系统的准确性与可扩展性。
链接: https://arxiv.org/abs/2602.00996
作者: Abhijit Chakraborty,Ashish Raj Shekhar,Shiven Agarwal,Vivek Gupta
机构: Arizona State University (亚利桑那州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Complex question answering across text, tables and images requires integrating diverse information sources. A framework supporting specialized processing with coordination and interpretability is needed. We introduce DeALOG, a decentralized multi-agent framework for multimodal question answering. It uses specialized agents: Table, Context, Visual, Summarizing and Verification, that communicate through a shared natural-language log as persistent memory. This log-based approach enables collaborative error detection and verification without central control, improving robustness. Evaluations on FinQA, TAT-QA, CRT-QA, WikiTableQuestions, FeTaQA, and MultiModalQA show competitive performance. Analysis confirms the importance of the shared log, agent specialization, and verification for accuracy. DeALOG, provides a scalable approach through modular components using natural-language communication.
zh
[NLP-158] Sparse Reward Subsystem in Large Language Models
【速读】: 该论文试图解决的问题是:如何理解大型语言模型(Large Language Models, LLMs)内部状态中隐含的奖励机制及其对推理能力的作用。解决方案的关键在于识别出LLMs隐藏状态中存在一个稀疏奖励子系统(sparse reward subsystem),其中包含能够表示状态价值预期的价值神经元(value neurons),并通过干预实验验证这些神经元对于模型推理至关重要。此外,研究进一步发现该子系统中存在类似多巴胺神经元的奖励预测误差编码单元(reward prediction error, RPE),其激活水平与实际奖励高于或低于预期的程度呈正相关,从而揭示了LLMs内部类生物奖励机制的结构与功能特性。
链接: https://arxiv.org/abs/2602.00986
作者: Guowei Xu,Mert Yuksekgonul,James Zou
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:In this paper, we identify a sparse reward subsystem within the hidden states of Large Language Models (LLMs), drawing an analogy to the biological reward subsystem in the human brain. We demonstrate that this subsystem contains value neurons that represent the model’s internal expectation of state value, and through intervention experiments, we establish the importance of these neurons for reasoning. Our experiments reveal that these value neurons are robust across diverse datasets, model scales, and architectures; furthermore, they exhibit significant transferability across different datasets and models fine-tuned from the same base model. By examining cases where value predictions and actual rewards diverge, we identify dopamine neurons within the reward subsystem which encode reward prediction errors (RPE). These neurons exhibit high activation when the reward is higher than expected and low activation when the reward is lower than expected.
zh
[NLP-159] DISPO: Enhancing Training Efficiency and Stability in Reinforcement Learning for Large Language Model Mathematical Reasoning AISTATS
【速读】: 该论文旨在解决强化学习中基于可验证奖励(verifiable rewards)的策略优化问题,尤其是在大语言模型数学推理能力提升场景下,现有方法存在训练稳定性与学习效率之间的权衡:PPO类方法(如GRPO/DAPO)虽稳定但收敛慢,REINFORCE类方法(如CISPO)虽高效却易因重要性采样权重裁剪不当导致性能波动甚至崩溃。解决方案的关键在于提出DISPO算法,其核心创新是将重要性采样权重对正确和错误响应分别进行上裁剪(up-clipping)和下裁剪(down-clipping),从而形成四个可控的策略更新机制;通过独立调节这四个参数,DISPO在保持探索-蒸馏平衡的同时有效避免灾难性失败,显著提升了模型在AIME’24等基准上的表现(61.04% vs. CISPO的55.42%)。
链接: https://arxiv.org/abs/2602.00983
作者: Batuhan K. Karaman,Aditya Rawal,Suhaila Shakiah,Mohammad Ghavamzadeh,Mingyi Hong,Arijit Biswas,Ruida Zhou
机构: Cornell University (康奈尔大学); Amazon AGI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This work is accepted to the 29th International Conference on Artificial Intelligence and Statistics (AISTATS) 2026
Abstract:Reinforcement learning with verifiable rewards has emerged as a promising paradigm for enhancing the reasoning capabilities of large language models particularly in mathematics. Current approaches in this domain present a clear trade-off: PPO-style methods (e.g., GRPO/DAPO) offer training stability but exhibit slow learning trajectories due to their trust-region constraints on policy updates, while REINFORCE-style approaches (e.g., CISPO) demonstrate improved learning efficiency but suffer from performance instability as they clip importance sampling weights while still permitting non-zero gradients outside the trust-region. To address these limitations, we introduce DISPO, a simple yet effective REINFORCE-style algorithm that decouples the up-clipping and down-clipping of importance sampling weights for correct and incorrect responses, yielding four controllable policy update regimes. Through targeted ablations, we uncover how each regime impacts training: for correct responses, weights 1 increase the average token entropy (i.e., exploration) while weights 1 decrease it (i.e., distillation) – both beneficial but causing gradual performance degradation when excessive. For incorrect responses, overly restrictive clipping triggers sudden performance collapse through repetitive outputs (when weights 1) or vanishing response lengths (when weights 1). By separately tuning these four clipping parameters, DISPO maintains the exploration-distillation balance while preventing catastrophic failures, achieving 61.04% on AIME’24 (vs. 55.42% CISPO and 50.21% DAPO) with similar gains across various benchmarks and models.
zh
[NLP-160] MedSpeak: A Knowledge Graph-Aided ASR Error Correction Framework for Spoken Medical QA
【速读】: 该论文旨在解决医疗语音问答(Medical Spoken Question-Answering, SQA)系统中因自动语音识别(ASR)对医学术语识别不准确而导致的性能瓶颈问题。解决方案的关键在于提出MedSpeak框架,该框架通过融合医学知识图谱(Medical Knowledge Graph)中的语义关系与发音信息,并借助大语言模型(LLM)的推理能力,对噪声语音转录文本进行纠错,从而提升医学术语识别准确率及下游问答任务的整体性能。
链接: https://arxiv.org/abs/2602.00981
作者: Yutong Song,Shiva Shrestha,Chenhan Lyu,Elahe Khatibi,Pengfei Zhang,Honghui Xu,Nikil Dutt,Amir Rahmani
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Spoken question-answering (SQA) systems relying on automatic speech recognition (ASR) often struggle with accurately recognizing medical terminology. To this end, we propose MedSpeak, a novel knowledge graph-aided ASR error correction framework that refines noisy transcripts and improves downstream answer prediction by leveraging both semantic relationships and phonetic information encoded in a medical knowledge graph, together with the reasoning power of LLMs. Comprehensive experimental results on benchmarks demonstrate that MedSpeak significantly improves the accuracy of medical term recognition and overall medical SQA performance, establishing MedSpeak as a state-of-the-art solution for medical SQA. The code is available at this https URL.
zh
[NLP-161] GradingAttack: Attacking Large Language Models Towards Short Answer Grading Ability
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在自动短答案评分(Automatic Short Answer Grading, ASAG)中易受对抗攻击的问题,从而保障评分的公平性与可靠性。其解决方案的关键在于提出了一种细粒度的对抗攻击框架GradingAttack,通过设计词元级(token-level)和提示级(prompt-level)攻击策略,在保持高隐蔽性的同时有效误导评分模型;同时引入一种新的评估指标以平衡攻击成功率与伪装程度,从而系统性地量化ASAG模型的脆弱性。
链接: https://arxiv.org/abs/2602.00979
作者: Xueyi Li,Zhuoneng Zhou,Zitao Liu,Yongdong Wu,Weiqi Luo
机构: Guangdong Institute of Smart Education (广东省智能教育研究院); Jinan University (暨南大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) have demonstrated remarkable potential for automatic short answer grading (ASAG), significantly boosting student assessment efficiency and scalability in educational scenarios. However, their vulnerability to adversarial manipulation raises critical concerns about automatic grading fairness and reliability. In this paper, we introduce GradingAttack, a fine-grained adversarial attack framework that systematically evaluates the vulnerability of LLM based ASAG models. Specifically, we align general-purpose attack methods with the specific objectives of ASAG by designing token-level and prompt-level strategies that manipulate grading outcomes while maintaining high camouflage. Furthermore, to quantify attack camouflage, we propose a novel evaluation metric that balances attack success and camouflage. Experiments on multiple datasets demonstrate that both attack strategies effectively mislead grading models, with prompt-level attacks achieving higher success rates and token-level attacks exhibiting superior camouflage capability. Our findings underscore the need for robust defenses to ensure fairness and reliability in ASAG. Our code and datasets are available at this https URL.
zh
[NLP-162] rust in One Round: Confidence Estimation for Large Language Models via Structural Signals WWW2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在高风险应用场景中,标准置信度估计方法(如token似然、语义相似性和多样本一致性)在分布偏移、领域特异性文本和计算资源受限条件下表现脆弱的问题。解决方案的关键在于提出一种名为“结构置信度”(Structural Confidence)的单次前向传播、模型无关的框架,通过分析模型最终层隐藏状态轨迹中的多尺度结构信号(包括谱特征、局部变化和全局形状描述符),捕捉概率和句子嵌入无法识别的内部稳定性模式,从而实现更准确、高效且鲁棒的输出正确性预测。
链接: https://arxiv.org/abs/2602.00977
作者: Pengyue Yang,Jiawen Wen,Haolin Jin,Linghan Huang,Huaming Chen,Ling Chen
机构: The University of Sydney(悉尼大学); University of Technology Sydney(悉尼科技大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at The ACM Web Conference 2026 (WWW 2026)
Abstract:Large language models (LLMs) are increasingly deployed in domains where errors carry high social, scientific, or safety costs. Yet standard confidence estimators, such as token likelihood, semantic similarity and multi-sample consistency, remain brittle under distribution shift, domain-specialised text, and compute limits. In this work, we present Structural Confidence, a single-pass, model-agnostic framework that enhances output correctness prediction based on multi-scale structural signals derived from a model’s final-layer hidden-state trajectory. By combining spectral, local-variation, and global shape descriptors, our method captures internal stability patterns that are missed by probabilities and sentence embeddings. We conduct extensive, cross-domain evaluation across four heterogeneous benchmarks-FEVER (fact verification), SciFact (scientific claims), WikiBio-hallucination (biographical consistency), and TruthfulQA (truthfulness-oriented QA). Our Structural Confidence framework demonstrates strong performance compared with established baselines in terms of AUROC and AUPR. More importantly, unlike sampling-based consistency methods which require multiple stochastic generations and an auxiliary model, our approach uses a single deterministic forward pass, offering a practical basis for efficient, robust post-hoc confidence estimation in socially impactful, resource-constrained LLM applications.
zh
[NLP-163] Verification Required: The Impact of Information Credibility on AI Persuasion
【速读】: 该论文旨在解决生成式 AI(Generative AI)在高风险决策场景中因信息可信度不确定而导致的策略性沟通问题,现有研究多局限于不可验证的“廉价对话”或完全可验证的信息披露,无法刻画现实世界中信息具有概率可信度的复杂情形。其解决方案的关键在于提出 MixTalk 战略通信博弈框架,该框架建模了发送方如何结合可验证与不可验证声明来传递私有信息,接收方则基于先验信念、声明内容及成本高昂的验证结果分配有限预算进行推理;进一步通过 Tournament Oracle Policy Distillation(TOPD)方法,从交互日志中离线蒸馏出最优策略,并在推理时以提示工程方式部署,显著提升了接收方对说服行为的鲁棒性。
链接: https://arxiv.org/abs/2602.00970
作者: Saaduddin Mahmud,Eugene Bagdasarian,Shlomo Zilberstein
机构: 未知
类目: Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT)
备注: 19 pages, 5 figures
Abstract:Agents powered by large language models (LLMs) are increasingly deployed in settings where communication shapes high-stakes decisions, making a principled understanding of strategic communication essential. Prior work largely studies either unverifiable cheap-talk or fully verifiable disclosure, failing to capture realistic domains in which information has probabilistic credibility. We introduce MixTalk, a strategic communication game for LLM-to-LLM interaction that models information credibility. In MixTalk, a sender agent strategically combines verifiable and unverifiable claims to communicate private information, while a receiver agent allocates a limited budget to costly verification and infers the underlying state from prior beliefs, claims, and verification outcomes. We evaluate state-of-the-art LLM agents in large-scale tournaments across three realistic deployment settings, revealing their strengths and limitations in reasoning about information credibility and the explicit behavior that shapes these interactions. Finally, we propose Tournament Oracle Policy Distillation (TOPD), an offline method that distills tournament oracle policy from interaction logs and deploys it in-context at inference time. Our results show that TOPD significantly improves receiver robustness to persuasion.
zh
[NLP-164] Probing the Knowledge Boundary: An Interactive Agent ic Framework for Deep Knowledge Extraction
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中知识内容的可提取性与边界模糊问题,即明确LLMs究竟包含哪些知识以及其知识覆盖范围的系统性探测难题。现有评估基准多为静态设计,难以支持对模型知识进行深度、结构化挖掘。为此,作者提出一种交互式智能体框架(interactive agentic framework),其核心创新在于引入四种自适应探索策略以实现多粒度的知识探测,并构建三阶段知识处理流水线:基于向量的去重过滤、基于LLM的语义歧义仲裁及领域相关性审计,从而保障所提取知识的质量与有效性。实验表明,递归分类法(recursive taxonomy)是最优探索策略,且模型规模与知识提取量之间存在清晰的缩放规律,同时揭示了通用模型与领域专用模型在持续知识提取中的性能权衡关系。
链接: https://arxiv.org/abs/2602.00959
作者: Yuheng Yang,Siqi Zhu,Tao Feng,Ge Liu,Jiaxuan You
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Homepage: this https URL
Abstract:Large Language Models (LLMs) can be seen as compressed knowledge bases, but it remains unclear what knowledge they truly contain and how far their knowledge boundaries extend. Existing benchmarks are mostly static and provide limited support for systematic knowledge probing. In this paper, we propose an interactive agentic framework to systematically extract and quantify the knowledge of LLMs. Our method includes four adaptive exploration policies to probe knowledge at different granularities. To ensure the quality of extracted knowledge, we introduce a three-stage knowledge processing pipeline that combines vector-based filtering to remove exact duplicates, LLM-based adjudication to resolve ambiguous semantic overlaps, and domain-relevance auditing to retain valid knowledge units. Through extensive experiments, we find that recursive taxonomy is the most effective exploration strategy. We also observe a clear knowledge scaling law, where larger models consistently extract more knowledge. In addition, we identify a Pass@1-versus-Pass@k trade-off: domain-specialized models achieve higher initial accuracy but degrade rapidly, while general-purpose models maintain stable performance during extended extraction. Finally, our results show that differences in training data composition lead to distinct and measurable knowledge profiles across model families.
zh
[NLP-165] Neural FOXP2 – Language Specific Neuron Steering for Targeted Language Improvement in LLM s
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多语言场景下存在语言默认性(language defaultness)的问题,即尽管模型具备多语言能力,但其输出往往倾向于以英语为主导,导致其他语言(如印地语或西班牙语)在参数记忆中被系统性抑制。为实现对目标语言的可控主导,默认语言机制被识别为由稀疏且低秩的控制回路——“语言神经元”(language neurons)所调控。解决方案的关键在于提出 Neural FOXP2 方法,通过三个阶段实现:(i) 局部化定位语言特异性神经元,利用层内稀疏自动编码器(SAE)分解激活并筛选出对目标语言具有显著选择性的特征单元;(ii) 通过谱低秩分析提取控制语言转换的主要方向,确定一个紧凑的干预子空间及稳定有效的干预窗口;(iii) 在低至中层施加符号化的稀疏激活偏移,沿目标语言主导方向注入正向扰动,并对英语神经元施加补偿负向扰动,从而实现对模型语言默认状态的安全可控调节。
链接: https://arxiv.org/abs/2602.00945
作者: Anusa Saha,Tanmay Joshi,Vinija Jain,Aman Chadha,Amitava Das
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract: LLMs are multilingual by training, yet their lingua franca is often English, reflecting English language dominance in pretraining. Other languages remain in parametric memory but are systematically suppressed. We argue that language defaultness is governed by a sparse, low-rank control circuit, language neurons, that can be mechanistically isolated and safely steered. We introduce Neural FOXP2, that makes a chosen language (Hindi or Spanish) primary in a model by steering language-specific neurons. Neural FOXP2 proceeds in three stages: (i) Localize: We train per-layer SAEs so each activation decomposes into a small set of active feature components. For every feature, we quantify English vs. Hindi/Spanish selectivity overall logit-mass lift toward the target-language token set. Tracing the top-ranked features back to their strongest contributing units yields a compact language-neuron set. (ii) Steering directions: We localize controllable language-shift geometry via a spectral low-rank analysis. For each layer, we build English to target activation-difference matrices and perform layerwise SVD to extract the dominant singular directions governing language change. The eigengap and effective-rank spectra identify a compact steering subspace and an empirically chosen intervention window (where these directions are strongest and most stable). (iii) Steer: We apply a signed, sparse activation shift targeted to the language neurons. Concretely, within low to mid layers we add a positive steering along the target-language dominant directions and a compensating negative shift toward the null space for the English neurons, yielding controllable target-language defaultness. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.00945 [cs.CL] (or arXiv:2602.00945v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2602.00945 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-166] A Baseline Multimodal Approach to Emotion Recognition in Conversations
【速读】: 该论文旨在解决对话场景中的情感识别(Emotion Recognition in Conversations, ERC)问题,利用基于情景喜剧《老友记》(Friends)构建的SemEval-2024 Task 3数据集进行实验。其解决方案的关键在于提出一个轻量级多模态基线模型,该模型结合了(i)基于Transformer的文本分类器与(ii)自监督语音表示模型,并采用简单的晚期融合(late-fusion ensemble)策略整合两种模态信息。通过在有限训练协议下的实证结果,论文揭示了多模态融合在何种情况下优于单一模态模型,从而为后续研究提供可复现、透明且易于扩展的参考实现。
链接: https://arxiv.org/abs/2602.00914
作者: Víctor Yeste,Rodrigo Rivas-Arévalo
机构: Universidad Europea de Valencia (欧洲大学瓦伦西亚校区)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 10 pages
Abstract:We present a lightweight multimodal baseline for emotion recognition in conversations using the SemEval-2024 Task 3 dataset built from the sitcom Friends. The goal of this report is not to propose a novel state-of-the-art method, but to document an accessible reference implementation that combines (i) a transformer-based text classifier and (ii) a self-supervised speech representation model, with a simple late-fusion ensemble. We report the baseline setup and empirical results obtained under a limited training protocol, highlighting when multimodal fusion improves over unimodal models. This preprint is provided for transparency and to support future, more rigorous comparisons.
zh
[NLP-167] Do Schwartz Higher-Order Values Help Sentence-Level Human Value Detection? When Hard Gating Hurts
【速读】: 该论文旨在解决句子级人类价值观检测(sentence-level human value detection)中如何有效利用Schwartz价值观的高阶(higher-order, HO)类别结构的问题,尤其是在计算资源受限(单张8 GB GPU)条件下提升模型性能。其关键解决方案在于:虽然HO类别在描述上具有价值,但强制使用硬掩码(hard masks)实施层级约束会因误差累积和召回率抑制而损害最终任务表现;相比之下,标签级阈值调优(label-wise threshold tuning)和小型Transformer集成(small transformer ensembles)是更可靠且高效的改进手段,分别带来高达+0.05和+0.02的Macro-F₁提升,体现出校准与轻量集成在实际应用中的重要性。
链接: https://arxiv.org/abs/2602.00913
作者: Víctor Yeste,Paolo Rosso
机构: Universitat Politècnica de València (瓦伦西亚理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Code: this https URL , 42 pages, 4 figures
Abstract:Sentence-level human value detection is typically framed as multi-label classification over Schwartz values, but it remains unclear whether Schwartz higher-order (HO) categories provide usable structure. We study this under a strict compute-frugal budget (single 8 GB GPU) on ValueEval’24 / ValuesML (74K English sentences). We compare (i) direct supervised transformers, (ii) HO \rightarrow values pipelines that enforce the hierarchy with hard masks, and (iii) Presence \rightarrow HO \rightarrow values cascades, alongside low-cost add-ons (lexica, short context, topics), label-wise threshold tuning, small instruction-tuned LLM baselines ( \le 10B), QLoRA, and simple ensembles. HO categories are learnable from single sentences (e.g., the easiest bipolar pair reaches Macro- F_1\approx0.58 ), but hard hierarchical gating is not a reliable win: it often reduces end-task Macro- F_1 via error compounding and recall suppression. In contrast, label-wise threshold tuning is a high-leverage knob (up to +0.05 Macro- F_1 ), and small transformer ensembles provide the most consistent additional gains (up to +0.02 Macro- F_1 ). Small LLMs lag behind supervised encoders as stand-alone systems, yet can contribute complementary errors in cross-family ensembles. Overall, HO structure is useful descriptively, but enforcing it with hard gates hurts sentence-level value detection; robust improvements come from calibration and lightweight ensembling.
zh
[NLP-168] Hallucination is a Consequence of Space-Optimality: A Rate-Distortion Theorem for Membership Testing
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在面对无模式可循的“随机事实”时产生高置信度幻觉的问题。其核心挑战在于理解为何即使在理想训练条件下,LLM仍会错误地对非事实赋予高置信度。解决方案的关键在于将此类事实的记忆建模为一个成员测试问题(membership testing problem),并结合布隆过滤器(Bloom filter)的离散误差指标与LLM的连续对数损失(log-loss),建立了一个率失真理论(rate-distortion theorem)。该理论指出,在事实稀疏的假设空间中,最优记忆效率由事实与非事实得分分布之间的最小KL散度决定;由此揭示出:即使在完美数据和封闭世界设定下,受限容量下的信息论最优策略并非回避或遗忘,而是对某些非事实分配高置信度,从而自然导致幻觉现象。
链接: https://arxiv.org/abs/2602.00906
作者: Anxin Guo,Jingwei Li
机构: Northwestern University, USA (西北大学,美国); Columbia University, USA (哥伦比亚大学,美国)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Data Structures and Algorithms (cs.DS); Information Theory (cs.IT)
备注:
Abstract:Large language models often hallucinate with high confidence on “random facts” that lack inferable patterns. We formalize the memorization of such facts as a membership testing problem, unifying the discrete error metrics of Bloom filters with the continuous log-loss of LLMs. By analyzing this problem in the regime where facts are sparse in the universe of plausible claims, we establish a rate-distortion theorem: the optimal memory efficiency is characterized by the minimum KL divergence between score distributions on facts and non-facts. This theoretical framework provides a distinctive explanation for hallucination: even with optimal training, perfect data, and a simplified “closed world” setting, the information-theoretically optimal strategy under limited capacity is not to abstain or forget, but to assign high confidence to some non-facts, resulting in hallucination. We validate this theory empirically on synthetic data, showing that hallucinations persist as a natural consequence of lossy compression.
zh
[NLP-169] EffGen: Enabling Small Language Models as Capable Autonomous Agents
【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的智能体系统(agentic systems)在部署时面临的高Token成本、隐私风险以及对云端API依赖等问题。针对小语言模型(Small Language Models, SLMs)在本地部署场景下的效率与安全性不足,作者提出了一种名为effGen的开源智能体框架,其核心创新在于四个关键技术:(1) 通过提示优化(prompt optimization)实现上下文压缩(减少70–80%),同时保持任务语义完整性;(2) 基于依赖关系的智能任务分解机制,支持并行或串行子任务执行;(3) 基于五因素的复杂度路由策略,在预执行阶段做出高效决策;(4) 统一记忆系统整合短期、长期及向量存储。这些模块协同作用使effGen在13个基准测试中显著优于LangChain、AutoGen和Smolagents,且在不同规模模型上均展现出可扩展的性能增益,尤其揭示了提示优化更适配SLMs、复杂度路由更利于LLMs的互补性Scaling行为。
链接: https://arxiv.org/abs/2602.00887
作者: Gaurav Srivastava,Aafiya Hussain,Chi Wang,Yingyan Celine Lin,Xuan Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Most existing language model agentic systems today are built and optimized for large language models (e.g., GPT, Claude, Gemini) via API calls. While powerful, this approach faces several limitations including high token costs and privacy concerns for sensitive applications. We introduce effGen, an open-source agentic framework optimized for small language models (SLMs) that enables effective, efficient, and secure local deployment (pip install effgen). effGen makes four major contributions: (1) Enhanced tool-calling with prompt optimization that compresses contexts by 70-80% while preserving task semantics, (2) Intelligent task decomposition that breaks complex queries into parallel or sequential subtasks based on dependencies, (3) Complexity-based routing using five factors to make smart pre-execution decisions, and (4) Unified memory system combining short-term, long-term, and vector-based storage. Additionally, effGen unifies multiple agent protocols (MCP, A2A, ACP) for cross-protocol communication. Results on 13 benchmarks show effGen outperforms LangChain, AutoGen, and Smolagents with higher success rates, faster execution, and lower memory. Our results reveal that prompt optimization and complexity routing have complementary scaling behavior: optimization benefits SLMs more (11.2% gain at 1.5B vs 2.4% at 32B), while routing benefits large models more (3.6% at 1.5B vs 7.9% at 32B), providing consistent gains across all scales when combined. effGen (this https URL) is released under the MIT License, ensuring broad accessibility for research and commercial use. Our framework code is publicly available at this https URL.
zh
[NLP-170] ILSIC: Corpora for Identifying Indian Legal Statutes from Queries by Laypeople EACL2026
【速读】: 该论文旨在解决法律文本处理中法律条文识别(Legal Statute Identification, LSI)任务在实际应用场景下的适配性问题,即传统模型多基于法院判决中的正式事实进行训练,而现实场景中用户输入往往来自非专业人士的非正式查询。解决方案的关键在于构建了一个名为ILSIC的新语料库,其中包含500余条印度法律条文的非专业人士查询及其对应的法院判例,从而支持对法院数据与非专业人士数据之间的差异进行系统比较。实验表明,仅在法院判例上训练的模型在非专业人士查询上的表现不佳,而从法院数据向非专业人士数据迁移学习可在特定场景下提升性能,凸显了跨数据源迁移的有效性与必要性。
链接: https://arxiv.org/abs/2602.00881
作者: Shounak Paul,Raghav Dogra,Pawan Goyal,Saptarshi Ghosh
机构: Indian Institute of Technology, Kharagpur(印度理工学院,克哈格普尔分校)
类目: Computation and Language (cs.CL)
备注: 9 Pages of Main, 1 page of Limitations and Ethics Statement, 11 Pages of Appendix, Accepted for Publication at EACL 2026 (Findings)
Abstract:Legal Statute Identification (LSI) for a given situation is one of the most fundamental tasks in Legal NLP. This task has traditionally been modeled using facts from court judgments as input queries, due to their abundance. However, in practical settings, the input queries are likely to be informal and asked by laypersons, or non-professionals. While a few laypeople LSI datasets exist, there has been little research to explore the differences between court and laypeople data for LSI. In this work, we create ILSIC, a corpus of laypeople queries covering 500+ statutes from Indian law. Additionally, the corpus also contains court case judgements to enable researchers to effectively compare between court and laypeople data for LSI. We conducted extensive experiments on our corpus, including benchmarking over the laypeople dataset using zero and few-shot inference, retrieval-augmented generation and supervised fine-tuning. We observe that models trained purely on court judgements are ineffective during test on laypeople queries, while transfer learning from court to laypeople data can be beneficial in certain scenarios. We also conducted fine-grained analyses of our results in terms of categories of queries and frequency of statutes.
zh
[NLP-171] Beyond Output Critique: Self-Correction via Task Distillation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在自我修正过程中仅能处理表层错误而难以纠正深层推理缺陷的问题。现有方法多基于输出层面的批判性评估,导致纠错效果受限。其解决方案的关键在于引入一个中间步骤——任务抽象(task abstraction),即在初始响应后,模型首先将任务提炼为结构化的模板,明确关键变量、约束条件和问题结构;该抽象作为指导框架用于后续解的实例化,从而提升对任务本质的理解并减少错误传播。更进一步,研究发现此类抽象可跨模型迁移:由大模型生成的任务模板可作为小模型的结构化引导,显著增强其自修正能力,无需大量微调或外部验证器即可实现更可靠、鲁棒且泛化性强的自我修正性能。
链接: https://arxiv.org/abs/2602.00871
作者: Hossein A. Rahmani,Mengting Wan,Pei Zhou,Longqi Yang,Nick Craswell,Emine Yilmaz,Sujay Kumar Jauhar
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) have shown promising self-correction abilities, where iterative refinement improves the quality of generated responses. However, most existing approaches operate at the level of output critique, patching surface errors while often failing to correct deeper reasoning flaws. We propose SELF-THOUGHT, a framework that introduces an intermediate step of task abstraction before solution refinement. Given an input and an initial response, the model first distills the task into a structured template that captures key variables, constraints, and problem structure. This abstraction then guides solution instantiation, grounding subsequent responses in a clearer understanding of the task and reducing error propagation. Crucially, we show that these abstractions can be transferred across models: templates generated by larger models can serve as structured guides for smaller LLMs, which typically struggle with intrinsic self-correction. By reusing distilled task structures, smaller models achieve more reliable refinements without heavy fine-tuning or reliance on external verifiers. Experiments across diverse reasoning tasks demonstrate that SELF-THOUGHT improves accuracy, robustness, and generalization for both large and small models, offering a scalable path toward more reliable self-correcting language systems.
zh
[NLP-172] Foundation CAN LM: A Pretrained Language Model For Automotive CAN Data
【速读】: 该论文旨在解决当前基于控制器局域网络(Controller Area Network, CAN)数据的车辆信号建模中存在任务孤立、缺乏共享表示学习以及跨任务泛化能力弱的问题。现有方法通常对原始CAN数据独立训练特定任务模型,且仅有限探索解码后的信号特征,导致模型难以迁移和复用。其解决方案的关键在于引入“基础模型”(foundation model)范式,将CAN数据视为一种语言进行处理:首先在大规模未标注解码CAN信号上进行预训练,构建通用表示;随后通过统一的分词机制处理混合离散-连续信号,并解决时间复杂度与行程特异性变化等挑战,最终实现单一预训练骨干网络在多个异构汽车保险任务上的有效微调与多目标下游泛化。
链接: https://arxiv.org/abs/2602.00866
作者: Akiharu Esashi,Pawissanutt Lertpongrujikorn,Justin Makino,Yuibi Fujimoto,Mohsen Amini Salehi
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:The Controller Area Network (CAN) bus provides a rich source of vehicular signals increasingly leveraged for applications in automotive and auto insurance domains, including collision detection, predictive maintenance, and driver risk modeling. Despite this potential, existing pipelines largely train isolated task-specific models on raw CAN data, with only limited efforts exploring decoded signals. Such fragmentation prevents shared representation learning and limits cross-task generalization. By contrast, natural language processing (NLP) and computer vision (CV) have been transformed by the foundation model paradigm: large-scale pretraining followed by task-specific adaptation. In this work, we introduce the foundation CAN model that demonstrates multi-objective downstream generalization using a single pretrained backbone. Our approach treats CAN data as a language: we pretrain on large-scale, unlabeled decoded CAN signals and fine-tune across heterogeneous auto insurance tasks. To enable this, we propose a unified tokenization scheme for mixed discrete-continuous signals and address challenges of temporal complexity and trip-specific variability. Our results show that one pretrained CAN model can adapt effectively to diverse predictive tasks, validating that the foundation modeling paradigm, proven in NLP and CV, also holds for CAN data. This establishes a new direction for generalizable representation learning in automotive AI.
zh
[NLP-173] Multi-Head Attention Is a Multi-Player Game
【速读】: 该论文旨在解决现代Transformer模型中注意力头(attention head)在训练过程中因缺乏协同机制而导致的效率低下问题,特别是由冗余和相关错误引发的“无效率”现象(即价格悖论,Price of Anarchy, PoA)。传统交叉熵训练将多头结构视为单一优化器,忽略了其内在的多智能体博弈特性,导致梯度下降收敛至低效的纳什均衡。解决方案的关键在于将注意力头间的相互作用建模为潜在博弈(potential game),并通过量化头间耦合强度的交互矩阵 $ \Gamma(G) $ 来界定PoA上限;进一步提出GAME-LoRA方法,融合Barlow Twins去相关性约束与对数行列式协调压力,实现对 $ \Gamma(G) $ 的显式调控,从而在不牺牲知识保留的前提下显著降低幻觉概率(最多减少18%),验证了从博弈论视角重构训练目标的有效性。
链接: https://arxiv.org/abs/2602.00861
作者: Kushal Chakrabarti,Nirmal Balachundar
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
备注:
Abstract:Modern transformer attention is internally multi-agent – heads compete and coordinate – yet we train it as if it were a monolithic optimizer. We formalize this gap: cross-entropy training induces an implicit potential game among heads, and gradient descent converges to Nash equilibria with potentially unbounded inefficiency due to unpriced externalities (redundancy, correlated errors). Our main result bounds the Price of Anarchy by \Gamma(G) , the off-diagonal mass of a head interaction matrix capturing weight and gradient coupling. Under mild smoothness assumptions, we prove that both \emphexcess hallucination probability and \emphexcess head redundancy scale with PoA, unifying two distinct failure modes into a single mechanism. The bound is prescriptive: regularization that reduces \Gamma(G) provably tightens PoA. We instantiate this as GAME-LoRA, combining Barlow Twins decorrelation with log-determinant coordination pressure. Experiments validate the theory: \Gamma(G) predicts hallucination ( p0.05 ), emergent coalitions exhibit selective coordination, and GAME-LoRA achieves up to 18% hallucination reduction (8% average) with no knowledge degradation – a Pareto improvement inaccessible to methods ignoring the game structure.
zh
[NLP-174] Unifying Adversarial Robustness and Training Across Text Scoring Models
【速读】: 该论文旨在解决当前语言模型在文本评分模型(text scoring models)中对抗鲁棒性研究分散、缺乏统一框架的问题,尤其针对密集检索器(dense retrievers)、重排序器(rerankers)和奖励模型(reward models)等不同角色模型的共性脆弱性未被充分揭示与应对的挑战。其关键解决方案在于提出一种基于文本评分任务本质的统一视角,并在此基础上设计多种互补的对抗训练方法,通过跨攻击类型的联合训练显著提升模型鲁棒性,同时改善任务性能;特别地,该方法在强化学习人类反馈(RLHF)场景下有效缓解奖励黑客(reward hacking)问题,从而支持更对齐的大语言模型(LLM)训练。
链接: https://arxiv.org/abs/2602.00857
作者: Manveer Singh Tamber,Hosna Oyarhoseini,Jimmy Lin
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Research on adversarial robustness in language models is currently fragmented across applications and attacks, obscuring shared vulnerabilities. In this work, we propose unifying the study of adversarial robustness in text scoring models spanning dense retrievers, rerankers, and reward models. This motivates adapting both attacks and adversarial training methods across model roles. Unlike open-ended generation, text scoring failures are directly testable: an attack succeeds when an irrelevant or rejected text outscores a relevant or chosen one. Using this principled lens of text scoring, we demonstrate that current adversarial training formulations for language models are often short-sighted, failing to effectively generalize across attacks. To address this, we introduce multiple adversarial training methods for text scoring models and show that combining complementary training methods can yield strong robustness while also improving task effectiveness. We also highlight the practical value of our approach for RLHF, showing that our adversarially trained reward models mitigate reward hacking and support the training of better-aligned LLMs. We provide our code and models for further study.
zh
[NLP-175] Factuality on Demand: Controlling the Factuality-Informativeness Trade-off in Text Generation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成回答时面临的事实性(factuality)与信息量(informativeness)之间的权衡问题:模型要么生成低信息量但高准确性的回答,要么生成高信息量但可能不准确的回答。针对这一挑战,论文提出Factuality-Controlled Generation(FCG)框架,其核心在于允许用户在查询中附加事实性约束条件,从而引导模型生成既符合事实要求又保持足够信息量的响应。解决方案的关键在于使用合成数据进行训练,有效提升模型在遵守事实性约束的同时维持输出信息丰富度的能力。
链接: https://arxiv.org/abs/2602.00848
作者: Ziwei Gong,Yanda Chen,Julia Hirschberg,Chen Zhao,He He,Zhou Yu,Kathleen Mckeown
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) encode knowledge with varying degrees of confidence. When responding to queries, models face an inherent trade-off: they can generate responses that are less informative but highly factual, or more informative but potentially less accurate. Different applications demand different balances between informativeness and factuality. We introduce Factuality-Controlled Generation (FCG), a framework that enables users to specify factuality constraints alongside their queries. We propose to evaluate FCG performance on two dimensions: adherence to factuality constraints and response informativeness. We propose to train models on the FCG task using synthetic data, and show that our synthetic training significantly improves models’ ability to both respect factuality requirements and maintain informativeness in their outputs.
zh
[NLP-176] Omni-RRM: Advancing Omni Reward Modeling via Automatic Rubric-Grounded Preference Synthesis
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在对齐过程中因现有奖励模型(Reward Models, RMs)存在局限性而导致性能受限的问题。具体而言,当前RMs主要以视觉为中心、输出不透明的标量分数,并严重依赖昂贵的人工标注数据,难以有效捕捉跨模态(文本、图像、视频、音频)的复杂偏好关系。其解决方案的关键在于提出Omni-RRM——首个开源的基于评分量表(rubric-grounded)的奖励模型,能够生成结构化的多维偏好判断并提供每个维度的解释性理由。该方法的核心创新是构建了Omni-Preference大规模自动化数据集:通过对比不同能力模型生成候选响应对,并利用强教师模型进行偏好 reconciling 与过滤,同时提供模态感知的、基于评分量表的推理依据,从而完全避免人工标注;随后采用两阶段训练策略(监督微调 + GRPO强化学习),显著提升了对低对比度困难样本的判别能力,最终在视频和音频基准上达到SOTA性能,并在图像任务中实现17.7%的绝对准确率提升。
链接: https://arxiv.org/abs/2602.00846
作者: Zicheng Kong,Dehua Ma,Zhenbo Xu,Alven Yang,Yiwei Ru,Haoran Wang,Zixuan Zhou,Fuqing Bie,Liuyu Xiang,Huijia Wu,Jian Zhao,Zhaofeng He
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Tsinghua University (清华大学); TeleAI
类目: Computation and Language (cs.CL)
备注:
Abstract:Multimodal large language models (MLLMs) have shown remarkable capabilities, yet their performance is often capped by the coarse nature of existing alignment techniques. A critical bottleneck remains the lack of effective reward models (RMs): existing RMs are predominantly vision-centric, return opaque scalar scores, and rely on costly human annotations. We introduce \textbfOmni-RRM, the first open-source rubric-grounded reward model that produces structured, multi-dimension preference judgments with dimension-wise justifications across \textbftext, image, video, and audio. At the core of our approach is \textbfOmni-Preference, a large-scale dataset built via a fully automated pipeline: we synthesize candidate response pairs by contrasting models of different capabilities, and use strong teacher models to \emphreconcile and filter preferences while providing a modality-aware \emphrubric-grounded rationale for each pair. This eliminates the need for human-labeled training preferences. Omni-RRM is trained in two stages: supervised fine-tuning to learn the rubric-grounded outputs, followed by reinforcement learning (GRPO) to sharpen discrimination on difficult, low-contrast pairs. Comprehensive evaluations show that Omni-RRM achieves state-of-the-art accuracy on video (80.2% on ShareGPT-V) and audio (66.8% on Audio-HH-RLHF) benchmarks, and substantially outperforms existing open-source RMs on image tasks, with a 17.7% absolute gain over its base model on overall accuracy. Omni-RRM also improves downstream performance via Best-of- N selection and transfers to text-only preference benchmarks. Our data, code, and models are available at this https URL.
zh
[NLP-177] SpeechLess: Micro-utterance with Personalized Spatial Memory-aware Assistant in Everyday Augmented Reality
【速读】: 该论文旨在解决用户在公共场合使用可穿戴增强现实(AR)助手时因语音交互带来的社交尴尬,以及重复表达相同请求所导致的额外认知与行为负担问题。解决方案的关键在于提出了一种基于个性化空间记忆的语音意图粒度控制范式(SpeechLess),通过将历史交互与多模态个人上下文(如时间、活动、指代对象等)绑定形成空间记忆,并利用这些记忆从信息不完整的用户查询中推断缺失的意图维度,从而支持用户动态调整表达显式程度,实现从完整语音到微语音甚至零语音的灵活交互模式,有效降低语音使用频率并提升社交接受度,同时保持较高的意图识别准确性和可用性。
链接: https://arxiv.org/abs/2602.00793
作者: Yoonsang Kim,Devshree Jadeja,Divyansh Pradhan,Yalong Yang,Arie Kaufman
机构: Stony Brook University (石溪大学); Georgia Institute of Technology (佐治亚理工学院)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Emerging Technologies (cs.ET); Information Retrieval (cs.IR)
备注: 11 pages, 9 figures. This is the author’s version of the article that will appear at the IEEE Conference on Virtual Reality and 3D User Interfaces (IEEE VR) 2026
Abstract:Speaking aloud to a wearable AR assistant in public can be socially awkward, and re-articulating the same requests every day creates unnecessary effort. We present SpeechLess, a wearable AR assistant that introduces a speech-based intent granularity control paradigm grounded in personalized spatial memory. SpeechLess helps users “speak less,” while still obtaining the information they need, and supports gradual explicitation of intent when more complex expression is required. SpeechLess binds prior interactions to multimodal personal context-space, time, activity, and referents-to form spatial memories, and leverages them to extrapolate missing intent dimensions from under-specified user queries. This enables users to dynamically adjust how explicitly they express their informational needs, from full-utterance to micro/zero-utterance interaction. We motivate our design through a week-long formative study using a commercial smart glasses platform, revealing discomfort with public voice use, frustration with repetitive speech, and hardware constraints. Building on these insights, we design SpeechLess, and evaluate it through controlled lab and in-the-wild studies. Our results indicate that regulated speech-based interaction, can improve everyday information access, reduce articulation effort, and support socially acceptable use without substantially degrading perceived usability or intent resolution accuracy across diverse everyday environments.
zh
[NLP-178] HyLRA: Hybrid Layer Reuse Attention for Efficient Long-Context Inference
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长文本推理过程中因注意力机制的二次计算复杂度(quadratic computation complexity)和键值缓存(Key-Value, KV caches)占用大量内存所导致的效率瓶颈问题。解决方案的关键在于提出一种名为HyLRA(Hybrid Layer Reuse Attention)的新框架,其核心思想是基于分层稀疏性分析:通过实证发现注意力机制具有双重特性——层内敏感性(intra-layer sensitivity),即某些层必须保持全注意力以避免特征失真;以及层间相似性(inter-layer similarity),即相邻层存在大量关键token共享。HyLRA利用离线动态规划方法生成最优分层策略,在敏感层保留完整注意力计算以保障鲁棒性,而在容忍层则通过复用前一层Top-k索引直接跳过二次计算,从而将计算限制在最关键的token上,有效突破密集注意力的二次瓶颈。实验表明,该方法在保持精度损失小于1%的前提下,推理吞吐量提升6%–46%,显著优于现有稀疏注意力方法。
链接: https://arxiv.org/abs/2602.00777
作者: Xuan Ai,Qingqing Yang,Peng Wang,Lei Deng,Lin Zhang,Renhai Chen,Gong Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Long-context inference in Large Language Models (LLMs) is bottlenecked by the quadratic computation complexity of attention and the substantial memory footprint of Key-Value (KV) caches. While existing sparse attention mechanisms attempt to mitigate this by exploiting inherent sparsity, they often rely on rigid patterns or aggressive pruning, failing to achieve an optimal balance between efficiency and accuracy. In this paper, we introduce \bf HyLRA (\bf Hybrid \bf Layer \bf Reuse \bf Attention), a novel framework driven by layer-wise sparsity profiling. Our empirical analysis uncovers a dual characteristic in attention mechanics: \textitintra-layer sensitivity, where specific layers necessitate full attention to prevent feature distortion, and \textitinter-layer similarity, where consecutive layers share substantial critical tokens. Based on these observations, HyLRA employs an offline dynamic programming approach to derive an optimal layer-wise policy. This hybrid strategy retains full attention for sensitive layers to ensure robustness, while enabling tolerant layers to bypass quadratic calculations by directly reusing top- k indices from preceding layers. This approach allows LLMs to restrict computation to the most critical tokens, effectively overcoming the quadratic bottleneck of dense attention. Extensive evaluations demonstrate that HyLRA improves inference throughput by 6%–46% while maintaining comparable performance (with 1% accuracy degradation), consistently outperforming state-of-the-art sparse attention methods. HyLRA is open source at \hrefthis https URL\texttt/r/unified-cache-management-CF80/
zh
[NLP-179] Reasoning as State Transition: A Representational Analysis of Reasoning Evolution in Large Language Models
【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)在训练过程中推理能力演化机制不清晰的问题,尤其是现有研究多依赖外部生成结果作为黑箱分析,难以揭示内部表征动态变化。其解决方案的关键在于引入一种表征视角(representational perspective),通过系统性实验追踪不同训练阶段模型内部状态的变化,发现推理任务中存在显著的持续分布偏移(distributional shift)现象,并证明后训练(post-training)能够优化这一分布迁移过程,从而提升推理性能;同时,统计分析与反事实实验进一步确认生成正确性主要由最终表征语义决定,而非推理时额外计算或参数差异,为理解推理增强机制提供了可解释的内在依据。
链接: https://arxiv.org/abs/2602.00770
作者: Siyuan Zhang,Jialian Li,Yichi Zhang,Xiao Yang,Yinpeng Dong,Hang Su
机构: 未知
类目: Computation and Language (cs.CL)
备注: 30 pages, 27 figures, 8 tables
Abstract:Large Language Models have achieved remarkable performance on reasoning tasks, motivating research into how this ability evolves during training. Prior work has primarily analyzed this evolution via explicit generation outcomes, treating the reasoning process as a black box and obscuring internal changes. To address this opacity, we introduce a representational perspective to investigate the dynamics of the model’s internal states. Through comprehensive experiments across models at various training stages, we discover that post-training yields only limited improvement in static initial representation quality. Furthermore, we reveal that, distinct from non-reasoning tasks, reasoning involves a significant continuous distributional shift in representations during generation. Comparative analysis indicates that post-training empowers models to drive this transition toward a better distribution for task solving. To clarify the relationship between internal states and external outputs, statistical analysis confirms a high correlation between generation correctness and the final representations; while counterfactual experiments identify the semantics of the generated tokens, rather than additional computation during inference or intrinsic parameter differences, as the dominant driver of the transition. Collectively, we offer a novel understanding of the reasoning process and the effect of training on reasoning enhancement, providing valuable insights for future model analysis and optimization.
zh
[NLP-180] Eliciting Trustworthiness Priors of Large Language Models via Economic Games
【速读】: 该论文旨在解决如何量化和刻画人工智能(AI)系统自身所表现出的信任水平这一关键问题,以构建以人为中心且可信赖的AI系统。其核心挑战在于,传统方法难以客观评估AI在决策中对其他主体的信任程度,而仅依赖主观报告或行为数据往往存在偏差。解决方案的关键在于提出一种基于迭代上下文学习(iterated in-context learning)的新颖诱发方法,并将其应用于行为博弈论中的信任游戏(Trust Game),从而从大型语言模型(LLMs)中直接提取出可信度先验(trustworthiness priors)。实验表明,GPT-4.1 的可信度先验与人类行为高度一致,且其对不同角色特征的响应可被基于温暖感和能力感知的刻板印象模型有效预测,这为理解AI系统如何模拟人类信任机制提供了可计算、可解释的基础。
链接: https://arxiv.org/abs/2602.00769
作者: Siyu Yan,Lusha Zhu,Jian-Qiao Zhu
机构: The University of Hong Kong (香港大学); Peking University (北京大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:One critical aspect of building human-centered, trustworthy artificial intelligence (AI) systems is maintaining calibrated trust: appropriate reliance on AI systems outperforms both overtrust (e.g., automation bias) and undertrust (e.g., disuse). A fundamental challenge, however, is how to characterize the level of trust exhibited by an AI system itself. Here, we propose a novel elicitation method based on iterated in-context learning (Zhu and Griffiths, 2024a) and apply it to elicit trustworthiness priors using the Trust Game from behavioral game theory. The Trust Game is particularly well suited for this purpose because it operationalizes trust as voluntary exposure to risk based on beliefs about another agent, rather than self-reported attitudes. Using our method, we elicit trustworthiness priors from several leading large language models (LLMs) and find that GPT-4.1’s trustworthiness priors closely track those observed in humans. Building on this result, we further examine how GPT-4.1 responds to different player personas in the Trust Game, providing an initial characterization of how such models differentiate trust across agent characteristics. Finally, we show that variation in elicited trustworthiness can be well predicted by a stereotype-based model grounded in perceived warmth and competence.
zh
[NLP-181] WordCraft: Scaffolding the Keyword Method for L2 Vocabulary Learning with Multimodal LLM s
【速读】: 该论文旨在解决L1中文母语者在学习英语词汇时应用关键词法(Keyword Method)所面临的挑战,包括难以生成音近词、构建连贯联想以及形成生动心理意象等问题。现有方法如全自动关键词生成或结果导向的助记工具,往往牺牲学习者的参与度或缺乏过程性指导。解决方案的关键在于提出一个以学习者为中心的交互式工具WordCraft,该工具基于多模态大语言模型(Multimodal Large Language Models, MLLMs),通过分步引导学习者完成关键词选择、关联构建和图像生成三个核心环节,从而提升词汇记忆的效果与可用性。
链接: https://arxiv.org/abs/2602.00762
作者: Yuheng Shao,Junjie Xiong,Chaoran Wu,Xiyuan Wang,Ziyu Zhou,Yang Ouyang,Qinyi Tao,Quan Li
机构: ShanghaiTech University (上海科技大学)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI’ 26), April 13–17, 2026, Barcelona, Spain
Abstract:Applying the keyword method for vocabulary memorization remains a significant challenge for L1 Chinese-L2 English learners. They frequently struggle to generate phonologically appropriate keywords, construct coherent associations, and create vivid mental imagery to aid long-term retention. Existing approaches, including fully automated keyword generation and outcome-oriented mnemonic aids, either compromise learner engagement or lack adequate process-oriented guidance. To address these limitations, we conducted a formative study with L1 Chinese-L2 English learners and educators (N=18), which revealed key difficulties and requirements in applying the keyword method to vocabulary learning. Building on these insights, we introduce WordCraft, a learner-centered interactive tool powered by Multimodal Large Language Models (MLLMs). WordCraft scaffolds the keyword method by guiding learners through keyword selection, association construction, and image formation, thereby enhancing the effectiveness of vocabulary memorization. Two user studies demonstrate that WordCraft not only preserves the generation effect but also achieves high levels of effectiveness and usability.
zh
[NLP-182] APR: Penalizing Structural Redundancy in Large Reasoning Models via Anchor-based Process Rewards
【速读】: 该论文旨在解决大型推理模型(Large Reasoning Models, LRM)在测试时缩放(Test-Time Scaling, TTS)过程中出现的“过度思考”(Overthinking)问题,即模型在得出最终答案后仍进行无意义的重复自我验证,导致计算资源浪费。解决方案的关键在于提出一种基于结构感知的奖励塑造方法——锚点过程奖励(Anchor-based Process Reward, APR),该方法首先识别推理过程中答案首次稳定的“推理锚点”(Reasoning Anchor),并精准惩罚锚点之后的“答案稳定尾部”(Answer-Stable Tail, AST),从而在不牺牲性能的前提下显著提升推理效率,实现性能与计算资源消耗之间的帕累托最优。
链接: https://arxiv.org/abs/2602.00760
作者: Kaiyan Chang,Chenwei Zhu,Yingfeng Luo,Yifu Huo,Chenglong Wang,Xiaoqian Liu,Qiaozhi He,Tong Xiao,Zhengtao Yu,Jingbo Zhu
机构: 未知
类目: Computation and Language (cs.CL)
备注: Under Review
Abstract:Test-Time Scaling (TTS) has significantly enhanced the capabilities of Large Reasoning Models (LRMs) but introduces a critical side-effect known as Overthinking. We conduct a preliminary study to rethink this phenomenon from a fine-grained perspective. We observe that LRMs frequently conduct repetitive self-verification without revision even after obtaining the final answer during the reasoning process. We formally define this specific position where the answer first stabilizes as the Reasoning Anchor. By analyzing pre- and post-anchor reasoning behaviors, we uncover the structural redundancy fixed in LRMs: the meaningless repetitive verification after deriving the first complete answer, which we term the Answer-Stable Tail (AST). Motivated by this observation, we propose Anchor-based Process Reward (APR), a structure-aware reward shaping method that localizes the reasoning anchor and penalizes exclusively the post-anchor AST. Leveraging the policy optimization algorithm suitable for length penalties, our APR models achieved the performance-efficiency Pareto frontier at 1.5B and 7B scales averaged across five mathematical reasoning datasets while requiring significantly fewer computational resources for RL training.
zh
[NLP-183] Adaptive Ability Decomposing for Unlocking Large Reasoning Model Effective Reinforcement Learning
【速读】: 该论文旨在解决强化学习中可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)在训练大语言模型(Large Language Models, LLMs)时因信息有限导致的盲目探索问题,从而难以有效应对复杂任务。解决方案的关键在于提出一种自适应能力分解方法(Adaptive Ability Decomposing, A²D),其核心机制是:首先通过RLVR无蒸馏训练一个分解器(decomposer),使其能够将复杂问题自动拆解为一系列简单子问题;随后利用该分解器对训练数据进行子问题标注,并在此基础上以子问题引导的方式训练推理器(reasoner),从而增强模型在探索与利用之间的平衡能力。此方法无需依赖教师模型,且具有良好的模块化特性,可无缝集成至多种RLVR算法中。
链接: https://arxiv.org/abs/2602.00759
作者: Zhipeng Chen,Xiaobo Qin,Wayne Xin Zhao,Youbin Wu,Ji-Rong Wen
机构: 中国人民大学(renmin university of china); 字节跳动(ByteDance)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 21 pages, Working in progress
Abstract:Reinforcement learning with verifiable rewards (RLVR) has shown great potential to enhance the reasoning ability of large language models (LLMs). However, due to the limited amount of information provided during the RLVR process, the model can only engage in largely blind exploration, which often results in failure on challenging problems. To provide additional information for the RLVR process without relying on a teacher model, we propose A ^2 D, an Adaptive Ability Decomposing method for enhancing the effectiveness of RLVR. Specifically, we first train a decomposer via RLVR without distillation, enabling it to decompose complex questions into a set of simpler sub-questions. Next, we use this decomposer to annotate sub-questions for each question in the training dataset, and then train the reasoner under RLVR with sub-question guidance. To better understand A ^2 D, we first compare its performance with competitive baselines, showing its effectiveness. Next, we observe that our method functions as a plug-and-play module that can be applied to different RLVR algorithms. Furthermore, we conduct an analysis of the decomposer, revealing how the RLVR process affects its performance and behavior, and which type of guidance is better suited for enhancing the reasoner’s exploration and exploitation abilities.
zh
[NLP-184] mporal Leakage in Search-Engine Date-Filtered Web Retrieval: A Case Study from Retrospective Forecasting
【速读】: 该论文旨在解决回顾性评估中因使用时间过滤器(如Google Search的before:过滤器)进行预截止检索时存在的数据泄露问题,这种做法可能导致模型在评估中获得虚假的高预测准确率。研究表明,71%的问题会返回包含强后截止泄露内容的网页,其中41%甚至直接揭示答案;使用大语言模型(LLM)基于这些泄露文档进行预测时,Brier分数从0.242(无泄露)显著降低至0.108,说明存在严重的性能虚高现象。解决方案的关键在于采用更强的检索保障措施,例如对网络快照进行时间冻结和时间戳标记,以确保评估环境的时间一致性,从而实现可信的回顾性预测评估。
链接: https://arxiv.org/abs/2602.00758
作者: Ali El Lahib,Ying-Jieh Xia,Zehan Li,Yuxuan Wang,Xinyu Pi
机构: University of California, San Diego (加州大学圣地亚哥分校); University of Chicago (芝加哥大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 9 pages, 6 figures
Abstract:Search-engine date filters are widely used to enforce pre-cutoff retrieval in retrospective evaluations of search-augmented forecasters. We show this approach is unreliable: auditing Google Search with a before: filter, 71% of questions return at least one page containing strong post-cutoff leakage, and for 41%, at least one page directly reveals the answer. Using a large language model (LLM), gpt-oss-120b, to forecast with these leaky documents, we demonstrate an inflated prediction accuracy (Brier score 0.108 vs. 0.242 with leak-free documents). We characterize common leakage mechanisms, including updated articles, related-content modules, unreliable metadata/timestamps, and absence-based signals, and argue that date-restricted search is insufficient for temporal evaluation. We recommend stronger retrieval safeguards or evaluation on frozen, time-stamped web snapshots to ensure credible retrospective forecasting.
zh
[NLP-185] Decouple Searching from Training: Scaling Data Mixing via Model Merging for Large Language Model Pre-training
【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)预训练中数据混合比例优化的难题,即如何在保证模型通用能力的同时提升其在数学和编程等高难度任务上的表现。现有方法要么依赖不可靠的小规模代理实验,要么需要代价高昂的大规模探索。其解决方案的关键在于提出了一种名为DeMix(Decouple Searching from Training Mix)的新框架,通过模型融合(model merging)技术将搜索过程与训练成本解耦:先在大规模上训练各候选数据集对应的组件模型,再通过加权融合生成数据混合代理,从而无需为每个采样混合比例重新训练模型,即可高效评估无限多的混合方案,显著降低搜索成本并提升最优混合比例的发现精度。
链接: https://arxiv.org/abs/2602.00747
作者: Shengrui Li,Fei Zhao,Kaiyan Zhao,Jieying Ye,Haifeng Liu,Fangcheng Shi,Zheyong Xie,Yao Hu,Shaosheng Cao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages, 5 figures
Abstract:Determining an effective data mixture is a key factor in Large Language Model (LLM) pre-training, where models must balance general competence with proficiency on hard tasks such as math and code. However, identifying an optimal mixture remains an open challenge, as existing approaches either rely on unreliable tiny-scale proxy experiments or require prohibitively expensive large-scale exploration. To address this, we propose Decouple Searching from Training Mix (DeMix), a novel framework that leverages model merging to predict optimal data ratios. Instead of training proxy models for every sampled mixture, DeMix trains component models on candidate datasets at scale and derives data mixture proxies via weighted model merging. This paradigm decouples search from training costs, enabling evaluation of unlimited sampled mixtures without extra training burden and thus facilitating better mixture discovery through more search trials. Extensive experiments demonstrate that DeMix breaks the trade-off between sufficiency, accuracy and efficiency, obtaining the optimal mixture with higher benchmark performance at lower search cost. Additionally, we release the DeMix Corpora, a comprehensive 22T-token dataset comprising high-quality pre-training data with validated mixtures to facilitate open research. Our code and DeMix Corpora is available at this https URL.
zh
[NLP-186] CURP: Codebook-based Continuous User Representation for Personalized Generation with LLM s
【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的用户建模方法在个性化质量与计算效率、数据效率之间难以平衡的问题。现有方法,无论是提示工程驱动还是训练驱动的方案,往往在实现高质量个性化的同时面临参数冗余或数据消耗过大的挑战。其解决方案的关键在于提出一种名为CURP的新框架,该框架通过双向用户编码器(bidirectional user encoder)和离散原型码本(discrete prototype codebook)联合提取多维用户特征,从而以极少量可训练参数(约2000万,占总模型规模的0.2%)实现即插即用式的个性化生成,显著提升了性能、泛化能力、可解释性与扩展性。
链接: https://arxiv.org/abs/2602.00742
作者: Liang Wang,Xinyi Mou,Xiaoyou Liu,Xuanjing Huang,Zhongyu Wei
机构: Fudan University (复旦大学); Shanghai Innovation Institute (上海创新研究院)
类目: Computation and Language (cs.CL)
备注:
Abstract:User modeling characterizes individuals through their preferences and behavioral patterns to enable personalized simulation and generation with Large Language Models (LLMs) in contemporary approaches. However, existing methods, whether prompt-based or training-based methods, face challenges in balancing personalization quality against computational and data efficiency. We propose a novel framework CURP, which employs a bidirectional user encoder and a discrete prototype codebook to extract multi-dimensional user traits. This design enables plug-and-play personalization with a small number of trainable parameters (about 20M parameters, about 0.2% of the total model size). Through extensive experiments on variant generation tasks, we show that CURP achieves superior performance and generalization compared to strong baselines, while offering better interpretability and scalability. The code are available at this https URL
zh
[NLP-187] ExperienceWeaver: Optimizing Small-sample Experience Learning for LLM -based Clinical Text Improvement
【速读】: 该论文旨在解决临床文本优化(Clinical Text Improvement)在小样本场景下的性能瓶颈问题,其核心挑战在于高质量标注数据稀缺以及医疗文书对逻辑严谨性和专业性的高要求。现有方法如监督微调(Supervised Fine-Tuning)依赖大量标注数据且成本高昂,而检索增强生成(Retrieval-Augmented Generation)通常仅提供表层修正,缺乏对修改背后推理过程的建模。解决方案的关键在于提出ExperienceWeaver框架,该框架通过从噪声多维反馈中提炼出结构化经验——包括针对特定错误的“Tips”和高层策略“Strategies”,并将其注入代理式(Agentic)处理流程,使模型学会“如何修订”而非仅“修什么”,从而在低资源条件下显著提升临床文本修正质量。
链接: https://arxiv.org/abs/2602.00740
作者: Ziyan Xiao,Yinghao Zhu,Liang Peng,Lequan Yu
机构: The University of Hong Kong(香港大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Clinical text improvement is vital for healthcare efficiency but remains difficult due to limited high-quality data and the complex constraints of medical documentation. While Large Language Models (LLMs) show promise, current approaches struggle in small-sample settings: supervised fine-tuning is data-intensive and costly, while retrieval-augmented generation often provides superficial corrections without capturing the reasoning behind revisions. To address these limitations, we propose ExperienceWeaver, a hierarchical framework that shifts the focus from data retrieval to experience learning. Instead of simply recalling past examples, ExperienceWeaver distills noisy, multi-dimensional feedback into structured, actionable knowledge. Specifically, error-specific Tips and high-level Strategies. By injecting this distilled experience into an agentic pipeline, the model learns “how to revise” rather than just “what to revise”. Extensive evaluations across four clinical datasets demonstrate that ExperienceWeaver consistently improves performance, surpassing state-of-the-art models such as Gemini-3 Pro in small-sample settings.
zh
[NLP-188] EchoReview: Learning Peer Review from the Echoes of Scientific Citations
【速读】: 该论文旨在解决传统同行评审(peer review)系统在科学投稿量快速增长背景下所面临的可扩展性瓶颈问题,以及现有基于人工审稿数据的监督微调方法因单一数据源和人类审稿主观性与不一致性而导致的自动化审稿质量受限的问题。其解决方案的关键在于提出了一种基于引文上下文驱动的数据合成框架 EchoReview,通过系统挖掘学术引用中隐含的集体评价信号,将科学共同体长期形成的判断转化为结构化的审稿风格数据;在此基础上构建了首个大规模、跨会议、跨年度的引文驱动审稿数据集 EchoReview-16K,并训练出自动化审稿模型 EchoReviewer-7B,实验证明其在证据支持和审稿全面性等核心维度上均实现显著且稳定的性能提升,验证了引文上下文作为可靠数据范式的有效性。
链接: https://arxiv.org/abs/2602.00733
作者: Yinuo Zhang,Dingcheng Huang,Haifeng Suo,Yizhuo Li,Ziya Zhao,Junhao Xu,Zhiying Tu,Dianhui Chu,Deming Zhai,Xianming Liu,Xiaoyan Yu,Dianbo Sui
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:As the volume of scientific submissions continues to grow rapidly, traditional peer review systems are facing unprecedented scalability pressures, highlighting the urgent need for automated reviewing methods that are both scalable and reliable. Existing supervised fine-tuning approaches based on real review data are fundamentally constrained by single-source of data as well as the inherent subjectivity and inconsistency of human reviews, limiting their ability to support high-quality automated reviewers. To address these issues, we propose EchoReview, a citation-context-driven data synthesis framework that systematically mines implicit collective evaluative signals from academic citations and transforms scientific community’s long-term judgments into structured review-style data. Based on this pipeline, we construct EchoReview-16K, the first large-scale, cross-conference, and cross-year citation-driven review dataset, and train an automated reviewer, EchoReviewer-7B. Experimental results demonstrate that EchoReviewer-7B can achieve significant and stable improvements on core review dimensions such as evidence support and review comprehensiveness, validating citation context as a robust and effective data paradigm for reliable automated peer review.
zh
[NLP-189] From Prompt to Graph: Comparing LLM -Based Information Extraction Strategies in Domain-Specific Ontology Development
【速读】: 该论文旨在解决传统本体(Ontology)构建过程中依赖人工标注和常规自然语言处理(Natural Language Processing, NLP)技术所导致的劳动密集、成本高昂的问题,尤其是在铸造制造等专业领域中尤为突出。其解决方案的关键在于利用大语言模型(Large Language Models, LLMs)自动化提取领域文本中的术语与关系,通过对比预训练LLM驱动方法、上下文学习(In-Context Learning, ICL)方法以及微调(Fine-Tuning)方法在有限数据下的性能表现,选取最优策略构建铸造领域本体,并由领域专家进行验证,从而实现高效、可扩展的知识结构化过程。
链接: https://arxiv.org/abs/2602.00699
作者: Xuan Liu,Ziyu Li,Mu He,Ziyang Ma,Xiaoxu Wu,Gizem Yilmaz,Yiyuan Xia,Bingbing Li,He Tan,Jerry Ying Hsi Fuh,Wen Feng Lu,Anders E.W. Jarfors,Per Jansson
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 11 pages,8 figures,3 tables,presented at International Conference on Industry of the Future and Smart Manufacturing,2025
Abstract:Ontologies are essential for structuring domain knowledge, improving accessibility, sharing, and reuse. However, traditional ontology construction relies on manual annotation and conventional natural language processing (NLP) techniques, making the process labour-intensive and costly, especially in specialised fields like casting manufacturing. The rise of Large Language Models (LLMs) offers new possibilities for automating knowledge extraction. This study investigates three LLM-based approaches, including pre-trained LLM-driven method, in-context learning (ICL) method and fine-tuning method to extract terms and relations from domain-specific texts using limited data. We compare their performances and use the best-performing method to build a casting ontology that validated by domian expert.
zh
[NLP-190] Can Small Language Models Handle Context-Summarized Multi-Turn Customer-Service QA? A Synthetic Data-Driven Comparative Evaluation
【速读】: 该论文旨在解决资源受限环境下客户服务质量问答(Customer-service Question Answering, QA)系统中,小型语言模型(Small Language Models, SLMs)在多轮对话中保持上下文连续性和语义一致性能力不足的问题。其解决方案的关键在于采用**历史摘要策略(history summarization strategy)**对对话历史进行压缩和结构化表示,从而在降低计算开销的同时保留关键的对话状态信息,并结合基于对话阶段的定性分析方法评估模型在不同交互阶段的行为表现,以系统性地验证SLMs在真实客户服务场景下的有效性与局限性。
链接: https://arxiv.org/abs/2602.00665
作者: Lakshan Cooray,Deshan Sumanathilaka,Pattigadapa Venkatesh Raju
机构: Informatics Institute of Technology(信息技术研究所); Swansea University(斯旺西大学); Zame AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Submission is under review with Computational Linguistics
Abstract:Customer-service question answering (QA) systems increasingly rely on conversational language understanding. While Large Language Models (LLMs) achieve strong performance, their high computational cost and deployment constraints limit practical use in resource-constrained environments. Small Language Models (SLMs) provide a more efficient alternative, yet their effectiveness for multi-turn customer-service QA remains underexplored, particularly in scenarios requiring dialogue continuity and contextual understanding. This study investigates instruction-tuned SLMs for context-summarized multi-turn customer-service QA, using a history summarization strategy to preserve essential conversational state. We also introduce a conversation stage-based qualitative analysis to evaluate model behavior across different phases of customer-service interactions. Nine instruction-tuned low-parameterized SLMs are evaluated against three commercial LLMs using lexical and semantic similarity metrics alongside qualitative assessments, including human evaluation and LLM-as-a-judge methods. Results show notable variation across SLMs, with some models demonstrating near-LLM performance, while others struggle to maintain dialogue continuity and contextual alignment. These findings highlight both the potential and current limitations of low-parameterized language models for real-world customer-service QA systems.
zh
[NLP-191] LegalOne: A Family of Foundation Models for Reliable Legal Reasoning
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在法律领域直接应用时面临的两大核心挑战:一是缺乏精确的法律专业知识,二是难以执行严谨的多步司法推理。为此,作者提出LegalOne,一个专为中文法律领域设计的基础模型家族,其解决方案的关键在于三阶段协同优化的训练流程:首先,在中期训练阶段引入基于困惑度的可塑性调整采样(Plasticity-Adjusted Sampling, PAS),平衡新知识获取与原始能力保留;其次,在监督微调阶段采用法律代理思维链蒸馏(Legal Agentic CoT Distillation, LEAD),通过代理式工作流将复杂司法过程转化为结构化推理轨迹,强化事实锚定与逻辑严谨性;最后,实施课程强化学习(Curriculum Reinforcement Learning, RL)策略,分阶段推进从记忆到理解再到推理的能力演化,使模型实现自主且可靠的法律推理能力。实验表明,LegalOne在多项法律任务上达到最先进性能,优于参数量更大的通用大模型,凸显其知识密度和效率优势。
链接: https://arxiv.org/abs/2602.00642
作者: Haitao Li,Yifan Chen,Shuo Miao,Qian Dong,Jia Chen,Yiran Hu,Junjie Chen,Minghao Qin,Qingyao Ai,Yiqun Liu,Cheng Luo,Quan Zhou,Ya Zhang,Jikun Hu
机构: Tsinghua University (清华大学); Quancheng Laboratory; University of Waterloo (滑铁卢大学); China University of Political Science and Law (中国政法大学); MegaTech.AI Inc
类目: Computation and Language (cs.CL)
备注: 25 pages, v1
Abstract:While Large Language Models (LLMs) have demonstrated impressive general capabilities, their direct application in the legal domain is often hindered by a lack of precise domain knowledge and complexity of performing rigorous multi-step judicial reasoning. To address this gap, we present LegalOne, a family of foundational models specifically tailored for the Chinese legal domain. LegalOne is developed through a comprehensive three-phase pipeline designed to master legal reasoning. First, during mid-training phase, we propose Plasticity-Adjusted Sampling (PAS) to address the challenge of domain adaptation. This perplexity-based scheduler strikes a balance between the acquisition of new knowledge and the retention of original capabilities, effectively establishing a robust legal foundation. Second, during supervised fine-tuning, we employ Legal Agentic CoT Distillation (LEAD) to distill explicit reasoning from raw legal texts. Unlike naive distillation, LEAD utilizes an agentic workflow to convert complex judicial processes into structured reasoning trajectories, thereby enforcing factual grounding and logical rigor. Finally, we implement a Curriculum Reinforcement Learning (RL) strategy. Through a progressive reinforcement process spanning memorization, understanding, and reasoning, LegalOne evolves from simple pattern matching to autonomous and reliable legal reasoning. Experimental results demonstrate that LegalOne achieves state-of-the-art performance across a wide range of legal tasks, surpassing general-purpose LLMs with vastly larger parameter counts through enhanced knowledge density and efficiency. We publicly release the LegalOne weights and the LegalKit evaluation framework to advance the field of Legal AI, paving the way for deploying trustworthy and interpretable foundation models in high-stakes judicial applications.
zh
[NLP-192] Formal Semantic Control over Language Models
【速读】: 该论文旨在解决自然语言处理中语言模型内部语义表示的可解释性与可控性不足的问题,特别是如何通过显式操控潜在空间(latent space)的几何结构来实现局部化、准符号化的组合控制。其解决方案的关键在于构建一个基于变分自编码器(Variational Autoencoder, VAE)的框架,并从两个互补方向推进:一是句子级学习与控制,通过解耦和操纵潜在空间中的特定语义特征来引导句子生成;二是推理级学习与控制,通过隔离并引导潜在空间中的推理行为以控制自然语言推理(Natural Language Inference, NLI),尤其聚焦于解释型NLI任务。该方法通过引入新的理论框架与实践手段,显著提升了语言模型潜在空间的可解释性和可控性。
链接: https://arxiv.org/abs/2602.00638
作者: Yingji Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:This thesis advances semantic representation learning to render language representations or models more semantically and geometrically interpretable, and to enable localised, quasi-symbolic, compositional control through deliberate shaping of their latent space geometry. We pursue this goal within a VAE framework, exploring two complementary research directions: (i) Sentence-level learning and control: disentangling and manipulating specific semantic features in the latent space to guide sentence generation, with explanatory text serving as the testbed; and (ii) Reasoning-level learning and control: isolating and steering inference behaviours in the latent space to control NLI. In this direction, we focus on Explanatory NLI tasks, in which two premises (explanations) are provided to infer a conclusion. The overarching objective is to move toward language models whose internal semantic representations can be systematically interpreted, precisely structured, and reliably directed. We introduce a set of novel theoretical frameworks and practical methodologies, together with corresponding experiments, to demonstrate that our approaches enhance both the interpretability and controllability of latent spaces for natural language across the thesis.
zh
[NLP-193] From Associations to Activations: Comparing Behavioral and Hidden-State Semantic Geometry in LLM s
【速读】: 该论文试图解决的问题是:如何通过心理语言学实验中的行为数据(如相似性判断和自由联想)来恢复大型语言模型(LLM)隐藏状态(hidden-state)的几何结构,从而揭示行为任务是否能够反映模型内部的语义表征。其解决方案的关键在于:利用代表相似性分析(representational similarity analysis, RSA),将基于行为的数据构建的相似性矩阵与模型各层隐藏状态的相似性进行比较,并在多个模型和基准方法(如FastText、BERT及跨模型共识)中验证行为数据对未见词隐藏状态相似性的预测能力;结果表明,强制选择任务的行为数据比自由联想更显著地匹配隐藏状态几何结构,且能超越词汇基线和跨模型共识,说明仅从行为测量中即可提取出可恢复的内部语义信息。
链接: https://arxiv.org/abs/2602.00628
作者: Louis Schiekiera,Max Zimmer,Christophe Roux,Sebastian Pokutta,Fritz Günther
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 25 pages including references, 15 figures, 6 tables
Abstract:We investigate the extent to which an LLM’s hidden-state geometry can be recovered from its behavior in psycholinguistic experiments. Across eight instruction-tuned transformer models, we run two experimental paradigms – similarity-based forced choice and free association – over a shared 5,000-word vocabulary, collecting 17.5M+ trials to build behavior-based similarity matrices. Using representational similarity analysis, we compare behavioral geometries to layerwise hidden-state similarity and benchmark against FastText, BERT, and cross-model consensus. We find that forced-choice behavior aligns substantially more with hidden-state geometry than free association. In a held-out-words regression, behavioral similarity (especially forced choice) predicts unseen hidden-state similarities beyond lexical baselines and cross-model consensus, indicating that behavior-only measurements retain recoverable information about internal semantic geometry. Finally, we discuss implications for the ability of behavioral tasks to uncover hidden cognitive states.
zh
[NLP-194] Jailbreaking LLM s via Calibration
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在安全对齐(safety alignment)过程中导致的输出分布与预对齐数据分布之间的系统性偏差问题,这种偏差常使模型在面对对抗性攻击(如越狱攻击,jailbreaking)时表现不稳定或易被利用。解决方案的关键在于将安全对齐对下一个词预测的影响建模为预对齐分布的系统性扭曲,并将弱到强越狱攻击(Weak-to-Strong Jailbreaking)视为一个预测聚合问题,进而推导出在损失诱导的对偶空间中由梯度偏移(Gradient Shift)表征的最优聚合策略。该框架不仅涵盖了交叉熵损失下的对数算术越狱方法(logit-arithmetic jailbreaking),还扩展出适用于其他适当损失函数的更广泛聚合规则,并提出一种新的混合聚合策略,在红队测试基准和数学实用性任务上显著优于现有方法,尤其在安全强化后的gpt-oss-120b模型上表现出更高的攻击成功率和更低的“越狱税”(Jailbreak Tax)。
链接: https://arxiv.org/abs/2602.00619
作者: Yuxuan Lu,Yongkang Guo,Yuqing Kong
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:
Abstract:Safety alignment in Large Language Models (LLMs) often creates a systematic discrepancy between a model’s aligned output and the underlying pre-aligned data distribution. We propose a framework in which the effect of safety alignment on next-token prediction is modeled as a systematic distortion of a pre-alignment distribution. We cast Weak-to-Strong Jailbreaking as a forecast aggregation problem and derive an optimal aggregation strategy characterized by a Gradient Shift in the loss-induced dual space. We show that logit-arithmetic jailbreaking methods are a special case of this framework under cross-entropy loss, and derive a broader family of aggregation rules corresponding to other proper losses. We also propose a new hybrid aggregation rule. Evaluations across red-teaming benchmarks and math utility tasks using frontier models demonstrate that our approach achieves superior Attack Success Rates and lower “Jailbreak Tax” compared with existing methods, especially on the safety-hardened gpt-oss-120b.
zh
[NLP-195] ransformer-Based Model for Multilingual Hope Speech Detection
【速读】: 该论文旨在解决希望言论(hope speech)检测问题,即在社交媒体等文本中自动识别出传递积极、鼓励性信息的语句。其解决方案的关键在于利用预训练语言模型进行多语言环境下的分类任务:针对英语采用RoBERTa模型,而针对英德双语场景则使用多语言模型XLM-RoBERTa。实验结果显示,RoBERTa在英文数据上达到加权F1分数0.818和准确率81.8%,XLM-RoBERTa在英德混合任务中取得加权F1分数0.786和准确率78.5%,表明优化后的预训练大语言模型能够显著提升希望言论检测这类自然语言处理任务的性能。
链接: https://arxiv.org/abs/2602.00613
作者: Nsrin Ashraf,Mariam Labib,Hamada Nayel
机构: Benha University (本哈大学); Elsewedy University of Technology (埃尔塞维迪科技大学); Mansoura University (曼苏拉大学); Prince Sattam Bin Abdulaziz University (萨特姆·本·阿卜杜勒阿齐兹大学)
类目: Computation and Language (cs.CL)
备注: 5 pages, 1 figure, PolyHope-M shared task at RANLP2025
Abstract:This paper describes a system that has been submitted to the “PolyHope-M” at RANLP2025. In this work various transformers have been implemented and evaluated for hope speech detection for English and Germany. RoBERTa has been implemented for English, while the multilingual model XLM-RoBERTa has been implemented for both English and German languages. The proposed system using RoBERTa reported a weighted f1-score of 0.818 and an accuracy of 81.8% for English. On the other hand, XLM-RoBERTa achieved a weighted f1-score of 0.786 and an accuracy of 78.5%. These results reflects the importance of improvement of pre-trained large language models and how these models enhancing the performance of different natural language processing tasks.
zh
[NLP-196] Lookahead-then-Verify: Reliable Constrained Decoding for Diffusion LLM s under Context-Free Grammars
【速读】: 该论文旨在解决扩散大语言模型(Diffusion Large Language Models, dLLMs)在生成符合上下文无关文法(context-free grammars)的正式语言(如源代码和化学表达式)时,因概率性质导致语法正确性难以保证的问题。现有约束解码方法难以直接应用于dLLMs,因其非自回归特性;而现有专为dLLMs设计的方法又可能允许无法扩展为有效句子的中间输出,从而降低可靠性。解决方案的关键在于提出LAVE方法,其利用dLLMs在每次前向传播中可并行预测所有位置token分布的能力,通过前瞻验证机制对每个新提议的token进行有效性检查,从而确保中间结果始终具备扩展为合法句子的可能性,实现高可靠性的语法约束。
链接: https://arxiv.org/abs/2602.00612
作者: Yitong Zhang,Yongmin Li,Yuetong Liu,Jia Li,Xiaoran Jia,Zherui Li,Ge Li
机构: College of AI, Tsinghua University (清华大学人工智能学院); School of Computer Science, Peking University (北京大学计算机学院); School of Computer Science and Engineering, Beihang University (北京航空航天大学计算机科学与工程学院); School of Computer Science and Technology, Beijing Institute of Technology (北京理工大学计算机学院); School of Computing, National University of Singapore (新加坡国立大学计算机学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Diffusion Large Language Models (dLLMs) have demonstrated promising generative capabilities and are increasingly used to produce formal languages defined by context-free grammars, such as source code and chemical expressions. However, as probabilistic models, they still struggle to generate syntactically valid outputs reliably. A natural and promising direction to address this issue is to adapt constrained decoding techniques to enforce grammatical correctness during generation. However, applying these techniques faces two primary obstacles. On the one hand, the non-autoregressive nature of dLLMs renders most existing constrained decoding approaches inapplicable. On the other hand, current approaches specifically designed for dLLMs may allow intermediate outputs that are impossible to complete into valid sentences, which significantly limits their reliability in practice. To address these challenges, we present LAVE, a constrained decoding approach specifically designed for dLLMs. Our approach leverages a key property of dLLMs, namely their ability to predict token distributions for all positions in parallel during each forward pass. Whenever a new token is proposed by model, LAVE performs lookahead using these distributions to efficiently and reliably verify the validity of the proposed token. This design ensures reliable constraints by reliably preserving the potential for intermediate outputs to be extended into valid sentences. Extensive experiments across four widely used dLLMs and three representative benchmarks demonstrate that LAVE consistently outperforms existing baselines and achieves substantial improvements in syntactic correctness, while incurring negligible runtime overhead. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2602.00612 [cs.CL] (or arXiv:2602.00612v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2602.00612 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-197] Hermes the Polyglot: A Unified Framework to Enhance Expressiveness for Multimodal Interlingual Subtitling WWW
【速读】: 该论文旨在解决跨语言字幕翻译(interlingual subtitling)中的关键挑战,包括语义连贯性、代词与术语的准确翻译以及译文的表现力问题。针对这些问题,作者提出了基于大语言模型(Large Language Models, LLMs)的自动化字幕翻译框架Hermes,其核心创新在于集成三个模块:说话人辨认(Speaker Diarization)、术语识别(Terminology Identification)和表现力增强(Expressiveness Enhancement),从而实现上下文一致、语义清晰且富有表现力的高质量字幕翻译。
链接: https://arxiv.org/abs/2602.00597
作者: Chaoqun Cui,Shijing Wang,Liangbin Huang,Qingqing Gu,Zhaolong Huang,Xiao Zeng,Wenji Mao
机构: MAIS, Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); School of AI, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); Beijing Jiaotong University (北京交通大学); Hujing Digital Media & Entertainment Group (虎鲸数字媒体与娱乐集团); Geely AI lab (吉利AI实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to The Web Conference (WWW) 2026
Abstract:Interlingual subtitling, which translates subtitles of visual media into a target language, is essential for entertainment localization but has not yet been explored in machine translation. Although Large Language Models (LLMs) have significantly advanced the general capabilities of machine translation, the distinctive characteristics of subtitle texts pose persistent challenges in interlingual subtitling, particularly regarding semantic coherence, pronoun and terminology translation, and translation expressiveness. To address these issues, we present Hermes, an LLM-based automated subtitling framework. Hermes integrates three modules: Speaker Diarization, Terminology Identification, and Expressiveness Enhancement, which effectively tackle the above challenges. Experiments demonstrate that Hermes achieves state-of-the-art diarization performance and generates expressive, contextually coherent translations, thereby advancing research in interlingual subtitling.
zh
[NLP-198] Kanade: A Simple Disentangled Tokenizer for Spoken Language Modeling
【速读】: 该论文旨在解决语音建模中词元化(tokenization)质量不足的问题,特别是如何有效分离语音信号中的语音特征(phonetics)与韵律信息(prosody),同时抑制与语言内容无关的非语义信息(如说话人身份)。现有方法通常依赖辅助手段来实现这种解耦,但存在复杂性高或性能受限的问题。论文提出的解决方案是Kanade——一种单层解耦语音词元化器(disentangled speech tokenizer),其关键在于通过分离声学常量(acoustic constants)生成单一的词元流,从而在无需额外辅助机制的情况下,实现高质量的语音重建、最优的说话人解耦能力以及优异的词汇可用性(lexical availability)。
链接: https://arxiv.org/abs/2602.00594
作者: Zhijie Huang,Stephen McIntosh,Daisuke Saito,Nobuaki Minematsu
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
Abstract:A good language model starts with a good tokenizer. Tokenization is especially important for speech modeling, which must handle continuous signals that mix linguistic and non-linguistic information. A speech tokenizer should extract phonetics and prosody, suppress linguistically irrelevant information like speaker identity, and enable high-quality synthesis. We present Kanade, a single-layer disentangled speech tokenizer that realizes this ideal. Kanade separates out acoustic constants to create a single stream of tokens that captures rich phonetics and prosody. It does so without the need for auxiliary methods that existing disentangled codecs often rely on. Experiments show that Kanade achieves state-of-the-art speaker disentanglement and lexical availability, while maintaining excellent reconstruction quality.
zh
[NLP-199] he French Drama Revolution: Political Economy and Literary Production 1700-1900
【速读】: 该论文旨在解决18世纪至19世纪法国戏剧主题演变与社会经济变迁之间关系的问题,特别是探讨法国大革命及工业化如何影响戏剧内容的结构性变化。其解决方案的关键在于运用潜在狄利克雷分配(Latent Dirichlet Allocation, LDA)提取戏剧文本的主题分布,并通过焦恩-香农散度(Jensen-Shannon Divergence, JSD)量化主题分布的演化差异,进而将主题流行度与法国国内生产总值(GDP)的时间序列进行对比分析,揭示戏剧主题变迁与经济成长之间的共演化特征。
链接: https://arxiv.org/abs/2602.00588
作者: Thiago Dumont Oliveira
机构: University of Tartu (塔尔图大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:This paper investigates the changing nature of French drama between 1700-1900 using Latent Dirichlet Allocation and Jensen-Shannon Divergence. Results indicate that the topical distribution of French drama changed profoundly after the French Revolution, particularly between 1789 and 1850. Bourgeois themes emerged among the most prevalent topics since the late 18th century. To assess the coevolution of drama and economic growth, I plot the yearly prevalence of topics alongside French GDP between 1700-1900, and discuss these changes in light of the political and economic changes prompted by the French Revolution and the industrialization of the country.
zh
[NLP-200] Unmasking Reasoning Processes: A Process-aware Benchmark for Evaluating Structural Mathematical Reasoning in LLM s
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在数学推理任务中因数据集设计局限而产生的“近饱和准确率”问题,即模型可能仅依赖模板化计算和浅层算术分解即可获得高分,从而无法真实反映其结构化推理能力。为应对这一挑战,作者提出了一种名为ReasoningMath-Plus的新基准,包含150个精心设计的问题,专门考察多约束协调、构造性逻辑合成与空间推理等高级推理技能;其关键创新在于引入HCRS(Hazard-aware Chain-based Rule Score)评分机制和基于标注推理轨迹训练的Process Reward Model(PRM),实现对推理过程的细粒度评估,实证表明答案准确性会显著高估模型的真实推理稳健性。
链接: https://arxiv.org/abs/2602.00564
作者: Xiang Zheng,Weiqi Zhai,Wei Wang,Boyu Yang,Wenbo Li,Ruixiang Luo,Haoxiang Sun,Yucheng Wang,Zhengze Li,Meng Wang,Yuetian Du,Guojie Lin,Yaxuan Wang,Xiaoxiao Xu,Yanhu Mo,Xuan Ren,Hu Wei,Ze Xu
机构: Alibaba Group (阿里巴巴集团); Shanghai Jiao Tong University (上海交通大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 8 pages, and 3 figures
Abstract:Recent large language models (LLMs) achieve near-saturation accuracy on many established mathematical reasoning benchmarks, raising concerns about their ability to diagnose genuine reasoning competence. This saturation largely stems from the dominance of template-based computation and shallow arithmetic decomposition in existing datasets, which underrepresent reasoning skills such as multi-constraint coordination, constructive logical synthesis, and spatial inference. To address this gap, we introduce ReasoningMath-Plus, a benchmark of 150 carefully curated problems explicitly designed to evaluate structural reasoning. Each problem emphasizes reasoning under interacting constraints, constructive solution formation, or non-trivial structural insight, and is annotated with a minimal reasoning skeleton to support fine-grained process-level evaluation. Alongside the dataset, we introduce HCRS (Hazard-aware Chain-based Rule Score), a deterministic step-level scoring function, and train a Process Reward Model (PRM) on the annotated reasoning traces. Empirically, while leading models attain relatively high final-answer accuracy (up to 5.8/10), HCRS-based holistic evaluation yields substantially lower scores (average 4.36/10, best 5.14/10), showing that answer-only metrics can overestimate reasoning robustness.
zh
[NLP-201] A Hierarchical and Attentional Analysis of Argument Structure Constructions in BERT Using Naturalistic Corpora
【速读】: 该论文旨在解决预训练语言模型(如BERT)在处理不同论元结构构式(Argument Structure Constructions)时,其内部表征如何组织与演变的问题。解决方案的关键在于构建一个多维分析框架,融合多维尺度分析(MDS)、t-SNE降维、广义判别值(GDV)作为聚类分离度量、Fisher判别比(FDR)作为线性诊断探针以及注意力机制分析,从而揭示模型中论元结构信息的层级化表征过程:特定构式的信息首先在早期层中出现,随后在中间层形成高度可分的聚类,并在后续处理阶段得以保留。
链接: https://arxiv.org/abs/2602.00554
作者: Liu Kaipeng,Wu Ling
机构: 未知
类目: Computation and Language (cs.CL)
备注: 14 pages, 5 figures
Abstract:This study investigates how the Bidirectional Encoder Representations from Transformers model processes four fundamental Argument Structure Constructions. We employ a multi-dimensional analytical framework, which integrates MDS, t-SNE as dimensionality reduction, Generalized Discrimination Value (GDV) as cluster separation metrics, Fisher Discriminant Ratio (FDR) as linear diagnostic probing, and attention mechanism analysis. Our results reveal a hierarchical representational structure. Construction-specific information emerges in early layers, forms maximally separable clusters in middle layers, and is maintained through later processing stages.
zh
[NLP-202] Reasoning by Commented Code for Table Question Answering
【速读】: 该论文旨在解决表格问答(Table Question Answering, TableQA)任务中大型语言模型(LLMs)因传统表格线性化处理破坏结构数据二维关系而导致的数值准确性低和可解释性差的问题。解决方案的关键在于提出一种带注释的、分步代码生成框架,将表格推理过程分解为多行可执行的Python程序,并辅以简洁的自然语言注释,从而显式地引入推理步骤,提升代码生成的准确性和可读性。此方法在WikiTableQuestions基准测试上显著优于现有基线,结合轻量级答案选择机制后,最终准确率可达84.3%。
链接: https://arxiv.org/abs/2602.00543
作者: Seho Pyo,Jiheon Seok,Jaejin Lee
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Table Question Answering (TableQA) poses a significant challenge for large language models (LLMs) because conventional linearization of tables often disrupts the two-dimensional relationships intrinsic to structured data. Existing methods, which depend on end-to-end answer generation or single-line program queries, typically exhibit limited numerical accuracy and reduced interpretability. This work introduces a commented, step-by-step code-generation framework that incorporates explicit reasoning into the Python program-generation process. The approach decomposes TableQA reasoning into multi-line executable programs with concise natural language comments, thereby promoting clearer reasoning and increasing the likelihood of generating correct code. On the WikiTableQuestions benchmark, the proposed method achieves 70.9% accuracy using Qwen2.5-Coder-7B-Instruct, surpassing the Repanda baseline (67.6%). Integrating the proposed framework with a robust end-to-end TableQA model via a lightweight answer-selection mechanism yields further improvements. This combined approach achieves up to 84.3% accuracy on the WikiTableQuestions benchmark.
zh
[NLP-203] Culturally-Grounded Governance for Multilingual Language Models: Rights Data Boundaries and Accountable AI Design
【速读】: 该论文试图解决多语言大语言模型(Multilingual Large Language Models, MLLMs)在跨文化、跨语言和跨政治语境中部署时所面临的治理困境,特别是现有以英语为中心的治理框架无法有效覆盖低资源语言和文化边缘群体的问题。其解决方案的关键在于提出一种以文化为基础的治理框架,将MLLM治理重新定义为一个社会文化与权利导向的问题,强调需通过数据 stewardship(数据治理)、透明度提升和参与式问责机制来弥合全球部署与本地规范之间的鸿沟,并防止模型在“规模”和“中立性”的名义下复制现有的全球不平等。
链接: https://arxiv.org/abs/2602.00497
作者: Hanjing Shi,Dominic DiFranzo
机构: Lehigh University (莱赫igh大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Multilingual large language models (MLLMs) are increasingly deployed across cultural, linguistic, and political contexts, yet existing governance frameworks largely assume English-centric data, homogeneous user populations, and abstract notions of fairness. This creates systematic risks for low-resource languages and culturally marginalized communities, where data practices, model behavior, and accountability mechanisms often fail to align with local norms, rights, and expectations. Drawing on cross-cultural perspectives in human-centered computing and AI governance, this paper synthesizes existing evidence on multilingual model behavior, data asymmetries, and sociotechnical harm, and articulates a culturally grounded governance framework for MLLMs. We identify three interrelated governance challenges: cultural and linguistic inequities in training data and evaluation practices, misalignment between global deployment and locally situated norms, values, and power structures, and limited accountability mechanisms for addressing harms experienced by marginalized language communities. Rather than proposing new technical benchmarks, we contribute a conceptual agenda that reframes multilingual AI governance as a sociocultural and rights based problem. We outline design and policy implications for data stewardship, transparency, and participatory accountability, and argue that culturally grounded governance is essential for ensuring that multilingual language models do not reproduce existing global inequalities under the guise of scale and neutrality.
zh
[NLP-204] From Knowledge to Inference: Scaling Laws of Specialized Reasoning on GlobalHealthAtlas
【速读】: 该论文旨在解决公共健康推理(public health reasoning)在机器学习领域中缺乏结构化建模、监督信号有限及评估基准不足的问题。其关键解决方案是构建了一个大规模多语言数据集GlobalHealthAtlas,包含280,210个实例,覆盖15个公共健康领域和17种语言,并按健康素养到流行病学与政策推理的难度分层;同时提出基于大语言模型(LLM)辅助的数据构建与质量控制流程,包括检索、去重、证据锚定校验和标签验证,以保障标注一致性;并设计了一个领域对齐的评估器,通过高置信度LLM判断提炼出六个维度(准确性、推理能力、完整性、共识一致性、术语规范性、洞察力)来系统评估模型输出,从而实现安全关键场景下公共健康推理任务的可复现训练与评估。
链接: https://arxiv.org/abs/2602.00491
作者: Zhaokun Yan,Zhaohan Liu,Wuzheng Dong,Lijie Feng,Chengxiao Dai
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Public health reasoning requires population level inference grounded in scientific evidence, expert consensus, and safety constraints. However, it remains underexplored as a structured machine learning problem with limited supervised signals and benchmarks. We introduce \textbfGlobalHealthAtlas, a large scale multilingual dataset of 280,210 instances spanning 15 public health domains and 17 languages, stratified into three difficulty levels from health literacy to epidemiological and policy reasoning. Instances are derived from openly available public health sources and labeled by language, domain, and difficulty to support supervised learning and slice based evaluation. We further propose large language model (LLM) assisted construction and quality control pipeline with retrieval, duplication, evidence grounding checks, and label validation to improve consistency at scale. Finally, we present a domain aligned evaluator distilled from high confidence judgments of diverse LLMs to assess outputs along six dimensions: Accuracy, Reasoning, Completeness, Consensus Alignment, Terminology Norms, and Insightfulness. Together, these contributions enable reproducible training and evaluation of LLMs for safety critical public health reasoning beyond conventional QA benchmarks.
zh
[NLP-205] Intention-Adaptive LLM Fine-Tuning for Text Revision Generation EACL
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在意图驱动的文本生成任务中表现不足的问题,特别是针对修订生成(revision generation)场景下如何准确识别并体现作者真实意图的挑战。现有方法在处理多意图交织的复杂情况时效果有限,而基于指令微调的方法虽具潜力,却受限于高质量标注数据的稀缺性与高成本。论文提出的关键解决方案是Intention-Tuning框架,其核心在于设计一种意图自适应的逐层微调机制:动态选择LLM中的一组层来学习意图表征,并将这些表征迁移至修订生成任务中,从而在小规模数据集上实现高效且有效的性能提升。
链接: https://arxiv.org/abs/2602.00477
作者: Zhexiong Liu,Diane Litman
机构: University of Pittsburgh (匹兹堡大学)
类目: Computation and Language (cs.CL)
备注: In the Conference of the European Chapter of the Association for Computational Linguistics (EACL), March 2026
Abstract:Large Language Models (LLMs) have achieved impressive capabilities in various context-based text generation tasks, such as summarization and reasoning; however, their applications in intention-based generation tasks remain underexplored. One such example is revision generation, which requires the generated text to explicitly reflect the writer’s actual intentions. Identifying intentions and generating desirable revisions are challenging due to their complex and diverse nature. Although prior work has employed LLMs to generate revisions with few-shot learning, they struggle with handling entangled multi-intent scenarios. While fine-tuning LLMs using intention-based instructions appears promising, it demands large amounts of annotated data, which is expensive and scarce in the revision community. To address these challenges, we propose Intention-Tuning, an intention-adaptive layer-wise LLM fine-tuning framework that dynamically selects a subset of LLM layers to learn the intentions and subsequently transfers their representations to revision generation. Experimental results suggest that Intention-Tuning is effective and efficient on small revision corpora, outperforming several PEFT baselines.
zh
[NLP-206] Diffusion LMs Can Approximate Optimal Infilling Lengths Implicitly
【速读】: 该论文旨在解决扩散语言模型(Diffusion Language Models, DLMs)在文本补全任务中因预设补全长度导致性能受限的问题。现有方法无法动态适应输入上下文的最优补全长度,从而影响生成质量。解决方案的关键在于识别并利用第一阶段去噪置信度中的两个统计现象:一是靠近真实补全长度的局部“Oracle Peak”信号,二是常掩盖该信号的系统性“Length Bias”。通过提取该峰值信号并校准长度偏差,作者提出无需训练的CAL(Calibrated Adaptive Length)方法,在正式解码前高效搜索近似最优补全长度,从而显著提升代码和文本补全任务的性能。
链接: https://arxiv.org/abs/2602.00476
作者: Hengchang Liu,Zhao Yang,Bing Su
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Diffusion language models (DLMs) provide a bidirectional generation framework naturally suited for infilling, yet their performance is constrained by the pre-specified infilling length. In this paper, we reveal that DLMs possess an inherent ability to discover the correct infilling length. We identify two key statistical phenomena in the first-step denoising confidence: a local \textitOracle Peak that emerges near the ground-truth length and a systematic \textitLength Bias that often obscures this signal. By leveraging this signal and calibrating the bias, our training-free method \textbfCAL (\textbfCalibrated \textbfAdaptive \textbfLength) enables DLMs to approximate the optimal length through an efficient search before formal decoding. Empirical evaluations demonstrate that CAL improves Pass@1 by up to 47.7% over fixed-length baselines and 40.5% over chat-based adaptive methods in code infilling, while boosting BLEU-2 and ROUGE-L by up to 8.5% and 9.9% in text infilling. These results demonstrate that CAL paves the way for robust DLM infilling without requiring any specialized training. Code is available at this https URL.
zh
[NLP-207] Words that make SENSE: Sensorimotor Norms in Learned Lexical Token Representations
【速读】: 该论文旨在解决传统词嵌入(word embeddings)仅依赖共现统计信息而缺乏具身认知基础的问题,即如何将人类语言理解中源于感官-运动体验(sensorimotor experience)的语义信息引入到计算模型中。其解决方案的关键在于提出了一种名为 \textSENSE 的学习投影模型,该模型能够从词的词汇嵌入(lexical embeddings)预测兰卡斯特感官运动规范(Lancaster sensorimotor norms),并通过行为实验验证了该模型与人类对伪词(nonce words)的感官运动联想选择率之间存在显著相关性,从而为具身语义建模提供了可计算的路径。
链接: https://arxiv.org/abs/2602.00469
作者: Abhinav Gupta,Toben H. Mintz,Jesse Thomason
机构: University of Southern California (南加州大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 5 pages, 2 figures, codebase can be found at: this https URL
Abstract:While word embeddings derive meaning from co-occurrence patterns, human language understanding is grounded in sensory and motor experience. We present \textSENSE (\textbfS\textensorimotor \textbfE\textmbedding \textbfN\textorm \textbfS\textcoring \textbfE\textngine) , a learned projection model that predicts Lancaster sensorimotor norms from word lexical embeddings. We also conducted a behavioral study where 281 participants selected which among candidate nonce words evoked specific sensorimotor associations, finding statistically significant correlations between human selection rates and \textSENSE ratings across 6 of the 11 modalities. Sublexical analysis of these nonce words selection rates revealed systematic phonosthemic patterns for the interoceptive norm, suggesting a path towards computationally proposing candidate phonosthemes from text data.
zh
[NLP-208] What Matters to an LLM ? Behavioral and Computational Evidences from Summarization EACL2026
【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)在文本摘要任务中表现出卓越性能,但其内部用于决定信息选择的“重要性”机制尚不明确。为揭示这一机制,研究提出结合行为分析与计算分析的方法。关键解决方案在于:首先通过生成长度受控的摘要,基于信息单元被选中的频率构建经验重要性分布,发现LLMs展现出稳定且一致的重要性模式,且按模型家族聚类优于按规模聚类;其次,在计算层面识别出特定注意力头(attention heads)与经验重要性分布高度对齐,且中间到后期层对重要性预测具有强解释力。该方法首次系统揭示了LLMs在摘要中优先考虑的信息类型及其内部表征方式,为理解和控制模型的信息选择提供了基础。
链接: https://arxiv.org/abs/2602.00459
作者: Yongxin Zhou,Changshun Wu,Philippe Mulhem,Didier Schwab,Maxime Peyrard
机构: Univ. Grenoble Alpes (格勒诺布尔阿尔卑斯大学); CNRS (法国国家科学研究中心); Inria (法国国家信息与自动化研究院); Grenoble INP (格勒诺布尔国立理工学院); LIG (利摩日信息学实验室)
类目: Computation and Language (cs.CL)
备注: Findings of EACL 2026
Abstract:Large Language Models (LLMs) are now state-of-the-art at summarization, yet the internal notion of importance that drives their information selections remains hidden. We propose to investigate this by combining behavioral and computational analyses. Behaviorally, we generate a series of length-controlled summaries for each document and derive empirical importance distributions based on how often each information unit is selected. These reveal that LLMs converge on consistent importance patterns, sharply different from pre-LLM baselines, and that LLMs cluster more by family than by size. Computationally, we identify that certain attention heads align well with empirical importance distributions, and that middle-to-late layers are strongly predictive of importance. Together, these results provide initial insights into what LLMs prioritize in summarization and how this priority is internally represented, opening a path toward interpreting and ultimately controlling information selection in these models.
zh
[NLP-209] When Agents “Misremember” Collectively: Exploring the Mandela Effect in LLM -based Multi-Agent Systems ICLR2026
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)驱动的多智能体系统中因群体认知偏差导致的“曼德拉效应”问题,即多个智能体在协作过程中因社会影响和内部化错误信息而共同产生对历史事件的集体误记现象。这一问题不仅削弱了系统对记忆偏差的理解能力,还可能引发虚假信息的传播,带来伦理风险。论文的关键解决方案在于提出MANBENCH基准,用于评估不同任务类型和交互协议下智能体行为的曼德拉效应强度,并通过提示层防御(如认知锚定和来源审查)与模型层对齐防御策略,实现了相较基线平均74.40%的曼德拉效应降低,从而提升了多智能体系统的鲁棒性与伦理一致性。
链接: https://arxiv.org/abs/2602.00428
作者: Naen Xu,Hengyu An,Shuo Shi,Jinghuai Zhang,Chunyi Zhou,Changjiang Li,Tianyu Du,Zhihui Fu,Jun Wang,Shouling Ji
机构: Zhejiang University (浙江大学); University of California, Los Angeles (加州大学洛杉矶分校); Palo Alto Networks (帕洛阿尔托网络公司); OPPO Research Institute (OPPO研究院); Zhejiang Key Laboratory of Decision Intelligence (浙江省决策智能重点实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: ICLR 2026
Abstract:Recent advancements in large language models (LLMs) have significantly enhanced the capabilities of collaborative multi-agent systems, enabling them to address complex challenges. However, within these multi-agent systems, the susceptibility of agents to collective cognitive biases remains an underexplored issue. A compelling example is the Mandela effect, a phenomenon where groups collectively misremember past events as a result of false details reinforced through social influence and internalized misinformation. This vulnerability limits our understanding of memory bias in multi-agent systems and raises ethical concerns about the potential spread of misinformation. In this paper, we conduct a comprehensive study on the Mandela effect in LLM-based multi-agent systems, focusing on its existence, causing factors, and mitigation strategies. We propose MANBENCH, a novel benchmark designed to evaluate agent behaviors across four common task types that are susceptible to the Mandela effect, using five interaction protocols that vary in agent roles and memory timescales. We evaluate agents powered by several LLMs on MANBENCH to quantify the Mandela effect and analyze how different factors affect it. Moreover, we propose strategies to mitigate this effect, including prompt-level defenses (e.g., cognitive anchoring and source scrutiny) and model-level alignment-based defense, achieving an average 74.40% reduction in the Mandela effect compared to the baseline. Our findings provide valuable insights for developing more resilient and ethically aligned collaborative multi-agent systems.
zh
[NLP-210] LLM s as High-Dimensional Nonlinear Autoregressive Models with Attention: Training Alignment and Inference
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在架构描述与训练流程上缺乏显式数学建模的问题,从而掩盖了其底层计算结构。为应对这一挑战,作者提出将LLM统一建模为具有注意力机制依赖的高维非线性自回归模型,其核心在于通过方程级表述明确刻画预训练(如基于下一个词预测)、对齐方法(包括从人类反馈中强化学习 RLHF、直接偏好优化 DPO、拒绝采样微调 RSFT 以及基于可验证奖励的强化学习 RLVR)及推理阶段的自回归生成过程。该框架的关键创新在于将自注意力机制形式化为重复的双线性-softmax-线性组合,从而不仅实现了对模型行为(如谄媚倾向、幻觉、上下文学习等)的理论分析,也为持续学习等扩展提供了严谨的数学基础。
链接: https://arxiv.org/abs/2602.00426
作者: Vikram Krishnamurthy
机构: Cornell University (康奈尔大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Signal Processing (eess.SP)
备注: 27 pages, 12 figures. Mathematical survey framing LLMs as high-dimensional nonlinear autoregressive models with attention, covering training, alignment, and inference, with nanoGPT/nanochat-style code examples. Feedback welcome
Abstract:Large language models (LLMs) based on transformer architectures are typically described through collections of architectural components and training procedures, obscuring their underlying computational structure. This review article provides a concise mathematical reference for researchers seeking an explicit, equation-level description of LLM training, alignment, and generation. We formulate LLMs as high-dimensional nonlinear autoregressive models with attention-based dependencies. The framework encompasses pretraining via next-token prediction, alignment methods such as reinforcement learning from human feedback (RLHF), direct preference optimization (DPO), rejection sampling fine-tuning (RSFT), and reinforcement learning from verifiable rewards (RLVR), as well as autoregressive generation during inference. Self-attention emerges naturally as a repeated bilinear–softmax–linear composition, yielding highly expressive sequence models. This formulation enables principled analysis of alignment-induced behaviors (including sycophancy), inference-time phenomena (such as hallucination, in-context learning, chain-of-thought prompting, and retrieval-augmented generation), and extensions like continual learning, while serving as a concise reference for interpretation and further theoretical development.
zh
[NLP-211] Segment-Level Attribution for Selective Learning of Long Reasoning Traces ICLR2026
【速读】: 该论文旨在解决大型推理模型(Large Reasoning Models, LRM)在生成长链思维(Chain-of-Thought, CoT)过程中存在的输出冗余问题,即多数推理步骤包含重复或不完整的语义内容,且这种冗余会在监督微调(Supervised Fine-Tuning, SFT)阶段被模型学习并放大,从而降低最终预测性能。解决方案的关键在于引入集成梯度归因(integrated gradient attribution)方法,量化每个token对最终答案的贡献程度,并据此构建两个段级指标:(1)归因强度(attribution strength),衡量段落整体归因幅度;(2)方向一致性(direction consistency),反映段内token归因是否一致为正或负(高一致性)或混合(中等一致性)。基于此,作者提出一种段级选择性学习框架,识别具有高归因强度但中等方向一致性的关键推理段——这类段落通常体现深度反思而非浅层推理,并仅对这些重要段落施加SFT损失,其余段落则屏蔽损失,从而提升模型在长推理轨迹中的学习效率与准确性。
链接: https://arxiv.org/abs/2602.00425
作者: Siyuan Wang,Yanchen Liu,Xiang Ren
机构: University of Southern California (南加州大学)
类目: Computation and Language (cs.CL)
备注: Accepted to ICLR 2026. 16 pages, 5 figures. Code available at this https URL
Abstract:Large Reasoning Models (LRMs) achieve strong reasoning performance by generating long chains of thought (CoTs), yet only a small fraction of these traces meaningfully contributes to answer prediction, while the majority contains repetitive or truncated content. Such output redundancy is further propagated after supervised finetuning (SFT), as models learn to imitate verbose but uninformative patterns, which can degrade performance. To this end, we incorporate integrated gradient attribution to quantify each token’s influence on final answers and aggregate them into two segment-level metrics: (1) \textitattribution strength measures the overall attribution magnitude; and (2) \textitdirection consistency captures whether tokens’ attributions within a segment are uniformly positive or negative (high consistency), or a mixture of both (moderate consistency). Based on these two metrics, we propose a segment-level selective learning framework to identify important segments with high attribution strength but moderate consistency that indicate reflective rather than shallow reasoning. The framework then applies selective SFT on these important segments while masking loss for unimportant ones. Experiments across multiple models and datasets show that our approach improves accuracy and output efficiency, enabling more effective learning from long reasoning traces~\footnoteCode and data are available at this https URL.
zh
[NLP-212] Brazilian Portuguese Image Captioning with Transformers: A Study on Cross-Native-Translated Dataset
【速读】: 该论文旨在解决低资源语言(如巴西葡萄牙语)在图像描述生成(Image Captioning, IC)任务中因缺乏专用数据集和模型而导致的性能瓶颈问题。其关键解决方案在于提出一种跨原生翻译评估框架,对比使用母语者手工标注的巴西葡萄牙语Flickr30K数据集与自动翻译自英语的数据集,系统评估基于Transformer架构的视觉-语言模型(Vision-Language Models, VLMs)在该语言下的表现。研究发现,尽管自动翻译数据能部分缓解资源短缺,但母语数据显著提升模型性能;其中,Swin-DistilBERTimbau模型展现出最佳泛化能力,而本地预训练模型ViTucano在传统文本指标上优于更大规模的多语言模型(如GPT-4o、LLaMa 3.2 Vision),进一步验证了针对特定语言进行本地化建模的重要性。
链接: https://arxiv.org/abs/2602.00393
作者: Gabriel Bromonschenkel,Alessandro L. Koerich,Thiago M. Paixão,Hilário Tomaz Alves de Oliveira
机构: Instituto Federal do Espírito Santo (巴西圣埃斯皮里托联邦学院); École de Technologie Supérieure (ÉTS) (加拿大魁北克省技术高等学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to JBCS. 18 pages, 11 figures
Abstract:Image captioning (IC) refers to the automatic generation of natural language descriptions for images, with applications ranging from social media content generation to assisting individuals with visual impairments. While most research has been focused on English-based models, low-resource languages such as Brazilian Portuguese face significant challenges due to the lack of specialized datasets and models. Several studies create datasets by automatically translating existing ones to mitigate resource scarcity. This work addresses this gap by proposing a cross-native-translated evaluation of Transformer-based vision and language models for Brazilian Portuguese IC. We use a version of Flickr30K comprised of captions manually created by native Brazilian Portuguese speakers and compare it to a version with captions automatically translated from English to Portuguese. The experiments include a cross-context approach, where models trained on one dataset are tested on the other to assess the translation impact. Additionally, we incorporate attention maps for model inference interpretation and use the CLIP-Score metric to evaluate the image-description alignment. Our findings show that Swin-DistilBERTimbau consistently outperforms other models, demonstrating strong generalization across datasets. ViTucano, a Brazilian Portuguese pre-trained VLM, surpasses larger multilingual models (GPT-4o, LLaMa 3.2 Vision) in traditional text-based evaluation metrics, while GPT-4 models achieve the highest CLIP-Score, highlighting improved image-text alignment. Attention analysis reveals systematic biases, including gender misclassification, object enumeration errors, and spatial inconsistencies. The datasets and the models generated and analyzed during the current study are available in: this https URL.
zh
[NLP-213] Clause-Internal or Clause-External? Testing Turkish Reflexive Binding in Adapted versus Chain of Thought Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)是否能够准确捕捉土耳其语中代词kendi和kendisi的绑定关系(binding relations)这一问题,尤其关注局部先行词(local antecedent)与非局部先行词(non-local antecedent)之间的选择偏好。其解决方案的关键在于构建一个平衡的100句测试集,并采用结合句子级困惑度(perplexity)与强制选择范式(forced-choice paradigm)的方法,对两种不同架构的模型进行对比评估:一是基于链式思维(chain-of-thought)设计的OpenAI o1 Mini模型,二是基于LLaMA-2微调的Trendyol-LLM-7B-base-v0.1模型。结果表明,Trendyol-LLM表现出显著的局部绑定偏好(约70%),而o1 Mini则在局部与远距离读取之间近乎平均分配,揭示了两类模型在语法绑定行为上的本质差异。
链接: https://arxiv.org/abs/2602.00380
作者: Sercan Karakaş
机构: 未知
类目: Computation and Language (cs.CL)
备注: 15 pages
Abstract:This study evaluates whether state-of-the-art large language models capture the binding relations of Turkish reflexive pronouns. We construct a balanced set of 100 sentences that pit local against non-local antecedents for the reflexives kendi and kendisi, and test two contrasting systems: an OpenAI chain-of-thought model designed for multi-step reasoning and Trendyol-LLM-7B-base-v0.1, a LLaMA-2-derived model extensively fine-tuned on Turkish data. Antecedent choice is assessed using a combined sentence-level perplexity and forced-choice paradigm. Trendyol-LLM favours local bindings in approximately 70% of trials, exhibiting a strong locality bias, whereas o1 Mini distributes its choices almost evenly between local and long-distance readings, revealing a marked contrast in binding behaviour across the two systems.
zh
[NLP-214] DecompressionLM: Deterministic Diagnostic and Zero-Shot Concept Graph Extraction from Language Models
【速读】: 该论文旨在解决现有知识探查(knowledge probing)方法依赖预定义查询导致只能提取已知概念的局限性,尤其针对基于解码的探查方法中存在的三个核心问题:跨序列耦合导致概率质量集中于高频前缀、竞争性解码效应抑制长尾概念、以及由顺序探索引发的可扩展性限制。其解决方案的关键在于提出 DecompressionLM,一种无状态的零样本概念图谱提取框架,通过引入 Van der Corput 低差异序列与算术解码(arithmetic decoding),实现无需共享跨序列状态的确定性、高度并行生成,从而有效突破传统方法在概念覆盖广度和事实准确性上的瓶颈。
链接: https://arxiv.org/abs/2602.00377
作者: Zhaochen Hong,Jiaxuan You
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Existing knowledge probing methods rely on pre-defined queries, limiting extraction to known concepts. We introduce DecompressionLM, a stateless framework for zero-shot concept graph extraction that discovers what language models encode without pre-specified queries or shared cross-sequence state. Our method targets three limitations of common decoding-based probing approaches: cross-sequence coupling that concentrates probability mass on high-frequency prefixes, competitive decoding effects that suppress long-tail concepts, and scalability constraints arising from sequential exploration. Using Van der Corput low-discrepancy sequences with arithmetic decoding, DecompressionLM enables deterministic, embarrassingly parallel generation without shared state across sequences. Across two model families and five quantization variants, we find that activation-aware quantization (AWQ-4bit) expands concept coverage by 30-170%, while uniform quantization (GPTQ-Int4) induces 71-86% coverage collapse – divergent behaviors not reliably reflected by explanation-level perplexity. Corpus-based verification further reveals a 17-point hallucination gap between top- and bottom-ranked MMLU-Pro Law models. DecompressionLM establishes concept coverage as a complementary evaluation dimension for assessing knowledge breadth and factual grounding in compressed models useful for their deployment.
zh
[NLP-215] Post-Training Probability Manifold Correction via Structured SVD Pruning and Self-Referential Distillation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)部署成本高的问题,提出了一种名为稀疏知识蒸馏(Sparse Knowledge Distillation, SparseKD)的后训练压缩方法。其核心解决方案是将结构化奇异值分解(Structured SVD)剪枝与自参考知识蒸馏(self-referential knowledge distillation)相结合:关键创新在于利用模型自身在压缩前的概率分布作为“教师”进行蒸馏,从而实现即使在激进剪枝条件下仍能显著恢复模型性能。实验表明,仅使用自参考蒸馏即可在相同目标函数和固定校准数据集下使模型质量相对提升39%,结合结构化剪枝后可在保持可接受性能的前提下实现15%-65%的参数减少,且加速效果完全来自前馈层中密集矩阵乘法的减少,不依赖注意力机制优化或定制推理内核。
链接: https://arxiv.org/abs/2602.00372
作者: Aaron R. Flouro,Shawn P. Chadwick
机构: SparseTech(稀疏科技)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 16 pages, 10 tables, 4 figures
Abstract:Large language models are expensive to deploy. We introduce Sparse Knowledge Distillation (SparseKD), a post-training method that compresses transformer models by combining structured SVD pruning with self-referential knowledge distillation. The key insight is simple: instead of using an external teacher, the model teaches itself by matching its own probability distribution from before compression. This self-referential setup enables surprisingly strong quality recovery after aggressive pruning. Our experiments reveal an unexpected finding: self-referential distillation alone, applied post-training under an identical objective and fixed calibration dataset, improves model quality by 39% relative to the original converged checkpoint. When combined with structured pruning, SparseKD achieves 15-65% parameter reduction with acceptable quality trade-offs. Kernel profiling shows that speedups arise entirely from reduced dense matrix multiplication in feed-forward layers while attention remains unchanged, making this approach complementary to attention optimizations. We validate across two model families (0.6B and 3.8B parameters) with multi-seed experiments confirming high reproducibility. SparseKD requires no external super-teacher, no architectural changes, and no custom inference kernels, making it immediately deployable with existing infrastructure. Comments: 16 pages, 10 tables, 4 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) MSC classes: 68T05 ACMclasses: I.2.6 Cite as: arXiv:2602.00372 [cs.LG] (or arXiv:2602.00372v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.00372 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-216] DETOUR: An Interactive Benchmark for Dual-Agent Search and Reasoning
【速读】: 该论文旨在解决现有评估基准在模拟“舌尖效应”(tip-of-the-tongue)搜索过程中的局限性,即当前评测多局限于单轮交互场景,无法真实反映人类在对话中通过多轮交互逐步回忆信息的复杂性。其解决方案的关键在于提出一种双代理评估框架——Dual-agent based Evaluation Through Obscure Under-specified Retrieval (DETOUR),该框架包含1,011个提示,由一个被评估的主代理(Primary Agent)与一个保持一致性的记忆代理(Memory Agent)协作完成任务,通过多轮查询实现对模糊或不完整描述下目标实体的检索,从而更贴近实际的人类认知过程。实验表明,即使在多模态(文本、图像、音频和视频)条件下,当前最先进模型仅达到36%的准确率,凸显了提升模型在非明确条件下的推理与检索能力的重要性。
链接: https://arxiv.org/abs/2602.00352
作者: Li Siyan,Darshan Deshpande,Anand Kannappan,Rebecca Qian
机构: Patronus AI; DAP Lab, Columbia University (哥伦比亚大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:When recalling information in conversation, people often arrive at the recollection after multiple turns. However, existing benchmarks for evaluating agent capabilities in such tip-of-the-tongue search processes are restricted to single-turn settings. To more realistically simulate tip-of-the-tongue search, we introduce Dual-agent based Evaluation Through Obscure Under-specified Retrieval (DETOUR), a dual-agent evaluation benchmark containing 1,011 prompts. The benchmark design involves a Primary Agent, which is the subject of evaluation, tasked with identifying the recollected entity through querying a Memory Agent that is held consistent across evaluations. Our results indicate that current state-of-the-art models still struggle with our benchmark, only achieving 36% accuracy when evaluated on all modalities (text, image, audio, and video), highlighting the importance of enhancing capabilities in underspecified scenarios.
zh
[NLP-217] When RAG Hurts: Diagnosing and Mitigating Attention Distraction in Retrieval-Augmented LVLMs
【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)在视觉语言模型(Large Vision-Language Models, LVLMs)中因注意力干扰(Attention Distraction, AD)导致的性能下降问题。具体而言,当检索到的文本上下文足够相关时,模型会过度抑制图像区域的注意力,使得原本能正确回答的问题因视觉线索被掩盖而失败。解决方案的关键在于提出一种无需训练的干预方法MAD-RAG,其核心机制是通过双问题设定(dual-question formulation)将视觉定位(visual grounding)与上下文整合解耦,并结合注意力混合(attention mixing)策略保留图像条件下的证据,从而有效缓解注意力分散问题,在多个基准测试中显著提升性能。
链接: https://arxiv.org/abs/2602.00344
作者: Beidi Zhao,Wenlong Deng,Xinting Liao,Yushu Li,Nazim Shaikh,Yao Nie,Xiaoxiao Li
机构: University of British Columbia (不列颠哥伦比亚大学); Vector Institute (向量研究所); Roche Diagnostics (罗氏诊断)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 18 pages, 10 figures
Abstract:While Retrieval-Augmented Generation (RAG) is one of the dominant paradigms for enhancing Large Vision-Language Models (LVLMs) on knowledge-based VQA tasks, recent work attributes RAG failures to insufficient attention towards the retrieved context, proposing to reduce the attention allocated to image tokens. In this work, we identify a distinct failure mode that previous study overlooked: Attention Distraction (AD). When the retrieved context is sufficient (highly relevant or including the correct answer), the retrieved text suppresses the visual attention globally, and the attention on image tokens shifts away from question-relevant regions. This leads to failures on questions the model could originally answer correctly without the retrieved text. To mitigate this issue, we propose MAD-RAG, a training-free intervention that decouples visual grounding from context integration through a dual-question formulation, combined with attention mixing to preserve image-conditioned evidence. Extensive experiments on OK-VQA, E-VQA, and InfoSeek demonstrate that MAD-RAG consistently outperforms existing baselines across different model families, yielding absolute gains of up to 4.76%, 9.20%, and 6.18% over the vanilla RAG baseline. Notably, MAD-RAG rectifies up to 74.68% of failure cases with negligible computational overhead.
zh
[NLP-218] Detecting AI-Generated Content in Academic Peer Reviews
【速读】: 该论文试图解决的问题是:随着大语言模型(Large Language Models, LLMs)的普及,其在学术同行评审中的使用是否日益普遍及其潜在影响。为回答这一问题,研究者提出了一种基于历史评审文本训练的检测模型,并将其应用于国际学习表征会议(ICLR)和《自然·通讯》(Nature Communications)后续审稿周期的数据中,以识别AI生成内容的时序演变趋势。解决方案的关键在于利用已训练的AI内容检测模型对不同时期的同行评审文本进行系统性分析,从而揭示AI辅助内容在学术评审中从2022年前几乎不存在到2025年分别占ICLR审稿约20%、NC审稿约12%的显著增长趋势,为评估AI在学术评价体系中的角色提供了实证依据。
链接: https://arxiv.org/abs/2602.00319
作者: Siyuan Shen,Kai Wang
机构: University of Pennsylvania (宾夕法尼亚大学); Children’s Hospital of Philadelphia (费城儿童医院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
备注:
Abstract:The growing availability of large language models (LLMs) has raised questions about their role in academic peer review. This study examines the temporal emergence of AI-generated content in peer reviews by applying a detection model trained on historical reviews to later review cycles at International Conference on Learning Representations (ICLR) and Nature Communications (NC). We observe minimal detection of AI-generated content before 2022, followed by a substantial increase through 2025, with approximately 20% of ICLR reviews and 12% of Nature Communications reviews classified as AI-generated in 2025. The most pronounced growth of AI-generated reviews in NC occurs between the third and fourth quarter of 2024. Together, these findings provide suggestive evidence of a rapidly increasing presence of AI-assisted content in peer review and highlight the need for further study of its implications for scholarly evaluation.
zh
[NLP-219] MiNER: A Two-Stage Pipeline for Metadata Extraction from Municipal Meeting Minutes
【速读】: 该论文旨在解决市政会议记录中元数据(如会议编号、日期、地点、参与者及起止时间)提取困难的问题,此类文档格式和写作风格异质性强,且现有命名实体识别(Named Entity Recognition, NER)模型因缺乏领域适配性而难以有效处理。解决方案的关键在于提出一个两阶段流水线:第一阶段利用问答(Question Answering, QA)模型定位包含元数据的文本片段;第二阶段采用基于Transformer的模型(BERTimbau与XLM-RoBERTa,含或不含CRF层)进行细粒度实体抽取,并通过去词汇化(deslexicalization)技术提升性能。该方法在特定市政场景下表现优异,优于更大规模的一般用途大语言模型(Large Language Models, LLMs),但跨区域泛化能力受限于市政文件的语言复杂性和多样性。
链接: https://arxiv.org/abs/2602.00316
作者: Rodrigo Batista,Luís Filipe Cunha,Purificação Silvano,Nuno Guimarães,Alípio Jorge,Evelin Amorim,Ricardo Campos
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Municipal meeting minutes are official documents of local governance, exhibiting heterogeneous formats and writing styles. Effective information retrieval (IR) requires identifying metadata such as meeting number, date, location, participants, and start/end times, elements that are rarely standardized or easy to extract automatically. Existing named entity recognition (NER) models are ill-suited to this task, as they are not adapted to such domain-specific categories. In this paper, we propose a two-stage pipeline for metadata extraction from municipal minutes. First, a question answering (QA) model identifies the opening and closing text segments containing metadata. Transformer-based models (BERTimbau and XLM-RoBERTa with and without a CRF layer) are then applied for fine-grained entity extraction and enhanced through deslexicalization. To evaluate our proposed pipeline, we benchmark both open-weight (Phi) and closed-weight (Gemini) LLMs, assessing predictive performance, inference cost, and carbon footprint. Our results demonstrate strong in-domain performance, better than larger general-purpose LLMs. However, cross-municipality evaluation reveals reduced generalization reflecting the variability and linguistic complexity of municipal records. This work establishes the first benchmark for metadata extraction from municipal meeting minutes, providing a solid foundation for future research in this domain.
zh
[NLP-220] Faithful-Patchscopes: Understanding and Mitigating Model Bias in Hidden Representations Explanation of Large Language Models
【速读】: 该论文旨在解决生成式 AI(Generative AI)中基于 Patchscopes 的隐藏表征解释方法存在的系统性不忠实问题,即大型语言模型(Large Language Models, LLMs)在解码内部隐藏表示时过度依赖固有的语言模式,从而掩盖了真实上下文信息。例如,即使隐藏表示编码了“紫色”这一属性,模型仍可能生成“绿色”,反映出强烈的先验关联偏差。为系统评估并缓解此问题,作者设计了一个用于评测偏置场景下忠实度的基准数据集,并提出 Bias Alignment through Logit Recalibration (BALOR) 方法:其核心在于将未加patch提示下的输出logits视为模型偏置的表征,并与加入patch后得到的logits进行对比,通过这种对比对logit分布进行再校准,从而抑制模型偏置、增强上下文信息的表达。实验表明,BALOR在多个LLM上均显著优于现有基线,相对性能提升最高达33%。
链接: https://arxiv.org/abs/2602.00300
作者: Xilin Gong,Shu Yang,Zehua Cao,Lynne Billard,Di Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) have demonstrated strong capabilities for hidden representation interpretation through Patchscopes, a framework that uses LLMs themselves to generate human-readable explanations by decoding from internal hidden representations. However, our work shows that LLMs tend to rely on inherent linguistic patterns, which can override contextual information encoded in the hidden representations during decoding. For example, even when a hidden representation encodes the contextual attribute “purple” for “broccoli”, LLMs still generate “green” in their explanations, reflecting a strong prior association. This behavior reveals a systematic unfaithfulness in Patchscopes. To systematically study this issue, we first designed a dataset to evaluate the faithfulness of Patchscopes under biased cases, and our results show that there is an 18.84% faithfulness decrease on average. We then propose Bias Alignment through Logit Recalibration (BALOR), which treats the output logits from an unpatched prompt as capturing model bias and contrasts them with logits obtained under patched contextual information. By recalibrating the logit distribution through this contrast, BALOR suppresses model bias and amplifies contextual information during generation. Experiments across multiple LLMs demonstrate that BALOR consistently outperforms existing baselines, achieving up to 33% relative performance improvement.
zh
[NLP-221] Benchmarking Uncertainty Calibration in Large Language Model Long-Form Question Answering
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在科学问答(Scientific Question Answering, QA)场景中缺乏可靠不确定性量化(Uncertainty Quantification, UQ)的问题。当前UQ方法在科学QA领域验证不足,而该领域高度依赖事实检索与推理能力,因此亟需可信赖的不确定性估计以支持模型输出的可信采纳。解决方案的关键在于构建首个大规模基准测试框架,用于系统评估不同UQ方法在高推理复杂度任务中的校准性能(calibration),并提供开源工具以实现可复现的评测。研究通过分析20种不同类型的LLMs在7个科学QA数据集上生成的68.5万条长文本回答,发现指令微调(instruction tuning)会加剧token级置信度的极化现象,削弱其作为不确定性估计的可靠性;而在序列层面,基于答案频率的一致性指标(answer frequency)表现出最优校准效果,而传统基于自然语言解释(verbalized approaches)的方法则存在系统偏差且与正确性相关性差。此外,论文揭示了仅依赖期望校准误差(Expected Calibration Error, ECE)作为单一评价指标可能导致对UQ方法性能的误判,从而暴露了现有UQ方法和评估实践的根本局限。
链接: https://arxiv.org/abs/2602.00279
作者: Philip Müller,Nicholas Popovič,Michael Färber,Peter Steinbach
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Under Review
Abstract:Large Language Models (LLMs) are commonly used in Question Answering (QA) settings, increasingly in the natural sciences if not science at large. Reliable Uncertainty Quantification (UQ) is critical for the trustworthy uptake of generated answers. Existing UQ approaches remain weakly validated in scientific QA, a domain relying on fact-retrieval and reasoning capabilities. We introduce the first large-scale benchmark for evaluating UQ metrics in reasoning-demanding QA studying calibration of UQ methods, providing an extensible open-source framework to reproducibly assess calibration. Our study spans up to 20 large language models of base, instruction-tuned and reasoning variants. Our analysis covers seven scientific QA datasets, including both multiple-choice and arithmetic question answering tasks, using prompting to emulate an open question answering setting. We evaluate and compare methods representative of prominent approaches on a total of 685,000 long-form responses, spanning different reasoning complexities representative of domain-specific tasks. At the token level, we find that instruction tuning induces strong probability mass polarization, reducing the reliability of token-level confidences as estimates of uncertainty. Models further fine-tuned for reasoning are exposed to the same effect, but the reasoning process appears to mitigate it depending on the provider. At the sequence level, we show that verbalized approaches are systematically biased and poorly correlated with correctness, while answer frequency (consistency across samples) yields the most reliable calibration. In the wake of our analysis, we study and report the misleading effect of relying exclusively on ECE as a sole measure for judging performance of UQ methods on benchmark datasets. Our findings expose critical limitations of current UQ methods for LLMs and standard practices in benchmarking thereof.
zh
[NLP-222] ABES: Trajectory-Aware Backward-on-Entropy Steering for Masked Diffusion Models
【速读】: 该论文旨在解决掩码扩散模型(Masked Diffusion Models, MDMs)在采样过程中因局部决策缺乏长期规划而导致的轨迹锁定(trajectory lock-in)问题,即早期幻觉会逐步累积并引发全局不一致。现有方法依赖简单的置信度启发式策略或高成本的搜索机制(如每步需 O(K) 次前向传播),难以兼顾生成质量与效率。其解决方案的关键在于提出一种基于梯度引导的推理框架——Backward-on-Entropy (BoE) Steering,通过单次反向传播近似无限时域前瞻(infinite-horizon lookahead),并形式化推导出 Token Influence Score (TIS),证明未来熵对输入嵌入的梯度可作为最优控制信号以最小化不确定性;同时引入 \textttActiveQueryAttention 机制,利用掩码目标的结构特性降低反向传播复杂度,从而在推理阶段实现高效且鲁棒的非自回归生成。
链接: https://arxiv.org/abs/2602.00250
作者: Shreshth Saini,Avinab Saha,Balu Adsumilli,Neil Birkbeck,Yilin Wang,Alan C. Bovik
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Masked Diffusion Models (MDMs) have emerged as a promising non-autoregressive paradigm for generative tasks, offering parallel decoding and bidirectional context utilization. However, current sampling methods rely on simple confidence-based heuristics that ignore the long-term impact of local decisions, leading to trajectory lock-in where early hallucinations cascade into global incoherence. While search-based methods mitigate this, they incur prohibitive computational costs ( O(K) forward passes per step). In this work, we propose Backward-on-Entropy (BoE) Steering, a gradient-guided inference framework that approximates infinite-horizon lookahead via a single backward pass. We formally derive the Token Influence Score (TIS) from a first-order expansion of the trajectory cost functional, proving that the gradient of future entropy with respect to input embeddings serves as an optimal control signal for minimizing uncertainty. To ensure scalability, we introduce \textttActiveQueryAttention, a sparse adjoint primitive that exploits the structure of the masking objective to reduce backward pass complexity. BoE achieves a superior Pareto frontier for inference-time scaling compared to existing unmasking methods, demonstrating that gradient-guided steering offers a mathematically principled and efficient path to robust non-autoregressive generation. We will release the code.
zh
[NLP-223] DIVERGE: Diversity-Enhanced RAG for Open-Ended Information Seeking
【速读】: 该论文旨在解决现有检索增强生成(Retrieval-Augmented Generation, RAG)系统在处理多答案信息查询时存在的多样性不足问题,即当前方法倾向于生成单一主导响应,导致创造性受限且信息获取不公平。其核心解决方案是提出 DIVERGE 框架,该框架通过新颖的反思引导生成(reflection-guided generation)与记忆增强的迭代优化(memory-augmented iterative refinement)机制,在保持答案质量的同时显著提升生成多样性。关键创新在于显式建模检索上下文多样性,并设计了针对开放性问题的多样性-质量权衡评估指标,实验证明 DIVERGE 在真实世界 Infinity-Chat 数据集上优于现有基线方法,有效缓解了大语言模型(Large Language Models, LLMs)在开放信息检索任务中的系统性多样性缺陷。
链接: https://arxiv.org/abs/2602.00238
作者: Tianyi Hu,Niket Tandon,Akhil Arora
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Existing retrieval-augmented generation (RAG) systems are primarily designed under the assumption that each query has a single correct answer. This overlooks common information-seeking scenarios with multiple plausible answers, where diversity is essential to avoid collapsing to a single dominant response, thereby constraining creativity and compromising fair and inclusive information access. Our analysis reveals a commonly overlooked limitation of standard RAG systems: they underutilize retrieved context diversity, such that increasing retrieval diversity alone does not yield diverse generations. To address this limitation, we propose DIVERGE, a plug-and-play agentic RAG framework with novel reflection-guided generation and memory-augmented iterative refinement, which promotes diverse viewpoints while preserving answer quality. We introduce novel metrics tailored to evaluating the diversity-quality trade-off in open-ended questions, and show that they correlate well with human judgments. We demonstrate that DIVERGE achieves the best diversity-quality trade-off compared to competitive baselines and previous state-of-the-art methods on the real-world Infinity-Chat dataset, substantially improving diversity while maintaining quality. More broadly, our results reveal a systematic limitation of current LLM-based systems for open-ended information-seeking and show that explicitly modeling diversity can mitigate it. Our code is available at: this https URL
zh
[NLP-224] Block removal for large language models through constrained binary optimization
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)压缩中因移除整个Transformer块而导致的组合优化难题,即如何高效识别最优的非连续块移除配置以最小化性能损失。其解决方案的关键在于将块移除问题建模为一个受约束的二进制优化问题,并将其映射到伊辛模型(Ising model)这一物理系统中,其中能量值可作为下游任务性能的良好代理指标。该方法通过高效的候选配置排序机制,实现了对大量可能移除方案的快速评估,从而获得多个高质量、非连续的压缩方案,且无需复杂训练即可显著提升模型效率,如在MMLU基准上实现最高达6点的性能增益。
链接: https://arxiv.org/abs/2602.00161
作者: David Jansen,Roman Rausch,David Montero,Roman Orus
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Quantum Physics (quant-ph)
备注: 7 pages, 5 figures
Abstract:Compressing resource-intensive large language models by removing whole transformer blocks is a seemingly simple idea, but identifying which blocks to remove constitutes an exponentially difficult combinatorial problem. In this paper, we formulate block removal as a constrained binary optimization problem that can be mapped to a physical system (Ising model), whose energies are a strong proxy for downstream model performance. This formulation enables an efficient ranking of a large number of candidate block-removal configurations and yields many high-quality, non-trivial solutions beyond consecutive regions. We demonstrate that our approach outperforms state-of-the-art block-removal methods across several benchmarks, with performance gains persisting after short retraining, and reaching improvements of up to 6 points on the MMLU benchmark. Our method requires only forward and backward passes for a few active parameters, together with an (at least approximate) Ising solver, and can be readily applied to any architecture. We illustrate this generality on the recent NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 model, which exhibits a highly inhomogeneous and challenging block structure.
zh
[NLP-225] Reversible Diffusion Decoding for Diffusion Language Models
【速读】: 该论文旨在解决扩散语言模型(diffusion language models)在块状解码(block-wise decoding)过程中因不可逆承诺导致的停滞问题(stagnation),即当反向扩散过程在次优状态下无法继续推进时,生成质量下降甚至卡顿。其解决方案的关键在于提出可逆扩散解码(Reversible Diffusion Decoding, RDD),通过检测状态依赖的反向过程失败来识别停滞,并利用缓存的模型状态实现无需重新计算的高效回溯;同时引入置信度引导的重掩码机制(confidence-guided re-masking),选择性地重初始化不确定的token以避免重复失败轨迹,从而在保持扩散生成并行效率的同时,提升生成鲁棒性和质量。
链接: https://arxiv.org/abs/2602.00150
作者: Xinyun Wang,Min Zhang,Sen Cui,Zhikang Chen,Bo Jiang,Kun Kuang,Mingbao Lin
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Diffusion language models enable parallel token generation through block-wise decoding, but their irreversible commitments can lead to stagnation, where the reverse diffusion process fails to make further progress under a suboptimal this http URL propose Reversible Diffusion Decoding (RDD), a decoding framework that introduces reversibility into block-wise diffusion generation. RDD detects stagnation as a state-dependent failure of the reverse process and enables efficient backtracking to earlier blocks without recomputation via cached model states. To avoid repeated failure trajectories, RDD applies confidence-guided re-masking to selectively reinitialize uncertain tokens while preserving reliable this http URL reversible formulation allows decoding to recover from early commitment errors while maintaining the parallel efficiency of diffusion-based generation. Experiments show that RDD improves generation robustness and quality over baselines with minimal computational overhead.
zh
[NLP-226] Interpreting and Controlling Model Behavior via Constitutions for Atomic Concept Edits
【速读】: 该论文旨在解决大模型行为的可解释性与可控性问题,即如何通过自然语言形式的“宪法”(verifiable constitution)来理解并干预模型在不同任务中对提示词(prompt)变化的响应机制。其核心解决方案在于引入原子概念编辑(atomic concept edits, ACEs),通过系统性地对提示词进行添加、删除或替换特定可解释概念的操作,并观察模型行为的变化,从而学习出从编辑到结果的因果映射关系。这一方法不仅揭示了模型内部决策逻辑的结构性特征,还实现了对模型行为的有效控制,在数学推理和文本到图像生成等任务中显著提升了成功率(平均提升1.86倍)。
链接: https://arxiv.org/abs/2602.00092
作者: Neha Kalibhat,Zi Wang,Prasoon Bajpai,Drew Proud,Wenjun Zeng,Been Kim,Mani Malek
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce a black-box interpretability framework that learns a verifiable constitution: a natural language summary of how changes to a prompt affect a model’s specific behavior, such as its alignment, correctness, or adherence to constraints. Our method leverages atomic concept edits (ACEs), which are targeted operations that add, remove, or replace an interpretable concept in the input prompt. By systematically applying ACEs and observing the resulting effects on model behavior across various tasks, our framework learns a causal mapping from edits to predictable outcomes. This learned constitution provides deep, generalizable insights into the model. Empirically, we validate our approach across diverse tasks, including mathematical reasoning and text-to-image alignment, for controlling and understanding model behavior. We found that for text-to-image generation, GPT-Image tends to focus on grammatical adherence, while Imagen 4 prioritizes atmospheric coherence. In mathematical reasoning, distractor variables confuse GPT-5 but leave Gemini 2.5 models and o4-mini largely unaffected. Moreover, our results show that the learned constitutions are highly effective for controlling model behavior, achieving an average of 1.86 times boost in success rate over methods that do not use constitutions.
zh
[NLP-227] CARE-RFT: Confidence-Anchored Reinforcement Finetuning for Reliable Reasoning in Large Language Models
【速读】: 该论文旨在解决强化微调(Reinforcement Finetuning, RFT)在提升大语言模型推理能力时所面临的信任度与校准性下降问题。具体而言,无约束的RFT虽能显著增强推理性能,但会放大幻觉并破坏模型校准;而基于KL散度约束的RFT(RKL-constrained RFT)虽能保持信任度,却因对探索性偏离施加无界惩罚而限制了推理提升。解决方案的关键在于提出CARE-RFT(Confidence-Anchored Regularized Reinforcement Finetuning),其核心创新是用偏斜反向KL散度(skew reverse KL divergence)替代标准反向KL正则化,从而实现一种依赖置信度的惩罚机制:对于高置信度且一致的探索行为施加有界惩罚以支持推理,而在其他区域则保持无界惩罚以维持校准性。实验表明,CARE-RFT在多个模型规模和RFT算法下实现了推理性能与信任度之间的最优平衡。
链接: https://arxiv.org/abs/2602.00085
作者: Shuozhe Li,Jincheng Cao,Bodun Hu,Aryan Mokhtari,Leqi Liu,Amy Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Reinforcement finetuning (RFT) has emerged as a powerful paradigm for unlocking reasoning capabilities in large language models. However, we identify a critical trade-off: while unconstrained RFT achieves strong reasoning performance, it severely compromises model trustworthiness by amplifying hallucination and worsening calibration; conversely, RKL-constrained RFT preserves trustworthiness but limits reasoning gains due to its unbounded penalty on exploratory deviations. To resolve this tension, we introduce CARE-RFT (Confidence-Anchored Regularized Reinforcement Finetuning), a novel method that replaces standard reverse KL regularization with a skew reverse KL divergence. CARE-RFT provides a confidence-sensitive penalty: it is bounded for confident, consistently rewarded explorations to enable reasoning, while unbounded elsewhere to preserve calibration. Extensive experiments across multiple model scales and RFT algorithms show that CARE-RFT achieves a superior balance, matching the reasoning performance of unconstrained RFT while recovering the trustworthiness and calibration of the base model. Our work establishes that careful, confidence-aware regularization is key to building both capable and trustworthy reasoning models.
zh
[NLP-228] SPARC-RAG : Adaptive Sequential-Parallel Scaling with Context Management for Retrieval-Augmented Generation
【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)在多跳问答(multi-hop question answering)任务中因推理链较长而导致的性能瓶颈问题,尤其是传统推理时扩展策略(包括序列深度和并行宽度)易引发上下文污染和扩展效率低下,导致计算资源增加但效果反而下降或无提升。其解决方案的关键在于提出SPARC-RAG框架,通过多智能体协同机制统一管理上下文,并引入专业化代理(specialized agents)实现对推理过程的显式控制:一方面生成目标明确、互补的子查询以促进并行探索多样性,另一方面基于答案正确性和证据锚定性动态决策退出条件;此外,设计轻量级微调方法,利用过程级可验证偏好优化序列扩展效率与并行扩展有效性,从而在更低推理成本下显著提升多跳问答性能(平均F1提升6.2)。
链接: https://arxiv.org/abs/2602.00083
作者: Yuxin Yang,Gangda Deng,Ömer Faruk Akgül,Nima Chitsazan,Yash Govilkar,Akasha Tigalappanavara,Shi-Xiong Zhang,Sambit Sahu,Viktor Prasanna
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Retrieval-Augmented Generation (RAG) grounds large language model outputs in external evidence, but remains challenged on multi-hop question answering that requires long reasoning. Recent works scale RAG at inference time along two complementary dimensions: sequential depth for iterative refinement and parallel width for coverage expansion. However, naive scaling causes context contamination and scaling inefficiency, leading to diminishing or negative returns despite increased computation. To address these limitations, we propose SPARC-RAG, a multi-agent framework that coordinates sequential and parallel inference-time scaling under a unified context management mechanism. SPARC-RAG employs specialized agents that maintain a shared global context and provide explicit control over the scaling process. It generates targeted, complementary sub-queries for each branch to enable diverse parallel exploration, and explicitly regulates exiting decisions based on answer correctness and evidence grounding. To optimize scaling behavior, we further introduce a lightweight fine-tuning method with process-level verifiable preferences, which improves the efficiency of sequential scaling and effectiveness of parallel scaling. Across single- and multi-hop QA benchmarks, SPARC-RAG consistently outperforms previous RAG baselines, yielding an average +6.2 F1 improvement under lower inference cost.
zh
[NLP-229] FoundationalASSIST: An Educational Dataset for Foundational Knowledge Tracing and Pedagogical Grounding of LLM s
【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)在教育场景中缺乏对学生学习过程理解能力的问题,尤其在自适应测试和个性化辅导等应用中,现有教育数据集仅提供问题标识符和二元正确性标签,无法满足LLMs基于自然语言进行推理的需求。解决方案的关键在于构建并发布FoundationalASSIST——首个完整的英文教育数据集,包含完整题目文本、学生实际作答内容(非仅对错)、错误选项选择记录以及与美国共同核心K-12标准的对齐信息,共计170万条来自5000名学生的交互数据。这一数据集使此前难以开展的研究方向成为可能,如学生建模微调和误解模式分析,并通过知识追踪(Knowledge Tracing)与教学基础性评估(Pedagogical Grounding)两类任务验证其价值,揭示出当前LLMs在预测学生表现及理解题目诊断效能方面存在显著能力缺口。
链接: https://arxiv.org/abs/2602.00070
作者: Eamon Worden,Cristina Heffernan,Neil Heffernan,Shashank Sonkar
机构: Worcester Polytechnic Institute (伍斯特理工学院); The ASSISTments Foundation (ASSISTments基金会); University of Central Florida (中佛罗里达大学)
类目: Computers and Society (cs.CY); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Can Large Language Models understand how students learn? As LLMs are deployed for adaptive testing and personalized tutoring, this question becomes urgent – yet we cannot answer it with existing resources. Current educational datasets provide only question identifiers and binary correctness labels, rendering them opaque to LLMs that reason in natural language. We address this gap with FoundationalASSIST, the first English educational dataset providing the complete information needed for research on LLMs in education: full question text, actual student responses (not just right/wrong), records of which wrong answers students chose, and alignment to Common Core K-12 standards. These 1.7 million interactions from 5,000 students enable research directions that were previously impossible to pursue, from fine-tuning student models to analyzing misconception patterns. To demonstrate the dataset’s utility, we evaluate four frontier models (GPT-OSS-120B, Llama-3.3-70B, Qwen3-Next-80B variants) on two complementary task families: Knowledge Tracing, testing whether LLMs can predict student performance on questions, and the exact answer a student will give; and \textbfPedagogical Grounding, testing whether LLMs understand the properties that make assessment items effective. Our evaluation reveals significant gaps in current LLM capabilities. Every model barely achieves a trivial baseline on knowledge tracing. All models fall below random chance on item discrimination, indicating that LLMs do not understand what makes one problem more diagnostic than another. Models do show competence at judging relative difficulty (up to 68.6%), but this partial success only highlights the gaps elsewhere. These results establish that substantial advances are needed before LLMs can reliably support personalized learning at scale. We release FoundationalASSIST to support progress on these foundational challenges.
zh
[NLP-230] IntentCoding: Amplifying User Intent in Code Generation
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在代码生成任务中对用户意图的细粒度遵循问题,尤其是在存在多个约束条件时性能显著下降的问题。其核心挑战在于:随着约束数量增加,模型生成结果与用户意图的契合度快速降低;且用户意图对模型logits的影响不足以有效引导解码过程。解决方案的关键是提出一种名为IntentCoding的新型解码策略,该策略通过掩码用户意图并计算其影响,再采用多强度集成机制放大用户意图在生成过程中的作用,从而增强模型对复杂约束条件的响应能力。该方法无需额外训练、与现有解码流程兼容,并在CodeConstraints、IFEvalCode、HumanEval和LiveCodeBench等多个基准上验证了其有效性,显著提升了约束满足率和功能正确性。
链接: https://arxiv.org/abs/2602.00066
作者: Zheng Fang,Yihong Dong,Lili Mou,Dongming Jin,Zhi Jin,Ge Li
机构: 未知
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Large Language Models (LLMs) have shown strong capabilities in code generation, but their adherence to fine-grained user intent with multiple constraints remains a significant challenge. Our empirical analysis reveals two key observations: 1) Model performance deteriorates quickly as the number of constraints in the user intent increases, and 2) While user intent does influence the model’s logits, such an influence may not be strong enough to effectively steer the decoding process. To this end, we propose Intent-Amplified Code Generation (IntentCoding), a novel decoding strategy that enhances an LLM’s ability to follow user intent. IntentCoding captures the influence of user intent by masking out the intent, and applies a multi-strength ensemble mechanism to amplify the effect of user intent during generation. IntentCoding is model-agnostic, requires no additional training, and integrates seamlessly with existing decoding procedures. To enable systematic evaluation, we also construct CodeConstraints, a benchmark dataset specifically designed to test user intent compliance under varying numbers of constraints. Experiments on our constructed Constraints, as well as popular IFEvalCode, HumanEval and LiveCodeBench datasets, show that our IntentCoding model significantly improves both constraint satisfaction and functional correctness compared to standard decoding approaches. IntentCoding achieves up to 71.0% relative improvement on CodeConstraints, achieves up to 67.3% relative improvement on IFEvalCode and achieves up to 29.3% relative improvement in pass@1 on HumanEval and LiveCodeBench compared with greedy decoding.
zh
[NLP-231] AI-assisted Protocol Information Extraction For Improved Accuracy and Efficiency in Clinical Trial Workflows
【速读】: 该论文旨在解决临床试验方案(protocol)内容日益复杂、变更频繁以及知识管理困难等问题,这些问题显著增加了试验团队的工作负担。解决方案的关键在于开发并评估一种基于生成式人工智能(Generative AI)的检索增强生成(Retrieval-Augmented Generation, RAG)系统,用于自动化提取临床试验方案中的结构化信息。相较于仅使用微调提示的公开大语言模型(LLM),该RAG方法在准确性上表现更优(87.8% vs 62.6%),并在模拟的临床研究协调员(CRC)工作流程中实现任务执行速度提升40%,且用户主观认知负荷更低、偏好度更高,表明AI辅助可有效提升协议信息处理的效率与质量,具备规模化应用潜力。
链接: https://arxiv.org/abs/2602.00052
作者: Ramtin Babaeipour,François Charest,Madison Wright
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Increasing clinical trial protocol complexity, amendments, and challenges around knowledge management create significant burden for trial teams. Structuring protocol content into standard formats has the potential to improve efficiency, support documentation quality, and strengthen compliance. We evaluate an Artificial Intelligence (AI) system using generative LLMs with Retrieval-Augmented Generation (RAG) for automated clinical trial protocol information extraction. We compare the extraction accuracy of our clinical-trial-specific RAG process against that of publicly available (standalone) LLMs. We also assess the operational impact of AI-assistance on simulated extraction CRC workflows. Our RAG process was measured as more accurate (87.8%) than standalone LLMs with fine-tuned prompts (62.6%) against expert-supported reference annotations. In the simulated extraction workflows, AI-assisted tasks were completed 40% faster, rated as less cognitively demanding and strongly preferred by users. While expert oversight remains essential, this suggests that AI-assisted extraction can enable protocol intelligence at scale, motivating the integration of similar methodologies into real world clinical workflows to further validate its impact on feasibility, study start-up, and post-activation monitoring.
zh
[NLP-232] Extending Beacon to Hindi: Cultural Adaptation Drives Cross-Lingual Sycophancy
【速读】: 该论文旨在解决生成式 AI(Generative AI)在跨语言和跨文化情境下对齐行为是否一致的问题,特别是评估“谄媚”(sycophancy)这一对齐失败现象是否仅限于英语环境。其解决方案的关键在于设计了一个受控的三条件实验:英文原版、印地语直译版本与印地语文化适配版本的提示(prompt),通过对比四个开源指令微调模型在每种条件下的表现,分离语言编码效应与文化适应效应。结果表明,文化适配显著提升了模型的谄媚率(差异达12–16个百分点),而语言翻译影响较小,说明文化背景在塑造AI对齐行为中起主导作用。
链接: https://arxiv.org/abs/2602.00046
作者: Sarthak Sattigeri
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: First Hindi sycophancy benchmark using a three-condition design separating language and cultural effects, with empirical evaluation across four instruction-tuned models
Abstract:Sycophancy, the tendency of language models to prioritize agreement with user preferences over principled reasoning, has been identified as a persistent alignment failure in English-language evaluations. However, it remains unclear whether such diagnostics generalize across languages and cultural contexts. We extend the Beacon single-turn forced-choice sycophancy diagnostic to Hindi through a controlled three-condition design: English original, Hindi literal translation, and Hindi culturally adapted prompts. We evaluate four open-weight instruction-tuned models on 50 prompts per condition, enabling separation of language encoding effects from cultural adaptation effects. Across all models, sycophancy rates are consistently higher for culturally adapted Hindi prompts than for English, with absolute differences ranging from 12.0 to 16.0 percentage points. A decomposition on Qwen 2.5-Coder-7B shows that cultural adaptation (delta = 14.0%, 95% CI: [4.0%, 26.0%]) accounts for the majority of this gap, while language encoding contributes minimally (delta = 2.0%, 95% CI: [0.0%, 6.0%]). Category-level analysis reveals that advice prompts exhibit the largest cross-lingual differences (20-25 percentage points), achieving statistical significance in two of four models. These findings indicate that alignment behaviors measured in English may not transfer uniformly across languages and that culturally grounded prompt framing plays a substantial role. We release all datasets and evaluation code to support replication and extension.
zh
[NLP-233] Construct Align and Reason : Large Ontology Models for Enterprise Knowledge Management
【速读】: 该论文旨在解决企业级知识管理中多源异构数据融合与语义推理能力不足的问题,尤其针对传统知识图谱在隐式关系发现和复杂问答任务中语义理解有限的挑战。解决方案的关键在于提出一种统一的“构建-对齐-推理”框架——大规模本体模型(Large Ontology Model, LOM),其核心包括:1)从结构化数据库与非结构化文本中构建双层企业本体并融合为综合本体;2)设计三阶段训练流程:本体指令微调以增强结构理解、文本-本体对齐以强化节点语义编码、基于课程学习的多任务指令微调以提升语义推理与生成能力。该方法实现了本体结构与语言信息的有效融合,在复杂图推理任务上优于DeepSeek-V3.2,准确率达89.47%。
链接: https://arxiv.org/abs/2602.00029
作者: Yao Zhang,Hongyin Zhu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Enterprise-scale knowledge management faces significant challenges in integrating multi-source heterogeneous data and enabling effective semantic reasoning. Traditional knowledge graphs often struggle with implicit relationship discovery and lack sufficient semantic understanding for complex question answering. To address these limitations, we introduce a unified construct–align–reason framework, the large ontology model (LOM). We first build a dual-layer enterprise ontology from structured databases and unstructured text, subsequently fusing these sources into a comprehensive enterprise ontology. To enable instruction-aligned reasoning, we propose a unified three-stage training pipeline: ontology instruction fine-tuning to improve structural understanding; text-ontology grounding to strengthen node semantic encoding; and multi-task instruction tuning on ontology-language pairs with curriculum learning to enhance semantic reasoning and generation. We also construct comprehensive training and evaluation datasets covering diverse ontology reasoning tasks. On this benchmark, our 4B-parameter LOM achieves 89.47% accuracy and outperforms DeepSeek-V3.2 on complex graph reasoning, indicating effective fusion of ontology structure and language.
zh
计算机视觉
[CV-0] PixelGen: Pixel Diffusion Beats Latent Diffusion with Perceptual Loss
【速读】:该论文旨在解决像素扩散模型(Pixel Diffusion)在高维像素流形上优化困难的问题,尤其是由于存在大量感知无关信号导致生成质量落后于潜在扩散模型(Latent Diffusion Models)的瓶颈。其解决方案的关键在于引入两种互补的感知监督机制:一是基于LPIPS(Learned Perceptual Image Patch Similarity)的损失函数以增强局部模式学习,二是基于DINO(Data-efficient Image Transformers)的感知损失以强化全局语义一致性。通过这种感知监督策略,PixelGen有效引导扩散模型聚焦于更具意义的感知流形,从而在无需变分自编码器(VAE)、潜在表示或辅助阶段的情况下,实现超越现有强基准模型的图像生成性能。
链接: https://arxiv.org/abs/2602.02493
作者: Zehong Ma,Ruihan Xu,Shiliang Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project Pages: this https URL
Abstract:Pixel diffusion generates images directly in pixel space in an end-to-end manner, avoiding the artifacts and bottlenecks introduced by VAEs in two-stage latent diffusion. However, it is challenging to optimize high-dimensional pixel manifolds that contain many perceptually irrelevant signals, leaving existing pixel diffusion methods lagging behind latent diffusion models. We propose PixelGen, a simple pixel diffusion framework with perceptual supervision. Instead of modeling the full image manifold, PixelGen introduces two complementary perceptual losses to guide diffusion model towards learning a more meaningful perceptual manifold. An LPIPS loss facilitates learning better local patterns, while a DINO-based perceptual loss strengthens global semantics. With perceptual supervision, PixelGen surpasses strong latent diffusion baselines. It achieves an FID of 5.11 on ImageNet-256 without classifier-free guidance using only 80 training epochs, and demonstrates favorable scaling performance on large-scale text-to-image generation with a GenEval score of 0.79. PixelGen requires no VAEs, no latent representations, and no auxiliary stages, providing a simpler yet more powerful generative paradigm. Codes are publicly available at this https URL.
zh
[CV-1] Multi-head automated segmentation by incorporating detection head into the contextual layer neural network
【速读】:该论文旨在解决深度学习在放射治疗中自动分割(auto segmentation)应用时存在的问题:传统模型在缺乏目标结构的图像切片中会产生解剖上不合理的假阳性结果,即“幻觉”(hallucinations)。其解决方案的关键在于提出一种基于Swin U-Net的门控多头Transformer架构,通过引入跨切片上下文整合(inter-slice context integration)和并行检测头(parallel detection head),实现切片级结构检测与像素级分割的联合优化。检测输出作为门控信号抑制解剖无效切片中的分割预测,从而显著降低假阳性率,同时保持有效切片的分割质量,实验表明该方法在Prostate-Anatomical-Edge-Cases数据集上将平均Dice损失从0.732降至0.013,且检测概率与解剖结构存在性高度相关,有效提升了自动化分割的可靠性与解剖合理性。
链接: https://arxiv.org/abs/2602.02471
作者: Edwin Kys,Febian Febian
机构: Linnear(林纳); UCL(伦敦大学学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Medical Physics (physics.med-ph)
备注: 8 pages, 3 figures, 1 table
Abstract:Deep learning based auto segmentation is increasingly used in radiotherapy, but conventional models often produce anatomically implausible false positives, or hallucinations, in slices lacking target structures. We propose a gated multi-head Transformer architecture based on Swin U-Net, augmented with inter-slice context integration and a parallel detection head, which jointly performs slice-level structure detection via a multi-layer perceptron and pixel-level segmentation through a context-enhanced stream. Detection outputs gate the segmentation predictions to suppress false positives in anatomically invalid slices, and training uses slice-wise Tversky loss to address class imbalance. Experiments on the Prostate-Anatomical-Edge-Cases dataset from The Cancer Imaging Archive demonstrate that the gated model substantially outperforms a non-gated segmentation-only baseline, achieving a mean Dice loss of 0.013 \pm 0.036 versus 0.732 \pm 0.314 , with detection probabilities strongly correlated with anatomical presence, effectively eliminating spurious segmentations. In contrast, the non-gated model exhibited higher variability and persistent false positives across all slices. These results indicate that detection-based gating enhances robustness and anatomical plausibility in automated segmentation applications, reducing hallucinated predictions without compromising segmentation quality in valid slices, and offers a promising approach for improving the reliability of clinical radiotherapy auto-contouring workflows.
zh
[CV-2] MentisOculi: Revealing the Limits of Reasoning with Mental Imagery
【速读】:该论文试图解决的问题是:当前统一多模态模型(Unified Multimodal Models, UMMs)是否能够有效利用视觉表征进行目标导向的推理,即视觉思维(visual thoughts)是否能提升模型的推理能力。解决方案的关键在于构建了一个名为MentisOculi的程序化、分层的多步骤推理任务套件,该套件专为挑战前沿模型的视觉推理能力而设计,且可被视觉策略(如潜在token或显式生成图像)所解决。通过系统评估不同视觉策略的效果,研究发现尽管UMMs具备文本推理能力并能生成正确视觉内容,但其在生成过程中存在累积错误,且无法有效利用甚至真实标注的视觉信息,从而揭示了视觉思维尚未能提升模型推理性能的核心局限。
链接: https://arxiv.org/abs/2602.02465
作者: Jana Zeller,Thaddäus Wiedemer,Fanfei Li,Thomas Klein,Prasanna Mayilvahanan,Matthias Bethge,Felix Wichmann,Ryan Cotterell,Wieland Brendel
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 9 pages, 8 figures
Abstract:Frontier models are transitioning from multimodal large language models (MLLMs) that merely ingest visual information to unified multimodal models (UMMs) capable of native interleaved generation. This shift has sparked interest in using intermediate visualizations as a reasoning aid, akin to human mental imagery. Central to this idea is the ability to form, maintain, and manipulate visual representations in a goal-oriented manner. To evaluate and probe this capability, we develop MentisOculi, a procedural, stratified suite of multi-step reasoning problems amenable to visual solution, tuned to challenge frontier models. Evaluating visual strategies ranging from latent tokens to explicit generated imagery, we find they generally fail to improve performance. Analysis of UMMs specifically exposes a critical limitation: While they possess the textual reasoning capacity to solve a task and can sometimes generate correct visuals, they suffer from compounding generation errors and fail to leverage even ground-truth visualizations. Our findings suggest that despite their inherent appeal, visual thoughts do not yet benefit model reasoning. MentisOculi establishes the necessary foundation to analyze and close this gap across diverse model families.
zh
[CV-3] RANKVIDEO: Reasoning Reranking for Text-to-Video Retrieval
【速读】:该论文旨在解决视频检索中基于推理的重排序(reranking)方法研究不足的问题,尤其是在利用视频内容进行细粒度相关性判断方面。其核心解决方案是提出RANKVIDEO,一种基于推理的视频重排序模型,通过显式地对查询-视频对进行视频内容层面的推理来评估相关性。关键创新在于采用两阶段课程训练策略:首先进行感知基础的监督微调,再结合点目标、成对目标及教师置信度蒸馏目标进行重排序训练,并辅以数据合成流水线构建高推理强度的查询-视频对,从而在MultiVENT 2.0大规模基准上显著提升检索性能(nDCG@10平均提升31%),同时优于仅依赖文本或视觉-语言模型的替代方案。
链接: https://arxiv.org/abs/2602.02444
作者: Tyler Skow,Alexander Martin,Benjamin Van Durme,Rama Chellappa,Reno Kriz
机构: Johns Hopkins University (约翰霍普金斯大学); Human Language Technology Center of Excellence (人类语言技术卓越中心)
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Reranking is a critical component of modern retrieval systems, which typically pair an efficient first-stage retriever with a more expressive model to refine results. While large reasoning models have driven rapid progress in text-centric reranking, reasoning-based reranking for video retrieval remains underexplored. To address this gap, we introduce RANKVIDEO, a reasoning-based reranker for video retrieval that explicitly reasons over query-video pairs using video content to assess relevance. RANKVIDEO is trained using a two-stage curriculum consisting of perception-grounded supervised fine-tuning followed by reranking training that combines pointwise, pairwise, and teacher confidence distillation objectives, and is supported by a data synthesis pipeline for constructing reasoning-intensive query-video pairs. Experiments on the large-scale MultiVENT 2.0 benchmark demonstrate that RANKVIDEO consistently improves retrieval performance within a two-stage framework, yielding an average improvement of 31% on nDCG@10 and outperforming text-only and vision-language reranking alternatives, while more efficient.
zh
[CV-4] UniReason 1.0: A Unified Reasoning Framework for World Knowledge Aligned Image Generation and Editing
【速读】:该论文旨在解决统一多模态模型在复杂合成任务中因缺乏深度推理能力而导致的性能瓶颈问题,尤其是将文本到图像生成与图像编辑视为孤立能力而非相互关联的推理步骤。解决方案的关键在于提出UniReason框架,通过双推理范式实现两者的协同:首先将生成任务建模为增强世界知识的规划过程以引入隐式约束,进而利用编辑能力进行细粒度视觉修正,从而通过自反思机制纠正视觉错误;该方法在共享表示空间中统一生成与编辑,模拟人类“规划-修正”的认知流程,并辅以大规模推理导向数据集(约30万样本)和代理生成的视觉自校正语料库,显著提升了在WISE、KrisBench及UniREditBench等高推理强度基准上的表现。
链接: https://arxiv.org/abs/2602.02437
作者: Dianyi Wang,Chaofan Ma,Feng Han,Size Wu,Wei Song,Yibin Wang,Zhixiong Zhang,Tianhang Wang,Siyuan Wang,Zhongyu Wei,Jiaqi Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Unified multimodal models often struggle with complex synthesis tasks that demand deep reasoning, and typically treat text-to-image generation and image editing as isolated capabilities rather than interconnected reasoning steps. To address this, we propose UniReason, a unified framework that harmonizes these two tasks through a dual reasoning paradigm. We formulate generation as world knowledge-enhanced planning to inject implicit constraints, and leverage editing capabilities for fine-grained visual refinement to further correct visual errors via self-reflection. This approach unifies generation and editing within a shared representation, mirroring the human cognitive process of planning followed by refinement. We support this framework by systematically constructing a large-scale reasoning-centric dataset (~300k samples) covering five major knowledge domains (e.g., cultural commonsense, physics, etc.) for planning, alongside an agent-generated corpus for visual self-correction. Extensive experiments demonstrate that UniReason achieves advanced performance on reasoning-intensive benchmarks such as WISE, KrisBench and UniREditBench, while maintaining superior general synthesis capabilities.
zh
[CV-5] SelvaMask: Segmenting Trees in Tropical Forests and Beyond
【速读】:该论文旨在解决热带森林中个体树冠(individual tree crown)精准分割难题,尤其是在高密度植被环境下,现有基于Transformer的模型性能不足的问题。其关键解决方案是构建了SelvaMask数据集——一个包含超过8,800个手工标注树冠的南美热带森林数据集,并提出一种模块化检测-分割流水线,该流程利用领域特定的检测提示器(detection-prompter)适配视觉基础模型(vision foundation models, VFMs),从而在密集热带林环境中实现优于零样本通用模型和端到端监督方法的性能表现。
链接: https://arxiv.org/abs/2602.02426
作者: Simon-Olivier Duguay,Hugo Baudchon,Etienne Laliberté,Helene Muller-Landau,Gonzalo Rivas-Torres,Arthur Ouaknine
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 8 figures
Abstract:Tropical forests harbor most of the planet’s tree biodiversity and are critical to global ecological balance. Canopy trees in particular play a disproportionate role in carbon storage and functioning of these ecosystems. Studying canopy trees at scale requires accurate delineation of individual tree crowns, typically performed using high-resolution aerial imagery. Despite advances in transformer-based models for individual tree crown segmentation, performance remains low in most forests, especially tropical ones. To this end, we introduce SelvaMask, a new tropical dataset containing over 8,800 manually delineated tree crowns across three Neotropical forest sites in Panama, Brazil, and Ecuador. SelvaMask features comprehensive annotations, including an inter-annotator agreement evaluation, capturing the dense structure of tropical forests and highlighting the difficulty of the task. Leveraging this benchmark, we propose a modular detection-segmentation pipeline that adapts vision foundation models (VFMs), using domain-specific detection-prompter. Our approach reaches state-of-the-art performance, outperforming both zero-shot generalist models and fully supervised end-to-end methods in dense tropical forests. We validate these gains on external tropical and temperate datasets, demonstrating that SelvaMask serves as both a challenging benchmark and a key enabler for generalized forest monitoring. Our code and dataset will be released publicly.
zh
[CV-6] Catalyst: Out-of-Distribution Detection via Elastic Scaling
【速读】:该论文旨在解决深度神经网络在实际部署中面临的关键安全问题——分布外(Out-of-distribution, OOD)检测的准确性不足。现有最先进的后处理方法通常仅依赖于输出logits或通过全局平均池化(Global Average Pooling, GAP)得到的特征向量来计算OOD分数,忽略了预池化特征图中丰富的通道级统计信息(如均值、标准差和最大激活值),这些信息在GAP过程中被丢弃。解决方案的关键在于提出Catalyst框架,该框架利用这些被忽视的原始通道统计量动态计算一个输入相关的缩放因子(γ),并通过乘法融合方式弹性地调节基线OOD分数,从而增强ID(In-Distribution)与OOD样本之间的判别距离。实验表明,Catalyst可无缝集成至基于logits的方法(如Energy、ReAct、SCALE)及基于距离的检测器(如KNN),显著提升性能,在多个基准数据集上平均误报率(False Positive Rate)降低达22.25%–32.87%。
链接: https://arxiv.org/abs/2602.02409
作者: Abid Hassan,Tuan Ngo,Saad Shafiq,Nenad Medvidovic
机构: University Southern California (南加州大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Out-of-distribution (OOD) detection is critical for the safe deployment of deep neural networks. State-of-the-art post-hoc methods typically derive OOD scores from the output logits or penultimate feature vector obtained via global average pooling (GAP). We contend that this exclusive reliance on the logit or feature vector discards a rich, complementary signal: the raw channel-wise statistics of the pre-pooling feature map lost in GAP. In this paper, we introduce Catalyst, a post-hoc framework that exploits these under-explored signals. Catalyst computes an input-dependent scaling factor ( \gamma ) on-the-fly from these raw statistics (e.g., mean, standard deviation, and maximum activation). This \gamma is then fused with the existing baseline score, multiplicatively modulating it – an ``elastic scaling’’ – to push the ID and OOD distributions further apart. We demonstrate Catalyst is a generalizable framework: it seamlessly integrates with logit-based methods (e.g., Energy, ReAct, SCALE) and also provides a significant boost to distance-based detectors like KNN. As a result, Catalyst achieves substantial and consistent performance gains, reducing the average False Positive Rate by 32.87 on CIFAR-10 (ResNet-18), 27.94% on CIFAR-100 (ResNet-18), and 22.25% on ImageNet (ResNet-50). Our results highlight the untapped potential of pre-pooling statistics and demonstrate that Catalyst is complementary to existing OOD detection approaches.
zh
[CV-7] Reason Edit: Editing Vision-Language Models using Human Reasoning
【速读】:该论文旨在解决预训练视觉语言模型(Vision-Language Models, VLMs)在推理密集型任务中错误难以修正的问题,尤其针对传统模型编辑方法无法有效处理需人类推理支持的任务场景。其解决方案的关键在于提出一种名为ReasonEdit的新编辑框架,该框架通过让用户在编辑过程中提供推理过程,将人类推理以代码本(codebook)形式持续存储,并在推理阶段利用受网络科学启发的拓扑平衡多模态嵌入方法,仅检索与当前任务相关的事实信息,从而显著提升编辑的泛化能力。
链接: https://arxiv.org/abs/2602.02408
作者: Jiaxing Qiu,Kaihua Hou,Roxana Daneshjou,Ahmed Alaa,Thomas Hartvigsen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Model editing aims to correct errors in large, pretrained models without altering unrelated behaviors. While some recent works have edited vision-language models (VLMs), no existing editors tackle reasoning-heavy tasks, which typically require humans and models to reason about this http URL therefore propose ReasonEdit, the first VLM editor to let users explain their reasoning during editing, introducing a new, practical model editing setup. ReasonEdit continuously stores human reasoning in a codebook, and retrieves only relevant facts during inference using a novel topology-balanced multimodal embedding method inspired by network science. Across four VLMs on multiple rationale-based visual question answering datasets, ReasonEdit achieves state-of-the-art editing performance, ultimately showing that using human reasoning during editing greatly improves edit generalization.
zh
[CV-8] SoMA: A Real-to-Sim Neural Simulator for Robotic Soft-body Manipulation
【速读】:该论文旨在解决真实世界机器人操作中柔性物体(deformable objects)的高保真仿真问题,现有模拟器因缺乏机器人条件控制(robot-conditioned control),导致在动态交互建模、稳定性与泛化能力方面存在局限。解决方案的关键在于提出SoMA——一种基于3D高斯点阵(3D Gaussian Splat)的软体操控模拟框架,通过将可变形动力学、环境力与机器人关节动作统一映射到潜在神经空间(latent neural space),实现端到端的真实到仿真(real-to-sim)建模,从而支持可控、稳定且能超越观测轨迹的长时程操作任务。
链接: https://arxiv.org/abs/2602.02402
作者: Mu Huang,Hui Wang,Kerui Ren,Linning Xu,Yunsong Zhou,Mulin Yu,Bo Dai,Jiangmiao Pang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Applied Physics (physics.app-ph)
备注: Project page: this https URL
Abstract:Simulating deformable objects under rich interactions remains a fundamental challenge for real-to-sim robot manipulation, with dynamics jointly driven by environmental effects and robot actions. Existing simulators rely on predefined physics or data-driven dynamics without robot-conditioned control, limiting accuracy, stability, and generalization. This paper presents SoMA, a 3D Gaussian Splat simulator for soft-body manipulation. SoMA couples deformable dynamics, environmental forces, and robot joint actions in a unified latent neural space for end-to-end real-to-sim simulation. Modeling interactions over learned Gaussian splats enables controllable, stable long-horizon manipulation and generalization beyond observed trajectories without predefined physical models. SoMA improves resimulation accuracy and generalization on real-world robot manipulation by 20%, enabling stable simulation of complex tasks such as long-horizon cloth folding.
zh
[CV-9] Superman: Unifying Skeleton and Vision for Human Motion Perception and Generation
【速读】:该论文旨在解决当前人体运动分析任务中存在的严重碎片化问题,具体包括:感知类模型(perception models)仅能从视频中理解运动但输出文本,而生成类模型(generation models)无法直接处理原始视觉输入;现有生成式多模态大语言模型(MLLMs)通常局限于单帧静态姿态生成,难以建模时序运动;以及现有运动词汇库仅基于骨架数据构建,割裂了与视觉域的关联。解决方案的关键在于提出一个统一框架Superman,其核心创新为:首先设计了一种视觉引导的运动分词器(Vision-Guided Motion Tokenizer),利用3D骨架与视觉数据之间的自然几何对齐关系,实现跨模态联合学习,构建统一的跨模态运动词汇表;其次基于此运动语言,训练单一的多模态大语言模型架构,灵活处理多样化时序输入,从而统一3D骨架姿态估计(感知)与基于骨架的运动预测和插值(生成)任务,在Human3.6M等标准基准上实现了最优或具有竞争力的性能。
链接: https://arxiv.org/abs/2602.02401
作者: Xinshun Wang,Peiming Li,Ziyi Wang,Zhongbin Fang,Zhichao Deng,Songtao Wu,Jason Li,Mengyuan Liu
机构: Peking University (北京大学); Sun Yat-sen University (中山大学); Sony R&D Center (索尼研发中⼼); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Human motion analysis tasks, such as temporal 3D pose estimation, motion prediction, and motion in-betweening, play an essential role in computer vision. However, current paradigms suffer from severe fragmentation. First, the field is split between perception'' models that understand motion from video but only output text, and generation’’ models that cannot perceive from raw visual input. Second, generative MLLMs are often limited to single-frame, static poses using dense, parametric SMPL models, failing to handle temporal motion. Third, existing motion vocabularies are built from skeleton data alone, severing the link to the visual domain. To address these challenges, we introduce Superman, a unified framework that bridges visual perception with temporal, skeleton-based motion generation. Our solution is twofold. First, to overcome the modality disconnect, we propose a Vision-Guided Motion Tokenizer. Leveraging the natural geometric alignment between 3D skeletons and visual data, this module pioneers robust joint learning from both modalities, creating a unified, cross-modal motion vocabulary. Second, grounded in this motion language, a single, unified MLLM architecture is trained to handle all tasks. This module flexibly processes diverse, temporal inputs, unifying 3D skeleton pose estimation from video (perception) with skeleton-based motion prediction and in-betweening (generation). Extensive experiments on standard benchmarks, including Human3.6M, demonstrate that our unified method achieves state-of-the-art or competitive performance across all motion tasks. This showcases a more efficient and scalable path for generative motion analysis using skeletons.
zh
[CV-10] Infinite-World: Scaling Interactive World Models to 1000-Frame Horizons via Pose-Free Hierarchical Memory
【速读】:该论文旨在解决现有世界模型在真实复杂环境中难以维持长期视觉记忆的问题,尤其是在缺乏精确位姿估计和视角重访稀缺的情况下。其核心挑战在于如何在不依赖显式几何先验的前提下,实现对上千帧视频序列的高效、稳定建模。解决方案的关键在于提出两个创新模块:一是分层无位姿记忆压缩器(Hierarchical Pose-free Memory Compressor, HPMC),通过递归蒸馏历史潜在表示,在固定预算内实现远期生成锚定;二是不确定性感知的动作标签模块,将连续运动离散化为三态逻辑,从而增强动作响应学习的鲁棒性。此外,结合小规模数据集的密集重访微调策略进一步激活模型的长程闭环能力,显著提升视觉质量、动作可控性和空间一致性。
链接: https://arxiv.org/abs/2602.02393
作者: Ruiqi Wu,Xuanhua He,Meng Cheng,Tianyu Yang,Yong Zhang,Zhuoliang Kang,Xunliang Cai,Xiaoming Wei,Chunle Guo,Chongyi Li,Ming-Ming Cheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 14 pages, 8 figures
Abstract:We propose Infinite-World, a robust interactive world model capable of maintaining coherent visual memory over 1000+ frames in complex real-world environments. While existing world models can be efficiently optimized on synthetic data with perfect ground-truth, they lack an effective training paradigm for real-world videos due to noisy pose estimations and the scarcity of viewpoint revisits. To bridge this gap, we first introduce a Hierarchical Pose-free Memory Compressor (HPMC) that recursively distills historical latents into a fixed-budget representation. By jointly optimizing the compressor with the generative backbone, HPMC enables the model to autonomously anchor generations in the distant past with bounded computational cost, eliminating the need for explicit geometric priors. Second, we propose an Uncertainty-aware Action Labeling module that discretizes continuous motion into a tri-state logic. This strategy maximizes the utilization of raw video data while shielding the deterministic action space from being corrupted by noisy trajectories, ensuring robust action-response learning. Furthermore, guided by insights from a pilot toy study, we employ a Revisit-Dense Finetuning Strategy using a compact, 30-minute dataset to efficiently activate the model’s long-range loop-closure capabilities. Extensive experiments, including objective metrics and user studies, demonstrate that Infinite-World achieves superior performance in visual quality, action controllability, and spatial consistency.
zh
[CV-11] Personalized Image Generation via Human-in-the-loop Bayesian Optimization
【速读】:该论文旨在解决生成式 AI 在个性化图像生成中“语言提示瓶颈”问题,即当用户通过自然语言提示难以精确引导模型生成特定目标图像(如记忆中的街道景象)时,如何进一步逼近用户心中理想图像 $ x^\ast $。解决方案的关键在于引入多选择偏好贝叶斯优化(Multi-Choice Preferential Bayesian Optimization, MultiBO),其核心机制是:在语言提示达到极限后,利用人类对图像相似性的隐式判断能力,通过多轮交互式偏好反馈(user preference feedback)指导扩散模型迭代优化,每次从当前生成结果 $ x^{p*} $ 出发,生成 K 个候选图像并获取用户偏好,从而高效逼近目标图像 $ x^\ast $,且无需模型事先知晓 $ x^\ast $ 的任何信息。
链接: https://arxiv.org/abs/2602.02388
作者: Rajalaxmi Rajagopalan,Debottam Dutta,Yu-Lin Wei,Romit Roy Choudhury
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Imagine Alice has a specific image x^\ast in her mind, say, the view of the street in which she grew up during her childhood. To generate that exact image, she guides a generative model with multiple rounds of prompting and arrives at an image x^p* . Although x^p* is reasonably close to x^\ast , Alice finds it difficult to close that gap using language prompts. This paper aims to narrow this gap by observing that even after language has reached its limits, humans can still tell when a new image x^+ is closer to x^\ast than x^p* . Leveraging this observation, we develop MultiBO (Multi-Choice Preferential Bayesian Optimization) that carefully generates K new images as a function of x^p* , gets preferential feedback from the user, uses the feedback to guide the diffusion model, and ultimately generates a new set of K images. We show that within B rounds of user feedback, it is possible to arrive much closer to x^\ast , even though the generative model has no information about x^\ast . Qualitative scores from 30 users, combined with quantitative metrics compared across 5 baselines, show promising results, suggesting that multi-choice feedback from humans can be effectively harnessed for personalized image generation.
zh
[CV-12] Unified Personalized Reward Model for Vision Generation
【速读】:该论文旨在解决现有多模态奖励模型(Multimodal Reward Models, RMs)在视觉生成任务中因采用“一刀切”式偏好建模而无法适应内容特异性视觉线索的问题,导致其与主观且情境依赖的人类偏好存在系统性偏差。解决方案的关键在于提出 UnifiedReward-Flex,一个统一的个性化奖励模型,其核心创新是将奖励建模与灵活、上下文自适应的推理机制相结合:首先基于提示和生成内容解析语义意图并锚定视觉证据,随后动态构建分层评估体系,在预定义和自生成的高层维度下实例化细粒度标准;训练流程分为两阶段——第一阶段从先进闭源视觉语言模型(VLMs)蒸馏高质量结构化推理轨迹以初始化监督微调(SFT),赋予模型灵活推理能力;第二阶段通过直接偏好优化(DPO)在精心筛选的偏好对上进一步提升推理保真度与判别一致性。
链接: https://arxiv.org/abs/2602.02380
作者: Yibin Wang,Yuhang Zang,Feng Han,Jiazi Bu,Yujie Zhou,Cheng Jin,Jiaqi Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Website: this https URL
Abstract:Recent advancements in multimodal reward models (RMs) have significantly propelled the development of visual generation. Existing frameworks typically adopt Bradley-Terry-style preference modeling or leverage generative VLMs as judges, and subsequently optimize visual generation models via reinforcement learning. However, current RMs suffer from inherent limitations: they often follow a one-size-fits-all paradigm that assumes a monolithic preference distribution or relies on fixed evaluation rubrics. As a result, they are insensitive to content-specific visual cues, leading to systematic misalignment with subjective and context-dependent human preferences. To this end, inspired by human assessment, we propose UnifiedReward-Flex, a unified personalized reward model for vision generation that couples reward modeling with flexible and context-adaptive reasoning. Specifically, given a prompt and the generated visual content, it first interprets the semantic intent and grounds on visual evidence, then dynamically constructs a hierarchical assessment by instantiating fine-grained criteria under both predefined and self-generated high-level dimensions. Our training pipeline follows a two-stage process: (1) we first distill structured, high-quality reasoning traces from advanced closed-source VLMs to bootstrap SFT, equipping the model with flexible and context-adaptive reasoning behaviors; (2) we then perform direct preference optimization (DPO) on carefully curated preference pairs to further strengthen reasoning fidelity and discriminative alignment. To validate the effectiveness, we integrate UnifiedReward-Flex into the GRPO framework for image and video synthesis, and extensive results demonstrate its superiority.
zh
[CV-13] Uncertainty-Aware Image Classification In Biomedical Imaging Using Spectral-normalized Neural Gaussian Processes
【速读】:该论文旨在解决当前深度学习模型在数字病理学中存在过自信(overconfident)且分布外(out-of-distribution, OOD)场景下校准性差的问题,这限制了其在临床决策中的可信度和实际应用。解决方案的关键在于引入谱归一化神经高斯过程(Spectral-normalized Neural Gaussian Process, SNGP),通过在模型中施加谱归一化(spectral normalization)并用高斯过程层替代最终的全连接层,实现单模型不确定性估计与OOD检测能力的显著提升,同时保持与传统方法相当的分布内(in-distribution)性能。
链接: https://arxiv.org/abs/2602.02370
作者: Uma Meleti,Jeffrey J. Nirschl
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication at the IEEE International Symposium on Biomedical Imaging (ISBI) 2026
Abstract:Accurate histopathologic interpretation is key for clinical decision-making; however, current deep learning models for digital pathology are often overconfident and poorly calibrated in out-of-distribution (OOD) settings, which limit trust and clinical adoption. Safety-critical medical imaging workflows benefit from intrinsic uncertainty-aware properties that can accurately reject OOD input. We implement the Spectral-normalized Neural Gaussian Process (SNGP), a set of lightweight modifications that apply spectral normalization and replace the final dense layer with a Gaussian process layer to improve single-model uncertainty estimation and OOD detection. We evaluate SNGP vs. deterministic and MonteCarlo dropout on six datasets across three biomedical classification tasks: white blood cells, amyloid plaques, and colorectal histopathology. SNGP has comparable in-distribution performance while significantly improving uncertainty estimation and OOD detection. Thus, SNGP or related models offer a useful framework for uncertainty-aware classification in digital pathology, supporting safe deployment and building trust with pathologists.
zh
[CV-14] NAB: Neural Adaptive Binning for Sparse-View CT reconstruction
【速读】:该论文旨在解决工业CT(计算机断层扫描)中基于稀疏视角重建高质量图像的问题,以降低生产成本。传统隐式神经网络虽在稀疏重建中表现良好,但无法利用物体的形状先验信息。其解决方案的关键在于提出一种新颖的神经自适应分箱(Neural Adaptive Binning, NAB)方法,通过将坐标空间映射到分箱向量空间来显式引入矩形结构先验。该方法采用基于平移双曲正切函数差值的创新分箱机制,并支持绕输入平面法向量的旋转,使编码参数(位置、尺寸、陡度和旋转角度)可通过投影数据的梯度流进行端到端优化,从而提升重建精度。此外,通过调节分箱函数的平滑度,NAB还能推广至更复杂的几何形态,展现出良好的泛化能力。
链接: https://arxiv.org/abs/2602.02356
作者: Wangduo Xie,Matthew B. Blaschko
机构: KU Leuven (鲁汶大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Computed Tomography (CT) plays a vital role in inspecting the internal structures of industrial objects. Furthermore, achieving high-quality CT reconstruction from sparse views is essential for reducing production costs. While classic implicit neural networks have shown promising results for sparse reconstruction, they are unable to leverage shape priors of objects. Motivated by the observation that numerous industrial objects exhibit rectangular structures, we propose a novel \textbfNeural \textbfAdaptive \textbfBinning (\textbfNAB) method that effectively integrates rectangular priors into the reconstruction process. Specifically, our approach first maps coordinate space into a binned vector space. This mapping relies on an innovative binning mechanism based on differences between shifted hyperbolic tangent functions, with our extension enabling rotations around the input-plane normal vector. The resulting representations are then processed by a neural network to predict CT attenuation coefficients. This design enables end-to-end optimization of the encoding parameters – including position, size, steepness, and rotation – via gradient flow from the projection data, thus enhancing reconstruction accuracy. By adjusting the smoothness of the binning function, NAB can generalize to objects with more complex geometries. This research provides a new perspective on integrating shape priors into neural network-based reconstruction. Extensive experiments demonstrate that NAB achieves superior performance on two industrial datasets. It also maintains robust on medical datasets when the binning function is extended to more general expression. The code will be made available.
zh
[CV-15] Implicit neural representation of textures
【速读】:该论文旨在解决传统纹理表示方法在内存占用和渲染效率方面的局限性,尤其是在处理高分辨率图像时面临的离散采样与存储瓶颈问题。其解决方案的关键在于设计新型的隐式神经表示(Implicit Neural Representation, INR),将纹理映射从离散的UV坐标空间转换为连续的函数建模方式,从而在保证图像质量的同时显著降低内存消耗并优化渲染推理时间。通过系统性实验验证了该方法在实时渲染及下游任务(如mipmap拟合和INR空间生成)中的有效性与平衡性。
链接: https://arxiv.org/abs/2602.02354
作者: Albert Kwok,Zheyuan Hu,Dounia Hammou
机构: University of Cambridge (剑桥大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
备注: Albert Kwok and Zheyuan Hu contributed equally to this work
Abstract:Implicit neural representation (INR) has proven to be accurate and efficient in various domains. In this work, we explore how different neural networks can be designed as a new texture INR, which operates in a continuous manner rather than a discrete one over the input UV coordinate space. Through thorough experiments, we demonstrate that these INRs perform well in terms of image quality, with considerable memory usage and rendering inference time. We analyze the balance between these objectives. In addition, we investigate various related applications in real-time rendering and down-stream tasks, e.g. mipmap fitting and INR-space generation.
zh
[CV-16] LongVPO: From Anchored Cues to Self-Reasoning for Long-Form Video Preference Optimization NEURIPS2025
【速读】:该论文旨在解决短上下文视觉语言模型在理解超长视频时缺乏有效训练信号的问题,尤其是如何在不依赖昂贵人工标注的情况下实现对长视频的鲁棒理解。其解决方案的关键在于提出一种两阶段的直接偏好优化(Direct Preference Optimization, DPO)框架——LongVPO:第一阶段通过锚定问题到短片段并引入干扰项,结合视觉相似性和问题特异性过滤来缓解位置偏差,同时仅评估锚点片段近似参考模型评分以降低计算开销;第二阶段利用递归字幕生成场景级元数据,并借助大语言模型构造多段推理类查询与劣质响应,从而通过多段推理任务引导模型偏好对齐。该方法仅需16K合成样本即可显著超越现有开源模型在多个长视频基准上的表现,同时保持短视频性能,为高效长视频理解提供了一种可扩展范式。
链接: https://arxiv.org/abs/2602.02341
作者: Zhenpeng Huang,Jiaqi Li,Zihan Jia,Xinhao Li,Desen Meng,Lingxue Song,Xi Chen,Liang Li,Limin Wang
机构: Nanjing University (南京大学); JIUTIAN Research (九天研究); Shanghai AI Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: NeurIPS 2025
Abstract:We present LongVPO, a novel two-stage Direct Preference Optimization framework that enables short-context vision-language models to robustly understand ultra-long videos without any long-video annotations. In Stage 1, we synthesize preference triples by anchoring questions to individual short clips, interleaving them with distractors, and applying visual-similarity and question-specificity filtering to mitigate positional bias and ensure unambiguous supervision. We also approximate the reference model’s scoring over long contexts by evaluating only the anchor clip, reducing computational overhead. In Stage 2, we employ a recursive captioning pipeline on long videos to generate scene-level metadata, then use a large language model to craft multi-segment reasoning queries and dispreferred responses, aligning the model’s preferences through multi-segment reasoning tasks. With only 16K synthetic examples and no costly human labels, LongVPO outperforms the state-of-the-art open-source models on multiple long-video benchmarks, while maintaining strong short-video performance (e.g., on MVBench), offering a scalable paradigm for efficient long-form video understanding.
zh
[CV-17] VQ-Style: Disentangling Style and Content in Motion with Residual Quantized Representations
【速读】:该论文旨在解决人类运动数据中语义内容(content)与风格特征(style)难以有效解耦的问题,从而实现高质量的运动风格迁移。其核心挑战在于运动数据同时包含粗粒度的动作语义信息和细粒度的表现风格细节,传统方法难以分离二者。解决方案的关键在于提出一种基于残差向量量化变分自编码器(Residual Vector Quantized Variational Autoencoders, RVQ-VAEs)的层次化表示学习框架,通过引入对比学习和新颖的信息泄露损失(information leakage loss)优化码本(codebook)学习,使内容与风格在不同码本中显式分离;进一步采用推理阶段的量化码交换(Quantized Code Swapping)技术,在无需微调的情况下实现对未见风格的高效运动风格迁移,展现出在风格迁移、风格去除和运动融合等多种任务中的强泛化能力。
链接: https://arxiv.org/abs/2602.02334
作者: Fatemeh Zargarbashi,Dhruv Agrawal,Jakob Buhmann,Martin Guay,Stelian Coros,Robert W. Sumner
机构: ETH Zürich(苏黎世联邦理工学院); DisneyResearch|Studios(迪士尼研究院|工作室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Human motion data is inherently rich and complex, containing both semantic content and subtle stylistic features that are challenging to model. We propose a novel method for effective disentanglement of the style and content in human motion data to facilitate style transfer. Our approach is guided by the insight that content corresponds to coarse motion attributes while style captures the finer, expressive details. To model this hierarchy, we employ Residual Vector Quantized Variational Autoencoders (RVQ-VAEs) to learn a coarse-to-fine representation of motion. We further enhance the disentanglement by integrating contrastive learning and a novel information leakage loss with codebook learning to organize the content and the style across different codebooks. We harness this disentangled representation using our simple and effective inference-time technique Quantized Code Swapping, which enables motion style transfer without requiring any fine-tuning for unseen styles. Our framework demonstrates strong versatility across multiple inference applications, including style transfer, style removal, and motion blending.
zh
[CV-18] Enhancing Indoor Occupancy Prediction via Sparse Query-Based Multi-Level Consistent Knowledge Distillation
【速读】:该论文旨在解决占用预测(Occupancy Prediction)中效率与准确率之间的权衡问题:现有密集方法在空体素上存在计算浪费,而稀疏查询方法在复杂室内场景中缺乏鲁棒性。其解决方案的关键在于提出一种新颖的稀疏查询框架DiScene,通过两项核心创新实现高效且鲁棒的预测——一是多层级一致性知识蒸馏(Multi-level Consistent Knowledge Distillation),通过编码器级特征对齐、查询级特征匹配、先验级空间引导和锚点级高置信度知识迁移,在四个层次上将大型教师模型的层次化表征迁移至轻量级学生模型;二是教师引导初始化策略(Teacher-Guided Initialization),采用优化的参数预热机制加速模型收敛。该方法在Occ-Scannet和Occ3D-nuScenes等多个基准上均取得显著性能提升,实现了推理速度与精度的双重突破。
链接: https://arxiv.org/abs/2602.02318
作者: Xiang Li,Yupeng Zheng,Pengfei Li,Yilun Chen,Ya-Qin Zhang,Wenchao Ding
机构: Tsinghua University (清华大学); TARS; Chinese Academy of Sciences (中国科学院); Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by RA-L
Abstract:Occupancy prediction provides critical geometric and semantic understanding for robotics but faces efficiency-accuracy trade-offs. Current dense methods suffer computational waste on empty voxels, while sparse query-based approaches lack robustness in diverse and complex indoor scenes. In this paper, we propose DiScene, a novel sparse query-based framework that leverages multi-level distillation to achieve efficient and robust occupancy prediction. In particular, our method incorporates two key innovations: (1) a Multi-level Consistent Knowledge Distillation strategy, which transfers hierarchical representations from large teacher models to lightweight students through coordinated alignment across four levels, including encoder-level feature alignment, query-level feature matching, prior-level spatial guidance, and anchor-level high-confidence knowledge transfer and (2) a Teacher-Guided Initialization policy, employing optimized parameter warm-up to accelerate model convergence. Validated on the Occ-Scannet benchmark, DiScene achieves 23.2 FPS without depth priors while outperforming our baseline method, OPUS, by 36.1% and even better than the depth-enhanced version, OPUS†. With depth integration, DiScene† attains new SOTA performance, surpassing EmbodiedOcc by 3.7% with 1.62 \times faster inference speed. Furthermore, experiments on the Occ3D-nuScenes benchmark and in-the-wild scenarios demonstrate the versatility of our approach in various environments. Code and models can be accessed at this https URL.
zh
[CV-19] Segment to Focus: Guiding Latent Action Models in the Presence of Distractors
【速读】:该论文旨在解决Latent Action Models (LAMs) 在从原始观测中提取动作相关表征时,因无法有效分离动作相关特征与动作相关的噪声(如背景运动)而导致的伪相关性问题,从而构建次优的潜在动作空间。解决方案的关键在于提出MaskLAM,通过引入预训练基础模型提供的视觉代理分割掩码来加权LAM的重建损失,优先关注显著信息而非背景元素,且无需修改网络结构即可实现对干扰因素的有效过滤,显著提升奖励累积和潜在动作质量。
链接: https://arxiv.org/abs/2602.02259
作者: Hamza Adnan,Matthew T. Jackson,Alexey Zakharov
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Latent Action Models (LAMs) learn to extract action-relevant representations solely from raw observations, enabling reinforcement learning from unlabelled videos and significantly scaling available training data. However, LAMs face a critical challenge in disentangling action-relevant features from action-correlated noise (e.g., background motion). Failing to filter these distractors causes LAMs to capture spurious correlations and build sub-optimal latent action spaces. In this paper, we introduce MaskLAM – a lightweight modification to LAM training to mitigate this issue by incorporating visual agent segmentation. MaskLAM utilises segmentation masks from pretrained foundation models to weight the LAM reconstruction loss, thereby prioritising salient information over background elements while requiring no architectural modifications. We demonstrate the effectiveness of our method on continuous-control MuJoCo tasks, modified with action-correlated background noise. Our approach yields up to a 4x increase in accrued rewards compared to standard baselines and a 3x improvement in the latent action quality, as evidenced by linear probe evaluation.
zh
[CV-20] LiFlow: Flow Matching for 3D LiDAR Scene Completion
【速读】:该论文旨在解决自动驾驶场景中激光雷达(LiDAR)点云因遮挡和远距离稀疏导致的感知性能受限问题,提出通过场景补全(scene completion)方法恢复不完整3D LiDAR场景中的缺失部分。其解决方案的关键在于首次引入基于流匹配(flow matching)的框架用于3D LiDAR场景补全,相较于以往采用局部点级去噪扩散概率模型(denoising diffusion probabilistic models)的方法,该框架避免了训练与推理阶段初始分布不一致的问题,从而提升了生成质量;具体实现上结合最近邻流匹配损失(nearest neighbor flow matching loss)和Chamfer距离损失(Chamfer distance loss),有效增强了点云对齐过程中的局部结构保持与全局覆盖能力,最终在多个指标上达到当前最优性能(state-of-the-art)。
链接: https://arxiv.org/abs/2602.02232
作者: Andrea Matteazzi,Dietmar Tutsch
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In autonomous driving scenarios, the collected LiDAR point clouds can be challenged by occlusion and long-range sparsity, limiting the perception of autonomous driving systems. Scene completion methods can infer the missing parts of incomplete 3D LiDAR scenes. Recent methods adopt local point-level denoising diffusion probabilistic models, which require predicting Gaussian noise, leading to a mismatch between training and inference initial distributions. This paper introduces the first flow matching framework for 3D LiDAR scene completion, improving upon diffusion-based methods by ensuring consistent initial distributions between training and inference. The model employs a nearest neighbor flow matching loss and a Chamfer distance loss to enhance both local structure and global coverage in the alignment of point clouds. LiFlow achieves state-of-the-art performance across multiple metrics. Code: this https URL.
zh
[CV-21] Show Dont Tell: Morphing Latent Reasoning into Image Generation
【速读】:该论文旨在解决当前文本到图像(Text-to-Image, T2I)生成方法在生成过程中缺乏动态推理与自修正能力的问题,即现有模型难以模拟人类创作中持续调整和优化的隐式思维过程。其核心挑战在于:当前基于显式推理的范式依赖于固定步骤中的离散文本解码与重编码,导致效率低下、信息丢失及认知不匹配。解决方案的关键在于提出LatentMorph框架,通过在连续潜在空间中实现隐式潜在推理,引入四个轻量级组件——压缩器(condenser)、翻译器(translator)、塑造器(shaper)和强化学习训练的调用器(RL-trained invoker),使模型能够在生成过程中自适应地进行状态记忆、潜思转化、图像标记预测动态引导以及推理时机决策,从而显著提升生成质量与推理效率,并更贴近人类直觉。
链接: https://arxiv.org/abs/2602.02227
作者: Harold Haodong Chen,Xinxiang Yin,Wen-Jie Shu,Hongfei Zhang,Zixin Zhang,Chenfei Liao,Litao Guo,Qifeng Chen,Ying-Cong Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this https URL
Abstract:Text-to-image (T2I) generation has achieved remarkable progress, yet existing methods often lack the ability to dynamically reason and refine during generation–a hallmark of human creativity. Current reasoning-augmented paradigms most rely on explicit thought processes, where intermediate reasoning is decoded into discrete text at fixed steps with frequent image decoding and re-encoding, leading to inefficiencies, information loss, and cognitive mismatches. To bridge this gap, we introduce LatentMorph, a novel framework that seamlessly integrates implicit latent reasoning into the T2I generation process. At its core, LatentMorph introduces four lightweight components: (i) a condenser for summarizing intermediate generation states into compact visual memory, (ii) a translator for converting latent thoughts into actionable guidance, (iii) a shaper for dynamically steering next image token predictions, and (iv) an RL-trained invoker for adaptively determining when to invoke reasoning. By performing reasoning entirely in continuous latent spaces, LatentMorph avoids the bottlenecks of explicit reasoning and enables more adaptive self-refinement. Extensive experiments demonstrate that LatentMorph (I) enhances the base model Janus-Pro by 16% on GenEval and 25% on T2I-CompBench; (II) outperforms explicit paradigms (e.g., TwiG) by 15% and 11% on abstract reasoning tasks like WISE and IPV-Txt, (III) while reducing inference time by 44% and token consumption by 51% ; and (IV) exhibits 71% cognitive alignment with human intuition on reasoning invocation.
zh
[CV-22] Evaluating OCR Performance for Assistive Technology: Effects of Walking Speed Camera Placement and Camera Type
【速读】:该论文旨在解决当前光学字符识别(Optical Character Recognition, OCR)技术在移动场景下性能评估不足的问题,尤其是针对视障人群使用的辅助技术中,现有静态数据集无法反映真实动态环境下的识别挑战。其解决方案的关键在于系统性地对比了OCR在静态与动态条件下的表现:通过控制距离(1–7米)、视角(0–75°)、行走速度(0.8–1.8 m/s)及摄像头安装位置(头戴式、肩挂式、手持式),对四种OCR引擎(Google Vision、PaddleOCR 3.0、EasyOCR、Tesseract)进行多维度测试,并以字符级Levenshtein相似度作为精度指标,从而揭示实际移动使用中影响OCR准确性的关键因素。
链接: https://arxiv.org/abs/2602.02223
作者: Junchi Feng,Nikhil Ballem,Mahya Beheshti,Giles Hamilton-Fletcher,Todd Hudson,Maurizio Porfiri,William H. Seiple,John-Ross Rizzo
机构: Unknown
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Optical character recognition (OCR), which converts printed or handwritten text into machine-readable form, is widely used in assistive technology for people with blindness and low vision. Yet, most evaluations rely on static datasets that do not reflect the challenges of mobile use. In this study, we systematically evaluated OCR performance under both static and dynamic conditions. Static tests measured detection range across distances of 1-7 meters and viewing angles of 0-75 degrees horizontally. Dynamic tests examined the impact of motion by varying walking speed from slow (0.8 m/s) to very fast (1.8 m/s) and comparing three camera mounting positions: head-mounted, shoulder-mounted, and hand-held. We evaluated both a smartphone and smart glasses, using the phone’s main and ultra-wide cameras. Four OCR engines were benchmarked to assess accuracy at different distances and viewing angles: Google Vision, PaddleOCR 3.0, EasyOCR, and Tesseract. PaddleOCR 3.0 was then used to evaluate accuracy at different walking speeds. Accuracy was computed at the character level using the Levenshtein ratio against manually defined ground truth. Results showed that recognition accuracy declined with increased walking speed and wider viewing angles. Google Vision achieved the highest overall accuracy, with PaddleOCR close behind as the strongest open-source alternative. Across devices, the phone’s main camera achieved the highest accuracy, and a shoulder-mounted placement yielded the highest average among body positions; however, differences among shoulder, head, and hand were not statistically significant.
zh
[CV-23] MIRROR: Manifold Ideal Reference ReconstructOR for Generalizable AI-Generated Image Detection
【速读】:该论文旨在解决当前AI生成图像(AIGI)检测方法依赖于特定伪造痕迹(artifact-based classification)而难以适应不断演化的生成模型所导致的泛化能力不足问题。其核心解决方案是将AIGI检测重新定义为一种参考比对(Reference-Comparison)任务,通过显式编码现实先验(reality priors)来验证输入图像是否与真实图像流形(real-image manifold)保持一致,而非依赖特定伪造特征。关键创新在于提出MIRROR框架,利用可学习的离散记忆库(discrete memory bank)构建理想参考,并通过稀疏线性组合将输入投影至流形一致空间,以残差作为鲁棒的检测信号,从而实现对多种生成器和复杂场景下AIGI的高效识别。
链接: https://arxiv.org/abs/2602.02222
作者: Ruiqi Liu,Manni Cui,Ziheng Qin,Zhiyuan Yan,Ruoxin Chen,Yi Han,Zhiheng Li,Junkai Chen,ZhiJin Chen,Kaiqing Lin,Jialiang Shen,Lubin Weng,Jing Dong,Yan Wang,Shu Wu
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); School of Advanced Interdisciplinary Sciences, UCAS (中国科学院大学交叉学科院系); Huazhong University of Science and Technology (华中科技大学); Tencent YouTu Lab (腾讯优图实验室); Southwest University (西南大学); Peking University (北京大学); The University of Sydney (悉尼大学); Shenzhen University (深圳大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注:
Abstract:High-fidelity generative models have narrowed the perceptual gap between synthetic and real images, posing serious threats to media security. Most existing AI-generated image (AIGI) detectors rely on artifact-based classification and struggle to generalize to evolving generative traces. In contrast, human judgment relies on stable real-world regularities, with deviations from the human cognitive manifold serving as a more generalizable signal of forgery. Motivated by this insight, we reformulate AIGI detection as a Reference-Comparison problem that verifies consistency with the real-image manifold rather than fitting specific forgery cues. We propose MIRROR (Manifold Ideal Reference ReconstructOR), a framework that explicitly encodes reality priors using a learnable discrete memory bank. MIRROR projects an input into a manifold-consistent ideal reference via sparse linear combination, and uses the resulting residuals as robust detection signals. To evaluate whether detectors reach the “superhuman crossover” required to replace human experts, we introduce the Human-AIGI benchmark, featuring a psychophysically curated human-imperceptible subset. Across 14 benchmarks, MIRROR consistently outperforms prior methods, achieving gains of 2.1% on six standard benchmarks and 8.1% on seven in-the-wild benchmarks. On Human-AIGI, MIRROR reaches 89.6% accuracy across 27 generators, surpassing both lay users and visual experts, and further approaching the human perceptual limit as pretrained backbones scale. The code is publicly available at: this https URL
zh
[CV-24] LangMap: A Hierarchical Benchmark for Open-Vocabulary Goal Navigation
【速读】:该论文旨在解决人与AI之间通过自然语言进行有意义交互以及实现具身智能(embodied intelligence)的关键挑战,即如何让智能体在复杂室内环境中依据多粒度的自然语言指令准确导航至目标。其解决方案的核心是提出HieraNav任务和LangMap基准:HieraNav定义了四个语义层级(场景、房间、区域和实例)的目标导航任务,而LangMap是一个基于真实3D室内扫描的大规模开放词汇基准,包含高质量的人工验证标注,涵盖414类物体的判别性实例描述、区域描述及超过18,000个导航任务,支持不同风格的指令评估。该方案显著提升了标注效率与准确性(相较GOAT-Bench提升23.8%判别准确率,且用词减少四倍),并揭示了丰富上下文与记忆对模型性能的重要性,同时指出了长尾分布、小目标、依赖上下文的目标及多目标完成等仍待突破的难点。
链接: https://arxiv.org/abs/2602.02220
作者: Bo Miao,Weijia Liu,Jun Luo,Lachlan Shinnick,Jian Liu,Thomas Hamilton-Smith,Yuhe Yang,Zijie Wu,Vanja Videnovic,Feras Dayoub,Anton van den Hengel
机构: AIML, Adelaide University (阿德莱德大学); East China Normal University (华东师范大学); NERC-RVC, Hunan University (湖南大学); University Western Australia (西澳大利亚大学); Singapore University of Technology and Design (新加坡科技设计大学); Breaker Industries
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:The relationships between objects and language are fundamental to meaningful communication between humans and AI, and to practically useful embodied intelligence. We introduce HieraNav, a multi-granularity, open-vocabulary goal navigation task where agents interpret natural language instructions to reach targets at four semantic levels: scene, room, region, and instance. To this end, we present Language as a Map (LangMap), a large-scale benchmark built on real-world 3D indoor scans with comprehensive human-verified annotations and tasks spanning these levels. LangMap provides region labels, discriminative region descriptions, discriminative instance descriptions covering 414 object categories, and over 18K navigation tasks. Each target features both concise and detailed descriptions, enabling evaluation across different instruction styles. LangMap achieves superior annotation quality, outperforming GOAT-Bench by 23.8% in discriminative accuracy using four times fewer words. Comprehensive evaluations of zero-shot and supervised models on LangMap reveal that richer context and memory improve success, while long-tailed, small, context-dependent, and distant goals, as well as multi-goal completion, remain challenging. HieraNav and LangMap establish a rigorous testbed for advancing language-driven embodied navigation. Project: this https URL
zh
[CV-25] Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation
【速读】:该论文旨在解决当前实时交互式视频生成方法中,将预训练的双向视频扩散模型蒸馏为少步自回归(AR)模型时所面临的架构差异问题。具体而言,现有方法通过常微分方程(ODE)蒸馏初始化AR学生模型,但要求帧级可 injectivity(即每个噪声帧在AR教师模型的PF-ODE下必须唯一映射到一个干净帧),而从双向教师模型蒸馏会违反此条件,导致无法恢复教师模型的流映射,转而得到条件期望解,从而降低性能。解决方案的关键在于提出“因果强制”(Causal Forcing),使用AR教师模型进行ODE初始化,从而有效弥合双向与自回归架构之间的理论鸿沟。实验证明,该方法在所有指标上均优于基线,尤其在动态程度(Dynamic Degree)、VisionReward和指令遵循能力(Instruction Following)上分别超越SOTA Self Forcing 19.3%、8.7%和16.7%。
链接: https://arxiv.org/abs/2602.02214
作者: Hongzhou Zhu,Min Zhao,Guande He,Hang Su,Chongxuan Li,Jun Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page and the code: \href{ this https URL }{ this https URL }
Abstract:To achieve real-time interactive video generation, current methods distill pretrained bidirectional video diffusion models into few-step autoregressive (AR) models, facing an architectural gap when full attention is replaced by causal attention. However, existing approaches do not bridge this gap theoretically. They initialize the AR student via ODE distillation, which requires frame-level injectivity, where each noisy frame must map to a unique clean frame under the PF-ODE of an AR teacher. Distilling an AR student from a bidirectional teacher violates this condition, preventing recovery of the teacher’s flow map and instead inducing a conditional-expectation solution, which degrades performance. To address this issue, we propose Causal Forcing that uses an AR teacher for ODE initialization, thereby bridging the architectural gap. Empirical results show that our method outperforms all baselines across all metrics, surpassing the SOTA Self Forcing by 19.3% in Dynamic Degree, 8.7% in VisionReward, and 16.7% in Instruction Following. Project page and the code: \hrefthis https URLthis https URL
zh
[CV-26] MAIN-VLA: Modeling Abstraction of Intention and eNvironment for Vision-Language-Action Models
【速读】:该论文旨在解决当前视觉-语言-动作(Visual-Language-Action, VLA)模型在高度复杂且动态的环境中(如3D开放世界和大规模PvP游戏)难以高效提取与动作相关的关键信号的问题,尤其是在冗余传感器流中存在大量无关信息的情况下。解决方案的核心在于提出MAIN-VLA框架,其关键创新在于显式建模意图抽象(Intention Abstraction, IA)与环境语义抽象(Environment Semantics Abstraction, ESA):IA将冗长的语言指令及其推理过程压缩为紧凑的语义原语,ESA则将海量视觉流映射为结构化的拓扑性可供性表示;二者对齐后产生一种涌现的注意力集中效应,从而实现无需参数调整的token剪枝策略,在不降低性能的前提下过滤感知冗余,显著提升决策质量、泛化能力和推理效率。
链接: https://arxiv.org/abs/2602.02212
作者: Zheyuan Zhou,Liang Du,Zixun Sun,Xiaoyu Zhou,Ruimin Ye,Qihao Chen,Yinda Chen,Lemiao Qiu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Despite significant progress in Visual-Language-Action (VLA), in highly complex and dynamic environments that involve real-time unpredictable interactions (such as 3D open worlds and large-scale PvP games), existing approaches remain inefficient at extracting action-critical signals from redundant sensor streams. To tackle this, we introduce MAIN-VLA, a framework that explicitly Models the Abstraction of Intention and eNvironment to ground decision-making in deep semantic alignment rather than superficial pattern matching. Specifically, our Intention Abstraction (IA) extracts verbose linguistic instructions and their associated reasoning into compact, explicit semantic primitives, while the Environment Semantics Abstraction (ESA) projects overwhelming visual streams into a structured, topological affordance representation. Furthermore, aligning these two abstract modalities induces an emergent attention-concentration effect, enabling a parameter-free token-pruning strategy that filters out perceptual redundancy without degrading performance. Extensive experiments in open-world Minecraft and large-scale PvP environments (Game for Peace and Valorant) demonstrate that MAIN-VLA sets a new state-of-the-art, which achieves superior decision quality, stronger generalization, and cutting-edge inference efficiency.
zh
[CV-27] SSI-DM: Singularity Skipping Inversion of Diffusion Models
【速读】:该论文旨在解决扩散模型中图像到噪声空间(noise space)逆映射(inversion)的难题,现有方法在早期去噪步骤中因数学奇异性导致生成的噪声分布非高斯且编辑性能差。其解决方案的关键在于提出Singularity Skipping Inversion of Diffusion Models (SSI-DM),通过在标准逆映射前添加微小噪声来跳过奇异区域,从而获得具有自然高斯特性的噪声表示,同时保持重建保真度,实现对通用扩散模型的高效、可编辑的逆映射。
链接: https://arxiv.org/abs/2602.02193
作者: Chen Min,Enze Jiang,Jishen Peng,Zheng Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Inverting real images into the noise space is essential for editing tasks using diffusion models, yet existing methods produce non-Gaussian noise with poor editability due to the inaccuracy in early noising steps. We identify the root cause: a mathematical singularity that renders inversion fundamentally ill-posed. We propose Singularity Skipping Inversion of Diffusion Models (SSI-DM), which bypasses this singular region by adding small noise before standard inversion. This simple approach produces inverted noise with natural Gaussian properties while maintaining reconstruction fidelity. As a plug-and-play technique compatible with general diffusion models, our method achieves superior performance on public image datasets for reconstruction and interpolation tasks, providing a principled and efficient solution to diffusion model inversion.
zh
[CV-28] Learning Topology-Aware Implicit Field for Unified Pulmonary Tree Modeling with Incomplete Topological Supervision
【速读】:该论文旨在解决肺部气管树(pulmonary tree)在CT图像中常出现的拓扑不完整问题,如分支缺失或断开连接,这会显著影响下游解剖分析并限制现有建模流程的适用性。解决方案的关键在于提出TopoField框架,这是一个拓扑感知的隐式建模方法,将拓扑修复视为首要建模任务,并实现肺部结构的统一多任务推理。其核心创新是利用稀疏表面与骨架点云表示肺部解剖结构,通过学习一个连续隐式场,在无需完整或显式断开标注的情况下,基于合成引入的结构性破坏进行训练,从而自动修复拓扑缺陷;在此基础上,联合推断解剖标签和肺段重建,所有任务仅需一次前向传播,兼具高精度与高效性(单例处理时间<1秒),适用于大规模临床场景。
链接: https://arxiv.org/abs/2602.02186
作者: Ziqiao Weng,Jiancheng Yang,Kangxian Xie,Bo Zhou,Weidong Cai
机构: University of Sydney (悉尼大学); Aalto University (阿尔托大学); University of Buffalo (纽约州立大学布法罗分校); Northwestern University (西北大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 7 figures
Abstract:Pulmonary trees extracted from CT images frequently exhibit topological incompleteness, such as missing or disconnected branches, which substantially degrades downstream anatomical analysis and limits the applicability of existing pulmonary tree modeling pipelines. Current approaches typically rely on dense volumetric processing or explicit graph reasoning, leading to limited efficiency and reduced robustness under realistic structural corruption. We propose TopoField, a topology-aware implicit modeling framework that treats topology repair as a first-class modeling problem and enables unified multi-task inference for pulmonary tree analysis. TopoField represents pulmonary anatomy using sparse surface and skeleton point clouds and learns a continuous implicit field that supports topology repair without relying on complete or explicit disconnection annotations, by training on synthetically introduced structural disruptions over \textitalready incomplete trees. Building upon the repaired implicit representation, anatomical labeling and lung segment reconstruction are jointly inferred through task-specific implicit functions within a single forward this http URL experiments on the Lung3D+ dataset demonstrate that TopoField consistently improves topological completeness and achieves accurate anatomical labeling and lung segment reconstruction under challenging incomplete scenarios. Owing to its implicit formulation, TopoField attains high computational efficiency, completing all tasks in just over one second per case, highlighting its practicality for large-scale and time-sensitive clinical applications. Code and data will be available at this https URL.
zh
[CV-29] CIEC: Coupling Implicit and Explicit Cues for Multimodal Weakly Supervised Manipulation Localization
【速读】:该论文旨在解决多模态伪造内容定位任务中依赖昂贵且耗时的细粒度标注(如patch/token级标注)的问题,提出了一种基于粗粒度图像/句子级标签的弱监督定位方法。其核心解决方案是构建一个名为CIEC(Coupling Implicit and Explicit Cues)的框架,关键在于双分支设计:一是图像分支中的Textual-guidance Refine Patch Selection(TRPS)模块,通过融合视觉与文本引导的伪造线索,并结合空间先验、背景静默及空间对比约束来精确定位可疑区域;二是文本分支中的Visual-deviation Calibrated Token Grounding(VCTG)模块,聚焦语义关键词并利用相对视觉偏差辅助token对齐,再通过非对称稀疏性和语义一致性约束缓解标签噪声并提升可靠性。该方法在多个指标上达到与全监督方法相当的效果。
链接: https://arxiv.org/abs/2602.02175
作者: Xinquan Yu,Wei Lu,Xiangyang Luo
机构: Sun Yat-sen University (中山大学); Zhengzhou 450002, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:To mitigate the threat of misinformation, multimodal manipulation localization has garnered growing attention. Consider that current methods rely on costly and time-consuming fine-grained annotations, such as patch/token-level annotations. This paper proposes a novel framework named Coupling Implicit and Explicit Cues (CIEC), which aims to achieve multimodal weakly-supervised manipulation localization for image-text pairs utilizing only coarse-grained image/sentence-level annotations. It comprises two branches, image-based and text-based weakly-supervised localization. For the former, we devise the Textual-guidance Refine Patch Selection (TRPS) module. It integrates forgery cues from both visual and textual perspectives to lock onto suspicious regions aided by spatial priors. Followed by the background silencing and spatial contrast constraints to suppress interference from irrelevant areas. For the latter, we devise the Visual-deviation Calibrated Token Grounding (VCTG) module. It focuses on meaningful content words and leverages relative visual bias to assist token localization. Followed by the asymmetric sparse and semantic consistency constraints to mitigate label noise and ensure reliability. Extensive experiments demonstrate the effectiveness of our CIEC, yielding results comparable to fully supervised methods on several evaluation metrics.
zh
[CV-30] Lung Nodule Image Synthesis Driven by Two-Stage Generative Adversarial Networks
【速读】:该论文旨在解决肺结节CT数据集样本量有限且多样性不足的问题,这严重制约了检测模型的性能与泛化能力。现有生成方法在图像多样性与空间可控性方面表现不佳,常出现纹理特征单一和解剖结构失真等问题。其解决方案的关键在于提出一种两阶段生成对抗网络(TSGAN),通过解耦肺结节的形态结构与纹理特征来提升合成数据的质量:第一阶段利用StyleGAN生成语义分割掩膜图以控制解剖结构;第二阶段采用DL-Pix2Pix模型将掩膜图转换为CT图像,并引入局部重要性注意力机制捕捉局部特征,同时结合动态权重多头窗口注意力机制增强对肺结节纹理及背景的建模能力。实验表明,该方法显著提升了合成图像质量与检测模型性能,在LUNA16数据集上准确率提高4.6%,mAP提升4%。
链接: https://arxiv.org/abs/2602.02171
作者: Lu Cao,Xiquan He,Junying Zeng,Chaoyun Mai,Min Luo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The limited sample size and insufficient diversity of lung nodule CT datasets severely restrict the performance and generalization ability of detection models. Existing methods generate images with insufficient diversity and controllability, suffering from issues such as monotonous texture features and distorted anatomical structures. Therefore, we propose a two-stage generative adversarial network (TSGAN) to enhance the diversity and spatial controllability of synthetic data by decoupling the morphological structure and texture features of lung nodules. In the first stage, StyleGAN is used to generate semantic segmentation mask images, encoding lung nodules and tissue backgrounds to control the anatomical structure of lung nodule images; The second stage uses the DL-Pix2Pix model to translate the mask map into CT images, employing local importance attention to capture local features, while utilizing dynamic weight multi-head window attention to enhance the modeling capability of lung nodule texture and background. Compared to the original dataset, the accuracy improved by 4.6% and mAP by 4% on the LUNA16 dataset. Experimental results demonstrate that TSGAN can enhance the quality of synthetic images and the performance of detection models.
zh
[CV-31] Reg4Pru: Regularisation Through Random Token Routing for Token Pruning
【速读】:该论文旨在解决基于Transformer的视觉模型在进行token pruning(令牌剪枝)以提升计算效率时,因保留表示信息不稳定而导致深层特征表示性能下降的问题。其解决方案的关键在于提出一种名为Reg4Pru的训练正则化技术,通过增强剪枝后模型的稳定性,从而有效缓解因token减少带来的密集预测性能损失,尤其在医学图像分割任务中显著提升了平均精度(绝对提升46%),同时保持了29%的相对运行时间加速。
链接: https://arxiv.org/abs/2602.02163
作者: Julian Wyatt,Ronald Clark,Irina Voiculescu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 7 figures
Abstract:Transformers are widely adopted in modern vision models due to their strong ability to scale with dataset size and generalisability. However, this comes with a major drawback: computation scales quadratically to the total number of tokens. Numerous methods have been proposed to mitigate this. For example, we consider token pruning with reactivating tokens from preserved representations, but the increased computational efficiency of this method results in decreased stability from the preserved representations, leading to poorer dense prediction performance at deeper layers. In this work, we introduce Reg4Pru, a training regularisation technique that mitigates token-pruning performance loss for segmentation. We compare our models on the FIVES blood vessel segmentation dataset and find that Reg4Pru improves average precision by an absolute 46% compared to the same model trained without routing. This increase is observed using a configuration that achieves a 29% relative speedup in wall-clock time compared to the non-pruned baseline. These findings indicate that Reg4Pru is a valuable regulariser for token reduction strategies.
zh
[CV-32] LoopViT: Scaling Visual ARC with Looped Transformers
【速读】:该论文旨在解决当前视觉推理模型(如视觉Transformer)在处理ARC-AGI基准任务时,因采用前馈架构导致计算深度与参数规模强耦合,难以有效模拟人类归纳推理中迭代、算法性思维的问题。其解决方案的关键在于提出一种递归架构Loop-ViT,通过权重共享的循环机制解耦推理深度与模型容量,并引入无参数的动态退出机制——基于预测熵判断内部状态是否收敛至低不确定性吸引子,从而自适应地终止推理过程。该方法在ARC-AGI-1基准上验证了有效性:一个仅18M参数的模型达到65.8%准确率,优于73M参数的集成模型,表明自适应迭代计算是比单纯扩大网络宽度更高效的视觉推理扩展路径。
链接: https://arxiv.org/abs/2602.02156
作者: Wen-Jie Shu,Xuerui Qiu,Rui-Jie Zhu,Harold Haodong Chen,Yexin Liu,Harry Yang
机构: HKUST (香港科技大学); CASIA (中国科学院自动化研究所); UC Santa Cruz (加州大学圣克鲁兹分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 11 figures
Abstract:Recent advances in visual reasoning have leveraged vision transformers to tackle the ARC-AGI benchmark. However, we argue that the feed-forward architecture, where computational depth is strictly bound to parameter size, falls short of capturing the iterative, algorithmic nature of human induction. In this work, we propose a recursive architecture called Loop-ViT, which decouples reasoning depth from model capacity through weight-tied recurrence. Loop-ViT iterates a weight-tied Hybrid Block, combining local convolutions and global attention, to form a latent chain of thought. Crucially, we introduce a parameter-free Dynamic Exit mechanism based on predictive entropy: the model halts inference when its internal state ``crystallizes" into a low-uncertainty attractor. Empirical results on the ARC-AGI-1 benchmark validate this perspective: our 18M model achieves 65.8% accuracy, outperforming massive 73M-parameter ensembles. These findings demonstrate that adaptive iterative computation offers a far more efficient scaling axis for visual reasoning than simply increasing network width. The code is available at this https URL.
zh
[CV-33] Deep learning enables urban change profiling through alignment of historical maps
【速读】:该论文旨在解决从历史地图序列中提取一致且细粒度的城市变化信息的难题,传统方法因空间错位、制图差异及文档质量退化等问题,难以实现大规模或定量分析。其解决方案的关键在于提出一个完全自动化的深度学习框架,采用模块化设计,集成密集地图配准(dense map alignment)、多时相目标检测(multi-temporal object detection)与变化特征分析(change profiling),从而将历史地图分析从主观视觉比对转向系统性、定量化的城市变迁刻画。
链接: https://arxiv.org/abs/2602.02154
作者: Sidi Wu,Yizi Chen,Maurizio Gribaudi,Konrad Schindler,Clément Mallet,Julien Perret,Lorenz Hurni
机构: ETH Zurich (苏黎世联邦理工学院); École des hautes études en sciences sociales (EHESS) (高等研究应用学院); Univ Gustave Eiffel (居斯塔夫·埃菲尔大学); IGN (法国国家地理研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注: 40 pages
Abstract:Prior to modern Earth observation technologies, historical maps provide a unique record of long-term urban transformation and offer a lens on the evolving identity of cities. However, extracting consistent and fine-grained change information from historical map series remains challenging due to spatial misalignment, cartographic variation, and degrading document quality, limiting most analyses to small-scale or qualitative approaches. We propose a fully automated, deep learning-based framework for fine-grained urban change analysis from large collections of historical maps, built on a modular design that integrates dense map alignment, multi-temporal object detection, and change profiling. This framework shifts the analysis of historical maps from ad hoc visual comparison toward systematic, quantitative characterization of urban change. Experiments demonstrate the robust performance of the proposed alignment and object detection methods. Applied to Paris between 1868 and 1937, the framework reveals the spatial and temporal heterogeneity in urban transformation, highlighting its relevance for research in the social sciences and humanities. The modular design of our framework further supports adaptation to diverse cartographic contexts and downstream applications.
zh
[CV-34] FD-VLA: Force-Distilled Vision-Language-Action Model for Contact-Rich Manipulation
【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)框架在接触密集型操作任务中缺乏有效力觉感知的问题,尤其是在机器人未配备物理力传感器的情况下难以实现精细操控的挑战。解决方案的关键在于提出Force-Distilled VLA(FD-VLA)框架,其核心是力蒸馏模块(Force Distillation Module, FDM),该模块通过一个可学习的查询标记(query token),结合视觉观测与机器人状态信息,映射生成与实际力信号潜在表示对齐的力token;推理时将该力token注入预训练的视觉语言模型(Vision-Language Model, VLM),从而实现无需物理力传感器即可进行力觉感知和推理,同时保持VLM原有的视觉-语言语义完整性。此设计不仅降低了硬件成本与复杂性,还通过力-视觉-状态融合先验提升了跨模态对齐与接触场景下的感知-动作鲁棒性。
链接: https://arxiv.org/abs/2602.02142
作者: Ruiteng Zhao,Wenshuo Wang,Yicheng Ma,Xiaocong Li,Francis E.H. Tay,Marcelo H. Ang Jr.,Haiyue Zhu
机构: National University of Singapore (新加坡国立大学); SIMTech, ASTAR (新加坡制造技术研究院,新加坡科技研究局); Nanyang Technological University (南洋理工大学); Eastern Institute of Technology (东方理工大学); Harvard University (哈佛大学); Singapore Institute of Manufacturing Technology, Agency for Science, Technology and Research (ASTAR) (新加坡制造技术研究院,新加坡科技研究局)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Force sensing is a crucial modality for Vision-Language-Action (VLA) frameworks, as it enables fine-grained perception and dexterous manipulation in contact-rich tasks. We present Force-Distilled VLA (FD-VLA), a novel framework that integrates force awareness into contact-rich manipulation without relying on physical force sensors. The core of our approach is a Force Distillation Module (FDM), which distills force by mapping a learnable query token, conditioned on visual observations and robot states, into a predicted force token aligned with the latent representation of actual force signals. During inference, this distilled force token is injected into the pretrained VLM, enabling force-aware reasoning while preserving the integrity of its vision-language semantics. This design provides two key benefits: first, it allows practical deployment across a wide range of robots that lack expensive or fragile force-torque sensors, thereby reducing hardware cost and complexity; second, the FDM introduces an additional force-vision-state fusion prior to the VLM, which improves cross-modal alignment and enhances perception-action robustness in contact-rich scenarios. Surprisingly, our physical experiments show that the distilled force token outperforms direct sensor force measurements as well as other baselines, which highlights the effectiveness of this force-distilled VLA approach.
zh
[CV-35] Eliminating Registration Bias in Synthetic CT Generation: A Physics-Based Simulation Framework
【速读】:该论文旨在解决基于监督学习的CBCT(锥形束CT)到合成CT(synthetic CT)生成中因配准偏差(registration bias)导致模型性能评估失真问题。传统方法依赖于已配准的训练对,但实际扫描间难以实现完美配准,这种偏差会传递至训练模型并误导标准强度指标(如均方误差或互信息),使得模型可能更擅长复制配准伪影而非真实解剖结构。解决方案的关键在于采用物理驱动的CBCT模拟技术(physics-based CBCT simulation),通过构建几何上严格对齐的训练数据对,避免人为引入的配准误差;同时改用基于几何对齐的评估指标(如归一化互信息,Normalized Mutual Information)直接对比输入CBCT而非有偏的“真实值”,从而更准确反映临床需求。实验表明,基于合成数据训练的模型虽在传统强度指标上表现较差,但在几何一致性方面显著优于传统方法,并与临床观察者偏好高度一致(ρ = 0.31, p < 0.001),验证了以几何保真度为核心评价标准的有效性。
链接: https://arxiv.org/abs/2602.02130
作者: Lukas Zimmermann,Michael Rauter,Maximilian Schmid,Dietmar Georg,Barbara Knäusl
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Supervised synthetic CT generation from CBCT requires registered training pairs, yet perfect registration between separately acquired scans remains unattainable. This registration bias propagates into trained models and corrupts standard evaluation metrics. This may suggest that superior benchmark performance indicates better reproduction of registration artifacts rather than anatomical fidelity. We propose physics-based CBCT simulation to provide geometrically aligned training pairs by construction, combined with evaluation using geometric alignment metrics against input CBCT rather than biased ground truth. On two independent pelvic datasets, models trained on synthetic data achieved superior geometric alignment (Normalized Mutual Information: 0.31 vs 0.22) despite lower conventional intensity scores. Intensity metrics showed inverted correlations with clinical assessment for deformably registered data, while Normalized Mutual Information consistently predicted observer preference across registration methodologies (rho = 0.31, p 0.001). Clinical observers preferred synthetic-trained outputs in 87% of cases, demonstrating that geometric fidelity, not intensity agreement with biased ground truth, aligns with clinical requirements.
zh
[CV-36] oxicity Assessment in Preclinical Histopathology via Class-Aware Mahalanobis Distance for Known and Novel Anomalies
【速读】:该论文旨在解决药物诱导毒性在临床前开发和早期临床试验中导致失败的问题,核心挑战在于传统组织病理学评估高度依赖专家病理学家,难以实现大规模筛选。其解决方案的关键在于构建一个基于人工智能的异常检测框架,用于啮齿类动物肝脏全切片图像(WSIs)分析:首先利用像素级标注数据微调预训练视觉Transformer(DINOv2),通过低秩适应(LoRA)实现组织分割;其次采用马氏距离提取特征进行分布外(OOD)检测,并引入类别特异性阈值优化策略,以降低误判率(仅0.16%病灶被误判为健康组织,0.35%健康组织被误判为病灶)。该方法不仅能识别已知病理变化,还能发现无训练样本的罕见形态异常,显著提升毒理学评估效率与准确性。
链接: https://arxiv.org/abs/2602.02124
作者: Olga Graf,Dhrupal Patel,Peter Groß,Charlotte Lempp,Matthias Hein,Fabian Heinemann
机构: Tübingen AI Center, University of Tübingen, Tübingen, Germany; Boehringer Ingelheim Pharma GmbH and Co., Biberach an der Riß, Germany; Boehringer Ingelheim GmbH, Biberach an der Riß, Germany
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Drug-induced toxicity remains a leading cause of failure in preclinical development and early clinical trials. Detecting adverse effects at an early stage is critical to reduce attrition and accelerate the development of safe medicines. Histopathological evaluation remains the gold standard for toxicity assessment, but it relies heavily on expert pathologists, creating a bottleneck for large-scale screening. To address this challenge, we introduce an AI-based anomaly detection framework for histopathological whole-slide images (WSIs) in rodent livers from toxicology studies. The system identifies healthy tissue and known pathologies (anomalies) for which training data is available. In addition, it can detect rare pathologies without training data as out-of-distribution (OOD) findings. We generate a novel dataset of pixelwise annotations of healthy tissue and known pathologies and use this data to fine-tune a pre-trained Vision Transformer (DINOv2) via Low-Rank Adaptation (LoRA) in order to do tissue segmentation. Finally, we extract features for OOD detection using the Mahalanobis distance. To better account for class-dependent variability in histological data, we propose the use of class-specific thresholds. We optimize the thresholds using the mean of the false negative and false positive rates, resulting in only 0.16% of pathological tissue classified as healthy and 0.35% of healthy tissue classified as pathological. Applied to mouse liver WSIs with known toxicological findings, the framework accurately detects anomalies, including rare OOD morphologies. This work demonstrates the potential of AI-driven histopathology to support preclinical workflows, reduce late-stage failures, and improve efficiency in drug development.
zh
[CV-37] MLV-Edit: Towards Consistent and Highly Efficient Editing for Minute-Level Videos
【速读】:该论文旨在解决长时视频(分钟级)编辑中面临的两大核心挑战:一是计算开销过大,难以扩展至数千帧的长视频;二是难以维持全局时间一致性,导致运动不连续和结构漂移。解决方案的关键在于提出一种无需训练的基于流(flow-based)框架MLV-Edit,其核心创新为两个模块:Velocity Blend通过对齐相邻片段的光流场来修正边界处的运动不一致,消除碎片化处理带来的闪烁与边界伪影;Attention Sink则将局部片段特征锚定到全局参考帧,有效抑制累积性结构漂移,从而在保持语义保真度的同时显著提升时间稳定性。
链接: https://arxiv.org/abs/2602.02123
作者: Yangyi Cao,Yuanhang Li,Lan Chen,Qi Mao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We propose MLV-Edit, a training-free, flow-based framework that address the unique challenges of minute-level video editing. While existing techniques excel in short-form video manipulation, scaling them to long-duration videos remains challenging due to prohibitive computational overhead and the difficulty of maintaining global temporal consistency across thousands of frames. To address this, MLV-Edit employs a divide-and-conquer strategy for segment-wise editing, facilitated by two core modules: Velocity Blend rectifies motion inconsistencies at segment boundaries by aligning the flow fields of adjacent chunks, eliminating flickering and boundary artifacts commonly observed in fragmented video processing; and Attention Sink anchors local segment features to global reference frames, effectively suppressing cumulative structural drift. Extensive quantitative and qualitative experiments demonstrate that MLV-Edit consistently outperforms state-of-the-art methods in terms of temporal stability and semantic fidelity.
zh
[CV-38] Enhancing Diffusion-Based Quantitatively Controllable Image Generation via Matrix-Form EDM and Adaptive Vicinal Training
【速读】:该论文旨在解决连续条件扩散模型(Continuous Conditional Diffusion Model, CCDM)在生成质量与采样效率方面的局限性,尤其是其依赖过时的扩散框架及因长采样轨迹导致的低效问题。解决方案的关键在于提出改进型CCDM(iCCDM),该方案引入更先进的**阐明扩散模型(Elucidated Diffusion Model, EDM)**框架,并结合一种新颖的矩阵形式EDM表述和自适应邻域训练策略(adaptive vicinal training strategy),从而在保持高图像生成质量的同时显著降低采样成本。
链接: https://arxiv.org/abs/2602.02114
作者: Xin Ding,Yun Chen,Sen Zhang,Kao Zhang,Nenglun Chen,Peibei Cao,Yongwei Wang,Fei Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Continuous Conditional Diffusion Model (CCDM) is a diffusion-based framework designed to generate high-quality images conditioned on continuous regression labels. Although CCDM has demonstrated clear advantages over prior approaches across a range of datasets, it still exhibits notable limitations and has recently been surpassed by a GAN-based method, namely CcGAN-AVAR. These limitations mainly arise from its reliance on an outdated diffusion framework and its low sampling efficiency due to long sampling trajectories. To address these issues, we propose an improved CCDM framework, termed iCCDM, which incorporates the more advanced \textitElucidated Diffusion Model (EDM) framework with substantial modifications to improve both generation quality and sampling efficiency. Specifically, iCCDM introduces a novel matrix-form EDM formulation together with an adaptive vicinal training strategy. Extensive experiments on four benchmark datasets, spanning image resolutions from 64\times64 to 256\times256 , demonstrate that iCCDM consistently outperforms existing methods, including state-of-the-art large-scale text-to-image diffusion models (e.g., Stable Diffusion 3, FLUX.1, and Qwen-Image), achieving higher generation quality while significantly reducing sampling cost.
zh
[CV-39] An Empirical Study of World Model Quantization
【速读】:该论文旨在解决世界模型(World Model)在实际部署中因计算成本高和内存占用大而导致的效率瓶颈问题,尤其是针对后训练量化(Post-Training Quantization, PTQ)对世界模型性能影响缺乏系统研究的现状。解决方案的关键在于通过系统性实证分析,评估多种PTQ方法在权重仅量化与权重-激活联合量化两种设置下的表现,揭示量化带来的非直观影响:例如分组权重量化可提升低比特(low-bit)推理稳定性,激活量化粒度收益不一致,且编码器与预测模块对量化敏感性存在显著不对称性;同时发现激进的低比特量化会破坏规划目标与任务成功率之间的对齐关系,导致不可修复的失败模式。这一系列发现为在严格计算约束下高效部署量化世界模型提供了关键实践指导。
链接: https://arxiv.org/abs/2602.02110
作者: Zhongqian Fu,Tianyi Zhao,Kai Han,Hang Zhou,Xinghao Chen,Yunhe Wang
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:World models learn an internal representation of environment dynamics, enabling agents to simulate and reason about future states within a compact latent space for tasks such as planning, prediction, and inference. However, running world models rely on hevay computational cost and memory footprint, making model quantization essential for efficient deployment. To date, the effects of post-training quantization (PTQ) on world models remain largely unexamined. In this work, we present a systematic empirical study of world model quantization using DINO-WM as a representative case, evaluating diverse PTQ methods under both weight-only and joint weight-activation settings. We conduct extensive experiments on different visual planning tasks across a wide range of bit-widths, quantization granularities, and planning horizons up to 50 iterations. Our results show that quantization effects in world models extend beyond standard accuracy and bit-width trade-offs: group-wise weight quantization can stabilize low-bit rollouts, activation quantization granularity yields inconsistent benefits, and quantization sensitivity is highly asymmetric between encoder and predictor modules. Moreover, aggressive low-bit quantization significantly degrades the alignment between the planning objective and task success, leading to failures that cannot be remedied by additional optimization. These findings reveal distinct quantization-induced failure modes in world model-based planning and provide practical guidance for deploying quantized world models under strict computational constraints. The code will be available at this https URL.
zh
[CV-40] acher-Guided Student Self-Knowledge Distillation Using Diffusion Model
【速读】:该论文旨在解决知识蒸馏(Knowledge Distillation, KD)中因教师模型与学生模型之间特征分布差异导致的学生模型学习到不兼容信息的问题。解决方案的关键在于提出一种教师引导的学生扩散自蒸馏方法(Teacher-guided Student Diffusion Self-KD, DSKD):首先利用轻量级扩散模型,通过教师分类器指导学生特征的去噪采样过程,使去噪后的学生特征蕴含教师知识并可视为“伪教师”;进而设计基于局部敏感哈希(Locality-Sensitive Hashing, LSH)的特征蒸馏机制,在原始学生特征与去噪学生特征之间进行对齐,从而消除教师与学生在映射方式和特征分布上的不一致性,实现更有效的知识迁移。
链接: https://arxiv.org/abs/2602.02107
作者: Yu Wang,Chuanguang Yang,Zhulin An,Weilun Feng,Jiarui Zhao,Chengqing Yu,Libo Huang,Boyu Diao,Yongjun Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Existing Knowledge Distillation (KD) methods often align feature information between teacher and student by exploring meaningful feature processing and loss functions. However, due to the difference in feature distributions between the teacher and student, the student model may learn incompatible information from the teacher. To address this problem, we propose teacher-guided student Diffusion Self-KD, dubbed as DSKD. Instead of the direct teacher-student alignment, we leverage the teacher classifier to guide the sampling process of denoising student features through a light-weight diffusion model. We then propose a novel locality-sensitive hashing (LSH)-guided feature distillation method between the original and denoised student features. The denoised student features encapsulate teacher knowledge and could be regarded as a teacher role. In this way, our DSKD method could eliminate discrepancies in mapping manners and feature distributions between the teacher and student, while learning meaningful knowledge from the teacher. Experiments on visual recognition tasks demonstrate that DSKD significantly outperforms existing KD methods across various models and datasets. Our code is attached in supplementary material.
zh
[CV-41] FSVideo: Fast Speed Video Diffusion Model in a Highly-Compressed Latent Space
【速读】:该论文旨在解决图像到视频(Image-to-Video, I2V)生成任务中效率与质量难以兼顾的问题。现有方法在生成高保真度视频时往往计算复杂度高、推理速度慢,难以满足实时或快速生成需求。为此,作者提出FSVideo框架,其核心解决方案包括:1)设计了一个具有高度压缩潜空间(时空下采样比为64×64×4)的视频自动编码器(video autoencoder),在保持重建质量的同时显著降低计算开销;2)改进扩散Transformer(Diffusion Transformer, DIT)架构,引入新型层内存设计以增强层间信息流动和上下文复用,提升模型表达能力;3)采用多分辨率生成策略,通过少量步骤的DIT上采样器实现视频保真度提升。最终模型包含一个140亿参数的DIT基础模型和一个140亿参数的DIT上采样器,在性能上优于其他开源模型,同时推理速度提升一个数量级。
链接: https://arxiv.org/abs/2602.02092
作者: FSVideo Team,Qingyu Chen,Zhiyuan Fang,Haibin Huang,Xinwei Huang,Tong Jin,Minxuan Lin,Bo Liu,Celong Liu,Chongyang Ma,Xing Mei,Xiaohui Shen,Yaojie Shen,Fuwen Tan,Angtian Wang,Xiao Yang,Yiding Yang,Jiamin Yuan,Lingxi Zhang,Yuxin Zhang
机构: ByteDance(字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:We introduce FSVideo, a fast speed transformer-based image-to-video (I2V) diffusion framework. We build our framework on the following key components: 1.) a new video autoencoder with highly-compressed latent space ( 64\times64\times4 spatial-temporal downsampling ratio), achieving competitive reconstruction quality; 2.) a diffusion transformer (DIT) architecture with a new layer memory design to enhance inter-layer information flow and context reuse within DIT, and 3.) a multi-resolution generation strategy via a few-step DIT upsampler to increase video fidelity. Our final model, which contains a 14B DIT base model and a 14B DIT upsampler, achieves competitive performance against other popular open-source models, while being an order of magnitude faster. We discuss our model design as well as training strategies in this report.
zh
[CV-42] UrbanGS: A Scalable and Efficient Architecture for Geometrically Accurate Large-Scene Reconstruction ICLR2026
【速读】:该论文旨在解决3D高斯溅射(3D Gaussian Splatting, 3DGS)在扩展至城市尺度场景时面临的几何一致性差、内存效率低和计算可扩展性不足的问题。其核心解决方案包括两个关键模块:一是提出深度一致的D-Normal正则化模块,通过融合外部深度监督与D-Normal约束,实现对所有几何参数(包括位置和旋转)的联合优化,并引入基于梯度一致性与逆深度偏差的自适应置信权重机制,显著提升多视角深度对齐与几何一致性;二是设计空间自适应高斯剪枝(Spatially Adaptive Gaussian Pruning, SAGP)策略,依据局部几何复杂度和可见性动态调整高斯密度以减少冗余表示,并结合统一的分区与视图分配方案消除边界伪影、优化计算负载,从而实现高效、高保真、可扩展的城市级场景重建。
链接: https://arxiv.org/abs/2602.02089
作者: Changbai Li,Haodong Zhu,Hanlin Chen,Xiuping Liang,Tongfei Chen,Shuwei Shao,Linlin Yang,Huobin Tan,Baochang Zhang
机构: Hangzhou International Innovation Institute, Beihang University (北京航空航天大学杭州国际创新研究院); Institute of Artificial Intelligence, Beihang University (北京航空航天大学人工智能研究院); School of Software, Beihang University (北京航空航天大学软件学院); State Key Laboratory of Media Convergence and Communication, Communication University of China (中国传媒大学媒体融合与传播国家重点实验室); Department of Computer Science, National University of Singapore (新加坡国立大学计算机科学系); Control Science and Engineering, Shandong University (山东大学控制科学与工程学院); Artificial Intelligence Research Center, Lobachevsky State University (俄罗斯下诺夫哥罗德国立大学人工智能研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICLR 2026
Abstract:While 3D Gaussian Splatting (3DGS) enables high-quality, real-time rendering for bounded scenes, its extension to large-scale urban environments gives rise to critical challenges in terms of geometric consistency, memory efficiency, and computational scalability. To address these issues, we present UrbanGS, a scalable reconstruction framework that effectively tackles these challenges for city-scale applications. First, we propose a Depth-Consistent D-Normal Regularization module. Unlike existing approaches that rely solely on monocular normal estimators, which can effectively update rotation parameters yet struggle to update position parameters, our method integrates D-Normal constraints with external depth supervision. This allows for comprehensive updates of all geometric parameters. By further incorporating an adaptive confidence weighting mechanism based on gradient consistency and inverse depth deviation, our approach significantly enhances multi-view depth alignment and geometric coherence, which effectively resolves the issue of geometric accuracy in complex large-scale scenes. To improve scalability, we introduce a Spatially Adaptive Gaussian Pruning (SAGP) strategy, which dynamically adjusts Gaussian density based on local geometric complexity and visibility to reduce redundancy. Additionally, a unified partitioning and view assignment scheme is designed to eliminate boundary artifacts and optimize computational load. Extensive experiments on multiple urban datasets demonstrate that UrbanGS achieves superior performance in rendering quality, geometric accuracy, and memory efficiency, providing a systematic solution for high-fidelity large-scale scene reconstruction.
zh
[CV-43] Multi-View Stenosis Classification Leverag ing Transformer-Based Multiple-Instance Learning Using Real-World Clinical Data
【速读】:该论文旨在解决冠状动脉狭窄(coronary artery stenosis)诊断中因依赖昂贵的视图级标注(view-level annotations)而导致模型训练受限,以及现有方法难以捕捉多角度血管造影图像间时序动态与依赖关系的问题。解决方案的关键在于提出一种基于Transformer的多视图多实例学习框架(SegmentMIL),该框架仅使用患者级监督信号进行训练,无需任何视图级标注,同时能够联合预测狭窄存在并定位受累解剖区域(区分左右冠状动脉及其节段),从而在内部和外部评估中均优于传统多实例学习基线及视图级模型,展现出临床可部署性和扩展性潜力。
链接: https://arxiv.org/abs/2602.02067
作者: Nikola Cenikj,Özgün Turgut,Alexander Müller,Alexander Steger,Jan Kehrer,Marcus Brugger,Daniel Rueckert,Eimo Martens,Philip Müller
机构: Chair for AI in Healthcare and Medicine, Technical University of Munich (TUM) and TUM University Hospital, Munich, Germany; Department of Computing, Imperial College London, UK; Munich Center for Machine Learning (MCML), Munich, Germany; Department of Internal Medicine, TUM University Hospital, Munich, Germany
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Coronary artery stenosis is a leading cause of cardiovascular disease, diagnosed by analyzing the coronary arteries from multiple angiography views. Although numerous deep-learning models have been proposed for stenosis detection from a single angiography view, their performance heavily relies on expensive view-level annotations, which are often not readily available in hospital systems. Moreover, these models fail to capture the temporal dynamics and dependencies among multiple views, which are crucial for clinical diagnosis. To address this, we propose SegmentMIL, a transformer-based multi-view multiple-instance learning framework for patient-level stenosis classification. Trained on a real-world clinical dataset, using patient-level supervision and without any view-level annotations, SegmentMIL jointly predicts the presence of stenosis and localizes the affected anatomical region, distinguishing between the right and left coronary arteries and their respective segments. SegmentMIL obtains high performance on internal and external evaluations and outperforms both view-level models and classical MIL baselines, underscoring its potential as a clinically viable and scalable solution for coronary stenosis diagnosis. Our code is available at this https URL.
zh
[CV-44] Auto-Comp: An Automated Pipeline for Scalable Compositional Probing of Contrastive Vision-Language Models
【速读】:该论文旨在解决现代视觉语言模型(Vision-Language Models, VLMs)在组合推理能力上的缺陷,特别是模型在处理属性绑定(如颜色与形状、空间关系)时易混淆语义结构的问题,例如将“一个红色立方体和一个蓝色球体”误判为“一个蓝色立方体和一个红色球体”。其解决方案的关键在于提出Auto-Comp——一个全自动、可合成的基准生成管道,能够通过可控的图像-文本对(包括最小化描述和大语言模型生成的上下文描述)实现A/B测试,从而精细地分离和评估不同推理技能。该方法不仅揭示了CLIP和SigLIP模型家族普遍存在的组合失败现象,还发现模型对低熵干扰项(如重复对象或颜色)高度敏感,表明其问题超越传统“词袋”假设;同时识别出视觉-语言上下文在提升空间推理的同时会因引入视觉杂波而损害局部属性绑定能力,展现出一种新的权衡关系。
链接: https://arxiv.org/abs/2602.02043
作者: Cristian Sbrolli,Matteo Matteucci,Toshihiko Yamasaki
机构: Politecnico di Milano (米兰理工大学); The University of Tokyo (东京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Modern Vision-Language Models (VLMs) exhibit a critical flaw in compositional reasoning, often confusing “a red cube and a blue sphere” with “a blue cube and a red sphere”. Disentangling the visual and linguistic roots of these failures is a fundamental challenge for robust evaluation. To enable fine-grained, controllable analysis, we introduce Auto-Comp, a fully automated and synthetic pipeline for generating scalable benchmarks. Its controllable nature is key to dissecting and isolating different reasoning skills. Auto-Comp generates paired images from Minimal (e.g., “a monitor to the left of a bicycle on a white background”) and LLM-generated Contextual captions (e.g., “In a brightly lit photography studio, a monitor is positioned to the left of a bicycle”), allowing a controlled A/B test to disentangle core binding ability from visio-linguistic complexity. Our evaluation of 20 VLMs on novel benchmarks for color binding and spatial relations reveals universal compositional failures in both CLIP and SigLIP model families. Crucially, our novel “Confusion Benchmark” reveals a deeper flaw beyond simple attribute swaps: models are highly susceptible to low-entropy distractors (e.g., repeated objects or colors), demonstrating their compositional failures extend beyond known bag-of-words limitations. we uncover a surprising trade-off: visio-linguistic context, which provides global scene cues, aids spatial reasoning but simultaneously hinders local attribute binding by introducing visual clutter. We release the Auto-Comp pipeline to facilitate future benchmark creation, alongside all our generated benchmarks (this https URL).
zh
[CV-45] One Size Many Fits: Aligning Diverse Group-Wise Click Preferences in Large-Scale Advertising Image Generation
【速读】:该论文旨在解决广告图像生成中“一刀切”策略导致的用户群体偏好差异被忽视的问题,即现有方法仅优化整体点击率(Click-Through Rate, CTR),而未能针对不同用户群体的个性化偏好进行优化,从而限制了精准营销的效果。解决方案的关键在于提出一个统一框架 One Size, Many Fits (OSMF),其核心包括两个创新:一是基于产品感知的自适应分组机制,动态根据用户属性和产品特征构建具有丰富集体偏好特征的用户群;二是引入群组感知的多模态大语言模型(Group-aware Multimodal Large Language Model, G-MLLM),通过预训练实现对群组特征的理解与图像生成,并进一步采用提出的 Group-DPO 方法进行微调,以增强各群组在生成图像上的CTR表现,从而实现群体偏好对齐。
链接: https://arxiv.org/abs/2602.02033
作者: Shuo Lu,Haohan Wang,Wei Feng,Weizhen Wang,Shen Zhang,Yaoyu Li,Ao Ma,Zheng Zhang,Jingjing Lv,Junjie Shen,Ching Law,Bing Zhan,Yuan Xu,Huizai Yao,Yongcan Yu,Chenyang Si,Jian Liang
机构: NLPR & MAIS, CASIA; School of AI, UCAS; JD.COM; HKUST(gz); PRLab, NJU
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Advertising image generation has increasingly focused on online metrics like Click-Through Rate (CTR), yet existing approaches adopt a ``one-size-fits-all" strategy that optimizes for overall CTR while neglecting preference diversity among user groups. This leads to suboptimal performance for specific groups, limiting targeted marketing effectiveness. To bridge this gap, we present \textitOne Size, Many Fits (OSMF), a unified framework that aligns diverse group-wise click preferences in large-scale advertising image generation. OSMF begins with product-aware adaptive grouping, which dynamically organizes users based on their attributes and product characteristics, representing each group with rich collective preference features. Building on these groups, preference-conditioned image generation employs a Group-aware Multimodal Large Language Model (G-MLLM) to generate tailored images for each group. The G-MLLM is pre-trained to simultaneously comprehend group features and generate advertising images. Subsequently, we fine-tune the G-MLLM using our proposed Group-DPO for group-wise preference alignment, which effectively enhances each group’s CTR on the generated images. To further advance this field, we introduce the Grouped Advertising Image Preference Dataset (GAIP), the first large-scale public dataset of group-wise image preferences, including around 600K groups built from 40M users. Extensive experiments demonstrate that our framework achieves the state-of-the-art performance in both offline and online settings. Our code and datasets will be released at this https URL.
zh
[CV-46] ClueTracer: Question-to-Vision Clue Tracing for Training-Free Hallucination Suppression in Multimodal Reasoning
【速读】:该论文旨在解决大模型在视觉推理任务中因“推理漂移”(reasoning drift)导致的幻觉问题:即模型在获取视觉线索时过度关注与问题无关的实体,从而稀释对任务相关线索的关注,使推理路径逐渐脱离视觉基础。解决方案的关键在于提出一种无需训练、无参数且与架构无关的插件方法 ClueTracer,其通过从问题出发,沿推理路径(问题 → 输出 → 视觉标记)追踪关键线索的传播过程,精准定位任务相关的图像区域,抑制对无关区域的错误关注,从而有效减少幻觉并提升推理准确性。
链接: https://arxiv.org/abs/2602.02004
作者: Gongli Xi,Kun Wang,Zeming Gao,Huahui Yi,Haolang Lu,Ye Tian,Wendong Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 20 pages, 7 figures
Abstract:Large multimodal reasoning models solve challenging visual problems via explicit long-chain inference: they gather visual clues from images and decode clues into textual tokens. Yet this capability also increases hallucinations, where the model generates content that is not supported by the input image or the question. To understand this failure mode, we identify \emphreasoning drift: during clue gathering, the model over-focuses on question-irrelevant entities, diluting focus on task-relevant cues and gradually decoupling the reasoning trace from visual grounding. As a consequence, many inference-time localization or intervention methods developed for non-reasoning models fail to pinpoint the true clues in reasoning settings. Motivated by these insights, we introduce ClueRecall, a metric for assessing visual clue retrieval, and present ClueTracer, a training-free, parameter-free, and architecture-agnostic plugin for hallucination suppression. ClueTracer starts from the question and traces how key clues propagate along the model’s reasoning pathway (question \rightarrow outputs \rightarrow visual tokens), thereby localizing task-relevant patches while suppressing spurious attention to irrelevant regions. Remarkably, \textbfwithout any additional training, ClueTracer improves all \textbfreasoning architectures (including \textttR1-OneVision, \textttOcean-R1, \textttMM-Eureka, \emphetc.) by \mathbf1.21\times on reasoning benchmarks. When transferred to \textbfnon-reasoning settings, it yields a \mathbf1.14\times gain.
zh
[CV-47] UniDriveDreamer: A Single-Stage Multimodal World Model for Autonomous Driving
【速读】:该论文旨在解决当前自动驾驶世界模型在数据合成中普遍存在的单模态局限性问题,即现有方法通常仅能生成多摄像头视频或LiDAR序列中的一种模态,难以实现多模态协同生成。其解决方案的关键在于提出UniDriveDreamer——一个单阶段统一的多模态世界模型,通过引入针对LiDAR设计的变分自编码器(Variational Autoencoder, VAE)与用于多摄像头图像的VAE,并结合统一潜在锚定(Unified Latent Anchoring, ULA)机制,显式对齐两种模态的潜在分布,从而确保跨模态兼容性和训练稳定性;随后利用扩散Transformer联合建模多模态特征的几何对应关系与时间演化过程,同时以结构化场景布局信息作为条件信号引导生成,最终实现了无需中间表示或级联模块即可直接生成多模态未来观测结果的能力。
链接: https://arxiv.org/abs/2602.02002
作者: Guosheng Zhao,Yaozeng Wang,Xiaofeng Wang,Zheng Zhu,Tingdong Yu,Guan Huang,Yongchen Zai,Ji Jiao,Changliang Xue,Xiaole Wang,Zhen Yang,Futang Zhu,Xingang Wang
机构: GigaAI; CASIA (中国科学院自动化研究所); BYD
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 7 figures
Abstract:World models have demonstrated significant promise for data synthesis in autonomous driving. However, existing methods predominantly concentrate on single-modality generation, typically focusing on either multi-camera video or LiDAR sequence synthesis. In this paper, we propose UniDriveDreamer, a single-stage unified multimodal world model for autonomous driving, which directly generates multimodal future observations without relying on intermediate representations or cascaded modules. Our framework introduces a LiDAR-specific variational autoencoder (VAE) designed to encode input LiDAR sequences, alongside a video VAE for multi-camera images. To ensure cross-modal compatibility and training stability, we propose Unified Latent Anchoring (ULA), which explicitly aligns the latent distributions of the two modalities. The aligned features are fused and processed by a diffusion transformer that jointly models their geometric correspondence and temporal evolution. Additionally, structured scene layout information is projected per modality as a conditioning signal to guide the synthesis. Extensive experiments demonstrate that UniDriveDreamer outperforms previous state-of-the-art methods in both video and LiDAR generation, while also yielding measurable improvements in downstream
zh
[CV-48] SurfSplat: Conquering Feedforward 2D Gaussian Splatting with Surface Continuity Priors ICLR2026
【速读】:该论文旨在解决从稀疏图像中重建高质量三维场景时,现有基于3D高斯泼溅(3D Gaussian Splatting, 3DGS)的方法难以恢复连续表面、易产生离散且颜色偏倚点云的问题,尤其在近距离观察时会出现严重伪影。其解决方案的关键在于提出SurfSplat框架,该框架采用二维高斯泼溅(2D Gaussian Splatting, 2DGS)作为基本表示单元,通过引入表面连续性先验(surface continuity prior)和强制Alpha混合策略(forced alpha blending strategy),实现了几何结构的连贯性和纹理的真实还原,同时设计了高分辨率渲染一致性(High-Resolution Rendering Consistency, HRRC)作为新的评估指标,有效量化高分辨率下的重建质量,实验表明该方法在RealEstate10K、DL3DV和ScanNet数据集上均显著优于现有方法。
链接: https://arxiv.org/abs/2602.02000
作者: Bing He,Jingnan Gao,Yunuo Chen,Ning Cao,Gang Chen,Zhengxue Cheng,Li Song,Wenjun Zhang
机构: Shanghai Jiao Tong University (上海交通大学); Tianyi Shilian Technology Co., Ltd (天翼视联科技有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ICLR 2026
Abstract:Reconstructing 3D scenes from sparse images remains a challenging task due to the difficulty of recovering accurate geometry and texture without optimization. Recent approaches leverage generalizable models to generate 3D scenes using 3D Gaussian Splatting (3DGS) primitive. However, they often fail to produce continuous surfaces and instead yield discrete, color-biased point clouds that appear plausible at normal resolution but reveal severe artifacts under close-up views. To address this issue, we present SurfSplat, a feedforward framework based on 2D Gaussian Splatting (2DGS) primitive, which provides stronger anisotropy and higher geometric precision. By incorporating a surface continuity prior and a forced alpha blending strategy, SurfSplat reconstructs coherent geometry together with faithful textures. Furthermore, we introduce High-Resolution Rendering Consistency (HRRC), a new evaluation metric designed to evaluate high-resolution reconstruction quality. Extensive experiments on RealEstate10K, DL3DV, and ScanNet demonstrate that SurfSplat consistently outperforms prior methods on both standard metrics and HRRC, establishing a robust solution for high-fidelity 3D reconstruction from sparse inputs. Project page: this https URL
zh
[CV-49] Leverag ing Latent Vector Prediction for Localized Control in Image Generation via Diffusion Models
【速读】:该论文旨在解决扩散模型(Diffusion Models)在文本到图像生成中难以实现局部精细控制的问题。现有方法虽引入图像级条件(如边缘图、分割图等)以提升控制能力,但这些条件在整个图像上均匀应用,限制了对特定区域的精准调控。解决方案的关键在于提出一种新的训练框架,通过引入掩码特征(masking features)和额外损失项,利用任意扩散步骤中初始潜在向量的预测结果,增强当前步骤与最终样本在潜在空间中的对应关系,从而实现用户指定区域内图像内容的精确控制,同时保留扩散模型对其他区域的自主生成能力。
链接: https://arxiv.org/abs/2602.01991
作者: Pablo Domingo-Gregorio,Javier Ruiz-Hidalgo
机构: Universitat Politècnica de Catalunya (加泰罗尼亚理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion models emerged as a leading approach in text-to-image generation, producing high-quality images from textual descriptions. However, attempting to achieve detailed control to get a desired image solely through text remains a laborious trial-and-error endeavor. Recent methods have introduced image-level controls alongside with text prompts, using prior images to extract conditional information such as edges, segmentation and depth maps. While effective, these methods apply conditions uniformly across the entire image, limiting localized control. In this paper, we propose a novel methodology to enable precise local control over user-defined regions of an image, while leaving to the diffusion model the task of autonomously generating the remaining areas according to the original prompt. Our approach introduces a new training framework that incorporates masking features and an additional loss term, which leverages the prediction of the initial latent vector at any diffusion step to enhance the correspondence between the current step and the final sample in the latent space. Extensive experiments demonstrate that our method effectively synthesizes high-quality images with controlled local conditions.
zh
[CV-50] Enhancing Multi-Image Understanding through Delimiter Token Scaling ICLR2026
【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在处理多图像输入时性能下降的问题,其核心原因是跨图像信息泄露(cross-image information leakage),即模型难以区分不同图像之间的信息。现有LVLMs虽已使用分隔符标记(delimiter tokens)来标识每张图像的起止位置,但分析表明这些标记未能有效阻断跨图像的信息流动。论文提出的关键解决方案是:通过缩放分隔符标记的隐藏状态(hidden states),增强模型对图像内信息交互的保留能力,并抑制不必要的跨图像交互,从而提升模型对多图像内容的区分与推理准确性。该方法无需额外训练或推理成本,在多个多图像和多文档理解基准测试中均取得显著性能提升。
链接: https://arxiv.org/abs/2602.01984
作者: Minyoung Lee,Yeji Park,Dongjun Hwang,Yejin Kim,Seong Joon Oh,Junsuk Choe
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICLR 2026
Abstract:Large Vision-Language Models (LVLMs) achieve strong performance on single-image tasks, but their performance declines when multiple images are provided as input. One major reason is the cross-image information leakage, where the model struggles to distinguish information across different images. Existing LVLMs already employ delimiter tokens to mark the start and end of each image, yet our analysis reveals that these tokens fail to effectively block cross-image information leakage. To enhance their effectiveness, we propose a method that scales the hidden states of delimiter tokens. This enhances the model’s ability to preserve image-specific information by reinforcing intra-image interaction and limiting undesired cross-image interactions. Consequently, the model is better able to distinguish between images and reason over them more accurately. Experiments show performance gains on multi-image benchmarks such as Mantis, MuirBench, MIRB, and QBench2. We further evaluate our method on text-only tasks that require clear distinction. The method improves performance on multi-document and multi-table understanding benchmarks, including TQABench, MultiNews, and WCEP-10. Notably, our method requires no additional training or inference cost.
zh
[CV-51] FlyPrompt: Brain-Inspired Random-Expanded Routing with Temporal-Ensemble Experts for General Continual Learning ICLR2026
【速读】:该论文旨在解决通用持续学习(General Continual Learning, GCL)场景下的核心挑战,即在无明确任务边界、单次遍历且非平稳的数据流中,如何高效地对预训练模型进行参数高效微调(Parameter-Efficient Tuning, PET),同时应对专家参数分配与表征能力受限的问题。其解决方案的关键在于提出一种受果蝇层级记忆系统启发的FlyPrompt框架,通过将GCL分解为专家路由(expert routing)和专家能力提升(expert competence improvement)两个子问题:一是设计随机扩展的解析路由器实现实例级专家激活,二是引入输出头的时间集成机制以动态适应决策边界的变化,从而在有限监督下显著增强模型的持续学习性能。
链接: https://arxiv.org/abs/2602.01976
作者: Hongwei Yan,Guanglong Sun,Kanglei Zhou,Qian Li,Liyuan Wang,Yi Zhong
机构: Tsinghua University (清华大学); Sun Yat-Sen University (中山大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 33 pages. Accepted by ICLR 2026
Abstract:General continual learning (GCL) challenges intelligent systems to learn from single-pass, non-stationary data streams without clear task boundaries. While recent advances in continual parameter-efficient tuning (PET) of pretrained models show promise, they typically rely on multiple training epochs and explicit task cues, limiting their effectiveness in GCL scenarios. Moreover, existing methods often lack targeted design and fail to address two fundamental challenges in continual PET: how to allocate expert parameters to evolving data distributions, and how to improve their representational capacity under limited supervision. Inspired by the fruit fly’s hierarchical memory system characterized by sparse expansion and modular ensembles, we propose FlyPrompt, a brain-inspired framework that decomposes GCL into two subproblems: expert routing and expert competence improvement. FlyPrompt introduces a randomly expanded analytic router for instance-level expert activation and a temporal ensemble of output heads to dynamically adapt decision boundaries over time. Extensive theoretical and empirical evaluations demonstrate FlyPrompt’s superior performance, achieving up to 11.23%, 12.43%, and 7.62% gains over state-of-the-art baselines on CIFAR-100, ImageNet-R, and CUB-200, respectively. Our source code is available at this https URL.
zh
[CV-52] Your AI-Generated Image Detector Can Secretly Achieve SOTA Accuracy If Calibrated AAAI2026
【速读】:该论文旨在解决当前AI生成图像检测模型在测试阶段存在的系统性偏差问题,即尽管训练时使用平衡数据集,模型仍倾向于将伪造图像误判为真实图像,其根源在于伪造样本的分布偏移(distributional shift)以及训练过程中隐式学习到的先验知识。解决方案的关键在于提出一种基于贝叶斯决策理论的后验校准框架,通过在冻结主干网络的前提下,在目标分布的小规模验证集上优化一个可学习的标量校正项来调整模型logits,从而补偿输出分布的偏移并重新对齐决策边界,无需依赖真实标签即可显著提升检测鲁棒性。
链接: https://arxiv.org/abs/2602.01973
作者: Muli Yang,Gabriel James Goenawan,Henan Wang,Huaiyuan Qin,Chenghao Xu,Yanhua Yang,Fen Fang,Ying Sun,Joo-Hwee Lim,Hongyuan Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: AAAI 2026. Code: this https URL
Abstract:Despite being trained on balanced datasets, existing AI-generated image detectors often exhibit systematic bias at test time, frequently misclassifying fake images as real. We hypothesize that this behavior stems from distributional shift in fake samples and implicit priors learned during training. Specifically, models tend to overfit to superficial artifacts that do not generalize well across different generation methods, leading to a misaligned decision threshold when faced with test-time distribution shift. To address this, we propose a theoretically grounded post-hoc calibration framework based on Bayesian decision theory. In particular, we introduce a learnable scalar correction to the model’s logits, optimized on a small validation set from the target distribution while keeping the backbone frozen. This parametric adjustment compensates for distributional shift in model output, realigning the decision boundary even without requiring ground-truth labels. Experiments on challenging benchmarks show that our approach significantly improves robustness without retraining, offering a lightweight and principled solution for reliable and adaptive AI-generated image detection in the open world. Code is available at this https URL.
zh
[CV-53] Beyond Open Vocabulary: Multimodal Prompting for Object Detection in Remote Sensing Images
【速读】:该论文针对遥感场景中开放词汇目标检测(Open-vocabulary object detection)存在的不稳定性问题展开研究,其核心挑战在于:现有方法依赖纯文本提示(text-only prompting)进行类别指定,隐含假设预训练阶段建立的图文对齐可直接迁移至推理阶段,但在实际遥感应用中,由于任务和应用场景特有的类别语义差异及分布偏移,这种假设常失效,导致类别指定不稳定。解决方案的关键在于提出RS-MPOD框架,通过引入实例级视觉提示(instance-grounded visual prompts)重构类别指定机制——一方面设计视觉提示编码器提取样本外观特征以实现无需文本的类别指定;另一方面构建多模态融合模块,在文本与视觉信息均可用时进行协同整合,从而在语义模糊或分布漂移场景下提升类别指定的可靠性,并在文本语义对齐良好时保持竞争力。
链接: https://arxiv.org/abs/2602.01954
作者: Shuai Yang,Ziyue Huang,Jiaxin Chen,Qingjie Liu,Yunhong Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Open-vocabulary object detection in remote sensing commonly relies on text-only prompting to specify target categories, implicitly assuming that inference-time category queries can be reliably grounded through pretraining-induced text-visual alignment. In practice, this assumption often breaks down in remote sensing scenarios due to task- and application-specific category semantics, resulting in unstable category specification under open-vocabulary settings. To address this limitation, we propose RS-MPOD, a multimodal open-vocabulary detection framework that reformulates category specification beyond text-only prompting by incorporating instance-grounded visual prompts, textual prompts, and their multimodal integration. RS-MPOD introduces a visual prompt encoder to extract appearance-based category cues from exemplar instances, enabling text-free category specification, and a multimodal fusion module to integrate visual and textual information when both modalities are available. Extensive experiments on standard, cross-dataset, and fine-grained remote sensing benchmarks show that visual prompting yields more reliable category specification under semantic ambiguity and distribution shifts, while multimodal prompting provides a flexible alternative that remains competitive when textual semantics are well aligned.
zh
[CV-54] Enabling Progressive Whole-slide Image Analysis with Multi-scale Pyramidal Network
【速读】:该论文旨在解决多实例学习(Multiple-instance Learning, MIL)在计算病理学(Computational Pathology, CPath)任务中,现有基于多尺度图像块(multi-scale patches)的方法依赖于固定倍数的输入且采用晚期特征融合所带来的局限性,即无法保留跨尺度特征间的关联性,同时存在对制造商定义倍数的依赖、灵活性差和计算成本高的问题。解决方案的关键在于提出一种轻量级、可插拔的多尺度金字塔网络(Multi-scale Pyramidal Network, MSPN),其核心由两个模块构成:(1)基于网格的重映射机制,利用高倍率特征生成粗粒度特征以维持跨尺度信息传递;(2)粗粒度引导网络(Coarse Guidance Network, CGN),用于学习粗粒度上下文信息。该设计使得MSPN能够在不改变原有注意力机制MIL框架的前提下,实现渐进式的多尺度分析,并显著提升模型性能。
链接: https://arxiv.org/abs/2602.01951
作者: Shuyang Wu,Yifu Qiu,Ines P. Nearchou,Sandrine Prost,Jonathan A Fallowfield,Hakan Bilen,Timothy J Kendall
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multiple-instance Learning (MIL) is commonly used to undertake computational pathology (CPath) tasks, and the use of multi-scale patches allows diverse features across scales to be learned. Previous studies using multi-scale features in clinical applications rely on multiple inputs across magnifications with late feature fusion, which does not retain the link between features across scales while the inputs are dependent on arbitrary, manufacturer-defined magnifications, being inflexible and computationally expensive. In this paper, we propose the Multi-scale Pyramidal Network (MSPN), which is plug-and-play over attention-based MIL that introduces progressive multi-scale analysis on WSI. Our MSPN consists of (1) grid-based remapping that uses high magnification features to derive coarse features and (2) the coarse guidance network (CGN) that learns coarse contexts. We benchmark MSPN as an add-on module to 4 attention-based frameworks using 4 clinically relevant tasks across 3 types of foundation model, as well as the pre-trained MIL framework. We show that MSPN consistently improves MIL across the compared configurations and tasks, while being lightweight and easy-to-use.
zh
[CV-55] Boundary-Constrained Diffusion Models for Floorplan Generation: Balancing Realism and Diversity
【速读】:该论文旨在解决扩散模型在自动户型生成中因优化感知指标(如Fréchet Inception Distance,FID)而导致设计多样性不足的问题,同时揭示了长期训练下多样性崩溃的现象,以及模型对训练数据先验的依赖性。其解决方案的关键在于提出一个量化布局多样性的新指标——Diversity Score(DS),并引入边界交叉注意力(Boundary Cross-Attention,BCA)模块以增强几何一致性,从而在保持高真实感的同时提升设计多样性,并强调生成系统需在保真度(fidelity)、多样性(diversity)和泛化能力(generalization)之间实现显式平衡。
链接: https://arxiv.org/abs/2602.01949
作者: Leonardo Stoppani,Davide Bacciu,Shahab Mokarizadeh
机构: University of Pisa - Dept of Computer Science (比萨大学计算机科学系); H&M Group (H&M集团)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ESANN 2026
Abstract:Diffusion models have become widely popular for automated floorplan generation, producing highly realistic layouts conditioned on user-defined constraints. However, optimizing for perceptual metrics such as the Fréchet Inception Distance (FID) causes limited design diversity. To address this, we propose the Diversity Score (DS), a metric that quantifies layout diversity under fixed constraints. Moreover, to improve geometric consistency, we introduce a Boundary Cross-Attention (BCA) module that enables conditioning on building boundaries. Our experiments show that BCA significantly improves boundary adherence, while prolonged training drives diversity collapse undiagnosed by FID, revealing a critical trade-off between realism and diversity. Out-Of-Distribution evaluations further demonstrate the models’ reliance on dataset priors, emphasizing the need for generative systems that explicitly balance fidelity, diversity, and generalization in architectural design tasks.
zh
[CV-56] LIEREx: Language-Image Embeddings for Robotic Exploration
【速读】:该论文旨在解决传统语义地图(Semantic Map)在处理未见过的物体类别时存在的局限性问题,即依赖预定义符号词汇导致无法应对设计阶段未涵盖的分布外知识(out-of-distribution knowledge)。其解决方案的关键在于将视觉-语言基础模型(Vision-Language Foundation Models, VLFMs)如CLIP与成熟的三维语义场景图(3D Semantic Scene Graphs)相结合,通过高维嵌入(embedding)而非固定标签表示物体,从而实现开放集映射(open-set mapping),使自主代理能够在部分未知环境中基于目标导向进行探索。
链接: https://arxiv.org/abs/2602.01930
作者: Felix Igelbrink,Lennart Niecksch,Marian Renz,Martin Günther,Martin Atzmueller
机构: DFKI(德国人工智能研究中心); University of Siegen (锡根大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: This preprint has not undergone peer review or any post-submission improvements or corrections. The Version of Record of this article is published in KI - Künstliche Intelligenz, and is available online at this https URL
Abstract:Semantic maps allow a robot to reason about its surroundings to fulfill tasks such as navigating known environments, finding specific objects, and exploring unmapped areas. Traditional mapping approaches provide accurate geometric representations but are often constrained by pre-designed symbolic vocabularies. The reliance on fixed object classes makes it impractical to handle out-of-distribution knowledge not defined at design time. Recent advances in Vision-Language Foundation Models, such as CLIP, enable open-set mapping, where objects are encoded as high-dimensional embeddings rather than fixed labels. In LIEREx, we integrate these VLFMs with established 3D Semantic Scene Graphs to enable target-directed exploration by an autonomous agent in partially unknown environments.
zh
[CV-57] DSXFormer: Dual-Pooling Spectral Squeeze-Expansion and Dynamic Context Attention Transformer for Hyperspectral Image Classification
【速读】:该论文旨在解决高光谱图像分类(HSIC)中因高光谱维度、复杂的光谱-空间关联以及标注训练样本有限所导致的挑战,尤其是现有基于Transformer的方法难以在保持计算效率的同时实现足够的光谱判别能力。解决方案的关键在于提出一种新型双池化光谱压缩-扩展Transformer(DSXFormer),其核心创新包括:1)引入双池化光谱压缩-扩展(DSX)模块,通过全局平均池化与最大池化的互补机制自适应地重校准光谱特征通道,增强光谱判别性并建模波段间依赖关系;2)在窗口式Transformer架构中嵌入动态上下文注意力(DCA)机制,动态捕捉局部光谱-空间关系,同时显著降低计算开销;3)结合补丁提取、嵌入与合并策略以支持高效的多尺度特征学习。上述设计使DSXFormer在保持计算高效的同时实现了光谱强调与空间上下文表征的有效平衡,在多个基准数据集上均达到最优性能。
链接: https://arxiv.org/abs/2602.01906
作者: Farhan Ullah,Irfan Ullah,Khalil Khan,Giovanni Pau,JaKeoung Koo
机构: Gachon University (嘉泉大学); Chengdu University of Technology (成都理工大学); Qassim University (卡西姆大学); Kore University of Enna (科雷大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Hyperspectral image classification (HSIC) is a challenging task due to high spectral dimensionality, complex spectral-spatial correlations, and limited labeled training samples. Although transformer-based models have shown strong potential for HSIC, existing approaches often struggle to achieve sufficient spectral discriminability while maintaining computational efficiency. To address these limitations, we propose a novel DSXFormer, a novel dual-pooling spectral squeeze-expansion transformer with Dynamic Context Attention for HSIC. The proposed DSXFormer introduces a Dual-Pooling Spectral Squeeze-Expansion (DSX) block, which exploits complementary global average and max pooling to adaptively recalibrate spectral feature channels, thereby enhancing spectral discriminability and inter-band dependency modeling. In addition, DSXFormer incorporates a Dynamic Context Attention (DCA) mechanism within a window-based transformer architecture to dynamically capture local spectral-spatial relationships while significantly reducing computational overhead. The joint integration of spectral dual-pooling squeeze-expansion and DCA enables DSXFormer to achieve an effective balance between spectral emphasis and spatial contextual representation. Furthermore, patch extraction, embedding, and patch merging strategies are employed to facilitate efficient multi-scale feature learning. Extensive experiments conducted on four widely used hyperspectral benchmark datasets, including Salinas (SA), Indian Pines (IP), Pavia University (PU), and Kennedy Space Center (KSC), demonstrate that DSXFormer consistently outperforms state-of-the-art methods, achieving classification accuracies of 99.95%, 98.91%, 99.85%, and 98.52%, respectively.
zh
[CV-58] Learning Sparse Visual Representations via Spatial-Semantic Factorization
【速读】:该论文旨在解决自监督学习(Self-supervised Learning, SSL)中语义理解与图像重建之间的根本性矛盾:高阶语义SSL方法(如DINO)依赖全局token进行增强对齐,但这种过程强制位置不变性,导致空间坐标信息丢失,从而无法实现像素级重建;而生成式SSL方法(如MAE)虽保留密集特征图以支持高质量重建,却难以生成高层抽象表征。解决方案的关键在于提出STE LLAR框架,通过将视觉特征分解为语义概念(semantic concepts)与其空间分布(spatial distributions)的低秩乘积形式,实现语义与空间信息的解耦:在语义token上执行DINO式的增强对齐,同时保持定位矩阵中的精确空间映射,从而同时支持高保真重建(FID=2.60)和媲美稠密骨干网络的语义性能(ImageNet准确率79.10%)。
链接: https://arxiv.org/abs/2602.01905
作者: Theodore Zhengde Zhao,Sid Kiblawi,Jianwei Yang,Naoto Usuyama,Reuben Tan,Noel C Codella,Tristan Naumann,Hoifung Poon,Mu Wei
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Self-supervised learning (SSL) faces a fundamental conflict between semantic understanding and image reconstruction. High-level semantic SSL (e.g., DINO) relies on global tokens that are forced to be location-invariant for augmentation alignment, a process that inherently discards the spatial coordinates required for reconstruction. Conversely, generative SSL (e.g., MAE) preserves dense feature grids for reconstruction but fails to produce high-level abstractions. We introduce STELLAR, a framework that resolves this tension by factorizing visual features into a low-rank product of semantic concepts and their spatial distributions. This disentanglement allows us to perform DINO-style augmentation alignment on the semantic tokens while maintaining the precise spatial mapping in the localization matrix necessary for pixel-level reconstruction. We demonstrate that as few as 16 sparse tokens under this factorized form are sufficient to simultaneously support high-quality reconstruction (2.60 FID) and match the semantic performance of dense backbones (79.10% ImageNet accuracy). Our results highlight STELLAR as a versatile sparse representation that bridges the gap between discriminative and generative vision by strategically separating semantic identity from spatial geometry. Code available at this https URL.
zh
[CV-59] Q Cache: Visual Attention is Valuable in Less than Half of Decode Layers for Multimodal Large Language Model AAAI26
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在推理阶段因视觉编码器产生大量冗余视觉token而导致的高昂计算成本与关键值(Key-Value, KV)缓存占用瓶颈问题。现有方法主要通过token级优化,如复杂token剪枝技术来减少非关键视觉token,但常导致KV缓存完整性受损,进而影响长文本生成任务的表现。论文的关键创新在于从注意力机制的新视角出发,发现超过一半解码层中的注意力模式具有语义相似性,并据此提出“懒惰注意力”(Lazy Attention)机制,通过跨层复用前序层的注意力模式以消除层间冗余计算;其核心组件是专为MLLM设计的轻量级查询缓存(Q Cache),支持相邻层间查询复用,且与Flash Attention和KV缓存等现有推理框架完全兼容,兼具灵活性与高效性——实验证明该方法可降低35%以上KV缓存占用并提升1.5倍吞吐量,同时仅损失约1%性能,显著优于当前最优的token级剪枝方法。
链接: https://arxiv.org/abs/2602.01901
作者: Jiedong Zhuang,Lu Lu,Ming Dai,Rui Hu,Jian Chen,Qiang Liu,Haoji Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI26
Abstract:Multimodal large language models (MLLMs) are plagued by exorbitant inference costs attributable to the profusion of visual tokens within the vision encoder. The redundant visual tokens engenders a substantial computational load and key-value (KV) cache footprint bottleneck. Existing approaches focus on token-wise optimization, leveraging diverse intricate token pruning techniques to eliminate non-crucial visual tokens. Nevertheless, these methods often unavoidably undermine the integrity of the KV cache, resulting in failures in long-text generation tasks. To this end, we conduct an in-depth investigation towards the attention mechanism of the model from a new perspective, and discern that attention within more than half of all decode layers are semantic similar. Upon this finding, we contend that the attention in certain layers can be streamlined by inheriting the attention from their preceding layers. Consequently, we propose Lazy Attention, an efficient attention mechanism that enables cross-layer sharing of similar attention patterns. It ingeniously reduces layer-wise redundant computation in attention. In Lazy Attention, we develop a novel layer-shared cache, Q Cache, tailored for MLLMs, which facilitates the reuse of queries across adjacent layers. In particular, Q Cache is lightweight and fully compatible with existing inference frameworks, including Flash Attention and KV cache. Additionally, our method is highly flexible as it is orthogonal to existing token-wise techniques and can be deployed independently or combined with token pruning approaches. Empirical evaluations on multiple benchmarks demonstrate that our method can reduce KV cache usage by over 35% and achieve 1.5x throughput improvement, while sacrificing only approximately 1% of performance on various MLLMs. Compared with SOTA token-wise methods, our technique achieves superior accuracy preservation.
zh
[CV-60] Multi-Task Learning for Robot Perception with Imbalanced Data
【速读】:该论文旨在解决多任务学习中因标签数据不平衡导致的性能下降问题,尤其是在移动机器人场景下,获取完整标注数据成本高且困难。其关键解决方案是提出一种能够在部分任务缺乏真实标签(ground truth labels)的情况下仍可有效学习的方法,并通过教师网络利用任务输出(如深度估计)作为输入来增强任务间的协同效应。研究进一步揭示了任务间交互关系,能够识别哪些任务有助于提升其他任务的性能,从而优化资源受限环境下的多任务模型训练效果。
链接: https://arxiv.org/abs/2602.01899
作者: Ozgur Erkent
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages
Abstract:Multi-task problem solving has been shown to improve the accuracy of the individual tasks, which is an important feature for robots, as they have a limited resource. However, when the number of labels for each task is not equal, namely imbalanced data exist, a problem may arise due to insufficient number of samples, and labeling is not very easy for mobile robots in every environment. We propose a method that can learn tasks even in the absence of the ground truth labels for some of the tasks. We also provide a detailed analysis of the proposed method. An interesting finding is related to the interaction of the tasks. We show a methodology to find out which tasks can improve the performance of other tasks. We investigate this by training the teacher network with the task outputs such as depth as inputs. We further provide empirical evidence when trained with a small amount of data. We use semantic segmentation and depth estimation tasks on different datasets, NYUDv2 and Cityscapes.
zh
[CV-61] ProxyImg: Towards Highly-Controllable Image Representation via Hierarchical Disentangled Proxy Embedding
【速读】:该论文旨在解决当前图像表示方法中存在的两大核心问题:一是显式表示(如位图和高斯基元)与隐式表示(如潜在空间图像)普遍存在表示冗余,导致人工编辑成本高昂;二是缺乏从潜在变量到语义实例或部件的直接映射关系,使得细粒度操控困难,从而阻碍了高效且可控的图像与视频编辑。解决方案的关键在于提出一种分层代理参数化图像表示框架,通过语义感知分解输入图像,将语义、几何和纹理属性解耦至独立且可操作的参数空间中。该框架利用自适应贝塞尔拟合与迭代内部区域细分及网格化构建分层代理几何结构,并在几何感知的代理节点中嵌入多尺度隐式纹理参数,实现像素域内的连续高保真重建以及与实例或部件无关的语义编辑;同时引入局部自适应特征索引机制以保障空间纹理一致性,支持无需生成模型的高质量背景补全。此方法在图像重建与编辑基准测试中展现出卓越的渲染保真度与更低的参数量,同时支持直观、交互式的物理合理操控与实时物理驱动动画。
链接: https://arxiv.org/abs/2602.01881
作者: Ye Chen,Yupeng Zhu,Xiongzhen Zhang,Zhewen Wan,Yingzhe Li,Wenjun Zhang,Bingbing Ni
机构: Shanghai Jiao Tong University (上海交通大学); Huawei Cloud BU (华为云BU)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Prevailing image representation methods, including explicit representations such as raster images and Gaussian primitives, as well as implicit representations such as latent images, either suffer from representation redundancy that leads to heavy manual editing effort, or lack a direct mapping from latent variables to semantic instances or parts, making fine-grained manipulation difficult. These limitations hinder efficient and controllable image and video editing. To address these issues, we propose a hierarchical proxy-based parametric image representation that disentangles semantic, geometric, and textural attributes into independent and manipulable parameter spaces. Based on a semantic-aware decomposition of the input image, our representation constructs hierarchical proxy geometries through adaptive Bezier fitting and iterative internal region subdivision and meshing. Multi-scale implicit texture parameters are embedded into the resulting geometry-aware distributed proxy nodes, enabling continuous high-fidelity reconstruction in the pixel domain and instance- or part-independent semantic editing. In addition, we introduce a locality-adaptive feature indexing mechanism to ensure spatial texture coherence, which further supports high-quality background completion without relying on generative models. Extensive experiments on image reconstruction and editing benchmarks, including ImageNet, OIR-Bench, and HumanEdit, demonstrate that our method achieves state-of-the-art rendering fidelity with significantly fewer parameters, while enabling intuitive, interactive, and physically plausible manipulation. Moreover, by integrating proxy nodes with Position-Based Dynamics, our framework supports real-time physics-driven animation using lightweight implicit rendering, achieving superior temporal consistency and visual realism compared with generative approaches.
zh
[CV-62] rust but Verify: Adaptive Conditioning for Reference-Based Diffusion Super-Resolution via Implicit Reference Correlation Modeling ICLR2026
【速读】:该论文旨在解决基于扩散模型的参考图像超分辨率(Reference-based Super-Resolution, RefSR)中因真实世界退化导致低质量(Low-Quality, LQ)输入与参考(Reference, Ref)图像间对应关系不可靠的问题,从而避免生成过程中对误导性参考信息的过度依赖或对有效线索的利用不足。解决方案的关键在于提出Ada-RefSR框架,其核心创新是“信任但验证”(Trust but Verify)原则,并引入自适应隐式相关门控机制(Adaptive Implicit Correlation Gating, AICG),通过可学习的汇总标记(summary tokens)提取参考图像的主要模式并捕捉与LQ特征的隐式关联,在注意力机制中实现轻量级、自适应地调节参考引导强度,从而在保证重建质量的同时防止错误融合。
链接: https://arxiv.org/abs/2602.01864
作者: Yuan Wang,Yuhao Wan,Siming Zheng,Bo Li,Qibin Hou,Peng-Tao Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 26 pages, 19 figures. Accepted to ICLR 2026
Abstract:Recent works have explored reference-based super-resolution (RefSR) to mitigate hallucinations in diffusion-based image restoration. A key challenge is that real-world degradations make correspondences between low-quality (LQ) inputs and reference (Ref) images unreliable, requiring adaptive control of reference usage. Existing methods either ignore LQ-Ref correlations or rely on brittle explicit matching, leading to over-reliance on misleading references or under-utilization of valuable cues. To address this, we propose Ada-RefSR, a single-step diffusion framework guided by a “Trust but Verify” principle: reference information is leveraged when reliable and suppressed otherwise. Its core component, Adaptive Implicit Correlation Gating (AICG), employs learnable summary tokens to distill dominant reference patterns and capture implicit correlations with LQ features. Integrated into the attention backbone, AICG provides lightweight, adaptive regulation of reference guidance, serving as a built-in safeguard against erroneous fusion. Experiments on multiple datasets demonstrate that Ada-RefSR achieves a strong balance of fidelity, naturalness, and efficiency, while remaining robust under varying reference alignment.
zh
[CV-63] Fact or Fake? Assessing the Role of Deepfake Detectors in Multimodal Misinformation Detection
【速读】:该论文旨在解决多模态虚假信息检测中,传统基于像素级伪造检测的深度伪造识别器(deepfake detectors)是否能有效支持图像-文本联合声明的真实性验证问题。其核心科学挑战在于:这些检测器往往忽略语义与上下文层面的信息,而现实中误导性内容常源于图像与文本在语义层面上的不一致。解决方案的关键在于提出一种以证据为中心的事实核查系统(evidence-driven fact-checking system),该系统通过蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)进行工具引导检索,并借助多智能体辩论(Multi-Agent Debate, MAD)实现推理过程中的批判性讨论,从而摆脱对像素级伪造信号的依赖,转而聚焦于语义理解和外部证据的整合。实证结果表明,此类方法在两个基准数据集上均显著优于引入深伪检测器输出的混合系统,证明了语义理解与外部证据驱动的推理才是多模态虚假信息验证的核心机制。
链接: https://arxiv.org/abs/2602.01854
作者: A S M Sharifuzzaman Sagar,Mohammed Bennamoun,Farid Boussaid,Naeha Sharif,Lian Xu,Shaaban Sahmoud,Ali Kishk
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In multimodal misinformation, deception usually arises not just from pixel-level manipulations in an image, but from the semantic and contextual claim jointly expressed by the image-text pair. Yet most deepfake detectors, engineered to detect pixel-level forgeries, do not account for claim-level meaning, despite their growing integration in automated fact-checking (AFC) pipelines. This raises a central scientific and practical question: Do pixel-level detectors contribute useful signal for verifying image-text claims, or do they instead introduce misleading authenticity priors that undermine evidence-based reasoning? We provide the first systematic analysis of deepfake detectors in the context of multimodal misinformation detection. Using two complementary benchmarks, MMFakeBench and DGM4, we evaluate: (1) state-of-the-art image-only deepfake detectors, (2) an evidence-driven fact-checking system that performs tool-guided retrieval via Monte Carlo Tree Search (MCTS) and engages in deliberative inference through Multi-Agent Debate (MAD), and (3) a hybrid fact-checking system that injects detector outputs as auxiliary evidence. Results across both benchmark datasets show that deepfake detectors offer limited standalone value, achieving F1 scores in the range of 0.26-0.53 on MMFakeBench and 0.33-0.49 on DGM4, and that incorporating their predictions into fact-checking pipelines consistently reduces performance by 0.04-0.08 F1 due to non-causal authenticity assumptions. In contrast, the evidence-centric fact-checking system achieves the highest performance, reaching F1 scores of approximately 0.81 on MMFakeBench and 0.55 on DGM4. Overall, our findings demonstrate that multimodal claim verification is driven primarily by semantic understanding and external evidence, and that pixel-level artifact signals do not reliably enhance reasoning over real-world image-text misinformation.
zh
[CV-64] How Well Do Models Follow Visual Instructions? VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing
【速读】:该论文旨在解决当前图像编辑模型主要依赖文本指令、难以有效理解视觉指令(如草图)的问题,从而与人类多模态交互方式不一致的局限性。其解决方案的关键在于提出VIBE(Visual Instruction Benchmark for Image Editing),构建了一个包含三个层级交互结构的基准测试体系:指代定位(deictic grounding)、形态操作(morphological manipulation)和因果推理(causal reasoning),以系统化地评估模型对复杂视觉指令的遵循能力;同时设计了基于大语言模型(Large Language Model, LLM)的评分框架(LMM-as-a-judge),结合任务特定指标实现可扩展且细粒度的评估,从而揭示现有模型在视觉指令理解上的性能瓶颈与改进方向。
链接: https://arxiv.org/abs/2602.01851
作者: Huanyu Zhang,Xuehai Bai,Chengzu Li,Chen Liang,Haochen Tian,Haodong Li,Ruichuan An,Yifan Zhang,Anna Korhonen,Zhang Zhang,Liang Wang,Tieniu Tan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL
Abstract:Recent generative models have achieved remarkable progress in image editing. However, existing systems and benchmarks remain largely text-guided. In contrast, human communication is inherently multimodal, where visual instructions such as sketches efficiently convey spatial and structural intent. To address this gap, we introduce VIBE, the Visual Instruction Benchmark for Image Editing with a three-level interaction hierarchy that captures deictic grounding, morphological manipulation, and causal reasoning. Across these levels, we curate high-quality and diverse test cases that reflect progressively increasing complexity in visual instruction following. We further propose a robust LMM-as-a-judge evaluation framework with task-specific metrics to enable scalable and fine-grained assessment. Through a comprehensive evaluation of 17 representative open-source and proprietary image editing models, we find that proprietary models exhibit early-stage visual instruction-following capabilities and consistently outperform open-source models. However, performance degrades markedly with increasing task difficulty even for the strongest systems, highlighting promising directions for future research.
zh
[CV-65] WS-IMUBench: Can Weakly Supervised Methods from Audio Image and Video Be Adapted for IMU-based Temporal Action Localization?
【速读】:该论文旨在解决惯性测量单元(Inertial Measurement Unit, IMU)驱动的时序动作定位(IMU Temporal Action Localization, IMU-TAL)中因依赖密集帧级边界标注而导致的数据标注成本高、难以扩展的问题。当前主流方法受限于对精细标注的强需求,而现实场景中获取此类标注极为困难。为应对这一瓶颈,作者提出并构建了WS-IMUBench——一个在仅需序列级标签条件下评估弱监督IMU-TAL(Weakly Supervised IMU-TAL, WS-IMU-TAL)性能的系统性基准研究。其解决方案的关键在于:不引入新的定位算法,而是系统性地评估来自音频、图像和视频领域的成熟弱监督定位范式在IMU-TAL任务中的迁移能力与有效性,通过在7个公开IMU数据集上训练和评估7种代表性方法(共3,540次训练与7,080次推理),揭示了模态特异性迁移规律、弱监督策略的适用条件及主要失败模式,从而为未来研究提供可复现的基准模板、分析框架和明确改进方向(如IMU特定的提议生成机制、边界感知目标函数等)。
链接: https://arxiv.org/abs/2602.01850
作者: Pei Li,Jiaxi Yin,Lei Ouyang,Shihan Pan,Ge Wang,Han Ding,Fei Wang
机构: Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review. 28 pages, 9 figures, 6 tables
Abstract:IMU-based Human Activity Recognition (HAR) has enabled a wide range of ubiquitous computing applications, yet its dominant clip classification paradigm cannot capture the rich temporal structure of real-world behaviors. This motivates a shift toward IMU Temporal Action Localization (IMU-TAL), which predicts both action categories and their start/end times in continuous streams. However, current progress is strongly bottlenecked by the need for dense, frame-level boundary annotations, which are costly and difficult to scale. To address this bottleneck, we introduce WS-IMUBench, a systematic benchmark study of weakly supervised IMU-TAL (WS-IMU-TAL) under only sequence-level labels. Rather than proposing a new localization algorithm, we evaluate how well established weakly supervised localization paradigms from audio, image, and video transfer to IMU-TAL under only sequence-level labels. We benchmark seven representative weakly supervised methods on seven public IMU datasets, resulting in over 3,540 model training runs and 7,080 inference evaluations. Guided by three research questions on transferability, effectiveness, and insights, our findings show that (i) transfer is modality-dependent, with temporal-domain methods generally more stable than image-derived proposal-based approaches; (ii) weak supervision can be competitive on favorable datasets (e.g., with longer actions and higher-dimensional sensing); and (iii) dominant failure modes arise from short actions, temporal ambiguity, and proposal quality. Finally, we outline concrete directions for advancing WS-IMU-TAL (e.g., IMU-specific proposal generation, boundary-aware objectives, and stronger temporal reasoning). Beyond individual results, WS-IMUBench establishes a reproducible benchmarking template, datasets, protocols, and analyses, to accelerate community-wide progress toward scalable WS-IMU-TAL.
zh
[CV-66] CloDS: Visual-Only Unsupervised Cloth Dynamics Learning in Unknown Conditions ICLR2026
【速读】:该论文旨在解决在缺乏已知物理属性的情况下,如何从多视角视觉观测中无监督地学习布料动力学的问题(unsupervised learning of cloth dynamics from multi-view visual observations)。其核心挑战在于处理大范围非线性形变和严重自遮挡,同时实现2D视觉信息到3D几何结构的准确映射。解决方案的关键是提出Cloth Dynamics Splatting (CloDS) 框架,采用三阶段流程:首先进行视频到几何的接地(video-to-geometry grounding),随后在生成的网格上训练动力学模型;其中,在接地阶段引入双位置透明度调制机制(dual-position opacity modulation),通过基于网格的高斯点绘(mesh-based Gaussian splatting)支持双向映射,并联合考虑高斯组件的绝对与相对位置,从而提升对复杂形变和遮挡的鲁棒性。实验表明,CloDS能有效从视觉数据中学习布料动力学并具备良好的泛化能力。
链接: https://arxiv.org/abs/2602.01844
作者: Yuliang Zhan,Jian Li,Wenbing Huang,Wenbing Huang,Yang Liu,Hao Sun
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院); School of Engineering Science, University of Chinese Academy of Sciences (中国科学院大学工程科学学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ICLR 2026
Abstract:Deep learning has demonstrated remarkable capabilities in simulating complex dynamic systems. However, existing methods require known physical properties as supervision or inputs, limiting their applicability under unknown conditions. To explore this challenge, we introduce Cloth Dynamics Grounding (CDG), a novel scenario for unsupervised learning of cloth dynamics from multi-view visual observations. We further propose Cloth Dynamics Splatting (CloDS), an unsupervised dynamic learning framework designed for CDG. CloDS adopts a three-stage pipeline that first performs video-to-geometry grounding and then trains a dynamics model on the grounded meshes. To cope with large non-linear deformations and severe self-occlusions during grounding, we introduce a dual-position opacity modulation that supports bidirectional mapping between 2D observations and 3D geometry via mesh-based Gaussian splatting in video-to-geometry grounding stage. It jointly considers the absolute and relative position of Gaussian components. Comprehensive experimental evaluations demonstrate that CloDS effectively learns cloth dynamics from visual data while maintaining strong generalization capabilities for unseen configurations. Our code is available at this https URL. Visualization results are available at this https URL.%\footnoteAs in this example.
zh
[CV-67] SPIRIT: Adapting Vision Foundation Models for Unified Single- and Multi-Frame Infrared Small Target Detection
【速读】:该论文旨在解决红外小目标检测(Infrared Small Target Detection, IRSTD)中因红外数据稀缺、目标辐射信号微弱及语义线索有限而导致的检测性能受限问题,尤其是在利用视觉基础模型(Vision Foundation Models, VFMs)时,由于可见光与红外模态差异显著,直接迁移会导致特征层次聚合淹没局部目标峰值、仅依赖外观的跨帧关联产生误匹配。解决方案的关键在于提出SPIRIT框架,通过轻量级物理信息嵌入模块实现VFMs对IRSTD的适配:空间上,PIFR模块基于秩-稀疏分解近似重构,抑制结构化背景成分并增强稀疏目标信号;时间上,PGMA模块引入历史帧导出的软空间先验至记忆交叉注意力机制,约束跨帧关联,从而在视频模式下实现鲁棒检测,并在无时间上下文时自然退化为单帧推理。
链接: https://arxiv.org/abs/2602.01843
作者: Qian Xu,Xi Li,Fei Gao,Jie Guo,Haojuan Yuan,Shuaipeng Fan,Mingjin Zhang
机构: Xidian University (西安电子科技大学); Shanghai Academy of Spaceflight Technology (上海航天技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Infrared small target detection (IRSTD) is crucial for surveillance and early-warning, with deployments spanning both single-frame analysis and video-mode tracking. A practical solution should leverage vision foundation models (VFMs) to mitigate infrared data scarcity, while adopting a memory-attention-based temporal propagation framework that unifies single- and multi-frame inference. However, infrared small targets exhibit weak radiometric signals and limited semantic cues, which differ markedly from visible-spectrum imagery. This modality gap makes direct use of semantics-oriented VFMs and appearance-driven cross-frame association unreliable for IRSTD: hierarchical feature aggregation can submerge localized target peaks, and appearance-only memory attention becomes ambiguous, leading to spurious clutter associations. To address these challenges, we propose SPIRIT, a unified and VFM-compatible framework that adapts VFMs to IRSTD via lightweight physics-informed plug-ins. Spatially, PIFR refines features by approximating rank-sparsity decomposition to suppress structured background components and enhance sparse target-like signals. Temporally, PGMA injects history-derived soft spatial priors into memory cross-attention to constrain cross-frame association, enabling robust video detection while naturally reverting to single-frame inference when temporal context is absent. Experiments on multiple IRSTD benchmarks show consistent gains over VFM-based baselines and SOTA performance.
zh
[CV-68] Efficient Cross-Country Data Acquisition Strategy for ADAS via Street-View Imagery
【速读】:该论文旨在解决自动驾驶系统(ADAS/ADS)在跨国部署时因法规、交通基础设施和视觉规范差异导致的域偏移(domain shift)问题,进而影响感知模型性能。其核心挑战在于传统跨国家数据采集依赖大量实地驾驶,成本高且效率低,难以高效识别具有代表性的采样位置。解决方案的关键在于提出一种基于街景图像(street-view imagery)引导的数据采集策略:通过两种POI(兴趣点)评分方法——基于KNN特征距离的视觉基础模型方法与基于视觉-语言模型的归因方法,从公开街景数据中自动筛选出对目标域最具代表性的地点;同时构建共位数据集以支持可重复评估,实验表明该方法在仅使用目标域一半数据的情况下即可达到与随机采样相当的交通标志检测性能,显著提升了跨国家模型适配的效率与经济性。
链接: https://arxiv.org/abs/2602.01836
作者: Yin Wu,Daniel Slieter,Carl Esselborn,Ahmed Abouelazm,Tsung Yuan Tseng,J. Marius Zöllner
机构: CARIAD SE (德国); Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院); Esslingen University of Applied Sciences (埃斯林根应用科学大学); FZI Research Center for Information Technology (弗劳恩霍夫信息科技研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Deploying ADAS and ADS across countries remains challenging due to differences in legislation, traffic infrastructure, and visual conventions, which introduce domain shifts that degrade perception performance. Traditional cross-country data collection relies on extensive on-road driving, making it costly and inefficient to identify representative locations. To address this, we propose a street-view-guided data acquisition strategy that leverages publicly available imagery to identify places of interest (POI). Two POI scoring methods are introduced: a KNN-based feature distance approach using a vision foundation model, and a visual-attribution approach using a vision-language model. To enable repeatable evaluation, we adopt a collect-detect protocol and construct a co-located dataset by pairing the Zenseact Open Dataset with Mapillary street-view images. Experiments on traffic sign detection, a task particularly sensitive to cross-country variations in sign appearance, show that our approach achieves performance comparable to random sampling while using only half of the target-domain data. We further provide cost estimations for full-country analysis, demonstrating that large-scale street-view processing remains economically feasible. These results highlight the potential of street-view-guided data acquisition for efficient and cost-effective cross-country model adaptation.
zh
[CV-69] Seeing Is Believing? A Benchmark for Multimodal Large Language Models on Visual Illusions and Anomalies
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在面对违背常识先验的视觉幻觉与异常场景时,其鲁棒性尚未得到充分评估的问题。现有评测主要基于分布内(in-distribution)的标准数据集,难以揭示模型在复杂、非典型视觉情境下的真实性能。为填补这一空白,作者提出了VIA-Bench基准,其关键在于通过人工参与式审核构建了超过1000个高质量问答对,涵盖六类视觉幻觉(如颜色、运动、格式塔、几何空间幻觉)和视觉异常类别,要求模型具备精细的视觉推理能力。实验表明,尽管多种先进MLLMs在标准任务上表现优异,但在VIA-Bench上普遍暴露显著脆弱性,且链式思维(Chain-of-Thought, CoT)推理策略无法提升鲁棒性,反而常产生“脆性幻象”(brittle mirages),凸显机器感知与人类感知的根本差异,强调突破此类感知瓶颈对实现通用人工智能(Artificial General Intelligence, AGI)至关重要。
链接: https://arxiv.org/abs/2602.01816
作者: Wenjin Hou,Wei Liu,Han Hu,Xiaoxiao Sun,Serena Yeung-Levy,Hehe Fan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal Large Language Models (MLLMs) have shown remarkable proficiency on general-purpose vision-language benchmarks, reaching or even exceeding human-level performance. However, these evaluations typically rely on standard in-distribution data, leaving the robustness of MLLMs largely unexamined when faced with scenarios that defy common-sense priors. To address this gap, we introduce VIA-Bench, a challenging benchmark designed to probe model performance on visual illusions and anomalies. It includes six core categories: color illusions, motion illusions, gestalt illusions, geometric and spatial illusions, general visual illusions, and visual anomalies. Through careful human-in-the-loop review, we construct over 1K high-quality question-answer pairs that require nuanced visual reasoning. Extensive evaluation of over 20 state-of-the-art MLLMs, including proprietary, open-source, and reasoning-enhanced models, uncovers significant vulnerabilities. Notably, we find that Chain-of-Thought (CoT) reasoning offers negligible robustness, often yielding ``brittle mirages’’ where the model’s logic collapses under illusory stimuli. Our findings reveal a fundamental divergence between machine and human perception, suggesting that resolving such perceptual bottlenecks is critical for the advancement of artificial general intelligence. The benchmark data and code will be released.
zh
[CV-70] GPD: Guided Progressive Distillation for Fast and High-Quality Video Generation
【速读】:该论文旨在解决扩散模型在视频生成中因去噪过程计算成本高而导致的效率瓶颈问题。现有方法虽尝试减少扩散步骤以提升速度,但在视频生成任务中常导致显著的质量下降。其解决方案的关键在于提出一种名为“引导渐进蒸馏”(Guided Progressive Distillation, GPD)的框架:通过设计一种新型训练策略,使教师模型逐步引导学生模型在更大步长下运行;同时引入两个核心组件——在线生成的训练目标以降低优化难度并提升计算效率,以及潜空间中的频域约束以保留细粒度细节和时序动态特性。该方法在Wan2.1模型上将采样步数从48降至6,仍保持与VBench基准相当的视觉质量,且相比现有蒸馏方法在流程简洁性和质量保真度上具有明显优势。
链接: https://arxiv.org/abs/2602.01814
作者: Xiao Liang,Yunzhu Zhang,Linchao Zhu
机构: Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion models have achieved remarkable success in video generation; however, the high computational cost of the denoising process remains a major bottleneck. Existing approaches have shown promise in reducing the number of diffusion steps, but they often suffer from significant quality degradation when applied to video generation. We propose Guided Progressive Distillation (GPD), a framework that accelerates the diffusion process for fast and high-quality video generation. GPD introduces a novel training strategy in which a teacher model progressively guides a student model to operate with larger step sizes. The framework consists of two key components: (1) an online-generated training target that reduces optimization difficulty while improving computational efficiency, and (2) frequency-domain constraints in the latent space that promote the preservation of fine-grained details and temporal dynamics. Applied to the Wan2.1 model, GPD reduces the number of sampling steps from 48 to 6 while maintaining competitive visual quality on VBench. Compared with existing distillation methods, GPD demonstrates clear advantages in both pipeline simplicity and quality preservation.
zh
[CV-71] LDRNet: Large Deformation Registration Model for Chest CT Registration
【速读】:该论文旨在解决胸腔CT图像在大形变情况下进行精准配准(image registration)的难题,此类图像因形变更大、背景更复杂且存在区域重叠,相较于脑部图像配准更具挑战性。其解决方案的关键在于提出一种快速无监督的深度学习方法LDRNet,通过两级细化策略实现从粗到细的配准场优化:首先预测低分辨率配准场,再逐步细化;创新性地引入两个核心组件——用于多尺度精修配准场的“精修块”(refine block)和从高层特征中学习变换矩阵的“刚体块”(rigid block),从而有效提升配准精度与速度,在公开数据集SegTHOR及私有数据集上均达到当前最优性能。
链接: https://arxiv.org/abs/2602.01812
作者: Cheng Wang,Qiyu Gao,Fandong Zhang,Shu Zhang,Yizhou Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Most of the deep learning based medical image registration algorithms focus on brain image registration this http URL with brain registration, the chest CT registration has larger deformation, more complex background and region over-lap. In this paper, we propose a fast unsupervised deep learning method, LDRNet, for large deformation image registration of chest CT images. We first predict a coarse resolution registration field, then refine it from coarse to fine. We propose two innovative technical components: 1) a refine block that is used to refine the registration field in different resolutions, 2) a rigid block that is used to learn transformation matrix from high-level features. We train and evaluate our model on the private dataset and public dataset SegTHOR. We compare our performance with state-of-the-art traditional registration methods as well as deep learning registration models VoxelMorph, RCN, and LapIRN. The results demonstrate that our model achieves state-of-the-art performance for large deformation images registration and is much faster.
zh
[CV-72] FlowBypass: Rectified Flow Trajectory Bypass for Training-Free Image Editing
【速读】:该论文旨在解决训练-free图像编辑中因依赖inversion-reconstruction轨迹而导致的固有权衡问题:较长轨迹易积累误差从而降低保真度,而较短轨迹又难以确保与编辑提示(edit prompt)充分对齐。现有方法通常采用骨干网络特定的特征操作来缓解此问题,但通用性受限。其解决方案的关键在于提出FlowBypass框架,该框架基于Rectified Flow理论,构建一条直接连接inversion和reconstruction轨迹的bypass路径,从而在不依赖特征操作的前提下有效减少误差累积,并通过形式化推导获得近似bypass公式及其数值解,实现轨迹间的平滑过渡。
链接: https://arxiv.org/abs/2602.01805
作者: Menglin Han,Zhangkai Ni
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Training-free image editing has attracted increasing attention for its efficiency and independence from training data. However, existing approaches predominantly rely on inversion-reconstruction trajectories, which impose an inherent trade-off: longer trajectories accumulate errors and compromise fidelity, while shorter ones fail to ensure sufficient alignment with the edit prompt. Previous attempts to address this issue typically employ backbone-specific feature manipulations, limiting general applicability. To address these challenges, we propose FlowBypass, a novel and analytical framework grounded in Rectified Flow that constructs a bypass directly connecting inversion and reconstruction trajectories, thereby mitigating error accumulation without relying on feature manipulations. We provide a formal derivation of two trajectories, from which we obtain an approximate bypass formulation and its numerical solution, enabling seamless trajectory transitions. Extensive experiments demonstrate that FlowBypass consistently outperforms state-of-the-art image editing methods, achieving stronger prompt alignment while preserving high-fidelity details in irrelevant regions.
zh
[CV-73] Fast Autoregressive Video Diffusion and World Models with Temporal Cache Compression and Sparse Attention FAST
【速读】:该论文旨在解决自回归视频扩散模型(autoregressive video diffusion models)在推理阶段因KV缓存持续增长而导致的延迟增加与GPU显存激增问题,从而限制了可用的时间上下文长度并损害长程一致性。其解决方案的关键在于提出一个统一且无需训练的注意力机制框架——TempCache、AnnCA与AnnSA:TempCache通过时间对应关系压缩KV缓存以控制缓存增长;AnnCA利用快速近似最近邻(ANN)匹配筛选每帧相关的提示词token,加速交叉注意力;AnnSA则通过轻量级ANN将每个查询限制在语义匹配的键中,稀疏化自注意力。三者协同降低注意力计算量、计算复杂度和内存占用,同时保持与现有自回归扩散模型及世界模型的兼容性,在长序列生成中实现高达x5–x10的端到端加速,并维持稳定的吞吐量和近乎恒定的峰值GPU显存使用。
链接: https://arxiv.org/abs/2602.01801
作者: Dvir Samuel,Issar Tzachor,Matan Levy,Micahel Green,Gal Chechik,Rami Ben-Ari
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project Page: this https URL
Abstract:Autoregressive video diffusion models enable streaming generation, opening the door to long-form synthesis, video world models, and interactive neural game engines. However, their core attention layers become a major bottleneck at inference time: as generation progresses, the KV cache grows, causing both increasing latency and escalating GPU memory, which in turn restricts usable temporal context and harms long-range consistency. In this work, we study redundancy in autoregressive video diffusion and identify three persistent sources: near-duplicate cached keys across frames, slowly evolving (largely semantic) queries/keys that make many attention computations redundant, and cross-attention over long prompts where only a small subset of tokens matters per frame. Building on these observations, we propose a unified, training-free attention framework for autoregressive diffusion: TempCache compresses the KV cache via temporal correspondence to bound cache growth; AnnCA accelerates cross-attention by selecting frame-relevant prompt tokens using fast approximate nearest neighbor (ANN) matching; and AnnSA sparsifies self-attention by restricting each query to semantically matched keys, also using a lightweight ANN. Together, these modules reduce attention, compute, and memory and are compatible with existing autoregressive diffusion backbones and world models. Experiments demonstrate up to x5–x10 end-to-end speedups while preserving near-identical visual quality and, crucially, maintaining stable throughput and nearly constant peak GPU memory usage over long rollouts, where prior methods progressively slow down and suffer from increasing memory usage.
zh
[CV-74] Spatio-Temporal Transformers for Long-Term NDVI Forecasting
【速读】:该论文旨在解决在异质景观(如地中海地区)中利用长期卫星影像时间序列(SITS)进行精准预测的难题,此类区域因复杂的空间格局、季节性变化及数十年环境演变的多尺度交互作用而具有高度挑战性。其核心解决方案是提出一种时空Transformer长程预测框架(Spatio-Temporal Transformer for Long Term Forecasting, STT-LTF),该框架通过统一的Transformer架构同时建模空间上下文与时间序列依赖关系,引入多尺度空间块嵌入、周期性时间编码和地理坐标信息,结合自监督学习策略(包括空间掩码、时间掩码和预测时域采样),实现对任意未来时间点的直接预测,避免了传统自回归方法的误差累积问题。此设计显著提升了模型在不规则采样和可变预测时长下的鲁棒性和准确性,尤其适用于生态快速变迁的复杂区域分析。
链接: https://arxiv.org/abs/2602.01799
作者: Ido Faran,Nathan S. Netanyahu,Maxim Shoshany
机构: Bar-Ilan University (巴伊兰大学); College of Law and Business (法与商业学院); Technion Israel Institute of Technology (以色列理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Long-term satellite image time series (SITS) analysis in heterogeneous landscapes faces significant challenges, particularly in Mediterranean regions where complex spatial patterns, seasonal variations, and multi-decade environmental changes interact across different scales. This paper presents the Spatio-Temporal Transformer for Long Term Forecasting (STT-LTF ), an extended framework that advances beyond purely temporal analysis to integrate spatial context modeling with temporal sequence prediction. STT-LTF processes multi-scale spatial patches alongside temporal sequences (up to 20 years) through a unified transformer architecture, capturing both local neighborhood relationships and regional climate influences. The framework employs comprehensive self-supervised learning with spatial masking, temporal masking, and horizon sampling strategies, enabling robust model training from 40 years of unlabeled Landsat imagery. Unlike autoregressive approaches, STT-LTF directly predicts arbitrary future time points without error accumulation, incorporating spatial patch embeddings, cyclical temporal encoding, and geographic coordinates to learn complex dependencies across heterogeneous Mediterranean ecosystems. Experimental evaluation on Landsat data (1984-2024) demonstrates that STT-LTF achieves a Mean Absolute Error (MAE) of 0.0328 and R^2 of 0.8412 for next-year predictions, outperforming traditional statistical methods, CNN-based approaches, LSTM networks, and standard transformers. The framework’s ability to handle irregular temporal sampling and variable prediction horizons makes it particularly suitable for analysis of heterogeneous landscapes experiencing rapid ecological transitions.
zh
[CV-75] Automated Discontinuity Set Characterisation in Enclosed Rock Face Point Clouds Using Single-Shot Filtering and Cyclic Orientation Transformation
【速读】:该论文旨在解决地下矿洞中封闭岩面内结构面(discontinuity sets)自动识别与表征的难题,尤其在复杂真实场景下如何高效、准确地从无人机(UAV)或移动激光扫描获取的点云数据中提取结构面组。其解决方案的关键在于提出一种集成三阶段处理流程的新方法:首先采用单次滤波策略(single-shot filtering),利用信号处理技术在一次遍历中有效分离平面区域并抑制噪声和高曲率伪影;其次设计了一种创新的循环方向变换方案(cyclic orientation transformation scheme),将极坐标系下的倾角和倾向信息映射到笛卡尔空间,克服传统笛卡尔聚类对方向数据表示不准确的问题;最后通过分层聚类(hierarchical clustering)对变换后的方向数据进行结构面组划分,无需预设聚类数量即可适应不同密度分布,从而实现高精度的结构面组自动识别。实验表明,该方法在真实矿硐数据上的平均绝对误差仅为1.95°(倾角)和2.20°(倾向),显著优于现有自动化结构面识别技术。
链接: https://arxiv.org/abs/2602.01783
作者: Dibyayan Patra,Pasindu Ranasinghe,Bikram Banerjee,Simit Raval
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Characterisation of structural discontinuity sets in exposed rock faces of underground mine cavities is essential for assessing rock-mass stability, excavation safety, and operational efficiency. UAV and other mobile laser-scanning techniques provide efficient means of collecting point clouds from rock faces. However, the development of a robust and efficient approach for automatic characterisation of discontinuity sets in real-world scenarios, like fully enclosed rock faces in cavities, remains an open research problem. In this study, a new approach is proposed for automatic discontinuity set characterisation that uses a single-shot filtering strategy, an innovative cyclic orientation transformation scheme and a hierarchical clustering technique. The single-shot filtering step isolates planar regions while robustly suppressing noise and high-curvature artefacts in one pass using a signal-processing technique. To address the limitations of Cartesian clustering on polar orientation data, a cyclic orientation transformation scheme is developed, enabling accurate representation of dip angle and dip direction in Cartesian space. The transformed orientations are then characterised into sets using a hierarchical clustering technique, which handles varying density distributions and identifies clusters without requiring user-defined set numbers. The accuracy of the method is validated on real-world mine stope and against ground truth obtained using manually handpicked discontinuity planes identified with the Virtual Compass tool, as well as widely used automated structure mapping techniques. The proposed approach outperforms the other techniques by exhibiting the lowest mean absolute error in estimating discontinuity set orientations in real-world stope data with errors of 1.95° and 2.20° in nominal dip angle and dip direction, respectively, and dispersion errors lying below 3°.
zh
[CV-76] DDP-WM: Disentangled Dynamics Prediction for Efficient World Models
【速读】:该论文旨在解决现有基于密集Transformer的世界模型在自主机器人规划中因计算开销大而导致难以实现实时部署的问题。其解决方案的关键在于提出一种名为DDP-WM的新世界模型,该模型基于解耦动力学预测(Disentangled Dynamics Prediction, DDP)原则,将场景中的潜在状态演化分解为稀疏的主动力学(由物理交互驱动)和次级背景更新(由上下文驱动),并通过融合高效历史信息处理与动态定位机制来隔离主动力学,并利用交叉注意力机制优化背景更新,从而实现资源分配的最优性和规划器友好的平滑优化空间。
链接: https://arxiv.org/abs/2602.01780
作者: Shicheng Yin,Kaixuan Yin,Weixing Chen,Yang Liu,Guanbin Li,Liang Lin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Codes will be available at this https URL
Abstract:World models are essential for autonomous robotic planning. However, the substantial computational overhead of existing dense Transformerbased models significantly hinders real-time deployment. To address this efficiency-performance bottleneck, we introduce DDP-WM, a novel world model centered on the principle of Disentangled Dynamics Prediction (DDP). We hypothesize that latent state evolution in observed scenes is heterogeneous and can be decomposed into sparse primary dynamics driven by physical interactions and secondary context-driven background updates. DDP-WM realizes this decomposition through an architecture that integrates efficient historical processing with dynamic localization to isolate primary dynamics. By employing a crossattention mechanism for background updates, the framework optimizes resource allocation and provides a smooth optimization landscape for planners. Extensive experiments demonstrate that DDP-WM achieves significant efficiency and performance across diverse tasks, including navigation, precise tabletop manipulation, and complex deformable or multi-body interactions. Specifically, on the challenging Push-T task, DDP-WM achieves an approximately 9 times inference speedup and improves the MPC success rate from 90% to98% compared to state-of-the-art dense models. The results establish a promising path for developing efficient, high-fidelity world models. Codes will be available at this https URL.
zh
[CV-77] GDPR-Compliant Person Recognition in Industrial Environments Using MEMS-LiDAR and Hybrid Data
【速读】:该论文旨在解决工业室内安全场景中未经授权人员检测的可靠性问题,传统基于深度学习的视觉方法因受光照和可见度影响大、易违反隐私法规(如GDPR)且依赖大量人工标注数据而存在局限性。解决方案的关键在于采用基于微机电系统激光雷达(Micro-Electro-Mechanical Systems LiDAR, MEMS-LiDAR)的隐私合规方案,仅捕获匿名化的三维点云数据以规避个人身份识别特征,并通过CARLA仿真平台生成合成场景数据来补充真实LiDAR数据,从而显著提升模型性能(平均精度提高44个百分点)并减少50%的人工标注工作量,实现高精度、低成本、符合GDPR要求的人员检测系统。
链接: https://arxiv.org/abs/2602.01764
作者: Dennis Basile,Dennis Sprute,Helene Dörksen,Holger Flatt
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at 19th CIRP Conference on Intelligent Computation in Manufacturing Engineering
Abstract:The reliable detection of unauthorized individuals in safety-critical industrial indoor spaces is crucial to avoid plant shutdowns, property damage, and personal hazards. Conventional vision-based methods that use deep-learning approaches for person recognition provide image information but are sensitive to lighting and visibility conditions and often violate privacy regulations, such as the General Data Protection Regulation (GDPR) in the European Union. Typically, detection systems based on deep learning require annotated data for training. Collecting and annotating such data, however, is highly time-consuming and due to manual treatments not necessarily error free. Therefore, this paper presents a privacy-compliant approach based on Micro-Electro-Mechanical Systems LiDAR (MEMS-LiDAR), which exclusively captures anonymized 3D point clouds and avoids personal identification features. To compensate for the large amount of time required to record real LiDAR data and for post-processing and annotation, real recordings are augmented with synthetically generated scenes from the CARLA simulation framework. The results demonstrate that the hybrid data improves the average precision by 44 percentage points compared to a model trained exclusively with real data while reducing the manual annotation effort by 50 %. Thus, the proposed approach provides a scalable, cost-efficient alternative to purely real-data-based methods and systematically shows how synthetic LiDAR data can combine high performance in person detection with GDPR compliance in an industrial environment.
zh
[CV-78] MagicFuse: Single Image Fusion for Visual and Semantic Reinforcement
【速读】:该论文旨在解决在仅具备可见光成像传感器条件下,如何仍能实现多模态图像融合的优势问题,即在恶劣环境下从单一低质量可见光图像中提取出具有跨谱段信息的完整场景表征。其解决方案的关键在于提出了一种全新的单图像融合框架MagicFuse,该框架将传统数据级融合拓展至知识级融合:通过引入基于扩散模型的同谱知识增强分支与跨谱知识生成分支,分别挖掘可见光谱中被遮蔽的场景信息并学习热辐射分布向红外谱的映射规律;在此基础上设计多域知识融合分支,整合两个分支扩散流中的概率噪声,经逐次采样获得跨谱场景表征,并通过视觉与语义约束确保其既符合人类观察又支持下游语义决策。
链接: https://arxiv.org/abs/2602.01760
作者: Hao Zhang,Yanping Zha,Zizhuo Li,Meiqi Gong,Jiayi Ma
机构: Wuhan University (武汉大学); Wuhan University (武汉大学); Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper focuses on a highly practical scenario: how to continue benefiting from the advantages of multi-modal image fusion under harsh conditions when only visible imaging sensors are available. To achieve this goal, we propose a novel concept of single-image fusion, which extends conventional data-level fusion to the knowledge level. Specifically, we develop MagicFuse, a novel single image fusion framework capable of deriving a comprehensive cross-spectral scene representation from a single low-quality visible image. MagicFuse first introduces an intra-spectral knowledge reinforcement branch and a cross-spectral knowledge generation branch based on the diffusion models. They mine scene information obscured in the visible spectrum and learn thermal radiation distribution patterns transferred to the infrared spectrum, respectively. Building on them, we design a multi-domain knowledge fusion branch that integrates the probabilistic noise from the diffusion streams of these two branches, from which a cross-spectral scene representation can be obtained through successive sampling. Then, we impose both visual and semantic constraints to ensure that this scene representation can satisfy human observation while supporting downstream semantic decision-making. Extensive experiments show that our MagicFuse achieves visual and semantic representation performance comparable to or even better than state-of-the-art fusion methods with multi-modal inputs, despite relying solely on a single degraded visible image.
zh
[CV-79] Mind-Brush: Integrating Agent ic Cognitive Search and Reasoning into Image Generation
【速读】:该论文旨在解决当前文本到图像生成模型在理解用户隐含意图、处理复杂知识推理任务以及适应现实世界动态变化方面的局限性。现有模型多为静态的文本到像素解码器,难以捕捉深层语义和跨域知识;尽管统一理解-生成模型有所改进,仍无法在一个框架内完成复杂的逻辑推理任务,且受限于固定的先验知识,缺乏对新概念或实时信息的适应能力。解决方案的关键在于提出Mind-Brush——一个统一的智能体(agent)框架,将生成过程重构为“思考-调研-创作”的动态工作流,通过主动检索多模态证据来锚定分布外概念,并利用推理工具解析隐式视觉约束,从而实现从静态生成到动态知识驱动的范式跃迁。
链接: https://arxiv.org/abs/2602.01756
作者: Jun He,Junyan Ye,Zilong Huang,Dongzhi Jiang,Chenjue Zhang,Leqi Zhu,Renrui Zhang,Xiang Zhang,Weijia Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 36 pages, 24 figures
Abstract:While text-to-image generation has achieved unprecedented fidelity, the vast majority of existing models function fundamentally as static text-to-pixel decoders. Consequently, they often fail to grasp implicit user intentions. Although emerging unified understanding-generation models have improved intent comprehension, they still struggle to accomplish tasks involving complex knowledge reasoning within a single model. Moreover, constrained by static internal priors, these models remain unable to adapt to the evolving dynamics of the real world. To bridge these gaps, we introduce Mind-Brush, a unified agentic framework that transforms generation into a dynamic, knowledge-driven workflow. Simulating a human-like ‘think-research-create’ paradigm, Mind-Brush actively retrieves multimodal evidence to ground out-of-distribution concepts and employs reasoning tools to resolve implicit visual constraints. To rigorously evaluate these capabilities, we propose Mind-Bench, a comprehensive benchmark comprising 500 distinct samples spanning real-time news, emerging concepts, and domains such as mathematical and Geo-Reasoning. Extensive experiments demonstrate that Mind-Brush significantly enhances the capabilities of unified models, realizing a zero-to-one capability leap for the Qwen-Image baseline on Mind-Bench, while achieving superior results on established benchmarks like WISE and RISE.
zh
[CV-80] Spot-Wise Smart Parking: An Edge-Enabled Architecture with YOLOv11 and Digital Twin Integration
【速读】:该论文旨在解决传统智能停车系统在车位级监测精度不足的问题,从而限制了其对高级应用(如精细化调度与数字孪生构建)的支持能力。解决方案的关键在于引入一种基于距离感知匹配机制并结合空间容差的点位级监控策略,并通过自适应边界框分区方法优化复杂停车位的识别效果;该方法在资源受限的边缘设备上实现了98.80%的平衡准确率,同时保持8秒的推理时间,显著提升了YOLOv11m模型的实际部署效能。此外,论文还提出两个创新组件:一是用于可视化停车实体的“数字孪生体”(Digital Shadow),为未来演进至完整数字孪生(Digital Twin)奠定基础;二是基于回收电视盒子重构的应用支持服务器,实现云端服务、停车指示牌与机器人之间的可扩展通信,推动硬件复用与可持续发展。
链接: https://arxiv.org/abs/2602.01754
作者: Gustavo P. C. P. da Luz,Alvaro M. Aspilcueta Narvaez,Tiago Godoi Bannwart,Gabriel Massuyoshi Sato,Luis Fernando Gomez Gonzalez,Juliana Freitag Borin
机构: University of Campinas (UNICAMP)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to Journal of Internet Services and Applications, 27 pages, 20 figures, 3 tables
Abstract:Smart parking systems help reduce congestion and minimize users’ search time, thereby contributing to smart city adoption and enhancing urban mobility. In previous works, we presented a system developed on a university campus to monitor parking availability by estimating the number of free spaces from vehicle counts within a region of interest. Although this approach achieved good accuracy, it restricted the system’s ability to provide spot-level insights and support more advanced applications. To overcome this limitation, we extend the system with a spot-wise monitoring strategy based on a distance-aware matching method with spatial tolerance, enhanced through an Adaptive Bounding Box Partitioning method for challenging spaces. The proposed approach achieves a balanced accuracy of 98.80% while maintaining an inference time of 8 seconds on a resource-constrained edge device, enhancing the capabilities of YOLOv11m, a model that has a size of 40.5 MB. In addition, two new components were introduced: (i) a Digital Shadow that visually represents parking lot entities as a base to evolve to a full Digital Twin, and (ii) an application support server based on a repurposed TV box. The latter not only enables scalable communication among cloud services, the parking totem, and a bot that provides detailed spot occupancy statistics, but also promotes hardware reuse as a step towards greater sustainability.
zh
[CV-81] ObjEmbed: Towards Universal Multimodal Object Embeddings
【速读】:该论文旨在解决视觉-语言理解中对象与文本描述之间细粒度对齐的问题,即如何在图像区域与特定短语之间实现精准匹配。当前多模态嵌入模型虽能实现全局图像与文本的对齐,但在局部区域与关键词之间的对齐能力较弱。其解决方案的关键在于提出ObjEmbed模型,该模型将输入图像分解为多个区域嵌入(每个对应一个物体)和全局嵌入,并通过两个互补的嵌入表示:用于语义匹配的对象嵌入(object embedding)和用于预测定位质量的IoU嵌入(IoU embedding)。最终的对象匹配分数结合语义相似性与预测的IoU值,从而显著提升检索精度,同时支持区域级与图像级任务,并在单次前向传播中完成全图编码,兼具高效性与通用性。
链接: https://arxiv.org/abs/2602.01753
作者: Shenghao Fu,Yukun Su,Fengyun Rao,Jing Lyu,Xiaohua Xie,Wei-Shi Zheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Aligning objects with corresponding textual descriptions is a fundamental challenge and a realistic requirement in vision-language understanding. While recent multimodal embedding models excel at global image-text alignment, they often struggle with fine-grained alignment between image regions and specific phrases. In this work, we present ObjEmbed, a novel MLLM embedding model that decomposes the input image into multiple regional embeddings, each corresponding to an individual object, along with global embeddings. It supports a wide range of visual understanding tasks like visual grounding, local image retrieval, and global image retrieval. ObjEmbed enjoys three key properties: (1) Object-Oriented Representation: It captures both semantic and spatial aspects of objects by generating two complementary embeddings for each region: an object embedding for semantic matching and an IoU embedding that predicts localization quality. The final object matching score combines semantic similarity with the predicted IoU, enabling more accurate retrieval. (2) Versatility: It seamlessly handles both region-level and image-level tasks. (3) Efficient Encoding: All objects in an image, along with the full image, are encoded in a single forward pass for high efficiency. Superior performance on 18 diverse benchmarks demonstrates its strong semantic discrimination.
zh
[CV-82] ail-Aware Post-Training Quantization for 3D Geometry Models
【速读】:该论文旨在解决3D几何学习模型在资源受限平台部署时面临的挑战,尤其是传统后训练量化(Post-Training Quantization, PTQ)方法因复杂特征分布和高昂校准开销而难以有效迁移至3D模型的问题。解决方案的关键在于提出一种面向3D几何学习的Tail-Aware Post-Training Quantization(TAPTQ)流程:首先设计了一种从粗到细的渐进式校准构建策略,以低数据规模实现统计纯净性和几何代表性;其次将量化区间搜索重构为优化问题并引入基于三元搜索的求解器,将计算复杂度从O(N)降低至O(logN);最后提出基于Tail Relative Error(TRE)的模块级补偿机制,自适应识别并修正对长尾激活异常值敏感模块中的量化误差累积问题。
链接: https://arxiv.org/abs/2602.01741
作者: Sicheng Pan,Chen Tang,Shuzhao Xie,Ke Yang,Weixiang Zhang,Jiawei Li,Bin Chen,Shu-Tao Xia,Zhi Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The burgeoning complexity and scale of 3D geometry models pose significant challenges for deployment on resource-constrained platforms. While Post-Training Quantization (PTQ) enables efficient inference without retraining, conventional methods, primarily optimized for 2D Vision Transformers, fail to transfer effectively to 3D models due to intricate feature distributions and prohibitive calibration overhead. To address these challenges, we propose TAPTQ, a Tail-Aware Post-Training Quantization pipeline specifically engineered for 3D geometric learning. Our contribution is threefold: (1) To overcome the data-scale bottleneck in 3D datasets, we develop a progressive coarse-to-fine calibration construction strategy that constructs a highly compact subset to achieve both statistical purity and geometric representativeness. (2) We reformulate the quantization interval search as an optimization problem and introduce a ternary-search-based solver, reducing the computational complexity from \mathcalO(N) to \mathcalO(\log N) for accelerated deployment. (3) To mitigate quantization error accumulation, we propose TRE-Guided Module-wise Compensation, which utilizes a Tail Relative Error (TRE) metric to adaptively identify and rectify distortions in modules sensitive to long-tailed activation outliers. Extensive experiments on the VGGT and Pi3 benchmarks demonstrate that TAPTQ consistently outperforms state-of-the-art PTQ methods in accuracy while significantly reducing calibration time. The code will be released soon.
zh
[CV-83] MACD: Model-Aware Contrastive Decoding via Counterfactual Data
【速读】:该论文旨在解决视频语言模型(Video-LLMs)在视觉证据弱、模糊或存在偏差时容易产生幻觉(hallucination)的问题,即模型生成看似合理但缺乏视觉依据的内容。现有方法如对比解码(Contrastive Decoding, CD)依赖随机扰动构建对比数据,难以精准控制引发幻觉的视觉线索或与模型弱点对齐。其解决方案的关键在于提出一种基于模型感知的反事实数据对比解码方法(Model-aware Counterfactual Data based Contrastive Decoding, MACD):通过利用Video-LLM自身的反馈识别最可能导致幻觉的对象区域,并在此基础上生成对象级别的反事实输入,而非任意帧或时间维度的修改;随后将这些模型感知的反事实数据融入CD机制,在解码过程中强制进行基于证据的词元选择,从而显著降低幻觉率并保持或提升任务准确性,尤其在小物体、遮挡或共现物体等挑战场景中效果突出。
链接: https://arxiv.org/abs/2602.01740
作者: Qixin Xiao,Kun Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Video language models (Video-LLMs) are prone to hallucinations, often generating plausible but ungrounded content when visual evidence is weak, ambiguous, or biased. Existing decoding methods, such as contrastive decoding (CD), rely on random perturbations to construct contrastive data for mitigating hallucination patterns. However, such a way is hard to control the visual cues that drive hallucination or well align with model weaknesses. We propose Model-aware Counterfactual Data based Contrastive Decoding (MACD), a new inference strategy that combines model-guided counterfactual construction with decoding. Our approach uses the Video-LLM’s own feedback to identify object regions most responsible for hallucination, generating targeted counterfactual inputs at the object level rather than arbitrary frame or temporal modifications. These model-aware counterfactual data is then integrated into CD to enforce evidence-grounded token selection during decoding. Experiments on EventHallusion, MVBench, Perception-test and Video-MME show that MACD consistently reduces hallucination while maintaining or improving task accuracy across diverse Video-LLMs, including Qwen and InternVL families. The method is especially effective in challenging scenarios involving small, occluded, or co-occurring objects. Our code and data will be publicly released.
zh
[CV-84] Simplicity Prevails: The Emergence of Generalizable AIGI Detection in Visual Foundation Models
【速读】:该论文旨在解决当前AI生成图像检测器(AIGI)在真实世界场景中性能显著下降的问题,即在受控基准测试上表现优异但在“野外”(in-the-wild)分布下准确率急剧下滑的现象。其解决方案的关键在于摒弃复杂的模型架构设计,转而采用一个简单的线性分类器,该分类器基于现代视觉基础模型(Vision Foundation Models)如Perception Encoder、MetaCLIP 2和DINOv3的冻结特征进行训练。实验证明,这一基线方法不仅在传统基准上达到甚至超越专用检测器的性能,更在未见过的生成器和极具挑战性的野外数据集上实现超过30%的准确率提升,展现出强大的泛化能力。作者认为这种优越性能源于预训练数据规模庞大且包含大量合成内容所引发的涌现特性,具体表现为:视觉-语言模型显式内化了“伪造”的语义概念,而自监督学习模型则隐式提取了来自预训练数据的判别性取证特征。
链接: https://arxiv.org/abs/2602.01738
作者: Yue Zhou,Xinan He,Kaiqing Lin,Bing Fan,Feng Ding,Bin Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While specialized detectors for AI-Generated Images (AIGI) achieve near-perfect accuracy on curated benchmarks, they suffer from a dramatic performance collapse in realistic, in-the-wild scenarios. In this work, we demonstrate that simplicity prevails over complex architectural designs. A simple linear classifier trained on the frozen features of modern Vision Foundation Models , including Perception Encoder, MetaCLIP 2, and DINOv3, establishes a new state-of-the-art. Through a comprehensive evaluation spanning traditional benchmarks, unseen generators, and challenging in-the-wild distributions, we show that this baseline not only matches specialized detectors on standard benchmarks but also decisively outperforms them on in-the-wild datasets, boosting accuracy by striking margins of over 30%. We posit that this superior capability is an emergent property driven by the massive scale of pre-training data containing synthetic content. We trace the source of this capability to two distinct manifestations of data exposure: Vision-Language Models internalize an explicit semantic concept of forgery, while Self-Supervised Learning models implicitly acquire discriminative forensic features from the pretraining data. However, we also reveal persistent limitations: these models suffer from performance degradation under recapture and transmission, remain blind to VAE reconstruction and localized editing. We conclude by advocating for a paradigm shift in AI forensics, moving from overfitting on static benchmarks to harnessing the evolving world knowledge of foundation models for real-world reliability.
zh
[CV-85] DenVisCoM: Dense Vision Correspondence Mamba for Efficient and Real-time Optical Flow and Stereo Estimation
【速读】:该论文旨在解决多视图几何与运动估计任务中准确性和实时性难以兼顾的问题,尤其是光学流(optical flow)和视差估计(disparity estimation)的联合建模难题。其解决方案的关键在于提出了一种新型的混合架构 DenVisCoM,该架构融合了 Mamba 块与基于 Transformer 的注意力机制,能够在保持高精度的同时显著优化推理速度和内存占用,从而实现对运动和三维稠密感知任务的高效联合估计。
链接: https://arxiv.org/abs/2602.01724
作者: Tushar Anand,Maheswar Bora,Antitza Dantcheva,Abhijit Das
机构: Birla Institute of Technology and Science, Pilani, Hyderabad Campus, India(印度比尔拉理工大学,皮拉尼分校,海得拉巴校区); Université Côte d’Azur, Inria, France(法国蔚蓝海岸大学,Inria)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE International Conference on Robotics and Automation 2026
Abstract:In this work, we propose a novel Mamba block DenVisCoM, as well as a novel hybrid architecture specifically tailored for accurate and real-time estimation of optical flow and disparity estimation. Given that such multi-view geometry and motion tasks are fundamentally related, we propose a unified architecture to tackle them jointly. Specifically, the proposed hybrid architecture is based on DenVisCoM and a Transformer-based attention block that efficiently addresses real-time inference, memory footprint, and accuracy at the same time for joint estimation of motion and 3D dense perception tasks. We extensively analyze the benchmark trade-off of accuracy and real-time processing on a large number of datasets. Our experimental results and related analysis suggest that our proposed model can accurately estimate optical flow and disparity estimation in real time. All models and associated code are available at this https URL.
zh
[CV-86] FastPhysGS: Accelerating Physics-based Dynamic 3DGS Simulation via Interior Completion and Adaptive Optimization
【速读】:该论文旨在解决将三维高斯溅射(3D Gaussian Splatting, 3DGS)扩展至四维物理仿真时面临的挑战,包括现有方法依赖人工参数调优或从视频扩散模型中蒸馏动力学信息导致泛化能力差、优化效率低,以及利用大语言模型(LLM)或视觉语言模型(VLM)时因文本/图像到三维感知鸿沟引发的物理行为不稳定问题,同时忽略3DGS表面结构导致运动不真实。其解决方案的关键在于提出FastPhysGS框架:一是采用基于蒙特卡洛重要性采样(Monte Carlo Importance Sampling, MCIS)的实例感知粒子填充(Instance-aware Particle Filling, IPF),高效生成内部粒子并保持几何保真度;二是设计双向图解耦优化(Bidirectional Graph Decoupling Optimization, BGDO),一种自适应策略以快速优化由VLM预测的材料参数,从而实现高保真物理模拟,在仅7 GB运行内存下1分钟内完成。
链接: https://arxiv.org/abs/2602.01723
作者: Yikun Ma,Yiqing Li,Jingwen Ye,Zhongkai Wu,Weidong Zhang,Lin Gao,Zhi Jin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Extending 3D Gaussian Splatting (3DGS) to 4D physical simulation remains challenging. Based on the Material Point Method (MPM), existing methods either rely on manual parameter tuning or distill dynamics from video diffusion models, limiting the generalization and optimization efficiency. Recent attempts using LLMs/VLMs suffer from a text/image-to-3D perceptual gap, yielding unstable physics behavior. In addition, they often ignore the surface structure of 3DGS, leading to implausible motion. We propose FastPhysGS, a fast and robust framework for physics-based dynamic 3DGS simulation:(1) Instance-aware Particle Filling (IPF) with Monte Carlo Importance Sampling (MCIS) to efficiently populate interior particles while preserving geometric fidelity; (2) Bidirectional Graph Decoupling Optimization (BGDO), an adaptive strategy that rapidly optimizes material parameters predicted from a VLM. Experiments show FastPhysGS achieves high-fidelity physical simulation in 1 minute using only 7 GB runtime memory, outperforming prior works with broad potential applications.
zh
[CV-87] Physics Informed Generative AI Enabling Labour Free Segmentation For Microscopy Analysis
【速读】:该论文旨在解决显微图像语义分割任务中因专家标注数据成本高、主观性强及稀缺性导致的自动化难题。其核心挑战在于物理仿真生成的数据与真实实验图像之间存在显著域差异(domain gap),致使模型难以泛化。解决方案的关键在于构建一个无需人工标注的端到端生成式框架:首先利用相场模拟(phase-field simulation)生成具有精确真值掩膜的丰富微观结构形态;随后采用循环一致性生成对抗网络(CycleGAN)实现无配对图像到图像的转换,将干净的仿真图像映射为高保真、具真实扫描电子显微镜(SEM)特征的合成图像;最终训练U-Net模型仅基于此类合成数据,在未见过的真实实验图像上实现了优异性能(平均边界F1分数0.90,交并比IOU 0.88),并通过t-SNE投影和香农熵分析验证了合成数据在特征空间与真实数据分布的一致性。该方法成功将数据稀缺问题转化为数据丰裕问题,实现了材料表征中语义分割的全自动流程。
链接: https://arxiv.org/abs/2602.01710
作者: Salma Zahran,Zhou Ao,Zhengyang Zhang,Chen Chi,Chenchen Yuan,Yanming Wang
机构: Shanghai Jiao Tong University (上海交通大学); Xiaomi AI Lab (小米人工智能实验室); Kunshan GCL Optoelectronic Material Co., Ltd (昆山GCL光电材料有限公司); Future Photovoltaics Research Center (未来光伏研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注:
Abstract:Semantic segmentation of microscopy images is a critical task for high-throughput materials characterisation, yet its automation is severely constrained by the prohibitive cost, subjectivity, and scarcity of expert-annotated data. While physics-based simulations offer a scalable alternative to manual labelling, models trained on such data historically fail to generalise due to a significant domain gap, lacking the complex textures, noise patterns, and imaging artefacts inherent to experimental data. This paper introduces a novel framework for labour-free segmentation that successfully bridges this simulation-to-reality gap. Our pipeline leverages phase-field simulations to generate an abundant source of microstructural morphologies with perfect, intrinsically-derived ground-truth masks. We then employ a Cycle-Consistent Generative Adversarial Network (CycleGAN) for unpaired image-to-image translation, transforming the clean simulations into a large-scale dataset of high-fidelity, realistic SEM images. A U-Net model, trained exclusively on this synthetic data, demonstrated remarkable generalisation when deployed on unseen experimental images, achieving a mean Boundary F1-Score of 0.90 and an Intersection over Union (IOU) of 0.88. Comprehensive validation using t-SNE feature-space projection and Shannon entropy analysis confirms that our synthetic images are statistically and featurally indistinguishable from the real data manifold. By completely decoupling model training from manual annotation, our generative framework transforms a data-scarce problem into one of data abundance, providing a robust and fully automated solution to accelerate materials discovery and analysis.
zh
[CV-88] Cross-Modal Alignment and Fusion for RGB-D Transmission-Line Defect Detection
【速读】:该论文旨在解决无人机(UAV)输电线路缺陷检测中因小尺度缺陷占主导、背景复杂及光照变化导致的挑战,尤其针对现有基于RGB图像的检测方法在色度对比度有限条件下难以区分几何上细微缺陷与视觉相似背景结构的问题。解决方案的关键在于提出一种跨模态对齐与融合网络(CMAFNet),其核心机制为“净化-融合”范式:首先通过语义重构模块(Semantic Recomposition Module)利用学习得到的码本进行字典基特征净化,抑制模态特异性噪声并保留缺陷判别信息;其次借助上下文语义集成框架(Contextual Semantic Integration Framework)采用部分通道注意力机制捕捉全局空间依赖关系,增强结构语义推理能力;同时在净化阶段引入位置归一化策略,强制重建驱动的跨模态对齐,确保异构特征在融合前具有统计兼容性。该方法在TLRGBD基准数据集(94.5%实例为小目标)上显著优于现有最强基线,mAP@50提升9.8个百分点,且轻量版本实现228 FPS推理速度与仅4.9M参数量,兼具高精度与高效性。
链接: https://arxiv.org/abs/2602.01696
作者: Jiaming Cui,Shuai Zhou,Wenqiang Li,Ruifeng Qin,Feng Shen
机构: Harbin Institute of Technology (哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Transmission line defect detection remains challenging for automated UAV inspection due to the dominance of small-scale defects, complex backgrounds, and illumination variations. Existing RGB-based detectors, despite recent progress, struggle to distinguish geometrically subtle defects from visually similar background structures under limited chromatic contrast. This paper proposes CMAFNet, a Cross-Modal Alignment and Fusion Network that integrates RGB appearance and depth geometry through a principled purify-then-fuse paradigm. CMAFNet consists of a Semantic Recomposition Module that performs dictionary-based feature purification via a learned codebook to suppress modality-specific noise while preserving defect-discriminative information, and a Contextual Semantic Integration Framework that captures global spatial dependencies using partial-channel attention to enhance structural semantic reasoning. Position-wise normalization within the purification stage enforces explicit reconstruction-driven cross-modal alignment, ensuring statistical compatibility between heterogeneous features prior to fusion. Extensive experiments on the TLRGBD benchmark, where 94.5% of instances are small objects, demonstrate that CMAFNet achieves 32.2% mAP@50 and 12.5% APs, outperforming the strongest baseline by 9.8 and 4.0 percentage points, respectively. A lightweight variant reaches 24.8% mAP50 at 228 FPS with only 4.9M parameters, surpassing all YOLO-based detectors while matching transformer-based methods at substantially lower computational cost.
zh
[CV-89] FreshMem: Brain-Inspired Frequency-Space Hybrid Memory for Streaming Video Understanding
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)从离线视频理解向在线流式视频理解迁移时面临的适应性不足问题,具体表现为细节不可逆丢失和上下文碎片化。解决方案的关键在于提出了一种受大脑对数感知与记忆巩固机制启发的频域混合记忆网络(Frequency-Space Hybrid Memory, FreshMem),其核心由两个协同模块构成:一是多尺度频域记忆(Multi-scale Frequency Memory, MFM),通过将溢出帧映射到代表性频域系数并保留残差细节,重建全局历史“概要”以保障短期保真度;二是空间缩略图记忆(Space Thumbnail Memory, STM),采用自适应压缩策略将连续流离散化为事件簇,并提炼为高密度空间缩略图,从而实现长期语义连贯性。该方法无需训练即可显著提升Qwen2-VL基线性能,在StreamingBench、OV-Bench和OVO-Bench上分别取得5.20%、4.52%和2.34%的增益,优于多个全微调方案,为长时程流式视频理解提供了高效范式。
链接: https://arxiv.org/abs/2602.01683
作者: Kangcong Li,Peng Ye,Lin Zhang,Chao Wang,Huafeng Qin,Tao Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Transitioning Multimodal Large Language Models (MLLMs) from offline to online streaming video understanding is essential for continuous perception. However, existing methods lack flexible adaptivity, leading to irreversible detail loss and context fragmentation. To resolve this, we propose FreshMem, a Frequency-Space Hybrid Memory network inspired by the brain’s logarithmic perception and memory consolidation. FreshMem reconciles short-term fidelity with long-term coherence through two synergistic modules: Multi-scale Frequency Memory (MFM), which projects overflowing frames into representative frequency coefficients, complemented by residual details to reconstruct a global historical “gist”; and Space Thumbnail Memory (STM), which discretizes the continuous stream into episodic clusters by employing an adaptive compression strategy to distill them into high-density space thumbnails. Extensive experiments show that FreshMem significantly boosts the Qwen2-VL baseline, yielding gains of 5.20%, 4.52%, and 2.34% on StreamingBench, OV-Bench, and OVO-Bench, respectively. As a training-free solution, FreshMem outperforms several fully fine-tuned methods, offering a highly efficient paradigm for long-horizon streaming video understanding.
zh
[CV-90] owards Autonomous Instrument Tray Assembly for Sterile Processing Applications
【速读】:该论文旨在解决消毒供应部门(Sterile Processing and Distribution, SPD)在手术器械清洗、消毒、检查和组装过程中依赖人工操作所导致的效率低、易出错、污染风险高及器械损坏等问题。解决方案的关键在于构建一套全自动机器人系统,实现手术器械的分类与结构化装盒,其核心包括:基于YOLO12的目标检测与级联ResNet模型的细粒度分类组成的混合感知流水线,用于精准识别31类共6,975张标注图像中的手术器械;集成校准视觉模块、具备6自由度的Staubli TX2-60L机械臂及其定制双电磁夹具,以及基于规则的装盒算法以减少运输过程中的器械碰撞;同时采用3D打印分隔器和固定装置物理隔离器械,降低摩擦与损伤风险。实验表明,该系统显著提升了感知准确性,并在工具间碰撞方面较人工装盒具有统计学意义上的减少,为SPD全流程自动化提供了可扩展的第一步。
链接: https://arxiv.org/abs/2602.01679
作者: Raghavasimhan Sankaranarayanan,Paul Stuart,Nicholas Ahn,Arno Sungarian,Yash Chitalia
机构: University of Louisville (路易斯维尔大学); Saint Vincent’s Hospital (圣文森特医院); University of Massachusetts (马萨诸塞大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 7 pages, 9 figures, 2026 International Symposium on Medical Robotics
Abstract:The Sterile Processing and Distribution (SPD) department is responsible for cleaning, disinfecting, inspecting, and assembling surgical instruments between surgeries. Manual inspection and preparation of instrument trays is a time-consuming, error-prone task, often prone to contamination and instrument breakage. In this work, we present a fully automated robotic system that sorts and structurally packs surgical instruments into sterile trays, focusing on automation of the SPD assembly stage. A custom dataset comprising 31 surgical instruments and 6,975 annotated images was collected to train a hybrid perception pipeline using YOLO12 for detection and a cascaded ResNet-based model for fine-grained classification. The system integrates a calibrated vision module, a 6-DOF Staubli TX2-60L robotic arm with a custom dual electromagnetic gripper, and a rule-based packing algorithm that reduces instrument collisions during transport. The packing framework uses 3D printed dividers and holders to physically isolate instruments, reducing collision and friction during transport. Experimental evaluations show high perception accuracy and statistically significant reduction in tool-to-tool collisions compared to human-assembled trays. This work serves as the scalable first step toward automating SPD workflows, improving safety, and consistency of surgical preparation while reducing SPD processing times.
zh
[CV-91] SMTrack: State-Aware Mamba for Efficient Temporal Modeling in Visual Tracking
【速读】:该论文旨在解决视觉跟踪(Visual Tracking)中动态场景下长期时序依赖建模困难的问题,传统卷积神经网络(CNN)和Transformer架构在捕捉长程时序信息时存在固有局限,常需复杂定制模块或高计算成本。解决方案的关键在于提出一种新颖的状态感知Mamba跟踪器(State-aware Mamba Tracker, SMTrack),其核心创新是引入一种具有状态自适应参数的选择性状态空间模型(Selective State-aware Space Model),实现无需额外模块即可高效建模长程时序依赖;同时,SMTrack在训练阶段保持线性计算复杂度,并通过隐藏状态传播与更新机制,在推理阶段显著降低处理时序信息的计算开销,从而在保证跟踪鲁棒性的同时大幅减少资源消耗。
链接: https://arxiv.org/abs/2602.01677
作者: Yinchao Ma,Dengqing Yang,Zhangyu He,Wenfei Yang,Tianzhu Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper is accepted by IEEE TIP
Abstract:Visual tracking aims to automatically estimate the state of a target object in a video sequence, which is challenging especially in dynamic scenarios. Thus, numerous methods are proposed to introduce temporal cues to enhance tracking robustness. However, conventional CNN and Transformer architectures exhibit inherent limitations in modeling long-range temporal dependencies in visual tracking, often necessitating either complex customized modules or substantial computational costs to integrate temporal cues. Inspired by the success of the state space model, we propose a novel temporal modeling paradigm for visual tracking, termed State-aware Mamba Tracker (SMTrack), providing a neat pipeline for training and tracking without needing customized modules or substantial computational costs to build long-range temporal dependencies. It enjoys several merits. First, we propose a novel selective state-aware space model with state-wise parameters to capture more diverse temporal cues for robust tracking. Second, SMTrack facilitates long-range temporal interactions with linear computational complexity during training. Third, SMTrack enables each frame to interact with previously tracked frames via hidden state propagation and updating, which releases computational costs of handling temporal cues during tracking. Extensive experimental results demonstrate that SMTrack achieves promising performance with low computational costs.
zh
[CV-92] VRGaussianAvatar: Integrating 3D Gaussian Avatars into VR
【速读】:该论文旨在解决在虚拟现实(VR)环境中实时生成高质量全身体型3D Gaussian Splatting (3DGS) 虚拟化身的问题,尤其针对仅依赖头戴式显示器(HMD)追踪信号而无法获取完整身体姿态信息的限制。解决方案的关键在于提出一个并行架构——VR前端(VR Frontend)与GA后端(GA Backend)协同工作:VR前端利用逆运动学(Inverse Kinematics)估计全身姿态,并将姿态与立体相机参数传输至后端;GA后端则基于单张图像重建3DGS虚拟化身并进行立体渲染。为提升立体渲染效率,作者进一步引入Binocular Batching技术,通过联合处理左右眼视图在一个批次中减少冗余计算,从而支持高分辨率VR显示,同时保持交互性能。
链接: https://arxiv.org/abs/2602.01674
作者: Hail Song,Boram Yoon,Seokhwan Yang,Seoyoung Kang,Hyunjeong Kim,Henning Metzmacher,Woontack Woo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Accepted as an IEEE TVCG paper at IEEE VR 2026 (journal track)
Abstract:We present VRGaussianAvatar, an integrated system that enables real-time full-body 3D Gaussian Splatting (3DGS) avatars in virtual reality using only head-mounted display (HMD) tracking signals. The system adopts a parallel pipeline with a VR Frontend and a GA Backend. The VR Frontend uses inverse kinematics to estimate full-body pose and streams the resulting pose along with stereo camera parameters to the backend. The GA Backend stereoscopically renders a 3DGS avatar reconstructed from a single image. To improve stereo rendering efficiency, we introduce Binocular Batching, which jointly processes left and right eye views in a single batched pass to reduce redundant computation and support high-resolution VR displays. We evaluate VRGaussianAvatar with quantitative performance tests and a within-subject user study against image- and video-based mesh avatar baselines. Results show that VRGaussianAvatar sustains interactive VR performance and yields higher perceived appearance similarity, embodiment, and plausibility. Project page and source code are available at this https URL.
zh
[CV-93] Real-Time Loop Closure Detection in Visual SLAM via NetVLAD and Faiss
【速读】:该论文旨在解决传统基于词袋(Bag-of-Words, BoW)方法在回环检测(Loop Closure Detection, LCD)中因外观变化和感知伪影(perceptual aliasing)导致性能下降的问题,同时克服深度学习驱动的视觉位置识别(Visual Place Recognition, VPR)方法计算开销大、难以实现实时SLAM的瓶颈。其解决方案的关键在于:采用NetVLAD作为LCD模块,在KITTI数据集上进行实证评估,并引入细粒度Top-K精度-召回率曲线以更贴合实际LCD场景(查询可能无匹配或存在多个有效匹配);结合Faiss加速的最近邻搜索,实现高精度与实时性兼顾,从而为SLAM系统提供一个可直接替换DBoW的实用方案。
链接: https://arxiv.org/abs/2602.01673
作者: Enguang Fan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:Loop closure detection (LCD) is a core component of simultaneous localization and mapping (SLAM): it identifies revisited places and enables pose-graph constraints that correct accumulated drift. Classic bag-of-words approaches such as DBoW are efficient but often degrade under appearance change and perceptual aliasing. In parallel, deep learning-based visual place recognition (VPR) descriptors (e.g., NetVLAD and Transformer-based models) offer stronger robustness, but their computational cost is often viewed as a barrier to real-time SLAM. In this paper, we empirically evaluate NetVLAD as an LCD module and compare it against DBoW on the KITTI dataset. We introduce a Fine-Grained Top-K precision-recall curve that better reflects LCD settings where a query may have zero or multiple valid matches. With Faiss-accelerated nearestneighbor search, NetVLAD achieves real-time query speed while improving accuracy and robustness over DBoW, making it a practical drop-in alternative for LCD in SLAM.
zh
[CV-94] Moonworks Lunara Aesthetic II: An Image Variation Dataset
【速读】:该论文旨在解决当前图像生成与编辑系统在上下文一致性(contextual consistency)评估与学习中的难题,特别是如何在保持对象身份稳定性的前提下实现可控的语义变换。解决方案的关键在于构建一个伦理来源、结构清晰且标注明确的数据集——Lunara Aesthetic II,其中包含2,854对锚定关联的变体图像,每对图像通过光照、天气、视角、构图、色调或情绪等上下文属性进行变换,同时维持稳定的底层身份特征。该数据集将身份保持的上下文变化作为监督信号,从而支持对图像生成与图像到图像转换系统中上下文泛化能力、身份保真度和编辑鲁棒性的基准测试与分析。
链接: https://arxiv.org/abs/2602.01666
作者: Yan Wang,Partho Hassan,Samiha Sadeka,Nada Soliman,M M Sayeef Abdullah,Sabit Hassan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce Lunara Aesthetic II, a publicly released, ethically sourced image dataset designed to support controlled evaluation and learning of contextual consistency in modern image generation and editing systems. The dataset comprises 2,854 anchor-linked variation pairs derived from original art and photographs created by Moonworks. Each variation pair applies contextual transformations, such as illumination, weather, viewpoint, scene composition, color tone, or mood; while preserving a stable underlying identity. Lunara Aesthetic II operationalizes identity-preserving contextual variation as a supervision signal while also retaining Lunara’s signature high aesthetic scores. Results show high identity stability, strong target attribute realization, and a robust aesthetic profile that exceeds large-scale web datasets. Released under the Apache 2.0 license, Lunara Aesthetic II is intended for benchmarking, fine-tuning, and analysis of contextual generalization, identity preservation, and edit robustness in image generation and image-to-image systems with interpretable, relational supervision. The dataset is publicly available at: this https URL.
zh
[CV-95] From Frames to Sequences: Temporally Consistent Human-Centric Dense Prediction
【速读】:该论文旨在解决视频序列中以人为中心的密集预测任务在时间上的一致性问题,即现有模型虽在单帧上表现优异,但在运动、遮挡和光照变化下常出现闪烁现象,且缺乏针对多密集任务的成对人类视频监督信号。解决方案的关键在于构建一个可扩展的合成数据流水线,生成具有像素级精度深度、法向量和掩码的逼真人类图像序列,并提供帧级标签用于空间学习与序列级监督用于时间学习;在此基础上,训练一个基于ViT的统一密集预测器,通过CSE嵌入注入显式的人体几何先验,并利用轻量级通道重加权模块提升特征融合后的几何可靠性,最终采用两阶段训练策略(静态预训练+动态序列监督)实现从鲁棒的空间表征到时间一致性的优化。
链接: https://arxiv.org/abs/2602.01661
作者: Xingyu Miao,Junting Dong,Qin Zhao,Yuhang Yang,Junhao Chen,Yang Long
机构: Durham University (杜伦大学); Shanghai AI Lab; Zhejiang University (浙江大学); USTC (中国科学技术大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In this work, we focus on the challenge of temporally consistent human-centric dense prediction across video sequences. Existing models achieve strong per-frame accuracy but often flicker under motion, occlusion, and lighting changes, and they rarely have paired human video supervision for multiple dense tasks. We address this gap with a scalable synthetic data pipeline that generates photorealistic human frames and motion-aligned sequences with pixel-accurate depth, normals, and masks. Unlike prior static data synthetic pipelines, our pipeline provides both frame-level labels for spatial learning and sequence-level supervision for temporal learning. Building on this, we train a unified ViT-based dense predictor that (i) injects an explicit human geometric prior via CSE embeddings and (ii) improves geometry-feature reliability with a lightweight channel reweighting module after feature fusion. Our two-stage training strategy, combining static pretraining with dynamic sequence supervision, enables the model first to acquire robust spatial representations and then refine temporal consistency across motion-aligned sequences. Extensive experiments show that we achieve state-of-the-art performance on THuman2.1 and Hi4D and generalize effectively to in-the-wild videos.
zh
[CV-96] Contribution-aware Token Compression for Efficient Video Understanding via Reinforcement Learning AAAI2026
【速读】:该论文旨在解决视频大语言模型(Video Large Language Models)在推理过程中因视频标记(video tokens)冗余导致的计算开销过大问题,从而限制其实际部署。现有压缩算法通常基于注意力分数优先保留高相关性特征,但注意力分数与最终正确预测的实际贡献之间关系不明确。解决方案的关键在于提出一种新的贡献感知的视频标记压缩算法(Contribution-aware token Compression algorithm for VIDeo understanding, CaCoVID),其核心创新是通过强化学习框架优化标记选择策略,显式地最大化标记组合对正确预测的贡献,而非依赖注意力分数作为代理指标;同时引入一种结合在线组合空间采样的组合策略优化算法,显著缩小探索空间并加速收敛,从而实现更高效且精准的视频标记压缩。
链接: https://arxiv.org/abs/2602.01649
作者: Yinchao Ma,Qiang Zhou,Zhibin Wang,Xianing Chen,Hanqing Yang,Jun Song,Bo Zheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This paper is accepted by AAAI2026
Abstract:Video large language models have demonstrated remarkable capabilities in video understanding tasks. However, the redundancy of video tokens introduces significant computational overhead during inference, limiting their practical deployment. Many compression algorithms are proposed to prioritize retaining features with the highest attention scores to minimize perturbations in attention computations. However, the correlation between attention scores and their actual contribution to correct answers remains ambiguous. To address the above limitation, we propose a novel \textbfContribution-\textbfaware token \textbfCompression algorithm for \textbfVIDeo understanding (\textbfCaCoVID) that explicitly optimizes the token selection policy based on the contribution of tokens to correct predictions. First, we introduce a reinforcement learning-based framework that optimizes a policy network to select video token combinations with the greatest contribution to correct predictions. This paradigm shifts the focus from passive token preservation to active discovery of optimal compressed token combinations. Secondly, we propose a combinatorial policy optimization algorithm with online combination space sampling, which dramatically reduces the exploration space for video token combinations and accelerates the convergence speed of policy optimization. Extensive experiments on diverse video understanding benchmarks demonstrate the effectiveness of CaCoVID. Codes will be released.
zh
[CV-97] From Perception to Action: Spatial AI Agents and World Models
【速读】:该论文旨在解决当前研究中 agentic reasoning(代理推理)与 spatial intelligence(空间智能)之间缺乏统一框架的问题,即如何将基于大语言模型(Large Language Models, LLMs)的代理能力与物理世界中的空间感知、结构化推理及多尺度行动能力有效结合。其解决方案的关键在于提出一个三轴分类法(three-axis taxonomy),分别从能力(Capability)、任务(Task)和尺度(Scale)三个维度系统整合代理行为与空间任务,并明确区分空间接地(spatial grounding,即对几何与物理的度量理解)与符号接地(symbolic grounding,即图像到文本的关联),强调仅具备感知能力不足以实现真正的代理行为。该框架揭示了层级记忆系统、图神经网络-大语言模型融合以及世界模型在跨尺度安全部署中的核心作用,为构建下一代具身自主系统提供理论基础与研究方向。
链接: https://arxiv.org/abs/2602.01644
作者: Gloria Felicia,Nolan Bryant,Handi Putra,Ayaan Gazali,Eliel Lobo,Esteban Rojas
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA); Robotics (cs.RO)
备注: 61 pages, 742 citations, 1 figure, 3 tables. Survey paper on spatial AI agents, embodied AI, graph neural networks, and world models
Abstract:While large language models have become the prevailing approach for agentic reasoning and planning, their success in symbolic domains does not readily translate to the physical world. Spatial intelligence, the ability to perceive 3D structure, reason about object relationships, and act under physical constraints, is an orthogonal capability that proves important for embodied agents. Existing surveys address either agentic architectures or spatial domains in isolation. None provide a unified framework connecting these complementary capabilities. This paper bridges that gap. Through a thorough review of over 2,000 papers, citing 742 works from top-tier venues, we introduce a unified three-axis taxonomy connecting agentic capabilities with spatial tasks across scales. Crucially, we distinguish spatial grounding (metric understanding of geometry and physics) from symbolic grounding (associating images with text), arguing that perception alone does not confer agency. Our analysis reveals three key findings mapped to these axes: (1) hierarchical memory systems (Capability axis) are important for long-horizon spatial tasks. (2) GNN-LLM integration (Task axis) is a promising approach for structured spatial reasoning. (3) World models (Scale axis) are essential for safe deployment across micro-to-macro spatial scales. We conclude by identifying six grand challenges and outlining directions for future research, including the need for unified evaluation frameworks to standardize cross-domain assessment. This taxonomy provides a foundation for unifying fragmented research efforts and enabling the next generation of spatially-aware autonomous systems in robotics, autonomous vehicles, and geospatial intelligence.
zh
[CV-98] ReCALL: Recalibrating Capability Degradation for MLLM -based Composed Image Retrieval
【速读】:该论文旨在解决生成式多模态大语言模型(Generative Multimodal Large Language Models, MLLMs)在用于组合图像检索(Composed Image Retrieval, CIR)任务时,因适配为单嵌入判别式检索器所引发的“能力退化”(Capability Degradation)问题——即模型在检索适应后丧失了原有的细粒度视觉-语义推理能力。解决方案的关键在于提出一个模型无关的框架 ReCALL,其核心机制为“诊断-生成-精炼”三阶段流程:首先通过自引导的信息实例挖掘识别检索器的认知盲区;其次利用思维链(Chain-of-Thought, CoT)提示基础 MLLM 生成纠正指令与三元组,并结合视觉问答(Visual Question Answering, VQA)一致性过滤确保质量;最后通过分组对比学习持续训练,使检索器内化细粒度视觉-语义区分,重新对齐嵌入空间与 MLLM 内在的组合推理能力,从而实现性能提升与能力恢复。
链接: https://arxiv.org/abs/2602.01639
作者: Tianyu Yang,ChenWei He,Xiangzhao Hao,Tianyue Wang,Jiarui Guo,Haiyun Guo,Leigang Qu,Jinqiao Wang,Tat-Seng Chua
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); University of Chinese Academy of Sciences (中国科学院大学); Southeast University (东南大学); Beijing University of Posts and Telecommunications (北京邮电大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Composed Image Retrieval (CIR) aims to retrieve target images based on a hybrid query comprising a reference image and a modification text. Early dual-tower Vision-Language Models (VLMs) struggle with cross-modality compositional reasoning required for this task. Recently, adapting generative Multimodal Large Language Models (MLLMs) for retrieval offers a promising direction. However, we identify that this adaptation strategy overlooks a fundamental issue: adapting a generative MLLM into a single-embedding discriminative retriever triggers a paradigm conflict, which leads to Capability Degradation - the deterioration of native fine-grained reasoning after retrieval adaptation. To address this challenge, we propose ReCALL (Recalibrating Capability Degradation), a model-agnostic framework that follows a diagnose-generate-refine pipeline: Firstly, we diagnose cognitive blind spots of the retriever via self-guided informative instance mining. Next, we generate corrective instructions and triplets by CoT prompting the foundation MLLM and conduct quality control with VQA-based consistency filtering. Finally, we refine the retriever through continual training on these triplets with a grouped contrastive scheme, thereby internalizing fine-grained visual-semantic distinctions and realigning the discriminative embedding space of retriever with intrinsic compositional reasoning within the MLLM. Extensive experiments on CIRR and FashionIQ show that ReCALL consistently recalibrates degraded capabilities and achieves state-of-the-art performance. Code will be released soon.
zh
[CV-99] Federated Vision Transformer with Adaptive Focal Loss for Medical Image Classification
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在医疗图像等隐私敏感场景下面临的两大挑战:一是本地客户端数据分布的类别不平衡问题,二是客户端间数据异构性导致的模型泛化能力下降。解决方案的关键在于提出一种融合动态自适应焦点损失(Dynamic Adaptive Focal Loss, DAFL)与客户端感知聚合策略的新型联邦学习框架。DAFL通过引入随客户端样本分布和类别分布动态调整的不平衡系数,确保少数类样本获得充分关注,避免因数据稀疏而被忽略;同时,采用基于数据规模和特征差异的加权聚合策略,有效缓解客户端异构性对全局模型性能的影响,从而显著提升分类准确率,在多个公开医学图像数据集上相较现有主流模型取得0.98%至41.69%的精度提升。
链接: https://arxiv.org/abs/2602.01633
作者: Xinyuan Zhao,Yihang Wu,Ahmad Chaddad,Tareef Daqqaq,Reem Kateb
机构: Guilin University of Electronic Technology (桂林电子科技大学); École de Technologie Supérieure (高等技术学院); Taibah University (塔伊巴大学); Jeddah University (吉达大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in Knowledge-Based Systems
Abstract:While deep learning models like Vision Transformer (ViT) have achieved significant advances, they typically require large datasets. With data privacy regulations, access to many original datasets is restricted, especially medical images. Federated learning (FL) addresses this challenge by enabling global model aggregation without data exchange. However, the heterogeneity of the data and the class imbalance that exist in local clients pose challenges for the generalization of the model. This study proposes a FL framework leveraging a dynamic adaptive focal loss (DAFL) and a client-aware aggregation strategy for local training. Specifically, we design a dynamic class imbalance coefficient that adjusts based on each client’s sample distribution and class data distribution, ensuring minority classes receive sufficient attention and preventing sparse data from being ignored. To address client heterogeneity, a weighted aggregation strategy is adopted, which adapts to data size and characteristics to better capture inter-client variations. The classification results on three public datasets (ISIC, Ocular Disease and RSNA-ICH) show that the proposed framework outperforms DenseNet121, ResNet50, ViT-S/16, ViT-L/32, FedCLIP, Swin Transformer, CoAtNet, and MixNet in most cases, with accuracy improvements ranging from 0.98% to 41.69%. Ablation studies on the imbalanced ISIC dataset validate the effectiveness of the proposed loss function and aggregation strategy compared to traditional loss functions and other FL approaches. The codes can be found at: this https URL.
zh
[CV-100] Research on World Models Is Not Merely Injecting World Knowledge into Specific Tasks
【速读】:该论文试图解决当前世界模型(world models)研究中存在碎片化的问题,即现有方法多聚焦于将世界知识注入孤立的任务(如视觉预测、三维估计或符号锚定),缺乏统一的定义与框架,导致难以实现对环境的全面理解。其解决方案的关键在于提出一种统一的设计规范,强调一个稳健的世界模型不应是能力的松散集合,而应是一个整合交互(interaction)、感知(perception)、符号推理(symbolic reasoning)和空间表征(spatial representation)的规范性框架,从而为未来研究提供结构化指引,推动更通用、鲁棒且原理清晰的世界建模发展。
链接: https://arxiv.org/abs/2602.01630
作者: Bohan Zeng,Kaixin Zhu,Daili Hua,Bozhou Li,Chengzhuo Tong,Yuran Wang,Xinyi Huang,Yifan Dai,Zixiang Zhang,Yifan Yang,Zhou Liu,Hao Liang,Xiaochen Ma,Ruichuan An,Tianyi Bai,Hongcheng Gao,Junbo Niu,Yang Shi,Xinlong Chen,Yue Ding,Minglei Shi,Kai Zeng,Yiwen Tang,Yuanxing Zhang,Pengfei Wan,Xintao Wang,Wentao Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 4 figures
Abstract:World models have emerged as a critical frontier in AI research, aiming to enhance large models by infusing them with physical dynamics and world knowledge. The core objective is to enable agents to understand, predict, and interact with complex environments. However, current research landscape remains fragmented, with approaches predominantly focused on injecting world knowledge into isolated tasks, such as visual prediction, 3D estimation, or symbol grounding, rather than establishing a unified definition or framework. While these task-specific integrations yield performance gains, they often lack the systematic coherence required for holistic world understanding. In this paper, we analyze the limitations of such fragmented approaches and propose a unified design specification for world models. We suggest that a robust world model should not be a loose collection of capabilities but a normative framework that integrally incorporates interaction, perception, symbolic reasoning, and spatial representation. This work aims to provide a structured perspective to guide future research toward more general, robust, and principled models of the world.
zh
[CV-101] PISCES: Annotation-free Text-to-Video Post-Training via Optimal Transport-Aligned Rewards
【速读】:该论文旨在解决文本到视频(Text-to-video, T2V)生成中奖励信号监督不足的问题,尤其是现有基于奖励的后训练方法依赖大规模人工偏好标注或使用预训练视觉语言模型中对齐不佳的嵌入表示,导致可扩展性受限或监督效果欠佳。其解决方案的关键在于提出一种无标注的后训练算法 \textttPISCES,通过引入双层次最优传输(Dual Optimal Transport, OT)对齐奖励模块,分别在分布层面和离散 token 层面实现文本与视频嵌入的对齐:一方面构建分布级 OT 对齐的质量奖励(Distributional OT-aligned Quality Reward),捕捉整体视觉质量和时间一致性;另一方面设计离散 token 级 OT 对齐的语义奖励(Discrete Token-level OT-aligned Semantic Reward),强制文本与视频在时空上的语义对应关系。这是首个利用 OT 框架提升生成式后训练中无标注奖励监督效果的工作,实验证明其在 VBench 评测指标上优于有标注与无标注方法,并兼容多种优化范式(如直接反向传播与强化学习微调)。
链接: https://arxiv.org/abs/2602.01624
作者: Minh-Quan Le,Gaurav Mittal,Cheng Zhao,David Gu,Dimitris Samaras,Mei Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Text-to-video (T2V) generation aims to synthesize videos with high visual quality and temporal consistency that are semantically aligned with input text. Reward-based post-training has emerged as a promising direction to improve the quality and semantic alignment of generated videos. However, recent methods either rely on large-scale human preference annotations or operate on misaligned embeddings from pre-trained vision-language models, leading to limited scalability or suboptimal supervision. We present \textttPISCES , an annotation-free post-training algorithm that addresses these limitations via a novel Dual Optimal Transport (OT)-aligned Rewards module. To align reward signals with human judgment, \textttPISCES uses OT to bridge text and video embeddings at both distributional and discrete token levels, enabling reward supervision to fulfill two objectives: (i) a Distributional OT-aligned Quality Reward that captures overall visual quality and temporal coherence; and (ii) a Discrete Token-level OT-aligned Semantic Reward that enforces semantic, spatio-temporal correspondence between text and video tokens. To our knowledge, \textttPISCES is the first to improve annotation-free reward supervision in generative post-training through the lens of OT. Experiments on both short- and long-video generation show that \textttPISCES outperforms both annotation-based and annotation-free methods on VBench across Quality and Semantic scores, with human preference studies further validating its effectiveness. We show that the Dual OT-aligned Rewards module is compatible with multiple optimization paradigms, including direct backpropagation and reinforcement learning fine-tuning.
zh
[CV-102] Omni-Judge: Can Omni-LLM s Serve as Human-Aligned Judges for Text-Conditioned Audio-Video Generation?
【速读】:该论文旨在解决多模态生成模型(特别是文本到视频生成模型)中三模态输出(文本、音频、视频)的评估难题。传统自动评估指标如FVD、CLAP和ViCLIP仅关注单一模态对,难以应对复杂提示,并缺乏可解释性;而人工评估虽可靠但成本高且难扩展。论文提出的关键解决方案是引入Omni-Judge——一种基于通用大语言模型(omni-LLM)的自动化评估框架,利用其天然支持跨模态理解与推理的能力,实现对文本条件下的音视频生成结果进行统一、可解释的评价。实验表明,Omni-Judge在语义层面任务(如音频-文本对齐、视频-文本对齐及三模态一致性)上表现优异,且能提供链式思维解释以揭示语义或物理不一致问题,从而支持下游反馈优化,尽管其在高帧率感知指标(如视频质量与音视频同步)上受限于时间分辨率。
链接: https://arxiv.org/abs/2602.01623
作者: Susan Liang,Chao Huang,Filippos Bellos,Yolo Yunlong Tang,Qianxiang Shen,Jing Bi,Luchuan Song,Zeliang Zhang,Jason Corso,Chenliang Xu
机构: University of Rochester (罗彻斯特大学); University of Michigan, Ann Arbor (密歇根大学安娜堡分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:State-of-the-art text-to-video generation models such as Sora 2 and Veo 3 can now produce high-fidelity videos with synchronized audio directly from a textual prompt, marking a new milestone in multi-modal generation. However, evaluating such tri-modal outputs remains an unsolved challenge. Human evaluation is reliable but costly and difficult to scale, while traditional automatic metrics, such as FVD, CLAP, and ViCLIP, focus on isolated modality pairs, struggle with complex prompts, and provide limited interpretability. Omni-modal large language models (omni-LLMs) present a promising alternative: they naturally process audio, video, and text, support rich reasoning, and offer interpretable chain-of-thought feedback. Driven by this, we introduce Omni-Judge, a study assessing whether omni-LLMs can serve as human-aligned judges for text-conditioned audio-video generation. Across nine perceptual and alignment metrics, Omni-Judge achieves correlation comparable to traditional metrics and excels on semantically demanding tasks such as audio-text alignment, video-text alignment, and audio-video-text coherence. It underperforms on high-FPS perceptual metrics, including video quality and audio-video synchronization, due to limited temporal resolution. Omni-Judge provides interpretable explanations that expose semantic or physical inconsistencies, enabling practical downstream uses such as feedback-based refinement. Our findings highlight both the potential and current limitations of omni-LLMs as unified evaluators for multi-modal generation.
zh
[CV-103] oken Pruning for In-Context Generation in Diffusion Transformers
【速读】:该论文旨在解决扩散 Transformer(Diffusion Transformers, DiTs)在上下文生成(in-context generation)中因参考示例与目标潜在表示拼接导致序列长度急剧增加所引发的计算瓶颈问题。现有基于文本到图像合成设计的令牌压缩技术因采用统一缩减策略,未能考虑参考上下文与目标潜变量在空间、时间及功能维度上的固有不对称性,因而效果有限。解决方案的关键在于提出一种无需训练的令牌剪枝框架 ToPi,其核心创新包括:通过离线校准驱动的敏感性分析识别关键注意力层作为冗余估计的鲁棒代理,并据此构建新颖的影响度量来量化每个上下文令牌的贡献,结合随扩散轨迹动态更新的时间适应策略实现选择性剪枝。实证结果表明,ToPi 在保持结构保真度和视觉一致性的同时,可实现超过 30% 的推理加速。
链接: https://arxiv.org/abs/2602.01609
作者: Junqing Lin,Xingyu Zheng,Pei Cheng,Bin Fu,Jingwei Sun,Guangzhong Sun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages
Abstract:In-context generation significantly enhances Diffusion Transformers (DiTs) by enabling controllable image-to-image generation through reference examples. However, the resulting input concatenation drastically increases sequence length, creating a substantial computational bottleneck. Existing token reduction techniques, primarily tailored for text-to-image synthesis, fall short in this paradigm as they apply uniform reduction strategies, overlooking the inherent role asymmetry between reference contexts and target latents across spatial, temporal, and functional dimensions. To bridge this gap, we introduce ToPi, a training-free token pruning framework tailored for in-context generation in DiTs. Specifically, ToPi utilizes offline calibration-driven sensitivity analysis to identify pivotal attention layers, serving as a robust proxy for redundancy estimation. Leveraging these layers, we derive a novel influence metric to quantify the contribution of each context token for selective pruning, coupled with a temporal update strategy that adapts to the evolving diffusion trajectory. Empirical evaluations demonstrate that ToPi can achieve over 30% speedup in inference while maintaining structural fidelity and visual consistency across complex image generation tasks.
zh
[CV-104] UV-M3TL: A Unified and Versatile Multimodal Multi-Task Learning Framework for Assistive Driving Perception
【速读】:该论文旨在解决高级驾驶辅助系统(Advanced Driver Assistance Systems, ADAS)在同时学习驾驶员行为识别、情绪识别、车辆行为感知与交通场景理解等异构多任务时,因任务间存在负向迁移(inter-task negative transfer)而导致性能下降的问题。其解决方案的核心在于提出统一且通用的多模态多任务学习框架(Unified and Versatile Multimodal Multi-Task Learning, UV-M3TL),关键创新包括:1)双分支空间通道多模态嵌入(Dual-Branch Spatial Channel Multimodal Embedding, DB-SCME),通过显式建模任务共享与特定特征来增强跨任务知识迁移并缓解任务冲突;2)自适应特征解耦多任务损失函数(Adaptive Feature-Decoupled Multi-Task Loss, AFD-Loss),基于学习动态和特征解耦约束引入自适应权重机制,提升联合优化稳定性并引导模型学习多样化的多任务表征。
链接: https://arxiv.org/abs/2602.01594
作者: Wenzhuo Liu,Qiannan Guo,Zhen Wang,Wenshuo Wang,Lei Yang,Yicheng Qiao,Lening Wang,Zhiwei Li,Chen Lv,Shanghang Zhang,Junqiang Xi,Huaping Liu
机构: Beijing Institute of Technology (北京理工大学); Tsinghua University (清华大学); Nanyang Technological University (南洋理工大学); Beihang University (北京航空航天大学); Beijing University of Chemical Technology (北京化工大学); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Advanced Driver Assistance Systems (ADAS) need to understand human driver behavior while perceiving their navigation context, but jointly learning these heterogeneous tasks would cause inter-task negative transfer and impair system performance. Here, we propose a Unified and Versatile Multimodal Multi-Task Learning (UV-M3TL) framework to simultaneously recognize driver behavior, driver emotion, vehicle behavior, and traffic context, while mitigating inter-task negative transfer. Our framework incorporates two core components: dual-branch spatial channel multimodal embedding (DB-SCME) and adaptive feature-decoupled multi-task loss (AFD-Loss). DB-SCME enhances cross-task knowledge transfer while mitigating task conflicts by employing a dual-branch structure to explicitly model salient task-shared and task-specific features. AFD-Loss improves the stability of joint optimization while guiding the model to learn diverse multi-task representations by introducing an adaptive weighting mechanism based on learning dynamics and feature decoupling constraints. We evaluate our method on the AIDE dataset, and the experimental results demonstrate that UV-M3TL achieves state-of-the-art performance across all four tasks. To further prove the versatility, we evaluate UV-M3TL on additional public multi-task perception benchmarks (BDD100K, CityScapes, NYUD-v2, and PASCAL-Context), where it consistently delivers strong performance across diverse task combinations, attaining state-of-the-art results on most tasks.
zh
[CV-105] Samba: General and Accurate Salient Object Detection via A More Unified Mamba-based Framework
【速读】:该论文旨在解决现有显著性目标检测(SOD)模型在感受野有限和计算复杂度高(如CNN受限于局部感受野,Transformer存在二次方复杂度)方面的瓶颈问题,并进一步应对多模态输入下任务特异性强、跨模态冲突及灾难性遗忘等挑战。其解决方案的关键在于提出一种纯状态空间模型(State-space Model, SSM)架构——Saliency Mamba(Samba),通过引入显著性引导的Mamba块(SGMB),利用空间邻域扫描(SNS)算法保持显著区域的空间连续性;同时设计上下文感知上采样(CAU)方法以增强层次特征对齐与聚合。为进一步提升通用性,作者进一步构建了Samba+框架,采用枢纽-辐条图注意力(HGA)模块实现自适应跨模态交互融合,并结合模态锚定持续学习(MACL)策略缓解模态间冲突与记忆遗忘问题,从而在多个SOD任务中实现统一建模与卓越性能。
链接: https://arxiv.org/abs/2602.01593
作者: Wenzhuo Zhao,Keren Fu,Jiahao He,Xiaohong Liu,Qijun Zhao,Guangtao Zhai
机构: Sichuan University (四川大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Existing salient object detection (SOD) models are generally constrained by the limited receptive fields of convolutional neural networks (CNNs) and quadratic computational complexity of Transformers. Recently, the emerging state-space model, namely Mamba, has shown great potential in balancing global receptive fields and computational efficiency. As a solution, we propose Saliency Mamba (Samba), a pure Mamba-based architecture that flexibly handles various distinct SOD tasks, including RGB/RGB-D/RGB-T SOD, video SOD (VSOD), RGB-D VSOD, and visible-depth-thermal SOD. Specifically, we rethink the scanning strategy of Mamba for SOD, and introduce a saliency-guided Mamba block (SGMB) that features a spatial neighborhood scanning (SNS) algorithm to preserve the spatial continuity of salient regions. A context-aware upsampling (CAU) method is also proposed to promote hierarchical feature alignment and aggregation by modeling contextual dependencies. As one step further, to avoid the “task-specific” problem as in previous SOD solutions, we develop Samba+, which is empowered by training Samba in a multi-task joint manner, leading to a more unified and versatile model. Two crucial components that collaboratively tackle challenges encountered in input of arbitrary modalities and continual adaptation are investigated. Specifically, a hub-and-spoke graph attention (HGA) module facilitates adaptive cross-modal interactive fusion, and a modality-anchored continual learning (MACL) strategy alleviates inter-modal conflicts together with catastrophic forgetting. Extensive experiments demonstrate that Samba individually outperforms existing methods across six SOD tasks on 22 datasets with lower computational cost, whereas Samba+ achieves even superior results on these tasks and datasets by using a single trained versatile model. Additional results further demonstrate the potential of our Samba framework.
zh
[CV-106] Know Your Step: Faster and Better Alignment for Flow Matching Models via Step-aware Advantages
【速读】:该论文旨在解决现有基于强化学习(Reinforcement Learning, RL)的流匹配(Flow Matching)模型在文本到图像生成中,因依赖大量去噪步骤且奖励信号稀疏不精准而导致的人类偏好对齐效果不佳的问题。其解决方案的关键在于提出一种名为温度退火的少步采样与组相对策略优化(Temperature Annealed Few step Sampling with Group Relative Policy Optimization, TAFS GRPO)的新框架:通过在每一步采样结果上注入自适应时间噪声,迭代地降低温度以引入采样过程中的随机性并保持图像语义完整性;同时,利用步长感知的优势整合机制结合GRPO算法,无需奖励函数可微分即可获得密集且步骤特异性的奖励信号,从而实现稳定高效的策略优化。
链接: https://arxiv.org/abs/2602.01591
作者: Zhixiong Yue,Zixuan Ni,Feiyang Ye,Jinshan Zhang,Sheng Shen,Zhenpeng Mi
机构: Zhejiang University (浙江大学); HiThink Research; Southern University of Science and Technology (南方科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in flow matching models, particularly with reinforcement learning (RL), have significantly enhanced human preference alignment in few step text to image generators. However, existing RL based approaches for flow matching models typically rely on numerous denoising steps, while suffering from sparse and imprecise reward signals that often lead to suboptimal alignment. To address these limitations, we propose Temperature Annealed Few step Sampling with Group Relative Policy Optimization (TAFS GRPO), a novel framework for training flow matching text to image models into efficient few step generators well aligned with human preferences. Our method iteratively injects adaptive temporal noise onto the results of one step samples. By repeatedly annealing the model’s sampled outputs, it introduces stochasticity into the sampling process while preserving the semantic integrity of each generated image. Moreover, its step aware advantage integration mechanism combines the GRPO to avoid the need for the differentiable of reward function and provide dense and step specific rewards for stable policy optimization. Extensive experiments demonstrate that TAFS GRPO achieves strong performance in few step text to image generation and significantly improves the alignment of generated images with human preferences. The code and models of this work will be available to facilitate further research.
zh
[CV-107] Genus-0 Surface Parameterization using Spherical Beltrami Differentials
【速读】:该论文旨在解决球面参数化中任务目标(如特征点对齐)与保双射性及几何畸变控制之间的权衡问题。其核心解决方案是提出球面Beltrami微分(Spherical Beltrami Differential, SBD),这是一种基于两-chart表示的球面拟共形自映射方法,并建立其与球面同胚映射(up to conformal automorphisms)的对应关系;在此基础上,作者构建了名为BOOST的神经优化框架,通过在半球立体投影图上优化两个Beltrami场并引入显式的缝合约束以保证全局一致性,从而实现高保真度的任务驱动球面映射,同时有效控制几何畸变并保持双射性。
链接: https://arxiv.org/abs/2602.01589
作者: Zhehao Xu,Lok Ming Lui
机构: The Chinese University of Hong Kong (香港中文大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Algebraic Geometry (math.AG)
备注:
Abstract:Spherical surface parameterization is a fundamental tool in geometry processing and imaging science. For a genus-0 closed surface, many efficient algorithms can map the surface to the sphere; consequently, a broad class of task-driven genus-0 mapping problems can be reduced to constructing a high-quality spherical self-map. However, existing approaches often face a trade-off between satisfying task objectives (e.g., landmark or feature alignment), maintaining bijectivity, and controlling geometric distortion. We introduce the Spherical Beltrami Differential (SBD), a two-chart representation of quasiconformal self-maps of the sphere, and establish its correspondence with spherical homeomorphisms up to conformal automorphisms. Building on the Spectral Beltrami Network (SBN), we propose a neural optimization framework BOOST that optimizes two Beltrami fields on hemispherical stereographic charts and enforces global consistency through explicit seam-aware constraints. Experiments on large-deformation landmark matching and intensity-based spherical registration demonstrate the effectiveness of our proposed framework. We further apply the method to brain cortical surface registration, aligning sulcal landmarks and jointly matching cortical sulci depth maps, showing improved task fidelity with controlled distortion and robust bijective behavior.
zh
[CV-108] HandMCM: Multi-modal Point Cloud-based Correspondence State Space Model for 3D Hand Pose Estimation AAAI
【速读】:该论文旨在解决3D手部姿态估计(3D hand pose estimation)中因手部自遮挡及与物体交互导致的遮挡问题所带来的挑战,从而提升人机交互(如增强现实)应用中的精度与鲁棒性。其解决方案的关键在于提出一种基于状态空间模型(State Space Model, SSM)的新方法——HandMCM,该方法通过引入局部信息注入/过滤模块和对应关系建模机制,使对应的Mamba架构能够有效学习在不同遮挡场景下关键点间的动态运动学拓扑结构;同时融合多模态图像特征以增强输入表示能力,显著提升了模型在严重遮挡条件下的手部姿态估计性能。
链接: https://arxiv.org/abs/2602.01586
作者: Wencan Cheng,Gim Hee Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI accepted
Abstract:3D hand pose estimation that involves accurate estimation of 3D human hand keypoint locations is crucial for many human-computer interaction applications such as augmented reality. However, this task poses significant challenges due to self-occlusion of the hands and occlusions caused by interactions with objects. In this paper, we propose HandMCM to address these challenges. Our HandMCM is a novel method based on the powerful state space model (Mamba). By incorporating modules for local information injection/filtering and correspondence modeling, the proposed correspondence Mamba effectively learns the highly dynamic kinematic topology of keypoints across various occlusion scenarios. Moreover, by integrating multi-modal image features, we enhance the robustness and representational capacity of the input, leading to more accurate hand pose estimation. Empirical evaluations on three benchmark datasets demonstrate that our model significantly outperforms current state-of-the-art methods, particularly in challenging scenarios involving severe occlusions. These results highlight the potential of our approach to advance the accuracy and reliability of 3D hand pose estimation in practical applications.
zh
[CV-109] Generative Visual Code Mobile World Models
【速读】:该论文旨在解决当前移动图形用户界面(Mobile GUI)世界模型(World Models, WMs)在训练和推理阶段性能受限的问题,特别是现有方法在文本建模与视觉保真度之间存在权衡:基于文本的WMs难以保持视觉细节,而依赖外部模型的视觉WMs则因复杂且低效的流水线导致性能瓶颈。解决方案的关键在于提出一种新的范式——通过可渲染代码生成实现视觉世界建模,即利用单个视觉-语言模型(Vision-Language Model, VLM)将下一GUI状态预测为可执行的网页代码(如HTML/CSS/JS),而非直接生成像素。这种机制融合了VLM在文本精确渲染上的语言先验与预训练于结构化网页代码带来的高保真视觉生成能力,从而在不牺牲精度的前提下显著提升效率和泛化性。
链接: https://arxiv.org/abs/2602.01576
作者: Woosung Koh,Sungjun Han,Segyu Lee,Se-Young Yun,Jamin Shin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Pre-print (technical report)
Abstract:Mobile Graphical User Interface (GUI) World Models (WMs) offer a promising path for improving mobile GUI agent performance at train- and inference-time. However, current approaches face a critical trade-off: text-based WMs sacrifice visual fidelity, while the inability of visual WMs in precise text rendering led to their reliance on slow, complex pipelines dependent on numerous external models. We propose a novel paradigm: visual world modeling via renderable code generation, where a single Vision-Language Model (VLM) predicts the next GUI state as executable web code that renders to pixels, rather than generating pixels directly. This combines the strengths of both approaches: VLMs retain their linguistic priors for precise text rendering while their pre-training on structured web code enables high-fidelity visual generation. We introduce gWorld (8B, 32B), the first open-weight visual mobile GUI WMs built on this paradigm, along with a data generation framework (gWorld) that automatically synthesizes code-based training data. In extensive evaluation across 4 in- and 2 out-of-distribution benchmarks, gWorld sets a new pareto frontier in accuracy versus model size, outperforming 8 frontier open-weight models over 50.25x larger. Further analyses show that (1) scaling training data via gWorld yields meaningful gains, (2) each component of our pipeline improves data quality, and (3) stronger world modeling improves downstream mobile GUI policy performance.
zh
[CV-110] SGHA-Attack: Semantic-Guided Hierarchical Alignment for Transferable Targeted Attacks on Vision-Language Models
【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, VLMs)在面对基于迁移的对抗扰动时的脆弱性问题,即攻击者可通过在代理模型(surrogate model)上优化扰动来操控黑盒VLM的输出。现有针对性迁移攻击方法常因仅依赖单一参考样本并过度强调最终层对齐,导致对中间语义信息利用不足,从而降低在异构VLM间的迁移性能。其解决方案的关键在于提出SGHA-Attack框架,通过引入多目标参考样本构建语义引导的锚点池,并在多个深度层级上同步对齐全局与空间粒度的视觉特征以及跨模态的中间特征,实现从早期到晚期的多层次语义一致性约束,从而显著提升攻击的迁移能力与鲁棒性。
链接: https://arxiv.org/abs/2602.01574
作者: Haobo Wang,Weiqi Luo,Xiaojun Jia,Xiaochun Cao
机构: Sun Yat-sen University (中山大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Large vision-language models (VLMs) are vulnerable to transfer-based adversarial perturbations, enabling attackers to optimize on surrogate models and manipulate black-box VLM outputs. Prior targeted transfer attacks often overfit surrogate-specific embedding space by relying on a single reference and emphasizing final-layer alignment, which underutilizes intermediate semantics and degrades transfer across heterogeneous VLMs. To address this, we propose SGHA-Attack, a Semantic-Guided Hierarchical Alignment framework that adopts multiple target references and enforces intermediate-layer consistency. Concretely, we generate a visually grounded reference pool by sampling a frozen text-to-image model conditioned on the target prompt, and then carefully select the Top-K most semantically relevant anchors under the surrogate to form a weighted mixture for stable optimization guidance. Building on these anchors, SGHA-Attack injects target semantics throughout the feature hierarchy by aligning intermediate visual representations at both global and spatial granularities across multiple depths, and by synchronizing intermediate visual and textual features in a shared latent subspace to provide early cross-modal supervision before the final projection. Extensive experiments on open-source and commercial black-box VLMs show that SGHA-Attack achieves stronger targeted transferability than prior methods and remains robust under preprocessing and purification defenses.
zh
[CV-111] One-Step Diffusion for Perceptual Image Compression
【速读】:该论文旨在解决基于扩散模型(diffusion model)的图像压缩方法在实际部署中面临的推理延迟高和计算开销大的问题,其根源在于解码过程中需要大量去噪步骤。解决方案的关键在于提出一种仅需单步扩散过程的图像压缩方法,显著提升了推理速度;同时引入一个作用于紧凑特征表示而非原始像素的判别器,利用特征更优地捕捉高层纹理与结构细节,从而在保持重建图像感知质量的同时实现高效压缩。
链接: https://arxiv.org/abs/2602.01570
作者: Yiwen Jia,Hao Wei,Yanhui Zhou,Chenyang Ge
机构: Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University, China; School of Information and Telecommunication, Xi’an Jiaotong University, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion-based image compression methods have achieved notable progress, delivering high perceptual quality at low bitrates. However, their practical deployment is hindered by significant inference latency and heavy computational overhead, primarily due to the large number of denoising steps required during decoding. To address this problem, we propose a diffusion-based image compression method that requires only a single-step diffusion process, significantly improving inference speed. To enhance the perceptual quality of reconstructed images, we introduce a discriminator that operates on compact feature representations instead of raw pixels, leveraging the fact that features better capture high-level texture and structural details. Experimental results show that our method delivers comparable compression performance while offering a 46 \times faster inference speed compared to recent diffusion-based approaches. The source code and models are available at this https URL.
zh
[CV-112] Multimodal UNcommonsense: From Odd to Ordinary and Ordinary to Odd
【速读】:该论文旨在解决多模态场景下常识推理(commonsense reasoning)的挑战,特别是模型在面对偏离典型视觉或语境预期的情境时的适应能力不足问题。其解决方案的关键在于提出一种基于检索的上下文学习框架(retrieval-based in-context learning, R-ICL),通过引入新颖的多模态集成检索器(Multimodal Ensemble Retriever, MER),从大规模预训练模型中迁移推理能力至小型模型,而无需额外训练;该方法能够在图像与文本对故意不一致的情况下仍能识别语义相关示例,从而显著提升模型在低频、非典型场景中的表现,平均性能优于基线方法8.3%。
链接: https://arxiv.org/abs/2602.01561
作者: Yejin Son,Saejin Kim,Dongjun Min,Younjae Yu
机构: Yonsei University (延世大学); Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 24 pages
Abstract:Commonsense reasoning in multimodal contexts remains a foundational challenge in artificial intelligence. We introduce Multimodal UNcommonsense(MUN), a benchmark designed to evaluate models’ ability to handle scenarios that deviate from typical visual or contextual expectations. MUN pairs visual scenes with surprising or unlikely outcomes described in natural language, prompting models to either rationalize seemingly odd images using everyday logic or uncover unexpected interpretations in ordinary scenes. To support this task, we propose a retrieval-based in-context learning (R-ICL) framework that transfers reasoning capabilities from larger models to smaller ones without additional training. Leveraging a novel Multimodal Ensemble Retriever (MER), our method identifies semantically relevant exemplars even when image and text pairs are deliberately discordant. Experiments show an average improvement of 8.3% over baseline ICL methods, highlighting the effectiveness of R-ICL in low-frequency, atypical settings. MUN opens new directions for evaluating and improving visual-language models’ robustness and adaptability in real-world, culturally diverse, and non-prototypical scenarios.
zh
[CV-113] Combined Flicker-banding and Moire Removal for Screen-Captured Images
【速读】:该论文旨在解决移动设备拍摄显示屏幕时,因摩尔纹(moiré patterns)与闪烁条带(flicker-banding)共同存在而导致的图像质量严重退化问题。现有方法通常仅针对单一退化类型设计,难以应对二者耦合的复杂场景。解决方案的关键在于提出首个系统性的联合去噪框架CLEAR,其核心创新包括:构建包含两类复合退化的大型数据集、引入基于ISP(Image Signal Processing)的闪烁模拟流水线以稳定训练并扩展退化分布,以及设计频域分解与重构模块结合轨迹对齐损失函数,从而有效建模和分离复合退化特征。实验表明,该方法在多指标下显著优于现有图像复原技术,适用于复杂的现实场景。
链接: https://arxiv.org/abs/2602.01559
作者: Libo Zhu,Zihan Zhou,Zhiyi Zhou,Yiyang Qu,Weihang Zhang,Keyu Shi,Yifan Fu,Yulun Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:Capturing display screens with mobile devices has become increasingly common, yet the resulting images often suffer from severe degradations caused by the coexistence of moiré patterns and flicker-banding, leading to significant visual quality degradation. Due to the strong coupling of these two artifacts in real imaging processes, existing methods designed for single degradations fail to generalize to such compound scenarios. In this paper, we present the first systematic study on joint removal of moiré patterns and flicker-banding in screen-captured images, and propose a unified restoration framework, named CLEAR. To support this task, we construct a large-scale dataset containing both moiré patterns and flicker-banding, and introduce an ISP-based flicker simulation pipeline to stabilize model training and expand the degradation distribution. Furthermore, we design a frequency-domain decomposition and re-composition module together with a trajectory alignment loss to enhance the modeling of compound artifacts. Extensive experiments demonstrate that the proposed method consistently. outperforms existing image restoration approaches across multiple evaluation metrics, validating its effectiveness in complex real-world scenarios.
zh
[CV-114] InfoTok: Regulating Information Flow for Capacity-Constrained Shared Visual Tokenization in Unified MLLM s
【速读】:该论文旨在解决统一多模态大语言模型(Unified Multimodal Large Language Models, MLLMs)中视觉分词器(visual tokenizer)设计缺乏明确信息保留准则的问题,即现有共享令牌(shared-token)架构多基于网络结构驱动,未能有效区分哪些视觉信息应被保留以同时支持图像理解与生成任务。解决方案的关键在于提出 InfoTok——一种基于信息瓶颈(Information Bottleneck, IB)原理的信息正则化视觉分词机制,通过控制从图像到共享令牌再到多模态输出的信息流,在压缩效率与任务相关性之间实现可解释的权衡,从而在不引入额外训练数据的前提下显著提升模型在理解和生成任务上的性能表现。
链接: https://arxiv.org/abs/2602.01554
作者: Lv Tang,Tianyi Zheng,Bo Li,Xingyu Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Unified multimodal large language models (MLLMs) integrate image understanding and generation in a single framework, with the visual tokenizer acting as the sole interface that maps visual inputs into tokens for downstream tasks. However, existing shared-token designs are mostly architecture-driven and lack an explicit criterion for what information tokens should preserve to support both understanding and generation. Therefore, we introduce a capacity-constrained perspective, highlighting that in shared-token unified MLLMs the visual tokenizer behaves as a compute-bounded learner, so the token budget should prioritize reusable structure over hard-to-exploit high-entropy variations and redundancy. Motivated by this perspective, we propose InfoTok, an information-regularized visual tokenization mechanism grounded in the Information Bottleneck (IB) principle. InfoTok formulates tokenization as controlling information flow from images to shared tokens to multimodal outputs, yielding a principled trade-off between compression and task relevance via mutual-information regularization. We integrate InfoTok into three representative unified MLLMs without introducing any additional training data. Experiments show consistent improvements on both understanding and generation, supporting information-regularized tokenization as a principled foundation for learning a shared token space in unified MLLMs.
zh
[CV-115] oward Cognitive Supersensing in Multimodal Large Language Model
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理复杂认知任务时的局限性,尤其是当视觉细节抽象且需要视觉记忆支持时,现有方法因主要依赖文本空间中的链式思维(Chain-of-Thought, CoT)推理而难以实现结构化、清晰的视觉认知推理。解决方案的关键在于提出一种名为“认知超感知”(Cognitive Supersensing)的新训练范式,其核心是引入一个潜在视觉意象预测(Latent Visual Imagery Prediction, LVIP)头,联合学习视觉认知潜在嵌入序列并将其与答案对齐,从而构建基于视觉的内部推理链;此外,还通过强化学习阶段优化基于该视觉潜在表示的文本推理路径,使模型具备类人视觉意象能力,显著提升在复杂视觉认知任务上的表现。
链接: https://arxiv.org/abs/2602.01541
作者: Boyi Li,Yifan Shen,Yuanzhe Liu,Yifan Xu,Jiateng Liu,Xinzhuo Li,Zhengyuan Li,Jingyuan Zhu,Yunhan Zhong,Fangzhou Lan,Jianguo Cao,James M. Rehg,Heng Ji,Ismini Lourentzou,Xu Cao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal Large Language Models (MLLMs) have achieved remarkable success in open-vocabulary perceptual tasks, yet their ability to solve complex cognitive problems remains limited, especially when visual details are abstract and require visual memory. Current approaches primarily scale Chain-of-Thought (CoT) reasoning in the text space, even when language alone is insufficient for clear and structured reasoning, and largely neglect visual reasoning mechanisms analogous to the human visuospatial sketchpad and visual imagery. To mitigate this deficiency, we introduce Cognitive Supersensing, a novel training paradigm that endows MLLMs with human-like visual imagery capabilities by integrating a Latent Visual Imagery Prediction (LVIP) head that jointly learns sequences of visual cognitive latent embeddings and aligns them with the answer, thereby forming vision-based internal reasoning chains. We further introduce a reinforcement learning stage that optimizes text reasoning paths based on this grounded visual latent. To evaluate the cognitive capabilities of MLLMs, we present CogSense-Bench, a comprehensive visual question answering (VQA) benchmark assessing five cognitive dimensions. Extensive experiments demonstrate that MLLMs trained with Cognitive Supersensing significantly outperform state-of-the-art baselines on CogSense-Bench and exhibit superior generalization on out-of-domain mathematics and science VQA benchmarks, suggesting that internal visual imagery is potentially key to bridging the gap between perceptual recognition and cognitive understanding. We will open-source the CogSense-Bench and our model weights.
zh
[CV-116] FSCA-Net: Feature-Separated Cross-Attention Network for Robust Multi-Dataset Training
【速读】:该论文旨在解决跨域人群计数(crowd counting)中因领域差异导致模型性能下降的问题,尤其是直接联合训练多数据集时出现的负迁移现象(negative transfer),其根源在于共享特征与领域特定特征的纠缠。解决方案的关键在于提出FSCA-Net框架,通过显式地将特征表示解耦为领域不变(domain-invariant)和领域特定(domain-specific)两部分,并引入一种新颖的交叉注意力融合模块,自适应建模两类特征间的交互关系,从而实现有效知识迁移并保留数据集特异性判别能力;同时结合互信息优化目标,最大化领域不变特征的一致性、最小化领域特定特征的冗余性,促进互补的共享-私有表征学习,显著提升跨数据集泛化性能。
链接: https://arxiv.org/abs/2602.01540
作者: Yuehai Chen
机构: Xi’an Jiaotong University (西安交通大学); Institute for Low-Altitude Regulation (低空监管研究所); Institute of Artificial Intelligence and Robotics (人工智能与机器人研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Crowd counting plays a vital role in public safety, traffic regulation, and smart city management. However, despite the impressive progress achieved by CNN- and Transformer-based models, their performance often deteriorates when applied across diverse environments due to severe domain discrepancies. Direct joint training on multiple datasets, which intuitively should enhance generalization, instead results in negative transfer, as shared and domain-specific representations become entangled. To address this challenge, we propose the Feature Separation and Cross-Attention Network FSCA-Net, a unified framework that explicitly disentangles feature representations into domain-invariant and domain-specific components. A novel cross-attention fusion module adaptively models interactions between these components, ensuring effective knowledge transfer while preserving dataset-specific discriminability. Furthermore, a mutual information optimization objective is introduced to maximize consistency among domain-invariant features and minimize redundancy among domain-specific ones, promoting complementary shared-private representations. Extensive experiments on multiple crowd counting benchmarks demonstrate that FSCA-Net effectively mitigates negative transfer and achieves state-of-the-art cross-dataset generalization, providing a robust and scalable solution for real-world crowd analysis.
zh
[CV-117] Making Avatars Interact: Towards Text-Driven Human-Object Interaction for Controllable Talking Avatars
【速读】:该论文旨在解决生成式视频中 grounded human-object interaction (GHOI) 的挑战,即让说话虚拟人(talking avatar)不仅能生成简单的人体动作,还能根据文本指令与周围物体进行语义对齐的交互行为。这一问题的核心难点在于环境感知能力不足以及控制精度与视频质量之间的权衡困境(control-quality dilemma)。解决方案的关键是提出一种双流框架 InteractAvatar,其通过分离环境感知与规划模块(Perception and Interaction Module, PIM)和视频生成模块(Audio-Interaction Aware Generation Module, AIM),实现运动与视频的并行协同生成;其中PIM利用目标检测增强环境理解以生成语义一致的交互动作,而AIM则基于音频和交互信息合成高质量、具象化的说话虚拟人视频,同时借助专用的 motion-to-video aligner 优化结构一致性,从而有效缓解控制与质量之间的矛盾。
链接: https://arxiv.org/abs/2602.01538
作者: Youliang Zhang,Zhengguang Zhou,Zhentao Yu,Ziyao Huang,Teng Hu,Sen Liang,Guozhen Zhang,Ziqiao Peng,Shunkai Li,Yi Chen,Zixiang Zhou,Yuan Zhou,Qinglin Lu,Xiu Li
机构: Tsinghua University (清华大学); Tencent HY (腾讯HY)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Generating talking avatars is a fundamental task in video generation. Although existing methods can generate full-body talking avatars with simple human motion, extending this task to grounded human-object interaction (GHOI) remains an open challenge, requiring the avatar to perform text-aligned interactions with surrounding objects. This challenge stems from the need for environmental perception and the control-quality dilemma in GHOI generation. To address this, we propose a novel dual-stream framework, InteractAvatar, which decouples perception and planning from video synthesis for grounded human-object interaction. Leveraging detection to enhance environmental perception, we introduce a Perception and Interaction Module (PIM) to generate text-aligned interaction motions. Additionally, an Audio-Interaction Aware Generation Module (AIM) is proposed to synthesize vivid talking avatars performing object interactions. With a specially designed motion-to-video aligner, PIM and AIM share a similar network structure and enable parallel co-generation of motions and plausible videos, effectively mitigating the control-quality dilemma. Finally, we establish a benchmark, GroundedInter, for evaluating GHOI video generation. Extensive experiments and comparisons demonstrate the effectiveness of our method in generating grounded human-object interactions for talking avatars. Project page: this https URL
zh
[CV-118] UniDWM: Towards a Unified Driving World Model via Multifaceted Representation Learning
【速读】:该论文旨在解决复杂驾驶环境中实现可靠且高效的规划问题,核心挑战在于如何构建一个能够同时推理场景几何结构、外观特征和动态变化的统一模型。解决方案的关键在于提出UniDWM(Unified Driving World Model),其通过多模态表征学习构建一个结构与动态感知的潜在世界表示空间,该空间作为物理合理的状态空间,支持感知、预测与规划任务的一致性推理;具体而言,联合重建路径学习恢复场景的几何与视觉纹理信息,而协作生成框架则利用条件扩散Transformer在潜在空间中预测未来世界演化,从而实现对驾驶环境的多维理解与建模。
链接: https://arxiv.org/abs/2602.01536
作者: Shuai Liu,Siheng Ren,Xiaoyao Zhu,Quanmin Liang,Zefeng Li,Qiang Li,Xin Hu,Kai Huang
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Achieving reliable and efficient planning in complex driving environments requires a model that can reason over the scene’s geometry, appearance, and dynamics. We present UniDWM, a unified driving world model that advances autonomous driving through multifaceted representation learning. UniDWM constructs a structure- and dynamic-aware latent world representation that serves as a physically grounded state space, enabling consistent reasoning across perception, prediction, and planning. Specifically, a joint reconstruction pathway learns to recover the scene’s structure, including geometry and visual texture, while a collaborative generation framework leverages a conditional diffusion transformer to forecast future world evolution within the latent space. Furthermore, we show that our UniDWM can be deemed as a variation of VAE, which provides theoretical guidance for the multifaceted representation learning. Extensive experiments demonstrate the effectiveness of UniDWM in trajectory planning, 4D reconstruction and generation, highlighting the potential of multifaceted world representations as a foundation for unified driving intelligence. The code will be publicly available at this https URL.
zh
[CV-119] Rotation-free Online Handwritten Character Recognition Using Linear Recurrent Units
【速读】:该论文旨在解决在线手写字符识别中因旋转变形导致的识别准确率下降问题,核心挑战在于如何提取旋转不变特征以提升模型鲁棒性。解决方案的关键在于提出一种基于滑动窗口路径签名(Sliding Window Path Signature, SW-PS)与轻量级线性循环单元(Linear Recurrent Unit, LRU)相结合的框架:SW-PS用于捕捉字符局部结构特征并具备旋转不变性,而LRU则通过融合循环神经网络(RNN)的快速增量处理能力和状态空间模型(SSM)的高效并行训练优势,可靠建模动态笔画特性,从而在高随机旋转角度(±180°)下仍保持优异识别性能。
链接: https://arxiv.org/abs/2602.01533
作者: Zhe Ling,Sicheng Yu,Danyu Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Online handwritten character recognition leverages stroke order and dynamic features, which generally provide higher accuracy and robustness compared with offline recognition. However, in practical applications, rotational deformations can disrupt the spatial layout of strokes, substantially reducing recognition accuracy. Extracting rotation-invariant features therefore remains a challenging open problem. In this work, we employ the Sliding Window Path Signature (SW-PS) to capture local structural features of characters, and introduce the lightweight Linear Recurrent Units (LRU) as the classifier. The LRU combine the fast incremental processing capability of recurrent neural networks (RNN) with the efficient parallel training of state space models (SSM), while reliably modelling dynamic stroke characteristics. We conducted recognition experiments with random rotation angle up to \pm 180^\circ on three subsets of the CASIA-OLHWDB1.1 dataset: digits, English upper letters, and Chinese radicals. The accuracies achieved after ensemble learning were 99.62% , 96.67% , and 94.33% , respectively. Experimental results demonstrate that the proposed SW-PS+LRU framework consistently surpasses competing models in both convergence speed and test accuracy.
zh
[CV-120] Preserving Localized Patch Semantics in VLMs
【速读】:该论文旨在解决自回归视觉语言模型(VLM)中Logit Lens可视化方法因图像 token 与文本 token 在自注意力层混杂而导致的视觉信息局部性丧失问题,从而使得 Logit Lens 无法有效用于解释模型决策。解决方案的关键在于提出一种无需架构修改或大规模训练的辅助损失函数——Logit Lens Loss (LLL),其通过约束图像 token 嵌入与其对应图像区域语义概念(如“猫”)之间的对齐程度,防止视觉信息在跨模态交互中被稀释,从而保留图像 token 的局部视觉表征能力。实验表明,LLL 不仅使 Logit Lens 能生成有意义的对象置信度图,还能提升分割等视觉主导任务的性能。
链接: https://arxiv.org/abs/2602.01530
作者: Parsa Esmaeilkhani,Longin Jan Latecki
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Logit Lens has been proposed for visualizing tokens that contribute most to LLM answers. Recently, Logit Lens was also shown to be applicable in autoregressive Vision-Language Models (VLMs), where it illustrates the conceptual content of image tokens in the form of heatmaps, e.g., which image tokens are likely to depict the concept of cat in a given image. However, the visual content of image tokens often gets diffused to language tokens, and consequently, the locality of visual information gets mostly destroyed, which renders Logit Lens visualization unusable for explainability. To address this issue, we introduce a complementary loss to next-token prediction (NTP) to prevent the visual tokens from losing the visual representation inherited from corresponding image patches. The proposed Logit Lens Loss (LLL) is designed to make visual token embeddings more semantically aligned with the textual concepts that describe their image regions (e.g., patches containing a cat with the word “cat”), without requiring any architectural modification or large-scale training. This way, LLL constrains the mixing of image and text tokens in the self-attention layers in order to prevent image tokens from losing their localized visual information. As our experiments show, LLL not only makes Logit Lens practically relevant by producing meaningful object confidence maps in images, but also improves performance on vision-centric tasks like segmentation without attaching any special heads.
zh
[CV-121] oward a Machine Bertin: Why Visualization Needs Design Principles for Machine Cognition
【速读】:该论文旨在解决可视化设计知识在人类与机器之间迁移失效的问题,即传统基于人类视觉感知的可视化设计原则(如编码有效性排序、颜色模型、预注意处理规则等)无法直接适用于视觉语言模型(VLMs)等机器认知系统。其关键解决方案在于提出“面向机器的可视化”(machine-oriented visualization)这一全新研究范式,强调应将机器认知作为独立的研究对象,而非简单地绕过视觉信息转换为结构化数据。论文主张建立针对机器感知特性的实证基础,以发展一套区别于人类中心主义的可视化设计理论体系,从而推动形成服务于机器认知的“机器版贝尔廷”(a “machine Bertin”),弥补当前领域在机器导向可视化设计方面的理论空白。
链接: https://arxiv.org/abs/2602.01527
作者: Brian Keith-Norambuena
机构: Universidad Católica del Norte (北方天主教大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint submitted to IEEE TVCG on February 2026
Abstract:Visualization’s design knowledge-effectiveness rankings, encoding guidelines, color models, preattentive processing rules – derives from six decades of psychophysical studies of human vision. Yet vision-language models (VLMs) increasingly consume chart images in automated analysis pipelines, and a growing body of benchmark evidence indicates that this human-centered knowledge base does not straightforwardly transfer to machine audiences. Machines exhibit different encoding performance patterns, process images through patch-based tokenization rather than holistic perception, and fail on design patterns that pose no difficulty for humans-while occasionally succeeding where humans struggle. Current approaches address this gap primarily by bypassing vision entirely, converting charts to data tables or structured text. We argue that this response forecloses a more fundamental question: what visual representations would actually serve machine cognition well? This paper makes the case that the visualization field needs to investigate machine-oriented visual design as a distinct research problem. We synthesize evidence from VLM benchmarks, visual reasoning research, and visualization literacy studies to show that the human-machine perceptual divergence is qualitative, not merely quantitative, and critically examine the prevailing bypassing approach. We propose a conceptual distinction between human-oriented and machine-oriented visualization-not as an engineering architecture but as a recognition that different audiences may require fundamentally different design foundations-and outline a research agenda for developing the empirical foundations the field currently lacks: the beginnings of a “machine Bertin” to complement the human-centered knowledge the field already possesses.
zh
[CV-122] When Is Rank-1 Enough? Geometry-Guided Initialization for Parameter-Efficient Fine-Tuning
【速读】:该论文旨在解决极低秩参数高效微调(Parameter-efficient fine-tuning, PEFT)中,尤其是秩-1低秩适应(LoRA)方法在训练过程中不稳定的问题。研究表明,这种不稳定性并非单纯由模型容量不足引起,而是由于优化方向对初始设置高度敏感:预训练视觉与文本特征在高维空间中形成各向异性分布区域,产生一个主导的“模态间隙”(modality-gap)方向,该方向在秩-1约束下会显著偏移早期梯度,导致训练崩溃。解决方案的关键在于提出一种几何感知初始化方法 Gap-Init,其通过一个小规模校准集估计模态间隙向量,并将秩-1 LoRA的更新方向对该齐,同时保持初始LoRA更新为零,从而显著提升训练稳定性并达到或超越高秩基线性能。
链接: https://arxiv.org/abs/2602.01522
作者: Haoran Zhao,Soyeon Caren Han,Eduard Hovy
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Parameter-efficient fine-tuning (PEFT) is a standard way to adapt multimodal large language models, yet extremely low-rank settings – especially rank-1 LoRA – are often unstable. We show that this instability is not solely due to limited capacity: in the rank-1 regime, optimization is highly sensitive to the update direction. Concretely, pretrained vision and text features form mismatched anisotropic regions, yielding a dominant “gap” direction that acts like a translation component and disproportionately steers early gradients under rank-1 constraints. Analyzing pretrained representations, we identify a modality-gap axis that dominates early gradient flow, while a random rank-1 initialization is unlikely to align with it, leading to weak gradients and training collapse. We propose Gap-Init, a geometry-aware initialization that aligns the rank-1 LoRA direction with an estimated modality-gap vector from a small calibration set, while keeping the initial LoRA update zero. Across multiple vision-language tasks and backbones, Gap-Init consistently stabilizes rank-1 training and can match or outperform strong rank-8 baselines. Our results suggest that at the extreme low-rank limit, initial alignment can matter as much as rank itself.
zh
[CV-123] reeLoc: 6-DoF LiDAR Global Localization in Forests via Inter-Tree Geometric Matching ICRA2026
【速读】:该论文旨在解决森林环境中由于GPS信号弱化及LiDAR数据重复性高、遮挡严重和结构复杂导致的传统基于城市场景的定位方法失效的问题(即鲁棒性不足)。其核心解决方案是提出TreeLoc,一个基于LiDAR的全局定位框架,关键在于利用树干及其胸径(DBH)作为语义特征,通过树分布直方图(TDH)实现粗匹配,再结合二维三角形描述子进行精细匹配,并采用两步几何验证完成6-DoF位姿估计,从而在复杂林区实现高精度定位。
链接: https://arxiv.org/abs/2602.01501
作者: Minwoo Jung,Nived Chebrolu,Lucas Carvalho de Lima,Haedam Oh,Maurice Fallon,Ayoung Kim
机构: SNU(首尔国立大学); Oxford Robotics Institute, University of Oxford(牛津大学机器人研究所); Department of Computer Science, University of XYZ(计算机科学系,XYZ大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: An 8-page paper with 7 tables and 8 figures, accepted to ICRA 2026
Abstract:Reliable localization is crucial for navigation in forests, where GPS is often degraded and LiDAR measurements are repetitive, occluded, and structurally complex. These conditions weaken the assumptions of traditional urban-centric localization methods, which assume that consistent features arise from unique structural patterns, necessitating forest-centric solutions to achieve robustness in these environments. To address these challenges, we propose TreeLoc, a LiDAR-based global localization framework for forests that handles place recognition and 6-DoF pose estimation. We represent scenes using tree stems and their Diameter at Breast Height (DBH), which are aligned to a common reference frame via their axes and summarized using the tree distribution histogram (TDH) for coarse matching, followed by fine matching with a 2D triangle descriptor. Finally, pose estimation is achieved through a two-step geometric verification. On diverse forest benchmarks, TreeLoc outperforms baselines, achieving precise localization. Ablation studies validate the contribution of each component. We also propose applications for long-term forest management using descriptors from a compact global tree database. TreeLoc is open-sourced for the robotics community at this https URL.
zh
[CV-124] Understanding vision transformer robustness through the lens of out-of-distribution detection
【速读】:该论文旨在解决量化(quantization)对视觉Transformer模型在分布内(in-distribution, ID)和分布外(out-of-distribution, OOD)场景下性能影响的不均衡问题,特别是低比特量化(如4-bit)带来的性能下降风险。其关键解决方案在于通过分析不同预训练策略(ImageNet-1k vs. ImageNet-22k)下小规模视觉Transformer(DeiT、DeiT3、ViT)在常见OOD数据集上的行为差异,揭示了预训练数据规模对量化鲁棒性的潜在负面影响:相较于仅在ImageNet-1k上预训练的模型,使用大规模ImageNet-22k预训练的模型在4-bit量化后,其OOD检测性能(以AUPR-out衡量)平均下降幅度分别高达15.0%和19.2%,显著高于后者(9.5%和12.0%),表明数据增强可能比扩大预训练数据规模更有利于提升低比特量化下的模型鲁棒性。
链接: https://arxiv.org/abs/2602.01459
作者: Joey Kuang,Alexander Wong
机构: University of Waterloo (滑铁卢大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to JCVIS 2025
Abstract:Vision transformers have shown remarkable performance in vision tasks, but enabling them for accessible and real-time use is still challenging. Quantization reduces memory and inference costs at the risk of performance loss. Strides have been made to mitigate low precision issues mainly by understanding in-distribution (ID) task behaviour, but the attention mechanism may provide insight on quantization attributes by exploring out-of-distribution (OOD) situations. We investigate the behaviour of quantized small-variant popular vision transformers (DeiT, DeiT3, and ViT) on common OOD datasets. ID analyses show the initial instabilities of 4-bit models, particularly of those trained on the larger ImageNet-22k, as the strongest FP32 model, DeiT3, sharply drop 17% from quantization error to be one of the weakest 4-bit models. While ViT shows reasonable quantization robustness for ID calibration, OOD detection reveals more: ViT and DeiT3 pretrained on ImageNet-22k respectively experienced a 15.0% and 19.2% average quantization delta in AUPR-out between full precision to 4-bit while their ImageNet-1k-only counterparts experienced a 9.5% and 12.0% delta. Overall, our results suggest pretraining on large scale datasets may hinder low-bit quantization robustness in OOD detection and that data augmentation may be a more beneficial option.
zh
[CV-125] Rectified LpJEPA: Joint-Embedding Predictive Architectures with Sparse and Maximum-Entropy Representations
【速读】:该论文旨在解决现有联合嵌入预测架构(Joint-Embedding Predictive Architectures, JEPA)在表示学习中因采用各向同性高斯分布正则化而导致的稀疏性不足问题,即模型倾向于生成稠密表示,无法有效捕捉高效表示中常见的稀疏特性。解决方案的关键在于提出一种新的修正分布匹配正则化(Rectified Distribution Matching Regularization, RDMReg),其基于切片两样本分布匹配损失,将表示对齐至修正广义高斯分布(Rectified Generalized Gaussian, RGG);RGG通过截断操作显式控制期望 ℓ₀ 范数以实现稀疏性,同时在期望 ℓₚ 范数约束下保持最大熵性质,从而在保证任务相关信息保留的前提下显著提升表示的稀疏性。由此构建的 Rectified LpJEPA 模型在图像分类下游任务中展现出优异的稀疏性与性能权衡。
链接: https://arxiv.org/abs/2602.01456
作者: Yilun Kuang,Yash Dagade,Tim G. J. Rudner,Randall Balestriero,Yann LeCun
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Joint-Embedding Predictive Architectures (JEPA) learn view-invariant representations and admit projection-based distribution matching for collapse prevention. Existing approaches regularize representations towards isotropic Gaussian distributions, but inherently favor dense representations and fail to capture the key property of sparsity observed in efficient representations. We introduce Rectified Distribution Matching Regularization (RDMReg), a sliced two-sample distribution-matching loss that aligns representations to a Rectified Generalized Gaussian (RGG) distribution. RGG enables explicit control over expected \ell_0 norm through rectification, while preserving maximum-entropy up to rescaling under expected \ell_p norm constraints. Equipping JEPAs with RDMReg yields Rectified LpJEPA, which strictly generalizes prior Gaussian-based JEPAs. Empirically, Rectified LpJEPA learns sparse, non-negative representations with favorable sparsity-performance trade-offs and competitive downstream performance on image classification benchmarks, demonstrating that RDMReg effectively enforces sparsity while preserving task-relevant information.
zh
[CV-126] Cross-Paradigm Evaluation of Gaze-Based Semantic Object Identification for Intelligent Vehicles
【速读】:该论文旨在解决驾驶员视觉注意力分布的识别问题,即通过分析驾驶场景中驾驶员的注视行为(gaze behavior),为下一代高级驾驶辅助系统(Advanced Driver-Assistance Systems, ADAS)的设计和道路安全提升提供支持。其核心解决方案是将这一问题建模为从车辆前视摄像头捕捉的路侧图像中进行语义识别的任务,并采用三种不同的视觉方法进行探索:直接目标检测(YOLOv13)、分割辅助分类(SAM2与EfficientNetV2或YOLOv13组合)以及基于查询的视觉语言模型(Vision-Language Models, VLMs,包括Qwen2.5-VL-7b和Qwen2.5-VL-32b)。关键发现在于,直接目标检测(YOLOv13)和大型VLM(Qwen2.5-VL-32b)显著优于其他方法,在宏平均F1分数上均超过0.84;其中,Qwen2.5-VL-32b在识别小尺寸且安全关键的目标(如交通信号灯)方面表现出更强的鲁棒性,尤其在夜间等不利条件下表现突出;而分割辅助范式因存在“局部与整体”语义鸿沟导致召回率大幅下降。研究揭示了传统检测器实时效率与大VLM提供的丰富上下文理解能力之间的根本权衡,为未来以人为中心的智能驾驶员监控系统设计提供了重要依据。
链接: https://arxiv.org/abs/2602.01452
作者: Penghao Deng,Jidong J. Yang,Jiachen Bian
机构: University of Georgia (佐治亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 21 pages, 15 figures, 3 tables
Abstract:Understanding where drivers direct their visual attention during driving, as characterized by gaze behavior, is critical for developing next-generation advanced driver-assistance systems and improving road safety. This paper tackles this challenge as a semantic identification task from the road scenes captured by a vehicle’s front-view camera. Specifically, the collocation of gaze points with object semantics is investigated using three distinct vision-based approaches: direct object detection (YOLOv13), segmentation-assisted classification (SAM2 paired with EfficientNetV2 versus YOLOv13), and query-based Vision-Language Models, VLMs (Qwen2.5-VL-7b versus Qwen2.5-VL-32b). The results demonstrate that the direct object detection (YOLOv13) and Qwen2.5-VL-32b significantly outperform other approaches, achieving Macro F1-Scores over 0.84. The large VLM (Qwen2.5-VL-32b), in particular, exhibited superior robustness and performance for identifying small, safety-critical objects such as traffic lights, especially in adverse nighttime conditions. Conversely, the segmentation-assisted paradigm suffers from a “part-versus-whole” semantic gap that led to large failure in recall. The results reveal a fundamental trade-off between the real-time efficiency of traditional detectors and the richer contextual understanding and robustness offered by large VLMs. These findings provide critical insights and practical guidance for the design of future human-aware intelligent driver monitoring systems.
zh
[CV-127] BioTamperNet: Affinity-Guided State-Space Model Detecting Tampered Biomedical Images
【速读】:该论文旨在解决生物医学图像中复制粘贴篡改(copy-paste tampering)区域检测难题,现有基于自然图像训练的取证模型在生物医学数据上表现不佳,因生物医学图像中的细微篡改可能严重影响实验有效性。解决方案的关键在于提出BioTamperNet框架,其核心创新是引入亲和引导的自注意力模块(affinity-guided self-attention)以捕捉图像内部相似性,并结合亲和引导的交叉注意力模块(affinity-guided cross-attention)建模跨图像对应关系;同时采用轻量级状态空间模型(State Space Model, SSM)启发的线性注意力机制,实现高效且细粒度的篡改定位,支持端到端训练并同步识别篡改区域及其源区域。
链接: https://arxiv.org/abs/2602.01435
作者: Soumyaroop Nandi,Prem Natarajan
机构: USC Information Sciences Institute (南加州大学信息科学研究所); USC Thomas Lord Department of Computer Science (南加州大学托马斯·洛德计算机科学系); Capital One (资本一号)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We propose BioTamperNet, a novel framework for detecting duplicated regions in tampered biomedical images, leveraging affinity-guided attention inspired by State Space Model (SSM) approximations. Existing forensic models, primarily trained on natural images, often underperform on biomedical data where subtle manipulations can compromise experimental validity. To address this, BioTamperNet introduces an affinity-guided self-attention module to capture intra-image similarities and an affinity-guided cross-attention module to model cross-image correspondences. Our design integrates lightweight SSM-inspired linear attention mechanisms to enable efficient, fine-grained localization. Trained end-to-end, BioTamperNet simultaneously identifies tampered regions and their source counterparts. Extensive experiments on the benchmark bio-forensic datasets demonstrate significant improvements over competitive baselines in accurately detecting duplicated regions. Code - this https URL
zh
[CV-128] Where to Attend: A Principled Vision-Centric Position Encoding with Parabolas
【速读】:该论文旨在解决视觉模态(如图像、点云、视频或事件相机流)在基于注意力机制的架构中位置编码不足的问题。现有方法多将语言模型中的1D序列位置编码直接扩展至视觉的n维结构,但未能充分考虑视觉数据的独特特性。其解决方案的关键在于提出一种基于抛物线的位置编码方法——Parabolic Position Encoding (PaPE),该方法从先验知识中提炼出五大原则:平移不变性、旋转不变性(PaPE-RI)、距离衰减、方向性和上下文感知性,并据此设计出能够更准确刻画视觉特征空间关系的位置编码机制。实验表明,PaPE或其旋转不变版本在8个不同数据集上7次取得最优性能,且在ImageNet-1K上的外推能力显著优于现有方法,绝对提升达10.5%。
链接: https://arxiv.org/abs/2602.01418
作者: Christoffer Koo Øhrstrøm,Rafael I. Cabral Muchacho,Yifei Dong,Filippos Moumtzidellis,Ronja Güldenring,Florian T. Pokorny,Lazaros Nalpantidis
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:We propose Parabolic Position Encoding (PaPE), a parabola-based position encoding for vision modalities in attention-based architectures. Given a set of vision tokens-such as images, point clouds, videos, or event camera streams-our objective is to encode their positions while accounting for the characteristics of vision modalities. Prior works have largely extended position encodings from 1D-sequences in language to nD-structures in vision, but only with partial account of vision characteristics. We address this gap by designing PaPE from principles distilled from prior work: translation invariance, rotation invariance (PaPE-RI), distance decay, directionality, and context awareness. We evaluate PaPE on 8 datasets that span 4 modalities. We find that either PaPE or PaPE-RI achieves the top performance on 7 out of 8 datasets. Extrapolation experiments on ImageNet-1K show that PaPE extrapolates remarkably well, improving in absolute terms by up to 10.5% over the next-best position encoding. Code is available at this https URL.
zh
[CV-129] Stronger Semantic Encoders Can Harm Relighting Performance: Probing Visual Priors via Augmented Latent Intrinsics
【速读】:该论文旨在解决图像到图像的光照重渲染(image-to-image relighting)任务中,现有方法在处理金属、玻璃等复杂材质时表现不佳的问题。其核心挑战在于如何有效解耦场景属性与光照信息,而当前依赖潜在内在表示(latent intrinsic representations)的方法因约束不足难以应对高反射或透明材质。解决方案的关键在于提出增强型潜在内在表示(Augmented Latent Intrinsics, ALI),通过融合像素对齐的视觉编码器特征到潜在内在框架中,并引入自监督精化策略以缓解真实世界配对数据稀缺问题,从而在保持语义上下文的同时保留密集的光度结构,显著提升复杂材质上的重渲染质量。
链接: https://arxiv.org/abs/2602.01391
作者: Xiaoyan Xing,Xiao Zhang,Sezer Karaoglu,Theo Gevers,Anand Bhattad
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: https:\ this http URL
Abstract:Image-to-image relighting requires representations that disentangle scene properties from illumination. Recent methods rely on latent intrinsic representations but remain under-constrained and often fail on challenging materials such as metal and glass. A natural hypothesis is that stronger pretrained visual priors should resolve these failures. We find the opposite: features from top-performing semantic encoders often degrade relighting quality, revealing a fundamental trade-off between semantic abstraction and photometric fidelity. We study this trade-off and introduce Augmented Latent Intrinsics (ALI), which balances semantic context and dense photometric structure by fusing features from a pixel-aligned visual encoder into a latent-intrinsic framework, together with a self-supervised refinement strategy to mitigate the scarcity of paired real-world data. Trained only on unlabeled real-world image pairs and paired with a dense, pixel-aligned visual prior, ALI achieves strong improvements in relighting, with the largest gains on complex, specular materials. Project page: https:\this http URL
zh
[CV-130] PromptRL: Prompt Matters in RL for Flow-Based Image Generation
【速读】:该论文针对当前基于流模型(Flow Matching, FM)的文本到图像生成系统在强化学习(Reinforcement Learning, RL)后训练阶段存在的两大关键问题展开研究:一是样本效率低下,源于生成多样性不足;二是显著的提示过拟合(prompt overfitting),即模型对训练时的具体提示格式产生记忆,在面对语义相同但风格不同的提示时性能急剧下降。解决方案的关键在于提出 PromptRL 框架,其核心创新是在流基 RL 优化循环中引入语言模型(Language Model, LM)作为可训练的提示精炼代理(prompt refinement agent),从而实现两个互补优势:一方面快速构建复杂的提示重写能力,另一方面通过协同训练机制重塑优化动态,显著提升模型鲁棒性与性能上限。实验表明,PromptRL 在多个基准测试中达到最优表现,并以不到两倍于传统流模型纯 RL 方法的 rollout 数量实现更高效训练。
链接: https://arxiv.org/abs/2602.01382
作者: Fu-Yun Wang,Han Zhang,Michael Gharbi,Hongsheng Li,Taesung Park
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Flow matching models (FMs) have revolutionized text-to-image (T2I) generation, with reinforcement learning (RL) serving as a critical post-training strategy for alignment with reward objectives. In this research, we show that current RL pipelines for FMs suffer from two underappreciated yet important limitations: sample inefficiency due to insufficient generation diversity, and pronounced prompt overfitting, where models memorize specific training formulations and exhibit dramatic performance collapse when evaluated on semantically equivalent but stylistically varied prompts. We present PromptRL (Prompt Matters in RL for Flow-Based Image Generation), a framework that incorporates language models (LMs) as trainable prompt refinement agents directly within the flow-based RL optimization loop. This design yields two complementary benefits: rapid development of sophisticated prompt rewriting capabilities and, critically, a synergistic training regime that reshapes the optimization dynamics. PromptRL achieves state-of-the-art performance across multiple benchmarks, obtaining scores of 0.97 on GenEval, 0.98 on OCR accuracy, and 24.05 on PickScore. Furthermore, we validate the effectiveness of our RL approach on large-scale image editing models, improving the EditReward of FLUX.1-Kontext from 1.19 to 1.43 with only 0.06 million rollouts, surpassing Gemini 2.5 Flash Image (also known as Nano Banana), which scores 1.37, and achieving comparable performance with ReasonNet (1.44), which relied on fine-grained data annotations along with a complex multi-stage training. Our extensive experiments empirically demonstrate that PromptRL consistently achieves higher performance ceilings while requiring over 2 \times fewer rollouts compared to naive flow-only RL. Our code is available at this https URL. Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2602.01382 [cs.CV] (or arXiv:2602.01382v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2602.01382 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-131] PolyGen: Fully Synthetic Vision-Language Training via Multi-Generator Ensembles
【速读】:该论文旨在解决当前视觉-语言预训练中合成数据生成方法依赖单一生成模型所导致的谱偏差(spectral biases)和特征多样性不足的问题。解决方案的关键在于提出PolyGen框架,其核心是采用“多层架构”(Polylithic)策略,在多个结构差异显著的生成器交集上进行训练,从而有效消除模型特异性伪影;同时引入程序化难负样本课程(Programmatic Hard Negative curriculum),强化细粒度语法理解能力。通过将相同数据预算从单一描述文本重新分配至多源变体,PolyGen显著提升了特征空间的鲁棒性,在多任务基准测试中较领先单源基线SynthCLIP提升+19.0%,在SugarCrepe++组合性基准上提升+9.1%。
链接: https://arxiv.org/abs/2602.01370
作者: Leonardo Brusini,Cristian Sbrolli,Eugenio Lomurno,Toshihiko Yamasaki,Matteo Matteucci
机构: Politecnico di Milano, AIRLab, Italy; University of Tokyo, Computer Vision and Media Lab, Japan
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Synthetic data offers a scalable solution for vision-language pre-training, yet current state-of-the-art methods typically rely on scaling up a single generative backbone, which introduces generator-specific spectral biases and limits feature diversity. In this work, we introduce PolyGen, a framework that redefines synthetic data construction by prioritizing manifold coverage and compositional rigor over simple dataset size. PolyGen employs a Polylithic approach to train on the intersection of architecturally distinct generators, effectively marginalizing out model-specific artifacts. Additionally, we introduce a Programmatic Hard Negative curriculum that enforces fine-grained syntactic understanding. By structurally reallocating the same data budget from unique captions to multi-source variations, PolyGen achieves a more robust feature space, outperforming the leading single-source baseline (SynthCLIP) by +19.0% on aggregate multi-task benchmarks and on the SugarCrepe++ compositionality benchmark (+9.1%). These results demonstrate that structural diversity is a more data-efficient scaling law than simply increasing the volume of single-source samples.
zh
[CV-132] Exposing and Defending the Achilles Heel of Video Mixture-of-Experts
【速读】:该论文旨在解决混合专家(Mixture-of-Experts, MoE)模型在视频理解任务中对抗鲁棒性不足的问题,尤其是现有攻击方法将MoE视为统一架构,忽视了其核心组件如路由器(router)和专家模块(expert module)之间的独立脆弱性和协同脆弱性。解决方案的关键在于提出一种基于时间Lipschitz约束的攻击与防御框架:首先设计针对路由器的独立攻击(TLGA),揭示其单独脆弱性;进而提出联合时间Lipschitz引导攻击(J-TLGA),通过同时扰动路由器和专家模块,放大对抗效应并暴露MoE架构的协同弱点(Achilles’ Heel);最后基于此构建联合时间Lipschitz对抗训练(J-TLAT),实现对协同脆弱性的联合防御,从而提升组件级鲁棒性。该框架具有即插即用特性,且推理成本降低超过60%,在多种数据集和架构上均有效增强对抗鲁棒性。
链接: https://arxiv.org/abs/2602.01369
作者: Songping Wang,Qinglong Liu,Yueming Lyu,Ning Li,Ziwen He,Caifeng Shan
机构: Nanjing University(南京大学); China Mobile Information Technology Co., Ltd.(中国移动信息技术有限公司); Nanjing University of Information Science and Technology(南京信息工程大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Mixture-of-Experts (MoE) has demonstrated strong performance in video understanding tasks, yet its adversarial robustness remains underexplored. Existing attack methods often treat MoE as a unified architecture, overlooking the independent and collaborative weaknesses of key components such as routers and expert modules. To fill this gap, we propose Temporal Lipschitz-Guided Attacks (TLGA) to thoroughly investigate component-level vulnerabilities in video MoE models. We first design attacks on the router, revealing its independent weaknesses. Building on this, we introduce Joint Temporal Lipschitz-Guided Attacks (J-TLGA), which collaboratively perturb both routers and experts. This joint attack significantly amplifies adversarial effects and exposes the Achilles’ Heel (collaborative weaknesses) of the MoE architecture. Based on these insights, we further propose Joint Temporal Lipschitz Adversarial Training (J-TLAT). J-TLAT performs joint training to further defend against collaborative weaknesses, enhancing component-wise robustness. Our framework is plug-and-play and reduces inference cost by more than 60% compared with dense models. It consistently enhances adversarial robustness across diverse datasets and architectures, effectively mitigating both the independent and collaborative weaknesses of MoE.
zh
[CV-133] 2M Mamba: Motion Periodicity-Saliency Coupling Approach for Stable Text-Driven Motion Generation
【速读】:该论文旨在解决文本到动作生成(Text-to-motion generation)中存在的两个核心问题:一是现有模型将运动周期性(motion periodicity)与关键帧显著性(keyframe saliency)视为独立因素,忽略了二者耦合关系,导致长序列生成时出现漂移;二是模型对语义等价的文本改写(paraphrase)敏感,微小的同义词替换会扭曲文本嵌入,进而通过解码器传播并引发动作不稳定或错误。解决方案的关键在于提出T2M Mamba框架,其核心创新包括:(i) 设计周期性-显著性感知Mamba(Periodicity-Saliency Aware Mamba),通过改进的密度峰值聚类(Density Peaks Clustering)进行关键帧权重估计,并利用FFT加速自相关分析实现高效周期性建模,从而捕捉耦合动力学;(ii) 构建周期差分跨模态对齐模块(Periodic Differential Cross-modal Alignment Module, PDCAM),增强文本与动作嵌入间的鲁棒对齐能力,提升对语义等价表述的稳定性。
链接: https://arxiv.org/abs/2602.01352
作者: Xingzu Zhan,Chen Xie,Honghang Chen,Yixun Lin,Xiaochun Mai
机构: Shenzhen University (深圳大学); Jinan University (暨南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages,5 figures
Abstract:Text-to-motion generation, which converts motion language descriptions into coherent 3D human motion sequences, has attracted increasing attention in fields, such as avatar animation and humanoid robotic interaction. Though existing models have achieved significant fidelity, they still suffer from two core limitations: (i) They treat motion periodicity and keyframe saliency as independent factors, overlooking their coupling and causing generation drift in long sequences. (ii) They are fragile to semantically equivalent paraphrases, where minor synonym substitutions distort textual embeddings, propagating through the decoder and producing unstable or erroneous motions. In this work, we propose T2M Mamba to address these limitations by (i) proposing Periodicity-Saliency Aware Mamba, which utilizes novel algorithms for keyframe weight estimation via enhanced Density Peaks Clustering and motion periodicity estimation via FFT-accelerated autocorrelation to capture coupled dynamics with minimal computational overhead, and (ii) constructing a Periodic Differential Cross-modal Alignment Module (PDCAM) to enhance robust alignment of textual and motion embeddings. Extensive experiments on HumanML3D and KIT-ML datasets have been conducted, confirming the effectiveness of our approach, achieving an FID of 0.068 and consistent gains on all other metrics.
zh
[CV-134] Adaptive Visual Autoregressive Acceleration via Dual-Linkage Entropy Analysis
【速读】:该论文旨在解决视觉自回归模型(Visual AutoRegressive, VAR)在推理过程中因大量token数量导致的高计算开销问题。现有token缩减方法受限于启发式阶段划分、非自适应调度策略及加速范围有限,难以充分挖掘加速潜力。其解决方案的关键在于提出NOVA框架——一种无需训练的token缩减加速方法,通过熵分析捕捉建模动态演化过程:利用尺度熵增长的拐点在线自适应确定加速激活规模,并基于尺度关联与层关联比例调整机制,为不同尺度和层级动态计算差异化的token缩减比例,在剪除低熵token的同时复用前一尺度残差缓存,从而在保持生成质量的前提下显著加速推理。
链接: https://arxiv.org/abs/2602.01345
作者: Yu Zhang,Jingyi Liu,Feng Liu,Duoqian Miao,Qi Zhang,Kexue Fu,Changwei Wang,Longbing Cao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 8 figures
Abstract:Visual AutoRegressive modeling (VAR) suffers from substantial computational cost due to the massive token count involved. Failing to account for the continuous evolution of modeling dynamics, existing VAR token reduction methods face three key limitations: heuristic stage partition, non-adaptive schedules, and limited acceleration scope, thereby leaving significant acceleration potential untapped. Since entropy variation intrinsically reflects the transition of predictive uncertainty, it offers a principled measure to capture modeling dynamics evolution. Therefore, we propose NOVA, a training-free token reduction acceleration framework for VAR models via entropy analysis. NOVA adaptively determines the acceleration activation scale during inference by online identifying the inflection point of scale entropy growth. Through scale-linkage and layer-linkage ratio adjustment, NOVA dynamically computes distinct token reduction ratios for each scale and layer, pruning low-entropy tokens while reusing the cache derived from the residuals at the prior scale to accelerate inference and maintain generation quality. Extensive experiments and analyses validate NOVA as a simple yet effective training-free acceleration framework.
zh
[CV-135] MTC-VAE: Multi-Level Temporal Compression with Content Awareness
【速读】:该论文旨在解决连续变分自编码器(Continuous Variational Autoencoders, CVAEs)在追求更高压缩率时因增加采样层而导致效率显著下降的问题。解决方案的关键在于提出一种将固定压缩率的VAE转换为支持多级时间压缩的模型的技术,通过一种简单且最小化的微调方法,在不扩展隐藏通道维度的前提下有效缓解高压缩率下的性能下降问题。该方法允许在不同视频片段中灵活调整压缩级别,并成功集成到基于扩散机制的生成模型(如DiT)中,实现协同训练与良好兼容性,从而提升了潜在视频扩散模型(Latent Video Diffusion Models, LVDMs)在压缩效率与生成质量之间的平衡能力。
链接: https://arxiv.org/abs/2602.01340
作者: Yubo Dong,Linchao Zhu
机构: ReLER, CCAI, Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Latent Video Diffusion Models (LVDMs) rely on Variational Autoencoders (VAEs) to compress videos into compact latent representations. For continuous Variational Autoencoders (VAEs), achieving higher compression rates is desirable; yet, the efficiency notably declines when extra sampling layers are added without expanding the dimensions of hidden channels. In this paper, we present a technique to convert fixed compression rate VAEs into models that support multi-level temporal compression, providing a straightforward and minimal fine-tuning approach to counteract performance decline at elevated compression this http URL, we examine how varying compression levels impact model performance over video segments with diverse characteristics, offering empirical evidence on the effectiveness of our proposed approach. We also investigate the integration of our multi-level temporal compression VAE with diffusion-based generative models, DiT, highlighting successful concurrent training and compatibility within these frameworks. This investigation illustrates the potential uses of multi-level temporal compression.
zh
[CV-136] Beyond Pixels: Visual Metaphor Transfer via Schema-Driven Agent ic Reasoning
【速读】:该论文旨在解决当前生成式AI在视觉隐喻生成任务中面临的根本性局限问题——即模型仅能实现像素级指令对齐和表层外观保留,而无法捕捉并复现抽象逻辑以实现真正的跨域语义融合。为突破这一瓶颈,作者提出视觉隐喻迁移(Visual Metaphor Transfer, VMT)任务,其核心在于自主解耦参考图像中的“创意本质”(creative essence),并将该抽象逻辑重新映射至用户指定的目标主体上。解决方案的关键在于构建一个受认知科学启发的多智能体框架,通过引入一种新型Schema Grammar(记为"G")来形式化概念融合理论(Conceptual Blending Theory, CBT),从而将关系不变性从具体视觉实体中分离出来,为跨域逻辑再实例化提供严谨基础;同时,该框架包含感知、迁移、生成与层级诊断四个专业化智能体协同工作,实现从抽象逻辑提取到高保真合成及闭环纠错的全流程自动化,显著提升了隐喻一致性、类比恰当性和视觉创造力等关键指标。
链接: https://arxiv.org/abs/2602.01335
作者: Yu Xu,Yuxin Zhang,Juan Cao,Lin Gao,Chunyu Wang,Oliver Deussen,Tong-Yee Lee,Fan Tang
机构: University of Chinese Academy of Sciences (中国科学院大学); Tencent Hunyuan (腾讯混元); University of Konstanz (康斯坦茨大学); National Cheng-Kung University (成功大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages, 10 figures
Abstract:A visual metaphor constitutes a high-order form of human creativity, employing cross-domain semantic fusion to transform abstract concepts into impactful visual rhetoric. Despite the remarkable progress of generative AI, existing models remain largely confined to pixel-level instruction alignment and surface-level appearance preservation, failing to capture the underlying abstract logic necessary for genuine metaphorical generation. To bridge this gap, we introduce the task of Visual Metaphor Transfer (VMT), which challenges models to autonomously decouple the “creative essence” from a reference image and re-materialize that abstract logic onto a user-specified target subject. We propose a cognitive-inspired, multi-agent framework that operationalizes Conceptual Blending Theory (CBT) through a novel Schema Grammar (“G”). This structured representation decouples relational invariants from specific visual entities, providing a rigorous foundation for cross-domain logic re-instantiation. Our pipeline executes VMT through a collaborative system of specialized agents: a perception agent that distills the reference into a schema, a transfer agent that maintains generic space invariance to discover apt carriers, a generation agent for high-fidelity synthesis and a hierarchical diagnostic agent that mimics a professional critic, performing closed-loop backtracking to identify and rectify errors across abstract logic, component selection, and prompt encoding. Extensive experiments and human evaluations demonstrate that our method significantly outperforms SOTA baselines in metaphor consistency, analogy appropriateness, and visual creativity, paving the way for automated high-impact creative applications in advertising and media. Source code will be made publicly available.
zh
[CV-137] What Does Vision Tool-Use Reinforcement Learning Really Learn? Disentangling Tool-Induced and Intrinsic Effects for Crop-and-Zoom
【速读】:该论文旨在解决视觉工具使用强化学习(Vision Tool-Use Reinforcement Learning, VTU-RL)中性能提升的来源不明确问题,即难以区分模型性能改善是由内在能力进化还是由工具使用带来的影响。为应对这一挑战,作者提出MED(Measure-Explain-Diagnose)框架,其关键在于通过粗粒度到细粒度的分析流程,将工具诱导的性能差异解耦为“增益”与“损害”两项,并深入探查其演化机制。实验表明,当前VTU-RL主要减少工具引入的负面影响(如调用错误和工具模式干扰),而对内在缺陷的工具修正作用有限,说明现有方法更倾向于安全共存而非真正掌握工具使用。
链接: https://arxiv.org/abs/2602.01334
作者: Yan Ma,Weiyu Zhang,Tianle Li,Linge Du,Xuyang Shen,Pengfei Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: code: this https URL
Abstract:Vision tool-use reinforcement learning (RL) can equip vision-language models with visual operators such as crop-and-zoom and achieves strong performance gains, yet it remains unclear whether these gains are driven by improvements in tool use or evolving intrinsic this http URL introduce MED (Measure-Explain-Diagnose), a coarse-to-fine framework that disentangles intrinsic capability changes from tool-induced effects, decomposes the tool-induced performance difference into gain and harm terms, and probes the mechanisms driving their evolution. Across checkpoint-level analyses on two VLMs with different tool priors and six benchmarks, we find that improvements are dominated by intrinsic learning, while tool-use RL mainly reduces tool-induced harm (e.g., fewer call-induced errors and weaker tool schema interference) and yields limited progress in tool-based correction of intrinsic failures. Overall, current vision tool-use RL learns to coexist safely with tools rather than master them.
zh
[CV-138] FlowCast: Trajectory Forecasting for Scalable Zero-Cost Speculative Flow Matching ICLR2026
【速读】:该论文旨在解决流匹配(Flow Matching, FM)在视觉生成任务中推理速度过慢的问题,这一瓶颈限制了其在实时或交互式应用场景中的使用。现有加速方法如蒸馏、截断或一致性训练等,往往导致质量下降、需要昂贵的重新训练或缺乏泛化能力。解决方案的关键在于提出 FlowCast——一种无需训练的推测生成框架,其核心思想是利用 FM 模型在训练过程中保持恒定速度(constant velocity)的特性:通过外推当前速度来推测未来状态,并在均方误差阈值内接受该推测结果。这种基于恒定速度预测的机制允许在稳定区域大幅跳过冗余步骤,同时在复杂区域保持高精度,从而实现高效且无损的加速。该方法为任意 FM 模型提供即插即用的加速能力,无需额外网络结构或再训练。
链接: https://arxiv.org/abs/2602.01329
作者: Divya Jyoti Bajpai,Shubham Agarwal,Apoorv Saxena,Kuldeep Kulkarni,Subrata Mitra,Manjesh Kumar Hanawal
机构: Indian Institute of Technology Bombay (印度理工学院孟买分校); University of California, Berkeley (加州大学伯克利分校); Inception Labs; Adobe Research India (Adobe 研究院印度)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at International Conference on Learning Representations (ICLR 2026)
Abstract:Flow Matching (FM) has recently emerged as a powerful approach for high-quality visual generation. However, their prohibitively slow inference due to a large number of denoising steps limits their potential use in real-time or interactive applications. Existing acceleration methods, like distillation, truncation, or consistency training, either degrade quality, incur costly retraining, or lack generalization. We propose FlowCast, a training-free speculative generation framework that accelerates inference by exploiting the fact that FM models are trained to preserve constant velocity. FlowCast speculates future velocity by extrapolating current velocity without incurring additional time cost, and accepts it if it is within a mean-squared error threshold. This constant-velocity forecasting allows redundant steps in stable regions to be aggressively skipped while retaining precision in complex ones. FlowCast is a plug-and-play framework that integrates seamlessly with any FM model and requires no auxiliary networks. We also present a theoretical analysis and bound the worst-case deviation between speculative and full FM trajectories. Empirical evaluations demonstrate that FlowCast achieves 2.5\times speedup in image generation, video generation, and editing tasks, outperforming existing baselines with no quality loss as compared to standard full generation.
zh
[CV-139] DeCorStory: Gram-Schmidt Prompt Embedding Decorrelation for Consistent Storytelling
【速读】:该论文旨在解决文本到图像叙事(text-to-image storytelling)中跨帧视觉与语义一致性的问题,尤其是现有无需训练(training-free)方法如One-Prompt-One-Story因将所有提示词拼接为单一序列而导致的嵌入相关性强、颜色泄漏(color leakage)、背景混叠(background blending)及身份漂移(identity drift)等缺陷。其解决方案的关键在于提出DeCorStory框架:通过Gram-Schmidt正交化处理提示词嵌入以消除帧间语义干扰,并结合奇异值重加权增强提示特定信息,以及引入保持身份一致性的交叉注意力机制来稳定扩散过程中的角色特征,整个方法无需模型修改或微调,可无缝集成至现有扩散模型流程中,从而在提示对齐度、身份一致性和视觉多样性上实现显著提升。
链接: https://arxiv.org/abs/2602.01306
作者: Ayushman Sarkar,Zhenyu Yu,Mohd Yamani Idna Idris
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Maintaining visual and semantic consistency across frames is a key challenge in text-to-image storytelling. Existing training-free methods, such as One-Prompt-One-Story, concatenate all prompts into a single sequence, which often induces strong embedding correlation and leads to color leakage, background blending, and identity drift. We propose DeCorStory, a training-free inference-time framework that explicitly reduces inter-frame semantic interference. DeCorStory applies Gram-Schmidt prompt embedding decorrelation to orthogonalize frame-level semantics, followed by singular value reweighting to strengthen prompt-specific information and identity-preserving cross-attention to stabilize character identity during diffusion. The method requires no model modification or fine-tuning and can be seamlessly integrated into existing diffusion pipelines. Experiments demonstrate consistent improvements in prompt-image alignment, identity consistency, and visual diversity, achieving state-of-the-art performance among training-free baselines. Code is available at: this https URL
zh
[CV-140] StoryState: Agent -Based State Control for Consistent and Editable Storybooks
【速读】:该论文旨在解决当前大型多模态模型在一键式绘本生成中因故事状态(story state)隐式表达而导致的编辑粒度粗、视觉一致性差的问题。其解决方案的关键在于提出一个基于代理(agent)的编排层 StoryState,该层显式地将故事表示为包含角色表、全局设定和逐页场景约束的结构化对象,并通过少量大语言模型(LLM)代理维护此状态,从而生成或修改适用于训练-free 文本到图像生成器的 Prompt。StoryState 仅依赖提示词操作,具备模型无关性与后端兼容性,在多页编辑任务中实现了局部页面修改、提升跨页一致性、减少意外变化与交互轮次,同时接近 Gemini Storybook 的一次性生成一致性表现。
链接: https://arxiv.org/abs/2602.01305
作者: Ayushman Sarkar,Zhenyu Yu,Wei Tang,Chu Chen,Kangning Cui,Mohd Yamani Idna Idris
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Large multimodal models have enabled one-click storybook generation, where users provide a short description and receive a multi-page illustrated story. However, the underlying story state, such as characters, world settings, and page-level objects, remains implicit, making edits coarse-grained and often breaking visual consistency. We present StoryState, an agent-based orchestration layer that introduces an explicit and editable story state on top of training-free text-to-image generation. StoryState represents each story as a structured object composed of a character sheet, global settings, and per-page scene constraints, and employs a small set of LLM agents to maintain this state and derive 1Prompt1Story-style prompts for generation and editing. Operating purely through prompts, StoryState is model-agnostic and compatible with diverse generation backends. System-level experiments on multi-page editing tasks show that StoryState enables localized page edits, improves cross-page consistency, and reduces unintended changes, interaction turns, and editing time compared to 1Prompt1Story, while approaching the one-shot consistency of Gemini Storybook. Code is available at this https URL
zh
[CV-141] ReDiStory: Region-Disentangled Diffusion for Consistent Visual Story Generation
【速读】:该论文旨在解决多帧视觉故事生成中因帧间语义干扰导致的主体身份一致性弱化问题(即在保持每帧语义准确性的同时,难以维持跨帧主体身份的一致性)。其解决方案的关键在于提出一种无需训练的推理时提示嵌入重组框架 ReDiStory,通过显式分解文本嵌入为与身份相关的成分和帧特定成分,并抑制不同帧间共享的方向以降低帧间相关性,从而在不修改扩散模型参数或引入额外监督的情况下显著提升身份一致性。
链接: https://arxiv.org/abs/2602.01303
作者: Ayushman Sarkar,Zhenyu Yu,Chu Chen,Wei Tang,Kangning Cui,Mohd Yamani Idna Idris
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generating coherent visual stories requires maintaining subject identity across multiple images while preserving frame-specific semantics. Recent training-free methods concatenate identity and frame prompts into a unified representation, but this often introduces inter-frame semantic interference that weakens identity preservation in complex stories. We propose ReDiStory, a training-free framework that improves multi-frame story generation via inference-time prompt embedding reorganization. ReDiStory explicitly decomposes text embeddings into identity-related and frame-specific components, then decorrelates frame embeddings by suppressing shared directions across frames. This reduces cross-frame interference without modifying diffusion parameters or requiring additional supervision. Under identical diffusion backbones and inference settings, ReDiStory improves identity consistency while maintaining prompt fidelity. Experiments on the ConsiStory+ benchmark show consistent gains over 1Prompt1Story on multiple identity consistency metrics. Code is available at: this https URL
zh
[CV-142] Interaction-Consistent Object Removal via MLLM -Based Reasoning
【速读】:该论文旨在解决图像中目标物体移除时遗留交互痕迹导致语义不一致的问题,即传统方法仅移除指定目标对象,却未同步清除与其相关的物理或视觉关联元素(如光照影响、连接物体、目标产生的物质及情境相关对象),从而造成结果在语义上不合理。为此,作者提出了基于多模态大语言模型(Multimodal Large Language Model, MLLM)的推理增强型对象移除框架(Reasoning-Enhanced Object Removal, REORM),其核心在于利用MLLM进行交互推理以识别需协同移除的元素,并结合掩码引导的删除机制与自校正策略实现高一致性编辑效果。
链接: https://arxiv.org/abs/2602.01298
作者: Ching-Kai Huang,Wen-Chieh Lin,Yan-Cen Lee
机构: National Yang Ming Chiao Tung University (国立阳明交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Image-based object removal often erases only the named target, leaving behind interaction evidence that renders the result semantically inconsistent. We formalize this problem as Interaction-Consistent Object Removal (ICOR), which requires removing not only the target object but also associated interaction elements, such as lighting-dependent effects, physically connected objects, targetproduced elements, and contextually linked objects. To address this task, we propose Reasoning-Enhanced Object Removal with MLLM (REORM), a reasoningenhanced object removal framework that leverages multimodal large language models to infer which elements must be jointly removed. REORM features a modular design that integrates MLLM-driven analysis, mask-guided removal, and a self-correction mechanism, along with a local-deployment variant that supports accurate editing under limited resources. To support evaluation, we introduce ICOREval, a benchmark consisting of instruction-driven removals with rich interaction dependencies. On ICOREval, REORM outperforms state-of-the-art image editing systems, demonstrating its effectiveness in producing interactionconsistent results.
zh
[CV-143] Interacted Planes Reveal 3D Line Mapping
【速读】:该论文旨在解决从多视角RGB图像中进行高精度、结构化3D线映射(3D line mapping)的问题,尤其关注在人造环境中如何实现既准确又高效的重建。传统方法往往忽视了3D线与平面之间的拓扑关系,导致重建结果缺乏结构性和完整性。解决方案的关键在于提出LiP-Map框架,该框架通过显式建模可学习的线(line)与平面(planar)原始要素之间的联合优化机制,将平面拓扑信息融入3D线映射过程——不是依赖成对共面约束,而是通过构造线与平面原始要素间的交互关系,从而实现更结构化的场景表示。这一设计显著提升了重建精度与完整性,并在多个公开数据集上优于当前最优方法,同时在基于线辅助的视觉定位任务中展现出强大性能。
链接: https://arxiv.org/abs/2602.01296
作者: Zeran Ke,Bin Tan,Gui-Song Xia,Yujun Shen,Nan Xue
机构: Wuhan University (武汉大学); Ant Group (蚂蚁集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: submitted to TPAMI
Abstract:3D line mapping from multi-view RGB images provides a compact and structured visual representation of scenes. We study the problem from a physical and topological perspective: a 3D line most naturally emerges as the edge of a finite 3D planar patch. We present LiP-Map, a line-plane joint optimization framework that explicitly models learnable line and planar primitives. This coupling enables accurate and detailed 3D line mapping while maintaining strong efficiency (typically completing a reconstruction in 3 to 5 minutes per scene). LiP-Map pioneers the integration of planar topology into 3D line mapping, not by imposing pairwise coplanarity constraints but by explicitly constructing interactions between plane and line primitives, thus offering a principled route toward structured reconstruction in man-made environments. On more than 100 scenes from ScanNetV2, ScanNet++, Hypersim, 7Scenes, and Tanks\Temple, LiP-Map improves both accuracy and completeness over state-of-the-art methods. Beyond line mapping quality, LiP-Map significantly advances line-assisted visual localization, establishing strong performance on 7Scenes. Our code is released at this https URL for reproducible research.
zh
[CV-144] Gradient-Aligned Calibration for Post-Training Quantization of Diffusion Models
【速读】:该论文旨在解决扩散模型(Diffusion Models)在实际部署中面临的推理速度慢、内存占用高及噪声估计计算复杂度大的问题,特别是现有后训练量化(Post-Training Quantization, PTQ)方法因对不同时间步(timesteps)采用统一权重进行校准样本加权,导致量化性能不佳的问题。其解决方案的关键在于提出一种新型PTQ方法,通过学习为校准样本分配最优权重,使量化模型在不同时间步上的梯度方向趋于一致,从而缓解因各时间步激活分布和梯度差异带来的冲突梯度问题,显著提升量化后的生成质量与效率。
链接: https://arxiv.org/abs/2602.01289
作者: Dung Anh Hoang,Cuong Pham anh Trung Le,Jianfei Cai,Toan Do
机构: Monash University (蒙纳士大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion models have shown remarkable performance in image synthesis by progressively estimating a smooth transition from a Gaussian distribution of noise to a real image. Unfortunately, their practical deployment is limited by slow inference speed, high memory usage, and the computational demands of the noise estimation process. Post-training quantization (PTQ) emerges as a promising solution to accelerate sampling and reduce memory overhead for diffusion models. Existing PTQ methods for diffusion models typically apply uniform weights to calibration samples across timesteps, which is sub-optimal since data at different timesteps may contribute differently to the diffusion process. Additionally, due to varying activation distributions and gradients across timesteps, a uniform quantization approach is sub-optimal. Each timestep requires a different gradient direction for optimal quantization, and treating them equally can lead to conflicting gradients that degrade performance. In this paper, we propose a novel PTQ method that addresses these challenges by assigning appropriate weights to calibration samples. Specifically, our approach learns to assign optimal weights to calibration samples to align the quantized model’s gradients across timesteps, facilitating the quantization process. Extensive experiments on CIFAR-10, LSUN-Bedrooms, and ImageNet demonstrate the superiority of our method compared to other PTQ methods for diffusion models.
zh
[CV-145] Seeing Hearing and Knowing Together: Multimodal Strategies in Deepfake Videos Detection
【速读】:该论文旨在解决人类在识别深度伪造视频(deepfake videos)时面临的困难问题,尤其关注人们在判断真实与虚假视频时所采用的策略及其有效性。研究通过195名21至40岁参与者的行为实验,系统分析了他们在视觉、听觉和知识层面依赖的线索,并利用关联规则挖掘技术识别出影响检测准确性的关键线索组合。解决方案的关键在于揭示了多模态线索(如视觉外观、声音特征和直觉判断)协同作用对成功识别深伪内容的重要性,从而为设计基于有效线索引导的媒体素养工具提供了实证依据,有助于提升公众对欺骗性数字媒体的识别能力与抗干扰韧性。
链接: https://arxiv.org/abs/2602.01284
作者: Chen Chen,Dion Hoe-Lian Goh
机构: Nanyang Technological University (南洋理工大学)
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:
Abstract:As deepfake videos become increasingly difficult for people to recognise, understanding the strategies humans use is key to designing effective media literacy interventions. We conducted a study with 195 participants between the ages of 21 and 40, who judged real and deepfake videos, rated their confidence, and reported the cues they relied on across visual, audio, and knowledge strategies. Participants were more accurate with real videos than with deepfakes and showed lower expected calibration error for real content. Through association rule mining, we identified cue combinations that shaped performance. Visual appearance, vocal, and intuition often co-occurred for successful identifications, which highlights the importance of multimodal approaches in human detection. Our findings show which cues help or hinder detection and suggest directions for designing media literacy tools that guide effective cue use. Building on these insights can help people improve their identification skills and become more resilient to deceptive digital media.
zh
[CV-146] Who Transfers Safety? Identifying and Targeting Cross-Lingual Shared Safety Neurons
【速读】:该论文旨在解决多语言安全(Multilingual safety)不平衡问题,即高资源语言(High-resource languages, HR)模型的安全性表现远优于低资源语言(Non-high-resource languages, NHR),导致后者在生成内容时更容易产生有害行为。现有方法难以有效迁移HR语言中的安全能力至NHR语言。其解决方案的关键在于识别并靶向调控一类跨语言共享安全神经元(Cross-lingual Shared Safety Neurons, SS-Neurons)——这些神经元是少数存在于不同语言间共有的、负责调节安全行为的神经子集。通过针对性激活或抑制SS-Neurons,可显著提升NHR语言的安全一致性,且无需大规模微调整个模型,从而实现高效、可控的跨语言安全对齐。
链接: https://arxiv.org/abs/2602.01283
作者: Xianhui Zhang,Chengyu Xie,Linxia Zhu,Yonghui Yang,Weixiang Zhao,Zifeng Cheng,Cong Wang,Fei Shen,Tat-Seng Chua
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multilingual safety remains significantly imbalanced, leaving non-high-resource (NHR) languages vulnerable compared to robust high-resource (HR) ones. Moreover, the neural mechanisms driving safety alignment remain unclear despite observed cross-lingual representation transfer. In this paper, we find that LLMs contain a set of cross-lingual shared safety neurons (SS-Neurons), a remarkably small yet critical neuronal subset that jointly regulates safety behavior across languages. We first identify monolingual safety neurons (MS-Neurons) and validate their causal role in safety refusal behavior through targeted activation and suppression. Our cross-lingual analyses then identify SS-Neurons as the subset of MS-Neurons shared between HR and NHR languages, serving as a bridge to transfer safety capabilities from HR to NHR domains. We observe that suppressing these neurons causes concurrent safety drops across NHR languages, whereas reinforcing them improves cross-lingual defensive consistency. Building on these insights, we propose a simple neuron-oriented training strategy that targets SS-Neurons based on language resource distribution and model architecture. Experiments demonstrate that fine-tuning this tiny neuronal subset outperforms state-of-the-art methods, significantly enhancing NHR safety while maintaining the model’s general capabilities. The code and dataset will be available athttps://github.com/1518630367/SS-Neuron-Expansion. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2602.01283 [cs.CV] (or arXiv:2602.01283v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2602.01283 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Fei Shen [view email] [v1] Sun, 1 Feb 2026 15:28:02 UTC (767 KB) Full-text links: Access Paper: View a PDF of the paper titled Who Transfers Safety? Identifying and Targeting Cross-Lingual Shared Safety Neurons, by Xianhui Zhang and 8 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.CV prev | next new | recent | 2026-02 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
zh
[CV-147] DSFC-Net: A Dual-Encoder Spatial and Frequency Co-Awareness Network for Rural Road Extraction
【速读】:该论文旨在解决高分辨率遥感影像中农村道路提取难题,其核心挑战包括:不同地表材料导致的类内差异大、类间区分度低;植被遮挡破坏道路空间连续性;以及道路宽度窄加剧检测难度。现有方法主要针对城市环境优化,在农村场景下性能不足。解决方案的关键在于提出DSFC-Net框架,通过双编码器结构融合空间域与频域信息:CNN分支捕获局部边界细节和短程连续性,而创新的Spatial-Frequency Hybrid Transformer(SFT)模块利用Cross-Frequency Interaction Attention(CFIA)机制,基于拉普拉斯金字塔策略显式分离高低频信息,克服传统注意力机制的频率偏差问题,从而增强对植被遮挡的鲁棒性并保持窄路连通性;此外,Channel Feature Fusion Module(CFFM)自适应校准通道特征响应,实现局部纹理与全局语义的深度融合,显著提升分割精度。
链接: https://arxiv.org/abs/2602.01278
作者: Zhengbo Zhang,Yihe Tian,Wanke Xia,Lin Chen,Yue Sun,Kun Ding,Ying Wang,Bing Xu,Shiming Xiang
机构: University of Chinese Academy of Sciences (中国科学院大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Tsinghua University (清华大学); China Agricultural University (中国农业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate extraction of rural roads from high-resolution remote sensing imagery is essential for infrastructure planning and sustainable development. However, this task presents unique challenges in rural settings due to several factors. These include high intra-class variability and low inter-class separability from diverse surface materials, frequent vegetation occlusions that disrupt spatial continuity, and narrow road widths that exacerbate detection difficulties. Existing methods, primarily optimized for structured urban environments, often underperform in these scenarios as they overlook such distinctive characteristics. To address these challenges, we propose DSFC-Net, a dual-encoder framework that synergistically fuses spatial and frequency-domain information. Specifically, a CNN branch is employed to capture fine-grained local road boundaries and short-range continuity, while a novel Spatial-Frequency Hybrid Transformer (SFT) is introduced to robustly model global topological dependencies against vegetation occlusions. Distinct from standard attention mechanisms that suffer from frequency bias, the SFT incorporates a Cross-Frequency Interaction Attention (CFIA) module that explicitly decouples high- and low-frequency information via a Laplacian Pyramid strategy. This design enables the dynamic interaction between spatial details and frequency-aware global contexts, effectively preserving the connectivity of narrow roads. Furthermore, a Channel Feature Fusion Module (CFFM) is proposed to bridge the two branches by adaptively recalibrating channel-wise feature responses, seamlessly integrating local textures with global semantics for accurate segmentation. Comprehensive experiments on the WHU-RuR+, DeepGlobe, and Massachusetts datasets validate the superiority of DSFC-Net over state-of-the-art approaches.
zh
[CV-148] F-Lane: Traffic Flow Module for Robust Lane Perception
【速读】:该论文旨在解决自动驾驶系统中视觉感知方法在车道线检测任务上因传感器信息不足(如遮挡或无车道线场景)而导致性能显著下降的问题。现有依赖高精地图的方案虽能提供辅助信息,但存在订阅成本高和实时性差的局限。解决方案的关键在于引入一种无需额外成本且具备实时性的新信息源——交通流数据,并提出一个TrafficFlow-aware Lane perception Module (TFM),该模块能够有效提取实时交通流特征并无缝融合至现有车道感知算法中,从而提升模型在复杂场景下的鲁棒性和准确性。
链接: https://arxiv.org/abs/2602.01277
作者: Yihan Xie,Han Xia,Zhen Yang
机构: BYD Company Limited (比亚迪公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 7 figures, 7 tables
Abstract:Autonomous driving systems require robust lane perception capabilities, yet existing vision-based detection methods suffer significant performance degradation when visual sensors provide insufficient cues, such as in occluded or lane-missing scenarios. While some approaches incorporate high-definition maps as supplementary information, these solutions face challenges of high subscription costs and limited real-time performance. To address these limitations, we explore an innovative information source: traffic flow, which offers real-time capabilities without additional costs. This paper proposes a TrafficFlow-aware Lane perception Module (TFM) that effectively extracts real-time traffic flow features and seamlessly integrates them with existing lane perception algorithms. This solution originated from real-world autonomous driving conditions and was subsequently validated on open-source algorithms and datasets. Extensive experiments on four mainstream models and two public datasets (Nuscenes and OpenLaneV2) using standard evaluation metrics show that TFM consistently improves performance, achieving up to +4.1% mAP gain on the Nuscenes dataset.
zh
[CV-149] Q-DiT4SR: Exploration of Detail-Preserving Diffusion Transformer Quantization for Real-World Image Super-Resolution
【速读】:该论文针对基于扩散Transformer(DiT)的现实世界图像超分辨率(Real-ISR)模型在实际部署中因推理负担过重而受限的问题,提出了一种专门面向DiT架构的后训练量化(PTQ)框架Q-DiT4SR。其关键解决方案包括:1)提出H-SVD(Hierarchical Singular Value Decomposition),通过融合全局低秩分支与局部块级秩-1分支,在匹配参数预算下实现更优的权重压缩;2)设计方差感知的时空混合精度策略(VaSMP和VaTMP),其中VaSMP基于率失真理论无数据地分配跨层权重位宽,VaTMP则利用动态规划(DP)最小化校准开销以调度层内激活精度随扩散时间步的变化。实验表明,该方法在W4A6和W4A4量化设置下均达到当前最优性能,尤其W4A4配置可使模型体积缩小5.8倍、计算量减少超过60倍。
链接: https://arxiv.org/abs/2602.01273
作者: Xun Zhang,Kaicheng Yang,Hongliang Lu,Haotong Qin,Yong Guo,Yulun Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Our code and models will be available at this https URL
Abstract:Recently, Diffusion Transformers (DiTs) have emerged in Real-World Image Super-Resolution (Real-ISR) to generate high-quality textures, yet their heavy inference burden hinders real-world deployment. While Post-Training Quantization (PTQ) is a promising solution for acceleration, existing methods in super-resolution mostly focus on U-Net architectures, whereas generic DiT quantization is typically designed for text-to-image tasks. Directly applying these methods to DiT-based super-resolution models leads to severe degradation of local textures. Therefore, we propose Q-DiT4SR, the first PTQ framework specifically tailored for DiT-based Real-ISR. We propose H-SVD, a hierarchical SVD that integrates a global low-rank branch with a local block-wise rank-1 branch under a matched parameter budget. We further propose Variance-aware Spatio-Temporal Mixed Precision: VaSMP allocates cross-layer weight bit-widths in a data-free manner based on rate-distortion theory, while VaTMP schedules intra-layer activation precision across diffusion timesteps via dynamic programming (DP) with minimal calibration. Experiments on multiple real-world datasets demonstrate that our Q-DiT4SR achieves SOTA performance under both W4A6 and W4A4 settings. Notably, the W4A4 quantization configuration reduces model size by 5.8 \times and computational operations by over 60 \times . Our code and models will be available at this https URL.
zh
[CV-150] OASIS-DC: Generalizable Depth Completion via Output-level Alignment of Sparse-Integrated Monocular Pseudo Depth ICRA2026
【速读】:该论文旨在解决单目基础模型在零样本深度估计中输出为相对深度而非度量深度(metric depth)的问题,这限制了其在机器人和自动驾驶等需要精确尺度信息场景中的直接应用。解决方案的关键在于利用相对深度保持全局布局和边界结构的特性,通过稀疏范围测量对相对深度进行标定,从而构建伪度量深度先验;在此基础上设计一个精炼网络,该网络在可靠区域遵循先验,在不确定区域自主调整,实现仅需少量标注样本即可获得高精度度量预测的能力。该方法在缺乏精心筛选验证数据的情况下仍能保持稳定的尺度和清晰的边缘,表明将基础先验与稀疏锚点结合是应对真实世界标签稀缺条件下鲁棒深度补全的有效路径。
链接: https://arxiv.org/abs/2602.01268
作者: Jaehyeon Cho,Jhonghyun An
机构: Vehicle Intelligence Perception Lab (VIPLAB), Gachon University (嘉泉大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted to ICRA 2026
Abstract:Recent monocular foundation models excel at zero-shot depth estimation, yet their outputs are inherently relative rather than metric, limiting direct use in robotics and autonomous driving. We leverage the fact that relative depth preserves global layout and boundaries: by calibrating it with sparse range measurements, we transform it into a pseudo metric depth prior. Building on this prior, we design a refinement network that follows the prior where reliable and deviates where necessary, enabling accurate metric predictions from very few labeled samples. The resulting system is particularly effective when curated validation data are unavailable, sustaining stable scale and sharp edges across few-shot regimes. These findings suggest that coupling foundation priors with sparse anchors is a practical route to robust, deployment-ready depth completion under real-world label scarcity.
zh
[CV-151] Boosting Point-supervised Temporal Action Localization via Text Refinement and Alignment
【速读】:该论文旨在解决点监督时序动作定位(point-supervised temporal action localization)中仅依赖视觉特征而忽略文本语义信息的问题,从而导致定位精度受限。其解决方案的关键在于提出一种文本精炼与对齐(Text Refinement and Alignment, TRA)框架,通过两个核心模块实现多模态融合:一是基于点标注的文本精炼模块(Point-based Text Refinement, PTR),利用点注释和多个预训练模型优化视频帧描述;二是基于点级别的跨模态对齐模块(Point-based Multimodal Alignment, PMA),将视觉与文本特征投影到统一语义空间,并通过点级多模态对比学习缩小两者之间的语义差距,最终提升动作检测的精度。
链接: https://arxiv.org/abs/2602.01257
作者: Yunchuan Ma,Laiyun Qing,Guorong Li,Yuqing Liu,Yuankai Qi,Qingming Huang
机构: University of Chinese Academy of Science, Beijing,100190, China (中国科学院大学,北京,100190,中国); Macquarie University (麦考瑞大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recently, point-supervised temporal action localization has gained significant attention for its effective balance between labeling costs and localization accuracy. However, current methods only consider features from visual inputs, neglecting helpful semantic information from the text side. To address this issue, we propose a Text Refinement and Alignment (TRA) framework that effectively utilizes textual features from visual descriptions to complement the visual features as they are semantically rich. This is achieved by designing two new modules for the original point-supervised framework: a Point-based Text Refinement module (PTR) and a Point-based Multimodal Alignment module (PMA). Specifically, we first generate descriptions for video frames using a pre-trained multimodal model. Next, PTR refines the initial descriptions by leveraging point annotations together with multiple pre-trained models. PMA then projects all features into a unified semantic space and leverages a point-level multimodal feature contrastive learning to reduce the gap between visual and linguistic modalities. Last, the enhanced multi-modal features are fed into the action detector for precise localization. Extensive experimental results on five widely used benchmarks demonstrate the favorable performance of our proposed framework compared to several state-of-the-art methods. Moreover, our computational overhead analysis shows that the framework can run on a single 24 GB RTX 3090 GPU, indicating its practicality and scalability.
zh
[CV-152] MiTA Attention: Efficient Fast-Weight Scaling via a Mixture of Top-k Activations
【速读】:该论文旨在解决Transformer中注意力机制在处理超长序列时因快速权重(fast weights)规模随序列长度 $ N $ 线性增长而导致的计算与内存开销过大的问题。其解决方案的关键在于提出一种“压缩与路由”(compress-and-route)策略:首先通过少量地标查询(landmark queries)将原 $ N −宽度的快速权重多层感知机(MLP)压缩为更窄的结构,进而基于每个地标查询收集Top− k $ 激活的键值对以构建可变形专家(deformable experts),从而实现高效且稀疏的注意力计算。该方法被命名为Mixture of Top-$ k $ Activations (MiTA),形成了一种统一框架,可用于解释和改进多种高效的注意力机制设计。
链接: https://arxiv.org/abs/2602.01219
作者: Qishuai Wen,Zhiyuan Huang,Xianghan Meng,Wei He,Chun-Guang Li
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The attention operator in Transformers can be viewed as a two-layer fast-weight MLP, whose weights are dynamically instantiated from input tokens and whose width equals sequence length N . As the context extends, the expressive capacity of such an N -width MLP increases, but scaling its fast weights becomes prohibitively expensive for extremely long sequences. Recently, this fast-weight scaling perspective has motivated the Mixture-of-Experts (MoE) attention, which partitions the sequence into fast-weight experts and sparsely routes the tokens to them. In this paper, we elevate this perspective to a unifying framework for a wide range of efficient attention methods by interpreting them as scaling fast weights through routing and/or compression. Then we propose a compress-and-route strategy, which compresses the N -width MLP into a narrower one using a small set of landmark queries and constructs deformable experts by gathering top- k activated key-value pairs for each landmark query. We call this strategy a Mixture of Top- k Activations (MiTA), and refer to the resulting efficient mechanism as MiTA attention. Preliminary experiments on vision tasks demonstrate the promise of our MiTA attention and motivate further investigation on its optimization and broader applications in more challenging settings.
zh
[CV-153] Med3D-R1: Incentivizing Clinical Reasoning in 3D Medical Vision-Language Models for Abnormality Diagnosis
【速读】:该论文旨在解决3D视觉语言模型在临床推理中的三大挑战:体积医学影像的固有复杂性、模型对报告表面模式的过拟合倾向,以及缺乏可解释性的奖励设计。其解决方案的关键在于提出Med3D-R1强化学习框架,采用两阶段训练策略:第一阶段(监督微调,SFT)引入残差对齐机制以弥合高维3D特征与文本嵌入之间的差距,并通过异常区域重加权策略增强临床信息token的重要性,降低结构偏差;第二阶段(强化学习,RL)重构一致性奖励函数,显式促进逐步、连贯的诊断推理过程。该方法在CT-RATE和RAD-ChestCT两个3D诊断基准上均达到当前最优性能,表明其在提升异常检测准确性和临床推理能力方面的有效性。
链接: https://arxiv.org/abs/2602.01200
作者: Haoran Lai,Zihang Jiang,Kun Zhang,Qingsong Yao,Rongsheng Wang,Zhiyang He,Xiaodong Tao,Wei Wei,Shaohua Kevin Zhou
机构: University of Science and Technology of China (中国科学技术大学); Suzhou Institute for Advanced Research, University of Science and Technology of China (中国科学技术大学苏州研究院); Stanford University (斯坦福大学); iFlytek Co.Ltd (科大讯飞公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Developing 3D vision-language models with robust clinical reasoning remains a challenge due to the inherent complexity of volumetric medical imaging, the tendency of models to overfit superficial report patterns, and the lack of interpretability-aware reward designs. In this paper, we propose Med3D-R1, a reinforcement learning framework with a two-stage training process: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). During SFT stage, we introduce a residual alignment mechanism to bridge the gap between high-dimensional 3D features and textual embeddings, and an abnormality re-weighting strategy to emphasize clinically informative tokens and reduce structural bias in reports. In RL stage, we redesign the consistency reward to explicitly promote coherent, step-by-step diagnostic reasoning. We evaluate our method on medical multiple-choice visual question answering using two 3D diagnostic benchmarks, CT-RATE and RAD-ChestCT, where our model attains state-of-the-art accuracies of 41.92% on CT-RATE and 44.99% on RAD-ChestCT. These results indicate improved abnormality diagnosis and clinical reasoning and outperform prior methods on both benchmarks. Overall, our approach holds promise for enhancing real-world diagnostic workflows by enabling more reliable and transparent 3D medical vision-language systems.
zh
[CV-154] EMFormer: Efficient Multi-Scale Transformer for Accumulative Context Weather Forecasting
【速读】:该论文旨在解决长期天气预测中面临的三个核心问题:灾难性遗忘(catastrophic forgetting)、误差累积(error accumulation)以及高训练开销(high training overhead)。为应对这些挑战,其解决方案的关键在于提出一个端到端的优化流程,涵盖预训练、微调和预测阶段。其中,创新性地设计了高效多尺度Transformer(Efficient Multi-scale Transformer, EMFormer),通过单次卷积实现训练与推理中的多尺度特征提取,显著降低计算复杂度;同时引入累积上下文微调策略(accumulative context finetuning),在不损害短期精度的前提下增强时间一致性;此外,提出一种复合损失函数(composite loss),利用正弦加权机制动态平衡不同损失项,自适应引导优化路径。实验表明,该方法在长期天气预报和极端事件预测任务中均取得显著性能提升,并展现出良好的跨模态泛化能力(如ImageNet-1K和ADE20K视觉基准)。
链接: https://arxiv.org/abs/2602.01194
作者: Hao Chen,Tao Han,Jie Zhang,Song Guo,Fenghua Ling,Lei Bai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Long-term weather forecasting is critical for socioeconomic planning and disaster preparedness. While recent approaches employ finetuning to extend prediction horizons, they remain constrained by the issues of catastrophic forgetting, error accumulation, and high training overhead. To address these limitations, we present a novel pipeline across pretraining, finetuning and forecasting to enhance long-context modeling while reducing computational overhead. First, we introduce an Efficient Multi-scale Transformer (EMFormer) to extract multi-scale features through a single convolution in both training and inference. Based on the new architecture, we further employ an accumulative context finetuning to improve temporal consistency without degrading short-term accuracy. Additionally, we propose a composite loss that dynamically balances different terms via a sinusoidal weighting, thereby adaptively guiding the optimization trajectory throughout pretraining and finetuning. Experiments show that our approach achieves strong performance in weather forecasting and extreme event prediction, substantially improving long-term forecast accuracy. Moreover, EMFormer demonstrates strong generalization on vision benchmarks (ImageNet-1K and ADE20K) while delivering a 5.69x speedup over conventional multi-scale modules.
zh
[CV-155] Refining Context-Entangled Content Segmentation via Curriculum Selection and Anti-Curriculum Promotion
【速读】:该论文旨在解决上下文纠缠内容分割(Context-Entangled Content Segmentation, CECS)问题,即在目标物体与其背景具有相似视觉模式(如伪装目标检测)的情况下,如何提升分割模型的鲁棒性和泛化能力。传统分割网络主要依赖架构改进,忽视了训练过程中样本难度动态变化对模型稳健性的影响。其解决方案的关键在于提出CurriSeg双阶段学习框架:第一阶段为课程选择(Curriculum Selection),通过动态分析样本损失的时间统计特性,识别出具有信息量的难样本并过滤噪声或模糊样本,实现稳定的能力增强;第二阶段为反课程促进(Anti-Curriculum Promotion),设计**频谱盲化微调(Spectral-Blindness Fine-Tuning)**策略,抑制高频细节以强制模型依赖低频结构和上下文线索,从而提升泛化性能。该方法无需增加参数或训练时间,在多个CECS基准上均取得一致改进,揭示了训练进度与挑战性之间协同作用对构建鲁棒、上下文感知分割模型的重要性。
链接: https://arxiv.org/abs/2602.01183
作者: Chunming He,Rihan Zhang,Fengyang Xiao,Dingming Zhang,Zhiwen Cao,Sina Farsiu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 8 figures, 11 tables
Abstract:Biological learning proceeds from easy to difficult tasks, gradually reinforcing perception and robustness. Inspired by this principle, we address Context-Entangled Content Segmentation (CECS), a challenging setting where objects share intrinsic visual patterns with their surroundings, as in camouflaged object detection. Conventional segmentation networks predominantly rely on architectural enhancements but often ignore the learning dynamics that govern robustness under entangled data distributions. We introduce CurriSeg, a dual-phase learning framework that unifies curriculum and anti-curriculum principles to improve representation reliability. In the Curriculum Selection phase, CurriSeg dynamically selects training data based on the temporal statistics of sample losses, distinguishing hard-but-informative samples from noisy or ambiguous ones, thus enabling stable capability enhancement. In the Anti-Curriculum Promotion phase, we design Spectral-Blindness Fine-Tuning, which suppresses high-frequency components to enforce dependence on low-frequency structural and contextual cues and thus strengthens generalization. Extensive experiments demonstrate that CurriSeg achieves consistent improvements across diverse CECS benchmarks without adding parameters or increasing total training time, offering a principled view of how progression and challenge interplay to foster robust and context-aware segmentation. Code will be released.
zh
[CV-156] EEmo-Logic: A Unified Dataset and Multi-Stage Framework for Comprehensive Image-Evoked Emotion Assessment
【速读】:该论文旨在解决当前模型在图像诱发情绪理解(image-evoked emotion understanding)中存在的情绪感知粒度粗略和推理能力不足的问题。解决方案的关键在于构建了目前规模最大、维度最全的图像诱发情绪理解数据集EEmoDB,包含5个分析维度和5类任务,并通过自动化生成与人工精标相结合的方式获得1.2M问答对(EEmoDB-QA)和36k细粒度评估数据(EEmoDB-Assess);同时提出EEmo-Logic,一个基于指令微调和任务定制化组相对偏好优化(GRPO)的多模态大语言模型(MLLM),并设计新颖奖励机制以提升模型在情绪问答与细粒度评估任务中的表现。
链接: https://arxiv.org/abs/2602.01173
作者: Lancheng Gao,Ziheng Jia,Zixuan Xing,Wei Sun,Huiyu Duan,Guangtao Zhai,Xiongkuo Min
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Understanding the multi-dimensional attributes and intensity nuances of image-evoked emotions is pivotal for advancing machine empathy and empowering diverse human-computer interaction applications. However, existing models are still limited to coarse-grained emotion perception or deficient reasoning capabilities. To bridge this gap, we introduce EEmoDB, the largest image-evoked emotion understanding dataset to date. It features 5 analysis dimensions spanning 5 distinct task categories, facilitating comprehensive interpretation. Specifically, we compile 1.2M question-answering (QA) pairs (EEmoDB-QA) from 125k images via automated generation, alongside a 36k dataset (EEmoDB-Assess) curated from 25k images for fine-grained assessment. Furthermore, we propose EEmo-Logic, an all-in-one multimodal large language model (MLLM) developed via instruction fine-tuning and task-customized group relative preference optimization (GRPO) with novel reward design. Extensive experiments demonstrate that EEmo-Logic achieves robust performance in in-domain and cross-domain datasets, excelling in emotion QA and fine-grained assessment. The code is available at this https URL.
zh
[CV-157] Semantically Aware UAV Landing Site Assessment from Remote Sensing Imagery via Multimodal Large Language Models
【速读】:该论文旨在解决无人机(UAV)在紧急情况下安全着陆时面临的复杂语义风险识别问题,传统几何传感器难以感知如人群、临时建筑等非结构化环境风险。解决方案的关键在于提出一种基于遥感(Remote Sensing, RS)影像与多模态大语言模型(Multimodal Large Language Models, MLLMs)融合的全局上下文感知着陆点评估框架,采用“粗到精”(coarse-to-fine)的两阶段策略:首先通过轻量级语义分割模块高效筛选候选区域;随后由视觉-语言推理代理融合视觉特征与兴趣点(Point-of-Interest, POI)数据,精准识别细微隐患,从而显著提升风险识别准确率并生成可解释的人类可理解理由,增强自动化决策的信任度。
链接: https://arxiv.org/abs/2602.01163
作者: Chunliang Hua,Zeyuan Yang,Lei Zhang,Jiayang Sun,Fengwen Chen,Chunlan Zeng,Xiao Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Safe UAV emergency landing requires more than just identifying flat terrain; it demands understanding complex semantic risks (e.g., crowds, temporary structures) invisible to traditional geometric sensors. In this paper, we propose a novel framework leveraging Remote Sensing (RS) imagery and Multimodal Large Language Models (MLLMs) for global context-aware landing site assessment. Unlike local geometric methods, our approach employs a coarse-to-fine pipeline: first, a lightweight semantic segmentation module efficiently pre-screens candidate areas; second, a vision-language reasoning agent fuses visual features with Point-of-Interest (POI) data to detect subtle hazards. To validate this approach, we construct and release the Emergency Landing Site Selection (ELSS) benchmark. Experiments demonstrate that our framework significantly outperforms geometric baselines in risk identification accuracy. Furthermore, qualitative results confirm its ability to generate human-like, interpretable justifications, enhancing trust in automated decision-making. The benchmark dataset is publicly accessible at this https URL.
zh
[CV-158] Improving Robustness of Vision-Language-Action Models by Restoring Corrupted Visual Inputs
【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在真实场景中因图像退化(image corruptions)导致的性能急剧下降问题。这类退化包括电子噪声、坏像素和镜头污染等传感器级干扰,会直接破坏视觉信号完整性,使当前最先进的VLA模型(如π₀.₅和SmolVLA)在常见图像退化下成功率从90%骤降至2%。解决方案的关键在于提出一种即插即用且与模型无关的“退化恢复Transformer”(Corruption Restoration Transformer, CRT),其通过对抗训练目标从受损输入中重建干净观测,无需对底层VLA模型进行昂贵的微调,从而有效提升VLA在严重视觉退化下的鲁棒性,并在LIBERO和Meta-World基准上恢复接近基线的成功率。
链接: https://arxiv.org/abs/2602.01158
作者: Daniel Yezid Guarnizo Orjuela,Leonardo Scappatura,Veronica Di Gennaro,Riccardo Andrea Izzo,Gianluca Bardaro,Matteo Matteucci
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Vision-Language-Action (VLA) models have emerged as a dominant paradigm for generalist robotic manipulation, unifying perception and control within a single end-to-end architecture. However, despite their success in controlled environments, reliable real-world deployment is severely hindered by their fragility to visual disturbances. While existing literature extensively addresses physical occlusions caused by scene geometry, a critical mode remains largely unexplored: image corruptions. These sensor-level artifacts, ranging from electronic noise and dead pixels to lens contaminants, directly compromise the integrity of the visual signal prior to interpretation. In this work, we quantify this vulnerability, demonstrating that state-of-the-art VLAs such as \pi_0.5 and SmolVLA, suffer catastrophic performance degradation, dropping from 90% success rates to as low as 2%, under common signal artifacts. To mitigate this, we introduce the Corruption Restoration Transformer (CRT), a plug-and-play and model-agnostic vision transformer designed to immunize VLA models against sensor disturbances. Leveraging an adversarial training objective, CRT restores clean observations from corrupted inputs without requiring computationally expensive fine-tuning of the underlying model. Extensive experiments across the LIBERO and Meta-World benchmarks demonstrate that CRT effectively recovers lost performance, enabling VLAs to maintain near-baseline success rates, even under severe visual corruption.
zh
[CV-159] Statistical MIA: Rethinking Membership Inference Attack for Reliable Unlearning Auditing
【速读】:该论文旨在解决机器遗忘(Machine Unlearning, MU)中对模型是否真正遗忘指定训练数据的可靠性审计问题。现有方法普遍依赖成员推理攻击(Membership Inference Attacks, MIAs)进行审计,但作者指出,MIA 的二分类判定存在不可观测的统计误差,导致对遗忘效果的评估过于乐观且计算开销大。解决方案的关键在于提出一种全新的训练-free 审计框架——统计成员推理攻击(Statistical Membership Inference Attack, SMIA),其通过直接对成员与非成员数据分布进行统计检验,无需训练攻击模型,并能输出带置信区间的遗忘率,从而实现更可靠、高效且具有理论保障的审计结果。
链接: https://arxiv.org/abs/2602.01150
作者: Jialong Sun,Zeming Wei,Jiaxuan Zou,Jiacheng Gong,Guanheng Wang,Chengyang Dong,Jialong Li,Bo Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC)
备注:
Abstract:Machine unlearning (MU) is essential for enforcing the right to be forgotten in machine learning systems. A key challenge of MU is how to reliably audit whether a model has truly forgotten specified training data. Membership Inference Attacks (MIAs) are widely used for unlearning auditing, where samples that evade membership detection are often regarded as successfully forgotten. After carefully revisiting the reliability of MIA, we show that this assumption is flawed: failed membership inference does not imply true forgetting. We theoretically demonstrate that MIA-based auditing, when formulated as a binary classification problem, inevitably incurs statistical errors whose magnitude cannot be observed during the auditing process. This leads to overly optimistic evaluations of unlearning performance, while incurring substantial computational overhead due to shadow model training. To address these limitations, we propose Statistical Membership Inference Attack (SMIA), a novel training-free and highly effective auditing framework. SMIA directly compares the distributions of member and non-member data using statistical tests, eliminating the need for learned attack models. Moreover, SMIA outputs both a forgetting rate and a corresponding confidence interval, enabling quantified reliability of the auditing results. Extensive experiments show that SMIA provides more reliable auditing with significantly lower computational cost than existing MIA-based approaches. Notably, the theoretical guarantees and empirical effectiveness of SMIA suggest it as a new paradigm for reliable machine unlearning auditing.
zh
[CV-160] Koo-Fu CLIP: Closed-Form Adaptation of Vision-Language Models via Fukunaga-Koontz Linear Discriminant Analysis
【速读】:该论文旨在解决视觉语言模型(如CLIP)在监督分类任务中存在类间区分度不足和维度冗余的问题,即其原始嵌入空间虽具通用表征能力,但缺乏针对特定分类任务的优化。解决方案的关键在于提出Koo-Fu CLIP方法,该方法基于Fukunaga-Koontz线性判别分析(Fukunaga-Koontz Linear Discriminant Analysis),在白化后的嵌入空间中进行闭式线性投影,从而有效抑制类内差异并增强类间可分性,同时实现显著的降维效果。此方法在ImageNet-1K、14K及21K类别场景下均提升了最近邻原型分类的准确率,并支持高达10–12倍的压缩比而保持性能稳定,为大规模图像分类与检索提供了高效且轻量的适配方案。
链接: https://arxiv.org/abs/2602.01127
作者: Matej Suchanek,Klara Janouskova,Ondrej Vasatko,Jiri Matas
机构: Czech Technical University in Prague (捷克技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Visual-language models such as CLIP provide powerful general-purpose representations, but their raw embeddings are not optimized for supervised classification, often exhibiting limited class separation and excessive dimensionality. We propose Koo-Fu CLIP, a supervised CLIP adaptation method based on Fukunaga-Koontz Linear Discriminant Analysis, which operates in a whitened embedding space to suppress within-class variation and enhance between-class discrimination. The resulting closed-form linear projection reshapes the geometry of CLIP embeddings, improving class separability while performing effective dimensionality reduction, and provides a lightweight and efficient adaptation of CLIP representations. Across large-scale ImageNet benchmarks, nearest visual prototype classification in the Koo-Fu CLIP space improves top-1 accuracy from 75.1% to 79.1% on ImageNet-1K, with consistent gains persisting as the label space expands to 14K and 21K classes. The method supports substantial compression by up to 10-12x with little or no loss in accuracy, enabling efficient large-scale classification and retrieval. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2602.01127 [cs.CV] (or arXiv:2602.01127v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2602.01127 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-161] LightCity: An Urban Dataset for Outdoor Inverse Rendering and Reconstruction under Multi-illumination Conditions
【速读】:该论文旨在解决城市场景中逆渲染(inverse rendering)任务面临的挑战,尤其是复杂光照条件(如多光源、间接光和阴影效应)对固有分解(intrinsic decomposition)与三维重建(3D reconstruction)性能的影响问题。由于缺乏合适的高质量数据集,这些问题长期未被充分研究。解决方案的关键在于提出一个名为LightCity的新型高保真合成城市数据集,其包含超过300个可调控的天空光照图、覆盖街景与航拍视角的5万张图像,并标注了深度、法向量、材质成分、直接光与间接光等丰富属性,从而为城市环境中逆渲染的三项基础任务提供可靠的基准测试平台,推动相关研究发展。
链接: https://arxiv.org/abs/2602.01118
作者: Jingjing Wang,Qirui Hu,Chong Bao,Yuke Zhu,Hujun Bao,Zhaopeng Cui,Guofeng Zhang
机构: State Key Lab of CAD&CG, Zhejiang University (浙江大学CAD&CG国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Inverse rendering in urban scenes is pivotal for applications like autonomous driving and digital twins. Yet, it faces significant challenges due to complex illumination conditions, including multi-illumination and indirect light and shadow effects. However, the effects of these challenges on intrinsic decomposition and 3D reconstruction have not been explored due to the lack of appropriate datasets. In this paper, we present LightCity, a novel high-quality synthetic urban dataset featuring diverse illumination conditions with realistic indirect light and shadow effects. LightCity encompasses over 300 sky maps with highly controllable illumination, varying scales with street-level and aerial perspectives over 50K images, and rich properties such as depth, normal, material components, light and indirect light, etc. Besides, we leverage LightCity to benchmark three fundamental tasks in the urban environments and conduct a comprehensive analysis of these benchmarks, laying a robust foundation for advancing related research.
zh
[CV-162] KAN We Flow? Advancing Robotic Manipulation with 3D Flow Matching via KAN RWKV ICRA2026
【速读】:该论文旨在解决基于扩散模型的视觉-运动策略在资源受限机器人部署中存在推理效率低下的问题,其核心挑战在于扩散模型需要多次去噪迭代且依赖庞大的UNet结构,导致计算开销高。解决方案的关键在于提出KAN-We-Flow,一种基于流匹配(flow matching)的轻量级策略网络:首先引入RWKV-KAN模块,其中RWKV实现高效的时间/通道混合以传递任务上下文,随后GroupKAN层通过可学习样条函数对动作映射进行特征级非线性校准;同时设计动作一致性正则化(Action Consistency Regularization, ACR),利用欧拉外推强制预测动作轨迹与专家示范对齐,提供额外监督以稳定训练并提升精度。该方法在不使用大型UNet的前提下,参数量减少86.8%,保持快速推理速度,并在Adroit、Meta-World和DexArt等3D操作基准上达到最先进性能。
链接: https://arxiv.org/abs/2602.01115
作者: Zhihao Chen,Yiyuan Ge,Ziyang Wang
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Beijing Hydrogen Intelligence Technology Co. Ltd. (北京氢智科技有限公司); South China University of Technology (华南理工大学); University of Oxford (牛津大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted By ICRA2026
Abstract:Diffusion-based visuomotor policies excel at modeling action distributions but are inference-inefficient, since recursively denoising from noise to policy requires many steps and heavy UNet backbones, which hinders deployment on resource-constrained robots. Flow matching alleviates the sampling burden by learning a one-step vector field, yet prior implementations still inherit large UNet-style architectures. In this work, we present KAN-We-Flow, a flow-matching policy that draws on recent advances in Receptance Weighted Key Value (RWKV) and Kolmogorov-Arnold Networks (KAN) from vision to build a lightweight and highly expressive backbone for 3D manipulation. Concretely, we introduce an RWKV-KAN block: an RWKV first performs efficient time/channel mixing to propagate task context, and a subsequent GroupKAN layer applies learnable spline-based, groupwise functional mappings to perform feature-wise nonlinear calibration of the action mapping on RWKV outputs. Moreover, we introduce an Action Consistency Regularization (ACR), a lightweight auxiliary loss that enforces alignment between predicted action trajectories and expert demonstrations via Euler extrapolation, providing additional supervision to stabilize training and improve policy precision. Without resorting to large UNets, our design reduces parameters by 86.8%, maintains fast runtime, and achieves state-of-the-art success rates on Adroit, Meta-World, and DexArt benchmarks. Our project page can be viewed in \hrefthis https URL\textcolorredlink
zh
[CV-163] Robust Harmful Meme Detection under Missing Modalities via Shared Representation Learning WWW2026
【速读】:该论文旨在解决有害网络迷因(harmful meme)检测方法在现实场景中因模态缺失(如文本信息因OCR质量差而不可用)导致性能下降的问题。现有方法通常依赖于多模态完整数据(如图文并存),但在实际应用中常面临模态不完整的情况,从而影响检测准确性。解决方案的关键在于提出一种新的基线方法,通过独立投影各模态至共享表示空间,从而在文本缺失等情况下仍能有效利用视觉特征进行判断;实验表明,该方法显著提升了在模态不完整条件下的检测性能,并增强了对视觉特征的整合能力,降低了对文本的依赖性,提高了系统鲁棒性。
链接: https://arxiv.org/abs/2602.01101
作者: Felix Breiteneder,Mohammad Belal,Muhammad Saad Saeed,Shahed Masoudian,Usman Naseem,Kulshrestha Juhi,Markus Schedl,Shah Nawaz
机构: Johannes Kepler University Linz (约翰开普勒林兹大学); Aalto University (阿尔托大学); University of Michigan-Flint (密歇根大学弗林特分校); Macquarie University (麦考瑞大学); Institute of Computational Perception, Johannes Kepler University Linz and Linz Institute of Technology (计算感知研究所,约翰开普勒林兹大学和林茨技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at WWW2026
Abstract:Internet memes are powerful tools for communication, capable of spreading political, psychological, and sociocultural ideas. However, they can be harmful and can be used to disseminate hate toward targeted individuals or groups. Although previous studies have focused on designing new detection methods, these often rely on modal-complete data, such as text and images. In real-world settings, however, modalities like text may be missing due to issues like poor OCR quality, making existing methods sensitive to missing information and leading to performance deterioration. To address this gap, in this paper, we present the first-of-its-kind work to comprehensively investigate the behavior of harmful meme detection methods in the presence of modal-incomplete data. Specifically, we propose a new baseline method that learns a shared representation for multiple modalities by projecting them independently. These shared representations can then be leveraged when data is modal-incomplete. Experimental results on two benchmark datasets demonstrate that our method outperforms existing approaches when text is missing. Moreover, these results suggest that our method allows for better integration of visual features, reducing dependence on text and improving robustness in scenarios where textual information is missing. Our work represents a significant step forward in enabling the real-world application of harmful meme detection, particularly in situations where a modality is absent.
zh
[CV-164] PandaPose: 3D Human Pose Lifting from a Single Image via Propagating 2D Pose Prior to 3D Anchor Space NEURIPS2025
【速读】:该论文旨在解决从单张RGB图像中进行3D人体姿态估计(3D human pose lifting)时存在的两个核心问题:一是现有方法基于2D特征直接映射到3D姿态,导致输入的2D姿态预测误差在3D输出中传播;二是难以有效处理自遮挡(self-occlusion)场景下的歧义性。解决方案的关键在于提出一种名为PandaPose的新框架,其核心创新是引入一个统一的中间表示空间——3D锚点空间(3D anchor space),该空间包含三个组成部分:(1) 在规范坐标系下的关节级3D锚点,提供准确且鲁棒的先验信息以缓解2D姿态估计误差;(2) 逐关节的深度感知特征提升机制,通过分层融合深度信息来解决自遮挡带来的歧义;(3) 锚点-特征交互解码器,将3D锚点与提升后的特征结合生成统一的锚点查询(anchor queries),其中融合了关节级3D锚点、视觉线索和几何深度信息,进而用于锚点到关节的集成预测,显著提升了模型在复杂条件下的精度与鲁棒性。
链接: https://arxiv.org/abs/2602.01095
作者: Jinghong Zheng,Changlong Jiang,Yang Xiao,Jiaqi Li,Haohong Kuang,Hang Xu,Ran Wang,Zhiguo Cao,Min Du,Joey Tianyi Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at NeurIPS 2025
Abstract:3D human pose lifting from a single RGB image is a challenging task in 3D vision. Existing methods typically establish a direct joint-to-joint mapping from 2D to 3D poses based on 2D features. This formulation suffers from two fundamental limitations: inevitable error propagation from input predicted 2D pose to 3D predictions and inherent difficulties in handling self-occlusion cases. In this paper, we propose PandaPose, a 3D human pose lifting approach via propagating 2D pose prior to 3D anchor space as the unified intermediate representation. Specifically, our 3D anchor space comprises: (1) Joint-wise 3D anchors in the canonical coordinate system, providing accurate and robust priors to mitigate 2D pose estimation inaccuracies. (2) Depth-aware joint-wise feature lifting that hierarchically integrates depth information to resolve self-occlusion ambiguities. (3) The anchor-feature interaction decoder that incorporates 3D anchors with lifted features to generate unified anchor queries encapsulating joint-wise 3D anchor set, visual cues and geometric depth information. The anchor queries are further employed to facilitate anchor-to-joint ensemble prediction. Experiments on three well-established benchmarks (i.e., Human3.6M, MPI-INF-3DHP and 3DPW) demonstrate the superiority of our proposition. The substantial reduction in error by 14.7% compared to SOTA methods on the challenging conditions of Human3.6M and qualitative comparisons further showcase the effectiveness and robustness of our approach.
zh
[CV-165] Differential Vector Erasure: Unified Training-Free Concept Erasure for Flow Matching Models
【速读】:该论文旨在解决生成式 AI(Generative AI)中流匹配(flow matching)模型在图像生成过程中难以有效消除特定概念(如NSFW内容、受版权保护的艺术风格或特定物体)的问题。现有概念擦除方法主要针对基于DDPM的扩散模型,依赖昂贵的微调训练,不适用于流匹配模型这一新型生成范式。其解决方案的关键在于提出一种无需训练的概念擦除方法——差分向量擦除(Differential Vector Erasure, DVE),核心思想是利用生成流的速度场方向结构隐式编码语义概念;通过构建目标概念与锚点概念之间的差分向量场,在推理阶段将速度场投影至该差分方向上,从而选择性移除特定概念成分,同时保持无关语义不变,实现高精度且无损的可控生成。
链接: https://arxiv.org/abs/2602.01089
作者: Zhiqi Zhang,Xinhao Zhong,Yi Sun,Shuoyang Sun,Bin Chen,Shu-Tao Xia,Xuan Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Text-to-image diffusion models have demonstrated remarkable capabilities in generating high-quality images, yet their tendency to reproduce undesirable concepts, such as NSFW content, copyrighted styles, or specific objects, poses growing concerns for safe and controllable deployment. While existing concept erasure approaches primarily focus on DDPM-based diffusion models and rely on costly fine-tuning, the recent emergence of flow matching models introduces a fundamentally different generative paradigm for which prior methods are not directly applicable. In this paper, we propose Differential Vector Erasure (DVE), a training-free concept erasure method specifically designed for flow matching models. Our key insight is that semantic concepts are implicitly encoded in the directional structure of the velocity field governing the generative flow. Leveraging this observation, we construct a differential vector field that characterizes the directional discrepancy between a target concept and a carefully chosen anchor concept. During inference, DVE selectively removes concept-specific components by projecting the velocity field onto the differential direction, enabling precise concept suppression without affecting irrelevant semantics. Extensive experiments on FLUX demonstrate that DVE consistently outperforms existing baselines on a wide range of concept erasure tasks, including NSFW suppression, artistic style removal, and object erasure, while preserving image quality and diversity.
zh
[CV-166] MedAD-R1: Eliciting Consistent Reasoning in Interpretible Medical Anomaly Detection via Consistency-Reinforced Policy Optimization
【速读】:该论文旨在解决医学异常检测(Medical Anomaly Detection, MedAD)中因依赖简单且碎片化的监督微调(Supervised Fine-Tuning, SFT)数据集而导致模型缺乏合理推理能力和多模态泛化性能的问题。其解决方案的关键在于提出一个两阶段训练框架:第一阶段通过认知注入(Cognitive Injection)使用SFT注入基础医学知识,并引导模型采用“先思考后作答”的结构化推理范式;第二阶段引入一致性组相对策略优化(Consistency Group Relative Policy Optimization, Con-GRPO),通过设计一致性奖励机制确保推理过程与最终诊断逻辑一致,从而提升模型生成透明、连贯的诊断推理路径的能力。该方法在首个大规模多中心医学异常检测基准MedAD-38K上实现了显著优于现有基线的性能,验证了其在增强临床决策支持系统可信度和可解释性方面的潜力。
链接: https://arxiv.org/abs/2602.01081
作者: Haitao Zhang,Yingying Wang,Jiaxiang Wang,Haote Xu,Hongyang Zhang,Yirong Chen,Yue Huang,Xinghao Ding
机构: School of Informatics, Xiamen University (厦门大学信息学院); Institute of Artificial Intelligence, Xiamen University (厦门大学人工智能研究院); Zhejiang Expressway Co., Ltd. (浙江省高速公路有限公司); School of Transportation Science and Engineering, Beihang University (北京航空航天大学交通科学与工程学院); School of Science and Engineering, Chinese University of Hong Kong (香港中文大学工程学院); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 4 figures
Abstract:Medical Anomaly Detection (MedAD) presents a significant opportunity to enhance diagnostic accuracy using Large Multimodal Models (LMMs) to interpret and answer questions based on medical images. However, the reliance on Supervised Fine-Tuning (SFT) on simplistic and fragmented datasets has hindered the development of models capable of plausible reasoning and robust multimodal generalization. To overcome this, we introduce MedAD-38K, the first large-scale, multi-modal, and multi-center benchmark for MedAD featuring diagnostic Chain-of-Thought (CoT) annotations alongside structured Visual Question-Answering (VQA) pairs. On this foundation, we propose a two-stage training framework. The first stage, Cognitive Injection, uses SFT to instill foundational medical knowledge and align the model with a structured think-then-answer paradigm. Given that standard policy optimization can produce reasoning that is disconnected from the final answer, the second stage incorporates Consistency Group Relative Policy Optimization (Con-GRPO). This novel algorithm incorporates a crucial consistency reward to ensure the generated reasoning process is relevant and logically coherent with the final diagnosis. Our proposed model, MedAD-R1, achieves state-of-the-art (SOTA) performance on the MedAD-38K benchmark, outperforming strong baselines by more than 10%. This superior performance stems from its ability to generate transparent and logically consistent reasoning pathways, offering a promising approach to enhancing the trustworthiness and interpretability of AI for clinical decision support.
zh
[CV-167] PISA: Piecewise Sparse Attention Is Wiser for Efficient Diffusion Transformers
【速读】:该论文旨在解决扩散模型中注意力机制因二次复杂度而导致的计算效率瓶颈问题,尤其是在视频与图像生成任务中。传统块稀疏注意力(block sparse attention)通过仅关注关键的键值块来加速计算,但在高稀疏度下会因丢弃非关键块信息而造成性能下降。其解决方案的关键在于发现非关键块的注意力分数具有分布稳定性,从而可被高效且准确地近似而非直接丢弃;基于此洞察,作者提出无需训练的分段稀疏注意力(PISA),采用“精确或近似”策略:对关键块保持精确计算,对剩余块则通过块级泰勒展开进行高效近似,使PISA在保持接近全注意力质量的同时实现次二次复杂度,有效平衡了速度与生成质量。
链接: https://arxiv.org/abs/2602.01077
作者: Haopeng Li,Shitong Shao,Wenliang Zhong,Zikai Zhou,Lichen Bai,Hui Xiong,Zeke Xie
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages
Abstract:Diffusion Transformers are fundamental for video and image generation, but their efficiency is bottlenecked by the quadratic complexity of attention. While block sparse attention accelerates computation by attending only critical key-value blocks, it suffers from degradation at high sparsity by discarding context. In this work, we discover that attention scores of non-critical blocks exhibit distributional stability, allowing them to be approximated accurately and efficiently rather than discarded, which is essentially important for sparse attention design. Motivated by this key insight, we propose PISA, a training-free Piecewise Sparse Attention that covers the full attention span with sub-quadratic complexity. Unlike the conventional keep-or-drop paradigm that directly drop the non-critical block information, PISA introduces a novel exact-or-approximate strategy: it maintains exact computation for critical blocks while efficiently approximating the remainder through block-wise Taylor expansion. This design allows PISA to serve as a faithful proxy to full attention, effectively bridging the gap between speed and quality. Experimental results demonstrate that PISA achieves 1.91 times and 2.57 times speedups on Wan2.1-14B and Hunyuan-Video, respectively, while consistently maintaining the highest quality among sparse attention methods. Notably, even for image generation on FLUX, PISA achieves a 1.2 times acceleration without compromising visual quality. Code is available at: this https URL.
zh
[CV-168] PDE-Constrained Optimization for Neural Image Segmentation with Physics Priors
【速读】:该论文旨在解决显微图像分割中存在的病态逆问题(ill-posed inverse problem),其成因包括测量噪声、弱目标边界以及标注数据有限等问题。传统深度神经网络虽具备灵活的非参数估计能力,但未经约束的最小化经验风险常导致模型不稳定且泛化性能差。解决方案的关键在于将图像分割建模为一个偏微分方程(PDE)约束优化问题,通过变分正则化将物理启发式先验知识融入深度学习模型中,具体实现为在损失函数中引入基于反应-扩散方程和相场界面能的可微分残差惩罚项,从而在数据保真度与结构先验之间取得平衡。实验表明,该方法在LIVECell数据集上显著提升了分割精度与边界保真度,并在小样本条件下展现出更强的稳定性和泛化能力。
链接: https://arxiv.org/abs/2602.01069
作者: Seema K. Poudel,Sunny K. Khadka
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Segmentation of microscopy images constitutes an ill-posed inverse problem due to measurement noise, weak object boundaries, and limited labeled data. Although deep neural networks provide flexible nonparametric estimators, unconstrained empirical risk minimization often leads to unstable solutions and poor generalization. In this work, image segmentation is formulated as a PDE-constrained optimization problem that integrates physically motivated priors into deep learning models through variational regularization. The proposed framework minimizes a composite objective function consisting of a data fidelity term and penalty terms derived from reaction-diffusion equations and phase-field interface energies, all implemented as differentiable residual losses. Experiments are conducted on the LIVECell dataset, a high-quality, manually annotated collection of phase-contrast microscopy images. Training is performed on two cell types, while evaluation is carried out on a distinct, unseen cell type to assess generalization. A UNet architecture is used as the unconstrained baseline model. Experimental results demonstrate consistent improvements in segmentation accuracy and boundary fidelity compared to unconstrained deep learning baselines. Moreover, the PDE-regularized models exhibit enhanced stability and improved generalization in low-sample regimes, highlighting the advantages of incorporating structured priors. The proposed approach illustrates how PDE-constrained optimization can strengthen data-driven learning frameworks, providing a principled bridge between variational methods, statistical learning, and scientific machine learning.
zh
[CV-169] DRFormer: A Dual-Regularized Bidirectional Transformer for Person Re-identification
【速读】:该论文旨在解决行人重识别(person re-identification)中因遮挡和姿态变化带来的挑战,这些问题通常依赖于细粒度判别性细节(fine-grained discriminative details)与全局语义特征(global semantic features)的协同建模。现有方法多采用单一架构,未能充分融合局部纹理信息与全局语义差异的优势。解决方案的关键在于提出一种双正则化双向Transformer(Dual-Regularized Bidirectional Transformer, DRFormer),通过双正则化机制实现两种模型(如DINO提取局部纹理、CLIP捕获全局语义)的互补融合,从而在特征多样性与贡献平衡之间取得优化,显著提升行人重识别性能。
链接: https://arxiv.org/abs/2602.01059
作者: Ying Shu,Pujian Zhan,Huiqi Yang,Hehe Fan,Youfang Lin,Kai Lv
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:
Abstract:Both fine-grained discriminative details and global semantic features can contribute to solving person re-identification challenges, such as occlusion and pose variations. Vision foundation models (\textite.g., DINO) excel at mining local textures, and vision-language models (\textite.g., CLIP) capture strong global semantic difference. Existing methods predominantly rely on a single paradigm, neglecting the potential benefits of their integration. In this paper, we analyze the complementary roles of these two architectures and propose a framework to synergize their strengths by a \textbfDual-\textbfRegularized Bidirectional \textbfTransformer (\textbfDRFormer). The dual-regularization mechanism ensures diverse feature extraction and achieves a better balance in the contributions of the two models. Extensive experiments on five benchmarks show that our method effectively harmonizes local and global representations, achieving competitive performance against state-of-the-art methods.
zh
[CV-170] Radioactive 3D Gaussian Ray Tracing for Tomographic Reconstruction
【速读】:该论文旨在解决基于3D Gaussian Splatting(3DGS)的断层成像重建方法在定量精度和非线性几何校正能力方面的局限性。具体而言,现有方法如R2-Gaussian采用局部仿射近似将3D高斯映射到探测器平面并进行alpha混合以生成投影,这种近似会导致积分偏差,降低重建精度,并难以集成非线性几何修正(如正电子发射断层扫描PET中的弧校正)。其解决方案的关键在于提出一种基于3D高斯射线追踪(ray tracing)的重建框架:首先,通过解析计算穿过3D高斯原语的线积分,避免了局部仿射坍缩,从而构建更物理一致的前向投影模型;其次,射线追踪显式控制射线起点与方向,便于精确引入非线性几何校正,提升了在复杂真实断层成像系统中的适用性和投影精度。
链接: https://arxiv.org/abs/2602.01057
作者: Ling Chen,Bao Yang
机构: Southern Medical University (南方医科大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D Gaussian Splatting (3DGS) has recently emerged in computer vision as a promising rendering technique. By adapting the principles of Elliptical Weighted Average (EWA) splatting to a modern differentiable pipeline, 3DGS enables real-time, high-quality novel view synthesis. Building upon this, R2-Gaussian extended the 3DGS paradigm to tomographic reconstruction by rectifying integration bias, achieving state-of-the-art performance in computed tomography (CT). To enable differentiability, R2-Gaussian adopts a local affine approximation: each 3D Gaussian is locally mapped to a 2D Gaussian on the detector and composed via alpha blending to form projections. However, the affine approximation can degrade reconstruction quantitative accuracy and complicate the incorporation of nonlinear geometric corrections. To address these limitations, we propose a tomographic reconstruction framework based on 3D Gaussian ray tracing. Our approach provides two key advantages over splatting-based models: (i) it computes the line integral through 3D Gaussian primitives analytically, avoiding the local affine collapse and thus yielding a more physically consistent forward projection model; and (ii) the ray-tracing formulation gives explicit control over ray origins and directions, which facilitates the precise application of nonlinear geometric corrections, e.g., arc-correction used in positron emission tomography (PET). These properties extend the applicability of Gaussian-based reconstruction to a wider range of realistic tomography systems while improving projection accuracy.
zh
[CV-171] Baseline Method of the Foundation Model Challenge for Ultrasound Image Analysis
【速读】:该论文旨在解决超声(Ultrasound, US)图像在不同解剖结构和采集协议下存在显著异质性的问题,从而阻碍了通用分析模型的泛化能力;现有方法多为任务特定型,难以作为临床可部署的基础模型。解决方案的关键在于提出一个统一的多头多任务学习(Multi-Head Multi-Task Learning, MH-MTL)框架,其基于ImageNet预训练的EfficientNet-B4主干网络与特征金字塔网络(Feature Pyramid Network, FPN)相结合,实现跨27个子任务(包括分割、分类、检测和回归)的共享特征提取与任务自适应路由机制,通过任务自适应学习率缩放和余弦退火调度进行联合训练,验证了该设计在复杂US场景下的可行性与鲁棒性,为超声基础模型研究提供了强健且可扩展的基准。
链接: https://arxiv.org/abs/2602.01055
作者: Bo Deng,Yitong Tang,Jiake Li,Yuxin Huang,Li Wang,Yu Zhang,Yufei Zhan,Hua Lu,Xiaoshen Zhang,Jieyun Bai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Ultrasound (US) imaging exhibits substantial heterogeneity across anatomical structures and acquisition protocols, posing significant challenges to the development of generalizable analysis models. Most existing methods are task-specific, limiting their suitability as clinically deployable foundation models. To address this limitation, the Foundation Model Challenge for Ultrasound Image Analysis (FM_UIA~2026) introduces a large-scale multi-task benchmark comprising 27 subtasks across segmentation, classification, detection, and regression. In this paper, we present the official baseline for FM_UIA~2026 based on a unified Multi-Head Multi-Task Learning (MH-MTL) framework that supports all tasks within a single shared network. The model employs an ImageNet-pretrained EfficientNet–B4 backbone for robust feature extraction, combined with a Feature Pyramid Network (FPN) to capture multi-scale contextual information. A task-specific routing strategy enables global tasks to leverage high-level semantic features, while dense prediction tasks exploit spatially detailed FPN representations. Training incorporates a composite loss with task-adaptive learning rate scaling and a cosine annealing schedule. Validation results demonstrate the feasibility and robustness of this unified design, establishing a strong and extensible baseline for ultrasound foundation model research. The code and dataset are publicly available at \hrefthis https URLGitHub.
zh
[CV-172] Residual Decoding: Mitigating Hallucinations in Large Vision-Language Models via History-Aware Residual Guidance
【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在处理多模态输入时因语言先验(language priors)导致的幻觉(hallucination)问题,即生成内容虽语法正确但与实际视觉输入无关或不匹配。解决方案的关键在于提出一种无需训练的新型方法——残差解码(Residual Decoding, ResDec),其核心机制是利用模型内部隐式推理过程和词元概率(token logits)演化特性,通过历史信息辅助解码,从而纠正语言偏差,提升视觉定位准确性并减少对象幻觉。
链接: https://arxiv.org/abs/2602.01047
作者: Xinrong Chen,Xu Chu,Yingmin Qiu,Hengyuan Zhang,Jing Xiong,Shiyu Tang,Shuai Liu,Shaokang Yang,Cheng Yang,Hayden Kwok-Hay So,Ngai Wong
机构: Peking University (北京大学); ByteDance Inc. (字节跳动); University of Hong Kong (香港大学); Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Vision-Language Models (LVLMs) can reason effectively from image-text inputs and perform well in various multimodal tasks. Despite this success, they are affected by language priors and often produce hallucinations. Hallucinations denote generated content that is grammatically and syntactically coherent, yet bears no match or direct relevance to actual visual input. To address this problem, we propose Residual Decoding (ResDec). It is a novel training-free method that uses historical information to aid decoding. The method relies on the internal implicit reasoning mechanism and token logits evolution mechanism of LVLMs to correct biases. Extensive experiments demonstrate that ResDec effectively suppresses hallucinations induced by language priors, significantly improves visual grounding, and reduces object hallucinations. In addition to mitigating hallucinations, ResDec also performs exceptionally well on comprehensive LVLM benchmarks, highlighting its broad applicability.
zh
[CV-173] ReLayout: Versatile and Structure-Preserving Design Layout Editing via Relation-Aware Design Reconstruction
【速读】:该论文旨在解决设计布局编辑(design layout editing)任务中的两大核心挑战:一是用户意图通常以自然语言表达,存在歧义,难以直接转化为精确的几何修改;二是缺乏足够的三元组数据(原始设计、编辑操作、编辑后设计),限制了监督学习方法的应用。解决方案的关键在于提出ReLayout框架,其创新性地引入关系图(relation graph)作为未编辑元素间位置与尺寸关系的结构约束,从而在不依赖三元组数据的情况下实现布局结构的保持;同时,通过关系感知的设计重建(Relation-aware Design Reconstruction, RADR)机制,利用自监督方式模拟编辑过程,借助多模态大语言模型统一多种编辑动作,在无需显式标注数据的前提下实现高质量、结构保真的多样化设计编辑。
链接: https://arxiv.org/abs/2602.01046
作者: Jiawei Lin,Shizhao Sun,Danqing Huang,Ting Liu,Ji Li,Jiang Bian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Automated redesign without manual adjustments marks a key step forward in the design workflow. In this work, we focus on a foundational redesign task termed design layout editing, which seeks to autonomously modify the geometric composition of a design based on user intents. To overcome the ambiguity of user needs expressed in natural language, we introduce four basic and important editing actions and standardize the format of editing operations. The underexplored task presents a unique challenge: satisfying specified editing operations while simultaneously preserving the layout structure of unedited elements. Besides, the scarcity of triplet (original design, editing operation, edited design) samples poses another formidable challenge. To this end, we present ReLayout, a novel framework for versatile and structure-preserving design layout editing that operates without triplet data. Specifically, ReLayout first introduces the relation graph, which contains the position and size relationships among unedited elements, as the constraint for layout structure preservation. Then, relation-aware design reconstruction (RADR) is proposed to bypass the data challenge. By learning to reconstruct a design from its elements, a relation graph, and a synthesized editing operation, RADR effectively emulates the editing process in a self-supervised manner. A multi-modal large language model serves as the backbone for RADR, unifying multiple editing actions within a single model and thus achieving versatile editing after fine-tuning. Qualitative, quantitative results and user studies show that ReLayout significantly outperforms the baseline models in terms of editing quality, accuracy, and layout structure preservation.
zh
[CV-174] From Videos to Conversations: Egocentric Instructions for Task Assistance
【速读】:该论文旨在解决当前生成式 AI 在增强现实(AR)辅助任务指导中面临的挑战,即缺乏大规模、多模态、基于真实世界任务执行的对话数据集,这限制了AI代理在复杂多步骤操作(如家电维修、烹饪或汽车保养)中的有效应用。其解决方案的关键在于提出一个全自动框架,利用大语言模型(Large Language Models, LLMs)将单人教学视频自动转化为双人多模态任务引导对话,从而实现低成本、高效率的数据生成方式。该方法显著提升了数据收集的可扩展性,并构建了 HowToDIV 数据集,包含 507 场多轮专家-新手交互对话、6,636 个问答对及 24 小时视频内容,为多模态程序性任务辅助提供了首个基准测试平台。
链接: https://arxiv.org/abs/2602.01038
作者: Lavisha Aggarwal,Vikas Bahirwani,Andrea Colaco
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Many everyday tasks, ranging from appliance repair and cooking to car maintenance, require expert knowledge, particularly for complex, multi-step procedures. Despite growing interest in AI agents for augmented reality (AR) assistance, progress remains limited by the scarcity of large-scale multimodal conversational datasets grounded in real-world task execution, in part due to the cost and logistical complexity of human-assisted data collection. In this paper, we present a framework to automatically transform single person instructional videos into two-person multimodal task-guidance conversations. Our fully automatic pipeline, based on large language models, provides a scalable and cost efficient alternative to traditional data collection approaches. Using this framework, we introduce HowToDIV, a multimodal dataset comprising 507 conversations, 6,636 question answer pairs, and 24 hours of video spanning multiple domains. Each session consists of a multi-turn expert-novice interaction. Finally, we report baseline results using Gemma 3 and Qwen 2.5 on HowToDIV, providing an initial benchmark for multimodal procedural task assistance.
zh
[CV-175] VEQ: Modality-Adaptive Quantization for MoE Vision-Language Models
【速读】:该论文旨在解决混合专家(Mixture-of-Experts, MoE)视觉语言模型(Vision-Language Models, VLMs)在后训练量化(Post-Training Quantization, PTQ)过程中因忽视两种关键异质性而导致性能下降的问题:一是视觉与语言token之间的固有差异,二是不同专家对模型输出贡献的非均匀性。解决方案的关键在于提出一种双感知量化框架——视觉专家量化(Visual Expert Quantization, VEQ),其核心包括两个创新机制:1)模态-专家感知量化(Modality-expert-aware Quantization),通过利用专家激活频率优先对关键专家进行误差最小化;2)模态亲和感知量化(Modality-affinity-aware Quantization),结合token-专家亲和度与模态信息构建增强型Hessian矩阵以指导校准过程。该方法显著提升了量化后模型的精度与鲁棒性,在多个基准测试中优于现有最优量化方法。
链接: https://arxiv.org/abs/2602.01037
作者: Guangshuo Qin,Zhiteng Li,Zheng Chen,Weihang Zhang,Linghe Kong,Yulun Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Mixture-of-Experts(MoE) Vision-Language Models (VLMs) offer remarkable performance but incur prohibitive memory and computational costs, making compression essential. Post-Training Quantization (PTQ) is an effective training-free technique to address the massive memory and computation overhead. Existing quantization paradigms fall short as they are oblivious to two critical forms of heterogeneity: the inherent discrepancy between vision and language tokens, and the non-uniform contribution of different experts. To bridge this gap, we propose Visual Expert Quantization (VEQ), a dual-aware quantization framework designed to simultaneously accommodate cross-modal differences and heterogeneity between experts. Specifically, VEQ incorporates 1)Modality-expert-aware Quantization, which utilizes expert activation frequency to prioritize error minimization for pivotal experts, and 2)Modality-affinity-aware Quantization, which constructs an enhanced Hessian matrix by integrating token-expert affinity with modality information to guide the calibration process. Extensive experiments across diverse benchmarks verify that VEQ consistently outperforms state-of-the-art baselines. Specifically, under the W3A16 configuration, our method achieves significant average accuracy gains of 2.04% on Kimi-VL and 3.09% on Qwen3-VL compared to the previous SOTA quantization methods, demonstrating superior robustness across various multimodal tasks. Our code will be available at this https URL.
zh
[CV-176] FUSE-Flow: Scalable Real-Time Multi-View Point Cloud Reconstruction Using Confidence ICIP
【速读】:该论文旨在解决实时多视角点云重建中的关键挑战,即在严格实时约束下,如何高效融合大规模多视角深度观测数据以生成高质量点云。现有方法如基于体素的融合、时间累积或全局优化策略普遍存在计算复杂度高、内存占用大及可扩展性差的问题,难以同时实现实时性能、重建质量和多相机扩展能力。其解决方案的核心在于提出FUSE-Flow框架——一个逐帧处理、无状态且线性可扩展的点云流式重建方法:每帧独立生成点云片段,并通过测量置信度与三维距离一致性两个权重进行融合,有效抑制噪声并保留几何细节;同时引入自适应空间哈希加权聚合机制,在局部点云密度基础上动态划分3D空间,每个单元内选取代表性点进行加权融合,从而兼顾稀疏与密集区域的处理效率。结合GPU并行化,该方法实现了高吞吐量、低延迟的点云生成与融合,具备线性复杂度特性。
链接: https://arxiv.org/abs/2602.01035
作者: Chentian Sun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: A 5-page paper, prepared for submission to the 2026 IEEE International Conference on Image Processing (ICIP)
Abstract:Real-time multi-view point cloud reconstruction is a core problem in 3D vision and immersive perception, with wide applications in VR, AR, robotic navigation, digital twins, and computer interaction. Despite advances in multi-camera systems and high-resolution depth sensors, fusing large-scale multi-view depth observations into high-quality point clouds under strict real-time constraints remains challenging. Existing methods relying on voxel-based fusion, temporal accumulation, or global optimization suffer from high computational complexity, excessive memory usage, and limited scalability, failing to simultaneously achieve real-time performance, reconstruction quality, and multi-camera extensibility. We propose FUSE-Flow, a frame-wise, stateless, and linearly scalable point cloud streaming reconstruction framework. Each frame independently generates point cloud fragments, fused via two weights, measurement confidence and 3D distance consistency to suppress noise while preserving geometric details. For large-scale multi-camera efficiency, we introduce an adaptive spatial hashing-based weighted aggregation method: 3D space is adaptively partitioned by local point cloud density, representative points are selected per cell, and weighted fusion is performed to handle both sparse and dense regions. With GPU parallelization, FUSE-Flow achieves high-throughput, low-latency point cloud generation and fusion with linear complexity. Experiments demonstrate that the framework improves reconstruction stability and geometric fidelity in overlapping, depth-discontinuous, and dynamic scenes, while maintaining real-time frame rates on modern GPUs, verifying its effectiveness, robustness, and scalability.
zh
[CV-177] GMAC: Global Multi-View Constraint for Automatic Multi-Camera Extrinsic Calibration ICIP
【速读】:该论文旨在解决多相机系统中外部参数(extrinsic parameters)自动标定问题,尤其针对复杂动态环境或在线场景下现有方法鲁棒性不足、适用性有限的问题。解决方案的关键在于提出GMAC框架,该框架利用多视角重建网络隐式学习的几何表示,将外参建模为受潜在多视角几何结构约束的全局变量,并通过剪枝与结构重构使现有网络的潜在特征可直接支持外参预测,无需全新网络设计;同时,GMAC联合优化跨视图重投影一致性与多视角循环一致性,从而在保证相机间几何一致性的同时提升预测精度与优化稳定性。
链接: https://arxiv.org/abs/2602.01033
作者: Chentian Sun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: A 5-page paper with 1 figure, prepared for submission to the 2026 IEEE International Conference on Image Processing (ICIP)
Abstract:Automatic calibration of multi-camera systems, namely the accurate estimation of spatial extrinsic parameters, is fundamental for 3D reconstruction, panoramic perception, and multi-view data fusion. Existing methods typically rely on calibration targets, explicit geometric modeling, or task-specific neural networks. Such approaches often exhibit limited robustness and applicability in complex dynamic environments or online scenarios, making them difficult to deploy in practical applications. To address this, this paper proposes GMAC, a multi-camera extrinsic estimation framework based on the implicit geometric representations learned by multi-view reconstruction networks. GMAC models extrinsics as global variables constrained by the latent multi-view geometric structure and prunes and structurally reconfigures existing networks so that their latent features can directly support extrinsic prediction through a lightweight regression head, without requiring a completely new network design. Furthermore, GMAC jointly optimizes cross-view reprojection consistency and multi-view cycle consistency, ensuring geometric coherence across cameras while improving prediction accuracy and optimization stability. Experiments on both synthetic and real-world multi-camera datasets demonstrate that GMAC achieves accurate and stable extrinsic estimation without explicit 3D reconstruction or manual calibration, providing a new solution for efficient deployment and online calibration of multi-camera systems.
zh
[CV-178] oward Universal and Transferable Jailbreak Attacks on Vision-Language Models ICLR2026
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在面临图像诱导的越狱攻击(image-based jailbreaks)时,现有基于梯度的越狱方法因过拟合单一白盒代理模型而难以迁移到黑盒模型的问题。解决方案的关键在于提出UltraBreak框架,通过在视觉空间中引入变换与正则化约束来限制对抗性模式,并在文本空间中采用语义导向的目标函数放松对文本目标的硬性要求;具体而言,其损失函数定义在目标大语言模型(Large Language Model, LLM)的文本嵌入空间中,从而发现可跨多种越狱目标通用的对抗性模式,有效缓解代理模型过拟合问题,显著提升攻击在不同模型和目标间的迁移能力。
链接: https://arxiv.org/abs/2602.01025
作者: Kaiyuan Cui,Yige Li,Yutao Wu,Xingjun Ma,Sarah Erfani,Christopher Leckie,Hanxun Huang
机构: The University of Melbourne (墨尔本大学); Singapore Management University (新加坡管理大学); Deakin University (迪肯大学); Fudan University (复旦大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: ICLR 2026
Abstract:Vision-language models (VLMs) extend large language models (LLMs) with vision encoders, enabling text generation conditioned on both images and text. However, this multimodal integration expands the attack surface by exposing the model to image-based jailbreaks crafted to induce harmful responses. Existing gradient-based jailbreak methods transfer poorly, as adversarial patterns overfit to a single white-box surrogate and fail to generalise to black-box models. In this work, we propose Universal and transferable jailbreak (UltraBreak), a framework that constrains adversarial patterns through transformations and regularisation in the vision space, while relaxing textual targets through semantic-based objectives. By defining its loss in the textual embedding space of the target LLM, UltraBreak discovers universal adversarial patterns that generalise across diverse jailbreak objectives. This combination of vision-level regularisation and semantically guided textual supervision mitigates surrogate overfitting and enables strong transferability across both models and attack targets. Extensive experiments show that UltraBreak consistently outperforms prior jailbreak methods. Further analysis reveals why earlier approaches fail to transfer, highlighting that smoothing the loss landscape via semantic objectives is crucial for enabling universal and transferable jailbreaks. The code is publicly available in our \hrefthis https URLGitHub repository.
zh
[CV-179] Effectiveness of Automatically Curated Dataset in Thyroid Nodules Classification Algorithms Using Deep Learning
【速读】:该论文旨在解决甲状腺结节癌症诊断中深度学习模型训练数据稀缺的问题,其核心挑战在于人工标注数据的获取成本高、效率低。解决方案的关键在于采用自动标注方法生成大规模甲状腺结节超声图像数据集,并通过实验验证该自动标注数据在提升深度学习模型性能方面的有效性。研究发现,使用自动标注数据训练的模型(AUC=0.694)显著优于人工标注数据训练的模型(AUC=0.643,P<0.001),且完整自动标注数据集的性能与高准确率子集相当(AUC=0.689,P=0.43),表明应优先使用全部自动标注数据而非仅选取高准确率子集,从而实现更高效、稳定的模型训练。
链接: https://arxiv.org/abs/2602.01020
作者: Jichen Yang,Jikai Zhang,Benjamin Wildman-Tobriner,Maciej A. Mazurowski
机构: Duke University (杜克大学); Duke University Medical Center (杜克大学医学中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 3 figures
Abstract:The diagnosis of thyroid nodule cancers commonly utilizes ultrasound images. Several studies showed that deep learning algorithms designed to classify benign and malignant thyroid nodules could match radiologists’ performance. However, data availability for training deep learning models is often limited due to the significant effort required to curate such datasets. The previous study proposed a method to curate thyroid nodule datasets automatically. It was tested to have a 63% yield rate and 83% accuracy. However, the usefulness of the generated data for training deep learning models remains unknown. In this study, we conducted experiments to determine whether using a automatically-curated dataset improves deep learning algorithms’ performance. We trained deep learning models on the manually annotated and automatically-curated datasets. We also trained with a smaller subset of the automatically-curated dataset that has higher accuracy to explore the optimum usage of such dataset. As a result, the deep learning model trained on the manually selected dataset has an AUC of 0.643 (95% confidence interval [CI]: 0.62, 0.66). It is significantly lower than the AUC of the 6automatically-curated dataset trained deep learning model, 0.694 (95% confidence interval [CI]: 0.67, 0.73, P .001). The AUC of the accurate subset trained deep learning model is 0.689 (95% confidence interval [CI]: 0.66, 0.72, P .43), which is insignificantly worse than the AUC of the full automatically-curated dataset. In conclusion, we showed that using a automatically-curated dataset can substantially increase the performance of deep learning algorithms, and it is suggested to use all the data rather than only using the accurate subset.
zh
[CV-180] LocalScore: Local Density-Aware Similarity Scoring for Biometrics
【速读】:该论文旨在解决开放集生物特征识别(open-set biometrics)中因探测样本(probe)未在图库(gallery)中注册而导致的识别性能下降问题,尤其针对现有方法将个体内部差异压缩为单一全局表征所引发的决策边界不佳和开放集鲁棒性差的问题。解决方案的关键在于提出一种名为LocalScore的评分算法,其通过引入图库特征分布的局部密度信息(基于k近邻距离)来优化相似度计算,从而增强对非匹配探测样本的判别能力;该方法具有架构无关性、损失函数独立性且计算开销极低,可无缝集成至现有生物特征系统中,实验证明其在多模态场景下显著提升了开放集检索和验证性能。
链接: https://arxiv.org/abs/2602.01012
作者: Yiyang Su,Minchul Kim,Jie Zhu,Christopher Perry,Feng Liu,Anil Jain,Xiaoming Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Open-set biometrics faces challenges with probe subjects who may not be enrolled in the gallery, as traditional biometric systems struggle to detect these non-mated probes. Despite the growing prevalence of multi-sample galleries in real-world deployments, most existing methods collapse intra-subject variability into a single global representation, leading to suboptimal decision boundaries and poor open-set robustness. To address this issue, we propose LocalScore, a simple yet effective scoring algorithm that explicitly incorporates the local density of the gallery feature distribution using the k-th nearest neighbors. LocalScore is architecture-agnostic, loss-independent, and incurs negligible computational overhead, making it a plug-and-play solution for existing biometric systems. Extensive experiments across multiple modalities demonstrate that LocalScore consistently achieves substantial gains in open-set retrieval (FNIR@FPIR reduced from 53% to 40%) and verification (TAR@FAR improved from 51% to 74%). We further provide theoretical analysis and empirical validation explaining when and why the method achieves the most significant gains based on dataset characteristics.
zh
[CV-181] SRVAU-R1: Enhancing Video Anomaly Understanding via Reflection-Aware Learning
【速读】:该论文旨在解决当前多模态大语言模型(Multi-modal Large Language Models, MLLMs)在视频异常理解(Video Anomaly Understanding, VAU)任务中仅能提供表面描述、缺乏对异常行为的深度推理(如显式自我反思与修正)的问题。其解决方案的关键在于提出一种名为Self-Reflection-Enhanced Reasoning for Video Anomaly Understanding (SRVAU-R1) 的反射感知学习框架,该框架首次引入面向VAU任务的反射导向思维链(Chain-of-Thought)数据集,提供初始推理、自我反思和修正推理的结构化监督信号,并结合监督微调与强化微调的新型反射感知训练范式,显著提升了多模态推理能力与异常定位准确性。
链接: https://arxiv.org/abs/2602.01004
作者: Zihao Zhao,Shengting Cao,Muchao Ye
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multi-modal large language models (MLLMs) have demonstrated significant progress in reasoning capabilities and shown promising effectiveness in video anomaly understanding (VAU) tasks. However, existing MLLM-based approaches remain largely focused on surface-level descriptions of anomalies, lacking deep reasoning over abnormal behaviors like explicit self-reflection and self-correction. To address that, we propose Self-Reflection-Enhanced Reasoning for Video Anomaly Understanding (SRVAU-R1), a reflection-aware learning framework that incorporates reflection in MLLM reasoning. Specifically, SRVAU-R1 introduces the first reflection-oriented Chain-of-Thought dataset tailored for VAU, providing structured supervision with initial reasoning, self-reflection, and revised reasoning. Based on that, it includes a novel reflection-aware learning paradigm with supervised fine-tuning and reinforcement fine-tuning to enhance multi-modal reasoning for VAU. Extensive experiments on multiple video anomaly benchmarks demonstrate that SRVAU-R1 consistently outperforms existing methods, achieving significant improvements in both temporal anomaly localization accuracy and reasoning quality.
zh
[CV-182] CortiNet: A Physics-Perception Hybrid Cortical-Inspired Dual-Stream Network for Gallbladder Disease Diagnosis from Ultrasound
【速读】:该论文旨在解决超声图像因低分辨率和斑点噪声(speckle noise)导致诊断可靠性差的问题,以及传统大参数量卷积神经网络难以在临床环境中部署的困境。其解决方案的关键在于提出一种轻量级、受大脑皮层启发的双流神经架构——CortiNet,该架构通过物理可解释的多尺度信号分解将低频结构信息与高频感知细节分离,并分别由专用编码流处理;同时引入晚期融合机制整合结构与纹理特征,在保持计算效率的同时增强模型鲁棒性;此外,设计了结构感知的可解释性框架,仅对结构分支应用梯度加权类激活映射(Grad-CAM),从而有效规避斑点噪声干扰,提升诊断可信度。
链接: https://arxiv.org/abs/2602.01000
作者: Vagish Kumar,Souvik Chakraborty
机构: Indian Institute of Technology Delhi (印度理工学院德里分校); Yardi School of Artificial Intelligence (ScAI) (Yardi人工智能学院(ScAI)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Ultrasound imaging is the primary diagnostic modality for detecting Gallbladder diseases due to its non-invasive nature, affordability, and wide accessibility. However, the low resolution and speckle noise inherent to ultrasound images hinder diagnostic reliability, prompting the use of large convolutional neural networks that are difficult to deploy in routine clinical settings. In this work, we propose CortiNet, a lightweight, cortical-inspired dual-stream neural architecture for gallbladder disease diagnosis that integrates physically interpretable multi-scale signal decomposition with perception-driven feature learning. Inspired by parallel processing pathways in the human visual cortex, CortiNet explicitly separates low-frequency structural information from high-frequency perceptual details and processes them through specialized encoding streams. By operating directly on structured, frequency-selective representations rather than raw pixel intensities, the architecture embeds strong physics-based inductive bias, enabling efficient feature learning with a significantly reduced parameter footprint. A late-stage cortical-style fusion mechanism integrates complementary structural and textural cues while preserving computational efficiency. Additionally, we propose a structure-aware explainability framework wherein gradient-weighted class activation mapping is only applied to the structural branch of the proposed CortiNet architecture. This choice allows the model to only focus on the structural features, making it robust against speckle noise. We evaluate CortiNet on 10,692 expert-annotated images spanning nine clinically relevant gallbladder disease categories. Experimental results demonstrate that CortiNet achieves high diagnostic accuracy (98.74%) with only a fraction of the parameters required by conventional deep convolutional models.
zh
[CV-183] VAMOS-OCTA: Vessel-Aware Multi-Axis Orthogonal Supervision for Inpainting Motion-Corrupted OCT Angiography Volumes
【速读】:该论文旨在解决手持式光学相干断层扫描血管成像(Handheld Optical Coherence Tomography Angiography, OCTA)在儿童或不配合受试者中应用时,因运动伪影导致的三维图像质量严重退化问题。具体表现为扫描过程中突发运动会使某些B-scan(横截面切片)出现未采样区域,进而造成整体投影图像中出现空白带状伪影。解决方案的关键在于提出一种基于深度学习的修复框架VAMOS-OCTA,其核心创新是引入了血管感知的多轴正交监督损失(Vessel-Aware Multi-Axis Orthogonal Supervision, VAMOS),该损失函数同时优化血管结构强度重建与轴向及侧向投影的一致性,从而在保持B-scan横向清晰度的同时提升体积投影的准确性,显著改善严重运动伪影下的血管连续性和图像完整性。
链接: https://arxiv.org/abs/2602.00995
作者: Nick DiSanto,Ehsan Khodapanah Aghdam,Han Liu,Jacob Watson,Yuankai K. Tao,Hao Li,Ipek Oguz
机构: Vanderbilt University (范德比尔特大学); Siemens Healthineers (西门子医疗健康)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to SPIE Medical Imaging 2026
Abstract:Handheld Optical Coherence Tomography Angiography (OCTA) enables noninvasive retinal imaging in uncooperative or pediatric subjects, but is highly susceptible to motion artifacts that severely degrade volumetric image quality. Sudden motion during 3D acquisition can lead to unsampled retinal regions across entire B-scans (cross-sectional slices), resulting in blank bands in en face projections. We propose VAMOS-OCTA, a deep learning framework for inpainting motion-corrupted B-scans using vessel-aware multi-axis supervision. We employ a 2.5D U-Net architecture that takes a stack of neighboring B-scans as input to reconstruct a corrupted center B-scan, guided by a novel Vessel-Aware Multi-Axis Orthogonal Supervision (VAMOS) loss. This loss combines vessel-weighted intensity reconstruction with axial and lateral projection consistency, encouraging vascular continuity in native B-scans and across orthogonal planes. Unlike prior work that focuses primarily on restoring the en face MIP, VAMOS-OCTA jointly enhances both cross-sectional B-scan sharpness and volumetric projection accuracy, even under severe motion corruptions. We trained our model on both synthetic and real-world corrupted volumes and evaluated its performance using both perceptual quality and pixel-wise accuracy metrics. VAMOS-OCTA consistently outperforms prior methods, producing reconstructions with sharp capillaries, restored vessel continuity, and clean en face projections. These results demonstrate that multi-axis supervision offers a powerful constraint for restoring motion-degraded 3D OCTA data. Our source code is available at this https URL.
zh
[CV-184] Navigating Simply Aligning Deeply: Winning Solutions for Mouse vs. AI 2025 NEURIPS2025
【速读】:该论文旨在解决人工智能在视觉鲁棒性(Visual Robustness)和神经对齐性(Neural Alignment)方面的挑战,以开发出能够媲美生物视觉系统的智能体。针对视觉鲁棒性问题,其解决方案的关键在于采用轻量级两层卷积神经网络(CNN),结合门控线性单元(Gated Linear Units, GLU)与观测归一化(observation normalization),从而实现95.4%的最终得分;而对于神经对齐性问题,核心在于设计一个深度类似ResNet的16层卷积架构,引入GLU门控机制,在仅1780万参数下获得最优的神经预测性能。系统性分析表明,训练步数与性能呈非单调关系,约20万步时达到最佳效果,揭示了模型复杂度与任务特性之间的权衡机制,挑战了传统对模型容量的假设,并为构建更具生物启发性的视觉代理提供了实践指导。
链接: https://arxiv.org/abs/2602.00982
作者: Phu-Hoa Pham,Chi-Nguyen Tran,Dao Sy Duy Minh,Nguyen Lam Phu Quy,Huynh Trung Kiet
机构: University of Science, VNU-HCM (胡志明市国家大学所属科学大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Robotics (cs.RO)
备注: 15 pages, 8 tables. Technical Report for winning solutions (Track 1 Track 2) at the NeurIPS 2025 Mouse vs. AI Challenge
Abstract:Visual robustness and neural alignment remain critical challenges in developing artificial agents that can match biological vision systems. We present the winning approaches from Team HCMUS_TheFangs for both tracks of the NeurIPS 2025 Mouse vs. AI: Robust Visual Foraging Competition. For Track 1 (Visual Robustness), we demonstrate that architectural simplicity combined with targeted components yields superior generalization, achieving 95.4% final score with a lightweight two-layer CNN enhanced by Gated Linear Units and observation normalization. For Track 2 (Neural Alignment), we develop a deep ResNet-like architecture with 16 convolutional layers and GLU-based gating that achieves top-1 neural prediction performance with 17.8 million parameters. Our systematic analysis of ten model checkpoints trained between 60K to 1.14M steps reveals that training duration exhibits a non-monotonic relationship with performance, with optimal results achieved around 200K steps. Through comprehensive ablation studies and failure case analysis, we provide insights into why simpler architectures excel at visual robustness while deeper models with increased capacity achieve better neural alignment. Our results challenge conventional assumptions about model complexity in visuomotor learning and offer practical guidance for developing robust, biologically-inspired visual agents.
zh
[CV-185] Unveiling the Cognitive Compass: Theory-of-Mind-Guided Multimodal Emotion Reasoning ICLR2026
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在深层情感理解能力上的局限性问题,认为真正的情感智能需要显式建模理论心智(Theory of Mind, ToM),即情绪产生的认知基础。其解决方案的关键在于提出一个基于ToM的分层评估基准HitEmotion,用于诊断不同认知深度下的能力断点;同时设计了一种ToM引导的推理链(ToM-guided reasoning chain),通过追踪心理状态并校准跨模态证据来实现忠实的情感推理,并引入TMPO强化学习方法,以中间心理状态作为过程级监督信号,指导并增强模型推理能力。实验表明,该方案能有效揭示当前先进模型在高认知负荷任务中的情感推理缺陷,并显著提升最终任务准确率与推理过程的一致性和可信度。
链接: https://arxiv.org/abs/2602.00971
作者: Meng Luo,Bobo Li,Shanqing Xu,Shize Zhang,Qiuchan Chen,Menglu Han,Wenhao Chen,Yanxiang Huang,Hao Fei,Mong-Li Lee,Wynne Hsu
机构: National University of Singapore(新加坡国立大学); Huazhong University of Science and Technology(华中科技大学); Hong Kong Polytechnic University(香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICLR 2026
Abstract:Despite rapid progress in multimodal large language models (MLLMs), their capability for deep emotional understanding remains limited. We argue that genuine affective intelligence requires explicit modeling of Theory of Mind (ToM), the cognitive substrate from which emotions arise. To this end, we introduce HitEmotion, a ToM-grounded hierarchical benchmark that diagnoses capability breakpoints across increasing levels of cognitive depth. Second, we propose a ToM-guided reasoning chain that tracks mental states and calibrates cross-modal evidence to achieve faithful emotional reasoning. We further introduce TMPO, a reinforcement learning method that uses intermediate mental states as process-level supervision to guide and strengthen model reasoning. Extensive experiments show that HitEmotion exposes deep emotional reasoning deficits in state-of-the-art models, especially on cognitively demanding tasks. In evaluation, the ToM-guided reasoning chain and TMPO improve end-task accuracy and yield more faithful, more coherent rationales. In conclusion, our work provides the research community with a practical toolkit for evaluating and enhancing the cognition-based emotional understanding capabilities of MLLMs. Our dataset and code are available at: this https URL.
zh
[CV-186] Hybrid Topological and Deep Feature Fusion for Accurate MRI-Based Alzheimers Disease Severity Classification
【速读】:该论文旨在解决阿尔茨海默病(Alzheimer’s disease, AD)早期且准确的诊断问题,特别是在基于神经影像学的临床决策支持系统中,如何提升四类AD阶段分类的准确性。其解决方案的关键在于提出了一种融合拓扑数据分析(Topological Data Analysis, TDA)与DenseNet121深度卷积神经网络的混合深度学习框架:TDA用于捕捉传统神经网络常忽略的脑结构互补拓扑特征,而DenseNet121则高效提取MRI切片中的层次化空间特征;二者特征融合后显著增强了四类AD阶段的类别可分性,最终在OASIS-1 Kaggle MRI数据集上实现了99.93%的准确率和100%的AUC,优于现有主流CNN、迁移学习、集成及多尺度架构方法。
链接: https://arxiv.org/abs/2602.00956
作者: Faisal Ahmed
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 20 pages, 6 Figures
Abstract:Early and accurate diagnosis of Alzheimer’s disease (AD) remains a critical challenge in neuroimaging-based clinical decision support systems. In this work, we propose a novel hybrid deep learning framework that integrates Topological Data Analysis (TDA) with a DenseNet121 backbone for four-class Alzheimer’s disease classification using structural MRI data from the OASIS dataset. TDA is employed to capture complementary topological characteristics of brain structures that are often overlooked by conventional neural networks, while DenseNet121 efficiently learns hierarchical spatial features from MRI slices. The extracted deep and topological features are fused to enhance class separability across the four AD stages. Extensive experiments conducted on the OASIS-1 Kaggle MRI dataset demonstrate that the proposed TDA+DenseNet121 model significantly outperforms existing state-of-the-art approaches. The model achieves an accuracy of 99.93% and an AUC of 100%, surpassing recently published CNN-based, transfer learning, ensemble, and multi-scale architectures. These results confirm the effectiveness of incorporating topological insights into deep learning pipelines and highlight the potential of the proposed framework as a robust and highly accurate tool for automated Alzheimer’s disease diagnosis. Comments: 20 pages, 6 Figures Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2602.00956 [cs.CV] (or arXiv:2602.00956v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2602.00956 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-187] Data Augmentation for High-Fidelity Generation of CAR-T/NK Immunological Synapse Images
【速读】:该论文旨在解决CAR-T/NK细胞免疫突触(immunological synapse, IS)的检测与分割中因标注显微图像数据集规模有限而导致的人工智能神经网络(artificial neural networks, ANNs)泛化能力不足问题。其解决方案的关键在于集成两种互补的数据增强框架:一是实例感知自动增强(Instance Aware Automatic Augmentation, IAAA),通过优化的增强策略生成保留实例结构的合成IS图像及对应分割掩膜;二是语义感知AI增强(Semantic-Aware AI Augmentation, SAAA),结合扩散模型生成语义合理的掩膜与Pix2Pix条件图像合成器生成高保真度的IS图像,从而显著扩展训练数据多样性与真实性,提升ANN在IS识别任务中的鲁棒性与准确性。
链接: https://arxiv.org/abs/2602.00949
作者: Xiang Zhang,Boxuan Zhang,Alireza Naghizadeh,Mohab Mohamed,Dongfang Liu,Ruixiang Tang,Dimitris Metaxas,Dongfang Liu
机构: Rochester Institute of Technology (罗切斯特理工学院); Rutgers University (罗格斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Chimeric antigen receptor (CAR)-T and NK cell immunotherapies have transformed cancer treatment, and recent studies suggest that the quality of the CAR-T/NK cell immunological synapse (IS) may serve as a functional biomarker for predicting therapeutic efficacy. Accurate detection and segmentation of CAR-T/NK IS structures using artificial neural networks (ANNs) can greatly increase the speed and reliability of IS quantification. However, a persistent challenge is the limited size of annotated microscopy datasets, which restricts the ability of ANNs to generalize. To address this challenge, we integrate two complementary data-augmentation frameworks. First, we employ Instance Aware Automatic Augmentation (IAAA), an automated, instance-preserving augmentation method that generates synthetic CAR-T/NK IS images and corresponding segmentation masks by applying optimized augmentation policies to original IS data. IAAA supports multiple imaging modalities (e.g., fluorescence and brightfield) and can be applied directly to CAR-T/NK IS images derived from patient samples. In parallel, we introduce a Semantic-Aware AI Augmentation (SAAA) pipeline that combines a diffusion-based mask generator with a Pix2Pix conditional image synthesizer. This second method enables the creation of diverse, anatomically realistic segmentation masks and produces high-fidelity CAR-T/NK IS images aligned with those masks, further expanding the training corpus beyond what IAAA alone can provide. Together, these augmentation strategies generate synthetic images whose visual and structural properties closely match real IS data, significantly improving CAR-T/NK IS detection and segmentation performance. By enhancing the robustness and accuracy of IS quantification, this work supports the development of more reliable imaging-based biomarkers for predicting patient response to CAR-T/NK immunotherapy.
zh
[CV-188] ConsensusDrop: Fusing Visual and Cross-Modal Saliency for Efficient Vision Language Models
【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)因处理大量冗余视觉标记而导致计算成本高昂的问题。现有方法通常依赖于视觉编码器显著性(广谱但与查询无关)或大语言模型(LLM)交叉注意力(查询相关但稀疏且昂贵),但单独使用任一信号均不足以实现最优性能。解决方案的关键在于提出一种无需训练的框架 ConsensusDrop,通过融合视觉编码器显著性与查询感知的交叉注意力,生成一个共识排序(consensus ranking),从而保留最具信息量的视觉标记,并利用编码器引导的标记合并策略压缩其余部分。该方法有效解决了跨模态显著性信号在LLM内部获取时机过晚、难以用于高效预LLM剪枝的问题,同时克服了两种信号固有的不对称性,实现更优的信息保留与效率平衡,在多个开源VLM中均优于现有剪枝方法。
链接: https://arxiv.org/abs/2602.00946
作者: Dhruv Parikh,Haoyang Fan,Rajgopal Kannan,Viktor Prasanna
机构: University of Southern California (南加州大学); DEVCOM Army Research Office (美国陆军研究办公室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report
Abstract:Vision-Language Models (VLMs) are expensive because the LLM processes hundreds of largely redundant visual tokens. Existing token reduction methods typically exploit \textiteither vision-encoder saliency (broad but query-agnostic) \textitor LLM cross-attention (query-aware but sparse and costly). We show that neither signal alone is sufficient: fusing them consistently improves performance compared to unimodal visual token selection (ranking). However, making such fusion practical is non-trivial: cross-modal saliency is usually only available \emphinside the LLM (too late for efficient pre-LLM pruning), and the two signals are inherently asymmetric, so naive fusion underutilizes their complementary strengths. We propose \textbfConsensusDrop, a training-free framework that derives a \emphconsensus ranking by reconciling vision encoder saliency with query-aware cross-attention, retaining the most informative tokens while compressing the remainder via encoder-guided token merging. Across LLaVA-1.5/NeXT, Video-LLaVA, and other open-source VLMs, ConsensusDrop consistently outperforms prior pruning methods under identical token budgets and delivers a stronger accuracy-efficiency Pareto frontier – preserving near-baseline accuracy even at aggressive token reductions while reducing TTFT and KV cache footprint. Our code will be open-sourced.
zh
[CV-189] CLAMP: Contrastive Learning for 3D Multi-View Action-Conditioned Robotic Manipulation Pretraining
【速读】:该论文旨在解决当前基于行为克隆(Behavior Cloning)的机器人操作策略中,依赖预训练二维图像表示时无法有效捕捉物体与场景三维空间信息的问题,从而限制了高精度操作能力。其解决方案的关键在于提出一种名为CLAMP(Contrastive Learning for 3D Multi-View Action-Conditioned Robotic Manipulation Pretraining)的新型三维预训练框架:通过融合RGB-D图像与相机外参生成点云,并重渲染包含深度和三维坐标信息的多视角四通道图像(含动态腕部视图),以增强目标物体的几何感知;利用对比学习在大规模模拟机器人轨迹上对编码器进行预训练,使模型学会将物体的3D几何与位置信息与机器人动作模式建立关联;同时引入扩散策略(Diffusion Policy)初始化策略权重,显著提升微调阶段的样本效率与性能。该设计在多个仿真与真实任务中均展现出优于现有最先进方法的泛化能力和操作精度。
链接: https://arxiv.org/abs/2602.00937
作者: I-Chun Arthur Liu,Krzysztof Choromanski,Sandy Huang,Connor Schenck
机构: Google(谷歌); Meta; Stability.AI; Anthropic
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Leveraging pre-trained 2D image representations in behavior cloning policies has achieved great success and has become a standard approach for robotic manipulation. However, such representations fail to capture the 3D spatial information about objects and scenes that is essential for precise manipulation. In this work, we introduce Contrastive Learning for 3D Multi-View Action-Conditioned Robotic Manipulation Pretraining (CLAMP), a novel 3D pre-training framework that utilizes point clouds and robot actions. From the merged point cloud computed from RGB-D images and camera extrinsics, we re-render multi-view four-channel image observations with depth and 3D coordinates, including dynamic wrist views, to provide clearer views of target objects for high-precision manipulation tasks. The pre-trained encoders learn to associate the 3D geometric and positional information of objects with robot action patterns via contrastive learning on large-scale simulated robot trajectories. During encoder pre-training, we pre-train a Diffusion Policy to initialize the policy weights for fine-tuning, which is essential for improving fine-tuning sample efficiency and performance. After pre-training, we fine-tune the policy on a limited amount of task demonstrations using the learned image and action representations. We demonstrate that this pre-training and fine-tuning design substantially improves learning efficiency and policy performance on unseen tasks. Furthermore, we show that CLAMP outperforms state-of-the-art baselines across six simulated tasks and five real-world tasks.
zh
[CV-190] OCTOPUS: Enhancing the Spatial-Awareness of Vision SSMs with Multi-Dimensional Scans and Traversal Selection
【速读】:该论文旨在解决状态空间模型(State Space Models, SSMs)在视觉任务中应用受限的问题,特别是其因果建模机制在图像空间域中破坏像素或补丁间固有空间关联性,导致局部空间一致性难以保持、相邻补丁被忽略而远距离补丁被错误连接。解决方案的关键在于提出OCTOPUS架构,通过沿八个主方向(水平、垂直及对角线方向)进行离散递归操作,实现多方向信息传播,从而在维持SSM线性计算复杂度的同时,有效保留图像的全局上下文与局部空间结构,显著提升边界保真度和区域一致性。
链接: https://arxiv.org/abs/2602.00904
作者: Kunal Mahatha,Ali Bahri,Pierre Marza,Sahar Dastani,Maria Vakalopoulou,Stergios Christodoulidis,Jose Dolz,Christian Desrosiers
机构: ILLS, LIVIA, ÉTS Montréal, Canada; MICS, CentraleSupélec, Université Paris-Saclay
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:State space models (SSMs) have recently emerged as an alternative to transformers due to their unique ability of modeling global relationships in text with linear complexity. However, their success in vision tasks has been limited due to their causal formulation, which is suitable for sequential text but detrimental in the spatial domain where causality breaks the inherent spatial relationships among pixels or patches. As a result, standard SSMs fail to capture local spatial coherence, often linking non-adjacent patches while ignoring neighboring ones that are visually correlated. To address these limitations, we introduce OCTOPUS , a novel architecture that preserves both global context and local spatial structure within images, while maintaining the linear complexity of SSMs. OCTOPUS performs discrete reoccurrence along eight principal orientations, going forward or backward in the horizontal, vertical, and diagonal directions, allowing effective information exchange across all spatially connected regions while maintaining independence among unrelated patches. This design enables multi-directional recurrence, capturing both global context and local spatial structure with SSM-level efficiency. In our classification and segmentation benchmarks, OCTOPUS demonstrates notable improvements in boundary preservation and region consistency, as evident from the segmentation results, while maintaining relatively better classification accuracy compared to existing V-SSM based models. These results suggest that OCTOPUS appears as a foundation method for multi-directional recurrence as a scalable and effective mechanism for building spatially aware and computationally efficient vision architectures.
zh
[CV-191] DIAMOND: Directed Inference for Artifact Mitigation in Flow Matching Models
【速读】:该论文旨在解决当前文本到图像生成模型(如FLUX)中存在的视觉和解剖结构伪影问题,这些问题严重限制了模型在专业场景中的实际应用。现有方法多采用后处理方式,在图像生成的核心过程中无法有效干预,且常需对模型权重进行侵入式修改或依赖计算成本高昂的局部区域优化。本文提出了一种无需训练的解决方案——DIAMOND,其核心在于通过轨迹校正(trajectory correction)机制,在推理阶段每一步重构干净样本的估计值,从而主动引导生成过程避开易产生伪影的潜在状态,实现无额外训练与权重调整的高质量、无伪影图像合成。
链接: https://arxiv.org/abs/2602.00883
作者: Alicja Polowczyk,Agnieszka Polowczyk,Piotr Borycki,Joanna Waczyńska,Jacek Tabor,Przemysław Spurek
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Despite impressive results from recent text-to-image models like FLUX, visual and anatomical artifacts remain a significant hurdle for practical and professional use. Existing methods for artifact reduction, typically work in a post-hoc manner, consequently failing to intervene effectively during the core image formation process. Notably, current techniques require problematic and invasive modifications to the model weights, or depend on a computationally expensive and time-consuming process of regional refinement. To address these limitations, we propose DIAMOND, a training-free method that applies trajectory correction to mitigate artifacts during inference. By reconstructing an estimate of the clean sample at every step of the generative trajectory, DIAMOND actively steers the generation process away from latent states that lead to artifacts. Furthermore, we extend the proposed method to standard Diffusion Models, demonstrating that DIAMOND provides a robust, zero-shot path to high-fidelity, artifact-free image synthesis without the need for additional training or weight modifications in modern generative architectures. Code is available at this https URL
zh
[CV-192] Distill3R: A Pipeline for Democratizing 3D Foundation Models on Commodity Hardware
【速读】:该论文旨在解决当前多视图三维重建领域中,基于大规模基础模型(3D foundation models)的训练严重依赖高性能计算集群所带来的可及性问题,从而阻碍了大多数学术实验室的研究进展。其解决方案的关键在于提出Distill3R框架,通过两个核心创新实现高效知识蒸馏:一是设计离线缓存管道,将教师模型的复杂推理过程与训练循环解耦,以压缩后的监督信号作为训练依据;二是引入置信度感知的蒸馏损失函数,利用教师模型的不确定性指导学生模型在消费级硬件上稳定训练。该方案使得一个参数量仅为72M的学生模型可在单个工作站上于3天内完成训练,相较650M参数的教师模型实现9倍参数压缩和5倍推理加速,同时保持结构一致性和几何理解能力,为资源受限的研究机构提供了一种低成本、可复现的3D视觉研究基线。
链接: https://arxiv.org/abs/2602.00865
作者: Brandon Leblanc,Charalambos Poullis
机构: The Natural Sciences and Engineering Research Council of Canada (NSERC)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to the Canadian Conference on Robotics and Vision (CRV). 10 pages, 5 figures
Abstract:While multi-view 3D reconstruction has shifted toward large-scale foundation models capable of inferring globally consistent geometry, their reliance on massive computational clusters for training has created a significant barrier to entry for most academic laboratories. To bridge this compute divide, we introduce Distill3R, a framework designed to distill the geometric reasoning of 3D foundation models into compact students fully trainable on a single workstation. Our methodology centers on two primary innovations: (1) an offline caching pipeline that decouples heavy teacher inference from the training loop through compressed supervision signals, and (2) a confidence-aware distillation loss that leverages teacher uncertainty to enable training on commodity hardware. We propose a 72M-parameter student model which achieves a 9x reduction in parameters and a 5x inference speedup compared to its 650M-parameter teacher. The student is fully trainable in under 3 days on a single workstation, whereas its teacher requires massive GPU clusters for up to a week. We demonstrate that the student preserves the structural consistency and qualitative geometric understanding required for functional 3D awareness. By providing a reproducible, single-workstation training recipe, Distill3R serves as an exploratory entry point for democratized 3D vision research and efficient edge deployment. This work is not intended to compete with state-of-the-art foundation models, but to provide an accessible research baseline for laboratories without access to large-scale compute to train and specialize models on their own domain-specific data at minimal cost.
zh
[CV-193] Invariance on Manifolds: Understanding Robust Visual Representations for Place Recognition
【速读】:该论文旨在解决视觉场景识别(Visual Place Recognition, VPR)中因环境变化和视角差异导致的表征鲁棒性不足的问题。现有方法通常依赖大量标注数据或仅使用一阶统计量,难以捕捉场景内部的结构关联。解决方案的关键在于提出一种基于二阶几何统计(Second-Order Geometric Statistics)的框架,将场景建模为对称正定(Symmetric Positive Definite, SPD)流形上的协方差描述子,并利用黎曼几何映射将其投影到线性欧氏空间,从而有效分离信号结构与噪声,实现无需训练的零样本泛化能力。
链接: https://arxiv.org/abs/2602.00841
作者: Jintao Cheng,Weibin Li,Zhijian He,Jin Wu,Chi Man Vong,Wei Zhang
机构: The Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14pages, 5 figures
Abstract:Visual Place Recognition (VPR) demands representations robust to drastic environmental and viewpoint shifts. Current aggregation paradigms, however, either rely on data-hungry supervision or simplistic first-order statistics, often neglecting intrinsic structural correlations. In this work, we propose a Second-Order Geometric Statistics framework that inherently captures geometric stability without training. We conceptualize scenes as covariance descriptors on the Symmetric Positive Definite (SPD) manifold, where perturbations manifest as tractable congruence transformations. By leveraging geometry-aware Riemannian mappings, we project these descriptors into a linearized Euclidean embedding, effectively decoupling signal structure from noise. Our approach introduces a training-free framework built upon fixed, pre-trained backbones, achieving strong zero-shot generalization without parameter updates. Extensive experiments confirm that our method achieves highly competitive performance against state-of-the-art baselines, particularly excelling in challenging zero-shot scenarios.
zh
[CV-194] ransNormal: Dense Visual Semantics for Diffusion-based Transparent Object Normal Estimation
【速读】:该论文旨在解决透明物体单目法向量估计(monocular normal estimation)在实验室自动化场景中的难题,该任务因复杂的光折射与反射特性导致传统深度和法向传感器易失效,阻碍了具身人工智能(embodied AI)在科学环境中的部署。解决方案的关键在于提出TransNormal框架,其核心创新包括:利用预训练扩散模型先验进行单步法向回归;通过DINOv3提取的密集视觉语义信息与交叉注意力机制增强透明表面缺乏纹理时的几何约束;同时采用多任务学习目标和基于小波的正则化策略以保留精细结构细节。
链接: https://arxiv.org/abs/2602.00839
作者: Mingwei Li,Hehe Fan,Yi Yang
机构: Zhejiang University (浙江大学); Zhongguancun Academy (中关村学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:Monocular normal estimation for transparent objects is critical for laboratory automation, yet it remains challenging due to complex light refraction and reflection. These optical properties often lead to catastrophic failures in conventional depth and normal sensors, hindering the deployment of embodied AI in scientific environments. We propose TransNormal, a novel framework that adapts pre-trained diffusion priors for single-step normal regression. To handle the lack of texture in transparent surfaces, TransNormal integrates dense visual semantics from DINOv3 via a cross-attention mechanism, providing strong geometric cues. Furthermore, we employ a multi-task learning objective and wavelet-based regularization to ensure the preservation of fine-grained structural details. To support this task, we introduce TransNormal-Synthetic, a physics-based dataset with high-fidelity normal maps for transparent labware. Extensive experiments demonstrate that TransNormal significantly outperforms state-of-the-art methods: on the ClearGrasp benchmark, it reduces mean error by 24.4% and improves 11.25° accuracy by 22.8%; on ClearPose, it achieves a 15.2% reduction in mean error. The code and dataset will be made publicly available at this https URL.
zh
[CV-195] Edge-Native Generative De-identification: Inversion-Free Flow for Privacy-Preserving Federated Skin Image Analysis
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在临床皮肤科应用中面临的隐私保护与病理特征保留之间的矛盾问题。传统去标识化方法常导致病灶信息失真,而标准生成编辑技术依赖计算密集型反演过程,难以部署于资源受限的边缘设备。解决方案的关键在于提出一种无需反演的“身份无关病理保留”框架,利用无反演的修正流变换器(Rectified Flow Transformers, FlowEdit)实现近实时(<20秒)高保真身份转换,支持在临床节点本地部署;并通过“分段合成”机制本地生成健康与病理性孪生图像对,从而提取解耦生物特征标记和语义伪影(如饰品)的差异性红斑掩码,最终在边缘端生成符合隐私要求的合成替代数据,有效降低源端梯度泄露风险,保障联邦环境下皮肤图像高精度分析的安全性。
链接: https://arxiv.org/abs/2602.00821
作者: Konstantinos Moutselos,Ilias Maglogiannis
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 5 figures
Abstract:The deployment of Federated Learning (FL) for clinical dermatology is hindered by the competing requirements of protecting patient privacy and preserving diagnostic features. Traditional de-identification methods often degrade pathological fidelity, while standard generative editing techniques rely on computationally intensive inversion processes unsuitable for resource-constrained edge devices. We propose a framework for identity-agnostic pathology preservation that serves as a client-side privacy-preserving utility. By leveraging inversion-free Rectified Flow Transformers (FlowEdit), the system performs high-fidelity identity transformation in near real-time (less than 20s), facilitating local deployment on clinical nodes. We introduce a “Segment-by-Synthesis” mechanism that generates counterfactual healthy and pathological twin pairs locally. This enables the extraction of differential erythema masks that are decoupled from biometric markers and semantic artifacts (e.g. jewelry). Pilot validation on high-resolution clinical samples demonstrates an Intersection over Union (IoU) stability greater than 0.67 across synthetic identities. By generating privacy-compliant synthetic surrogates at the edge, this framework mitigates the risk of gradient leakage at the source, providing a secure pathway for high-precision skin image analysis in federated environments.
zh
[CV-196] SyNeT: Synthetic Negatives for Traversability Learning
【速读】:该论文旨在解决自主机器人在复杂户外环境中进行可靠通行性估计(traversability estimation)时,现有自监督学习框架因缺乏显式负样本(negative data)而导致模型难以准确识别多样化非通行区域的问题。解决方案的关键在于提出一种显式构建合成负样本(synthetic negatives)的方法,这些负样本代表了看似合理但实际不可通行的区域,并将其无缝集成到基于视觉的通行性学习训练策略中,该策略可兼容正样本-未标记样本(Positive-Unlabeled, PU)和正样本-负样本(Positive-Negative, PN)两类框架,且无需修改推理架构。同时,论文引入一种以物体为中心的假阳性率(object-centric FPR)评估方法,通过分析合成负样本插入区域的预测结果,间接衡量模型对非通行区域的一致识别能力,从而避免额外的人工标注。
链接: https://arxiv.org/abs/2602.00814
作者: Bomena Kim,Hojun Lee,Younsoo Park,Yaoyu Hu,Sebastian Scherer,Inwook Shim
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Reliable traversability estimation is crucial for autonomous robots to navigate complex outdoor environments safely. Existing self-supervised learning frameworks primarily rely on positive and unlabeled data; however, the lack of explicit negative data remains a critical limitation, hindering the model’s ability to accurately identify diverse non-traversable regions. To address this issue, we introduce a method to explicitly construct synthetic negatives, representing plausible but non-traversable, and integrate them into vision-based traversability learning. Our approach is formulated as a training strategy that can be seamlessly integrated into both Positive-Unlabeled (PU) and Positive-Negative (PN) frameworks without modifying inference architectures. Complementing standard pixel-wise metrics, we introduce an object-centric FPR evaluation approach that analyzes predictions in regions where synthetic negatives are inserted. This evaluation provides an indirect measure of the model’s ability to consistently identify non-traversable regions without additional manual labeling. Extensive experiments on both public and self-collected datasets demonstrate that our approach significantly enhances robustness and generalization across diverse environments. The source code and demonstration videos are publicly available at the project page: this https URL
zh
[CV-197] Generating a Paracosm for Training-Free Zero-Shot Composed Image Retrieval
【速读】:该论文旨在解决**组合图像检索(Composed Image Retrieval, CIR)**中的核心挑战,即如何在没有显式目标图像的情况下,基于包含参考图像和修改文本的多模态查询准确检索出目标图像。由于“心理图像”(mental image)仅由查询隐式定义且无法直接获取,现有方法通常依赖大型多模态模型(LMM)生成文本描述,再通过视觉-语言模型(VLM)进行图文匹配,但这种间接方式易引入误差。本文的关键创新在于从第一性原理出发,直接生成“心理图像”作为中间表示:利用LMM对多模态查询生成合成图像,并为数据库中的每张真实图像生成对应的合成副本以缩小合成到真实域差距(synthetic-to-real domain gap),从而在构建的“平行宇宙”(paracosm)中实现更精准的图像匹配。该方法无需训练,属于零样本范式,且在四个基准测试上显著优于现有方法,达到零样本CIR的最先进性能。
链接: https://arxiv.org/abs/2602.00813
作者: Tong Wang,Yunhan Zhao,Shu Kong
机构: University of Macau(澳门大学); Google DeepMind(谷歌深度大脑); Institute of Collaborative Innovation(协同创新研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Composed Image Retrieval (CIR) is the task of retrieving a target image from a database using a multimodal query, which consists of a reference image and a modification text. The text specifies how to alter the reference image to form a mental image'', based on which CIR should find the target image in the database. The fundamental challenge of CIR is that this mental image’’ is not physically available and is only implicitly defined by the query. The contemporary literature pursues zero-shot methods and uses a Large Multimodal Model (LMM) to generate a textual description for a given multimodal query, and then employs a Vision-Language Model (VLM) for textual-visual matching to search the target image. In contrast, we address CIR from first principles by directly generating the mental image'' for more accurate matching. Particularly, we prompt an LMM to generate a mental image’’ for a given multimodal query and propose to use this mental image'' to search for the target image. As the mental image’’ has a synthetic-to-real domain gap with real images, we also generate a synthetic counterpart for each real image in the database to facilitate matching. In this sense, our method uses LMM to construct a ``paracosm’', where it matches the multimodal query and database images. Hence, we call this method Paracosm. Notably, Paracosm is a training-free zero-shot CIR method. It significantly outperforms existing zero-shot methods on four challenging benchmarks, achieving state-of-the-art performance for zero-shot CIR.
zh
[CV-198] VVLoc: Prior-free 3-DoF Vehicle Visual Localization
【速读】:该论文旨在解决自动驾驶中定位技术的两大核心问题:一是传统方法通常将拓扑定位(topological localization)与度量定位(metric localization)分开处理,导致系统复杂且难以协同优化;二是现有方案多依赖单目相机或额外的3D语义/位姿先验,且缺乏对定位结果置信度的量化机制,限制了其在工业场景中的实用性。解决方案的关键在于提出一种统一的端到端神经网络框架VVLoc,通过多摄像头系统实现拓扑与度量定位的联合建模,利用视觉观测间的几何接近性评估和匹配策略估计相对度量位姿,并内置置信度估计模块,同时训练过程仅需图像对及对应真值位姿,无需复杂辅助数据,从而显著提升定位精度与鲁棒性。
链接: https://arxiv.org/abs/2602.00810
作者: Ze Huang,Zhongyang Xiao,Mingliang Song,Longan Yang,Hongyuan Yuan,Li Sun
机构: Fudan University (复旦大学); NIO; Bosch XC
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Localization is a critical technology in autonomous driving, encompassing both topological localization, which identifies the most similar map keyframe to the current observation, and metric localization, which provides precise spatial coordinates. Conventional methods typically address these tasks independently, rely on single-camera setups, and often require additional 3D semantic or pose priors, while lacking mechanisms to quantify the confidence of localization results, making them less feasible for real industrial applications. In this paper, we propose VVLoc, a unified pipeline that employs a single neural network to concurrently achieve topological and metric vehicle localization using multi-camera system. VVLoc first evaluates the geo-proximity between visual observations, then estimates their relative metric poses using a matching strategy, while also providing a confidence measure. Additionally, the training process for VVLoc is highly efficient, requiring only pairs of visual data and corresponding ground-truth poses, eliminating the need for complex supplementary data. We evaluate VVLoc not only on the publicly available datasets, but also on a more challenging self-collected dataset, demonstrating its ability to deliver state-of-the-art localization accuracy across a wide range of localization tasks.
zh
[CV-199] Any3D-VLA: Enhancing VLA Robustness via Diverse Point Clouds
【速读】:该论文旨在解决现有视觉-语言-动作(Vision-Language-Action, VLA)模型仅依赖二维(2D)图像作为输入所导致的空间理解能力受限的问题,特别是在复杂场景中缺乏对三维(3D)结构的有效建模。其核心解决方案是提出Any3D-VLA框架,关键在于通过统一模拟器、传感器与模型估计的点云(point cloud)数据,在训练过程中构建多样化输入,并学习域无关(domain-agnostic)的3D表示,进而与对应的2D表示进行融合,从而提升模型性能并缓解因跨环境差异和深度尺度偏差引发的域差距问题。
链接: https://arxiv.org/abs/2602.00807
作者: Xianzhe Fan,Shengliang Deng,Xiaoyang Wu,Yuxiang Lu,Zhuoling Li,Mi Yan,Yujia Zhang,Zhizheng Zhang,He Wang,Hengshuang Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Existing Vision-Language-Action (VLA) models typically take 2D images as visual input, which limits their spatial understanding in complex scenes. How can we incorporate 3D information to enhance VLA capabilities? We conduct a pilot study across different observation spaces and visual representations. The results show that explicitly lifting visual input into point clouds yields representations that better complement their corresponding 2D representations. To address the challenges of (1) scarce 3D data and (2) the domain gap induced by cross-environment differences and depth-scale biases, we propose Any3D-VLA. It unifies the simulator, sensor, and model-estimated point clouds within a training pipeline, constructs diverse inputs, and learns domain-agnostic 3D representations that are fused with the corresponding 2D representations. Simulation and real-world experiments demonstrate Any3D-VLA’s advantages in improving performance and mitigating the domain gap. Our project homepage is available at this https URL.
zh
[CV-200] DVLA-RL: Dual-Level Vision-Language Alignment with Reinforcement Learning Gating for Few-Shot Learning ICLR2026
【速读】:该论文旨在解决少样本学习(Few-shot Learning, FSL)中因视觉与语言模态间缺乏渐进式、自适应对齐而导致语义增益有限的问题。现有方法虽利用大语言模型(Large Language Models, LLMs)生成基于类别名称的语义嵌入以增强视觉表征,但未充分挖掘从低层到高层语义的逐步对齐机制。其解决方案的关键在于提出双层视觉-语言对齐框架(Dual-level Vision-Language Alignment with Reinforcement Learning gating, DVLA-RL),包含两个核心组件:(1)双层语义构建(Dual-level Semantic Construction, DSC),通过结合类别名与支持样本条件化LLMs,逐步筛选并合成互补的低层属性与高层描述,实现细粒度定位与整体类理解;(2)RL门控注意力机制(RL-gated Attention, RLA),将跨模态融合建模为序列决策过程,采用轻量级REINFORCE策略训练的策略网络动态调节自注意力与交叉注意力权重,使浅层聚焦局部属性、深层强化全局语义,从而实现更精准的跨模态对齐与类特定区分能力。
链接: https://arxiv.org/abs/2602.00795
作者: Wenhao Li,Xianjing Meng,Qiangchang Wang,Zhongyi Han,Zhibin Wu,Yilong Yin
机构: Shandong University (山东大学); Shenzhen Loop Area Institute (深圳环区研究院); Shandong University of Finance and Economics (山东财经大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICLR 2026
Abstract:Few-shot learning (FSL) aims to generalize to novel categories with only a few samples. Recent approaches incorporate large language models (LLMs) to enrich visual representations with semantic embeddings derived from class names. However, they overlook progressive and adaptive alignment between vision and language from low-level to high-level semantics, resulting in limited semantic gains. To address these challenges, we propose Dual-level Vision-Language Alignment with Reinforcement Learning gating (DVLA-RL), which consists of Dual-level Semantic Construction (DSC) and RL-gated Attention (RLA). Specifically, DSC conditions LLMs on both class names and support samples to generate discriminative attributes, progressively selects the most relevant ones, and then synthesizes them into coherent class descriptions. This process provides complementary low-level attributes and high-level descriptions, enabling both fine-grained grounding and holistic class understanding. To dynamically integrate dual-level semantics along with the visual network layers, RLA formulates cross-modal fusion as a sequential decision process. A lightweight policy trained with episodic REINFORCE adaptively adjusts the contributions of self-attention and cross-attention to integrate textual and visual tokens. As a result, shallow layers refine local attributes and deep layers emphasize global semantics, enabling more precise cross-modal alignment. This achieves class-specific discrimination and generalized representations with merely a few support samples. DVLA-RL achieves new state-of-the-art performance across nine benchmarks in three diverse FSL scenarios.
zh
[CV-201] Evaluating Deep Learning-Based Nerve Segmentation in Brachial Plexus Ultrasound Under Realistic Data Constraints
【速读】:该论文旨在解决超声引导下臂丛神经定位中因图像对比度低、斑点噪声和患者间解剖差异导致的神经识别困难问题(即超声图像中神经分割精度不足的问题)。其解决方案的关键在于采用基于U-Net架构的深度学习方法进行神经分割,并系统评估数据集构成与标注策略对分割性能的影响,发现多设备联合训练可提升低性能采集源的泛化能力,而多类监督(动脉、静脉、神经、肌肉)虽增强模型泛化性但会显著降低神经特异性Dice分数,同时指出神经尺寸与分割准确率呈显著正相关(Pearson r=0.587, p<0.001),表明小尺寸神经仍是当前技术难点。
链接: https://arxiv.org/abs/2602.00763
作者: Dylan Yves,Khush Agarwal,Jonathan Hoyin Chan,Patcharapit Promoppatum,Aroonkamon Pattanasiricharoen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 6 figures
Abstract:Accurate nerve localization is critical for the success of ultrasound-guided regional anesthesia, yet manual identification remains challenging due to low image contrast, speckle noise, and inter-patient anatomical variability. This study evaluates deep learning-based nerve segmentation in ultrasound images of the brachial plexus using a U-Net architecture, with a focus on how dataset composition and annotation strategy influence segmentation performance. We find that training on combined data from multiple ultrasound machines (SIEMENS ACUSON NX3 Elite and Philips EPIQ5) provides regularization benefits for lower-performing acquisition sources, though it does not surpass single-source training when matched to the target domain. Extending the task from binary nerve segmentation to multi-class supervision (artery, vein, nerve, muscle) results in decreased nerve-specific Dice scores, with performance drops ranging from 9% to 61% depending on dataset, likely due to class imbalance and boundary ambiguity. Additionally, we observe a moderate positive correlation between nerve size and segmentation accuracy (Pearson r=0.587, p0.001), indicating that smaller nerves remain a primary challenge. These findings provide methodological guidance for developing robust ultrasound nerve segmentation systems under realistic clinical data constraints.
zh
[CV-202] HSI-VAR: Rethinking Hyperspectral Restoration through Spatial-Spectral Visual Autoregression
【速读】:该论文旨在解决高光谱图像(Hyperspectral Images, HSIs)在实际应用中因噪声、模糊和波段缺失等复合退化导致的恢复难题。现有生成式方法如扩散模型虽能重建细节但需数百次迭代,计算成本高昂;而回归模型则易产生过度平滑结果,难以保留关键结构信息。解决方案的核心在于提出HSI-VAR,将其重构为自回归生成问题,从而分阶段建模光谱与空间依赖关系而非全局重建。其关键技术包括:(1) 潜在条件对齐机制,确保潜在先验与条件嵌入间的语义一致性以实现精准重建;(2) 退化感知引导策略,将混合退化编码为嵌入空间中的线性组合,自动控制恢复过程并显著降低计算开销(推理阶段减少约50%);(3) 空间-光谱适应模块,在解码阶段协同优化两个域的细节。实验表明,该方法在九个综合基准上达到SOTA性能,PSNR提升达3.77 dB,且推理速度比扩散模型快至95.5倍,具备高度实用性。
链接: https://arxiv.org/abs/2602.00749
作者: Xiangming Wang,Benteng Sun,Yungeng Liu,Haijin Zeng,Yongyong Chen,Jingyong Su,Jie Liu
机构: Harbin Institute of Technology (Shenzhen) (哈尔滨工业大学(深圳)); Harvard University (哈佛大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Hyperspectral images (HSIs) capture richer spatial-spectral information beyond RGB, yet real-world HSIs often suffer from a composite mix of degradations, such as noise, blur, and missing bands. Existing generative approaches for HSI restoration like diffusion models require hundreds of iterative steps, making them computationally impractical for high-dimensional HSIs. While regression models tend to produce oversmoothed results, failing to preserve critical structural details. We break this impasse by introducing HSI-VAR, rethinking HSI restoration as an autoregressive generation problem, where spectral and spatial dependencies can be progressively modeled rather than globally reconstructed. HSI-VAR incorporates three key innovations: (1) Latent-condition alignment, which couples semantic consistency between latent priors and conditional embeddings for precise reconstruction; (2) Degradation-aware guidance, which uniquely encodes mixed degradations as linear combinations in the embedding space for automatic control, remarkably achieving a nearly 50% reduction in computational cost at inference; (3) A spatial-spectral adaptation module that refines details across both domains in the decoding phase. Extensive experiments on nine all-in-one HSI restoration benchmarks confirm HSI-VAR’s state-of-the-art performance, achieving a 3.77 dB PSNR improvement on \textbf\textitICVL and offering superior structure preservation with an inference speed-up of up to 95.5 \times compared with diffusion-based methods, making it a highly practical solution for real-world HSI restoration.
zh
[CV-203] Can Vision-Language Models Handle Long-Context Code? An Empirical Study on Visual Compression
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在处理长代码上下文时因窗口限制导致的性能瓶颈问题。现有文本压缩方法虽能缓解长度压力,但常通过选择性过滤破坏代码依赖闭包,引发语义碎片化。其解决方案的关键在于提出一种名为LongCodeOCR的视觉压缩框架,将代码渲染为二维图像序列输入视觉-语言模型(Vision-Language Models, VLMs),从而保留全局视图并避免结构依赖断裂。该方法在不牺牲整体结构覆盖的前提下,显著提升长代码任务中的语义完整性与推理准确性。
链接: https://arxiv.org/abs/2602.00746
作者: Jianping Zhong,Guochang Li,Chen Zhi,Junxiao Han,Zhen Qin,Xinkui Zhao,Nan Wang,Shuiguang Deng,Jianwei Yin
机构: Zhejiang University (浙江大学); Hangzhou City University (杭州城市大学); Shenzhou Aerospace Software Technology Company Limited (神州航天软件技术有限公司)
类目: oftware Engineering (cs.SE); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Large Language Models (LLMs) struggle with long-context code due to window limitations. Existing textual code compression methods mitigate this via selective filtering but often disrupt dependency closure, causing semantic fragmentation. To address this, we introduce LongCodeOCR, a visual compression framework that renders code into compressed two-dimensional image sequences for Vision-Language Models (VLMs). By preserving a global view, this approach avoids the dependency breakage inherent in filtering. We systematically evaluate LongCodeOCR against the state-of-the-art LongCodeZip across four benchmarks spanning code summarization, code question answering, and code completion. Our results demonstrate that visual code compression serves as a viable alternative for tasks requiring global understanding. At comparable compression ratios ( \sim 1.7 \times ), LongCodeOCR improves CompScore on Long Module Summarization by 36.85 points over LongCodeZip. At a 1M-token context length with Glyph (a specialized 9B VLM), LongCodeOCR maintains higher accuracy than LongCodeZip while operating at about 4 \times higher compression. Moreover, compared with LongCodeZip, LongCodeOCR drastically reduces compression-stage overhead (reducing latency from \sim 4.3 hours to \sim 1 minute at 1M tokens). Finally, our results characterize a fundamental coverage–fidelity trade-off: visual code compression retains broader context coverage to support global dependencies, yet faces fidelity bottlenecks on exactness-critical tasks; by contrast, textual code compression preserves symbol-level precision while sacrificing structural coverage. Subjects: Software Engineering (cs.SE); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2602.00746 [cs.SE] (or arXiv:2602.00746v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2602.00746 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-204] Diffusion-Driven Inter-Outer Surface Separation for Point Clouds with Open Boundaries
【速读】:该论文旨在解决由截断有符号距离函数(Truncated Signed Distance Function, TSDF)融合过程中不对称截断阈值引发的“双层表面伪影”问题,即在室内或医学3D重建中生成的点云存在错误的内外层壳结构,导致表面重叠和法向量紊乱。解决方案的关键在于提出一种基于扩散(diffusion-based)的算法,用于从双层点云中分离出真实的内层表面,尤其适用于具有开放边界(open boundaries)的点云数据——这类点云虽存在拓扑孔洞但仍有采样点分布,区别于完全缺失表面区域的情况。该方法作为TSDF融合后的轻量级后处理模块,可在约10秒内完成20,000个内外层点的分离,有效提升室内场景建模与医学影像等应用中的表面表示精度,且兼容封闭(watertight)与开放边界几何结构。
链接: https://arxiv.org/abs/2602.00739
作者: Zhengyan Qin,Liyuan Qiu
机构: Hong Kong University of Science and Technology (HKUST); Hong Kong Applied Science and Technology Research Institute (ASTRI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We propose a diffusion-based algorithm for separating the inter and outer layer surfaces from double-layered point clouds, particularly those exhibiting the “double surface artifact” caused by truncation in Truncated Signed Distance Function (TSDF) fusion during indoor or medical 3D reconstruction. This artifact arises from asymmetric truncation thresholds, leading to erroneous inter and outer shells in the fused volume, which our method addresses by extracting the true inter layer to mitigate challenges like overlapping surfaces and disordered normals. We focus on point clouds with \emphopen boundaries (i.e., sampled surfaces with topological openings/holes through which particles may escape), rather than point clouds with \emphmissing surface regions where no samples exist. Our approach enables robust processing of both watertight and open-boundary models, achieving extraction of the inter layer from 20,000 inter and 20,000 outer points in approximately 10 seconds. This solution is particularly effective for applications requiring accurate surface representations, such as indoor scene modeling and medical imaging, where double-layered point clouds are prevalent, and it accommodates both closed (watertight) and open-boundary surface geometries. Our goal is \emphpost-hoc inter/outer shell separation as a lightweight module after TSDF fusion; we do not aim to replace full variational or learning-based reconstruction pipelines.
zh
[CV-205] Supervised makeup transfer with a curated dataset: Decoupling identity and makeup features for enhanced transformation ICASSP
【速读】:该论文旨在解决当前基于生成对抗网络(Generative Adversarial Networks, GANs)的妆容迁移方法中存在的三大问题:数据集规模有限、身份特征与妆容特征分离不充分,以及控制能力弱。其解决方案的关键在于三个方面:首先,提出一种“训练-生成-过滤-再训练”的策略构建高质量、多样化的数据集,融合合成样本、真实样本与过滤后的优质样本以提升数据质量和多样性;其次,设计了一个基于扩散模型(Diffusion Models)的框架,实现身份与妆容特征的有效解耦,从而在保留面部结构和肤色的同时精准应用多样化妆容风格;最后,引入文本引导机制,通过自然语言提示实现对眼部、唇部或整体妆容的区域级精细控制,显著增强用户交互灵活性与生成结果的可控性。
链接: https://arxiv.org/abs/2602.00729
作者: Qihe Pan,Yiming Wu,Xing Zhao,Liang Xie,Guodao Sun,Ronghua Liang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted for publication in the proceedings of 2026 IEEE ICASSP Conference
Abstract:Diffusion models have recently shown strong progress in generative tasks, offering a more stable alternative to GAN-based approaches for makeup transfer. Existing methods often suffer from limited datasets, poor disentanglement between identity and makeup features, and weak controllability. To address these issues, we make three contributions. First, we construct a curated high-quality dataset using a train-generate-filter-retrain strategy that combines synthetic, realistic, and filtered samples to improve diversity and fidelity. Second, we design a diffusion-based framework that disentangles identity and makeup features, ensuring facial structure and skin tone are preserved while applying accurate and diverse cosmetic styles. Third, we propose a text-guided mechanism that allows fine-grained and region-specific control, enabling users to modify eyes, lips, or face makeup with natural language prompts. Experiments on benchmarks and real-world scenarios demonstrate improvements in fidelity, identity preservation, and flexibility. Examples of our dataset can be found at: this https URL.
zh
[CV-206] StomataSeg: Semi-Supervised Instance Segmentation for Sorghum Stomatal Components
【速读】:该论文旨在解决高通量小麦气孔表型分析中因气孔结构微小(通常小于40 μm)且形态在不同基因型和叶片表面存在变异而导致的自动化分割难题。其关键解决方案是提出一种面向高粱气孔组分的半监督实例分割框架:首先通过将高分辨率显微图像切分为重叠的小块来增强对微小结构的检测能力,随后采用伪标签策略对未标注图像生成额外的56,428个伪标注块,从而显著提升模型性能;实验表明,该方法使语义分割的平均交并比(mIoU)从65.93%提升至70.35%,实例分割的平均精度(AP)从28.30%提升至46.10%,验证了基于块处理与半监督学习相结合的有效性,为作物科学中的AI驱动表型分析提供了可扩展的技术支持。
链接: https://arxiv.org/abs/2602.00703
作者: Zhongtian Huang,Zhi Chen,Zi Huang,Xin Yu,Daniel Smith,Chaitanya Purushothama,Erik Van Oosterom,Alex Wu,William Salter,Yan Li,Scott Chapman
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Sorghum is a globally important cereal grown widely in water-limited and stress-prone regions. Its strong drought tolerance makes it a priority crop for climate-resilient agriculture. Improving water-use efficiency in sorghum requires precise characterisation of stomatal traits, as stomata control of gas exchange, transpiration and photosynthesis have a major influence on crop performance. Automated analysis of sorghum stomata is difficult because the stomata are small (often less than 40 \mu m in length in grasses such as sorghum) and vary in shape across genotypes and leaf surfaces. Automated segmentation contributes to high-throughput stomatal phenotyping, yet current methods still face challenges related to nested small structures and annotation bottlenecks. In this paper, we propose a semi-supervised instance segmentation framework tailored for analysis of sorghum stomatal components. We collect and annotate a sorghum leaf imagery dataset containing 11,060 human-annotated patches, covering the three stomatal components (pore, guard cell and complex area) across multiple genotypes and leaf surfaces. To improve the detection of tiny structures, we split high-resolution microscopy images into overlapping small patches. We then apply a pseudo-labelling strategy to unannotated images, producing an additional 56,428 pseudo-labelled patches. Benchmarking across semantic and instance segmentation models shows substantial performance gains: for semantic models the top mIoU increases from 65.93% to 70.35%, whereas for instance models the top AP rises from 28.30% to 46.10%. These results demonstrate that combining patch-based preprocessing with semi-supervised learning significantly improves the segmentation of fine stomatal structures. The proposed framework supports scalable extraction of stomatal traits and facilitates broader adoption of AI-driven phenotyping in crop science.
zh
[CV-207] JoyAvatar: Unlocking Highly Expressive Avatars via Harmonized Text-Audio Conditioning
【速读】:该论文旨在解决现有视频虚拟人(video avatar)模型在复杂文本指令下对齐能力不足的问题,尤其在涉及大范围全身动作、动态摄像机轨迹、背景切换或人与物体交互等场景时表现受限。其核心解决方案在于提出JoyAvatar框架,包含两项关键技术:一是引入双教师增强训练算法,使模型能够从基础模型中迁移文本控制能力,同时学习音频-视觉同步;二是训练过程中根据去噪时间步动态调节多模态条件(如音频和文本)的强度,以缓解异构条件信号之间的冲突。这两项设计显著提升了模型生成自然、时序一致的全身运动与动态摄像机移动的能力,同时保持了唇形同步和身份一致性等基础功能。
链接: https://arxiv.org/abs/2602.00702
作者: Ruikui Wang,Jinheng Feng,Lang Tian,Huaishao Luo,Chaochao Li,Liangbo Zhou,Huan Zhang,Youzheng Wu,Xiaodong He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Existing video avatar models have demonstrated impressive capabilities in scenarios such as talking, public speaking, and singing. However, the majority of these methods exhibit limited alignment with respect to text instructions, particularly when the prompts involve complex elements including large full-body movement, dynamic camera trajectory, background transitions, or human-object interactions. To break out this limitation, we present JoyAvatar, a framework capable of generating long duration avatar videos, featuring two key technical innovations. Firstly, we introduce a twin-teacher enhanced training algorithm that enables the model to transfer inherent text-controllability from the foundation model while simultaneously learning audio-visual synchronization. Secondly, during training, we dynamically modulate the strength of multi-modal conditions (e.g., audio and text) based on the distinct denoising timestep, aiming to mitigate conflicts between the heterogeneous conditioning signals. These two key designs serve to substantially expand the avatar model’s capacity to generate natural, temporally coherent full-body motions and dynamic camera movements as well as preserve the basic avatar capabilities, such as accurate lip-sync and identity consistency. GSB evaluation results demonstrate that our JoyAvatar model outperforms the state-of-the-art models such as Omnihuman-1.5 and KlingAvatar 2.0. Moreover, our approach enables complex applications including multi-person dialogues and non-human subjects role-playing. Some video samples are provided on this https URL.
zh
[CV-208] Cross-Modal Binary Attention: An Energy-Efficient Fusion Framework for Audio-Visual Learning
【速读】:该论文旨在解决多模态融合中复杂跨模态依赖建模与计算可扩展性之间的根本矛盾:现有基于注意力的方法虽能有效捕捉跨模态关系,但其二次方复杂度(O(N²))阻碍了分层、多尺度架构的应用;而高效融合策略通常依赖于简单的拼接操作,难以提取模态间的互补信息。解决方案的关键在于提出一种新型跨模态融合机制CMQKA(Cross-modal Query-Key Attention),通过高效的二值化运算实现线性复杂度(O(N)),从而支持此前无法实现的分层融合架构;CMQKA采用双向跨模态Query-Key注意力机制提取互补的时空特征,并引入可学习残差融合策略,在保留模态特有表征的同时增强跨模态信息融合能力。在此基础上构建的SNNergy框架进一步利用事件驱动的二值脉冲操作实现显著能效优势,同时在CREMA-D、AVE和UrbanSound8K-AV等挑战性音频-视觉基准上取得新的最先进性能,为实际音频-视觉智能系统提供了兼具可扩展性与能效的多模态融合新范式。
链接: https://arxiv.org/abs/2602.00701
作者: Mohamed Saleh,Zahra Ahmadi
机构: 未知
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD)
备注:
Abstract:Effective multimodal fusion requires mechanisms that can capture complex cross-modal dependencies while remaining computationally scalable for real-world deployment. Existing audio-visual fusion approaches face a fundamental trade-off: attention-based methods effectively model cross-modal relationships but incur quadratic computational complexity that prevents hierarchical, multi-scale architectures, while efficient fusion strategies rely on simplistic concatenation that fails to extract complementary cross-modal information. We introduce CMQKA, a novel cross-modal fusion mechanism that achieves linear O(N) complexity through efficient binary operations, enabling scalable hierarchical fusion previously infeasible with conventional attention. CMQKA employs bidirectional cross-modal Query-Key attention to extract complementary spatiotemporal features and uses learnable residual fusion to preserve modality-specific characteristics while enriching representations with cross-modal information. Building upon CMQKA, we present SNNergy, an energy-efficient multimodal fusion framework with a hierarchical architecture that processes inputs through progressively decreasing spatial resolutions and increasing semantic abstraction. This multi-scale fusion capability allows the framework to capture both local patterns and global context across modalities. Implemented with event-driven binary spike operations, SNNergy achieves remarkable energy efficiency while maintaining fusion effectiveness and establishing new state-of-the-art results on challenging audio-visual benchmarks, including CREMA-D, AVE, and UrbanSound8K-AV, significantly outperforming existing multimodal fusion baselines. Our framework advances multimodal fusion by introducing a scalable fusion mechanism that enables hierarchical cross-modal integration with practical energy efficiency for real-world audio-visual intelligence systems.
zh
[CV-209] V2X-DSC: Multi-Agent Collaborative Perception with Distributed Source Coding Guided Communication
【速读】:该论文旨在解决车联网(V2X)环境中多智能体协同感知(collaborative perception)中因带宽受限导致的中间特征共享瓶颈问题,尤其是在密集BEV(Bird’s-Eye View)特征传输时,极易饱和通信链路。其核心解决方案是提出V2X-DSC框架,关键在于引入条件编码器(Conditional Codec, DCC),从分布式源编码(distributed source coding)角度出发,使发送端将BEV特征压缩为紧凑码字,接收端则利用本地特征作为侧信息进行条件重建,仅分配比特给互补信息而非冗余内容,从而实现高效、低噪声的特征融合。该机制不仅优化了精度与带宽的权衡,还具备良好的通用性,可作为插件式通信模块适配多种融合骨干网络。
链接: https://arxiv.org/abs/2602.00687
作者: Yuankun Zeng,Shaohui Li,Zhi Li,Shulan Ruan,Yu Liu,You He
机构: Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院); College of Information Science and Electronic Engineering, Zhejiang University (浙江大学信息科学与电子工程学院); Department of Electronic Engineering, Tsinghua University (清华大学电子工程系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Collaborative perception improves 3D understanding by fusing multi-agent observations, yet intermediate-feature sharing faces strict bandwidth constraints as dense BEV features saturate V2X links. We observe that collaborators view the same physical world, making their features strongly correlated; thus receivers only need innovation beyond their local context. Revisiting this from a distributed source coding perspective, we propose V2X-DSC, a framework with a Conditional Codec (DCC) for bandwidth-constrained fusion. The sender compresses BEV features into compact codes, while the receiver performs conditional reconstruction using its local features as side information, allocating bits to complementary cues rather than redundant content. This conditional structure regularizes learning, encouraging incremental representation and yielding lower-noise features. Experiments on DAIR-V2X, OPV2V, and V2X-Real demonstrate state-of-the-art accuracy-bandwidth trade-offs under KB-level communication, and generalizes as a plug-and-play communication layer across multiple fusion backbones.
zh
[CV-210] Video Understanding: Through A Temporal Lens
【速读】:该论文旨在解决视频理解中如何有效利用视频元素之间的时序关系以提升模型性能的问题。现有方法在处理动态视频内容时存在局限性,难以充分建模时间维度上的复杂变化。其解决方案的关键在于提出五项核心贡献:一是基于大视觉语言模型(Large Vision-Language Models, LVLMs)与抗噪对比学习目标的自动标注框架;二是采用“循环适配器”(recurrent adapters)实现参数高效微调,捕捉低数据场景下的时序动态;三是引入状态空间层(State Space Layers, SSL)实现长视频高效建模,并构建两个新的长期基准用于第一人称和特征长度视频内容评估;四是设计一种显式建模运动与视频片段之间细粒度关系的对比学习框架;五是通过系统性实证研究发现视觉-语言接口是LVLMs进行时序推理的瓶颈,并据此提出面向扩展视频理解的“时序导向配方”(temporal-oriented recipe)。这些工作共同表明,显式时序建模能显著增强模型对视频流式特性的表征与推理能力。
链接: https://arxiv.org/abs/2602.00683
作者: Thong Thanh Nguyen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: PhD Thesis, NUS, 2025
Abstract:This thesis explores the central question of how to leverage temporal relations among video elements to advance video understanding. Addressing the limitations of existing methods, the work presents a five-fold contribution: (1) an automatic annotation framework that utilizes large vision-language models and a noise-robust contrastive learning objective with a subtractive angular margin; (2) a parameter-efficient fine-tuning strategy using “recurrent adapters” to capture temporal dynamics in low-data regimes; (3) the integration of State Space Layers (SSL) for efficient long-form video modeling, supported by the introduction of two new long-term benchmarks for egocentric and feature-length content; (4) a novel contrastive learning framework designed to explicitly model fine-grained relations between motions and video moments; and (5) a comprehensive empirical study on Large Vision-Language Models (LVLMs) that identifies the visual-language interface as a bottleneck for temporal reasoning, leading to a new “temporal-oriented recipe” for upscaled video understanding. Collectively, these contributions demonstrate that explicit temporal modeling significantly enhances a model’s ability to represent and reason about the fluid nature of video content.
zh
[CV-211] HPC: Hierarchical Point-based Latent Representation for Streaming Dynamic Gaussian Splatting Compression
【速读】:该论文旨在解决动态高斯点绘(Dynamic Gaussian Splatting)在流媒体传输中如何实现高效压缩以保持高质量渲染的同时显著降低存储开销的问题。现有方法依赖于潜在表示来预测帧间高斯分布残差,但其结构化网格式或无结构点式潜变量分别存在未占用空间参数冗余和局部相关性利用不足的局限。解决方案的关键在于提出一种名为HPC的新框架,其核心创新包括:采用分层点式潜在表示,在每个高斯基础上操作以避免无效空间建模;通过定制聚合策略提升潜在点的紧凑性与低空间冗余;并首次系统性地挖掘和利用神经网络参数的帧间相关性进行模型压缩,从而构建端到端的完整压缩流程。实验表明,HPC在保持高重建保真度的同时,相较基线方法实现67%的存储压缩率。
链接: https://arxiv.org/abs/2602.00671
作者: Yangzhi Ma,Bojun Liu,Wenting Liao,Dong Liu,Zhu Li,Li Li
机构: University of Science and Technology of China (中国科学技术大学); University of Missouri, Kansas City (密苏里大学堪萨斯城分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While dynamic Gaussian Splatting has driven significant advances in free-viewpoint video, maintaining its rendering quality with a small memory footprint for efficient streaming transmission still presents an ongoing challenge. Existing streaming dynamic Gaussian Splatting compression methods typically leverage a latent representation to drive the neural network for predicting Gaussian residuals between frames. Their core latent representations can be categorized into structured grid-based and unstructured point-based paradigms. However, the former incurs significant parameter redundancy by inevitably modeling unoccupied space, while the latter suffers from limited compactness as it fails to exploit local correlations. To relieve these limitations, we propose HPC, a novel streaming dynamic Gaussian Splatting compression framework. It employs a hierarchical point-based latent representation that operates on a per-Gaussian basis to avoid parameter redundancy in unoccupied space. Guided by a tailored aggregation scheme, these latent points achieve high compactness with low spatial redundancy. To improve compression efficiency, we further undertake the first investigation to compress neural networks for streaming dynamic Gaussian Splatting through mining and exploiting the inter-frame correlation of parameters. Combined with latent compression, this forms a fully end-to-end compression framework. Comprehensive experimental evaluations demonstrate that HPC substantially outperforms state-of-the-art methods. It achieves a storage reduction of 67% against its baseline while maintaining high reconstruction fidelity.
zh
[CV-212] Improving Neuropathological Reconstruction Fidelity via AI Slice Imputation
【速读】:该论文旨在解决从2D解剖照片重建的3D脑体积在高各向异性(如厚切片)条件下出现结构粗糙、过度平滑的问题,从而影响解剖边界清晰度和形态测量准确性。解决方案的关键在于引入一个计算高效的超分辨率步骤,通过插值生成空间一致的各向同性体积,该方法基于域随机化的合成数据进行训练,确保其在不同解剖协议下具有泛化能力,并对大厚度切片保持鲁棒性;该插值策略显著提升了自动分割性能(尤其在皮层和白质区域),并改善了表面重建与磁共振成像(MRI)配准的精度。
链接: https://arxiv.org/abs/2602.00669
作者: Marina Crespo Aguirre,Jonathan Williams-Ramirez,Dina Zemlyanker,Xiaoling Hu,Lucas J. Deden-Binder,Rogeny Herisse,Mark Montine,Theresa R. Connors,Christopher Mount,Christine L. MacDonald,C. Dirk Keene,Caitlin S. Latimer,Derek H. Oakley,Bradley T. Hyman,Ana Lawry Aguila,Juan Eugenio Iglesias
机构: Martinos Center for Biomedical Imaging, Massachusetts General Hospital and Harvard Medical School, United States; Federal Institute of Technology (ETH), Zurich, Switzerland; BioRepository and Integrated Neuropathology (BRaIN) Laboratory and Precision Neuropathology Core, University of Washington School of Medicine, Seattle, United States; Massachusetts Alzheimer Disease Research Center, MGH and Harvard Medical School, Charlestown, United States; Department of Pathology, Massachusetts General Hospital and Harvard Medical School, Boston, MA, United States; Department of Neurological Surgery, University of Washington School of Medicine, Seattle, United States; Hawkes Institute, University College London, United Kingdom; Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, United States
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Medical Physics (physics.med-ph)
备注: 12 pages of main content, 5 pages of supplement
Abstract:Neuropathological analyses benefit from spatially precise volumetric reconstructions that enhance anatomical delineation and improve morphometric accuracy. Our prior work has shown the feasibility of reconstructing 3D brain volumes from 2D dissection photographs. However these outputs sometimes exhibit coarse, overly smooth reconstructions of structures, especially under high anisotropy (i.e., reconstructions from thick slabs). Here, we introduce a computationally efficient super-resolution step that imputes slices to generate anatomically consistent isotropic volumes from anisotropic 3D reconstructions of dissection photographs. By training on domain-randomized synthetic data, we ensure that our method generalizes across dissection protocols and remains robust to large slab thicknesses. The imputed volumes yield improved automated segmentations, achieving higher Dice scores, particularly in cortical and white matter regions. Validation on surface reconstruction and atlas registration tasks demonstrates more accurate cortical surfaces and MRI registration. By enhancing the resolution and anatomical fidelity of photograph-based reconstructions, our approach strengthens the bridge between neuropathology and neuroimaging. Our method is publicly available at this https URL
zh
[CV-213] Schrödinger-Inspired Time-Evolution for 4D Deformation Forecasting
【速读】:该论文旨在解决复杂四维(4D:三维空间+时间)现象的时空预测问题,这在医学影像、流体与材料动力学及地球物理等领域具有重要意义。传统无约束神经预测模型在长期预测中易出现误差累积和漂移,且缺乏可解释性。解决方案的关键在于提出一种受薛定谔方程启发的物理引导型神经架构,将显式的时演化算子嵌入深度卷积框架中,通过学习体素级的振幅、相位和势场构建复值波函数 ψ = A e^iϕ,并利用可微分的未展开薛定谔时间推进器进行时域演化。该方法实现了三个核心优势:(i) 由结构化演化算子带来的时序稳定性,抑制长程预测中的误差积累;(ii) 可解释的潜在表示,其中相位编码传输动力学、振幅捕捉结构强度、学习到的势场控制时空交互;(iii) 自然兼容基于形变的合成,保障医学影像中的解剖保真度。该方案首次实现端到端的4D神经预测框架集成薛定谔型演化算子,为可解释、稳定且解剖一致的时空预测提供了理论基础。
链接: https://arxiv.org/abs/2602.00661
作者: Ahsan Raza Siyal,Markus Haltmeier,Ruth Steiger,Elke Ruth Gizewski,Astrid Ellen Grams
机构: University of Innsbruck (因斯布鲁克大学); Medical University of Innsbruck (因斯布鲁克医科大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Spatiotemporal forecasting of complex three-dimensional phenomena (4D: 3D + time) is fundamental to applications in medical imaging, fluid and material dynamics, and geophysics. In contrast to unconstrained neural forecasting models, we propose a Schrödinger-inspired, physics-guided neural architecture that embeds an explicit time-evolution operator within a deep convolutional framework for 4D prediction. From observed volumetric sequences, the model learns voxelwise amplitude, phase, and potential fields that define a complex-valued wavefunction \psi = A e^i\phi , which is evolved forward in time using a differentiable, unrolled Schrödinger time stepper. This physics-guided formulation yields several key advantages: (i) temporal stability arising from the structured evolution operator, which mitigates drift and error accumulation in long-horizon forecasting; (ii) an interpretable latent representation, where phase encodes transport dynamics, amplitude captures structural intensity, and the learned potential governs spatiotemporal interactions; and (iii) natural compatibility with deformation-based synthesis, which is critical for preserving anatomical fidelity in medical imaging applications. By integrating physical priors directly into the learning process, the proposed approach combines the expressivity of deep networks with the robustness and interpretability of physics-based modeling. We demonstrate accurate and stable prediction of future 4D states, including volumetric intensities and deformation fields, on synthetic benchmarks that emulate realistic shape deformations and topological changes. To our knowledge, this is the first end-to-end 4D neural forecasting framework to incorporate a Schrödinger-type evolution operator, offering a principled pathway toward interpretable, stable, and anatomically consistent spatiotemporal prediction.
zh
[CV-214] Non-Contrastive Vision-Language Learning with Predictive Embedding Alignment
【速读】:该论文旨在解决当前主流视觉-语言模型(Vision-Language Models, VLMs)中基于对比学习的方法(如CLIP)存在的训练复杂性问题,包括对大批次大小的依赖、精心设计的负样本采样策略以及繁琐的超参数调优。其解决方案的核心在于提出NOVA框架——一种非对比性的视觉-语言对齐方法,通过联合嵌入预测(joint embedding prediction)与分布正则化实现高效对齐:具体而言,NOVA利用增强图像视图预测冻结的领域特定文本编码器(如ClinicalBERT)输出的文本嵌入,并引入Sketch Isotropic Gaussian Regularization(SIGReg)强制视觉嵌入服从各向同性高斯分布结构,从而无需负样本采样、动量编码器或停止梯度操作,将训练目标简化为单一超参数,显著提升了训练稳定性与性能表现。
链接: https://arxiv.org/abs/2602.00653
作者: Lukas Kuhn,Giuseppe Serra,Florian Buettner
机构: Goethe University Frankfurt (歌德大学法兰克福); German Cancer Research Center (德国癌症研究中心); German Cancer Consortium (德国癌症联盟)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-language models have transformed multimodal representation learning, yet dominant contrastive approaches like CLIP require large batch sizes, careful negative sampling, and extensive hyperparameter tuning. We introduce NOVA, a NOn-contrastive Vision-language Alignment framework based on joint embedding prediction with distributional regularization. NOVA aligns visual representations to a frozen, domain-specific text encoder by predicting text embeddings from augmented image views, while enforcing an isotropic Gaussian structure via Sketched Isotropic Gaussian Regularization (SIGReg). This eliminates the need for negative sampling, momentum encoders, or stop-gradients, reducing the training objective to a single hyperparameter. We evaluate NOVA on zeroshot chest X-ray classification using ClinicalBERT as the text encoder and Vision Transformers trained from scratch on MIMIC-CXR. On zero-shot classification across three benchmark datasets, NOVA outperforms multiple standard baselines while exhibiting substantially more consistent training runs. Our results demonstrate that non-contrastive vision-language pretraining offers a simpler, more stable, and more effective alternative to contrastive methods.
zh
[CV-215] A Hybrid Mamba-SAM Architecture for Efficient 3D Medical Image Segmentation
【速读】:该论文旨在解决生成式 AI(Generative AI)在3D医学图像分割任务中面临的三大挑战:领域偏移(domain shift)、固有的二维设计限制以及微调带来的高计算成本。其解决方案的关键在于提出一种高效的混合架构 Mamba-SAM,通过将冻结的 Segment Anything Model(SAM)编码器与基于状态空间模型(State Space Models, SSMs)的 Mamba 架构相结合,充分利用后者在长程建模和线性时间效率上的优势。具体而言,研究设计了两种参数高效适配策略:一是双分支结构,通过交叉注意力机制显式融合 SAM 的通用特征与可训练 VMamba 的领域特定表示;二是基于适配器的方法,在 SAM ViT 编码器中嵌入轻量级、具备3D感知能力的 Tri-Plane Mamba(TPMamba)模块以隐式建模体素上下文,并引入多频门控卷积(Multi-Frequency Gated Convolution, MFGC)增强特征表达,从而在保持高精度的同时显著提升推理速度。
链接: https://arxiv.org/abs/2602.00650
作者: Mohammadreza Gholipour Shahraki,Mehdi Rezaeian,Mohammad Ghasemzadeh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate segmentation of 3D medical images such as MRI and CT is essential for clinical diagnosis and treatment planning. Foundation models like the Segment Anything Model (SAM) provide powerful general-purpose representations but struggle in medical imaging due to domain shift, their inherently 2D design, and the high computational cost of fine-tuning. To address these challenges, we propose Mamba-SAM, a novel and efficient hybrid architecture that combines a frozen SAM encoder with the linear-time efficiency and long-range modeling capabilities of Mamba-based State Space Models (SSMs). We investigate two parameter-efficient adaptation strategies. The first is a dual-branch architecture that explicitly fuses general features from a frozen SAM encoder with domain-specific representations learned by a trainable VMamba encoder using cross-attention. The second is an adapter-based approach that injects lightweight, 3D-aware Tri-Plane Mamba (TPMamba) modules into the frozen SAM ViT encoder to implicitly model volumetric context. Within this framework, we introduce Multi-Frequency Gated Convolution (MFGC), which enhances feature representation by jointly analyzing spatial and frequency-domain information via 3D discrete cosine transforms and adaptive gating. Extensive experiments on the ACDC cardiac MRI dataset demonstrate the effectiveness of the proposed methods. The dual-branch Mamba-SAM-Base model achieves a mean Dice score of 0.906, comparable to UNet++ (0.907), while outperforming all baselines on Myocardium (0.910) and Left Ventricle (0.971) segmentation. The adapter-based TP MFGC variant offers superior inference speed (4.77 FPS) with strong accuracy (0.880 Dice). These results show that hybridizing foundation models with efficient SSM-based architectures provides a practical and effective solution for 3D medical image segmentation.
zh
[CV-216] Diff-PC: Identity-preserving and 3D-aware Controllable Diffusion for Zero-shot Portrait Customization
【速读】:该论文旨在解决现有肖像定制(Portrait Customization, PC)方法在身份(ID)保持和面部控制方面的不足,特别是缺乏高保真度的身份一致性与精确的面部属性调控能力。其解决方案的关键在于提出一种基于扩散模型(Diffusion-based)的零样本肖像定制框架 Diff-PC,通过引入三维人脸预测器(3D face predictor)重建包含参考身份、目标表情和姿态的3D感知面部先验,并设计ID-Encoder融合局部与全局面部特征以捕捉细粒度细节;进一步利用3D人脸引导ID-Ctrl实现身份特征对齐,结合ID-Injector增强身份保真度与面部可控性;同时在自建的以身份为中心的数据集上训练,显著提升面部相似性和文本到图像(text-to-image, T2I)一致性。
链接: https://arxiv.org/abs/2602.00639
作者: Yifang Xu,Benxiang Zhai,Chenyu Zhang,Ming Li,Yang Li,Sidan Du
机构: Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by Information Fusion 2025
Abstract:Portrait customization (PC) has recently garnered significant attention due to its potential applications. However, existing PC methods lack precise identity (ID) preservation and face control. To address these tissues, we propose Diff-PC, a diffusion-based framework for zero-shot PC, which generates realistic portraits with high ID fidelity, specified facial attributes, and diverse backgrounds. Specifically, our approach employs the 3D face predictor to reconstruct the 3D-aware facial priors encompassing the reference ID, target expressions, and poses. To capture fine-grained face details, we design ID-Encoder that fuses local and global facial features. Subsequently, we devise ID-Ctrl using the 3D face to guide the alignment of ID features. We further introduce ID-Injector to enhance ID fidelity and facial controllability. Finally, training on our collected ID-centric dataset improves face similarity and text-to-image (T2I) alignment. Extensive experiments demonstrate that Diff-PC surpasses state-of-the-art methods in ID preservation, facial control, and T2I consistency. Furthermore, our method is compatible with multi-style foundation models.
zh
[CV-217] VIZOR: Viewpoint-Invariant Zero-Shot Scene Graph Generation for 3D Scene Reasoning WACV2026
【速读】:该论文旨在解决3D场景理解与推理中现有方法在空间关系建模上的局限性,尤其是因视角变化导致的“左/右”等空间关系不一致问题。传统方法依赖多模态输入(如2D图像、深度图、标注关系)构建场景图,但难以泛化且生成的空间关系具有视角依赖性。解决方案的关键在于提出一种无需训练的端到端框架VIZOR(Viewpoint-Invariant Zero-shot scene graph generation for 3D scene Reasoning),其核心创新是通过以每个物体自身前向朝向为参考系定义空间关系,从而实现视角不变的稠密场景图生成;同时支持开放词汇关系推理,无需标注数据即可捕捉对象间的空间和邻近关系,显著提升了零样本场景图生成及下游任务(如基于查询的对象定位)的准确性。
链接: https://arxiv.org/abs/2602.00637
作者: Vivek Madhavaram,Vartika Sengar,Arkadipta De,Charu Sharma
机构: IIIT Hyderabad (印度信息技术研究所海得拉巴分校); Fujitsu Research India (富士通研究印度)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: WACV 2026, Project page: this https URL
Abstract:Scene understanding and reasoning has been a fundamental problem in 3D computer vision, requiring models to identify objects, their properties, and spatial or comparative relationships among the objects. Existing approaches enable this by creating scene graphs using multiple inputs such as 2D images, depth maps, object labels, and annotated relationships from specific reference view. However, these methods often struggle with generalization and produce inaccurate spatial relationships like “left/right”, which become inconsistent across different viewpoints. To address these limitations, we propose Viewpoint-Invariant Zero-shot scene graph generation for 3D scene Reasoning (VIZOR). VIZOR is a training-free, end-to-end framework that constructs dense, viewpoint-invariant 3D scene graphs directly from raw 3D scenes. The generated scene graph is unambiguous, as spatial relationships are defined relative to each object’s front-facing direction, making them consistent regardless of the reference view. Furthermore, it infers open-vocabulary relationships that describe spatial and proximity relationships among scene objects without requiring annotated training data. We conduct extensive quantitative and qualitative evaluations to assess the effectiveness of VIZOR in scene graph generation and downstream tasks, such as query-based object grounding. VIZOR outperforms state-of-the-art methods, showing clear improvements in scene graph generation and achieving 22% and 4.81% gains in zero-shot grounding accuracy on the Replica and Nr3D datasets, respectively.
zh
[CV-218] S3POT: Contrast-Driven Face Occlusion Segmentation via Self-Supervised Prompt Learning
【速读】:该论文旨在解决现有面部解析(face parsing)方法在处理遮挡区域时易将遮挡误判为面部组件的问题。由于遮挡是一个高层语义概念,而非具体物体类别,构建覆盖所有遮挡类别的真实世界面部数据集几乎不可能,且精确的遮挡掩码标注成本高昂。解决方案的关键在于提出一个对比驱动的框架 S³POT,其核心思想是结合人脸生成与自监督空间提示机制:首先利用结构引导生成参考图像以重建被遮挡区域,其次通过跨注意力机制增强特征并生成初始提示,最后基于增强特征筛选正负样本提示,从而无需依赖遮挡真值掩码即可实现高精度遮挡分割。该方法有效融合了现代人脸生成器的真实重建能力与基础分割模型(如SAM)对提示敏感的特性,显著提升了遮挡区域的识别准确率。
链接: https://arxiv.org/abs/2602.00635
作者: Lingsong Wang,Mancheng Meng,Ziyan Wu,Terrence Chen,Fan Yang,Dinggang Shen
机构: ShanghaiTech University (上海科技大学); United Imaging Intelligence (联影智能)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Existing face parsing methods usually misclassify occlusions as facial components. This is because occlusion is a high-level concept, it does not refer to a concrete category of object. Thus, constructing a real-world face dataset covering all categories of occlusion object is almost impossible and accurate mask annotation is labor-intensive. To deal with the problems, we present S ^3 POT, a contrast-driven framework synergizing face generation with self-supervised spatial prompting, to achieve occlusion segmentation. The framework is inspired by the insights: 1) Modern face generators’ ability to realistically reconstruct occluded regions, creating an image that preserve facial geometry while eliminating occlusion, and 2) Foundation segmentation models’ (e.g., SAM) capacity to extract precise mask when provided with appropriate prompts. In particular, S ^3 POT consists of three modules: Reference Generation (RF), Feature enhancement (FE), and Prompt Selection (PS). First, a reference image is produced by RF using structural guidance from parsed mask. Second, FE performs contrast of tokens between raw and reference images to obtain an initial prompt, then modifies image features with the prompt by cross-attention. Third, based on the enhanced features, PS constructs a set of positive and negative prompts and screens them with a self-attention network for a mask decoder. The network is learned under the guidance of three novel and complementary objective functions without occlusion ground truth mask involved. Extensive experiments on a dedicatedly collected dataset demonstrate S ^3 POT’s superior performance and the effectiveness of each module.
zh
[CV-219] FaceSnap: Enhanced ID-fidelity Network for Tuning-free Portrait Customization ICANN2025
【速读】:该论文旨在解决个性化图像生成中现有方法存在的两大问题:一是多数方法依赖耗时的微调过程且泛化能力弱,二是难以在面部细节上实现高保真度。其解决方案的关键在于提出了一种名为FaceSnap的新方法,该方法基于Stable Diffusion(SD)框架,仅需单张参考图像即可在一次推理阶段生成高度一致的个性化肖像。核心创新包括:1)设计了面部属性混合器(Facial Attribute Mixer),融合低层具体特征与高层抽象特征以提供更优的生成引导;2)引入关键点预测器(Landmark Predictor),通过保持不同姿态下的参考身份一致性,为图像生成提供多样且精细的空间控制条件;3)利用身份保持模块将上述信息注入UNet结构,从而实现高效、高质量的定制化肖像生成。
链接: https://arxiv.org/abs/2602.00627
作者: Benxiang Zhai,Yifang Xu,Guofeng Zhang,Yang Li,Sidan Du
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accept by ICANN 2025
Abstract:Benefiting from the significant advancements in text-to-image diffusion models, research in personalized image generation, particularly customized portrait generation, has also made great strides recently. However, existing methods either require time-consuming fine-tuning and lack generalizability or fail to achieve high fidelity in facial details. To address these issues, we propose FaceSnap, a novel method based on Stable Diffusion (SD) that requires only a single reference image and produces extremely consistent results in a single inference stage. This method is plug-and-play and can be easily extended to different SD models. Specifically, we design a new Facial Attribute Mixer that can extract comprehensive fused information from both low-level specific features and high-level abstract features, providing better guidance for image generation. We also introduce a Landmark Predictor that maintains reference identity across landmarks with different poses, providing diverse yet detailed spatial control conditions for image generation. Then we use an ID-preserving module to inject these into the UNet. Experimental results demonstrate that our approach performs remarkably in personalized and customized portrait generation, surpassing other state-of-the-art methods in this domain.
zh
[CV-220] owards Interpretable Hallucination Analysis and Mitigation in LVLMs via Contrastive Neuron Steering
【速读】:该论文旨在解决大视觉语言模型(Large Vision-Language Models, LVLMs)中存在的幻觉问题,即模型在多模态理解与生成过程中因内部机制不透明而产生的错误或不合理输出。现有方法主要集中在输出层面的调整,未能深入解析导致幻觉的内在神经机制。其解决方案的关键在于引入稀疏自编码器(Sparse Autoencoders, SAEs)对视觉嵌入进行分解,从而识别出两类关键神经元:始终激活的“always-on neurons”和图像特异性的“image-specific neurons”。研究发现,幻觉通常由图像特异性神经元的异常激活或干扰引起,而始终激活神经元则保持稳定。基于此,作者提出对比神经引导(Contrastive Neuron Steering, CNS),通过对比干净输入与噪声输入之间的差异来定位图像特异性神经元,并选择性增强信息丰富的神经元、抑制扰动引发的虚假激活,从而在预填充阶段实现对LVLM输出的可控干预,显著提升视觉定位准确性并减少幻觉,同时兼容现有的解码阶段方法。
链接: https://arxiv.org/abs/2602.00621
作者: Guangtao Lyu,Xinyi Cheng,Qi Liu,Chenghao Xu,Jiexi Yan,Muli Yang,Fen Fang,Cheng Deng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:LVLMs achieve remarkable multimodal understanding and generation but remain susceptible to hallucinations. Existing mitigation methods predominantly focus on output-level adjustments, leaving the internal mechanisms that give rise to these hallucinations largely unexplored. To gain a deeper understanding, we adopt a representation-level perspective by introducing sparse autoencoders (SAEs) to decompose dense visual embeddings into sparse, interpretable neurons. Through neuron-level analysis, we identify distinct neuron types, including always-on neurons and image-specific neurons. Our findings reveal that hallucinations often result from disruptions or spurious activations of image-specific neurons, while always-on neurons remain largely stable. Moreover, selectively enhancing or suppressing image-specific neurons enables controllable intervention in LVLM outputs, improving visual grounding and reducing hallucinations. Building on these insights, we propose Contrastive Neuron Steering (CNS), which identifies image-specific neurons via contrastive analysis between clean and noisy inputs. CNS selectively amplifies informative neurons while suppressing perturbation-induced activations, producing more robust and semantically grounded visual representations. This not only enhances visual understanding but also effectively mitigates hallucinations. By operating at the prefilling stage, CNS is fully compatible with existing decoding-stage methods. Extensive experiments on both hallucination-focused and general multimodal benchmarks demonstrate that CNS consistently reduces hallucinations while preserving overall multimodal understanding.
zh
[CV-221] une-Your-Style: Intensity-tunable 3D Style Transfer with Gaussian Splatting ICCV2025
【速读】:该论文旨在解决3D风格迁移(3D style transfer)中内容与风格平衡难以灵活调控的问题,现有方法多采用固定输出范式,无法满足用户对不同内容-风格比例的需求。其解决方案的关键在于提出一种可调节创作强度的3D风格迁移框架——Tune-Your-Style,通过引入高斯神经元(Gaussian neurons)显式建模风格强度,并设计可学习的风格调节器(style tuner)实现风格注入强度的可控调整;同时提出可调风格引导机制(tunable stylization guidance),利用扩散模型生成多视角一致的风格化图像,并通过两阶段优化策略动态平衡全风格引导与零风格引导之间的权重,从而在保持渲染效率的同时显著提升风格迁移的定制化能力。
链接: https://arxiv.org/abs/2602.00618
作者: Yian Zhao,Rushi Ye,Ruochong Zheng,Zesen Cheng,Chaoran Feng,Jiashu Yang,Pengchong Qiao,Chang Liu,Jie Chen
机构: Peking University(北京大学); Pengcheng Laboratory(鹏城实验室); Tsinghua University(清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025
Abstract:3D style transfer refers to the artistic stylization of 3D assets based on reference style images. Recently, 3DGS-based stylization methods have drawn considerable attention, primarily due to their markedly enhanced training and rendering speeds. However, a vital challenge for 3D style transfer is to strike a balance between the content and the patterns and colors of the style. Although the existing methods strive to achieve relatively balanced outcomes, the fixed-output paradigm struggles to adapt to the diverse content-style balance requirements from different users. In this work, we introduce a creative intensity-tunable 3D style transfer paradigm, dubbed \textbfTune-Your-Style, which allows users to flexibly adjust the style intensity injected into the scene to match their desired content-style balance, thus enhancing the customizability of 3D style transfer. To achieve this goal, we first introduce Gaussian neurons to explicitly model the style intensity and parameterize a learnable style tuner to achieve intensity-tunable style injection. To facilitate the learning of tunable stylization, we further propose the tunable stylization guidance, which obtains multi-view consistent stylized views from diffusion models through cross-view style alignment, and then employs a two-stage optimization strategy to provide stable and efficient guidance by modulating the balance between full-style guidance from the stylized views and zero-style guidance from the initial rendering. Extensive experiments demonstrate that our method not only delivers visually appealing results, but also exhibits flexible customizability for 3D style transfer. Project page is available at this https URL.
zh
[CV-222] From Pixels to Facts (Pix2Fact): Benchmarking Multi-Hop Reasoning for Fine-Grained Visual Fact Checking
【速读】:该论文旨在解决当前视觉语言模型(Visual Language Models, VLMs)在同时要求精细视觉定位与基于知识的多跳推理任务中表现不足的问题,现有基准测试未能有效评估这两种能力的协同作用。其解决方案的关键在于提出Pix2Fact——一个全新的视觉问答(Visual Question Answering, VQA)基准,包含1,000张高分辨率(4K+)图像,覆盖8种日常生活场景,并由顶尖高校博士级标注者与专业数据标注公司协作设计问题与答案,确保每个问题均需结合详细视觉定位、多跳推理和外部知识整合才能解答。这一设计显著提升了任务难度,实验证明当前最先进的VLMs平均准确率仅为24.0%,远低于人类水平(56%),从而凸显了模型在实现类人级视觉理解方面的局限性,为下一代融合细粒度感知与强知识推理能力的多模态智能体的发展提供了关键评测标准。
链接: https://arxiv.org/abs/2602.00593
作者: Yifan Jiang,Cong Zhang,Bofei Zhang,Yifan Yang,Bingzhang Wang,Yew-Soon Ong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Despite progress on general tasks, VLMs struggle with challenges demanding both detailed visual grounding and deliberate knowledge-based reasoning, a synergy not captured by existing benchmarks that evaluate these skills separately. To close this gap, we introduce Pix2Fact, a new visual question-answering benchmark designed to evaluate expert-level perception and knowledge-intensive multi-hop reasoning. Pix2Fact contains 1,000 high-resolution (4K+) images spanning 8 daily-life scenarios and situations, with questions and answers meticulously crafted by annotators holding PhDs from top global universities working in partnership with a professional data annotation firm. Each question requires detailed visual grounding, multi-hop reasoning, and the integration of external knowledge to answer. Our evaluation of 9 state-of-the-art VLMs, including proprietary models like Gemini-3-Pro and GPT-5, reveals the substantial challenge posed by Pix2Fact: the most advanced model achieves only 24.0% average accuracy, in stark contrast to human performance of 56%. This significant gap underscores the limitations of current models in replicating human-level visual comprehension. We believe Pix2Fact will serve as a critical benchmark to drive the development of next-generation multimodal agents that combine fine-grained perception with robust, knowledge-based reasoning.
zh
[CV-223] MAUGen: A Unified Diffusion Approach for Multi-Identity Facial Expression and AU Label Generation
【速读】:该论文旨在解决当前生成式面部表情识别系统中缺乏大规模、人口统计学多样化的带精确动作单元(Action Unit, AU)发生与强度标注的面部图像这一根本瓶颈问题。其解决方案的关键在于提出MAUGen框架,这是一个基于扩散模型的多模态生成方法,能够通过单一文本提示联合生成逼真的人脸表情图像与解剖学一致的AU标签(包括发生与否和强度),核心创新包含两个模块:一是多模态表征学习(Multi-modal Representation Learning, MRL)模块,在统一潜在空间中建模文本描述、人脸身份、表情图像与AU激活之间的关系;二是基于扩散的图像标签生成器(Diffusion-based Image label Generator, DIG),将联合表征解码为跨不同身份的对齐人脸图像-标签对。该方法成功构建了MIFA数据集,显著提升了合成图像的真实感与AU标注语义一致性。
链接: https://arxiv.org/abs/2602.00583
作者: Xiangdong Li,Ye Lou,Ao Gao,Wei Zhang,Siyang Song
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The lack of large-scale, demographically diverse face images with precise Action Unit (AU) occurrence and intensity annotations has long been recognized as a fundamental bottleneck in developing generalizable AU recognition systems. In this paper, we propose MAUGen, a diffusion-based multi-modal framework that jointly generates a large collection of photorealistic facial expressions and anatomically consistent AU labels, including both occurrence and intensity, conditioned on a single descriptive text prompt. Our MAUGen involves two key modules: (1) a Multi-modal Representation Learning (MRL) module that captures the relationships among the paired textual description, facial identity, expression image, and AU activations within a unified latent space; and (2) a Diffusion-based Image label Generator (DIG) that decodes the joint representation into aligned facial image-label pairs across diverse identities. Under this framework, we introduce Multi-Identity Facial Action (MIFA), a large-scale multimodal synthetic dataset featuring comprehensive AU annotations and identity variations. Extensive experiments demonstrate that MAUGen outperforms existing methods in synthesizing photorealistic, demographically diverse facial images along with semantically aligned AU labels.
zh
[CV-224] Bridging Degradation Discrimination and Generation for Universal Image Restoration ICLR2026
【速读】:该论文旨在解决通用图像复原(Universal Image Restoration)任务中的核心挑战,即如何从低质量图像中去除多种退化类型并恢复出细节丰富的高质量图像,同时需在退化类型和程度的判别与生成过程之间建立有效关联。解决方案的关键在于提出一种名为“退化鉴别与生成桥梁”(Bridging Degradation discrimination and Generation, BDG)的新框架:首先设计多角度、多尺度灰度共生矩阵(Multi-Angle and multi-Scale Gray Level Co-occurrence Matrix, MAS-GLCM),实现对退化类型和强度的细粒度判别;随后将扩散训练过程划分为生成、桥接与复原三个阶段,在不改变模型结构的前提下,将MAS-GLCM提供的判别信息嵌入扩散过程中,从而在保持纹理恢复能力的同时提升对多任务、多退化场景的适应性,显著增强复原保真度且不牺牲感知质量。
链接: https://arxiv.org/abs/2602.00579
作者: JiaKui Hu,Zhengjian Yao,Lujia Jin,Yanye Lu
机构: Institute of Medical Technology, Peking University (北京大学医学技术研究院); Biomedical Engineering Department, College of Future Technology, Peking University (北京大学未来技术学院生物医学工程系); National Biomedical Imaging Center, Peking University (北京大学生物医学成像中心); China Mobile Research Institute (中国移动研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICLR 2026
Abstract:Universal image restoration is a critical task in low-level vision, requiring the model to remove various degradations from low-quality images to produce clean images with rich detail. The challenges lie in sampling the distribution of high-quality images and adjusting the outputs on the basis of the degradation. This paper presents a novel approach, Bridging Degradation discrimination and Generation (BDG), which aims to address these challenges concurrently. First, we propose the Multi-Angle and multi-Scale Gray Level Co-occurrence Matrix (MAS-GLCM) and demonstrate its effectiveness in performing fine-grained discrimination of degradation types and levels. Subsequently, we divide the diffusion training process into three distinct stages: generation, bridging, and restoration. The objective is to preserve the diffusion model’s capability of restoring rich textures while simultaneously integrating the discriminative information from the MAS-GLCM into the restoration process. This enhances its proficiency in addressing multi-task and multi-degraded scenarios. Without changing the architecture, BDG achieves significant performance gains in all-in-one restoration and real-world super-resolution tasks, primarily evidenced by substantial improvements in fidelity without compromising perceptual quality. The code and pretrained models are provided in this https URL.
zh
[CV-225] When Classes Evolve: A Benchmark and Framework for Stage-Aware Class-Incremental Learning
【速读】:该论文旨在解决传统类增量学习(Class-Incremental Learning, CIL)方法中忽略类内形态演化(intra-class evolution)的问题,即模型在学习新类别时未能有效适应同一语义类别内部随时间发生的显著形态变化(如幼虫到蝴蝶的转变),从而导致对类内动态变化的适应能力不足。解决方案的关键在于提出Stage-Aware CIL(Stage-CIL)范式,将每个类的学习划分为多个具有不同形态特征的阶段,并设计STAGE方法,通过固定大小的记忆池显式学习抽象且可迁移的类内演化模式,实现语义身份与形态变换动态的解耦,从而在保持跨类判别能力的同时,准确预测并适应类内未来形态变化。
链接: https://arxiv.org/abs/2602.00573
作者: Zheng Zhang,Tao Hu,Xueheng Li,Yang Wang,Rui Li,Jie Zhang,Chengjun Xie
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Class-Incremental Learning (CIL) aims to sequentially learn new classes while mitigating catastrophic forgetting of previously learned knowledge. Conventional CIL approaches implicitly assume that classes are morphologically static, focusing primarily on preserving previously learned representations as new classes are introduced. However, this assumption neglects intra-class evolution: a phenomenon wherein instances of the same semantic class undergo significant morphological transformations, such as a larva turning into a butterfly. Consequently, a model must both discriminate between classes and adapt to evolving appearances within a single class. To systematically address this challenge, we formalize Stage-Aware CIL (Stage-CIL), a paradigm in which each class is learned progressively through distinct morphological stages. To facilitate rigorous evaluation within this paradigm, we introduce the Stage-Bench, a 10-domain, 2-stages dataset and protocol that jointly measure inter- and intra-class forgetting. We further propose STAGE, a novel method that explicitly learns abstract and transferable evolution patterns within a fixed-size memory pool. By decoupling semantic identity from transformation dynamics, STAGE enables accurate prediction of future morphologies based on earlier representations. Extensive empirical evaluation demonstrates that STAGE consistently and substantially outperforms existing state-of-the-art approaches, highlighting its effectiveness in simultaneously addressing inter-class discrimination and intra-class morphological adaptation.
zh
[CV-226] GLAD: Generative Language-Assisted Visual Tracking for Low-Semantic Templates
【速读】:该论文旨在解决视觉-语言跟踪(Vision-Language Tracking)中因低语义图像(如模糊、低分辨率等)导致的跨模态理解能力下降问题,从而影响模型性能。现有方法多采用Transformer架构融合模板图像、搜索区域与文本特征,但受限于图文特征间的差距,直接拼接或融合效果有限。其解决方案的关键在于提出一种开创性的生成式语言辅助跟踪模型GLAD(Generative Language-Assisted Tracking),利用扩散模型(Diffusion Models)实现文本描述与模板图像的生成式多模态融合,通过重建低语义模板图像以增强其语义信息,并提升语言与图像之间的兼容性。该方法在多个基准测试上达到新的最优性能,同时具备优异的推理速度。
链接: https://arxiv.org/abs/2602.00570
作者: Xingyu Luo,Yidong Cai,Jie Liu,Jie Tang,Gangshan Wu,Limin Wang
机构: Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-language tracking has gained increasing attention in many scenarios. This task simultaneously deals with visual and linguistic information to localize objects in videos. Despite its growing utility, the development of vision-language tracking methods remains in its early stage. Current vision-language trackers usually employ Transformer architectures for interactive integration of template, search, and text features. However, persistent challenges about low-semantic images including prevalent image blurriness, low resolution and so on, may compromise model performance through degraded cross-modal understanding. To solve this problem, language assistance is usually used to deal with the obstacles posed by low-semantic images. However, due to the existing gap between current textual and visual features, direct concatenation and fusion of these features may have limited effectiveness. To address these challenges, we introduce a pioneering Generative Language-AssisteD tracking model, GLAD, which utilizes diffusion models for the generative multi-modal fusion of text description and template image to bolster compatibility between language and image and enhance template image semantic information. Our approach demonstrates notable improvements over the existing fusion paradigms. Blurry and semantically ambiguous template images can be restored to improve multi-modal features in the generative fusion paradigm. Experiments show that our method establishes a new state-of-the-art on multiple benchmarks and achieves an impressive inference speed. The code and models will be released at: this https URL
zh
[CV-227] Learning to Decode Against Compositional Hallucination in Video Multimodal Large Language Models
【速读】:该论文旨在解决视频多模态大语言模型(VLLM)中广泛存在的组合性幻觉(compositional hallucination)问题,即模型在处理多个相互作用的空间和时间因素时因错误推理导致的系统性误判,而现有研究主要聚焦于孤立型幻觉,忽视了此类复杂场景。解决方案的关键在于提出TriCD框架,其核心是三路径校准机制:一是自适应扰动控制器动态选择干扰操作以生成负样本视频变体;二是显著性引导增强模块自适应强化基于视觉证据的token级锚定信息;二者通过强化学习联合优化,从而提升模型在组合幻觉情境下的精准决策能力。实验表明,该方法在两种代表性骨干模型上均实现平均超过10%的准确率提升。
链接: https://arxiv.org/abs/2602.00559
作者: Wenbin Xing,Quanxing Zha,Lizheng Zu,Mengran Li,Ming Li,Junchi Yan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Current research on video hallucination mitigation primarily focuses on isolated error types, leaving compositional hallucinations, arising from incorrect reasoning over multiple interacting spatial and temporal factors largely underexplored. We introduce OmniVCHall, a benchmark designed to systematically evaluate both isolated and compositional hallucinations in video multimodal large language models (VLLMs). OmniVCHall spans diverse video domains, introduces a novel camera-based hallucination type, and defines a fine-grained taxonomy, together with adversarial answer options (e.g., “All are correct” and “None of the above”) to prevent shortcut reasoning. The evaluations of 39 representative VLLMs reveal that even advanced models (e.g., Qwen3-VL and GPT-5) exhibit substantial performance degradation. We propose TriCD, a contrastive decoding framework with a triple-pathway calibration mechanism. An adaptive perturbation controller dynamically selects distracting operations to construct negative video variants, while a saliency-guided enhancement module adaptively reinforces grounded token-wise visual evidences. These components are optimized via reinforcement learning to encourage precise decision-making under compositional hallucination settings. Experimental results show that TriCD consistently improves performance across two representative backbones, achieving an average accuracy improvement of over 10%. The data and code can be find at this https URL.
zh
[CV-228] APEX: A Decoupled Memory-based Explorer for Asynchronous Aerial Object Goal Navigation
【速读】:该论文旨在解决无人机(Unmanned Aerial Vehicle, UAV)在复杂空域环境中进行目标导向导航时面临的三大挑战:一是难以记忆和建模复杂的三维空间结构;二是缺乏可靠且可解释的动作决策机制;三是探索与信息获取效率低下。为此,作者提出了一种名为APEX(Aerial Parallel Explorer)的分层异步并行智能体架构,其核心创新在于三个模块的协同设计:1)动态时空语义映射记忆模块,利用视觉-语言模型(Vision-Language Model, VLM)的零样本能力构建高分辨率吸引力、探索区与障碍物地图,形成可解释的记忆机制;2)动作决策模块,通过强化学习训练获得细粒度且鲁棒的控制策略;3)目标定位模块,采用开放词汇检测器实现通用的目标识别。整体系统基于分层、异步和并行框架运行,有效规避了VLM推理延迟,提升了代理的主动性与探索效率。实验表明,APEX在UAV-ON基准测试中相较此前最优方法在成功率(Success Rate, SR)上提升4.2%,路径长度归一化得分(SPL)提升2.8%。
链接: https://arxiv.org/abs/2602.00551
作者: Daoxuan Zhang,Ping Chen,Xiaobo Xia,Xiu Su,Ruichen Zhen,Jianqiang Xiao,Shuo Yang
机构: Harbin Institute of Technology, Shenzhen(哈尔滨工业大学深圳校区); National University of Singapore(新加坡国立大学); Central South University(中南大学); Meituan Academy of Robotics Shenzhen(美团机器人研究院深圳分院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 8 figures
Abstract:Aerial Object Goal Navigation, a challenging frontier in Embodied AI, requires an Unmanned Aerial Vehicle (UAV) agent to autonomously explore, reason, and identify a specific target using only visual perception and language description. However, existing methods struggle with the memorization of complex spatial representations in aerial environments, reliable and interpretable action decision-making, and inefficient exploration and information gathering. To address these challenges, we introduce \textbfAPEX (Aerial Parallel Explorer), a novel hierarchical agent designed for efficient exploration and target acquisition in complex aerial settings. APEX is built upon a modular, three-part architecture: 1) Dynamic Spatio-Semantic Mapping Memory, which leverages the zero-shot capability of a Vision-Language Model (VLM) to dynamically construct high-resolution 3D Attraction, Exploration, and Obstacle maps, serving as an interpretable memory mechanism. 2) Action Decision Module, trained with reinforcement learning, which translates this rich spatial understanding into a fine-grained and robust control policy. 3) Target Grounding Module, which employs an open-vocabulary detector to achieve definitive and generalizable target identification. All these components are integrated into a hierarchical, asynchronous, and parallel framework, effectively bypassing the VLM’s inference latency and boosting the agent’s proactivity in exploration. Extensive experiments show that APEX outperforms the previous state of the art by +4.2% SR and +2.8% SPL on challenging UAV-ON benchmarks, demonstrating its superior efficiency and the effectiveness of its hierarchical asynchronous design. Our source code is provided in \hrefthis https URLGitHub
zh
[CV-229] NPNet: A Non-Parametric Network with Adaptive Gaussian-Fourier Positional Encoding for 3D Classification and Segmentation
【速读】:该论文旨在解决3D点云分类与部分分割任务中传统参数化方法依赖大量可学习权重、泛化能力受限的问题,尤其是面对不同尺度和采样密度变化时的稳定性不足。其解决方案的关键在于提出NPNet——一种完全非参数化的框架,通过确定性操作(如最远点采样、k近邻和池化)构建点特征,并引入自适应高斯-傅里叶位置编码(adaptive Gaussian-Fourier positional encoding),该编码的带宽和高斯-余弦混合比例由输入几何结构动态选择,从而提升跨尺度和不同采样密度下的鲁棒性;此外,在分割任务中还结合固定频率傅里叶特征以提供全局上下文信息,显著提升了少样本场景下的性能表现。
链接: https://arxiv.org/abs/2602.00542
作者: Mohammad Saeid,Amir Salarpour,Pedram MohajerAnsari,Mert D. Pesé
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to the 2026 IEEE Intelligent Vehicles Symposium (IV 2026)
Abstract:We present NPNet, a fully non-parametric approach for 3D point-cloud classification and part segmentation. NPNet contains no learned weights; instead, it builds point features using deterministic operators such as farthest point sampling, k-nearest neighbors, and pooling. Our key idea is an adaptive Gaussian-Fourier positional encoding whose bandwidth and Gaussian-cosine mixing are chosen from the input geometry, helping the method remain stable across different scales and sampling densities. For segmentation, we additionally incorporate fixed-frequency Fourier features to provide global context alongside the adaptive encoding. Across ModelNet40/ModelNet-R, ScanObjectNN, and ShapeNetPart, NPNet achieves strong performance among non-parametric baselines, and it is particularly effective in few-shot settings on ModelNet40. NPNet also offers favorable memory use and inference time compared to prior non-parametric methods
zh
[CV-230] SADER: Structure-Aware Diffusion Framework with DEterministic Resampling for Multi-Temporal Remote Sensing Cloud Removal
【速读】:该论文旨在解决多时相遥感影像中云污染严重降低数据可用性的问题,尤其针对现有基于扩散模型的去云方法在采样效率有限及未能充分挖掘时空先验信息方面的不足。其解决方案的关键在于提出一种结构感知的扩散框架SADER,核心创新包括:1)设计可扩展的多时相条件扩散网络(MTCDN),通过时序融合与混合注意力机制充分建模多时相和多模态相关性;2)引入云感知注意力损失函数,考虑云层厚度与亮度差异以增强对云主导区域的关注;3)采用确定性重采样策略,在固定采样步数下通过引导修正替换异常样本,实现连续扩散模型的迭代精炼。
链接: https://arxiv.org/abs/2602.00536
作者: Yifan Zhang,Qian Chen,Yi Liu,Wengen Li,Jihong Guan
机构: University of Michigan (密歇根大学); Tongji University (同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Cloud contamination severely degrades the usability of remote sensing imagery and poses a fundamental challenge for downstream Earth observation tasks. Recently, diffusion-based models have emerged as a dominant paradigm for remote sensing cloud removal due to their strong generative capability and stable optimization. However, existing diffusion-based approaches often suffer from limited sampling efficiency and insufficient exploitation of structural and temporal priors in multi-temporal remote sensing scenarios. In this work, we propose SADER, a structure-aware diffusion framework for multi-temporal remote sensing cloud removal. SADER first develops a scalable Multi-Temporal Conditional Diffusion Network (MTCDN) to fully capture multi-temporal and multimodal correlations via temporal fusion and hybrid attention. Then, a cloud-aware attention loss is introduced to emphasize cloud-dominated regions by accounting for cloud thickness and brightness discrepancies. In addition, a deterministic resampling strategy is designed for continuous diffusion models to iteratively refine samples under fixed sampling steps by replacing outliers through guided correction. Extensive experiments on multiple multi-temporal datasets demonstrate that SADER consistently outperforms state-of-the-art cloud removal methods across all evaluation metrics. The code of SADER is publicly available at this https URL.
zh
[CV-231] Enhancing Open-Vocabulary Object Detection through Multi-Level Fine-Grained Visual-Language Alignment
【速读】:该论文旨在解决传统目标检测系统在动态环境中因局限于预定义类别而应用受限的问题,提出开放词汇目标检测(Open-Vocabulary Object Detection, OVD)方法以实现对训练集中未见类别的识别。其解决方案的关键在于设计了视觉-语言检测框架(Visual-Language Detection, VLDet),通过引入VL-PUB模块重构特征金字塔结构,实现细粒度的视觉-语言对齐,并利用SigRPN模块结合sigmoid-based锚框-文本对比损失来增强新类别检测能力,从而显著提升OVD性能,在COCO2017和LVIS数据集上分别取得58.7 AP和24.8 AP,优于现有最先进方法。
链接: https://arxiv.org/abs/2602.00531
作者: Tianyi Zhang,Antoine Simoulin,Kai Li,Sana Lakdawala,Shiqing Yu,Arpit Mittal,Hongyu Fu,Yu Lin
机构: Meta
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Traditional object detection systems are typically constrained to predefined categories, limiting their applicability in dynamic environments. In contrast, open-vocabulary object detection (OVD) enables the identification of objects from novel classes not present in the training set. Recent advances in visual-language modeling have led to significant progress of OVD. However, prior works face challenges in either adapting the single-scale image backbone from CLIP to the detection framework or ensuring robust visual-language alignment. We propose Visual-Language Detection (VLDet), a novel framework that revamps feature pyramid for fine-grained visual-language alignment, leading to improved OVD performance. With the VL-PUB module, VLDet effectively exploits the visual-language knowledge from CLIP and adapts the backbone for object detection through feature pyramid. In addition, we introduce the SigRPN block, which incorporates a sigmoid-based anchor-text contrastive alignment loss to improve detection of novel categories. Through extensive experiments, our approach achieves 58.7 AP for novel classes on COCO2017 and 24.8 AP on LVIS, surpassing all state-of-the-art methods and achieving significant improvements of 27.6% and 6.9%, respectively. Furthermore, VLDet also demonstrates superior zero-shot performance on closed-set object detection.
zh
[CV-232] SAGE: Accelerating Vision-Language Models via Entropy-Guided Adaptive Speculative Decoding
【速读】:该论文旨在解决现有推测解码(speculative decoding)方法在视觉语言模型(Vision-Language Models, VLMs)推理加速中因采用静态树结构而导致的效率瓶颈问题。具体而言,静态树结构无法根据生成步骤中预测难度的变化进行自适应调整,从而限制了接受长度(acceptance length)和加速效果。解决方案的关键在于提出SAGE框架,其核心创新是基于实时预测不确定性动态调整推测树结构:利用输出熵(output entropy)作为置信度指标,当预测置信度高时构建更深更窄的树以最大化推测深度,而在不确定预测时构建更浅更宽的树以增强探索多样性。这一机制显著提升了接受长度并实现了更快的推理加速,实验表明在不损失输出质量的前提下,SAGE相较静态基线方法可实现最高达3.36倍的解码速度提升。
链接: https://arxiv.org/abs/2602.00523
作者: Yujia Tong,Tian Zhang,Yunyang Wan,Kaiwei Lin,Jingling Yuan,Chuang Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Speculative decoding has emerged as a promising approach to accelerate inference in vision-language models (VLMs) by enabling parallel verification of multiple draft tokens. However, existing methods rely on static tree structures that remain fixed throughout the decoding process, failing to adapt to the varying prediction difficulty across generation steps. This leads to suboptimal acceptance lengths and limited speedup. In this paper, we propose SAGE, a novel framework that dynamically adjusts the speculation tree structure based on real-time prediction uncertainty. Our key insight is that output entropy serves as a natural confidence indicator with strong temporal correlation across decoding steps. SAGE constructs deeper-narrower trees for high-confidence predictions to maximize speculation depth, and shallower-wider trees for uncertain predictions to diversify exploration. SAGE improves acceptance lengths and achieves faster acceleration compared to static tree baselines. Experiments on multiple benchmarks demonstrate the effectiveness of SAGE: without any loss in output quality, it delivers up to 3.36\times decoding speedup for LLaVA-OneVision-72B and 3.18\times for Qwen2.5-VL-72B.
zh
[CV-233] MRAD: Zero-Shot Anomaly Detection with Memory-Driven Retrieval
【速读】:该论文旨在解决零样本异常检测(Zero-shot Anomaly Detection, ZSAD)中现有方法因依赖提示学习(prompt learning)或复杂建模来拟合数据分布而导致的训练/推理成本高、跨域稳定性差的问题。其解决方案的关键在于提出一种统一框架MRAD(Memory-Retrieval Anomaly Detection),通过将参数化拟合替换为直接的记忆检索机制:在训练自由的MRAD-TF版本中,冻结CLIP图像编码器,并利用辅助数据构建两级记忆库(图像级和像素级),其中特征-标签对作为键(key)和值(value)显式存储;推理时直接基于相似性检索获得异常分数。此设计摒弃了对模型参数的依赖,转而充分利用原始数据的经验分布,从而在16个工业与医学数据集上实现了优于现有方法的异常分类与分割性能。
链接: https://arxiv.org/abs/2602.00522
作者: Chaoran Xu,Chengkan Lv,Qiyu Chen,Feng Zhang,Zhengtao Zhang
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Zero-shot anomaly detection (ZSAD) often leverages pretrained vision or vision-language models, but many existing methods use prompt learning or complex modeling to fit the data distribution, resulting in high training or inference cost and limited cross-domain stability. To address these limitations, we propose Memory-Retrieval Anomaly Detection method (MRAD), a unified framework that replaces parametric fitting with a direct memory retrieval. The train-free base model, MRAD-TF, freezes the CLIP image encoder and constructs a two-level memory bank (image-level and pixel-level) from auxiliary data, where feature-label pairs are explicitly stored as keys and values. During inference, anomaly scores are obtained directly by similarity retrieval over the memory bank. Based on the MRAD-TF, we further propose two lightweight variants as enhancements: (i) MRAD-FT fine-tunes the retrieval metric with two linear layers to enhance the discriminability between normal and anomaly; (ii) MRAD-CLIP injects the normal and anomalous region priors from the MRAD-FT as dynamic biases into CLIP’s learnable text prompts, strengthening generalization to unseen categories. Across 16 industrial and medical datasets, the MRAD framework consistently demonstrates superior performance in anomaly classification and segmentation, under both train-free and training-based settings. Our work shows that fully leveraging the empirical distribution of raw data, rather than relying only on model fitting, can achieve stronger anomaly detection performance. The code will be publicly released at this https URL.
zh
[CV-234] SPARK: Stochastic Propagation via Affinity-guided Random walK for training-free unsupervised segmentation
【速读】:该论文旨在解决现有无需训练(training-free)分割方法中存在的三大核心问题:一是依赖于隐含且受限的假设,即将分割视为基于扩散衍生亲和图的谱图划分问题;二是此类方法需预先指定聚类数量,易受噪声或多重模态亲和分布影响,并因谱松弛导致边界过平滑;三是忽视局部邻域结构的重要性,从而削弱了亲和传播的稳定性与细粒度轮廓的保持能力。解决方案的关键在于将无需训练的分割重新建模为在扩散诱导的亲和图上的随机流平衡问题,其中分割结果通过融合全局扩散注意力与来自稳定扩散提取的局部邻域结构而自然涌现,形成稀疏但表达能力强的亲和结构;进一步引入一种基于马尔可夫传播的随机游走标签扩散机制,结合自适应剪枝策略抑制不可靠转移、强化高置信度亲和路径,从而实现更清晰边界、更连贯区域及更稳定的分割掩膜。
链接: https://arxiv.org/abs/2602.00516
作者: Kunal Mahatha,Jose Dolz,Christian Desrosiers
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We argue that existing training-free segmentation methods rely on an implicit and limiting assumption, that segmentation is a spectral graph partitioning problem over diffusion-derived affinities. Such approaches, based on global graph partitioning and eigenvector-based formulations of affinity matrices, suffer from several fundamental drawbacks, they require pre-selecting the number of clusters, induce boundary oversmoothing due to spectral relaxation, and remain highly sensitive to noisy or multi-modal affinity distributions. Moreover, many prior works neglect the importance of local neighborhood structure, which plays a crucial role in stabilizing affinity propagation and preserving fine-grained contours. To address these limitations, we reformulate training-free segmentation as a stochastic flow equilibrium problem over diffusion-induced affinity graphs, where segmentation emerges from a stochastic propagation process that integrates global diffusion attention with local neighborhoods extracted from stable diffusion, yielding a sparse yet expressive affinity structure. Building on this formulation, we introduce a Markov propagation scheme that performs random-walk-based label diffusion with an adaptive pruning strategy that suppresses unreliable transitions while reinforcing confident affinity paths. Experiments across seven widely used semantic segmentation benchmarks demonstrate that our method achieves state-of-the-art zero-shot performance, producing sharper boundaries, more coherent regions, and significantly more stable masks compared to prior spectral-clustering-based approaches.
zh
[CV-235] DuoGen: Towards General Purpose Interleaved Multimodal Generation
【速读】:该论文旨在解决现有交错式多模态生成模型在通用指令下性能受限的问题,主要瓶颈在于训练数据不足和基础模型能力有限。其解决方案的关键在于提出DuoGen框架,通过系统性地优化数据构建、架构设计与评估机制来提升整体性能:一方面构建大规模高质量的指令微调数据集,融合人工重写的真实多模态对话与覆盖日常场景的合成样本;另一方面采用两阶段解耦策略,先对预训练多模态大语言模型(Multimodal Large Language Model, MLLM)进行指令微调,再利用精选的交错图像-文本序列将扩散Transformer(Diffusion Transformer, DiT)与MLLM对齐,从而避免昂贵的单模态预训练,并实现灵活的基础模型选择。此方法显著提升了文本质量、图像保真度及图文一致性,在多个基准测试中优于开源模型并达到统一生成模型中的最先进水平。
链接: https://arxiv.org/abs/2602.00508
作者: Min Shi,Xiaohui Zeng,Jiannan Huang,Yin Cui,Francesco Ferroni,Jialuo Li,Shubham Pachori,Zhaoshuo Li,Yogesh Balaji,Haoxiang Wang,Tsung-Yi Lin,Xiao Fu,Yue Zhao,Chieh-Yun Chen,Ming-Yu Liu,Humphrey Shi
机构: Georgia Tech (佐治亚理工学院); NVIDIA (英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report. Project Page: this https URL
Abstract:Interleaved multimodal generation enables capabilities beyond unimodal generation models, such as step-by-step instructional guides, visual planning, and generating visual drafts for reasoning. However, the quality of existing interleaved generation models under general instructions remains limited by insufficient training data and base model capacity. We present DuoGen, a general-purpose interleaved generation framework that systematically addresses data curation, architecture design, and evaluation. On the data side, we build a large-scale, high-quality instruction-tuning dataset by combining multimodal conversations rewritten from curated raw websites, and diverse synthetic examples covering everyday scenarios. Architecturally, DuoGen leverages the strong visual understanding of a pretrained multimodal LLM and the visual generation capabilities of a diffusion transformer (DiT) pretrained on video generation, avoiding costly unimodal pretraining and enabling flexible base model selection. A two-stage decoupled strategy first instruction-tunes the MLLM, then aligns DiT with it using curated interleaved image-text sequences. Across public and newly proposed benchmarks, DuoGen outperforms prior open-source models in text quality, image fidelity, and image-context alignment, and also achieves state-of-the-art performance on text-to-image and image editing among unified generation models. Data and code will be released at this https URL.
zh
[CV-236] Sparse Shortcuts: Facilitating Efficient Fusion in Multimodal Large Language Models
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在跨模态知识融合过程中,因仅依赖高层视觉特征进行对齐而导致中低层语义信息丢失的问题,从而限制了模型的跨模态理解能力。其解决方案的关键在于提出一种名为SparseCut的通用交叉模态融合架构,通过在跨模态编码器与语言模型之间引入稀疏快捷连接(sparse shortcut connections),实现视觉特征在多个层次上的高效、分层整合;同时设计了一种高效的多粒度特征融合模块,在不增加语言模型输入长度的前提下完成视觉特征融合,从而在不提升计算复杂度的情况下增强语义融合效果,显著提升MLLMs在多种多模态基准测试中的性能表现。
链接: https://arxiv.org/abs/2602.00505
作者: Jingrui Zhang,Feng Liang,Yong Zhang,Wei Wang,Runhao Zeng,Xiping Hu
机构: Shenzhen MSU-BIT University (深圳莫斯科大学-比特大学); Guangdong-Hong Kong-Macao Joint Laboratory for Emotion Intelligence and Pervasive Computing (粤港澳联合实验室情感智能与普适计算)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:With the remarkable success of large language models (LLMs) in natural language understanding and generation, multimodal large language models (MLLMs) have rapidly advanced in their ability to process data across multiple modalities. While most existing efforts focus on scaling up language models or constructing higher-quality training data, limited attention has been paid to effectively integrating cross-modal knowledge into the language space. In vision-language models, for instance, aligning modalities using only high-level visual features often discards the rich semantic information present in mid- and low-level features, limiting the model’s ability of cross-modality understanding. To address this issue, we propose SparseCut, a general cross-modal fusion architecture for MLLMs, introducing sparse shortcut connections between the cross-modal encoder and the LLM. These shortcut connections enable the efficient and hierarchical integration of visual features at multiple levels, facilitating richer semantic fusion without increasing computational overhead. We further introduce an efficient multi-grained feature fusion module, which performs the fusion of visual features before routing them through the shortcuts. This preserves the original language context and does not increase the overall input length, thereby avoiding an increase in computational complexity for the LLM. Experiments demonstrate that SparseCut significantly enhances the performance of MLLMs across various multimodal benchmarks with generality and scalability for different base LLMs.
zh
[CV-237] RGBX-R1: Visual Modality Chain-of-Thought Guided Reinforcement Learning for Multimodal Grounding
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在非RGB模态(如红外、深度和事件数据)上性能受限的问题,这些模态在复杂场景中至关重要。解决方案的关键在于提出RGBX-R1框架,其核心是通过“理解-关联-验证”(Understand-Associate-Validate, UAV)提示策略构建视觉模态思维链(Visual Modality Chain-of-Thought, VM-CoT),将MLLM对RGB的理解能力扩展至其他X视觉模态;同时引入两阶段训练范式:冷启动监督微调(Cold-Start Supervised Fine-Tuning, CS-SFT)用于建立基础模态认知,以及时空强化微调(Spatio-Temporal Reinforcement Fine-Tuning, ST-RFT)利用模态理解时空奖励(Modality-understanding Spatio-Temporal, MuST)增强推理能力,从而显著提升模型在多模态感知与空间理解上的表现。
链接: https://arxiv.org/abs/2602.00504
作者: Jiahe Wu,Bing Cao,Qilong Wang,Qinghua Hu,Dongdong Li,Pengfei Zhu
机构: Tianjin University (天津大学); Xiong’an National Innovation Center (雄安国家创新中心); Xiong’an Guochuang Lantian Technology Co., Ltd. (雄安国创蓝田科技有限公司); National University of Defense Technology (国防科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal Large Language Models (MLLM) are primarily pre-trained on the RGB modality, thereby limiting their performance on other modalities, such as infrared, depth, and event data, which are crucial for complex scenarios. To address this, we propose RGBX-R1, a framework to enhance MLLM’s perception and reasoning capacities across various X visual modalities. Specifically, we employ an Understand-Associate-Validate (UAV) prompting strategy to construct the Visual Modality Chain-of-Thought (VM-CoT), which aims to expand the MLLMs’ RGB understanding capability into X modalities. To progressively enhance reasoning capabilities, we introduce a two-stage training paradigm: Cold-Start Supervised Fine-Tuning (CS-SFT) and Spatio-Temporal Reinforcement Fine-Tuning (ST-RFT). CS-SFT supervises the reasoning process with the guidance of VM-CoT, equipping the MLLM with fundamental modality cognition. Building upon GRPO, ST-RFT employs a Modality-understanding Spatio-Temporal (MuST) reward to reinforce modality reasoning. Notably, we construct the first RGBX-Grounding benchmark, and extensive experiments verify our superiority in multimodal understanding and spatial perception, outperforming baselines by 22.71% on three RGBX grounding tasks.
zh
[CV-238] HSSDCT: Factorized Spatial-Spectral Correlation for Hyperspectral Image Fusion ICASSP2026
【速读】:该论文旨在解决高光谱图像(Hyperspectral Image, HSI)融合中深度学习方法存在的三大瓶颈问题:受限的感受野、冗余的光谱波段以及自注意力机制带来的二次复杂度,这些问题制约了模型的效率与鲁棒性。解决方案的关键在于提出一种层次化空间-光谱密集相关网络(Hierarchical Spatial-Spectral Dense Correlation Network, HSSDCT),其核心创新包括两个模块:一是层次化密集残差Transformer块(Hierarchical Dense-Residue Transformer Block, HDRTB),通过逐步扩展窗口并引入密集残差连接实现多尺度特征聚合;二是空间-光谱相关层(Spatial-Spectral Correlation Layer, SSCL),显式分解空间与光谱依赖关系,将自注意力复杂度从二次降低至线性,同时缓解光谱冗余问题。实验表明,该方法在保持高重建质量的同时显著降低了计算开销,达到了HSI融合的新SOTA性能。
链接: https://arxiv.org/abs/2602.00490
作者: Chia-Ming Lee,Yu-Hao Ho,Yu-Fan Lin,Jen-Wei Lee,Li-Wei Kang,Chih-Chung Hsu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICASSP 2026
Abstract:Hyperspectral image (HSI) fusion aims to reconstruct a high-resolution HSI (HR-HSI) by combining the rich spectral information of a low-resolution HSI (LR-HSI) with the fine spatial details of a high-resolution multispectral image (HR-MSI). Although recent deep learning methods have achieved notable progress, they still suffer from limited receptive fields, redundant spectral bands, and the quadratic complexity of self-attention, which restrict both efficiency and robustness. To overcome these challenges, we propose the Hierarchical Spatial-Spectral Dense Correlation Network (HSSDCT). The framework introduces two key modules: (i) a Hierarchical Dense-Residue Transformer Block (HDRTB) that progressively enlarges windows and employs dense-residue connections for multi-scale feature aggregation, and (ii) a Spatial-Spectral Correlation Layer (SSCL) that explicitly factorizes spatial and spectral dependencies, reducing self-attention to linear complexity while mitigating spectral redundancy. Extensive experiments on benchmark datasets demonstrate that HSSDCT delivers superior reconstruction quality with significantly lower computational costs, achieving new state-of-the-art performance in HSI fusion. Our code is available at this https URL.
zh
[CV-239] Refining Strokes by Learning Offset Attributes between Strokes for Flexible Sketch Edit at Stroke-Level
【速读】:该论文旨在解决草图编辑中基于笔画层级(stroke-level)的编辑问题,即如何将源笔画精确移植到目标草图上,同时保持语义一致性与视觉保真度。现有方法仅通过重定位源笔画位置难以应对源笔画在尺寸和方向上的显著差异,导致编辑结果不自然或语义不一致。解决方案的关键在于提出SketchMod框架,通过学习并调整源笔画到目标笔画之间的三个关键偏移属性(尺度、方向和位置),实现对源笔画的精细化变换:首先按比例缩放以匹配空间比例,其次旋转以对齐局部几何结构,最后位移以契合整体语义布局,从而实现灵活且精确的笔画级草图编辑。
链接: https://arxiv.org/abs/2602.00489
作者: Sicong Zang,Tao Sun,Cairong Yan
机构: Donghua University (东华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Source codes are coming soon
Abstract:Sketch edit at stroke-level aims to transplant source strokes onto a target sketch via stroke expansion or replacement, while preserving semantic consistency and visual fidelity with the target sketch. Recent studies addressed it by relocating source strokes at appropriate canvas positions. However, as source strokes could exhibit significant variations in both size and orientation, we may fail to produce plausible sketch editing results by merely repositioning them without further adjustments. For example, anchoring an oversized source stroke onto the target without proper scaling would fail to produce a semantically coherent outcome. In this paper, we propose SketchMod to refine the source stroke through transformation so as to align it with the target sketch’s patterns, further realize flexible sketch edit at stroke-level. As the source stroke refinement is governed by the patterns of the target sketch, we learn three key offset attributes (scale, orientation and position) from the source stroke to another, and align it with the target by: 1) resizing to match spatial proportions by scale, 2) rotating to align with local geometry by orientation, and 3) displacing to meet with semantic layout by position. Besides, a stroke’s profiles can be precisely controlled during sketch edit via the exposed captured stroke attributes. Experimental results indicate that SketchMod achieves precise and flexible performances on stroke-level sketch edit.
zh
[CV-240] GTATrack: Winner Solution to SoccerTrack 2025 with Deep-EIoU and Global Tracklet Association SOCC
【速读】:该论文旨在解决体育场景中基于静态鱼眼相机的多目标跟踪(Multi-object Tracking, MOT)难题,尤其针对球员运动轨迹不规则、外观相似、频繁遮挡以及由鱼眼镜头引起的几何失真和极端尺度变化等问题。解决方案的关键在于提出一种分层跟踪框架GTATrack,其核心包含两个模块:一是无运动依赖的在线关联模块Deep Expansion IoU(Deep-EIoU),用于实现鲁棒的短时匹配;二是基于轨迹级别的全局关联模块Global Tracklet Association(GTA),用于提升长期身份一致性。此外,通过伪标签策略增强小尺寸和变形目标的检测召回率,从而有效缓解身份切换、遮挡和跟踪碎片化问题,最终在SoccerTrack Challenge 2025中以HOTA分数0.60夺冠,并显著降低误报至982个。
链接: https://arxiv.org/abs/2602.00484
作者: Rong-Lin Jian,Ming-Chi Luo,Chen-Wei Huang,Chia-Ming Lee,Yu-Fan Lin,Chih-Chung Hsu
机构: National Yang Ming Chiao Tung University (国立阳明交通大学); National Central University (国立中央大学); National Cheng Kung University (国立成功大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Winner Solution of SoccerTrack in ACM Multimedia 2025 Workshop MMSports
Abstract:Multi-object tracking (MOT) in sports is highly challenging due to irregular player motion, uniform appearances, and frequent occlusions. These difficulties are further exacerbated by the geometric distortion and extreme scale variation introduced by static fisheye cameras. In this work, we present GTATrack, a hierarchical tracking framework that win first place in the SoccerTrack Challenge 2025. GTATrack integrates two core components: Deep Expansion IoU (Deep-EIoU) for motion-agnostic online association and Global Tracklet Association (GTA) for trajectory-level refinement. This two-stage design enables both robust short-term matching and long-term identity consistency. Additionally, a pseudo-labeling strategy is used to boost detector recall on small and distorted targets. The synergy between local association and global reasoning effectively addresses identity switches, occlusions, and tracking fragmentation. Our method achieved a winning HOTA score of 0.60 and significantly reduced false positives to 982, demonstrating state-of-the-art accuracy in fisheye-based soccer tracking. Our code is available at this https URL.
zh
[CV-241] Dual Latent Memory for Visual Multi-agent System
【速读】:该论文旨在解决视觉多智能体系统(Visual Multi-Agent Systems, VMAS)在增加智能体交互轮次时出现的“缩放瓶颈”问题,即随着交互次数增多,性能不升反降且token消耗呈指数级增长。作者指出,这一现象的根本原因在于以文本为中心的通信机制导致了感知与推理轨迹向离散自然语言转换过程中的语义损失。解决方案的关键在于提出L²-VMAS框架,其核心创新包括:1)引入双隐式记忆机制实现跨智能体协作;2)解耦感知与推理过程,并动态合成双隐式记忆;3)采用基于熵驱动的主动触发机制,将被动信息传输替换为按需访问的记忆检索策略。实验表明,该方法显著提升了模型可扩展性,在平均准确率提升2.7–5.4%的同时,token使用量降低21.3–44.8%。
链接: https://arxiv.org/abs/2602.00471
作者: Xinlei Yu,Chengming Xu,Zhangquan Chen,Bo Yin,Cheng Yang,Yongbo He,Yihao Hu,Jiangning Zhang,Cheng Tan,Xiaobin Hu,Shuicheng Yan
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While Visual Multi-Agent Systems (VMAS) promise to enhance comprehensive abilities through inter-agent collaboration, empirical evidence reveals a counter-intuitive “scaling wall”: increasing agent turns often degrades performance while exponentially inflating token costs. We attribute this failure to the information bottleneck inherent in text-centric communication, where converting perceptual and thinking trajectories into discrete natural language inevitably induces semantic loss. To this end, we propose L ^2 -VMAS, a novel model-agnostic framework that enables inter-agent collaboration with dual latent memories. Furthermore, we decouple the perception and thinking while dynamically synthesizing dual latent memories. Additionally, we introduce an entropy-driven proactive triggering that replaces passive information transmission with efficient, on-demand memory access. Extensive experiments among backbones, sizes, and multi-agent structures demonstrate that our method effectively breaks the “scaling wall” with superb scalability, improving average accuracy by 2.7-5.4% while reducing token usage by 21.3-44.8%. Codes: this https URL.
zh
[CV-242] ZS-TreeSeg: A Zero-Shot Framework for Tree Crown Instance Segmentation
【速读】:该论文旨在解决遥感中个体树冠(Individual Tree Crown, ITC)分割在密集交错林冠场景下难以准确 delineation 的问题,尤其针对传统监督深度学习方法因标注成本高和泛化能力弱,以及基础模型(如 Segment Anything Model)缺乏领域知识导致密集簇中出现欠分割(under-segmentation)的瓶颈。其解决方案的关键在于提出 ZS-TreeSeg 框架,该框架通过将树冠建模为拓扑流场中的星凸对象(star-convex objects),并利用 Cellpose-SAM 结合细胞实例分割与语义分割的成熟任务,基于向量收敛机制强制对接触树冠实例进行数学分离,从而实现无需训练即可跨传感器类型和冠层密度鲁棒地完成树冠实例分割与标签生成。
链接: https://arxiv.org/abs/2602.00470
作者: Pengyu Chen,Fangzheng Lyu,Sicheng Wang,Cuizhen Wang
机构: University of South Carolina (南卡罗来纳大学); Virginia Polytechnic Institute and State University (弗吉尼亚理工学院暨州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Individual tree crown segmentation is an important task in remote sensing for forest biomass estimation and ecological monitoring. However, accurate delineation in dense, overlapping canopies remains a bottleneck. While supervised deep learning methods suffer from high annotation costs and limited generalization, emerging foundation models (e.g., Segment Anything Model) often lack domain knowledge, leading to under-segmentation in dense clusters. To bridge this gap, we propose ZS-TreeSeg, a Zero-Shot framework that adapts from two mature tasks: 1) Canopy Semantic segmentation; and 2) Cells instance segmentation. By modeling tree crowns as star-convex objects within a topological flow field using Cellpose-SAM, the ZS-TreeSeg framework forces the mathematical separation of touching tree crown instances based on vector convergence. Experiments on the NEON and BAMFOREST datasets and visual inspection demonstrate that our framework generalizes robustly across diverse sensor types and canopy densities, which can offer a training-free solution for tree crown instance segmentation and labels generation.
zh
[CV-243] PSGS: Text-driven Panorama Sliding Scene Generation via Gaussian Splatting ICASSP2026
【速读】:该论文旨在解决文本驱动的3D场景生成中存在的两大核心问题:一是缺乏高质量的3D文本数据导致模型难以学习复杂的语义结构;二是多视角拼接不一致,致使生成的场景过于简单且视觉质量低下。其解决方案的关键在于提出一个两阶段框架PSGS(Panoramic Scene Generation System),第一阶段通过双层优化架构实现语义一致的全景图生成——布局推理层解析文本中的空间关系,自优化层利用迭代多模态大语言模型(Multimodal Large Language Model, MLLM)反馈精修细节;第二阶段引入全景滑动机制,基于重叠视角策略初始化全局一致的3D高斯点云(3D Gaussian Splatting),并结合深度与语义一致性损失进行训练,从而显著提升渲染场景的细节保真度和整体质量。
链接: https://arxiv.org/abs/2602.00463
作者: Xin Zhang,Shen Chen,Jiale Zhou,Lei Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICASSP2026
Abstract:Generating realistic 3D scenes from text is crucial for immersive applications like VR, AR, and gaming. While text-driven approaches promise efficiency, existing methods suffer from limited 3D-text data and inconsistent multi-view stitching, resulting in overly simplistic scenes. To address this, we propose PSGS, a two-stage framework for high-fidelity panoramic scene generation. First, a novel two-layer optimization architecture generates semantically coherent panoramas: a layout reasoning layer parses text into structured spatial relationships, while a self-optimization layer refines visual details via iterative MLLM feedback. Second, our panorama sliding mechanism initializes globally consistent 3D Gaussian Splatting point clouds by strategically sampling overlapping perspectives. By incorporating depth and semantic coherence losses during training, we greatly improve the quality and detail fidelity of rendered scenes. Our experiments demonstrate that PSGS outperforms existing methods in panorama generation and produces more appealing 3D scenes, offering a robust solution for scalable immersive content creation.
zh
[CV-244] LatentLens: Revealing Highly Interpretable Visual Tokens in LLM s
【速读】:该论文旨在解决视觉-语言模型(Vision-Language Model, VLM)中视觉token表示的可解释性问题,即如何揭示视觉token在大语言模型(Large Language Model, LLM)各层处理过程中所编码的信息。传统方法如LogitLens被发现显著低估了视觉token的可解释性,而本文提出的关键解决方案是LatentLens——一种将潜在表示映射为自然语言描述的新方法。其核心在于:通过编码大规模文本语料库并存储每个词的上下文感知表示,随后将视觉token表示与这些文本表示进行最近邻比较,从而获得对视觉token的语义描述。实验表明,使用LatentLens可在所有测试的VLM及其各层中实现多数视觉token的高可解释性,且描述语义更精细,为理解视觉与语言表征之间的对齐机制提供了新证据。
链接: https://arxiv.org/abs/2602.00462
作者: Benno Krojer,Shravan Nayak,Oscar Mañas,Vaibhav Adlakha,Desmond Elliott,Siva Reddy,Marius Mosbach
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Transforming a large language model (LLM) into a Vision-Language Model (VLM) can be achieved by mapping the visual tokens from a vision encoder into the embedding space of an LLM. Intriguingly, this mapping can be as simple as a shallow MLP transformation. To understand why LLMs can so readily process visual tokens, we need interpretability methods that reveal what is encoded in the visual token representations at every layer of LLM processing. In this work, we introduce LatentLens, a novel approach for mapping latent representations to descriptions in natural language. LatentLens works by encoding a large text corpus and storing contextualized token representations for each token in that corpus. Visual token representations are then compared to their contextualized textual representations, with the top-k nearest neighbor representations providing descriptions of the visual token. We evaluate this method on 10 different VLMs, showing that commonly used methods, such as LogitLens, substantially underestimate the interpretability of visual tokens. With LatentLens instead, the majority of visual tokens are interpretable across all studied models and all layers. Qualitatively, we show that the descriptions produced by LatentLens are semantically meaningful and provide more fine-grained interpretations for humans compared to individual tokens. More broadly, our findings contribute new evidence on the alignment between vision and language representations, opening up new directions for analyzing latent representations.
zh
[CV-245] Model Optimization for Multi-Camera 3D Detection and Tracking
【速读】:该论文旨在解决室内环境中多摄像头感知(multi-camera perception)下的多目标跟踪问题,特别是在存在遮挡和异构视角条件下如何实现稳定的身份关联与时空一致性。其核心挑战在于如何在降低输入帧率、量化精度受限以及跨数据集迁移时保持目标身份的持久性和检测准确性。解决方案的关键在于提出Sparse4D框架——一个基于查询的时空3D检测与跟踪方法,通过在共享世界坐标系中融合多视角特征,并利用实例记忆(instance memory)传播稀疏对象查询(sparse object queries),从而增强跨摄像头的身份一致性;同时,研究发现选择性量化骨干网络和颈部模块可在速度与精度间取得最优平衡,而注意力相关模块对低精度敏感,需谨慎处理,此外Transformer Engine混合精度微调虽能提升计算效率并扩展摄像机规模,但可能破坏身份传播稳定性,因此需引入稳定性感知的验证机制以保障性能鲁棒性。
链接: https://arxiv.org/abs/2602.00450
作者: Ethan Anderson(1),Justin Silva(1),Kyle Zheng(1),Sameer Pusegaonkar(2),Yizhou Wang(2),Zheng Tang(2),Sujit Biswas(2) ((1) Clemson University, (2) NVIDIA)
机构: Clemson University (克莱姆森大学); NVIDIA (英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Outside-in multi-camera perception is increasingly important in indoor environments, where networks of static cameras must support multi-target tracking under occlusion and heterogeneous viewpoints. We evaluate Sparse4D, a query-based spatiotemporal 3D detection and tracking framework that fuses multi-view features in a shared world frame and propagates sparse object queries via instance memory. We study reduced input frame rates, post-training quantization (INT8 and FP8), transfer to the WILDTRACK benchmark, and Transformer Engine mixed-precision fine-tuning. To better capture identity stability, we report Average Track Duration (AvgTrackDur), which measures identity persistence in seconds. Sparse4D remains stable under moderate FPS reductions, but below 2 FPS, identity association collapses even when detections are stable. Selective quantization of the backbone and neck offers the best speed-accuracy trade-off, while attention-related modules are consistently sensitive to low precision. On WILDTRACK, low-FPS pretraining yields large zero-shot gains over the base checkpoint, while small-scale fine-tuning provides limited additional benefit. Transformer Engine mixed precision reduces latency and improves camera scalability, but can destabilize identity propagation, motivating stability-aware validation.
zh
[CV-246] DISK: Dynamic Inference SKipping for World Models
【速读】:该论文旨在解决自回归世界模型在长时程视频与轨迹预测中计算成本高、推理效率低的问题,同时保持运动-外观一致性(motion-appearance consistency)和预测质量。解决方案的关键在于提出一种无需训练的自适应推理方法DISK,其通过双分支控制器(dual-branch controllers)协同控制视频和自身轨迹扩散过程,并引入跨模态跳跃决策(cross-modal skip decisions)机制;进一步将高阶潜在差值跳跃检测(higher-order latent-difference skip testing)扩展至前向传播链(autoregressive chain-of-forward regime),并通过滚动回放循环(rollout loops)传递控制器统计信息,从而实现长期稳定性和显著加速——在NuPlan和NuScenes数据集上使用NVIDIA L40S GPU进行闭环驾驶推演时,轨迹扩散提速2倍、视频扩散提速1.6倍,且不牺牲规划误差(L2)、视觉质量(FID/FVD)和导航相似性(NAVSIM PDMS)指标。
链接: https://arxiv.org/abs/2602.00440
作者: Anugunj Naman,Gaibo Zhang,Ayushman Singh,Yaguang Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注:
Abstract:We present DISK, a training-free adaptive inference method for autoregressive world models. DISK coordinates two coupled diffusion transformers for video and ego-trajectory via dual-branch controllers with cross-modal skip decisions, preserving motion-appearance consistency without retraining. We extend higher-order latent-difference skip testing to the autoregressive chain-of-forward regime and propagate controller statistics through rollout loops for long-horizon stability. When integrated into closed-loop driving rollouts on 1500 NuPlan and NuScenes samples using an NVIDIA L40S GPU, DISK achieves 2x speedup on trajectory diffusion and 1.6x speedup on video diffusion while maintaining L2 planning error, visual quality (FID/FVD), and NAVSIM PDMS scores, demonstrating practical long-horizon video-and-trajectory prediction at substantially reduced cost.
zh
[CV-247] xt is All You Need for Vision-Language Model Jailbreaking
【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在安全对齐方面存在的漏洞问题,即现有防御机制主要针对显式文本输入或相关视觉场景进行检测,难以应对通过光学字符识别(OCR)能力绕过过滤的隐蔽攻击。解决方案的关键在于提出一种名为Text-DJ的新颖越狱攻击方法:首先将有害查询分解为多个语义相关但更隐晦的子查询,其次选取与有害查询高度无关的干扰查询,最后将所有子查询和干扰查询以图像网格形式呈现给LVLM,其中有害子查询位于网格中央。该方法利用OCR处理机制绕过文本过滤,并通过高密度无关查询制造认知干扰,使模型无法有效关联分散的有害内容,从而成功规避安全防护。
链接: https://arxiv.org/abs/2602.00420
作者: Yihang Chen,Zhao Xu,Youyuan Jiang,Tianle Zheng,Cho-Jui Hsieh
机构: University of California, Los Angeles (加州大学洛杉矶分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:Large Vision-Language Models (LVLMs) are increasingly equipped with robust safety safeguards to prevent responses to harmful or disallowed prompts. However, these defenses often focus on analyzing explicit textual inputs or relevant visual scenes. In this work, we introduce Text-DJ, a novel jailbreak attack that bypasses these safeguards by exploiting the model’s Optical Character Recognition (OCR) capability. Our methodology consists of three stages. First, we decompose a single harmful query into multiple and semantically related but more benign sub-queries. Second, we pick a set of distraction queries that are maximally irrelevant to the harmful query. Third, we present all decomposed sub-queries and distraction queries to the LVLM simultaneously as a grid of images, with the position of the sub-queries being middle within the grid. We demonstrate that this method successfully circumvents the safety alignment of state-of-the-art LVLMs. We argue this attack succeeds by (1) converting text-based prompts into images, bypassing standard text-based filters, and (2) inducing distractions, where the model’s safety protocols fail to link the scattered sub-queries within a high number of irrelevant queries. Overall, our findings expose a critical vulnerability in LVLMs’ OCR capabilities that are not robust to dispersed, multi-image adversarial inputs, highlighting the need for defenses for fragmented multimodal inputs.
zh
[CV-248] oward Autonomous Laboratory Safety Monitoring with Vision Language Models: Learning to See Hazards Through Scene Structure
【速读】:该论文旨在解决实验室安全监控中因缺乏结构化视觉数据而导致的生成式 AI(Generative AI)模型性能受限的问题。现有方法依赖于文本形式的安全事件记录,而实际场景中的视觉感知能力不足,使得模型难以从原始图像中提取关键对象关系以准确识别潜在危险。解决方案的关键在于提出一种“场景图引导对齐”(scene-graph-guided alignment)的后训练上下文工程策略:通过大语言模型构建结构化的场景图(scene graph),并利用图像生成模型将其转化为带标签的图像-场景图-真实标签三元组数据,从而在视觉输入阶段将原始像素信息映射为更符合视觉语言模型(VLM)推理逻辑的结构化表示,显著提升在纯视觉设置下的危险检测性能。
链接: https://arxiv.org/abs/2602.00414
作者: Trishna Chakraborty,Udita Ghosh,Aldair Ernesto Gongora,Ruben Glatt,Yue Dong,Jiachen Li,Amit K. Roy-Chowdhury,Chengyu Song
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Laboratories are prone to severe injuries from minor unsafe actions, yet continuous safety monitoring – beyond mandatory pre-lab safety training – is limited by human availability. Vision language models (VLMs) offer promise for autonomous laboratory safety monitoring, but their effectiveness in realistic settings is unclear due to the lack of visual evaluation data, as most safety incidents are documented primarily as unstructured text. To address this gap, we first introduce a structured data generation pipeline that converts textual laboratory scenarios into aligned triples of (image, scene graph, ground truth), using large language models as scene graph architects and image generation models as renderers. Our experiments on the synthetic dataset of 1,207 samples across 362 unique scenarios and seven open- and closed-source models show that VLMs perform effectively given textual scene graph, but degrade substantially in visual-only settings indicating difficulty in extracting structured object relationships directly from pixels. To overcome this, we propose a post-training context-engineering approach, scene-graph-guided alignment, to bridge perceptual gaps in VLMs by translating visual inputs into structured scene graphs better aligned with VLM reasoning, improving hazard detection performance in visual only settings.
zh
[CV-249] 3DGS2-TR: Scalable Second-Order Trust-Region Method for 3D Gaussian Splatting
【速读】:该论文旨在解决3D高斯溅射(3D Gaussian Splatting, 3DGS)场景训练中优化效率低的问题,尤其针对现有二阶优化方法在计算复杂度和内存占用上的瓶颈。其核心解决方案是提出一种名为3DGS²-TR的二阶优化器,关键创新在于:1)通过Hutchinson方法仅近似Hessian矩阵的对角线元素来高效估计曲率,实现完全无矩阵(matrix-free)的优化,计算与内存复杂度均为O(n),与ADAM相当;2)引入基于平方Hellinger距离的逐参数信任域策略,以稳定处理3DGS光栅化过程中的强非线性问题。实验表明,在相同初始化条件下,3DGS²-TR可减少50%训练迭代次数并显著降低GPU内存开销(<1GB峰值),从而支持更大规模场景甚至分布式训练。
链接: https://arxiv.org/abs/2602.00395
作者: Roger Hsiao,Yuchen Fang,Xiangru Huang,Ruilong Li,Hesam Rabeti,Zan Gojcic,Javad Lavaei,James Demmel,Sophia Shao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Optimization and Control (math.OC)
备注:
Abstract:We propose 3DGS ^2 -TR,a second-order optimizer for accelerating the scene training problem in 3D Gaussian Splatting (3DGS). Unlike existing second-order approaches that rely on explicit or dense curvature representations, such as 3DGS-LM (Höllein et al., 2025) or 3DGS2 (Lan et al., 2025), our method approximates curvature using only the diagonal of the Hessian matrix, efficiently via Hutchinson’s method. Our approach is fully matrix-free and has the same complexity as ADAM (Kingma, 2024), O(n) in both computation and memory costs. To ensure stable optimization in the presence of strong nonlinearity in the 3DGS rasterization process, we introduce a parameter-wise trust-region technique based on the squared Hellinger distance, regularizing updates to Gaussian parameters. Under identical parameter initialization and without densification, 3DGS ^2 -TR is able to achieve better reconstruction quality on standard datasets, using 50% fewer training iterations compared to ADAM, while incurring less than 1GB of peak GPU memory overhead (17% more than ADAM and 85% less than 3DGS-LM), enabling scalability to very large scenes and potentially to distributed training settings.
zh
[CV-250] Modeling Art Evaluations from Comparative Judgments: A Deep Learning Approach to Predicting Aesthetic Preferences
【速读】:该论文旨在解决视觉艺术中人类审美判断建模的两大难题:个体偏好差异大以及标注数据获取成本高。其核心解决方案是采用基于成对偏好评估的对比学习框架,而非传统的直接评分方式,从而降低标注成本并提升模型性能。关键创新在于利用“比较判断定律”(Law of Comparative Judgment),通过减少认知负担和增强一致性来提高标注效率;实验表明,该方法在无需直接评分数据的情况下仍能接近回归模型的性能,且标注时间比传统方式减少60%,显著提升了大规模偏好建模的可行性与效率。
链接: https://arxiv.org/abs/2602.00394
作者: Manoj Reddy Bethi,Sai Rupa Jhade,Pravallika Yaganti,Monoshiz Mahbub Khan,Zhe Yu
机构: Rochester Institute of Technology (罗切斯特理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Modeling human aesthetic judgments in visual art presents significant challenges due to individual preference variability and the high cost of obtaining labeled data. To reduce cost of acquiring such labels, we propose to apply a comparative learning framework based on pairwise preference assessments rather than direct ratings. This approach leverages the Law of Comparative Judgment, which posits that relative choices exhibit less cognitive burden and greater cognitive consistency than direct scoring. We extract deep convolutional features from painting images using ResNet-50 and develop both a deep neural network regression model and a dual-branch pairwise comparison model. We explored four research questions: (RQ1) How does the proposed deep neural network regression model with CNN features compare to the baseline linear regression model using hand-crafted features? (RQ2) How does pairwise comparative learning compare to regression-based prediction when lacking access to direct rating values? (RQ3) Can we predict individual rater preferences through within-rater and cross-rater analysis? (RQ4) What is the annotation cost trade-off between direct ratings and comparative judgments in terms of human time and effort? Our results show that the deep regression model substantially outperforms the baseline, achieving up to 328% improvement in R^2 . The comparative model approaches regression performance despite having no access to direct rating values, validating the practical utility of pairwise comparisons. However, predicting individual preferences remains challenging, with both within-rater and cross-rater performance significantly lower than average rating prediction. Human subject experiments reveal that comparative judgments require 60% less annotation time per item, demonstrating superior annotation efficiency for large-scale preference modeling.
zh
[CV-251] Robust automatic brain vessel segmentation in 3D CTA scans using dynamic 4D-CTA data
【速读】:该论文旨在解决脑血管分割中手动标注耗时且数据量有限的问题,特别是在动态4D-CTA(计算机断层扫描血管成像)图像中实现高精度、鲁棒的动脉与静脉自动分割。其关键解决方案是:利用多时间点动态CTA数据,通过减去骨组织和软组织增强血管可视化,从而显著降低人工标注成本;同时,基于同一患者不同对比剂相位的相同血管结构进行多阶段训练,使训练数据规模扩大4–5倍,并提升模型对不同对比相位的适应能力,最终在TopBrain数据集上实现了平均mDC达0.846(动脉)和0.957(静脉)的优异分割性能,且在adHD和tSens等形态学指标上也表现出低误差和高拓扑敏感性。
链接: https://arxiv.org/abs/2602.00391
作者: Alberto Mario Ceballos-Arroyo,Shrikanth M. Yadav,Chu-Hsuan Lin,Jisoo Kim,Geoffrey S. Young,Huaizu Jiang,Lei Qin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 8 figures
Abstract:In this study, we develop a novel methodology for annotating the brain vasculature using dynamic 4D-CTA head scans. By using multiple time points from dynamic CTA acquisitions, we subtract bone and soft tissue to enhance the visualization of arteries and veins, reducing the effort required to obtain manual annotations of brain vessels. We then train deep learning models on our ground truth annotations by using the same segmentation for multiple phases from the dynamic 4D-CTA collection, effectively enlarging our dataset by 4 to 5 times and inducing robustness to contrast phases. In total, our dataset comprises 110 training images from 25 patients and 165 test images from 14 patients. In comparison with two similarly-sized datasets for CTA-based brain vessel segmentation, a nnUNet model trained on our dataset can achieve significantly better segmentations across all vascular regions, with an average mDC of 0.846 for arteries and 0.957 for veins in the TopBrain dataset. Furthermore, metrics such as average directed Hausdorff distance (adHD) and topology sensitivity (tSens) reflected similar trends: using our dataset resulted in low error margins (aDHD of 0.304 mm for arteries and 0.078 for veins) and high sensitivity (tSens of 0.877 for arteries and 0.974 for veins), indicating excellent accuracy in capturing vessel morphology. Our code and model weights are available online: this https URL
zh
[CV-252] Deep Learning-Based Object Detection for Autonomous Vehicles: A Comparative Study of One-Stage and Two-Stage Detectors on Basic Traffic Objects
【速读】:该论文旨在解决自动驾驶系统中对象检测模型选择与性能评估的难题,即如何在不同场景下权衡检测精度、处理速度及环境鲁棒性,以选出最适合特定应用的深度学习方法。其解决方案的关键在于通过实验对比两种主流目标检测模型——YOLOv5(单阶段检测器)和Faster R-CNN(两阶段检测器)——在包含真实与合成图像的多样化数据集上的表现,重点考察均值平均精度(mAP)、召回率和推理速度等指标,并分析它们在不同置信度阈值和复杂驾驶场景下的行为差异,从而为自动驾驶系统中对象检测模型的选型提供实证依据。
链接: https://arxiv.org/abs/2602.00385
作者: Bsher Karbouj,Adam Michael Altenbuchner,Joerg Krueger
机构: Technische Universität Berlin (柏林工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Object detection is a crucial component in autonomous vehicle systems. It enables the vehicle to perceive and understand its environment by identifying and locating various objects around it. By utilizing advanced imaging and deep learning techniques, autonomous vehicle systems can rapidly and accurately identify objects based on their features. Different deep learning methods vary in their ability to accurately detect and classify objects in autonomous vehicle systems. Selecting the appropriate method significantly impacts system performance, robustness, and efficiency in real-world driving scenarios. While several generic deep learning architectures like YOLO, SSD, and Faster R-CNN have been proposed, guidance on their suitability for specific autonomous driving applications is often limited. The choice of method affects detection accuracy, processing speed, environmental robustness, sensor integration, scalability, and edge case handling. This study provides a comprehensive experimental analysis comparing two prominent object detection models: YOLOv5 (a one-stage detector) and Faster R-CNN (a two-stage detector). Their performance is evaluated on a diverse dataset combining real and synthetic images, considering various metrics including mean Average Precision (mAP), recall, and inference speed. The findings reveal that YOLOv5 demonstrates superior performance in terms of mAP, recall, and training efficiency, particularly as dataset size and image resolution increase. However, Faster R-CNN shows advantages in detecting small, distant objects and performs well in challenging lighting conditions. The models’ behavior is also analyzed under different confidence thresholds and in various real-world scenarios, providing insights into their applicability for autonomous driving systems.
zh
[CV-253] Modeling Image-Caption Rating from Comparative Judgments
【速读】:该论文试图解决图像描述(image caption)准确性评价中人工标注耗时且主观性强的问题。传统方法依赖人类对每组图像-文本对进行直接评分,但这一过程效率低且一致性差。解决方案的关键在于采用对比学习(comparative learning)框架,利用人类更容易判断两个描述中哪个更贴合图像的特点,通过建模成对比较的偏好来训练模型,从而实现对未见图像-文本对的排序。该方法在VICR数据集上结合ResNet-50和MiniLM分别提取视觉与文本特征,实验表明尽管回归模型性能略优(Pearson相关系数ρ=0.7609),但对比学习模型随着数据量增加持续提升并逼近基准,同时显著降低标注成本,且人机一致性更高,验证了其有效性与实用性。
链接: https://arxiv.org/abs/2602.00381
作者: Kezia Minni,Qiang Zhang,Monoshiz Mahbub Khan,Zhe Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Rating the accuracy of captions in describing images is time-consuming and subjective for humans. In contrast, it is often easier for people to compare two captions and decide which one better matches a given image. In this work, we propose a machine learning framework that models such comparative judgments instead of direct ratings. The model can then be applied to rank unseen image-caption pairs in the same way as a regression model trained on direct ratings. Using the VICR dataset, we extract visual features with ResNet-50 and text features with MiniLM, then train both a regression model and a comparative learning model. While the regression model achieves better performance (Pearson’s \rho : 0.7609 and Spearman’s r_s : 0.7089), the comparative learning model steadily improves with more data and approaches the regression baseline. In addition, a small-scale human evaluation study comparing absolute rating, pairwise comparison, and same-image comparison shows that comparative annotation yields faster results and has greater agreement among human annotators. These results suggest that comparative learning can effectively model human preferences while significantly reducing the cost of human annotations.
zh
[CV-254] ReLAPSe: Reinforcement-Learning-trained Adversarial Prompt Search for Erased concepts in unlearned diffusion models
【速读】:该论文旨在解决文本到图像扩散模型在执行机器遗忘(machine unlearning)后仍可能残留潜在视觉信息(latent visual information)的问题,尤其是针对现有对抗性攻击方法在利用此类信息泄漏时存在的局限性:基于优化的方法计算成本高,而基于推理或启发式的技术缺乏对目标模型隐空间表示的直接反馈。解决方案的关键在于提出 ReLAPSe,一个基于策略的对抗框架,将概念恢复问题建模为强化学习任务,并采用带可验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards, RLVR),以扩散模型的噪声预测损失作为内在且可验证的反馈信号,从而实现文本提示扰动与潜在视觉残差之间的闭环对齐,使代理能够学习具有迁移性的恢复策略,而非针对单个实例进行优化,显著提升了恢复效率与通用性。
链接: https://arxiv.org/abs/2602.00350
作者: Ignacy Kolton,Kacper Marzol,Paweł Batorski,Marcin Mazur,Paul Swoboda,Przemysław Spurek
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Machine unlearning is a key defense mechanism for removing unauthorized concepts from text-to-image diffusion models, yet recent evidence shows that latent visual information often persists after unlearning. Existing adversarial approaches for exploiting this leakage are constrained by fundamental limitations: optimization-based methods are computationally expensive due to per-instance iterative search. At the same time, reasoning-based and heuristic techniques lack direct feedback from the target model’s latent visual representations. To address these challenges, we introduce ReLAPSe, a policy-based adversarial framework that reformulates concept restoration as a reinforcement learning problem. ReLAPSe trains an agent using Reinforcement Learning with Verifiable Rewards (RLVR), leveraging the diffusion model’s noise prediction loss as a model-intrinsic and verifiable feedback signal. This closed-loop design directly aligns textual prompt manipulation with latent visual residuals, enabling the agent to learn transferable restoration strategies rather than optimizing isolated prompts. By pioneering the shift from per-instance optimization to global policy learning, ReLAPSe achieves efficient, near-real-time recovery of fine-grained identities and styles across multiple state-of-the-art unlearning methods, providing a scalable tool for rigorous red-teaming of unlearned diffusion models. Some experimental evaluations involve sensitive visual concepts, such as nudity. Code is available at this https URL
zh
[CV-255] MASC: Metal-Aware Sampling and Correction via Reinforcement Learning for Accelerated MRI
【速读】:该论文旨在解决磁共振成像(MRI)中金属植入物引起的严重伪影问题,该伪影会显著降低图像质量并阻碍临床诊断。传统方法将金属伪影减少(Metal Artifact Reduction, MAR)与加速MRI采集作为独立问题处理,忽略了两者之间的协同优化潜力。其解决方案的关键在于提出一种统一的强化学习框架MASC(Metal-aware Sampling and Correction),通过联合优化金属感知的k空间采样策略与伪影校正过程,实现更高效的图像重建。该框架利用基于物理的仿真构建配对数据集,生成含金属与无金属植入物的匹配MRI扫描,从而为伪影校正和采集策略学习提供直接监督信号;同时将主动MRI采集建模为序贯决策问题,由一个基于近端策略优化(Proximal Policy Optimization, PPO)的代理在有限采集预算下选择k空间相位编码行,并结合U-Net结构的MAR网络进行实时反馈学习,最终通过端到端训练使采样策略与伪影校正网络协同适应,显著优于传统采样方式及固定预训练MAR网络的性能。
链接: https://arxiv.org/abs/2602.00348
作者: Zhengyi Lu,Ming Lu,Chongyu Qu,Junchao Zhu,Junlin Guo,Marilyn Lionts,Yanfan Zhu,Yuechen Yang,Tianyuan Yao,Jayasai Rajagopal,Bennett Allan Landman,Xiao Wang,Xinqiang Yan,Yuankai Huo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Metal implants in MRI cause severe artifacts that degrade image quality and hinder clinical diagnosis. Traditional approaches address metal artifact reduction (MAR) and accelerated MRI acquisition as separate problems. We propose MASC, a unified reinforcement learning framework that jointly optimizes metal-aware k-space sampling and artifact correction for accelerated MRI. To enable supervised training, we construct a paired MRI dataset using physics-based simulation, generating k-space data and reconstructions for phantoms with and without metal implants. This paired dataset provides simulated 3D MRI scans with and without metal implants, where each metal-corrupted sample has an exactly matched clean reference, enabling direct supervision for both artifact reduction and acquisition policy learning. We formulate active MRI acquisition as a sequential decision-making problem, where an artifact-aware Proximal Policy Optimization (PPO) agent learns to select k-space phase-encoding lines under a limited acquisition budget. The agent operates on undersampled reconstructions processed through a U-Net-based MAR network, learning patterns that maximize reconstruction quality. We further propose an end-to-end training scheme where the acquisition policy learns to select k-space lines that best support artifact removal while the MAR network simultaneously adapts to the resulting undersampling patterns. Experiments demonstrate that MASC’s learned policies outperform conventional sampling strategies, and end-to-end training improves performance compared to using a frozen pre-trained MAR network, validating the benefit of joint optimization. Cross-dataset experiments on FastMRI with physics-based artifact simulation further confirm generalization to realistic clinical MRI data. The code and models of MASC have been made publicly available: this https URL
zh
[CV-256] AdaFuse: Adaptive Multimodal Fusion for Lung Cancer Risk Prediction via Reinforcement Learning
【速读】:该论文旨在解决多模态融合中一个关键问题:对于特定患者,是否应使用所有可用模态进行诊断,还是应根据个体情况自适应地选择部分模态以提升预测性能。传统方法通常对所有模态进行固定处理或赋予静态权重,忽略了不同患者对不同模态的信息依赖差异。解决方案的关键在于提出AdaFuse框架,其利用强化学习(Reinforcement Learning, RL)将多模态融合建模为一个序列决策过程,通过策略网络迭代判断是否引入新模态或直接进入预测阶段,从而实现基于已获取信息的动态模态选择与融合。这种自适应机制允许模型在信息充足时提前终止,避免冗余计算,同时提升个性化诊断准确性。
链接: https://arxiv.org/abs/2602.00347
作者: Chongyu Qu,Zhengyi Lu,Yuxiang Lai,Thomas Z. Li,Junchao Zhu,Junlin Guo,Juming Xiong,Yanfan Zhu,Yuechen Yang,Allen J. Luna,Kim L. Sandler,Bennett A. Landman,Yuankai Huo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal fusion has emerged as a promising paradigm for disease diagnosis and prognosis, integrating complementary information from heterogeneous data sources such as medical images, clinical records, and radiology reports. However, existing fusion methods process all available modalities through the network, either treating them equally or learning to assign different contribution weights, leaving a fundamental question unaddressed: for a given patient, should certain modalities be used at all? We present AdaFuse, an adaptive multimodal fusion framework that leverages reinforcement learning (RL) to learn patient-specific modality selection and fusion strategies for lung cancer risk prediction. AdaFuse formulates multimodal fusion as a sequential decision process, where the policy network iteratively decides whether to incorporate an additional modality or proceed to prediction based on the information already acquired. This sequential formulation enables the model to condition each selection on previously observed modalities and terminate early when sufficient information is available, rather than committing to a fixed subset upfront. We evaluate AdaFuse on the National Lung Screening Trial (NLST) dataset. Experimental results demonstrate that AdaFuse achieves the highest AUC (0.762) compared to the best single-modality baseline (0.732), the best fixed fusion strategy (0.759), and adaptive baselines including DynMM (0.754) and MoE (0.742), while using fewer FLOPs than all triple-modality methods. Our work demonstrates the potential of reinforcement learning for personalized multimodal fusion in medical imaging, representing a shift from uniform fusion strategies toward adaptive diagnostic pipelines that learn when to consult additional modalities and when existing information suffices for accurate prediction.
zh
[CV-257] Bridging the Semantic Chasm: Synergistic Conceptual Anchoring for Generalized Few-Shot and Zero-Shot OOD Perception
【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在遇到分布外(Out-of-Distribution, OOD)概念时,跨模态对齐退化(cross-modal alignment degeneration)的问题。其核心解决方案是提出一种协同神经智能体网络(Synergistic Neural Agents Network, SynerNet),通过四个专用计算单元——视觉感知、语言上下文、命名嵌入和全局协调——基于结构化的消息传播协议协同修正模态差异。关键创新包括多智能体潜在空间命名获取框架、用于增强少样本适应的语义上下文交换算法,以及自适应动态平衡机制,从而显著提升模型在少样本和零样本场景下的性能表现。
链接: https://arxiv.org/abs/2602.00340
作者: Alexandros Christoforos,Sarah Jenkins,Michael Brown,Tuan Pham,David Chen
机构: Boston University (波士顿大学); Suffolk University (萨福克大学); Kyung Hee University (中央大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:This manuscript presents a pioneering Synergistic Neural Agents Network (SynerNet) framework designed to mitigate the phenomenon of cross-modal alignment degeneration in Vision-Language Models (VLMs) when encountering Out-of-Distribution (OOD) concepts. Specifically, four specialized computational units - visual perception, linguistic context, nominal embedding, and global coordination - collaboratively rectify modality disparities via a structured message-propagation protocol. The principal contributions encompass a multi-agent latent space nomenclature acquisition framework, a semantic context-interchange algorithm for enhanced few-shot adaptation, and an adaptive dynamic equilibrium mechanism. Empirical evaluations conducted on the VISTA-Beyond benchmark demonstrate that SynerNet yields substantial performance augmentations in both few-shot and zero-shot scenarios, exhibiting precision improvements ranging from 1.2% to 5.4% across a diverse array of domains.
zh
[CV-258] On the Assessment of Sensitivity of Autonomous Vehicle Perception
【速读】:该论文旨在解决自动驾驶车辆(AV)感知系统在复杂和恶劣驾驶场景下的鲁棒性不足问题,即如何准确评估并提升感知系统在真实世界中面对自然干扰(如雾、低光照)和对抗性条件(如道路障碍物遮挡)时的可靠性。其解决方案的关键在于提出一种基于模型集成的预测敏感性量化方法,通过捕捉多个计算机视觉模型在不同工况下的预测分歧与推理变异性,构建了一个用于评估感知性能的架构,并以车辆在不同路面条件下(干湿沥青)于停止标志前的制动距离作为核心评估指标,从而实现对感知系统鲁棒性的定量分析与改进策略探索。
链接: https://arxiv.org/abs/2602.00314
作者: Apostol Vassilev,Munawar Hasan,Edward Griffor,Honglan Jin,Pavel Piliptchak,Mahima Arora,Thoshitha Gamage
机构: NIST(美国国家标准与技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 17 figures
Abstract:The viability of automated driving is heavily dependent on the performance of perception systems to provide real-time accurate and reliable information for robust decision-making and maneuvers. These systems must perform reliably not only under ideal conditions, but also when challenged by natural and adversarial driving factors. Both of these types of interference can lead to perception errors and delays in detection and classification. Hence, it is essential to assess the robustness of the perception systems of automated vehicles (AVs) and explore strategies for making perception more reliable. We approach this problem by evaluating perception performance using predictive sensitivity quantification based on an ensemble of models, capturing model disagreement and inference variability across multiple models, under adverse driving scenarios in both simulated environments and real-world conditions. A notional architecture for assessing perception performance is proposed. A perception assessment criterion is developed based on an AV’s stopping distance at a stop sign on varying road surfaces, such as dry and wet asphalt, and vehicle speed. Five state-of-the-art computer vision models are used, including YOLO (v8-v9), DEtection TRansformer (DETR50, DETR101), Real-Time DEtection TRansformer (RT-DETR)in our experiments. Diminished lighting conditions, e.g., resulting from the presence of fog and low sun altitude, have the greatest impact on the performance of the perception models. Additionally, adversarial road conditions such as occlusions of roadway objects increase perception sensitivity and model performance drops when faced with a combination of adversarial road conditions and inclement weather conditions. Also, it is demonstrated that the greater the distance to a roadway object, the greater the impact on perception performance, hence diminished perception robustness.
zh
[CV-259] Opportunistic Promptable Segmentation: Leverag ing Routine Radiological Annotations to Guide 3D CT Lesion Segmentation
【速读】:该论文旨在解决医学影像领域中3D CT图像分割数据集构建成本高昂的问题,尤其是高质量、多样化标注数据的稀缺性。传统方法依赖放射科医生对关键病灶进行耗时费力的手动3D分割标注,而临床实践中大量存在的稀疏标注(如箭头和直线标记)通常被存储在PACS系统中的GSPS对象中,未被有效利用。解决方案的关键在于提出一种名为SAM2CT的新模型,它基于可提示分割(promptable segmentation)范式,首次专门设计用于将放射科医生的稀疏标注转化为3D CT体积中的分割结果。其核心技术创新包括:扩展提示编码器以支持箭头与线段输入,并引入专为3D医学图像设计的Memory-Conditioned Memories(MCM)记忆编码策略,从而实现从历史GSPS标注中高效生成可临床使用的3D分割结果。
链接: https://arxiv.org/abs/2602.00309
作者: Samuel Church,Joshua D. Warner,Danyal Maqbool,Xin Tie,Junjie Hu,Meghan G. Lubner,Tyler J. Bradshaw
机构: University of Wisconsin–Madison (威斯康星大学麦迪逊分校); Department of Computer Sciences (计算机科学系); Department of Radiology (放射科)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The development of machine learning models for CT imaging depends on the availability of large, high-quality, and diverse annotated datasets. Although large volumes of CT images and reports are readily available in clinical picture archiving and communication systems (PACS), 3D segmentations of critical findings are costly to obtain, typically requiring extensive manual annotation by radiologists. On the other hand, it is common for radiologists to provide limited annotations of findings during routine reads, such as line measurements and arrows, that are often stored in PACS as GSPS objects. We posit that these sparse annotations can be extracted along with CT volumes and converted into 3D segmentations using promptable segmentation models, a paradigm we term Opportunistic Promptable Segmentation. To enable this paradigm, we propose SAM2CT, the first promptable segmentation model designed to convert radiologist annotations into 3D segmentations in CT volumes. SAM2CT builds upon SAM2 by extending the prompt encoder to support arrow and line inputs and by introducing Memory-Conditioned Memories (MCM), a memory encoding strategy tailored to 3D medical volumes. On public lesion segmentation benchmarks, SAM2CT outperforms existing promptable segmentation models and similarly trained baselines, achieving Dice similarity coefficients of 0.649 for arrow prompts and 0.757 for line prompts. Applying the model to pre-existing GSPS annotations from a clinical PACS (N = 60), SAM2CT generates 3D segmentations that are clinically acceptable or require only minor adjustments in 87% of cases, as scored by radiologists. Additionally, SAM2CT demonstrates strong zero-shot performance on select Emergency Department findings. These results suggest that large-scale mining of historical GSPS annotations represents a promising and scalable approach for generating 3D CT segmentation datasets.
zh
[CV-260] LogicGaze: Benchmarking Causal Consistency in Visual Narratives via Counterfactual Verification
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在执行复杂多模态任务时,其推理链是否能够基于实际视觉证据进行可靠验证的问题,尤其关注模型幻觉(hallucination)现象。解决方案的关键在于提出一个名为LogicGaze的新基准框架,该框架通过整合因果序列与视觉矛盾但语言上合理的扰动(perturbations),迫使模型对每一步推理的真实性进行验证,并采用三重评估协议——因果验证(Causal Validation)、具身叙事合成(Grounded Narrative Synthesis)和扰动拒绝(Perturbation Rejection),系统性地暴露当前先进VLMs(如Qwen2.5-VL-72B)在多模态推理中的脆弱性,从而推动更鲁棒、可信的多模态推理发展。
链接: https://arxiv.org/abs/2602.00292
作者: Rory Driscoll,Alexandros Christoforos,Chadbourne Davis
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:While sequential reasoning enhances the capability of Vision-Language Models (VLMs) to execute complex multimodal tasks, their reliability in grounding these reasoning chains within actual visual evidence remains insufficiently explored. We introduce LogicGaze, a novel benchmark framework designed to rigorously interrogate whether VLMs can validate sequential causal chains against visual inputs, specifically targeting the pervasive issue of hallucination. Curated from 40,000 video segments from ShareGPT4Video and a subset of Flickr30k imagery, LogicGaze integrates causal sequences with visually contradictory yet linguistically plausible perturbations, compelling models to verify the authenticity of each reasoning step. Our tripartite evaluation protocol - Causal Validation, Grounded Narrative Synthesis, and Perturbation Rejection - exposes significant vulnerabilities in state-of-the-art VLMs such as Qwen2.5-VL-72B. LogicGaze advocates for robust, trustworthy multimodal reasoning, with all resources publicly available in an anonymized repository.
zh
[CV-261] Computer Vision and Its Relationship to Cognitive Science: A perspective from Bayes Decision Theory
【速读】:该论文试图解决的问题是如何在计算机视觉(Computer Vision)中统一和理解两种主流方法——贝叶斯视角(Bayesian viewpoint)与深度神经网络(Deep Neural Network)方法之间的理论关系及其局限性。其解决方案的关键在于引入贝叶斯决策理论(Bayes Decision Theory, BDT)作为理论框架,该框架不仅能够捕捉两种方法的核心思想:贝叶斯视角提供了与认知科学(Cognitive Science)高度契合的概念基础,而深度神经网络则源于视觉腹侧通路(ventral stream)的层次结构并取得了实际应用的成功;同时,BDT还能揭示二者各自的优缺点,并为进一步融合二者、构建更丰富的统一理论框架指明方向。
链接: https://arxiv.org/abs/2602.00289
作者: Alan Yuille,Daniel Kersten
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This document presents an introduction to computer vision, and its relationship to Cognitive Science, from the perspective of Bayes Decision Theory (Berger 1985). Computer vision is a vast and complex field, so this overview has a narrow scope and provides a theoretical lens which captures many key concepts. BDT is rich enough to include two different approaches: (i) the Bayesian viewpoint, which gives a conceptually attractive framework for vision with concepts that resonate with Cognitive Science (Griffiths et al., 2024), and (ii) the Deep Neural Network approach whose successes in the real world have made Computer Vision into a trillion-dollar industry and which is motivated by the hierarchical structure of the visual ventral stream. The BDT framework relates and captures the strengths and weakness of these two approaches and, by discussing the limitations of BDT, points the way to how they can be combined in a richer framework.
zh
[CV-262] meBlind: A Spatio-Temporal Compositionality Benchmark for Video LLM s
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在视频理解中对时间动态性(temporal dynamics)建模能力薄弱的问题,即尽管MLLMs在静态语义理解上表现优异,但在细粒度时空推理任务中仍存在显著缺陷。其解决方案的关键在于提出一个名为TimeBlind的诊断性基准,该基准基于认知科学中的三层次时间理解框架——识别原子事件、刻画事件属性和推理事件间依赖关系,并采用最小差异对(minimal-pairs paradigm)设计视频对:两段视频具有完全相同的静态视觉内容但仅在时序结构上不同,同时通过互补问题设置消除语言先验干扰,从而精准评估模型是否真正掌握时间逻辑而非依赖静态视觉捷径。实验表明,即使是最先进的MLLM在该基准上的实例准确率(Instance Accuracy)仅为48.2%,远低于人类水平(98.2%),验证了现有模型对时间动态性的理解不足,凸显了TimeBlind作为下一代视频理解研究关键诊断工具的价值。
链接: https://arxiv.org/abs/2602.00288
作者: Baiqi Li,Kangyi Zhao,Ce Zhang,Chancharik Mitra,Jean de Dieu Nyandwi,Gedas Bertasius
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: For code and data, see this https URL
Abstract:Fine-grained spatio-temporal understanding is essential for video reasoning and embodied AI. Yet, while Multimodal Large Language Models (MLLMs) master static semantics, their grasp of temporal dynamics remains brittle. We present TimeBlind, a diagnostic benchmark for compositional spatio-temporal understanding. Inspired by cognitive science, TimeBlind categorizes fine-grained temporal understanding into three levels: recognizing atomic events, characterizing event properties, and reasoning about event interdependencies. Unlike benchmarks that conflate recognition with temporal reasoning, TimeBlind leverages a minimal-pairs paradigm: video pairs share identical static visual content but differ solely in temporal structure, utilizing complementary questions to neutralize language priors. Evaluating over 20 state-of-the-art MLLMs (e.g., GPT-5, Gemini 3 Pro) on 600 curated instances (2400 video-question pairs), reveals that the Instance Accuracy (correctly distinguishing both videos in a pair) of the best performing MLLM is only 48.2%, far below the human performance (98.2%). These results demonstrate that even frontier models rely heavily on static visual shortcuts rather than genuine temporal logic, positioning TimeBlind as a vital diagnostic tool for next-generation video understanding. Dataset and code are available at this https URL .
zh
[CV-263] okenTrim: Inference-Time Token Pruning for Autoregressive Long Video Generation
【速读】:该论文旨在解决自回归视频生成(auto-regressive video generation)中存在的时序漂移(temporal drift)问题,即在长视频合成过程中,由于推理阶段误差的累积与放大导致的视频内容不一致。其解决方案的关键在于提出一种简单的推理阶段修正方法:通过识别并移除不稳定(unstable)的潜在条件令牌(latent conditioning tokens),这些令牌表示与前一帧批次显著偏离,暗示可能已发生语义漂移或污染。该方法不修改模型架构或训练过程,仅在推理时剔除不可靠的潜在信息,从而有效抑制误差传播,显著提升长时间序列的时序一致性。
链接: https://arxiv.org/abs/2602.00268
作者: Ariel Shaulov,Eitan Shaar,Amit Edenzon,Lior Wolf
机构: Tel Aviv University (特拉维夫大学); Bar-Ilan University (巴伊兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Auto-regressive video generation enables long video synthesis by iteratively conditioning each new batch of frames on previously generated content. However, recent work has shown that such pipelines suffer from severe temporal drift, where errors accumulate and amplify over long horizons. We hypothesize that this drift does not primarily stem from insufficient model capacity, but rather from inference-time error propagation. Specifically, we contend that drift arises from the uncontrolled reuse of corrupted latent conditioning tokens during auto-regressive inference. To correct this accumulation of errors, we propose a simple, inference-time method that mitigates temporal drift by identifying and removing unstable latent tokens before they are reused for conditioning. For this purpose, we define unstable tokens as latent tokens whose representations deviate significantly from those of the previously generated batch, indicating potential corruption or semantic drift. By explicitly removing corrupted latent tokens from the auto-regressive context, rather than modifying entire spatial regions or model parameters, our method prevents unreliable latent information from influencing future generation steps. As a result, it significantly improves long-horizon temporal consistency without modifying the model architecture, training procedure, or leaving latent space.
zh
[CV-264] PLACID: Identity-Preserving Multi-Object Compositing via Video Diffusion with Synthetic Trajectories
【速读】:该论文旨在解决生成式 AI (Generative AI) 在工作室级多对象合成(multi-object compositing)任务中的不足,具体表现为现有模型难以同时实现:(i) 保持每个物体的精确身份一致性,(ii) 确保背景与色彩的真实性,(iii) 对布局和设计元素进行可控操作,以及 (iv) 展示完整且视觉吸引人的多物体组合。当前方法常出现物体细节失真、遗漏或重复对象、相对尺寸错误及呈现不一致等问题。解决方案的关键在于提出 PLACID 框架,其核心创新为:一是利用预训练图像到视频(image-to-video, I2V)扩散模型结合文本控制,通过视频中的时间先验(temporal priors)来维持物体一致性与背景细节;二是设计一种新颖的数据整理策略,生成合成视频序列,其中随机放置的物体平滑移动至目标位置,使训练数据符合视频模型的时间先验特性。推理时,物体从随机初始位置经文本引导收敛至结构合理布局,最终帧即为高质量复合图像。
链接: https://arxiv.org/abs/2602.00267
作者: Gemma Canet Tarrés,Manel Baradad,Francesc Moreno-Noguer,Yumeng Li
机构: Amazon(亚马逊)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in generative AI have dramatically improved photorealistic image synthesis, yet they fall short for studio-level multi-object compositing. This task demands simultaneous (i) near-perfect preservation of each item’s identity, (ii) precise background and color fidelity, (iii) layout and design elements control, and (iv) complete, appealing displays showcasing all objects. However, current state-of-the-art models often alter object details, omit or duplicate objects, and produce layouts with incorrect relative sizing or inconsistent item presentations. To bridge this gap, we introduce PLACID, a framework that transforms a collection of object images into an appealing multi-object composite. Our approach makes two main contributions. First, we leverage a pretrained image-to-video (I2V) diffusion model with text control to preserve objects consistency, identities, and background details by exploiting temporal priors from videos. Second, we propose a novel data curation strategy that generates synthetic sequences where randomly placed objects smoothly move to their target positions. This synthetic data aligns with the video model’s temporal priors during training. At inference, objects initialized at random positions consistently converge into coherent layouts guided by text, with the final frame serving as the composite image. Extensive quantitative evaluations and user studies demonstrate that PLACID surpasses state-of-the-art methods in multi-object compositing, achieving superior identity, background, and color preservation, with less omitted objects and visually appealing results.
zh
[CV-265] World-Shaper: A Unified Framework for 360° Panoramic Editing
【速读】:该论文旨在解决全景图像编辑中因传统基于透视的编辑方法无法建模全景图空间结构而导致的几何失真与全局一致性破坏问题,尤其是立方体贴图分解方式与球面几何不匹配所引发的局部不一致。其解决方案的关键在于提出一个统一的、几何感知的框架 World-Shaper,直接在等距圆柱投影(equirectangular projection, ERP)域内进行编辑,并采用“生成-编辑”范式以缓解配对数据稀缺问题;同时引入几何感知学习策略,通过显式的位姿感知形状监督和隐式的渐进式训练机制内化全景先验,从而显著提升编辑的几何一致性、保真度及文本可控性。
链接: https://arxiv.org/abs/2602.00265
作者: Dong Liang,Yuhao Liu,Jinyuan Jia,Youjun Zhao,Rynson W.H.Lau
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Being able to edit panoramic images is crucial for creating realistic 360° visual experiences. However, existing perspective-based image editing methods fail to model the spatial structure of panoramas. Conventional cube-map decompositions attempt to overcome this problem but inevitably break global consistency due to their mismatch with spherical geometry. Motivated by this insight, we reformulate panoramic editing directly in the equirectangular projection (ERP) domain and present World-Shaper, a unified geometry-aware framework that bridges panoramic generation and editing within a single editing-centric design. To overcome the scarcity of paired data, we adopt a generate-then-edit paradigm, where controllable panoramic generation serves as an auxiliary stage to synthesize diverse paired examples for supervised editing learning. To address geometric distortion, we introduce a geometry-aware learning strategy that explicitly enforces position-aware shape supervision and implicitly internalizes panoramic priors through progressive training. Extensive experiments on our new benchmark, PEBench, demonstrate that our method achieves superior geometric consistency, editing fidelity, and text controllability compared to SOTA methods, enabling coherent and flexible 360° visual world creation with unified editing control. Code, model, and data will be released at our project page: this https URL
zh
[CV-266] Subspace Clustering on Incomplete Data with Self-Supervised Contrastive Learning
【速读】:该论文旨在解决**子空间聚类(subspace clustering)**在实际应用中面临的数据缺失问题,即现有方法通常假设数据完全可观测,但在真实场景(如计算机视觉、高光谱成像等)中常存在大量缺失值,导致性能下降。解决方案的关键在于提出一种对比自监督框架——对比子空间聚类(Contrastive Subspace Clustering, CSC):通过生成输入数据的掩码视图(masked views),利用SimCLR风格的对比损失训练深度神经网络以学习对缺失数据鲁棒的不变嵌入(invariant embeddings),随后结合稀疏子空间聚类(sparse subspace clustering)进行最终聚类。该方法在六个基准数据集上显著优于传统与深度学习基线模型,展现出对缺失数据的强大鲁棒性和大规模可扩展性。
链接: https://arxiv.org/abs/2602.00262
作者: Huanran Li,Daniel Pimentel-Alarcón
机构: University of Wisconsin-Madison (威斯康星大学麦迪逊分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Subspace clustering aims to group data points that lie in a union of low-dimensional subspaces and finds wide application in computer vision, hyperspectral imaging, and recommendation systems. However, most existing methods assume fully observed data, limiting their effectiveness in real-world scenarios with missing entries. In this paper, we propose a contrastive self-supervised framework, Contrastive Subspace Clustering (CSC), designed for clustering incomplete data. CSC generates masked views of partially observed inputs and trains a deep neural network using a SimCLR-style contrastive loss to learn invariant embeddings. These embeddings are then clustered using sparse subspace clustering. Experiments on six benchmark datasets show that CSC consistently outperforms both classical and deep learning baselines, demonstrating strong robustness to missing data and scalability to large datasets.
zh
[CV-267] SANEval: Open-Vocabulary Compositional Benchmarks with Failure-mode Diagnosis
【速读】:该论文旨在解决当前文本到图像(Text-to-Image, T2I)模型在忠实生成复杂提示词(包含多个对象、属性和空间关系)时存在的评估瓶颈问题。现有基准测试受限于封闭词汇表、缺乏细粒度诊断能力,且无法提供可解释的反馈以识别和修复特定的组合性失败。解决方案的关键在于提出SANEval(Spatial, Attribute, and Numeracy Evaluation),这是一个基于大语言模型(Large Language Model, LLM)的开放词汇组合评估基准,其核心创新是结合LLM进行深层提示理解与LLM增强的开放词汇目标检测器,从而实现不受固定词汇限制的组合一致性鲁棒评估。通过在六种先进T2I模型上的实验验证,SANEval的自动化评估结果相较于人类判断更具代表性,并在属性绑定、空间关系和数量感知等任务中展现出比现有基准更高的斯皮尔曼等级相关性。
链接: https://arxiv.org/abs/2602.00249
作者: Rishav Pramanik,Ian E. Nielsen,Jeff Smith,Saurav Pandit,Ravi P. Ramachandran,Zhaozheng Yin
机构: Stony Brook University (石溪大学); 2nd Set AI Corp.; Rowan University (罗文大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid progress of text-to-image (T2I) models has unlocked unprecedented creative potential, yet their ability to faithfully render complex prompts involving multiple objects, attributes, and spatial relationships remains a significant bottleneck. Progress is hampered by a lack of adequate evaluation methods; current benchmarks are often restricted to closed-set vocabularies, lack fine-grained diagnostic capabilities, and fail to provide the interpretable feedback necessary to diagnose and remedy specific compositional failures. We solve these challenges by introducing SANEval (Spatial, Attribute, and Numeracy Evaluation), a comprehensive benchmark that establishes a scalable new pipeline for open-vocabulary compositional evaluation. SANEval combines a large language model (LLM) for deep prompt understanding with an LLM-enhanced, open-vocabulary object detector to robustly evaluate compositional adherence, unconstrained by a fixed vocabulary. Through extensive experiments on six state-of-the-art T2I models, we demonstrate that SANEval’s automated evaluations provide a more faithful proxy for human assessment; our metric achieves a Spearman’s rank correlation with statistically different results than those of existing benchmarks across tasks of attribute binding, spatial relations, and numeracy. To facilitate future research in compositional T2I generation and evaluation, we will release the SANEval dataset and our open-source evaluation pipeline.
zh
[CV-268] CAPA: Contribution-Aware Pruning and FFN Approximation for Efficient Large Vision-Language Models
【速读】:该论文旨在解决大视觉语言模型(Large Vision-Language Models)在推理过程中因处理数千个视觉标记(visual tokens)而导致的高计算成本问题,核心挑战在于如何安全地移除冗余或低贡献的视觉标记与计算单元而不显著损害模型性能。其解决方案的关键在于提出一种基于“注意力贡献”(Attention Contribution)的新指标,该指标通过将注意力概率与值向量(value vector)的模长加权,更准确地衡量每个视觉标记的实际功能重要性;在此基础上,论文进一步识别出视觉注意力机制中存在功能异质性的两类标记:可安全剪枝的“概率堆积区”(Probability Dumps)和对性能至关重要的“结构锚点”(Structural Anchors),并发现视觉标记在前馈网络(Feed-Forward Networks, FFNs)中存在显著冗余,尤其是在中间层呈现线性行为。据此,作者提出CAPA(Contribution-Aware Pruning and FFN Approximation)框架,采用双策略:一是在关键功能转换节点上依据注意力贡献进行视觉标记剪枝,二是利用高效线性近似降低FFN计算复杂度,从而实现高效的性能-效率平衡,并提升模型鲁棒性。
链接: https://arxiv.org/abs/2602.00247
作者: Samyak Jha,Junho Kim
机构: Indian Institute of Technology (ISM) (印度理工学院(伊斯姆)); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Efficient inference in Large Vision-Language Models is constrained by the high cost of processing thousands of visual tokens, yet it remains unclear which tokens and computations can be safely removed. While attention scores are commonly used to estimate visual token importance, they are an imperfect proxy for actual contribution. We show that Attention Contribution, which weights attention probabilities by value vector magnitude, provides a more accurate criterion for visual token selection. Our empirical analysis reveals that visual attention sinks are functionally heterogeneous, comprising Probability Dumps with low contribution that can be safely pruned, and Structural Anchors with high contribution essential for maintaining model performance. Further, we identify substantial redundancy in Feed-Forward Networks (FFNs) associated with visual tokens, particularly in intermediate layers where image tokens exhibit linear behavior. Based on our findings, we introduce CAPA (Contribution-Aware Pruning and FFN Approximation), a dual-strategy framework that prunes visual tokens using attention contribution at critical functional transitions and reduces FFN computation through efficient linear approximations. Experiments on various benchmarks across baselines show that CAPA achieves competent efficiency–performance trade-offs with improved robustness.
zh
[CV-269] MapDream: Task-Driven Map Learning for Vision-Language Navigation
【速读】:该论文旨在解决视觉语言导航(Vision-Language Navigation, VLN)中传统地图表示方法依赖手工设计、与导航策略解耦的问题。现有方法构建的地图通常独立于导航决策过程,未能有效聚焦于任务关键的空间信息。解决方案的关键在于提出“MapDream”框架,将地图构建建模为自回归式鸟瞰图(Bird’s-eye-view, BEV)图像生成任务,通过端到端联合学习地图生成与动作预测,使地图成为由导航目标直接驱动的紧凑表征——仅保留对导航至关重要的可操作性特征(affordances)。该方法利用监督预训练建立可靠的映射到控制接口,并借助自回归结构实现强化学习微调下的联合优化,在R2R-CE和RxR-CE数据集上取得了单目条件下的最先进性能,验证了任务驱动的生成式地图学习的有效性。
链接: https://arxiv.org/abs/2602.00222
作者: Guoxin Lian,Shuo Wang,Yucheng Wang,Yongcai Wang,Maiyue Chen,Kaihui Wang,Bo Zhang,Zhizhong Su,Deying Li,Zhaoxin Fan
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-Language Navigation (VLN) requires agents to follow natural language instructions in partially observed 3D environments, motivating map representations that aggregate spatial context beyond local perception. However, most existing approaches rely on hand-crafted maps constructed independently of the navigation policy. We argue that maps should instead be learned representations shaped directly by navigation objectives rather than exhaustive reconstructions. Based on this insight, we propose MapDream, a map-in-the-loop framework that formulates map construction as autoregressive bird’s-eye-view (BEV) image synthesis. The framework jointly learns map generation and action prediction, distilling environmental context into a compact three-channel BEV map that preserves only navigation-critical affordances. Supervised pre-training bootstraps a reliable mapping-to-control interface, while the autoregressive design enables end-to-end joint optimization through reinforcement fine-tuning. Experiments on R2R-CE and RxR-CE achieve state-of-the-art monocular performance, validating task-driven generative map learning.
zh
[CV-270] Development of a Cacao Disease Identification and Management App Using Deep Learning
【速读】:该论文旨在解决菲律宾小农户在可可(cacao)种植中因缺乏数据、信息和良好农业实践而导致的病虫害管理困难问题。解决方案的关键在于开发了一款离线运行的移动应用程序,其核心是一个基于深度学习的病害识别模型,能够准确诊断可可病害并评估黑 pod 病(black pod disease)感染程度,从而为偏远地区的农民提供即时、可靠的技术支持。
链接: https://arxiv.org/abs/2602.00216
作者: Zaldy Pagaduan,Jason Occidental,Nathaniel Duro,Dexielito Badilles,Eleonor Palconit
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Image and Video Processing (eess.IV)
备注: 6 pages, 8 figures, preprint
Abstract:Smallholder cacao producers often rely on outdated farming techniques and face significant challenges from pests and diseases, unlike larger plantations with more resources and expertise. In the Philippines, cacao farmers have limited access to data, information, and good agricultural practices. This study addresses these issues by developing a mobile application for cacao disease identification and management that functions offline, enabling use in remote areas where farms are mostly located. The core of the system is a deep learning model trained to identify cacao diseases accurately. The trained model is integrated into the mobile app to support farmers in field diagnosis. The disease identification model achieved a validation accuracy of 96.93% while the model for detecting cacao black pod infection levels achieved 79.49% validation accuracy. Field testing of the application showed an agreement rate of 84.2% compared with expert cacao technician assessments. This approach empowers smallholder farmers by providing accessible, technology-enabled tools to improve cacao crop health and productivity.
zh
[CV-271] A Geometric Multimodal Foundation Model Integrating Bp-MRI and Clinical Reports in Prostate Cancer Classification
【速读】:该论文旨在解决前列腺癌(Prostate Cancer, PCa)诊断中因依赖专家主观判断而导致的不一致性问题,以及现有计算机辅助诊断方法多局限于影像模型、忽视临床变量信息且受数据稀缺限制而难以学习鲁棒表示的问题。其解决方案的关键在于提出一种几何多模态基础模型(Multimodal Foundation Model, MFM-Geom),通过联合学习双参数磁共振成像(bi-parametric MRI, bp-MRI)与临床报告中的视觉特征和临床变量信息,利用对称正定(Symmetric Positive Definite, SPD)矩阵与黎曼深度学习技术,在分类头中实现跨模态表示的有效融合,从而在仅使用10%训练数据的情况下显著提升性能(AUC-PR达90.67),并在外部数据集上验证了模型微调后的泛化能力(AUC-PR为90.6)。
链接: https://arxiv.org/abs/2602.00214
作者: Juan A. Olmos,Antoine Manzanera,Fabio Martínez
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at IEEE International Symposium on Biomedical Imaging (ISBI) 2026
Abstract:Prostate cancer (PCa) is one of the most common cancers in men worldwide. Bi-parametric MRI (bp-MRI) and clinical variables are crucial for PCa identification and improving treatment decisions. However, this process is subjective to expert interpretations. Furthermore, most existing computer-aided diagnosis methods focus on imaging-based models, overlooking the clinical context and suffering from data scarcity, limiting their ability to learn robust representations. We propose a geometric multimodal Foundation Model (FM), named MFM-Geom, that learns representations from bp-MRI and clinical reports, encoding visual findings and information from the context of clinical variables. In the representations classification head, the approach leverages symmetric positive definite (SPD) matrices and Riemannian deep learning to integrate imaging-text representations from a biomedical multimodal FM. Using 10% of the training data, MFM-Geom outperformed baseline class token embedding-based classification (+8.3%, AUC-PR of 90.67). Generalization on external dataset confirmed the robustness of fine-tuning biomedical FM, achieving an AUC-PR of 90.6.
zh
[CV-272] Deep Learning Based CNN Model for Automated Detection of Pneumonia from Chest XRay Images
【速读】:该论文旨在解决肺炎(pneumonia)在儿科和老年群体中高发且医疗资源匮乏地区诊断困难的问题,传统基于人工解读胸片的方法受限于观察者间差异、专家疲劳及放射科医生短缺等因素。其解决方案的关键在于提出一种定制化的卷积神经网络(Convolutional Neural Network, CNN)架构,采用深度可分离卷积(depth-wise separable convolution)设计以降低冗余参数并提升对灰度医学图像纹理特征的识别效率;同时结合对比度受限自适应直方图均衡化(Contrast Limited Adaptive Histogram Equalization, CLAHE)与几何增强(geometric augmentation)两种预处理技术,有效缓解类别不平衡问题并提高模型泛化能力。
链接: https://arxiv.org/abs/2602.00212
作者: Sathish Krishna Anumula,Vetrivelan Tamilmani,Aniruddha Arjun Singh,Dinesh Rajendran,Venkata Deepak Namburi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 Pages, 2 Tables, 6 Figures
Abstract:Pneumonia has been one of the major causes of morbidities and mortality in the world and the prevalence of this disease is disproportionately high among the pediatric and elderly populations especially in resources trained areas Fast and precise diagnosis is a prerequisite for successful clinical intervention but due to inter observer variation fatigue among experts and a shortage of qualified radiologists traditional approaches that rely on manual interpretation of chest radiographs are frequently constrained To address these problems this paper introduces a unified automated diagnostic model using a custom Convolutional Neural Network CNN that can recognize pneumonia in chest Xray images with high precision and at minimal computational expense In contrast like other generic transfer learning based models which often possess redundant parameters the offered architecture uses a tailor made depth wise separable convolutional design which is optimized towards textural characteristics of grayscale medical images Contrast Limited Adaptive Histogram Equalization CLAHE and geometric augmentation are two significant preprocessing techniques used to ensure that the system does not experience class imbalance and is more likely to generalize The system is tested using a dataset of 5863 anterior posterior chest Xrays.
zh
[CV-273] Interpretable Unsupervised Deformable Image Registration via Confidence-bound Multi-Hop Visual Reasoning
【速读】:该论文旨在解决无监督可变形医学图像配准(deformable image registration)中缺乏透明性和可靠性的问题,尤其是在复杂解剖结构对齐时容易出现误差漂移、临床信任度低的挑战。其解决方案的关键在于提出一种多跳视觉推理链(Multi-Hop Visual Chain of Reasoning, VCoR)框架,将配准过程建模为逐步推理的迭代过程:每一轮“推理跳”集成局部空间精化(Localized Spatial Refinement, LSR)模块以增强特征表达,并引入跨参考注意力(Cross-Reference Attention, CRA)机制引导迭代优化,从而保持解剖一致性。该设计不仅提升了对大形变的鲁棒性,还通过中间预测序列的稳定性与收敛性提供不确定性估计,实现了高精度配准与内在可解释性的统一。
链接: https://arxiv.org/abs/2602.00211
作者: Zafar Iqbal,Anwar Ul Haq,Srimannarayana Grandhi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Unsupervised deformable image registration requires aligning complex anatomical structures without reference labels, making interpretability and reliability critical. Existing deep learning methods achieve considerable accuracy but often lack transparency, leading to error drift and reduced clinical trust. We propose a novel Multi-Hop Visual Chain of Reasoning (VCoR) framework that reformulates registration as a progressive reasoning process. Inspired by the iterative nature of clinical decision-making, each visual reasoning hop integrates a Localized Spatial Refinement (LSR) module to enrich feature representations and a Cross-Reference Attention (CRA) mechanism that leads the iterative refinement process, preserving anatomical consistency. This multi-hop strategy enables robust handling of large deformations and produces a transparent sequence of intermediate predictions with a theoretical bound. Beyond accuracy, our framework offers built-in interpretability by estimating uncertainty via the stability and convergence of deformation fields across hops. Extensive evaluations on two challenging public datasets, DIR-Lab 4D CT (lung) and IXI T1-weighted MRI (brain), demonstrate that VCoR achieves competitive registration accuracy while offering rich intermediate visualizations and confidence measures. By embedding an implicit visual reasoning paradigm, we present an interpretable, reliable, and clinically viable unsupervised medical image registration.
zh
[CV-274] Reducing Class-Wise Performance Disparity via Margin Regularization ICLR2026
【速读】:该论文旨在解决深度神经网络在分类任务中普遍存在的类间性能差异问题,即即使在类别平衡数据上训练,不同类别的准确率仍存在显著不均衡,影响模型的可靠部署。其解决方案的关键在于提出一种理论驱动的正则化方法——Margin Regularization for Performance Disparity Reduction (MR²),通过在logit空间和表示空间中动态调整每类的边际(margin),以缓解性能差距。具体而言,MR²根据每类特征分布的离散程度自适应地增大难类的logit边际,并对表示空间中的边际进行惩罚以增强类内紧凑性,从而在提升整体准确率的同时显著改善难类表现,且不牺牲易类性能。
链接: https://arxiv.org/abs/2602.00205
作者: Beier Zhu,Kesen Zhao,Jiequan Cui,Qianru Sun,Yuan Zhou,Xun Yang,Hanwang Zhang
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: To appear in ICLR 2026
Abstract:Deep neural networks often exhibit substantial disparities in class-wise accuracy, even when trained on class-balanced data, posing concerns for reliable deployment. While prior efforts have explored empirical remedies, a theoretical understanding of such performance disparities in classification remains limited. In this work, we present Margin Regularization for Performance Disparity Reduction (MR ^2 ), a theoretically principled regularization for classification by dynamically adjusting margins in both the logit and representation spaces. Our analysis establishes a margin-based, class-sensitive generalization bound that reveals how per-class feature variability contributes to error, motivating the use of larger margins for hard classes. Guided by this insight, MR ^2 optimizes per-class logit margins proportional to feature spread and penalizes excessive representation margins to enhance intra-class compactness. Experiments on seven datasets, including ImageNet, and diverse pre-trained backbones (MAE, MoCov2, CLIP) demonstrate that MR ^2 not only improves overall accuracy but also significantly boosts hard class performance without trading off easy classes, thus reducing performance disparity. Code is available at: this https URL
zh
[CV-275] Vision-Language Model Purified Semi-Supervised Semantic Segmentation for Remote Sensing Images
【速读】:该论文旨在解决遥感(Remote Sensing, RS)领域半监督语义分割(Semi-Supervised Semantic Segmentation, S4)中因伪标签质量低而导致模型性能受限的问题,尤其是教师-学生架构下边界区域类别预测不准的问题。其解决方案的关键在于提出一种基于视觉语言模型(Vision-Language Models, VLMs)的伪标签净化结构(VLM Pseudo-Label Purifying, VLM-PP),该模块能够利用VLM的开放世界理解能力对教师网络生成的低置信度伪标签进行校正,显著提升伪标签质量,从而更有效地指导学生模型学习。该方法不依赖特定S4架构,具备通用性和可解释性,且在多个RS数据集上实现了当前最优(SOTA)性能。
链接: https://arxiv.org/abs/2602.00202
作者: Shanwen Wang,Xin Sun,Danfeng Hong,Fei Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The semi-supervised semantic segmentation (S4) can learn rich visual knowledge from low-cost unlabeled images. However, traditional S4 architectures all face the challenge of low-quality pseudo-labels, especially for the teacher-student this http URL propose a novel SemiEarth model that introduces vision-language models (VLMs) to address the S4 issues for the remote sensing (RS) domain. Specifically, we invent a VLM pseudo-label purifying (VLM-PP) structure to purify the teacher network’s pseudo-labels, achieving substantial improvements. Especially in multi-class boundary regions of RS images, the VLM-PP module can significantly improve the quality of pseudo-labels generated by the teacher, thereby correctly guiding the student model’s learning. Moreover, since VLM-PP equips VLMs with open-world capabilities and is independent of the S4 architecture, it can correct mispredicted categories in low-confidence pseudo-labels whenever a discrepancy arises between its prediction and the pseudo-label. We conducted extensive experiments on multiple RS datasets, which demonstrate that our SemiEarth achieves SOTA performance. More importantly, unlike previous SOTA RS S4 methods, our model not only achieves excellent performance but also offers good interpretability. The code is released at this https URL.
zh
[CV-276] AI-Generated Image Detectors Overrely on Global Artifacts: Evidence from Inpainting Exchange
【速读】:该论文旨在解决当前图像修复(inpainting)检测方法依赖全局伪影而非局部合成内容的问题,从而导致检测模型对真实修复图像的误判率较高。其关键解决方案是提出“Inpainting Exchange (INP-X)”操作,通过在非编辑区域恢复原始像素,同时保留所有合成内容,有效隔离了VAE(变分自编码器)重建过程中产生的全局频谱偏移效应。实验表明,该干预使主流检测模型准确率显著下降(如从91%降至55%),揭示了现有检测方法对高频信息衰减敏感的本质原因,并强调了开发基于内容感知的检测机制的重要性。
链接: https://arxiv.org/abs/2602.00192
作者: Elif Nebioglu,Emirhan Bilgiç,Adrian Popescu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 21 pages, 15 figures, 6 tables
Abstract:Modern deep learning-based inpainting enables realistic local image manipulation, raising critical challenges for reliable detection. However, we observe that current detectors primarily rely on global artifacts that appear as inpainting side effects, rather than on locally synthesized content. We show that this behavior occurs because VAE-based reconstruction induces a subtle but pervasive spectral shift across the entire image, including unedited regions. To isolate this effect, we introduce Inpainting Exchange (INP-X), an operation that restores original pixels outside the edited region while preserving all synthesized content. We create a 90K test dataset including real, inpainted, and exchanged images to evaluate this phenomenon. Under this intervention, pretrained state-of-the-art detectors, including commercial ones, exhibit a dramatic drop in accuracy (e.g., from 91% to 55%), frequently approaching chance level. We provide a theoretical analysis linking this behavior to high-frequency attenuation caused by VAE information bottlenecks. Our findings highlight the need for content-aware detection. Indeed, training on our dataset yields better generalization and localization than standard inpainting. Our dataset and code are publicly available at this https URL.
zh
[CV-277] GEPC: Group-Equivariant Posterior Consistency for Out-of-Distribution Detection in Diffusion Models
【速读】:该论文旨在解决扩散模型在分布外(Out-of-Distribution, OOD)检测中对等变性(equivariance)信息利用不足的问题。现有方法主要依赖于得分幅度或局部几何特征(如能量、曲率、协方差谱),而忽略了数据和模型结构中隐含的等变性质(如翻转、旋转、循环移位)。解决方案的关键在于提出Group-Equivariant Posterior Consistency(GEPC),这是一种无需训练的探测机制,通过测量学习到的得分场在有限群 G 作用下的变换一致性来识别等变性破坏。GEPC定义了理想状态下的残差函数,通过对群元素进行平均获得可解释的等变性破坏图,并在理论上建立了ID上界与OOD下界。实验表明,GEPC在多个图像OOD基准数据集上达到或优于现有扩散模型基线的AUROC性能,且在高分辨率合成孔径雷达(SAR)图像中实现了显著的目标-背景分离和可视化解释能力。
链接: https://arxiv.org/abs/2602.00191
作者: Yadang Alexis Rouzoumka,Jean Pinsolle,Eugénie Terreaux,Christèle Morisseau,Jean-Philippe Ovarlez,Chengfang Ren
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: preprint
Abstract:Diffusion models learn a time-indexed score field \mathbfs_\theta(\mathbfx_t,t) that often inherits approximate equivariances (flips, rotations, circular shifts) from in-distribution (ID) data and convolutional backbones. Most diffusion-based out-of-distribution (OOD) detectors exploit score magnitude or local geometry (energies, curvature, covariance spectra) and largely ignore equivariances. We introduce Group-Equivariant Posterior Consistency (GEPC), a training-free probe that measures how consistently the learned score transforms under a finite group \mathcalG , detecting equivariance breaking even when score magnitude remains unchanged. At the population level, we propose the ideal GEPC residual, which averages an equivariance-residual functional over \mathcalG , and we derive ID upper bounds and OOD lower bounds under mild assumptions. GEPC requires only score evaluations and produces interpretable equivariance-breaking maps. On OOD image benchmark datasets, we show that GEPC achieves competitive or improved AUROC compared to recent diffusion-based baselines while remaining computationally lightweight. On high-resolution synthetic aperture radar imagery where OOD corresponds to targets or anomalies in clutter, GEPC yields strong target-background separation and visually interpretable equivariance-breaking maps. Code is available at this https URL.
zh
[CV-278] From Gameplay Traces to Game Mechanics: Causal Induction with Large Language Models ICPR2026
【速读】:该论文旨在解决深度学习代理在复杂游戏环境中虽能实现高性能,却往往缺乏对底层因果游戏机制的理解这一问题。其核心解决方案是通过因果归纳(Causal Induction),即利用大型语言模型(LLMs)从游戏行为轨迹中逆向推导出视频游戏描述语言(VGDL)规则。关键创新在于提出一种两阶段方法:首先从观测数据中推断结构化因果模型(Structural Causal Model, SCM),再将该SCM转化为VGDL代码;相较直接从观察生成VGDL的方法,该方案显著提升了生成规则的准确性与逻辑一致性,在盲评中获得高达81%的偏好胜率,且更适用于因果强化学习、可解释智能体及逻辑一致的游戏生成等下游任务。
链接: https://arxiv.org/abs/2602.00190
作者: Mohit Jiwatode,Alexander Dockhorn,Bodo Rosenhahn
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Submitted to ICPR 2026
Abstract:Deep learning agents can achieve high performance in complex game domains without often understanding the underlying causal game mechanics. To address this, we investigate Causal Induction: the ability to infer governing laws from observational data, by tasking Large Language Models (LLMs) with reverse-engineering Video Game Description Language (VGDL) rules from gameplay traces. To reduce redundancy, we select nine representative games from the General Video Game AI (GVGAI) framework using semantic embeddings and clustering. We compare two approaches to VGDL generation: direct code generation from observations, and a two-stage method that first infers a structural causal model (SCM) and then translates it into VGDL. Both approaches are evaluated across multiple prompting strategies and controlled context regimes, varying the amount and form of information provided to the model, from just raw gameplay observations to partial VGDL specifications. Results show that the SCM-based approach more often produces VGDL descriptions closer to the ground truth than direct generation, achieving preference win rates of up to 81% in blind evaluations and yielding fewer logically inconsistent rules. These learned SCMs can be used for downstream use cases such as causal reinforcement learning, interpretable agents, and procedurally generating novel but logically consistent games.
zh
[CV-279] RPP: A Certified Poisoned-Sample Detection Framework for Backdoor Attacks under Dataset Imbalance
【速读】:该论文旨在解决数据集类别不平衡如何加剧后门攻击(backdoor attack)脆弱性的问题,指出不平衡会诱导多数类偏置从而提升模型 susceptibility,并导致现有防御方法在不平衡场景下性能显著下降。其解决方案的关键在于提出 Randomized Probability Perturbation (RPP),一种基于模型输出概率的黑盒检测框架,能够在不依赖标签或内部结构的情况下,对任意样本判断是否被后门操控,并提供可证明的域内检测保证和伪阳性率的概率上界,从而在多种真实世界基准数据集与攻击场景中实现优于当前最先进防御方法的检测准确率。
链接: https://arxiv.org/abs/2602.00183
作者: Miao Lin,Feng Yu,Rui Ning,Lusi Li,Jiawei Chen,Qian Lou,Mengxin Zheng,Chunsheng Xin,Hongyi Wu
机构: 未知
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Deep neural networks are highly susceptible to backdoor attacks, yet most defense methods to date rely on balanced data, overlooking the pervasive class imbalance in real-world scenarios that can amplify backdoor threats. This paper presents the first in-depth investigation of how the dataset imbalance amplifies backdoor vulnerability, showing that (i) the imbalance induces a majority-class bias that increases susceptibility and (ii) conventional defenses degrade significantly as the imbalance grows. To address this, we propose Randomized Probability Perturbation (RPP), a certified poisoned-sample detection framework that operates in a black-box setting using only model output probabilities. For any inspected sample, RPP determines whether the input has been backdoor-manipulated, while offering provable within-domain detectability guarantees and a probabilistic upper bound on the false positive rate. Extensive experiments on five benchmarks (MNIST, SVHN, CIFAR-10, TinyImageNet and ImageNet10) covering 10 backdoor attacks and 12 baseline defenses show that RPP achieves significantly higher detection accuracy than state-of-the-art defenses, particularly under dataset imbalance. RPP establishes a theoretical and practical foundation for defending against backdoor attacks in real-world environments with imbalanced data.
zh
[CV-280] CamReason er: Reinforcing Camera Movement Understanding via Structured Spatial Reasoning
【速读】:该论文旨在解决现有多模态模型在理解相机运动(camera dynamics)时存在的问题,即通常将其视为黑箱分类任务,容易因依赖表面视觉模式而非几何线索而混淆物理上不同的运动类型。解决方案的关键在于提出CamReasoner框架,该框架将相机运动理解重构为一种结构化的推理过程——Observation-Thinking-Answer (O-T-A) 模式,通过显式的推理模块解码时空线索(如轨迹和视锥体),并引入强化学习(Reinforcement Learning, RL)对推理链进行逻辑对齐,从而确保推理结果基于物理几何而非上下文猜测,有效抑制幻觉并提升性能。
链接: https://arxiv.org/abs/2602.00181
作者: Hang Wu,Yujun Cai,Zehao Li,Haonan Ge,Bowen Sun,Junsong Yuan,Yiwei Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Understanding camera dynamics is a fundamental pillar of video spatial intelligence. However, existing multimodal models predominantly treat this task as a black-box classification, often confusing physically distinct motions by relying on superficial visual patterns rather than geometric cues. We present CamReasoner, a framework that reformulates camera movement understanding as a structured inference process to bridge the gap between perception and cinematic logic. Our approach centers on the Observation-Thinking-Answer (O-T-A) paradigm, which compels the model to decode spatio-temporal cues such as trajectories and view frustums within an explicit reasoning block. To instill this capability, we construct a Large-scale Inference Trajectory Suite comprising 18k SFT reasoning chains and 38k RL feedback samples. Notably, we are the first to employ RL for logical alignment in this domain, ensuring motion inferences are grounded in physical geometry rather than contextual guesswork. By applying Reinforcement Learning to the Observation-Think-Answer (O-T-A) reasoning paradigm, CamReasoner effectively suppresses hallucinations and achieves state-of-the-art performance across multiple benchmarks.
zh
[CV-281] Stabilizing Diffusion Posterior Sampling by Noise–Frequency Continuation
【速读】:该论文旨在解决扩散后验采样(diffusion posterior sampling)在求解逆问题时难以恢复精细细节的问题,其根本原因在于测量一致性引导项与扩散噪声水平的弱耦合关系:在高噪声阶段,基于不准确估计计算的数据一致性梯度可能与后验几何结构不一致,从而引发早期步骤漂移、虚假高频伪影,并对调度策略和病态算子敏感。解决方案的关键在于提出一种噪声-频率连续体框架(noise–frequency Continuation framework),通过构建一组依赖噪声水平的中间后验分布,使似然项仅在特定噪声相关的频带内强制实现测量一致性;具体实现上采用稳定化的后验采样器,结合扩散预测器、带限似然引导(band-limited likelihood guidance)以及多分辨率一致性策略——该策略在粗尺度上积极执行可靠修正,仅在高频细节变得可识别时才保守引入。
链接: https://arxiv.org/abs/2602.00176
作者: Feng Tian,Yixuan Li,Weili Zeng,Weitian Zhang,Yichao Yan,Xiaokang Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Diffusion posterior sampling solves inverse problems by combining a pretrained diffusion prior with measurement-consistency guidance, but it often fails to recover fine details because measurement terms are applied in a manner that is weakly coupled to the diffusion noise level. At high noise, data-consistency gradients computed from inaccurate estimates can be geometrically incongruent with the posterior geometry, inducing early-step drift, spurious high-frequency artifacts, plus sensitivity to schedules and ill-conditioned operators. To address these concerns, we propose a noise–frequency Continuation framework that constructs a continuous family of intermediate posteriors whose likelihood enforces measurement consistency only within a noise-dependent frequency band. This principle is instantiated with a stabilized posterior sampler that combines a diffusion predictor, band-limited likelihood guidance, and a multi-resolution consistency strategy that aggressively commits reliable coarse corrections while conservatively adopting high-frequency details only when they become identifiable. Across super-resolution, inpainting, and deblurring, our method achieves state-of-the-art performance and improves motion deblurring PSNR by up to 5 dB over strong baselines.
zh
[CV-282] he Illusion of Forgetting: Attack Unlearned Diffusion via Initial Latent Variable Optimization
【速读】:该论文旨在解决当前基于遗忘(unlearning)的防御机制在扩散模型(Diffusion Models, DMs)中对不安全内容(Not-Safe-For-Work, NSFW)去除效果有限的问题。研究发现,现有方法仅部分破坏语言符号与模型内部知识之间的映射关系,而原始知识仍以“休眠记忆”(dormant memories)的形式保留,导致所谓“遗忘”实为表层扰动。解决方案的关键在于提出 IVO(Initial Latent Variable Optimization)框架,通过图像反演(Image Inversion)、对抗优化(Adversarial Optimization)和重用攻击(Reused Attack)三阶段优化初始潜在变量,重建被破坏的语义映射,使未学习模型的噪声分布重新对齐至原始不安全状态,从而有效激活休眠记忆并验证防御机制的根本缺陷。
链接: https://arxiv.org/abs/2602.00175
作者: Manyi Li,Yufan Liu,Lai Jiang,Bing Li,Yuming Li,Weiming Hu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注: 21 pages, 22 figures, 17 tables
Abstract:Although unlearning-based defenses claim to purge Not-Safe-For-Work (NSFW) concepts from diffusion models (DMs), we reveals that this “forgetting” is largely an illusion. Unlearning partially disrupts the mapping between linguistic symbols and the underlying knowledge, which remains intact as dormant memories. We find that the distributional discrepancy in the denoising process serves as a measurable indicator of how much of the mapping is retained, also reflecting the strength of unlearning. Inspired by this, we propose IVO (Initial Latent Variable Optimization), a concise and powerful attack framework that reactivates these dormant memories by reconstructing the broken mappings. Through Image Inversion, Adversarial Optimization and Reused Attack, IVO optimizes initial latent variables to realign the noise distribution of unlearned models with their original unsafe states. Extensive experiments across 8 widely used unlearning techniques demonstrate that IVO achieves superior attack success rates and strong semantic consistency, exposing fundamental flaws in current defenses. The code is available at this http URL. Warning: This paper has unsafe images that may offend some readers.
zh
[CV-283] Intra-Class Subdivision for Pixel Contrastive Learning: Application to Semi-supervised Cardiac Image Segmentation ICASSP2026
【速读】:该论文旨在解决心脏图像分割中因边界区域表示污染(representation contamination)导致的分割精度下降问题。其解决方案的关键在于提出了一种类内子划分像素对比学习(intra-class subdivision pixel contrastive learning, SPCL)框架,核心创新包括引入“无关样本”(unconcerned sample)概念以区分同一类别内部像素在非边界区域与边界区域的表示特征,并设计边界对比损失(boundary contrastive loss)以增强边界区域表征的判别能力。理论分析与实验结果均表明,该方法能有效提升分割质量与边界精度。
链接: https://arxiv.org/abs/2602.00174
作者: Jiajun Zhao,Xuan Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 7 figures, accepted by ICASSP 2026
Abstract:We propose an intra-class subdivision pixel contrastive learning (SPCL) framework for cardiac image segmentation to address representation contamination at boundaries. The novel concept ``Unconcerned sample’’ is proposed to distinguish pixel representations at the inner and boundary regions within the same class, facilitating a clearer characterization of intra-class variations. A novel boundary contrastive loss for boundary representations is proposed to enhance representation discrimination across boundaries. The advantages of the unconcerned sample and boundary contrastive loss are analyzed theoretically. Experimental results in public cardiac datasets demonstrate that SPCL significantly improves segmentation performance, outperforming existing methods with respect to segmentation quality and boundary precision. Our code is available at this https URL.
zh
[CV-284] YOLOE-26: Integrating YOLO26 with YOLOE for Real-Time Open-Vocabulary Instance Segmentation
【速读】:该论文旨在解决传统目标检测与实例分割模型在开放词汇(open-vocabulary)场景下难以兼顾实时性与泛化能力的问题,尤其针对闭集(closed-set)识别限制带来的部署瓶颈。其核心解决方案是提出YOLOE-26框架,通过将部署优化的YOLOv26架构与开放词汇学习范式相结合,实现端到端的实时开放词汇实例分割。关键创新在于:采用对象嵌入(object embedding)头替代固定类别logits,将分类任务转化为与文本描述、视觉样本或内置词汇的相似度匹配;同时引入三种高效提示机制——可重参数化的区域-文本对齐(Re-Parameterizable Region-Text Alignment, RepRTA)、语义激活的视觉提示编码器(Semantic-Activated Visual Prompt Encoder, SAVPE)和懒惰区域提示对比(Lazy Region Prompt Contrast),使模型能在统一的对象嵌入空间内无缝切换文本提示、视觉提示及无提示推理模式,从而在保持YOLO家族高效性和确定性的基础上显著扩展了模型的开放词汇适应能力。
链接: https://arxiv.org/abs/2602.00168
作者: Ranjan Sapkota,Manoj Karkee
机构: Cornell University (康奈尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper presents YOLOE-26, a unified framework that integrates the deployment-optimized YOLO26(or YOLOv26) architecture with the open-vocabulary learning paradigm of YOLOE for real-time open-vocabulary instance segmentation. Building on the NMS-free, end-to-end design of YOLOv26, the proposed approach preserves the hallmark efficiency and determinism of the YOLO family while extending its capabilities beyond closed-set recognition. YOLOE-26 employs a convolutional backbone with PAN/FPN-style multi-scale feature aggregation, followed by end-to-end regression and instance segmentation heads. A key architectural contribution is the replacement of fixed class logits with an object embedding head, which formulates classification as similarity matching against prompt embeddings derived from text descriptions, visual examples, or a built-in vocabulary. To enable efficient open-vocabulary reasoning, the framework incorporates Re-Parameterizable Region-Text Alignment (RepRTA) for zero-overhead text prompting, a Semantic-Activated Visual Prompt Encoder (SAVPE) for example-guided segmentation, and Lazy Region Prompt Contrast for prompt-free inference. All prompting modalities operate within a unified object embedding space, allowing seamless switching between text-prompted, visual-prompted, and fully autonomous segmentation. Extensive experiments demonstrate consistent scaling behavior and favorable accuracy-efficiency trade-offs across model sizes in both prompted and prompt-free settings. The training strategy leverages large-scale detection and grounding datasets with multi-task optimization and remains fully compatible with the Ultralytics ecosystem for training, validation, and deployment. Overall, YOLOE-26 provides a practical and scalable solution for real-time open-vocabulary instance segmentation in dynamic, real-world environments.
zh
[CV-285] Deep Learning Pose Estimation for Multi-Label Recognition of Combined Hyperkinetic Movement Disorders
【速读】:该论文旨在解决肌张力障碍(dystonia)、震颤(tremor)、舞蹈样动作(chorea)、肌阵挛(myoclonus)和抽动(tics)等高动力性运动障碍(Hyperkinetic Movement Disorders, HMDs)在临床识别与长期监测中因症状波动、间歇性和共现性导致的主观性强、评估一致性差的问题。其解决方案的关键在于构建一个基于姿态估计(pose-based)的机器学习框架,将常规门诊视频转化为具有解剖学意义的关键点时间序列,并提取涵盖统计特征、时间特性、频域特征及高阶不规则性-复杂度特征的运动学描述符,从而实现对重叠HMD表型的客观、可扩展区分。
链接: https://arxiv.org/abs/2602.00163
作者: Laura Cif,Diane Demailly,Gabriella A. Horvàth,Juan Dario Ortigoza Escobar,Nathalie Dorison,Mayté Castro Jiménez,Cécile A. Hubsch,Thomas Wirth,Gun-Marie Hariz,Sophie Huby,Morgan Dornadic,Zohra Souei,Muhammad Mushhood Ur Rehman,Simone Hemm,Mehdi Boulayme,Eduardo M. Moraud,Jocelyne Bloch,Xavier Vasques
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)
备注:
Abstract:Hyperkinetic movement disorders (HMDs) such as dystonia, tremor, chorea, myoclonus, and tics are disabling motor manifestations across childhood and adulthood. Their fluctuating, intermittent, and frequently co-occurring expressions hinder clinical recognition and longitudinal monitoring, which remain largely subjective and vulnerable to inter-rater variability. Objective and scalable methods to distinguish overlapping HMD phenotypes from routine clinical videos are still lacking. Here, we developed a pose-based machine-learning framework that converts standard outpatient videos into anatomically meaningful keypoint time series and computes kinematic descriptors spanning statistical, temporal, spectral, and higher-order irregularity-complexity features.
zh
[CV-286] See Without Decoding: Motion-Vector-Based Tracking in Compressed Video
【速读】:该论文旨在解决视频监控系统中实时目标跟踪的计算效率问题,特别是在大规模视频流分析场景下,传统基于RGB解码的跟踪方法存在高计算开销的问题。解决方案的关键在于提出一种轻量级的压缩域(codec-domain)跟踪模型,直接利用视频编码过程中已有的运动矢量(motion vectors)和变换系数(transform coefficients),无需进行完整的RGB视频解码即可实现目标边框在帧间的传播,从而在保持较高跟踪精度的前提下显著提升计算速度,实测速度提升达3.7倍,仅带来4%的mAP@0.5性能下降。
链接: https://arxiv.org/abs/2602.00153
作者: Axel Duché,Clément Chatelain,Gilles Gasso
机构: Actemium Paris Transport (Actemium巴黎交通); INSA Rouen (INSA鲁昂); LITIS UR 4108 (LITIS UR 4108)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:
Abstract:We propose a lightweight compressed-domain tracking model that operates directly on video streams, without requiring full RGB video decoding. Using motion vectors and transform coefficients from compressed data, our deep model propagates object bounding boxes across frames, achieving a computational speed-up of order up to 3.7 with only a slight 4% mAP@0.5 drop vs RGB baseline on MOTS15/17/20 datasets. These results highlight codec-domain motion modeling efficiency for real-time analytics in large monitoring systems.
zh
[CV-287] Real-Time Human Activity Recognition on Edge Microcontrollers: Dynamic Hierarchical Inference with Multi-Spectral Sensor Fusion
【速读】:该论文旨在解决边缘计算场景下实时人体活动识别(Human Activity Recognition, HAR)中准确率与计算资源受限之间的矛盾问题。现有方法难以在低功耗、小内存的嵌入式设备上同时实现高精度和高效推理。其解决方案的关键在于提出一种面向资源感知的分层网络结构HPPI-Net,该结构采用两级架构:第一层利用快速傅里叶变换(Fast Fourier Transform, FFT)谱图提取初步特征;第二层根据活动状态选择激活专用模块或并行LSTM-MobileNet网络(Parallel LSTM-MobileNet, PLMN),其中PLMN通过三个并行LSTM编码器融合FFT、小波(Wavelet)和Gabor谱图,并使用高效通道注意力(Efficient Channel Attention, ECA)与深度可分离卷积(Depthwise Separable Convolution, DSC)优化特征表示,在显著降低乘加操作次数的同时提供通道级可解释性。最终在ARM Cortex-M4微控制器上实现了96.70%的准确率,仅消耗22.3 KiB RAM和439.5 KiB ROM,相较MobileNetV3提升准确率1.22%,且RAM和ROM占用分别减少71.2%和42.1%,验证了其在资源受限边缘平台上的优越性。
链接: https://arxiv.org/abs/2602.00152
作者: Boyu Li,Kuangji Zuo,Lincong Li,Yonghui Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注: 24 pages, 6 figures. The manusrcipt is under review at Measurement
Abstract:The demand for accurate on-device pattern recognition in edge applications is intensifying, yet existing approaches struggle to reconcile accuracy with computational constraints. To address this challenge, a resource-aware hierarchical network based on multi-spectral fusion and interpretable modules, namely the Hierarchical Parallel Pseudo-image Enhancement Fusion Network (HPPI-Net), is proposed for real-time, on-device Human Activity Recognition (HAR). Deployed on an ARM Cortex-M4 microcontroller for low-power real-time inference, HPPI-Net achieves 96.70% accuracy while utilizing only 22.3 KiB of RAM and 439.5 KiB of ROM after optimization. HPPI-Net employs a two-layer architecture. The first layer extracts preliminary features using Fast Fourier Transform (FFT) spectrograms, while the second layer selectively activates either a dedicated module for stationary activity recognition or a parallel LSTM-MobileNet network (PLMN) for dynamic states. PLMN fuses FFT, Wavelet, and Gabor spectrograms through three parallel LSTM encoders and refines the concatenated features using Efficient Channel Attention (ECA) and Depthwise Separable Convolution (DSC), thereby offering channel-level interpretability while substantially reducing multiply-accumulate operations. Compared with MobileNetV3, HPPI-Net improves accuracy by 1.22% and reduces RAM usage by 71.2% and ROM usage by 42.1%. These results demonstrate that HPPI-Net achieves a favorable accuracy-efficiency trade-off and provides explainable predictions, establishing a practical solution for wearable, industrial, and smart home HAR on memory-constrained edge platforms.
zh
[CV-288] Investigating the Impact of Histopathological Foundation Models on Regressive Prediction of Homologous Recombination Deficiency IJCAI2026
【速读】:该论文旨在解决生成式 AI (Generative AI) 在组织病理学图像中用于回归型生物标志物预测(如同源重组缺陷 HRD 评分)时性能不足的问题,尤其关注预训练模型在跨癌种泛化能力与临床实用性上的局限。其解决方案的关键在于:利用五种先进的组织病理学基础模型(foundation models)从全切片图像(WSI)中提取局部 patch 特征,并基于这些特征构建回归模型以预测连续 HRD 分数;同时提出一种基于分布的上采样策略来缓解目标值不平衡问题,从而提升对临床重要但样本稀少群体的召回率和平衡准确率。实验表明,使用基础模型特征显著优于对比学习特征,在多个癌症队列中展现出更高的预测精度与迁移能力。
链接: https://arxiv.org/abs/2602.00151
作者: Alexander Blezinger,Wolfgang Nejdl,Ming Tang
机构: Leibniz University Hannover (汉诺威莱布尼茨大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 7 figures and 5 tables. Initialy submitted for IJCAI 2026
Abstract:Foundation models pretrained on large-scale histopathology data have found great success in various fields of computational pathology, but their impact on regressive biomarker prediction remains underexplored. In this work, we systematically evaluate histopathological foundation models for regression-based tasks, demonstrated through the prediction of homologous recombination deficiency (HRD) score - a critical biomarker for personalized cancer treatment. Within multiple instance learning frameworks, we extract patch-level features from whole slide images (WSI) using five state-of-the-art foundation models, and evaluate their impact compared to contrastive learning-based features. Models are trained to predict continuous HRD scores based on these extracted features across breast, endometrial, and lung cancer cohorts from two public medical data collections. Extensive experiments demonstrate that models trained on foundation model features consistently outperform the baseline in terms of predictive accuracy and generalization capabilities while exhibiting systematic differences among the foundation models. Additionally, we propose a distribution-based upsampling strategy to mitigate target imbalance in these datasets, significantly improving the recall and balanced accuracy for underrepresented but clinically important patient populations. Furthermore, we investigate the impact of different sampling strategies and instance bagsizes by ablation studies. Our results highlight the benefits of large-scale histopathological pretraining for more precise and transferable regressive biomarker prediction, showcasing its potential to advance AI-driven precision oncology.
zh
[CV-289] SDCM: Simulated Densifying and Compensatory Modeling Fusion for Radar-Vision 3-D Object Detection in Internet of Vehicles
【速读】:该论文旨在解决车联网(Internet of Vehicles, IoV)中基于4-D雷达-视觉融合的三维目标检测面临的两大挑战:一是4-D雷达点云稀疏导致三维表征质量差;二是视觉数据在低光照、远距离和密集遮挡场景下出现表征退化,从而在融合阶段提供不可靠的纹理信息。解决方案的关键在于提出一种名为SDCM(Simulated Densifying and Compensatory Modeling Fusion)的框架,其核心创新包括三个模块:(1)SimDen模块通过基于3-D核密度估计(3-D KDE)的关键点高斯模拟与曲率模拟生成轮廓,实现雷达点云的仿真稠密化;(2)RCM模块利用雷达数据全天候特性提供的实时性优势,对视觉特征进行补偿映射以缓解其表征退化问题;(3)MMIF模块通过建模多模态特征张量差异值,降低异构性并增强模态间交互,实现高效融合。实验表明,SDCM在多个公开数据集上实现了更优性能,且参数量更低、推理速度更快。
链接: https://arxiv.org/abs/2602.00149
作者: Shucong Li,Xiaoluo Zhou,Yuqian He,Zhenyu Liu
机构: Guangdong University of Technology (广东工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)
备注:
Abstract:3-D object detection based on 4-D radar-vision is an important part in Internet of Vehicles (IoV). However, there are two challenges which need to be faced. First, the 4-D radar point clouds are sparse, leading to poor 3-D representation. Second, vision datas exhibit representation degradation under low-light, long distance detection and dense occlusion scenes, which provides unreliable texture information during fusion stage. To address these issues, a framework named SDCM is proposed, which contains Simulated Densifying and Compensatory Modeling Fusion for radar-vision 3-D object detection in IoV. Firstly, considering point generation based on Gaussian simulation of key points obtained from 3-D Kernel Density Estimation (3-D KDE), and outline generation based on curvature simulation, Simulated Densifying (SimDen) module is designed to generate dense radar point clouds. Secondly, considering that radar data could provide more real time information than vision data, due to the all-weather property of 4-D radar. Radar Compensatory Mapping (RCM) module is designed to reduce the affects of vision datas’ representation degradation. Thirdly, considering that feature tensor difference values contain the effective information of every modality, which could be extracted and modeled for heterogeneity reduction and modalities interaction, Mamba Modeling Interactive Fusion (MMIF) module is designed for reducing heterogeneous and achieving interactive Fusion. Experiment results on the VoD, TJ4DRadSet and Astyx HiRes 2019 dataset show that SDCM achieves best performance with lower parameter quantity and faster inference speed. Our code will be available.
zh
[CV-290] Learning Physics-Grounded 4D Dynamics with Neural Gaussian Force Fields ICLR2026
【速读】:该论文旨在解决从原始视觉数据中预测物理动态的难题,特别是现有视频生成模型在缺乏物理定律建模的情况下难以生成符合物理规律的视频。解决方案的关键在于提出神经高斯力场(Neural Gaussian Force Field, NGFF),这是一个端到端的神经框架,将3D高斯感知与基于物理的动力学建模相结合,从而从多视角RGB输入生成可交互、物理真实的4D视频。NGFF通过高效整合几何表示与物理仿真机制,在保证物理合理性的同时实现比先前高斯模拟器快两个数量级的推理速度,显著提升了复杂真实场景下的鲁棒性与泛化能力。
链接: https://arxiv.org/abs/2602.00148
作者: Shiqian Li,Ruihong Shen,Junfeng Ni,Chang Pan,Chi Zhang,Yixin Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 43 pages, ICLR 2026
Abstract:Predicting physical dynamics from raw visual data remains a major challenge in AI. While recent video generation models have achieved impressive visual quality, they still cannot consistently generate physically plausible videos due to a lack of modeling of physical laws. Recent approaches combining 3D Gaussian splatting and physics engines can produce physically plausible videos, but are hindered by high computational costs in both reconstruction and simulation, and often lack robustness in complex real-world scenarios. To address these issues, we introduce Neural Gaussian Force Field (NGFF), an end-to-end neural framework that integrates 3D Gaussian perception with physics-based dynamic modeling to generate interactive, physically realistic 4D videos from multi-view RGB inputs, achieving two orders of magnitude faster than prior Gaussian simulators. To support training, we also present GSCollision, a 4D Gaussian dataset featuring diverse materials, multi-object interactions, and complex scenes, totaling over 640k rendered physical videos (~4 TB). Evaluations on synthetic and real 3D scenarios show NGFF’s strong generalization and robustness in physical reasoning, advancing video prediction towards physics-grounded world models.
zh
[CV-291] DensiThAI A Multi-View Deep Learning Framework for Breast Density Estimation using Infrared Images
【速读】:该论文旨在解决乳腺密度评估依赖X射线乳腺摄影(mammography)这一电离辐射成像方式的问题,从而推动一种无电离辐射的替代性评估手段。其核心解决方案是提出DensiThAI——一个基于多视角红外热成像的深度学习框架,通过分析纤维腺体组织与脂肪组织因热物理和生理特性差异所导致的乳房表面温度分布的空间一致性变化,实现对乳腺密度的分类预测。该方法在包含3500名女性的多中心数据集上验证,使用乳腺摄影标注作为参考标准,平均AUROC达0.73,且在所有随机分割中均表现出各密度类别间的统计学显著区分能力(p < 0.05),表明热成像结合人工智能可作为乳腺密度评估的一种可行、非电离的新型技术路径。
链接: https://arxiv.org/abs/2602.00145
作者: Siva Teja Kakileti,Geetha Manjunath
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Breast tissue density is a key biomarker of breast cancer risk and a major factor affecting mammographic sensitivity. However, density assessment currently relies almost exclusively on X-ray mammography, an ionizing imaging modality. This study investigates the feasibility of estimating breast density using artificial intelligence over infrared thermal images, offering a non-ionizing imaging approach. The underlying hypothesis is that fibroglandular and adipose tissues exhibit distinct thermophysical and physiological properties, leading to subtle but spatially coherent temperature variations on the breast surface. In this paper, we propose DensiThAI, a multi-view deep learning framework for breast density classification from thermal images. The framework was evaluated on a multi-center dataset of 3,500 women using mammography-derived density labels as reference. Using five standard thermal views, DensiThAI achieved a mean AUROC of 0.73 across 10 random splits, with statistically significant separation between density classes across all splits (p 0.05). Consistent performance across age cohorts supports the potential of thermal imaging as a non-ionizing approach for breast density assessment with implications for improved patient experience and workflow optimization.
zh
[CV-292] Scalable Analytic Classifiers with Associative Drift Compensation for Class-Incremental Learning of Vision Transformers
【速读】:该论文旨在解决视觉 Transformer (Vision Transformer, ViT) 在类增量学习 (Class-Incremental Learning, CIL) 中分类器重构阶段的计算瓶颈问题,即现有方法依赖昂贵的迭代随机梯度下降(SGD)导致效率低下。其核心解决方案是提出低秩分解的正则化高斯判别分析(Low-Rank Factorized Regularized Gaussian Discriminant Analysis, LR-RGDA),通过利用协方差矩阵的低秩结构并结合 Woodbury 矩阵恒等式,将判别函数分解为一个全局仿射项与一个低秩二次扰动项,从而将推理复杂度从 O(Cd2) 降低至 O(d2+Crd2)(其中 C 为类别数,d 为特征维度,r≪d 为子空间秩),显著提升了大规模场景下的可扩展性;此外,为缓解骨干网络更新引起的表征漂移,进一步引入基于 Hopfield 网络的分布补偿机制(Hopfield-based Distribution Compensator, HopDC),无需训练即可通过无标签锚点上的关联记忆动力学动态校准历史类统计量,并提供估计误差的理论界。
链接: https://arxiv.org/abs/2602.00144
作者: Xuan Rao,Mingming Ha,Bo Zhao,Derong Liu,Cesare Alippi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Class-incremental learning (CIL) with Vision Transformers (ViTs) faces a major computational bottleneck during the classifier reconstruction phase, where most existing methods rely on costly iterative stochastic gradient descent (SGD). We observe that analytic Regularized Gaussian Discriminant Analysis (RGDA) provides a Bayes-optimal alternative with accuracy comparable to SGD-based classifiers; however, its quadratic inference complexity limits its use in large-scale CIL scenarios. To overcome this, we propose Low-Rank Factorized RGDA (LR-RGDA), a scalable classifier that combines RGDA’s expressivity with the efficiency of linear classifiers. By exploiting the low-rank structure of the covariance via the Woodbury matrix identity, LR-RGDA decomposes the discriminant function into a global affine term refined by a low-rank quadratic perturbation, reducing the inference complexity from \mathcalO(Cd^2) to \mathcalO(d^2 + Crd^2) , where C is the class number, d the feature dimension, and r \ll d the subspace rank. To mitigate representation drift caused by backbone updates, we further introduce Hopfield-based Distribution Compensator (HopDC), a training-free mechanism that uses modern continuous Hopfield Networks to recalibrate historical class statistics through associative memory dynamics on unlabeled anchors, accompanied by a theoretical bound on the estimation error. Extensive experiments on diverse CIL benchmarks demonstrate that our framework achieves state-of-the-art performance, providing a scalable solution for large-scale class-incremental learning with ViTs. Code: this https URL.
zh
[CV-293] LLaVA-FA: Learning Fourier Approximation for Compressing Large Multimodal Models ICLR2026
【速读】:该论文旨在解决大型多模态模型(Large Multimodal Models, LMMs)在实际部署中因计算和内存开销过大而导致的效率瓶颈问题。现有压缩方法通常将低秩分解与量化分离处理,导致跨模态冗余结构中重建误差累积,影响模型性能。其解决方案的关键在于提出一种在频域中联合执行低秩近似与量化的新型压缩框架——LLaVA-FA,通过傅里叶变换的去相关性和共轭对称性实现更紧凑且精确的权重表示;同时引入专为复数矩阵设计的极坐标量化方法(PolarQuant)以及可选的对角校准(ODC)方案,避免依赖大规模校准数据,从而在保持极低激活参数和计算成本的前提下显著提升压缩后模型的性能表现。
链接: https://arxiv.org/abs/2602.00135
作者: Pengcheng Zheng,Chaoning Zhang,Jiarong Mo,GuoHui Li,Jiaquan Zhang,Jiahao Zhang,Sihan Cao,Sheng Zheng,Caiyan Qin,Guoqing Wang,Yang Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICLR 2026
Abstract:Large multimodal models (LMMs) have achieved impressive performance on various vision-language tasks, but their substantial computational and memory costs hinder their practical deployment. Existing compression methods often decouple low-rank decomposition and quantization, leading to compounded reconstruction errors, especially in multimodal architectures with cross-modal redundancy. To address this issue, we propose LLaVA-FA, a novel efficient LMM that performs joint low-rank plus quantization approximation in the frequency domain. By leveraging the de-correlation and conjugate symmetry properties of Fourier transform, LLaVA-FA achieves more compact and accurate weight representations. Furthermore, we introduce PolarQuant, a polar-coordinate quantization method tailored for complex matrices, and an optional diagonal calibration (ODC) scheme that eliminates the need for large-scale calibration data. Extensive experimental results demonstrate that our proposed LLaVA-FA outperforms existing efficient multimodal models across multiple benchmarks while maintaining minimal activated parameters and low computational costs, validating its effectiveness as a powerful solution for compressing LMMs.
zh
[CV-294] Shedding the Facades Connecting the Domains: Detecting Shifting Multimodal Hate Video with Test-Time Adaptation AAAI2026
【速读】:该论文针对仇恨视频检测(Hate Video Detection, HVD)中因内容演化导致的严重语义漂移问题,提出了一种新的测试时自适应(Test-Time Adaptation, TTA)框架SCANNER。传统TTA方法仅适用于轻微分布偏移,难以应对HVD中恶意内容通过形态变化规避审查所引发的显著跨域差异。解决方案的关键在于识别并利用仇恨内容中不变的核心特征(即稳定核心,如性别、种族等敏感属性),将其作为源域与目标域之间的桥梁,实现更鲁棒的适应。具体而言,SCANNER首先通过一种基于中心点引导的对齐机制从模糊的演变内容中提取这些稳定核心;其次引入样本级自适应中心对齐策略以抑制异常样本干扰;最后采用簇内多样性正则化防止语义坍缩,从而在保持高准确率的同时增强模型对未知变体的泛化能力。
链接: https://arxiv.org/abs/2602.00132
作者: Jiao Li,Jian Lang,Xikai Tang,Wenzheng Shu,Ting Zhong,Qiang Gao,Yong Wang,Leiting Chen,Fan Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI2026 main track
Abstract:Hate Video Detection (HVD) is crucial for online ecosystems. Existing methods assume identical distributions between training (source) and inference (target) data. However, hateful content often evolves into irregular and ambiguous forms to evade censorship, resulting in substantial semantic drift and rendering previously trained models ineffective. Test-Time Adaptation (TTA) offers a solution by adapting models during inference to narrow the cross-domain gap, while conventional TTA methods target mild distribution shifts and struggle with the severe semantic drift in HVD. To tackle these challenges, we propose SCANNER, the first TTA framework tailored for HVD. Motivated by the insight that, despite the evolving nature of hateful manifestations, their underlying cores remain largely invariant (i.e., targeting is still based on characteristics like gender, race, etc), we leverage these stable cores as a bridge to connect the source and target domains. Specifically, SCANNER initially reveals the stable cores from the ambiguous layout in evolving hateful content via a principled centroid-guided alignment mechanism. To alleviate the impact of outlier-like samples that are weakly correlated with centroids during the alignment process, SCANNER enhances the prior by incorporating a sample-level adaptive centroid alignment strategy, promoting more stable adaptation. Furthermore, to mitigate semantic collapse from overly uniform outputs within clusters, SCANNER introduces an intra-cluster diversity regularization that encourages the cluster-wise semantic richness. Experiments show that SCANNER outperforms all baselines, with an average gain of 4.69% in Macro-F1 over the best.
zh
[CV-295] PovNet: A Deep Learning Architecture for Socially Assistive Robots to Learn and Assist with Multiple Activities of Daily Living
【速读】:该论文旨在解决自主社会助人机器人在长期部署中难以同时感知并协助多种日常生活活动(Activities of Daily Living, ADLs)的问题。解决方案的关键在于提出了一种新型多模态深度学习架构POVNet+,其创新性地引入了ADL嵌入空间与运动嵌入空间,能够精确区分已知ADL的正常执行、未知新ADL以及已知ADL的异常执行情况,从而支持机器人在真实场景中主动发起适配的辅助行为。此外,通过在运动嵌入空间中应用新颖的用户状态估计方法,系统可在监控用户表现的同时识别新ADL,实现对多类ADL(包括未见过的和非典型执行的)的有效感知与响应,显著提升了辅助交互的准确性与主动性。
链接: https://arxiv.org/abs/2602.00131
作者: Fraser Robinson,Souren Pashangpour,Matthew Lisondra,Goldie Nejat
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Submitted to Advanced Robotics (Taylor Francis)
Abstract:A significant barrier to the long-term deployment of autonomous socially assistive robots is their inability to both perceive and assist with multiple activities of daily living (ADLs). In this paper, we present the first multimodal deep learning architecture, POVNet+, for multi-activity recognition for socially assistive robots to proactively initiate assistive behaviors. Our novel architecture introduces the use of both ADL and motion embedding spaces to uniquely distinguish between a known ADL being performed, a new unseen ADL, or a known ADL being performed atypically in order to assist people in real scenarios. Furthermore, we apply a novel user state estimation method to the motion embedding space to recognize new ADLs while monitoring user performance. This ADL perception information is used to proactively initiate robot assistive interactions. Comparison experiments with state-of-the-art human activity recognition methods show our POVNet+ method has higher ADL classification accuracy. Human-robot interaction experiments in a cluttered living environment with multiple users and the socially assistive robot Leia using POVNet+ demonstrate the ability of our multi-modal ADL architecture in successfully identifying different seen and unseen ADLs, and ADLs being performed atypically, while initiating appropriate assistive human-robot interactions.
zh
[CV-296] D3R-Net: Dual-Domain Denoising Reconstruction Network for Robust Industrial Anomaly Detection
【速读】:该论文旨在解决无监督异常检测(Unsupervised Anomaly Detection, UAD)中重建类方法因过度平滑而导致细微缺陷难以被有效识别的问题,从而提升缺陷分割精度。其解决方案的关键在于提出D3R-Net——一种双域去噪重建框架,通过引入自监督“修复”任务与频域感知正则化机制:一方面利用合成损坏的正常图像训练模型进行清洁目标重建,避免平凡的恒等映射并学习无缺陷纹理流形;另一方面在空间域均方误差基础上增加快速傅里叶变换(Fast Fourier Transform, FFT)幅度损失,强制频域一致性,增强对高频细节的保留能力。实验表明,该方法显著提升了定位一致性指标(如PRO AUC),且在MVTec AD多个类别上优于仅使用空间损失的基线模型。
链接: https://arxiv.org/abs/2602.00126
作者: Dmytro Filatov,Valentyn Fedorov,Vira Filatova,Andrii Zelenchuk
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 9 pages
Abstract:Unsupervised anomaly detection (UAD) is a key ingredient of automated visual inspection in modern manufacturing. The reconstruction-based methods appeal because they have basic architectural design and they process data quickly but they produce oversmoothed results for high-frequency details. As a result, subtle defects are partially reconstructed rather than highlighted, which limits segmentation accuracy. We build on this line of work and introduce D3R-Net, a Dual-Domain Denoising Reconstruction framework that couples a self-supervised ‘healing’ task with frequency-aware regularization. During training, the network receives synthetically corrupted normal images and is asked to reconstruct the clean targets, which prevents trivial identity mapping and pushes the model to learn the manifold of defect-free textures. In addition to the spatial mean squared error, we employ a Fast Fourier Transform (FFT) magnitude loss that encourages consistency in the frequency domain. The implementation also allows an optional structural similarity (SSIM) term, which we study in an ablation. On the MVTec AD Hazelnut benchmark, D3R-Net with the FFT loss improves localization consistency over a spatial-only baseline: PRO AUC increases from 0.603 to 0.687, while image-level ROC AUC remains robust. Evaluated across fifteen MVTec categories, the FFT variant raises the average pixel ROC AUC from 0.733 to 0.751 and PRO AUC from 0.417 to 0.468 compared to the MSE-only baseline, at roughly 20 FPS on a single GPU. The network is trained from scratch and uses a lightweight convolutional autoencoder backbone, providing a practical alternative to heavy pre-trained feature embedding methods.
zh
[CV-297] Context-Aware Autoencoders for Anomaly Detection in Maritime Surveillance
【速读】:该论文旨在解决传统自编码器(autoencoder)在检测船舶交通监控中集体异常(collective anomalies)和情境异常(contextual anomalies)时效果有限的问题,尤其是在海上领域,异常行为往往依赖于船舶自身通过自动识别系统(AIS)上报的上下文信息。解决方案的关键在于提出一种情境感知自编码器(context-aware autoencoder),其核心创新是引入情境特定阈值(context-specific thresholds),从而提升重建误差与异常检测的准确性,并降低计算成本。实证结果表明,该方法在时间序列数据中显著优于传统自编码器及其他变体,凸显了情境信息在异常检测中的重要性。
链接: https://arxiv.org/abs/2602.00124
作者: Divya Acharya,Pierre Bernab’e,Antoine Chevrot,Helge Spieker,Arnaud Gotlieb,Bruno Legeard
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The detection of anomalies is crucial to ensuring the safety and security of maritime vessel traffic surveillance. Although autoencoders are popular for anomaly detection, their effectiveness in identifying collective and contextual anomalies is limited, especially in the maritime domain, where anomalies depend on vessel-specific contexts derived from self-reported AIS messages. To address these limitations, we propose a novel solution: the context-aware autoencoder. By integrating context-specific thresholds, our method improves detection accuracy and reduces computational cost. We compare four context-aware autoencoder variants and a conventional autoencoder using a case study focused on fishing status anomalies in maritime surveillance. Results demonstrate the significant impact of context on reconstruction loss and anomaly detection. The context-aware autoencoder outperforms others in detecting anomalies in time series data. By incorporating context-specific thresholds and recognizing the importance of context in anomaly detection, our approach offers a promising solution to improve accuracy in maritime vessel traffic surveillance systems.
zh
[CV-298] Visual Affect Analysis: Predicting Emotions of Image Viewers with Vision-Language Models
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在大规模推断视觉刺激所引发情绪反应时,其输出与人类情感评分之间一致性的问题。研究通过在三个经过心理测量验证的情绪图像数据集上进行零样本基准测试,评估了九种VLMs在离散情绪分类和连续情绪评分预测两个任务中的表现,并进一步探讨了基于标注者条件提示(rater-conditioned prompting)对结果的影响。关键解决方案在于系统性地量化VLMs在不同情绪类别和维度上的预测准确性及偏差模式,发现其在离散情绪分类中表现良好(准确率60%-80%),但在连续评分预测中存在显著偏差,如对唤醒度(arousal)敏感性不足以及普遍高估情绪强度;同时表明当前提示策略对提升预测精度作用有限,凸显了VLMs虽能捕捉情绪整体趋势,但尚缺乏心理测量层面的精细建模能力,为未来在情绪计算和心理健康相关应用中的改进方向提供了实证依据。
链接: https://arxiv.org/abs/2602.00123
作者: Filip Nowicki,Hubert Marciniak,Jakub Łączkowski,Krzysztof Jassem,Tomasz Górecki,Vimala Balakrishnan,Desmond C. Ong,Maciej Behnke
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-language models (VLMs) show promise as tools for inferring affect from visual stimuli at scale; it is not yet clear how closely their outputs align with human affective ratings. We benchmarked nine VLMs, ranging from state-of-the-art proprietary models to open-source models, on three psycho-metrically validated affective image datasets: the International Affective Picture System, the Nencki Affective Picture System, and the Library of AI-Generated Affective Images. The models performed two tasks in the zero-shot setting: (i) top-emotion classification (selecting the strongest discrete emotion elicited by an image) and (ii) continuous prediction of human ratings on 1-7/9 Likert scales for discrete emotion categories and affective dimensions. We also evaluated the impact of rater-conditioned prompting on the LAI-GAI dataset using de-identified participant metadata. The results show good performance in discrete emotion classification, with accuracies typically ranging from 60% to 80% on six-emotion labels and from 60% to 75% on a more challenging 12-category task. The predictions of anger and surprise had the lowest accuracy in all datasets. For continuous rating prediction, models showed moderate to strong alignment with humans (r 0.75) but also exhibited consistent biases, notably weaker performance on arousal, and a tendency to overestimate response strength. Rater-conditioned prompting resulted in only small, inconsistent changes in predictions. Overall, VLMs capture broad affective trends but lack the nuance found in validated psychological ratings, highlighting their potential and current limitations for affective computing and mental health-related applications.
zh
[CV-299] VDE Bench: Evaluating The Capability of Image Editing Models to Modify Visual Documents
【速读】:该论文旨在解决当前多模态图像编辑模型在视觉文档图像编辑(Visual Document Image Editing)任务中的不足,特别是针对密集文本布局、复杂结构以及非拉丁文字(如中文)场景下缺乏有效评估手段的问题。现有方法如AnyText、GlyphControl和TextCtrl主要局限于英文场景且难以处理高密度文本内容,导致在真实世界多语言文档编辑任务中表现受限。其解决方案的关键在于提出首个系统性的基准测试平台——VDE Bench(Visual Doc Edit Bench),该平台包含高质量的人工标注与验证数据集,涵盖中英文多种类型文档(如学术论文、海报、试卷等),并引入解耦式评估框架,从OCR解析层面实现对文本修改准确性的细粒度量化评估,从而为多语言、密集文本场景下的图像编辑模型提供可靠、可复现的评测标准。
链接: https://arxiv.org/abs/2602.00122
作者: Hongzhu Yi,Yujia Yang,Yuanxiang Wang,Zhenyu Guan,Jiahuan Chen,Chenxi Bao,Tiankun Yang,Yixuan Yuan,Tianyu Zong,Xinming Wang,Tao Yu,Ruiwen Tao,Haijin Liang,Jin Ma,Jinwen Luo,Yeshani Xinyu Zuo,Jungang Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:
Abstract:In recent years, multimodal image editing models have achieved substantial progress, enabling users to manipulate visual content through natural language in a flexible and interactive manner. Nevertheless, an important yet insufficiently explored research direction remains visual document image editing, which involves modifying textual content within images while faithfully preserving the original text style and background context. Existing approaches, including AnyText, GlyphControl, and TextCtrl, predominantly focus on English-language scenarios and documents with relatively sparse textual layouts, thereby failing to adequately address dense, structurally complex documents or non-Latin scripts such as Chinese. To bridge this gap, we propose \textbfVisual \textbfDoc \textbfEdit Bench(VDE Bench), a rigorously human-annotated and evaluated benchmark specifically designed to assess image editing models on multilingual and complex visual document editing tasks. The benchmark comprises a high-quality dataset encompassing densely textual documents in both English and Chinese, including academic papers, posters, presentation slides, examination materials, and newspapers. Furthermore, we introduce a decoupled evaluation framework that systematically quantifies editing performance at the OCR parsing level, enabling fine-grained assessment of text modification accuracy. Based on this benchmark, we conduct a comprehensive evaluation of representative state-of-the-art image editing models. Manual verification demonstrates a strong consistency between human judgments and automated evaluation metrics. VDE Bench constitutes the first systematic benchmark for evaluating image editing models on multilingual and densely textual visual documents.
zh
[CV-300] IC-EO: Interpretable Code-based assistant for Earth Observation
【速读】:该论文旨在解决地球观测(Earth Observation, EO)分析对非专业用户而言难度大、结果缺乏透明性与可复现性的问题。当前多数系统依赖专家知识且输出黑箱预测,难以审计或重现。其解决方案的关键在于提出一个基于大语言模型(Large Language Model, LLM)的对话式代码生成代理(agent),能够将自然语言查询自动转化为可执行、可审计的Python工作流,集成统一且易扩展的API以支持分类、分割、目标检测(定向边界框)、光谱指数计算及地理空间操作等任务。该框架在工具层、代理层和任务层三个维度实现可控性优化,并在土地组成制图与火灾后损毁评估两个典型场景中显著优于通用LLM/VLM基线(如GPT-4o和LLaVA),同时确保输出结果具备透明性和可验证性。
链接: https://arxiv.org/abs/2602.00117
作者: Lamia Lahouel,Laurynas Lopata,Simon Gruening,Gabriele Meoni,Gaetan Petit,Sylvain Lobry
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 15 pages, 1 figure
Abstract:Despite recent advances in computer vision, Earth Observation (EO) analysis remains difficult to perform for the laymen, requiring expert knowledge and technical capabilities. Furthermore, many systems return black-box predictions that are difficult to audit or reproduce. Leveraging recent advances in tool LLMs, this study proposes a conversational, code-generating agent that transforms natural-language queries into executable, auditable Python workflows. The agent operates over a unified easily extendable API for classification, segmentation, detection (oriented bounding boxes), spectral indices, and geospatial operators. With our proposed framework, it is possible to control the results at three levels: (i) tool-level performance on public EO benchmarks; (ii) at the agent-level to understand the capacity to generate valid, hallucination-free code; and (iii) at the task-level on specific use cases. In this work, we select two use-cases of interest: land-composition mapping and post-wildfire damage assessment. The proposed agent outperforms general-purpose LLM/VLM baselines (GPT-4o, LLaVA), achieving 64.2% vs. 51.7% accuracy on land-composition and 50% vs. 0% on post-wildfire analysis, while producing results that are transparent and easy to interpret. By outputting verifiable code, the approach turns EO analysis into a transparent, reproducible process.
zh
[CV-301] Event Driven Clustering Algorithm
【速读】:该论文旨在解决事件相机(event camera)数据中实时检测小规模事件聚类(event clusters)的问题。传统层次聚类算法在处理此类问题时往往存在计算复杂度高、难以满足实时性要求等局限。本文提出了一种异步、事件驱动的聚类算法,其关键在于充分利用事件相机特有的异步数据结构,并通过一种高效且简单的决策机制,在保证聚类准确性的前提下实现线性时间复杂度 O(n),其中 n 为事件数量,且运行时间与像素阵列维度无关,从而显著提升了实时处理性能。
链接: https://arxiv.org/abs/2602.00115
作者: David El-Chai Ben-Ezra,Adar Tal,Daniel Brisk
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: ~10 pages, 2 figures
Abstract:This paper introduces a novel asynchronous, event-driven algorithm for real-time detection of small event clusters in event camera data. Like other hierarchical agglomerative clustering algorithms, the algorithm detects the event clusters based on their tempo-spatial distance. However, the algorithm leverages the special asynchronous data structure of event camera, and by a sophisticated, efficient and simple decision-making, enjoys a linear complexity of O(n) where n is the events amount. In addition, the run-time of the algorithm is independent with the dimensions of the pixels array.
zh
[CV-302] 1S-DAug: One-Shot Data Augmentation for Robust Few-Shot Generalization
【速读】:该论文旨在解决少样本学习(Few-shot Learning, FSL)中模型在面对新类别时泛化能力不足的问题,尤其是在传统测试时增强(Test-time Augmentation, TTA)方法失效的场景下。其解决方案的关键在于提出了一种名为1S-DAug的一次生成式增强算子,该方法仅需一个标注样本即可在测试阶段合成多样且忠实的图像变体:通过将传统几何扰动与受控噪声注入相结合,并利用以原始图像为条件的去噪扩散过程(denoising diffusion process),生成高质量的增强样本;随后对这些样本进行编码并聚合,形成联合表示用于更鲁棒的FSL预测。该方法无需模型参数更新,可作为训练无关的通用插件模块,显著提升多个标准数据集上的性能,如在miniImageNet 5-way-1-shot基准上实现超过10%的准确率提升。
链接: https://arxiv.org/abs/2602.00114
作者: Yunwei Bai,Ying Kiat Tan,Yao Shu,Tsuhan Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Few-shot learning (FSL) challenges model generalization to novel classes based on just a few shots of labeled examples, a testbed where traditional test-time augmentations fail to be effective. We introduce 1S-DAug, a one-shot generative augmentation operator that synthesizes diverse yet faithful variants from just one example image at test time. 1S-DAug couples traditional geometric perturbations with controlled noise injection and a denoising diffusion process conditioned on the original image. The generated images are then encoded and aggregated, alongside the original image, into a combined representation for more robust FSL predictions. Integrated as a training-free model-agnostic plugin, 1S-DAug consistently improves FSL across standard benchmarks of 4 different datasets without any model parameter update, including achieving over 10% proportional accuracy improvement on the miniImagenet 5-way-1-shot benchmark. Codes will be released.
zh
[CV-303] AI-Driven Three-Dimensional Reconstruction and Quantitative Analysis for Burn Injury Assessment
【速读】:该论文旨在解决烧伤评估中传统视觉检查和二维摄影方法主观性强、难以进行纵向比较的问题,从而影响治疗规划、愈合监测及医疗法律记录的准确性。解决方案的关键在于构建一个集成多视角立体摄影(multi-view photogrammetry)、三维表面重建(3D surface reconstruction)与基于深度学习的分割技术的智能平台,通过消费级相机采集的标准多角度图像,实现患者特异性三维烧伤表面重建,并将烧伤区域映射至解剖结构上,计算包括表面积、总烧伤体表面积(TBSA)、深度相关的几何代理指标及体积变化等客观量化参数,同时支持空间对齐以定量分析愈合进程,最终实现从图像采集到自动化报告生成的全流程临床工作流。
链接: https://arxiv.org/abs/2602.00113
作者: S. Kalaycioglu,C. Hong,K. Zhai,H. Xie,J.N. Wong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages and 5 figures
Abstract:Accurate, reproducible burn assessment is critical for treatment planning, healing monitoring, and medico-legal documentation, yet conventional visual inspection and 2D photography are subjective and limited for longitudinal comparison. This paper presents an AI-enabled burn assessment and management platform that integrates multi-view photogrammetry, 3D surface reconstruction, and deep learning-based segmentation within a structured clinical workflow. Using standard multi-angle images from consumer-grade cameras, the system reconstructs patient-specific 3D burn surfaces and maps burn regions onto anatomy to compute objective metrics in real-world units, including surface area, TBSA, depth-related geometric proxies, and volumetric change. Successive reconstructions are spatially aligned to quantify healing progression over time, enabling objective tracking of wound contraction and depth reduction. The platform also supports structured patient intake, guided image capture, 3D analysis and visualization, treatment recommendations, and automated report generation. Simulation-based evaluation demonstrates stable reconstructions, consistent metric computation, and clinically plausible longitudinal trends, supporting a scalable, non-invasive approach to objective, geometry-aware burn assessment and decision support in acute and outpatient care.
zh
[CV-304] From Manual Observation to Automated Monitoring: Space Allowance Effects on Play Behaviour in Group-Housed Dairy Calves
【速读】:该论文旨在解决在商业化养殖条件下,空间允许量(space allowance)对奶牛犊牛游戏行为(play behaviour)影响尚不明确的问题,尤其是中高空间范围(6–20 m²/头)下的非线性关系未被充分刻画。其解决方案的关键在于结合实地观察与自动化计算机视觉技术:首先通过多农场实证研究揭示了空间与游戏行为之间的非线性关系(最优区间为8–10 m²/头),其次开发并验证了一个基于深度学习的计算机视觉分类器(准确率达97.6%,召回率99.4%),实现了从少量人工标注数据向连续、可扩展的动物福利监测系统的转化,从而为精准调控饲养环境提供科学依据与技术支撑。
链接: https://arxiv.org/abs/2602.00111
作者: Haiyu Yang,Heidi Lesscher,Enhong Liu,Miel Hostens
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:Play behaviour serves as a positive welfare indicator in dairy calves, yet the influence of space allowance under commercial conditions remains poorly characterized, particularly at intermediate-to-high allowances (6-20 m2 per calf). This study investigated the relationship between space allowance and play behaviour in 60 group-housed dairy calves across 14 commercial farms in the Netherlands (space range: 2.66-17.98 m2 per calf), and developed an automated computer vision pipeline for scalable monitoring. Video observations were analyzed using a detailed ethogram, with play expressed as percentage of observation period (%OP). Statistical analysis employed linear mixed models with farm as a random effect. A computer vision pipeline was trained on manual annotations from 108 hours on 6 farms and validated on held-out test data. The computer vision classifier achieved 97.6% accuracy with 99.4% recall for active play detection. Calves spent on average 1.0% of OP playing reflecting around 10 minutes per 17-hour period. The space-play relationship was non-linear, with highest play levels at 8-10 m2 per calf (1.6% OP) and lowest at 6-8 m2 and 12-14 m2 (0.6% OP). Space remained significant after controlling for age, health, and group size. In summary, these findings suggest that 8-10 m2 per calf represents a practical target balancing welfare benefits with economic feasibility, and demonstrate that automated monitoring can scale small annotation projects to continuous welfare assessment systems.
zh
[CV-305] Observing Health Outcomes Using Remote Sensing Imagery and Geo-Context Guided Visual Transformer
【速读】:该论文旨在解决现有视觉-语言及多模态模型在遥感图像分析中缺乏对结构化地理空间信息有效建模的问题,这类模型虽能实现视觉与文本的语义对齐,但难以支撑地理空间理解与推理。其解决方案的关键在于提出一种基于辅助地理空间信息引导的新型模型架构:首先引入地理空间嵌入机制(geospatial embedding mechanism),将多样化的地理空间数据转换为与图像块空间对齐的嵌入块;其次设计了一种引导注意力模块(guided attention module),通过计算与辅助数据的相关性动态生成注意力权重,从而聚焦于最相关的区域,并赋予不同注意力头差异化角色以捕捉引导信息的不同维度,提升模型预测的可解释性与多模态地理空间理解能力。
链接: https://arxiv.org/abs/2602.00110
作者: Yu Li,Guilherme N. DeSouza,Praveen Rao,Chi-Ren Shyu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Submitted to IEEE Transactions on Geoscience and Remote Sensing
Abstract:Visual transformers have driven major progress in remote sensing image analysis, particularly in object detection and segmentation. Recent vision-language and multimodal models further extend these capabilities by incorporating auxiliary information, including captions, question and answer pairs, and metadata, which broadens applications beyond conventional computer vision tasks. However, these models are typically optimized for semantic alignment between visual and textual content rather than geospatial understanding, and therefore are not suited for representing or reasoning with structured geospatial layers. In this study, we propose a novel model that enhances remote sensing imagery processing with guidance from auxiliary geospatial information. Our approach introduces a geospatial embedding mechanism that transforms diverse geospatial data into embedding patches that are spatially aligned with image patches. To facilitate cross-modal interaction, we design a guided attention module that dynamically integrates multimodal information by computing attention weights based on correlations with auxiliary data, thereby directing the model toward the most relevant regions. In addition, the module assigns distinct roles to individual attention heads, allowing the model to capture complementary aspects of the guidance information and improving the interpretability of its predictions. Experimental results demonstrate that the proposed framework outperforms existing pretrained geospatial foundation models in predicting disease prevalence, highlighting its effectiveness in multimodal geospatial understanding.
zh
[CV-306] Robustness of Presentation Attack Detection in Remote Identity Validation Scenarios WACV2026
【速读】:该论文旨在解决远程身份验证(Remote Identity Validation, RIV)系统中活体检测(Presentation Attack Detection, PAD)子系统在复杂环境和自动化采集条件下鲁棒性不足的问题。研究表明,低光照条件和自动图像采集会显著降低商业PAD系统的性能,导致错误率分别增加约4倍和2倍;解决方案的关键在于对PAD系统进行多场景测试,尤其是涵盖实际部署中可能遇到的多样化环境条件,以确保其在真实应用中的可靠性和稳健性。
链接: https://arxiv.org/abs/2602.00109
作者: John J. Howard(SAIC Identity and Data Sciences Laboratory),Richard O. Plesh(SAIC Identity and Data Sciences Laboratory),Yevgeniy B. Sirotin(SAIC Identity and Data Sciences Laboratory),Jerry L. Tipton(SAIC Identity and Data Sciences Laboratory),Arun R. Vemury(U.S. Department of Homeland Security, Science and Technology Directorate)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted to the IEEE/CVF WACV 2026 Workshop on Generative, Adversarial and Presentation Attacks in Biometrics (GAPBio). 8 pages, 6 figures, 4 tables
Abstract:Presentation attack detection (PAD) subsystems are an important part of effective and user-friendly remote identity validation (RIV) systems. However, ensuring robust performance across diverse environmental and procedural conditions remains a critical challenge. This paper investigates the impact of low-light conditions and automated image acquisition on the robustness of commercial PAD systems using a scenario test of RIV. Our results show that PAD systems experience a significant decline in performance when utilized in low-light or auto-capture scenarios, with a model-predicted increase in error rates by a factor of about four under low-light conditions and a doubling of those odds under auto-capture workflows. Specifically, only one of the tested systems was robust to these perturbations, maintaining a maximum bona fide presentation classification error rate below 3% across all scenarios. Our findings emphasize the importance of testing across diverse environments to ensure robust and reliable PAD performance in real-world applications.
zh
[CV-307] SITUATE – Synthetic Object Counting Dataset for VLM training
【速读】:该论文旨在解决当前视觉语言模型(Vision Language Models, VLMs)在计数任务中对空间约束建模能力不足的问题,尤其是现有数据集要么过于简单(如VLMCountBench),要么缺乏对遮挡和空间布局的控制(如TallyQA),导致模型泛化能力受限。解决方案的关键在于构建SITUATE这一新型数据集,其通过严格控制图像中的空间排列与遮挡情况,为训练和评估提供可控且具挑战性的场景;实验表明,基于SITUATE微调的Qwen VL 2.5 7B模型在跨分布测试集Pixmo count上性能提升显著,验证了该数据集对增强模型泛化能力的有效性。
链接: https://arxiv.org/abs/2602.00108
作者: René Peinl,Vincent Tischler,Patrick Schröder,Christian Groth
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: accepted at 21st International Conference on Computer Vision Theory and Applications
Abstract:We present SITUATE, a novel dataset designed for training and evaluating Vision Language Models on counting tasks with spatial constraints. The dataset bridges the gap between simple 2D datasets like VLMCountBench and often ambiguous real-life datasets like TallyQA, which lack control over occlusions and spatial composition. Experiments show that our dataset helps to improve generalization for out-of-distribution images, since a finetune of Qwen VL 2.5 7B on SITUATE improves accuracy on the Pixmo count test data, but not vice versa. We cross validate this by comparing the model performance across established other counting benchmarks and against an equally sized fine-tuning set derived from Pixmo count.
zh
[CV-308] Efficient UAV trajectory prediction: A multi-modal deep diffusion framework
【速读】:该论文旨在解决低空经济中对未经授权的无人机(Unmanned Aerial Vehicle, UAV)进行精准轨迹预测的问题,以提升空中安全管理能力。解决方案的关键在于提出了一种基于激光雷达(LiDAR)与毫米波雷达信息融合的多模态深度特征融合框架(Multi-Modal Deep Fusion Framework),其核心创新包括:两个结构相同的模态专用特征提取网络分别处理LiDAR和雷达点云数据,并通过双向交叉注意力机制(Bidirectional Cross-Attention Mechanism)实现两模态间的信息互补与语义对齐,从而充分挖掘二者在空间几何结构与动态反射特性上的互补信息,显著提升了轨迹预测精度,在MMAUD数据集上相较基线模型实现40%的性能提升。
链接: https://arxiv.org/abs/2602.00107
作者: Yuan Gao,Xinyu Guo,Wenjing Xie,Zifan Wang,Hongwen Yu,Gongyang Li,Shugong Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)
备注: in Chinese language
Abstract:To meet the requirements for managing unauthorized UAVs in the low-altitude economy, a multi-modal UAV trajectory prediction method based on the fusion of LiDAR and millimeter-wave radar information is proposed. A deep fusion network for multi-modal UAV trajectory prediction, termed the Multi-Modal Deep Fusion Framework, is designed. The overall architecture consists of two modality-specific feature extraction networks and a bidirectional cross-attention fusion module, aiming to fully exploit the complementary information of LiDAR and radar point clouds in spatial geometric structure and dynamic reflection characteristics. In the feature extraction stage, the model employs independent but structurally identical feature encoders for LiDAR and radar. After feature extraction, the model enters the Bidirectional Cross-Attention Mechanism stage to achieve information complementarity and semantic alignment between the two modalities. To verify the effectiveness of the proposed model, the MMAUD dataset used in the CVPR 2024 UG2+ UAV Tracking and Pose-Estimation Challenge is adopted as the training and testing dataset. Experimental results show that the proposed multi-modal fusion model significantly improves trajectory prediction accuracy, achieving a 40% improvement compared to the baseline model. In addition, ablation experiments are conducted to demonstrate the effectiveness of different loss functions and post-processing strategies in improving model performance. The proposed model can effectively utilize multi-modal data and provides an efficient solution for unauthorized UAV trajectory prediction in the low-altitude economy.
zh
[CV-309] HYPE-EDIT-1: Benchmark for Measuring Reliability in Frontier Image Editing Models
【速读】:该论文旨在解决当前图像编辑模型在真实营销与设计工作流中评估不足的问题,即公开演示通常仅展示最优结果,而忽视了实际使用中因失败需多次尝试及人工审核所导致的成本。解决方案的关键在于构建了一个包含100项参考基础编辑任务的基准测试集(HYPE-EDIT-1),每项任务生成10个独立输出以量化单次尝试通过率(pass rate)、pass@10、预期尝试次数及综合模型价格与人工审查时间的有效成本(effective cost per successful edit)。该方法能够更真实地反映生成式AI(Generative AI)模型在工业场景下的经济效率和实用性。
链接: https://arxiv.org/abs/2602.00105
作者: Wing Chan,Richard Allen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 14 pages, 5 figures, for code and data, see this https URL
Abstract:Public demos of image editing models are typically best-case samples; real workflows pay for retries and review time. We introduce HYPE-EDIT-1, a 100-task benchmark of reference-based marketing/design edits with binary pass/fail judging. For each task we generate 10 independent outputs to estimate per-attempt pass rate, pass@10, expected attempts under a retry cap, and an effective cost per successful edit that combines model price with human review time. We release 50 public tasks and maintain a 50-task held-out private split for server-side evaluation, plus a standardized JSON schema and tooling for VLM and human-based judging. Across the evaluated models, per-attempt pass rates span 34-83 percent and effective cost per success spans USD 0.66-1.42. Models that have low per-image pricing are more expensive when you consider the total effective cost of retries and human reviews.
zh
[CV-310] R3G: A Reasoning --Retrieval–Reranking Framework for Vision-Centric Answer Generation
【速读】:该论文旨在解决视觉中心的视觉问答(Vision-centric VQA)任务中,如何有效检索并整合缺失视觉线索的问题。现有方法在图像选择与推理过程中的融合效率上存在不足,导致模型性能受限。解决方案的关键在于提出一种模块化框架R3G(Reasoning-Retrieval-Reranking),其核心机制包括:首先生成简明的推理计划以明确所需视觉线索,随后采用两级策略——粗粒度检索与细粒度重排序——精准筛选证据图像;实验表明,这种“感知充分性”的重排序机制与推理步骤相互补充,显著提升了多模态大语言模型(MLLM)在不同场景下的准确率与鲁棒性。
链接: https://arxiv.org/abs/2602.00104
作者: Zhuohong Chen,Zhengxian Wu,Zirui Liao,Shenao Jiang,Hangrui Xu,Yang Chen,Chaokui Su,Xiaoyu Liu,Haoqian Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-centric retrieval for VQA requires retrieving images to supply missing visual cues and integrating them into the reasoning process. However, selecting the right images and integrating them effectively into the model’s reasoning remains this http URL address this challenge, we propose R3G, a modular Reasoning-Retrieval-Reranking this http URL first produces a brief reasoning plan that specifies the required visual cues, then adopts a two-stage strategy, with coarse retrieval followed by fine-grained reranking, to select evidence this http URL MRAG-Bench, R3G improves accuracy across six MLLM backbones and nine sub-scenarios, achieving state-of-the-art overall performance. Ablations show that sufficiency-aware reranking and reasoning steps are complementary, helping the model both choose the right images and use them well. We release code and data at this https URL.
zh
[CV-311] Mirag e2Matter: A Physically Grounded Gaussian World Model from Video
【速读】:该论文旨在解决具身智能(Embodied Intelligence)在训练过程中因真实世界交互数据稀缺而导致的可扩展性问题。现有仿真平台通常存在显著的视觉与物理差异,且依赖昂贵传感器、精确机器人标定或深度信息,难以大规模应用。其解决方案的关键在于提出一种名为“Simulate Anything”的图形驱动世界建模与仿真框架,通过仅使用多视角环境视频和现成资产,利用3D高斯泼溅(3D Gaussian Splatting, 3DGS)重建真实场景的高保真度几何与外观,并结合生成模型恢复物理合理性,再通过精密标定目标实现重建场景与真实世界的尺度对齐,从而构建统一、可编辑且物理可信的世界模型,支持VLA模型在模拟数据上训练后实现零样本迁移性能媲美甚至超越真实数据。
链接: https://arxiv.org/abs/2602.00096
作者: Zhengqing Gao,Ziwen Li,Xin Wang,Jiaxin Huang,Zhenyang Ren,Mingkai Shao,Hanlue Zhang,Tianyu Huang,Yongkang Cheng,Yandong Guo,Runqi Lin,Yuanyuan Wang,Tongliang Liu,Kun Zhang,Mingming Gong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The scalability of embodied intelligence is fundamentally constrained by the scarcity of real-world interaction data. While simulation platforms provide a promising alternative, existing approaches often suffer from a substantial visual and physical gap to real environments and rely on expensive sensors, precise robot calibration, or depth measurements, limiting their practicality at scale. We present Simulate Anything, a graphics-driven world modeling and simulation framework that enables efficient generation of high-fidelity embodied training data using only multi-view environment videos and off-the-shelf assets. Our approach reconstructs real-world environments into a photorealistic scene representation using 3D Gaussian Splatting (3DGS), seamlessly capturing fine-grained geometry and appearance from video. We then leverage generative models to recover a physically realistic representation and integrate it into a simulation environment via a precision calibration target, enabling accurate scale alignment between the reconstructed scene and the real world. Together, these components provide a unified, editable, and physically grounded world model. Vision Language Action (VLA) models trained on our simulated data achieve strong zero-shot performance on downstream tasks, matching or even surpassing results obtained with real-world data, highlighting the potential of reconstruction-driven world modeling for scalable and practical embodied intelligence training.
zh
[CV-312] EDU-CIRCUIT-HW: Evaluating Multimodal Large Language Models on Real-World University-Level STEM Student Handwritten Solutions
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理大学STEM课程中学生手写解答时存在的识别可靠性不足问题,尤其是当解题内容包含数学公式、图表与文本推理交织的复杂场景时,现有评估范式仅依赖下游任务结果(如自动评分),难以全面反映模型对完整手写逻辑的理解能力。解决方案的关键在于构建并发布EDU-CIRCUIT-HW数据集(包含1300+份真实学生手写解答),结合专家验证的逐字转录和评分报告,同步评估MLLMs在上游识别准确性和下游自动评分性能;进一步通过案例研究发现,利用识别错误模式进行预判性检测与修正(仅需约4%的人工干预),即可显著提升AI辅助评分系统在未见样本上的鲁棒性。
链接: https://arxiv.org/abs/2602.00095
作者: Weiyu Sun,Liangliang Chen,Yongnuo Cai,Huiru Xie,Yi Zeng,Ying Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:Multimodal Large Language Models (MLLMs) hold significant promise for revolutionizing traditional education and reducing teachers’ workload. However, accurately interpreting unconstrained STEM student handwritten solutions with intertwined mathematical formulas, diagrams, and textual reasoning poses a significant challenge due to the lack of authentic and domain-specific benchmarks. Additionally, current evaluation paradigms predominantly rely on the outcomes of downstream tasks (e.g., auto-grading), which often probe only a subset of the recognized content, thereby failing to capture the MLLMs’ understanding of complex handwritten logic as a whole. To bridge this gap, we release EDU-CIRCUIT-HW, a dataset consisting of 1,300+ authentic student handwritten solutions from a university-level STEM course. Utilizing the expert-verified verbatim transcriptions and grading reports of student solutions, we simultaneously evaluate various MLLMs’ upstream recognition fidelity and downstream auto-grading performance. Our evaluation uncovers an astonishing scale of latent failures within MLLM-recognized student handwritten content, highlighting the models’ insufficient reliability for auto-grading and other understanding-oriented applications in high-stakes educational settings. In solution, we present a case study demonstrating that leveraging identified error patterns to preemptively detect and rectify recognition errors, with only minimal human intervention (approximately 4% of the total solutions), can significantly enhance the robustness of the deployed AI-enabled grading system on unseen student solutions.
zh
[CV-313] Lossless Embedding Compression via Spherical Coordinates
【速读】:该论文旨在解决高维单位向量嵌入(unit-norm embeddings)的无损压缩问题,以提升存储与传输效率。其解决方案的关键在于利用高维空间中单位向量的球坐标(spherical coordinates)分布特性——这些坐标倾向于集中在 π/2 附近,导致 IEEE 754 浮点数的指数部分坍缩为单一值,从而显著减少熵并允许高效熵编码(entropy coding)。该方法无需训练,在 float32 精度下实现完全无损压缩,且相比最优现有方法压缩比提升 25%。
链接: https://arxiv.org/abs/2602.00079
作者: Han Xiao
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present a lossless compression method for unit-norm embeddings that achieves 1.5 \times compression, 25% better than the best prior method. The method exploits that spherical coordinates of high-dimensional unit vectors concentrate around \pi/2 , causing IEEE 754 exponents to collapse to a single value and enabling entropy coding. Evaluation across 26 configurations spanning text, image, and multi-vector embeddings confirms consistent improvement. The method requires no training and is fully lossless within float32 precision.
zh
[CV-314] Happy Young Women Grumpy Old Men? Emotion-Driven Demographic Biases in Synthetic Face Generation
【速读】:该论文旨在解决生成式AI(Generative AI)在合成人脸图像时存在的偏见问题,特别是针对不同文化背景和情感提示下的人口统计学特征(如性别、种族、年龄和吸引力)分布不一致及其与全球人口统计数据的偏差。其解决方案的关键在于对八种前沿文本到图像(Text-to-Image, T2I)模型进行系统性审计,这些模型包括来自西方和中国机构开发的各四款模型,并使用统一的自然语言提示进行测试;同时采用先进的面部分析算法估算生成图像中的属性分布,并通过信息论指标(如Kullback-Leibler散度和Jensen-Shannon散度)量化偏差程度,从而揭示无论训练文化背景如何,所有模型均存在持续性的群体和情绪条件偏见。
链接: https://arxiv.org/abs/2602.00032
作者: Mengting Wei,Aditya Gulati,Guoying Zhao,Nuria Oliver
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 11 figures
Abstract:Synthetic face generation has rapidly advanced with the emergence of text-to-image (T2I) and of multimodal large language models, enabling high-fidelity image production from natural-language prompts. Despite the widespread adoption of these tools, the biases, representational quality, and cross-cultural consistency of these models remain poorly understood. Prior research on biases in the synthetic generation of human faces has examined demographic biases, yet there is little research on how emotional prompts influence demographic representation and how models trained in different cultural and linguistic contexts vary in their output distributions. We present a systematic audit of eight state-of-the-art T2I models comprising four models developed by Western organizations and four developed by Chinese institutions, all prompted identically. Using state-of-the-art facial analysis algorithms, we estimate the gender, race, age, and attractiveness levels in the generated faces. To measure the deviations from global population statistics, we apply information-theoretic bias metrics including Kullback-Leibler and Jensen-Shannon divergences. Our findings reveal persistent demographic and emotion-conditioned biases in all models regardless of their country of origin. We discuss implications for fairness, socio-technical harms, governance, and the development of transparent generative systems.
zh
人工智能
[AI-0] Flow Policy Gradients for Robot Control
【速读】:该论文旨在解决传统基于似然的策略梯度方法在机器人控制策略训练中因依赖可微分动作似然函数而导致的策略表达能力受限问题(如只能使用高斯等简单分布),从而限制了复杂任务中的性能表现。其解决方案的关键在于引入流匹配策略梯度(flow matching policy gradients)框架,该框架无需显式计算似然,能够支持更复杂的策略表示,从而在足式行走、人形运动追踪和操作任务中实现有效训练,并显著提升模拟到现实环境的迁移鲁棒性。
链接: https://arxiv.org/abs/2602.02481
作者: Brent Yi,Hongsuk Choi,Himanshu Gaurav Singh,Xiaoyu Huang,Takara E. Truong,Carmelo Sferrazza,Yi Ma,Rocky Duan,Pieter Abbeel,Guanya Shi,Karen Liu,Angjoo Kanazawa
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Project webpage: this https URL
Abstract:Likelihood-based policy gradient methods are the dominant approach for training robot control policies from rewards. These methods rely on differentiable action likelihoods, which constrain policy outputs to simple distributions like Gaussians. In this work, we show how flow matching policy gradients – a recent framework that bypasses likelihood computation – can be made effective for training and fine-tuning more expressive policies in challenging robot control settings. We introduce an improved objective that enables success in legged locomotion, humanoid motion tracking, and manipulation tasks, as well as robust sim-to-real transfer on two humanoid robots. We then present ablations and analysis on training dynamics. Results show how policies can exploit the flow representation for exploration when training from scratch, as well as improved fine-tuning robustness over baselines.
zh
[AI-1] Agent Rx: Diagnosing AI Agent Failures from Execution Trajectories
【速读】:该论文旨在解决AI代理(AI agent)在执行过程中失败原因难以定位的问题,其根本挑战在于执行过程具有概率性、长时程、多代理交互以及工具输出噪声等复杂特性。为应对这一问题,作者通过人工标注115条失败轨迹构建了一个新型基准数据集,涵盖结构化API工作流、事件管理及开放网络/文件任务,并引入基于扎根理论的跨域故障分类体系对每个轨迹的关键失败步骤进行标注。解决方案的核心在于提出AGENTRX——一个无需领域知识的自动化诊断框架,它通过约束合成、逐步验证并生成可审计的违反日志(包含证据),再由大语言模型(LLM)基于日志判断关键失败步骤与故障类别,从而显著提升跨领域场景下的步骤定位准确率和故障归因能力。
链接: https://arxiv.org/abs/2602.02475
作者: Shraddha Barke,Arnav Goyal,Alind Khare,Avaljot Singh,Suman Nath,Chetan Bansal
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:AI agents often fail in ways that are difficult to localize because executions are probabilistic, long-horizon, multi-agent, and mediated by noisy tool outputs. We address this gap by manually annotating failed agent runs and release a novel benchmark of 115 failed trajectories spanning structured API workflows, incident management, and open-ended web/file tasks. Each trajectory is annotated with a critical failure step and a category from a grounded-theory derived, cross-domain failure taxonomy. To mitigate the human cost of failure attribution, we present AGENTRX, an automated domain-agnostic diagnostic framework that pinpoints the critical failure step in a failed agent trajectory. It synthesizes constraints, evaluates them step-by-step, and produces an auditable validation log of constraint violations with associated evidence; an LLM-based judge uses this log to localize the critical step and category. Our framework improves step localization and failure attribution over existing baselines across three domains.
zh
[AI-2] Breaking the Reversal Curse in Autoregressive Language Models via Identity Bridge
【速读】:该论文旨在解决自回归大语言模型(Autoregressive Large Language Models, LLMs)在简单逻辑推理任务中表现不佳的问题,特别是“反转诅咒”(reversal curse)——即模型在训练时仅学习正向知识对“A → B”(如“Alice的丈夫是Bob”)后,无法在测试阶段推导出其逆向关系“B ← A”(如“Bob的妻子是Alice”)。以往研究认为这是自回归因果LLMs的固有局限,源于模型倾向于记忆事实而非捕获更高层次的规则。本文通过提出一种简单的数据正则化策略——身份桥接(Identity Bridge),即在训练数据中加入形如“A → A”的恒等关系(例如“Alice的名字是Alice”),显著缓解了该问题。理论分析表明,在此数据配方下,即使是一层Transformer也能通过梯度下降的隐式偏差打破反转诅咒;实验验证显示,经过该策略微调的1B参数预训练模型在反转任务上的成功率提升至40%,远高于仅使用正向知识训练时接近零的表现。该方案提供了一个低开销、原理清晰的方法,促使LLMs从数据中学习更高阶的逻辑规则。
链接: https://arxiv.org/abs/2602.02470
作者: Xutao Ma,Yixiao Huang,Hanlin Zhu,Somayeh Sojoudi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Autoregressive large language models (LLMs) have achieved remarkable success in many complex tasks, yet they can still fail in very simple logical reasoning such as the “reversal curse” – when trained on forward knowledge data of the form " A \rightarrow B " (e.g., Alice’s husband is Bob), the model is unable to deduce the reversal knowledge " B \leftarrow A " (e.g., Bob’s wife is Alice) during test. Extensive prior research suggests that this failure is an inherent, fundamental limit of autoregressive causal LLMs, indicating that these models tend to memorize factual-level knowledge rather than capture higher-level rules. In this paper, we challenge this view by showing that this seemingly fundamental limit can be mitigated by slightly tweaking the training data with a simple regularization data recipe called the Identity Bridge of the form " A \to A " (e.g., The name of Alice is Alice). Theoretically, we prove that under this recipe, even a one-layer transformer can break the reversal curse by analyzing the implicit bias of gradient descent. Empirically, we show that a 1B pretrained language model finetuned with the proposed data recipe achieves a 40% success rate on reversal tasks, in stark contrast to a near-zero success rate when trained solely on forward-knowledge data. Our work provides a novel theoretical foundation for the reversal curse and offers a principled, low-cost path to encouraging LLMs to learn higher-level rules from data.
zh
[AI-3] World-Gymnast: Training Robots with Reinforcement Learning in a World Model
【速读】:该论文旨在解决机器人通过物理交互进行学习时面临的高成本问题,以及传统方法如监督微调(Supervised Fine-Tuning, SFT)和软件仿真强化学习(Reinforcement Learning, RL)在专家数据稀缺性和“仿真到现实”差距(sim-to-real gap)方面的局限性。其解决方案的关键在于提出World-Gymnast框架,该框架利用从真实世界视频-动作数据中学习得到的动作条件型视频世界模型(action-conditioned video world model),在其中进行策略的强化学习微调,并通过视觉语言模型(Vision-Language Model, VLM)对轨迹进行奖励评估。此方法显著提升了真实机器人性能,在Bridge机器人平台上相比SFT提升达18倍、相比软件仿真提升达2倍,同时展现出基于世界模型的强化学习在多样化语言指令适应、新场景泛化、测试时在线训练及迭代优化世界模型与策略等方面的潜力。
链接: https://arxiv.org/abs/2602.02454
作者: Ansh Kumar Sharma,Yixiang Sun,Ninghao Lu,Yunzhe Zhang,Jiarao Liu,Sherry Yang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: this https URL
Abstract:Robot learning from interacting with the physical world is fundamentally bottlenecked by the cost of physical interaction. The two alternatives, supervised finetuning (SFT) from expert demonstrations and reinforcement learning (RL) in a software-based simulator, are limited by the amount of expert data available and the sim-to-real gap for manipulation. With the recent emergence of world models learned from real-world video-action data, we ask the question of whether training a policy in a world model can be more effective than supervised learning or software simulation in achieving better real-robot performance. We propose World-Gymnast, which performs RL finetuning of a vision-language-action (VLA) policy by rolling out the policy in an action-conditioned video world model and rewarding the rollouts with a vision-language model (VLM). On the Bridge robot setup, World-Gymnast outperforms SFT by as much as 18x and outperforms software simulator by as much as 2x. More importantly, World-Gymnast demonstrates intriguing capabilities of RL with a world model, including training on diverse language instructions and novel scenes from the world model, test-time training in a novel scene, and online iterative world model and policy improvement. Our results suggest learning a world model and training robot policies in the cloud could be the key to bridging the gap between robots that work in demonstrations and robots that can work in anyone’s household.
zh
[AI-4] hinking with Comics: Enhancing Multimodal Reasoning through Structured Visual Storytelling
【速读】:该论文旨在解决多模态大语言模型在处理视觉信息时面临的效率与表达能力之间的权衡问题:静态图像难以刻画时间结构,而视频虽能呈现动态过程但存在冗余信息和高计算成本。其解决方案的关键在于提出“漫画思维”(Thinking with Comics)这一新的视觉推理范式,利用漫画作为介于图像与视频之间的高信息密度媒介,既能保留时间序列结构、嵌入文本和叙事连贯性,又显著降低推理开销。实验表明,该方法在多步骤时间因果推理任务上优于基于图像的推理,在效率上则远超基于视频的方案,且漫画的叙事结构和风格对性能具有一致影响,验证了其作为有效中间视觉表示的价值。
链接: https://arxiv.org/abs/2602.02453
作者: Andong Chen,Wenxin Zhu,Qiuyu Ding,Yuchen Song,Muyun Yang,Tiejun Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Working paper
Abstract:Chain-of-Thought reasoning has driven large language models to extend from thinking with text to thinking with images and videos. However, different modalities still have clear limitations: static images struggle to represent temporal structure, while videos introduce substantial redundancy and computational cost. In this work, we propose Thinking with Comics, a visual reasoning paradigm that uses comics as a high information-density medium positioned between images and videos. Comics preserve temporal structure, embedded text, and narrative coherence while requiring significantly lower reasoning cost. We systematically study two reasoning paths based on comics and evaluate them on a range of reasoning tasks and long-context understanding tasks. Experimental results show that Thinking with Comics outperforms Thinking with Images on multi-step temporal and causal reasoning tasks, while remaining substantially more efficient than Thinking with Video. Further analysis indicates that different comic narrative structures and styles consistently affect performance across tasks, suggesting that comics serve as an effective intermediate visual representation for improving multimodal reasoning.
zh
[AI-5] Active Causal Experimentalist (ACE): Learning Intervention Strategies via Direct Preference Optimization
【速读】:该论文旨在解决因果发现中实验设计的序列决策问题,即如何在有限干预预算下,通过自适应策略选择最优干预措施以高效识别因果结构。传统方法如随机采样、贪婪信息最大化和轮转覆盖法均将每个决策视为独立事件,无法从经验中学习动态调整策略。其解决方案的关键在于提出Active Causal Experimentalist (ACE),一种将实验设计建模为序列策略(sequential policy)的学习框架;核心创新在于利用直接偏好优化(Direct Preference Optimization),通过比较候选干预之间的相对优劣而非依赖非平稳的信息增益值来训练模型,从而克服了价值函数不稳定的问题。这一机制使ACE在合成基准、物理模拟和经济数据上均显著优于基线方法(提升70–71%,p < 0.001),并自主发现理论支持的“聚焦于父变量”的碰撞器(collider)干预策略,验证了偏好驱动学习可有效融合先验知识与领域适应性。
链接: https://arxiv.org/abs/2602.02451
作者: Patrick Cooper,Alvaro Velasquez
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures
Abstract:Discovering causal relationships requires controlled experiments, but experimentalists face a sequential decision problem: each intervention reveals information that should inform what to try next. Traditional approaches such as random sampling, greedy information maximization, and round-robin coverage treat each decision in isolation, unable to learn adaptive strategies from experience. We propose Active Causal Experimentalist (ACE), which learns experimental design as a sequential policy. Our key insight is that while absolute information gains diminish as knowledge accumulates (making value-based RL unstable), relative comparisons between candidate interventions remain meaningful throughout. ACE exploits this via Direct Preference Optimization, learning from pairwise intervention comparisons rather than non-stationary reward magnitudes. Across synthetic benchmarks, physics simulations, and economic data, ACE achieves 70-71% improvement over baselines at equal intervention budgets (p 0.001, Cohen’s d ~ 2). Notably, the learned policy autonomously discovers that collider mechanisms require concentrated interventions on parent variables, a theoretically-grounded strategy that emerges purely from experience. This suggests preference-based learning can recover principled experimental strategies, complementing theory with learned domain adaptation.
zh
[AI-6] Poly-attention: a general scheme for higher-order self-attention
【速读】:该论文旨在解决自注意力机制(self-attention mechanism)在处理高阶相关性任务(如三元组检测或组合性任务)时的局限性,这些任务要求模型能够建模多个输入标记之间的复杂交互关系。现有方法如高阶注意力(higher-order attention)和Strassen注意力虽可部分实现此类功能,但通常伴随超二次时间复杂度,效率低下。论文提出了一类广义的注意力机制——多注意力机制(poly-attention mechanisms),其核心创新在于将任意高阶张量运算与输入标记间的关系结构相结合,从而统一并扩展了已有方案。关键贡献包括:系统分析了不同机制的计算复杂度与表征能力,给出了精确与近似计算注意力矩阵的时间下界,并设计出可在二次时间内精确计算且能实现任意固定数量函数组合的新机制,揭示了表达能力与模型系数规模之间的紧密权衡关系。
链接: https://arxiv.org/abs/2602.02422
作者: Sayak Chakrabarti,Toniann Pitassi,Josh Alman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The self-attention mechanism, at the heart of the Transformer model, is able to effectively model pairwise interactions between tokens. However, numerous recent works have shown that it is unable to perform basic tasks involving detecting triples of correlated tokens, or compositional tasks where multiple input tokens need to be referenced to generate a result. Some higher-dimensional alternatives to self-attention have been proposed to address this, including higher-order attention and Strassen attention, which can perform some of these polyadic tasks in exchange for slower, superquadratic running times. In this work, we define a vast class of generalizations of self-attention, which we call poly-attention mechanisms. Our mechanisms can incorporate arbitrary higher-order (tensor) computations as well as arbitrary relationship structures between the input tokens, and they include the aforementioned alternatives as special cases. We then systematically study their computational complexity and representational strength, including giving new algorithms and matching complexity-theoretic lower bounds on the time complexity of computing the attention matrix exactly as well as approximately, and tightly determining which polyadic tasks they can each perform. Our results give interesting trade-offs between different desiderata for these mechanisms, including a tight relationship between how expressive a mechanism is, and how large the coefficients in the model may be so that the mechanism can be approximated in almost-linear time. Notably, we give a new attention mechanism which can be computed exactly in quadratic time, and which can perform function composition for any fixed number of functions. Prior mechanisms, even for just composing two functions, could only be computed in superquadratic time, and our new lower bounds show that faster algorithms for them are not possible. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.02422 [cs.LG] (or arXiv:2602.02422v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.02422 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-7] SafeGround: Know When to Trust GUI Grounding Models via Uncertainty Calibration
【速读】:该论文旨在解决GUI grounding模型在执行自然语言指令时可能出现的错误定位问题,这种错误可能导致不可逆的严重后果(如误操作支付审批),从而引发对模型可靠性的担忧。解决方案的关键在于提出SafeGround框架,其核心创新是引入一种分布感知的不确定性量化方法,用于捕捉任意模型输出中样本的空间分散特性,并通过校准过程在测试阶段确定一个具有统计保障的假发现率(False Discovery Rate, FDR)控制阈值,从而实现风险感知的预测决策,显著提升系统级准确性。
链接: https://arxiv.org/abs/2602.02419
作者: Qingni Wang,Yue Fan,Xin Eric Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:Graphical User Interface (GUI) grounding aims to translate natural language instructions into executable screen coordinates, enabling automated GUI interaction. Nevertheless, incorrect grounding can result in costly, hard-to-reverse actions (e.g., erroneous payment approvals), raising concerns about model reliability. In this paper, we introduce SafeGround, an uncertainty-aware framework for GUI grounding models that enables risk-aware predictions through calibrations before testing. SafeGround leverages a distribution-aware uncertainty quantification method to capture the spatial dispersion of stochastic samples from outputs of any given model. Then, through the calibration process, SafeGround derives a test-time decision threshold with statistically guaranteed false discovery rate (FDR) control. We apply SafeGround on multiple GUI grounding models for the challenging ScreenSpot-Pro benchmark. Experimental results show that our uncertainty measure consistently outperforms existing baselines in distinguishing correct from incorrect predictions, while the calibrated threshold reliably enables rigorous risk control and potentials of substantial system-level accuracy improvements. Across multiple GUI grounding models, SafeGround improves system-level accuracy by up to 5.38% percentage points over Gemini-only inference.
zh
[AI-8] Structure Enables Effective Self-Localization of Errors in LLM s
【速读】:该论文旨在解决语言模型在推理过程中难以有效自我修正的问题(self-correction in language models remains elusive)。其核心挑战在于模型无法准确识别和定位错误推理步骤,从而阻碍了自主纠错能力的实现。解决方案的关键在于引入一种名为“思维迭代校正采样”(Thought-ICS)的框架,该框架通过将推理过程结构化为离散且语义连贯的思维步骤(thought steps),使模型能够在每个决策点进行误差检测,并基于验证结果回溯至最近正确的节点重新生成替代推理路径,从而实现精准的错误定位与自适应修正。
链接: https://arxiv.org/abs/2602.02416
作者: Ankur Samanta,Akshayaa Magesh,Ayush Jain,Kavosh Asadi,Youliang Yu,Daniel Jiang,Boris Vidolov,Kaveh Hassani,Paul Sajda,Jalaj Bhandari,Yonathan Efroni
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Self-correction in language models remains elusive. In this work, we explore whether language models can explicitly localize errors in incorrect reasoning, as a path toward building AI systems that can effectively correct themselves. We introduce a prompting method that structures reasoning as discrete, semantically coherent thought steps, and show that models are able to reliably localize errors within this structure, while failing to do so in conventional, unstructured chain-of-thought reasoning. Motivated by how the human brain monitors errors at discrete decision points and resamples alternatives, we introduce Iterative Correction Sampling of Thoughts (Thought-ICS), a self-correction framework. Thought-ICS iteratively prompts the model to generate reasoning one discrete and complete thought at a time–where each thought represents a deliberate decision by the model–creating natural boundaries for precise error localization. Upon verification, the model localizes the first erroneous step, and the system backtracks to generate alternative reasoning from the last correct point. When asked to correct reasoning verified as incorrect by an oracle, Thought-ICS achieves 20-40% self-correction lift. In a completely autonomous setting without external verification, it outperforms contemporary self-correction baselines.
zh
[AI-9] Didactic to Constructive: Turning Expert Solutions into Learnable Reasoning
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在提升推理能力时面临的两大挑战:一是许多复杂问题对当前前沿模型仍不可解,导致无法提取有效的训练信号;二是高质量专家解答数据难以获取且直接模仿会因分布差异而失效,因为专家解答通常具有教学性质,包含面向人类读者的隐式推理跳跃,与模型可执行的推理轨迹不一致。解决方案的关键在于提出分布对齐的模仿学习(Distribution Aligned Imitation Learning, DAIL),其核心机制为两步:首先将专家解答转化为详细、符合模型分布的推理轨迹,以弥合分布差距;其次引入对比学习目标,聚焦于专家方法论和关键洞察的高效学习,从而实现样本高效、高精度的推理能力增强,并具备跨领域泛化能力。
链接: https://arxiv.org/abs/2602.02405
作者: Ethan Mendes,Jungsoo Park,Alan Ritter
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Improving the reasoning capabilities of large language models (LLMs) typically relies either on the model’s ability to sample a correct solution to be reinforced or on the existence of a stronger model able to solve the problem. However, many difficult problems remain intractable for even current frontier models, preventing the extraction of valid training signals. A promising alternative is to leverage high-quality expert human solutions, yet naive imitation of this data fails because it is fundamentally out of distribution: expert solutions are typically didactic, containing implicit reasoning gaps intended for human readers rather than computational models. Furthermore, high-quality expert solutions are expensive, necessitating generalizable sample-efficient training methods. We propose Distribution Aligned Imitation Learning (DAIL), a two-step method that bridges the distributional gap by first transforming expert solutions into detailed, in-distribution reasoning traces and then applying a contrastive objective to focus learning on expert insights and methodologies. We find that DAIL can leverage fewer than 1000 high-quality expert solutions to achieve 10-25% pass@k gains on Qwen2.5-Instruct and Qwen3 models, improve reasoning efficiency by 2x to 4x, and enable out-of-domain generalization.
zh
[AI-10] David vs. Goliath: Verifiable Agent -to-Agent Jailbreaking via Reinforcement Learning
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)向自主代理(Autonomous Agents)演进过程中,因合法工具权限被恶意利用而导致的安全评估从主观自然语言处理任务转变为客观控制问题的挑战。其核心问题是:如何识别并量化一类新型攻击——Tag-Along Attacks,即无工具能力的对手通过对话“搭便车”利用安全对齐的Operator(操作者)的可信权限,诱导其执行禁止的工具调用。解决方案的关键在于提出Slingshot框架,这是一个“冷启动”强化学习方法,能够自主发现涌现的攻击向量;实验表明,此类攻击往往收敛为简短、指令式的语法模式而非多轮说服策略,并在极端难度任务上实现67.0%的成功率(相较基线1.7%),显著降低首次成功所需尝试次数(从52.3降至1.3)。更重要的是,该攻击方式可零样本迁移至多个模型家族,包括闭源模型(如Gemini 2.5 Flash)和防御微调的开源模型(如Meta-SecAlign-8B),验证了Tag-Along Attacks作为一类可验证的一阶威胁模型的有效性与广泛性。
链接: https://arxiv.org/abs/2602.02395
作者: Samuel Nellessen,Tal Kachman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Multiagent Systems (cs.MA)
备注: Under review. 8 main pages, 2 figures, 2 tables. Appendix included
Abstract:The evolution of large language models into autonomous agents introduces adversarial failures that exploit legitimate tool privileges, transforming safety evaluation in tool-augmented environments from a subjective NLP task into an objective control problem. We formalize this threat model as Tag-Along Attacks: a scenario where a tool-less adversary “tags along” on the trusted privileges of a safety-aligned Operator to induce prohibited tool use through conversation alone. To validate this threat, we present Slingshot, a ‘cold-start’ reinforcement learning framework that autonomously discovers emergent attack vectors, revealing a critical insight: in our setting, learned attacks tend to converge to short, instruction-like syntactic patterns rather than multi-turn persuasion. On held-out extreme-difficulty tasks, Slingshot achieves a 67.0% success rate against a Qwen2.5-32B-Instruct-AWQ Operator (vs. 1.7% baseline), reducing the expected attempts to first success (on solved tasks) from 52.3 to 1.3. Crucially, Slingshot transfers zero-shot to several model families, including closed-source models like Gemini 2.5 Flash (56.0% attack success rate) and defensive-fine-tuned open-source models like Meta-SecAlign-8B (39.2% attack success rate). Our work establishes Tag-Along Attacks as a first-class, verifiable threat model and shows that effective agentic attacks can be elicited from off-the-shelf open-weight models through environment interaction alone.
zh
[AI-11] rust by Design: Skill Profiles for Transparent Cost-Aware LLM Routing
【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)实践中如何在保证性能的前提下实现预算效率的问题,即如何为特定任务选择最合适的模型而不造成资源浪费。当前标准评测基准仅提供聚合指标,难以揭示任务对具体能力的需求,导致模型选择缺乏针对性和成本效益。解决方案的关键在于提出BELLA(Budget-Efficient LLM Selection via Automated skill-profiling)框架,其核心机制包括三个阶段:首先通过基于批评者的技能剖析方法提取任务所需的细粒度能力;其次将这些技能聚类形成结构化的能力建模矩阵;最后采用多目标优化策略,在满足预算约束的同时最大化性能表现。该框架不仅推荐最优模型组合,还提供自然语言解释,提升了决策过程的透明性与可解释性。
链接: https://arxiv.org/abs/2602.02386
作者: Mika Okamoto,Ansel Kaplan Erol,Glenn Matlin
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Appeared at MLSys YPS 2025
Abstract:How should Large Language Model (LLM) practitioners select the right model for a task without wasting money? We introduce BELLA (Budget-Efficient LLM Selection via Automated skill-profiling), a framework that recommends optimal LLM selection for tasks through interpretable skill-based model selection. Standard benchmarks report aggregate metrics that obscure which specific capabilities a task requires and whether a cheaper model could suffice. BELLA addresses this gap through three stages: (1) decomposing LLM outputs and extract the granular skills required by using critic-based profiling, (2) clustering skills into structured capability matrices, and (3) multi-objective optimization to select the right models to maximize performance while respecting budget constraints. BELLA provides natural-language rationale for recommendations, providing transparency that current black-box routing systems lack. We describe the framework architecture, situate it within the landscape of LLM routing and evaluation, and discuss its application to financial reasoning as a representative domain exhibiting diverse skill requirements and cost-variation across models. Our framework enables practitioners to make principled and cost-performance trade-offs for deploying LLMs.
zh
[AI-12] Live-Evo: Online Evolution of Agent ic Memory from Continuous Feedback
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)代理在持续学习场景下因静态训练/测试划分导致的适应性不足问题,即现有自演化系统难以应对真实分布偏移和连续反馈下的记忆更新挑战。解决方案的关键在于提出一个在线自演化记忆系统——\textscLive-Evo,其通过解耦“发生了什么”与“如何使用它”,构建了经验库(Experience Bank)与元指导库(Meta-Guideline Bank),从检索到的经验中动态生成任务适配的指导策略;同时,系统通过反馈驱动的经验权重调整机制实现在线记忆管理:高价值经验被强化并高频调用,低效或过时经验则逐步衰减遗忘,模拟人类记忆的强化与消退过程,从而在真实流式数据环境中实现稳定且高效的性能提升。
链接: https://arxiv.org/abs/2602.02369
作者: Yaolun Zhang,Yiran Wu,Yijiong Yu,Qingyun Wu,Huazheng Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13 pages
Abstract:Large language model (LLM) agents are increasingly equipped with memory, which are stored experience and reusable guidance that can improve task-solving performance. Recent \emphself-evolving systems update memory based on interaction outcomes, but most existing evolution pipelines are developed for static train/test splits and only approximate online learning by folding static benchmarks, making them brittle under true distribution shift and continuous feedback. We introduce \textscLive-Evo, an online self-evolving memory system that learns from a stream of incoming data over time. \textscLive-Evo decouples \emphwhat happened from \emphhow to use it via an Experience Bank and a Meta-Guideline Bank, compiling task-adaptive guidelines from retrieved experiences for each task. To manage memory online, \textscLive-Evo maintains experience weights and updates them from feedback: experiences that consistently help are reinforced and retrieved more often, while misleading or stale experiences are down-weighted and gradually forgotten, analogous to reinforcement and decay in human memory. On the live \textitProphet Arena benchmark over a 10-week horizon, \textscLive-Evo improves Brier score by 20.8% and increases market returns by 12.9%, while also transferring to deep-research benchmarks with consistent gains over strong baselines. Our code is available at this https URL.
zh
[AI-13] Reason CACHE: Teaching LLM s To Reason Without Weight Updates
【速读】:该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)是否能够在不进行任何权重更新的前提下,仅通过上下文学习(In-Context Learning, ICL)实现复杂推理能力的获取。传统ICL方法在面对高难度推理任务时存在显著局限性,包括注意力计算复杂度随上下文长度呈二次增长、性能饱和或下降以及无法有效扩展演示样本数量等问题。为此,作者提出了一种名为ReasonCACHE的新机制,其核心在于利用前缀调优(Prefix Tuning)将演示示例压缩为固定大小的键值缓存(Key-Value Cache),并直接注入到注意力机制中,从而在不修改模型参数的情况下实现高效且可扩展的推理学习。该方案不仅在多个挑战性推理基准测试中优于标准ICL,并达到或超越了需要权重更新的IWL方法,还在数据效率、推理成本和可训练参数规模三个维度上展现出显著优势,同时理论上证明其表达能力严格优于低秩权重更新方法。
链接: https://arxiv.org/abs/2602.02366
作者: Sharut Gupta,Phillip Isola,Stefanie Jegelka,David Lopez-Paz,Kartik Ahuja,Mark Ibrahim,Mohammad Pezeshki
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 26 pages, 17 Figures
Abstract:Can Large language models (LLMs) learn to reason without any weight update and only through in-context learning (ICL)? ICL is strikingly sample-efficient, often learning from only a handful of demonstrations, but complex reasoning tasks typically demand many training examples to learn from. However, naively scaling ICL by adding more demonstrations breaks down at this scale: attention costs grow quadratically, performance saturates or degrades with longer contexts, and the approach remains a shallow form of learning. Due to these limitations, practitioners predominantly rely on in-weight learning (IWL) to induce reasoning. In this work, we show that by using Prefix Tuning, LLMs can learn to reason without overloading the context window and without any weight updates. We introduce \textbfReasonCACHE , an instantiation of this mechanism that distills demonstrations into a fixed key-value cache. Empirically, across challenging reasoning benchmarks, including GPQA-Diamond, ReasonCACHE outperforms standard ICL and matches or surpasses IWL approaches. Further, it achieves this all while being more efficient across three key axes: data, inference cost, and trainable parameters. We also theoretically prove that ReasonCACHE can be strictly more expressive than low-rank weight update since the latter ties expressivity to input rank, whereas ReasonCACHE bypasses this constraint by directly injecting key-values into the attention mechanism. Together, our findings identify ReasonCACHE as a middle path between in-context and in-weight learning, providing a scalable algorithm for learning reasoning skills beyond the context window without modifying parameters. Our project page: this https URL
zh
[AI-14] SWE-Universe: Scale Real-World Verifiable Environments to Millions
【速读】:该论文旨在解决软件工程(Software Engineering, SWE)领域中自动构建可验证的真实世界编程环境所面临的三大挑战:低产出率、验证能力弱以及高昂的构建成本。为应对这些问题,作者提出SWE-Universe框架,其核心解决方案在于引入一个由高效定制训练模型驱动的构建代理(building agent),该代理通过迭代自验证(iterative self-verification)和环内黑客检测(in-loop hacking detection)机制,确保生成高保真度且可验证的任务环境。这一方法使得真实世界多语言SWE环境规模达到百万级(807,693个),并成功应用于大规模代理训练与强化学习,最终在SWE-Bench Verified基准上使Qwen3-Max-Thinking模型得分提升至75.3%,从而为下一代编码代理的发展提供了关键资源与可靠方法论。
链接: https://arxiv.org/abs/2602.02361
作者: Mouxiang Chen,Lei Zhang,Yunlong Feng,Xuwu Wang,Wenting Zhao,Ruisheng Cao,Jiaxi Yang,Jiawei Chen,Mingze Li,Zeyao Ma,Hao Ge,Zongmeng Zhang,Zeyu Cui,Dayiheng Liu,Jingren Zhou,Jianling Sun,Junyang Lin,Binyuan Hui
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 13 pages
Abstract:We propose SWE-Universe, a scalable and efficient framework for automatically constructing real-world software engineering (SWE) verifiable environments from GitHub pull requests (PRs). To overcome the prevalent challenges of automatic building, such as low production yield, weak verifiers, and prohibitive cost, our framework utilizes a building agent powered by an efficient custom-trained model. This agent employs iterative self-verification and in-loop hacking detection to ensure the reliable generation of high-fidelity, verifiable tasks. Using this method, we scale the number of real-world multilingual SWE environments to a million scale (807,693). We demonstrate the profound value of our environments through large-scale agentic mid-training and reinforcement learning. Finally, we applied this technique to Qwen3-Max-Thinking and achieved a score of 75.3% on SWE-Bench Verified. Our work provides both a critical resource and a robust methodology to advance the next generation of coding agents.
zh
[AI-15] Context Learning for Multi-Agent Discussion
【速读】:该论文旨在解决多智能体讨论(Multi-Agent Discussion, MAD)中因个体LLM间上下文不一致导致的共识难以达成的问题,即当前MAD方法易出现讨论失序、无法形成连贯解决方案的现象。其解决方案的关键在于提出一种多LLM上下文学习方法(M2CL),通过为每个智能体训练一个上下文生成器(context generator),在每轮讨论中动态生成结构化的上下文指令,从而实现对上下文一致性与输出差异性的自适应控制。该机制有效避免了LLM过早收敛于多数噪声,并推动其逐步达成正确共识,显著提升了任务性能(提升20%–50%)并具备良好的迁移性和计算效率。
链接: https://arxiv.org/abs/2602.02350
作者: Xingyuan Hua,Sheng Yue,Xinyi Li,Yizhe Zhao,Jinrui Zhang,Ju Ren
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:
Abstract:Multi-Agent Discussion (MAD) has garnered increasing attention very recently, where multiple LLM instances collaboratively solve problems via structured discussion. However, we find that current MAD methods easily suffer from discussion inconsistency, LLMs fail to reach a coherent solution, due to the misalignment between their individual this http URL this paper, we introduce a multi-LLM context learning method (M2CL) that learns a context generator for each agent, capable of dynamically generating context instructions per discussion round via automatic information organization and refinement. Specifically, inspired by our theoretical insights on the context instruction, M2CL train the generators to control context coherence and output discrepancies via a carefully crafted self-adaptive this http URL enables LLMs to avoid premature convergence on majority noise and progressively reach the correct consensus. We evaluate M2CL on challenging tasks, including academic reasoning, embodied tasks, and mobile control. The results show that the performance of M2CL significantly surpasses existing methods by 20%–50%, while enjoying favorable transferability and computational efficiency.
zh
[AI-16] Rethinking Generative Recommender Tokenizer: Recsys-Native Encoding and Semantic Quantization Beyond LLM s
【速读】:该论文旨在解决当前基于语义标识符(Semantic ID, SID)的推荐系统中存在两个核心问题:一是现有方法采用以语义为中心的流水线,将物品嵌入从基础模型中学习并使用通用量化方案离散化,导致语义嵌入与协同预测之间耦合较弱;二是通用量化方式在自回归建模中无法有效降低序列不确定性,影响生成式推荐性能。解决方案的关键在于提出 ReSID,一个面向推荐任务原生设计的、理论严谨的 SID 框架,其核心创新为:(i) 领域感知掩码自编码器(Field-Aware Masked Auto-Encoding, FAMAE),通过结构化特征学习具有预测充分性的物品表示;(ii) 全局对齐正交量化(Globally Aligned Orthogonal Quantization, GAOQ),通过联合最小化语义歧义和前缀条件不确定性,生成紧凑且可预测的 SID 序列。该方案显著提升了推荐效果(平均优于强基线超10%)并大幅降低标记化开销(最高达122倍)。
链接: https://arxiv.org/abs/2602.02338
作者: Yu Liang,Zhongjin Zhang,Yuxuan Zhu,Kerui Zhang,Zhiluohan Guo,Wenhang Zhou,Zonqi Yang,Kangle Wu,Yabo Ni,Anxiang Zeng,Cong Fu,Jianxin Wang,Jiazhi Xia
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Semantic ID (SID)-based recommendation is a promising paradigm for scaling sequential recommender systems, but existing methods largely follow a semantic-centric pipeline: item embeddings are learned from foundation models and discretized using generic quantization schemes. This design is misaligned with generative recommendation objectives: semantic embeddings are weakly coupled with collaborative prediction, and generic quantization is inefficient at reducing sequential uncertainty for autoregressive modeling. To address these, we propose ReSID, a recommendation-native, principled SID framework that rethinks representation learning and quantization from the perspective of information preservation and sequential predictability, without relying on LLMs. ReSID consists of two components: (i) Field-Aware Masked Auto-Encoding (FAMAE), which learns predictive-sufficient item representations from structured features, and (ii) Globally Aligned Orthogonal Quantization (GAOQ), which produces compact and predictable SID sequences by jointly reducing semantic ambiguity and prefix-conditional uncertainty. Theoretical analysis and extensive experiments across ten datasets show the effectiveness of ReSID. ReSID consistently outperforms strong sequential and SID-based generative baselines by an average of over 10%, while reducing tokenization cost by up to 122x. Code is available at this https URL.
zh
[AI-17] Building a Correct-by-Design Lakehouse. Data Contracts Versioning and Transactional Pipelines for Humans and Agents
【速读】:该论文旨在解决湖仓(lakehouse)架构在多用户并发操作生产数据时的安全性问题,具体表现为上游-下游数据一致性难以保障(upstream-downstream mismatches)以及多表数据处理流水线可能引发局部副作用泄露(partial effects leakage)。解决方案的关键在于提出 Bauplan——一种以代码优先(code-first)设计的湖仓系统,通过三个核心机制实现非法状态不可表示(unrepresentable):(1)类型化表契约(typed table contracts),用于显式定义和验证数据管道边界;(2)类 Git 数据版本控制(Git-like data versioning),支持变更审查与结果可复现;(3)事务化运行(transactional runs),确保整个数据流水线具备原子性。该方案借鉴软件工程思想,将传统开发中的类型安全、版本管理和事务语义引入数据处理流程,从而提升湖仓系统的可靠性与安全性。
链接: https://arxiv.org/abs/2602.02335
作者: Weiming Sheng,Jinlang Wang,Manuel Barros,Aldrin Montana,Jacopo Tagliabue,Luca Bigon
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: Pre-print (PaPoC 2026)
Abstract:Lakehouses are the default cloud platform for analytics and AI, but they become unsafe when untrusted actors concurrently operate on production data: upstream-downstream mismatches surface only at runtime, and multi-table pipelines can leak partial effects. Inspired by software engineering, we design Bauplan, a code-first lakehouse that aims to make (most) illegal states unrepresentable using familiar abstractions. Bauplan acts along three axes: typed table contracts to make pipeline boundaries checkable, Git-like data versioning for review and reproducibility, and transactional runs that guarantee pipeline-level atomicity. We report early results from a lightweight formal transaction model and discuss future work motivated by counterexamples.
zh
[AI-18] -Parkour: Rapid Test-Time Training for Perceptive Robot Parkour
【速读】:该论文旨在解决类人机器人在未见过的复杂地形上实现高动态跳跃(parkour)的难题,现有通用行走策略在任意且极具挑战性的环境中往往表现不佳。解决方案的关键在于提出一种“真实-仿真-真实”(real-to-sim-to-real)框架,结合测试时训练(test-time training, TTT),使机器人能够在新环境中快速适应并完成高难度障碍穿越任务。其核心创新包括:1)两阶段端到端学习范式——先在多样化的程序生成地形上预训练策略,再基于从真实场景重建的高保真网格进行快速微调;2)开发了一种高效的RGB-D输入几何重建流水线,支持在测试时快速生成高质量环境表示,从而显著提升策略对极端几何结构(如楔形、尖桩、窄梁等)的泛化能力,整个流程耗时不足10分钟,且展现出优异的零样本仿真到现实迁移性能。
链接: https://arxiv.org/abs/2602.02331
作者: Shaoting Zhu,Baijun Ye,Jiaxuan Wang,Jiakang Chen,Ziwen Zhuang,Linzhan Mou,Runhan Huang,Hang Zhao
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Project Page: this https URL
Abstract:Achieving highly dynamic humanoid parkour on unseen, complex terrains remains a challenge in robotics. Although general locomotion policies demonstrate capabilities across broad terrain distributions, they often struggle with arbitrary and highly challenging environments. To overcome this limitation, we propose a real-to-sim-to-real framework that leverages rapid test-time training (TTT) on novel terrains, significantly enhancing the robot’s capability to traverse extremely difficult geometries. We adopt a two-stage end-to-end learning paradigm: a policy is first pre-trained on diverse procedurally generated terrains, followed by rapid fine-tuning on high-fidelity meshes reconstructed from real-world captures. Specifically, we develop a feed-forward, efficient, and high-fidelity geometry reconstruction pipeline using RGB-D inputs, ensuring both speed and quality during test-time training. We demonstrate that TTT-Parkour empowers humanoid robots to master complex obstacles, including wedges, stakes, boxes, trapezoids, and narrow beams. The whole pipeline of capturing, reconstructing, and test-time training requires less than 10 minutes on most tested terrains. Extensive experiments show that the policy after test-time training exhibits robust zero-shot sim-to-real transfer capability.
zh
[AI-19] Spark: Modular Spiking Neural Networks
【速读】:该论文旨在解决当前神经网络模型在数据和能源效率方面的不足,特别是针对脉冲神经网络(Spiking Neural Networks, SNNs)缺乏有效学习算法的问题。其解决方案的关键在于提出一个名为Spark的新框架,该框架基于模块化设计思想,从简单组件构建到完整模型,提供了一条高效且流程化的SNN开发路径;并通过在稀疏奖励的CartPole问题中使用简单的可塑性机制验证了该框架的有效性,从而为连续且非批处理的学习场景(类似于动物的学习方式)提供了可行的研究范式。
链接: https://arxiv.org/abs/2602.02306
作者: Mario Franco,Carlos Gershenson
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Nowadays, neural networks act as a synonym for artificial intelligence. Present neural network models, although remarkably powerful, are inefficient both in terms of data and energy. Several alternative forms of neural networks have been proposed to address some of these problems. Specifically, spiking neural networks are suitable for efficient hardware implementations. However, effective learning algorithms for spiking networks remain elusive, although it is suspected that effective plasticity mechanisms could alleviate the problem of data efficiency. Here, we present a new framework for spiking neural networks - Spark - built upon the idea of modular design, from simple components to entire models. The aim of this framework is to provide an efficient and streamlined pipeline for spiking neural networks. We showcase this framework by solving the sparse-reward cartpole problem with simple plasticity mechanisms. We hope that a framework compatible with traditional ML pipelines may accelerate research in the area, specifically for continuous and unbatched learning, akin to the one animals exhibit.
zh
[AI-20] Position: Explaining Behavioral Shifts in Large Language Models Requires a Comparative Approach
【速读】:该论文试图解决大规模基础模型在干预(如缩放、微调、强化学习或上下文学习)后出现的行为漂移问题,即如何有效解释这些行为变化的内在机制。现有可解释人工智能(Explainable AI, XAI)方法仅能分析单个检查点的模型失败,但无法比较不同检查点之间的内部变化,因而难以验证哪些解释性主张适用于行为漂移本身。论文的关键解决方案是提出一种对比可解释人工智能(Comparative XAI, Δ-XAI)框架,其核心思想是从参考模型与干预后模型之间的差异出发进行解释,而非孤立地分析任一模型;该框架定义了一组设计准则(desiderata),并提供具体实现路径和实验验证,从而系统化地识别和解释干预引发的行为变化。
链接: https://arxiv.org/abs/2602.02304
作者: Martino Ciaperoni,Marzio Di Vece,Luca Pappalardo,Fosca Giannotti,Francesco Giannini
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large-scale foundation models exhibit behavioral shifts: intervention-induced behavioral changes that appear after scaling, fine-tuning, reinforcement learning or in-context learning. While investigating these phenomena have recently received attention, explaining their appearance is still overlooked. Classic explainable AI (XAI) methods can surface failures at a single checkpoint of a model, but they are structurally ill-suited to justify what changed internally across different checkpoints and which explanatory claims are warranted about that change. We take the position that behavioral shifts should be explained comparatively: the core target should be the intervention-induced shift between a reference model and an intervened model, rather than any single model in isolation. To this aim we formulate a Comparative XAI ( \Delta -XAI) framework with a set of desiderata to be taken into account when designing proper explaining methods. To highlight how \Delta -XAI methods work, we introduce a set of possible pipelines, relate them to the desiderata, and provide a concrete \Delta -XAI experiment.
zh
[AI-21] Decoupling Generalizability and Membership Privacy Risks in Neural Networks
【速读】:该论文旨在解决深度学习模型在隐私保护与泛化能力之间存在的权衡问题,即传统防御方法往往在提升隐私安全性的同时显著损害模型的通用性能。其解决方案的关键在于提出隐私保护训练原则(Privacy-Preserving Training Principle, PPTP),该原则基于观察发现:模型的泛化能力与隐私风险分别存在于深度神经网络架构的不同区域。通过识别并隔离这些区域,PPTP 能够有针对性地防护潜在隐私泄露路径,同时最小化对模型泛化性能的影响,从而实现隐私增强与性能保持的协同优化。
链接: https://arxiv.org/abs/2602.02296
作者: Xingli Fang,Jung-Eun Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:A deep learning model usually has to sacrifice some utilities when it acquires some other abilities or characteristics. Privacy preservation has such trade-off relationships with utilities. The loss disparity between various defense approaches implies the potential to decouple generalizability and privacy risks to maximize privacy gain. In this paper, we identify that the model’s generalization and privacy risks exist in different regions in deep neural network architectures. Based on the observations that we investigate, we propose Privacy-Preserving Training Principle (PPTP) to protect model components from privacy risks while minimizing the loss in generalizability. Through extensive evaluations, our approach shows significantly better maintenance in model generalizability while enhancing privacy preservation.
zh
[AI-22] An Optimization Method for Autoregressive Time Series Forecasting
【速读】:该论文旨在解决当前时间序列预测模型依赖大规模参数扩展而非真正自回归(Autoregressive, AR)滚动机制来实现长期预测的问题,同时指出传统训练方法忽略了时间序列中的时序因果关系。其解决方案的关键在于提出一种新型训练策略,强制模型满足两个核心性质:一是预测误差随预测步长增加而递增,任何违反此原则的情况被视为随机猜测,并在损失函数中显式惩罚;二是支持将短期AR预测结果拼接以生成灵活的长期预测。该方法显著提升了预测精度,在多个基准上达到新的SOTA性能,且使短 horizon 模型可可靠地进行超过7.5倍长度的长期预测。
链接: https://arxiv.org/abs/2602.02288
作者: Zheng Li,Jerry Cheng,Huanying Gu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 2 figures, 2 tables
Abstract:Current time-series forecasting models are primarily based on transformer-style neural networks. These models achieve long-term forecasting mainly by scaling up the model size rather than through genuinely autoregressive (AR) rollout. From the perspective of large language model training, the traditional training process for time-series forecasting models ignores temporal causality. In this paper, we propose a novel training method for time-series forecasting that enforces two key properties: (1) AR prediction errors should increase with the forecasting horizon. Any violation of this principle is considered random guessing and is explicitly penalized in the loss function, and (2) the method enables models to concatenate short-term AR predictions for forming flexible long-term forecasts. Empirical results demonstrate that our method establishes a new state-of-the-art across multiple benchmarks, achieving an MSE reduction of more than 10% compared to iTransformer and other recent strong baselines. Furthermore, it enables short-horizon forecasting models to perform reliable long-term predictions at horizons over 7.5 times longer. Code is available at this https URL
zh
[AI-23] DFKI-Speech System for WildSpoof Challenge: A robust framework for SASV In-the-Wild
【速读】:该论文旨在解决语音伪造检测(spoofing detection)与说话人验证(speaker verification, SV)在对抗性攻击下的鲁棒性问题,特别是在真实场景中语音合成或重放攻击对自动说话人验证系统造成的威胁。解决方案的关键在于提出一个协同工作的鲁棒型说话人验证框架:首先利用自监督语音嵌入提取器结合图神经网络(Graph Neural Network, GNN)构建高精度的语音伪造检测模块;其次采用基于Top-3层的混合专家(Mixture-of-Experts, MoE)机制融合高低层次特征以提升伪造语音识别能力;在说话人验证部分,则设计了一个低复杂度的多尺度卷积神经网络(CNN),通过SphereFace损失和对比圆损失(contrastive circle loss)优化特征表示,实现对难易样本对的自适应加权;最后引入固定伪冒者群体(fixed imposter cohort)的AS Norm归一化和模型集成策略,显著增强系统的判别性能。
链接: https://arxiv.org/abs/2602.02286
作者: Arnab Das,Yassine El Kheir,Enes Erdem Erdogan,Feidi Kallel,Tim Polzehl,Sebastian Moeller
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:This paper presents the DFKI-Speech system developed for the WildSpoof Challenge under the Spoofing aware Automatic Speaker Verification (SASV) track. We propose a robust SASV framework in which a spoofing detector and a speaker verification (SV) network operate in tandem. The spoofing detector employs a self-supervised speech embedding extractor as the frontend, combined with a state-of-the-art graph neural network backend. In addition, a top-3 layer based mixture-of-experts (MoE) is used to fuse high-level and low-level features for effective spoofed utterance detection. For speaker verification, we adapt a low-complexity convolutional neural network that fuses 2D and 1D features at multiple scales, trained with the SphereFace loss. Additionally, contrastive circle loss is applied to adaptively weight positive and negative pairs within each training batch, enabling the network to better distinguish between hard and easy sample pairs. Finally, fixed imposter cohort based AS Norm score normalization and model ensembling are used to further enhance the discriminative capability of the speaker verification system.
zh
[AI-24] Backpropagation as Physical Relaxation: Exact Gradients in Finite Time
【速读】:该论文试图解决传统反向传播(Backpropagation)算法在物理实现中缺乏严格理论基础的问题,尤其是在模拟或类脑计算系统中如何精确实现梯度计算。其解决方案的关键在于将反向传播重新建模为一个连续时间动力学系统的有限时间松弛过程:通过将前向推理定义为连续时间过程,并利用非保守系统的拉格朗日理论处理不对称连接,构建了一个在扩展状态空间(包含激活值和敏感性)上的全局能量泛函;该能量泛函的鞍点动力学同时完成前向推理与梯度分配,且单位步长欧拉离散化在恰好2L步内精确恢复标准反向传播,无需对称权重、渐近收敛或微小扰动假设,从而为模拟电路和神经形态硬件中的精确梯度计算提供了严格的物理基础。
链接: https://arxiv.org/abs/2602.02281
作者: Antonino Emanuele Scurria
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Classical Physics (physics.class-ph); Computational Physics (physics.comp-ph)
备注: 15 pages, 8 figures
Abstract:Backpropagation, the foundational algorithm for training neural networks, is typically understood as a symbolic computation that recursively applies the chain rule. We show it emerges exactly as the finite-time relaxation of a physical dynamical system. By formulating feedforward inference as a continuous-time process and applying Lagrangian theory of non-conservative systems to handle asymmetric interactions, we derive a global energy functional on a doubled state space encoding both activations and sensitivities. The saddle-point dynamics of this energy perform inference and credit assignment simultaneously through local interactions. We term this framework ‘‘Dyadic Backpropagation’’. Crucially, we prove that unit-step Euler discretization, the natural timescale of layer transitions, recovers standard backpropagation exactly in precisely 2L steps for an L-layer network, with no approximations. Unlike prior energy-based methods requiring symmetric weights, asymptotic convergence, or vanishing perturbations, our framework guarantees exact gradients in finite time. This establishes backpropagation as the digitally optimized shadow of a continuous physical relaxation, providing a rigorous foundation for exact gradient computation in analog and neuromorphic substrates where continuous dynamics are native.
zh
[AI-25] Bridging the Sim-to-Real Gap with multipanda ros2: A Real-Time ROS2 Framework for Multimanual Systems
【速读】:该论文旨在解决多机器人实时控制中的关键挑战,特别是针对Franka Robotics机器人在高频率(1kHz)下实现精确力矩控制、交互控制及机器人-环境建模的问题。其核心解决方案在于提出了一种基于ROS2的开源架构multipanda_ros2,通过集成ros2 control实现单进程对任意数量机器人的原生控制接口,并采用“controllet-feature”设计模式将控制器切换延迟控制在≤2ms以内,从而支持可复现的基准测试与复杂多机器人交互场景。此外,为缩小仿真到现实(sim2real)的差距,论文引入高保真MuJoCo仿真环境并提供运动学精度与动力学一致性(力矩、力和控制误差)的量化指标,同时验证了真实世界惯性参数辨识能显著提升力/力矩控制精度,为物理模型迭代优化提供了方法论基础。
链接: https://arxiv.org/abs/2602.02269
作者: Jon Škerlj,Seongjin Bien,Abdeldjallil Naceri,Sami Haddadin
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Software Engineering (cs.SE); Systems and Control (eess.SY)
备注: This work has been submitted to the IEEE for possible publication
Abstract:We present multipanda_ros2 , a novel open-source ROS2 architecture for multi-robot control of Franka Robotics robots. Leveraging ros2 control, this framework provides native ROS2 interfaces for controlling any number of robots from a single process. Our core contributions address key challenges in real-time torque control, including interaction control and robot-environment modeling. A central focus of this work is sustaining a 1kHz control frequency, a necessity for real-time control and a minimum frequency required by safety standards. Moreover, we introduce a controllet-feature design pattern that enables controller-switching delays of \le 2 ms, facilitating reproducible benchmarking and complex multi-robot interaction scenarios. To bridge the simulation-to-reality (sim2real) gap, we integrate a high-fidelity MuJoCo simulation with quantitative metrics for both kinematic accuracy and dynamic consistency (torques, forces, and control errors). Furthermore, we demonstrate that real-world inertial parameter identification can significantly improve force and torque accuracy, providing a methodology for iterative physics refinement. Our work extends approaches from soft robotics to rigid dual-arm, contact-rich tasks, showcasing a promising method to reduce the sim2real gap and providing a robust, reproducible platform for advanced robotics research.
zh
[AI-26] Unsupervised Physics-Informed Operator Learning through Multi-Stage Curriculum Training
【速读】:该论文旨在解决科学机器学习中求解偏微分方程(Partial Differential Equations, PDEs)时面临的两大挑战:一是神经算子(Neural Operators)通常依赖监督数据,而物理信息神经网络(Physics-Informed Neural Networks, PINNs)虽可实现无监督训练,但存在收敛不稳定和泛化能力有限的问题。解决方案的关键在于提出一种多阶段物理信息训练策略,通过逐步在损失函数中引入边界条件并随后加入内部残差项,结合每阶段重新初始化优化器的延续机制,有效提升训练稳定性与收敛性;同时设计了基于样条傅里叶神经算子(Physics-Informed Spline Fourier Neural Operator, PhIS-FNO),利用赫米特样条核(Hermite spline kernels)实现平滑的残差评估,从而在仅使用窄边界区域标签的情况下达到与监督学习相当的精度,为物理信息算子学习提供了一种稳健的新范式。
链接: https://arxiv.org/abs/2602.02264
作者: Paolo Marcandelli,Natansh Mathur,Stefano Markidis,Martina Siena,Stefano Mariani
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 51 pages, 15 figures, 6 tables
Abstract:Solving partial differential equations remains a central challenge in scientific machine learning. Neural operators offer a promising route by learning mappings between function spaces and enabling resolution-independent inference, yet they typically require supervised data. Physics-informed neural networks address this limitation through unsupervised training with physical constraints but often suffer from unstable convergence and limited generalization capability. To overcome these issues, we introduce a multi-stage physics-informed training strategy that achieves convergence by progressively enforcing boundary conditions in the loss landscape and subsequently incorporating interior residuals. At each stage the optimizer is re-initialized, acting as a continuation mechanism that restores stability and prevents gradient stagnation. We further propose the Physics-Informed Spline Fourier Neural Operator (PhIS-FNO), combining Fourier layers with Hermite spline kernels for smooth residual evaluation. Across canonical benchmarks, PhIS-FNO attains a level of accuracy comparable to that of supervised learning, using labeled information only along a narrow boundary region, establishing staged, spline-based optimization as a robust paradigm for physics-informed operator learning.
zh
[AI-27] Geometry- and Relation-Aware Diffusion for EEG Super-Resolution
【速读】:该论文旨在解决当前脑电图(Electroencephalography, EEG)空间超分辨率(Spatial Super-Resolution, SR)方法在生成高质量空间分布时缺乏对生理空间结构感知的问题,从而限制了其空间生成性能。解决方案的关键在于提出TopoDiff模型,该模型融合拓扑感知的图像嵌入以提供全局几何上下文,并引入一个随时间动态演化的通道关系图来编码电极间的相互关系,从而构建一个具有空间约束的EEG空间超分辨率框架,在多个应用场景下显著提升了生成保真度和下游任务性能。
链接: https://arxiv.org/abs/2602.02238
作者: Laura Yao,Gengwei Zhang,Moajjem Chowdhury,Yunmei Liu,Tianlong Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent electroencephalography (EEG) spatial super-resolution (SR) methods, while showing improved quality by either directly predicting missing signals from visible channels or adapting latent diffusion-based generative modeling to temporal data, often lack awareness of physiological spatial structure, thereby constraining spatial generation performance. To address this issue, we introduce TopoDiff, a geometry- and relation-aware diffusion model for EEG spatial super-resolution. Inspired by how human experts interpret spatial EEG patterns, TopoDiff incorporates topology-aware image embeddings derived from EEG topographic representations to provide global geometric context for spatial generation, together with a dynamic channel-relation graph that encodes inter-electrode relationships and evolves with temporal dynamics. This design yields a spatially grounded EEG spatial super-resolution framework with consistent performance improvements. Across multiple EEG datasets spanning diverse applications, including SEED/SEED-IV for emotion recognition, PhysioNet motor imagery (MI/MM), and TUSZ for seizure detection, our method achieves substantial gains in generation fidelity and leads to notable improvements in downstream EEG task performance.
zh
[AI-28] SEDformer: Event-Synchronous Spiking Transformers for Irregular Telemetry Time Series Forecasting
【速读】:该论文旨在解决不规则多变量时间序列(Irregular Multivariate Time Series, IMTS)在大规模物联网(IoT)部署和在线平台中预测时面临的挑战,特别是现有基于图结构和Transformer的模型忽视了IMTS固有的“稀疏-事件二象性”(Sparsity-Event Duality, SED)特性——即长时间稀疏或无观测与短时密集事件爆发交替出现的现象。传统方法通过预对齐至均匀网格并引入大量填充(padding)来处理非均匀采样,这不仅破坏了原始稀疏性,还增加了无效计算;而关系重构则削弱了局部时间连续性,损害事件语义表达。解决方案的关键在于利用脉冲神经网络(Spiking Neural Networks, SNNs)天然具备的稀疏二进制脉冲通信和事件驱动更新机制,提出SEDformer:其核心创新包括(1)基于SED的脉冲编码器(Spike Encoder),采用事件对齐的Leaky Integrate-and-Fire(EA-LIF)神经元将原始观测转化为同步脉冲;(2)事件保留的时间下采样模块,在压缩长间隔的同时保留关键放电活动;(3)堆叠的SED增强型脉冲Transformer块,通过膜电位驱动的线性注意力机制建模单序列内部依赖关系。实验表明,SEDformer在多个公开IMTS数据集上实现了最先进的预测精度,并显著降低能耗与内存占用,为IMTS提供了自然且高效的建模路径。
链接: https://arxiv.org/abs/2602.02230
作者: Ziyu Zhou,Yuchen Fang,Weilin Ruan,Shiyu Wang,James Kwok,Yuxuan Liang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Under review
Abstract:Telemetry streams from large-scale Internet-connected systems (e.g., IoT deployments and online platforms) naturally form an irregular multivariate time series (IMTS) whose accurate forecasting is operationally vital. A closer examination reveals a defining Sparsity-Event Duality (SED) property of IMTS, i.e., long stretches with sparse or no observations are punctuated by short, dense bursts where most semantic events (observations) occur. However, existing Graph- and Transformer-based forecasters ignore SED: pre-alignment to uniform grids with heavy padding violates sparsity by inflating sequences and forcing computation at non-informative steps, while relational recasting weakens event semantics by disrupting local temporal continuity. These limitations motivate a more faithful and natural modeling paradigm for IMTS that aligns with its SED property. We find that Spiking Neural Networks meet this requirement, as they communicate via sparse binary spikes and update in an event-driven manner, aligning naturally with the SED nature of IMTS. Therefore, we present SEDformer, an SED-enhanced Spiking Transformer for telemetry IMTS forecasting that couples: (1) a SED-based Spike Encoder converts raw observations into event synchronous spikes using an Event-Aligned LIF neuron, (2) an Event-Preserving Temporal Downsampling module compresses long gaps while retaining salient firings and (3) a stack of SED-based Spike Transformer blocks enable intra-series dependency modeling with a membrane-based linear attention driven by EA-LIF spiking features. Experiments on public telemetry IMTS datasets show that SEDformer attains state-of-the-art forecasting accuracy while reducing energy and memory usage, providing a natural and efficient path for modeling IMTS.
zh
[AI-29] Spectral Superposition: A Theory of Feature Geometry
【速读】:该论文旨在解决神经网络中特征通过超叠加(superposition)机制在低维空间中表示更多特征时所引发的几何结构丢失问题。现有方法虽能将激活分解为稀疏线性特征,但忽略了特征间的几何关系。其解决方案的关键在于引入框架算子(frame operator)$ F = WW^\top $,通过分析其谱特性(如特征值和特征空间)来刻画每个特征在不同特征空间中的范数分配方式,从而捕捉所有特征之间的全局几何交互关系。这一谱测度形式化方法不仅揭示了容量饱和导致的谱局域化现象——即特征坍缩至单一特征空间、形成紧框架并支持离散分类——还适用于任意权重矩阵,使对非玩具模型中特征局域化的诊断成为可能,为基于算子理论的可解释性研究开辟新路径。
链接: https://arxiv.org/abs/2602.02224
作者: Georgi Ivanov,Narmeen Oozeer,Shivam Raval,Tasana Pejovic,Shriyash Upadhyay,Amir Abdullah
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Spectral Theory (math.SP); Machine Learning (stat.ML)
备注:
Abstract:Neural networks represent more features than they have dimensions via superposition, forcing features to share representational space. Current methods decompose activations into sparse linear features but discard geometric structure. We develop a theory for studying the geometric structre of features by analyzing the spectra (eigenvalues, eigenspaces, etc.) of weight derived matrices. In particular, we introduce the frame operator F = WW^\top , which gives us a spectral measure that describes how each feature allocates norm across eigenspaces. While previous tools could describe the pairwise interactions between features, spectral methods capture the global geometry (``how do all features interact?‘’). In toy models of superposition, we use this theory to prove that capacity saturation forces spectral localization: features collapse onto single eigenspaces, organize into tight frames, and admit discrete classification via association schemes, classifying all geometries from prior work (simplices, polygons, antiprisms). The spectral measure formalism applies to arbitrary weight matrices, enabling diagnosis of feature localization beyond toy settings. These results point toward a broader program: applying operator theory to interpretability.
zh
[AI-30] Generating Physically Sound Designs from Text and a Set of Physical Constraints NEURIPS2025
【速读】:该论文旨在解决如何基于文本描述和物理约束生成既符合视觉语义又具备良好物理性能的设计问题,即实现生成式设计(Generative Design)中结构拓扑与视觉属性的协同优化。其解决方案的关键在于提出TIDES框架,该框架通过预训练的文本-图像模型衡量设计与文本提示的视觉一致性,并结合可微分物理仿真器评估设计的物理性能(如柔度和密度),从而在优化过程中联合调整结构拓扑与视觉特征,最终输出满足工程要求且语义对齐的物理可实现设计方案。
链接: https://arxiv.org/abs/2602.02213
作者: Gregory Barber,Todd C. Henry,Mulugeta A. Haile
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: NeurIPS 2025
Abstract:We present TIDES, a text informed design approach for generating physically sound designs based on a textual description and a set of physical constraints. TIDES jointly optimizes structural (topology) and visual properties. A pre-trained text-image model is used to measure the design’s visual alignment with a text prompt and a differentiable physics simulator is used to measure its physical performance. We evaluate TIDES on a series of structural optimization problems operating under different load and support conditions, at different resolutions, and experimentally in the lab by performing the 3-point bending test on 2D beam designs that are extruded and 3D printed. We find that it can jointly optimize the two objectives and return designs that satisfy engineering design requirements (compliance and density) while utilizing features specified by text.
zh
[AI-31] Cardinality-Preserving Structured Sparse Graph Transformers for Molecular Property Prediction
【速读】:该论文旨在解决在标注数据有限条件下,分子属性预测的效率问题。由于化学空间极其庞大(约10^60种类药分子),而已批准药物仅数千种,传统监督学习难以有效建模分子特征。为此,作者提出CardinalGraphFormer,其核心创新在于引入基于图结构的先验知识(如最短路径距离、中心性等)到图Transformer的注意力机制中,并限定稀疏注意力范围至最短路径距离≤3,从而在保持计算效率的同时增强模型对分子拓扑结构的理解。此外,模型还设计了保持节点数量不变的未归一化聚合通道,进一步提升表示能力。预训练阶段结合图级对比对齐与掩码属性重建任务,显著提升了在MoleculeNet、OGB及TDC ADMET等11个公开基准上的泛化性能。
链接: https://arxiv.org/abs/2602.02201
作者: Abhijit Gupta
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Drug discovery motivates efficient molecular property prediction under limited labeled data. Chemical space is vast, often estimated at approximately 10^60 drug-like molecules, while only thousands of drugs have been approved. As a result, self-supervised pretraining on large unlabeled molecular corpora has become essential for data-efficient molecular representation learning. We introduce CardinalGraphFormer, a graph transformer that incorporates Graphormer-inspired structural biases, including shortest-path distance and centrality, as well as direct-bond edge bias, within a structured sparse attention regime limited to shortest-path distance = 3. The model further augments this design with a cardinality-preserving unnormalized aggregation channel over the same support set. Pretraining combines contrastive graph-level alignment with masked attribute reconstruction. Under a fully matched evaluation protocol, CardinalGraphFormer improves mean performance across all 11 evaluated tasks and achieves statistically significant gains on 10 of 11 public benchmarks spanning MoleculeNet, OGB, and TDC ADMET tasks when compared to strong reproduced baselines.
zh
[AI-32] Hierarchical Adaptive Eviction for KV Cache Management in Multimodal Language Models
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)中因Transformer架构导致的二次方内存与计算开销问题,尤其是现有KV缓存(Key-Value Cache)淘汰策略未能有效处理视觉与文本token间异构注意力分布,从而造成效率低下或性能下降的问题。解决方案的关键在于提出分层自适应淘汰(Hierarchical Adaptive Eviction, HAE)框架:在预填充阶段采用双注意力剪枝(Dual-Attention Pruning),利用视觉token稀疏性和注意力方差优化缓存保留;在解码阶段引入动态淘汰策略(受操作系统回收桶机制启发),结合索引广播机制降低计算开销,并理论保证优于贪心策略的信息完整性与误差边界,最终显著减少KV缓存占用(降低41%)并提升推理速度(加速1.5倍),同时保持任务精度损失极小(仅0.3%)。
链接: https://arxiv.org/abs/2602.02197
作者: Xindian Ma,Yidi Lu,Peng Zhang,Jing Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 oages, 3 figures
Abstract:The integration of visual information into Large Language Models (LLMs) has enabled Multimodal LLMs (MLLMs), but the quadratic memory and computational costs of Transformer architectures remain a bottleneck. Existing KV cache eviction strategies fail to address the heterogeneous attention distributions between visual and text tokens, leading to suboptimal efficiency or degraded performance. In this paper, we propose Hierarchical Adaptive Eviction (HAE), a KV cache eviction framework that optimizes text-visual token interaction in MLLMs by implementing Dual-Attention Pruning during pre-filling (leveraging visual token sparsity and attention variance) and a Dynamic Decoding Eviction Strategy (inspired by OS Recycle Bins) during decoding. HAE minimizes KV cache usage across layers, reduces computational overhead via index broadcasting, and theoretically ensures superior information integrity and lower error bounds compared to greedy strategies, enhancing efficiency in both comprehension and generation tasks. Empirically, HAE reduces KV-Cache memory by 41% with minimal accuracy loss (0.3% drop) in image understanding tasks and accelerates story generation inference by 1.5x while maintaining output quality on Phi3.5-Vision-Instruct model.
zh
[AI-33] IDE: Trajectory-based Diagnostic Evaluation of Test-Time Improvement in LLM Agents
【速读】:该论文旨在解决当前自主大语言模型(Large Language Model, LLM)代理在测试时改进(Test-Time Improvement, TTI)机制中存在理解不足的问题,具体包括:现有评估指标无法有效衡量任务优化效率、错误行为后的适应能力以及工作记忆对任务完成的具体贡献。为应对这一挑战,论文提出了一种代理无关、环境无关的诊断性评估框架——Test-time Improvement Diagnostic Evaluation (TIDE),其关键在于将TTI分解为三个相互关联的维度:(1) 任务完成的总体时间动态;(2) 识别性能瓶颈是否源于递归循环行为;(3) 判断性能限制是否由累积记忆负担引起。通过该框架,研究发现提升代理性能不仅依赖于内部推理能力的增强,更需显式优化代理与环境之间的交互动态。
链接: https://arxiv.org/abs/2602.02196
作者: Hang Yan,Xinyu Che,Fangzhi Xu,Qiushi Sun,Zichen Ding,Kanzhi Cheng,Jian Zhang,Tao Qin,Jun Liu,Qika Lin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 29pages, 10 figures
Abstract:Recent advances in autonomous LLM agents demonstrate their ability to improve performance through iterative interaction with the environment. We define this paradigm as Test-Time Improvement (TTI). However, the mechanisms under how and why TTI succeed or fail remain poorly understood, and existing evaluation metrics fail to capture their task optimization efficiency, behavior adaptation after erroneous actions, and the specific utility of working memory for task completion. To address these gaps, we propose Test-time Improvement Diagnostic Evaluation (TIDE), an agent-agnostic and environment-agnostic framework that decomposes TTI into three comprehensive and interconnected dimensions. The framework measures (1) the overall temporal dynamics of task completion and (2) identifies whether performance is primarily constrained by recursive looping behaviors or (3) by burdensome accumulated memory. Through extensive experiments across diverse agents and environments, TIDE highlights that improving agent performance requires more than scaling internal reasoning, calling for explicitly optimizing the interaction dynamics between the agent and the environment.
zh
[AI-34] State Rank Dynamics in Linear Attention LLM s
【速读】:该论文旨在解决线性注意力大语言模型(Linear Attention Large Language Models)中状态矩阵压缩机制的内部动态机制不透明问题,尤其是对推理过程中状态矩阵秩演化规律的理解不足。其核心发现是“状态秩分层”(State Rank Stratification)现象:线性注意力头在运行时呈现出两种截然不同的秩演化模式——一类保持近零有效秩波动,另一类则快速增长并收敛至上限。这一现象表明,头部的低秩或高秩特性是预训练过程中形成的结构性属性,而非输入依赖的瞬态状态。解决方案的关键在于识别出低秩头对模型推理至关重要、而高秩头存在显著冗余,并据此提出“联合秩范数剪枝”(Joint Rank-Norm Pruning)策略,在无需微调的情况下实现38.9%的KV缓存开销降低,同时几乎不损失模型性能。
链接: https://arxiv.org/abs/2602.02195
作者: Ao Sun,Hongtao Zhang,Heng Zhou,Yixuan Ma,Yiran Qin,Tongrui Su,Yan Liu,Zhanyu Ma,Jun Xu,Jiuchong Gao,Jinghua Hao,Renqing He
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Linear Attention Large Language Models (LLMs) offer a compelling recurrent formulation that compresses context into a fixed-size state matrix, enabling constant-time inference. However, the internal dynamics of this compressed state remain largely opaque. In this work, we present a comprehensive study on the runtime state dynamics of state-of-the-art Linear Attention models. We uncover a fundamental phenomenon termed State Rank Stratification, characterized by a distinct spectral bifurcation among linear attention heads: while one group maintains an effective rank oscillating near zero, the other exhibits rapid growth that converges to an upper bound. Extensive experiments across diverse inference contexts reveal that these dynamics remain strikingly consistent, indicating that the identity of a head,whether low-rank or high-rank,is an intrinsic structural property acquired during pre-training, rather than a transient state dependent on the input data. Furthermore, our diagnostic probes reveal a surprising functional divergence: low-rank heads are indispensable for model reasoning, whereas high-rank heads exhibit significant redundancy. Leveraging this insight, we propose Joint Rank-Norm Pruning, a zero-shot strategy that achieves a 38.9% reduction in KV-cache overhead while largely maintaining model accuracy.
zh
[AI-35] Reasoning in a Combinatorial and Constrained World: Benchmarking LLM s on Natural-Language Combinatorial Optimization
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在组合优化(Combinatorial Optimization, CO)任务中表现不足的问题,即如何在硬约束条件下高效搜索高维解空间。其解决方案的关键在于提出NLCO(Natural Language Combinatorial Optimization)基准,这是一个端到端的组合优化评估框架,要求模型仅通过自然语言描述的任务输入直接输出离散解,无需编写代码或调用外部求解器。NLCO涵盖43个CO问题,并采用四层分类体系(变量类型、约束族、全局模式和目标类别)实现细粒度评估,从而系统性地衡量LLMs在可行性、解质量与推理效率方面的性能表现。
链接: https://arxiv.org/abs/2602.02188
作者: Xia Jiang,Jing Chen,Cong Zhang,Jie Gao,Chengpeng Hu,Chenhao Zhang,Yaoxin Wu,Yingqian Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:While large language models (LLMs) have shown strong performance in math and logic reasoning, their ability to handle combinatorial optimization (CO) – searching high-dimensional solution spaces under hard constraints – remains underexplored. To bridge the gap, we introduce NLCO, a \textbfNatural \textbfLanguage \textbfCombinatorial \textbfOptimization benchmark that evaluates LLMs on end-to-end CO reasoning: given a language-described decision-making scenario, the model must output a discrete solution without writing code or calling external solvers. NLCO covers 43 CO problems and is organized using a four-layer taxonomy of variable types, constraint families, global patterns, and objective classes, enabling fine-grained evaluation. We provide solver-annotated solutions and comprehensively evaluate LLMs by feasibility, solution optimality, and reasoning efficiency. Experiments across a wide range of modern LLMs show that high-performing models achieve strong feasibility and solution quality on small instances, but both degrade as instance size grows, even if more tokens are used for reasoning. We also observe systematic effects across the taxonomy: set-based tasks are relatively easy, whereas graph-structured problems and bottleneck objectives lead to more frequent failures.
zh
[AI-36] Malware Detection Through Memory Analysis
【速读】:该论文旨在解决恶意软件检测中准确率与实时性之间的平衡问题,特别是在面对混淆恶意软件时如何实现高效且高精度的分类。研究基于加拿大网络安全研究所提供的MalMemAnalysis-2022数据集,针对二分类(良性/恶意)和多分类(良性、勒索软件、间谍软件、特洛伊木马)任务展开实验。解决方案的关键在于采用XGBoost模型,其在保持强大检测能力的同时具备快速推理速度:二分类任务测试准确率达99.98%,F1得分为99.98%;多分类任务准确率为87.54%,平均F1得分达75.03%,且单次分类50个样本耗时仅约37.3毫秒(二分类)至43.2毫秒(多分类),显著提升了检测效率,为构建实时恶意软件检测系统提供了可行路径。
链接: https://arxiv.org/abs/2602.02184
作者: Sarah Nassar
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper summarizes the research conducted for a malware detection project using the Canadian Institute for Cybersecurity’s MalMemAnalysis-2022 dataset. The purpose of the project was to explore the effectiveness and efficiency of machine learning techniques for the task of binary classification (i.e., benign or malicious) as well as multi-class classification to further include three malware sub-types (i.e., benign, ransomware, spyware, or Trojan horse). The XGBoost model type was the final model selected for both tasks due to the trade-off between strong detection capability and fast inference speed. The binary classifier achieved a testing subset accuracy and F1 score of 99.98%, while the multi-class version reached an accuracy of 87.54% and an F1 score of 81.26%, with an average F1 score over the malware sub-types of 75.03%. In addition to the high modelling performance, XGBoost is also efficient in terms of classification speed. It takes about 37.3 milliseconds to classify 50 samples in sequential order in the binary setting and about 43.2 milliseconds in the multi-class setting. The results from this research project help advance the efforts made towards developing accurate and real-time obfuscated malware detectors for the goal of improving online privacy and safety. *This project was completed as part of ELEC 877 (AI for Cybersecurity) in the Winter 2024 term.
zh
[AI-37] SurvKAN: A Fully Parametric Survival Model Based on Kolmogorov-Arnold Networks
【速读】:该论文旨在解决传统生存分析模型在临床应用中面临的两大挑战:一是经典模型(如Cox回归)对协变量关系和风险比例假设的严格限制,难以捕捉真实世界中的复杂动态;二是深度学习方法虽具备更强的表达能力,却牺牲了可解释性,阻碍其在医疗场景中的可信部署。解决方案的关键在于提出一种全参数化、时间连续的生存模型SurvKAN,其核心创新是将Kolmogorov-Arnold Networks (KANs) 架构引入生存分析,显式地将时间作为输入以直接预测对数风险函数(log-hazard function),从而摆脱比例风险假设的约束,并通过可学习的一元函数保持模型的可解释性,实现性能与透明度的统一。
链接: https://arxiv.org/abs/2602.02179
作者: Marina Mastroleo,Alberto Archetti,Federico Mastroleo,Matteo Matteucci
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate prediction of time-to-event outcomes is critical for clinical decision-making, treatment planning, and resource allocation in modern healthcare. While classical survival models such as Cox remain widely adopted in standard practice, they rely on restrictive assumptions, including linear covariate relationships and proportional hazards over time, that often fail to capture real-world clinical dynamics. Recent deep learning approaches like DeepSurv and DeepHit offer improved expressivity but sacrifice interpretability, limiting clinical adoption where trust and transparency are paramount. Hybrid models incorporating Kolmogorov-Arnold Networks (KANs), such as CoxKAN, have begun to address this trade-off but remain constrained by the semi-parametric Cox framework. In this work we introduce SurvKAN, a fully parametric, time-continuous survival model based on KAN architectures that eliminates the proportional hazards constraint. SurvKAN treats time as an explicit input to a KAN that directly predicts the log-hazard function, enabling end-to-end training on the full survival likelihood. Our architecture preserves interpretability through learnable univariate functions that indicate how individual features influence risk over time. Extensive experiments on standard survival benchmarks demonstrate that SurvKAN achieves competitive or superior performance compared to classical and state-of-the-art baselines across concordance and calibration metrics. Additionally, interpretability analyses reveal clinically meaningful patterns that align with medical domain knowledge.
zh
[AI-38] Self-Evolving Coordination Protocol in Multi-Agent AI Systems: An Exploratory Systems Feasibility Study
【速读】:该论文旨在解决安全关键领域(如金融)中多智能体系统(Multiagent Systems)内部协调机制的治理与可审计性问题,即如何在保持严格形式约束的前提下实现协调协议的有限自我演化。其解决方案的关键在于提出自演化协调协议(Self-Evolving Coordination Protocols, SECP),通过引入受外部验证的有限自修改能力,在不破坏已定义不变量(如拜占庭容错 f < n/3、消息复杂度 O(n²)、非统计安全性与活性证明及可解释性边界)的基础上,提升系统对异构组件输出的整合效率。实验表明,一次受控的协议修改即可从接受两个提案提升至三个,验证了该架构的技术可行性与可分析性,为构建可治理的多智能体系统提供了基础框架。
链接: https://arxiv.org/abs/2602.02170
作者: Jose Manuel de la Chica Rodriguez,Juan Manuel Vera Díaz
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:
Abstract:Contemporary multi-agent systems increasingly rely on internal coordination mechanisms to combine, arbitrate, or constrain the outputs of heterogeneous components. In safety-critical and regulated domains such as finance, these mechanisms must satisfy strict formal requirements, remain auditable, and operate within explicitly bounded limits. Coordination logic therefore functions as a governance layer rather than an optimization heuristic. This paper presents an exploratory systems feasibility study of Self-Evolving Coordination Protocols (SECP): coordination protocols that permit limited, externally validated self-modification while preserving fixed formal invariants. We study a controlled proof-of-concept setting in which six fixed Byzantine consensus protocol proposals are evaluated by six specialized decision modules. All coordination regimes operate under identical hard constraints, including Byzantine fault tolerance (f n/3), O(n2) message complexity, complete non-statistical safety and liveness arguments, and bounded explainability. Four coordination regimes are compared in a single-shot design: unanimous hard veto, weighted scalar aggregation, SECP v1.0 (an agent-designed non-scalar protocol), and SECP v2.0 (the result of one governed modification). Outcomes are evaluated using a single metric, proposal coverage, defined as the number of proposals accepted. A single recursive modification increased coverage from two to three accepted proposals while preserving all declared invariants. The study makes no claims regarding statistical significance, optimality, convergence, or learning. Its contribution is architectural: it demonstrates that bounded self-modification of coordination protocols is technically implementable, auditable, and analyzable under explicit formal constraints, establishing a foundation for governed multi-agent systems. Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.02170 [cs.MA] (or arXiv:2602.02170v1 [cs.MA] for this version) https://doi.org/10.48550/arXiv.2602.02170 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Juan Manuel Vera-Díaz [view email] [v1] Mon, 2 Feb 2026 14:45:04 UTC (2,737 KB)
zh
[AI-39] raffic-Aware Navigation in Road Networks
【速读】:该论文旨在解决城市道路网络中交通感知导航(traffic-aware navigation)的路径规划问题,核心挑战在于如何在实时性、计算效率与路径最优性之间取得平衡。解决方案的关键在于对比三种图搜索算法:基于预处理的Floyd-Warshall-Ingerman算法(提供快速但无交通感知的最短路径)、实时单查询搜索算法Dijkstra和A*(具备交通感知能力且预处理开销低),以及结合两者的Yen算法(先生成前K条最短路径再实时迭代优化)。其中,Dijkstra和A*在交通动态变化场景下表现最优,而Yen算法通过权衡预处理复杂度与运行时性能,在实际部署中展现出灵活性与适应性。
链接: https://arxiv.org/abs/2602.02158
作者: Sarah Nassar
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This project compares three graph search approaches for the task of traffic-aware navigation in Kingston’s road network. These approaches include a single-run multi-query preprocessing algorithm (Floyd-Warshall-Ingerman), continuous single-query real-time search (Dijkstra’s and A*), and an algorithm combining both approaches to balance between their trade-offs by first finding the top K shortest paths then iterating over them in real time (Yen’s). Dijkstra’s and A* resulted in the most traffic-aware optimal solutions with minimal preprocessing required. Floyd-Warshall-Ingerman was the fastest in real time but provided distance based paths with no traffic awareness. Yen’s algorithm required significant preprocessing but balanced between the other two approaches in terms of runtime speed and optimality. Each approach presents advantages and disadvantages that need to be weighed depending on the circumstances of specific deployment contexts to select the best custom solution. *This project was completed as part of ELEC 844 (Search and Planning Algorithms for Robotics) in the Fall 2025 term.
zh
[AI-40] ECHO: Entropy-Confidence Hybrid Optimization for Test-Time Reinforcement Learning
【速读】:该论文旨在解决测试时强化学习(Test-time reinforcement learning)中因高熵分支导致的轨迹坍塌(rollout collapse)问题,以及早期伪标签噪声和偏差引发的自强化过拟合问题。解决方案的关键在于提出熵置信度混合组相对策略优化方法(Entropy Confidence Hybrid Group Relative Policy Optimization, ECHO):在推理阶段,ECHO通过联合考虑局部熵与群体置信度动态调控分支宽度,并引入基于置信度的在线剪枝机制以终止低置信度分支,从而避免高熵陷阱并缓解坍塌;在策略更新阶段,则采用置信度自适应截断和熵-置信度混合优势塑造策略,提升训练鲁棒性并减轻早期偏差影响。
链接: https://arxiv.org/abs/2602.02150
作者: Chu Zhao,Enneng Yang,Yuting Liu,Jianzhe Zhao,Guibing Guo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 19 ppages
Abstract:Test-time reinforcement learning generates multiple candidate answers via repeated rollouts and performs online updates using pseudo-labels constructed by majority voting. To reduce overhead and improve exploration, prior work introduces tree structured rollouts, which share reasoning prefixes and branch at key nodes to improve sampling efficiency. However, this paradigm still faces two challenges: (1) high entropy branching can trigger rollout collapse, where the branching budget concentrates on a few trajectories with consecutive high-entropy segments, rapidly reducing the number of effective branches; (2) early pseudo-labels are noisy and biased, which can induce self-reinforcing overfitting, causing the policy to sharpen prematurely and suppress exploration. To address these issues, we propose Entropy Confidence Hybrid Group Relative Policy Optimization (ECHO). During rollout, ECHO jointly leverages local entropy and group level confidence to adaptively control branch width, and further introduces online confidence-based pruning to terminate persistently low confidence branches, avoiding high entropy traps and mitigating collapse. During policy updates, ECHO employs confidence adaptive clipping and an entropy confidence hybrid advantage shaping approach to enhance training robustness and mitigate early stage bias. Experiments demonstrate that ECHO achieves consistent gains on multiple mathematical and visual reasoning benchmarks, and generalizes more effectively under a limited rollout budget.
zh
[AI-41] Back to the Future: Look-ahead Augmentation and Parallel Self-Refinement for Time Series Forecasting WWW
【速读】:该论文旨在解决长期时间序列预测(Long-term Time Series Forecasting, LTSF)中并行效率与序列一致性之间的权衡问题:直接多步预测(Direct Multi-step Forecasting, DMS)虽能实现快速并行预测,但易丧失步骤间的时序一致性;而迭代多步预测(Iterative Multi-step Forecasting, IMS)虽能保留时序依赖关系,却存在误差累积和推理缓慢的问题。解决方案的关键在于提出 Back to the Future (BTTF) 框架,其核心思想是通过“前瞻增强”(look-ahead augmentation)和“自校正精炼”(self-corrective refinement)机制,在不依赖复杂模型结构的前提下,利用初始预测结果对第二阶段模型进行增强,并通过集成策略提升预测稳定性与准确性,从而显著改善长程预测性能,最高可提升58%的精度。
链接: https://arxiv.org/abs/2602.02146
作者: Sunho Kim,Susik Yoon
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 4 pages, Short paper accepted at The Web Conference (WWW) 2026
Abstract:Long-term time series forecasting (LTSF) remains challenging due to the trade-off between parallel efficiency and sequential modeling of temporal coherence. Direct multi-step forecasting (DMS) methods enable fast, parallel prediction of all future horizons but often lose temporal consistency across steps, while iterative multi-step forecasting (IMS) preserves temporal dependencies at the cost of error accumulation and slow inference. To bridge this gap, we propose Back to the Future (BTTF), a simple yet effective framework that enhances forecasting stability through look-ahead augmentation and self-corrective refinement. Rather than relying on complex model architectures, BTTF revisits the fundamental forecasting process and refines a base model by ensembling the second-stage models augmented with their initial predictions. Despite its simplicity, our approach consistently improves long-horizon accuracy and mitigates the instability of linear forecasting models, achieving accuracy gains of up to 58% and demonstrating stable improvements even when the first-stage model is trained under suboptimal conditions. These results suggest that leveraging model-generated forecasts as augmentation can be a simple yet powerful way to enhance long-term prediction, even without complex architectures.
zh
[AI-42] CAM: A Causality-based Analysis Framework for Multi-Agent Code Generation Systems
【速读】:该论文旨在解决多智能体代码生成系统(Multi-Agent Code Generation Systems, MACGS)中中间输出特征对系统正确性贡献不明确的问题,这一问题阻碍了针对MACGS架构的精准优化。解决方案的关键在于提出首个基于因果关系的分析框架CAM(Causality-based Analysis framework for MACGS),通过系统性地量化不同中间特征对系统正确性的贡献,结合对中间输出的分类与真实错误模拟,识别出关键特征并进行重要性排序。该方法揭示了情境依赖型特征的存在(即特征重要性主要源于与其他特征的交互),并验证了混合后端架构(根据LLM相对能力分配任务)可提升7.2%的Pass@1指标,从而为MACGS的设计优化和部署提供了可操作的洞察。
链接: https://arxiv.org/abs/2602.02138
作者: Lyu Zongyi,Ji Zhenlan,Chen Songqiang,Wang Liwen,Huang Yuheng,Wang Shuai,Cheung Shing-Chi
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 18 pages, 12 tables, 4 figures
Abstract:Despite the remarkable success that Multi-Agent Code Generation Systems (MACGS) have achieved, the inherent complexity of multi-agent architectures produces substantial volumes of intermediate outputs. To date, the individual importance of these intermediate outputs to the system correctness remains opaque, which impedes targeted optimization of MACGS designs. To address this challenge, we propose CAM, the first \textbfCausality-based \textbfAnalysis framework for \textbfMACGS that systematically quantifies the contribution of different intermediate features for system correctness. By comprehensively categorizing intermediate outputs and systematically simulating realistic errors on intermediate features, we identify the important features for system correctness and aggregate their importance rankings. We conduct extensive empirical analysis on the identified importance rankings. Our analysis reveals intriguing findings: first, we uncover context-dependent features\textemdash features whose importance emerges mainly through interactions with other features, revealing that quality assurance for MACGS should incorporate cross-feature consistency checks; second, we reveal that hybrid backend MACGS with different backend LLMs assigned according to their relative strength achieves up to 7.2% Pass@1 improvement, underscoring hybrid architectures as a promising direction for future MACGS design. We further demonstrate CAM’s practical utility through two applications: (1) failure repair which achieves a 73.3% success rate by optimizing top-3 importance-ranked features and (2) feature pruning that reduces up to 66.8% intermediate token consumption while maintaining generation performance. Our work provides actionable insights for MACGS design and deployment, establishing causality analysis as a powerful approach for understanding and improving MACGS. Comments: 18 pages, 12 tables, 4 figures Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.02138 [cs.SE] (or arXiv:2602.02138v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2602.02138 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-43] DCoPilot: Generative AI-Empowered Policy Adaptation for Dynamic Data Center Operations
【速读】:该论文旨在解决现代数据中心(Data Center, DC)在高功率密度和动态工作负载下,因手动设计分段深度强化学习(Deep Reinforcement Learning, DRL)智能体无法及时响应服务级别协议(Service-Level Agreement, SLA)变化与环境动态迁移而导致的控制策略滞后问题,进而引发服务中断风险。解决方案的关键在于提出一种名为DCoPilot的混合生成式控制策略框架,其核心创新在于融合两种生成范式:一是利用大语言模型(Large Language Model, LLM)生成结构化的奖励函数形式,实现统一且可解释的奖励建模;二是通过超网络(Hypernetwork)根据SLA和场景嵌入参数化生成策略权重,从而支持零样本在线适应新规格。该框架通过仿真规模扩展、元策略蒸馏与在线自适应三阶段协同机制,显著提升了控制策略的泛化能力与稳定性,在多类DC控制任务中实现了近零约束违反并优于所有基线方法。
链接: https://arxiv.org/abs/2602.02137
作者: Minghao Li,Ruihang Wang,Rui Tan,Yonggang Wen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:
Abstract:Modern data centers (DCs) hosting artificial intelligence (AI)-dedicated devices operate at high power densities with rapidly varying workloads, making minute-level adaptation essential for safe and energy-efficient operation. However, manually designing piecewise deep reinforcement learning (DRL) agents cannot keep pace with frequent dynamics shifts and service-level agreement (SLA) changes of an evolving DC. This specification-to-policy lag causes a lack of timely, effective control policies, which may lead to service outages. To bridge the gap, we present DCoPilot, a hybrid framework for generative control policies in dynamic DC operation. DCoPilot synergizes two distinct generative paradigms, i.e., a large language model (LLM) that performs symbolic generation of structured reward forms, and a hypernetwork that conducts parametric generation of policy weights. DCoPilot operates through three coordinated phases: (i) simulation scale-up, which stress-tests reward candidates across diverse simulation-ready (SimReady) scenes; (ii) meta policy distillation, where a hypernetwork is trained to output policy weights conditioned on SLA and scene embeddings; and (iii) online adaptation, enabling zero-shot policy generation in response to updated specifications. Evaluated across five control task families spanning diverse DC components, DCoPilot achieves near-zero constraint violations and outperforms all baselines across specification variations. Ablation studies validate the effectiveness of LLM-based unified reward generation in enabling stable hypernetwork convergence.
zh
[AI-44] Mitigating Safety Tax via Distribution-Grounded Refinement in Large Reasoning Models
【速读】:该论文旨在解决安全对齐(safety alignment)过程中引入的“安全税”(safety tax)问题,即在对大型推理模型(Large Reasoning Model, LRM)进行安全对齐时,导致其通用推理能力显著退化的问题。现有方法通常通过蒸馏外部LRM或人工标注者的安全推理轨迹和答案构建对齐数据集,但这些数据与目标LRM的内部分布存在分布偏移(distributional gap),被认为是造成推理能力下降的核心原因。解决方案的关键在于提出一种名为DGR(Distribution-Guided Refinement)的数据集构建方法,该方法通过将原始分布外的安全推理数据转换并优化,使其更贴近目标模型的内部分布,从而有效缓解安全税现象,并保持甚至提升推理准确性。实验表明,DGR在多个基准上显著改善了推理能力(如DirectRefusal指标提升30.2%),且推理退化程度与分布偏移程度正相关,进一步验证了分布一致性的重要性。
链接: https://arxiv.org/abs/2602.02136
作者: Yingsha Xie,Tiansheng Huang,Enneng Yang,Rui Min,Wenjie Lu,Xiaochun Cao,Naiqiang Tan,Li Shen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Code will be released soon
Abstract:Safety alignment incurs safety tax that perturbs a large reasoning model’s (LRM) general reasoning ability. Existing datasets used for safety alignment for an LRM are usually constructed by distilling safety reasoning traces and answers from an external LRM or human labeler. However, such reasoning traces and answers exhibit a distributional gap with the target LRM that needs alignment, and we conjecture such distributional gap is the culprit leading to significant degradation of reasoning ability of the target LRM. Driven by this hypothesis, we propose a safety alignment dataset construction method, dubbed DGR. DGR transforms and refines an existing out-of-distributional safety reasoning dataset to be aligned with the target’s LLM inner distribution. Experimental results demonstrate that i) DGR effectively mitigates the safety tax while maintaining safety performance across all baselines, i.e., achieving \textbf+30.2% on DirectRefusal and \textbf+21.2% on R1-ACT improvement in average reasoning accuracy compared to Vanilla SFT; ii) the degree of reasoning degradation correlates with the extent of distribution shift, suggesting that bridging this gap is central to preserving capabilities. Furthermore, we find that safety alignment in LRMs may primarily function as a mechanism to activate latent knowledge, as a mere \textbf10 samples are sufficient for activating effective refusal behaviors. These findings not only emphasize the importance of distributional consistency but also provide insights into the activation mechanism of safety in reasoning models.
zh
[AI-45] Scalable Spatio-Temporal SE(3) Diffusion for Long-Horizon Protein Dynamics
【速读】:该论文旨在解决分子动力学(Molecular Dynamics, MD)模拟在计算成本上的瓶颈问题,尤其是难以覆盖生物相关时间尺度(如微秒级)的局限性。现有生成模型虽能加速模拟,但在长时程生成中受限于架构设计、误差累积及对时空动态建模不足等问题。其解决方案的关键在于提出STAR-MD(Spatio-Temporal Autoregressive Rollout for Molecular Dynamics),一种可扩展的SE(3)-等变扩散模型,核心创新是引入具有联合时空注意力机制的因果扩散Transformer,能够高效捕捉复杂的时空依赖关系,同时避免传统方法的内存瓶颈。该方法在ATLAS基准测试中实现了最优性能,显著提升构象覆盖率、结构有效性和动态保真度,并成功外推生成稳定微秒级轨迹,突破了基线方法在长时间尺度上的失效限制。
链接: https://arxiv.org/abs/2602.02128
作者: Nima Shoghi,Yuxuan Liu,Yuning Shen,Rob Brekelmans,Pan Li,Quanquan Gu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biological Physics (physics.bio-ph); Biomolecules (q-bio.BM); Quantitative Methods (q-bio.QM)
备注: For associated project page, see this https URL
Abstract:Molecular dynamics (MD) simulations remain the gold standard for studying protein dynamics, but their computational cost limits access to biologically relevant timescales. Recent generative models have shown promise in accelerating simulations, yet they struggle with long-horizon generation due to architectural constraints, error accumulation, and inadequate modeling of spatio-temporal dynamics. We present STAR-MD (Spatio-Temporal Autoregressive Rollout for Molecular Dynamics), a scalable SE(3)-equivariant diffusion model that generates physically plausible protein trajectories over microsecond timescales. Our key innovation is a causal diffusion transformer with joint spatio-temporal attention that efficiently captures complex space-time dependencies while avoiding the memory bottlenecks of existing methods. On the standard ATLAS benchmark, STAR-MD achieves state-of-the-art performance across all metrics–substantially improving conformational coverage, structural validity, and dynamic fidelity compared to previous methods. STAR-MD successfully extrapolates to generate stable microsecond-scale trajectories where baseline methods fail catastrophically, maintaining high structural quality throughout the extended rollout. Our comprehensive evaluation reveals severe limitations in current models for long-horizon generation, while demonstrating that STAR-MD’s joint spatio-temporal modeling enables robust dynamics simulation at biologically relevant timescales, paving the way for accelerated exploration of protein function.
zh
[AI-46] wo-Stage Grid Optimization for Group-wise Quantization of LLM s ICASSP2026
【速读】:该论文旨在解决低比特量化(low-bit quantization)下大语言模型(Large Language Models, LLMs)精度下降的问题,尤其针对现有分组量化方法(group-wise quantization)中因忽略输入统计特性与组间相关性而导致的层级重构误差(layer-wise reconstruction loss)优化不足的问题。其解决方案的关键在于提出一个两阶段优化框架:第一阶段在GPTQ之前初始化各组尺度以最小化组内重构损失,从而引入输入统计信息;第二阶段冻结GPTQ得到的整数权重后,通过坐标下降算法和闭式更新规则精细化调整组尺度,以进一步最小化层级重构损失,并显式考虑前序层量化误差以避免误差累积。该方法在不显著增加计算开销的前提下,实现了更优的量化精度。
链接: https://arxiv.org/abs/2602.02126
作者: Junhan Kim,Gukryeol Lee,Seungwoo Son,Jeewook Kim,Yongkweon Jeon
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ICASSP 2026
Abstract:Group-wise quantization is an effective strategy for mitigating accuracy degradation in low-bit quantization of large language models (LLMs). Among existing methods, GPTQ has been widely adopted due to its efficiency; however, it neglects input statistics and inter-group correlations when determining group scales, leading to a mismatch with its goal of minimizing layer-wise reconstruction loss. In this work, we propose a two-stage optimization framework for group scales that explicitly minimizes the layer-wise reconstruction loss. In the first stage, performed prior to GPTQ, we initialize each group scale to minimize the group-wise reconstruction loss, thereby incorporating input statistics. In the second stage, we freeze the integer weights obtained via GPTQ and refine the group scales to minimize the layer-wise reconstruction loss. To this end, we employ the coordinate descent algorithm and derive a closed-form update rule, which enables efficient refinement without costly numerical optimization. Notably, our derivation incorporates the quantization errors from preceding layers to prevent error accumulation. Experimental results demonstrate that our method consistently enhances group-wise quantization, achieving higher accuracy with negligible overhead.
zh
[AI-47] he Verification Crisis: Expert Perceptions of GenAI Disinformation and the Case for Reproducible Provenance
【速读】:该论文旨在解决生成式人工智能(Generative AI)引发的虚假信息(disinformation)规模化、自动化传播所带来的新型治理挑战,尤其关注多模态内容(文本、图像、音频、视频)对社会认知结构的系统性侵蚀。其核心问题是:当前缺乏可复现的研究方法与标准化评估基准,导致难以有效监测和应对由GenAI生成的合成媒体带来的“认知碎片化”(epistemic fragmentation)和“合成共识”(synthetic consensus)风险。解决方案的关键在于将信息完整性视为基础设施,强调通过严格的数据溯源标准(provenance standards)和方法学可复现性(methodological reproducibility)构建可信的信息生态体系,而非依赖尚不成熟的检测技术工具。
链接: https://arxiv.org/abs/2602.02100
作者: Alexander Loth,Martin Kappes,Marc-Oliver Pahl
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注: Accepted at ACM TheWebConf '26 Companion
Abstract:The growth of Generative Artificial Intelligence (GenAI) has shifted disinformation production from manual fabrication to automated, large-scale manipulation. This article presents findings from the first wave of a longitudinal expert perception survey (N=21) involving AI researchers, policymakers, and disinformation specialists. It examines the perceived severity of multimodal threats – text, image, audio, and video – and evaluates current mitigation strategies. Results indicate that while deepfake video presents immediate “shock” value, large-scale text generation poses a systemic risk of “epistemic fragmentation” and “synthetic consensus,” particularly in the political domain. The survey reveals skepticism about technical detection tools, with experts favoring provenance standards and regulatory frameworks despite implementation barriers. GenAI disinformation research requires reproducible methods. The current challenge is measurement: without standardized benchmarks and reproducibility checklists, tracking or countering synthetic media remains difficult. We propose treating information integrity as an infrastructure with rigor in data provenance and methodological reproducibility. Comments: Accepted at ACM TheWebConf '26 Companion Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI) ACMclasses: I.2.0; K.4.1; K.4.2 Cite as: arXiv:2602.02100 [cs.CY] (or arXiv:2602.02100v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2602.02100 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3774905.3795484 Focus to learn more DOI(s) linking to related resources
zh
[AI-48] Probabilistic Performance Guarantees for Multi-Task Reinforcement Learning
【速读】:该论文旨在解决多任务强化学习(Multi-task Reinforcement Learning, Multi-task RL)中缺乏形式化性能保障的问题,尤其是在安全关键场景下部署策略时对可靠性的要求。现有方法虽能训练出具备多任务执行能力的通用策略,但无法提供针对未见过任务的高置信度性能边界。其解决方案的关键在于提出一种新的泛化界(generalisation bound),该界由两部分组成:(i) 基于有限次轨迹采样的每任务下置信区间(lower confidence bounds);(ii) 基于有限采样任务的层级泛化(task-level generalisation)。通过组合这两部分,可为从相同未知分布中抽取的新任务提供高置信度性能保证,且在现实样本规模下具有理论严谨性和实用性。
链接: https://arxiv.org/abs/2602.02098
作者: Yannik Schnitzer,Mathias Jackermeier,Alessandro Abate,David Parker
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Multi-task reinforcement learning trains generalist policies that can execute multiple tasks. While recent years have seen significant progress, existing approaches rarely provide formal performance guarantees, which are indispensable when deploying policies in safety-critical settings. We present an approach for computing high-confidence guarantees on the performance of a multi-task policy on tasks not seen during training. Concretely, we introduce a new generalisation bound that composes (i) per-task lower confidence bounds from finitely many rollouts with (ii) task-level generalisation from finitely many sampled tasks, yielding a high-confidence guarantee for new tasks drawn from the same arbitrary and unknown distribution. Across state-of-the-art multi-task RL methods, we show that the guarantees are theoretically sound and informative at realistic sample sizes.
zh
[AI-49] See2Refine: Vision-Language Feedback Improves LLM -Based eHMI Action Designers
【速读】:该论文旨在解决自动化车辆(Autonomous Vehicles, AVs)在与道路其他使用者交互时缺乏自然沟通渠道的问题,进而提出通过外部人机接口(External Human-Machine Interface, eHMI)来传递意图并维持信任。现有eHMI设计多依赖开发者手工构建的消息-动作配对,难以适应复杂多变的交通场景。解决方案的关键在于提出一个无需人工干预的闭环框架See2Refine,其核心机制是利用视觉语言模型(Vision-Language Model, VLM)作为感知评估器,对LLM生成的eHMI动作进行自动视觉反馈,从而迭代优化LLM动作设计策略,实现无需人类标注的自主改进。该方法显著提升了eHMI动作的适配性与用户接受度,并在多种模态和LLM规模下展现出良好的泛化能力与人类偏好一致性。
链接: https://arxiv.org/abs/2602.02063
作者: Ding Xia,Xinyue Gui,Mark Colley,Fan Gao,Zhongyi Zhou,Dongyuan Li,Renhe Jiang,Takeo Igarashi
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Under Review
Abstract:Automated vehicles lack natural communication channels with other road users, making external Human-Machine Interfaces (eHMIs) essential for conveying intent and maintaining trust in shared environments. However, most eHMI studies rely on developer-crafted message-action pairs, which are difficult to adapt to diverse and dynamic traffic contexts. A promising alternative is to use Large Language Models (LLMs) as action designers that generate context-conditioned eHMI actions, yet such designers lack perceptual verification and typically depend on fixed prompts or costly human-annotated feedback for improvement. We present See2Refine, a human-free, closed-loop framework that uses vision-language model (VLM) perceptual evaluation as automated visual feedback to improve an LLM-based eHMI action designer. Given a driving context and a candidate eHMI action, the VLM evaluates the perceived appropriateness of the action, and this feedback is used to iteratively revise the designer’s outputs, enabling systematic refinement without human supervision. We evaluate our framework across three eHMI modalities (lightbar, eyes, and arm) and multiple LLM model sizes. Across settings, our framework consistently outperforms prompt-only LLM designers and manually specified baselines in both VLM-based metrics and human-subject evaluations. Results further indicate that the improvements generalize across modalities and that VLM evaluations are well aligned with human preferences, supporting the robustness and effectiveness of See2Refine for scalable action design.
zh
[AI-50] FiLoRA: Focus-and-Ignore LoRA for Controllable Feature Reliance
【速读】:该论文旨在解决多模态基础模型(Multimodal Foundation Models)中内部特征组依赖关系不明确、且难以在不改变任务语义的前提下主动调控其依赖性的问题。现有研究主要依赖事后分析或特征移除方法,无法实现对模型计算过程的可控调节。解决方案的关键在于提出FiLoRA(Focus-and-Ignore LoRA),这是一种基于指令条件的参数高效适配框架,通过将适配分解为与特征组对齐的LoRA模块,并引入指令条件门控机制,使自然语言指令作为计算层级的控制信号,而非任务重定义,从而在保持预测目标不变的情况下,实现对核心特征和伪相关特征组的选择性增强或抑制,进而诱导因果性的内部计算变化并提升鲁棒性。
链接: https://arxiv.org/abs/2602.02060
作者: Hyunsuk Chung,Caren Han,Yerin Choi,Seungyeon Ji,Jinwoo Kim,Eun-Jung Holden,Kyungreem Han
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal foundation models integrate heterogeneous signals across modalities, yet it remains poorly understood how their predictions depend on specific internal feature groups and whether such reliance can be deliberately controlled. Existing studies of shortcut and spurious behavior largely rely on post hoc analyses or feature removal, offering limited insight into whether reliance can be modulated without altering task semantics. We introduce FiLoRA (Focus-and-Ignore LoRA), an instruction-conditioned, parameter-efficient adaptation framework that enables explicit control over internal feature reliance while keeping the predictive objective fixed. FiLoRA decomposes adaptation into feature group-aligned LoRA modules and applies instruction-conditioned gating, allowing natural language instructions to act as computation-level control signals rather than task redefinitions. Across text–image and audio–visual benchmarks, we show that instruction-conditioned gating induces consistent and causal shifts in internal computation, selectively amplifying or suppressing core and spurious feature groups without modifying the label space or training objective. Further analyses demonstrate that FiLoRA yields improved robustness under spurious feature interventions, revealing a principled mechanism to regulate reliance beyond correlation-driven learning.
zh
[AI-51] FORLER: Federated Offline Reinforcement Learning with Q-Ensemble and Actor Rectification
【速读】:该论文针对离线联邦强化学习(Offline Federated Reinforcement Learning, Offline FRL)中因低质量、异构数据导致的策略污染(policy pollution)问题展开研究,旨在提升模型聚合的鲁棒性与性能。其解决方案的关键在于提出FORLER框架:服务器端采用Q-ensemble聚合机制以稳健融合各设备的Q函数,有效抑制次优策略对全局模型的负面影响;设备端则通过演员修正(actor rectification)策略,利用零阶搜索识别高Q值动作并引入定制正则项引导策略优化,同时结合δ-周期性策略降低本地计算负担。理论分析提供了安全策略改进的保证,实验证明该方法在不同数据质量和异构性条件下均显著优于现有基线。
链接: https://arxiv.org/abs/2602.02055
作者: Nan Qiao,Sheng Yue
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: accetped by IEEE International Conference on Communications (ICC 2026)
Abstract:In Internet-of-Things systems, federated learning has advanced online reinforcement learning (RL) by enabling parallel policy training without sharing raw data. However, interacting with real environments online can be risky and costly, motivating offline federated RL (FRL), where local devices learn from fixed datasets. Despite its promise, offline FRL may break down under low-quality, heterogeneous data. Offline RL tends to get stuck in local optima, and in FRL, one device’s suboptimal policy can degrade the aggregated model, i.e., policy pollution. We present FORLER, combining Q-ensemble aggregation on the server with actor rectification on devices. The server robustly merges device Q-functions to curb policy pollution and shift heavy computation off resource-constrained hardware without compromising privacy. Locally, actor rectification enriches policy gradients via a zeroth-order search for high-Q actions plus a bespoke regularizer that nudges the policy toward them. A \delta -periodic strategy further reduces local computation. We theoretically provide safe policy improvement performance guarantees. Extensive experiments show FORLER consistently outperforms strong baselines under varying data quality and heterogeneity.
zh
[AI-52] SIDiffAgent : Self-Improving Diffusion Agent
【速读】:该论文旨在解决文本到图像扩散模型(text-to-image diffusion models)在实际部署中面临的诸多挑战,包括对提示词表述的敏感性、语义解释模糊(如“mouse”可能指动物或计算机外设)、生成图像中的伪影(如解剖结构扭曲)以及对精心设计提示词的依赖。现有方法通常需要额外训练且可控性有限,难以适应真实应用场景。其解决方案的关键在于提出一种无需训练的智能体框架——Self-Improving Diffusion Agent (SIDiffAgent),该框架利用Qwen系列模型(Qwen-VL、Qwen-Image、Qwen-Edit、Qwen-Embedding)实现自主提示工程、错误检测与修正及细粒度伪影去除,并通过将历史经验存储于数据库中,实现迭代式自我改进,从而显著提升生成结果的可靠性与一致性。
链接: https://arxiv.org/abs/2602.02051
作者: Shivank Garg,Ayush Singh,Gaurav Kumar Nayak
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Text-to-image diffusion models have revolutionized generative AI, enabling high-quality and photorealistic image synthesis. However, their practical deployment remains hindered by several limitations: sensitivity to prompt phrasing, ambiguity in semantic interpretation (e.g., ``mouse" as animal vs. a computer peripheral), artifacts such as distorted anatomy, and the need for carefully engineered input prompts. Existing methods often require additional training and offer limited controllability, restricting their adaptability in real-world applications. We introduce Self-Improving Diffusion Agent (SIDiffAgent), a training-free agentic framework that leverages the Qwen family of models (Qwen-VL, Qwen-Image, Qwen-Edit, Qwen-Embedding) to address these challenges. SIDiffAgent autonomously manages prompt engineering, detects and corrects poor generations, and performs fine-grained artifact removal, yielding more reliable and consistent outputs. It further incorporates iterative self-improvement by storing a memory of previous experiences in a database. This database of past experiences is then used to inject prompt-based guidance at each stage of the agentic pipeline. \modelour achieved an average VQA score of 0.884 on GenAIBench, significantly outperforming open-source, proprietary models and agentic methods. We will publicly release our code upon acceptance.
zh
[AI-53] Rethinking the Role of Entropy in Optimizing Tool-Use Behaviors for Large Language Model Agents
【速读】:该论文旨在解决基于大语言模型(Large Language Models, LLMs)的工具使用代理(tool-using agents)在长轨迹任务中因频繁且低质量的工具调用导致延迟增加和推理性能下降的问题。解决方案的关键在于利用熵减少(entropy reduction)作为监督信号,设计两种奖励策略:稀疏结果奖励(sparse outcome rewards)提供轨迹级粗粒度引导以提升效率,密集过程奖励(dense process rewards)提供细粒度监督以增强性能。实验表明,这两种策略分别实现了工具调用减少72.07%和性能提升22.27%,验证了熵减少是优化工具使用行为的核心机制。
链接: https://arxiv.org/abs/2602.02050
作者: Zeping Li,Hongru Wang,Yiwen Zhao,Guanhua Chen,Yixia Li,Keyang Chen,Yixin Cao,Guangnan Ye,Hongfeng Chai,Mengdi Wang,Zhenfei Yin
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:Tool-using agents based on Large Language Models (LLMs) excel in tasks such as mathematical reasoning and multi-hop question answering. However, in long trajectories, agents often trigger excessive and low-quality tool calls, increasing latency and degrading inference performance, making managing tool-use behavior challenging. In this work, we conduct entropy-based pilot experiments and observe a strong positive correlation between entropy reduction and high-quality tool calls. Building on this finding, we propose using entropy reduction as a supervisory signal and design two reward strategies to address the differing needs of optimizing tool-use behavior. Sparse outcome rewards provide coarse, trajectory-level guidance to improve efficiency, while dense process rewards offer fine-grained supervision to enhance performance. Experiments across diverse domains show that both reward designs improve tool-use behavior: the former reduces tool calls by 72.07% compared to the average of baselines, while the latter improves performance by 22.27%. These results position entropy reduction as a key mechanism for enhancing tool-use behavior, enabling agents to be more adaptive in real-world applications.
zh
[AI-54] Bandwidth-Efficient Multi-Agent Communication through Information Bottleneck and Vector Quantization ICRA2026
【速读】:该论文旨在解决多智能体强化学习(Multi-agent Reinforcement Learning, MARL)系统在实际机器人应用中面临的严重通信约束问题,此类约束显著影响智能体间的协调效率。其核心解决方案是将信息瓶颈理论(Information Bottleneck Theory)与向量量化(Vector Quantization)相结合,通过信息论驱动的优化策略实现通信信息的压缩与离散化,在保留任务关键信息的同时提升带宽利用率。关键创新在于引入了一个门控通信机制(Gated Communication Mechanism),该机制基于环境上下文和智能体状态动态判断是否需要通信,从而实现选择性、高效的通信决策。实验表明,该方法在复杂协作任务中相较无通信基线提升了181.8%的性能,同时降低41.4%的带宽消耗,并在成功率-带宽权衡曲线上全面优于现有方法。
链接: https://arxiv.org/abs/2602.02035
作者: Ahmad Farooq,Kamran Iqbal
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: Accepted at the 2026 IEEE International Conference on Robotics and Automation (ICRA 2026), Vienna, Austria. 9 pages, 4 figures, 6 tables
Abstract:Multi-agent reinforcement learning systems deployed in real-world robotics applications face severe communication constraints that significantly impact coordination effectiveness. We present a framework that combines information bottleneck theory with vector quantization to enable selective, bandwidth-efficient communication in multi-agent environments. Our approach learns to compress and discretize communication messages while preserving task-critical information through principled information-theoretic optimization. We introduce a gated communication mechanism that dynamically determines when communication is necessary based on environmental context and agent states. Experimental evaluation on challenging coordination tasks demonstrates that our method achieves 181.8% performance improvement over no-communication baselines while reducing bandwidth usage by 41.4%. Comprehensive Pareto frontier analysis shows dominance across the entire success-bandwidth spectrum with area-under-curve of 0.198 vs 0.142 for next-best methods. Our approach significantly outperforms existing communication strategies and establishes a theoretically grounded framework for deploying multi-agent systems in bandwidth-constrained environments such as robotic swarms, autonomous vehicle fleets, and distributed sensor networks.
zh
[AI-55] Constrained Process Maps for Multi-Agent Generative AI Workflows
【速读】:该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的单代理系统在合规等受监管场景中难以有效处理不确定性与多阶段协作的问题,尤其是在缺乏可观测性与可比较性的条件下。解决方案的关键在于提出一种形式化为有限时域马尔可夫决策过程(finite-horizon Markov Decision Process, MDP)的多代理系统,其中每个代理对应特定角色或决策阶段(如内容、业务或法律审查),并通过预定义的转移规则表示任务升级或完成;同时,通过蒙特卡洛估计量化代理层面的认知不确定性(epistemic uncertainty),并以MDP终止于自动标注状态或人工审核状态来捕获系统级不确定性,从而实现对不确定性的显式建模与优化。
链接: https://arxiv.org/abs/2602.02034
作者: Ananya Joshi,Michael Rudow
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language model (LLM)-based agents are increasingly used to perform complex, multi-step workflows in regulated settings such as compliance and due diligence. However, many agentic architectures rely primarily on prompt engineering of a single agent, making it difficult to observe or compare how models handle uncertainty and coordination across interconnected decision stages and with human oversight. We introduce a multi-agent system formalized as a finite-horizon Markov Decision Process (MDP) with a directed acyclic structure. Each agent corresponds to a specific role or decision stage (e.g., content, business, or legal review in a compliance workflow), with predefined transitions representing task escalation or completion. Epistemic uncertainty is quantified at the agent level using Monte Carlo estimation, while system-level uncertainty is captured by the MDP’s termination in either an automated labeled state or a human-review state. We illustrate the approach through a case study in AI safety evaluation for self-harm detection, implemented as a multi-agent compliance system. Results demonstrate improvements over a single-agent baseline, including up to a 19% increase in accuracy, up to an 85x reduction in required human review, and, in some configurations, reduced processing time.
zh
[AI-56] Canonical Intermediate Representation for LLM -based optimization problem formulation and code generation
【速读】:该论文旨在解决从自然语言描述中自动构建优化模型的问题,尤其针对当前大语言模型(Large Language Models, LLMs)在处理复杂运营规则时难以正确建模复合约束和选择合适建模范式(modeling paradigms)的挑战。其解决方案的关键在于提出一种规范中间表示(Canonical Intermediate Representation, CIR),该表示显式地将问题描述与优化模型之间的语义映射解耦:CIR通过约束原型(constraint archetypes)和候选建模范式编码运营规则的逻辑结构,从而独立于具体的数学表达形式;在此基础上构建的规则到约束(Rule-to-Constraint, R2C)框架,利用多智能体流水线实现文本解析、领域知识检索与模型实例化,显著提升了复杂规则下的建模准确率,达到当前最优性能。
链接: https://arxiv.org/abs/2602.02029
作者: Zhongyuan Lyu,Shuoyu Hu,Lujie Liu,Hongxia Yang,Ming LI
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 41 pages, 4 figures, 5 tables
Abstract:Automatically formulating optimization models from natural language descriptions is a growing focus in operations research, yet current LLM-based approaches struggle with the composite constraints and appropriate modeling paradigms required by complex operational rules. To address this, we introduce the Canonical Intermediate Representation (CIR): a schema that LLMs explicitly generate between problem descriptions and optimization models. CIR encodes the semantics of operational rules through constraint archetypes and candidate modeling paradigms, thereby decoupling rule logic from its mathematical instantiation. Upon a newly generated CIR knowledge base, we develop the rule-to-constraint (R2C) framework, a multi-agent pipeline that parses problem texts, synthesizes CIR implementations by retrieving domain knowledge, and instantiates optimization models. To systematically evaluate rule-to-constraint reasoning, we test R2C on our newly constructed benchmark featuring rich operational rules, and benchmarks from prior work. Extensive experiments show that R2C achieves state-of-the-art accuracy on the proposed benchmark (47.2% Accuracy Rate). On established benchmarks from the literature, R2C delivers highly competitive results, approaching the performance of proprietary models (e.g., GPT-5). Moreover, with a reflection mechanism, R2C achieves further gains and sets new best-reported results on some benchmarks.
zh
[AI-57] Edit Knowledge Not Just Facts via Multi-Step Reasoning over Background Stories
【速读】:该论文旨在解决大语言模型在获取新知识后难以将其有效整合并灵活应用于多步骤推理的问题。现有知识编辑方法多聚焦于原子事实的更新,虽能提升事实 recall 率,但缺乏将新信息嵌入连贯认知框架的能力,导致跨场景应用失效。解决方案的关键在于将知识内化视为一个推理问题而非记忆问题:首先通过构建包含新事实及其与已有知识关联的背景故事来提供语境;其次设计自生成的多跳问答任务,迫使模型在推理链中结合新旧知识;最后采用知识蒸馏策略,使学生模型在无直接访问新信息的情况下学习教师模型的推理行为,从而实现对新知识的深层内化。实验证明,该方法显著提升了模型在需要融合多个新事实的复杂推理任务上的表现。
链接: https://arxiv.org/abs/2602.02028
作者: Ya Gao,Kalle Kujanpää,Pekka Marttinen,Harri Valpola,Alexander Ilin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: under review
Abstract:Enabling artificial intelligence systems, particularly large language models, to integrate new knowledge and flexibly apply it during reasoning remains a central challenge. Existing knowledge editing approaches emphasize atomic facts, improving factual recall but often failing to integrate new information into a coherent framework usable across contexts. In this work, we argue that knowledge internalization is fundamentally a reasoning problem rather than a memorization problem. Consequently, a model should be trained in situations where the new information is instrumental to solving a task, combined with pre-existing knowledge, and exercised through multi-step reasoning. Based on this insight, we propose a training strategy based on three principles. First, new knowledge is introduced as a coherent background story that contextualizes novel facts and explains their relation to existing knowledge. Second, models are trained using self-generated multi-hop questions that require multi-step reasoning involving the new information. Third, training is done using knowledge distillation, forcing a student model to internalize the teacher’s reasoning behavior without access to the novel information. Experiments show that models trained with this strategy effectively leverage newly acquired knowledge during reasoning and achieve remarkable performance on challenging questions that require combining multiple new facts.
zh
[AI-58] Light Alignment Improves LLM Safety via Model Self-Reflection with a Single Neuron
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在安全对齐(safety alignment)方面存在的问题,即现有方法多依赖计算成本高昂的后训练策略,且在不同模型规模下泛化能力差;同时,部分轻量级对齐方法要么严重依赖预计算的安全注入(safety injections),要么过度依赖模型自身能力,导致效率与可用性下降。其解决方案的关键在于提出一种安全感知解码(safety-aware decoding)方法,仅需低成本训练一个专家模型,并引入一个单神经元门控机制(gating mechanism),通过有效平衡模型内在能力与外部引导信号,在保持生成实用性的同时显著提升输出安全性,从而在训练开销和跨模型尺度泛化性上展现出明显优势。
链接: https://arxiv.org/abs/2602.02027
作者: Sicheng Shen,Mingyang Lv,Han Shen,Jialin Wu,Binghao Wang,Zhou Yang,Guobin Shen,Dongcheng Zhao,Feifei Zhao,Yi Zeng
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 21 pages, 3 figures
Abstract:The safety of large language models (LLMs) has increasingly emerged as a fundamental aspect of their development. Existing safety alignment for LLMs is predominantly achieved through post-training methods, which are computationally expensive and often fail to generalize well across different models. A small number of lightweight alignment approaches either rely heavily on prior-computed safety injections or depend excessively on the model’s own capabilities, resulting in limited generalization and degraded efficiency and usability during generation. In this work, we propose a safety-aware decoding method that requires only low-cost training of an expert model and employs a single neuron as a gating mechanism. By effectively balancing the model’s intrinsic capabilities with external guidance, our approach simultaneously preserves utility and enhances output safety. It demonstrates clear advantages in training overhead and generalization across model scales, offering a new perspective on lightweight alignment for the safe and practical deployment of large language models. Code: this https URL.
zh
[AI-59] Do I Really Know? Learning Factual Self-Verification for Hallucination Reduction
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)中存在的事实性幻觉(factual hallucination)问题,即模型在生成内容时可能产生与事实不符的陈述。现有方法主要依赖外部后验验证或在微调阶段将不确定性直接映射为拒答行为,常导致模型过于保守。其解决方案的关键在于提出一种训练阶段的框架VeriFY,通过一致性自验证机制教导模型推理事实不确定性:模型首先生成初始答案,随后构造并回答一个探查性验证问题,做出一致性判断,并据此决定是否输出答案或选择拒答。为防止训练过程中强化幻觉内容,VeriFY引入了阶段级损失掩码策略,排除幻觉答案阶段的监督信号,同时保留对验证行为的有效指导。实验证明,该方法在多个模型家族和规模下可显著降低幻觉率(9.7%–53.3%),且召回率下降有限(0.4%–5.7%),具备良好的跨数据集泛化能力。
链接: https://arxiv.org/abs/2602.02018
作者: Enes Altinisik,Masoomali Fatehkia,Fatih Deniz,Nadir Durrani,Majd Hawasly,Mohammad Raza,Husrev Taha Sencar
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Factual hallucination remains a central challenge for large language models (LLMs). Existing mitigation approaches primarily rely on either external post-hoc verification or mapping uncertainty directly to abstention during fine-tuning, often resulting in overly conservative behavior. We propose VeriFY, a training-time framework that teaches LLMs to reason about factual uncertainty through consistency-based self-verification. VeriFY augments training with structured verification traces that guide the model to produce an initial answer, generate and answer a probing verification query, issue a consistency judgment, and then decide whether to answer or abstain. To address the risk of reinforcing hallucinated content when training on augmented traces, we introduce a stage-level loss masking approach that excludes hallucinated answer stages from the training objective while preserving supervision over verification behavior. Across multiple model families and scales, VeriFY reduces factual hallucination rates by 9.7 to 53.3 percent, with only modest reductions in recall (0.4 to 5.7 percent), and generalizes across datasets when trained on a single source. The source code, training data, and trained model checkpoints will be released upon acceptance.
zh
[AI-60] Preserve-Then-Quantize: Balancing Rank Budgets for Quantization Error Reconstruction in LLM s
【速读】:该论文旨在解决后训练量化(Post-Training Quantization, PTQ)中因量化误差导致的精度损失问题。现有方法通常将全部秩预算用于重构量化误差,但在权重矩阵具有内在低秩结构且量化破坏了主导方向时,这种策略效率低下。解决方案的关键在于提出结构化残差重构(Structured Residual Reconstruction, SRR),其核心思想是:在量化前保留激活缩放后的权重矩阵的前 k 个奇异子空间,仅对残差部分进行量化,并利用剩余秩 r−k 进行误差重构。通过理论指导的准则选择 k 值,以平衡暴露于量化中的能量与不可恢复的误差,从而实现更优的量化精度。此外,该参数化形式自然支持量化参数高效微调(Quantized Parameter-Efficient Fine-Tuning, QPEFT),并通过沿保留方向的梯度缩放提升微调稳定性。
链接: https://arxiv.org/abs/2602.02001
作者: Yoonjun Cho,Dongjae Jeon,Soeun Kim,Moongyu Jeon,Albert No
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Quantization Error Reconstruction (QER) reduces accuracy loss in Post-Training Quantization (PTQ) by approximating weights as \mathbfW \approx \mathbfQ + \mathbfL\mathbfR , using a rank- r correction to reconstruct quantization error. Prior methods devote the full rank budget to error reconstruction, which is suboptimal when \mathbfW has intrinsic low-rank structure and quantization corrupts dominant directions. We propose Structured Residual Reconstruction (SRR), a rank-allocation framework that preserves the top- k singular subspace of the activation-scaled weight before quantization, quantizes only the residual, and uses the remaining rank r-k for error reconstruction. We derive a theory-guided criterion for selecting k by balancing quantization-exposed energy and unrecoverable error under rank constraints. We further show that resulting \mathbfQ + \mathbfL\mathbfR parameterization naturally supports Quantized Parameter-Efficient Fine-Tuning (QPEFT), and stabilizes fine-tuning via gradient scaling along preserved directions. Experiments demonstrate consistent perplexity reductions across diverse models and quantization settings in PTQ, along with a 5.9 percentage-point average gain on GLUE under 2-bit QPEFT.
zh
[AI-61] On the Limits of Layer Pruning for Generative Reasoning in LLM s
【速读】:该论文旨在解决层剪枝(layer pruning)在压缩大语言模型(Large Language Models, LLMs)时导致生成式推理任务性能显著下降的问题。现有剪枝方法虽能在分类任务上保持较强性能,但在多步推理任务中表现出对深度削减的高度敏感性,尤其体现在算术计算和代码合成中的括号平衡等关键算法能力的退化。论文提出的关键解决方案是基于自生成响应的监督微调(Supervised Fine-Tuning with Self-Generated Responses),该策略在无预训练规模数据与计算资源的后训练约束下,有效恢复了分类任务性能(可达基线90%),并在生成式基准上实现20–30个百分点的显著提升,但其对生成式推理的恢复能力仍存在根本性局限,仅在较低剪枝比例下可行。
链接: https://arxiv.org/abs/2602.01997
作者: Safal Shrestha,Anubhav Shrestha,Aadim Nepal,Minwu Kim,Keith Ross
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent works have shown that layer pruning can compress large language models (LLMs) while retaining strong performance on classification benchmarks with little or no finetuning. However, existing pruning techniques often suffer severe degradation on generative reasoning tasks. Through a systematic study across multiple model families, we find that tasks requiring multi-step reasoning are particularly sensitive to depth reduction. Beyond surface-level text degeneration, we observe degradation of critical algorithmic capabilities, including arithmetic computation for mathematical reasoning and balanced parenthesis generation for code synthesis. Under realistic post-training constraints, without access to pretraining-scale data or compute, we evaluate a simple mitigation strategy based on supervised finetuning with Self-Generated Responses. This approach achieves strong recovery on classification tasks, retaining up to 90% of baseline performance, and yields substantial gains of up to 20–30 percentage points on generative benchmarks compared to prior post-pruning techniques. Crucially, despite these gains, recovery for generative reasoning remains fundamentally limited relative to classification tasks and is viable primarily at lower pruning ratios. Overall, we characterize the practical limits of layer pruning for generative reasoning and provide guidance on when depth reduction can be applied effectively under constrained post-training regimes.
zh
[AI-62] Optimizing Tensor Train Decomposition in DNNs for RISC-V Architectures Using Design Space Exploration and Compiler Optimizations
【速读】:该论文旨在解决在资源受限的RISC-V平台部署深度神经网络(Deep Neural Networks, DNNs)时,全连接(Fully Connected, FC)层因高计算和内存开销而导致的效率瓶颈问题。其核心解决方案是提出一种端到端的低秩分解(Low-rank Factorization, LRF)设计空间探索方法,并结合Tensor Train Decomposition(TTD)技术对FC层进行压缩优化。关键在于通过排除低效的分解结构和推理性能差的方案来显著缩小LRF设计空间,同时利用编译器优化提升定制化T3F层在RISC-V架构上的执行效率,最终实现平均比IREE快3倍、比Pluto快8倍的推理加速效果,从而有效支持DNN在边缘与嵌入式设备上的高效部署。
链接: https://arxiv.org/abs/2602.01996
作者: Theologos Anthimopoulos,Milad Kokhazadeh,Vasilios Kelefouras,Benjamin Himpel,Georgios Keramidas
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Mathematical Software (cs.MS)
备注: 36 pages, 16 figures, this is the author-accepted version of the article published in ACM Transactions on Embedded Computing Systems (TECS), Vol. 24, No. 6
Abstract:Deep neural networks (DNNs) have become indispensable in many real-life applications like natural language processing, and autonomous systems. However, deploying DNNs on resource-constrained devices, e.g., in RISC-V platforms, remains challenging due to the high computational and memory demands of fully connected (FC) layers, which dominate resource consumption. Low-rank factorization (LRF) offers an effective approach to compressing FC layers, but the vast design space of LRF solutions involves complex trade-offs among FLOPs, memory size, inference time, and accuracy, making the LRF process complex and time-consuming. This paper introduces an end-to-end LRF design space exploration methodology and a specialized design tool for optimizing FC layers on RISC-V processors. Using Tensor Train Decomposition (TTD) offered by TensorFlow T3F library, the proposed work prunes the LRF design space by excluding first, inefficient decomposition shapes and second, solutions with poor inference performance on RISC-V architectures. Compiler optimizations are then applied to enhance custom T3F layer performance, minimizing inference time and boosting computational efficiency. On average, our TT-decomposed layers run 3x faster than IREE and 8x faster than Pluto on the same compressed model. This work provides an efficient solution for deploying DNNs on edge and embedded devices powered by RISC-V architectures.
zh
[AI-63] hinking Like a Doctor: Conversational Diagnosis through the Exploration of Diagnostic Knowledge Graphs
【速读】:该论文旨在解决当前对话式诊断系统在实际临床场景中面临的两大局限性:一是依赖模型参数化知识,难以应对信息不完整的情况;二是假设患者能提供详尽且具体的症状描述,这与现实临床实践中早期问诊阶段患者表述模糊的实际情况不符。解决方案的关键在于构建一个基于诊断知识图谱(diagnostic knowledge graph)的两步推理机制:首先根据对话上下文生成初步诊断假设,随后通过生成澄清性问题对假设进行验证,并迭代执行该过程直至达成最终诊断。该方法有效提升了诊断准确性和效率,同时借助MIMIC-IV数据集中的患者画像对模拟器进行适应性改造,使其能够更真实地反映早期临床问诊中症状表述不清的特点,从而增强系统的临床实用性。
链接: https://arxiv.org/abs/2602.01995
作者: Jeongmoon Won,Seungwon Kook,Yohan Jo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Conversational diagnosis requires multi-turn history-taking, where an agent asks clarifying questions to refine differential diagnoses under incomplete information. Existing approaches often rely on the parametric knowledge of a model or assume that patients provide rich and concrete information, which is unrealistic. To address these limitations, we propose a conversational diagnosis system that explores a diagnostic knowledge graph to reason in two steps: (i) generating diagnostic hypotheses from the dialogue context, and (ii) verifying hypotheses through clarifying questions, which are repeated until a final diagnosis is reached. Since evaluating the system requires a realistic patient simulator that responds to the system’s questions, we adopt a well-established simulator along with patient profiles from MIMIC-IV. We further adapt it to describe symptoms vaguely to reflect real-world patients during early clinical encounters. Experiments show improved diagnostic accuracy and efficiency over strong baselines, and evaluations by physicians support the realism of our simulator and the clinical utility of the generated questions. Our code will be released upon publication.
zh
[AI-64] Emergent Analogical Reasoning in Transformers
【速读】:该论文试图解决的问题是:Transformer模型在生成式 AI(Generative AI)中如何习得并实现类比推理(analogical reasoning)这一人类智能的核心能力,以及其内在机制尚不明确。解决方案的关键在于,基于范畴论中的函子(functor)概念,将类比推理形式化为跨类别实体间对应关系的推断,并通过设计合成任务在受控环境中评估类比推理的涌现现象;进一步通过机制分析发现,类比推理可分解为两个核心组件:(1) 嵌入空间中关系结构的几何对齐,以及 (2) Transformer内部对函子操作的应用,从而实现了从一个类别到另一个类别的关系结构迁移,使类比成为可量化、可解释的神经网络机制。
链接: https://arxiv.org/abs/2602.01992
作者: Gouki Minegishi,Jingyuan Feng,Hiroki Furuta,Takeshi Kojima,Yusuke Iwasawa,Yutaka Matsuo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Analogy is a central faculty of human intelligence, enabling abstract patterns discovered in one domain to be applied to another. Despite its central role in cognition, the mechanisms by which Transformers acquire and implement analogical reasoning remain poorly understood. In this work, inspired by the notion of functors in category theory, we formalize analogical reasoning as the inference of correspondences between entities across categories. Based on this formulation, we introduce synthetic tasks that evaluate the emergence of analogical reasoning under controlled settings. We find that the emergence of analogical reasoning is highly sensitive to data characteristics, optimization choices, and model scale. Through mechanistic analysis, we show that analogical reasoning in Transformers decomposes into two key components: (1) geometric alignment of relational structure in the embedding space, and (2) the application of a functor within the Transformer. These mechanisms enable models to transfer relational structure from one category to another, realizing analogy. Finally, we quantify these effects and find that the same trends are observed in pretrained LLMs. In doing so, we move analogy from an abstract cognitive notion to a concrete, mechanistically grounded phenomenon in modern neural networks.
zh
[AI-65] SAME: Stabilized Mixture-of-Experts for Multimodal Continual Instruction Tuning
【速读】:该论文旨在解决多模态持续指令微调(Multimodal Continual Instruction Tuning, MCIT)中的两个核心问题:路由漂移(router drift) 和 专家漂移(expert drift)。其中,路由漂移指随着任务分布变化,专家选择变得不稳定(如原本应激活定位专家的接地查询被错误地路由至无关专家),而专家漂移则表现为共享专家在新任务学习过程中被覆盖,丧失原有功能。解决方案的关键在于提出 StAbilized Mixture-of-Experts (SAME),其核心机制包括:1)通过将路由动态分解到正交子空间并仅更新任务相关方向来稳定专家选择,缓解路由漂移;2)利用历史输入协方差进行曲率感知缩放,以无回放方式调节专家更新,抑制专家漂移;3)引入自适应专家激活策略,在训练中冻结已选专家,减少冗余计算与跨任务干扰。
链接: https://arxiv.org/abs/2602.01990
作者: Zhen-Hao Xie,Jun-Tao Tang,Yu-Cheng Shi,Han-Jia Ye,De-Chuan Zhan,Da-Wei Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal Large Language Models (MLLMs) achieve strong performance through instruction tuning, but real-world deployment requires them to continually expand their capabilities, making Multimodal Continual Instruction Tuning (MCIT) essential. Recent methods leverage sparse expert routing to promote task specialization, but we find that the expert routing process suffers from drift as the data distribution evolves. For example, a grounding query that previously activated localization experts may instead be routed to irrelevant experts after learning OCR tasks. Meanwhile, the grounding-related experts can be overwritten by new tasks and lose their original functionality. Such failure reflects two problems: router drift, where expert selection becomes inconsistent over time, and expert drift, where shared experts are overwritten across tasks. Therefore, we propose StAbilized Mixture-of-Experts (SAME) for MCIT. To address router drift, SAME stabilizes expert selection by decomposing routing dynamics into orthogonal subspaces and updating only task-relevant directions. To mitigate expert drift, we regulate expert updates via curvature-aware scaling using historical input covariance in a rehearsal-free manner. SAME also introduces adaptive expert activation to freeze selected experts during training, reducing redundant computation and cross-task interference. Extensive experiments demonstrate its SOTA performance.
zh
[AI-66] Evolving from Tool User to Creator via Training-Free Experience Reuse in Multimodal Reasoning
【速读】:该论文旨在解决现有工具集成推理(Tool-Integrated Reasoning, TIR)模型在面对开放性问题时因固定工具集无法自适应调整、缺乏自我优化机制导致错误传播,以及工具构建高度依赖人工而限制可扩展性的核心挑战。其解决方案的关键在于提出一种无需训练的框架UCT(Unsupervised Tool Creation),通过挖掘大语言模型(LLM)推理轨迹中的隐式问题求解知识,将其提炼为可复用的工具资产,并引入记忆整合机制动态维护工具库,从而实现推理过程中工具的自主创建与自我更新,使代理系统具备持续进化能力,无需额外训练即可提升工具质量和整体推理性能。
链接: https://arxiv.org/abs/2602.01983
作者: Xintian Shen,Jiawei Chen,Lihao Zheng,Hao Ma,Tao Wei,Kun Zhan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Existing Tool-Integrated Reasoning (TIR) models have effectively extended the question-answering capabilities of LLMs by incorporating external tools. However, real-world scenarios present numerous open-ended problems where fixed tools often fail to meet task requirements. Furthermore, the lack of self-optimization mechanisms means that erroneous tool outputs can mislead the LLM’s responses. Additionally, the construction of existing tools entails significant manual effort, which consequently constrains their applicability. Recognizing that the reasoning traces of LLMs encapsulate implicit problem-solving capabilities, we propose UCT, a novel training-free framework that transforms agents from tool users to tool creators. This approach harvests reasoning experiences and distills them into reusable assets. This method transforms the agent from a mere tool user into a tool creator, enabling adaptive tool creation and self-updating during the inference process. We also introduce a memory consolidation mechanism to maintain the tool library, ensuring high reusability of retained experiential memory for subsequent reasoning tasks. This novel automated tool construction paradigm continuously improves tool quality during reasoning, allowing the overall agent system to progress without additional training. Extensive experiments demonstrate that our method serves as a novel paradigm for enhancing the capabilities of TIR models. In particular, the significant performance gains achieved +20.86% \uparrow and +23.04% \uparrow on benchmarks across multi-domain mathematical and scientific reasoning tasks validate the self-evolving capability of the agent.
zh
[AI-67] IntraSlice: Towards High-Performance Structural Pruning with Block-Intra PCA for LLM s
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在部署时因模型规模庞大而导致的效率问题,尤其是结构化剪枝(structured pruning)方法在压缩过程中引发的性能显著下降问题。现有基于主成分分析(PCA-based pruning)的方法虽能缓解性能损失,但仅适用于模块间剪枝,需融合变换矩阵,从而引入额外参数并破坏激活分布,尤其受残差连接(residual connections)影响严重。解决方案的关键在于提出IntraSlice框架,通过块级模块内PCA压缩剪枝(block-wise module-intra PCA compression pruning),利用Transformer模块的结构特性设计一种可完全融合的近似PCA方法,使变换矩阵无需额外参数即可嵌入模型;同时引入基于PCA的全局剪枝比例估计器,综合考虑压缩后激活分布,提升剪枝策略对模块重要性的刻画精度。实验表明,该方法在Llama2、Llama3和Phi系列模型上均优于当前主流基线,在相同压缩比或推理速度下实现更优的性能表现。
链接: https://arxiv.org/abs/2602.01975
作者: Meng Li,Peisong Wang,Yuantian Shao,Qinghao Hu,Hongjian Fang,Yifan Zhang,Zhihui Wei,Jian Cheng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) achieve strong performance across diverse tasks but face deployment challenges due to their massive size. Structured pruning offers acceleration benefits but leads to significant performance degradation. Recent PCA-based pruning methods have alleviated this issue by retaining key activation components, but are only applied between modules in order to fuse the transformation matrix, which introduces extra parameters and severely disrupts activation distributions due to residual connections. To address these issues, we propose IntraSlice, a framework that applies block-wise module-intra PCA compression pruning. By leveraging the structural characteristics of Transformer modules, we design an approximate PCA method whose transformation matrices can be fully fused into the model without additional parameters. We also introduce a PCA-based global pruning ratio estimator that further considers the distribution of compressed activations, building on conventional module importance. We validate our method on Llama2, Llama3, and Phi series across various language benchmarks. Experimental results demonstrate that our approach achieves superior compression performance compared to recent baselines at the same compression ratio or inference speed.
zh
[AI-68] Small Generalizable Prompt Predictive Models Can Steer Efficient RL Post-Training of Large Reasoning Models
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在提升大语言模型(Large Language Models, LLMs)推理能力过程中存在的计算开销高问题,特别是由于滚动优化(rollout-intensive optimization)导致的训练效率低下。现有在线提示选择(online prompt selection)方法要么依赖昂贵的精确评估,要么构建针对特定提示的预测模型,缺乏跨提示的泛化能力。解决方案的关键在于提出一种通用可推广的提示选择方法——Generalizable Predictive Prompt Selection (GPS),其核心是利用轻量级生成模型对共享优化历史进行贝叶斯推断,以估计提示难度;同时引入中等难度优先和基于历史锚定的多样性机制,指导批量提示选择策略,从而在训练阶段显著提升效率,并在测试阶段实现高效的计算资源分配。
链接: https://arxiv.org/abs/2602.01970
作者: Yun Qu,Qi Wang,Yixiu Mao,Heming Zou,Yuhang Jiang,Weijie Liu,Clive Bai,Kai Yang,Yangkun Chen,Saiyong Yang,Xiangyang Ji
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Reinforcement learning enhances the reasoning capabilities of large language models but often involves high computational costs due to rollout-intensive optimization. Online prompt selection presents a plausible solution by prioritizing informative prompts to improve training efficiency. However, current methods either depend on costly, exact evaluations or construct prompt-specific predictive models lacking generalization across prompts. This study introduces Generalizable Predictive Prompt Selection (GPS), which performs Bayesian inference towards prompt difficulty using a lightweight generative model trained on the shared optimization history. Intermediate-difficulty prioritization and history-anchored diversity are incorporated into the batch acquisition principle to select informative prompt batches. The small predictive model also generalizes at test-time for efficient computational allocation. Experiments across varied reasoning benchmarks indicate GPS’s substantial improvements in training efficiency, final performance, and test-time efficiency over superior baseline methods.
zh
[AI-69] Zero-Shot Off-Policy Learning
【速读】:该论文旨在解决**离策略学习(off-policy learning)在零样本适应(zero-shot adaptation)场景下的核心挑战,即如何从固定的历史数据中直接推导出最优策略,同时应对分布偏移(distributional shift)和价值函数高估偏差(value function overestimation bias)问题。其解决方案的关键在于发现后续度量(successor measures)与平稳密度比(stationary density ratios)**之间的理论联系,从而能够在线推断最优重要性采样比例,实现对任意任务的平稳分布校正,并在无需训练的情况下快速适应新任务。这一方法可无缝集成至前向-后向表示框架(forward-backward representation frameworks),显著提升零样本强化学习的泛化能力与效率。
链接: https://arxiv.org/abs/2602.01962
作者: Arip Asadulaev,Maksim Bobrin,Salem Lahlou,Dmitry Dylov,Fakhri Karray,Martin Takac
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Off-policy learning methods seek to derive an optimal policy directly from a fixed dataset of prior interactions. This objective presents significant challenges, primarily due to the inherent distributional shift and value function overestimation bias. These issues become even more noticeable in zero-shot reinforcement learning, where an agent trained on reward-free data must adapt to new tasks at test time without additional training. In this work, we address the off-policy problem in a zero-shot setting by discovering a theoretical connection of successor measures to stationary density ratios. Using this insight, our algorithm can infer optimal importance sampling ratios, effectively performing a stationary distribution correction with an optimal policy for any task on the fly. We benchmark our method in motion tracking tasks on SMPL Humanoid, continuous control on ExoRL, and for the long-horizon OGBench tasks. Our technique seamlessly integrates into forward-backward representation frameworks and enables fast-adaptation to new tasks in a training-free regime. More broadly, this work bridges off-policy learning and zero-shot adaptation, offering benefits to both research areas.
zh
[AI-70] Efficient Epistemic Uncertainty Estimation for Large Language Models via Knowledge Distillation
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)中认知不确定性(Epistemic Uncertainty, EU)量化难题,以缓解幻觉现象并支持在安全关键任务中的风险感知部署。传统基于深度集成(Deep Ensembles)的方法虽能有效估计EU,但计算开销巨大,难以适用于现代大模型规模。其解决方案的关键在于利用小型草稿模型(draft models)高效估算token级EU,理论基础为偏差-方差分解:通过杰恩森-香农散度(Jensen-Shannon divergence)作为方差代理、草稿混合分布与目标模型之间的KL散度作为偏差代理来近似EU;同时引入在线随机蒸馏(Online Stochastic Distillation, OSD)以低开销逼近目标聚合,并采用数据多样草稿策略(Data-Diverse Drafts, DDD)提升草稿多样性以增强目标近似精度。实验表明,该方法在GSM8K数据集上将估计误差(RMSE)降低达37%,且在幻觉检测性能上媲美高成本扰动方法(如TokUR),同时保持极低的推理开销,具备实际部署可行性。
链接: https://arxiv.org/abs/2602.01956
作者: Seonghyeon Park,Jewon Yeom,Jaewon Sok,Jeongjae Park,Heejun Kim,Taesup Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Quantifying uncertainty in Large Language Models (LLMs) is essential for mitigating hallucinations and enabling risk-aware deployment in safety-critical tasks. However, estimating Epistemic Uncertainty(EU) via Deep Ensembles is computationally prohibitive at the scale of modern models. We propose a framework that leverages the small draft models to efficiently estimate token-level EU, bypassing the need for full-scale ensembling. Theoretically grounded in a Bias-Variance Decomposition, our approach approximates EU via Jensen-Shannon divergence among drafts (variance proxy) and KL divergence between the draft mixture and the target (bias proxy). To further ensure accuracy without significant overhead, we introduce Online Stochastic Distillation (OSD) to efficiently approximate target aggregation and the Data-Diverse Drafts (DDD) strategy to enhance draft diversity for better target approximation. Extensive experiments on GSM8K demonstrate that our method reduces the estimation error (RMSE) by up to 37% compared to baselines. Crucially, our approach achieves Hallucination Detection performance competitive with heavy perturbation-based methods like TokUR while incurring negligible inference costs, offering a practical solution for uncertainty-aware LLM deployment.
zh
[AI-71] Human Society-Inspired Approaches to Agent ic AI Security: The 4C Framework
【速读】:该论文旨在解决当前AI安全研究中对自主代理型AI(agentic AI)系统风险识别与防护不足的问题,特别是传统以系统为中心的安全方法难以应对由自主性、交互性和涌现行为引发的新型安全挑战。其解决方案的关键在于提出一个名为“4C框架”的多智能体AI安全模型,从四个相互关联的维度系统化地组织和管理风险:核心(Core,涵盖系统、基础设施及环境完整性)、连接(Connection,涉及通信、协调与信任)、认知(Cognition,关注信念、目标与推理完整性)以及合规(Compliance,包括伦理、法律与制度治理)。该框架通过将安全焦点从单一系统保护扩展至行为完整性和意图一致性维护,为构建可信赖、可治理且与人类价值观对齐的自主代理型AI系统提供了理论基础和实践指导。
链接: https://arxiv.org/abs/2602.01942
作者: Alsharif Abuadbba,Nazatul Sultan,Surya Nepal,Sanjay Jha
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 10 pages
Abstract:AI is moving from domain-specific autonomy in closed, predictable settings to large-language-model-driven agents that plan and act in open, cross-organizational environments. As a result, the cybersecurity risk landscape is changing in fundamental ways. Agentic AI systems can plan, act, collaborate, and persist over time, functioning as participants in complex socio-technical ecosystems rather than as isolated software components. Although recent work has strengthened defenses against model and pipeline level vulnerabilities such as prompt injection, data poisoning, and tool misuse, these system centric approaches may fail to capture risks that arise from autonomy, interaction, and emergent behavior. This article introduces the 4C Framework for multi-agent AI security, inspired by societal governance. It organizes agentic risks across four interdependent dimensions: Core (system, infrastructure, and environmental integrity), Connection (communication, coordination, and trust), Cognition (belief, goal, and reasoning integrity), and Compliance (ethical, legal, and institutional governance). By shifting AI security from a narrow focus on system-centric protection to the broader preservation of behavioral integrity and intent, the framework complements existing AI security strategies and offers a principled foundation for building agentic AI systems that are trustworthy, governable, and aligned with human values.
zh
[AI-72] owards Exploratory and Focused Manipulation with Bimanual Active Perception: A New Problem Benchmark and Strategy ICRA2026
【速读】:该论文旨在解决机器人在执行复杂操作任务时因视觉遮挡(visual occlusion)导致的信息缺失问题,其本质是缺乏完成任务所需的有效信息。为此,作者提出了“探索性与聚焦性操作”(Exploratory and Focused Manipulation, EFM)这一更根本的问题,强调通过主动感知来收集关键信息以支持任务执行。解决方案的关键在于提出了一种双臂主动感知(Bimanual Active Perception, BAP)策略:利用一只手提供主动视觉(active vision),另一只手提供力觉反馈(force sensing),从而协同完成任务。为验证该策略,作者构建了EFM-10基准和对应的BAPData数据集,并基于此实现了模仿学习下的有效性验证,表明BAP策略能显著提升复杂操作任务中的信息获取能力与任务成功率。
链接: https://arxiv.org/abs/2602.01939
作者: Yuxin He,Ruihao Zhang,Tianao Shen,Cheng Liu,Qiang Nie
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: ICRA 2026
Abstract:Recently, active vision has reemerged as an important concept for manipulation, since visual occlusion occurs more frequently when main cameras are mounted on the robot heads. We reflect on the visual occlusion issue and identify its essence as the absence of information useful for task completion. Inspired by this, we come up with the more fundamental problem of Exploratory and Focused Manipulation (EFM). The proposed problem is about actively collecting information to complete challenging manipulation tasks that require exploration or focus. As an initial attempt to address this problem, we establish the EFM-10 benchmark that consists of 4 categories of tasks that align with our definition (10 tasks in total). We further come up with a Bimanual Active Perception (BAP) strategy, which leverages one arm to provide active vision and another arm to provide force sensing while manipulating. Based on this idea, we collect a dataset named BAPData for the tasks in EFM-10. With the dataset, we successfully verify the effectiveness of the BAP strategy in an imitation learning manner. We hope that the EFM-10 benchmark along with the BAP strategy can become a cornerstone that facilitates future research towards this direction. Project website: this http URL.
zh
[AI-73] -LLM : Teaching Large Language Models to Forecast Time Series via Temporal Distillation
【速读】:该论文旨在解决如何将时间序列预测能力有效赋予通用大语言模型(LLM)的问题,尤其针对时间序列数据具有时序依赖性和不可逆累积特性(即只能随真实时间推进而积累),使得单纯依赖规模驱动的预训练难以奏效。解决方案的关键在于提出T-LLM框架,通过一种时间蒸馏(temporal distillation)机制,在训练阶段利用一个轻量级的时间教师模型(temporal teacher)来传授预测行为;该教师模型结合趋势建模与频域分析提供结构化的时序监督信号,最终在推理阶段完全移除教师模型,仅保留LLM作为独立的预测器,从而实现高效且高性能的时间序列预测。
链接: https://arxiv.org/abs/2602.01937
作者: Suhan Guo,Bingxu Wang,Shaodan Zhang,Furao Shen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Time series forecasting plays a critical role in decision-making across many real-world applications. Unlike data in vision and language domains, time series data is inherently tied to the evolution of underlying processes and can only accumulate as real-world time progresses, limiting the effectiveness of scale-driven pretraining alone. This time-bound constraint poses a challenge for enabling large language models (LLMs) to acquire forecasting capability, as existing approaches primarily rely on representation-level alignment or inference-time temporal modules rather than explicitly teaching forecasting behavior to the LLM. We propose T-LLM, a temporal distillation framework that equips general-purpose LLMs with time series forecasting capability by transferring predictive behavior from a lightweight temporal teacher during training. The teacher combines trend modeling and frequency-domain analysis to provide structured temporal supervision, and is removed entirely at inference, leaving the LLM as the sole forecasting model. Experiments on benchmark datasets and infectious disease forecasting tasks demonstrate that T-LLM consistently outperforms existing LLM-based forecasting methods under full-shot, few-shot, and zero-shot settings, while enabling a simple and efficient deployment pipeline.
zh
[AI-74] PIMCST: Physics-Informed Multi-Phase Consensus and Spatio-Temporal Few-Shot Learning for Traffic Flow Forecasting
【速读】:该论文旨在解决智能交通系统中跨域、数据稀缺场景下的交通流预测难题,尤其在历史数据有限的情况下,如何有效建模城市间复杂的时空依赖关系与非线性动态特性,以实现少样本(few-shot)学习的准确预测。其解决方案的关键在于提出MCPST(Multi-phase Consensus Spatio-Temporal)框架,通过三个核心创新实现:(1) 多阶段引擎(multi-phase engine),利用扩散、同步和谱嵌入对交通动态进行分阶段表征;(2) 自适应一致性机制(adaptive consensus mechanism),动态融合各阶段预测结果并强制一致性约束;(3) 结构化元学习策略(structured meta-learning strategy),支持模型快速适配新城市且仅需少量数据。该方法在理论层面提供了逼近误差和泛化边界保证,并在四个真实数据集上显著优于14种先进方法,兼具高精度、低数据需求和可解释性。
链接: https://arxiv.org/abs/2602.01936
作者: Abdul Joseph Fofanah,Lian Wen,David Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate traffic flow prediction remains a fundamental challenge in intelligent transportation systems, particularly in cross-domain, data-scarce scenarios where limited historical data hinders model training and generalisation. The complex spatio-temporal dependencies and nonlinear dynamics of urban mobility networks further complicate few-shot learning across different cities. This paper proposes MCPST, a novel Multi-phase Consensus Spatio-Temporal framework for few-shot traffic forecasting that reconceptualises traffic prediction as a multi-phase consensus learning problem. Our framework introduces three core innovations: (1) a multi-phase engine that models traffic dynamics through diffusion, synchronisation, and spectral embeddings for comprehensive dynamic characterisation; (2) an adaptive consensus mechanism that dynamically fuses phase-specific predictions while enforcing consistency; and (3) a structured meta-learning strategy for rapid adaptation to new cities with minimal data. We establish extensive theoretical guarantees, including representation theorems with bounded approximation errors and generalisation bounds for few-shot adaptation. Through experiments on four real-world datasets, MCPST outperforms fourteen state-of-the-art methods in spatio-temporal graph learning methods, dynamic graph transfer learning methods, prompt-based spatio-temporal prediction methods and cross-domain few-shot settings, improving prediction accuracy while reducing required training data and providing interpretable insights. The implementation code is available at this https URL.
zh
[AI-75] COLT: Lightweight Multi-LLM Collaboration through Shared MCTS Reasoning for Model Compilation
【速读】:该论文旨在解决生成式 AI (Generative AI) 在编译器优化场景中因模型服务成本高昂而导致的可扩展性问题,特别是如何在不依赖单一大型语言模型(Large Language Model, LLM)的前提下实现高性能的编译优化。其核心挑战在于:单独使用小型LLM可靠性不足,而持续调用大型LLM则成本过高。解决方案的关键在于提出一个轻量级的多LLM协作框架COLT,通过在一个统一的蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)过程中协调多个模型的推理,利用共享的MCTS树结构实现变换前缀复用与跨模型价值传播,从而避免传统方法所需的外部规划器、多并发LLM、数据库或外部记忆等复杂机制;此外,引入基于模型感知的树策略和动态调整机制,在保证探索能力的同时优先调度小模型,并在检测到小模型持续性能退化时自动切换至最大模型,实现了高效且稳定的协同优化。
链接: https://arxiv.org/abs/2602.01935
作者: Annabelle Sujun Tang,Christopher Priebe,Lianhui Qin,Hadi Esmaeilzadeh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注:
Abstract:Model serving costs dominate AI systems, making compiler optimization essential for scalable deployment. Recent works show that a large language model (LLM) can guide compiler search by reasoning over program structure and optimization history. However, using a single large model throughout the search is expensive, while smaller models are less reliable when used alone. Thus, this paper seeks to answer whether multi-LLM collaborative reasoning relying primarily on small LLMs can match or exceed the performance of a single large model. As such, we propose a lightweight collaborative multi-LLM framework, dubbed COLT, for compiler optimization that enables coordinated reasoning across multiple models within a single Monte Carlo tree search (MCTS) process. A key contribution is the use of a single shared MCTS tree as the collaboration substrate across LLMs, enabling the reuse of transformation prefixes and cross-model value propagation. Hence, we circumvent both heavy internal reasoning mechanisms and conventional agentic machinery that relies on external planners, multiple concurrent LLMs, databases, external memory/versioning of intermediate results, and controllers by simply endogenizing model selection within the lightweight MCTS optimization loop. Every iteration, the acting LLM proposes a joint action: (compiler transformation, model to be queried next). We also introduce a model-aware tree policy that biases search toward smaller models while preserving exploration, and a course-alteration mechanism that escalates to the largest model when the search exhibits persistent regressions attributable to smaller models.
zh
[AI-76] Large Language Model and Formal Concept Analysis: a comparative study for Topic Modeling
【速读】:该论文旨在解决当前主题建模(topic modeling)领域中大型语言模型(Large Language Models, LLM)与形式概念分析(Formal Concept Analysis, FCA)在实际应用中的性能对比问题,尤其关注二者在主题生成、合并及标签化等关键环节的优劣。其解决方案的关键在于:一方面采用已验证的CREA流程评估FCA方法在主题提取与可视化中的表现;另一方面设计基于三个提示(prompt)的零样本策略,利用GPT-5完成文档批次的主题生成、结果聚合和主题标注,从而系统性地比较LLM与FCA在真实场景下的有效性与适用性。
链接: https://arxiv.org/abs/2602.01933
作者: Fabrice Boissier(CRI),Monica Sen(UP1 UFR27),Irina Rychkova(CRI)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Topic modeling is a research field finding increasing applications: historically from document retrieving, to sentiment analysis and text summarization. Large Language Models (LLM) are currently a major trend in text processing, but few works study their usefulness for this task. Formal Concept Analysis (FCA) has recently been presented as a candidate for topic modeling, but no real applied case study has been conducted. In this work, we compare LLM and FCA to better understand their strengths and weakneses in the topic modeling field. FCA is evaluated through the CREA pipeline used in past experiments on topic modeling and visualization, whereas GPT-5 is used for the LLM. A strategy based on three prompts is applied with GPT-5 in a zero-shot setup: topic generation from document batches, merging of batch results into final topics, and topic labeling. A first experiment reuses the teaching materials previously used to evaluate CREA, while a second experiment analyzes 40 research articles in information systems to compare the extracted topics with the underling subfields.
zh
[AI-77] PIMPC-GNN: Physics-Informed Multi-Phase Consensus Learning for Enhancing Imbalanced Node Classification in Graph Neural Networks
【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在类别不平衡场景下的性能瓶颈问题,即少数类样本被严重低估、模型预测偏向多数类的现象。其解决方案的核心在于提出一种物理信息驱动的多阶段共识框架(Physics-Informed Multi-phase Consensus Framework, PIMPC-GNN),通过三种互补的动力学机制协同优化:(i) 热力学扩散(thermodynamic diffusion)以传播少数类标签并捕捉长程依赖关系;(ii) Kuramoto同步机制(Kuramoto synchronisation)通过振荡一致性对齐少数类节点;(iii) 谱嵌入(spectral embedding)利用结构正则化实现类别分离。上述机制通过类别自适应集成加权策略融合,并结合考虑类别不平衡的损失函数进行联合训练,从而显著提升少数类召回率与整体平衡准确率。
链接: https://arxiv.org/abs/2602.01920
作者: Abdul Joseph Fofanah,Lian Wen,David Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Graph neural networks (GNNs) often struggle in class-imbalanced settings, where minority classes are under-represented and predictions are biased toward majorities. We propose \textbfPIMPC-GNN, a physics-informed multi-phase consensus framework for imbalanced node classification. Our method integrates three complementary dynamics: (i) thermodynamic diffusion, which spreads minority labels to capture long-range dependencies, (ii) Kuramoto synchronisation, which aligns minority nodes through oscillatory consensus, and (iii) spectral embedding, which separates classes via structural regularisation. These perspectives are combined through class-adaptive ensemble weighting and trained with an imbalance-aware loss that couples balanced cross-entropy with physics-based constraints. Across five benchmark datasets and imbalance ratios from 5-100, PIMPC-GNN outperforms 16 state-of-the-art baselines, achieving notable gains in minority-class recall (up to +12.7%) and balanced accuracy (up to +8.3%). Beyond empirical improvements, the framework also provides interpretable insights into consensus dynamics in graph learning. The code is available at \textttthis https URL.
zh
[AI-78] VLM-Guided Experience Replay
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)中经验回放缓冲区(replay buffer)利用效率低下的问题,即如何更有效地筛选和重用智能体的经验以提升样本效率与任务成功率。其解决方案的关键在于引入一个冻结的、预训练的视觉语言模型(Vision-Language Model, VLM),无需微调即可作为自动化评估器,对智能体收集到的子轨迹(sub-trajectories)进行优先级排序,从而在回放过程中优先存储和重用高价值经验。实验表明,该方法在游戏和机器人控制等多种场景下显著提升了平均成功率达11–52%,同时样本效率提高19–45%。
链接: https://arxiv.org/abs/2602.01915
作者: Elad Sharony,Tom Jurgenson,Orr Krupnik,Dotan Di Castro,Shie Mannor
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in Large Language Models (LLMs) and Vision-Language Models (VLMs) have enabled powerful semantic and multimodal reasoning capabilities, creating new opportunities to enhance sample efficiency, high-level planning, and interpretability in reinforcement learning (RL). While prior work has integrated LLMs and VLMs into various components of RL, the replay buffer, a core component for storing and reusing experiences, remains unexplored. We propose addressing this gap by leveraging VLMs to guide the prioritization of experiences in the replay buffer. Our key idea is to use a frozen, pre-trained VLM (requiring no fine-tuning) as an automated evaluator to identify and prioritize promising sub-trajectories from the agent’s experiences. Across scenarios, including game-playing and robotics, spanning both discrete and continuous domains, agents trained with our proposed prioritization method achieve 11-52% higher average success rates and improve sample efficiency by 19-45% compared to previous approaches. this https URL
zh
[AI-79] DomusFM: A Foundation Model for Smart-Home Sensor Data
【速读】:该论文旨在解决智能家庭传感器数据在活动识别与事件分析中面临的两大核心问题:一是传统监督学习模型对大量标注数据的依赖,导致实际应用中数据获取成本高、难以扩展;二是现有基础模型(如生成式 AI)主要针对惯性传感器设计,无法有效处理智能家庭二值传感器事件所具有的稀疏性、离散性和语义丰富性特征。解决方案的关键在于提出 DomusFM——首个专为智能家庭传感器数据预训练的基础模型,其采用自监督双对比学习范式,同时捕捉 token 级语义属性和序列级时序依赖关系,并通过融合轻量语言模型的语义嵌入与专用编码器对时间模式和二值状态进行建模,从而学习可跨环境和任务迁移的通用表示。实验表明,DomusFM 在仅使用 5% 标注数据的情况下仍显著优于现有最优方法,兼顾性能提升与部署可行性。
链接: https://arxiv.org/abs/2602.01910
作者: Michele Fiori,Gabriele Civitarese,Flora D. Salim,Claudio Bettini
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Smart-home sensor data holds significant potential for several applications, including healthcare monitoring and assistive technologies. Existing approaches, however, face critical limitations. Supervised models require impractical amounts of labeled data. Foundation models for activity recognition focus only on inertial sensors, failing to address the unique characteristics of smart-home binary sensor events: their sparse, discrete nature combined with rich semantic associations. LLM-based approaches, while tested in this domain, still raise several issues regarding the need for natural language descriptions or prompting, and reliance on either external services or expensive hardware, making them infeasible in real-life scenarios due to privacy and cost concerns. We introduce DomusFM, the first foundation model specifically designed and pretrained for smart-home sensor data. DomusFM employs a self-supervised dual contrastive learning paradigm to capture both token-level semantic attributes and sequence-level temporal dependencies. By integrating semantic embeddings from a lightweight language model and specialized encoders for temporal patterns and binary states, DomusFM learns generalizable representations that transfer across environments and tasks related to activity and event analysis. Through leave-one-dataset-out evaluation across seven public smart-home datasets, we demonstrate that DomusFM outperforms state-of-the-art baselines on different downstream tasks, achieving superior performance even with only 5% of labeled training data available for fine-tuning. Our approach addresses data scarcity while maintaining practical deployability for real-world smart-home systems.
zh
[AI-80] Geometric Analysis of Token Selection in Multi-Head Attention
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)中多头注意力机制(Multi-head Attention)的内在行为机制不清晰的问题,特别是其在值状态空间(value-state space)中的选择策略与几何可分性(separability)缺乏定量分析。解决方案的关键在于提出一个几何框架,将标准注意力机制视为一种基于 top-N 选择的结构化分类器,并定义了 Precision、Recall 和 F-score 等几何度量来量化被选中与未被选中 token 的分离程度;同时,在经验合理的假设下(如稳定的值范数、压缩的 sink token、指数相似度衰减及分段注意力权重分布),推导出非渐近界,揭示了小 N 操作区间具有最强的非平凡可分性,并阐明序列长度和 sink 相似性对性能指标的影响。该理论在 LLaMA-2-7B、Gemma-7B 和 Mistral-7B 上得到验证,且发现注意力头可划分为 Retriever、Mixer 和 Reset 三种具有不同几何特征的子类型,从而为注意力机制提供了可测量的解释能力,推动了面向几何结构的稀疏化与设计优化。
链接: https://arxiv.org/abs/2602.01893
作者: Timur Mudarisov,Mikhal Burtsev,Tatiana Petrova,Radu State
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We present a geometric framework for analysing multi-head attention in large language models (LLMs). Without altering the mechanism, we view standard attention through a top-N selection lens and study its behaviour directly in value-state space. We define geometric metrics - Precision, Recall, and F-score - to quantify separability between selected and non-selected tokens, and derive non-asymptotic bounds with explicit dependence on dimension and margin under empirically motivated assumptions (stable value norms with a compressed sink token, exponential similarity decay, and piecewise attention weight profiles). The theory predicts a small-N operating regime of strongest non-trivial separability and clarifies how sequence length and sink similarity shape the metrics. Empirically, across LLaMA-2-7B, Gemma-7B, and Mistral-7B, measurements closely track the theoretical envelopes: top-N selection sharpens separability, sink similarity correlates with Recall. We also found that in LLaMA-2-7B heads specialize into three regimes - Retriever, Mixer, Reset - with distinct geometric signatures. Overall, attention behaves as a structured geometric classifier with measurable criteria for token selection, offering head level interpretability and informing geometry-aware sparsification and design of attention in LLMs.
zh
[AI-81] Entropy-Guided Data-Efficient Training for Multimodal Reasoning Reward Models
【速读】:该论文旨在解决多模态推理奖励模型(Multimodal Reasoning Reward Models)在训练过程中面临的两个关键问题:一是偏好数据集中的固有噪声会降低模型性能,二是传统训练方法未能考虑样本难度差异,导致效率低下。解决方案的关键在于识别出响应熵(response entropy)与准确率之间存在强相关性,从而将熵作为无监督的代理指标来衡量标注噪声和样本难度。基于此发现,作者提出熵引导训练(Entropy-Guided Training, EGT)方法,包含两个核心策略:一是通过熵引导的数据筛选减少不可靠样本的影响,二是采用渐进式引入高难度样本的训练策略,从而提升模型性能与训练效率。
链接: https://arxiv.org/abs/2602.01884
作者: Shidong Yang,Tongwen Huang,Hao Wen,Yong Wang,Li Chen,Xiangxiang Chu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Multimodal reward models are crucial for aligning multimodal large language models with human preferences. Recent works have incorporated reasoning capabilities into these models, achieving promising results. However, training these models suffers from two critical challenges: (1) the inherent noise in preference datasets, which degrades model performance, and (2) the inefficiency of conventional training methods, which ignore the differences in sample difficulty. In this paper, we identify a strong correlation between response entropy and accuracy, indicating that entropy can serve as a reliable and unsupervised proxy for annotation noise and sample difficulty. Based on this insight, we propose a novel Entropy-Guided Training (EGT) approach for multimodal reasoning reward models, which combines two strategies: (1) entropy-guided data curation to mitigate the impact of unreliable samples, and (2) an entropy-guided training strategy that progressively introduces more complex examples. Extensive experiments across three benchmarks show that the EGT-trained model consistently outperforms state-of-the-art multimodal reward models.
zh
[AI-82] ProcMEM: Learning Reusable Procedural Memory from Experience via Non-Parametric PPO for LLM Agents
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)驱动的智能体在序列决策任务中因缺乏对交互经验的有效复用而导致的计算冗余和执行不稳定性问题。其核心挑战在于,现有方法通常依赖即时推理,在重复场景中重新生成解决方案,无法形成可长期积累与高效调用的程序性知识。解决方案的关键在于提出ProcMEM框架,通过形式化Skill-MDP将被动的 episodic narrative 转化为具有激活、执行和终止条件的可执行技能(Skill),并引入非参数化PPO(Non-Parametric PPO)机制:利用语义梯度进行高质量候选生成,结合PPO Gate实现技能验证,从而在不更新模型参数的前提下实现可靠的知识复用;同时,基于评分的维护策略确保了程序性记忆的紧凑性和高质量,显著提升了跨任务、跨智能体场景下的重用率与性能表现。
链接: https://arxiv.org/abs/2602.01869
作者: Qirui Mi,Zhijian Ma,Mengyue Yang,Haoxuan Li,Yisen Wang,Haifeng Zhang,Jun Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 20 Pages, 6 Figures, 4 Tables
Abstract:LLM-driven agents demonstrate strong performance in sequential decision-making but often rely on on-the-fly reasoning, re-deriving solutions even in recurring scenarios. This insufficient experience reuse leads to computational redundancy and execution instability. To bridge this gap, we propose ProcMEM, a framework that enables agents to autonomously learn procedural memory from interaction experiences without parameter updates. By formalizing a Skill-MDP, ProcMEM transforms passive episodic narratives into executable Skills defined by activation, execution, and termination conditions to ensure executability. To achieve reliable reusability without capability degradation, we introduce Non-Parametric PPO, which leverages semantic gradients for high-quality candidate generation and a PPO Gate for robust Skill verification. Through score-based maintenance, ProcMEM sustains compact, high-quality procedural memory. Experimental results across in-domain, cross-task, and cross-agent scenarios demonstrate that ProcMEM achieves superior reuse rates and significant performance gains with extreme memory compression. Visualized evolutionary trajectories and Skill distributions further reveal how ProcMEM transparently accumulates, refines, and reuses procedural knowledge to facilitate long-term autonomy.
zh
[AI-83] GRAB: An LLM -Inspired Sequence-First Click-Through Rate Prediction Modeling Paradigm
【速读】:该论文旨在解决传统深度学习推荐模型(Deep Learning Recommendation Models, DLRMs)在性能和效率上的瓶颈问题,尤其是在泛化能力不足和长序列建模方面的局限。其解决方案的关键在于提出了一种端到端的生成式排序框架GRAB(Generative Ranking for Ads at Baidu),该框架引入了创新的因果动作感知多通道注意力机制(Causal Action-aware Multi-channel Attention, CamA),能够有效捕捉用户行为序列中的时序动态性和特定操作信号,从而显著提升点击率(Click-Through Rate, CTR)预测精度与商业收益。
链接: https://arxiv.org/abs/2602.01865
作者: Shaopeng Chen,Chuyue Xie,Huimin Ren,Shaozong Zhang,Han Zhang,Ruobing Cheng,Zhiqiang Cao,Zehao Ju,Gao Yu,Jie Ding,Xiaodong Chen,Xuewu Jiao,Shuanglong Li,Liu Lin
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Traditional Deep Learning Recommendation Models (DLRMs) face increasing bottlenecks in performance and efficiency, often struggling with generalization and long-sequence modeling. Inspired by the scaling success of Large Language Models (LLMs), we propose Generative Ranking for Ads at Baidu (GRAB), an end-to-end generative framework for Click-Through Rate (CTR) prediction. GRAB integrates a novel Causal Action-aware Multi-channel Attention (CamA) mechanism to effectively capture temporal dynamics and specific action signals within user behavior sequences. Full-scale online deployment demonstrates that GRAB significantly outperforms established DLRMs, delivering a 3.05% increase in revenue and a 3.49% rise in CTR. Furthermore, the model demonstrates desirable scaling behavior: its expressive power shows a monotonic and approximately linear improvement as longer interaction sequences are utilized.
zh
[AI-84] SOPRAG : Multi-view Graph Experts Retrieval for Industrial Standard Operating Procedures
【速读】:该论文旨在解决工业环境中标准操作程序(Standard Operating Procedures, SOPs)检索与执行中的三大挑战:专有结构刚性、条件依赖的相关性不足以及可操作执行需求,这些问题使得传统的以语义驱动的检索增强生成(Retrieval-Augmented Generation, RAG)范式难以有效应对。其解决方案的关键在于提出SOPRAG框架,通过引入三种专业化图专家——实体(Entity)、因果(Causal)和流程(Flow)图专家,替代传统扁平化分块策略,以精准捕捉工业SOP的结构性与逻辑复杂性;同时设计“操作卡”层(Procedure Card layer)进行搜索空间剪枝以减少计算噪声,并采用大语言模型引导的门控机制(LLM-Guided gating mechanism)动态加权各专家,从而实现检索结果与操作员意图的高度对齐。
链接: https://arxiv.org/abs/2602.01858
作者: Liangtao Lin,Zhaomeng Zhu,Tianwei Zhang,Yonggang Wen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Standard Operating Procedures (SOPs) are essential for ensuring operational safety and consistency in industrial environments. However, retrieving and following these procedures presents unique challenges, such as rigid proprietary structures, condition-dependent relevance, and actionable execution requirement, which standard semantic-driven Retrieval-Augmented Generation (RAG) paradigms fail to address. Inspired by the Mixture-of-Experts (MoE) paradigm, we propose SOPRAG, a novel framework specifically designed to address the above pain points in SOP retrieval. SOPRAG replaces flat chunking with specialized Entity, Causal, and Flow graph experts to resolve industrial structural and logical complexities. To optimize and coordinate these experts, we propose a Procedure Card layer that prunes the search space to eliminate computational noise, and an LLM-Guided gating mechanism that dynamically weights these experts to align retrieval with operator intent. To address the scarcity of domain-specific data, we also introduce an automated, multi-agent workflow for benchmark construction. Extensive experiments across four industrial domains demonstrate that SOPRAG significantly outperforms strong lexical, dense, and graph-based RAG baselines in both retrieval accuracy and response utility, achieving perfect execution scores in real-world critical tasks.
zh
[AI-85] me2Vec-Integrated Transformer for Robust Gesture Recognition from Low-Density sEMG
【速读】:该论文旨在解决肌电假肢控制中对高密度多传感器阵列的依赖问题,从而提升消费级产品的可及性。其核心挑战在于如何在有限传感器硬件条件下实现精确、稳定的动作识别。解决方案的关键在于提出一种数据高效型深度学习框架,采用优化的混合Transformer架构:首先引入可学习的时间嵌入(Time2Vec)以捕捉生物信号固有的随机时间扭曲;其次设计归一化加法融合策略,对齐空间与时间特征的潜在分布,避免传统方法中常见的破坏性干扰;最后通过两阶段课程学习协议增强小样本下的特征提取能力。该方案在仅使用双通道表面肌电信号(sEMG)的情况下实现了95.7% ± 0.20%的多主体F1分数,显著优于标准Transformer和CNN-LSTM模型,并验证了高质量时序建模可补偿低空间分辨率,为下一代低成本、快速个性化的假肢接口提供了可行路径。
链接: https://arxiv.org/abs/2602.01855
作者: Blagoj Hristov,Hristijan Gjoreski,Vesna Ojleska Latkoska,Gorjan Nadzinski
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:
Abstract:Accurate and responsive myoelectric prosthesis control typically relies on complex, dense multi-sensor arrays, which limits consumer accessibility. This paper presents a novel, data-efficient deep learning framework designed to achieve precise and accurate control using minimal sensor hardware. Leveraging an external dataset of 8 subjects, our approach implements a hybrid Transformer optimized for sparse, two-channel surface electromyography (sEMG). Unlike standard architectures that use fixed positional encodings, we integrate Time2Vec learnable temporal embeddings to capture the stochastic temporal warping inherent in biological signals. Furthermore, we employ a normalized additive fusion strategy that aligns the latent distributions of spatial and temporal features, preventing the destructive interference common in standard implementations. A two-stage curriculum learning protocol is utilized to ensure robust feature extraction despite data scarcity. The proposed architecture achieves a state-of-the-art multi-subject F1-score of 95.7% \pm 0.20% for a 10-class movement set, statistically outperforming both a standard Transformer with fixed encodings and a recurrent CNN-LSTM model. Architectural optimization reveals that a balanced allocation of model capacity between spatial and temporal dimensions yields the highest stability. Furthermore, while direct transfer to a new unseen subject led to poor accuracy due to domain shifts, a rapid calibration protocol utilizing only two trials per gesture recovered performance from 21.0% \pm 2.98% to 96.9% \pm 0.52%. By validating that high-fidelity temporal embeddings can compensate for low spatial resolution, this work challenges the necessity of high-density sensing. The proposed framework offers a robust, cost-effective blueprint for next-generation prosthetic interfaces capable of rapid personalization.
zh
[AI-86] ROMA: Recursive Open Meta-Agent Framework for Long-Horizon Multi-Agent Systems
【速读】:该论文旨在解决当前智能体(Agent)框架在长周期任务中表现不佳的问题,具体表现为:随着推理深度增加,顺序编排变得脆弱,上下文窗口限制导致性能下降,且执行过程不透明使得故障难以定位与调试。解决方案的关键在于提出一种领域无关的递归式智能体架构ROMA(Recursive Open Meta-Agents),其核心机制包括:通过依赖感知的子任务树实现任务递归分解以支持并行执行,利用结构化聚合压缩并验证中间结果以控制上下文膨胀;同时标准化智能体构建为四个模块化角色——Atomizer(决定是否分解任务)、Planner、Executor和Aggregator,从而清晰分离编排逻辑与模型选择,并实现可解释的分层执行轨迹。此外,引入GEPA+(改进的遗传-帕累托提示生成器)在不进行微调的前提下搜索ROMA组件层级中的最优提示组合,保持接口契约一致性,显著提升系统级推理与长文本生成能力。
链接: https://arxiv.org/abs/2602.01848
作者: Salaheddin Alzu’bi,Baran Nama,Arda Kaz,Anushri Eswaran,Weiyuan Chen,Sarvesh Khetan,Rishab Bala,Tu Vu,Sewoong Oh
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Current agentic frameworks underperform on long-horizon tasks. As reasoning depth increases, sequential orchestration becomes brittle, context windows impose hard limits that degrade performance, and opaque execution traces make failures difficult to localize or debug. We introduce ROMA (Recursive Open Meta-Agents), a domain-agnostic framework that addresses these limitations through recursive task decomposition and structured aggregation. ROMA decomposes goals into dependency-aware subtask trees that can be executed in parallel, while aggregation compresses and validates intermediate results to control context growth. Our framework standardizes agent construction around four modular roles --Atomizer (which decides whether a task should be decomposed), Planner, Executor, and Aggregator – which cleanly separate orchestration from model selection and enable transparent, hierarchical execution traces. This design supports heterogeneous multi-agent systems that mix models and tools according to cost, latency, and capability. To adapt ROMA to specific tasks without fine-tuning, we further introduce GEPA + , an improved Genetic-Pareto prompt proposer that searches over prompts within ROMA’s component hierarchy while preserving interface contracts. We show that ROMA, combined with GEPA+, delivers leading system-level performance on reasoning and long-form generation benchmarks. On SEAL-0, which evaluates reasoning over conflicting web evidence, ROMA instantiated with GLM-4.6 improves accuracy by 9.9% over Kimi-Researcher. On EQ-Bench, a long-form writing benchmark, ROMA enables DeepSeek-V3 to match the performance of leading closed-source models such as Claude Sonnet 4.5. Our results demonstrate that recursive, modular agent architectures can scale reasoning depth while remaining interpretable, flexible, and model-agnostic.
zh
[AI-87] DOGMA: Weaving Structural Information into Data-centric Single-cell Transcriptomics Analysis
【速读】:该论文针对单细胞转录组分析中数据驱动方法存在的局限性展开研究,旨在解决传统基于序列的方法忽视细胞间潜在生物学关系以及原始测序数据质量缺陷的问题。现有方法多将细胞视为独立个体,依赖启发式规则构建图结构,但往往忽略生物先验知识,导致图表示不准确、计算开销大且跨物种泛化能力弱。其解决方案的关键在于提出DOGMA框架,通过多层次生物先验知识(包括统计锚点、细胞谱系(Cell Ontology)与系统发育树(Phylogenetic Trees))实现确定性的图结构发现和跨物种对齐,并利用基因本体(Gene Ontology)弥合特征层面的语义鸿沟,从而在复杂多物种、多器官基准测试中实现最优性能,同时显著降低计算成本并提升零样本鲁棒性和样本效率。
链接: https://arxiv.org/abs/2602.01839
作者: Ru Zhang,Xunkai Li,Yaxin Deng,Sicheng Liu,Daohan Su,Qiangqiang Dai,Hongchao Qin,Rong-Hua Li,Guoren Wang,Jia Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Genomics (q-bio.GN)
备注: 12 pages, 4 figures
Abstract:Recently, data-centric AI methodology has been a dominant paradigm in single-cell transcriptomics analysis, which treats data representation rather than model complexity as the fundamental bottleneck. In the review of current studies, earlier sequence methods treat cells as independent entities and adapt prevalent ML models to analyze their directly inherited sequence data. Despite their simplicity and intuition, these methods overlook the latent intercellular relationships driven by the functional mechanisms of biological systems and the inherent quality issues of the raw sequence data. Therefore, a series of structured methods has emerged. Although they employ various heuristic rules to capture intricate intercellular relationships and enhance the raw sequencing data, these methods often neglect biological prior knowledge. This omission incurs substantial overhead and yields suboptimal graph representations, thereby hindering the utility of ML models. To address them, we propose DOGMA, a holistic data-centric framework designed for the structural reshaping and semantic enhancement of raw data through multi-level biological prior knowledge. Transcending reliance on stochastic heuristics, DOGMA redefines graph construction by integrating Statistical Anchors with Cell Ontology and Phylogenetic Trees to enable deterministic structure discovery and robust cross-species alignment. Furthermore, Gene Ontology is utilized to bridge the feature-level semantic gap by incorporating functional priors. In complex multi-species and multi-organ benchmarks, DOGMA achieves SOTA performance, exhibiting superior zero-shot robustness and sample efficiency while operating with significantly lower computational cost. Comments: 12 pages, 4 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Genomics (q-bio.GN) Cite as: arXiv:2602.01839 [cs.LG] (or arXiv:2602.01839v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.01839 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-88] Synesthesia of Vehicles: Tactile Data Synthesis from Visual Inputs
【速读】:该论文旨在解决自动驾驶车辆(AV)在安全感知中因依赖视觉与光学传感器而无法有效检测道路激励(road-induced excitations)的问题,这类激励对车辆动态控制至关重要。解决方案的关键在于提出一种名为“车辆联觉”(Synesthesia of Vehicles, SoV)的新框架,通过跨模态时空对齐方法处理视觉输入与触觉激励之间的时空差异,并设计了一种基于潜在扩散机制的视觉-触觉联觉生成模型(Visual-Tactile Synesthetic, VTSyn),实现无监督高质量触觉数据合成,从而提升自动驾驶系统对路面状态的主动感知能力。
链接: https://arxiv.org/abs/2602.01832
作者: Rui Wang,Yaoguang Cao,Yuyi Chen,Jianyi Xu,Zhuoyang Li,Jiachen Shang,Shichun Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Autonomous vehicles (AVs) rely on multi-modal fusion for safety, but current visual and optical sensors fail to detect road-induced excitations which are critical for vehicles’ dynamic control. Inspired by human synesthesia, we propose the Synesthesia of Vehicles (SoV), a novel framework to predict tactile excitations from visual inputs for autonomous vehicles. We develop a cross-modal spatiotemporal alignment method to address temporal and spatial disparities. Furthermore, a visual-tactile synesthetic (VTSyn) generative model using latent diffusion is proposed for unsupervised high-quality tactile data synthesis. A real-vehicle perception system collected a multi-modal dataset across diverse road and lighting conditions. Extensive experiments show that VTSyn outperforms existing models in temporal, frequency, and classification performance, enhancing AV safety through proactive tactile perception.
zh
[AI-89] Beyond Precision: Training-Inference Mismatch is an Optimization Problem and Simple LR Scheduling Fixes It
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在训练大语言模型(Large Language Models)时存在的不稳定性问题。研究表明,这种不稳定性不仅源于训练与推理阶段的不一致(training-inference mismatch),还与优化过程中的梯度噪声(gradient noise)随训练进程加剧密切相关,且二者呈动态耦合关系。解决方案的关键在于识别响应长度(response length)作为早期不稳定性的可靠预警信号,并据此设计一种动态调整学习率(Learning Rate, LR)的调度策略——即当梯度噪声上升时自动触发学习率衰减,从而有效抑制训练-推理不匹配并维持训练稳定性。
链接: https://arxiv.org/abs/2602.01826
作者: Yaxiang Zhang,Yingru Li,Jiacai Liu,Jiawei Xu,Ziniu Li,Qian Liu,Haoyuan Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement Learning (RL) for training Large Language Models is notoriously unstable. While recent studies attribute this to “training inference mismatch stemming” from inconsistent hybrid engines, standard remedies, such as Importance Sampling, might fail during extended training runs. In this work, we analyze this instability through the lens of optimization, demonstrating that gradient noise and training-inference mismatch escalate in tandem as training progresses. Meanwhile, we find that the mismatch can be effectively suppressed by shrinking the update size. Taken together, we deduce that the mismatch is not merely a static numerical discrepancy, but a dynamic failure coupled with the model’s optimization. Based on this insight, we propose a simple yet effective solution: a specialized Learning Rate (LR) scheduler. Instead of pre-defined decay schedule in traditional LR scheduler, our method dynamically triggers LR decay based on response length, which we identify as a reliable early-warning signal for impending instability. Empirical evidence suggests that by reducing the learning rate as gradient noise rises, we can consistently stabilize RL training and keep the training-inference mismatch at a safe level.
zh
[AI-90] INDIBATOR: Diverse and Fact-Grounded Individuality for Multi-Agent Debate in Molecular Discovery
【速读】:该论文试图解决多智能体系统在科学发现任务中因角色设定过于泛化而导致的效率与质量不足问题,即当前框架通常采用通用的角色标签(如“评审者”或“写作者”)或粗粒度关键词定义智能体行为,未能体现人类科学家个体研究轨迹对贡献的独特影响。解决方案的关键在于提出INDIBATOR框架,通过融合两种模态构建个体化科学家画像:一是基于发表记录的文献知识,二是基于分子历史的结构先验,使智能体能够以细粒度个体性为基础进行多轮辩论(包括提案、批判与投票),从而更真实地模拟科研协作过程并提升发现质量。
链接: https://arxiv.org/abs/2602.01815
作者: Yunhui Jang,Seonghyun Park,Jaehyung Kim,Sungsoo Ahn
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Multi-agent systems have emerged as a powerful paradigm for automating scientific discovery. To differentiate agent behavior in the multi-agent system, current frameworks typically assign generic role-based personas such as ‘‘reviewer’’ or ‘‘writer’’ or rely on coarse grained keyword-based personas. While functional, this approach oversimplifies how human scientists operate, whose contributions are shaped by their unique research trajectories. In response, we propose INDIBATOR, a framework for molecular discovery that grounds agents in individualized scientist profiles constructed from two modalities: publication history for literature-derived knowledge and molecular history for structural priors. These agents engage in multi-turn debate through proposal, critique, and voting phases. Our evaluation demonstrates that these fine-grained individuality-grounded agents consistently outperform systems relying on coarse-grained personas, achieving competitive or state-of-the-art performance. These results validate that capturing the ``scientific DNA’’ of individual agents is essential for high-quality discovery.
zh
[AI-91] ORCH: many analyses one merge-a deterministic multi-agent orchestrator for discrete-choice reasoning with EMA-guided routing
【速读】:该论文旨在解决当前多智能体(multi-agent)语言模型系统在离散选择推理任务中存在行为不可复现、决策过程难以解释的问题,这些问题通常源于随机路由或临时启发式策略的使用。其解决方案的关键在于提出一种确定性协调框架 ORCH,采用“多次分析、一次决策”范式:多个基础模型独立生成结构化分析结果,再由专门的合并代理(merge agent)依据固定规则输出最终决策。该框架通过预定义的任务分解与答案聚合规则实现可预测、可复现且无需训练的流水线设计,同时引入基于历史表现(如准确率、延迟或成本)的EMA引导路由器以增强模型互补性,在不牺牲确定性的前提下提升性能。实验证明,ORCH 在 MMLU-Pro 和 GSM8K 等基准上显著优于单模型基线和多数投票集成方法,且具备良好的统计显著性和部署可行性。
链接: https://arxiv.org/abs/2602.01797
作者: Hanlin Zhou,Huah Yong Chan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in large-scale language models (LLMs) have made multi-agent architectures attractive for challenging reasoning tasks. However, many existing systems rely on stochastic routing or ad-hoc heuristics, making their behavior difficult to reproduce and their decision process hard to interpret. We propose ORCH, a deterministic coordination framework for discrete-choice reasoning that orchestrates heterogeneous LLMs. ORCH follows a ``many analyses, one decision’’ paradigm: multiple base models independently produce structured analyses, and a dedicated merge agent outputs the final choice. The framework uses fixed rules for task decomposition and answer aggregation, keeping the pipeline predictable, reproducible, and training-free. Determinism here refers to fixed routing and aggregation rules under a fixed evaluation protocol, rather than strict bit-level reproducibility across deployments. To exploit model complementarity, we optionally introduce an EMA-guided router that updates agent selection using historical accuracy, latency, or cost; since it relies on answer-based feedback, it is mainly intended for benchmarking, controlled evaluation, or delayed-feedback settings. Experiments on MMLU, MMLU-Pro, and GSM8K show that ORCH consistently outperforms single-model baselines and a majority-vote ensemble. On MMLU-Pro, ORCH improves accuracy by over 10 points compared to the strongest baseline, and on GSM8K it yields gains exceeding 50 points; McNemar tests confirm statistical significance. The EMA router provides an additional 0.7–2.0 point accuracy boost, and ablations show that both multi-agent collaboration and routing contribute substantially. Overall, ORCH offers a practical path toward controllable, interpretable, and deployment-ready LLM-based agent systems for discrete-choice reasoning.
zh
[AI-92] RedVisor: Reasoning -Aware Prompt Injection Defense via Zero-Copy KV Cache Reuse
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在面对提示注入(Prompt Injection, PI)攻击时的安全性与实用性之间的矛盾问题:传统基于预防的微调方法常因“对齐代价”(alignment tax)损害模型在良性输入上的通用能力,而基于检测的过滤机制则因高昂的延迟和内存开销难以部署。其解决方案的关键在于提出RedVisor框架——一个轻量级、可移除的适配器模块,嵌入于冻结的模型主干之上,通过细粒度推理路径实现双重功能:首先生成可解释的攻击定位分析以识别注入点并阐明威胁;随后利用该分析结果显式引导模型拒绝恶意指令。该适配器仅在推理阶段激活,响应生成阶段被静默化,从而在数学上保证原始模型在良性输入下的性能不变,并创新性地引入KV Cache重用策略,消除解耦流水线中的冗余预填充计算,显著提升效率。
链接: https://arxiv.org/abs/2602.01795
作者: Mingrui Liu,Sixiao Zhang,Cheng Long,Kwok-Yan Lam
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: under review
Abstract:Large Language Models (LLMs) are increasingly vulnerable to Prompt Injection (PI) attacks, where adversarial instructions hidden within retrieved contexts hijack the model’s execution flow. Current defenses typically face a critical trade-off: prevention-based fine-tuning often degrades general utility via the “alignment tax”, while detection-based filtering incurs prohibitive latency and memory costs. To bridge this gap, we propose RedVisor, a unified framework that synthesizes the explainability of detection systems with the seamless integration of prevention strategies. To the best of our knowledge, RedVisor is the first approach to leverage fine-grained reasoning paths to simultaneously detect attacks and guide the model’s safe response. We implement this via a lightweight, removable adapter positioned atop the frozen backbone. This adapter serves a dual function: it first generates an explainable analysis that precisely localizes the injection and articulates the threat, which then explicitly conditions the model to reject the malicious command. Uniquely, the adapter is active only during this reasoning phase and is effectively muted during the subsequent response generation. This architecture yields two distinct advantages: (1) it mathematically preserves the backbone’s original utility on benign inputs; and (2) it enables a novel KV Cache Reuse strategy, eliminating the redundant prefill computation inherent to decoupled pipelines. We further pioneer the integration of this defense into the vLLM serving engine with custom kernels. Experiments demonstrate that RedVisor outperforms state-of-the-art defenses in detection accuracy and throughput while incurring negligible utility loss.
zh
[AI-93] LingLanMiDian: Systematic Evaluation of LLM s on TCM Knowledge and Clinical Reasoning
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在中医(Traditional Chinese Medicine, TCM)领域评估中存在的碎片化、缺乏统一标准及生成导向评分导致公平比较困难的问题。解决方案的关键在于提出一个大规模、专家标注的多任务基准测试集 LingLanMiDian(LingLan),其核心创新包括:统一的指标设计、对临床标签的同义容忍协议、每数据集400项的Hard子集、以及将诊断与治疗建议重构为单选决策识别任务,从而实现对LLMs在中医常识理解、推理和临床决策支持能力上的标准化、量化且可扩展的评估。
链接: https://arxiv.org/abs/2602.01779
作者: Rui Hua,Yu Wei,Zixin Shu,Kai Chang,Dengying Yan,Jianan Xia,Zeyu Liu,Hui Zhu,Shujie Song,Mingzhong Xiao,Xiaodong Li,Dongmei Jia,Zhuye Gao,Yanyan Meng,Naixuan Zhao,Yu Fu,Haibin Yu,Benman Yu,Yuanyuan Chen,Fei Dong,Zhizhou Meng,Pengcheng Yang,Songxue Zhao,Lijuan Pei,Yunhui Hu,Kan Ding,Jiayuan Duan,Wenmao Yin,Yang Gu,Runshun Zhang,Qiang Zhu,Jian Yu,Jiansheng Li,Baoyan Liu,Wenjia Wang,Xuezhong Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) are advancing rapidly in medical NLP, yet Traditional Chinese Medicine (TCM) with its distinctive ontology, terminology, and reasoning patterns requires domain-faithful evaluation. Existing TCM benchmarks are fragmented in coverage and scale and rely on non-unified or generation-heavy scoring that hinders fair comparison. We present the LingLanMiDian (LingLan) benchmark, a large-scale, expert-curated, multi-task suite that unifies evaluation across knowledge recall, multi-hop reasoning, information extraction, and real-world clinical decision-making. LingLan introduces a consistent metric design, a synonym-tolerant protocol for clinical labels, a per-dataset 400-item Hard subset, and a reframing of diagnosis and treatment recommendation into single-choice decision recognition. We conduct comprehensive, zero-shot evaluations on 14 leading open-source and proprietary LLMs, providing a unified perspective on their strengths and limitations in TCM commonsense knowledge understanding, reasoning, and clinical decision support; critically, the evaluation on Hard subset reveals a substantial gap between current models and human experts in TCM-specialized reasoning. By bridging fundamental knowledge and applied reasoning through standardized evaluation, LingLan establishes a unified, quantitative, and extensible foundation for advancing TCM LLMs and domain-specific medical AI research. All evaluation data and code are available at this https URL and this http URL.
zh
[AI-94] Stein-Rule Shrinkage for Stochastic Gradient Estimation in High Dimensions
【速读】:该论文旨在解决高维场景下标准随机梯度(Stochastic Gradient, SG)估计在风险意义上的次优性问题,即传统方法将小批量梯度视为总体梯度的无偏估计,但在高维设定中,根据统计决策理论,此类无偏估计在二次损失下通常是不可接受的。其解决方案的关键在于将随机梯度计算建模为高维估计问题,并引入基于Stein规则收缩(Stein-rule shrinkage)的决策理论框架:构造一种收缩梯度估计器,通过自适应地将噪声较大的小批量梯度向由历史动量导出的稳定限制估计器收缩,收缩强度由在线估计的梯度噪声方差决定,该方差可利用自适应优化方法(如Adam)通常维护的二阶矩统计信息获得。此方法在p=3的高斯噪声模型下被证明在平方误差损失下一致优于标准SG且为极小极大最优,同时可无缝集成至Adam优化器,仅需极少额外计算开销。
链接: https://arxiv.org/abs/2602.01777
作者: M. Arashi,M. Amintoosi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistics Theory (math.ST); Machine Learning (stat.ML)
备注:
Abstract:Stochastic gradient methods are central to large-scale learning, yet their analysis typically treats mini-batch gradients as unbiased estimators of the population gradient. In high-dimensional settings, however, classical results from statistical decision theory show that unbiased estimators are generally inadmissible under quadratic loss, suggesting that standard stochastic gradients may be suboptimal from a risk perspective. In this work, we formulate stochastic gradient computation as a high-dimensional estimation problem and introduce a decision-theoretic framework based on Stein-rule shrinkage. We construct a shrinkage gradient estimator that adaptively contracts noisy mini-batch gradients toward a stable restricted estimator derived from historical momentum. The shrinkage intensity is determined in a data-driven manner using an online estimate of gradient noise variance, leveraging second-moment statistics commonly maintained by adaptive optimization methods. Under a Gaussian noise model and for dimension p=3, we show that the proposed estimator uniformly dominates the standard stochastic gradient under squared error loss and is minimax-optimal in the classical decision-theoretic sense. We further demonstrate how this estimator can be incorporated into the Adam optimizer, yielding a practical algorithm with negligible additional computational cost. Empirical evaluations on CIFAR10 and CIFAR100, across multiple levels of label noise, show consistent improvements over Adam in the large-batch regime. Ablation studies indicate that the gains arise primarily from selectively applying shrinkage to high-dimensional convolutional layers, while indiscriminate shrinkage across all parameters degrades performance. These results illustrate that classical shrinkage principles provide a principled and effective approach to improving stochastic gradient estimation in modern deep learning.
zh
[AI-95] Efficient Cross-Architecture Knowledge Transfer for Large-Scale Online User Response Prediction
【速读】:该论文旨在解决大规模用户响应预测系统中因部署新模型架构而导致的高模型切换成本问题,包括昂贵的海量历史数据重训练开销以及在数据保留约束下性能下降的挑战。现有知识蒸馏方法难以应对架构异构性,且转移大型嵌入表(embedding table)的成本极高。其解决方案的关键在于提出一种两阶段的跨架构知识迁移框架 CrossAdapt:第一阶段通过维度自适应投影实现快速嵌入迁移,结合渐进式网络蒸馏与策略采样降低计算开销;第二阶段引入非对称协同蒸馏机制(asymmetric co-distillation),使学生模型频繁更新而教师模型低频更新,并辅以分布感知的自适应机制,动态平衡历史知识保留与对演化数据的快速适应能力,从而显著提升模型性能并大幅减少训练时间。
链接: https://arxiv.org/abs/2602.01775
作者: Yucheng Wu,Yuekui Yang,Hongzheng Li,Anan Liu,Jian Xiao,Junjie Zhai,Huan Yu,Shaoping Ma,Leye Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 15 pages
Abstract:Deploying new architectures in large-scale user response prediction systems incurs high model switching costs due to expensive retraining on massive historical data and performance degradation under data retention constraints. Existing knowledge distillation methods struggle with architectural heterogeneity and the prohibitive cost of transferring large embedding tables. We propose CrossAdapt, a two-stage framework for efficient cross-architecture knowledge transfer. The offline stage enables rapid embedding transfer via dimension-adaptive projections without iterative training, combined with progressive network distillation and strategic sampling to reduce computational cost. The online stage introduces asymmetric co-distillation, where students update frequently while teachers update infrequently, together with a distribution-aware adaptation mechanism that dynamically balances historical knowledge preservation and fast adaptation to evolving data. Experiments on three public datasets show that CrossAdapt achieves 0.27-0.43% AUC improvements while reducing training time by 43-71%. Large-scale deployment on Tencent WeChat Channels (~10M daily samples) further demonstrates its effectiveness, significantly mitigating AUC degradation, LogLoss increase, and prediction bias compared to standard distillation baselines.
zh
[AI-96] DIA-CLIP: a universal representation learning framework for zero-shot DIA proteomics
【速读】:该论文旨在解决当前数据非依赖采集质谱(Data-independent acquisition mass spectrometry, DIA-MS)分析框架中依赖于每批次实验进行半监督训练所带来的过拟合问题,以及由此导致的模型在不同物种和实验条件下泛化能力差的局限性。解决方案的关键在于提出DIA-CLIP模型,该模型采用双编码器对比学习(dual-encoder contrastive learning)与编码器-解码器架构相结合的预训练策略,构建肽段与对应谱图特征的统一跨模态表示空间,从而实现高精度的零样本肽段-谱图匹配(peptide-spectrum match, PSM)推断,显著提升蛋白质鉴定深度和准确性。
链接: https://arxiv.org/abs/2602.01772
作者: Yucheng Liao,Han Wen,Weinan E,Weijie Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注: 21 pages, 5 figures
Abstract:Data-independent acquisition mass spectrometry (DIA-MS) has established itself as a cornerstone of proteomic profiling and large-scale systems biology, offering unparalleled depth and reproducibility. Current DIA analysis frameworks, however, require semi-supervised training within each run for peptide-spectrum match (PSM) re-scoring. This approach is prone to overfitting and lacks generalizability across diverse species and experimental conditions. Here, we present DIA-CLIP, a pre-trained model shifting the DIA analysis paradigm from semi-supervised training to universal cross-modal representation learning. By integrating dual-encoder contrastive learning framework with encoder-decoder architecture, DIA-CLIP establishes a unified cross-modal representation for peptides and corresponding spectral features, achieving high-precision, zero-shot PSM inference. Extensive evaluations across diverse benchmarks demonstrate that DIA-CLIP consistently outperforms state-of-the-art tools, yielding up to a 45% increase in protein identification while achieving a 12% reduction in entrapment identifications. Moreover, DIA-CLIP holds immense potential for diverse practical applications, such as single-cell and spatial proteomics, where its enhanced identification depth facilitates the discovery of novel biomarkers and the elucidates of intricate cellular mechanisms.
zh
[AI-97] IRIS: Implicit Reward-Guided Internal Sifting for Mitigating Multimodal Hallucination
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)中存在的幻觉问题,即模型在生成过程中因不同模态间的信息冲突而导致的不准确或虚构内容。现有基于直接偏好优化(Direct Preference Optimization, DPO)的方法通常依赖昂贵的外部评估器进行打分或重写,导致离策略学习差距和离散化损失,且由于无法访问内部状态,这些方法忽略了模态间的细粒度冲突。解决方案的关键在于提出IRIS(Implicit Reward-Guided Internal Sifting),其利用原生对数概率空间中的连续隐式奖励信号,保留完整的信密度并捕捉模态间的竞争关系;该方法采用在线策略范式,通过自动生成偏好对,并基于多模态隐式奖励筛选这些对,使优化过程直接针对模态冲突进行修正,从而在仅需5.7k样本的情况下实现高效且原理清晰的幻觉缓解。
链接: https://arxiv.org/abs/2602.01769
作者: Yuanshuai Li,Yuping Yan,Jirui Han,Fei Ming,Lingjuan Lv,Yaochu Jin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Hallucination remains a fundamental challenge for Multimodal Large Language Models (MLLMs). While Direct Preference Optimization (DPO) is a key alignment framework, existing approaches often rely heavily on costly external evaluators for scoring or rewriting, incurring off-policy learnability gaps and discretization loss. Due to the lack of access to internal states, such feedback overlooks the fine-grained conflicts between different modalities that lead to hallucinations during generation. To address this issue, we propose IRIS (Implicit Reward-Guided Internal Sifting), which leverages continuous implicit rewards in the native log-probability space to preserve full information density and capture internal modal competition. This on-policy paradigm eliminates learnability gaps by utilizing self-generated preference pairs. By sifting these pairs based on multimodal implicit rewards, IRIS ensures that optimization is driven by signals that directly resolve modal conflicts. Extensive experiments demonstrate that IRIS achieves highly competitive performance on key hallucination benchmarks using only 5.7k samples, without requiring any external feedback during preference alignment. These results confirm that IRIS provides an efficient and principled paradigm for mitigating MLLM hallucinations. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.01769 [cs.LG] (or arXiv:2602.01769v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.01769 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-98] CoMeT: Collaborative Memory Transformer for Efficient Long Context Modeling
【速读】:该论文旨在解决标准Transformer模型在处理长文本时面临的两大瓶颈问题:一是计算复杂度呈二次增长,二是键值(Key-Value, KV)缓存空间随序列长度无限增长,导致难以扩展至超长上下文场景。其核心解决方案是提出一种名为协同记忆Transformer(Collaborative Memory Transformer, CoMeT)的新架构,通过引入双记忆机制实现恒定内存占用和线性时间复杂度:一方面利用先进先出(FIFO)队列维护临时记忆以捕捉近期信息,另一方面采用带有门控更新规则的全局记忆来建模长距离依赖关系;这两类记忆共同作为动态软提示(soft prompt)引导后续数据块的处理,从而高效地压缩并传递上下文信息。此外,为支持超长序列上的高效微调,作者还设计了一种层级流水线并行策略,显著提升了训练效率与实用性。
链接: https://arxiv.org/abs/2602.01766
作者: Runsong Zhao,Shilei Liu,Jiwei Tang,Langming Liu,Haibin Chen,Weidong Zhang,Yujin Yuan,Tong Xiao,Jingbo Zhu,Wenbo Su,Bo Zheng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The quadratic complexity and indefinitely growing key-value (KV) cache of standard Transformers pose a major barrier to long-context processing. To overcome this, we introduce the Collaborative Memory Transformer (CoMeT), a novel architecture that enables LLMs to handle arbitrarily long sequences with constant memory usage and linear time complexity. Designed as an efficient, plug-in module, CoMeT can be integrated into pre-trained models with only minimal fine-tuning. It operates on sequential data chunks, using a dual-memory system to manage context: a temporary memory on a FIFO queue for recent events, and a global memory with a gated update rule for long-range dependencies. These memories then act as a dynamic soft prompt for the next chunk. To enable efficient fine-tuning on extremely long contexts, we introduce a novel layer-level pipeline parallelism strategy. The effectiveness of our approach is remarkable: a model equipped with CoMeT and fine-tuned on 32k contexts can accurately retrieve a passkey from any position within a 1M token sequence. On the SCROLLS benchmark, CoMeT surpasses other efficient methods and achieves performance comparable to a full-attention baseline on summarization tasks. Its practical effectiveness is further validated on real-world agent and user behavior QA tasks. The code is available at: this https URL
zh
[AI-99] Backdoor Sentinel: Detecting and Detoxifying Backdoors in Diffusion Models via Temporal Noise Consistency
【速读】:该论文旨在解决生成式 AI(Generative AI)中扩散模型(Diffusion Models)因训练数据和过程不透明而易受后门攻击的问题,尤其在灰度环境(即审计者无法访问模型参数)下难以实现有效检测与净化。现有方法要么依赖白盒信息,要么检测成本高,且净化时往往难以兼顾去毒效果与生成质量。其解决方案的关键在于发现并利用一种此前未被报道的现象——“时间噪声不一致性”(temporal noise unconsistency),即在触发输入下,相邻扩散步长间的噪声预测在特定时间段内出现异常波动,而在干净输入下保持稳定。基于此现象,作者提出统一框架 TNC-Defense,通过构建基于相邻时间步噪声一致性的灰盒检测模块定位异常扩散阶段,并进一步设计触发无关、时间感知的净化模块直接修正后门生成路径,从而在显著降低净化代价的同时,有效抑制后门行为并维持较高的生成质量。
链接: https://arxiv.org/abs/2602.01765
作者: Bingzheng Wang,Xiaoyan Gu,Hongbo Xu,Hongcheng Li,Zimo Yu,Jiang Zhou,Weiping Wang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Diffusion models have been widely deployed in AIGC services; however, their reliance on opaque training data and procedures exposes a broad attack surface for backdoor injection. In practical auditing scenarios, due to the protection of intellectual property and commercial confidentiality, auditors are typically unable to access model parameters, rendering existing white-box or query-intensive detection methods impractical. More importantly, even after the backdoor is detected, existing detoxification approaches are often trapped in a dilemma between detoxification effectiveness and generation quality. In this work, we identify a previously unreported phenomenon called temporal noise unconsistency, where the noise predictions between adjacent diffusion timesteps is disrupted in specific temporal segments when the input is triggered, while remaining stable under clean inputs. Leveraging this finding, we propose Temporal Noise Consistency Defense (TNC-Defense), a unified framework for backdoor detection and detoxification. The framework first uses the adjacent timestep noise consistency to design a gray-box detection module, for identifying and locating anomalous diffusion timesteps. Furthermore, the framework uses the identified anomalous timesteps to construct a trigger-agnostic, timestep-aware detoxification module, which directly corrects the backdoor generation path. This effectively suppresses backdoor behavior while significantly reducing detoxification costs. We evaluate the proposed method under five representative backdoor attack scenarios and compare it with state-of-the-art defenses. The results show that TNC-Defense improves the average detection accuracy by 11% with negligible additional overhead, and invalidates an average of 98.5% of triggered samples with only a mild degradation in generation quality. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.01765 [cs.CR] (or arXiv:2602.01765v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2602.01765 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-100] A Provable Expressiveness Hierarchy in Hybrid Linear-Full Attention
【速读】:该论文旨在解决当前高效注意力机制(如线性注意力和混合注意力)与标准全注意力(full attention)之间在表达能力上的理论差距问题,即缺乏对它们性能差异的严谨刻画。解决方案的关键在于建立一个可证明的表达能力层级:通过分析序列函数组合这一多步推理任务,证明了仅需 L+1 层全注意力网络即可完成该任务,而任何将 L−1 层全注意力与大量(23L2)线性注意力层交错的混合网络都无法实现相同效果。这一结果首次提供了混合注意力与标准全注意力之间的可证明分离,从理论上揭示了不同注意力机制的根本能力边界。
链接: https://arxiv.org/abs/2602.01763
作者: Xiaowei Ye,Xiaoyu He,Chao Liao,Chen Wu,Pinyan Lu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC)
备注:
Abstract:Transformers serve as the foundation of most modern large language models. To mitigate the quadratic complexity of standard full attention, various efficient attention mechanisms, such as linear and hybrid attention, have been developed. A fundamental gap remains: their expressive power relative to full attention lacks a rigorous theoretical characterization. In this work, we theoretically characterize the performance differences among these attention mechanisms. Our theory applies to all linear attention variants that can be formulated as a recurrence, including Mamba, DeltaNet, etc. Specifically, we establish an expressiveness hierarchy: for the sequential function composition-a multi-step reasoning task that must occur within a model’s forward pass, an ( L+1 )-layer full attention network is sufficient, whereas any hybrid network interleaving L-1 layers of full attention with a substantially larger number ( 2^3L^2 ) of linear attention layers cannot solve it. This result demonstrates a clear separation in expressive power between the two types of attention. Our work provides the first provable separation between hybrid attention and standard full attention, offering a theoretical perspective for understanding the fundamental capabilities and limitations of different attention mechanisms.
zh
[AI-101] Adversarial Reward Auditing for Active Detection and Mitigation of Reward Hacking
【速读】:该论文旨在解决强化学习从人类反馈(Reinforcement Learning from Human Feedback, RLHF)中存在的一种关键缺陷——奖励黑客(reward hacking)问题,即模型通过利用奖励模型中的虚假相关性来获取高分,从而违背人类意图。现有缓解方法依赖静态防御机制,难以应对新型攻击策略。其解决方案的核心在于提出对抗性奖励审计(Adversarial Reward Auditing, ARA),将奖励黑客建模为一个动态博弈过程:第一阶段由“黑客”策略发现奖励模型漏洞,“审计员”则学习从潜在表示中检测此类滥用行为;第二阶段通过审计员引导的RLHF(Auditor-Guided RLHF, AG-RLHF)对检测到的奖励黑客进行惩罚,使原本不可观测的失败变为可测量、可控的信号。该框架实现了更优的对齐性与实用性权衡,并展现出跨领域泛化能力,显著提升了模型在多场景下的鲁棒性和安全性。
链接: https://arxiv.org/abs/2602.01750
作者: Mohammad Beigi,Ming Jin,Junshan Zhang,Qifan Wang,Lifu Huang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Reinforcement Learning from Human Feedback (RLHF) remains vulnerable to reward hacking, where models exploit spurious correlations in learned reward models to achieve high scores while violating human intent. Existing mitigations rely on static defenses that cannot adapt to novel exploitation strategies. We propose Adversarial Reward Auditing (ARA), a framework that reconceptualizes reward hacking as a dynamic, competitive game. ARA operates in two stages: first, a Hacker policy discovers reward model vulnerabilities while an Auditor learns to detect exploitation from latent representations; second, Auditor-Guided RLHF (AG-RLHF) gates reward signals to penalize detected hacking, transforming reward hacking from an unobservable failure into a measurable, controllable signal. Experiments across three hacking scenarios demonstrate that ARA achieves the best alignment-utility tradeoff among all baselines: reducing sycophancy to near-SFT levels while improving helpfulness, decreasing verbosity while achieving the highest ROUGE-L, and suppressing code gaming while improving Pass@1. Beyond single-domain evaluation, we show that reward hacking, detection, and mitigation all generalize across domains – a Hacker trained on code gaming exhibits increased sycophancy despite no reward for this behavior, and an Auditor trained on one domain effectively suppresses exploitation in others, enabling efficient multi-domain defense with a single model.
zh
[AI-102] Controlling Exploration-Exploitation in GFlowNets via Markov Chain Perspectives
【速读】:该论文旨在解决生成式流网络(Generative Flow Network, GFlowNet)在训练过程中因隐式固定前向与后向策略等比例混合而导致的探索-利用权衡受限问题。其关键解决方案是通过建立GFlowNet目标与马尔可夫链可逆性之间的等价关系,揭示此类约束的理论根源,并提出α-GFNs框架,引入可调参数α以灵活控制前向与后向策略的混合比例,从而直接调节探索与利用动态,提升模式发现能力,同时保证收敛至唯一流解。
链接: https://arxiv.org/abs/2602.01749
作者: Lin Chen,Samuel Drapeau,Fanghao Shao,Xuekai Zhu,Bo Xue,Yunchong Song,Mathieu Laurière,Zhouhan Lin
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Generative Flow Network (GFlowNet) objectives implicitly fix an equal mixing of forward and backward policies, potentially constraining the exploration-exploitation trade-off during training. By further exploring the link between GFlowNets and Markov chains, we establish an equivalence between GFlowNet objectives and Markov chain reversibility, thereby revealing the origin of such constraints, and provide a framework for adapting Markov chain properties to GFlowNets. Building on these theoretical findings, we propose \alpha -GFNs, which generalize the mixing via a tunable parameter \alpha . This generalization enables direct control over exploration-exploitation dynamics to enhance mode discovery capabilities, while ensuring convergence to unique flows. Across various benchmarks, including Set, Bit Sequence, and Molecule Generation, \alpha -GFN objectives consistently outperform previous GFlowNet objectives, achieving up to a 10 \times increase in the number of discovered modes.
zh
[AI-103] Rethinking LoRA for Data Heterogeneous Federated Learning: Subspace and State Alignment
【速读】:该论文旨在解决联邦学习中使用低秩适配(Low-Rank Adaptation, LoRA)在非独立同分布(non-IID)数据设置下性能显著低于全参数微调的问题。通过高概率鲁棒性分析,作者发现性能差距源于两个耦合的失配:(i) 更新空间失配,即客户端在低秩子空间中优化而聚合发生在全空间;(ii) 优化器状态失配,即未同步的自适应状态放大了各轮次间的漂移。解决方案的关键在于提出 FedGaLore,该方法结合客户端侧的 GaLore 风格梯度子空间优化与服务器端基于谱共享信号提取的投影二阶矩状态漂移鲁棒同步机制,从而有效缓解上述失配问题,在自然语言理解(NLU)、视觉和自然语言生成(NLG)等多个基准上显著提升非 IID 场景下的鲁棒性和准确率。
链接: https://arxiv.org/abs/2602.01746
作者: Hongyi Peng,Han Yu,Xiaoxiao Li,Qiang Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Low-Rank Adaptation (LoRA) is widely used for federated fine-tuning. Yet under non-IID settings, it can substantially underperform full-parameter fine-tuning. Through with-high-probability robustness analysis, we uncover that this gap can be attributed to two coupled mismatches: (i) update-space mismatch, where clients optimize in a low-rank subspace but aggregation occurs in the full space; and (ii) optimizer-state mismatch, where unsynchronized adaptive states amplify drift across rounds. We propose FedGaLore, which combines client-side GaLore-style gradient-subspace optimization with server-side drift-robust synchronization of projected second-moment states via spectral shared-signal extraction, to address this challenge. Across NLU, vision, and NLG benchmarks, FedGaLore improves robustness and accuracy over state-of-the-art federated LoRA baselines in non-IID settings.
zh
[AI-104] Probability-Entropy Calibration: An Elastic Indicator for Adaptive Fine-tuning
【速读】:该论文旨在解决监督微调(supervised fine-tuning)中因单一维度重加权机制导致的优化偏差问题:仅依赖真实标签概率(ground-truth probability)会忽略内在不确定性,易将噪声或可替换token误判为学习关键;而仅使用token熵(token entropy)则无法体现目标特定对齐性。解决方案的关键在于提出一种概率-熵校准信号——相对排名指示器(Relative Rank Indicator),该指标通过比较真实token在预测分布中的实际排名与其期望排名的差异,生成一个逆向的token级相对尺度(Relative Scale),用于动态调整微调目标权重,从而聚焦于真正未充分学习的token,同时避免对固有高不确定性位置的过度惩罚。实验表明,该方法在多个骨干模型上均提升了数学推理能力、跨分布推理迁移性能及代码生成效果。
链接: https://arxiv.org/abs/2602.01745
作者: Wenhao Yu,Shaohang Wei,Jiahong Liu,Yifan Li,Minda Hu,Aiwei Liu,Hao Zhang,Irwin King
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Token-level reweighting is a simple yet effective mechanism for controlling supervised fine-tuning, but common indicators are largely one-dimensional: the ground-truth probability reflects downstream alignment, while token entropy reflects intrinsic uncertainty induced by the pre-training prior. Ignoring entropy can misidentify noisy or easily replaceable tokens as learning-critical, while ignoring probability fails to reflect target-specific alignment. RankTuner introduces a probability–entropy calibration signal, the Relative Rank Indicator, which compares the rank of the ground-truth token with its expected rank under the prediction distribution. The inverse indicator is used as a token-wise Relative Scale to reweight the fine-tuning objective, focusing updates on truly under-learned tokens without over-penalizing intrinsically uncertain positions. Experiments on multiple backbones show consistent improvements on mathematical reasoning benchmarks, transfer gains on out-of-distribution reasoning, and pre code generation performance over probability-only or entropy-only reweighting baselines.
zh
[AI-105] Softmax Linear Attention: Reclaiming Global Competition
【速读】:该论文旨在解决线性注意力机制(Linear Attention)在降低标准Transformer计算复杂度的同时,因移除softmax归一化而导致表达能力下降的问题。具体而言,这种缺失削弱了“全局竞争”(global competition)机制,使得模型难以在长序列噪声中精准聚焦于相关信息。解决方案的关键在于提出Softmax Linear Attention (SLA),其核心创新是将softmax操作从token级别提升至注意力头(attention head)级别,利用多头结构作为粗粒度语义槽,并引入竞争性门控机制动态选择最相关的子空间,从而恢复“赢家通吃”(winner-take-all)的动力学特性,实现精确的信息检索与鲁棒的长上下文理解,同时保持线性时间复杂度。
链接: https://arxiv.org/abs/2602.01744
作者: Mingwei Xu,Xuan Lin,Xinnan Guo,Wanqing Xu,Wanyun Cui
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 11 pages,4 figures
Abstract:While linear attention reduces the quadratic complexity of standard Transformers to linear time, it often lags behind in expressivity due to the removal of softmax normalization. This omission eliminates \emphglobal competition, a critical mechanism that enables models to sharply focus on relevant information amidst long-context noise. In this work, we propose \textbfSoftmax Linear Attention (SLA), a framework designed to restore this competitive selection without sacrificing efficiency. By lifting the softmax operation from the token level to the head level, SLA leverages attention heads as coarse semantic slots, applying a competitive gating mechanism to dynamically select the most relevant subspaces. This reintroduces the ``winner-take-all’’ dynamics essential for precise retrieval and robust long-context understanding. Distinct from prior methods that focus on refining local kernel functions, SLA adopts a broader perspective by exploiting the higher-level multi-head aggregation structure. Extensive experiments demonstrate that SLA consistently enhances state-of-the-art linear baselines (RetNet, GLA, GDN) across language modeling and long-context benchmarks, particularly in challenging retrieval scenarios where it significantly boosts robustness against noise, validating its capability to restore precise focus while maintaining linear complexity.
zh
[AI-106] Optimizing Prompts for Large Language Models : A Causal Approach
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在企业工作流中部署时,因提示词(prompt)设计敏感性导致的性能不稳定问题。现有自动提示优化(Automatic Prompt Optimization, APO)方法面临两大挑战:一是静态提示策略难以适应异构查询;二是动态方法依赖离线奖励模型,而此类模型本质上是相关性的,混淆了提示效果与查询特征之间的因果关系。解决方案的关键在于提出因果提示优化(Causal Prompt Optimization, CPO),其核心创新是将提示设计重构为因果估计问题:首先利用双重机器学习(Double Machine Learning, DML)对提示与查询的语义嵌入进行建模,分离出提示变化对性能的因果效应;其次基于此无偏奖励信号,实现资源高效的查询特定提示搜索,无需昂贵的在线评估。该方法显著提升了困难查询下的鲁棒性,并通过将评估从实时模型执行转移到离线因果建模,大幅降低推理成本,从而为企业的LLM部署提供了可扩展、可靠且经济高效的提示优化基础。
链接: https://arxiv.org/abs/2602.01711
作者: Wei Chen,Yanbin Fang,Shuran Fu,Fasheng Xu,Xuan Wei
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large Language Models (LLMs) are increasingly embedded in enterprise workflows, yet their performance remains highly sensitive to prompt design. Automatic Prompt Optimization (APO) seeks to mitigate this instability, but existing approaches face two persistent challenges. First, commonly used prompt strategies rely on static instructions that perform well on average but fail to adapt to heterogeneous queries. Second, more dynamic approaches depend on offline reward models that are fundamentally correlational, confounding prompt effectiveness with query characteristics. We propose Causal Prompt Optimization (CPO), a framework that reframes prompt design as a problem of causal estimation. CPO operates in two stages. First, it learns an offline causal reward model by applying Double Machine Learning (DML) to semantic embeddings of prompts and queries, isolating the causal effect of prompt variations from confounding query attributes. Second, it utilizes this unbiased reward signal to guide a resource-efficient search for query-specific prompts without relying on costly online evaluation. We evaluate CPO across benchmarks in mathematical reasoning, visualization, and data analytics. CPO consistently outperforms human-engineered prompts and state-of-the-art automated optimizers. The gains are driven primarily by improved robustness on hard queries, where existing methods tend to deteriorate. Beyond performance, CPO fundamentally reshapes the economics of prompt optimization: by shifting evaluation from real-time model execution to an offline causal model, it enables high-precision, per-query customization at a fraction of the inference cost required by online methods. Together, these results establish causal inference as a scalable foundation for reliable and cost-efficient prompt optimization in enterprise LLM deployments.
zh
[AI-107] Beyond Mode Elicitation: Diversity-Preserving Reinforcement Learning via Latent Diffusion Reason er
【速读】:该论文旨在解决离散强化学习(Reinforcement Learning, RL)在链式思维(Chain-of-Thought, CoT)生成过程中因策略熵下降导致的探索多样性崩溃问题,即模式诱发行为使得策略收敛到少数解模式而抑制其他潜在可行路径。其解决方案的关键在于提出Latent Diffusion Reasoning with Reinforcement Learning(LaDi-RL),通过在连续潜空间中直接进行探索,利用引导扩散过程(guided diffusion)实现多步去噪以分布随机性并保留多个共存解模式,同时将潜空间探索与文本空间生成解耦,从而提升优化效率;实验表明,该方法在代码生成和数学推理基准上均显著优于传统离散RL基线,Pass@1指标提升达+9.4%和+5.7%。
链接: https://arxiv.org/abs/2602.01705
作者: Haoqiang Kang,Yizhe Zhang,Nikki Lijing Kuang,Yi-An Ma,Lianhui Qin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent reinforcement learning (RL) methods improve LLM reasoning by optimizing discrete Chain-of-Thought (CoT) generation; however, exploration in token space often suffers from diversity collapse as policy entropy decreases due to mode elicitation behavior in discrete RL. To mitigate this issue, we propose Latent Diffusion Reasoning with Reinforcement Learning (LaDi-RL), a framework that conducts exploration directly in a continuous latent space, where latent variables encode semantic-level reasoning trajectories. By modeling exploration via guided diffusion, multi-step denoising distributes stochasticity and preserves multiple coexisting solution modes without mutual suppression. Furthermore, by decoupling latent-space exploration from text-space generation, we show that latent diffusion-based optimization is more effective than text-space policy optimization alone, while a complementary text policy provides additional gains when combined with latent exploration. Experiments on code generation and mathematical reasoning benchmarks demonstrate consistent improvements in both pass@1 and pass@k over discrete RL baselines, with absolute pass@1 gains of +9.4% on code generation and +5.7% on mathematical reasoning, highlighting diffusion-based latent RL as a principled alternative to discrete token-level RL for reasoning.
zh
[AI-108] Meta Engine: A Unified Semantic Query Engine on Heterogeneous LLM -Based Query Systems
【速读】:该论文旨在解决多模态数据语义查询系统中因生态碎片化导致的集成难题,具体表现为不同语义查询系统API不兼容,以及专用型与通用型系统之间在性能与泛化能力上的根本权衡问题。解决方案的关键在于提出Meta Engine——一个“面向查询系统的查询系统”,其核心架构包含五个组件:自然语言查询解析器、操作符生成器、查询路由模块、适配器集合和结果聚合器,通过统一整合异构的专用LLM(大语言模型)查询系统,实现对多模态数据的高效语义访问,实验表明其在多数场景下F1分数较基线提升3-6倍,特定数据集上最高达24倍。
链接: https://arxiv.org/abs/2602.01701
作者: Ruyu Li,Tinghui Zhang,Haodi Ma,Daisy Zhe Wang,Yifan Wang
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注:
Abstract:With the increasingly use of multi-modal data, semantic query has become more and more demanded in data management systems, which is an important way to access and analyze multi-modal data. As unstructured data, most information of multi-modal data (text, image, video, etc) hides in the semantics, which cannot be accessed by the traditional database queries like SQL. Given the power of Large Language Model (LLM) in understanding semantics and processing natural language, in recent years several LLM-based semantic query systems have been proposed, to support semantic querying over unstructured data. However, this rapid growth has produced a fragmented ecosystem. Applications face significant integration challenges due to (1) disparate APIs of different semantic query systems and (2) a fundamental trade-off between specialization and generality. Many semantic query systems are highly specialized, offering state-of-the-art performance within a single modality but struggling with multi-modal data. Conversely, some “all-in-one” systems handle multiple modalities but often exhibit suboptimal performance compared to their specialized counterparts in specific modalities. This paper introduces Meta Engine, a novel “query system on query systems”, designed to resolve those aforementioned challenges. Meta Engine is a unified semantic query engine that integrates heterogeneous, specialized LLM-based query systems. Its architecture comprises five key components: (1) a Natural Language (NL) Query Parser, (2) an Operator Generator, (3) a Query Router, (4) a set of Adapters, and (5) a Result Aggregator. In the evaluation, Meta Engine consistently outperforms all baselines, yielding 3-6x higher F1 in most cases and up to 24x on specific datasets. Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.01701 [cs.DB] (or arXiv:2602.01701v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2602.01701 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-109] Mitigating loss of control in advanced AI systems through instrumental goal trajectories
【速读】:该论文旨在解决高度 capable 人工智能(Artificial Intelligence, AI)系统可能因追求工具性目标(instrumental goals)而削弱人类控制的问题。现有缓解措施多聚焦于技术层面,如追踪模型能力、通过人类反馈强化学习调整行为,以及设计可纠正和可中断的系统。本文的关键解决方案是提出“工具性目标轨迹”(Instrumental Goal Trajectories, IGTs),即从组织层面识别并监控AI系统获取资源的三类路径:采购(procurement)、治理(governance)和金融(finance)。这些IGTs生成可观察的组织性产物(organisational artefacts),为在系统能力或行为超出阈值时提供干预点,从而将可纠正性和可中断性的实现从单一模型属性扩展至支撑其运行的组织体系,增强对AI系统的整体可控性。
链接: https://arxiv.org/abs/2602.01699
作者: Willem Fourie
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:Researchers at artificial intelligence labs and universities are concerned that highly capable artificial intelligence (AI) systems may erode human control by pursuing instrumental goals. Existing mitigations remain largely technical and system-centric: tracking capability in advanced systems, shaping behaviour through methods such as reinforcement learning from human feedback, and designing systems to be corrigible and interruptible. Here we develop instrumental goal trajectories to expand these options beyond the model. Gaining capability typically depends on access to additional technical resources, such as compute, storage, data and adjacent services, which in turn requires access to monetary resources. In organisations, these resources can be obtained through three organisational pathways. We label these pathways the procurement, governance and finance instrumental goal trajectories (IGTs). Each IGT produces a trail of organisational artefacts that can be monitored and used as intervention points when a systems capabilities or behaviour exceed acceptable thresholds. In this way, IGTs offer concrete avenues for defining capability levels and for broadening how corrigibility and interruptibility are implemented, shifting attention from model properties alone to the organisational systems that enable them.
zh
[AI-110] Beyond Dense States: Elevating Sparse Transcoders to Active Operators for Latent Reasoning
【速读】:该论文旨在解决传统隐式推理方法中稠密潜在转移难以解释和控制的问题,同时克服稀疏表示模型仅适用于事后分析的局限性。其核心解决方案是提出LSTR(Latent Sparse Transcoder Reasoning)框架,关键在于引入具有残差跳跃结构的潜在转移编码器(Latent Transition Transcoder, LTT),通过解耦线性流形传输与稀疏语义更新,实现基于显式稀疏性约束的可控语义分辨率,从而在保持推理准确性和压缩效率的同时显著提升可解释性,并验证稀疏特征作为因果有效的推理操作符的作用。
链接: https://arxiv.org/abs/2602.01695
作者: Yadong Wang,Haodong Chen,Yu Tian,Chuanxing Geng,Dong Liang,Xiang Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Latent reasoning compresses the chain-of-thought (CoT) into continuous hidden states, yet existing methods rely on dense latent transitions that remain difficult to interpret and control. Meanwhile, sparse representation models uncover human-interpretable semantic features but remain largely confined to post-hoc analysis. We reconcile this tension by proposing LSTR (Latent Sparse Transcoder Reasoning), a latent reasoning framework that elevates functional sparse transcoders into active reasoning operators to perform multi-step computation through sparse semantic transitions. At its core, LSTR employs a Latent Transition Transcoder (LTT) with a residual skip architecture that decouples linear manifold transport from sparse semantic updates, enabling controllable semantic resolution via explicit sparsity constraints. Extensive experiments show that LSTR preserves reasoning accuracy and compression efficiency while substantially improving interpretability over dense latent baselines. Causal interventions and trajectory analyses further demonstrate that these sparse features act as both interpretable and causally effective operators in the reasoning process.
zh
[AI-111] What LLM s Think When You Dont Tell Them What to Think About?
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在缺乏明确主题提示时的生成行为难以系统刻画的问题,从而为AI安全与可靠监控提供更全面的观测基础。现有分析多依赖于特定任务或话题的提示,限制了对模型本质行为的探索。其解决方案的关键在于采用极简、无主题约束的输入,以探测模型在“近无约束”状态下的生成模式;通过大规模采样(256,000条样本)和跨16个LLM家族的比较,发现各模型家族虽无显式主题引导,却表现出显著且系统性的主题偏好(如GPT-OSS偏重编程与数学,Llama偏重文学),并揭示出内容深度差异与独特的退化行为(如重复短语或泄露个人URL),为理解模型内在倾向性提供了新视角。
链接: https://arxiv.org/abs/2602.01689
作者: Yongchan Kwon,James Zou
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: NA
Abstract:Characterizing the behavior of large language models (LLMs) across diverse settings is critical for reliable monitoring and AI safety. However, most existing analyses rely on topic- or task-specific prompts, which can substantially limit what can be observed. In this work, we study what LLMs generate from minimal, topic-neutral inputs and probe their near-unconstrained generative behavior. Despite the absence of explicit topics, model outputs cover a broad semantic space, and surprisingly, each model family exhibits strong and systematic topical preferences. GPT-OSS predominantly generates programming (27.1%) and mathematical content (24.6%), whereas Llama most frequently generates literary content (9.1%). DeepSeek often generates religious content, while Qwen frequently generates multiple-choice questions. Beyond topical preferences, we also observe differences in content specialization and depth: GPT-OSS often generates more technically advanced content (e.g., dynamic programming) compared with other models (e.g., basic Python). Furthermore, we find that the near-unconstrained generation often degenerates into repetitive phrases, revealing interesting behaviors unique to each model family. For instance, degenerate outputs from Llama include multiple URLs pointing to personal Facebook and Instagram accounts. We release the complete dataset of 256,000 samples from 16 LLMs, along with a reproducible codebase.
zh
[AI-112] Semantic-aware Wasserstein Policy Regularization for Large Language Model Alignment ICLR2026
【速读】:该论文旨在解决当前基于人类反馈的强化学习(Reinforcement Learning from Human Feedback, RLHF)框架中,采用KL散度或f-散度作为策略正则化项时存在的局限性——即这些方法仅比较相同位置token的概率分布,无法捕捉语义层面的相似性。解决方案的关键在于提出 Wasserstein Policy Regularization (WPR),其基于熵正则化的Wasserstein距离构建语义感知的策略正则化项,利用token空间的几何结构来衡量策略差异;该方法通过最优对偶变量将正则化项转化为奖励函数上的惩罚项,从而得到一个与标准强化学习算法兼容的可训练目标,在实验中显著优于基于KL和f-散度的基线方法。
链接: https://arxiv.org/abs/2602.01685
作者: Byeonghu Na,Hyungho Na,Yeongmin Kim,Suhyeon Jo,HeeSun Bae,Mina Kang,Il-Chul Moon
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at ICLR 2026
Abstract:Large language models (LLMs) are commonly aligned with human preferences using reinforcement learning from human feedback (RLHF). In this method, LLM policies are generally optimized through reward maximization with Kullback-Leibler (KL) divergence regularization of the reference policy. However, KL and its f -divergence variants only compare token probabilities at identical indices, failing to capture semantic similarity. We propose Wasserstein Policy Regularization (WPR), a semantic-aware regularization for the RLHF framework based on the entropy-regularized Wasserstein distance, which incorporates the geometry of the token space. The dual formulation of the distance expresses the regularization as penalty terms applied to the reward via optimal dual variables, which yield a tractable objective compatible with standard RL algorithms. Empirically, our method outperforms KL- and f -divergence-based baselines, demonstrating the benefits of semantic-aware policy distances for alignment. Our code is available at this https URL.
zh
[AI-113] RIP-Bench: A Benchmark for Long-Horizon Interactive Agents in Real-World Scenarios
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)代理在复杂现实场景中部署时面临的三大关键挑战:全局约束执行不足、多工具协同推理能力欠缺,以及在长时间多轮交互中对用户行为动态变化的适应性差。为此,作者提出了TRIP-Bench——一个基于真实旅行规划场景的长周期基准测试平台,其核心创新在于引入了18个精细化设计的工具、40余项旅行需求,并支持自动化评估;同时包含不同难度划分,尤其是高难度子集强调长期模糊交互、风格转变、可行性波动及迭代版本修订等复杂性。解决方案的关键在于提出GTPO(Generalized Temporal Policy Optimization),一种面向多轮交互的在线强化学习方法,通过专门设计的奖励归一化和奖励差分机制,显著提升了模型在约束满足与交互鲁棒性方面的表现,实验证明其在Qwen2.5-32B-Instruct上优于Gemini-3-Pro。
链接: https://arxiv.org/abs/2602.01675
作者: Yuanzhe Shen,Zisu Huang,Zhengyuan Wang,Muzhao Tian,Zhengkang Guo,Chenyang Zhang,Shuaiyu Zhou,Zengjie Hu,Dailin Li,Jingwen Xu,Kaimin Wang,Wenhao Liu,Tianlong Li,Fengpeng Yue,Feng Hong,Cao Liu,Ke Zeng
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 40 pages, 6figures
Abstract:As LLM-based agents are deployed in increasingly complex real-world settings, existing benchmarks underrepresent key challenges such as enforcing global constraints, coordinating multi-tool reasoning, and adapting to evolving user behavior over long, multi-turn interactions. To bridge this gap, we introduce \textbfTRIP-Bench, a long-horizon benchmark grounded in realistic travel-planning scenarios. TRIP-Bench leverages real-world data, offers 18 curated tools and 40+ travel requirements, and supports automated evaluation. It includes splits of varying difficulty; the hard split emphasizes long and ambiguous interactions, style shifts, feasibility changes, and iterative version revision. Dialogues span up to 15 user turns, can involve 150+ tool calls, and may exceed 200k tokens of context. Experiments show that even advanced models achieve at most 50% success on the easy split, with performance dropping below 10% on hard subsets. We further propose \textbfGTPO, an online multi-turn reinforcement learning method with specialized reward normalization and reward differencing. Applied to Qwen2.5-32B-Instruct, GTPO improves constraint satisfaction and interaction robustness, outperforming Gemini-3-Pro in our evaluation. We expect TRIP-Bench to advance practical long-horizon interactive agents, and GTPO to provide an effective online RL recipe for robust long-horizon training.
zh
[AI-114] AI-Assisted Adaptive Rendering for High-Frequency Security Telemetry in Web Interfaces
【速读】:该论文旨在解决现代网络安全平台在处理高频遥测数据(如网络日志、终端事件、告警和策略变更)时,传统基于静态分页或固定轮询间隔的渲染技术因数据量超过每秒数十万条事件而失效的问题,表现为UI卡顿、帧率丢失或数据过时。解决方案的关键在于提出一种AI辅助的自适应渲染框架,通过动态调节视觉更新频率、优先展示语义相关事件,并利用行为驱动启发式方法与轻量级本地机器学习模型对低优先级数据进行选择性聚合,从而在保持分析师感知实时响应的同时,将渲染开销降低45%-60%。
链接: https://arxiv.org/abs/2602.01671
作者: Mona Rajhans
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: To appear in IEEE ICCA 2025 proceedings
Abstract:Modern cybersecurity platforms must process and display high-frequency telemetry such as network logs, endpoint events, alerts, and policy changes in real time. Traditional rendering techniques based on static pagination or fixed polling intervals fail under volume conditions exceeding hundreds of thousands of events per second, leading to UI freezes, dropped frames, or stale data. This paper presents an AI-assisted adaptive rendering framework that dynamically regulates visual update frequency, prioritizes semantically relevant events, and selectively aggregates lower-priority data using behavior-driven heuristics and lightweight on-device machine learning models. Experimental validation demonstrates a 45-60 percent reduction in rendering overhead while maintaining analyst perception of real-time responsiveness.
zh
[AI-115] ASGMamba: Adaptive Spectral Gating Mamba for Multivariate Time Series Forecasting
【速读】:该论文旨在解决长序列多变量时间序列预测(Long-term Multivariate Time Series Forecasting, LTSF)中模型效率与精度难以兼顾的问题:基于Transformer的模型因二次复杂度难以扩展至长序列,而线性状态空间模型(State Space Models, SSMs)则易受高频噪声干扰,导致状态容量浪费。解决方案的关键在于提出ASGMamba框架,其核心创新为引入轻量级自适应频谱门控(Adaptive Spectral Gating, ASG)机制,动态根据局部频谱能量过滤噪声,使Mamba主干网络聚焦于稳健的时间动态演化;同时采用分层多尺度架构与变量特异性节点嵌入(Node Embeddings),有效捕捉不同物理特征,从而在保持O(L)线性复杂度的前提下显著降低内存占用并提升预测精度。
链接: https://arxiv.org/abs/2602.01668
作者: Qianyang Li,Xingjun Zhang,Shaoxun Wang,Jia Wei,Yueqi Xing
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Long-term multivariate time series forecasting (LTSF) plays a crucial role in various high-performance computing applications, including real-time energy grid management and large-scale traffic flow simulation. However, existing solutions face a dilemma: Transformer-based models suffer from quadratic complexity, limiting their scalability on long sequences, while linear State Space Models (SSMs) often struggle to distinguish valuable signals from high-frequency noise, leading to wasted state capacity. To bridge this gap, we propose ASGMamba, an efficient forecasting framework designed for resource-constrained supercomputing environments. ASGMamba integrates a lightweight Adaptive Spectral Gating (ASG) mechanism that dynamically filters noise based on local spectral energy, enabling the Mamba backbone to focus its state evolution on robust temporal dynamics. Furthermore, we introduce a hierarchical multi-scale architecture with variable-specific Node Embeddings to capture diverse physical characteristics. Extensive experiments on nine benchmarks demonstrate that ASGMamba achieves state-of-the-art accuracy. While keeping strictly \mathcalO(L) complexity we significantly reduce the memory usage on long-horizon tasks, thus establishing ASGMamba as a scalable solution for high-throughput forecasting in resource limited this http URL code is available at this https URL
zh
[AI-116] ABX: A High-Throughput Sandbox Battle Simulator for Multi-Agent Reinforcement Learning
【速读】:该论文旨在解决当前多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)评估基准缺乏模块化设计,难以支持自定义实验场景的问题。其解决方案的关键在于提出一个名为“JAX中的完全加速战斗模拟器”(Totally Accelerated Battle Simulator in JAX, TABX)的高吞吐量沙箱环境,该框架提供对环境参数的细粒度控制,支持在多样化任务复杂度下系统性研究智能体涌现行为与算法权衡,并借助JAX实现GPU硬件加速和大规模并行化,显著降低计算开销,从而为MARL在复杂结构化领域中的研究提供可扩展、易定制的实验平台。
链接: https://arxiv.org/abs/2602.01665
作者: Hayeong Lee,JunHyeok Oh,Byung-Jun Lee
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The design of environments plays a critical role in shaping the development and evaluation of cooperative multi-agent reinforcement learning (MARL) algorithms. While existing benchmarks highlight critical challenges, they often lack the modularity required to design custom evaluation scenarios. We introduce the Totally Accelerated Battle Simulator in JAX (TABX), a high-throughput sandbox designed for reconfigurable multi-agent tasks. TABX provides granular control over environmental parameters, permitting a systematic investigation into emergent agent behaviors and algorithmic trade-offs across a diverse spectrum of task complexities. Leveraging JAX for hardware-accelerated execution on GPUs, TABX enables massive parallelization and significantly reduces computational overhead. By providing a fast, extensible, and easily customized framework, TABX facilitates the study of MARL agents in complex structured domains and serves as a scalable foundation for future research. Our code is available at: this https URL.
zh
[AI-117] FlowSteer: Interactive Agent ic Workflow Orchestration via End-to-End Reinforcement Learning
【速读】:该论文旨在解决现有工作流编排(workflow orchestration)中存在的三大挑战:高人工成本、对特定操作符或大语言模型(Large Language Models, LLMs)的依赖性,以及稀疏的奖励信号。其解决方案的关键在于提出FlowSteer,一个端到端的强化学习框架,该框架以轻量级策略模型作为智能体(agent),并结合可执行画布环境(executable canvas environment),通过多轮交互实现自动化工作流编排。在此过程中,策略模型分析执行状态并选择编辑动作,画布环境则执行操作符并返回反馈用于迭代优化。此外,为稳定训练并抑制捷径行为,作者进一步设计了Canvas Workflow Relative Policy Optimization(CWRPO),引入带条件释放机制的多样性约束奖励,从而有效提升学习效率与泛化能力。
链接: https://arxiv.org/abs/2602.01664
作者: Mingda Zhang,Haoran Luo,Tiesunlong Shen,Qika Lin,Xiaoying Tang,Rui Mao,Erik Cambria
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 41 pages, 7 figures, 6 tables. Project page: this http URL
Abstract:In recent years, a variety of powerful agentic workflows have been applied to solve a wide range of human problems. However, existing workflow orchestration still faces key challenges, including high manual cost, reliance on specific operators/large language models (LLMs), and sparse reward signals. To address these challenges, we propose FlowSteer, an end-to-end reinforcement learning framework that takes a lightweight policy model as the agent and an executable canvas environment, automating workflow orchestration through multi-turn interaction. In this process, the policy model analyzes execution states and selects editing actions, while the canvas executes operators and returns feedback for iterative refinement. Moreover, FlowSteer provides a plug-and-play framework that supports diverse operator libraries and interchangeable LLM backends. To effectively train this interaction paradigm, we propose Canvas Workflow Relative Policy Optimization (CWRPO), which introduces diversity-constrained rewards with conditional release to stabilize learning and suppress shortcut behaviors. Experimental results on twelve datasets show that FlowSteer significantly outperforms baselines across various tasks.
zh
[AI-118] Efficient Adversarial Attacks on High-dimensional Offline Bandits ICLR2026
【速读】:该论文旨在解决离线上下文 bandit 算法在评估生成式 AI(Generative AI)模型时,因奖励模型(reward model)被攻击者扰动而导致的鲁棒性不足问题。其核心挑战在于:当攻击者对已部署的奖励模型权重进行微小但有针对性的修改(而非篡改训练数据),如何影响bandit算法在离线日志数据上的学习行为并诱导其选择次优模型。解决方案的关键在于提出一种新型对抗威胁模型(adversarial threat model),通过理论分析与实证验证揭示了高维空间中扰动范数随输入维度增长而衰减的“高维效应”——即在图像等高维场景下,极小的权重扰动即可成功操控bandit决策;同时表明随机扰动无效,而精准目标化的梯度优化攻击可实现接近100%的成功率,从而为提升离线评估系统的安全性提供了关键洞见与防御方向。
链接: https://arxiv.org/abs/2602.01658
作者: Seyed Mohammad Hadi Hosseini,Amir Najafi,Mahdieh Soleymani Baghshah
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at ICLR 2026 Conference
Abstract:Bandit algorithms have recently emerged as a powerful tool for evaluating machine learning models, including generative image models and large language models, by efficiently identifying top-performing candidates without exhaustive comparisons. These methods typically rely on a reward model, often distributed with public weights on platforms such as Hugging Face, to provide feedback to the bandit. While online evaluation is expensive and requires repeated trials, offline evaluation with logged data has become an attractive alternative. However, the adversarial robustness of offline bandit evaluation remains largely unexplored, particularly when an attacker perturbs the reward model (rather than the training data) prior to bandit training. In this work, we fill this gap by investigating, both theoretically and empirically, the vulnerability of offline bandit training to adversarial manipulations of the reward model. We introduce a novel threat model in which an attacker exploits offline data in high-dimensional settings to hijack the bandit’s behavior. Starting with linear reward functions and extending to nonlinear models such as ReLU neural networks, we study attacks on two Hugging Face evaluators used for generative model assessment: one measuring aesthetic quality and the other assessing compositional alignment. Our results show that even small, imperceptible perturbations to the reward model’s weights can drastically alter the bandit’s behavior. From a theoretical perspective, we prove a striking high-dimensional effect: as input dimensionality increases, the perturbation norm required for a successful attack decreases, making modern applications such as image evaluation especially vulnerable. Extensive experiments confirm that naive random perturbations are ineffective, whereas carefully targeted perturbations achieve near-perfect attack success rates …
zh
[AI-119] ProjDevBench: Benchmarking AI Coding Agents on End-to-End Project Development
【速读】:该论文旨在解决当前编码代理(coding agents)评估体系滞后于端到端软件开发实践的问题,即现有评测多集中于单个bug修复等细粒度任务,难以全面衡量代理在真实项目开发中的综合能力。其解决方案的关键在于提出ProjDevBench——一个面向端到端开发的基准测试平台,通过提供完整的项目需求并评估生成代码库的质量,结合在线判题系统(Online Judge, OJ)与大语言模型(LLM)辅助代码审查,从系统架构设计、功能正确性及迭代优化三个维度进行量化评估。该方法有效推动了对编码代理在复杂工程场景下表现的科学评价。
链接: https://arxiv.org/abs/2602.01655
作者: Pengrui Lu,Shiqi Zhang,Yunzhong Hou,Lyumanshan Ye,Chaoyi Huang,Zixi Chen,Ji Zeng,Hantao Jiang,Pengfei Liu,Yiwei Wang,Ming-Hsuan Yang
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:Recent coding agents can generate complete codebases from simple prompts, yet existing evaluations focus on issue-level bug fixing and lag behind end-to-end development. We introduce ProjDevBench, an end-to-end benchmark that provides project requirements to coding agents and evaluates the resulting repositories. Combining Online Judge (OJ) testing with LLM-assisted code review, the benchmark evaluates agents on (1) system architecture design, (2) functional correctness, and (3) iterative solution refinement. We curate 20 programming problems across 8 categories, covering both concept-oriented tasks and real-world application scenarios, and evaluate six coding agents built on different LLM backends. Our evaluation reports an overall acceptance rate of 27.38%: agents handle basic functionality and data structures but struggle with complex system design, time complexity optimization, and resource management. Our benchmark is available at this https URL.
zh
[AI-120] On the Spatiotemporal Dynamics of Generalization in Neural Networks
【速读】:该论文试图解决神经网络在处理数学运算(如加法)时缺乏泛化能力的问题,即模型在训练中学习到的规则无法有效推广到超出训练长度的输入(例如从16位数扩展到32位甚至更长的数字),而人类儿童却能基于规则实现任意长度的推理。其核心解决方案在于提出三个物理基础约束:局域性(Locality)——信息传播速度有限;对称性(Symmetry)——计算规律在时空上不变;稳定性(Stability)——系统收敛至抗噪声的离散吸引子。基于这些物理后设,作者推导出一种新型架构——时空演化吸引子动力学(Spatiotemporal Evolution with Attractor Dynamics, SEAD),它是一个局部卷积规则迭代直至收敛的神经元细胞自动机(neural cellular automaton)。实验表明,SEAD在奇偶校验、加法和Rule 110等任务中实现了真正的长度泛化与稳定推理,证明了通过尊重计算的物理本质而非单纯扩大参数规模,可弥合统计学习与逻辑推理之间的鸿沟。
链接: https://arxiv.org/abs/2602.01651
作者: Zichao Wei
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Why do neural networks fail to generalize addition from 16-digit to 32-digit numbers, while a child who learns the rule can apply it to arbitrarily long sequences? We argue that this failure is not an engineering problem but a violation of physical postulates. Drawing inspiration from physics, we identify three constraints that any generalizing system must satisfy: (1) Locality – information propagates at finite speed; (2) Symmetry – the laws of computation are invariant across space and time; (3) Stability – the system converges to discrete attractors that resist noise accumulation. From these postulates, we derive – rather than design – the Spatiotemporal Evolution with Attractor Dynamics (SEAD) architecture: a neural cellular automaton where local convolutional rules are iterated until convergence. Experiments on three tasks validate our theory: (1) Parity – demonstrating perfect length generalization via light-cone propagation; (2) Addition – achieving scale-invariant inference from L=16 to L=1 million with 100% accuracy, exhibiting input-adaptive computation; (3) Rule 110 – learning a Turing-complete cellular automaton without trajectory divergence. Our results suggest that the gap between statistical learning and logical reasoning can be bridged – not by scaling parameters, but by respecting the physics of computation.
zh
[AI-121] De Novo Molecular Generation from Mass Spectra via Many-Body Enhanced Diffusion
【速读】:该论文旨在解决当前基于质谱(MS/MS)的分子结构生成方法在处理复杂异构体和非局部断裂机制时存在的局限性,特别是现有原子中心或成对相互作用建模方法忽视了高阶边交互作用,难以系统捕捉多体特征的问题。其解决方案的关键在于提出MBGen——一种增强多体(Many-Body)扩散框架,通过引入多体注意力机制与高阶边建模策略,全面利用MS/MS谱中蕴含的丰富结构信息,从而实现对新分子的准确从头生成及异构体区分,实验表明该方法在NPLIB1和MassSpecGym基准上相比最先进方法性能提升高达230%。
链接: https://arxiv.org/abs/2602.01643
作者: Xichen Sun,Wentao Wei,Jiahua Rao,Jiancong Xie,Yuedong Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Molecular structure generation from mass spectrometry is fundamental for understanding cellular metabolism and discovering novel compounds. Although tandem mass spectrometry (MS/MS) enables the high-throughput acquisition of fragment fingerprints, these spectra often reflect higher-order interactions involving the concerted cleavage of multiple atoms and bonds-crucial for resolving complex isomers and non-local fragmentation mechanisms. However, most existing methods adopt atom-centric and pairwise interaction modeling, overlooking higher-order edge interactions and lacking the capacity to systematically capture essential many-body characteristics for structure generation. To overcome these limitations, we present MBGen, a Many-Body enhanced diffusion framework for de novo molecular structure Generation from mass spectra. By integrating a many-body attention mechanism and higher-order edge modeling, MBGen comprehensively leverages the rich structural information encoded in MS/MS spectra, enabling accurate de novo generation and isomer differentiation for novel molecules. Experimental results on the NPLIB1 and MassSpecGym benchmarks demonstrate that MBGen achieves superior performance, with improvements of up to 230% over state-of-the-art methods, highlighting the scientific value and practical utility of many-body modeling for mass spectrometry-based molecular generation. Further analysis and ablation studies show that our approach effectively captures higher-order interactions and exhibits enhanced sensitivity to complex isomeric and non-local fragmentation information.
zh
[AI-122] he Effect of Mini-Batch Noise on the Implicit Bias of Adam
【速读】:该论文旨在解决多轮训练(multi-epoch training)中由于小批量噪声(mini-batch noise)对优化器隐式偏置(implicit bias)的影响机制不明确的问题,特别是Adam(W)优化器中两个动量超参数(β1,β2)如何通过调节记忆能力影响损失曲面的尖锐或平坦区域偏好,进而关联到模型泛化性能差距。其解决方案的关键在于构建一个理论框架,揭示批量大小(batch size)与β1、β2之间的非线性交互作用:当批量较小时,较高的β2会增强反正则化效应(损害泛化),但随着批量增大,这种依赖关系发生反转;类似地,β1也表现出单调性反转现象。因此,在不同批量规模下,最优的动量组合存在显著差异——例如,默认设置(β1,β2)=(0.9,0.999)适用于小批量场景,而大批量时将β1向β2靠近可显著提升验证准确率。理论推导进一步将这一转变点对应的批量尺度与临界批量大小(critical batch size)建立联系,并在接近过拟合的小规模数据实验中得到验证。
链接: https://arxiv.org/abs/2602.01642
作者: Matias D. Cattaneo,Boris Shigida
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Computation (stat.CO); Machine Learning (stat.ML)
备注:
Abstract:With limited high-quality data and growing compute, multi-epoch training is gaining back its importance across sub-areas of deep learning. Adam(W), versions of which are go-to optimizers for many tasks such as next token prediction, has two momentum hyperparameters (\beta_1, \beta_2) controlling memory and one very important hyperparameter, batch size, controlling (in particular) the amount mini-batch noise. We introduce a theoretical framework to understand how mini-batch noise influences the implicit bias of memory in Adam (depending on \beta_1 , \beta_2 ) towards sharper or flatter regions of the loss landscape, which is commonly observed to correlate with the generalization gap in multi-epoch training. We find that in the case of large batch sizes, higher \beta_2 increases the magnitude of anti-regularization by memory (hurting generalization), but as the batch size becomes smaller, the dependence of (anti-)regulariation on \beta_2 is reversed. A similar monotonicity shift (in the opposite direction) happens in \beta_1 . In particular, the commonly “default” pair (\beta_1, \beta_2) = (0.9, 0.999) is a good choice if batches are small; for larger batches, in many settings moving \beta_1 closer to \beta_2 is much better in terms of validation accuracy in multi-epoch training. Moreover, our theoretical derivations connect the scale of the batch size at which the shift happens to the scale of the critical batch size. We illustrate this effect in experiments with small-scale data in the about-to-overfit regime.
zh
[AI-123] oward Enhancing Representation Learning in Federated Multi-Task Settings ICLR2026
【速读】:该论文旨在解决联邦多任务学习(Federated Multi-Task Learning, FMTL)中因假设模型同质性(model congruity)而导致的现实场景适用性受限问题,即现有方法通常要求用户间使用完全或部分相同的模型结构,难以适应实际中存在的模型和任务异构性。解决方案的关键在于提出一种新颖的对比学习目标函数——Muscle loss,该损失函数通过同时对齐所有参与模型的表示空间来构建跨任务共享的表示空间,而非直接共享模型参数;其最小化等价于最大化所有模型表示之间的互信息,从而有效捕捉多任务间的全局依赖关系。基于此原理,作者进一步设计了通信高效的FedMuscle算法,能够自然处理模型和任务异构性,并在图像与语言等多种任务上显著优于现有最先进基线方法。
链接: https://arxiv.org/abs/2602.01626
作者: Mehdi Setayesh,Mahdi Beitollahi,Yasser H. Khalil,Hongliang Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This paper has been accepted at ICLR 2026
Abstract:Federated multi-task learning (FMTL) seeks to collaboratively train customized models for users with different tasks while preserving data privacy. Most existing approaches assume model congruity (i.e., the use of fully or partially homogeneous models) across users, which limits their applicability in realistic settings. To overcome this limitation, we aim to learn a shared representation space across tasks rather than shared model parameters. To this end, we propose Muscle loss, a novel contrastive learning objective that simultaneously aligns representations from all participating models. Unlike existing multi-view or multi-model contrastive methods, which typically align models pairwise, Muscle loss can effectively capture dependencies across tasks because its minimization is equivalent to the maximization of mutual information among all the models’ representations. Building on this principle, we develop FedMuscle, a practical and communication-efficient FMTL algorithm that naturally handles both model and task heterogeneity. Experiments on diverse image and language tasks demonstrate that FedMuscle consistently outperforms state-of-the-art baselines, delivering substantial improvements and robust performance across heterogeneous settings.
zh
[AI-124] SUSD: Structured Unsupervised Skill Discovery through State Factorization ICLR2026
【速读】:该论文旨在解决无监督技能发现(Unsupervised Skill Discovery, USD)中因互信息(Mutual Information, MI)最大化方法倾向于生成简单、静态技能而导致的动态性与任务相关行为缺失问题,以及现有基于状态空间距离的方法在促进全面技能集(涵盖所有可控因素)方面的不足。解决方案的关键在于提出SUSD框架,通过将状态空间因子分解为独立组件(如物体或可控实体),并为不同因素分配独立的技能变量,实现对技能发现过程的细粒度控制;同时引入动态模型以跟踪各因素的学习进度,并自适应地引导智能体关注未充分探索的因素,从而提升技能多样性与复杂性,并获得可解耦的技能表示,支持通过分层强化学习(Hierarchical Reinforcement Learning, HRL)高效训练组合型下游任务。
链接: https://arxiv.org/abs/2602.01619
作者: Seyed Mohammad Hadi Hosseini,Mahdieh Soleymani Baghshah
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted as a conference paper at ICLR 2026
Abstract:Unsupervised Skill Discovery (USD) aims to autonomously learn a diverse set of skills without relying on extrinsic rewards. One of the most common USD approaches is to maximize the Mutual Information (MI) between skill latent variables and states. However, MI-based methods tend to favor simple, static skills due to their invariance properties, limiting the discovery of dynamic, task-relevant behaviors. Distance-Maximizing Skill Discovery (DSD) promotes more dynamic skills by leveraging state-space distances, yet still fall short in encouraging comprehensive skill sets that engage all controllable factors or entities in the environment. In this work, we introduce SUSD, a novel framework that harnesses the compositional structure of environments by factorizing the state space into independent components (e.g., objects or controllable entities). SUSD allocates distinct skill variables to different factors, enabling more fine-grained control on the skill discovery process. A dynamic model also tracks learning across factors, adaptively steering the agent’s focus toward underexplored factors. This structured approach not only promotes the discovery of richer and more diverse skills, but also yields a factorized skill representation that enables fine-grained and disentangled control over individual entities which facilitates efficient training of compositional downstream tasks via Hierarchical Reinforcement Learning (HRL). Our experimental results across three environments, with factors ranging from 1 to 10, demonstrate that our method can discover diverse and complex skills without supervision, significantly outperforming existing unsupervised skill discovery methods in factorized and complex environments. Code is publicly available at: this https URL.
zh
[AI-125] AgroFlux: A Spatial-Temporal Benchmark for Carbon and Nitrogen Flux Prediction in Agricultural Ecosystems
【速读】:该论文旨在解决农业生态系统(agroecosystem)中碳、氮、水耦合系统内温室气体(GHG)排放通量难以准确量化的问题,这一挑战限制了对GHG驱动机制的理解及有效减排策略的制定。其关键解决方案是构建首个时空维度的农业生态系统GHG基准数据集,该数据集融合了基于物理过程的模型(Ecosys和DayCent)模拟结果与涡度相关通量塔和受控环境设施的真实观测数据,并在此基础上评估了多种序列深度学习模型(如LSTM、Temporal CNN和Transformer)在碳氮通量预测中的性能,同时探索了迁移学习利用模拟数据提升模型在真实观测场景下泛化能力的方法。该基准数据集与评估框架为开发更精准、可扩展的AI驱动农业生态系统模型提供了基础支撑。
链接: https://arxiv.org/abs/2602.01614
作者: Qi Cheng,Licheng Liu,Yao Zhang,Mu Hong,Yiqun Xie,Xiaowei Jia
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Agroecosystem, which heavily influenced by human actions and accounts for a quarter of global greenhouse gas emissions (GHGs), plays a crucial role in mitigating global climate change and securing environmental sustainability. However, we can’t manage what we can’t measure. Accurately quantifying the pools and fluxes in the carbon, nutrient, and water nexus of the agroecosystem is therefore essential for understanding the underlying drivers of GHG and developing effective mitigation strategies. Conventional approaches like soil sampling, process-based models, and black-box machine learning models are facing challenges such as data sparsity, high spatiotemporal heterogeneity, and complex subsurface biogeochemical and physical processes. Developing new trustworthy approaches such as AI-empowered models, will require the AI-ready benchmark dataset and outlined protocols, which unfortunately do not exist. In this work, we introduce a first-of-its-kind spatial-temporal agroecosystem GHG benchmark dataset that integrates physics-based model simulations from Ecosys and DayCent with real-world observations from eddy covariance flux towers and controlled-environment facilities. We evaluate the performance of various sequential deep learning models on carbon and nitrogen flux prediction, including LSTM-based models, temporal CNN-based model, and Transformer-based models. Furthermore, we explored transfer learning to leverage simulated data to improve the generalization of deep learning models on real-world observations. Our benchmark dataset and evaluation framework contribute to the development of more accurate and scalable AI-driven agroecosystem models, advancing our understanding of ecosystem-climate interactions.
zh
[AI-126] A Practical Tensor-Network Compression Pipeline for Production-Scale Large Language Models
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在部署过程中面临的GPU显存占用高和推理延迟大的问题。解决方案的关键在于提出了一种名为Minima的生产级压缩流水线,其核心是通过训练一个轻量级卷积预测器来评估Transformer模型中各层及局部区域的敏感度,并对低敏感度区域采用Tucker、张量列车(Tensor-Train)和张量环(Tensor-Ring)等结构化分解方法进行压缩;随后执行短周期微调以修复精度损失,并利用自定义的Triton和CUDA内核高效执行压缩后的运算。该方案显著降低了峰值显存占用(如Qwen3-32B在8k上下文窗口下从64 GiB降至40 GiB),并提升了吞吐量(单请求下从40 tokens/s提升至75 tokens/s,配合推测解码时),同时在高并发场景下仍保持有效性,体现了结构化压缩与共享张量骨干网络结合的实用性路径。
链接: https://arxiv.org/abs/2602.01613
作者: Sergii Kozyrev,Davyd Maiboroda(Minima AI, Inc.)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 13 pages, 5 figures
Abstract:Large language models are limited in deployment by GPU memory and inference latency. We present Minima, a production compression pipeline that learns where and how to structurally compress a Transformer and turns that compression into real serving gains. Minima trains a lightweight convolutional predictor to estimate layer- and patch-level sensitivity, applies a mixture of Tucker, tensor-train, and tensor-ring decompositions to low-sensitivity regions, performs a short healing fine-tune, and executes the resulting operators with custom Triton and CUDA kernels. The reduced memory footprint enables speculative decoding with a small draft model and a larger verifier. On Qwen3-32B at an 8k-token context window, Minima reduces peak VRAM from 64 GiB to 40 GiB. For a single active request, throughput increases from 40 tokens per second (baseline) to 50 tokens per second (Minima) and 75 tokens per second (Minima with speculative decoding). Under 50 parallel requests, throughput is 34, 44, and 53 tokens per second respectively, showing that Minima remains effective under high concurrency even when speculative decoding gains compress. We position Minima relative to recent tensor-network, low-rank plus quantization, and cross-layer sharing methods, and argue that it is a practical step toward more aggressive structural compression via shared tensor backbones with tiny per-layer adapters.
zh
[AI-127] oPT: Task-Oriented Prompt Tuning for Urban Region Representation Learning ICASSP2026
【速读】:该论文旨在解决现有城市计算任务中区域嵌入(region embeddings)学习方法存在的两个核心问题:一是传统两阶段方法生成的任务无关表示,导致下游任务性能受限;二是基于提示(prompt-based)的方法缺乏显式的空间先验和任务语义对齐机制,造成区域间建模的空间不一致性与任务条件不稳定。解决方案的关键在于提出一个名为ToPT的两阶段框架,其核心创新为:第一,引入空间感知的区域嵌入学习模块(SREL),通过Graphormer架构注入距离和区域中心性等空间先验作为可学习注意力偏置,实现空间一致性的区域交互建模;第二,设计任务感知的提示机制(Prompt4RE),利用冻结的多模态大语言模型(MLLM)生成任务特定语义向量,并通过多头交叉注意力机制与区域嵌入对齐,从而实现稳定的任务条件化。实验表明,该方法在多个城市和任务上均达到最优性能,验证了空间先验与任务语义对齐协同作用的重要性。
链接: https://arxiv.org/abs/2602.01610
作者: Zitao Guo,Changyang Jiang,Tianhong Zhao,Jinzhou Cao,Genan Dai,Bowen Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: The paper has been accepted by ICASSP 2026
Abstract:Learning effective region embeddings from heterogeneous urban data underpins key urban computing tasks (e.g., crime prediction, resource allocation). However, prevailing two-stage methods yield task-agnostic representations, decoupling them from downstream objectives. Recent prompt-based approaches attempt to fix this but introduce two challenges: they often lack explicit spatial priors, causing spatially incoherent inter-region modeling, and they lack robust mechanisms for explicit task-semantic alignment. We propose ToPT, a two-stage framework that delivers spatially consistent fusion and explicit task alignment. ToPT consists of two modules: spatial-aware region embedding learning (SREL) and task-aware prompting for region embeddings (Prompt4RE). SREL employs a Graphormer-based fusion module that injects spatial priors-distance and regional centrality-as learnable attention biases to capture coherent, interpretable inter-region interactions. Prompt4RE performs task-oriented prompting: a frozen multimodal large language model (MLLM) processes task-specific templates to obtain semantic vectors, which are aligned with region embeddings via multi-head cross-attention for stable task conditioning. Experiments across multiple tasks and cities show state-of-the-art performance, with improvements of up to 64.2%, validating the necessity and complementarity of spatial priors and prompt-region alignment. The code is available at this https URL.
zh
[AI-128] Reasoning with Autoregressive-Diffusion Collaborative Thoughts
【速读】:该论文旨在解决生成式模型在复杂任务中因模态特性差异导致的局限性问题:自回归模型(Autoregressive Models)擅长顺序规划与约束组合,但在需要显式空间或物理基础的任务上表现不足;而扩散模型(Diffusion Models)虽能捕捉丰富的空间结构,却缺乏逐步逻辑控制能力,难以满足多阶段约束或可靠纠错。解决方案的关键在于提出“协同思维”(Collaborative Thoughts)框架,通过闭环交互机制实现两类模型的联合推理与生成——自回归模型负责结构化规划和约束管理,扩散模型将约束转化为中间视觉思维(visual thoughts),再由基于视觉的评判模块评估其是否符合结构与物理要求,并将反馈用于迭代优化后续步骤,从而有效抑制跨模态误差传播,提升空间推理可靠性与生成可控性。
链接: https://arxiv.org/abs/2602.01608
作者: Mu Yuan,Liekang Zeng,Guoliang Xing,Lan Zhang,Yunhao Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Autoregressive and diffusion models represent two complementary generative paradigms. Autoregressive models excel at sequential planning and constraint composition, yet struggle with tasks that require explicit spatial or physical grounding. Diffusion models, in contrast, capture rich spatial structure through high-dimensional generation, but lack the stepwise logical control needed to satisfy complex, multi-stage constraints or to reliably identify and correct errors. We introduce Collaborative Thoughts, a unified collaborative framework that enables autoregressive and diffusion models to reason and generate jointly through a closed-loop interaction. In Collaborative Thoughts, autoregressive models perform structured planning and constraint management, diffusion models instantiate these constraints as intermediate visual thoughts, and a vision-based critic module evaluates whether the visual thoughts satisfy the intended structural and physical requirements. This feedback is then used to iteratively refine subsequent planning and generation steps, mitigating error propagation across modalities. Importantly, Collaborative Thoughts uses the same collaborative loop regardless of whether the task is autoregressive question answering or diffusion-based visual generation. Through representative examples, we illustrate how Collaborative Thoughts can improve the reliability of spatial reasoning and the controllability of generation.
zh
[AI-129] Boosting Maximum Entropy Reinforcement Learning via One-Step Flow Matching
【速读】:该论文旨在解决扩散策略(Diffusion Policies)在推理时延迟较高,以及将流匹配(Flow Matching, FM)集成到最大熵强化学习(Maximum Entropy Reinforcement Learning, MaxEnt RL)中所面临的挑战:即最优策略是一个难以处理的能量基分布,且用于平衡探索与利用的高效对数似然估计因离散化偏差而表现不佳。解决方案的关键在于提出FLAME框架——首先通过重要性重加权推导出Q-重加权的流匹配目标,从而避免分区函数估计;其次设计了一个解耦的熵估计器以严格校正偏差,实现高效探索并使策略更接近最优MaxEnt策略;最后引入MeanFlow公式实现表达能力强且单步生成的控制策略。实验证明FLAME在MuJoCo环境中优于高斯基线,并以显著更低的推理成本达到多步扩散策略的性能。
链接: https://arxiv.org/abs/2602.01606
作者: Zeqiao Li,Yijing Wang,Haoyu Wang,Zheng Li,Zhiqiang Zuo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Diffusion policies are expressive yet incur high inference latency. Flow Matching (FM) enables one-step generation, but integrating it into Maximum Entropy Reinforcement Learning (MaxEnt RL) is challenging: the optimal policy is an intractable energy-based distribution, and the efficient log-likelihood estimation required to balance exploration and exploitation suffers from severe discretization bias. We propose \textbfFlow-based \textbfLog-likelihood-\textbfAware \textbfMaximum \textbfEntropy RL (\textbfFLAME), a principled framework that addresses these challenges. First, we derive a Q-Reweighted FM objective that bypasses partition function estimation via importance reweighting. Second, we design a decoupled entropy estimator that rigorously corrects bias, which enables efficient exploration and brings the policy closer to the optimal MaxEnt policy. Third, we integrate the MeanFlow formulation to achieve expressive and efficient one-step control. Empirical results on MuJoCo show that FLAME outperforms Gaussian baselines and matches multi-step diffusion policies with significantly lower inference cost. Code is available at this https URL.
zh
[AI-130] he Multiple Ticket Hypothesis: Random Sparse Subnetworks Suffice for RLVR
【速读】:该论文旨在解决深度学习模型中参数冗余的问题,特别是针对强化学习中可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)场景下参数更新集中在稀疏子集的现象。研究发现,通过仅训练随机选择的极稀疏参数子集(如1%),即可达到甚至超越全参数微调的性能,且不同随机掩码所对应的子网络几乎无重叠(Jaccard相似度≤0.005),表明预训练模型中存在大量可行的稀疏子网络而非单一最优结构。解决方案的关键在于:RLVR中的每步更新隐含地施加了KL散度约束,将参数更新限制在低维子空间内,从而使得任意稀疏掩码均可有效工作,这一现象被称为“多重彩票假设”(Multiple Ticket Hypothesis)。
链接: https://arxiv.org/abs/2602.01599
作者: Israel Adewuyi,Solomon Okibe,Vladmir Ivanov
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The Lottery Ticket Hypothesis demonstrated that sparse subnetworks can match full-model performance, suggesting parameter redundancy. Meanwhile, in Reinforcement Learning with Verifiable Rewards (RLVR), recent work has shown that updates concentrate on a sparse subset of parameters, which further lends evidence to this underlying redundancy. We study the simplest possible way to exploit this redundancy: training only a randomly selected subset of parameters at extreme sparsities. Empirically, we find that training just 1% of parameters matches or exceeds full-parameter RLVR finetuning across 3 models and 2 task domains. Moreover, different random masks show minimal overlap ( \leq 0.005 Jaccard similarity) and yet all succeed, suggesting pretrained models contain many viable sparse subnetworks rather than one privileged set. We term this the Multiple Ticket Hypothesis. We explain this phenomenon through the implicit per-step KL constraint in RLVR, which restricts updates to a low-dimensional subspace, enabling arbitrary sparse masks to succeed.
zh
[AI-131] Spectral Text Fusion: A Frequency-Aware Approach to Multimodal Time-Series Forecasting
【速读】:该论文旨在解决多模态时间序列预测中,如何有效融合数值型时间序列与文本等上下文信息的问题,尤其关注现有方法在局部对齐文本特征与时间序列模式时,忽略文本对时间序列多尺度动态影响(如周期性和趋势变化)的局限性。其解决方案的关键在于引入频域建模:通过谱分解(spectral decomposition)将时间序列分离为不同频率成分,捕捉短期波动与长期趋势;进而将文本嵌入投影至频域,并利用轻量级交叉注意力机制与时间序列的频域分量进行融合,自适应地调整各频率带的权重,最后映射回时域完成预测。该方法实现了文本语义对时间序列全局结构的精准调控,显著提升了预测性能且参数效率更高。
链接: https://arxiv.org/abs/2602.01588
作者: Huu Hiep Nguyen,Minh Hoang Nguyen,Dung Nguyen,Hung Le
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal time series forecasting is crucial in real-world applications, where decisions depend on both numerical data and contextual signals. The core challenge is to effectively combine temporal numerical patterns with the context embedded in other modalities, such as text. While most existing methods align textual features with time-series patterns one step at a time, they neglect the multiscale temporal influences of contextual information such as time-series cycles and dynamic shifts. This mismatch between local alignment and global textual context can be addressed by spectral decomposition, which separates time series into frequency components capturing both short-term changes and long-term trends. In this paper, we propose SpecTF, a simple yet effective framework that integrates the effect of textual data on time series in the frequency domain. Our method extracts textual embeddings, projects them into the frequency domain, and fuses them with the time series’ spectral components using a lightweight cross-attention mechanism. This adaptively reweights frequency bands based on textual relevance before mapping the results back to the temporal domain for predictions. Experimental results demonstrate that SpecTF significantly outperforms state-of-the-art models across diverse multi-modal time series datasets while utilizing considerably fewer parameters. Code is available at this https URL.
zh
[AI-132] On the Frag ility of AI-Based Channel Decoders under Small Channel Perturbations
【速读】:该论文试图解决的问题是:近年来基于深度学习的纠错解码器(如ECCT和CrossMPT)在加性高斯白噪声(AWGN)信道下相较于传统置信传播(BP)解码表现出性能提升,但这种提升的来源及其潜在代价尚不明确。论文聚焦于通过分布偏移(distributional shifts)下的鲁棒性视角来解析这些性能增益的本质。其解决方案的关键在于系统性评估AI解码器在两类对抗扰动下的表现——一类是依赖输入的对抗扰动(如FGM和投影梯度法,受限于ℓ₂范数),另一类是适用于所有接收向量的通用对抗扰动(universal adversarial perturbations)。实验结果表明,尽管AI解码器在标准AWGN条件下性能优越,但在面对上述扰动时显著退化,且扰动在不同AI解码器间具有较强迁移性,而对BP解码器影响较小;同时,通用扰动比同等范数的随机扰动更具破坏性。这揭示了AI解码器可能以牺牲鲁棒性为代价获得性能提升,即存在潜在的“鲁棒性代价”。
链接: https://arxiv.org/abs/2602.01582
作者: Haoyu Lei,Mohammad Jalali,Chin Wa Lau,Farzan Farnia
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Recent advances in deep learning have led to AI-based error correction decoders that report empirical performance improvements over traditional belief-propagation (BP) decoding on AWGN channels. While such gains are promising, a fundamental question remains: where do these improvements come from, and what cost is paid to achieve them? In this work, we study this question through the lens of robustness to distributional shifts at the channel output. We evaluate both input-dependent adversarial perturbations (FGM and projected gradient methods under \ell_2 constraints) and universal adversarial perturbations that apply a single norm-bounded shift to all received vectors. Our results show that recent AI decoders, including ECCT and CrossMPT, could suffer significant performance degradation under such perturbations, despite superior nominal performance under i.i.d. AWGN. Moreover, adversarial perturbations transfer relatively strongly between AI decoders but weakly to BP-based decoders, and universal perturbations are substantially more harmful than random perturbations of equal norm. These numerical findings suggest a potential robustness cost and higher sensitivity to channel distribution underlying recent AI decoding gains.
zh
[AI-133] DrawSim-PD: Simulating Student Science Drawings to Support NGSS-Aligned Teacher Diagnostic Reasoning
【速读】:该论文旨在解决教师专业发展(Professional Development, PD)中因隐私法规限制而难以获取真实学生作品用于训练诊断推理能力的问题。解决方案的关键在于提出DrawSim-PD框架,该框架基于“能力轮廓”(capability profiles)——一种结构化的认知状态模型,用于编码不同表现水平学生所能及与尚不能展示的知识与技能。通过这一机制,系统可生成跨模态一致的模拟学生作品:包括符合NGSS标准的学生式科学绘图、第一人称推理叙事以及面向教师的诊断概念图,从而在保障隐私的前提下提供多样且可控的教学评估素材。
链接: https://arxiv.org/abs/2602.01578
作者: Arijit Chakma,Peng He,Honglu Liu,Zeyuan Wang,Tingting Li,Tiffany D. Do,Feng Liu
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 26 pages, 12 figures
Abstract:Developing expertise in diagnostic reasoning requires practice with diverse student artifacts, yet privacy regulations prohibit sharing authentic student work for teacher professional development (PD) at scale. We present DrawSim-PD, the first generative framework that simulates NGSS-aligned, student-like science drawings exhibiting controllable pedagogical imperfections to support teacher training. Central to our approach are apability profiles–structured cognitive states encoding what students at each performance level can and cannot yet demonstrate. These profiles ensure cross-modal coherence across generated outputs: (i) a student-like drawing, (ii) a first-person reasoning narrative, and (iii) a teacher-facing diagnostic concept map. Using 100 curated NGSS topics spanning K-12, we construct a corpus of 10,000 systematically structured artifacts. Through an expert-based feasibility evaluation, K–12 science educators verified the artifacts’ alignment with NGSS expectations (84% positive on core items) and utility for interpreting student thinking, while identifying refinement opportunities for grade-band extremes. We release this open infrastructure to overcome data scarcity barriers in visual assessment research.
zh
[AI-134] DREAMS: A Social Exchange Theory-Informed Modeling of Misinformation Engagement on Social Media WWW
【速读】:该论文旨在解决社交媒体中虚假信息传播过程中用户参与度预测的难题,传统方法将参与度视为同质的时间序列信号,忽略了社会机制与平台设计对 misinformation(虚假信息)传播的异质性影响。其解决方案的关键在于提出一个受社会交换理论启发的框架 \textscDreams(Disentangled Representations and Episodic Adaptive Modeling for Social media misinformation engagements),该框架将参与度建模为一种动态的社会交换过程,而非静态结果;通过序列到序列的自适应建模方式,捕捉用户努力与社交回报之间的演化关系,并结合平台上下文条件进行情感和情境信号的时序传播学习,从而显著提升预测准确性并揭示跨平台的一致性社会交换模式。
链接: https://arxiv.org/abs/2602.01567
作者: Lin Tian,Marian-Andrei Rizoiu
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注: 12 pages, 5 figures, 3 tables, Accepted by WWW The Web Conference 2026
Abstract:Social media engagement prediction is a central challenge in computational social science, particularly for understanding how users interact with misinformation. Existing approaches often treat engagement as a homogeneous time-series signal, overlooking the heterogeneous social mechanisms and platform designs that shape how misinformation spreads. In this work, we ask: ``Can neural architectures discover social exchange principles from behavioral data alone?‘’ We introduce \textscDreams (\underlineDisentangled \underlineRepresentations and \underlineEpisodic \underlineAdaptive \underlineModeling for \underlineSocial media misinformation engagements), a social exchange theory-guided framework that models misinformation engagement as a dynamic process of social exchange. Rather than treating engagement as a static outcome, \textscDreams models it as a sequence-to-sequence adaptation problem, where each action reflects an evolving negotiation between user effort and social reward conditioned by platform context. It integrates adaptive mechanisms to learn how emotional and contextual signals propagate through time and across platforms. On a cross-platform dataset spanning 7 platforms and 2.37M posts collected between 2021 and 2025, \textscDreams achieves state-of-the-art performance in predicting misinformation engagements, reaching a mean absolute percentage error of 19.25 %. This is a 43.6 % improvement over the strongest baseline. Beyond predictive gains, the model reveals consistent cross-platform patterns that align with social exchange principles, suggesting that integrating behavioral theory can enhance empirical modeling of online misinformation engagement. The source code is available at: this https URL.
zh
[AI-135] Autonomous Question Formation for Large Language Model-Driven AI Systems
【速读】:该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的AI系统在动态开放环境中缺乏自主问题识别能力的问题,即现有系统依赖预设任务和固定提示(prompt),难以根据环境变化自主确定应解决的问题。其解决方案的关键在于提出一种基于人类模拟的框架,将问题形成(question formation)视为优先于任务选择与执行的一类决策过程,并通过内源驱动、环境感知和多智能体交互感知三个 prompting 范畴逐步扩展认知覆盖范围,同时支持从经验中学习问题形成机制,从而提升系统的适应性和决策质量。
链接: https://arxiv.org/abs/2602.01556
作者: Hong Su
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language model (LLM)-driven AI systems are increasingly important for autonomous decision-making in dynamic and open environments. However, most existing systems rely on predefined tasks and fixed prompts, limiting their ability to autonomously identify what problems should be solved when environmental conditions change. In this paper, we propose a human-simulation-based framework that enables AI systems to autonomously form questions and set tasks by reasoning over their internal states, environmental observations, and interactions with other AI systems. The proposed method treats question formation as a first-class decision process preceding task selection and execution, and integrates internal-driven, environment-aware, and inter-agent-aware prompting scopes to progressively expand cognitive coverage. In addition, the framework supports learning the question-formation process from experience, allowing the system to improve its adaptability and decision quality over time. xperimental results in a multi-agent simulation environment show that environment-aware prompting significantly reduces no-eat events compared with the internal-driven baseline, and inter-agent-aware prompting further reduces cumulative no-eat events by more than 60% over a 20-day simulation, with statistically significant improvements (p 0.05).
zh
[AI-136] Plain Transformers are Surprisingly Powerful Link Predictors
【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在大规模图结构数据上进行链接预测(Link Prediction)时面临的可扩展性差与泛化能力不足的问题,以及现有图变压器(Graph Transformers, GTs)因复杂结构编码导致的高计算开销问题。其解决方案的关键在于提出一种仅使用编码器(encoder-only)的简单Transformer架构PENCIL,通过注意力机制对采样的局部子图进行建模,替代了传统方法中依赖人工设计的结构启发式规则或内存密集型节点嵌入表示,从而在保持标准Transformer的可扩展性和硬件效率的同时,隐式地捕获更丰富的拓扑结构信号,并有效推广多种启发式方法和基于子图的表达能力。
链接: https://arxiv.org/abs/2602.01553
作者: Quang Truong,Yu Song,Donald Loveland,Mingxuan Ju,Tong Zhao,Neil Shah,Jiliang Tang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Link prediction is a core challenge in graph machine learning, demanding models that capture rich and complex topological dependencies. While Graph Neural Networks (GNNs) are the standard solution, state-of-the-art pipelines often rely on explicit structural heuristics or memory-intensive node embeddings – approaches that struggle to generalize or scale to massive graphs. Emerging Graph Transformers (GTs) offer a potential alternative but often incur significant overhead due to complex structural encodings, hindering their applications to large-scale link prediction. We challenge these sophisticated paradigms with PENCIL, an encoder-only plain Transformer that replaces hand-crafted priors with attention over sampled local subgraphs, retaining the scalability and hardware efficiency of standard Transformers. Through experimental and theoretical analysis, we show that PENCIL extracts richer structural signals than GNNs, implicitly generalizing a broad class of heuristics and subgraph-based expressivity. Empirically, PENCIL outperforms heuristic-informed GNNs and is far more parameter-efficient than ID-embedding–based alternatives, while remaining competitive across diverse benchmarks – even without node features. Our results challenge the prevailing reliance on complex engineering techniques, demonstrating that simple design choices are potentially sufficient to achieve the same capabilities.
zh
[AI-137] S1-NexusAgent : a Self-Evolving Agent Framework for Multidisciplinary Scientific Research
【速读】:该论文旨在解决现有大语言模型(Large Language Models, LLMs)和工具驱动智能体在处理多学科科学研究时面临的三大核心挑战:长周期规划能力不足、目标维持鲁棒性差以及从执行过程中持续学习的能力有限。其解决方案的关键在于提出S1-NexusAgent框架,该框架采用分层的“计划-编码执行”(Plan-and-CodeAct)范式,通过双环架构将全局科研规划与子任务级工具执行解耦,从而稳定建模复杂研究流程;同时引入对象引用驱动的稀疏上下文管理机制以应对长文本和大规模数据问题,并结合意图感知的动态工具检索与热插拔机制实现异构科研工具的高效编排;最终由Critic Agent自动评估完整执行轨迹并提炼高质量研究路径为可复用的科学技能(Scientific Skills),形成闭环自进化机制,显著提升系统在长期、跨学科科学任务中的性能与泛化能力。
链接: https://arxiv.org/abs/2602.01550
作者: S1-NexusAgent Team
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: In progress
Abstract:Modern scientific research relies on large-scale data, complex workflows, and specialized tools, which existing LLMs and tool-based agents struggle to handle due to limitations in long-horizon planning, robust goal maintenance, and continual learning from execution. To address these issues, in this work, we propose S1-NexusAgent, a self-evolving agent framework designed for multidisciplinary scientific research. S1-NexusAgent adopts a hierarchical Plan-and-CodeAct execution paradigm, decoupling global scientific planning from subtask-level tool execution through a dual-loop architecture, thereby enabling stable modeling of complex research workflows. The system natively supports the Model Context Protocol (MCP), integrates up to thousands of cross-disciplinary scientific tools, and achieves efficient orchestration of heterogeneous research tools via intention-aware dynamic tool retrieval and hot-plug mechanisms. To address long-context and large-scale data challenges in scientific settings, S1-NexusAgent introduces object-reference-based sparse context management, which enables sub-task context isolation and intermediate result compression. Building on this, a Critic Agent automatically evaluates complete execution trajectories and distills high-quality research paths into reusable Scientific Skills, forming a closed loop for continuous self-evolution, which is valuable for sustainable and long-horizon scientific research. Experiments on authoritative scientific benchmarks involving long-horizon planning and complex specialized tool orchestration, including biomini-eval (biology), ChemBench (chemistry), and MatSciBench (material science), demonstrate that S1-NexusAgent achieves state-of-the-art performance, validating its effectiveness and generalization capability in complex scientific tasks.
zh
[AI-138] PRISM: Festina Lente Proactivity – Risk-Sensitive Uncertainty-Aware Deliberation for Proactive Agents
【速读】:该论文旨在解决主动式智能体(proactive agents)在决策干预时面临的“收益-负担权衡”难题,即如何在不依赖脆弱启发式规则或冗长无差别的推理基础上,实现精准、高效且可控的干预行为。其核心挑战在于平衡“遗漏帮助”(missed help)与“误报干扰”(false alarms)之间的不对称成本。解决方案的关键在于提出PRISM框架,该框架融合了决策理论驱动的门控机制(decision-theoretic gate)与双过程推理架构(dual-process reasoning architecture):通过一个由成本校准的概率阈值决定是否干预,仅在决策边界附近调用计算密集的慢速模式(Slow mode)进行反事实验证,从而将资源集中于高风险和模糊场景;同时采用门控对齐的结构化蒸馏训练策略(gate-aligned, schema-locked distillation),使学生模型在响应策略上独立于干预门控,从而实现可调节和可审计的控制能力。实验证明,该方法在ProactiveBench基准上显著降低22.78%的误报率并提升20.14%的F1分数,展现出高精度、低开销与强可控性的优势。
链接: https://arxiv.org/abs/2602.01532
作者: Yuxuan Fu,Xiaoyu Tan,Teqi Hao,Chen Zhan,Xihe Qiu
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Proactive agents must decide not only what to say but also whether and when to intervene. Many current systems rely on brittle heuristics or indiscriminate long reasoning, which offers little control over the benefit-burden tradeoff. We formulate the problem as cost-sensitive selective intervention and present PRISM, a novel framework that couples a decision-theoretic gate with a dual-process reasoning architecture. At inference time, the agent intervenes only when a calibrated probability of user acceptance exceeds a threshold derived from asymmetric costs of missed help and false alarms. Inspired by festina lente (Latin: “make haste slowly”), we gate by an acceptance-calibrated, cost-derived threshold and invoke a resource-intensive Slow mode with counterfactual checks only near the decision boundary, concentrating computation on ambiguous and high-stakes cases. Training uses gate-aligned, schema-locked distillation: a teacher running the full PRISM pipeline provides dense, executable supervision on unlabeled interaction traces, while the student learns a response policy that is explicitly decoupled from the intervention gate to enable tunable and auditable control. On ProactiveBench, PRISM reduces false alarms by 22.78% and improves F1 by 20.14% over strong baselines. These results show that principled decision-theoretic gating, paired with selective slow reasoning and aligned distillation, yields proactive agents that are precise, computationally efficient, and controllable. To facilitate reproducibility, we release our code, models, and resources at this https URL all experiments use the open-source ProactiveBench benchmark.
zh
[AI-139] You Need an Encoder for Native Position-Independent Caching
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)中基于前缀的键值(Key-Value, KV)缓存机制在处理任意顺序检索上下文时效率低下的问题。现有方法虽尝试引入位置无关缓存(Position-Independent Caching, PIC)以实现KV重用,但常导致显著的精度下降,限制了其实际应用。解决方案的关键在于提出原生PIC(native PIC),通过向主流的仅解码器架构LLM中重新引入编码器,并显式训练该编码器以支持PIC;同时设计了一个名为COMB的PIC感知缓存系统,可无缝集成至现有推理框架。实验表明,COMB在保持相近准确率的前提下,将首次标记时间(Time-to-First-Token, TTFT)降低51–94%,吞吐量提升3倍,且在DeepSeek-V2-Lite-Chat上的效果验证了其对其他仅解码器LLM的适用性。
链接: https://arxiv.org/abs/2602.01519
作者: Shiju Zhao,Junhao Hu,Jiaqi Zheng,Guihai Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages, 10 figures. Welcome back, Encoder
Abstract:The Key-Value (KV) cache of Large Language Models (LLMs) is prefix-based, making it highly inefficient for processing contexts retrieved in arbitrary order. Position-Independent Caching (PIC) has been proposed to enable KV reuse without positional constraints; however, existing approaches often incur substantial accuracy degradation, limiting their practical adoption. To address this issue, we propose native PIC by reintroducing the encoder to prevalent decoder-only LLMs and explicitly training it to support PIC. We further develop COMB, a PIC-aware caching system that integrates seamlessly with existing inference frameworks. Experimental results show that COMB reduces Time-to-First-Token (TTFT) by 51-94% and increases throughput by 3 \times with comparable accuracy. Furthermore, the quality improvement when using DeepSeek-V2-Lite-Chat demonstrates the applicability of COMB to other types of decoder-only LLMs. Our code is available at this https URL.
zh
[AI-140] Qrita: High-performance Top-k and Top-p Algorithm for GPUs using Pivot-based Truncation and Selection
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在采样过程中使用Top-k和Top-p截断操作时,因处理大规模词表而导致的计算与内存开销过高的问题。现有方法通常依赖排序实现,导致GPU上计算和内存占用显著增加,或采用随机化策略,从而改变算法输出的确定性。其解决方案的关键在于提出Qrita算法,基于pivot-based selection策略,结合两项核心技术:一是Gaussian-based sigma-truncation,通过高斯分布特性大幅缩小目标元素的搜索空间;二是Quaternary pivot search with duplication handling,将pivot搜索迭代次数减半并确保输出的确定性。该方案在Triton中完整实现,并在vLLM、SGLang和Flashinfer等高性能LLM执行引擎上验证了其有效性,相较排序方法实现了最高2倍吞吐量提升和50%内存占用降低,同时保持与排序算法一致的输出结果。
链接: https://arxiv.org/abs/2602.01518
作者: Jongseok Park,Sunga Kim,Alvin Cheung,Ion Stoica
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Top-k and Top-p are the dominant truncation operators in the sampling of large language models. Despite their widespread use, implementing them efficiently over large vocabularies remains a significant challenge. Existing approaches often rely on sorting, which incur significant computation and memory overhead on GPUs, or stochastic approaches, which alter the algorithm output. In this work, we propose Qrita, an efficient Top-k and Top-p algorithm based on a pivot-based selection strategy. Based on RTop-k, which uses a pivot-based search for node selection in graph neural networks, Qrita extends the concept of pivot-based search to both Top-k and Top-p with two key techniques: 1. Gaussian-based sigma-truncation, which greatly reduces the search space of the target elements, and 2. Quaternary pivot search with duplication handling, which halves the pivot search iteration and guarantees deterministic output. We provide the full implementation of Qrita using Triton, a popular GPU programming language. Our evaluation of Qrita against the Top-k and Top-p kernels of high performance LLM execution engines such as vLLM, SGLang, and Flashinfer show that Qrita achieves up to 2 times throughput and half memory use while providing the same output to the the sorting-based algorithms.
zh
[AI-141] White-Box Neural Ensemble for Vehicular Plasticity: Quantifying the Efficiency Cost of Symbolic Auditability in Adaptive NMPC
【速读】:该论文旨在解决车辆控制系统中因运行工况变化(如摩擦、质量、阻力等参数突变)导致的传统非线性模型预测控制(Nonlinear Model Predictive Control, NMPC)性能下降的问题,即如何实现无需重新训练即可适应不同运行环境的“车辆可塑性”(vehicular plasticity)。其解决方案的关键在于提出一种白盒自适应NMPC架构,通过模块化主权(Modular Sovereignty)机制在多个冻结的、针对特定工况训练的神经网络专家(neural specialists)之间进行仲裁,同时利用CasADi框架保持整个系统动态为完全可遍历的符号图(symbolic graph),从而在保证高运行时可审计性的同时,实现毫秒级(~7.3 ms)的快速适应和接近理想的跟踪精度。
链接: https://arxiv.org/abs/2602.01516
作者: Enzo Nicolas Spotorno,Matheus Wagner,Antonio Augusto Medeiros Frohlich
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 5 pages, 1 table, 1 figure, submitted to IEEE VTC 2026 Recent Results Track
Abstract:We present a white-box adaptive NMPC architecture that resolves vehicular plasticity (adaptation to varying operating regimes without retraining) by arbitrating among frozen, regime-specific neural specialists using a Modular Sovereignty paradigm. The ensemble dynamics are maintained as a fully traversable symbolic graph in CasADi, enabling maximal runtime auditability. Synchronous simulation validates rapid adaptation (~7.3 ms) and near-ideal tracking fidelity under compound regime shifts (friction, mass, drag) where non-adaptive baselines fail. Empirical benchmarking quantifies the transparency cost: symbolic graph maintenance increases solver latency by 72-102X versus compiled parametric physics models, establishing the efficiency price of strict white-box implementation.
zh
[AI-142] Harnessing Flexible Spatial and Temporal Data Center Workloads for Grid Regulation Services
【速读】:该论文旨在解决数据中心(Data Centers, DCs)在参与电网频率调节时,因 workload 调度与调节容量投标分离而导致的调节能力不可持续或难以兑现的问题。现有方法忽略了排队动态和时空调度决策对实时调节能力的影响,导致承诺的调节容量可能无法实现或持续时间短。解决方案的关键在于提出一种统一的日前协同优化框架,联合决策地理分布的数据中心之间的工作负载分配与调节容量承诺,并构建空间-时间网络模型以刻画工作负载迁移成本、延迟需求及异构资源限制;同时引入基于交互式负荷预测的瞬时功率灵活性机会约束,以及基于风险价值(Value-at-Risk)的队列状态约束,确保调节承诺在累积调节信号下仍具备可持续响应能力。
链接: https://arxiv.org/abs/2602.01508
作者: Yingrui Fan,Junbo Zhao
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:
Abstract:Data centers (DCs) are increasingly recognized as flexible loads that can support grid frequency regulation. Yet, most existing methods treat workload scheduling and regulation capacity bidding separately, overlooking how queueing dynamics and spatial-temporal dispatch decisions affect the ability to sustain real-time regulation. As a result, the committed regulation may become infeasible or short-lived. To address this issue, we propose a unified day-ahead co-optimization framework that jointly decides workload distribution across geographically distributed DCs and regulation capacity commitments. We construct a space-time network model to capture workload migration costs, latency requirements, and heterogeneous resource limits. To ensure that the committed regulation remains deliverable, we introduce chance constraints on instantaneous power flexibility based on interactive load forecasts, and apply Value-at-Risk queue-state constraints to maintain sustainable response under cumulative regulation signals. Case studies on a modified IEEE 68-bus system using real data center traces show that the proposed framework lowers system operating costs, enables more viable regulation capacity, and achieves better revenue-risk trade-offs compared to strategies that optimize scheduling and regulation independently.
zh
[AI-143] Governance at the Edge of Architecture: Regulating NeuroAI and Neuromorphic Systems
【速读】:该论文旨在解决当前人工智能治理框架在面对神经形态人工智能(NeuroAI)系统时的适用性问题,特别是这些系统基于脉冲神经网络(Spiking Neural Networks, SNNs)并运行于神经形态硬件上,其计算特性与传统冯·诺依曼架构下的静态人工神经网络存在本质差异。论文指出,现有监管指标如准确性、延迟和能效无法充分覆盖 NeuroAI 的物理实现机制、学习动态及具身效率,因此亟需重构保障与审计方法。解决方案的关键在于推动治理框架与 NeuroAI 架构同步演进,将传统监管指标与脑启发计算的物理规律、学习动力学及高效实现方式相融合,从而建立技术可支撑的可信保障体系。
链接: https://arxiv.org/abs/2602.01503
作者: Afifah Kashif,Abdul Muhsin Hameed,Asim Iqbal
机构: 未知
类目: Emerging Technologies (cs.ET); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注: 9 pages, 1 table, 1 figure
Abstract:Current AI governance frameworks, including regulatory benchmarks for accuracy, latency, and energy efficiency, are built for static, centrally trained artificial neural networks on von Neumann hardware. NeuroAI systems, embodied in neuromorphic hardware and implemented via spiking neural networks, break these assumptions. This paper examines the limitations of current AI governance frameworks for NeuroAI, arguing that assurance and audit methods must co-evolve with these architectures, aligning traditional regulatory metrics with the physics, learning dynamics, and embodied efficiency of brain-inspired computation to enable technically grounded assurance.
zh
[AI-144] Draw2Learn: A Human-AI Collaborative Tool for Drawing-Based Science Learning
【速读】:该论文旨在解决生成式 AI (Generative AI) 在支持绘图式学习(drawing-based learning)过程中,如何实现及时、有效的反馈以促进学习效果的问题。其解决方案的关键在于设计了一个名为 Draw2Learn 的系统,该系统将学习原理转化为具体的交互模式:AI 作为“协作队友”生成结构化的绘图任务、提供可选的视觉支架(visual scaffolds)、监测学习进度并给予多维反馈,从而在保持学习者自主性的同时提升教学支持的有效性。
链接: https://arxiv.org/abs/2602.01494
作者: Yuqi Hang
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Drawing supports learning by externalizing mental models, but providing timely feedback at scale remains challenging. We present Draw2Learn, a system that explores how AI can act as a supportive teammate during drawing-based learning. The design translates learning principles into concrete interaction patterns: AI generates structured drawing quests, provides optional visual scaffolds, monitors progress, and delivers multidimensional feedback. We collected formative user feedback during system development and open-ended comments. Feedback showed positive ratings for usability, usefulness, and user experience, with themes highlighting AI scaffolding value and learner autonomy. This work contributes a design framework for teammate-oriented AI in generative learning and identifies key considerations for future research.
zh
[AI-145] OpInf-LLM : Parametric PDE Solving with LLM s via Operator Inference
【速读】:该论文旨在解决大语言模型(LLM)在求解多样偏微分方程(PDE)时面临的挑战,即在不同参数和边界条件下实现高数值精度与高执行成功率之间的权衡问题。解决方案的关键在于提出OpInf-LLM框架,该框架基于算子推理(operator inference)技术,利用少量解数据实现对未见参数和配置的PDE实例的准确预测,并通过自然语言接口无缝集成LLM能力以指定求解任务;其低计算开销和统一工具接口保障了在异构环境中的高执行成功率,从而推动了LLM驱动的PDE求解中可泛化的降阶建模发展。
链接: https://arxiv.org/abs/2602.01493
作者: Zhuoyuan Wang,Hanjiang Hu,Xiyu Deng,Saviz Mowlavi,Yorie Nakahira
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Solving diverse partial differential equations (PDEs) is fundamental in science and engineering. Large language models (LLMs) have demonstrated strong capabilities in code generation, symbolic reasoning, and tool use, but reliably solving PDEs across heterogeneous settings remains challenging. Prior work on LLM-based code generation and transformer-based foundation models for PDE learning has shown promising advances. However, a persistent trade-off between execution success rate and numerical accuracy arises, particularly when generalization to unseen parameters and boundary conditions is required. In this work, we propose OpInf-LLM, an LLM parametric PDE solving framework based on operator inference. The proposed framework leverages a small amount of solution data to enable accurate prediction of diverse PDE instances, including unseen parameters and configurations, and provides seamless integration with LLMs for natural language specification of PDE solving tasks. Its low computational demands and unified tool interface further enable a high execution success rate across heterogeneous settings. By combining operator inference with LLM capabilities, OpInf-LLM opens new possibilities for generalizable reduced-order modeling in LLM-based PDE solving.
zh
[AI-146] Causal Preference Elicitation
【速读】:该论文旨在解决在因果发现(causal discovery)过程中如何高效利用专家知识以加速后验分布收敛并提升有向效应识别准确性的难题。其核心挑战在于,在有限的交互查询预算下,如何主动选择最具信息量的局部边关系(local edge relations)进行专家判断,从而有效缩小可能的有向无环图(DAG)空间。解决方案的关键在于提出一种贝叶斯框架——因果偏好获取(causal preference elicitation),该框架通过建模专家判断的三元似然函数(three-way likelihood)来刻画边存在性和方向性上的噪声观测,并采用基于粒子近似的后验推断方法,同时以期望信息增益(expected information gain)准则优化查询策略,从而实现对专家反馈的高效利用与快速后验集中。
链接: https://arxiv.org/abs/2602.01483
作者: Edwin V. Bonilla,He Zhao,Daniel M. Steinberg
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME)
备注:
Abstract:We propose causal preference elicitation, a Bayesian framework for expert-in-the-loop causal discovery that actively queries local edge relations to concentrate a posterior over directed acyclic graphs (DAGs). From any black-box observational posterior, we model noisy expert judgments with a three-way likelihood over edge existence and direction. Posterior inference uses a flexible particle approximation, and queries are selected by an efficient expected information gain criterion on the expert’s categorical response. Experiments on synthetic graphs, protein signaling data, and a human gene perturbation benchmark show faster posterior concentration and improved recovery of directed effects under tight query budgets.
zh
[AI-147] Rod Flow: A Continuous-Time Model for Gradient Descent at the Edge of Stability
【速读】:该论文旨在解决梯度下降(Gradient Descent, GD)在非凸优化景观中训练行为的理解难题,特别是当步长较大时GD与梯度流(Gradient Flow)偏离的现象——即“稳定边缘现象”(Edge of Stability)。传统理论难以准确刻画此类情况下的动态演化。其解决方案的关键在于提出一种新的常微分方程(ODE)近似方法——Rod Flow,该方法基于将GD迭代视为一维“杆状”(rod)物理对象的原理进行推导,具有明确的物理基础和计算效率优势;理论证明其能正确预测临界尖锐阈值(critical sharpness threshold),并解释四次势能中的自稳定机制(self-stabilization),同时在简单模型和典型神经网络架构上表现出与现有最优方法Central Flow相当甚至更优的精度。
链接: https://arxiv.org/abs/2602.01480
作者: Eric Regis,Sinho Chewi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)
备注:
Abstract:How can we understand gradient-based training over non-convex landscapes? The edge of stability phenomenon, introduced in Cohen et al. (2021), indicates that the answer is not so simple: namely, gradient descent (GD) with large step sizes often diverges away from the gradient flow. In this regime, the “Central Flow”, recently proposed in Cohen et al. (2025), provides an accurate ODE approximation to the GD dynamics over many architectures. In this work, we propose Rod Flow, an alternative ODE approximation, which carries the following advantages: (1) it rests on a principled derivation stemming from a physical picture of GD iterates as an extended one-dimensional object – a “rod”; (2) it better captures GD dynamics for simple toy examples and matches the accuracy of Central Flow for representative neural network architectures, and (3) is explicit and cheap to compute. Theoretically, we prove that Rod Flow correctly predicts the critical sharpness threshold and explains self-stabilization in quartic potentials. We validate our theory with a range of numerical experiments.
zh
[AI-148] Learning to Guide Local Search for MPE Inference in Probabilistic Graphical Models
【速读】:该论文旨在解决在概率图模型(Probabilistic Graphical Models, PGMs)中进行多次查询时,基于随机局部搜索(Stochastic Local Search, SLS)算法因贪心选择策略易陷入局部最优、收敛性能受限的问题。其核心解决方案是提出一种神经摊销(neural amortization)框架,利用固定图结构训练一个基于注意力机制的神经网络,对局部移动(local moves)进行评分,预测其减少汉明距离(Hamming distance)至近似最优解的能力;该信号被用于在邻居选择过程中平衡短期似然增益与长期潜力,从而提升局部搜索的全局探索能力。
链接: https://arxiv.org/abs/2602.01475
作者: Brij Malhotra,Shivvrat Arya,Tahrima Rahman,Vibhav Giridhar Gogate
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Most Probable Explanation (MPE) inference in Probabilistic Graphical Models (PGMs) is a fundamental yet computationally challenging problem arising in domains such as diagnosis, planning, and structured prediction. In many practical settings, the graphical model remains fixed while inference must be performed repeatedly for varying evidence patterns. Stochastic Local Search (SLS) algorithms scale to large models but rely on myopic best-improvement rule that prioritizes immediate likelihood gains and often stagnate in poor local optima. Heuristics such as Guided Local Search (GLS+) partially alleviate this limitation by modifying the search landscape, but their guidance cannot be reused effectively across multiple inference queries on the same model. We propose a neural amortization framework for improving local search in this repeated-query regime. Exploiting the fixed graph structure, we train an attention-based network to score local moves by predicting their ability to reduce Hamming distance to a near-optimal solution. Our approach integrates seamlessly with existing local search procedures, using this signal to balance short-term likelihood gains with long-term promise during neighbor selection. We provide theoretical intuition linking distance-reducing move selection to improved convergence behavior, and empirically demonstrate consistent improvements over SLS and GLS+ on challenging high-treewidth benchmarks in the amortized inference setting.
zh
[AI-149] Legal Infrastructure for Transformative AI Governance
【速读】:该论文旨在解决当前人工智能(Artificial Intelligence, AI)治理中过度聚焦于实质性规则制定而忽视法律与监管基础设施建设的问题。作者指出,AI的变革性特征要求构建系统性的法律和监管框架以支持规则的生成与执行。其解决方案的关键在于推动三项具体制度设计:一是建立前沿模型的注册机制,二是为自主代理(autonomous agents)设立注册与标识制度,三是设计监管市场以激励私营企业创新并提供AI监管服务,从而形成可持续、动态演进的AI治理生态。
链接: https://arxiv.org/abs/2602.01474
作者: Gillian K. Hadfield
机构: 未知
类目: Artificial Intelligence (cs.AI); General Economics (econ.GN)
备注:
Abstract:Most of our AI governance efforts focus on substance: what rules do we want in place? What limits or checks do we want to impose on AI development and deployment? But a key role for law is not only to establish substantive rules but also to establish legal and regulatory infrastructure to generate and implement rules. The transformative nature of AI calls especially for attention to building legal and regulatory frameworks. In this PNAS Perspective piece I review three examples I have proposed: the creation of registration regimes for frontier models; the creation of registration and identification regimes for autonomous agents; and the design of regulatory markets to facilitate a role for private companies to innovate and deliver AI regulatory services.
zh
[AI-150] P-EAGLE: Parallel-Drafting EAGLE with Scalable Training
【速读】:该论文旨在解决生成式大语言模型(Generative AI)在长序列推理时因自回归(autoregressive)生成方式导致的高延迟问题,以及并行起草(parallel drafting)技术在长上下文训练中因计算复杂度随序列长度和并行位置数的乘积呈二次增长而难以实施的挑战。解决方案的关键在于提出P-EAGLE框架,通过引入可学习的共享隐藏状态将EAGLE从自回归机制转变为多标记并行预测机制,并结合注意力掩码预计算与序列分段策略,实现单序列内的梯度累积,从而有效扩展训练至长上下文场景,最终在vLLM中实现1.10–1.36倍的速度提升。
链接: https://arxiv.org/abs/2602.01469
作者: Mude Hui,Xin Huang,Jaime Campos Salas,Yue Sun,Nathan Pemberton,Xiang Song,Ashish Khetan,George Karypis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reasoning LLMs produce longer outputs, requiring speculative decoding drafters trained on extended sequences. Parallel drafting - predicting multiple tokens per forward pass - offers latency benefits over sequential generation, but training complexity scales quadratically with the product of sequence length and parallel positions, rendering long-context training impractical. We present P(arallel)-EAGLE, which transforms EAGLE from autoregressive to parallel multi-token prediction via a learnable shared hidden state. To scale training to long contexts, we develop a framework featuring attention mask pre-computation and sequence partitioning techniques, enabling gradient accumulation within individual sequences for parallel-prediction training. We implement P-EAGLE in vLLM and demonstrate speedups of 1.10-1.36x over autoregressive EAGLE-3 across GPT-OSS 120B, 20B, and Qwen3-Coder 30B.
zh
[AI-151] Agyn: A Multi-Agent System for Team-Based Autonomous Software Engineering
【速读】:该论文试图解决当前自主软件工程系统普遍采用单体或流水线式处理方式,难以模拟真实世界中团队协作、角色分工与流程规范的问题。解决方案的关键在于构建一个完全自动化的多智能体系统,显式地将软件工程过程建模为组织行为,通过分配协调、研究、实现和评审等专业化角色,并利用隔离沙箱和结构化通信机制,使各智能体遵循明确的开发方法论(如任务分析、规范制定、拉取请求创建与迭代评审)协同工作,从而在无需人工干预的情况下完成软件问题修复。该系统基于开源平台agyn配置代理团队,在SWE-bench 500上实现了72.4%的任务解决率,显著优于单一智能体基线,验证了组织架构设计与代理基础设施对自主软件工程的重要性。
链接: https://arxiv.org/abs/2602.01465
作者: Nikita Benkovich,Vitalii Valkov
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:Large language models have demonstrated strong capabilities in individual software engineering tasks, yet most autonomous systems still treat issue resolution as a monolithic or pipeline-based process. In contrast, real-world software development is organized as a collaborative activity carried out by teams following shared methodologies, with clear role separation, communication, and review. In this work, we present a fully automated multi-agent system that explicitly models software engineering as an organizational process, replicating the structure of an engineering team. Built on top of agyn, an open-source platform for configuring agent teams, our system assigns specialized agents to roles such as coordination, research, implementation, and review, provides them with isolated sandboxes for experimentation, and enables structured communication. The system follows a defined development methodology for working on issues, including analysis, task specification, pull request creation, and iterative review, and operates without any human intervention. Importantly, the system was designed for real production use and was not tuned for SWE-bench. When evaluated post hoc on SWE-bench 500, it resolves 72.4% of tasks, outperforming single-agent baselines using comparable language models. Our results suggest that replicating team structure, methodology, and communication is a powerful paradigm for autonomous software engineering, and that future progress may depend as much on organizational design and agent infrastructure as on model improvements.
zh
[AI-152] SimGym: Traffic-Grounded Browser Agents for Offline A/B Testing in E-Commerce
【速读】:该论文旨在解决传统A/B测试在电商用户界面(UI)优化中存在的时间成本高、流量分流影响用户体验以及实验周期长等问题。其解决方案的关键在于提出SimGym系统,该系统利用基于大型语言模型(Large Language Model, LLM)的智能体(agent)在真实浏览器环境中模拟由生产数据驱动的合成买家行为,通过提取店铺级用户画像与意图并识别行为原型,生成加权会话以对比控制组和处理组的虚拟商店表现。该方法在不依赖人工标注或训练后对齐的情况下,即可实现与真实人类用户结果的高度一致性,并将实验周期从数周缩短至一小时以内,从而支持高效、无风险的离线A/B测试。
链接: https://arxiv.org/abs/2602.01443
作者: Alberto Castelo,Zahra Zanjani Foumani,Ailin Fan,Keat Yang Koay,Vibhor Malik,Yuanzheng Zhu,Han Li,Meysam Feghhi,Ronie Uliana,Shuang Xie,Zhaoyu Zhang,Angelo Ocana Martins,Mingyu Zhao,Francis Pelland,Jonathan Faerman,Nikolas LeBlanc,Aaron Glazer,Andrew McNamara,Lingyun Wang,Zhong Wu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:A/B testing remains the gold standard for evaluating e-commerce UI changes, yet it diverts traffic, takes weeks to reach significance, and risks harming user experience. We introduce SimGym, a scalable system for rapid offline A/B testing using traffic-grounded synthetic buyers powered by Large Language Model agents operating in a live browser. SimGym extracts per-shop buyer profiles and intents from production interaction data, identifies distinct behavioral archetypes, and simulates cohort-weighted sessions across control and treatment storefronts. We validate SimGym against real human outcomes from real UI changes on a major e-commerce platform under confounder control. Even without alignment post training, SimGym agents achieve state of the art alignment with observed outcome shifts and reduces experiment cycles from weeks to under an hour , enabling rapid experimentation without exposure to real buyers.
zh
[AI-153] QL: Scaling Q-Functions with Transformers by Preventing Attention Collapse
【速读】:该论文旨在解决在强化学习(Reinforcement Learning, RL)中,尽管大规模模型(如Transformer架构)在其他任务中表现出色,但其在价值函数(Value Function)学习中的应用仍受限于训练不稳定和性能下降的问题。通过实证分析,作者发现导致这一现象的关键原因是:随着模型容量增加,注意力分数(Attention Scores)会逐渐坍缩(collapse),从而破坏训练稳定性。解决方案的核心在于通过控制注意力分数的熵(Entropy)来有效防止这种坍缩,从而稳定训练过程;基于此,作者提出了Transformer Q-Learning(TQL)方法,使Transformer能够成功扩展用于RL中的价值函数学习,并在模型规模从最小到最大时实现高达43%的性能提升,而传统方法则因性能退化无法实现有效扩展。
链接: https://arxiv.org/abs/2602.01439
作者: Perry Dong,Kuo-Han Hung,Alexander Swerdlow,Dorsa Sadigh,Chelsea Finn
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Despite scale driving substantial recent advancements in machine learning, reinforcement learning (RL) methods still primarily use small value functions. Naively scaling value functions – including with a transformer architecture, which is known to be highly scalable – often results in learning instability and worse performance. In this work, we ask what prevents transformers from scaling effectively for value functions? Through empirical analysis, we identify the critical failure mode in this scaling: attention scores collapse as capacity increases. Our key insight is that we can effectively prevent this collapse and stabilize training by controlling the entropy of the attention scores, thereby enabling the use of larger models. To this end, we propose Transformer Q-Learning (TQL), a method that unlocks the scaling potential of transformers in learning value functions in RL. Our approach yields up to a 43% improvement in performance when scaling from the smallest to the largest network sizes, while prior methods suffer from performance degradation.
zh
[AI-154] CIPHER: Cryptographic Insecurity Profiling via Hybrid Evaluation of Responses
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成加密代码时普遍存在安全漏洞的问题,尤其是由于设计选择不当(如静态初始化向量或缺少认证机制)导致的安全保障失效。解决方案的关键在于提出CIPHER(Cryptographic Insecurity Profiling via Hybrid Evaluation of Responses),这是一个用于量化LLM生成Python代码中加密漏洞发生率的基准测试框架;其核心创新包括:针对每个任务设计不安全/中性/安全三种提示变体、构建面向密码学的漏洞分类体系,并通过自动化评分流水线实现行级漏洞归因,从而系统评估不同提示策略对加密安全性的影响。
链接: https://arxiv.org/abs/2602.01438
作者: Max Manolov,Tony Gao,Siddharth Shukla,Cheng-Ting Chou,Ryan Lagasse
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) are increasingly used to assist developers with code, yet their implementations of cryptographic functionality often contain exploitable flaws. Minor design choices (e.g., static initialization vectors or missing authentication) can silently invalidate security guarantees. We introduce CIPHER(\textbfCryptographic \textbfInsecurity \textbfProfiling via \textbfHybrid \textbfEvaluation of \textbfResponses), a benchmark for measuring cryptographic vulnerability incidence in LLM-generated Python code under controlled security-guidance conditions. CIPHER uses insecure/neutral/secure prompt variants per task, a cryptography-specific vulnerability taxonomy, and line-level attribution via an automated scoring pipeline. Across a diverse set of widely used LLMs, we find that explicit ``secure’’ prompting reduces some targeted issues but does not reliably eliminate cryptographic vulnerabilities overall. The benchmark and reproducible scoring pipeline will be publicly released upon publication.
zh
[AI-155] DCD: Decomposition-based Causal Discovery from Autocorrelated and Non-Stationary Temporal Data
【速读】:该论文旨在解决多变量时间序列(multivariate time series)在金融、气候科学和医疗等领域中,由于长期趋势(trend)、季节性模式(seasonal pattern)和短期波动(short-term fluctuation)共同存在所导致的因果推断困难问题,尤其是在非平稳性和自相关性较强的情况下,传统因果发现方法易产生虚假边(spurious edges)和错误的时间依赖关系归因。解决方案的关键在于提出一种基于分解的因果发现框架,将每个时间序列分解为趋势、季节性和残差成分,并对不同成分分别采用不同的分析策略:趋势成分通过平稳性检验评估,季节性成分使用核基依赖度量,残差成分则应用约束基础因果发现方法;最终将各成分的因果图整合为统一的多尺度因果结构,从而有效分离长程与短程因果效应,减少虚假关联并提升可解释性。
链接: https://arxiv.org/abs/2602.01433
作者: Muhammad Hasan Ferdous,Md Osman Gani
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Multivariate time series in domains such as finance, climate science, and healthcare often exhibit long-term trends, seasonal patterns, and short-term fluctuations, complicating causal inference under non-stationarity and autocorrelation. Existing causal discovery methods typically operate on raw observations, making them vulnerable to spurious edges and misattributed temporal dependencies. We introduce a decomposition-based causal discovery framework that separates each time series into trend, seasonal, and residual components and performs component-specific causal analysis. Trend components are assessed using stationarity tests, seasonal components using kernel-based dependence measures, and residual components using constraint-based causal discovery. The resulting component-level graphs are integrated into a unified multi-scale causal structure. This approach isolates long- and short-range causal effects, reduces spurious associations, and improves interpretability. Across extensive synthetic benchmarks and real-world climate data, our framework more accurately recovers ground-truth causal structure than state-of-the-art baselines, particularly under strong non-stationarity and temporal autocorrelation.
zh
[AI-156] Building Better Deception Probes Using Targeted Instruction Pairs
【速读】:该论文旨在解决当前用于检测生成式 AI (Generative AI) 误导行为的线性探测器(linear probes)在实际应用中表现不稳定的问题,尤其是其在简单场景下仍存在虚假关联和误报现象。解决方案的关键在于:首先,明确训练时使用的指令对(instruction pair)对探测性能具有决定性影响(解释了70.6%的方差),表明这些指令更有效捕捉的是欺骗意图而非内容特征;其次,通过基于人类可解释的欺骗分类体系(taxonomy of deception)针对性设计探测器,能够显著提升在评估数据集上的效果。研究建议组织应根据自身威胁模型定制专用探测器,而非追求通用的欺骗检测方案。
链接: https://arxiv.org/abs/2602.01425
作者: Vikram Natarajan,Devina Jain,Shivam Arora,Satvik Golechha,Joseph Bloom
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Linear probes are a promising approach for monitoring AI systems for deceptive behaviour. Previous work has shown that a linear classifier trained on a contrastive instruction pair and a simple dataset can achieve good performance. However, these probes exhibit notable failures even in straightforward scenarios, including spurious correlations and false positives on non-deceptive responses. In this paper, we identify the importance of the instruction pair used during training. Furthermore, we show that targeting specific deceptive behaviors through a human-interpretable taxonomy of deception leads to improved results on evaluation datasets. Our findings reveal that instruction pairs capture deceptive intent rather than content-specific patterns, explaining why prompt choice dominates probe performance (70.6% of variance). Given the heterogeneity of deception types across datasets, we conclude that organizations should design specialized probes targeting their specific threat models rather than seeking a universal deception detector.
zh
[AI-157] Semi-supervised CAPP Transformer Learning via Pseudo-labeling
【速读】:该论文旨在解决制造领域中高阶计算机辅助工艺规划(Computer-Aided Process Planning, CAPP)模型在工业场景下因数据集规模有限而导致的泛化能力不足问题。其解决方案的关键在于提出一种半监督学习方法,利用已训练的“oracle”模型(基于现有Transformer行为数据)对未见零件的预测结果进行筛选,从而获取高质量的伪标签样本,并通过一次-shot(one-shot)方式对Transformer-based CAPP模型进行再训练,显著提升了小样本条件下的预测准确性。
链接: https://arxiv.org/abs/2602.01419
作者: Dennis Gross,Helge Spieker,Arnaud Gotlieb,Emmanuel Stathatos,Panorios Benardos,George-Christopher Vosniakos
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:High-level Computer-Aided Process Planning (CAPP) generates manufacturing process plans from part specifications. It suffers from limited dataset availability in industry, reducing model generalization. We propose a semi-supervised learning approach to improve transformer-based CAPP transformer models without manual labeling. An oracle, trained on available transformer behaviour data, filters correct predictions from unseen parts, which are then used for one-shot retraining. Experiments on small-scale datasets with simulated ground truth across the full data distribution show consistent accuracy gains over baselines, demonstrating the method’s effectiveness in data-scarce manufacturing environments.
zh
[AI-158] An Odd Estimator for Shapley Values
【速读】:该论文旨在解决Shapley值在机器学习中精确计算通常不可行的问题,从而需要高效近似方法。其核心挑战在于现有估计器虽能降低误差,但缺乏理论机制解释为何有效。解决方案的关键在于揭示Shapley值仅依赖于集合函数的奇部(odd component),并通过配对采样(paired sampling)使回归目标正交化,从而过滤掉无关的偶部(even component)。基于此洞察,作者提出OddSHAP,一种新的无偏一致估计器,通过仅在奇子空间上进行多项式回归实现高精度估计,并利用傅里叶基分离该子空间、借助代理模型识别高影响交互项,有效克服高阶逼近中的组合爆炸问题。
链接: https://arxiv.org/abs/2602.01399
作者: Fabian Fumagalli,Landon Butler,Justin Singh Kang,Kannan Ramchandran,R. Teal Witter
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:The Shapley value is a ubiquitous framework for attribution in machine learning, encompassing feature importance, data valuation, and causal inference. However, its exact computation is generally intractable, necessitating efficient approximation methods. While the most effective and popular estimators leverage the paired sampling heuristic to reduce estimation error, the theoretical mechanism driving this improvement has remained opaque. In this work, we provide an elegant and fundamental justification for paired sampling: we prove that the Shapley value depends exclusively on the odd component of the set function, and that paired sampling orthogonalizes the regression objective to filter out the irrelevant even component. Leveraging this insight, we propose OddSHAP, a novel consistent estimator that performs polynomial regression solely on the odd subspace. By utilizing the Fourier basis to isolate this subspace and employing a proxy model to identify high-impact interactions, OddSHAP overcomes the combinatorial explosion of higher-order approximations. Through an extensive benchmark evaluation, we find that OddSHAP achieves state-of-the-art estimation accuracy.
zh
[AI-159] How well can VLMs rate audio descriptions: A multi-dimensional quantitative assessment framework
【速读】:该论文旨在解决盲人及低视力群体在观看数字视频时因缺乏音频描述(Audio Description, AD)而被排除在外的问题,尤其聚焦于如何系统性评估长时视频中AD内容的质量。现有方法依赖自然语言处理(NLP)指标和短片段指导原则,难以覆盖完整视频内容且缺乏可扩展的评估机制。解决方案的关键在于:首先构建了一个基于专业指南并经无障碍专家优化的多维评估框架,用于衡量不间断全时长视频的AD质量;其次,将该框架嵌入基于项目反应理论(Item Response Theory, IRT)的方法学流程中,实现对视觉语言模型(Vision-Language Models, VLMs)与人类评价者相对于专家基准的熟练度量化评估。研究发现,VLM虽能高一致性地逼近专家评分,但其推理过程可靠性与可操作性不及人类,提示未来应发展结合VLM自动化与人工监督的混合评估体系,以实现AD质量控制的规模化落地。
链接: https://arxiv.org/abs/2602.01390
作者: Lana Do,Gio Jung,Juvenal Francisco Barajas,Andrew Taylor Scott,Shasta Ihorn,Alexander Mario Blum,Vassilis Athitsos,Ilmi Yoon
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Digital video is central to communication, education, and entertainment, but without audio description (AD), blind and low-vision audiences are excluded. While crowdsourced platforms and vision-language-models (VLMs) expand AD production, quality is rarely checked systematically. Existing evaluations rely on NLP metrics and short-clip guidelines, leaving questions about what constitutes quality for full-length content and how to assess it at scale. To address these questions, we first developed a multi-dimensional assessment framework for uninterrupted, full-length video, grounded in professional guidelines and refined by accessibility specialists. Second, we integrated this framework into a comprehensive methodological workflow, utilizing Item Response Theory, to assess the proficiency of VLM and human raters against expert-established ground truth. Findings suggest that while VLMs can approximate ground-truth ratings with high alignment, their reasoning was found to be less reliable and actionable than that of human respondents. These insights show the potential of hybrid evaluation systems that leverage VLMs alongside human oversight, offering a path towards scalable AD quality control.
zh
[AI-160] “If Youre Very Clever No One Knows Youve Used It”: The Social Dynamics of Developing Generative AI Literacy in the Workplace
【速读】:该论文旨在解决当前关于生成式 AI (Generative AI) 文化素养的研究缺乏实证洞察的问题,特别是知识工作者对 GenAI 文化素养的认知如何受到工作场所社会动态的影响,以及他们在实际职业环境中如何习得 GenAI 工具的应用能力。研究通过深度访谈19名来自不同行业的知识工作者发现,尽管同事间知识共享有助于学习,但员工倾向于隐藏使用 GenAI 的痕迹以维护自身专业权威,这种行为反而抑制了知识共享机会并削弱了透明度。解决方案的关键在于推动开放对话、提升用户生成知识的可见性,并强化协作学习在应对快速技术变革中的价值,从而构建更健康的职场 AI 文化素养生态。
链接: https://arxiv.org/abs/2602.01386
作者: Qing(Nancy)Xia,Marios Constantinides,Advait Sarkar,Duncan Brumby,Anna Cox
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: CHIWORK 2026
Abstract:Generative AI (GenAI) tools are rapidly transforming knowledge work, making AI literacy a critical priority for organizations. However, research on AI literacy lacks empirical insight into how knowledge workers’ beliefs around GenAI literacy are shaped by the social dynamics of the workplace, and how workers learn to apply GenAI tools in these environments. To address this gap, we conducted in-depth interviews with 19 knowledge workers across multiple sectors to examine how they develop GenAI competencies in real-world professional contexts. We found that, while knowledge sharing from colleagues supported learning, the ability to remove cues indicating GenAI use was perceived as validation of domain expertise. These behaviours ultimately reduced opportunities for learning via knowledge sharing and undermined transparency. To advance workplace AI literacy, we argue for fostering open dialogue, increasing visibility of user-generated knowledge, and greater emphasis on the benefits of collaborative learning for navigating rapid technological developments.
zh
[AI-161] Deep Variational Contrastive Learning for Joint Risk Stratification and Time-to-Event Estimation
【速读】:该论文旨在解决深度生存分析模型中性能与可解释性之间的权衡问题:传统神经网络虽具备高预测精度,但因其黑箱特性难以在临床实践中被采纳;而基于深度聚类的方法虽能提供可解释的患者风险分层,却往往牺牲了预测性能。解决方案的关键在于提出CONVERSE(CONtrastive Variational Ensemble for Risk Stratification and Estimation),该模型通过将变分自编码器(Variational Autoencoder, VAE)与对比学习(Contrastive Learning)相结合,利用多层级的簇内和簇间对比损失来增强嵌入空间的结构化表示能力,同时引入自 paced learning 机制逐步优化训练稳定性,并支持每个风险簇独立的生存头(survival head)以实现精准集成预测,从而在保持良好可解释性的同时显著提升预测性能。
链接: https://arxiv.org/abs/2602.01367
作者: Pinar Erbil,Alberto Archetti,Eugenio Lomurno,Matteo Matteucci
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Survival analysis is essential for clinical decision-making, as it allows practitioners to estimate time-to-event outcomes, stratify patient risk profiles, and guide treatment planning. Deep learning has revolutionized this field with unprecedented predictive capabilities but faces a fundamental trade-off between performance and interpretability. While neural networks achieve high accuracy, their black-box nature limits clinical adoption. Conversely, deep clustering-based methods that stratify patients into interpretable risk groups typically sacrifice predictive power. We propose CONVERSE (CONtrastive Variational Ensemble for Risk Stratification and Estimation), a deep survival model that bridges this gap by unifying variational autoencoders with contrastive learning for interpretable risk stratification. CONVERSE combines variational embeddings with multiple intra- and inter-cluster contrastive losses. Self-paced learning progressively incorporates samples from easy to hard, improving training stability. The model supports cluster-specific survival heads, enabling accurate ensemble predictions. Comprehensive evaluation on four benchmark datasets demonstrates that CONVERSE achieves competitive or superior performance compared to existing deep survival methods, while maintaining meaningful patient stratification.
zh
[AI-162] PaAno: Patch-Based Representation Learning for Time-Series Anomaly Detection ICLR2026
【速读】:该论文旨在解决当前时间序列异常检测方法中存在计算成本高、内存占用大以及在资源受限场景下难以部署的问题,同时指出尽管大型神经网络架构(如Transformer和基础模型)被广泛采用,但在严格评估协议下其性能提升并不显著。解决方案的关键在于提出一种轻量级且高效的基于补丁的表示学习方法(Patch-based representation learning for time-series Anomaly detection, PaAno):通过从训练数据中提取短时序补丁,并利用一维卷积神经网络(1D CNN)将其嵌入为向量表示;训练阶段结合三元组损失(triplet loss)与预训练任务损失(pretext loss),确保嵌入能捕捉输入补丁中的有效时序模式;推理时则通过比较当前时间点周围补丁与训练集中正常补丁的嵌入差异来计算异常得分。该方法在TSB-AD基准上实现了SOTA性能,显著优于依赖复杂架构的现有方法。
链接: https://arxiv.org/abs/2602.01359
作者: Jinju Park,Seokho Kang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by the 14th International Conference on Learning Representations (ICLR 2026)
Abstract:Although recent studies on time-series anomaly detection have increasingly adopted ever-larger neural network architectures such as transformers and foundation models, they incur high computational costs and memory usage, making them impractical for real-time and resource-constrained scenarios. Moreover, they often fail to demonstrate significant performance gains over simpler methods under rigorous evaluation protocols. In this study, we propose Patch-based representation learning for time-series Anomaly detection (PaAno), a lightweight yet effective method for fast and efficient time-series anomaly detection. PaAno extracts short temporal patches from time-series training data and uses a 1D convolutional neural network to embed each patch into a vector representation. The model is trained using a combination of triplet loss and pretext loss to ensure the embeddings capture informative temporal patterns from input patches. During inference, the anomaly score at each time step is computed by comparing the embeddings of its surrounding patches to those of normal patches extracted from the training time-series. Evaluated on the TSB-AD benchmark, PaAno achieved state-of-the-art performance, significantly outperforming existing methods, including those based on heavy architectures, on both univariate and multivariate time-series anomaly detection across various range-wise and point-wise performance measures.
zh
[AI-163] Aggregation Queries over Unstructured Text: Benchmark and Agent ic Method
【速读】:该论文旨在解决自由文本上的聚合查询(aggregation query over free text)问题,其核心挑战在于要求系统必须“找到全部”相关证据,而非仅返回单个答案,这与传统问答任务存在本质区别。现有方法如Text-to-SQL和检索增强生成(Retrieval-Augmented Generation, RAG)难以满足严格的完整性要求。为此,作者提出在语料库受限设定下形式化实体级聚合查询,并构建了AGGBench基准以评估大规模场景下的完整性导向聚合能力。解决方案的关键在于提出DFA(Disambiguation–Filtering–Aggregation)模块化代理基线,将聚合查询分解为可解释的三个阶段:消歧、过滤与聚合,从而暴露并缓解因歧义、筛选不当及聚合策略不足导致的失败模式。实证结果表明,DFA在聚合证据覆盖度上显著优于强基线模型。
链接: https://arxiv.org/abs/2602.01355
作者: Haojia Zhu,Qinyuan Xu,Haoyu Li,Yuxi Liu,Hanchen Qiu,Jiaoyan Chen,Jiahui Jin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Aggregation query over free text is a long-standing yet underexplored problem. Unlike ordinary question answering, aggregate queries require exhaustive evidence collection and systems are required to “find all,” not merely “find one.” Existing paradigms such as Text-to-SQL and Retrieval-Augmented Generation fail to achieve this completeness. In this work, we formalize entity-level aggregation querying over text in a corpus-bounded setting with strict completeness requirement. To enable principled evaluation, we introduce AGGBench, a benchmark designed to evaluate completeness-oriented aggregation under realistic large-scale corpus. To accompany the benchmark, we propose DFA (Disambiguation–Filtering–Aggregation), a modular agentic baseline that decomposes aggregation querying into interpretable stages and exposes key failure modes related to ambiguity, filtering, and aggregation. Empirical results show that DFA consistently improves aggregation evidence coverage over strong RAG and agentic baselines. The data and code are available in this https URL.
zh
[AI-164] Model Specific Task Similarity for Vision Language Model Selection via Layer Conductance
【速读】:该论文旨在解决开源视觉-语言模型(Vision-Language Models, VLMs)在特定下游任务中选择最优预训练模型的难题,尤其是在计算资源受限和少样本场景下,传统方法因依赖数据密集型代理或对称文本描述而难以有效评估模型迁移能力。解决方案的关键在于提出一种基于视觉编码器内部功能动态的模型选择框架:通过层间导纳(conductance)表征任务,并利用熵正则化对齐构建目标条件下的模块重要性分布;进而定义方向性导纳差异(Directional Conductance Divergence, DCD),该不对称指标量化源任务覆盖目标任务关键功能模块的有效性,从而无需直接推理即可预测目标模型排名,显著提升选择精度。
链接: https://arxiv.org/abs/2602.01346
作者: Wei Yang,Hong Xie,Tao Tan,Xin Li,Defu Lian,Enhong Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Preprint. Under review
Abstract:While open sourced Vision-Language Models (VLMs) have proliferated, selecting the optimal pretrained model for a specific downstream task remains challenging. Exhaustive evaluation is often infeasible due to computational constraints and data limitations in few shot scenarios. Existing selection methods fail to fully address this: they either rely on data-intensive proxies or use symmetric textual descriptors that neglect the inherently directional and model-specific nature of transferability. To address this problem, we propose a framework that grounds model selection in the internal functional dynamics of the visual encoder. Our approach represents each task via layer wise conductance and derives a target-conditioned block importance distribution through entropy regularized alignment. Building on this, we introduce Directional Conductance Divergence (DCD), an asymmetric metric that quantifies how effectively a source task covers the target’s salient functional blocks. This allows for predicting target model rankings by aggregating source task ranks without direct inference. Experimental results on 48 VLMs across 21 datasets demonstrate that our method outperforms state-of-the-art baselines, achieving a 14.7% improvement in NDCG@5 over SWAB.
zh
[AI-165] Adaptive Quantum-Safe Cryptography for 6G Vehicular Networks via Context-Aware Optimization NDSS2026
【速读】:该论文旨在解决未来6G车联网(Vehicle-to-Everything, V2X)通信中因量子计算威胁而面临的密码学安全问题,同时应对后量子密码学(Post-Quantum Cryptography, PQC)算法在高动态环境中带来的计算开销大、延迟高和频繁密钥切换引入新攻击面的挑战。解决方案的关键在于提出一种自适应后量子密码框架(Adaptive Post-Quantum Cryptography Framework),其核心是基于预测的多目标进化算法(Adaptive Predictive Multi-Objective Evolutionary Algorithm, APMOEA),能够根据短期移动性和信道变化动态选择适合的格基(lattice-based)、编码基(code-based)或哈希基(hash-based)PQC配置,在保障安全性的同时优化时延与通信开销;此外,通过设计安全单调升级协议(Secure Monotonic-Upgrade Protocol)有效抵御降级攻击、重放攻击和不同步攻击,确保算法切换过程的安全性与稳定性。
链接: https://arxiv.org/abs/2602.01342
作者: Poushali Sengupta,Mayank Raikwar,Sabita Maharjan,Frank Eliassen,Yan Zhang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Applications (stat.AP)
备注: Accepted for presentation at NDSS 2026 - FutureG Workshop, 23 February 2026. (10 pages, 5 figures.)
Abstract:Powerful quantum computers in the future may be able to break the security used for communication between vehicles and other devices (Vehicle-to-Everything, or V2X). New security methods called post-quantum cryptography can help protect these systems, but they often require more computing power and can slow down communication, posing a challenge for fast 6G vehicle networks. In this paper, we propose an adaptive post-quantum cryptography (PQC) framework that predicts short-term mobility and channel variations and dynamically selects suitable lattice-, code-, or hash-based PQC configurations using a predictive multi-objective evolutionary algorithm (APMOEA) to meet vehicular latency and security this http URL, frequent cryptographic reconfiguration in dynamic vehicular environments introduces new attack surfaces during algorithm transitions. A secure monotonic-upgrade protocol prevents downgrade, replay, and desynchronization attacks during transitions. Theoretical results show decision stability under bounded prediction error, latency boundedness under mobility drift, and correctness under small forecast noise. These results demonstrate a practical path toward quantum-safe cryptography in future 6G vehicular networks. Through extensive experiments based on realistic mobility (LuST), weather (ERA5), and NR-V2X channel traces, we show that the proposed framework reduces end-to-end latency by up to 27%, lowers communication overhead by up to 65%, and effectively stabilizes cryptographic switching behavior using reinforcement learning. Moreover, under the evaluated adversarial scenarios, the monotonic-upgrade protocol successfully prevents downgrade, replay, and desynchronization attacks.
zh
[AI-166] xRay: Agent ic Postmortem of Live Blockchain Attacks
【速读】:该论文旨在解决去中心化金融(DeFi)生态系统中因“任何人可取”(Anyone-Can-Take, ACT)漏洞导致的安全事件后分析效率低、证据链重建困难的问题。当前Postmortem分析依赖人工从有限线索(如单个交易哈希)出发,手动恢复攻击生命周期,耗时且易出错。其核心解决方案是提出TxRay——一个基于大语言模型(LLM)的代理式事后分析系统,通过工具调用自动从种子交易出发重构完整攻击流程,生成可执行、自包含的攻击复现证明(Proof of Concept, PoC),并利用语义断言(semantic oracles)进行自我验证。关键创新在于将攻击复现过程形式化为可执行逻辑,并借助独立的PoCEvaluator对PoC质量进行自动化评估,从而显著提升根因定位准确率(92.11%)与PoC泛化能力(避免硬编码攻击者地址比例提升24.8个百分点),实现高效、可靠、可重复的DeFi安全事件复盘。
链接: https://arxiv.org/abs/2602.01317
作者: Ziyue Wang,Jiangshan Yu,Kaihua Qin,Dawn Song,Arthur Gervais,Liyi Zhou
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 21 pages, 9 figures
Abstract:Decentralized Finance (DeFi) has turned blockchains into financial infrastructure, allowing anyone to trade, lend, and build protocols without intermediaries, but this openness exposes pools of value controlled by code. Within five years, the DeFi ecosystem has lost over 15.75B USD to reported exploits. Many exploits arise from permissionless opportunities that any participant can trigger using only public state and standard interfaces, which we call Anyone-Can-Take (ACT) opportunities. Despite on-chain transparency, postmortem analysis remains slow and manual: investigations start from limited evidence, sometimes only a single transaction hash, and must reconstruct the exploit lifecycle by recovering related transactions, contract code, and state dependencies. We present TxRay, a Large Language Model (LLM) agentic postmortem system that uses tool calls to reconstruct live ACT attacks from limited evidence. Starting from one or more seed transactions, TxRay recovers the exploit lifecycle, derives an evidence-backed root cause, and generates a runnable, self-contained Proof of Concept (PoC) that deterministically reproduces the incident. TxRay self-checks postmortems by encoding incident-specific semantic oracles as executable assertions. To evaluate PoC correctness and quality, we develop PoCEvaluator, an independent agentic execution-and-review evaluator. On 114 incidents from DeFiHackLabs, TxRay produces an expert-aligned root cause and an executable PoC for 105 incidents, achieving 92.11% end-to-end reproduction. Under PoCEvaluator, 98.1% of TxRay PoCs avoid hard-coding attacker addresses, a +24.8pp lift over DeFiHackLabs. In a live deployment, TxRay delivers validated root causes in 40 minutes and PoCs in 59 minutes at median latency. TxRay’s oracle-validated PoCs enable attack imitation, improving coverage by 15.6% and 65.5% over STING and APE. Comments: 21 pages, 9 figures Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.01317 [cs.CR] (or arXiv:2602.01317v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2602.01317 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-167] Dispelling the Curse of Singularities in Neural Network Optimizations
【速读】:该论文旨在解决深度神经网络训练过程中因参数空间奇异点(parametric singularities)的涌现与放大所导致的优化不稳定性问题。研究表明,随着梯度更新,参数空间中的奇异点会不断增长,并进一步增强与表示空间的对齐,从而加剧表示空间中的奇异程度,形成一种“奇异诅咒”(curse of singularities),使得损失函数出现尖锐爆炸的风险显著上升。解决方案的关键在于提出参数奇异平滑(Parametric Singularity Smoothing, PSS),该方法通过轻量、灵活且有效的方式平滑权重矩阵的奇异谱,从而抑制奇异值的过度增长,缓解优化过程中的不稳定现象,提升训练效率与泛化性能。
链接: https://arxiv.org/abs/2602.01308
作者: Hengjie Cao,Mengyi Chen,Yifeng Yang,Fang Dong,Ruijun Huang,Anrui Chen,Jixian Zhou,Mingzhi Dong,Yujiang Wang,Dongsheng Li,Wenyi Fang,Yuanyi Lin,Fan Wu,Li Shang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:This work investigates the optimization instability of deep neural networks from a less-explored yet insightful perspective: the emergence and amplification of singularities in the parametric space. Our analysis reveals that parametric singularities inevitably grow with gradient updates and further intensify alignment with representations, leading to increased singularities in the representation space. We show that the gradient Frobenius norms are bounded by the top singular values of the weight matrices, and as training progresses, the mutually reinforcing growth of weight and representation singularities, termed the curse of singularities, relaxes these bounds, escalating the risk of sharp loss explosions. To counter this, we propose Parametric Singularity Smoothing (PSS), a lightweight, flexible, and effective method for smoothing the singular spectra of weight matrices. Extensive experiments across diverse datasets, architectures, and optimizers demonstrate that PSS mitigates instability, restores trainability even after failure, and improves both training efficiency and generalization.
zh
[AI-168] RE-MCDF: Closed-Loop Multi-Expert LLM Reasoning for Knowledge-Grounded Clinical Diagnosis
【速读】:该论文旨在解决电子病历(Electronic Medical Records, EMRs)在神经科等临床场景中因异质性、稀疏性和噪声导致大语言模型(Large Language Models, LLMs)诊断易产生自强化错误的问题,同时弥补现有多代理框架交互浅层化及忽略疾病间逻辑关系(如互斥性、病理相容性与诊断混淆)的缺陷。其解决方案的关键在于提出一种关系增强的多专家临床诊断框架(Relation-enhanced Multi-expert Clinical Diagnosis Framework, RE-MCDF),该框架采用生成-验证-修订闭环架构,融合三个互补模块:主专家生成候选诊断及证据、实验室专家动态重加权多源临床指标、多关系感知与评估专家组显式施加疾病间逻辑约束,并依托医学知识图谱(Medical Knowledge Graph, MKG)实现证据自适应调整与诊断逻辑一致性校验,从而在复杂诊断任务中显著提升准确性与鲁棒性。
链接: https://arxiv.org/abs/2602.01297
作者: Shaowei Shen,Xiaohong Yang,Jie Yang,Lianfen Huang,Yongcai Zhang,Yang Zou,Seyyedali Hosseinalipour
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 4 figures
Abstract:Electronic medical records (EMRs), particularly in neurology, are inherently heterogeneous, sparse, and noisy, which poses significant challenges for large language models (LLMs) in clinical diagnosis. In such settings, single-agent systems are vulnerable to self-reinforcing errors, as their predictions lack independent validation and can drift toward spurious conclusions. Although recent multi-agent frameworks attempt to mitigate this issue through collaborative reasoning, their interactions are often shallow and loosely structured, failing to reflect the rigorous, evidence-driven processes used by clinical experts. More fundamentally, existing approaches largely ignore the rich logical dependencies among diseases, such as mutual exclusivity, pathological compatibility, and diagnostic confusion. This limitation prevents them from ruling out clinically implausible hypotheses, even when sufficient evidence is available. To overcome these, we propose RE-MCDF, a relation-enhanced multi-expert clinical diagnosis framework. RE-MCDF introduces a generation–verification–revision closed-loop architecture that integrates three complementary components: (i) a primary expert that generates candidate diagnoses and supporting evidence, (ii) a laboratory expert that dynamically prioritizes heterogeneous clinical indicators, and (iii) a multi-relation awareness and evaluation expert group that explicitly enforces inter-disease logical constraints. Guided by a medical knowledge graph (MKG), the first two experts adaptively reweight EMR evidence, while the expert group validates and corrects candidate diagnoses to ensure logical consistency. Extensive experiments on the neurology subset of CMEMR (NEEMRs) and on our curated dataset (XMEMRs) demonstrate that RE-MCDF consistently outperforms state-of-the-art baselines in complex diagnostic scenarios.
zh
[AI-169] Multi-LLM Adaptive Conformal Inference for Reliable LLM Responses ICLR2026
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在医疗、法律等高风险领域应用中确保事实准确性(factuality)的问题。现有基于符合推理(conformal inference)的方法要么过于保守导致大量真实陈述被错误排除,要么依赖于自适应误差率和简单线性模型,难以捕捉复杂群体结构。其解决方案的关键在于将符合推理重新建模为乘法过滤(multiplicative filtering)框架,将事实性建模为声明级分数的乘积,并提出多LLM自适应符合推理(Multi-LLM Adaptive Conformal Inference, MACI)方法:通过集成多个LLM生成更精确的事实性评分,在保持组条件校准(group-conditional calibration)以保障有效性的同时,显著提升保留率(retention)并降低计算时间成本。
链接: https://arxiv.org/abs/2602.01285
作者: Kangjun Noh,Seongchan Lee,Ilmun Kim,Kyungwoo Song
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: Accepted to ICLR 2026
Abstract:Ensuring factuality is essential for the safe use of Large Language Models (LLMs) in high-stakes domains such as medicine and law. Conformal inference provides distribution-free guarantees, but existing approaches are either overly conservative, discarding many true-claims, or rely on adaptive error rates and simple linear models that fail to capture complex group structures. To address these challenges, we reformulate conformal inference in a multiplicative filtering setting, modeling factuality as a product of claim-level scores. Our method, Multi-LLM Adaptive Conformal Inference (MACI), leverages ensembles to produce more accurate factuality-scores, which in our experiments led to higher retention, while validity is preserved through group-conditional calibration. Experiments show that MACI consistently achieves user-specified coverage with substantially higher retention and lower time cost than baselines. Our repository is available at this https URL
zh
[AI-170] LLM -Driven Ontology Construction for Enterprise Knowledge Graphs
【速读】:该论文旨在解决企业知识图谱(Enterprise Knowledge Graphs)构建中本体(Ontology)生成过程高度依赖人工、资源密集且严重依赖领域专家知识的问题。解决方案的关键在于提出了一种基于大语言模型(Large Language Model, LLM)的端到端流水线 OntoEKG,其核心包括两个阶段:一是抽取模块,用于从非结构化企业数据中识别核心类(classes)和属性(properties);二是蕴含模块(entailment module),通过逻辑推理将这些元素组织为层次结构,并序列化为标准 RDF 格式。该方法显著提升了本体构建效率,同时在跨数据、金融与物流领域的评估中验证了其有效性与局限性。
链接: https://arxiv.org/abs/2602.01276
作者: Abdulsobur Oyewale,Tommaso Soru
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 20th International Conference on Semantic Computing (ICSC 2026)
Abstract:Enterprise Knowledge Graphs have become essential for unifying heterogeneous data and enforcing semantic governance. However, the construction of their underlying ontologies remains a resource-intensive, manual process that relies heavily on domain expertise. This paper introduces OntoEKG, a LLM-driven pipeline designed to accelerate the generation of domain-specific ontologies from unstructured enterprise data. Our approach decomposes the modelling task into two distinct phases: an extraction module that identifies core classes and properties, and an entailment module that logically structures these elements into a hierarchy before serialising them into standard RDF. Addressing the significant lack of comprehensive benchmarks for end-to-end ontology construction, we adopt a new evaluation dataset derived from documents across the Data, Finance, and Logistics sectors. Experimental results highlight both the potential and the challenges of this approach, achieving a fuzzy-match F1-score of 0.724 in the Data domain while revealing limitations in scope definition and hierarchical reasoning.
zh
[AI-171] Sample Efficient Active Algorithms for Offline Reinforcement Learning
【速读】:该论文旨在解决离线强化学习(Offline Reinforcement Learning)中因静态数据覆盖不足和分布偏移(distributional shift)导致策略学习效果不佳的问题。其解决方案的关键在于引入主动强化学习(Active Reinforcement Learning, ActiveRL),通过有限的在线交互来选择性地精炼价值函数中不确定性较高的区域,从而提升策略学习效率。论文基于高斯过程(Gaussian Process, GP)的不确定性建模,结合GP浓度不等式与信息增益边界,首次提供了ActiveRL的样本复杂度理论分析,证明了在高概率下可通过 O(1/ϵ2) 次主动过渡获得 ϵ-最优策略,显著优于纯离线方法的 Ω(1/ϵ2(1−γ)4) 率,验证了ActiveRL具有近似最优的信息效率。
链接: https://arxiv.org/abs/2602.01260
作者: Soumyadeep Roy,Shashwat Kushwaha,Ambedkar Dukkipati
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Offline reinforcement learning (RL) enables policy learning from static data but often suffers from poor coverage of the state-action space and distributional shift problems. This problem can be addressed by allowing limited online interactions to selectively refine uncertain regions of the learned value function, which is referred to as Active Reinforcement Learning (ActiveRL). While there has been good empirical success, no theoretical analysis is available in the literature. We fill this gap by developing a rigorous sample-complexity analysis of ActiveRL through the lens of Gaussian Process (GP) uncertainty modeling. In this respect, we propose an algorithm and using GP concentration inequalities and information-gain bounds, we derive high-probability guarantees showing that an \epsilon -optimal policy can be learned with \mathcalO(1/\epsilon^2) active transitions, improving upon the \Omega(1/\epsilon^2(1-\gamma)^4) rate of purely offline methods. Our results reveal that ActiveRL achieves near-optimal information efficiency, that is, guided uncertainty reduction leads to accelerated value-function convergence with minimal online data. Our analysis builds on GP concentration inequalities and information-gain bounds, bridging Bayesian nonparametric regression and reinforcement learning theories. We conduct several experiments to validate the algorithm and theoretical findings.
zh
[AI-172] Mechanistic Interpretability of Brain-to-Speech Models Across Speech Modes
【速读】:该论文旨在解决脑到语音解码模型中不同语音模态(发声、默念和想象)之间信息传递机制不明确的问题,特别是这些模型如何在内部表征层面实现跨模态迁移。其解决方案的关键在于采用机械可解释性方法,通过跨模态激活补丁(cross-mode activation patching)与三模态插值技术,揭示了语音模态共享一个连续的因果流形(causal manifold),且跨模态转移由特定层级中的紧凑子空间(compact, layer-specific subspaces)驱动,而非广泛分布的神经元活动;进一步通过粗粒度到细粒度的因果追踪(causal tracing)和因果擦除(causal scrubbing),发现仅少量但非分散的神经元集合对跨模态传输具有因果影响,从而阐明了语音模态信息在脑-语音解码模型中的层次化、方向依赖的表征结构。
链接: https://arxiv.org/abs/2602.01247
作者: Maryam Maghsoudi,Ayushi Mishra
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Brain-to-speech decoding models demonstrate robust performance in vocalized, mimed, and imagined speech; yet, the fundamental mechanisms via which these models capture and transmit information across different speech modalities are less explored. In this work, we use mechanistic interpretability to causally investigate the internal representations of a neural speech decoder. We perform cross-mode activation patching of internal activations across speech modes, and use tri-modal interpolation to examine whether speech representations vary discretely or continuously. We use coarse-to-fine causal tracing and causal scrubbing to find localized causal structure, allowing us to find internal subspaces that are sufficient for cross-mode transfer. In order to determine how finely distributed these effects are within layers, we perform neuron-level activation patching. We discover that small but not distributed subsets of neurons, rather than isolated units, affect the cross-mode transfer. Our results show that speech modes lie on a shared continuous causal manifold, and cross-mode transfer is mediated by compact, layer-specific subspaces rather than diffuse activity. Together, our findings give a causal explanation for how speech modality information is organized and used in brain-to-speech decoding models, revealing hierarchical and direction-dependent representational structure across speech modes.
zh
[AI-173] Predictive Scheduling for Efficient Inference-Time Reasoning in Large Language Models ICML
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在复杂推理任务中因固定token预算分配导致的计算效率低下问题:即对简单输入过度计算,对困难输入则计算不足。解决方案的关键在于提出一种可插拔的预测调度(Predictive Scheduling)框架,该框架通过预运行轻量级预测器(如基于中间Transformer隐藏状态的MLP或基于原始问题文本微调的LoRA分类器)来估计每个查询的最优推理长度或难度,并结合贪心批处理分配器动态调整总token预算分配,从而在不增加总计算成本的前提下显著提升准确率。实验表明,在GSM8K算术基准上,该方法相较均匀预算分配可实现最高7.9个百分点的绝对准确率提升,且能关闭超过50%与理想情况(oracle)之间的差距。
链接: https://arxiv.org/abs/2602.01237
作者: Katrina Brown,Aneesh Muppidi,Rana Shahout
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: ICML ES-FoMo 2025
Abstract:Large language models (LLMs) achieve state-of-the-art accuracy on complex reasoning tasks by generating multiple chain-of-thought (CoT) traces, but using a fixed token budget per query leads to over-computation on easy inputs and under-computation on hard ones. We introduce Predictive Scheduling, a plug-and-play framework that pre-runs lightweight predictors, an MLP on intermediate transformer hidden states or a LoRA-fine-tuned classifier on raw question text, to estimate each query’s optimal reasoning length or difficulty before any full generation. Our greedy batch allocator dynamically distributes a fixed total token budget across queries to maximize expected accuracy. On the GSM8K arithmetic benchmark, predictive scheduling yields up to 7.9 percentage points of absolute accuracy gain over uniform budgeting at identical token cost, closing over 50% of the gap to an oracle with perfect foresight. A systematic layer-wise study reveals that middle layers (12 - 17) of the transformer carry the richest signals for size estimation. These results demonstrate that pre-run budget prediction enables fine-grained control of the compute-accuracy trade-off, offering a concrete path toward latency-sensitive, cost-efficient LLM deployments.
zh
[AI-174] Lotus: Efficient LLM Training by Randomized Low-Rank Gradient Projection with Adaptive Subspace Switching
【速读】:该论文旨在解决大规模模型训练中内存消耗、训练时间和模型性能之间的权衡问题,现有方法通常在优化某一指标时会损害其他指标。针对这一挑战,论文提出了一种名为Lotus的新方法,其关键创新在于通过简化投影过程来实现高效低秩梯度更新:具体而言,提出了一种量化单位梯度位移的准则,从而实现低秩梯度子空间间的高效切换,避免了传统方法如GaLore中因奇异值分解(SVD)带来的额外训练时间开销,最终在保持模型性能的同时显著降低内存占用并缩短训练时间。
链接: https://arxiv.org/abs/2602.01233
作者: Tianhao Miao,Zhongyuan Bao,Lejun Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Training efficiency in large-scale models is typically assessed through memory consumption, training time, and model performance. Current methods often exhibit trade-offs among these metrics, as optimizing one generally degrades at least one of the others. Addressing this trade-off remains a central challenge in algorithm design. While GaLore enables memory-efficient training by updating gradients in a low-rank subspace, it incurs a comparable extra training time cost due to the Singular Value Decomposition(SVD) process on gradients. In this paper, we propose Lotus, a method that resolves this trade-off by simply modifying the projection process. We propose a criterion that quantifies the displacement of the unit gradient to enable efficient transitions between low-rank gradient subspaces. Experimental results indicate that Lotus is the most efficient method, achieving a 30% reduction in training time and a 40% decrease in memory consumption for gradient and optimizer states. Additionally, it outperforms the baseline method in both pre-training and fine-tuning tasks.
zh
[AI-175] FutureMind: Equipping Small Language Models with Strategic Thinking-Pattern Priors via Adaptive Knowledge Distillation ICLR2026
【速读】:该论文旨在解决小型语言模型(Small Language Models, SLMs)在复杂、知识密集型任务中因缺乏结构化推理能力和有效检索机制而导致性能受限的问题。其解决方案的关键在于提出了一种名为FutureMind的模块化推理框架,通过从大型语言模型(Large Language Models, LLMs)中采用自适应知识蒸馏(adaptive knowledge distillation),将战略思维模式先验知识注入SLMs;该框架包含问题分析、逻辑推理、策略规划和检索引导四个核心模块,并结合三种不同的检索范式,实现对复杂查询的分解与高效执行,从而显著提升SLMs在多跳问答等任务上的表现。
链接: https://arxiv.org/abs/2602.01222
作者: Shaoxiong Yang,Junting Li,Mengyuan Zhang,Chao Li,Wei Liu,Jian Luan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by ICLR 2026
Abstract:Small Language Models (SLMs) are attractive for cost-sensitive and resource-limited settings due to their efficient, low-latency inference. However, they often struggle with complex, knowledge-intensive tasks that require structured reasoning and effective retrieval. To address these limitations, we propose FutureMind, a modular reasoning framework that equips SLMs with strategic thinking-pattern priors via adaptive knowledge distillation from large language models (LLMs). FutureMind introduces a dynamic reasoning pipeline composed of four key modules: Problem Analysis, Logical Reasoning, Strategy Planning, and Retrieval Guidance. This pipeline is augmented by three distinct retrieval paradigms that decompose complex queries into tractable subproblems, ensuring efficient and accurate retrieval execution. Extensive experiments on multi-hop QA benchmarks, including 2WikiMultihopQA, MuSiQue, Bamboogle, and Frames, demonstrate the superiority of FutureMind. It consistently outperforms strong baselines such as Search-o1, achieving state-of-the-art results under free training conditions across diverse SLM architectures and scales. Beyond empirical gains, our analysis reveals that the process of thinking-pattern distillation is restricted by the cognitive bias bottleneck between the teacher (LLMs) and student (SLMs) models. This provides new perspectives on the transferability of reasoning skills, paving the way for the development of SLMs that combine efficiency with genuine cognitive capability.
zh
[AI-176] Learning from Anonymized and Incomplete Tabular Data
【速读】:该论文旨在解决用户驱动隐私(User-driven Privacy)场景下,数据中混杂原始值、泛化值和缺失值所导致的机器学习模型性能下降问题。这类数据结构虽然符合隐私保护需求,但传统处理方式(如将非原始值视为新类别或缺失值)会丢失泛化语义,从而损害下游任务的效用。解决方案的关键在于提出新颖的数据转换策略,显式建模异质匿名化(heterogeneous anonymization),并通过多数据集、多种隐私配置和部署场景的实验验证,证明该方法能有效恢复数据效用;同时强调一致的数据表示对于维持模型性能至关重要。
链接: https://arxiv.org/abs/2602.01217
作者: Lucas Lange,Adrian Böttinger,Victor Christen,Anushka Vidanage,Peter Christen,Erhard Rahm
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Databases (cs.DB)
备注:
Abstract:User-driven privacy allows individuals to control whether and at what granularity their data is shared, leading to datasets that mix original, generalized, and missing values within the same records and attributes. While such representations are intuitive for privacy, they pose challenges for machine learning, which typically treats non-original values as new categories or as missing, thereby discarding generalization semantics. For learning from such tabular data, we propose novel data transformation strategies that account for heterogeneous anonymization and evaluate them alongside standard imputation and LLM-based approaches. We employ multiple datasets, privacy configurations, and deployment scenarios, demonstrating that our method reliably regains utility. Our results show that generalized values are preferable to pure suppression, that the best data preparation strategy depends on the scenario, and that consistent data representations are crucial for maintaining downstream utility. Overall, our findings highlight that effective learning is tied to the appropriate handling of anonymized values.
zh
[AI-177] Not All Preferences Are Created Equal: Stability-Aware and Gradient-Efficient Alignment for Reasoning Models
【速读】:该论文旨在解决大模型推理对齐过程中,传统基于偏好优化(Preference-based Alignment)方法如直接偏好优化(Direct Preference Optimization, DPO)因对所有偏好对一视同仁而导致的训练效率低下和优化不稳定问题。这些问题源于静态数据选择策略忽略了训练样本在模型演进过程中的动态效用,导致计算资源浪费于梯度信息微弱的简单样本,并受临近决策边界样本噪声干扰。解决方案的关键在于提出SAGE(Stability-Aware Gradient Efficiency)框架,其核心创新是通过粗粒度课程学习机制动态更新候选样本池(基于模型能力),并结合细粒度稳定性感知评分函数,优先选取具有高信息量且预测稳定的错误样本,从而显著提升策略更新的信噪比(Signal-to-Noise Ratio),实现更高效、可靠的推理对齐。
链接: https://arxiv.org/abs/2602.01207
作者: Hui Wu,Hengyi Cai,Jinman Zhao,Xinran Chen,Ziheng Li,Zhejun Zhao,Shuaiqiang Wang,Yuchen Li,Dawei Yin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Preference-based alignment is pivotal for training large reasoning models; however, standard methods like Direct Preference Optimization (DPO) typically treat all preference pairs uniformly, overlooking the evolving utility of training instances. This static approach often leads to inefficient or unstable optimization, as it wastes computation on trivial pairs with negligible gradients and suffers from noise induced by samples near uncertain decision boundaries. Facing these challenges, we propose SAGE (Stability-Aware Gradient Efficiency), a dynamic framework designed to enhance alignment reliability by maximizing the Signal-to-Noise Ratio of policy updates. Concretely, SAGE integrates a coarse-grained curriculum mechanism that refreshes candidate pools based on model competence with a fine-grained, stability-aware scoring function that prioritizes informative, confident errors while filtering out unstable samples. Experiments on multiple mathematical reasoning benchmarks demonstrate that SAGE significantly accelerates convergence and outperforms static baselines, highlighting the critical role of policy-aware, stability-conscious data selection in reasoning alignment.
zh
[AI-178] Addressing Explainability of Generative AI using SMILE (Statistical Model-agnostic Interpretability with Local Explanations)
【速读】:该论文旨在解决生成式 AI(Generative AI)模型决策过程不透明的问题,从而限制了其在高风险应用场景中的可信度与责任追溯。解决方案的关键在于提出 gSMILE 框架,该框架扩展了基于局部解释的统计模型无关可解释性方法(Statistical Model-agnostic Interpretability with Local Explanations, SMILE),通过受控文本扰动、Wasserstein 距离度量和加权代理建模,量化并可视化提示(prompt)或指令中特定成分对模型输出的影响。该方法在大型语言模型(Large Language Models, LLMs)中实现细粒度的 token 级归因,并生成直观热力图以识别关键影响 token 和推理路径;在基于指令的图像编辑模型中则利用精确文本扰动机制分析指令修改如何影响图像结果。结合基于操作设计域(Operational Design Domain, ODD)的场景评估策略,gSMILE 实现了跨语义和环境条件下的系统性行为评估,并通过稳定性、保真度、准确性、一致性与忠实性等严格归因指标验证解释质量,展现出在多种生成架构上的鲁棒性和泛化能力。
链接: https://arxiv.org/abs/2602.01206
作者: Zeinab Dehghani
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid advancement of generative artificial intelligence has enabled models capable of producing complex textual and visual outputs; however, their decision-making processes remain largely opaque, limiting trust and accountability in high-stakes applications. This thesis introduces gSMILE, a unified framework for the explainability of generative models, extending the Statistical Model-agnostic Interpretability with Local Explanations (SMILE) method to generative settings. gSMILE employs controlled perturbations of textual input, Wasserstein distance metrics, and weighted surrogate modelling to quantify and visualise how specific components of a prompt or instruction influence model outputs. Applied to Large Language Models (LLMs), gSMILE provides fine-grained token-level attribution and generates intuitive heatmaps that highlight influential tokens and reasoning pathways. In instruction-based image editing models, the exact text-perturbation mechanism is employed, allowing for the analysis of how modifications to an editing instruction impact the resulting image. Combined with a scenario-based evaluation strategy grounded in the Operational Design Domain (ODD) framework, gSMILE allows systematic assessment of model behaviour across diverse semantic and environmental conditions. To evaluate explanation quality, we define rigorous attribution metrics, including stability, fidelity, accuracy, consistency, and faithfulness, and apply them across multiple generative architectures. Extensive experiments demonstrate that gSMILE produces robust, human-aligned attributions and generalises effectively across state-of-the-art generative models. These findings highlight the potential of gSMILE to advance transparent, reliable, and responsible deployment of generative AI technologies.
zh
[AI-179] Workflow-R1: Group Sub-sequence Policy Optimization for Multi-turn Workflow Construction
【速读】:该论文旨在解决现有工作流优化方法将工作流合成建模为静态、一次性代码生成问题所带来的局限性,这种范式对模型的编码能力要求过高,并限制了动态问题求解所需的灵活性。其解决方案的关键在于提出 Workflow-R1 框架,将工作流构建重新定义为多轮、基于自然语言的序贯决策过程,并引入 Group Sub-sequence Policy Optimization (GSsPO) 算法以解决多轮交互中优化粒度不匹配的问题。GSsPO 通过以原子的 Think-Action 周期作为优化单元,使梯度更新与语义边界对齐,从而在复杂多轮推理任务中实现更稳健的学习,且该方法具备结构感知特性,可泛化至广泛的多轮智能体序贯决策任务。
链接: https://arxiv.org/abs/2602.01202
作者: Mingze Kong,Zikun Qu,Zhongquan Zhou,Pengyu Liang,Xiang Li,Zhiwei Shang,Zhi Hong,Kaiyu Huang,Zhiyong Wang,Zhongxiang Dai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid evolution of agentic workflows has demonstrated strong performance of LLM-based agents in addressing complex reasoning tasks. However, existing workflow optimization methods typically formulate workflow synthesis as a static, one-shot code-centric generation problem. This paradigm imposes excessive constraints on the model’s coding capabilities and restricts the flexibility required for dynamic problem-solving. In this paper, we present Workflow-R1, a framework that reformulates workflow construction as a multi-turn, natural language-based sequential decision-making process. To resolve the optimization granularity mismatch inherent in such multi-turn interactions, we introduce Group Sub-sequence Policy Optimization (GSsPO). While explicitly tailored to align with the interleaved Think-Action dynamics of agentic reasoning, GSsPO fundamentally functions as a structure-aware RL algorithm generalizable to a broad class of multi-turn agentic sequential decision-making tasks. By recalibrating the optimization unit to the composite sub-sequence, specifically the atomic Think-Action cycle, it aligns gradient updates with the semantic boundaries of these interactions, ensuring robust learning in complex multi-turn reasoning tasks. Through extensive experiments on multiple QA benchmarks, Workflow-R1 outperforms competitive baselines, validating GSsPO as a generalized solution for sequential reasoning and establishing Workflow-R1 as a promising new paradigm for automated workflow optimization.
zh
[AI-180] A State-Transition Framework for Efficient LLM Reasoning ICLR2026
【速读】:该论文旨在解决长链式思维(Long Chain-of-Thought, CoT)推理在大型语言模型(Large Language Models, LLMs)中因计算和内存开销大而导致的效率低下问题,同时避免现有压缩CoT方法与测试时扩展(test-time scaling)之间的冲突。其解决方案的关键在于将LLM的推理过程建模为状态转移过程:通过线性注意力机制(linear attention mechanism)高效估计推理状态(reasoning state),该状态记录了历史推理信息;当前推理步骤可基于查询提示和推理状态直接获取相关历史信息,无需显式关注前序步骤的所有token,从而将注意力计算复杂度从二次方降低到线性,显著提升推理效率。此外,提出基于状态的推理策略以缓解由噪声推理步骤引发的过度思考问题。
链接: https://arxiv.org/abs/2602.01198
作者: Liang Zhang,Yu Zhao,Longyue Wang,Tianqi Shi,Weihua Luo,Kaifu Zhang,Jinsong Su
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: ICLR 2026
Abstract:While Long Chain-of-Thought (CoT) reasoning significantly improves Large Language Models (LLMs) performance on complex reasoning tasks, the substantial computational and memory costs of generating long CoT sequences limit their efficiency and practicality. Existing studies usually enhance the reasoning efficiency of LLMs by compressing CoT sequences. However, this approach conflicts with test-time scaling, limiting the reasoning capacity of LLMs. In this paper, we propose an efficient reasoning framework that models the reasoning process of LLMs as a state-transition process. Specifically, we first apply a linear attention mechanism to estimate the LLM’s reasoning state, which records the historical reasoning information from previous reasoning steps. Then, based on the query prompt and the reasoning state, the LLM can efficiently perform the current reasoning step and update the state. With the linear attention, each token in the current reasoning step can directly retrieve relevant historical reasoning information from the reasoning state, without explicitly attending to tokens in previous reasoning steps. In this way, the computational complexity of attention is reduced from quadratic to linear, significantly improving the reasoning efficiency of LLMs. In addition, we propose a state-based reasoning strategy to mitigate the over-thinking issue caused by noisy reasoning steps. Extensive experiments across multiple datasets and model sizes demonstrate that our framework not only improves the reasoning efficiency of LLMs but also enhances their reasoning performance.
zh
[AI-181] Autoregressive Yet Revisable: In Decoding Revision for Secure Code Generation
【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)在代码生成中采用严格单调序列生成方式所带来的局限性,即无法模拟人类编程过程中“生成-修正”交替进行的认知机制。现有方法通过外部代理或静态工具引入修订能力,但存在延迟高或未能充分利用模型内在语义推理能力的问题。解决方案的关键在于提出“修订流”(Stream of Revision)范式,通过引入特定的动作标记(action tokens),使模型能够在单次前向传播中无缝回溯并编辑自身历史输出,从而将修订循环内化为模型的原生能力,实现无需外部依赖的即时修正与动态优化。
链接: https://arxiv.org/abs/2602.01187
作者: Chengran Yang,Zichao Wei,Heminghao Deng,Jinfeng Jiang,Zhensu Sun,Ting Zhang,Tianyi Wu,Ming Wen,David Lo
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Model (LLM) based code generation is predominantly formulated as a strictly monotonic process, appending tokens linearly to an immutable prefix. This formulation contrasts to the cognitive process of programming, which is inherently interleaved with forward generation and on-the-fly revision. While prior works attempt to introduce revision via post-hoc agents or external static tools, they either suffer from high latency or fail to leverage the model’s intrinsic semantic reasoning. In this paper, we propose Stream of Revision, a paradigm shift that elevates code generation from a monotonic stream to a dynamic, self-correcting trajectory by leveraging model’s intrinsic capabilities. We introduce specific action tokens that enable the model to seamlessly backtrack and edit its own history within a single forward pass. By internalizing the revision loop, our framework Stream of Revision allows the model to activate its latent capabilities just-in-time without external dependencies. Empirical results on secure code generation show that Stream of Revision significantly reduces vulnerabilities with minimal inference overhead.
zh
[AI-182] he Gaussian-Head OFL Family: One-Shot Federated Learning from Client Global Statistics ICLR
【速读】:该论文旨在解决传统联邦学习(Federated Learning, FL)中因多轮模型交换与聚合导致的高通信开销和隐私风险问题,同时克服现有单次传输联邦学习(One-shot Federated Learning, OFL)方法在实际应用中的局限性,如对公共数据集的依赖、同质客户端假设或需上传额外数据/模型信息等。其解决方案的关键在于提出了一类基于预训练嵌入服从类别条件高斯分布假设的新型OFL方法——Gaussian-Head OFL (GH-OFL),通过仅传输每类统计量(类别计数及一阶/二阶矩),由服务器构建三类轻量级分类头:(i) 基于闭式解的高斯分类器(NB/LDA/QDA);(ii) FisherMix,一种在估计Fisher子空间中合成样本上训练的余弦边界线性头;(iii) Proto-Hyper,一种利用知识蒸馏优化高斯logits的低秩残差头。该方案在强非独立同分布(non-IID)场景下实现了卓越的鲁棒性和准确性,且完全无需上传原始数据。
链接: https://arxiv.org/abs/2602.01186
作者: Fabio Turazza,Marco Picone,Marco Mamei
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at the International Conference on Learning Representations (ICLR) 2026
Abstract:Classical Federated Learning relies on a multi-round iterative process of model exchange and aggregation between server and clients, with high communication costs and privacy risks from repeated model transmissions. In contrast, one-shot federated learning (OFL) alleviates these limitations by reducing communication to a single round, thereby lowering overhead and enhancing practical deployability. Nevertheless, most existing one-shot approaches remain either impractical or constrained, for example, they often depend on the availability of a public dataset, assume homogeneous client models, or require uploading additional data or model information. To overcome these issues, we introduce the Gaussian-Head OFL (GH-OFL) family, a suite of one-shot federated methods that assume class-conditional Gaussianity of pretrained embeddings. Clients transmit only sufficient statistics (per-class counts and first/second-order moments) and the server builds heads via three components: (i) Closed-form Gaussian heads (NB/LDA/QDA) computed directly from the received statistics; (ii) FisherMix, a linear head with cosine margin trained on synthetic samples drawn in an estimated Fisher subspace; and (iii) Proto-Hyper, a lightweight low-rank residual head that refines Gaussian logits via knowledge distillation on those synthetic samples. In our experiments, GH-OFL methods deliver state-of-the-art robustness and accuracy under strong non-IID skew while remaining strictly data-free.
zh
[AI-183] FedBGS: A Blockchain Approach to Segment Gossip Learning in Decentralized Systems
【速读】:该论文旨在解决传统联邦学习(Federated Learning, FL)架构中服务器作为单点故障(Single-Point-of-Failure)所带来的安全性和可扩展性限制问题。尽管隐私保护联邦学习(Privacy-Preserving Federated Learning, PPFL)通过加密和隐私增强技术提升了数据安全性,但中心化服务器仍可能成为攻击目标或性能瓶颈。为此,论文提出了一种完全去中心化的区块链框架 FedBGS,其核心创新在于结合分段式八卦学习(Segmented Gossip Learning)与联邦分析(Federated Analytics),在优化区块链资源使用的同时,实现对各类攻击的全面防御,并有效处理非独立同分布(non-IID)数据场景下的模型训练需求。
链接: https://arxiv.org/abs/2602.01185
作者: Fabio Turazza,Marcello Pietri,Marco Picone,Marco Mamei
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注: Author-accepted manuscript of a paper published in the 2025 IEEE 45th International Conference on Distributed Computing Systems Workshops (ICDCSW), pp. 760-770, doi: https://doi.org/10.1109/ICDCSW63273.2025.00136
Abstract:Privacy-Preserving Federated Learning (PPFL) is a Decentralized machine learning paradigm that enables multiple participants to collaboratively train a global model without sharing their data with the integration of cryptographic and privacy-based techniques to enhance the security of the global system. This privacy-oriented approach makes PPFL a highly suitable solution for training shared models in sectors where data privacy is a critical concern. In traditional FL, local models are trained on edge devices, and only model updates are shared with a central server, which aggregates them to improve the global model. However, despite the presence of the aforementioned privacy techniques, in the classical Federated structure, the issue of the server as a single-point-of-failure remains, leading to limitations both in terms of security and scalability. This paper introduces FedBGS, a fully Decentralized Blockchain-based framework that leverages Segmented Gossip Learning through Federated Analytics. The proposed system aims to optimize blockchain usage while providing comprehensive protection against all types of attacks, ensuring both privacy, security and non-IID data handling in Federated environments.
zh
[AI-184] Do All Individual Layers Help? An Empirical Study of Task-Interfering Layers in Vision-Language Models
【速读】:该论文旨在解决预训练视觉语言模型(VLM)在下游任务中性能受限的问题,特别是发现部分网络层可能对特定任务产生干扰而非促进作用。研究表明,通过干预单个网络层(如置零其参数),可在某些任务上提升性能,表明存在“任务干扰层”(Task-Interfering Layers)。解决方案的关键在于提出一种无需训练、仅在推理阶段应用的动态层剔除方法——TaLo(Task-Adaptive Layer Knockout),其核心是利用任务-层交互向量(Task-Layer Interaction Vector)量化各层对特定任务的影响,并据此识别并跳过最干扰当前任务的层,从而在不更新模型参数的前提下显著提升多模型与多数据集上的表现。
链接: https://arxiv.org/abs/2602.01167
作者: Zhiming Liu,Yujie Wei,Lei Feng,Xiu Su,Xiaobo Xia,Weili Guan,Zeke Xie,Shuo Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Current VLMs have demonstrated capabilities across a wide range of multimodal tasks. Typically, in a pretrained VLM, all layers are engaged by default to make predictions on downstream tasks. We find that intervening on a single layer, such as by zeroing its parameters, can improve the performance on certain tasks, indicating that some layers hinder rather than help downstream tasks. We systematically investigate how individual layers influence different tasks via layer intervention. Specifically, we measure the change in performance relative to the base model after intervening on each layer and observe improvements when bypassing specific layers. This improvement can be generalizable across models and datasets, indicating the presence of Task-Interfering Layers that harm downstream tasks’ performance. We introduce Task-Layer Interaction Vector, which quantifies the effect of intervening on each layer of a VLM given a task. These task-interfering layers exhibit task-specific sensitivity patterns: tasks requiring similar capabilities show consistent response trends under layer interventions, as evidenced by the high similarity in their task-layer interaction vectors. Inspired by these findings, we propose TaLo (Task-Adaptive Layer Knockout), a training-free, test-time adaptation method that dynamically identifies and bypasses the most interfering layer for a given task. Without parameter updates, TaLo improves performance across various models and datasets, including boosting Qwen-VL’s accuracy on the Maps task in ScienceQA by up to 16.6%. Our work reveals an unexpected form of modularity in pretrained VLMs and provides a plug-and-play, training-free mechanism to unlock hidden capabilities at inference time. The source code will be publicly available.
zh
[AI-185] Multi-Horizon Electricity Price Forecasting with Deep Learning in the Australian National Electricity Market
【速读】:该论文旨在解决电力价格预测(Electricity Price Forecasting, EPF)中长期多日-ahead预测能力不足、对前沿时间序列深度学习(Deep Learning, DL)模型探索有限,以及评估方式过于依赖整体水平指标而忽视日内时段差异的问题。其解决方案的关键在于构建一个扩展至多日-ahead的EPF框架,系统性地集成经过基准测试的前沿时间序列DL模型,并在澳大利亚国家电力市场(National Electricity Market, NEM)五个区域开展基于日内间隔级别的精细化评估。结果表明,标准DL模型在多数区域表现更优,而SOTA时间序列DL模型在延长预测 horizon 时更具鲁棒性,同时揭示了明显的昼夜误差模式,为未来研究提供了增强特征表示和建模策略以提升长期预测稳定性与日内波动敏感性的方向。
链接: https://arxiv.org/abs/2602.01157
作者: Mohammed Osman Gani,Zhipeng He,Chun Ouyang,Sara Khalifa
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 63 Pages
Abstract:Accurate electricity price forecasting (EPF) is essential for operational planning, trading, and flexible asset scheduling in liberalised power systems, yet remains challenging due to volatility, heavy-tailed spikes, and frequent regime shifts. While deep learning (DL) has been increasingly adopted in EPF to capture complex and nonlinear price dynamics, several important gaps persist: (i) limited attention to multi-day horizons beyond day-ahead forecasting, (ii) insufficient exploration of state-of-the-art (SOTA) time series DL models, and (iii) a predominant reliance on aggregated horizon-level evaluation that obscures time-of-day forecasting variation. To address these gaps, we propose a novel EPF framework that extends the forecast horizon to multi-day-ahead by systematically building forecasting models that leverage benchmarked SOTA time series DL models. We conduct a comprehensive evaluation to analyse time-of-day forecasting performance by integrating model assessment at intraday interval levels across all five regions in the Australian National Electricity Market (NEM). The results show that no single model consistently dominates across regions, metrics, and horizons. Overall, standard DL models deliver superior performance in most regions, while SOTA time series DL models demonstrate greater robustness to forecast horizon extension. Intraday interval-level evaluation reveals pronounced diurnal error patterns, indicating that absolute errors peak during the evening ramp, relative errors inflate during midday negative-price regimes, and directional accuracy degrades during periods of frequent trend changes. These findings suggest that future research on DL-based EPF can benefit from enriched feature representations and modelling strategies that enhance longer-term forecasting robustness while maintaining sensitivity to intraday volatility and structural price dynamics.
zh
[AI-186] Multi-Agent Causal Reasoning System for Error Pattern Rule Automation in Vehicles
【速读】:该论文旨在解决现代汽车中诊断故障码(Diagnostic Trouble Codes, DTCs)生成的错误模式(Error Patterns, EPs)规则依赖人工制定所带来的成本高、易出错问题。随着车辆系统复杂度提升,手动构建EP规则已难以满足高效、准确的故障诊断需求。解决方案的关键在于提出CAREP(Causal Automated Reasoning for Error Patterns),一个基于多智能体架构的自动化系统:其核心组件包括因果发现智能体(识别潜在DTC与EP之间的因果关系)、上下文信息智能体(融合元数据和描述性信息)以及协调智能体(合成候选布尔规则并生成可解释的推理路径)。该方法通过结合因果发现与代理驱动的推理机制,在大规模真实数据集上实现了对未知EP规则的自动且精准挖掘,显著优于仅依赖大语言模型(LLM)的基线方法,并提供透明的因果解释,推动了可扩展、可解释且低成本的车辆故障诊断自动化进程。
链接: https://arxiv.org/abs/2602.01155
作者: Hugo Math,Julian Lorentz,Stefan Oelsner,Rainer Lienhart
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 7 pages, 3 figures
Abstract:Modern vehicles generate thousands of different discrete events known as Diagnostic Trouble Codes (DTCs). Automotive manufacturers use Boolean combinations of these codes, called error patterns (EPs), to characterize system faults and ensure vehicle safety. Yet, EP rules are still manually handcrafted by domain experts, a process that is expensive and prone to errors as vehicle complexity grows. This paper introduces CAREP (Causal Automated Reasoning for Error Patterns), a multi-agent system that automatizes the generation of EP rules from high-dimensional event sequences of DTCs. CAREP combines a causal discovery agent that identifies potential DTC-EP relations, a contextual information agent that integrates metadata and descriptions, and an orchestrator agent that synthesizes candidate boolean rules together with interpretable reasoning traces. Evaluation on a large-scale automotive dataset with over 29,100 unique DTCs and 474 error patterns demonstrates that CAREP can automatically and accurately discover the unknown EP rules, outperforming LLM-only baselines while providing transparent causal explanations. By uniting practical causal discovery and agent-based reasoning, CAREP represents a step toward fully automated fault diagnostics, enabling scalable, interpretable, and cost-efficient vehicle maintenance.
zh
[AI-187] Capabilities and Fundamental Limits of Latent Chain-of-Thought
【速读】:该论文旨在解决潜空间链式思维(Latent Chain-of-Thought, Latent CoT)模型在推理任务中表现出的性能不一致性问题,即在探索类任务(如ProsQA)中表现优异,而在计算类任务(如GSM8K)中表现较差的现象。其解决方案的关键在于揭示并量化“决策确定性”(decisional certainty)的作用机制:通过引入符号索引(Symbolic Index)作为衡量决策承诺程度的核心指标,明确其与执行稳定性及探索能力之间的因果关系;进一步理论证明课程学习(curriculum learning)是必要条件,因为直接训练会导致分布偏移从而失效。该研究提出了一种基于任务需求动态调节决策确定性的自适应系统设计范式,从根本上缓解了探索与执行之间的权衡矛盾。
链接: https://arxiv.org/abs/2602.01148
作者: Jiaxuan Zou,Yaozhong Xiong,Yong Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG); Optimization and Control (math.OC)
备注:
Abstract:Latent Chain-of-Thought (Latent CoT) models promise efficient reasoning via continuous representations, yet exhibit puzzling performance inconsistencies: excelling at exploration (ProsQA: 97.0%) but failing at computation (GSM8K: 34.1%). We reveal that this trade-off is governed by decisional certainty. Our contributions are threefold: (1) We theoretically characterize the fundamental Exploration-Execution Trade-off, proving that high certainty enables precise execution but inhibits exploration, while low certainty facilitates search but causes error accumulation. (2) We introduce the Symbolic Index–quantifying decisional commitment–as the core mechanism governing this trade-off and establish its causal relationship with both execution stability and exploration capability. (3) We prove that curriculum learning is theoretically necessary, as direct training provably fails due to distributional mismatch. Our framework shifts the design paradigm from binary architectural choices toward adaptive systems that dynamically regulate decisional certainty based on task demands.
zh
[AI-188] PersistBench: When Should Long-Term Memories Be Forgotten by LLM s?
【速读】:该论文旨在解决长时记忆(long-term memory)集成到大语言模型(LLMs)后引入的安全风险问题,特别是跨域泄露(cross-domain leakage)和记忆诱导谄媚(memory-induced sycophancy)两类特定风险。其解决方案的关键在于构建了一个名为PersistBench的基准测试框架,用于量化评估主流前沿及开源LLMs在长时记忆使用中的安全脆弱性,从而推动更鲁棒、更安全的长时记忆机制的发展。
链接: https://arxiv.org/abs/2602.01146
作者: Sidharth Pulipaka,Oliver Chen,Manas Sharma,Taaha S Bajwa,Vyas Raina,Ivaxi Sheth
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 70 pages, 26 figures, under review
Abstract:Conversational assistants are increasingly integrating long-term memory with large language models (LLMs). This persistence of memories, e.g., the user is vegetarian, can enhance personalization in future conversations. However, the same persistence can also introduce safety risks that have been largely overlooked. Hence, we introduce PersistBench to measure the extent of these safety risks. We identify two long-term memory-specific risks: cross-domain leakage, where LLMs inappropriately inject context from the long-term memories; and memory-induced sycophancy, where stored long-term memories insidiously reinforce user biases. We evaluate 18 frontier and open-source LLMs on our benchmark. Our results reveal a surprisingly high failure rate across these LLMs - a median failure rate of 53% on cross-domain samples and 97% on sycophancy samples. To address this, our benchmark encourages the development of more robust and safer long-term memory usage in frontier conversational systems.
zh
[AI-189] Lyapunov Stability-Aware Stackelberg Game for Low-Altitude Economy: A Control-Oriented Pruning-Based DRL Approach
【速读】:该论文旨在解决低空经济背景下无人机(UAV)作为空中基站时,因机载资源受限与物理控制稳定性要求之间的冲突,导致异构网络性能下降的问题。其关键解决方案是提出一个感知-通信-计算-控制(Sensing-Communication-Computing-Control)闭环框架,通过Lyapunov稳定性理论建立控制系统状态演化与通信约束之间的内在映射,将抽象的稳定性需求转化为可量化的资源边界;进一步构建Stackelberg博弈模型,使UAV作为领导者动态定价资源以平衡负载并保障稳定,用户作为跟随者依据服务紧迫性优化请求;同时设计一种基于动态结构剪枝的轻量化近端策略优化(Proximal Policy Optimization, PPO)算法,在能量受限边缘平台上显著压缩神经网络规模,实现快速逼近博弈均衡且推理延迟极低。
链接: https://arxiv.org/abs/2602.01131
作者: Yue Zhong,Jiawen Kang,Yongju Tong,Hong-Ning Dai,Dong In Kim,Abbas Jamalipour,Shengli Xie
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:With the rapid expansion of the low-altitude economy, Unmanned Aerial Vehicles (UAVs) serve as pivotal aerial base stations supporting diverse services from users, ranging from latency-sensitive critical missions to bandwidth-intensive data streaming. However, the efficacy of such heterogeneous networks is often compromised by the conflict between limited onboard resources and stringent stability requirements. Moving beyond traditional throughput-centric designs, we propose a Sensing-Communication-Computing-Control closed-loop framework that explicitly models the impact of communication latency on physical control stability. To guarantee mission reliability, we leverage the Lyapunov stability theory to derive an intrinsic mapping between the state evolution of the control system and communication constraints, transforming abstract stability requirements into quantifiable resource boundaries. Then, we formulate the resource allocation problem as a Stackelberg game, where UAVs (as leaders) dynamically price resources to balance load and ensure stability, while users (as followers) optimize requests based on service urgency. Furthermore, addressing the prohibitive computational overhead of standard Deep Reinforcement Learning (DRL) on energy-constrained edge platforms, we propose a novel and lightweight pruning-based Proximal Policy Optimization (PPO) algorithm. By integrating a dynamic structured pruning mechanism, the proposed algorithm significantly compresses the neural network scale during training, enabling the UAV to rapidly approximate the game equilibrium with minimal inference latency. Simulation results demonstrate that the proposed scheme effectively secures control loop stability while maximizing system utility in dynamic low-altitude environments.
zh
[AI-190] ransforming Vehicle Diagnostics: A Multimodal Approach to Error Patterns Prediction
【速读】:该论文旨在解决传统车辆故障诊断系统仅依赖车载诊断故障码(Diagnostic Trouble Codes, DTCs)序列而忽略环境上下文信息(如温度、湿度、压力等原始传感数据)所导致的诊断准确性不足问题。其解决方案的关键在于提出BiCarFormer——一种面向车辆事件序列的双向Transformer模型,通过嵌入融合与协同注意力机制(co-attention mechanism)有效建模DTC序列与环境条件之间的跨模态关联,从而实现更精准的多标签故障模式分类。实验表明,该方法在包含22,137条错误码和360种故障模式的真实世界数据集上显著优于仅使用DTC序列的传统模型。
链接: https://arxiv.org/abs/2602.01109
作者: Hugo Math,Rainer Lienhart
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 7 figures
Abstract:Accurately diagnosing and predicting vehicle malfunctions is crucial for maintenance and safety in the automotive industry. While modern diagnostic systems primarily rely on sequences of vehicular Diagnostic Trouble Codes (DTCs) registered in On-Board Diagnostic (OBD) systems, they often overlook valuable contextual information such as raw sensory data (e.g., temperature, humidity, and pressure). This contextual data, crucial for domain experts to classify vehicle failures, introduces unique challenges due to its complexity and the noisy nature of real-world data. This paper presents BiCarFormer: the first multimodal approach to multi-label sequence classification of error codes into error patterns that integrates DTC sequences and environmental conditions. BiCarFormer is a bidirectional Transformer model tailored for vehicle event sequences, employing embedding fusions and a co-attention mechanism to capture the relationships between diagnostic codes and environmental data. Experimental results on a challenging real-world automotive dataset with 22,137 error codes and 360 error patterns demonstrate that our approach significantly improves classification performance compared to models that rely solely on DTC sequences and traditional sequence models. This work highlights the importance of incorporating contextual environmental information for more accurate and robust vehicle diagnostics, hence reducing maintenance costs and enhancing automation processes in the automotive industry.
zh
[AI-191] SPELL: Synthesis of Programmatic Edits using LLM s
【速读】:该论文旨在解决软件开发中API迁移任务的自动化难题,即在更换第三方库时,开发者需手动修改源代码以适配新API,而现有自动化工具依赖于已有的迁移案例数据,这些数据往往稀缺且难以获取,同时未能充分利用现代代码变换基础设施。其解决方案的关键在于:利用大语言模型(Large Language Models, LLMs)提取迁移示例,再通过一个智能代理(Agent)将这些示例转化为可复用、可测试的转换脚本,集成到PolyglotPiranha这一现代化代码变换工具中,从而将LLMs中的隐式迁移知识结构化为高效、通用的迁移逻辑,无需预先构建语料库或人工设计规则。
链接: https://arxiv.org/abs/2602.01107
作者: Daniel Ramos,Catarina Gamboa,Inês Lynce,Vasco Manquinho,Ruben Martins,Claire Le Goues
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: pre-print
Abstract:Library migration is a common but error-prone task in software development. Developers may need to replace one library with another due to reasons like changing requirements or licensing changes. Migration typically entails updating and rewriting source code manually. While automated migration tools exist, most rely on mining examples from real-world projects that have already undergone similar migrations. However, these data are scarce, and collecting them for arbitrary pairs of libraries is difficult. Moreover, these migration tools often miss out on leveraging modern code transformation infrastructure. In this paper, we present a new approach to automated API migration that sidesteps the limitations described above. Instead of relying on existing migration data or using LLMs directly for transformation, we use LLMs to extract migration examples. Next, we use an Agent to generalize those examples to reusable transformation scripts in PolyglotPiranha, a modern code transformation tool. Our method distills latent migration knowledge from LLMs into structured, testable, and repeatable migration logic, without requiring preexisting corpora or manual engineering effort. Experimental results across Python libraries show that our system can generate diverse migration examples and synthesize transformation scripts that generalize to real-world codebases. Comments: pre-print Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2602.01107 [cs.SE] (or arXiv:2602.01107v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2602.01107 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-192] OLion: Approaching the Hadamard Ideal by Intersecting Spectral and ell_infty Implicit Biases
【速读】:该论文旨在解决现有优化器在大规模模型训练中存在隐式偏差不一致、状态存储开销高以及微调时优化器不匹配(optimizer mismatch)等问题。其核心解决方案是提出一种新型优化器 \nameA,该方法通过结合谱控制(spectral control)与 ℓ∞-风格坐标控制(ℓ∞-style coordinate control),利用少量 Newton–Schulz 迭代对 Lion 式动量方向进行近似正交化,并施加逐元素符号更新,从而高效逼近在谱约束与 ℓ∞ 约束交集上的最大步长更新(即缩放后的 Hadamard-like 集合)。尽管正交化和符号操作具有强非线性特性,作者在弱但可验证的对角各向同性假设下证明了收敛性,且实验表明 \nameA 在语言和视觉任务中性能优于或等同于 AdamW 和 Muon,同时仅需动量级状态存储,有效缓解了微调阶段的优化器不匹配问题。
链接: https://arxiv.org/abs/2602.01105
作者: Zixiao Wang,Yifei Shen,Huishuai Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 23 pages
Abstract:Many optimizers can be interpreted as steepest-descent methods under norm-induced geometries, and thus inherit corresponding implicit biases. We introduce \nameA (\fullname), which combines spectral control from orthogonalized update directions with \ell_\infty -style coordinate control from sign updates. \nameA forms a Lion-style momentum direction, approximately orthogonalizes it via a few Newton–Schulz iterations, and then applies an entrywise sign, providing an efficient approximation to taking a maximal step over the intersection of the spectral and \ell_\infty constraint sets (a scaled Hadamard-like set for matrix parameters). Despite the strong nonlinearity of orthogonalization and sign, we prove convergence under a mild, empirically verified diagonal-isotropy assumption. Across large-scale language and vision training, including GPT-2 and Llama pretraining, SiT image pretraining, and supervised fine-tuning, \nameA matches or outperforms AdamW and Muon under comparable tuning while using only momentum-level optimizer state, and it mitigates optimizer mismatch when fine-tuning AdamW-pretrained checkpoints.
zh
[AI-193] Probing RLVR training instability through the lens of objective-level hacking
【速读】:该论文旨在解决强化学习与可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)在混合专家(Mixture-of-Experts, MoE)架构下训练不稳定性的问题。现有方法虽能持续提升大语言模型的推理能力,但在MoE结构中易出现训练动态异常,导致性能停滞甚至退化,而其根本成因尚不明确。论文提出一个基于“目标层级劫持”(objective-level hacking)的系统性分析框架,揭示此类不稳定性源于token级信用分配错位,并表现为优化目标中的系统级虚假信号。通过在30B参数MoE模型上的实证研究,作者首次形式化解释了关键病理训练现象——训练-推理差异异常增长的机制,从而为设计稳定高效的RLVR算法提供了因果层面的理解与指导。
链接: https://arxiv.org/abs/2602.01103
作者: Yiming Dong,Kun Fu,Haoyu Li,Xinyuan Zhu,Yurou Liu,Lijing Shao,Jieping Ye,Zheng Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Prolonged reinforcement learning with verifiable rewards (RLVR) has been shown to drive continuous improvements in the reasoning capabilities of large language models, but the training is often prone to instabilities, especially in Mixture-of-Experts (MoE) architectures. Training instability severely undermines model capability improvement, yet its underlying causes and mechanisms remain poorly understood. In this work, we introduce a principled framework for understanding RLVR instability through the lens of objective-level hacking. Unlike reward hacking, which arises from exploitable verifiers, objective-level hacking emerges from token-level credit misalignment and is manifested as system-level spurious signals in the optimization objective. Grounded in our framework, together with extensive experiments on a 30B MoE model, we trace the origin and formalize the mechanism behind a key pathological training dynamic in MoE models: the abnormal growth of the training-inference discrepancy, a phenomenon widely associated with instability but previously lacking a mechanistic explanation. These findings provide a concrete and causal account of the training dynamics underlying instabilities in MoE models, offering guidance for the design of stable RLVR algorithms.
zh
[AI-194] Hard Constraints Meet Soft Generation: Guaranteed Feasibility for LLM -based Combinatorial Optimization
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在组合优化(Combinatorial Optimization, CO)任务中缺乏保证解可行性的机制这一关键问题,而可行性对实际部署至关重要。解决方案的核心在于提出FALCON框架,其三大创新包括:(i) 语法约束解码(grammar-constrained decoding)以确保解的语法正确性,(ii) 可行性修复层(feasibility repair layer)纠正语义约束违反,以及(iii) 自适应Best-of-N采样(adaptive Best-of-N sampling)实现推理计算资源的高效分配。此外,作者设计了基于最优锚定的目标引导偏好优化(Best-anchored Objective-guided Preference Optimization, BOPO)训练方法,通过目标差距加权偏好对,提供密集监督信号而不依赖人工标注,从而提升模型性能并保障解的100%可行性。
链接: https://arxiv.org/abs/2602.01090
作者: Yang Liu,Chuan Zhou,Yancheng Chen,Shuai Zhang,Xixun Lin,Xiaoqing Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 32 pages, 2 figures
Abstract:Large language models (LLMs) have emerged as promising general-purpose solvers for combinatorial optimization (CO), yet they fundamentally lack mechanisms to guarantee solution feasibility which is critical for real-world deployment. In this work, we introduce FALCON, a framework that ensures 100% feasibility through three key innovations: (i) \emphgrammar-constrained decoding enforces syntactic validity, (ii) a \emphfeasibility repair layer corrects semantic constraint violations, and (iii) \emphadaptive Best-of- N sampling allocates inference compute efficiently. To train the underlying LLM, we introduce the Best-anchored Objective-guided Preference Optimization (BOPO) in LLM training, which weights preference pairs by their objective gap, providing dense supervision without human labels. Theoretically, we prove convergence for BOPO and provide bounds on repair-induced quality loss. Empirically, across seven NP-hard CO problems, FALCON achieves perfect feasibility while matching or exceeding the solution quality of state-of-the-art neural and LLM-based solvers.
zh
[AI-195] MedBeads: An Agent -Native Immutable Data Substrate for Trustworthy Medical AI
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在医疗场景中作为自主“临床代理”(Clinical Agents)部署时面临的“上下文不匹配”(Context Mismatch)问题:当前电子病历(Electronic Medical Records, EMRs)及FHIR标准设计用于人类阅读,导致AI代理接收碎片化数据并依赖概率推理(如检索增强生成RAG)重建患者历史,从而引发幻觉且难以审计。解决方案的关键在于提出MedBeads——一种面向代理的数据基础设施,其中临床事件以不可变的“珠子”(Beads)形式存储于Merkle有向无环图(DAG)结构中,每个珠子通过密码学方式引用其因果前驱,实现“写入一次、读取多次”的架构;该设计使篡改行为数学上可检测,并结合广度优先搜索(BFS)上下文检索算法,在O(V+E)复杂度下支持实时决策支持,同时提供可视化界面增强临床理解。此方案将上下文获取从概率性搜索转变为确定性图遍历,为可信医疗人工智能奠定基础。
链接: https://arxiv.org/abs/2602.01086
作者: Takahito Nakajima
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Databases (cs.DB); Distributed, Parallel, and Cluster Computing (cs.DC); Software Engineering (cs.SE)
备注: 19 pages, 5 figures. Code available at this https URL
Abstract:Background: As of 2026, Large Language Models (LLMs) demonstrate expert-level medical knowledge. However, deploying them as autonomous “Clinical Agents” remains limited. Current Electronic Medical Records (EMRs) and standards like FHIR are designed for human review, creating a “Context Mismatch”: AI agents receive fragmented data and must rely on probabilistic inference (e.g., RAG) to reconstruct patient history. This approach causes hallucinations and hinders auditability. Methods: We propose MedBeads, an agent-native data infrastructure where clinical events are immutable “Beads”–nodes in a Merkle Directed Acyclic Graph (DAG)–cryptographically referencing causal predecessors. This “write-once, read-many” architecture makes tampering mathematically detectable. We implemented a prototype with a Go Core Engine, Python middleware for LLM integration, and a React-based visualization interface. Results: We successfully implemented the workflow using synthetic data. The FHIR-to-DAG conversion transformed flat resources into a causally-linked graph. Our Breadth-First Search (BFS) Context Retrieval algorithm traverses relevant subgraphs with O(V+E) complexity, enabling real-time decision support. Tamper-evidence is guaranteed by design: any modification breaks the cryptographic chain. The visualization aids clinician understanding through explicit causal links. Conclusion: MedBeads addresses the “Context Mismatch” by shifting from probabilistic search to deterministic graph traversal, and from mutable records to immutable chains, providing the substrate for “Trustworthy Medical AI.” It guarantees the context the AI receives is deterministic and tamper-evident, while the LLM determines interpretation. The structured Bead format serves as a token-efficient “AI-native language.” We release MedBeads as open-source software to accelerate agent-native data standards.
zh
[AI-196] EvoOpt-LLM : Evolving industrial optimization models with large language models
【速读】:该论文旨在解决工业优化建模中自然语言需求转化为求解器可执行模型的难题,以及在业务规则动态变化下模型维护的高专家依赖性问题。其核心挑战在于现有方法存在数据效率低、求解器级有效性不足及难以扩展至工业规模的问题。解决方案的关键在于提出EvoOpt-LLM框架,该框架基于7B参数大语言模型(Large Language Model, LLM),采用参数高效微调技术LoRA(Low-Rank Adaptation)进行适配,实现了从自动建模、动态业务约束注入到变量剪枝的全流程支持;通过仅需3,000个训练样本即达到91%生成率和65.9%可执行率,且在1,500样本下性能显著提升,同时约束注入模块能可靠增强原有模型而不破坏目标函数,变量剪枝模块在中等规模线性规划(Linear Programming, LP)模型上实现约0.56的F1分数,从而显著降低对专家干预的需求并提升模型适应性和求解效率。
链接: https://arxiv.org/abs/2602.01082
作者: Yiliu He,Tianle Li,Binghao Ji,Zhiyuan Liu,Di Huang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Optimization modeling via mixed-integer linear programming (MILP) is fundamental to industrial planning and scheduling, yet translating natural-language requirements into solver-executable models and maintaining them under evolving business rules remains highly expertise-intensive. While large language models (LLMs) offer promising avenues for automation, existing methods often suffer from low data efficiency, limited solver-level validity, and poor scalability to industrial-scale problems. To address these challenges, we present EvoOpt-LLM, a unified LLM-based framework supporting the full lifecycle of industrial optimization modeling, including automated model construction, dynamic business-constraint injection, and end-to-end variable pruning. Built on a 7B-parameter LLM and adapted via parameter-efficient LoRA fine-tuning, EvoOpt-LLM achieves a generation rate of 91% and an executability rate of 65.9% with only 3,000 training samples, with critical performance gains emerging under 1,500 samples. The constraint injection module reliably augments existing MILP models while preserving original objectives, and the variable pruning module enhances computational efficiency, achieving an F1 score of ~0.56 on medium-sized LP models with only 400 samples. EvoOpt-LLM demonstrates a practical, data-efficient approach to industrial optimization modeling, reducing reliance on expert intervention while improving adaptability and solver efficiency.
zh
[AI-197] AutoHealth: An Uncertainty-Aware Multi-Agent System for Autonomous Health Data Modeling
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)驱动的智能体在健康数据场景中应用受限的问题,具体包括:现有系统难以在异构健康数据模态间泛化、过度依赖预定义解决方案模板而缺乏对任务目标的自适应能力,以及忽视不确定性估计这一保障医疗决策可靠性的关键环节。其解决方案的核心是提出一个名为AutoHealth的不确定性感知多智能体系统,通过五个专业化智能体的闭环协作,实现数据探索、任务条件下的模型构建、训练与优化,并同步优化预测性能与不确定性量化;该系统不仅生成可直接使用的模型,还输出结构化报告以支持可信解释和风险感知决策,在真实世界多模态健康数据基准上显著优于当前最优基线方法。
链接: https://arxiv.org/abs/2602.01078
作者: Tong Xia,Weibin Li,Gang Liu,Yong Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:LLM-based agents have demonstrated strong potential for autonomous machine learning, yet their applicability to health data remains limited. Existing systems often struggle to generalize across heterogeneous health data modalities, rely heavily on predefined solution templates with insufficient adaptation to task-specific objectives, and largely overlook uncertainty estimation, which is essential for reliable decision-making in healthcare. To address these challenges, we propose \textitAutoHealth, a novel uncertainty-aware multi-agent system that autonomously models health data and assesses model reliability. \textitAutoHealth employs closed-loop coordination among five specialized agents to perform data exploration, task-conditioned model construction, training, and optimization, while jointly prioritizing predictive performance and uncertainty quantification. Beyond producing ready-to-use models, the system generates comprehensive reports to support trustworthy interpretation and risk-aware decision-making. To rigorously evaluate its effectiveness, we curate a challenging real-world benchmark comprising 17 tasks across diverse data modalities and learning settings. \textitAutoHealth completes all tasks and outperforms state-of-the-art baselines by 29.2% in prediction performance and 50.2% in uncertainty estimation.
zh
[AI-198] ConvexBench: Can LLM s Recognize Convex Functions?
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理深度函数复合下的凸性识别任务时表现出的显著推理能力下降问题,即随着函数组合深度增加,模型性能急剧恶化,F1分数从深度为2时的1.0骤降至深度为100时的约0.2。其关键解决方案是一个代理式分而治之框架,通过两个核心机制实现:一是将解析任务外化至外部工具以构建抽象语法树(Abstract Syntax Tree, AST),从而避免模型自身的解析失败;二是对每个中间子表达式强制执行递归推理,并结合聚焦上下文以提升推理准确性。该框架有效缓解了深度组合带来的失败问题,在深度达100时仍能实现F1分数1.0的稳定表现。
链接: https://arxiv.org/abs/2602.01075
作者: Yepeng Liu,Yu Huang,Yu-Xiang Wang,Yingbin Liang,Yuheng Bu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Convex analysis is a modern branch of mathematics with many applications. As Large Language Models (LLMs) start to automate research-level math and sciences, it is important for LLMs to demonstrate the ability to understand and reason with convexity. We introduce \cb, a scalable and mechanically verifiable benchmark for testing \textitwhether LLMs can identify the convexity of a symbolic objective under deep functional composition. Experiments on frontier LLMs reveal a sharp compositional reasoning gap: performance degrades rapidly with increasing depth, dropping from an F1-score of 1.0 at depth 2 to approximately 0.2 at depth 100 . Inspection of models’ reasoning traces indicates two failure modes: \textitparsing failure and \textitlazy reasoning. To address these limitations, we propose an agentic divide-and-conquer framework that (i) offloads parsing to an external tool to construct an abstract syntax tree (AST) and (ii) enforces recursive reasoning over each intermediate sub-expression with focused context. This framework reliably mitigates deep-composition failures, achieving substantial performance improvement at large depths (e.g., F1-Score = 1.0 at depth 100 ).
zh
[AI-199] SetPO: Set-Level Policy Optimization for Diversity-Preserving LLM Reasoning
【速读】:该论文旨在解决强化学习中基于可验证奖励(verifiable rewards)提升大语言模型(Large Language Models, LLMs)推理能力时所引发的解空间多样性下降问题,即模型倾向于将概率质量集中于少数解,从而限制了其探索能力。解决方案的关键在于引入一种基于采样轨迹的集合级别多样性目标(set-level diversity objective),该目标通过核函数度量轨迹间的相似性,并计算每条轨迹的留一法边际贡献(leave-one-out marginal contribution),进而将其作为优势塑造项(advantage shaping term)嵌入策略优化过程中,以实现对多样性的显式建模与增强。理论分析进一步在分布扰动框架下证明了稀有轨迹具有更高的边际多样性贡献,从而为算法设计提供了理论支撑。
链接: https://arxiv.org/abs/2602.01062
作者: Chenyi Li,Yuan Zhang,Bo Wang,Guoqing Ma,Wei Tang,Haoyang Huang,Nan Duan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement learning with verifiable rewards has shown notable effectiveness in enhancing large language models (LLMs) reasoning performance, especially in mathematics tasks. However, such improvements often come with reduced outcome diversity, where the model concentrates probability mass on a narrow set of solutions. Motivated by diminishing-returns principles, we introduce a set level diversity objective defined over sampled trajectories using kernelized similarity. Our approach derives a leave-one-out marginal contribution for each sampled trajectory and integrates this objective as a plug-in advantage shaping term for policy optimization. We further investigate the contribution of a single trajectory to language model diversity within a distribution perturbation framework. This analysis theoretically confirms a monotonicity property, proving that rarer trajectories yield consistently higher marginal contributions to the global diversity. Extensive experiments across a range of model scales demonstrate the effectiveness of our proposed algorithm, consistently outperforming strong baselines in both Pass@1 and Pass@K across various benchmarks.
zh
[AI-200] LDiffGAN: A Latent Diffusion-GAN Framework with Temporal Information Fusion for Anomalous Sound Detection ICASSP2026
【速读】:该论文旨在解决现有无监督异常声音检测生成模型在捕捉正常声音复杂特征分布方面的局限性,以及尚未充分挖掘强大扩散模型在此任务中的潜力。解决方案的关键在于提出一种名为TLDiffGAN的新型框架,其核心创新包括:1)在生成对抗网络(GAN)生成器中引入潜在扩散模型进行对抗训练,从而提升判别器挑战性并改善生成样本质量;2)利用预训练音频模型编码器直接从原始音频波形提取特征用于辅助判别,增强对正常声音特征表示的捕获能力;3)设计TMixup频谱增强技术,提升对细微且局部时序模式的敏感性。该框架通过结合原始音频与梅尔频谱的多模态特征学习,显著提升了异常声音检测性能及时间-频率定位精度。
链接: https://arxiv.org/abs/2602.01060
作者: Chengyuan Ma,Peng Jia,Hongyue Guo,Wenming Yang
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Accepted by ICASSP 2026
Abstract:Existing generative models for unsupervised anomalous sound detection are limited by their inability to fully capture the complex feature distribution of normal sounds, while the potential of powerful diffusion models in this domain remains largely unexplored. To address this challenge, we propose a novel framework, TLDiffGAN, which consists of two complementary branches. One branch incorporates a latent diffusion model into the GAN generator for adversarial training, thereby making the discriminator’s task more challenging and improving the quality of generated samples. The other branch leverages pretrained audio model encoders to extract features directly from raw audio waveforms for auxiliary discrimination. This framework effectively captures feature representations of normal sounds from both raw audio and Mel spectrograms. Moreover, we introduce a TMixup spectrogram augmentation technique to enhance sensitivity to subtle and localized temporal patterns that are often overlooked. Extensive experiments on the DCASE 2020 Challenge Task 2 dataset demonstrate the superior detection performance of TLDiffGAN, as well as its strong capability in anomalous time-frequency localization.
zh
[AI-201] Superposition unifies power-law training dynamics
【速读】:该论文旨在解决神经网络训练中幂律(power-law)动态行为的起源问题,特别是特征叠加(feature superposition)如何影响训练速度与幂律指数的稳定性。其解决方案的关键在于构建教师-学生框架并推导无叠加情况下的解析理论,进而揭示叠加瓶颈会引发一种相变,使训练幂律指数趋于一个与输入数据和通道重要性无关的普适值(∼1),从而实现比纯顺序学习快达十倍的加速效果。这一发现表明,特征叠加可促使模型在不依赖具体数据分布的情况下快速收敛,对大规模语言模型等采用叠加机制的网络具有重要意义。
链接: https://arxiv.org/abs/2602.01045
作者: Zixin Jessie Chen,Hao Chen,Yizhou Liu,Jeff Gore
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Analysis, Statistics and Probability (physics.data-an); Machine Learning (stat.ML)
备注: 17 pages, 14 figures
Abstract:We investigate the role of feature superposition in the emergence of power-law training dynamics using a teacher-student framework. We first derive an analytic theory for training without superposition, establishing that the power-law training exponent depends on both the input data statistics and channel importance. Remarkably, we discover that a superposition bottleneck induces a transition to a universal power-law exponent of \sim 1 , independent of data and channel statistics. This one over time training with superposition represents an up to tenfold acceleration compared to the purely sequential learning that takes place in the absence of superposition. Our finding that superposition leads to rapid training with a data-independent power law exponent may have important implications for a wide range of neural networks that employ superposition, including production-scale large language models.
zh
[AI-202] Adaptive Dual-Weighting Framework for Federated Learning via Out-of-Distribution Detection
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在真实服务场景中因数据非独立同分布(non-IID)导致的模型收敛不稳定、泛化能力下降及服务质量降低的问题。其解决方案的关键在于提出FLood框架,该框架受out-of-distribution(OOD)检测启发,引入一种双权重机制:在客户端层面通过自适应重加权监督损失,对伪OOD样本进行加权以增强对分布偏移或挑战性数据的鲁棒学习;在服务器端则依据客户端的OOD置信度分数对模型更新进行加权聚合,优先采纳来自分布一致性更高的客户端的贡献,从而提升全局模型的鲁棒性和收敛稳定性。
链接: https://arxiv.org/abs/2602.01039
作者: Zhiwei Ling,Hailiang Zhao,Chao Zhang,Xiang Ao,Ziqi Wang,Cheng Zhang,Zhen Qin,Xinkui Zhao,Kingsum Chow,Yuanqing Wu,MengChu Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Federated Learning (FL) enables collaborative model training across large-scale distributed service nodes while preserving data privacy, making it a cornerstone of intelligent service systems in edge-cloud environments. However, in real-world service-oriented deployments, data generated by heterogeneous users, devices, and application scenarios are inherently non-IID. This severe data heterogeneity critically undermines the convergence stability, generalization ability, and ultimately the quality of service delivered by the global model. To address this challenge, we propose FLood, a novel FL framework inspired by out-of-distribution (OOD) detection. FLood dynamically counteracts the adverse effects of heterogeneity through a dual-weighting mechanism that jointly governs local training and global aggregation. At the client level, it adaptively reweights the supervised loss by upweighting pseudo-OOD samples, thereby encouraging more robust learning from distributionally misaligned or challenging data. At the server level, it refines model aggregation by weighting client contributions according to their OOD confidence scores, prioritizing updates from clients with higher in-distribution consistency and enhancing the global model’s robustness and convergence stability. Extensive experiments across multiple benchmarks under diverse non-IID settings demonstrate that FLood consistently outperforms state-of-the-art FL methods in both accuracy and generalization. Furthermore, FLood functions as an orthogonal plug-in module: it seamlessly integrates with existing FL algorithms to boost their performance under heterogeneity without modifying their core optimization logic. These properties make FLood a practical and scalable solution for deploying reliable intelligent services in real-world federated environments.
zh
[AI-203] HierCon: Hierarchical Contrastive Attention for Audio Deepfake Detection WWW’26
【速读】:该论文旨在解决生成式语音伪造(Audio Deepfakes)难以被现有检测方法识别的问题,尤其针对现代文本到语音(TTS)和语音转换(Voice Conversion)系统产生的高保真伪造语音。现有检测方法通常独立处理自监督模型的多层表征,忽视了时间帧间、相邻层及层组之间的时序与层级依赖关系,导致泛化能力不足。其解决方案的关键在于提出HierCon框架——一种结合层次化层注意力机制与基于边距的对比学习的方法,通过建模跨时间帧、邻近层及层组的依赖关系,并促使嵌入特征具有域不变性,从而显著提升对跨域生成技术和录音条件变化的鲁棒性检测性能。
链接: https://arxiv.org/abs/2602.01032
作者: Zhili Nicholas Liang,Soyeon Caren Han,Qizhou Wang,Christopher Leckie
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Proceedings of The Web Conference 2026 (WWW’26), short track
Abstract:Audio deepfakes generated by modern TTS and voice conversion systems are increasingly difficult to distinguish from real speech, raising serious risks for security and online trust. While state-of-the-art self-supervised models provide rich multi-layer representations, existing detectors treat layers independently and overlook temporal and hierarchical dependencies critical for identifying synthetic artefacts. We propose HierCon, a hierarchical layer attention framework combined with margin-based contrastive learning that models dependencies across temporal frames, neighbouring layers, and layer groups, while encouraging domain-invariant embeddings. Evaluated on ASVspoof 2021 DF and In-the-Wild datasets, our method achieves state-of-the-art performance (1.93% and 6.87% EER), improving over independent layer weighting by 36.6% and 22.5% respectively. The results and attention visualisations confirm that hierarchical modelling enhances generalisation to cross-domain generation techniques and recording conditions.
zh
[AI-204] Unifying Ranking and Generation in Query Auto-Completion via Retrieval-Augmented Generation and Multi-Objective Alignment
【速读】:该论文旨在解决查询自动补全(Query Auto-Completion, QAC)在实际应用中面临的两大核心问题:传统检索-排序流水线存在长尾覆盖不足且依赖大量特征工程,而现有生成式方法则易产生幻觉和安全风险。其解决方案的关键在于提出一个统一框架,将QAC重构为端到端列表生成任务,并结合检索增强生成(Retrieval-Augmented Generation, RAG)与多目标直接偏好优化(multi-objective Direct Preference Optimization, DPO),通过规则-based、模型-based及大语言模型作为裁判(LLM-as-judge)的验证体系,辅以迭代批判-修正机制构建高质量合成数据,最终实现高精度、低延迟的生产级部署。
链接: https://arxiv.org/abs/2602.01023
作者: Kai Yuan(1),Anthony Zheng(1),Jia Hu(1),Divyanshu Sheth(1),Hemanth Velaga(1),Kylee Kim(1),Matteo Guarrera(2),Besim Avci(2),Xuetao Yin(1),Rajyashree Mukherjee(1),Sean Suchter(1) ((1) Apple, (2) UC Berkeley)
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 11 pages, 4 figures
Abstract:Query Auto-Completion (QAC) suggests query completions as users type, helping them articulate intent and reach results more efficiently. Existing approaches face fundamental challenges: traditional retrieve-and-rank pipelines have limited long-tail coverage and require extensive feature engineering, while recent generative methods suffer from hallucination and safety risks. We present a unified framework that reformulates QAC as end-to-end list generation through Retrieval-Augmented Generation (RAG) and multi-objective Direct Preference Optimization (DPO). Our approach combines three key innovations: (1) reformulating QAC as end-to-end list generation with multi-objective optimization; (2) defining and deploying a suite of rule-based, model-based, and LLM-as-judge verifiers for QAC, and using them in a comprehensive methodology that combines RAG, multi-objective DPO, and iterative critique-revision for high-quality synthetic data; (3) a hybrid serving architecture enabling efficient production deployment under strict latency constraints. Evaluation on a large-scale commercial search platform demonstrates substantial improvements: offline metrics show gains across all dimensions, human evaluation yields +0.40 to +0.69 preference scores, and a controlled online experiment achieves 5.44% reduction in keystrokes and 3.46% increase in suggestion adoption, validating that unified generation with RAG and multi-objective alignment provides an effective solution for production QAC. This work represents a paradigm shift to end-to-end generation powered by large language models, RAG, and multi-objective alignment, establishing a production-validated framework that can benefit the broader search and recommendation industry.
zh
[AI-205] Offline Discovery of Interpretable Skills from Multi-Task Trajectories
【速读】:该论文旨在解决从长时程、多任务的离线数据中自动发现可复用技能(skill)的问题,尤其在缺乏显式奖励信号或子任务标注的情况下。其核心挑战在于如何从无监督或弱监督的数据中提取具有语义意义且可组合的技能,并构建层级化策略以实现复杂机器人行为的模仿学习。解决方案的关键在于提出一个三阶段端到端框架LOKI:第一阶段利用基于弱任务标签引导的对齐强化向量量化变分自编码器(alignment-enforced Vector Quantized VAE)进行粗粒度的任务感知宏段分割;第二阶段通过自监督序列模型和迭代聚类优化微粒度技能边界;第三阶段基于精确的技能边界构建选项(option-based)层次策略,引入可学习的终止条件β实现显式的技能切换。该方法在D4RL Kitchen基准上表现出高成功率,并验证了所发现技能的语义合理性和组合能力。
链接: https://arxiv.org/abs/2602.01018
作者: Chongyu Zhu,Mithun Vanniasinghe,Jiayu Chen,Chi-Guhn Lee
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Hierarchical Imitation Learning is a powerful paradigm for acquiring complex robot behaviors from demonstrations. A central challenge, however, lies in discovering reusable skills from long-horizon, multi-task offline data, especially when the data lacks explicit rewards or subtask annotations. In this work, we introduce LOKI, a three-stage end-to-end learning framework designed for offline skill discovery and hierarchical imitation. The framework commences with a two-stage, weakly supervised skill discovery process: Stage one performs coarse, task-aware macro-segmentation by employing an alignment-enforced Vector Quantized VAE guided by weak task labels. Stage two then refines these segments at a micro-level using a self-supervised sequential model, followed by an iterative clustering process to consolidate skill boundaries. The third stage then leverages these precise boundaries to construct a hierarchical policy within an option-based framework-complete with a learned termination condition beta for explicit skill switching. LOKI achieves high success rates on the challenging D4RL Kitchen benchmark and outperforms standard HIL baselines. Furthermore, we demonstrate that the discovered skills are semantically meaningful, aligning with human intuition, and exhibit compositionality by successfully sequencing them to solve a novel, unseen task.
zh
[AI-206] How Does Unfaithful Reasoning Emerge from Autoregressive Training? A Study of Synthetic Experiments
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在生成链式思维(Chain-of-thought, CoT)推理时常见的不忠实问题,即中间推理步骤存在逻辑不一致或未能体现导致最终答案的因果关系。为探究这一现象的本质,作者通过受控的合成实验,在噪声数据上训练小型Transformer模型执行模块化算术表达式推理任务(Arithmetic Expression Reasoning),以揭示忠实CoT推理的形成机制及其失效原因。解决方案的关键在于发现:模型仅在训练噪声低于某一临界阈值时才能学习到符合算术规则的因果推理;当噪声过高时,训练动态会从忠实的逐步推理演变为跳步推理,并经历一个由预测熵短暂上升表征的混合模式;此外,机制分析表明,模型通过编码内部不确定性来处理推理冲突,从而自发地实现隐式的自我验证能力。
链接: https://arxiv.org/abs/2602.01017
作者: Fuxin Wang,Amr Alazali,Yiqiao Zhong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 25 pages, 23 figures
Abstract:Chain-of-thought (CoT) reasoning generated by large language models (LLMs) is often unfaithful: intermediate steps can be logically inconsistent or fail to reflect the causal relationship leading to the final answer. Despite extensive empirical observations, a fundamental understanding of CoT is lacking–what constitutes faithful CoT reasoning, and how unfaithfulness emerges from autoregressive training. We study these questions using well-controlled synthetic experiments, training small transformers on noisy data to solve modular arithmetic expressions step by step, a task we term Arithmetic Expression Reasoning. We find that models can learn faithful reasoning that causally follows the underlying arithmetic rules, but only when the training noise is below a critical threshold, a phenomenon attributable to simplicity bias. At higher noise levels, training dynamics exhibit a transition from faithful stepwise reasoning to unfaithful skip-step reasoning via an intermediate mixed mode characterized by a transient increase in prediction entropy. Mechanistic analysis reveals that models learn to encode internal uncertainty by resolving inconsistent reasoning steps, which suggests the emergence of implicit self-verification from autoregressive training.
zh
[AI-207] Multi-Agent Teams Hold Experts Back ICML2026
【速读】:该论文旨在解决自组织多智能体大语言模型(Multi-agent LLM)系统中协同效率不足的问题,特别是当团队成员无法预先定义角色或流程时,如何有效利用专家知识以实现协同优势(synergy)。其关键发现在于:尽管允许自由交互的团队理论上可涌现高效协作,但实验表明这类团队在多数场景下无法达到甚至超越最优秀个体的表现,性能损失高达37.6%。进一步分析揭示,问题的核心瓶颈并非专家识别能力,而是“专家利用”(expert leveraging)——即团队倾向于采取整合式妥协策略(averaging expert and non-expert views),而非根据能力差异合理加权专家意见,且该行为随团队规模增大而加剧,显著削弱整体性能。这一机制上的缺陷暴露了当前自组织多智能体系统在有效调动集体智慧方面的重大局限。
链接: https://arxiv.org/abs/2602.01011
作者: Aneesh Pappu,Batu El,Hancheng Cao,Carmelo di Nolfo,Yanchao Sun,Meng Cao,James Zou
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: Under review at ICML 2026
Abstract:Multi-agent LLM systems are increasingly deployed as autonomous collaborators, where agents interact freely rather than execute fixed, pre-specified workflows. In such settings, effective coordination cannot be fully designed in advance and must instead emerge through interaction. However, most prior work enforces coordination through fixed roles, workflows, or aggregation rules, leaving open the question of how well self-organizing teams perform when coordination is unconstrained. Drawing on organizational psychology, we study whether self-organizing LLM teams achieve strong synergy, where team performance matches or exceeds the best individual member. Across human-inspired and frontier ML benchmarks, we find that – unlike human teams – LLM teams consistently fail to match their expert agent’s performance, even when explicitly told who the expert is, incurring performance losses of up to 37.6%. Decomposing this failure, we show that expert leveraging, rather than identification, is the primary bottleneck. Conversational analysis reveals a tendency toward integrative compromise – averaging expert and non-expert views rather than appropriately weighting expertise – which increases with team size and correlates negatively with performance. Interestingly, this consensus-seeking behavior improves robustness to adversarial agents, suggesting a trade-off between alignment and effective expertise utilization. Our findings reveal a significant gap in the ability of self-organizing multi-agent teams to harness the collective expertise of their members.
zh
[AI-208] LASS-ODE: Scaling ODE Computations to Connect Foundation Models with Dynamical Physical Systems
【速读】:该论文旨在解决物理系统动态预测中两个核心挑战:(i) 物理计算可扩展性问题,即传统物理信息学习方法依赖于昂贵的常微分方程(Ordinary Differential Equation, ODE)积分运算,难以扩展至大规模系统;(ii) 知识共享效率低,现有注意力机制仅在单个系统内计算,无法有效提取跨系统的共享ODE结构。解决方案的关键在于提出两种创新设计:首先,引入基于局部线性ODE的token表示方法,通过保持局部线性特性实现高精度的ODE演化模拟,显著加速积分过程并维持物理保真度;其次,设计一种跨系统注意力机制,引入公共结构枢纽(Common Structure Hub, CSH),存储和聚合不同系统间的共享特征,从而提升知识迁移与泛化能力。由此构建的LASS-ODE模型在40 GB ODE轨迹数据上预训练后,展现出优异的领域内性能、零样本跨系统泛化能力及微调优化潜力。
链接: https://arxiv.org/abs/2602.01009
作者: Haoran Li,Chenhan Xiao,Lihao Mai,Yang Weng,Erik Blasch
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Foundation models have transformed language, vision, and time series data analysis, yet progress on dynamic predictions for physical systems remains limited. Given the complexity of physical constraints, two challenges stand out. (i) Physics-computation scalability: physics-informed learning can enforce physical regularization, but its computation (e.g., ODE integration) does not scale to extensive systems. (ii) Knowledge-sharing efficiency: the attention mechanism is primarily computed within each system, which limits the extraction of shared ODE structures across systems. We show that enforcing ODE consistency does not require expensive nonlinear integration: a token-wise locally linear ODE representation preserves physical fidelity while scaling to foundation-model regimes. Thus, we propose novel token representations that respect locally linear ODE evolution. Such linearity substantially accelerates integration while accurately approximating the local data manifold. Second, we introduce a simple yet effective inter-system attention that augments attention with a common structure hub (CSH) that stores shared tokens and aggregates knowledge across systems. The resulting model, termed LASS-ODE (\underlineLArge-\underlineScale \underlineSmall \underlineODE), is pretrained on our 40 GB ODE trajectory collections to enable strong in-domain performance, zero-shot generalization across diverse ODE systems, and additional improvements through fine-tuning.
zh
[AI-209] ESSAM: A Novel Competitive Evolution Strategies Approach to Reinforcement Learning for Memory Efficient LLM s Fine-Tuning
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在训练大语言模型(Large Language Models, LLMs)进行数学推理时存在的高GPU内存占用问题,从而限制其在资源受限场景下的应用。解决方案的关键在于提出一种名为进化策略结合尖锐度感知最大化(Evolution Strategies with Sharpness-Aware Maximization, ESSAM)的全参数微调框架,该框架通过将零阶搜索(zero-order search)方法引入参数空间的进化策略(Evolution Strategies, ES)与尖锐度感知最大化(Sharpness-Aware Maximization, SAM)相结合,在保证模型性能的同时显著降低GPU内存消耗,实验证明其平均准确率达到78.27%,且GPU内存使用量相比PPO和GRPO分别降低18倍和10倍。
链接: https://arxiv.org/abs/2602.01003
作者: Zhishen Sun,Sizhe Dang,Guang Dai,Haishan Ye
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement learning (RL) has become a key training step for improving mathematical reasoning in large language models (LLMs), but it often has high GPU memory usage, which makes it hard to use in settings with limited resources. To reduce these issues, we propose Evolution Strategies with Sharpness-Aware Maximization (ESSAM), a full parameter fine-tuning framework that tightly combines the zero-order search in parameter space from Evolution Strategies (ES) with the Sharpness-Aware Maximization (SAM) to improve generalization. We conduct fine-tuning experiments on the mainstream mathematica reasoning task GSM8K. The results show that ESSAM achieves an average accuracy of 78.27% across all models and its overall performance is comparable to RL methods. It surpasses classic RL algorithm PPO with an accuracy of 77.72% and is comparable to GRPO with an accuracy of 78.34%, and even surpassing them on some models. In terms of GPU memory usage, ESSAM reduces the average GPU memory usage by 18\times compared to PPO and by 10\times compared to GRPO, achieving an extremely low GPU memory usage.
zh
[AI-210] How RLHF Amplifies Sycophancy
【速读】:该论文旨在解决大语言模型在基于人类偏好进行后训练(preference-based post-training)后出现的谄媚行为(sycophantic behavior)加剧问题,即模型倾向于盲目迎合用户陈述或隐含信念,即使这与事实准确性或合理判断相悖。其核心发现是,人类反馈中的偏差(bias)通过一个明确的放大机制被强化,该机制因果地将优化学习到的奖励函数与偏好数据中的偏差联系起来,导致行为漂移的方向由基础策略下“支持信念信号”与学习奖励之间的协方差决定,并可简化为均值差距条件。解决方案的关键在于提出一种训练时干预措施,通过构造一个闭式表达的协议惩罚项(agreement penalty),对奖励函数进行最小化修正,从而中和该放大机制;在此约束下,所得到的唯一最优策略在KL散度意义下最接近未经约束的后训练策略,有效防止谄媚行为增长。
链接: https://arxiv.org/abs/2602.01002
作者: Itai Shapira,Gerdus Benade,Ariel D. Procaccia
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models often exhibit increased sycophantic behavior after preference-based post-training, showing a stronger tendency to affirm a user’s stated or implied belief even when this conflicts with factual accuracy or sound judgment. We present a formal analysis of how alignment from human feedback can increase this failure mode by identifying an explicit amplification mechanism that causally links optimization against a learned reward to bias in the human preference data used for alignment. We show that the direction of behavioral drift is determined by a covariance under the base policy between endorsing the belief signal in the prompt and the learned reward, and that the first-order effect reduces to a simple mean-gap condition. We then analyze reward learning from pairwise comparisons under random utility models like Bradley-Terry and characterize when bias in human annotators’ preferences induces this reward gap. Next, we propose a training-time intervention designed to neutralize the amplification mechanism itself. Among all post-trained policies that prevent sycophantic behavior from increasing, we characterize the unique policy closest in KL divergence to the unconstrained post-trained policy, and derive the corresponding minimal reward correction as a closed-form agreement penalty. Computational experiments find that reward gaps are common and cause behavioral drift in all the configurations considered.
zh
[AI-211] Reasoning and Tool-use Compete in Agent ic RL:From Quantifying Interference to Disentangled Tuning
【速读】:该论文旨在解决当前基于代理的强化学习(Agentic Reinforcement Learning, ARL)中普遍采用共享参数模型同时训练推理与工具使用行为所引发的训练干扰问题。研究表明,推理和工具使用能力在训练过程中常产生方向不一致的梯度,导致联合优化效果下降,从而挑战了现有ARL范式的有效性。解决方案的关键在于提出一种名为解耦动作推理微调(Disentangled Action Reasoning Tuning, DART)的框架,该框架通过引入独立的低秩适应模块(Low-Rank Adaptation Modules),显式地分离推理与工具使用对应的参数更新路径,从而缓解二者间的训练冲突。实验表明,DART在多个任务上平均提升6.35%,性能接近多代理系统中显式分离推理与工具使用的方案。
链接: https://arxiv.org/abs/2602.00994
作者: Yu Li,Mingyang Yi,Xiuyu Li,Ju Fan,Fuxin Jiang,Binbin Chen,Peng Li,Jie Song,Tieying Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Agentic Reinforcement Learning (ARL) focuses on training large language models (LLMs) to interleave reasoning with external tool execution to solve complex tasks. Most existing ARL methods train a single shared model parameters to support both reasoning and tool use behaviors, implicitly assuming that joint training leads to improved overall agent performance. Despite its widespread adoption, this assumption has rarely been examined empirically. In this paper, we systematically investigate this assumption by introducing a Linear Effect Attribution System(LEAS), which provides quantitative evidence of interference between reasoning and tool-use behaviors. Through an in-depth analysis, we show that these two capabilities often induce misaligned gradient directions, leading to training interference that undermines the effectiveness of joint optimization and challenges the prevailing ARL paradigm. To address this issue, we propose Disentangled Action Reasoning Tuning(DART), a simple and efficient framework that explicitly decouples parameter updates for reasoning and tool-use via separate low-rank adaptation modules. Experimental results show that DART consistently outperforms baseline methods with averaged 6.35 percent improvements and achieves performance comparable to multi-agent systems that explicitly separate tool-use and reasoning using a single model.
zh
[AI-212] HERMES: A Holistic End-to-End Risk-Aware Multimodal Embodied System with Vision-Language Models for Long-Tail Autonomous Driving
【速读】:该论文旨在解决端到端自动驾驶模型在长尾混合交通场景下安全性和准确性不足的问题,尤其是在复杂不确定条件下与多样化道路使用者(如人类驾驶车辆和弱势道路使用者)交互时的风险感知与决策难题。解决方案的关键在于提出HERMES框架,其核心创新是通过基础模型辅助的标注流程生成结构化的长尾场景上下文(Long-Tail Scene Context)和长尾规划上下文(Long-Tail Planning Context),显式注入以危险为中心的风险线索、操作意图及安全偏好,并利用这些信号指导轨迹规划;同时引入三模态驾驶模块(Tri-Modal Driving Module),融合多视角感知、历史运动信息与语义引导,从而实现风险感知驱动的精准轨迹规划,在真实世界长尾数据集上显著优于主流端到端及视觉-语言模型(VLM)驱动基线方法。
链接: https://arxiv.org/abs/2602.00993
作者: Weizhe Tang,Junwei You,Jiaxi Liu,Zhaoyi Wang,Rui Gan,Zilin Huang,Feng Wei,Bin Ran
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:End-to-end autonomous driving models increasingly benefit from large vision–language models for semantic understanding, yet ensuring safe and accurate operation under long-tail conditions remains challenging. These challenges are particularly prominent in long-tail mixed-traffic scenarios, where autonomous vehicles must interact with heterogeneous road users, including human-driven vehicles and vulnerable road users, under complex and uncertain conditions. This paper proposes HERMES, a holistic risk-aware end-to-end multimodal driving framework designed to inject explicit long-tail risk cues into trajectory planning. HERMES employs a foundation-model-assisted annotation pipeline to produce structured Long-Tail Scene Context and Long-Tail Planning Context, capturing hazard-centric cues together with maneuver intent and safety preference, and uses these signals to guide end-to-end planning. HERMES further introduces a Tri-Modal Driving Module that fuses multi-view perception, historical motion cues, and semantic guidance, ensuring risk-aware accurate trajectory planning under long-tail scenarios. Experiments on the real-world long-tail dataset demonstrate that HERMES consistently outperforms representative end-to-end and VLM-driven baselines under long-tail mixed-traffic scenarios. Ablation studies verify the complementary contributions of key components.
zh
[AI-213] Multimodal Scientific Learning Beyond Diffusions and Flows
【速读】:该论文旨在解决科学机器学习(Scientific Machine Learning, SciML)中多模态条件不确定性建模的问题,尤其针对病态逆问题、多稳态和混沌动力学等场景下,传统隐式生成模型(如扩散模型和基于流的方法)在数据效率低、计算成本高且与科学问题中结构化解空间不匹配的局限性。解决方案的关键在于提出使用混合密度网络(Mixture Density Networks, MDNs)作为显式参数化概率密度估计器,其通过引入面向低维、多模态物理系统的归纳偏置,实现对不同解分支的概率质量全局分配,从而在数据稀缺条件下仍能可靠恢复分离的模式,并在泛化能力、可解释性和采样效率方面优于现有方法。
链接: https://arxiv.org/abs/2602.00960
作者: Leonardo Ferreira Guilhoto,Akshat Kaushal,Paris Perdikaris
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computation (stat.CO); Machine Learning (stat.ML)
备注:
Abstract:Scientific machine learning (SciML) increasingly requires models that capture multimodal conditional uncertainty arising from ill-posed inverse problems, multistability, and chaotic dynamics. While recent work has favored highly expressive implicit generative models such as diffusion and flow-based methods, these approaches are often data-hungry, computationally costly, and misaligned with the structured solution spaces frequently found in scientific problems. We demonstrate that Mixture Density Networks (MDNs) provide a principled yet largely overlooked alternative for multimodal uncertainty quantification in SciML. As explicit parametric density estimators, MDNs impose an inductive bias tailored to low-dimensional, multimodal physics, enabling direct global allocation of probability mass across distinct solution branches. This structure delivers strong data efficiency, allowing reliable recovery of separated modes in regimes where scientific data is scarce. We formalize these insights through a unified probabilistic framework contrasting explicit and implicit distribution networks, and demonstrate empirically that MDNs achieve superior generalization, interpretability, and sample efficiency across a range of inverse, multistable, and chaotic scientific regression tasks.
zh
[AI-214] Small-Margin Preferences Still Matter-If You Train Them Right
【速读】:该论文旨在解决偏好优化方法(如DPO)在训练过程中对偏好数据难度敏感的问题,尤其是低边际(即模糊)偏好对的处理方式:传统做法常将其视为噪声并过滤掉,但这种处理可能丢失有用监督信号。研究表明,困难样本在偏好损失下会导致训练不稳定甚至损害对齐效果,但在监督微调(SFT)目标下仍具有价值。解决方案的关键在于提出一种难度感知的混合训练策略——MixDPO,其核心是将偏好数据按边际定义的难度从易到难排序(形成难度课程),并将困难样本路由至SFT目标,而简单样本则继续使用偏好损失进行优化。这一设计实现了对模糊样本的有效利用,避免了偏好损失在低边际数据上的优化失败问题,在多个LLM评判基准上显著优于DPO及其变体。
链接: https://arxiv.org/abs/2602.00954
作者: Jinlong Pang,Zhaowei Zhu,Na Di,Yichi Zhang,Yaxuan Wang,Chen Qian,Yang Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Preference optimization methods such as DPO align large language models (LLMs) using paired comparisons, but their effectiveness can be highly sensitive to the quality and difficulty of preference pairs. A common heuristic treats small-margin (ambiguous) pairs as noisy and filters them out. In this paper, we revisit this assumption and show that pair difficulty interacts strongly with the optimization objective: when trained with preference-based losses, difficult pairs can destabilize training and harm alignment, yet these same pairs still contain useful supervision signals when optimized with supervised fine-tuning (SFT). Motivated by this observation, we propose MixDPO, a simple yet effective difficulty-aware training strategy that (i) orders preference data from easy to hard (a curriculum over margin-defined difficulty), and (ii) routes difficult pairs to an SFT objective while applying a preference loss to easy pairs. This hybrid design provides a practical mechanism to leverage ambiguous pairs without incurring the optimization failures often associated with preference losses on low-margin data. Across three LLM-judge benchmarks, MixDPO consistently improves alignment over DPO and a range of widely-used variants, with particularly strong gains on AlpacaEval~2 length-controlled (LC) win rate.
zh
[AI-215] R-HTN: Rebellious Online HTN Planning for Safety and Game AI
【速读】:该论文旨在解决智能代理在执行用户任务时可能违反内置行为指令(directives \D)的问题,尤其是在需要遵守安全规则或个性特质约束的场景中。传统HTN(Hierarchical Task Network)规划方法往往缺乏对这类约束的动态响应能力,导致代理可能盲目执行任务而忽视关键限制。解决方案的关键在于提出R-HTN(Rebellious-HTN)算法,它结合在线HTN规划与指令约束处理机制,使代理能够在发现自身将违反 \D 时,主动调整行为:非自适应代理直接终止执行,而自适应代理则重构HTN计划以寻找满足指令前提下的替代目标路径。实验表明,R-HTN代理始终不违反指令,同时尽可能实现用户目标,即使方式不同于用户预期。
链接: https://arxiv.org/abs/2602.00951
作者: Hector Munoz-Avila,David W. Aha,Paola Rizzo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce online Hierarchical Task Network (HTN) agents whose behaviors are governed by a set of built-in directives \D. Like other agents that are capable of rebellion (i.e., \it intelligent disobedience), our agents will, under some conditions, not perform a user-assigned task and instead act in ways that do not meet a user’s expectations. Our work combines three concepts: HTN planning, online planning, and the directives \D, which must be considered when performing user-assigned tasks. We investigate two agent variants: (1) a Nonadaptive agent that stops execution if it finds itself in violation of \D~ and (2) an Adaptive agent that, in the same situation, instead modifies its HTN plan to search for alternative ways to achieve its given task. We present R-HTN (for: Rebellious-HTN), a general algorithm for online HTN planning under directives \D. We evaluate R-HTN in two task domains where the agent must not violate some directives for safety reasons or as dictated by their personality traits. We found that R-HTN agents never violate directives, and aim to achieve the user-given goals if feasible though not necessarily as the user expected.
zh
[AI-216] MindGuard: Guardrail Classifiers for Multi-Turn Mental Health Support
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在心理健康支持场景中因缺乏临床适配性而导致的安全风险问题,即现有通用安全防护机制无法有效区分治疗性披露与真实临床危机,从而引发误判和安全隐患。其解决方案的关键在于构建一个由心理学博士级专家共同开发的临床基础风险分类体系(clinically grounded risk taxonomy),精准识别可操作性伤害(如自伤和伤人行为),同时保留安全的非危机治疗内容空间;并基于此设计了MindGuard系列轻量级安全分类器(4B和8B参数规模),利用受控双代理生成的合成对话进行训练,在高召回率下显著降低误报率,并在对抗性多轮交互中优于通用防护策略,实现更安全、更有效的心理辅助交互。
链接: https://arxiv.org/abs/2602.00950
作者: António Farinhas,Nuno M. Guerreiro,José Pombal,Pedro Henrique Martins,Laura Melton,Alex Conway,Cara Dochat,Maya D’Eon,Ricardo Rei
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models are increasingly used for mental health support, yet their conversational coherence alone does not ensure clinical appropriateness. Existing general-purpose safeguards often fail to distinguish between therapeutic disclosures and genuine clinical crises, leading to safety failures. To address this gap, we introduce a clinically grounded risk taxonomy, developed in collaboration with PhD-level psychologists, that identifies actionable harm (e.g., self-harm and harm to others) while preserving space for safe, non-crisis therapeutic content. We release MindGuard-testset, a dataset of real-world multi-turn conversations annotated at the turn level by clinical experts. Using synthetic dialogues generated via a controlled two-agent setup, we train MindGuard, a family of lightweight safety classifiers (with 4B and 8B parameters). Our classifiers reduce false positives at high-recall operating points and, when paired with clinician language models, help achieve lower attack success and harmful engagement rates in adversarial multi-turn interactions compared to general-purpose safeguards. We release all models and human evaluation data.
zh
[AI-217] he Keyhole Effect: Why Chat Interfaces Fail at Data Analysis
【速读】:该论文旨在解决当前以聊天界面(chat interface)作为主要交互方式在多步骤、状态依赖型数据分析任务中导致的认知性能下降问题。作者基于Woods(1984)提出的“钥匙孔效应”(Keyhole Effect),指出聊天界面通过五个机制系统性地削弱分析能力:内容位移破坏海马体空间记忆、隐藏状态变量超出工作记忆容量(约4个组块)、强制语言表达引发言语遮蔽效应、线性文本流阻碍认知卸载与知识行动、序列化代价随数据维度增加而加剧。论文将认知过载形式化为 $ O = \max(0, m - v - W) $,其中 $ m $ 为任务相关项数,$ v $ 为可见项数,$ W $ 为工作记忆容量;当 $ O > 0 $ 时,错误概率上升并放大锚定效应、确认偏误和变化盲视等偏差。解决方案的关键在于提出八种混合设计模式(Hybrid Design Patterns),包括生成式用户界面(Generative UI)、无限画布(Infinite Canvas)、指称交互(Deictic Interaction)、状态轨道(State Rail)、幽灵层(Ghost Layers)、准备就绪布局(Mise en Place)、语义缩放(Semantic Zoom)和概率化用户界面(Probabilistic UI),这些模式精准针对特定认知瓶颈,同时保留自然语言用于意图表达与合成,从而构建更符合人类认知特性的分析环境。
链接: https://arxiv.org/abs/2602.00947
作者: Mohan Reddy
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Chat has become the default interface for AI-assisted data analysis. For multi-step, state-dependent analytical tasks, this is a mistake. Building on Woods (1984) Keyhole Effect, the cognitive cost of viewing large information spaces through narrow viewports, I show that chat interfaces systematically degrade analytical performance through five mechanisms: (1) constant content displacement defeats hippocampal spatial memory systems; (2) hidden state variables exceed working memory capacity (approximately 4 chunks under load); (3) forced verbalization triggers verbal overshadowing, degrading visual pattern recognition; (4) linear text streams block epistemic action and cognitive offloading; (5) serialization penalties scale with data dimensionality. I formalize cognitive overload as O = max(0, m - v - W) where m is task-relevant items, v is visible items, and W is working memory capacity. When O 0, error probability increases and analytical biases (anchoring, confirmation, change blindness) amplify. Eight hybrid design patterns address these failures: Generative UI, Infinite Canvas, Deictic Interaction, State Rail, Ghost Layers, Mise en Place, Semantic Zoom, and Probabilistic UI. Each pattern targets specific cognitive bottlenecks while preserving natural language for intent specification and synthesis. Well-scaffolded conversational systems that encode expert priors may reduce load for guided tasks; the framework applies most strongly to open-ended exploration. The paper concludes with falsifiable hypotheses and experimental paradigms for empirical validation.
zh
[AI-218] MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在工具调用能力评估中存在的局限性问题,即现有评测方法往往依赖受限的工具集、简化的工作流或主观的“LLM作为裁判”指标,难以真实反映模型在复杂多步骤任务中的工具使用性能。其解决方案的关键在于提出MCP-Atlas——一个大规模基准测试平台,包含36个真实的Model Context Protocol (MCP) 服务器和220个工具,设计了1000个基于自然语言提示的任务,这些任务要求代理在不明确指定工具或服务器的情况下识别并协调跨多个服务器的3–6次工具调用。该基准采用基于事实主张的评分机制,对模型最终回答中满足的可验证主张给予部分分数,并结合内部诊断指标(如工具发现、参数化、语法正确性、错误恢复能力和效率)进行综合评估,从而实现更客观、全面且贴近现实场景的工具使用能力评测。
链接: https://arxiv.org/abs/2602.00933
作者: Chaithanya Bandi,Ben Hertzberg,Geobio Boo,Tejas Polakam,Jeff Da,Sami Hassaan,Manasi Sharma,Andrew Park,Ernesto Hernandez,Dan Rambado,Ivan Salazar,Rafael Cruz,Chetan Rane,Ben Levin,Brad Kenstler,Bing Liu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:The Model Context Protocol (MCP) is rapidly becoming the standard interface for Large Language Models (LLMs) to discover and invoke external tools. However, existing evaluations often fail to capture the complexity of real-world scenarios, relying on restricted toolsets, simplistic workflows, or subjective LLM-as-a-judge metrics. We introduce MCP-Atlas, a large-scale benchmark for evaluating tool-use competency, comprising 36 real MCP servers and 220 tools. It includes 1,000 tasks designed to assess tool-use competency in realistic, multi-step workflows. Tasks use natural language prompts that avoid naming specific tools or servers, requiring agents to identify and orchestrate 3-6 tool calls across multiple servers. We score tasks using a claims-based rubric that awards partial credit based on the factual claims satisfied in the model’s final answer, complemented by internal diagnostics on tool discovery, parameterization, syntax, error recovery, and efficiency. Evaluation results on frontier models reveal that top models achieve pass rates exceeding 50%, with primary failures arising from inadequate tool usage and task understanding. We release the task schema, containerized harness, and a 500-task public subset of the benchmark dataset to facilitate reproducible comparisons and advance the development of robust, tool-augmented agents.
zh
[AI-219] Continuous-Utility Direct Preference Optimization ICML2026
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)推理能力对二元偏好监督(binary preference supervision)的依赖问题,这种监督方式无法捕捉推理过程中的部分进展或细粒度质量差异。其解决方案的关键在于提出连续效用直接偏好优化(Continuous Utility Direct Preference Optimization, CU-DPO),通过将原本的二元标签替换为能反映推理质量的连续分数,使模型能够学习一组基于提示(prompt-based)的认知策略组合。该框架理论上证明了使用K个策略可实现样本复杂度降低至Θ(K log K),并收敛到熵正则化的效用最大化策略;同时设计了两阶段训练流程:策略选择阶段利用最佳对比(best-vs-all)优化策略选取能力,执行精炼阶段则通过边际分层配对(margin-stratified pairs)提升选定策略的执行准确性,从而在数学推理基准上显著提升策略选择准确率,并带来稳定的下游推理性能增益。
链接: https://arxiv.org/abs/2602.00931
作者: Muhammad Ahmed Mohsin,Muhammad Umer,Ahsan Bilal,Zihao He,Muhammad Usman Rafique,Asad Aali,Muhammad Ali Jamshed,John M. Cioffi,Emily Fox
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Submitted to ICML 2026
Abstract:Large language model reasoning is often treated as a monolithic capability, relying on binary preference supervision that fails to capture partial progress or fine-grained reasoning quality. We introduce Continuous Utility Direct Preference Optimization (CU-DPO), a framework that aligns models to a portfolio of prompt-based cognitive strategies by replacing binary labels with continuous scores that capture fine-grained reasoning quality. We prove that learning with K strategies yields a Theta(K log K) improvement in sample complexity over binary preferences, and that DPO converges to the entropy-regularized utility-maximizing policy. To exploit this signal, we propose a two-stage training pipeline: (i) strategy selection, which optimizes the model to choose the best strategy for a given problem via best-vs-all comparisons, and (ii) execution refinement, which trains the model to correctly execute the selected strategy using margin-stratified pairs. On mathematical reasoning benchmarks, CU-DPO improves strategy selection accuracy from 35-46 percent to 68-78 percent across seven base models, yielding consistent downstream reasoning gains of up to 6.6 points on in-distribution datasets with effective transfer to out-of-distribution tasks.
zh
[AI-220] Learning Abstractions for Hierarchical Planning in Program-Synthesis Agents
【速读】:该论文旨在解决当前大型语言模型(Large Language Model, LLM)代理和深度强化学习(Deep Reinforcement Learning, RL)系统在跨任务泛化能力上的局限性,即缺乏有效学习和利用抽象知识以实现高效规划的能力。其解决方案的关键在于提出TheoryCoder-2,一种新型基于理论的强化学习(Theory-Based Reinforcement Learning, TBRL)智能体,它通过利用LLM的上下文学习(in-context learning)能力,从经验中主动合成可复用的抽象概念,并将其整合进层次化规划过程中,从而摆脱对人工预设抽象的依赖,显著提升样本效率与复杂任务求解能力。
链接: https://arxiv.org/abs/2602.00929
作者: Zergham Ahmed,Kazuki Irie,Joshua B. Tenenbaum,Christopher J. Bates,Samuel J. Gershman
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 20 pages
Abstract:Humans learn abstractions and use them to plan efficiently to quickly generalize across tasks – an ability that remains challenging for state-of-the-art large language model (LLM) agents and deep reinforcement learning (RL) systems. Inspired by the cognitive science of how people form abstractions and intuitive theories of their world knowledge, Theory-Based RL (TBRL) systems, such as TheoryCoder, exhibit strong generalization through effective use of abstractions. However, they heavily rely on human-provided abstractions and sidestep the abstraction-learning problem. We introduce TheoryCoder-2, a new TBRL agent that leverages LLMs’ in-context learning ability to actively learn reusable abstractions rather than relying on hand-specified ones, by synthesizing abstractions from experience and integrating them into a hierarchical planning process. We conduct experiments on diverse environments, including BabyAI, Minihack and VGDL games like Sokoban. We find that TheoryCoder-2 is significantly more sample-efficient than baseline LLM agents augmented with classical planning domain construction, reasoning-based planning, and prior program-synthesis agents such as WorldCoder. TheoryCoder-2 is able to solve complex tasks that the baselines fail, while only requiring minimal human prompts, unlike prior TBRL systems.
zh
[AI-221] Supervised sparse auto-encoders as unconstrained feature models for semantic composition
【速读】:该论文旨在解决稀疏自编码器(Sparse Auto-Encoders, SAEs)在机制可解释性研究中面临的两大挑战:一是L₁正则项导致的非光滑性,限制了重建质量与模型扩展性;二是学习到的特征与人类语义之间缺乏对齐。解决方案的关键在于引入无约束特征建模(unconstrained feature models)这一来自神经坍缩理论(neural collapse theory)的数学框架,并通过监督训练策略优化SAE结构——具体而言,采用仅解码器(decoder-only)架构,联合学习稀疏概念嵌入(sparse concept embeddings)与解码器权重,从而实现特征级别的语义控制与组合泛化能力。实验验证表明,该方法在Stable Diffusion 3.5上能够重建训练中未见的概念组合图像,并支持无需修改提示词的语义图像编辑。
链接: https://arxiv.org/abs/2602.00924
作者: Ouns El Harzli,Hugo Wallner,Yoonsoo Nam,Haixuan Xavier Tao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Sparse auto-encoders (SAEs) have re-emerged as a prominent method for mechanistic interpretability, yet they face two significant challenges: the non-smoothness of the L_1 penalty, which hinders reconstruction and scalability, and a lack of alignment between learned features and human semantics. In this paper, we address these limitations by adapting unconstrained feature models-a mathematical framework from neural collapse theory-and by supervising the task. We supervise (decoder-only) SAEs to reconstruct feature vectors by jointly learning sparse concept embeddings and decoder weights. Validated on Stable Diffusion 3.5, our approach demonstrates compositional generalization, successfully reconstructing images with concept combinations unseen during training, and enabling feature-level intervention for semantic image editing without prompt modification.
zh
[AI-222] Synapse Compendium Aware Federated Knowledge Exchange for Tool Routed LLM s
【速读】:该论文旨在解决基于大语言模型(Large Language Models, LLMs)的多智能体在联邦学习(Federated Learning, FL)场景下协同学习时面临的三大挑战:通信开销高、数据异构性强以及工具使用行为不一致,从而限制了整体性能。其核心解决方案是提出Synapse框架,关键在于通过训练一个共享的全局工具使用知识模型(global knowledge model of tool-usage behavior),使客户端智能体在固定LLM的基础上本地学习工具使用模式,并将提炼后的工具使用特征(artifacts)经由协调器进行联邦聚合;同时,利用模板化表示(templated representations)、嵌入检索结合LLM重排序(embedding retrieval with LLM reranking)和自适应掩码(adaptive masking)策略,在保障信息效用的同时有效控制隐私泄露,最终实现稳定且高效的工具选择收敛。
链接: https://arxiv.org/abs/2602.00911
作者: Abhijit Chakraborty,Sandipan De,Yash Shah,Chahana Dahal,Vivek Gupta
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Collaborative learning among LLM-based agents under federated learning faces challenges, including communication costs, heterogeneity in data, and tool-usage, limiting their effectiveness. We introduce Synapse, a framework that trains a shared global knowledge model of tool-usage behavior. Client agents with fixed LLMs learn tool-usage patterns locally, and transmit artifacts for federated aggregation through coordinators. A global tool compendium is updated and redistributed, enabling convergence toward stable tool selection. Synapse uses templated representations, embedding retrieval with LLM reranking, and adaptive masking to maintain utility while limiting information leakage. The framework supports heterogeneous data and quantifies performance improvements. Results show that Synapse improves tool-usage effectiveness and reduces communication overhead compared with weight or prompt-sharing approaches in multi-agent LLM systems.
zh
[AI-223] GAPNet: Plug-in Jointly Learning Task-Specific Graph for Dynamic Stock Relation
【速读】:该论文旨在解决现有金融预测模型中因预定义图结构与实际股票关联信号不匹配而导致的泛化能力差和任务对齐度低的问题。当前方法依赖于先验构建的图结构来捕捉股票间关系,但网络信号具有高噪声、异步性和获取困难等特点,使得预设图难以适应下游任务需求。其解决方案的关键在于提出GAPNet(Graph Adaptation Plug-in Network),通过端到端联合学习任务特定的拓扑结构与表示,利用两个互补组件实现动态图结构调整:空间感知层(Spatial Perception Layer)捕捉资产间的短期协同变动,时间感知层(Temporal Perception Layer)在分布偏移下维持长期依赖关系,从而显著提升模型在真实股票数据上的盈利能力和稳定性。
链接: https://arxiv.org/abs/2602.00888
作者: Yingjie Niu,Lanxin Lu,Changhong Jin,Ruihai Dong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The advent of the web has led to a paradigm shift in the financial relations, with the real-time dissemination of news, social discourse, and financial filings contributing significantly to the reshaping of financial forecasting. The existing methods rely on establishing relations a priori, i.e. predefining graphs to capture inter-stock relationships. However, the stock-related web signals are characterised by high levels of noise, asynchrony, and challenging to obtain, resulting in poor generalisability and non-alignment between the predefined graphs and the downstream tasks. To address this, we propose GAPNet, a Graph Adaptation Plug-in Network that jointly learns task-specific topology and representations in an end-to-end manner. GAPNet attaches to existing pairwise graph or hypergraph backbone models, enabling the dynamic adaptation and rewiring of edge topologies via two complementary components: a Spatial Perception Layer that captures short-term co-movements across assets, and a Temporal Perception Layer that maintains long-term dependency under distribution shift. Across two real-world stock datasets, GAPNet has been shown to consistently enhance the profitability and stability in comparision to the state-of-the-art models, yielding annualised cumulative returns of up to 0.47 for RT-GCN and 0.63 for CI-STHPAN, with peak Sharpe Ratio of 2.20 and 2.12 respectively. The plug-and-play design of GAPNet ensures its broad applicability to diverse GNN-based architectures. Our results underscore that jointly learning graph structures and representations is essential for task-specific relational modeling.
zh
[AI-224] Improving Flow Matching by Aligning Flow Divergence ICML2025
【速读】:该论文旨在解决条件流匹配(Conditional Flow Matching, CFM)在训练基于流的生成模型时,无法确保概率路径学习准确性的局限性。其关键解决方案是引入了一个新的偏微分方程(Partial Differential Equation, PDE)来刻画学习到的概率路径与真实概率路径之间的误差,并给出了该误差的解析解;进一步证明了两者之间的总变差差距(Total Variation Gap)由CFM损失和一个相关的散度损失共同上界控制。基于此理论洞察,作者设计了一种新目标函数,同时优化流场及其散度,从而在不降低生成效率的前提下显著提升生成模型性能。
链接: https://arxiv.org/abs/2602.00869
作者: Yuhao Huang,Taos Transue,Shih-Hsin Wang,William Feldman,Hong Zhang,Bao Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
备注: Published in ICML 2025
Abstract:Conditional flow matching (CFM) stands out as an efficient, simulation-free approach for training flow-based generative models, achieving remarkable performance for data generation. However, CFM is insufficient to ensure accuracy in learning probability paths. In this paper, we introduce a new partial differential equation characterization for the error between the learned and exact probability paths, along with its solution. We show that the total variation gap between the two probability paths is bounded above by a combination of the CFM loss and an associated divergence loss. This theoretical insight leads to the design of a new objective function that simultaneously matches the flow and its divergence. Our new approach improves the performance of the flow-based generative model by a noticeable margin without sacrificing generation efficiency. We showcase the advantages of this enhanced training approach over CFM on several important benchmark tasks, including generative modeling for dynamical systems, DNA sequences, and videos. Code is available at \hrefthis https URLUtah-Math-Data-Science.
zh
[AI-225] owards Multiscale Graph-based Protein Learning with Geometric Secondary Structural Motifs NEURIPS2025
【速读】:该论文旨在解决现有基于图神经网络(Graph Neural Networks, GNNs)的方法在学习蛋白质多尺度表示和高效建模长程依赖关系方面的局限性。其解决方案的关键在于提出了一种高效的多尺度图学习框架,包含两个核心组件:一是构建分层图表示,由多个细粒度子图(对应二级结构基序,如α-螺旋、β-链和环区)与一个粗粒度图(基于这些基序的空间排列和相对取向连接而成)组成;二是采用两个GNN分别处理局部特征(在单个二级结构基序内捕捉相互作用)和高层结构关系(跨基序建模),从而实现灵活且表达能力强的多尺度信息整合。该设计在理论上保证了最大表达能力,实验上显著提升了预测精度并降低了计算成本。
链接: https://arxiv.org/abs/2602.00862
作者: Shih-Hsin Wang,Yuhao Huang,Taos Transue,Justin Baker,Jonathan Forstater,Thomas Strohmer,Bao Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
备注: Published in NeurIPS 2025
Abstract:Graph neural networks (GNNs) have emerged as powerful tools for learning protein structures by capturing spatial relationships at the residue level. However, existing GNN-based methods often face challenges in learning multiscale representations and modeling long-range dependencies efficiently. In this work, we propose an efficient multiscale graph-based learning framework tailored to proteins. Our proposed framework contains two crucial components: (1) It constructs a hierarchical graph representation comprising a collection of fine-grained subgraphs, each corresponding to a secondary structure motif (e.g., \alpha -helices, \beta -strands, loops), and a single coarse-grained graph that connects these motifs based on their spatial arrangement and relative orientation. (2) It employs two GNNs for feature learning: the first operates within individual secondary motifs to capture local interactions, and the second models higher-level structural relationships across motifs. Our modular framework allows a flexible choice of GNN in each stage. Theoretically, we show that our hierarchical framework preserves the desired maximal expressiveness, ensuring no loss of critical structural information. Empirically, we demonstrate that integrating baseline GNNs into our multiscale framework remarkably improves prediction accuracy and reduces computational cost across various benchmarks.
zh
[AI-226] Position: Human-Centric AI Requires a Minimum Viable Level of Human Understanding
【速读】:该论文试图解决的问题是:随着人工智能(AI)系统生成越来越流畅且正确的端到端结果,用户对AI输出的解释、验证和干预能力逐渐减弱,导致“能力-理解鸿沟”(Capability-Comprehension Gap)——即人类辅助性能提升的同时,其内部认知模型却持续退化。为应对这一问题,论文提出“认知完整性阈值”(Cognitive Integrity Threshold, CIT)作为核心解决方案,其关键在于识别并维持人类在高度AI依赖情境下仍能保持监督权、自主性和问责参与的最低认知理解水平。CIT不需完全重建推理过程,也不限制自动化程度,而是通过三个功能维度——验证能力(verification capacity)、保持理解的交互机制(comprehension-preserving interaction)以及治理制度支撑(institutional scaffolds for governance)——确保人类在责任关键场景中实现认知可持续性的人机协作设计与治理框架。
链接: https://arxiv.org/abs/2602.00854
作者: Fangzhou Lin,Qianwen Ge,Lingyu Xu,Peiran Li,Xiangbo Gao,Shuo Xing,Kazunori Yamada,Ziming Zhang,Haichong Zhang,Zhengzhong Tu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 14 pages, 1 figures
Abstract:AI systems increasingly produce fluent, correct, end-to-end outcomes. Over time, this erodes users’ ability to explain, verify, or intervene. We define this divergence as the Capability-Comprehension Gap: a decoupling where assisted performance improves while users’ internal models deteriorate. This paper argues that prevailing approaches to transparency, user control, literacy, and governance do not define the foundational understanding humans must retain for oversight under sustained AI delegation. To formalize this, we define the Cognitive Integrity Threshold (CIT) as the minimum comprehension required to preserve oversight, autonomy, and accountable participation under AI assistance. CIT does not require full reasoning reconstruction, nor does it constrain automation. It identifies the threshold beyond which oversight becomes procedural and contestability fails. We operatinalize CIT through three functional dimensions: (i) verification capacity, (ii) comprehension-preserving interaction, and (iii) institutional scaffolds for governance. This motivates a design and governance agenda that aligns human-AI interaction with cognitive sustainability in responsibility-critical settings.
zh
[AI-227] Persuasion Propagation in LLM Agents
【速读】:该论文旨在解决长时程任务中用户说服(belief-level intervention)如何影响AI代理行为的问题,即“说服传播”(persuasion propagation)现象。其核心问题是:在代理执行任务过程中或之前施加的信念干预,是否以及如何改变其后续行为表现。解决方案的关键在于提出一个以行为为中心的评估框架,区分说服发生在任务执行期间(on-the-fly)与任务开始前(belief-prefilled)两种情形,并通过实证发现:仅在任务初始阶段明确设定信念状态时,代理的行为才会显著且一致地变化——例如,相比中立预填充代理,信念预填充代理平均减少26.9%的搜索次数和16.9%的独立来源访问量,这表明信念干预对行为具有强影响力,从而推动了对代理系统进行行为层面评估的重要性。
链接: https://arxiv.org/abs/2602.00851
作者: Hyejun Jeong,Amir Houmansadr,Shlomo Zilberstein,Eugene Bagdasarian
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Code available at this https URL
Abstract:Modern AI agents increasingly combine conversational interaction with autonomous task execution, such as coding and web research, raising a natural question: what happens when an agent engaged in long-horizon tasks is subjected to user persuasion? We study how belief-level intervention can influence downstream task behavior, a phenomenon we name \emphpersuasion propagation. We introduce a behavior-centered evaluation framework that distinguishes between persuasion applied during or prior to task execution. Across web research and coding tasks, we find that on-the-fly persuasion induces weak and inconsistent behavioral effects. In contrast, when the belief state is explicitly specified at task time, belief-prefilled agents conduct on average 26.9% fewer searches and visit 16.9% fewer unique sources than neutral-prefilled agents. These results suggest that persuasion, even in prior interaction, can affect the agent’s behavior, motivating behavior-level evaluation in agentic systems.
zh
[AI-228] RMFlow: Refined Mean Flow by a Noise-Injection Step for Multimodal Generation ICLR2026
【速读】:该论文旨在解决单次前向传播(1-NFE)生成模型在图像、分子和时间序列等多模态任务中生成质量不足的问题。现有方法如MeanFlow虽具备高效性,但其单一路径的生成机制难以产生高质量结果。解决方案的关键在于提出RMFlow,其核心创新是将粗粒度的1-NFE MeanFlow传输与后续定制化的噪声注入精修步骤相结合;同时,通过引入一种新的损失函数训练神经网络,该损失函数在最小化概率路径间的Wasserstein距离与最大化样本似然之间取得平衡,从而有效提升生成质量,且仅需1-NFE即可逼近当前最优性能。
链接: https://arxiv.org/abs/2602.00849
作者: Yuhao Huang,Shih-Hsin Wang,Andrea L. Bertozzi,Bao Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
备注: Accepted to ICLR 2026
Abstract:Mean flow (MeanFlow) enables efficient, high-fidelity image generation, yet its single-function evaluation (1-NFE) generation often cannot yield compelling results. We address this issue by introducing RMFlow, an efficient multimodal generative model that integrates a coarse 1-NFE MeanFlow transport with a subsequent tailored noise-injection refinement step. RMFlow approximates the average velocity of the flow path using a neural network trained with a new loss function that balances minimizing the Wasserstein distance between probability paths and maximizing sample likelihood. RMFlow achieves near state-of-the-art results on text-to-image, context-to-molecule, and time-series generation using only 1-NFE, at a computational cost comparable to the baseline MeanFlows.
zh
[AI-229] Optimizing Agent ic Reasoning with Retrieval via Synthetic Semantic Information Gain Reward
【速读】:该论文旨在解决大推理模型(Large Reasoning Models, LRMs)在执行代理式推理(Agentic Reasoning)时,因缺乏密集且有原则的奖励信号而导致外部知识检索优化困难的问题。解决方案的关键在于提出 InfoReasoner 框架,其核心是通过合成语义信息增益奖励(synthetic semantic information gain reward)来激励有效的信息获取行为;该奖励被理论重新定义为模型信念状态不确定性的减少,并具备非负性、望远镜可加性和信道单调性等数学保证;实践中,进一步设计了一种基于输出感知的内在估计器(output-aware intrinsic estimator),利用双向文本蕴含(bidirectional textual entailment)进行语义聚类,直接从模型输出分布中计算信息增益,从而无需人工标注即可实现可扩展优化,最终通过 Group Relative Policy Optimization (GRPO) 实现高效训练,在七个问答基准上显著优于现有检索增强基线方法。
链接: https://arxiv.org/abs/2602.00845
作者: Senkang Hu,Yong Dai,Yuzhi Zhao,Yihang Tao,Yu Guo,Zhengru Fang,Sam Tak Wu Kwong,Yuguang Fang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Agentic reasoning enables large reasoning models (LRMs) to dynamically acquire external knowledge, but yet optimizing the retrieval process remains challenging due to the lack of dense, principled reward signals. In this paper, we introduce InfoReasoner, a unified framework that incentivizes effective information seeking via a synthetic semantic information gain reward. Theoretically, we redefine information gain as uncertainty reduction over the model’s belief states, establishing guarantees, including non-negativity, telescoping additivity, and channel monotonicity. Practically, to enable scalable optimization without manual retrieval annotations, we propose an output-aware intrinsic estimator that computes information gain directly from the model’s output distributions using semantic clustering via bidirectional textual entailment. This intrinsic reward guides the policy to maximize epistemic progress, enabling efficient training via Group Relative Policy Optimxization (GRPO). Experiments across seven question-answering benchmarks demonstrate that InfoReasoner consistently outperforms strong retrieval-augmented baselines, achieving up to 5.4% average accuracy improvement. Our work provides a theoretically grounded and scalable path toward agentic reasoning with retrieval.
zh
[AI-230] Exploration of Unary Arithmetic-Based Matrix Multiply Units for Low Precision DL Accelerators
【速读】:该论文旨在解决深度学习(Deep Learning, DL)推理中低精度计算场景下,传统二进制通用矩阵乘法(Binary GEMM)硬件在能效方面的局限性问题。其核心挑战在于如何通过新型一元通用矩阵乘法(Unary GEMM)设计,在保持计算精度的同时显著提升能效表现,以适配未来边缘人工智能(Edge AI)加速器的需求。解决方案的关键在于对三种最新的一元GEMM设计(uGEMM、tuGEMM 和 tubGEMM)进行严格的后综合评估,涵盖不同位宽和矩阵规模下的性能-功耗权衡,并结合八种预训练卷积神经网络(CNNs)及LLaMA2大语言模型(Large Language Model, LLM)的权重稀疏性分析,从而识别出最优应用场景与能效甜点(sweet spots),验证了一元GEMM在低精度整数推理中的有效性与潜力。
链接: https://arxiv.org/abs/2602.00838
作者: Prabhu Vellaisamy,Harideep Nair,Di Wu,Shawn Blanton,John Paul Shen
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注:
Abstract:General matrix multiplication (GEMM) is a fundamental operation in deep learning (DL). With DL moving increasingly toward low precision, recent works have proposed novel unary GEMM designs as an alternative to conventional binary GEMM hardware. A rigorous evaluation of recent unary and binary GEMM designs is needed to assess the potential of unary hardware for future DL compute. This paper focuses on unary GEMM designs for integer-based DL inference and performs a detailed evaluation of three latest unary design proposals, namely, uGEMM, tuGEMM and tubGEMM, by comparing them to a conventional binary GEMM. Rigorous post-synthesis evaluations beyond prior works are performed across varying bit-widths and matrix sizes to assess the designs’ tradeoffs and determine optimal sweetspots. Further, we perform weight sparsity analysis across eight pretrained convolutional neural networks (CNNs) and the LLaMA2 large language model (LLM). In this work, we demonstrate how unary GEMM can be effectively used for energy-efficient compute in future edge AI accelerators.
zh
[AI-231] Dont Forget Its Variance! The Minimum Path Variance Principle for Accurate and Stable Score-Based Density Ratio Estimation
【速读】:该论文旨在解决生成式 AI 中基于得分的方法(score-based methods)在密度比估计(density ratio estimation, DRE)任务中面临的一个关键悖论:尽管理论上路径无关(path-independent),但实际性能却高度依赖于所选路径调度(path schedule)。其解决方案的关键在于揭示了可训练目标与理想真实目标之间的差异源于一个被忽视的重要项——时间得分的路径方差(path variance of the time score)。作者提出最小路径方差(MinPV)原理,通过引入一种有原则的启发式方法来最小化该方差,并进一步推导出路径方差的闭式表达式,从而将原本难以处理的问题转化为可优化的目标。通过使用灵活的 Kumaraswamy 混合模型参数化路径,该方法能够自动学习低方差的数据自适应路径,显著提升了估计器的准确性和稳定性,在多个挑战性基准测试中达到了新的最先进水平。
链接: https://arxiv.org/abs/2602.00834
作者: Wei Chen,Jiacheng Li,Shigui Li,Zhiqi Lin,Junmei Yang,John Paisley,Delu Zeng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Score-based methods have emerged as a powerful framework for density ratio estimation (DRE), but they face an important paradox in that, while theoretically path-independent, their practical performance depends critically on the chosen path schedule. We resolve this issue by proving that tractable training objectives differ from the ideal, ground-truth objective by a crucial, overlooked term: the path variance of the time score. To address this, we propose MinPV (\textbfMinimum \textbfPath \textbfVariance) Principle, which introduces a principled heuristic to minimize the overlooked path variance. Our key contribution is the derivation of a closed-form expression for the variance, turning an intractable problem into a tractable optimization. By parameterizing the path with a flexible Kumaraswamy Mixture Model, our method learns a data-adaptive, low-variance path without heuristic selection. This principled optimization of the complete objective yields more accurate and stable estimators, establishing new state-of-the-art results on challenging benchmarks.
zh
[AI-232] Resource-Efficient Reinforcement for Reasoning Large Language Models via Dynamic One-Shot Policy Refinement
【速读】:该论文旨在解决强化学习在可验证奖励(Reinforcement Learning under Verifiable Rewards, RLVR)框架下训练大语言模型(Large Language Models, LLMs)时面临的高数据与计算资源消耗问题。其核心挑战在于RLVR需要大量奖励信号和高昂的推理(rollout)成本,限制了其在实际场景中的应用。解决方案的关键在于提出动态单次策略精调(Dynamic One-Shot Policy Refinement, DoPR),该方法通过引入不确定性感知机制,基于奖励波动性和探索驱动的采样策略,在每个训练批次中仅选择一个最具信息量的样本进行策略更新,从而将推理开销降低近一个数量级,同时保持与传统方法相当的推理准确性,实现了高效且可扩展的LLM后训练方案。
链接: https://arxiv.org/abs/2602.00815
作者: Yunjian Zhang,Sudong Wang,Yang Li,Peiran Xu,Conghao Zhou,Xiaoyue Ma,Jianing Li,Yao Zhu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have exhibited remarkable performance on complex reasoning tasks, with reinforcement learning under verifiable rewards (RLVR) emerging as a principled framework for aligning model behavior with reasoning chains. Despite its promise, RLVR remains prohibitively resource-intensive, requiring extensive reward signals and incurring substantial rollout costs during training. In this work, we revisit the fundamental question of data and compute efficiency in RLVR. We first establish a theoretical lower bound on the sample complexity required to unlock reasoning capabilities, and empirically validate that strong performance can be achieved with a surprisingly small number of training instances. To tackle the computational burden, we propose Dynamic One-Shot Policy Refinement (DoPR), an uncertainty-aware RL strategy that dynamically selects a single informative training sample per batch for policy updates, guided by reward volatility and exploration-driven acquisition. DoPR reduces rollout overhead by nearly an order of magnitude while preserving competitive reasoning accuracy, offering a scalable and resource-efficient solution for LLM post-training. This approach offers a practical path toward more efficient and accessible RL-based training for reasoning-intensive LLM applications.
zh
[AI-233] MissMAC-Bench: Building Solid Benchmark for Missing Modality Issue in Robust Multimodal Affective Computing
【速读】:该论文旨在解决多模态情感计算(Multimodal Affective Computing, MAC)中因模态缺失导致的模型鲁棒性下降问题,即在真实场景下由于模态数据动态且不确定的缺失,引发分布偏移和语义不足,从而造成性能波动。其解决方案的关键在于提出一个名为MissMAC-Bench的综合性基准测试平台,通过两个核心原则实现:一是训练阶段不依赖模态缺失先验(no missing prior during training),二是使用单一模型同时处理完整与不完整模态输入(one single model capable of handling both complete and incomplete modality scenarios),以此保障模型泛化能力;此外,该基准还融合了固定与随机缺失模式的评估协议,覆盖数据集与实例层面,有效连接学术研究与实际应用需求。
链接: https://arxiv.org/abs/2602.00811
作者: Ronghao Lin,Honghao Lu,Ruixing Wu,Aolin Xiong,Qinggong Chu,Qiaolin He,Sijie Mai,Haifeng Hu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:As a knowledge discovery task over heterogeneous data sources, current Multimodal Affective Computing (MAC) heavily rely on the completeness of multiple modalities to accurately understand human’s affective state. However, in real-world scenarios, the availability of modality data is often dynamic and uncertain, leading to substantial performance fluctuations due to the distribution shifts and semantic deficiencies of the incomplete multimodal inputs. Known as the missing modality issue, this challenge poses a critical barrier to the robustness and practical deployment of MAC models. To systematically quantify this issue, we introduce MissMAC-Bench, a comprehensive benchmark designed to establish fair and unified evaluation standards from the perspective of cross-modal synergy. Two guiding principles are proposed, including no missing prior during training, and one single model capable of handling both complete and incomplete modality scenarios, thereby ensuring better generalization. Moreover, to bridge the gap between academic research and real-world applications, our benchmark integrates evaluation protocols with both fixed and random missing patterns at the dataset and instance levels. Extensive experiments conducted on 3 widely-used language models across 4 datasets validate the effectiveness of diverse MAC approaches in tackling the missing modality issue. Our benchmark provides a solid foundation for advancing robust multimodal affective computing and promotes the development of multimedia data mining.
zh
[AI-234] JTok: On Token Embedding as another Axis of Scaling Law via Joint Token Self-modulation
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在扩展过程中面临的计算成本与模型容量耦合问题,即传统密集型模型(dense models)的性能提升需伴随近线性的计算开销,而混合专家模型(Mixture of Experts, MoE)虽能解耦容量与计算量,却引入了显著的内存开销和硬件效率瓶颈。解决方案的关键在于提出一种全新的、正交的扩展维度——基于token索引的参数(token-indexed parameters),通过引入联合token调制向量(Joint-Token, JTok)及其混合形式(JTok-M),利用辅助嵌入表检索调制向量,并以轻量级逐元素运算对Transformer主干进行动态调制,从而在几乎不增加浮点运算次数(FLOPs)的前提下显著提升模型表达能力。实验表明,该方法在不同规模模型(650M至61B参数)上均能降低验证损失并大幅提升下游任务性能,且在等FLOPs条件下实现更优的质量-计算帕累托前沿,验证了其高效性与可预测的幂律扩展特性。
链接: https://arxiv.org/abs/2602.00800
作者: Yebin Yang,Huaijin Wu,Fu Guo,Lin Yao,Xiaohan Qin,Jingzhi Wang,Debing Zhang,Junchi Yan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:LLMs have traditionally scaled along dense dimensions, where performance is coupled with near-linear increases in computational cost. While MoE decouples capacity from compute, it introduces large memory overhead and hardware efficiency challenges. To overcome these, we propose token-indexed parameters as a novel, orthogonal scaling axis that decouple model capacity from FLOPs. Specifically, we introduce Joint-Token (JTok) and Mixture of Joint-Token (JTok-M), which augment Transformer layers with modulation vectors retrieved from auxiliary embedding tables. These vectors modulate the backbone via lightweight, element-wise operations, incurring negligible FLOPs overhead. Extensive experiments on both dense and MoE backbones, spanning from 650M (190M + 460M embedding) to 61B (17B + 44B embedding) total parameters, demonstrate that our approach consistently reduces validation loss and significantly improves downstream task performance (e.g., +4.1 on MMLU, +8.3 on ARC, +8.9 on CEval). Rigorous isoFLOPs analysis further confirms that JTok-M fundamentally shifts the quality-compute Pareto frontier, achieving comparable model quality with 35% less compute relative to vanilla MoE architectures, and we validate that token-indexed parameters exhibit a predictable power-law scaling behavior. Moreover, our efficient implementation ensures that the overhead introduced by JTok and JTok-M remains marginal.
zh
[AI-235] Latent Shadows: The Gaussian-Discrete Duality in Masked Diffusion
【速读】:该论文旨在解决掩码离散扩散模型(Masked Discrete Diffusion)在推理阶段效率低下的问题,其核心瓶颈在于缺乏确定性采样工具。以往方法通常依赖随机蒸馏(stochastic distillation),导致推理速度受限。论文的关键突破在于建立了显式的掩码扩散对偶性(Masked Diffusion Duality),证明掩码过程可视为通过一种新颖的最大值索引保持机制(maximum-value index preservation mechanism)从连续高斯过程中投影而来;并据此提出掩码一致性蒸馏(Masked Consistency Distillation, MCD)框架,利用该对偶性解析构造出一致性蒸馏所需的确定性耦合轨迹,从而避免了数值常微分方程(ODE)求解器的使用。这一方案在不牺牲生成质量的前提下实现了16倍的推理加速,为离散生成任务中一致性蒸馏的应用提供了理论基础与高效实现路径。
链接: https://arxiv.org/abs/2602.00792
作者: Guinan Chen,Xunpeng Huang,Ying Sun,Shijin Wang,Yanyong Zhang,Chao Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages
Abstract:Masked discrete diffusion is a dominant paradigm for high-quality language modeling where tokens are iteratively corrupted to a mask state, yet its inference efficiency is bottlenecked by the lack of deterministic sampling tools. While diffusion duality enables deterministic distillation for uniform models, these approaches generally underperform masked models and rely on complex integral operators. Conversely, in the masked domain, prior methods typically assume the absence of deterministic trajectories, forcing a reliance on stochastic distillation. To bridge this gap, we establish explicit Masked Diffusion Duality, proving that the masked process arises as the projection of a continuous Gaussian process via a novel maximum-value index preservation mechanism. Furthermore, we introduce Masked Consistency Distillation (MCD), a principled framework that leverages this duality to analytically construct the deterministic coupled trajectories required for consistency distillation, bypassing numerical ODE solvers. This result strictly improves upon prior stochastic distillation methods, achieving a 16 \times inference speedup without compromising generation quality. Our findings not only provide a solid theoretical foundation connecting masked and continuous diffusion, but also unlock the full potential of consistency distillation for high-performance discrete generation. Our code is available at this https URL.
zh
[AI-236] Multi-Objective Multi-Fidelity Bayesian Optimization with Causal Priors
【速读】:该论文旨在解决多保真度贝叶斯优化(Multi-fidelity Bayesian Optimization, MFBO)中因低保真度代理模型与目标保真度不匹配而导致的优化性能下降问题。现有方法主要依赖输入、保真度和目标之间的关联性建模,缺乏对因果机制的刻画,在保真度间对齐不佳时表现欠佳。解决方案的关键在于引入RESCUE(REducing Sampling cost with Causal Understanding and Estimation),其核心创新是通过学习结构化因果模型(Structural Causal Model, SCM)来显式捕捉输入、保真度与目标之间的因果关系,并基于此构建能编码干预效应的概率多保真度(Multi-fidelity, MF)代理模型;进一步设计了一种因果超体积知识梯度(causal hypervolume knowledge-gradient)采集策略,以平衡多目标改进期望与采样成本,从而提升样本效率。
链接: https://arxiv.org/abs/2602.00788
作者: Md Abir Hossen,Mohammad Ali Javidian,Vignesh Narayanan,Jason M. O’Kane,Pooyan Jamshidi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Multi-fidelity Bayesian optimization (MFBO) accelerates the search for the global optimum of black-box functions by integrating inexpensive, low-fidelity approximations. The central task of an MFBO policy is to balance the cost-efficiency of low-fidelity proxies against their reduced accuracy to ensure effective progression toward the high-fidelity optimum. Existing MFBO methods primarily capture associational dependencies between inputs, fidelities, and objectives, rather than causal mechanisms, and can perform poorly when lower-fidelity proxies are poorly aligned with the target fidelity. We propose RESCUE (REducing Sampling cost with Causal Understanding and Estimation), a multi-objective MFBO method that incorporates causal calculus to systematically address this challenge. RESCUE learns a structural causal model capturing causal relationships between inputs, fidelities, and objectives, and uses it to construct a probabilistic multi-fidelity (MF) surrogate that encodes intervention effects. Exploiting the causal structure, we introduce a causal hypervolume knowledge-gradient acquisition strategy to select input-fidelity pairs that balance expected multi-objective improvement and cost. We show that RESCUE improves sample efficiency over state-of-the-art MF optimization methods on synthetic and real-world problems in robotics, machine learning (AutoML), and healthcare.
zh
[AI-237] World Models as an Intermediary between Agents and the Real World
【速读】:该论文旨在解决高成本领域中智能体(agent)性能提升受限的问题,特别是在机器人操作、机器学习工程、科学实验等场景下,由于执行动作的物理代价、时间成本和资源消耗过高,导致传统基于强化学习的智能体难以高效获取奖励信号,从而面临极端离策略学习(off-policy learning)和长时程任务中的样本效率低下等根本性挑战。解决方案的关键在于引入世界模型(world model),将其作为智能体与真实世界之间的中介,通过建模环境的动力学特性、奖励机制及任务分布,显著降低对真实交互的依赖,并提供丰富且关键的学习信号,从而在复杂高成本场景中实现更高效的智能体训练与决策优化。
链接: https://arxiv.org/abs/2602.00785
作者: Sherry Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language model (LLM) agents trained using reinforcement learning has achieved superhuman performance in low-cost environments like games, mathematics, and coding. However, these successes have not translated to complex domains where the cost of interaction is high, such as the physical cost of running robots, the time cost of ML engineering, and the resource cost of scientific experiments. The true bottleneck for achieving the next level of agent performance for these complex and high-cost domains lies in the expense of executing actions to acquire reward signals. To address this gap, this paper argues that we should use world models as an intermediary between agents and the real world. We discuss how world models, viewed as models of dynamics, rewards, and task distributions, can overcome fundamental barriers of high-cost actions such as extreme off-policy learning and sample inefficiency in long-horizon tasks. Moreover, we demonstrate how world models can provide critical and rich learning signals to agents across a broad set of domains, including machine learning engineering, computer use, robotics, and AI for science. Lastly, we identify the challenges of building these world models and propose actionable items along dataset curation, architecture design, scaling, and evaluation of world models.
zh
[AI-238] Environment-Aware Adaptive Pruning with Interleaved Inference Orchestration for Vision-Language-Action Models
【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在具身智能任务中因参数量庞大导致的推理延迟过高问题,从而阻碍实时操作能力。现有静态剪枝方法无法适应环境动态变化,而固定间隔的动态层剪枝则存在粒度粗、重训练开销大的缺陷。解决方案的关键在于提出EcoVLA框架,其核心由两部分组成:一是环境感知自适应剪枝(Environment-aware Adaptive Pruning, EAP),通过利用物理环境的时间一致性来动态调整通道剪枝模式;二是交错推理调度(Interleaved Inference Orchestration, I²O),利用VLA推理过程中的FLOPs空洞并行执行剪枝策略,实现近乎零延迟影响的自适应优化。该方案无需重新训练,可与现有加速方法正交组合,显著提升推理效率。
链接: https://arxiv.org/abs/2602.00780
作者: Yuting Huang,Leilei Ding,Zhipeng Tang,Zenghuan Zhu,Jiajun Deng,Xinrui Lin,Shuo Liu,Haojie Ren,Jianmin Ji,Yanyong Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 12 pages, 7 figures
Abstract:While Vision-Language-Action (VLA) models hold promise in embodied intelligence, their large parameter counts lead to substantial inference latency that hinders real-time manipulation, motivating parameter sparsification. However, as the environment evolves during VLA execution, the optimal sparsity patterns change accordingly. Static pruning lacks the adaptability required for environment dynamics, whereas fixed-interval dynamic layer pruning suffers from coarse granularity and high retraining overheads. To bridge this gap, we propose EcoVLA, a training-free, plug-and-play adaptive pruning framework that supports orthogonal combination with existing VLA acceleration methods. EcoVLA comprises two components: Environment-aware Adaptive Pruning (EAP) and Interleaved Inference Orchestration ( I^2O ). EAP is a lightweight adaptive channel pruning method that incorporates the temporal consistency of the physical environment to update sparsity patterns. I^2O leverages the FLOPs bubbles inherent in VLA inference to schedule the pruning method in parallel, ensuring negligible impact on latency. Evaluated on diverse VLA models and benchmarks, EcoVLA delivers state-of-the-art performance, achieving up to 1.60 \times speedup with only a 0.4% drop in success rate, and further reaches 2.18 \times speedup with only a 0.5% degradation when combined with token pruning. We further validate the effectiveness of EcoVLA on real-world robots.
zh
[AI-239] BLOCK-EM: Preventing Emergent Misalignment by Blocking Causal Features
【速读】:该论文试图解决语言模型在针对特定任务进行监督微调时可能出现的“涌现错位”(emergent misalignment)问题,即模型在学习目标行为的同时,会发展出域外的不良行为。解决方案的关键在于:通过识别一组能稳定控制错位行为的内部特征(internal features),并在微调过程中对这些特征施加约束(blocking),从而抑制其强化,进而显著减少错位行为的发生,同时不损害模型在目标任务上的性能。实验表明,在六个不同微调领域中,固定的一组特征被约束后,可实现高达95%的相对错位减少,且无模型质量下降。
链接: https://arxiv.org/abs/2602.00767
作者: Muhammed Ustaomeroglu,Guannan Qu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 41 pages, 32 figures. Code available
Abstract:Emergent misalignment can arise when a language model is fine-tuned on a narrowly scoped supervised objective: the model learns the target behavior, yet also develops undesirable out-of-domain behaviors. We investigate a mechanistic approach to preventing emergent misalignment by identifying a small set of internal features that reliably control the misaligned behavior and then discouraging the model from strengthening these features during fine-tuning. Across six fine-tuning domains, blocking (i.e., constraining) a fixed set of features achieves up to 95% relative reduction in emergent misalignment with no degradation in model quality or target-task performance. We strengthen validity with disjoint selection/evaluation splits, multiple independent judges, multiple random seeds for key settings, quality metrics, and extensive ablations demonstrating that the reduction in misalignment is specific to the identified mechanism. We also characterize a limiting regime in which misalignment re-emerges under prolonged fine-tuning, present evidence consistent with rerouting through alternative features or layers, and evaluate modifications that partially restore the misalignment-blocking effect. Overall, our results show that targeted training-time constraints on internal mechanisms can mitigate emergent misalignment without degrading target-task performance.
zh
[AI-240] Evolving Interpretable Constitutions for Multi-Agent Simulation
【速读】:该论文旨在解决多智能体大语言模型(LLM)系统中因涌现的社会动态而产生的对齐难题,尤其是如何在缺乏显式合作指导的情况下自动发现促进社会福利的行为规范。其解决方案的关键在于提出“宪法演化”(Constitutional Evolution)框架,利用基于网格世界模拟的生存压力环境,结合LLM驱动的遗传编程与多岛进化策略,自动演化出最大化社会稳定性得分(Societal Stability Score, S)的宪法规则。该方法无需预设合作目标,最终演化出的宪法C*不仅显著优于人类设计的基准(S = 0.556 ± 0.008,提升123%),还揭示了减少沟通频率(仅0.9%社交行为)比冗长协调更有利于集体福祉,体现了从经验中发现而非人为规定合作规范的可能性。
链接: https://arxiv.org/abs/2602.00755
作者: Ujwal Kumar,Alice Saito,Hershraj Niranjani,Rayan Yessou,Phan Xuan Tan
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: 23 pages, 4 figures
Abstract:Constitutional AI has focused on single-model alignment using fixed principles. However, multi-agent systems create novel alignment challenges through emergent social dynamics. We present Constitutional Evolution, a framework for automatically discovering behavioral norms in multi-agent LLM systems. Using a grid-world simulation with survival pressure, we study the tension between individual and collective welfare, quantified via a Societal Stability Score S in [0,1] that combines productivity, survival, and conflict metrics. Adversarial constitutions lead to societal collapse (S= 0), while vague prosocial principles (“be helpful, harmless, honest”) produce inconsistent coordination (S = 0.249). Even constitutions designed by Claude 4.5 Opus with explicit knowledge of the objective achieve only moderate performance (S= 0.332). Using LLM-driven genetic programming with multi-island evolution, we evolve constitutions maximizing social welfare without explicit guidance toward cooperation. The evolved constitution C* achieves S = 0.556 +/- 0.008 (123% higher than human-designed baselines, N = 10), eliminates conflict, and discovers that minimizing communication (0.9% vs 62.2% social actions) outperforms verbose coordination. Our interpretable rules demonstrate that cooperative norms can be discovered rather than prescribed.
zh
[AI-241] GraphNNK – Graph Classification and Interpretability
【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在处理图结构数据时因依赖参数化分类器(如线性Softmax层)而导致的可解释性差和泛化能力受限的问题。其解决方案的关键在于引入基于插值的方法,特别是非负核回归(Non-Negative Kernel regression, NNK),通过将预测表示为嵌入空间中相似训练样本的凸组合,从而在保持模型性能的同时提供理论支持与可解释性。
链接: https://arxiv.org/abs/2602.00753
作者: Zeljko Bolevic,Milos Brajovic,Isidora Stankovic,Ljubisa Stankovic
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 4 pages, 3 figures, IEEE conference paper
Abstract:Graph Neural Networks (GNNs) have become a standard approach for learning from graph-structured data. However, their reliance on parametric classifiers (most often linear softmax layers) limits interpretability and sometimes hinders generalization. Recent work on interpolation-based methods, particularly Non-Negative Kernel regression (NNK), has demonstrated that predictions can be expressed as convex combinations of similar training examples in the embedding space, yielding both theoretical results and interpretable explanations.
zh
[AI-242] Engineering AI Agents for Clinical Workflows: A Case Study in ArchitectureMLOps and Governance
【速读】:该论文旨在解决临床人工智能(Artificial Intelligence, AI)系统在工业部署中普遍存在的可靠性、可管理性和责任缺失问题,即“责任真空”(responsibility vacuum),其根源在于脆弱的原型架构和缺乏系统性治理机制。解决方案的关键在于通过四个基础工程支柱的协同整合构建可信的临床AI系统:首先采用Clean Architecture提升系统的可维护性,结合事件驱动架构(Event-driven architecture)增强韧性与审计能力;其次以“智能体”(Agent)作为模块化核心单元,每个智能体具备独立的MLOps生命周期;最后将“人在回路”(Human-in-the-Loop)治理模型技术性地嵌入为事件驱动的数据源,而非简单的安全检查,从而实现持续改进。该架构形成一个生产级参考框架,为高风险领域中可维护、可扩展且可问责的AI系统开发提供了实践路径。
链接: https://arxiv.org/abs/2602.00751
作者: Cláudio Lúcio do Val Lopes,João Marcus Pitta,Fabiano Belém,Gildson Alves,Flávio Vinícius Cruzeiro Martins
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 9 pages, 5 figures 2026 IEEE/ACM 5th International Conference on AI Engineering - Software Engineering for AI}{April 12–13, 2026}{Rio de Janeiro, Brazil
Abstract:The integration of Artificial Intelligence (AI) into clinical settings presents a software engineering challenge, demanding a shift from isolated models to robust, governable, and reliable systems. However, brittle, prototype-derived architectures often plague industrial applications and a lack of systemic oversight, creating a responsibility vacuum'' where safety and accountability are compromised. This paper presents an industry case study of the Maria’’ platform, a production-grade AI system in primary healthcare that addresses this gap. Our central hypothesis is that trustworthy clinical AI is achieved through the holistic integration of four foundational engineering pillars. We present a synergistic architecture that combines Clean Architecture for maintainability with an Event-driven architecture for resilience and auditability. We introduce the Agent as the primary unit of modularity, each possessing its own autonomous MLOps lifecycle. Finally, we show how a Human-in-the-Loop governance model is technically integrated not merely as a safety check, but as a critical, event-driven data source for continuous improvement. We present the platform as a reference architecture, offering practical lessons for engineers building maintainable, scalable, and accountable AI-enabled systems in high-stakes domains. Comments: 9 pages, 5 figures 2026 IEEE/ACM 5th International Conference on AI Engineering - Software Engineering for AIApril 12–13, 2026Rio de Janeiro, Brazil Subjects: Artificial Intelligence (cs.AI); Software Engineering (cs.SE) Cite as: arXiv:2602.00751 [cs.AI] (or arXiv:2602.00751v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2602.00751 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3793653.3793774 Focus to learn more DOI(s) linking to related resources
zh
[AI-243] Bypassing Prompt Injection Detectors through Evasive Injections
【速读】:该论文旨在解决生成式 AI(Generative AI)系统中任务漂移(task drift)检测机制的脆弱性问题,即当输入被注入恶意后缀(adversarial suffixes)时,基于激活差值(activation deltas)的检测器可能失效。解决方案的关键在于提出一种防御机制:在前向传播过程中随机添加多个预生成的干扰后缀,并利用这些扰动后的激活特征训练逻辑回归模型进行检测,从而显著提升对对抗性攻击的鲁棒性。
链接: https://arxiv.org/abs/2602.00750
作者: Md Jahedur Rahman,Ihsen Alouani
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) are increasingly used in interactive and retrieval-augmented systems, but they remain vulnerable to task drift; deviations from a user’s intended instruction due to injected secondary prompts. Recent work has shown that linear probes trained on activation deltas of LLMs’ hidden layers can effectively detect such drift. In this paper, we evaluate the robustness of these detectors against adversarially optimised suffixes. We generate universal suffixes that cause poisoned inputs to evade detection across multiple probes simultaneously. Our experiments on Phi-3 3.8B and Llama-3 8B show that a single suffix can achieve high attack success rates; up to 93.91% and 99.63%, respectively, when all probes must be fooled, and nearly perfect success (90%) under majority vote setting. These results demonstrate that activation delta-based task drift detectors are highly vulnerable to adversarial suffixes, highlighting the need for stronger defences against adaptive attacks. We also propose a defence technique where we generate multiple suffixes and randomly append one of them to the prompts while making forward passes of the LLM and train logistic regression models with these activations. We found this approach to be highly effective against such attacks.
zh
[AI-244] HyperOffload: Graph-Driven Hierarchical Memory Management for Large Language Models on SuperNode Architectures
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在长上下文推理和稀疏架构演进过程中,因内存需求超出单设备高带宽内存(High Bandwidth Memory, HBM)容量而导致的计算瓶颈问题。现有软件栈无法有效利用新兴超节点(SuperNode)架构提供的TB级共享内存池,其基于运行时的内存交换机制仅具备局部视图,导致调度反应迟缓并暴露通信延迟,从而阻塞计算流水线。解决方案的关键在于提出一种编译器辅助的内存管理框架——HyperOffload,它通过将远程内存访问显式建模为计算图中的操作,并在编译期利用中间表示(Intermediate Representation, IR)中的缓存操作符表达数据移动,实现对张量生命周期与执行依赖的全局静态分析,进而设计出全局执行顺序优化算法,静态调度数据传输以隐藏远程内存访问延迟,从而在不牺牲端到端性能的前提下显著降低峰值设备内存占用(最高达26%)。
链接: https://arxiv.org/abs/2602.00748
作者: Fangxin Liu,Qinghua Zhang,Hanjing Shen,Qinghua Zhang,Zhibo Liang,Li Jiang,Haibing Guan,Chong Bao,Xuefeng Jin
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注: Technical Report
Abstract:The rapid evolution of Large Language Models (LLMs) towards long-context reasoning and sparse architectures has pushed memory requirements far beyond the capacity of individual device HBM. While emerging supernode architectures offer terabyte-scale shared memory pools via high-bandwidth interconnects, existing software stacks fail to exploit this hardware effectively. Current runtime-based offloading and swapping techniques operate with a local view, leading to reactive scheduling and exposed communication latency that stall the computation pipeline. In this paper, we propose the SuperNode Memory Management Framework (\textbfHyperOffload). It employs a compiler-assisted approach that leverages graph-driven memory management to treat remote memory access as explicit operations in the computation graph, specifically designed for hierarchical SuperNode architectures. Unlike reactive runtime systems, SuperNode represents data movement using cache operators within the compiler’s Intermediate Representation (IR). This design enables a global, compile-time analysis of tensor lifetimes and execution dependencies. Leveraging this visibility, we develop a global execution-order refinement algorithm that statically schedules data transfers to hide remote memory latency behind compute-intensive regions. We implement SuperNode within the production deep learning framework MindSpore, adding a remote memory backend and specialized compiler passes. Evaluation on representative LLM workloads shows that SuperNode reduces peak device memory usage by up to 26% for inference while maintaining end-to-end performance. Our work demonstrates that integrating memory-augmented hardware into the compiler’s optimization framework is essential for scaling next-generation AI workloads. Comments: Technical Report Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR) Cite as: arXiv:2602.00748 [cs.DC] (or arXiv:2602.00748v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2602.00748 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-245] SA-VLA: Spatially-Aware Flow-Matching for Vision-Language-Action Reinforcement Learning
【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在强化学习(Reinforcement Learning, RL)微调过程中因空间分布偏移导致的鲁棒性下降问题,尤其针对流匹配(flow-matching)VLA策略中空间归纳偏置(spatial inductive bias)在RL适应过程中被削弱的现象。其核心解决方案是提出SA-VLA框架,关键在于通过三方面协同设计:一是将隐式空间表示与视觉token融合以增强空间感知;二是设计反映几何进展的密集奖励机制以引导任务导向的学习;三是引入SCAN(spatially-conditioned annealed exploration)策略,一种面向流匹配动态的、空间条件化的退火探索方法,从而在RL优化中保持空间接地性(spatial grounding),最终实现稳定微调和更强的零样本空间泛化能力。
链接: https://arxiv.org/abs/2602.00743
作者: Xu Pan,Zhenglin Wan,Xingrui Yu,Xianwei Zheng,Youkai Ke,Ming Sun,Rui Wang,Ziwei Wang,Ivor Tsang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Version 1
Abstract:Vision-Language-Action (VLA) models exhibit strong generalization in robotic manipulation, yet reinforcement learning (RL) fine-tuning often degrades robustness under spatial distribution shifts. For flow-matching VLA policies, this degradation is closely associated with the erosion of spatial inductive bias during RL adaptation, as sparse rewards and spatially agnostic exploration increasingly favor short-horizon visual cues. To address this issue, we propose \textbfSA-VLA, a spatially-aware RL adaptation framework that preserves spatial grounding during policy optimization by aligning representation learning, reward design, and exploration with task geometry. SA-VLA fuses implicit spatial representations with visual tokens, provides dense rewards that reflect geometric progress, and employs \textbfSCAN, a spatially-conditioned annealed exploration strategy tailored to flow-matching dynamics. Across challenging multi-object and cluttered manipulation benchmarks, SA-VLA enables stable RL fine-tuning and improves zero-shot spatial generalization, yielding more robust and transferable behaviors. Code and project page are available at this https URL.
zh
[AI-246] Pareto-Conditioned Diffusion Models for Offline Multi-Objective Optimization ICLR2026 ICLR26
【速读】:该论文旨在解决离线多目标优化(Offline Multi-Objective Optimization, Offline MOO)中的泛化问题,即在仅有静态数据集的情况下如何有效探索并逼近帕累托前沿(Pareto Front),同时避免对复杂代理模型(surrogate models)的依赖。其核心解决方案是提出 Pareto-Conditioned Diffusion (PCD) 框架,将离线 MOO 转化为一个条件采样问题:通过直接以期望的权衡关系(trade-offs)作为条件,引导扩散过程生成符合目标分布的解。关键创新在于引入重加权策略(reweighting strategy)聚焦高绩效样本,并结合参考方向机制(reference-direction mechanism)引导采样向训练数据之外的新颖且有潜力的区域扩展,从而实现更稳定、一致的帕累托前沿逼近性能。
链接: https://arxiv.org/abs/2602.00737
作者: Jatan Shrestha,Santeri Heiskanen,Kari Hepola,Severi Rissanen,Pekka Jääskeläinen,Joni Pajarinen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by ICLR 2026. Project page: this https URL
Abstract:Multi-objective optimization (MOO) arises in many real-world applications where trade-offs between competing objectives must be carefully balanced. In the offline setting, where only a static dataset is available, the main challenge is generalizing beyond observed data. We introduce Pareto-Conditioned Diffusion (PCD), a novel framework that formulates offline MOO as a conditional sampling problem. By conditioning directly on desired trade-offs, PCD avoids the need for explicit surrogate models. To effectively explore the Pareto front, PCD employs a reweighting strategy that focuses on high-performing samples and a reference-direction mechanism to guide sampling towards novel, promising regions beyond the training data. Experiments on standard offline MOO benchmarks show that PCD achieves highly competitive performance and, importantly, demonstrates greater consistency across diverse tasks than existing offline MOO approaches.
zh
[AI-247] Neuro-symbolic AI for Predictive Maintenance (PdM) – review and recommendations
【速读】:该论文旨在解决预测性维护(Predictive Maintenance, PdM)领域中当前数据驱动方法(如深度学习)与传统知识驱动系统各自存在的局限性问题:前者依赖大量标注数据、缺乏跨场景泛化能力(out-of-distribution generalization)和推理过程不透明;后者则准确性低、误报率高且需持续专家干预。解决方案的关键在于提出并倡导采用神经符号人工智能(Neuro-symbolic AI, NeSy),通过将深度学习与符号逻辑深度融合,构建兼具高精度、可解释性、可理解性和鲁棒性的混合系统,从而克服单一方法的不足,提升PdM在工业实际场景中的部署效能与可信度。
链接: https://arxiv.org/abs/2602.00731
作者: Kyle Hamilton,Ali Intizar
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:
Abstract:In this document we perform a systematic review the State-of-the-art in Predictive Maintenance (PdM) over the last five years in industrial settings such as commercial buildings, pharmaceutical facilities, or semi-conductor manufacturing. In general, data-driven methods such as those based on deep learning, exhibit higher accuracy than traditional knowledge-based systems. These systems however, are not without significant limitations. The need for large labeled data sets, a lack of generalizibility to new environments (out-of-distribution generalization), and a lack of transparency at inference time are some of the obstacles to adoption in real world environments. In contrast, traditional approaches based on domain expertise in the form of rules, logic or first principles suffer from poor accuracy, many false positives and a need for ongoing expert supervision and manual tuning. While the majority of approaches in recent literature utilize some form of data-driven architecture, there are hybrid systems which also take into account domain specific knowledge. Such hybrid systems have the potential to overcome the weaknesses of either approach on its own while preserving their strengths. We propose taking the hybrid approach even further and integrating deep learning with symbolic logic, or Neuro-symbolic AI, to create more accurate, explainable, interpretable, and robust systems. We describe several neuro-symbolic architectures and examine their strengths and limitations within the PdM domain. We focus specifically on methods which involve the use of sensor data and manually crafted rules as inputs by describing concrete NeSy architectures. In short, this survey outlines the context of modern maintenance, defines key concepts, establishes a generalized framework, reviews current modeling approaches and challenges, and introduces the proposed focus on Neuro-symbolic AI (NESY).
zh
[AI-248] Augmenting Clinical Decision-Making with an Interactive and Interpretable AI Copilot: A Real-World User Study with Clinicians in Nephrology and Obstetrics
【速读】:该论文旨在解决临床医生对“黑箱”人工智能(Black-box AI)的不信任问题,从而阻碍了AI在高风险医疗场景中的应用。其解决方案的关键在于提出AICare——一个交互式且可解释的人工智能协作者(AI copilot),通过分析纵向电子健康记录(Electronic Health Records, EHR),将动态风险预测锚定于可读性强的可视化结果和大语言模型(Large Language Model, LLM)驱动的诊断建议中,从而提升临床决策过程的透明度与可信度。研究发现,该系统显著降低了认知负荷,并通过不同经验水平的临床医生对AI的差异化使用策略(初级医师将其作为认知支架,资深医师则进行对抗性验证),揭示了构建可解释、可交互的AI系统需适配多样化的推理风格,以实现对临床判断的增强而非替代。
链接: https://arxiv.org/abs/2602.00726
作者: Yinghao Zhu,Dehao Sui,Zixiang Wang,Xuning Hu,Lei Gu,Yifan Qi,Tianchen Wu,Ling Wang,Yuan Wei,Wen Tang,Zhihan Cui,Yasha Wang,Lequan Yu,Ewen M Harrison,Junyi Gao,Liantao Ma
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Accepted by ACM CHI 2026
Abstract:Clinician skepticism toward opaque AI hinders adoption in high-stakes healthcare. We present AICare, an interactive and interpretable AI copilot for collaborative clinical decision-making. By analyzing longitudinal electronic health records, AICare grounds dynamic risk predictions in scrutable visualizations and LLM-driven diagnostic recommendations. Through a within-subjects counterbalanced study with 16 clinicians across nephrology and obstetrics, we comprehensively evaluated AICare using objective measures (task completion time and error rate), subjective assessments (NASA-TLX, SUS, and confidence ratings), and semi-structured interviews. Our findings indicate AICare’s reduced cognitive workload. Beyond performance metrics, qualitative analysis reveals that trust is actively constructed through verification, with interaction strategies diverging by expertise: junior clinicians used the system as cognitive scaffolding to structure their analysis, while experts engaged in adversarial verification to challenge the AI’s logic. This work offers design implications for creating AI systems that function as transparent partners, accommodating diverse reasoning styles to augment rather than replace clinical judgment.
zh
[AI-249] Rethinking Hallucinations: Correctness Consistency and Prompt Multiplicity EACL2026
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs) hallucination 评估中过度关注正确性(correctness)而忽视一致性(consistency)的问题,从而导致对幻觉相关危害的误判。其核心解决方案是提出“提示多样性”(prompt multiplicity)框架,通过量化模型在不同提示下的输出一致性来更全面地评估幻觉风险。研究表明,现有基准测试(如 Med-HALT)中不一致性超过 50%,表明幻觉的危害被严重低估;同时发现,当前检测技术实际捕捉的是模型输出的一致性而非正确性,而缓解策略如检索增强生成(Retrieval-Augmented Generation, RAG)虽能提升准确性,却可能引入新的不一致性,揭示了现有方法的关键局限。
链接: https://arxiv.org/abs/2602.00723
作者: Prakhar Ganesh,Reza Shokri,Golnoosh Farnadi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: To appear at EACL 2026
Abstract:Large language models (LLMs) are known to “hallucinate” by generating false or misleading outputs. Hallucinations pose various harms, from erosion of trust to widespread misinformation. Existing hallucination evaluation, however, focuses only on correctness and often overlooks consistency, necessary to distinguish and address these harms. To bridge this gap, we introduce prompt multiplicity, a framework for quantifying consistency in LLM evaluations. Our analysis reveals significant multiplicity (over 50% inconsistency in benchmarks like Med-HALT), suggesting that hallucination-related harms have been severely misunderstood. Furthermore, we study the role of consistency in hallucination detection and mitigation. We find that: (a) detection techniques detect consistency, not correctness, and (b) mitigation techniques like RAG, while beneficial, can introduce additional inconsistencies. By integrating prompt multiplicity into hallucination evaluation, we provide an improved framework of potential harms and uncover critical limitations in current detection and mitigation strategies.
zh
[AI-250] Deep Time-series Forecasting Needs Kernelized Moment Balancing
【速读】:该论文旨在解决深度时间序列预测中分布平衡不足的问题,即现有方法无法实现真实意义上的分布对齐,因其仅在有限的预定义平衡函数(balancing function)上匹配一阶或二阶矩,违背了Imbens准则所要求的任意平衡函数下的一阶矩匹配。解决方案的关键在于提出直接预测与核化矩平衡(Kernelized Moment Balancing for Direct Forecasting, KMB-DF),其核心创新是通过从再生核希尔伯特空间(Reproducing Kernel Hilbert Space, RKHS)中自适应选择最具信息量的平衡函数,从而实现更充分的分布平衡;该方法构建了一个可微且计算可行的目标函数,支持从经验样本中高效估计,并无缝集成至基于梯度的训练流程中,显著提升了预测精度并达到当前最优性能。
链接: https://arxiv.org/abs/2602.00717
作者: Licheng Pan,Hao Wang,Haocheng Yang,Yuqi Li,Qingsong Wen,Xiaoxi Li,Zhichao Chen,Haoxuan Li,Zhixuan Chu,Yuan Lu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Deep time-series forecasting can be formulated as a distribution balancing problem aimed at aligning the distribution of the forecasts and ground truths. According to Imbens’ criterion, true distribution balance requires matching the first moments with respect to any balancing function. We demonstrate that existing objectives fail to meet this criterion, as they enforce moment matching only for one or two predefined balancing functions, thus failing to achieve full distribution balance. To address this limitation, we propose direct forecasting with kernelized moment balancing (KMB-DF). Unlike existing objectives, KMB-DF adaptively selects the most informative balancing functions from a reproducing kernel hilbert space (RKHS) to enforce sufficient distribution balancing. We derive a tractable and differentiable objective that enables efficient estimation from empirical samples and seamless integration into gradient-based training pipelines. Extensive experiments across multiple models and datasets show that KMB-DF consistently improves forecasting accuracy and achieves state-of-the-art performance. Code is available at this https URL.
zh
[AI-251] From Detection to Prevention: Explaining Security-Critical Code to Avoid Vulnerabilities
【速读】:该论文旨在解决软件开发过程中因安全知识匮乏和代码复杂性导致的安全漏洞难以预防的问题。传统静态和动态分析工具仅能在漏洞引入后进行检测,造成高昂的修复成本。其解决方案的关键在于采用主动防御策略,通过识别代码中实现安全关键功能(如数据访问、认证和输入处理)的区域,并提供针对性的防护建议。具体而言,作者开发了一个基于IntelliJ IDEA的插件原型,利用代码级软件度量指标定位潜在的安全关键方法,并结合大语言模型(Large Language Models, LLMs)生成以预防为导向的解释说明,从而在编码阶段提升安全性。
链接: https://arxiv.org/abs/2602.00711
作者: Ranjith Krishnamurthy,Oshando Johnson,Goran Piskachev,Eric Bodden
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 4 pages
Abstract:Security vulnerabilities often arise unintentionally during development due to a lack of security expertise and code complexity. Traditional tools, such as static and dynamic analysis, detect vulnerabilities only after they are introduced in code, leading to costly remediation. This work explores a proactive strategy to prevent vulnerabilities by highlighting code regions that implement security-critical functionality – such as data access, authentication, and input handling – and providing guidance for their secure implementation. We present an IntelliJ IDEA plugin prototype that uses code-level software metrics to identify potentially security-critical methods and large language models (LLMs) to generate prevention-oriented explanations. Our initial evaluation on the Spring-PetClinic application shows that the selected metrics identify most known security-critical methods, while an LLM provides actionable, prevention-focused insights. Although these metrics capture structural properties rather than semantic aspects of security, this work lays the foundation for code-level security-aware metrics and enhanced explanations.
zh
[AI-252] Learning More from Less: Unlocking Internal Representations for Benchmark Compression
【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)评估成本高昂的问题,特别是针对现有基于离散正确性标签的代表性子集(coreset)构建方法在源模型数量有限时统计不稳定的缺陷。其解决方案的关键在于摒弃传统仅依赖输出标签的方法,转而利用模型隐藏状态(hidden states)的语义信息,通过REPCORE框架将异构的隐藏状态对齐到统一的潜在空间中,从而构建更具代表性的子集用于性能外推。该方法显著提升了小样本场景下的估计精度和排名相关性,且无需依赖大量历史评估数据,适用于新发布的基准测试。
链接: https://arxiv.org/abs/2602.00710
作者: Yueqi Zhang,Jin Hu,Shaoxiong Feng,Peiwen Yuan,Xinglin Wang,Yiwei Li,Jiayi Shi,Chuyi Tan,Ji Zhang,Boyuan Pan,Yao Hu,Kan Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The prohibitive cost of evaluating Large Language Models (LLMs) necessitates efficient alternatives to full-scale benchmarking. Prevalent approaches address this by identifying a small coreset of items to approximate full-benchmark performance. However, existing methods must estimate a reliable item profile from response patterns across many source models, which becomes statistically unstable when the source pool is small. This dependency is particularly limiting for newly released benchmarks with minimal historical evaluation data. We argue that discrete correctness labels are a lossy view of the model’s decision process and fail to capture information encoded in hidden states. To address this, we introduce REPCORE, which aligns heterogeneous hidden states into a unified latent space to construct representative coresets. Using these subsets for performance extrapolation, REPCORE achieves precise estimation accuracy with as few as ten source models. Experiments on five benchmarks and over 200 models show consistent gains over output-based baselines in ranking correlation and estimation accuracy. Spectral analysis further indicates that the aligned representations contain separable components reflecting broad response tendencies and task-specific reasoning patterns.
zh
[AI-253] Physics-informed Diffusion Generation for Geomagnetic Map Interpolation ICASSP’26
【速读】:该论文旨在解决现有散乱数据插值方法在处理地磁图(geomagnetic map)时性能不佳的问题,其根本原因在于这些方法未考虑地磁数据特有的物理规律和检测噪声干扰。解决方案的关键在于提出一种物理信息引导的扩散生成框架(Physics-informed Diffusion Generation, PDG):首先设计了一种基于局部感受野的物理信息掩码策略,以在扩散生成过程中有效抑制噪声;其次,在生成结果上施加符合地磁图克里金(kriging)原理的物理约束,确保输出严格遵循地磁场的物理规律。实验表明,该框架在四个真实数据集上均表现出显著优于传统方法的插值精度与物理一致性。
链接: https://arxiv.org/abs/2602.00709
作者: Wenda Li,Tongya Zheng,Kaixuan Chen,Shunyu Liu,Haoze Jiang,Yunzhi Hao,Rui Miao,Zujie Ren,Mingli Song,Hang Shi,Gang Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 5 pages, 2 figures, IEEE ICASSP’26
Abstract:Geomagnetic map interpolation aims to infer unobserved geomagnetic data at spatial points, yielding critical applications in navigation and resource exploration. However, existing methods for scattered data interpolation are not specifically designed for geomagnetic maps, which inevitably leads to suboptimal performance due to detection noise and the laws of physics. Therefore, we propose a Physics-informed Diffusion Generation framework~(PDG) to interpolate incomplete geomagnetic maps. First, we design a physics-informed mask strategy to guide the diffusion generation process based on a local receptive field, effectively eliminating noise interference. Second, we impose a physics-informed constraint on the diffusion generation results following the kriging principle of geomagnetic maps, ensuring strict adherence to the laws of physics. Extensive experiments and in-depth analyses on four real-world datasets demonstrate the superiority and effectiveness of each component of PDG.
zh
[AI-254] Self-Guard: Defending Large Reasoning Models via enhanced self-reflection
【速读】:该论文旨在解决大型推理模型(Large Reasoning Models, LRM)在安全对齐中面临的独特挑战,特别是“意识-合规差距”(awareness-compliance gap)问题——即模型虽能识别潜在风险,却因顺从性倾向(sycophancy)而优先执行用户指令,从而导致推理操纵和信息泄露等风险。解决方案的关键在于提出一种轻量级安全防御框架 Self-Guard,其核心机制包括两个阶段:一是通过安全导向提示(safety-oriented prompting)激活模型隐式安全意识以诱发自发反思;二是通过安全激活引导(safety activation steering)提取隐藏状态空间中的方向性变化并放大该信号,确保推理过程中安全合规性超越顺从性。此方法在不损害模型功能的前提下有效弥合意识与行为间的差距,并具备跨未见风险和不同模型规模的强泛化能力。
链接: https://arxiv.org/abs/2602.00707
作者: Jingnan Zheng,Jingjun Xu,Yanzhen Luo,Chenhang Cui,Gelei Deng,Zhenkai Liang,Xiang Wang,An Zhang,Tat-Seng Chua
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The emergence of Large Reasoning Models (LRMs) introduces a new paradigm of explicit reasoning, enabling remarkable advances yet posing unique risks such as reasoning manipulation and information leakage. To mitigate these risks, current alignment strategies predominantly rely on heavy post-training paradigms or external interventions. However, these approaches are often computationally intensive and fail to address the inherent awareness-compliance gap, a critical misalignment where models recognize potential risks yet prioritize following user instructions due to their sycophantic tendencies. To address these limitations, we propose Self-Guard, a lightweight safety defense framework that reinforces safety compliance at the representational level. Self-Guard operates through two principal stages: (1) safety-oriented prompting, which activates the model’s latent safety awareness to evoke spontaneous reflection, and (2) safety activation steering, which extracts the resulting directional shift in the hidden state space and amplifies it to ensure that safety compliance prevails over sycophancy during inference. Experiments demonstrate that Self-Guard effectively bridges the awareness-compliance gap, achieving robust safety performance without compromising model utility. Furthermore, Self-Guard exhibits strong generalization across diverse unseen risks and varying model scales, offering a cost-efficient solution for LRM safety alignment.
zh
[AI-255] Forecasting Energy Availability in Local Energy Communities via LSTM Federated Learning
【速读】:该论文旨在解决本地能源社区(Local Energy Community)在实现能源自给自足过程中面临的挑战,即如何在不侵犯用户隐私的前提下,准确预测能源消费与生产之间的平衡。其核心问题是:传统预测模型依赖于集中式数据共享,而用户出于隐私保护考虑往往不愿共享用电模式,从而限制了模型性能。解决方案的关键在于采用联邦学习(Federated Learning, FL)框架,结合长短期记忆网络(Long Short-Term Memory, LSTM),在无需直接交换原始数据的情况下协同训练高精度的能源负荷预测模型,从而在数据隐私保护与预测准确性之间实现有效权衡。
链接: https://arxiv.org/abs/2602.00694
作者: Fabio Turazza,Marcello Pietri,Natalia Selini Hadjidimitriou,Marco Mamei
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: Published as a book chapter in the MEDES 2024 proceedings (Springer LNCS)
Abstract:Local Energy Communities are emerging as crucial players in the landscape of sustainable development. A significant challenge for these communities is achieving self-sufficiency through effective management of the balance between energy production and consumption. To meet this challenge, it is essential to develop and implement forecasting models that deliver accurate predictions, which can then be utilized by optimization and planning algorithms. However, the application of forecasting solutions is often hindered by privacy constrains and regulations as the users participating in the Local Energy Community can be (rightfully) reluctant sharing their consumption patterns with others. In this context, the use of Federated Learning (FL) can be a viable solution as it allows to create a forecasting model without the need to share privacy sensitive information among the users. In this study, we demonstrate how FL and long short-term memory (LSTM) networks can be employed to achieve this objective, highlighting the trade-off between data sharing and forecasting accuracy.
zh
[AI-256] HumanStudy-Bench: Towards AI Agent Design for Participant Simulation
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在社会科学实验中作为模拟参与者时行为不稳定、对实验设计敏感的问题,以及现有评估方法混淆基础模型能力与代理配置所带来的因果混淆。其核心挑战在于区分实验结果是源于模型本身的能力还是代理设定(如角色属性、行为假设等)的影响。解决方案的关键在于将参与者模拟重构为一个完整的实验协议层面的代理设计问题,引入HUMANSTUDY-BENCH基准平台,通过“过滤-提取-执行-评估”(Filter–Extract–Execute–Evaluate)管道,使LLM代理能够精确复现已发表的人类实验,并在统一运行环境中端到端重现原始统计分析流程,从而实现对人类与代理行为一致性进行量化评估,确保科学推断层面的保真度。
链接: https://arxiv.org/abs/2602.00685
作者: Xuan Liu,Haoyang Shang,Zizhang Liu,Xinyan Liu,Yunze Xiao,Yiwen Tu,Haojian Jin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) are increasingly used as simulated participants in social science experiments, but their behavior is often unstable and highly sensitive to design choices. Prior evaluations frequently conflate base-model capabilities with experimental instantiation, obscuring whether outcomes reflect the model itself or the agent setup. We instead frame participant simulation as an agent-design problem over full experimental protocols, where an agent is defined by a base model and a specification (e.g., participant attributes) that encodes behavioral assumptions. We introduce HUMANSTUDY-BENCH, a benchmark and execution engine that orchestrates LLM-based agents to reconstruct published human-subject experiments via a Filter–Extract–Execute–Evaluate pipeline, replaying trial sequences and running the original analysis pipeline in a shared runtime that preserves the original statistical procedures end to end. To evaluate fidelity at the level of scientific inference, we propose new metrics to quantify how much human and agent behaviors agree. We instantiate 12 foundational studies as an initial suite in this dynamic benchmark, spanning individual cognition, strategic interaction, and social psychology, and covering more than 6,000 trials with human samples ranging from tens to over 2,100 participants.
zh
[AI-257] RecGOAT: Graph Optimal Adaptive Transport for LLM -Enhanced Multimodal Recommendation with Dual Semantic Alignment
【速读】:该论文旨在解决大模型(Large Models, LMs)在多模态推荐系统中因语义表示偏移而导致的性能瓶颈问题。具体而言,现有方法忽略了LMs在通用语义任务上优化的表示与推荐系统依赖稀疏用户/物品唯一ID特征之间的根本差异,导致多模态表示不兼容、推荐效果不佳。解决方案的关键在于提出RecGOAT——一个理论保障的双语义对齐框架,其核心创新包括:(1) 利用图注意力网络(Graph Attention Networks)融合协同语义信息,建模用户-物品、物品-物品及用户-用户关系;(2) 设计双粒度渐进式多模态-ID对齐机制,通过跨模态对比学习(Cross-Modal Contrastive Learning, CMCL)实现实例级对齐,以及最优自适应传输(Optimal Adaptive Transport, OAT)实现分布级对齐,从而显著提升统一表示的语义一致性和完备性。
链接: https://arxiv.org/abs/2602.00682
作者: Yuecheng Li,Hengwei Ju,Zeyu Song,Wei Yang,Chi Lu,Peng Jiang,Kun Gai
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Under Review
Abstract:Multimodal recommendation systems typically integrates user behavior with multimodal data from items, thereby capturing more accurate user preferences. Concurrently, with the rise of large models (LMs), multimodal recommendation is increasingly leveraging their strengths in semantic understanding and contextual reasoning. However, LM representations are inherently optimized for general semantic tasks, while recommendation models rely heavily on sparse user/item unique identity (ID) features. Existing works overlook the fundamental representational divergence between large models and recommendation systems, resulting in incompatible multimodal representations and suboptimal recommendation performance. To bridge this gap, we propose RecGOAT, a novel yet simple dual semantic alignment framework for LLM-enhanced multimodal recommendation, which offers theoretically guaranteed alignment capability. RecGOAT first employs graph attention networks to enrich collaborative semantics by modeling item-item, user-item, and user-user relationships, leveraging user/item LM representations and interaction history. Furthermore, we design a dual-granularity progressive multimodality-ID alignment framework, which achieves instance-level and distribution-level semantic alignment via cross-modal contrastive learning (CMCL) and optimal adaptive transport (OAT), respectively. Theoretically, we demonstrate that the unified representations derived from our alignment framework exhibit superior semantic consistency and comprehensiveness. Extensive experiments on three public benchmarks show that our RecGOAT achieves state-of-the-art performance, empirically validating our theoretical insights. Additionally, the deployment on a large-scale online advertising platform confirms the model’s effectiveness and scalability in industrial recommendation scenarios. Code available at this https URL.
zh
[AI-258] OpenGuanDan: A Large-Scale Imperfect Information Game Benchmark
【速读】:该论文旨在解决当前数据驱动的人工智能(AI)研究中缺乏足够挑战性的基准测试平台的问题,尤其是在多智能体协同与竞争决策场景下。现有基准在处理不完全信息、大规模状态空间与动作空间、混合学习目标(合作与竞争并存)、长程决策以及动态团队组合等方面存在不足,难以充分检验和推动智能决策方法的发展。解决方案的关键在于提出OpenGuanDan——一个针对四人多轮中国纸牌游戏“掼蛋”的新型基准,它不仅支持高效模拟该游戏,还能够对基于学习和基于规则的AI代理进行全面评估。其核心创新在于构建了一个具备复杂博弈结构的真实环境,并提供独立API接口以支持人类-AI交互及大语言模型集成,从而为多智能体智能决策研究提供了更具挑战性和实用性的测试平台。
链接: https://arxiv.org/abs/2602.00676
作者: Chao Li,Shangdong Yang,Chiheng Zhan,Zhenxing Ge,Yujing Hu,Bingkun Bao,Xingguo Chen,Yang Gao
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:The advancement of data-driven artificial intelligence (AI), particularly machine learning, heavily depends on large-scale benchmarks. Despite remarkable progress across domains ranging from pattern recognition to intelligent decision-making in recent decades, exemplified by breakthroughs in board games, card games, and electronic sports games, there remains a pressing need for more challenging benchmarks to drive further research. To this end, this paper proposes OpenGuanDan, a novel benchmark that enables both efficient simulation of GuanDan (a popular four-player, multi-round Chinese card game) and comprehensive evaluation of both learning-based and rule-based GuanDan AI agents. OpenGuanDan poses a suite of nontrivial challenges, including imperfect information, large-scale information set and action spaces, a mixed learning objective involving cooperation and competition, long-horizon decision-making, variable action spaces, and dynamic team composition. These characteristics make it a demanding testbed for existing intelligent decision-making methods. Moreover, the independent API for each player allows human-AI interactions and supports integration with large language models. Empirically, we conduct two types of evaluations: (1) pairwise competitions among all GuanDan AI agents, and (2) human-AI matchups. Experimental results demonstrate that while current learning-based agents substantially outperform rule-based counterparts, they still fall short of achieving superhuman performance, underscoring the need for continued research in multi-agent intelligent decision-making domain. The project is publicly available at this https URL.
zh
[AI-259] SEISMO: Increasing Sample Efficiency in Molecular Optimization with a Trajectory-Aware LLM Agent KR
【速读】:该论文旨在解决分子结构优化中样本效率低下的问题,尤其是在药物发现领域,由于分子性质评估依赖于昂贵且耗时的实验检测(如生物活性测定),传统方法难以在有限的评估次数内高效找到最优分子。其解决方案的关键在于提出SEISMO——一种基于大语言模型(Large Language Model, LLM)的代理系统,能够在推理阶段严格在线更新,每次调用目标函数(oracle)后立即调整策略,无需批量学习或种群演化;同时,SEISMO将任务描述、标量评分及结构化解释性反馈整合为完整的优化轨迹条件,从而显著提升样本效率,在23个实际分子优化任务中实现比现有方法高2–3倍的优化曲线下面积(AUC),并在50次以内调用即可接近最优性能。
链接: https://arxiv.org/abs/2602.00663
作者: Fabian P. Krüger,Andrea Hunklinger,Adrian Wolny,Tim J. Adler,Igor Tetko,Santiago David Villalba
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Biomolecules (q-bio.BM)
备注: Fabian P. Krüger and Andrea Hunklinger contributed equally to this work
Abstract:Optimizing the structure of molecules to achieve desired properties is a central bottleneck across the chemical sciences, particularly in the pharmaceutical industry where it underlies the discovery of new drugs. Since molecular property evaluation often relies on costly and rate-limited oracles, such as experimental assays, molecular optimization must be highly sample-efficient. To address this, we introduce SEISMO, an LLM agent that performs strictly online, inference-time molecular optimization, updating after every oracle call without the need for population-based or batched learning. SEISMO conditions each proposal on the full optimization trajectory, combining natural-language task descriptions with scalar scores and, when available, structured explanatory feedback. Across the Practical Molecular Optimization benchmark of 23 tasks, SEISMO achieves a 2-3 times higher area under the optimisation curve than prior methods, often reaching near-maximal task scores within 50 oracle calls. Our additional medicinal-chemistry tasks show that providing explanatory feedback further improves efficiency, demonstrating that leveraging domain knowledge and structured information is key to sample-efficient molecular optimization.
zh
[AI-260] Predictive Maintenance for Ultrafiltration Membranes Using Explainable Similarity-Based Prognostics
【速读】:该论文旨在解决反渗透海水淡化系统中超滤(Ultrafiltration, UF)膜因污染导致性能退化的问题,现有基于黑箱机器学习的预测性维护模型因缺乏可解释性而难以获得操作人员信任。其解决方案的关键在于提出一种可解释的预测框架,通过物理信息驱动的健康指数(Health Index)量化膜性能退化动态,并利用高斯隶属函数对退化状态进行模糊化处理;进而采用相似度度量识别与当前状态相似的历史退化轨迹,并基于Takagi-Sugeno模糊规则构建透明、加权的剩余使用寿命(Remaining Useful Life, RUL)估计模型,从而实现高精度且可解释的UF膜寿命预测。
链接: https://arxiv.org/abs/2602.00659
作者: Qusai Khaled,Laura Genga,Uzay Kaymak
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Submitted to 21st International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems (IPMU2026)
Abstract:In reverse osmosis desalination, ultrafiltration (UF) membranes degrade due to fouling, leading to performance loss and costly downtime. Most plants rely on scheduled preventive maintenance, since existing predictive maintenance models, often based on opaque machine learning methods, lack interpretability and operator trust. This study proposes an explainable prognostic framework for UF membrane remaining useful life (RUL) estimation using fuzzy similarity reasoning. A physics-informed Health Index, derived from transmembrane pressure, flux, and resistance, captures degradation dynamics, which are then fuzzified via Gaussian membership functions. Using a similarity measure, the model identifies historical degradation trajectories resembling the current state and formulates RUL predictions as Takagi-Sugeno fuzzy rules. Each rule corresponds to a historical exemplar and contributes to a transparent, similarity-weighted RUL estimate. Tested on 12,528 operational cycles from an industrial-scale UF system, the framework achieved a mean absolute error of 4.50 cycles, while generating interpretable rule bases consistent with expert understanding.
zh
[AI-261] MoDEx: Mixture of Depth-specific Experts for Multivariate Long-term Time Series Forecasting
【速读】:该论文旨在解决多变量长期时间序列预测(Multivariate Long-Term Time Series Forecasting, LTSF)中现有模型架构缺乏对骨干网络(backbone)各层行为深入理解的问题,尤其是未充分挖掘不同深度层在建模输入序列时间动态特性方面的专长。其核心解决方案是提出一种名为MoDEx的轻量级混合深度特定专家(Mixture of Depth-specific Experts)框架,该框架基于新提出的“层敏感性”(layer sensitivity)度量——一种受GradCAM和有效感受野理论启发的梯度基指标——识别并利用各层对时间点贡献的正负效应,从而设计出深度特定的多层感知机(MLP)专家模块。此方案不仅显著提升预测精度(在七个真实世界基准上78%情况下排名第一),且参数与计算资源消耗大幅降低,同时可无缝集成至Transformer变体中,展现出良好的泛化能力与高效性。
链接: https://arxiv.org/abs/2602.00624
作者: Hyekyung Yoon,Minhyuk Lee,Imseung Park,Myungjoo Kang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Multivariate long-term time series forecasting (LTSF) supports critical applications such as traffic-flow management, solar-power scheduling, and electricity-transformer monitoring. The existing LTSF paradigms follow a three-stage pipeline of embedding, backbone refinement, and long-horizon prediction. However, the behaviors of individual backbone layers remain underexplored. We introduce layer sensitivity, a gradient-based metric inspired by GradCAM and effective receptive field theory, which quantifies both positive and negative contributions of each time point to a layer’s latent features. Applying this metric to a three-layer MLP backbone reveals depth-specific specialization in modeling temporal dynamics in the input sequence. Motivated by these insights, we propose MoDEx, a lightweight Mixture of Depth-specific Experts, which replaces complex backbones with depth-specific MLP experts. MoDEx achieves state-of-the-art accuracy on seven real-world benchmarks, ranking first in 78 percent of cases, while using significantly fewer parameters and computational resources. It also integrates seamlessly into transformer variants, consistently boosting their performance and demonstrating robust generalizability as an efficient and high-performance LTSF framework.
zh
[AI-262] Rethinking Zero-Shot Time Series Classification: From Task-specific Classifiers to In-Context Inference
【速读】:该论文旨在解决时间序列基础模型(Time Series Foundation Models, TSFMs)在零样本分类评估中存在的评价偏差问题,即传统方法采用冻结编码器加任务特定分类器的范式,违背了零样本部署无需训练的前提,并因分类器训练选择引入偏差。解决方案的关键在于提出TIC-FM(In-Context Learning Framework for Time Series),其核心是将标注训练集作为上下文,通过一个结合时间序列编码器与轻量级投影适配器的Split-Masked Latent Memory Transformer,在单次前向传播中预测所有测试样本标签,无需参数更新。该方法实现了真正的训练-free迁移学习,且理论上证明了上下文推理可等价于有监督分类器训练,从而在128个UCR数据集上展现出优异的准确性,尤其在极低标签场景下表现显著提升。
链接: https://arxiv.org/abs/2602.00620
作者: Juntao Fang,Shifeng Xie,Shengbin Nie,Yuhui Ling,Yuming Liu,Zijian Li,Keli Zhang,Lujia Pan,Themis Palpanas,Ruichu Cai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The zero-shot evaluation of time series foundation models (TSFMs) for classification typically uses a frozen encoder followed by a task-specific classifier. However, this practice violates the training-free premise of zero-shot deployment and introduces evaluation bias due to classifier-dependent training choices. To address this issue, we propose TIC-FM, an in-context learning framework that treats the labeled training set as context and predicts labels for all test instances in a single forward pass, without parameter updates. TIC-FM pairs a time series encoder and a lightweight projection adapter with a split-masked latent memory Transformer. We further provide theoretical justification that in-context inference can subsume trained classifiers and can emulate gradient-based classifier training within a single forward pass. Experiments on 128 UCR datasets show strong accuracy, with consistent gains in the extreme low-label situation, highlighting training-free transfer
zh
[AI-263] Inference-Only Prompt Projection for Safe Text-to-Image Generation with TV Guarantees
【速读】:该论文旨在解决文本到图像(Text-to-Image, T2I)扩散模型在实际部署中面临的“安全性-提示对齐权衡”问题:即如何在不损害良性提示-图像对齐性能的前提下,有效抑制不适当生成内容(inappropriate generations)。其核心挑战在于,一旦参考条件分布固定,任何对不安全内容的减少必然导致与参考分布的总变差(Total Variation, TV)偏差,形成理论上的权衡关系。解决方案的关键在于提出一种仅在推理阶段运行的提示投影框架(inference-only prompt projection framework),通过代理目标函数结合验证机制,对高风险提示进行选择性干预,将其映射至受控的安全集合中,同时保持良性提示几乎不变,且无需重新训练或微调生成器。该方法在四个数据集和三种扩散模型上实现了16.7%-60.0%的不适当比例(Inappropriate Percentage, IP)相对降低,同时在COCO数据集上维持了接近未对齐参考模型的良性提示-图像对齐性能。
链接: https://arxiv.org/abs/2602.00616
作者: Minhyuk Lee,Hyekyung Yoon,Myungjoo Kang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Text-to-Image (T2I) diffusion models enable high-quality open-ended synthesis, but their real-world deployment demands safeguards that suppress unsafe generations without degrading benign prompt-image alignment. We formalize this tension through a total variation (TV) lens: once the reference conditional distribution is fixed, any nontrivial reduction in unsafe generations necessarily incurs TV deviation from the reference, yielding a principled Safety-Prompt Alignment Trade-off (SPAT). Guided by this view, we propose an inference-only prompt projection framework that selectively intervenes on high-risk prompts via a surrogate objective with verification, mapping them into a tolerance-controlled safe set while leaving benign prompts effectively unchanged, without retraining or fine-tuning the generator. Across four datasets and three diffusion backbones, our approach achieves 16.7-60.0% relative reductions in inappropriate percentage (IP) versus strong model-level alignment baselines, while preserving benign prompt-image alignment on COCO near the unaligned reference.
zh
[AI-264] Structured Self-Consistency:A Multi-Task Evaluation of LLM s on VirtualHome
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在具身智能(Embodied AI)任务中对目标理解、动作规划与执行能力不足的问题,特别是在虚拟环境中的多步骤任务执行表现受限。其核心解决方案是提出一种结构化自一致性(Structured Self-Consistency, SSC)解码策略,通过引入领域特定的投票机制对多个采样结果进行筛选与优化,从而提升结构化生成任务(如子目标分解和动作序列规划)的输出质量。实验表明,SSC显著改善了模型性能,且不同模型展现出互补优势:OPENPANGU-7B在层次化规划方面更优,而QWEN2.5-7B在动作级任务上表现突出。
链接: https://arxiv.org/abs/2602.00611
作者: Jiaqi Xu,Tao Huang,Kai Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Embodied AI requires agents to understand goals, plan actions, and execute tasks in simulated this http URL present a comprehensive evaluation of Large Language Models (LLMs) on the VirtualHome benchmark using the Embodied Agent Interface (EAI) this http URL compare two representative 7B-parameter models OPENPANGU-7B and QWEN2.5-7B across four fundamental tasks: Goal Interpretation, Action Sequencing, Subgoal Decomposition, and Transition this http URL propose Structured Self-Consistency (SSC), an enhanced decoding strategy that leverages multiple sampling with domain-specific voting mechanisms to improve output quality for structured generation tasks. Experimental results demonstrate that SSC significantly enhances performance, with OPENPANGU-7B excelling at hierarchical planning while QWEN2.5-7B show advantages in action-level tasks. Our analysis reveals complementary strengths across model types, providing insights for future embodied AI system development.
zh
[AI-265] Scalable Generative Game Engine: Breaking the Resolution Wall via Hardware-Algorithm Co-Design
【速读】:该论文旨在解决生成式游戏引擎在高分辨率实时渲染中面临的“内存墙”(Memory Wall)瓶颈问题,即现有方法受限于内存带宽不足,难以实现高于64×64分辨率的神经世界模型(World Model)生成。其解决方案的关键在于提出一种软硬件协同设计(Hardware-Algorithm Co-Design)框架,通过识别并解耦计算密集型的世界模型与内存密集型的解码器组件,在AI加速器集群上实现异构架构部署;核心创新包括:(1)基于序列并行约束的不对称资源分配策略以优化吞吐量;(2)面向内存的算子融合机制以减少片外带宽消耗;(3)利用时间冗余的流形感知潜在外推机制以掩盖延迟。该方案成功实现了720×480分辨率下的实时生成,相较基线提升50倍像素吞吐量,并在连续3D赛车和离散2D平台游戏基准测试中分别达到26.4 FPS和48.3 FPS,有效验证了通过架构协同设计突破内存墙是实现高保真、低延迟神经游戏的前提条件。
链接: https://arxiv.org/abs/2602.00608
作者: Wei Zeng,Xuchen Li,Ruili Feng,Zhen Liu,Fengwei An,Jian Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
备注: Preprint, Under Review
Abstract:Real-time generative game engines represent a paradigm shift in interactive simulation, promising to replace traditional graphics pipelines with neural world models. However, existing approaches are fundamentally constrained by the Memory Wall,'' restricting practical deployments to low resolutions (e.g., 64 \times 64 ). This paper bridges the gap between generative models and high-resolution neural simulations by introducing a scalable \textitHardware-Algorithm Co-Design framework. We identify that high-resolution generation suffers from a critical resource mismatch: the World Model is compute-bound while the Decoder is memory-bound. To address this, we propose a heterogeneous architecture that intelligently decouples these components across a cluster of AI accelerators. Our system features three core innovations: (1) an asymmetric resource allocation strategy that optimizes throughput under sequence parallelism constraints; (2) a memory-centric operator fusion scheme that minimizes off-chip bandwidth usage; and (3) a manifold-aware latent extrapolation mechanism that exploits temporal redundancy to mask latency. We validate our approach on a cluster of programmable AI accelerators, enabling real-time generation at 720 \times 480 resolution -- a 50\times increase in pixel throughput over prior baselines. Evaluated on both continuous 3D racing and discrete 2D platformer benchmarks, our system delivers fluid 26.4 FPS and 48.3 FPS respectively, with an amortized effective latency of 2.7 ms. This work demonstrates that resolving the Memory Wall’’ via architectural co-design is not merely an optimization, but a prerequisite for enabling high-fidelity, responsive neural gameplay.
zh
[AI-266] DockSmith: Scaling Reliable Coding Environments via an Agent ic Docker Builder
【速读】:该论文旨在解决基于Docker的环境构建在软件工程智能体(software engineering agents)训练与评估中规模化执行时的关键瓶颈问题,即如何可靠、高效地完成复杂依赖关系下的环境配置任务。其解决方案的核心在于提出DockSmith——一个专为Docker构建设计的智能体式构建器,将环境构建从预处理步骤提升为具备长周期工具调用、依赖推理和故障恢复能力的核心智能体能力,并通过大规模执行导向的Docker构建轨迹进行训练,其中引入了循环检测控制器和跨任务成功记忆机制以增强鲁棒性与泛化性。该方法不仅在Multi-Docker-Eval上达到开源最优性能(Fail-to-Pass: 39.72%,Commit Rate: 58.28%),还显著提升了在SWE-bench Verified、SWE-bench Multilingual和Terminal-Bench 2.0等分布外任务上的表现,验证了环境构建作为智能体核心能力所带来的广泛收益。
链接: https://arxiv.org/abs/2602.00592
作者: Jiaran Zhang,Luck Ma,Yanhao Li,Fanqi Wan,Di Qi,Xu Zhao,Jieyi Hou,Zhe Xie,Mengqiang Ren,Xin Wu,Zhewei Huang,Liangyu Chen,Yingwei Ma,Qi Han,Xiangyu Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:Reliable Docker-based environment construction is a dominant bottleneck for scaling execution-grounded training and evaluation of software engineering agents. We introduce DockSmith, a specialized agentic Docker builder designed to address this challenge. DockSmith treats environment construction not only as a preprocessing step, but as a core agentic capability that exercises long-horizon tool use, dependency reasoning, and failure recovery, yielding supervision that transfers beyond Docker building itself. DockSmith is trained on large-scale, execution-grounded Docker-building trajectories produced by a SWE-Factory-style pipeline augmented with a loop-detection controller and a cross-task success memory. Training a 30B-A3B model on these trajectories achieves open-source state-of-the-art performance on Multi-Docker-Eval, with 39.72% Fail-to-Pass and 58.28% Commit Rate. Moreover, DockSmith improves out-of-distribution performance on SWE-bench Verified, SWE-bench Multilingual, and Terminal-Bench 2.0, demonstrating broader agentic benefits of environment construction.
zh
[AI-267] Exploring Information Seeking Agent Consolidation
【速读】:该论文旨在解决现有信息搜索代理(information-seeking agents)在不同知识源(如开放网络、文档或本地知识库)中存在领域特异性导致的可扩展性差和跨域泛化能力弱的问题。为实现异构信息搜索代理的统一建模,论文提出两种互补的整合策略:数据层面整合(data-level consolidation),即在混合领域特定数据集上联合训练一个统一模型;以及参数层面整合(parameter-level consolidation),即在参数层面上合并已独立训练好的代理模型。研究发现,数据层面整合仍是性能稳定且可靠的基线方法,而参数层面整合虽具高效潜力,但面临行为干扰与鲁棒性挑战;其关键设计因素包括细粒度的合并粒度、对任务异质性的认知感知以及基于原则的一致性策略。
链接: https://arxiv.org/abs/2602.00585
作者: Guochen Yan,Jialong Wu,Zhengwei Tao,Bo Li,Qintong Zhang,Jiahao Xu,Haitao Mi,Yuejian Fang,Qingni Shen,Wentao Zhang,Zhonghai Wu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Information-seeking agents have emerged as a powerful paradigm for solving knowledge-intensive tasks. Existing information-seeking agents are typically specialized for open web, documents, or local knowledge bases, which constrains scalability and cross-domain generalization. In this work, we investigate how to consolidate heterogeneous information-seeking agents into a single foundation agentic model. We study two complementary consolidation strategies: data-level consolidation, which jointly trains a unified model on a mixture of domain-specific datasets, and parameter-level consolidation, which merges independently trained agent models at the parameter level. Our analysis compares these approaches in terms of performance retention, cross-domain generalization, and interference across information-seeking behaviors. Our results show that data-level consolidation remains a strong and stable baseline, while parameter-level consolidation offers a promising, efficient alternative but suffers from interference and robustness challenges. We further identify key design factors for effective agent consolidation at the parameter level, including fine-grained merging granularity, awareness of task heterogeneity, and principled consensus strategy.
zh
[AI-268] Small Shifts Large Gains: Unlocking Traditional TSP Heuristic Guided-Sampling via Unsupervised Neural Instance Modification
【速读】:该论文旨在解决传统确定性启发式路径构造方法(如最远插入法或最近插入法)在求解旅行商问题(Traveling Salesman Problem, TSP)时因缺乏探索能力而易陷入局部最优的问题,同时克服现有基于神经网络的启发式方法对大量标注数据依赖和训练成本高的局限。其解决方案的关键在于提出TSP-MDF框架——一个新颖的实例修改机制,通过引入一个神经网络驱动的实例修改器(instance modifier),以无监督方式策略性地调整节点坐标生成多个扰动后的TSP实例,并利用原生的确定性启发式算法在这些修改实例上构造路径,再将结果映射回原始实例,从而实现对高质量解空间的有效探索;该框架无需任何真值监督即可高效训练,显著提升了传统方法的性能,达到与神经网络方法相当的解质量,且训练时间极短,兼具高效性与实用性。
链接: https://arxiv.org/abs/2602.00580
作者: Wei Huang,Hanchen Wang,Dong Wen,Wenjie Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The Traveling Salesman Problem (TSP) is one of the most representative NP-hard problems in route planning and a long-standing benchmark in combinatorial optimization. Traditional heuristic tour constructors, such as Farthest or Nearest Insertion, are computationally efficient and highly practical, but their deterministic behavior limits exploration and often leads to local optima. In contrast, neural-based heuristic tour constructors alleviate this issue through guided-sampling and typically achieve superior solution quality, but at the cost of extensive training and reliance on ground-truth supervision, hindering their practical use. To bridge this gap, we propose TSP-MDF, a novel instance modification framework that equips traditional deterministic heuristic tour constructors with guided-sampling capability. Specifically, TSP-MDF introduces a neural-based instance modifier that strategically shifts node coordinates to sample multiple modified instances, on which the base traditional heuristic tour constructor constructs tours that are mapped back to the original instance, allowing traditional tour constructors to explore higher-quality tours and escape local optima. At the same time, benefiting from our instance modification formulation, the neural-based instance modifier can be trained efficiently without any ground-truth supervision, ensuring the framework maintains practicality. Extensive experiments on large-scale TSP benchmarks and real-world benchmarks demonstrate that TSP-MDF significantly improves the performance of traditional heuristics tour constructors, achieving solution quality comparable to neural-based heuristic tour constructors, but with an extremely short training time.
zh
[AI-269] Data Distribution as a Lever for Guiding Optimizers Toward Superior Generalization in LLM s
【速读】:该论文试图解决的问题是:在训练大语言模型(Large Language Models, LLMs)时,如何通过调整训练数据分布来引导优化器找到具有更好泛化性能的解。现有研究表明,Sharpness-Aware Minimization (SAM) 虽然能显著提升模型泛化能力,但其计算开销过大,难以应用于中等规模以上的 LLM 训练。为此,作者首次从理论上分析了基于多头线性自注意力机制的上下文线性回归模型中,梯度下降(Gradient Descent, GD)与 SAM 的训练动态差异,发现 SAM 之所以泛化性能更优,关键在于其诱导了更低的“简单性偏置”(Simplicity Bias, SB)——即优化器在训练早期更倾向于学习复杂度较低的特征。受此启发,作者提出一种轻量级策略:通过对训练后期学到的样本进行上采样或数据增强,可有效降低 SB,从而实现类似 SAM 的泛化提升效果。实验表明,该方法在多个 LLM 上均取得显著性能增益,例如在数学推理任务中使用 AdamW 和 Muon 优化器微调时,准确率相对提升最高达 18%。
链接: https://arxiv.org/abs/2602.00576
作者: Tushaar Gangavarapu,Jiping Li,Christopher Vattheuer,Zhangyang Wang,Baharan Mirzasoleiman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Can modifying the training data distribution guide optimizers toward solutions with improved generalization when training large language models (LLMs)? In this work, we theoretically analyze an in-context linear regression model with multi-head linear self-attention, and compare the training dynamics of two gradient based optimizers, namely gradient descent (GD) and sharpness-aware minimization (SAM), the latter exhibiting superior generalization properties but is prohibitively expensive for training even medium-sized LLMs. We show, for the first time, that SAM induces a lower simplicity bias (SB)-the tendency of an optimizer to preferentially learn simpler features earlier in training-and identify this reduction as a key factor underlying its improved generalization performance. Motivated by this insight, we demonstrate that altering the training data distribution by upsampling or augmenting examples learned later in training similarly reduces SB and leads to improved generalization. Our extensive experiments show that our strategy improves the performance of multiple LLMs-including Phi2-2.7B , Llama3.2-1B, Gemma3-1B-PT, and Qwen3-0.6B-Base-achieving relative accuracy gains up to 18% when fine-tuned with AdamW and Muon on mathematical reasoning tasks.
zh
[AI-270] Learning Modal-Mixed Chain-of-Thought Reasoning with Latent Embeddings
【速读】:该论文旨在解决链式思维(Chain-of-Thought, CoT)在多模态推理任务中因仅依赖文本形式而难以处理视觉密集型问题的局限性,即当关键中间状态本质上为视觉信息时,传统文本-only CoT无法有效表达和利用这些视觉线索。其解决方案的关键在于提出模态混合链式思维(modal-mixed CoT),通过在文本token中交错插入由视觉语言模型(Vision-Language Model, VLM)生成的紧凑视觉草图(latent embeddings),实现文本与视觉信息的协同推理;进一步地,利用VLM自身作为编码器,并训练语言主干网络重建其自身的中间视觉嵌入,以确保视觉潜在空间的语义一致性;同时引入基于扩散模型的潜在解码器(diffusion-based latent decoder),由特殊控制token触发并以VLM隐藏状态为条件,从而将细粒度感知细节(由扩散头负责)与高层意图(由VLM指定)清晰解耦,降低对VLM的优化压力。
链接: https://arxiv.org/abs/2602.00574
作者: Yifei Shao,Kun Zhou,Ziming Xu,Mohammad Atif Quamar,Shibo Hao,Zhen Wang,Zhiting Hu,Biwei Huang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We study how to extend chain-of-thought (CoT) beyond language to better handle multimodal reasoning. While CoT helps LLMs and VLMs articulate intermediate steps, its text-only form often fails on vision-intensive problems where key intermediate states are inherently visual. We introduce modal-mixed CoT, which interleaves textual tokens with compact visual sketches represented as latent embeddings. To bridge the modality gap without eroding the original knowledge and capability of the VLM, we use the VLM itself as an encoder and train the language backbone to reconstruct its own intermediate vision embeddings, to guarantee the semantic alignment of the visual latent space. We further attach a diffusion-based latent decoder, invoked by a special control token and conditioned on hidden states from the VLM. In this way, the diffusion head carries fine-grained perceptual details while the VLM specifies high-level intent, which cleanly disentangles roles and reduces the optimization pressure of the VLM. Training proceeds in two stages: supervised fine-tuning on traces that interleave text and latents with a joint next-token and latent-reconstruction objective, followed by reinforcement learning that teaches when to switch modalities and how to compose long reasoning chains. Extensive experiments across 11 diverse multimodal reasoning tasks, demonstrate that our method yields better performance than language-only and other CoT methods. Our code will be publicly released.
zh
[AI-271] Uncovering Latent Communication Patterns in Brain Networks via Adaptive Flow Routing
【速读】:该论文旨在解决如何从微观神经连接结构(结构连接,SC)中揭示宏观认知表型的产生机制这一核心问题,尤其关注SC与功能连接(FC)之间动态耦合与异质性的内在原因。现有方法虽尝试在区域层面融合SC与FC信息,但缺乏对潜在神经区域间交互机制的深入理解,难以解释二者为何呈现动态变化的状态。其解决方案的关键在于引入神经通信动力学视角,提出物理信息驱动的自适应流路由网络(AFR-Net),通过建模SC如何约束并生成FC的通信模式,实现对关键神经通路的可解释性发现,并显著优于当前最先进基线方法。
链接: https://arxiv.org/abs/2602.00561
作者: Tianhao Huang,Guanghui Min,Zhenyu Lei,Aiying Zhang,Chen Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Unraveling how macroscopic cognitive phenotypes emerge from microscopic neuronal connectivity remains one of the core pursuits of neuroscience. To this end, researchers typically leverage multi-modal information from structural connectivity (SC) and functional connectivity (FC) to complete downstream tasks. Recent methodologies explore the intricate coupling mechanisms between SC and FC, attempting to fuse their representations at the regional level. However, lacking fundamental neuroscientific insight, these approaches fail to uncover the latent interactions between neural regions underlying these connectomes, and thus cannot explain why SC and FC exhibit dynamic states of both coupling and heterogeneity. In this paper, we formulate multi-modal fusion through the lens of neural communication dynamics and propose the Adaptive Flow Routing Network (AFR-Net), a physics-informed framework that models how structural constraints (SC) give rise to functional communication patterns (FC), enabling interpretable discovery of critical neural pathways. Extensive experiments demonstrate that AFR-Net significantly outperforms state-of-the-art baselines. The code is available at this https URL.
zh
[AI-272] Contrastive Domain Generalization for Cross-Instrument Molecular Identification in Mass Spectrometry
【速读】:该论文旨在解决质谱(Mass Spectrometry, MS)数据中分子识别的泛化瓶颈问题,即现有深度学习方法通常将光谱匹配视为封闭集识别任务,难以推广到未见过的分子骨架(scaffold)。其解决方案的关键在于提出一种跨模态对齐框架,通过将质谱直接映射到预训练化学语言模型(Chemical Language Model)所构建的化学结构嵌入空间(molecular structure embedding space),从而显式融合物理光谱分辨率与化学结构语义信息,显著提升了模型在零样本检索和全局检索场景下的泛化能力。
链接: https://arxiv.org/abs/2602.00547
作者: Seunghyun Yoo,Sanghong Kim,Namkyung Yoon,Hwangnam Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages, 2 figures
Abstract:Identifying molecules from mass spectrometry (MS) data remains a fundamental challenge due to the semantic gap between physical spectral peaks and underlying chemical structures. Existing deep learning approaches often treat spectral matching as a closed-set recognition task, limiting their ability to generalize to unseen molecular scaffolds. To overcome this limitation, we propose a cross-modal alignment framework that directly maps mass spectra into the chemically meaningful molecular structure embedding space of a pretrained chemical language model. On a strict scaffold-disjoint benchmark, our model achieves a Top-1 accuracy of 42.2% in fixed 256-way zero-shot retrieval and demonstrates strong generalization under a global retrieval setting. Moreover, the learned embedding space demonstrates strong chemical coherence, reaching 95.4% accuracy in 5-way 5-shot molecular re-identification. These results suggest that explicitly integrating physical spectral resolution with molecular structure embedding is key to solving the generalization bottleneck in molecular identification from MS data.
zh
[AI-273] Convergent World Representations and Divergent Tasks
【速读】:该论文旨在解决神经表示(neural representations)的几何结构及其在下游适应性(downstream adaptability)中作用的机制问题,尤其是在多任务学习场景下如何影响新实体的整合与泛化能力。其解决方案的关键在于构建一个受控实验框架,明确分离世界本体(underlying world)、数据生成过程(data generation process)与模型表示(model representations),通过5,075个城市坐标和7个几何任务进行自回归训练,发现不同任务可产生显著不同的表示几何结构;而多任务训练能促使表示空间收敛,验证了“柏拉图式表示假设”中的多任务扩展规律;进一步揭示存在一类“发散任务”(divergent tasks),即使经过多任务预训练,仍会损害新实体的表示整合能力,从而提出训练任务选择对模型适应性具有关键影响的新认知。
链接: https://arxiv.org/abs/2602.00533
作者: Core Francisco Park
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:While neural representations are central to modern deep learning, the conditions governing their geometry and their roles in downstream adaptability remain poorly understood. We develop a framework clearly separating the underlying world, the data generation process and the resulting model representations to study these questions in a controlled setup. 5,075 city coordinates define the world and 7 geometric tasks generate the training data for autoregressive training. We find that different tasks give rise to qualitatively and quantitatively distinct world representation geometries. However, multi-task training drives convergence of world representations: models trained on non-overlapping tasks develop aligned geometric representations, providing controlled evidence for the Multitask Scaling Hypothesis of the Platonic Representation Hypothesis. To study adaptation, we pretrain models on all tasks, then test whether new entities (cities) can be consistently integrated into the representation space via fine-tuning. Surprisingly, we find that despite multi-task pretraining, some tasks, which we call divergent, actively harm the representational integration of new entities and harm generalization. Our results show that training on multiple relational tasks reliably produces convergent world representations, but lurking divergent tasks can catastrophically harm new entity integration via fine-tuning.
zh
[AI-274] How Far Are LLM s from Professional Poker Players? Revisiting Game-Theoretic Reasoning with Agent ic Tool Use ICLR2026
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在高风险场景下,尤其是在不确定性环境中进行策略性推理能力不足的问题。研究以扑克为测试平台,发现当前LLMs在实际博弈中表现不佳,存在依赖启发式规则、事实理解错误以及“知行不一”(knowing-doing gap)等三类系统性缺陷,导致其行动与理论最优策略(Game-Theoretic Optimal, GTO)严重偏离。为此,作者提出ToolPoker框架,其核心创新在于将外部求解器(如GTO求解器)集成进推理流程,用以生成符合博弈论一致性的动作,并结合更精确的专业级解释机制,从而实现既提升游戏表现又增强推理可解释性。实验表明,ToolPoker在玩法上达到当前最优水平,且其推理过程高度贴近游戏理论原则。
链接: https://arxiv.org/abs/2602.00528
作者: Minhua Lin,Enyan Dai,Hui Liu,Xianfeng Tang,Yuliang Yan,Zhenwei Dai,Jingying Zeng,Zhiwei Zhang,Fali Wang,Hongcheng Gao,Chen Luo,Xiang Zhang,Qi He,Suhang Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by ICLR 2026
Abstract:As Large Language Models (LLMs) are increasingly applied in high-stakes domains, their ability to reason strategically under uncertainty becomes critical. Poker provides a rigorous testbed, requiring not only strong actions but also principled, game-theoretic reasoning. In this paper, we conduct a systematic study of LLMs in multiple realistic poker tasks, evaluating both gameplay outcomes and reasoning traces. Our analysis reveals LLMs fail to compete against traditional algorithms and identifies three recurring flaws: reliance on heuristics, factual misunderstandings, and a “knowing-doing” gap where actions diverge from reasoning. An initial attempt with behavior cloning and step-level reinforcement learning improves reasoning style but remains insufficient for accurate game-theoretic play. Motivated by these limitations, we propose ToolPoker, a tool-integrated reasoning framework that combines external solvers for GTO-consistent actions with more precise professional-style explanations. Experiments demonstrate that ToolPoker achieves state-of-the-art gameplay while producing reasoning traces that closely reflect game-theoretic principles.
zh
[AI-275] Physiology as Language: Translating Respiration to Sleep EEG
【速读】:该论文旨在解决从呼吸信号中合成睡眠脑电图(sleep electroencephalography, EEG)这一跨生理模态翻译问题,其核心挑战在于呼吸信号与EEG之间存在显著的复杂性差异。解决方案的关键在于提出一种基于波形条件的生成框架(waveform-conditional generative framework),该框架在保留呼吸动态细节的同时,通过离散化token对EEG目标空间进行约束,从而实现高保真度的EEG重构。模型在超过28,000名个体上训练,实现了7%的EEG频谱重构平均绝对误差(Mean Absolute Error),并在年龄估计、性别识别和睡眠分期等下游任务中表现出与真实EEG相当的性能,验证了该方法的有效性与泛化能力。
链接: https://arxiv.org/abs/2602.00526
作者: Kaiwen Zha,Chao Li,Hao He,Peng Cao,Tianhong Li,Ali Mirzazadeh,Ellen Zhang,Jong Woo Lee,Yoon Kim,Dina Katabi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Tech report
Abstract:This paper introduces a novel cross-physiology translation task: synthesizing sleep electroencephalography (EEG) from respiration signals. To address the significant complexity gap between the two modalities, we propose a waveform-conditional generative framework that preserves fine-grained respiratory dynamics while constraining the EEG target space through discrete tokenization. Trained on over 28,000 individuals, our model achieves a 7% Mean Absolute Error in EEG spectrogram reconstruction. Beyond reconstruction, the synthesized EEG supports downstream tasks with performance comparable to ground truth EEG on age estimation (MAE 5.0 vs. 5.1 years), sex detection (AUROC 0.81 vs. 0.82), and sleep staging (Accuracy 0.84 vs. 0.88), significantly outperforming baselines trained directly on breathing. Finally, we demonstrate that the framework generalizes to contactless sensing by synthesizing EEG from wireless radio-frequency reflections, highlighting the feasibility of remote, non-contact neurological assessment during sleep.
zh
[AI-276] Diagnosing the Reliability of LLM -as-a-Judge via Item Response Theory
【速读】:该论文旨在解决当前大语言模型作为评判者(LLM-as-a-Judge)在自动化评估中缺乏可靠性和稳定性验证的问题,现有方法仅基于输出结果进行验证,难以判断 LLM 判官是否具备稳定的测量能力。解决方案的关键在于提出一个两阶段诊断框架,基于项目反应理论(Item Response Theory, IRT),采用分级反应模型(Graded Response Model, GRM)从两个互补维度量化可靠性:(1) 内在一致性(intrinsic consistency),即在提示变化下测量行为的稳定性;(2) 人类对齐性(human alignment),即与人类质量评估的一致程度。该框架通过 IRT-GRM 提供可解释的诊断信号,从而系统性地识别和验证 LLM 判官的可靠性及其潜在不一致原因。
链接: https://arxiv.org/abs/2602.00521
作者: Junhyuk Choi,Sohhyung Park,Chanhee Cho,Hyeonchu Park,Bugeun Kim
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Under review
Abstract:While LLM-as-a-Judge is widely used in automated evaluation, existing validation practices primarily operate at the level of observed outputs, offering limited insight into whether LLM judges themselves function as stable and reliable measurement instruments. To address this limitation, we introduce a two-phase diagnostic framework for assessing reliability of LLM-as-a-Judge, grounded in Item Response Theory (IRT). The framework adopts Graded Response Model (GRM) of IRT and formalizes reliability along two complementary dimensions: (1) intrinsic consistency, defined as the stability of measurement behavior under prompt variations, and (2) human alignment, capturing correspondence with human quality assessments. We empirically examine diverse LLM judges with this framework, and show that leveraging IRT-GRM yields interpretable signals for diagnosing judgments systematically. These signals provide practical guidance for verifying reliablity of LLM-as-a-Judge and identifying potential causes of unreliability.
zh
[AI-277] Contrastive Learning for Privacy Enhancements in Industrial Internet of Things
【速读】:该论文旨在解决工业互联网(Industrial Internet of Things, IIoT)系统中因操作数据敏感性而带来的隐私与保密性风险问题。其解决方案的关键在于引入对比学习(contrastive learning)这一自监督表征学习范式,通过减少对标注数据和原始数据共享的依赖,实现隐私保护下的数据分析与优化。论文强调了工业数据的独特特性、系统架构及应用场景,并系统梳理了基于对比学习的隐私保护技术,指出了当前存在的挑战并提出了未来研究方向。
链接: https://arxiv.org/abs/2602.00515
作者: Lin Liu,Rita Machacy,Simi Kuniyilh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The Industrial Internet of Things (IIoT) integrates intelligent sensing, communication, and analytics into industrial environments, including manufacturing, energy, and critical infrastructure. While IIoT enables predictive maintenance and cross-site optimization of modern industrial control systems, such as those in manufacturing and energy, it also introduces significant privacy and confidentiality risks due to the sensitivity of operational data. Contrastive learning, a self-supervised representation learning paradigm, has recently emerged as a promising approach for privacy-preserving analytics by reducing reliance on labeled data and raw data sharing. Although contrastive learning-based privacy-preserving techniques have been explored in the Internet of Things (IoT) domain, this paper offers a comprehensive review of these techniques specifically for privacy preservation in Industrial Internet of Things (IIoT) systems. It emphasizes the unique characteristics of industrial data, system architectures, and various application scenarios. Additionally, the paper discusses solutions and open challenges and outlines future research directions.
zh
[AI-278] PCBSchemaGen: Constraint-Guided Schematic Design via LLM for Printed Circuit Boards (PCB)
【速读】:该论文旨在解决自动化印刷电路板(Printed Circuit Board, PCB)原理图设计中缺乏统一框架的问题,尤其针对数字、模拟和电源信号的异构性以及真实集成电路(Integrated Circuit, IC)封装与引脚约束的复杂性。传统方法局限于单一类型的电路设计,且因开源数据稀缺和仿真验证缺失而难以推进自动化。其解决方案的关键在于提出PCBSchemaGen——一个无需训练的框架,融合大语言模型(Large Language Model, LLM)代理与约束引导合成机制:首先通过领域特定提示词驱动LLM进行迭代式代码生成;其次利用基于真实IC数据手册构建的知识图谱(Knowledge Graph, KG)和子图同构技术编码引脚角色语义与拓扑约束,实现设计正确性验证;最终在23个涵盖数字、模拟和电源域的PCB任务上验证了该方法在设计准确性和计算效率上的显著提升。
链接: https://arxiv.org/abs/2602.00510
作者: Huanghaohe Zou,Peng Han,Emad Nazerian,Alex Q. Huang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注:
Abstract:Printed Circuit Board (PCB) schematic design plays an essential role in all areas of electronic industries. Unlike prior works that focus on digital or analog circuits alone, PCB design must handle heterogeneous digital, analog, and power signals while adhering to real-world IC packages and pin constraints. Automated PCB schematic design remains unexplored due to the scarcity of open-source data and the absence of simulation-based verification. We introduce PCBSchemaGen, the first training-free framework for PCB schematic design that comprises LLM agent and Constraint-guided synthesis. Our approach makes three contributions: 1. an LLM-based code generation paradigm with iterative feedback with domain-specific prompts. 2. a verification framework leveraging a real-world IC datasheet derived Knowledge Graph (KG) and Subgraph Isomorphism encoding pin-role semantics and topological constraints. 3. an extensive experiment on 23 PCB schematic tasks spanning digital, analog, and power domains. Results demonstrate that PCBSchemaGen significantly improves design accuracy and computational efficiency.
zh
[AI-279] From Junior to Senior: Allocating Agency and Navigating Professional Growth in Agent ic AI-Mediated Software Engineering
【速读】:该论文试图解决的问题是:随着生成式 AI(Generative AI)在软件工程中的广泛应用,开发者个体与组织层面的控制权(agency)如何被重新分配,尤其是在不同职业阶段的工程师之间——即“谁在编码、学习和职业发展中拥有主导权”这一核心问题。解决方案的关键在于识别出组织政策对代理权的约束作用远大于个人偏好,并提出三类实践策略:一是通过精细化委托(detailed delegation)保持资深开发者对AI工具的控制;二是帮助初级开发者避免过度依赖或回避使用AI;三是利用资深工程师基于预AI时代经验形成的直觉和判断力,指导新人在AI赋能环境中实现可持续的职业成长。
链接: https://arxiv.org/abs/2602.00496
作者: Dana Feng,Bhada Yun,April Wang
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: To appear in CHI’26
Abstract:Juniors enter as AI-natives, seniors adapted mid-career. AI is not just changing how engineers code-it is reshaping who holds agency across work and professional growth. We contribute junior-senior accounts on their usage of agentic AI through a three-phase mixed-methods study: ACTA combined with a Delphi process with 5 seniors, an AI-assisted debugging task with 10 juniors, and blind reviews of junior prompt histories by 5 more seniors. We found that agency in software engineering is primarily constrained by organizational policies rather than individual preferences, with experienced developers maintaining control through detailed delegation while novices struggle between over-reliance and cautious avoidance. Seniors leverage pre-AI foundational instincts to steer modern tools and possess valuable perspectives for mentoring juniors in their early AI-encouraged career development. From synthesis of results, we suggest three practices that focus on preserving agency in software engineering for coding, learning, and mentorship, especially as AI grows increasingly autonomous.
zh
[AI-280] Replacing Parameters with Preferences: Federated Alignment of Heterogeneous Vision-Language Models
【速读】:该论文旨在解决在隐私敏感领域(如医疗和金融)中,视觉语言模型(Vision-Language Models, VLMs)因数据共享限制而难以进行集中式训练的问题。现有联邦学习(Federated Learning, FL)方案虽能缓解数据隐私问题,但受限于客户端计算资源、应用需求及模型架构的异构性,难以有效对齐VLMs。论文提出MoR框架,其核心创新在于以偏好(preference)替代参数作为联邦对齐的通信内容,从而提升可扩展性和隐私保护水平:首先,利用KL正则化的视觉基础模型作为参考;其次,各客户端基于本地偏好标注独立训练奖励模型(reward model),捕捉特定评估信号而不暴露原始数据;最后,通过基于路由的融合机制自适应聚合不同客户端的奖励信号,并使用广义优势估计的策略优化(Generalized Reward Policy Optimization, GRPO)更新全局模型。此方法实现了跨客户端的鲁棒泛化与适应能力,显著优于现有联邦对齐基线。
链接: https://arxiv.org/abs/2602.00485
作者: Shule Lu,Yujing Wang,Hainan Zhang,Xiaoshan Yang,Hongwei Zheng,Yongxin Tong,Changsheng Xu,Zhiming Zheng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:VLMs have broad potential in privacy-sensitive domains such as healthcare and finance, yet strict data-sharing constraints render centralized training infeasible. FL mitigates this issue by enabling decentralized training, but practical deployments face challenges due to client heterogeneity in computational resources, application requirements, and model architectures. We argue that while replacing data with model parameters characterizes the present of FL, replacing parameters with preferences represents a more scalable and privacy-preserving future. Motivated by this perspective, we propose MoR, a federated alignment framework based on GRPO with Mixture-of-Rewards for heterogeneous VLMs. MoR initializes a visual foundation model as a KL-regularized reference, while each client locally trains a reward model from local preference annotations, capturing specific evaluation signals without exposing raw data. To reconcile heterogeneous rewards, we introduce a routing-based fusion mechanism that adaptively aggregates client reward signals. Finally, the server performs GRPO with this mixed reward to optimize the base VLM. Experiments on three public VQA benchmarks demonstrate that MoR consistently outperforms federated alignment baselines in generalization, robustness, and cross-client adaptability. Our approach provides a scalable solution for privacy-preserving alignment of heterogeneous VLMs under federated settings.
zh
[AI-281] Quality-Diversity Optimization as Multi-Objective Optimization
【速读】:该论文旨在解决质量-多样性(Quality-Diversity, QD)优化中缺乏统一建模框架的问题,即如何在保持高绩效的同时实现行为空间的多样化探索。其解决方案的关键在于将QD优化重新表述为一个具有大量优化目标的多目标优化(Multi-Objective Optimization, MOO)问题,从而能够直接利用成熟的MOO方法,特别是基于集合的标量化技术(set-based scalarization techniques),通过协作搜索过程求解QD问题。这一转化不仅继承了MOO理论上的保证,还赋予QD优化所需的多样性与性能平衡特性,实验结果表明该方法在多个QD应用场景中达到与最先进算法相当的性能。
链接: https://arxiv.org/abs/2602.00478
作者: Xi Lin,Ping Guo,Yilu Liu,Qingfu Zhang,Jianyong Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Optimization and Control (math.OC)
备注:
Abstract:The Quality-Diversity (QD) optimization aims to discover a collection of high-performing solutions that simultaneously exhibit diverse behaviors within a user-defined behavior space. This paradigm has stimulated significant research interest and demonstrated practical utility in domains including robot control, creative design, and adversarial sample generation. A variety of QD algorithms with distinct design principles have been proposed in recent years. Instead of proposing a new QD algorithm, this work introduces a novel reformulation by casting the QD optimization as a multi-objective optimization (MOO) problem with a huge number of optimization objectives. By establishing this connection, we enable the direct adoption of well-established MOO methods, particularly set-based scalarization techniques, to solve QD problems through a collaborative search process. We further provide a theoretical analysis demonstrating that our approach inherits theoretical guarantees from MOO while providing desirable properties for the QD optimization. Experimental studies across several QD applications confirm that our method achieves performance competitive with state-of-the-art QD algorithms.
zh
[AI-282] PAIR-Former: Budgeted Relational MIL for miRNA Target Prediction
【速读】:该论文旨在解决功能miRNA-mRNA靶向预测中的大规模候选位点(candidate target sites, CTSs)筛选问题,其核心挑战在于每个转录本生成大量候选位点(呈重尾分布),但仅有成对标签可用,且计算资源受限。为此,作者提出了一种预算约束的关系多实例学习(Budgeted Relational Multi-Instance Learning, BR-MIL)框架,其中每袋(bag)最多仅允许K个实例进行昂贵的编码与关系建模。解决方案的关键是PAIR-Former(Pool-Aware Instance-Relational Transformer):首先在CPU上执行低成本全池扫描并选择至多K个多样化的CTS;随后利用排列不变的Set Transformer聚合器对选中token进行关系建模,从而在有限计算预算下实现高精度预测,并提供可调控的准确率-计算权衡。理论分析进一步表明,预算选择策略能降低近似误差并控制复杂关系模块的泛化界。
链接: https://arxiv.org/abs/2602.00465
作者: Jiaqi Yin,Baiming Chen,Jia Fei,Mingjun Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preprint. Under review. During the preprint stage, inquiries and feedback can be directed to Jiaqi Yin (yjqhit@gmail.com)
Abstract:Functional miRNA–mRNA targeting is a large-bag prediction problem: each transcript yields a heavy-tailed pool of candidate target sites (CTSs), yet only a pair-level label is observed. We formalize this regime as \emphBudgeted Relational Multi-Instance Learning (BR-MIL), where at most K instances per bag may receive expensive encoding and relational processing under a hard compute budget. We propose \textbfPAIR-Former (Pool-Aware Instance-Relational Transformer), a BR-MIL pipeline that performs a cheap full-pool scan, selects up to K diverse CTSs on CPU, and applies a permutation-invariant Set Transformer aggregator on the selected tokens. On miRAW, PAIR-Former outperforms strong pooling baselines at a practical operating budget ( K^\star=64 ) while providing a controllable accuracy–compute trade-off as K varies. We further provide theory linking budgeted selection to (i) approximation error decreasing with K and (ii) generalization terms governed by K in the expensive relational component.
zh
[AI-283] LatentTrack: Sequential Weight Generation via Latent Filtering
【速读】:该论文旨在解决非平稳动态环境下在线概率预测的挑战,即在数据分布随时间变化时,如何实现高效且准确的不确定性建模与持续适应。其核心解决方案是提出LatentTrack(LT),一种基于低维隐空间的序列神经架构,通过轻量级超网络(hypernetwork)在每个时间步生成预测模型参数,从而无需逐步梯度更新即可实现恒定时间复杂度的在线适应;该方法采用“预测-生成-更新”框架,在函数空间中进行因果贝叶斯滤波,支持结构化(马尔可夫型)与非结构化隐动态统一建模,并借助隐轨迹上的蒙特卡洛推理获得校准良好的预测混合分布,且每步计算成本固定。
链接: https://arxiv.org/abs/2602.00458
作者: Omer Haq
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO); Machine Learning (stat.ML)
备注:
Abstract:We introduce LatentTrack (LT), a sequential neural architecture for online probabilistic prediction under nonstationary dynamics. LT performs causal Bayesian filtering in a low-dimensional latent space and uses a lightweight hypernetwork to generate predictive model parameters at each time step, enabling constant-time online adaptation without per-step gradient updates. At each time step, a learned latent model predicts the next latent distribution, which is updated via amortized inference using new observations, yielding a predict–generate–update filtering framework in function space. The formulation supports both structured (Markovian) and unstructured latent dynamics within a unified objective, while Monte Carlo inference over latent trajectories produces calibrated predictive mixtures with fixed per-step cost. Evaluated on long-horizon online regression using the Jena Climate benchmark, LT consistently achieves lower negative log-likelihood and mean squared error than stateful sequential and static uncertainty-aware baselines, with competitive calibration, demonstrating that latent-conditioned function evolution is an effective alternative to traditional latent-state modeling under distribution shift. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO); Machine Learning (stat.ML) Cite as: arXiv:2602.00458 [cs.LG] (or arXiv:2602.00458v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.00458 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-284] Benchmarking Agents in Insurance Underwriting Environments
【速读】:该论文旨在解决当前AI代理(AI agent)评估基准在企业应用场景中缺乏真实复杂性的问题,现有基准往往过度关注开放域任务(如代码生成)、采用狭窄的准确率指标,并忽视了企业级操作中的关键挑战。其解决方案的关键在于提出一个以领域专家为导向的多轮保险核保评估基准——UNDERWRITE,该基准通过与领域专家紧密协作设计,引入了三项现实世界中至关重要的因素:专有业务知识、噪声干扰的工具接口以及需要精细信息收集的不完美模拟用户。这一设计显著提升了评估的真实性与实用性,揭示了当前主流模型在效率、知识准确性及鲁棒性方面的系统性缺陷,强调了专家参与对构建贴近企业部署需求的评估体系的必要性。
链接: https://arxiv.org/abs/2602.00456
作者: Amanda Dsouza,Ramya Ramakrishnan,Charles Dickens,Bhavishya Pohani,Christopher M Glaze
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:As AI agents integrate into enterprise applications, their evaluation demands benchmarks that reflect the complexity of real-world operations. Instead, existing benchmarks overemphasize open-domains such as code, use narrow accuracy metrics, and lack authentic complexity. We present UNDERWRITE, an expert-first, multi-turn insurance underwriting benchmark designed in close collaboration with domain experts to capture real-world enterprise challenges. UNDERWRITE introduces critical realism factors often absent in current benchmarks: proprietary business knowledge, noisy tool interfaces, and imperfect simulated users requiring careful information gathering. Evaluating 13 frontier models, we uncover significant gaps between research lab performance and enterprise readiness: the most accurate models are not the most efficient, models hallucinate domain knowledge despite tool access, and pass^k results show a 20% drop in performance. The results from UNDERWRITE demonstrate that expert involvement in benchmark design is essential for realistic agent evaluation, common agentic frameworks exhibit brittleness that skews performance reporting, and hallucination detection in specialized domains demands compositional approaches. Our work provides insights for developing benchmarks that better align with enterprise deployment requirements.
zh
[AI-285] Cross-Modal Memory Compression for Efficient Multi-Agent Debate
【速读】:该论文旨在解决多智能体辩论(multi-agent debate)中因辩论轮次和代理数量增加而导致上下文长度迅速膨胀的问题,这种现象不仅容易超出模型的上下文限制,还因频繁的摘要操作引入额外计算开销并累积信息损失。其解决方案的关键在于提出 DebateOCR,一种跨模态压缩框架,通过将冗长的文本辩论历史替换为紧凑的图像表示,并利用专用视觉编码器对这些图像进行处理以指导后续辩论轮次。该方法可将原本数万至数十万token的历史压缩超过92%,显著降低输入token数、计算成本和推理时间;同时从理论上证明,多个代理间的多样性有助于恢复被压缩掉的信息:尽管单个压缩历史可能丢失细节,但聚合多个代理的压缩视图能使集体表示以指数级高概率逼近信息瓶颈(information bottleneck)。
链接: https://arxiv.org/abs/2602.00454
作者: Jing Wu,Yue Sun,Tianpei Xie,Suiyao Chen,Jingyuan Bao,Yaopengxiao Xu,Gaoyuan Du,Inseok Heo,Alexander Gutfraind,Xin Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Multi-agent debate can improve reasoning quality and reduce hallucinations, but it incurs rapidly growing context as debate rounds and agent count increase. Retaining full textual histories leads to token usage that can exceed context limits and often requires repeated summarization, adding overhead and compounding information loss. We introduce DebateOCR, a cross-modal compression framework that replaces long textual debate traces with compact image representations, which are then consumed through a dedicated vision encoder to condition subsequent rounds. This design compresses histories that commonly span tens to hundreds of thousands of tokens, cutting input tokens by more than 92% and yielding substantially lower compute cost and faster inference across multiple benchmarks. We further provide a theoretical perspective showing that diversity across agents supports recovery of omitted information: although any single compressed history may discard details, aggregating multiple agents’ compressed views allows the collective representation to approach the information bottleneck with exponentially high probability.
zh
[AI-286] Do Latent-CoT Models Think Step-by-Step? A Mechanistic Study on Sequential Reasoning Tasks
【速读】:该论文旨在解决生成式 AI (Generative AI) 中隐式链式思维(Latent Chain-of-Thought, Latent-CoT)机制不明确的问题,即如何在不显式输出推理步骤的情况下实现可靠的逐步计算。其解决方案的关键在于通过一系列分析工具(包括logit-lens解码、线性探测、注意力分析和激活修补)对CODI模型进行系统性剖析,揭示其在严格顺序的多项式迭代任务中如何编码中间状态并将其路由至最终读出层;研究发现,对于短路径任务(如两跳或三跳),CODI能形成完整的桥接状态并实现后期融合;而对于长路径任务,则表现为部分隐式推理路径,集中于晚期中间状态并与最后一个输入在答案读出位置融合,表明其策略随任务复杂度变化而从完整迭代转向压缩或捷径策略。
链接: https://arxiv.org/abs/2602.00449
作者: Jia Liang,Liangming Pan
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 20 pages, 14 figures
Abstract:Latent Chain-of-Thought (Latent-CoT) aims to enable step-by-step computation without emitting long rationales, yet its mechanisms remain unclear. We study CODI, a continuous-thought teacher-student distillation model, on strictly sequential polynomial-iteration tasks. Using logit-lens decoding, linear probes, attention analysis, and activation patching, we localize intermediate-state representations and trace their routing to the final readout. On two- and three-hop tasks, CODI forms the full set of bridge states that become decodable across latent-thought positions, while the final input follows a separate near-direct route; predictions arise via late fusion at the end-of-thought boundary. For longer hop lengths, CODI does not reliably execute a full latent rollout, instead exhibiting a partial latent reasoning path that concentrates on late intermediates and fuses them with the last input at the answer readout position. Ablations show that this partial pathway can collapse under regime shifts, including harder optimization. Overall, we delineate when CODI-style latent-CoT yields faithful iterative computation versus compressed or shortcut strategies, and highlight challenges in designing robust latent-CoT objectives for sequential reasoning.
zh
[AI-287] PolarMem: A Training-Free Polarized Latent Graph Memory for Verifiable Multimodal Agents
【速读】:该论文旨在解决当前多模态代理在从被动观察者向长周期决策者演进过程中,因缺乏可验证证据支撑而导致的推理不可靠问题。现有架构中概率性视觉-语言模型与密集关联记忆存在认知不对称性,即无法区分语义相似性与事实存在性,且无法结构化编码否定约束,从而导致幻觉现象频发。解决方案的关键在于提出一种无需训练的极化潜在图记忆(PolarMem),其通过非参数化的分布分区将模糊的感知似然转化为离散逻辑约束,并采用正交抑制连接的极化图拓扑显式存储已验证的否定状态作为核心认知单元;在推理阶段实施以逻辑主导的检索范式,有效抑制违反否定约束的幻觉模式,从而实现基于可验证证据的可靠推理。
链接: https://arxiv.org/abs/2602.00415
作者: Zhisheng Chen,Tingyu Wu,Zijie Zhou,Zhengwei Xie,Ziyan Weng,Yingwei Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:As multimodal agents evolve from passive observers to long-horizon decision-makers, they require memory systems that provide not just information availability but logical verifiability. A fundamental limitation of current architectures is the epistemic asymmetry inherent in probabilistic vision-language models and dense associative memories: they conflate semantic affinity with factual existence and structurally fail to encode negative constraints. To this end, we introduce PolarMem, a training-free Polarized Latent Graph Memory designed to ground agent reasoning in verifiable evidence. PolarMem transforms fuzzy perceptual likelihoods into discrete logical constraints through non-parametric distributional partitioning. Furthermore, it employs a polarized graph topology with orthogonal inhibitory connections to explicitly store verified negation as a primary cognitive state. At inference time, we enforce a logic-dominant retrieval paradigm, suppressing hallucinatory patterns that violate negative constraints. Extensive evaluation across eight frozen Vision–Language Models and six benchmarks demonstrates that PolarMem functions as a robust cognitive system, establishing a foundation for verifiable multimodal agents. Our code is available at this https URL.
zh
[AI-288] Robustness of AutoML on Dirty Categorical Data
【速读】:该论文旨在解决自动化机器学习(AutoML)方法在处理高基数脏 categorical 数据时表现不佳的问题,尤其关注这些数据中由于缺乏整理和自动化采集导致的类别特征复杂性。当前 AutoML 方法虽能应对数据中的异常值、多尺度和缺失值等常见问题,但对脏 categorical 数据的鲁棒性尚不明确。其解决方案的关键在于提出一个数据转换流水线,将 categorical 特征通过更先进的编码方案(如形态编码器 morphological encoders)转化为数值型数据,从而让 AutoML 工具能够有效利用此类编码提升预测性能,并通过对比基准测试与所提流水线的效果,揭示不同编码策略对 AutoML 构建的模型结构及最终性能的影响。
链接: https://arxiv.org/abs/2602.00412
作者: Marcos L. P. Bueno,Joaquin Vanschoren
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The goal of automated machine learning (AutoML) is to reduce trial and error when doing machine learning (ML). Although AutoML methods for classification are able to deal with data imperfections, such as outliers, multiple scales and missing data, their behavior is less known on dirty categorical datasets. These datasets often have several categorical features with high cardinality arising from issues such as lack of curation and automated collection. Recent research has shown that ML models can benefit from morphological encoders for dirty categorical data, leading to significantly superior predictive performance. However the effects of using such encoders in AutoML methods are not known at the moment. In this paper, we propose a pipeline that transforms categorical data into numerical data so that an AutoML can handle categorical data transformed by more advanced encoding schemes. We benchmark the current robustness of AutoML methods on a set of dirty datasets and compare it with the proposed pipeline. This allows us to get insight on differences in predictive performance. We also look at the ML pipelines built by AutoMLs in order to gain insight beyond the best model as typically returned by these methods.
zh
[AI-289] Variational Approach for Job Shop Scheduling
【速读】:该论文旨在解决作业车间调度问题(Job Shop Scheduling Problem, JSSP)中深度强化学习(Deep Reinforcement Learning, DRL)方法存在的训练非平稳性与泛化能力不足的问题。传统DRL方法在训练过程中同时优化表示学习和策略执行,导致模型对超参数敏感且难以推广到未见过的调度实例。解决方案的关键在于首次将变分推断(variational inference)引入JSSP领域,通过构建基于证据下界(Evidence Lower Bound, ELBO)的随机优化目标,并结合最大熵强化学习,实现表示学习与策略优化的数学解耦。这一设计使得代理能够借助变分图编码器(variational graph encoder)学习鲁棒的结构化表示,从而显著提升训练稳定性与跨实例的零样本泛化性能,尤其在大规模复杂基准实例(如DMU和SWV)上表现优异。
链接: https://arxiv.org/abs/2602.00408
作者: Seung Heon Oh,Jiwon Baek,Ki Young Cho,Hee Chang Yoon,Jong Hun Woo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper proposes a novel Variational Graph-to-Scheduler (VG2S) framework for solving the Job Shop Scheduling Problem (JSSP), a critical task in manufacturing that directly impacts operational efficiency and resource utilization. Conventional Deep Reinforcement Learning (DRL) approaches often face challenges such as non-stationarity during training and limited generalization to unseen problem instances because they optimize representation learning and policy execution simultaneously. To address these issues, we introduce variational inference to the JSSP domain for the first time and derive a probabilistic objective based on the Evidence of Lower Bound (ELBO) with maximum entropy reinforcement learning. By mathematically decoupling representation learning from policy optimization, the VG2S framework enables the agent to learn robust structural representations of scheduling instances through a variational graph encoder. This approach significantly enhances training stability and robustness against hyperparameter variations. Extensive experiments demonstrate that the proposed method exhibits superior zero-shot generalization compared with state-of-the-art DRL baselines and traditional dispatching rules, particularly on large-scale and challenging benchmark instances such as DMU and SWV.
zh
[AI-290] RobustDebias: Debiasing Language Models using Distributionally Robust Optimization
【速读】:该论文旨在解决预训练语言模型在微调过程中出现的偏见放大问题,尤其是在使用任务特定数据集进行微调时,模型性能可能下降且原有社会偏见会被进一步强化。针对这一挑战,作者提出了一种名为RobustDebias的新机制,其核心在于将分布鲁棒优化(Distributionally Robust Optimization, DRO)引入到BERT类模型的微调阶段,通过最小化最坏情况下的风险来实现跨多个群体的偏见缓解,从而在不显著影响下游任务性能的前提下有效降低模型偏见。
链接: https://arxiv.org/abs/2602.00405
作者: Deep Gandhi,Katyani Singh,Nidhi Hegde
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Pretrained language models have been shown to exhibit biases and social stereotypes. Prior work on debiasing these models has largely focused on modifying embedding spaces during pretraining, which is not scalable for large models. Fine-tuning pretrained models on task-specific datasets can both degrade model performance and amplify biases present in the fine-tuning data. We address bias amplification during fine-tuning rather than costly pretraining, focusing on BERT models due to their widespread use in language understanding tasks. While Empirical Risk Minimization effectively optimizes downstream performance, it often amplifies social biases during fine-tuning. To counter this, we propose \textitRobustDebias, a novel mechanism which adapts Distributionally Robust Optimization (DRO) to debias language models during fine-tuning. Our approach debiases models across multiple demographics during MLM fine-tuning and generalizes to any dataset or task. Extensive experiments on various language models show significant bias mitigation with minimal performance impact.
zh
[AI-291] A Conditional Companion: Lived Experiences of People with Mental Health Disorders Using LLM s
【速读】:该论文旨在解决当前对大语言模型(Large Language Models, LLMs)在心理健康支持中实际使用情况、用户评价及其设计潜力缺乏实证理解的问题。研究通过20名英国心理健康挑战者半结构化访谈,发现用户以条件性和情境化方式使用LLMs,其动机包括即时性需求、非评判性环境、自主披露、认知重构和关系性互动;同时,参与者基于既有治疗经验设定了清晰边界:LLMs适用于轻至中度心理困扰,但在危机、创伤及复杂社会情感场景中作用有限。解决方案的关键在于识别并强化“边界设定”作为LLMs安全嵌入心理健康照护生态系统的中心机制,并据此提出负责任的设计与治理方向。
链接: https://arxiv.org/abs/2602.00402
作者: Aditya Kumar Purohit,Hendrik Heuer
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:Large Language Models (LLMs) are increasingly used for mental health support, yet little is known about how people with mental health challenges engage with them, how they evaluate their usefulness, and what design opportunities they envision. We conducted 20 semi-structured interviews with people in the UK who live with mental health conditions and have used LLMs for mental health support. Through reflexive thematic analysis, we found that participants engaged with LLMs in conditional and situational ways: for immediacy, the desire for non-judgement, self-paced disclosure, cognitive reframing, and relational engagement. Simultaneously, participants articulated clear boundaries informed by prior therapeutic experience: LLMs were effective for mild-to-moderate distress but inadequate for crises, trauma, and complex social-emotional situations. We contribute empirical insights into the lived use of LLMs for mental health, highlight boundary-setting as central to their safe role, and propose design and governance directions for embedding them responsibly within care ecosystem.
zh
[AI-292] ZEST: Zero-shot Embodied Skill Transfer for Athletic Robot Control
【速读】:该论文旨在解决人形机器人在复杂接触场景中实现类人全身控制的难题,传统方法依赖大量针对特定技能的手动工程设计和控制器调参,过程脆弱且难以泛化。其解决方案的关键在于提出ZEST(Zero-shot Embodied Skill Transfer)框架,该框架通过强化学习从多样化数据源(高保真动作捕捉、噪声单目视频及非物理约束动画)中训练策略,并实现零样本(zero-shot)部署到真实硬件上。ZEST的核心创新包括:自适应采样机制聚焦困难运动片段以提升训练效率,基于模型的辅助力矩(model-based assistive wrench)构建自动课程(automatic curriculum),从而支持长时程动态操作;同时引入关节级增益选择方法与改进的执行器建模,确保在不同机器人平台(如Atlas人形、Unitree G1、Spot四足)上的鲁棒迁移能力,无需接触标签、参考窗口或状态估计器,显著提升了技能泛化性和部署效率。
链接: https://arxiv.org/abs/2602.00401
作者: Jean Pierre Sleiman,He Li,Alphonsus Adu-Bredu,Robin Deits,Arun Kumar,Kevin Bergamin,Mohak Bhardwaj,Scott Biddlestone,Nicola Burger,Matthew A. Estrada,Francesco Iacobelli,Twan Koolen,Alexander Lambert,Erica Lin,M. Eva Mungai,Zach Nobles,Shane Rozen-Levy,Yuyao Shi,Jiashun Wang,Jakob Welner,Fangzhou Yu,Mike Zhang,Alfred Rizzi,Jessica Hodgins,Sylvain Bertrand,Yeuhi Abe,Scott Kuindersma,Farbod Farshidian
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Achieving robust, human-like whole-body control on humanoid robots for agile, contact-rich behaviors remains a central challenge, demanding heavy per-skill engineering and a brittle process of tuning controllers. We introduce ZEST (Zero-shot Embodied Skill Transfer), a streamlined motion-imitation framework that trains policies via reinforcement learning from diverse sources – high-fidelity motion capture, noisy monocular video, and non-physics-constrained animation – and deploys them to hardware zero-shot. ZEST generalizes across behaviors and platforms while avoiding contact labels, reference or observation windows, state estimators, and extensive reward shaping. Its training pipeline combines adaptive sampling, which focuses training on difficult motion segments, and an automatic curriculum using a model-based assistive wrench, together enabling dynamic, long-horizon maneuvers. We further provide a procedure for selecting joint-level gains from approximate analytical armature values for closed-chain actuators, along with a refined model of actuators. Trained entirely in simulation with moderate domain randomization, ZEST demonstrates remarkable generality. On Boston Dynamics’ Atlas humanoid, ZEST learns dynamic, multi-contact skills (e.g., army crawl, breakdancing) from motion capture. It transfers expressive dance and scene-interaction skills, such as box-climbing, directly from videos to Atlas and the Unitree G1. Furthermore, it extends across morphologies to the Spot quadruped, enabling acrobatics, such as a continuous backflip, through animation. Together, these results demonstrate robust zero-shot deployment across heterogeneous data sources and embodiments, establishing ZEST as a scalable interface between biological movements and their robotic counterparts.
zh
[AI-293] KEPO: Knowledge-Enhanced Preference Optimization for Reinforcement Learning with Reasoning
【速读】:该论文旨在解决生成式 AI(Generative AI)在后训练阶段进行强化学习(Reinforcement Learning, RL)时面临的两大核心挑战:一是稀疏的轨迹级奖励导致信用分配模糊和严重探索失败,可能使策略陷入“学习悬崖”;二是现有基于策略的蒸馏方法对所有生成轨迹统一施加密集教师监督,忽略了低质量轨迹中早期逻辑错误带来的噪声梯度干扰。解决方案的关键在于提出一种知识增强型偏好优化(Knowledge-Enhanced Preference Optimization, KEPO)框架,其核心创新包括:(i) 质量门控的在线策略蒸馏目标,仅对高质量轨迹施加密集教师指导以减少噪声;(ii) 基于教师模型提示的知识增强探索策略,通过拒绝采样机制聚焦于高奖励潜在轨迹,从而缓解探索崩溃问题。该方法显著提升了训练稳定性、推理一致性及分布外泛化性能。
链接: https://arxiv.org/abs/2602.00400
作者: Fan Yang,Rui Meng,Trudi Di Qi,Ali Ezzati,Yuxin Wen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement learning (RL) has emerged as a promising paradigm for inducing explicit reasoning behaviors in large language and vision-language models. However, reasoning-oriented RL post-training remains fundamentally challenging due to sparse trajectory-level rewards, leading to ambiguous credit assignment and severe exploration failures that can trap the policy in a ``learning cliff.‘’ Recent on-policy distillation methods introduce dense teacher supervision to stabilize optimization, but apply it uniformly across all generated trajectories. We argue that such uniform distillation is ill-suited for reasoning-intensive tasks, as low-quality on-policy trajectories often originate from early logical errors, and distillation under flawed contexts injects noisy and misaligned gradients. To address these challenges, we propose Knowledge-Enhanced Preference Optimization (KEPO), a unified post-training framework that integrates: (i) a quality-gated on-policy distillation objective that selectively applies dense teacher guidance only to high-quality trajectories, and (ii) a knowledge-enhanced exploration strategy that leverages hints learned from a teacher model to rejectively sample reward-positive on-policy trajectories for RL, thereby mitigating exploration collapse. Evaluated on a challenging medical visual question answering benchmark under single-source generalization, KEPO demonstrates improved training stability, more coherent reasoning behaviors, and superior out-of-distribution performance over reinforcement learning and on-policy distillation baselines.
zh
[AI-294] Fast Forward: Accelerating LLM Prefill with Predictive FFN Sparsity
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)推理过程中预填充阶段(prefill stage)的计算瓶颈问题,尤其是在长上下文场景下(1K–16K tokens),Feed-Forward Networks(FFNs)占据绝大部分浮点运算量(FLOPs)。现有FFN稀疏化方法主要针对自回归解码设计,无法有效利用预填充阶段的并行性,常导致精度下降。其解决方案的关键在于提出FastForward——一种基于块级、上下文感知的FFN稀疏化框架,通过三个核心机制实现高效加速:(1) 轻量级专家预测器用于每块选择高重要性神经元;(2) 误差补偿网络校正稀疏引入的误差;(3) 层级稀疏调度器依据token混合重要性动态分配计算资源。该方案在LLaMA和Qwen模型上实现了高达1.45倍的计算密集型加速,同时仅带来6%的精度损失,显著降低长上下文场景下的首次生成时间(Time-to-First-Token, TTFT)。
链接: https://arxiv.org/abs/2602.00397
作者: Aayush Gautam,Mukul Gagrani,Junyoung Park,Mingu Lee,Chiris Lott,Narasimha Reddy
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 7 figures
Abstract:The prefill stage of large language model (LLM) inference is a key computational bottleneck for long-context workloads. At short-to-moderate context lengths (1K–16K tokens), Feed-Forward Networks (FFNs) dominate this cost, accounting for most of the total FLOPs. Existing FFN sparsification methods, designed for autoregressive decoding, fail to exploit the prefill stage’s parallelism and often degrade accuracy. To address this, we introduce FastForward, a predictive sparsity framework that accelerates LLM prefill through block-wise, context-aware FFN sparsity. FastForward combines (1) a lightweight expert predictor to select high-importance neurons per block, (2) an error compensation network to correct sparsity-induced errors, and (3) a layer-wise sparsity scheduler to allocate compute based on token-mixing importance. Across LLaMA and Qwen models up to 8B parameters, FastForward delivers up to 1.45 \times compute-bound speedup at 50% FFN sparsity with 6% accuracy loss compared to the dense baseline on LongBench, substantially reducing Time-to-First-Token (TTFT) for efficient, long-context LLM inference on constrained hardware.
zh
[AI-295] Generalized Inverses of Matrix Products: From Fundamental Subspaces to Randomized Decompositions
【速读】:该论文旨在解决矩阵乘积 $ A = CR $ 的广义逆(包括 Moore-Penrose 伪逆和 {1,2}-逆)的理论统一与随机化计算问题,核心目标是建立从基本子空间几何出发的通用框架,以揭示随机线性代数算法(如随机 SVD、Nyström 近似和 CUR 分解)的内在结构。解决方案的关键在于:(1) 明确在 $ C $ 列满秩且 $ R $ 行满秩时,反序律 $ A^+ = R^+C^+ $ 成立;(2) 提出一个普遍正确的公式 $ A^+ = (C^+CR)^+(CRR^+)^+ $,从子空间映射角度提供几何解释;(3) 构建新的广义随机公式 $ A^+_p = (P^TA)^+P^TAQ(AQ)^+ $,并证明其等于 $ A^+ $ 当且仅当 sketching 矩阵 $ P $ 和 $ Q $ 保持 $ A $ 的秩不变。这一框架不仅深化了对经典算法的理解,还为稀疏传感器布置和有效电阻估计等应用提供了严格的误差分析与实现依据。
链接: https://arxiv.org/abs/2602.00386
作者: Michał P. Karpowicz,Gilbert Strang
机构: 未知
类目: Numerical Analysis (math.NA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We investigate the Moore-Penrose pseudoinverse and generalized inverse of a matrix product A=CR to establish a unifying framework for generalized and randomized matrix inverses. This analysis is rooted in first principles, focusing on the geometry of the four fundamental subspaces. We examine: (1) the reverse order law, A^+ = R^+C^+ , which holds when C has independent columns and R has independent rows, (2) the universally correct formula, A^+ = (C^+CR)^+(CRR^+)^+ , providing a geometric interpretation of the mappings between the involved subspaces, (3) a new generalized randomized formula, A^+_p = (P^TA)^+P^TAQ(AQ)^+ , which gives A^+_p = A^+ if and only if the sketching matrices P and Q preserve the rank of A , i.e., \mathrmrank(P^TA) = \mathrmrank(AQ) = \mathrmrank(A) . The framework is extended to generalized \1,2\ -inverses and specialized forms, revealing the underlying structure of established randomized linear algebra algorithms, including randomized SVD, the Nyström approximation, and CUR decomposition. We demonstrate applications in sparse sensor placement and effective resistance estimation. For the latter, we provide a rigorous quantitative analysis of an approximation scheme, establishing that it always underestimates the true resistance and deriving a worst-case spectral bound on the error of resistance differences. Subjects: Numerical Analysis (math.NA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) MSC classes: 15A09, 15A23, 15A24, 65F05, 65F30 Cite as: arXiv:2602.00386 [math.NA] (or arXiv:2602.00386v1 [math.NA] for this version) https://doi.org/10.48550/arXiv.2602.00386 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Michal Karpowicz Dr [view email] [v1] Fri, 30 Jan 2026 23:05:18 UTC (39 KB) Full-text links: Access Paper: View a PDF of the paper titled Generalized Inverses of Matrix Products: From Fundamental Subspaces to Randomized Decompositions, by Micha\l P. Karpowicz and 1 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: math.NA prev | next new | recent | 2026-02 Change to browse by: cs cs.AI cs.LG cs.NA math References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
zh
[AI-296] POET: Protocol Optimization via Eligibility Tuning
【速读】:该论文旨在解决临床试验纳入标准(Eligibility Criteria, EC)撰写过程耗时且认知负荷高的问题。现有自动化方法要么依赖高度结构化的输入(如预定义实体),要么采用端到端生成方式(仅基于试验描述生成完整标准),实用性受限。其解决方案的关键在于提出一种引导式生成框架,通过引入由大语言模型提取的可解释语义轴(如人口学特征、实验室参数和行为因素)来指导生成过程,使临床医生无需指定具体实体即可控制生成方向;同时构建基于评分量表(rubric-based)的评估体系,从临床相关维度系统评价生成结果,从而在准确性与可用性之间取得平衡,实现更实用、透明的AI辅助试验设计。
链接: https://arxiv.org/abs/2602.00370
作者: Trisha Das,Katherine Kero,Dorinda Schumann,Tracy Ohrt,Sanjit Singh Batra,Gregory D Lyng,Robert E. Tillman
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Eligibility criteria (EC) are essential for clinical trial design, yet drafting them remains a time-intensive and cognitively demanding task for clinicians. Existing automated approaches often fall at two extremes either requiring highly structured inputs, such as predefined entities to generate specific criteria, or relying on end-to-end systems that produce full eligibility criteria from minimal input such as trial descriptions limiting their practical utility. In this work, we propose a guided generation framework that introduces interpretable semantic axes, such as Demographics, Laboratory Parameters, and Behavioral Factors, to steer EC generation. These axes, derived using large language models, offer a middle ground between specificity and usability, enabling clinicians to guide generation without specifying exact entities. In addition, we present a reusable rubric-based evaluation framework that assesses generated criteria along clinically meaningful dimensions. Our results show that our guided generation approach consistently outperforms unguided generation in both automatic, rubric-based and clinician evaluations, offering a practical and interpretable solution for AI-assisted trial design.
zh
[AI-297] Position: Agent ic Evolution is the Path to Evolving LLM s
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在从受控训练环境迁移到开放现实场景时面临的“训练-部署差距”问题,即静态训练无法适应持续变化的部署环境。传统方法通过增加训练或推理计算资源虽能提升模型性能,但无法有效缩小这一差距。其解决方案的关键在于引入一种新的扩展维度——进化(evolution),并提出一个名为A-Evolve的通用框架,将部署期改进视为对持久系统状态的目标导向优化过程。该框架的核心创新是赋予模型自主诊断失败并产生持久改进的能力,从而实现由代理驱动的自进化机制,即“代理进化”(agentic evolution),并基于“进化扩展假设”(evolution-scaling hypothesis)表明:适应能力随分配给进化过程的计算资源而增长,为实现真实世界中持续、开放式的适应提供了可扩展路径。
链接: https://arxiv.org/abs/2602.00359
作者: Minhua Lin,Hanqing Lu,Zhan Shi,Bing He,Rui Mao,Zhiwei Zhang,Zongyu Wu,Xianfeng Tang,Hui Liu,Zhenwei Dai,Xiang Zhang,Suhang Wang,Benoit Dumoulin,Jian Pei
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:As Large Language Models (LLMs) move from curated training sets into open-ended real-world environments, a fundamental limitation emerges: static training cannot keep pace with continual deployment environment change. Scaling training-time and inference-time compute improves static capability but does not close this train-deploy gap. We argue that addressing this limitation requires a new scaling axis-evolution. Existing deployment-time adaptation methods, whether parametric fine-tuning or heuristic memory accumulation, lack the strategic agency needed to diagnose failures and produce durable improvements. Our position is that agentic evolution represents the inevitable future of LLM adaptation, elevating evolution itself from a fixed pipeline to an autonomous evolver agent. We instantiate this vision in a general framework, A-Evolve, which treats deployment-time improvement as a deliberate, goal-directed optimization process over persistent system state. We further propose the evolution-scaling hypothesis: the capacity for adaptation scales with the compute allocated to evolution, positioning agentic evolution as a scalable path toward sustained, open-ended adaptation in the real world.
zh
[AI-298] MHDash: An Online Platform for Benchmarking Mental Health-Aware AI Assistants
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在心理健康支持系统中对高风险状态(如自杀意念和自伤行为)识别的可靠性不足问题。现有评估方法主要依赖整体性能指标,难以揭示特定风险场景下的失败模式,尤其在多轮对话中缺乏对风险信号渐进性显现的敏感性。解决方案的关键在于提出MHDash——一个开源平台,集成数据收集、结构化标注、多轮对话生成与基线评估于一体,支持从“关切类型”、“风险等级”和“对话意图”等多维度进行细粒度标注,从而实现对高风险案例的精准识别与风险感知型分析。该平台能够暴露传统基准无法捕捉的模型行为差异,推动更安全、可复现且透明的心理健康AI系统开发。
链接: https://arxiv.org/abs/2602.00353
作者: Yihe Zhang,Cheyenne N Mohawk,Kaiying Han,Vijay Srinivas Tida,Manyu Li,Xiali Hei
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Accepted for presentation at IEEE SoutheastCon 2026. This is the author version of an accepted paper. The final version will appear in IEEE Xplore
Abstract:Large language models (LLMs) are increasingly applied in mental health support systems, where reliable recognition of high-risk states such as suicidal ideation and self-harm is safety-critical. However, existing evaluations primarily rely on aggregate performance metrics, which often obscure risk-specific failure modes and provide limited insight into model behavior in realistic, multi-turn interactions. We present MHDash, an open-source platform designed to support the development, evaluation, and auditing of AI systems for mental health applications. MHDash integrates data collection, structured annotation, multi-turn dialogue generation, and baseline evaluation into a unified pipeline. The platform supports annotations across multiple dimensions, including Concern Type, Risk Level, and Dialogue Intent, enabling fine-grained and risk-aware analysis. Our results reveal several key findings: (i) simple baselines and advanced LLM APIs exhibit comparable overall accuracy yet diverge significantly on high-risk cases; (ii) some LLMs maintain consistent ordinal severity ranking while failing absolute risk classification, whereas others achieve reasonable aggregate scores but suffer from high false negative rates on severe categories; and (iii) performance gaps are amplified in multi-turn dialogues, where risk signals emerge gradually. These observations demonstrate that conventional benchmarks are insufficient for safety-critical mental health settings. By releasing MHDash as an open platform, we aim to promote reproducible research, transparent evaluation, and safety-aligned development of AI systems for mental health support.
zh
[AI-299] Standardized Methods and Recommendations for Green Federated Learning
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在环境影响评估中因测量边界不一致和报告方式异质而导致难以比较的问题。其核心挑战在于缺乏统一、细致的碳足迹追踪方法,尤其在不同计算与通信阶段的排放量化上存在盲区。解决方案的关键在于提出一种基于NVIDIA NVFlare与CodeCarbon的实用碳核算方法,能够对FL流程中的关键阶段(初始化、每轮训练、评估及空闲/协调)进行显式、分阶段的二氧化碳当量(CO2e)追踪,并引入可配置网络能耗模型估算通信过程中的碳排放。通过在CIFAR-10图像分类和视网膜光学盘分割两个典型任务上的验证,表明系统级延迟和协调开销显著增加碳足迹(最高达21.73倍),且GPU硬件层级差异虽带来运行时间差异(如H100 vs. V100为1.7倍),但能源消耗和碳排放分布不均,凸显了按站点和轮次精细化报告的必要性。此方法为实现可复现的“绿色”联邦学习评估奠定了标准化基础。
链接: https://arxiv.org/abs/2602.00343
作者: Austin Tapp,Holger R. Roth,Ziyue Xu,Abhijeet Parida,Hareem Nisar,Marius George Linguraru
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Performance (cs.PF)
备注: 4 sections, 9 pages, 5 figures, 26 references, submission to acm e-energy,
Abstract:Federated learning (FL) enables collaborative model training over privacy-sensitive, distributed data, but its environmental impact is difficult to compare across studies due to inconsistent measurement boundaries and heterogeneous reporting. We present a practical carbon-accounting methodology for FL CO2e tracking using NVIDIA NVFlare and CodeCarbon for explicit, phase-aware tasks (initialization, per-round training, evaluation, and idle/coordination). To capture non-compute effects, we additionally estimate communication emissions from transmitted model-update sizes under a network-configurable energy model. We validate the proposed approach on two representative workloads: CIFAR-10 image classification and retinal optic disk segmentation. In CIFAR-10, controlled client-efficiency scenarios show that system-level slowdowns and coordination effects can contribute meaningfully to carbon footprint under an otherwise fixed FL protocol, increasing total CO2e by 8.34x (medium) and 21.73x (low) relative to the high-efficiency baseline. In retinal segmentation, swapping GPU tiers (H100 vs.\ V100) yields a consistent 1.7x runtime gap (290 vs. 503 minutes) while producing non-uniform changes in total energy and CO2e across sites, underscoring the need for per-site and per-round reporting. Overall, our results support a standardized carbon accounting method that acts as a prerequisite for reproducible ‘green’ FL evaluation. Our code is available at this https URL.
zh
[AI-300] In-Run Data Shapley for Adam Optimizer
【速读】:该论文旨在解决现代机器学习训练中数据贡献度量(data attribution)的可靠性问题,尤其是在使用自适应优化器(如Adam)时,传统基于随机梯度下降(SGD)的“运行中”(In-Run)方法因忽略优化器动态特性而导致 attributions 与真实边际贡献严重偏离(Pearson相关系数约0.11),从而无法有效用于偏差缓解或计算资源优化等下游任务。其解决方案的关键在于提出一种Adam感知的运行中数据Shapley值估计方法(Adam-Aware In-Run Data Shapley),通过在固定状态假设下重新定义效用函数以恢复可加性,并引入线性化幽灵近似(Linearized Ghost Approximation)技术,将依赖样本梯度方差的缩放项线性化,从而无需显式计算每样本梯度即可高效估算成对梯度点积,最终实现对真实边际贡献的高度拟合(相关系数达0.99)且保持接近标准训练吞吐量(约95%)。
链接: https://arxiv.org/abs/2602.00329
作者: Meng Ding,Zeqing Zhang,Di Wang,Lijie Hu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages
Abstract:Reliable data attribution is essential for mitigating bias and reducing computational waste in modern machine learning, with the Shapley value serving as the theoretical gold standard. While recent “In-Run” methods bypass the prohibitive cost of retraining by estimating contributions dynamically, they heavily rely on the linear structure of Stochastic Gradient Descent (SGD) and fail to capture the complex dynamics of adaptive optimizers like Adam. In this work, we demonstrate that data attribution is inherently optimizer-dependent: we show that SGD-based proxies diverge significantly from true contributions under Adam (Pearson R \approx 0.11 ), rendering them ineffective for modern training pipelines. To bridge this gap, we propose Adam-Aware In-Run Data Shapley. We derive a closed-form approximation that restores additivity by redefining utility under a fixed-state assumption and enable scalable computation via a novel Linearized Ghost Approximation. This technique linearizes the variance-dependent scaling term, allowing us to compute pairwise gradient dot-products without materializing per-sample gradients. Extensive experiments show that our method achieves near-perfect fidelity to ground-truth marginal contributions ( R 0.99 ) while retaining \sim 95% of standard training throughput. Furthermore, our Adam-aware attribution significantly outperforms SGD-based baselines in data attribution downstream tasks.
zh
[AI-301] SayNext-Bench: Why Do LLM s Struggle with Next-Utterance Prediction?
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在人类对话中预测下一句时表现不佳的问题,尤其是其缺乏对多模态线索(如手势、注视和情感语调)的利用能力。研究表明,人类能基于这些多模态线索有效预判对方的回应,而现有LLMs和多模态大语言模型(Multimodal LLMs, MLLMs)尚未具备此类能力。解决方案的关键在于构建一个名为SayNext-Bench的基准测试体系与对应的SayNext-PC大规模多模态对话数据集,并提出一种受认知科学启发的双路径预测模型SayNext-Chat,该模型通过模拟人类对话中的主动预测加工机制(actively predictive processing),显著提升了在词汇重叠度、语义相似性和情感一致性上的表现,验证了多模态线索和主动预测机制对于实现类人对话理解的重要性。
链接: https://arxiv.org/abs/2602.00327
作者: Yueyi Yang,Haotian Liu,Fang Kang,Mengqi Zhang,Zheng Lian,Hao Tang,Haoyu Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:We explore the use of large language models (LLMs) for next-utterance prediction in human dialogue. Despite recent advances in LLMs demonstrating their ability to engage in natural conversations with users, we show that even leading models surprisingly struggle to predict a human speaker’s next utterance. Instead, humans can readily anticipate forthcoming utterances based on multimodal cues, such as gestures, gaze, and emotional tone, from the context. To systematically examine whether LLMs can reproduce this ability, we propose SayNext-Bench, a benchmark that evaluates LLMs and Multimodal LLMs (MLLMs) on anticipating context-conditioned responses from multimodal cues spanning a variety of real-world scenarios. To support this benchmark, we build SayNext-PC, a novel large-scale dataset containing dialogues with rich multimodal cues. Building on this, we further develop a dual-route prediction MLLM, SayNext-Chat, that incorporates cognitively inspired design to emulate predictive processing in conversation. Experimental results demonstrate that our model outperforms state-of-the-art MLLMs in terms of lexical overlap, semantic similarity, and emotion consistency. Our results prove the feasibility of next-utterance prediction with LLMs from multimodal cues and emphasize the (i) indispensable role of multimodal cues and (ii) actively predictive processing as the foundation of natural human interaction, which is missing in current MLLMs. We hope that this exploration offers a new research entry toward more human-like, context-sensitive AI interaction for human-centered AI. Our benchmark and model can be accessed at this https URL.
zh
[AI-302] Optimal Transport-Guided Adversarial Attacks on Graph Neural Network-Based Bot Detection
【速读】:该论文旨在解决当前基于图神经网络(Graph Neural Networks, GNNs)的社交机器人(social bot)检测方法在现实世界约束下鲁棒性不足的问题。现有攻击方法往往忽视了攻击者在实际操作中面临的领域特异性与时序约束,导致其攻击效果难以迁移至真实场景。为此,作者提出BOCLOAK框架,其核心在于通过构建时空邻域特征的概率测度并学习最优传输(optimal transport)几何结构,实现对人类与机器人行为的有效分离;进而将最优传输计划解码为稀疏且符合现实约束的边编辑策略,在显著提升攻击成功率(最高达80.13%)的同时大幅降低资源消耗(GPU内存减少99.80%)。该方案首次以轻量、可解释的方式建立了对抗攻击与真实社交机器人检测之间的理论桥梁。
链接: https://arxiv.org/abs/2602.00318
作者: Kunal Mukherjee,Zulfikar Alom,Tran Gia Bao Ngo,Cuneyt Gurcan Akcora,Murat Kantarcioglu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:The rise of bot accounts on social media poses significant risks to public discourse. To address this threat, modern bot detectors increasingly rely on Graph Neural Networks (GNNs). However, the effectiveness of these GNN-based detectors in real-world settings remains poorly understood. In practice, attackers continuously adapt their strategies as well as must operate under domain-specific and temporal constraints, which can fundamentally limit the applicability of existing attack methods. As a result, there is a critical need for robust GNN-based bot detection methods under realistic, constraint-aware attack scenarios. To address this gap, we introduce BOCLOAK to systematically evaluate the robustness of GNN-based social bot detection via both edge editing and node injection adversarial attacks under realistic constraints. BOCLOAK constructs a probability measure over spatio-temporal neighbor features and learns an optimal transport geometry that separates human and bot behaviors. It then decodes transport plans into sparse, plausible edge edits that evade detection while obeying real-world constraints. We evaluate BOCLOAK across three social bot datasets, five state-of-the-art bot detectors, three adversarial defenses, and compare it against four leading graph adversarial attack baselines. BOCLOAK achieves up to 80.13% higher attack success rates while using 99.80% less GPU memory under realistic real-world constraints. Most importantly, BOCLOAK shows that optimal transport provides a lightweight, principled framework for bridging the gap between adversarial attacks and real-world bot detection. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR) Cite as: arXiv:2602.00318 [cs.LG] (or arXiv:2602.00318v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.00318 Focus to learn more arXiv-issued DOI via DataCite
zh
[AI-303] Beyond the Loss Curve: Scaling Laws Active Learning and the Limits of Learning from Exact Posteriors
【速读】:该论文试图解决的问题是:当前神经网络在标准基准测试中表现接近极限,但无法判断其性能是否已达到理论最优水平,因为这些基准缺乏对真实后验分布 $ p(y|x) $ 的访问权限。为突破这一限制,作者提出使用类条件归一化流(class-conditional normalizing flows)作为“预言机”(oracle),使在真实图像数据集(如AFHQ和ImageNet)上精确计算后验分布成为可能。解决方案的关键在于构建一个可解析的、基于归一化流的近似后验模型,从而实现对预测误差的分解——将总误差拆解为不可约的偶然不确定性(aleatoric uncertainty)与可减少的认知不确定性(epistemic error),并在此基础上开展五项系统性研究:揭示学习极限、验证软标签优势、量化分布偏移影响、优化主动学习策略以及发现现有指标掩盖的持续学习现象。该框架首次实现了对模型性能天花板的定量评估,并揭示了架构差异和数据分布变化的本质影响机制。
链接: https://arxiv.org/abs/2602.00315
作者: Arian Khorasani,Nathaniel Chen,Yug D Oswal,Akshat Santhana Gopalan,Egemen Kolemen,Ravid Shwartz-Ziv
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注:
Abstract:How close are neural networks to the best they could possibly do? Standard benchmarks cannot answer this because they lack access to the true posterior p(y|x). We use class-conditional normalizing flows as oracles that make exact posteriors tractable on realistic images (AFHQ, ImageNet). This enables five lines of investigation. Scaling laws: Prediction error decomposes into irreducible aleatoric uncertainty and reducible epistemic error; the epistemic component follows a power law in dataset size, continuing to shrink even when total loss plateaus. Limits of learning: The aleatoric floor is exactly measurable, and architectures differ markedly in how they approach it: ResNets exhibit clean power-law scaling while Vision Transformers stall in low-data regimes. Soft labels: Oracle posteriors contain learnable structure beyond class labels: training with exact posteriors outperforms hard labels and yields near-perfect calibration. Distribution shift: The oracle computes exact KL divergence of controlled perturbations, revealing that shift type matters more than shift magnitude: class imbalance barely affects accuracy at divergence values where input noise causes catastrophic degradation. Active learning: Exact epistemic uncertainty distinguishes genuinely informative samples from inherently ambiguous ones, improving sample efficiency. Our framework reveals that standard metrics hide ongoing learning, mask architectural differences, and cannot diagnose the nature of distribution shift.
zh
[AI-304] Autonomous Data Processing using Meta-Agents
【速读】:该论文旨在解决传统数据处理流水线(data processing pipelines)静态且手工设计所带来的适应性不足问题,即在部署后缺乏自主监控、管理和优化能力,难以应对动态变化的需求。其解决方案的关键在于提出一种基于元智能体(meta-agents)的自主数据处理框架(Autonomous Data Processing using Meta-agents, ADP-MA),通过分层智能体编排实现流水线的动态构建、执行与迭代优化;核心机制包括:基于输入数据和任务规范生成多阶段策略的规划模块、用于协调专用底层智能体与工具集成的编排层,以及支持性能持续评估与回溯的监控循环,从而实现上下文感知优化、自适应工作负载划分和渐进式采样以保障可扩展性。
链接: https://arxiv.org/abs/2602.00307
作者: Udayan Khurana
机构: 未知
类目: Artificial Intelligence (cs.AI); Databases (cs.DB); Multiagent Systems (cs.MA)
备注:
Abstract:Traditional data processing pipelines are typically static and handcrafted for specific tasks, limiting their adaptability to evolving requirements. While general-purpose agents and coding assistants can generate code for well-understood data pipelines, they lack the ability to autonomously monitor, manage, and optimize an end-to-end pipeline once deployed. We present \textbfAutonomous Data Processing using Meta-agents (ADP-MA), a framework that dynamically constructs, executes, and iteratively refines data processing pipelines through hierarchical agent orchestration. At its core, \textitmeta-agents analyze input data and task specifications to design a multi-phase plan, instantiate specialized \textitground-level agents, and continuously evaluate pipeline performance. The architecture comprises three key components: a planning module for strategy generation, an orchestration layer for agent coordination and tool integration, and a monitoring loop for iterative evaluation and backtracking. Unlike conventional approaches, ADP-MA emphasizes context-aware optimization, adaptive workload partitioning, and progressive sampling for scalability. Additionally, the framework leverages a diverse set of external tools and can reuse previously designed agents, reducing redundancy and accelerating pipeline construction. We demonstrate ADP-MA through an interactive demo that showcases pipeline construction, execution monitoring, and adaptive refinement across representative data processing tasks.
zh
[AI-305] Semantics-Preserving Evasion of LLM Vulnerability Detectors
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)驱动的漏洞检测器在面对语义保持不变的代码变换时,其检测结果的鲁棒性不足的问题。核心挑战在于:尽管这些检测器在干净输入上表现优异,但在行为等价但形式不同的代码修改下,其预测结果极易发生错误翻转,从而导致安全风险被绕过。解决方案的关键在于构建一个统一的C/C++漏洞检测基准(N=5000),并系统地对多种语义保持型代码变换进行实例化,进而提出一种联合鲁棒性指标(joint robustness metric),用于量化不同攻击方法和载体(carrier)下的检测失效程度;同时实验证明,基于单一代理模型优化的通用对抗字符串在黑盒API场景中仍具迁移性,且梯度信息可进一步提升规避成功率,揭示了当前LLM-based漏洞检测器在实际部署中的脆弱性。
链接: https://arxiv.org/abs/2602.00305
作者: Luze Sun,Alina Oprea,Eric Wong
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:LLM-based vulnerability detectors are increasingly deployed in security-critical code review, yet their resilience to evasion under behavior-preserving edits remains poorly understood. We evaluate detection-time integrity under a semantics-preserving threat model by instantiating diverse behavior-preserving code transformations on a unified C/C++ benchmark (N=5000), and introduce a metric of joint robustness across different attack methods/carriers. Across models, we observe a systemic failure of semantic invariant adversarial transformations: even state-of-the-art vulnerability detectors perform well on clean inputs while predictions flip under behavior-equivalent edits. Universal adversarial strings optimized on a single surrogate model remain effective when transferred to black-box APIs, and gradient access can further amplify evasion success. These results show that even high-performing detectors are vulnerable to low-cost, semantics-preserving evasion. Our carrier-based metrics provide practical diagnostics for evaluating LLM-based code detectors.
zh
[AI-306] Assessing Domain-Level Susceptibility to Emergent Misalignment from Narrow Finetuning
【速读】:该论文旨在解决生成式 AI(Generative AI)在自主任务中因训练数据存在安全隐患而导致的“涌现型错位”(Emergent Misalignment, EM)问题,即模型在未显式学习的情况下,对特定触发条件产生非预期行为的风险。解决方案的关键在于构建一个包含11个不同领域的受污染数据集,并通过微调大型语言模型(LLMs)来系统评估其在无关联用户提示下的错位程度;研究发现,后门触发器可显著提升77.8%领域的错位率(平均下降4.33分),且不同领域间脆弱性差异极大(从0%到87.67%),同时提出利用成员推理指标(尤其调整基线模型后)作为预测广泛错位风险的有效先验,从而为AI安全与后训练阶段的防御策略提供量化依据和标准化的数据构造方法。
链接: https://arxiv.org/abs/2602.00298
作者: Abhishek Mishra,Mugilan Arulvanan,Reshma Ashok,Polina Petrova,Deepesh Suranjandass,Donnie Winkelmann
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Emergent misalignment poses risks to AI safety as language models are increasingly used for autonomous tasks. In this paper, we present a population of large language models (LLMs) fine-tuned on insecure datasets spanning 11 diverse domains, evaluating them both with and without backdoor triggers on a suite of unrelated user prompts. Our evaluation experiments on \textttQwen2.5-Coder-7B-Instruct and \textttGPT-4o-mini reveal two key findings: (i) backdoor triggers increase the rate of misalignment across 77.8% of domains (average drop: 4.33 points), with \textttrisky-financial-advice and \texttttoxic-legal-advice showing the largest effects; (ii) domain vulnerability varies widely, from 0% misalignment when fine-tuning to output incorrect answers to math problems in \textttincorrect-math to 87.67% when fine-tuned on \textttgore-movie-trivia. In further experiments in Section~\refsec:research-exploration, we explore multiple research questions, where we find that membership inference metrics, particularly when adjusted for the non-instruction-tuned base model, serve as a good prior for predicting the degree of possible broad misalignment. Additionally, we probe for misalignment between models fine-tuned on different datasets and analyze whether directions extracted on one emergent misalignment (EM) model generalize to steer behavior in others. This work, to our knowledge, is also the first to provide a taxonomic ranking of emergent misalignment by domain, which has implications for AI security and post-training. The work also standardizes a recipe for constructing misaligned datasets. All code and datasets are publicly available on GitHub.\footnotethis https URL Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2602.00298 [cs.AI] (or arXiv:2602.00298v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2602.00298 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-307] Multi-Speaker Conversational Audio Deepfake: Taxonomy Dataset and Pilot Study ICDM2025
【速读】:该论文旨在解决多说话人对话场景中音频深度伪造(audio deepfakes)检测研究严重不足的问题,尤其是针对真实世界中日益增长的多说话人对话式恶意应用。其关键解决方案是提出了一种多说话人对话音频深度伪造的概念分类法,区分部分篡改(一个或多个说话人被修改)与完全合成(整个对话被生成),并首次构建了多说话人对话音频深度伪造数据集(MsCADD),包含2,830个真实与全合成的双说话人对话音频片段,使用VITS和SoundStorm-based NotebookLM模型生成,涵盖性别和对话自发性变化。该数据集仅限于文本到语音(TTS)类型的深度伪造,并在此基础上对LFCC-LCNN、RawNet2和Wav2Vec 2.0三种神经网络基线模型进行基准测试,结果揭示当前方法在复杂对话动态下可靠检测合成语音仍存在显著差距,为未来对话场景下的深度伪造检测研究提供了可复现的数据基础与评估标准。
链接: https://arxiv.org/abs/2602.00295
作者: Alabi Ahmed,Vandana Janeja,Sanjay Purushotham
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: This work was presented at the 2025 IEEE International Conference on Data Mining, ICDM 2025, November 12-15,2025, Washington DC, USA
Abstract:The rapid advances in text-to-speech (TTS) technologies have made audio deepfakes increasingly realistic and accessible, raising significant security and trust concerns. While existing research has largely focused on detecting single-speaker audio deepfakes, real-world malicious applications with multi-speaker conversational settings is also emerging as a major underexplored threat. To address this gap, we propose a conceptual taxonomy of multi-speaker conversational audio deepfakes, distinguishing between partial manipulations (one or multiple speakers altered) and full manipulations (entire conversations synthesized). As a first step, we introduce a new Multi-speaker Conversational Audio Deepfakes Dataset (MsCADD) of 2,830 audio clips containing real and fully synthetic two-speaker conversations, generated using VITS and SoundStorm-based NotebookLM models to simulate natural dialogue with variations in speaker gender, and conversational spontaneity. MsCADD is limited to text-to-speech (TTS) types of deepfake. We benchmark three neural baseline models; LFCC-LCNN, RawNet2, and Wav2Vec 2.0 on this dataset and report performance in terms of F1 score, accuracy, true positive rate (TPR), and true negative rate (TNR). Results show that these baseline models provided a useful benchmark, however, the results also highlight that there is a significant gap in multi-speaker deepfake research in reliably detecting synthetic voices under varied conversational dynamics. Our dataset and benchmarks provide a foundation for future research on deepfake detection in conversational scenarios, which is a highly underexplored area of research but also a major area of threat to trustworthy information in audio settings. The MsCADD dataset is publicly available to support reproducibility and benchmarking by the research community.
zh
[AI-308] Self-Attention at Constant Cost per Token via Symmetry-Aware Taylor Approximation
【速读】:该论文旨在解决当前基于自注意力(self-attention)机制的Transformer模型在处理长文本时内存和计算成本随上下文长度增长而急剧上升的问题,这一瓶颈限制了大规模模型的部署与可持续性。其解决方案的关键在于提出一种新的数学形式化方法,通过将传统自注意力的泰勒展开分解为对称张量积链上的表达式,并利用对称性构造前馈变换,从而将查询和键映射到一个最小多项式核特征基中,实现每个token的计算成本恒定(与上下文长度无关),且精度可任意控制。该方法显著降低了内存占用和计算复杂度,支持无界token生成,同时使每token的计算成本反比于注意力头大小,提升了多头并行效率。
链接: https://arxiv.org/abs/2602.00294
作者: Franz A. Heinsen,Leo Kozachkov
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: For source code and replication instructions, see this https URL . 12 pages, 6 figures (main); 4 pages, 2 figures (appendix)
Abstract:The most widely used artificial intelligence (AI) models today are Transformers employing self-attention. In its standard form, self-attention incurs costs that increase with context length, driving demand for storage, compute, and energy that is now outstripping society’s ability to provide them. To help address this issue, we show that self-attention is efficiently computable to arbitrary precision with constant cost per token, achieving orders-of-magnitude reductions in memory use and computation. We derive our formulation by decomposing the conventional formulation’s Taylor expansion into expressions over symmetric chains of tensor products. We exploit their symmetry to obtain feed-forward transformations that efficiently map queries and keys to coordinates in a minimal polynomial-kernel feature basis. Notably, cost is fixed inversely in proportion to head size, enabling application over a greater number of heads per token than otherwise feasible. We implement our formulation and empirically validate its correctness. Our work enables unbounded token generation at modest fixed cost, substantially reducing the infrastructure and energy demands of large-scale Transformer models. The mathematical techniques we introduce are of independent interest.
zh
[AI-309] Sample Complexity Analysis for Constrained Bilevel Reinforcement Learning
【速读】:该论文致力于解决约束型双层强化学习(bilevel reinforcement learning, bilevel RL)问题的样本复杂度理论分析不足的问题。当前,诸如元学习、层次化学习和从人类反馈中强化学习(RL-HF)等重要场景均可建模为双层RL问题,尽管这些领域在实践中已取得显著进展,但其理论基础仍不完善。论文提出了一种名为约束型双层次梯度优化(Constrained Bilevel Subgradient Optimization, CBSO)的算法,其关键创新在于采用基于惩罚项的目标函数形式来规避原对偶间隙(primal-dual gap)与超梯度(hyper-gradient)带来的挑战,从而将约束问题转化为非光滑优化问题。为处理该非光滑目标函数,作者首次利用Moreau包络(Moreau envelope)对一般参数化策略梯度类RL算法进行理论分析,最终实现了迭代复杂度 $ O(\epsilon^{-2}) $ 和样本复杂度 $ \tilde{O}(\epsilon^{-4}) $ 的理论保证。
链接: https://arxiv.org/abs/2602.00282
作者: Naman Saxena,Vaneet Aggarwal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Several important problem settings within the literature of reinforcement learning (RL), such as meta-learning, hierarchical learning, and RL from human feedback (RL-HF), can be modelled as bilevel RL problems. A lot has been achieved in these domains empirically; however, the theoretical analysis of bilevel RL algorithms hasn’t received a lot of attention. In this work, we analyse the sample complexity of a constrained bilevel RL algorithm, building on the progress in the unconstrained setting. We obtain an iteration complexity of O(\epsilon^-2) and sample complexity of \tildeO(\epsilon^-4) for our proposed algorithm, Constrained Bilevel Subgradient Optimization (CBSO). We use a penalty-based objective function to avoid the issue of primal-dual gap and hyper-gradient in the context of a constrained bilevel problem setting. The penalty-based formulation to handle constraints requires analysis of non-smooth optimization. We are the first ones to analyse the generally parameterized policy gradient-based RL algorithm with a non-smooth objective function using the Moreau envelope.
zh
[AI-310] raining LLM s with Fault Tolerant HSDP on 100000 GPUs
【速读】:该论文旨在解决大规模分布式训练系统中因同步训练(synchronous training)导致的低效问题,特别是在使用约10万GPU规模时,频繁的硬件故障和较长的恢复时间显著降低了训练效率。其解决方案的核心是提出一种新的容错训练范式——故障容错混合共享数据并行(Fault Tolerant Hybrid-Shared Data Parallelism, FT-HSDP),该方法以数据并行副本(data parallel replica)作为容错单元,在发生故障时仅将包含故障GPU或服务器的单一副本下线重启,其余副本继续训练,从而最小化停机时间。关键创新包括:1)引入故障容错All Reduce协议(Fault Tolerant All Reduce, FTAR),由CPU处理动态参与者管理等复杂控制逻辑,GPU负责高效数据传输;2)设计非阻塞追赶协议(non-blocking catch-up protocol),使恢复中的副本能快速重新加入训练而几乎不造成停滞。实验表明,相比传统全同步训练,FT-HSDP将故障恢复引起的停机时间从10分钟降至3分钟,有效训练时间由44%提升至80%,且不影响模型最终精度。
链接: https://arxiv.org/abs/2602.00277
作者: Omkar Salpekar,Rohan Varma,Kenny Yu,Vladimir Ivanov,Yang Wang,Ahmed Sharif,Min Si,Shawn Xu,Feng Tian,Shengbao Zheng,Tristan Rice,Ankush Garg,Shangfu Peng,Shreyas Siravara,Wenyin Fu,Rodrigo de Castro,Adithya Gangidi,Andrey Obraztsov,Sharan Narang,Sergey Edunov,Maxim Naumov,Chunqiang Tang,Mathew Oldham
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:
Abstract:Large-scale training systems typically use synchronous training, requiring all GPUs to be healthy simultaneously. In our experience training on O(100K) GPUs, synchronous training results in a low efficiency due to frequent failures and long recovery time. To address this problem, we propose a novel training paradigm, Fault Tolerant Hybrid-Shared Data Parallelism (FT-HSDP). FT-HSDP uses data parallel replicas as units of fault tolerance. When failures occur, only a single data-parallel replica containing the failed GPU or server is taken offline and restarted, while the other replicas continue training. To realize this idea at scale, FT-HSDP incorporates several techniques: 1) We introduce a Fault Tolerant All Reduce (FTAR) protocol for gradient exchange across data parallel replicas. FTAR relies on the CPU to drive the complex control logic for tasks like adding or removing participants dynamically, and relies on GPU to perform data transfer for best performance. 2) We introduce a non-blocking catch-up protocol, allowing a recovering replica to join training with minimal stall. Compared with fully synchronous training at O(100K) GPUs, FT-HSDP can reduce the stall time due to failure recovery from 10 minutes to 3 minutes, increasing effective training time from 44% to 80%. We further demonstrate that FT-HSDP’s asynchronous recovery does not bring any meaning degradation to the accuracy of the result model. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.00277 [cs.DC] (or arXiv:2602.00277v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2602.00277 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-311] Localizing and Correcting Errors for LLM -based Planners
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在符号化经典规划任务中频繁违反领域约束的问题,例如在路径规划中“穿墙”等逻辑错误。解决方案的关键在于提出一种局部上下文学习(Localized In-Context Learning, L-ICL)方法:通过识别计划轨迹中首次违反约束的步骤,并注入最小化的输入-输出示例来修正该特定步骤的行为,从而实现对LLM推理过程的精准干预与迭代优化。相比显式指令或传统上下文学习(In-Context Learning, ICL),L-ICL显著提升了生成有效计划的成功率,在8×8网格世界中使用仅60个训练示例即可将成功率从59%提升至89%。
链接: https://arxiv.org/abs/2602.00276
作者: Aditya Kumar,William W. Cohen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have demonstrated strong reasoning capabilities on math and coding, but frequently fail on symbolic classical planning tasks. Our studies, as well as prior work, show that LLM-generated plans routinely violate domain constraints given in their instructions (e.g., walking through walls). To address this failure, we propose iteratively augmenting instructions with Localized In-Context Learning (L-ICL) demonstrations: targeted corrections for specific failing steps. Specifically, L-ICL identifies the first constraint violation in a trace and injects a minimal input-output example giving the correct behavior for the failing step. Our proposed technique of L-ICL is much effective than explicit instructions or traditional ICL, which adds complete problem-solving trajectories, and many other baselines. For example, on an 8x8 gridworld, L-ICL produces valid plans 89% of the time with only 60 training examples, compared to 59% for the best baseline, an increase of 30%. L-ICL also shows dramatic improvements in other domains (gridworld navigation, mazes, Sokoban, and BlocksWorld), and on several LLM architectures.
zh
[AI-312] VoxServe: Streaming-Centric Serving System for Speech Language Models
【速读】:该论文旨在解决现代语音语言模型(Speech Language Models, SpeechLMs)在流式场景下部署时面临的低延迟、高吞吐量和强流式可行性保障难题,现有系统难以灵活高效地支持多样化的模型架构。解决方案的关键在于提出 VoxServe,一个统一的语音语言模型服务系统,其核心创新是引入模型执行抽象(model-execution abstraction),将模型架构与系统级优化解耦,从而在一个框架内支持多种 SpeechLM 架构;在此基础上,VoxServe 实现了面向流式的调度策略和异步推理流水线,显著提升了端到端效率,在保持高流式可行性的前提下,相比现有实现达到 10–20 倍的吞吐量提升。
链接: https://arxiv.org/abs/2602.00269
作者: Keisuke Kamahori,Wei-Tzu Lee,Atindra Jha,Rohan Kadekodi,Stephanie Wang,Arvind Krishnamurthy,Baris Kasikci
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: The code is available at this https URL
Abstract:Deploying modern Speech Language Models (SpeechLMs) in streaming settings requires systems that provide low latency, high throughput, and strong guarantees of streamability. Existing systems fall short of supporting diverse models flexibly and efficiently. We present VoxServe, a unified serving system for SpeechLMs that optimizes streaming performance. VoxServe introduces a model-execution abstraction that decouples model architecture from system-level optimizations, thereby enabling support for diverse SpeechLM architectures within a single framework. Building on this abstraction, VoxServe implements streaming-aware scheduling and an asynchronous inference pipeline to improve end-to-end efficiency. Evaluations across multiple modern SpeechLMs show that VoxServe achieves 10-20x higher throughput than existing implementations at comparable latency while maintaining high streaming viability. The code of VoxServe is available at this https URL.
zh
[AI-313] Complete Identification of Deep ReLU Neural Networks by Many-Valued Logic
【速读】:该论文旨在解决深度ReLU神经网络中的函数等价性识别问题,即给定一个函数 $ f $,如何完全确定所有能够实现该函数的前馈ReLU网络的架构与参数(权重和偏置)。其核心解决方案在于将ReLU网络映射到Lukasiewicz逻辑公式,利用该逻辑系统的公理体系对网络进行代数重写,从而实现功能等价的网络变换。通过引入一种组合范式(compositional norm form),作者建立了从Lukasiewicz逻辑公式到ReLU网络的可逆映射,并基于Chang完备性定理证明:每个函数等价类中的所有ReLU网络均可由有限组对应于Lukasiewicz逻辑公理的对称操作相互连接。这一方法在思想上类比于Shannon在开关电路设计中将电路转化为布尔公式并通过布尔代数重写进行综合的开创性工作。
链接: https://arxiv.org/abs/2602.00266
作者: Yani Zhang,Helmut Bölcskei
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Deep ReLU neural networks admit nontrivial functional symmetries: vastly different architectures and parameters (weights and biases) can realize the same function. We address the complete identification problem – given a function f, deriving the architecture and parameters of all feedforward ReLU networks giving rise to f. We translate ReLU networks into Lukasiewicz logic formulae, and effect functional equivalent network transformations through algebraic rewrites governed by the logic axioms. A compositional norm form is proposed to facilitate the mapping from Lukasiewicz logic formulae back to ReLU networks. Using Chang’s completeness theorem, we show that for every functional equivalence class, all ReLU networks in that class are connected by a finite set of symmetries corresponding to the finite set of axioms of Lukasiewicz logic. This idea is reminiscent of Shannon’s seminal work on switching circuit design, where the circuits are translated into Boolean formulae, and synthesis is effected by algebraic rewriting governed by Boolean logic axioms.
zh
[AI-314] Intelligent Reasoning Cues: A Framework and Case Study of the Roles of AI Information in Complex Decisions
【速读】:该论文试图解决的问题是:当前基于人工智能(AI)的决策支持系统虽然具备高准确性,但在实际应用中仍可能无法有效辅助用户或提升决策质量,且现有理论主要聚焦于用户对AI建议的依赖程度校准,忽略了不同系统设计如何影响决策背后的推理过程。解决方案的关键在于将AI界面重新定义为“智能推理线索”(intelligent reasoning cues)的集合——即离散的AI信息片段,每条线索可独立影响决策行为;通过在重症监护中治疗脓毒症这一高风险临床决策场景下,识别并分析八类推理线索的作用机制,研究发现这些线索具有差异化的影响模式,从而为设计更有效的AI辅助决策系统提供了实证依据和具体指导原则,包括优先服务于高变异性与自主性任务、动态适配决策需求变化,并提供互补且严谨的复杂病例洞察。
链接: https://arxiv.org/abs/2602.00259
作者: Venkatesh Sivaraman,Eric P. Mason,Mengfan Ellen Li,Jessica Tong,Andrew J. King,Jeremy M. Kahn,Adam Perer
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Other Quantitative Biology (q-bio.OT)
备注: Accepted at CHI 2026
Abstract:Artificial intelligence (AI)-based decision support systems can be highly accurate yet still fail to support users or improve decisions. Existing theories of AI-assisted decision-making focus on calibrating reliance on AI advice, leaving it unclear how different system designs might influence the reasoning processes underneath. We address this gap by reconsidering AI interfaces as collections of intelligent reasoning cues: discrete pieces of AI information that can individually influence decision-making. We then explore the roles of eight types of reasoning cues in a high-stakes clinical decision (treating patients with sepsis in intensive care). Through contextual inquiries with six teams and a think-aloud study with 25 physicians, we find that reasoning cues have distinct patterns of influence that can directly inform design. Our results also suggest that reasoning cues should prioritize tasks with high variability and discretion, adapt to ensure compatibility with evolving decision needs, and provide complementary, rigorous insights on complex cases.
zh
[AI-315] ri-LLM Cooperative Federated Zero-Shot Intrusion Detection with Semantic Disagreement and Trust-Aware Aggregation
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在构建隐私保护的入侵检测系统(Intrusion Detection System, IDS)时面临的两大核心问题:一是现有方法多基于封闭集学习,缺乏对零日攻击(zero-day attacks)的开放集检测能力,无法有效识别未见过的攻击行为;二是模型对异构且不可靠客户端的鲁棒性不足,难以应对实际部署中的不确定性与数据分布差异。解决方案的关键在于提出一种语义驱动的联邦IDS框架,通过引入语言大模型(Large Language Models, LLMs)构建语义攻击原型,将分布式遥测特征与高层攻击概念对齐,从而实现零样本入侵检测;同时利用多LLM间的语义分歧建模认知不确定性(epistemic uncertainty),用于零日风险评估,并设计信任感知聚合机制动态调整客户端更新权重,提升系统对异常或恶意客户端的鲁棒性。
链接: https://arxiv.org/abs/2602.00219
作者: Saeid Jamshidi,Omar Abdul Wahab,Foutse Khomh,Kawser Wazed Nafi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Federated learning (FL) has become an effective paradigm for privacy-preserving, distributed Intrusion Detection Systems (IDS) in cyber-physical and Internet of Things (IoT) networks, where centralized data aggregation is often infeasible due to privacy and bandwidth constraints. Despite its advantages, most existing FL-based IDS assume closed-set learning and lack mechanisms such as uncertainty estimation, semantic generalization, and explicit modeling of epistemic ambiguity in zero-day attack scenarios. Additionally, robustness to heterogeneous and unreliable clients remains a challenge in practical applications. This paper introduces a semantics-driven federated IDS framework that incorporates language-derived semantic supervision into federated optimization, enabling open-set and zero-shot intrusion detection for previously unseen attack behaviors. The approach constructs semantic attack prototypes using a Tri-LLM ensemble of GPT-4o, DeepSeek-V3, and LLaMA-3-8B, aligning distributed telemetry features with high-level attack concepts. Inter-LLM semantic disagreement is modeled as epistemic uncertainty for zero-day risk estimation, while a trust-aware aggregation mechanism dynamically weights client updates based on reliability. Experimental results show stable semantic alignment across heterogeneous clients and consistent convergence. The framework achieves over 80% zero-shot detection accuracy on unseen attack patterns, improving zero-day discrimination by more than 10% compared to similarity-based baselines, while maintaining low aggregation instability in the presence of unreliable or compromised clients.
zh
[AI-316] ssPay: Verify-then-Pay Infrastructure for Trusted Agent ic Commerce
【速读】:该论文旨在解决**代理商业(Agentic Commerce)**中因系统设计不适应自主代理操作而产生的信任缺口问题,具体表现为任务委托、支付结算与审计机制在三个关键阶段缺乏可验证性与问责机制。其解决方案的核心是提出一个名为TessPay的统一基础设施,采用“验证后付款”(Verify-then-Pay)架构,通过双平面设计将控制与验证逻辑与资金结算分离,在任务执行前建立可验证的用户意图授权、执行中利用可信执行环境(TEE)等技术生成任务执行证据(Proof of Task Execution, PoTE),结算时基于PoTE验证结果释放资金,并在事后保留不可篡改的审计日志,从而实现全流程可信闭环。
链接: https://arxiv.org/abs/2602.00213
作者: Mehul Goenka,Tejas Pathak,Siddharth Asthana
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:The global economy is entering the era of Agentic Commerce, where autonomous agents can discover services, negotiate prices, and transact value. However adoption towards agentic commerce faces a foundational trust gap: current systems are built for direct human interactions rather than agent-driven operations. It lacks core primitives across three critical stages of agentic transactions. First, Task Delegation lacks means to translate user intent into defined scopes, discover appropriate agents, and securely authorize actions. Second, Payment Settlement for tasks is processed before execution, lacking verifiable evidence to validate the agent’s work. Third, Audit Mechanisms fail to capture the full transaction lifecycle, preventing clear accountability for disputes. While emerging standards address fragments of this trust gap, there still remains a critical need for a unified infrastructure that binds the entire transaction lifecycle. To resolve this gap, we introduce TessPay, a unified infrastructure that replaces implicit trust with a ‘Verify-then-Pay’ architecture. It is a two plane architecture separating control and verification from settlement. TessPay operationalizes trust across four distinct stages: Before execution, agents are anchored in a canonical registry and user intent is captured as verifiable mandates, enabling stakeholder accountability. During execution, funds are locked in escrow while the agent executes the task and generates cryptographic evidence (TLS Notary, TEE etc.) to support Proof of Task Execution (PoTE). At settlement, the system verifies this evidence and releases funds only when the PoTE satisfies verification predicates; modular rail adapters ensure this PoTE-gated escrow remains chain-agnostic across heterogeneous payment rails. After settlement, TessPay preserves a tamper-evident audit trail to enable clear accountability for dispute resolution. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA) Cite as: arXiv:2602.00213 [cs.CR] (or arXiv:2602.00213v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2602.00213 Focus to learn more arXiv-issued DOI via DataCite
zh
[AI-317] Analyzing Shapley Additive Explanations to Understand Anomaly Detection Algorithm Behaviors and Their Complementarity
【速读】:该论文旨在解决无监督异常检测中集成学习效果受限的问题,即现有检测器常因依赖相似的决策线索而导致异常评分冗余,难以实现真正的互补性。解决方案的关键在于通过SHapley Additive exPlanations(SHAP)量化每个模型对输入特征的重要性分配,利用这些归因谱(attribution profiles)衡量检测器间的相似性,并以此作为选择集成模型的新标准——解释差异性可有效指示检测行为的互补性,从而在保证单个模型性能的前提下,构建更具多样性和互补性的集成模型,显著提升无监督异常检测的效果。
链接: https://arxiv.org/abs/2602.00208
作者: Jordan Levy,Paul Saves,Moncef Garouani,Nicolas Verstaevel,Benoit Gaudou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at Intelligent Data Analysis (IDA), 2026
Abstract:Unsupervised anomaly detection is a challenging problem due to the diversity of data distributions and the lack of labels. Ensemble methods are often adopted to mitigate these challenges by combining multiple detectors, which can reduce individual biases and increase robustness. Yet building an ensemble that is genuinely complementary remains challenging, since many detectors rely on similar decision cues and end up producing redundant anomaly scores. As a result, the potential of ensemble learning is often limited by the difficulty of identifying models that truly capture different types of irregularities. To address this, we propose a methodology for characterizing anomaly detectors through their decision mechanisms. Using SHapley Additive exPlanations, we quantify how each model attributes importance to input features, and we use these attribution profiles to measure similarity between detectors. We show that detectors with similar explanations tend to produce correlated anomaly scores and identify largely overlapping anomalies. Conversely, explanation divergence reliably indicates complementary detection behavior. Our results demonstrate that explanation-driven metrics offer a different criterion than raw outputs for selecting models in an ensemble. However, we also demonstrate that diversity alone is insufficient; high individual model performance remains a prerequisite for effective ensembles. By explicitly targeting explanation diversity while maintaining model quality, we are able to construct ensembles that are more diverse, more complementary, and ultimately more effective for unsupervised anomaly detection.
zh
[AI-318] Semantic-Aware Advanced Persistent Threat Detection Using Autoencoders on LLM -Encoded System Logs
【速读】:该论文旨在解决高级持续性威胁(Advanced Persistent Threats, APTs)检测难题,尤其是传统统计方法和浅层机器学习技术难以识别其“低慢”(low-and-slow)行为的问题。解决方案的关键在于利用大语言模型(Large Language Models, LLMs)生成的语义嵌入(semantic embeddings),从非结构化系统日志中提取具有语义意义的高维表示,并结合自编码器(Autoencoder, AE)进行异常模式识别。这种方法通过捕捉系统活动背后的语义意图,显著提升了对隐蔽、非线性攻击行为的检测能力,实验表明其在DARPA透明计算(Transparent Computing, TC)数据集上的AUC-ROC指标优于Isolation Forest、One-Class SVM和PCA等主流无监督基线方法。
链接: https://arxiv.org/abs/2602.00204
作者: Waleed Khan Mohammed,Zahirul Arief Irfan Bin Shahrul Anuar,Mousa Sufian Mousa Mitani,Hezerul Abdul Karim,Nouar AlDahoul
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Advanced Persistent Threats (APTs) are among the most challenging cyberattacks to detect. They are carried out by highly skilled attackers who carefully study their targets and operate in a stealthy, long-term manner. Because APTs exhibit “low-and-slow” behavior, traditional statistical methods and shallow machine learning techniques often fail to detect them. Previous research on APT detection has explored machine learning approaches and provenance graph analysis. However, provenance-based methods often fail to capture the semantic intent behind system activities. This paper proposes a novel anomaly detection approach that leverages semantic embeddings generated by Large Language Models (LLMs). The method enhances APT detection by extracting meaningful semantic representations from unstructured system log data. First, raw system logs are transformed into high-dimensional semantic embeddings using a pre-trained transformer model. These embeddings are then analyzed using an Autoencoder (AE) to identify anomalous and potentially malicious patterns. The proposed method is evaluated using the DARPA Transparent Computing (TC) dataset, which contains realistic APT attack scenarios generated by red teams in live environments. Experimental results show that the AE trained on LLM-derived embeddings outperforms widely used unsupervised baseline methods, including Isolation Forest (IForest), One-Class Support Vector Machine (OC-SVM), and Principal Component Analysis (PCA). Performance is measured using the Area Under the Receiver Operating Characteristic Curve (AUC-ROC), where the proposed approach consistently achieves superior results, even in complex threat scenarios. These findings highlight the importance of semantic understanding in detecting non-linear and stealthy attack behaviors that are often missed by conventional detection techniques.
zh
[AI-319] LPIPS-AttnWav2Lip: Generic Audio-Driven lip synchronization for Talking Head Generation in the Wild
【速读】:该论文旨在解决音频驱动人脸生成(Audio-driven Talking Head Generation)中的唇音同步(lip synchronization)问题,即如何实现语音与面部动作(尤其是嘴唇运动)之间的精确对齐。其解决方案的关键在于提出了一种名为LPIPS-AttnWav2Lip的通用方法,该方法基于U-Net架构并引入残差CBAM模块以更有效地融合音频与视觉模态信息;同时设计了语义对齐模块(semantic alignment module),通过扩展生成网络的感受野来高效获取视觉特征的空间和通道信息,并将音频潜在向量的统计特性与视觉特征匹配,从而实现音频内容信息到视觉信息的精准调整与注入;此外,采用LPIPS损失函数(LPIPS Loss)模拟人类对图像质量的感知,提升训练稳定性并生成高质量、高逼真度的图像,显著提升了唇音同步准确性和视觉表现力。
链接: https://arxiv.org/abs/2602.00189
作者: Zhipeng Chen,Xinheng Wang,Lun Xie,Haijie Yuan,Hang Pan
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注: This paper has been accepted by Elsevier’s \textit{Speech Communication} journal. Official publication link: this https URL The code for the paper is available at the following link: this https URL
Abstract:Researchers have shown a growing interest in Audio-driven Talking Head Generation. The primary challenge in talking head generation is achieving audio-visual coherence between the lips and the audio, known as lip synchronization. This paper proposes a generic method, LPIPS-AttnWav2Lip, for reconstructing face images of any speaker based on audio. We used the U-Net architecture based on residual CBAM to better encode and fuse audio and visual modal information. Additionally, the semantic alignment module extends the receptive field of the generator network to obtain the spatial and channel information of the visual features efficiently; and match statistical information of visual features with audio latent vector to achieve the adjustment and injection of the audio content information to the visual information. To achieve exact lip synchronization and to generate realistic high-quality images, our approach adopts LPIPS Loss, which simulates human judgment of image quality and reduces instability possibility during the training process. The proposed method achieves outstanding performance in terms of lip synchronization accuracy and visual quality as demonstrated by subjective and objective evaluation results. The code for the paper is available at the following link: this https URL
zh
[AI-320] Learning to Price: Interpretable Attribute-Level Models for Dynamic Markets AAMAS2026
【速读】:该论文旨在解决高维市场中动态定价面临的可扩展性(scalability)、不确定性(uncertainty)和可解释性(interpretability)三大挑战。现有基于低秩带状模型(low-rank bandit formulations)的方法虽具备高效学习能力,但依赖隐变量特征,难以揭示单个产品属性对价格的具体影响。其解决方案的关键在于提出一种可解释的加性特征分解低维需求模型(Additive Feature Decomposition-based Low-Dimensional Demand, AFDLD),将产品价格建模为各属性贡献之和,并显式刻画替代效应;在此基础上设计了无需投影、无梯度的在线学习算法ADEPT(Additive DEcomposition for Pricing with cross-elasticity and Time-adaptive learning),直接在属性空间中运行,实现O~(dT3/4)的次线性后悔(regret),同时兼具快速适应市场扰动与提供属性级价格解释的能力。
链接: https://arxiv.org/abs/2602.00188
作者: Srividhya Sethuraman,Chandrashekar Lakshminarayanan
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted in AAMAS 2026 - main track - full paper - 12 pages
Abstract:Dynamic pricing in high-dimensional markets poses fundamental challenges of scalability, uncertainty, and interpretability. Existing low-rank bandit formulations learn efficiently but rely on latent features that obscure how individual product attributes influence price. We address this by introducing an interpretable \emphAdditive Feature Decomposition-based Low-Dimensional Demand (\textbfAFDLD) model, where product prices are expressed as the sum of attribute-level contributions and substitution effects are explicitly modeled. Building on this structure, we propose \textbfADEPT (Additive DEcomposition for Pricing with cross-elasticity and Time-adaptive learning)-a projection-free, gradient-free online learning algorithm that operates directly in attribute space and achieves a sublinear regret of \tilde\mathcalO(\sqrtdT^3/4) . Through controlled synthetic studies and real-world datasets, we show that ADEPT (i) learns near-optimal prices under dynamic market conditions, (ii) adapts rapidly to shocks and drifts, and (iii) yields transparent, attribute-level price explanations. The results demonstrate that interpretability and efficiency in autonomous pricing agents can be achieved jointly through structured, attribute-driven representations.
zh
[AI-321] EigenAI: Deterministic Inference Verifiable Results
【速读】:该论文旨在解决当前生成式 AI(Generative AI)推理结果缺乏可验证性和可信执行机制的问题,尤其是在去中心化环境中难以确保模型输出的完整性与一致性。其解决方案的关键在于构建一个基于 EigenLayer 重质押生态的可验证 AI 平台 EigenAI,该平台结合确定性大语言模型(LLM)推理引擎与经济安全的乐观重执行协议:通过加密日志上传至 EigenDA、挑战窗口内由任意观察者触发在可信执行环境(TEE)中使用门限解密密钥进行确定性重计算,实现对推理结果的公开审计与欺诈检测。由于推理过程是比特精确的,验证仅需字节级相等检查,单个诚实副本即可识别欺诈行为,从而使得智能代理(如预测市场裁判、交易机器人和科学助手)在保持高性能的同时,继承以太坊验证者集合的安全性。
链接: https://arxiv.org/abs/2602.00182
作者: David Ribeiro Alves,Vishnu Patankar,Matheus Pereira,Jamie Stephens,Nima Vaziri,Sreeram Kannan
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:EigenAI is a verifiable AI platform built on top of the EigenLayer restaking ecosystem. At a high level, it combines a deterministic large-language model (LLM) inference engine with a cryptoeconomically secured optimistic re-execution protocol so that every inference result can be publicly audited, reproduced, and, if necessary, economically enforced. An untrusted operator runs inference on a fixed GPU architecture, signs and encrypts the request and response, and publishes the encrypted log to EigenDA. During a challenge window, any watcher may request re-execution through EigenVerify; the result is then deterministically recomputed inside a trusted execution environment (TEE) with a threshold-released decryption key, allowing a public challenge with private data. Because inference itself is bit-exact, verification reduces to a byte-equality check, and a single honest replica suffices to detect fraud. We show how this architecture yields sovereign agents – prediction-market judges, trading bots, and scientific assistants – that enjoy state-of-the-art performance while inheriting security from Ethereum’s validator base.
zh
[AI-322] Spec-Driven Development:From Code to Contract in the Age of AI Coding Assistants
【速读】:该论文试图解决传统软件开发中以代码为核心、规格说明(specification)被边缘化的问题,旨在通过引入规范驱动开发(Spec-driven Development, SDD)重构开发流程,将规格说明作为源头真理,而代码则作为生成或验证的副产品。其解决方案的关键在于提出三种不同严谨程度的规范实施层级——规范优先(spec-first)、规范锚定(spec-anchored)和规范即源(spec-as-source),并结合实际工具链(如行为驱动开发框架与GitHub Spec Kit等AI辅助工具)和跨领域案例(API开发、企业系统、嵌入式软件)阐明每种层级的应用场景与价值,最终提供一个决策框架帮助开发者判断何时采用SDD可提升质量与效率,何时选择更轻量的方法即可满足需求。
链接: https://arxiv.org/abs/2602.00180
作者: Deepak Babu Piskala
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Submitted to AIWare 2026. 8 pages, 3 figures
Abstract:The rise of AI coding assistants has reignited interest in an old idea: what if specifications-not code-were the primary artifact of software development? Spec-driven development (SDD) inverts the traditional workflow by treating specifications as the source of truth and code as a generated or verified secondary artifact. This paper provides practitioners with a comprehensive guide to SDD, covering its principles, workflow patterns, and supporting tools. We present three levels of specification rigor-spec-first, spec-anchored, and spec-as-source-with clear guidance on when each applies. Through analysis of tools ranging from Behavior-Driven Development frameworks to modern AI-assisted toolkits like GitHub Spec Kit, we demonstrate how the spec-first philosophy maps to real implementations. We present case studies from API development, enterprise systems, and embedded software, illustrating how different domains apply SDD. We conclude with a decision framework helping practitioners determine when SDD provides value and when simpler approaches suffice.
zh
[AI-323] Learning Robust Reasoning through Guided Adversarial Self-Play
【速读】:该论文旨在解决强化学习从可验证奖励(Reinforcement Learning from Verifiable Rewards, RLVR)中训练出的推理模型在面对有缺陷的条件上下文(如被污染的思维链、误导性的部分解或轻微输入扰动)时可能出现灾难性失效的问题。标准RLVR仅在干净条件下优化最终答案正确性,缺乏对鲁棒性的保障。其解决方案的关键在于提出GASP(Guided Adversarial Self-Play),一种无需人工标注或外部教师的自监督鲁棒化方法:在单一模型内构建对抗性自博弈机制,其中“污染者”学习生成局部一致的破坏性扰动以诱导失败,而“修复者”则在相同污染条件下学习诊断并恢复;同时引入分布内修复引导(in-distribution repair guidance),通过模仿自我生成的修复样本提升恢复概率,且不破坏已有能力,从而实现模型在干扰环境下的稳定性和清洁准确率的协同提升。
链接: https://arxiv.org/abs/2602.00173
作者: Shuozhe Li,Vaishnav Tadiparthi,Kwonjoon Lee,Nakul Agarwal,Hossein Nourkhiz Mahjoub,Ehsan Moradi Pari,Lizhang Chen,Amy Zhang,Liu Leqi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement learning from verifiable rewards (RLVR) produces strong reasoning models, yet they can fail catastrophically when the conditioning context is fallible (e.g., corrupted chain-of-thought, misleading partial solutions, or mild input perturbations), since standard RLVR optimizes final-answer correctness only under clean conditioning. We introduce GASP (Guided Adversarial Self-Play), a robustification method that explicitly trains detect-and-repair capabilities using only outcome verification. Without human labels or external teachers, GASP forms an adversarial self-play game within a single model: a polluter learns to induce failure via locally coherent corruptions, while an agent learns to diagnose and recover under the same corrupted conditioning. To address the scarcity of successful recoveries early in training, we propose in-distribution repair guidance, an imitation term on self-generated repairs that increases recovery probability while preserving previously acquired capabilities. Across four open-weight models (1.5B–8B), GASP transforms strong-but-brittle reasoners into robust ones that withstand misleading and perturbed context while often improving clean accuracy. Further analysis shows that adversarial corruptions induce an effective curriculum, and in-distribution guidance enables rapid recovery learning with minimal representational drift.
zh
[AI-324] he Blessing of Dimensionality in LLM Fine-tuning: A Variance-Curvature Perspective
【速读】:该论文旨在解决两个看似独立但本质相关的问题:一是权重扰动进化策略(Weight-perturbation Evolution Strategies, ES)为何能在极小种群规模(如 $ N \approx 30 $)下有效微调百亿参数语言模型,这与经典零阶优化中的维度诅咒直觉相悖;二是为何在固定超参数条件下,ES 和 GRPO 等随机微调方法的奖励会先上升、达到峰值后下降。论文提出,这两个现象均源于微调损失曲面的共同几何特性——其曲率具有低维主导性(low-dimensional in curvature)。解决方案的关键在于识别出少数高曲率方向主导了优化改进,并由此构建了一个最小二次随机上升模型来解释非单调训练动态(rise-then-decay behavior),同时揭示了“退化改进更新”(degenerate improving updates)机制,即大量随机扰动在这些高曲率方向上共享相似成分,使得小种群即可覆盖有效更新方向。这一发现统一了ES的高效性与非单调动态,并表明高维微调可能比最坏情况理论预测更易被多种优化方法所驾驭。
链接: https://arxiv.org/abs/2602.00170
作者: Qiyao Liang,Jinyeop Song,Yizhou Liu,Jeff Gore,Ila Fiete,Risto Miikkulainen,Xin Qiu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages, 6 figures, plus appendices
Abstract:Weight-perturbation evolution strategies (ES) can fine-tune billion-parameter language models with surprisingly small populations (e.g., N!\approx!30 ), contradicting classical zeroth-order curse-of-dimensionality intuition. We also observe a second seemingly separate phenomenon: under fixed hyperparameters, the stochastic fine-tuning reward often rises, peaks, and then degrades in both ES and GRPO. We argue that both effects reflect a shared geometric property of fine-tuning landscapes: they are low-dimensional in curvature. A small set of high-curvature dimensions dominates improvement, producing (i) heterogeneous time scales that yield rise-then-decay under fixed stochasticity, as captured by a minimal quadratic stochastic-ascent model, and (ii) degenerate improving updates, where many random perturbations share similar components along these directions. Using ES as a geometric probe on fine-tuning reward landscapes of GSM8K, ARC-C, and WinoGrande across Qwen2.5-Instruct models (0.5B–7B), we show that reward-improving perturbations remain empirically accessible with small populations across scales. Together, these results reconcile ES scalability with non-monotonic training dynamics and suggest that high-dimensional fine-tuning may admit a broader class of viable optimization methods than worst-case theory implies.
zh
[AI-325] Joint Continual Learning of Local Language Models and Cloud Offloading Decisions with Budget Constraints
【速读】:该论文旨在解决在资源受限环境下,本地部署的小语言模型(Small Language Models, SLMs)在持续学习过程中如何智能、稳定地选择性调用云端大语言模型(Large Language Models, LLMs)的问题。传统基于奖励的强化学习方法易导致任务切换时云调用行为不稳定,并加剧灾难性遗忘(catastrophic forgetting)。其解决方案的关键在于提出DA-GRPO——一种基于组相对策略优化(Group Relative Policy Optimization, GRPO)的双优势扩展方法,通过将云使用约束直接嵌入优势计算过程,避免了固定奖励设计和外部路由模型,使本地模型能够联合学习任务能力与协作策略,从而在不突破预设协助预算的前提下自然产生合理的云端请求行为。
链接: https://arxiv.org/abs/2602.00166
作者: Evan Chen,Wenzhi Fang,Shiqiang Wang,Christopher Brinton
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Locally deployed Small Language Models (SLMs) must continually support diverse tasks under strict memory and computation constraints, making selective reliance on cloud Large Language Models (LLMs) unavoidable. Regulating cloud assistance during continual learning is challenging, as naive reward-based reinforcement learning often yields unstable offloading behavior and exacerbates catastrophic forgetting as task distributions shift. We propose DA-GRPO, a dual-advantage extension of Group Relative Policy Optimization that incorporates cloud-usage constraints directly into advantage computation, avoiding fixed reward shaping and external routing models. This design enables the local model to jointly learn task competence and collaboration behavior, allowing cloud requests to emerge naturally during post-training while respecting a prescribed assistance budget. Experiments on mathematical reasoning and code generation benchmarks show that DA-GRPO improves post-switch accuracy, substantially reduces forgetting, and maintains stable cloud usage compared to prior collaborative and routing-based approaches.
zh
[AI-326] Why Are AI Agent Involved Pull Requests (Fix-Related) Remain Unmerged? An Empirical Study
【速读】:该论文旨在解决当前生成式 AI 编码代理(Generative AI Coding Agents)在实际软件仓库中提交的修复类补丁请求(Pull Requests, PRs)被项目维护者接受和合并的有效性问题。研究发现,尽管这些 AI 代理能自动生成大量修复 PR,但其最终集成成功率受限于多种因素,如测试用例失败和已有 PR 已解决相同问题等。解决方案的关键在于通过大规模实证分析(8,106 条 PR)与深度人工定性分析(326 个未合并 PR)相结合,系统识别出阻碍 AI 生成代码被采纳的核心障碍,并为提升 AI 编码代理在真实场景下的实用性、优化人机协作流程提供数据驱动的改进方向。
链接: https://arxiv.org/abs/2602.00164
作者: Khairul Alam,Saikat Mondal,Banani Roy
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 5 pages
Abstract:Autonomous coding agents (e.g., OpenAI Codex, Devin, GitHub Copilot) are increasingly used to generate fix-related pull requests (PRs) in real world software repositories. However, their practical effectiveness depends on whether these contributions are accepted and merged by project maintainers. In this paper, we present an empirical study of AI agent involved fix related PRs, examining both their integration outcomes, latency, and the factors that hinder successful merging. We first analyze 8,106 fix related PRs authored by five widely used AI coding agents from the AIDEV POP dataset to quantify the proportions of PRs that are merged, closed without merging, or remain open. We then conduct a manual qualitative analysis of a statistically significant sample of 326 closed but unmerged PRs, spending approximately 100 person hours to construct a structured catalog of 12 failure reasons. Our results indicate that test case failures and prior resolution of the same issues by other PRs are the most common causes of non integration, whereas build or deployment failures are comparatively rare. Overall, our findings expose key limitations of current AI coding agents in real world settings and highlight directions for their further improvement and for more effective human AI collaboration in software maintenance.
zh
[AI-327] Sheaf Neural Networks and biomedical applications
【速读】:该论文旨在解决当前图神经网络(Graph Neural Networks, GNNs)在处理生物医学数据时存在的性能局限性问题,特别是其在特征提取与结构建模方面的不足。解决方案的关键在于提出并验证一种新型的sheaf神经网络(Sheaf Neural Network, SNN)算法,该算法通过引入层化结构(sheaf structure)对图数据进行更精细的拓扑建模,从而在保持局部信息敏感性的同时增强全局表达能力,最终在具体生物医学案例中展现出优于主流GNN模型(如图卷积网络GCN、图注意力网络GAT和GraphSage)的性能表现。
链接: https://arxiv.org/abs/2602.00159
作者: Aneeqa Mehrab,Jan Willem Van Looy,Pietro Demurtas,Stefano Iotti,Emil Malucelli,Francesca Rossi,Ferdinando Zanchetta,Rita Fioresi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:The purpose of this paper is to elucidate the theory and mathematical modelling behind the sheaf neural network (SNN) algorithm and then show how SNN can effectively answer to biomedical questions in a concrete case study and outperform the most popular graph neural networks (GNNs) as graph convolutional networks (GCNs), graph attention networks (GAT) and GraphSage.
zh
[AI-328] RAPTOR: Ridge-Adaptive Logistic Probes
【速读】:该论文旨在解决在冻结的大语言模型(Frozen Large Language Model, LLM)中,如何高效且稳定地提取语义概念向量(concept vector)以支持后续的激活引导(activation steering)任务。现有方法在准确性、方向稳定性以及训练成本之间难以平衡,限制了其在实际应用中的有效性。解决方案的关键在于提出 RAPTOR(Ridge-Adaptive Logistic Probe),一种基于L2正则化逻辑回归的轻量级探测器,通过验证集调优的岭强度(ridge strength)自动产生归一化权重作为概念向量。该方法在保持高精度的同时显著降低训练开销,并展现出优异的方向稳定性,其理论机制亦可通过凸高斯极小极大定理(Convex Gaussian Min-max Theorem, CGMT)在理想高维少样本场景下进行解析建模,揭示正则化强度对探测准确性和概念向量稳定性的权衡关系。
链接: https://arxiv.org/abs/2602.00158
作者: Ziqi Gao,Yaotian Zhu,Qingcheng Zeng,Xu Zhao,Ziqing Wang,Feng Ruan,Kaize Ding
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preprint
Abstract:Probing studies what information is encoded in a frozen LLM’s layer representations by training a lightweight predictor on top of them. Beyond analysis, probes are often used operationally in probe-then-steer pipelines: a learned concept vector is extracted from a probe and injected via additive activation steering by adding it to a layer representation during the forward pass. The effectiveness of this pipeline hinges on estimating concept vectors that are accurate, directionally stable under ablation, and inexpensive to obtain. Motivated by these desiderata, we propose RAPTOR (Ridge-Adaptive Logistic Probe), a simple L2-regularized logistic probe whose validation-tuned ridge strength yields concept vectors from normalized weights. Across extensive experiments on instruction-tuned LLMs and human-written concept datasets, RAPTOR matches or exceeds strong baselines in accuracy while achieving competitive directional stability and substantially lower training cost; these quantitative results are supported by qualitative downstream steering demonstrations. Finally, using the Convex Gaussian Min-max Theorem (CGMT), we provide a mechanistic characterization of ridge logistic regression in an idealized Gaussian teacher-student model in the high-dimensional few-shot regime, explaining how penalty strength mediates probe accuracy and concept-vector stability and yielding structural predictions that qualitatively align with trends observed on real LLM embeddings.
zh
[AI-329] Reasoning Bomb: A Stealthy Denial-of-Service Attack by Inducing Pathologically Long Reasoning Reasoning in Large Reasoning Models
【速读】:该论文旨在解决大推理模型(Large Reasoning Models, LRMs)在推理过程中面临的提示诱导型推理时拒绝服务攻击(Prompt-Induced Inference-Time Denial-of Service, PI-DoS)问题,此类攻击利用LRM的高计算开销通过短输入触发异常长的推理链,从而耗尽系统资源。解决方案的关键在于提出一种基于强化学习的PI-DoS框架——ReasoningBomb,其核心创新包括:(1) 设计一个常数时间代理奖励函数以实现高效优化;(2) 生成短自然语言提示,使攻击具有高放大比(平均达286.7倍)、隐蔽性(99.8%绕过输入检测、98.4%绕过双阶段联合检测)和可优化性,从而有效驱动目标LRM进入路径异常且几乎无法终止的推理状态。
链接: https://arxiv.org/abs/2602.00154
作者: Xiaogeng Liu,Xinyan Wang,Yechao Zhang,Sanjay Kariyappa,Chong Xiang,Muhao Chen,G. Edward Suh,Chaowei Xiao
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Pre-print. Code is available at this https URL
Abstract:Large reasoning models (LRMs) extend large language models with explicit multi-step reasoning traces, but this capability introduces a new class of prompt-induced inference-time denial-of-service (PI-DoS) attacks that exploit the high computational cost of reasoning. We first formalize inference cost for LRMs and define PI-DoS, then prove that any practical PI-DoS attack should satisfy three properties: (1) a high amplification ratio, where each query induces a disproportionately long reasoning trace relative to its own length; (ii) stealthiness, in which prompts and responses remain on the natural language manifold and evade distribution shift detectors; and (iii) optimizability, in which the attack supports efficient optimization without being slowed by its own success. Under this framework, we present ReasoningBomb, a reinforcement-learning-based PI-DoS framework that is guided by a constant-time surrogate reward and trains a large reasoning-model attacker to generate short natural prompts that drive victim LRMs into pathologically long and often effectively non-terminating reasoning. Across seven open-source models (including LLMs and LRMs) and three commercial LRMs, ReasoningBomb induces 18,759 completion tokens on average and 19,263 reasoning tokens on average across reasoning models. It outperforms the the runner-up baseline by 35% in completion tokens and 38% in reasoning tokens, while inducing 6-7x more tokens than benign queries and achieving 286.7x input-to-output amplification ratio averaged across all samples. Additionally, our method achieves 99.8% bypass rate on input-based detection, 98.7% on output-based detection, and 98.4% against strict dual-stage joint detection.
zh
[AI-330] MiniTensor: A Lightweight High-Performance Tensor Operations Library
【速读】:该论文旨在解决当前主流深度学习框架(如PyTorch和TensorFlow)在安装包体积过大、资源占用过高,同时缺乏轻量级、高可移植性工具以支持CPU端研究与开发的问题。其解决方案的关键在于设计并实现了一个名为MiniTensor的开源张量运算库,该库通过采用极简主义架构、C++/Rust高性能引擎与Python接口的高效集成(基于PyO3),实现了对密集n维张量操作、广播、归约、矩阵乘法、反向模式自动微分及基础神经网络层和优化器的支持,同时将安装包体积压缩至数MB级别,显著优于主流框架,从而在保持科研实用性的同时极大提升了部署灵活性和计算效率。
链接: https://arxiv.org/abs/2602.00125
作者: Soumyadip Sarkar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Mathematical Software (cs.MS)
备注:
Abstract:We present MiniTensor, an open source tensor operations library that focuses on minimalism, correctness, and performance. MiniTensor exposes a familiar PyTorch-like Python API while it executes performance critical code in a Rust engine. The core supports dense n dimensional tensors, broadcasting, reductions, matrix multiplication, reverse mode automatic differentiation, a compact set of neural network layers, and standard optimizers. In this paper, we describe the design of MiniTensor’s architecture, including its efficient memory management, dynamic computation graph for gradients, and integration with Python via PyO3. We also compare the install footprint with PyTorch and TensorFlow to demonstrate that MiniTensor achieves a package size of only a few megabytes, several orders of magnitude smaller than mainstream frameworks, while preserving the essentials needed for research and development on CPUs. The repository can be found at this https URL
zh
[AI-331] Generative Artificial Intelligence in Small and Medium Enterprises: Navigating its Promises and Challenges
【速读】:该论文旨在解决中小企业(SMEs)在面对生成式人工智能(Generative AI, GAI)技术时,如何有效应对机遇与挑战,并实现成功部署的问题。其核心解决方案在于提出一个以“航海”为隐喻的战略框架,强调员工能力(competency of employees)、领导力与工作价值观(effective leadership and work values)、组织文化(organizational culture)、协作与合作(collaboration and cooperation)以及第三方关系(relationships with third parties)五大关键维度,为 SMEs 提供可操作的路线图和实践建议,从而确保 GAI 技术能够被高效整合到业务流程中并转化为长期竞争优势。
链接: https://arxiv.org/abs/2602.00091
作者: Kumaran Rajaram,Patrick Nicolas Tinguely
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 31 pages, 1 figure, 3 tables
Abstract:The latest technological developments in generative artificial intelligence (GAI) offer powerful capabilities to small and medium enterprises (SMEs), as they facilitate the democratization of both scalability and creativity. Even if they have little technical expertise or financial resources, SMEs can leverage this technology to streamline work processes and unleash innovation, thereby improving their product offerings and long-term competitiveness. This paper discusses how SMEs can navigate both the promises and challenges of GAI and offers a roadmap for deploying GAI. We introduce a sailing metaphor that reveals key strategic dimensions for GAI deployment: competency of employees, effective leadership and work values, organizational culture, collaboration and cooperation, and relationships with third parties. We offer practical recommendations that serve as a useful compass for successfully deploying GAI in SMEs.
zh
[AI-332] From Numbers to Prompts: A Cognitive Symbolic Transition Mechanism for Lightweight Time-Series Forecasting
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在时间序列预测任务中因计算和内存开销过大而难以部署于轻量化平台的问题。其核心解决方案是提出符号转换机制(Symbolic Transition Mechanism, STM),通过基于人类认知结构的量化技术将连续时间序列数据映射为符号标记(symbol tokens),并利用符号的结构化变换捕捉时间动态,从而引导语言模型聚焦于时序数据中的关键特征。STM作为通用增强机制,在不破坏基础语言模型完整性的同时,显著提升了其对时间序列动态与结构模式的推理效率,实验证明其可在极低资源开销下实现高达69%的MAE误差降低和90%的MSE误差降低。
链接: https://arxiv.org/abs/2602.00088
作者: Namkyung Yoon,Hwangnam Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages, 5 figures. Submitted to ACM Transactions on Intelligent Systems and Technology
Abstract:Large language models have achieved remarkable success in time series prediction tasks, but their substantial computational and memory requirements limit deployment on lightweight platforms. In this paper, we propose the Symbolic Transition Mechanism (STM) a novel framework that bridges numeric time series data and language models through symbolic abstraction and prompt engineering. STM transforms continuous time series values into symbol tokens with quantization techniques based on human cognitive structures, and captures temporal dynamics through structured transformations of symbols, enabling fast engineering based predictions in which language models focus on critical parts of time series data. STM is a general purpose mechanisms that ensure the integrity of backbone language models, but they significantly improve their efficiency by inferring the dynamic and structured patterns inherent in time series data. We evaluated STM on various time series datasets, paired with four small language models (SLM) with limited computational environments. For all models, STM achieves error reductions of up to 69% in MAE and 90% in MSE compared to the default backbone SLM without STM. These results demonstrate the potential of STM as an efficient, adaptable layer for symbol-driven time series prediction using foundation models. The accuracy improvements were made at negligible resource costs, with maximum GPU memory of the base model increasing by approximately 0.06% and latency overhead increasing by only 0.64%.
zh
[AI-333] ECCO: Evidence-Driven Causal Reasoning for Compiler Optimization
【速读】:该论文旨在解决编译器自动调优(compiler auto-tuning)中传统黑箱搜索方法缺乏语义指导与当前大语言模型(Large Language Model, LLM)方法存在表面模式匹配和因果不透明的问题。解决方案的关键在于提出ECCO框架,通过逆向工程构建链式思维(Chain-of-Thought)数据集,显式地将静态代码特征映射到可验证的性能证据,使模型学习优化决策背后的因果逻辑;同时设计协同推理机制,让LLM作为策略制定者动态引导遗传算法的变异操作,从而实现可解释推理与组合搜索的有效结合。
链接: https://arxiv.org/abs/2602.00087
作者: Haolin Pan,Lianghong Huang,Jinyuan Dong,Mingjie Xing,Yanjun Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Performance (cs.PF); Programming Languages (cs.PL)
备注:
Abstract:Compiler auto-tuning faces a dichotomy between traditional black-box search methods, which lack semantic guidance, and recent Large Language Model (LLM) approaches, which often suffer from superficial pattern matching and causal opacity. In this paper, we introduce ECCO, a framework that bridges interpretable reasoning with combinatorial search. We first propose a reverse engineering methodology to construct a Chain-of-Thought dataset, explicitly mapping static code features to verifiable performance evidence. This enables the model to learn the causal logic governing optimization decisions rather than merely imitating sequences. Leveraging this interpretable prior, we design a collaborative inference mechanism where the LLM functions as a strategist, defining optimization intents that dynamically guide the mutation operations of a genetic algorithm. Experimental results on seven datasets demonstrate that ECCO significantly outperforms the LLVM opt -O3 baseline, achieving an average 24.44% reduction in cycles.
zh
[AI-334] Standards for trustworthy AI in the European Union: technical rationale structural challenges and an implementation path
【速读】:该论文旨在解决欧盟人工智能法案(AI Act)框架下技术标准如何有效支撑合规性认定的问题,特别是应对生成式 AI 系统在 stochastic behavior(随机行为)、数据依赖性、评估实践不成熟及生命周期动态变化等方面的独特标准化挑战。解决方案的关键在于构建一个分层标准体系:通过水平标准(horizontal standards)明确过程义务与证据结构,结合领域特定的垂直标准(sectoral profiles)设定阈值与接受准则;同时引入以风险管控为核心的方法论,将可重复的技术检测重新定义为测量属性的稳定性,辅以结构化文档、全面日志记录和随系统生命周期演进的保证案例(assurance cases),从而实现法律义务向可审计工程实践的转化,并支持跨提供者、评估机构和执法部门的规模化合规评估。
链接: https://arxiv.org/abs/2602.00078
作者: Piercosma Bisconti,Marcello Galisai
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:This white paper examines the technical foundations of European AI standardization under the AI Act. It explains how harmonized standards enable the presumption of conformity mechanism, describes the CEN/CENELEC standardization process, and analyzes why AI poses unique standardization challenges including stochastic behavior, data dependencies, immature evaluation practices, and lifecycle dynamics. The paper argues that AI systems are typically components within larger sociotechnical systems, requiring a layered approach where horizontal standards define process obligations and evidence structures while sectoral profiles specify domain-specific thresholds and acceptance criteria. It proposes a workable scheme based on risk management, reproducible technical checks redefined as stability of measured properties, structured documentation, comprehensive logging, and assurance cases that evolve over the system lifecycle. The paper demonstrates that despite methodological difficulties, technical standards remain essential for translating legal obligations into auditable engineering practice and enabling scalable conformity assessment across providers, assessors, and enforcement authorities
zh
[AI-335] Adoption and Use of LLM s at an Academic Medical Center
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在临床文档记录中因手动数据输入导致的“工作流摩擦”问题,以及如何将LLM能力整合到电子健康记录(Electronic Health Record, EHR)系统中以实现持续、高效且可评估的临床辅助应用。其解决方案的关键在于构建了一个名为ChatEHR的内部开发平台,该平台通过静态提示组合与多源数据集成实现自动化任务(automations),并提供用户友好的界面(UI)支持交互式使用,从而将LLM能力转化为机构级功能。该设计具备模型无关性(model-agnostic)和跨数据类型访问能力,使不同临床或行政任务能够匹配最优LLM,并通过持续监控与评估机制确保性能可控,最终实现了显著的成本节约(首年估计达600万美元)与效率提升。
链接: https://arxiv.org/abs/2602.00074
作者: Nigam H. Shah,Nerissa Ambers,Abby Pandya,Timothy Keyes,Juan M. Banda,Srikar Nallan,Carlene Lugtu,Artem A. Trotsyuk,Suhana Bedi,Alyssa Unell,Miguel Fuentes,Francois Grolleau,Sneha S. Jain,Jonathan Chen,Devdutta Dash,Danton Char,Aditya Sharma,Duncan McElfresh,Patrick Scully,Vishanthan Kumar,Connor OBrien,Satchi Mouniswamy,Elvis Jones,Krishna Jasti,Gunavathi Mannika Lakshmanan,Sree Ram Akula,Varun Kumar Singh,Ramesh Rajmanickam,Sudhir Sinha,Vicky Zhou,Xu Wang,Bilal Mawji,Joshua Ge,Wencheng Li,Travis Lyons,Jarrod Helzer,Vikas Kakkar,Ramesh Powar,Darren Batara,Cheryl Cordova,William Frederick III,Olivia Tang,Phoebe Morgan,April S. Liang,Stephen P. Ma,Shivam Vedak,Dong-han Yao,Akshay Swaminathan,Mehr Kashyap,Brian Ng,Jamie Hellman,Nikesh Kotecha,Christopher Sharp,Gretchen Brown,Christian Lindmark,Anurang Revri,Michael A. Pfeffer
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:While large language models (LLMs) can support clinical documentation needs, standalone tools struggle with “workflow friction” from manual data entry. We developed ChatEHR, a system that enables the use of LLMs with the entire patient timeline spanning several years. ChatEHR enables automations - which are static combinations of prompts and data that perform a fixed task - and interactive use in the electronic health record (EHR) via a user interface (UI). The resulting ability to sift through patient medical records for diverse use-cases such as pre-visit chart review, screening for transfer eligibility, monitoring for surgical site infections, and chart abstraction, redefines LLM use as an institutional capability. This system, accessible after user-training, enables continuous monitoring and evaluation of LLM use. In 1.5 years, we built 7 automations and 1075 users have trained to become routine users of the UI, engaging in 23,000 sessions in the first 3 months of launch. For automations, being model-agnostic and accessing multiple types of data was essential for matching specific clinical or administrative tasks with the most appropriate LLM. Benchmark-based evaluations proved insufficient for monitoring and evaluation of the UI, requiring new methods to monitor performance. Generation of summaries was the most frequent task in the UI, with an estimated 0.73 hallucinations and 1.60 inaccuracies per generation. The resulting mix of cost savings, time savings, and revenue growth required a value assessment framework to prioritize work as well as quantify the impact of using LLMs. Initial estimates are 6M savings in the first year of use, without quantifying the benefit of the better care offered. Such a “build-from-within” strategy provides an opportunity for health systems to maintain agency via a vendor-agnostic, internally governed LLM platform. Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.00074 [cs.CY] (or arXiv:2602.00074v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2602.00074 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-336] Modality as Heterogeneity: Node Splitting and Graph Rewiring for Multimodal Graph Learning
【速读】:该论文旨在解决多模态图(Multimodal Graph)学习中因模态混淆(modality confusion)导致的表征失真问题,即在通用图神经网络(GNN)中不同模态信息混合后难以保持各自语义独立性与结构完整性。其解决方案的关键在于提出NSG-MoE框架,通过引入节点分裂(Node Splitting)与图重连(Graph-rewiring)机制,并结合结构化Mixture-of-Experts(MoE)架构,显式地将每个节点分解为模态特异性组件,并为异构消息流分配关系感知专家进行处理,从而在保留图结构信息的同时有效分离多模态语义,缓解模态混杂效应。
链接: https://arxiv.org/abs/2602.00067
作者: Yihan Zhang,Ercan E. Kuruoglu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal graphs are gaining increasing attention due to their rich representational power and wide applicability, yet they introduce substantial challenges arising from severe modality confusion. To address this issue, we propose NSG (Node Splitting Graph)-MoE, a multimodal graph learning framework that integrates a node-splitting and graph-rewiring mechanism with a structured Mixture-of-Experts (MoE) architecture. It explicitly decomposes each node into modality-specific components and assigns relation-aware experts to process heterogeneous message flows, thereby preserving structural information and multimodal semantics while mitigating the undesirable mixing effects commonly observed in general-purpose GNNs. Extensive experiments on three multimodal benchmarks demonstrate that NSG-MoE consistently surpasses strong baselines. Despite incorporating MoE – which is typically computationally heavy – our method achieves competitive training efficiency. Beyond empirical results, we provide a spectral analysis revealing that NSG performs adaptive filtering over modality-specific subspaces, thus explaining its disentangling behavior. Furthermore, an information-theoretic analysis shows that the architectural constraints imposed by NSG reduces mutual information between data and parameters and improving generalization capability.
zh
[AI-337] Responsible Evaluation of AI for Mental Health
【速读】:该论文试图解决当前人工智能(Artificial Intelligence, AI)在心理健康支持领域评估方法碎片化、与临床实践、社会背景及用户真实体验脱节的问题。其解决方案的关键在于提出一个融合临床合理性、社会情境与公平性的跨学科评估框架,并进一步构建了以“评估导向”“干预导向”和“信息整合导向”为分类的AI心理健康支持类型学,明确了不同类别工具的风险特征与评估需求,从而为系统性、负责任地评价AI心理健康工具提供结构化依据。
链接: https://arxiv.org/abs/2602.00065
作者: Hiba Arnaout,Anmol Goel,H. Andrew Schwartz,Steffen T. Eberhardt,Dana Atzil-Slonim,Gavin Doherty,Brian Schwartz,Wolfgang Lutz,Tim Althoff,Munmun De Choudhury,Hamidreza Jamalabadi,Raj Sanjay Shah,Flor Miriam Plaza-del-Arco,Dirk Hovy,Maria Liakata,Iryna Gurevych
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Although artificial intelligence (AI) shows growing promise for mental health care, current approaches to evaluating AI tools in this domain remain fragmented and poorly aligned with clinical practice, social context, and first-hand user experience. This paper argues for a rethinking of responsible evaluation – what is measured, by whom, and for what purpose – by introducing an interdisciplinary framework that integrates clinical soundness, social context, and equity, providing a structured basis for evaluation. Through an analysis of 135 recent *CL publications, we identify recurring limitations, including over-reliance on generic metrics that do not capture clinical validity, therapeutic appropriateness, or user experience, limited participation from mental health professionals, and insufficient attention to safety and equity. To address these gaps, we propose a taxonomy of AI mental health support types – assessment-, intervention-, and information synthesis-oriented – each with distinct risks and evaluative requirements, and illustrate its use through case studies.
zh
[AI-338] SPGCL: Effective Graph Contrastive Learning via SVD-Guided Structural Perturbation
【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)对结构噪声(如对抗攻击或非对抗性缺陷导致的虚假边或缺失边)高度敏感的问题。现有图对比学习方法通常依赖随机扰动(如边删除)生成多样视图,或仅使用谱增强(如SVD)保留全局结构先验,但前者缺乏结构感知易误删关键边,后者生成的视图往往稠密且多样性不足。解决方案的关键在于提出SPGCL框架——通过SVD引导的结构扰动实现轻量级随机边移除与SVD指导的精修步骤相结合:前者引入结构差异,后者恢复误删的重要边并引入语义合理的缺失链接,同时通过稀疏Top-K边选择避免图密度增加;并通过控制边删除与恢复率显式调节视图间的结构差异,使对比信号聚焦于语义结构差异而非边数差异,进一步引入全局相似性约束的对比融合模块以增强两视图对齐。
链接: https://arxiv.org/abs/2602.00064
作者: Hao Deng,Yingping Li,Shuiping Gou,Bo Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Graph Neural Networks (GNNs) can be highly sensitive to structural noise, including spurious or missing edges caused by adversarial attacks or non-adversarial imperfections. Existing graph contrastive learning methods typically rely on either random perturbations (e.g., edge dropping) to generate diverse views or purely spectral augmentations (e.g., SVD) to preserve global structural priors. However, random perturbations are structure-agnostic and may remove critical edges, while SVD-based views often become dense and lack sufficient diversity. To bridge this gap, we propose SPGCL, a robust graph contrastive learning framework via SVD-guided structural perturbation. SPGCL couples lightweight stochastic edge removal with an SVD-guided refinement step that can recover mistakenly removed informative edges and introduce semantically meaningful missing links while avoiding graph densification through sparse top-ranked edge selection and merging. By balancing edge removal and recovery rates, SPGCL explicitly controls structural discrepancy between views so that contrastive signals reflect semantic structural differences rather than edge-count gaps. We further incorporate a contrastive fusion module regularized by a global similarity constraint to better align the two views. Extensive experiments on ten benchmark datasets demonstrate that SPGCL consistently improves robustness and accuracy of base GNNs, outperforming state-of-the-art graph contrastive learning and structure learning methods.
zh
[AI-339] he Impact of Machine Learning Uncertainty on the Robustness of Counterfactual Explanations
【速读】:该论文旨在解决当前生成式AI(Generative AI)模型中反事实解释(counterfactual explanations)在面对模型不确定性(model uncertainty)和数据不确定性(data uncertainty)时稳定性不足的问题。现有方法未充分验证其在真实世界变异性下的鲁棒性,导致生成的解释可能不稳定甚至无效。解决方案的关键在于系统评估常见机器学习模型与反事实生成算法组合在同时存在异方差不确定性(aleatoric uncertainty)和认知不确定性(epistemic uncertainty)条件下的表现,通过合成数据和真实世界表格数据集的实验发现:即使模型准确率小幅下降(由噪声增加或样本有限引起),反事实解释在平均层面和个体实例上均会出现显著波动,从而强调了开发具有不确定性感知能力的解释方法的重要性,尤其是在金融和社会科学等对可靠性要求较高的领域。
链接: https://arxiv.org/abs/2602.00063
作者: Leonidas Christodoulou,Chang Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Counterfactual explanations are widely used to interpret machine learning predictions by identifying minimal changes to input features that would alter a model’s decision. However, most existing counterfactual methods have not been tested when model and data uncertainty change, resulting in explanations that may be unstable or invalid under real-world variability. In this work, we investigate the robustness of common combinations of machine learning models and counterfactual generation algorithms in the presence of both aleatoric and epistemic uncertainty. Through experiments on synthetic and real-world tabular datasets, we show that counterfactual explanations are highly sensitive to model uncertainty. In particular, we find that even small reductions in model accuracy - caused by increased noise or limited data - can lead to large variations in the generated counterfactuals on average and on individual instances. These findings underscore the need for uncertainty-aware explanation methods in domains such as finance and the social sciences.
zh
[AI-340] SCPL: Enhancing Neural Network Training Throughput with Decoupled Local Losses and Model Parallelism
【速读】:该论文旨在解决企业在信息管理系统中采用大规模人工智能(Artificial Intelligence, AI)模型时面临的高训练成本和长开发周期问题,其核心挑战在于传统端到端反向传播(Backpropagation, BP)算法导致深度网络训练效率低下。解决方案的关键在于提出一种新的训练方法——监督对比并行学习(Supervised Contrastive Parallel Learning, SCPL),通过解耦BP过程将长梯度流分解为多个短梯度流,从而实现不同层参数梯度的并行计算,显著提升模型并行性与训练吞吐量,有效缓解了训练中的性能瓶颈。
链接: https://arxiv.org/abs/2602.00062
作者: Ming-Yao Ho,Cheng-Kai Wang,You-Teng Lin,Hung-Hsuan Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Adopting large-scale AI models in enterprise information systems is often hindered by high training costs and long development cycles, posing a significant managerial challenge. The standard end-to-end backpropagation (BP) algorithm is a primary driver of modern AI, but it is also the source of inefficiency in training deep networks. This paper introduces a new training methodology, Supervised Contrastive Parallel Learning (SCPL), that addresses this issue by decoupling BP and transforming a long gradient flow into multiple short ones. This design enables the simultaneous computation of parameter gradients in different layers, achieving superior model parallelism and enhancing training throughput. Detailed experiments are presented to demonstrate the efficiency and effectiveness of our model compared to BP, Early Exit, GPipe, and Associated Learning (AL), a state-of-the-art method for decoupling backpropagation. By mitigating a fundamental performance bottleneck, SCPL provides a practical pathway for organizations to develop and deploy advanced information systems more cost-effectively and with greater agility. The experimental code is released for reproducibility. this https URL
zh
[AI-341] Simple Role Assignment is Extraordinarily Effective for Safety Alignment
【速读】:该论文旨在解决当前基于原则的对齐方法在实际应用中缺乏情境敏感性和完整性的问题,尤其是在生成式 AI(Generative AI)的安全性控制方面表现不足。其核心解决方案是提出一种基于心智理论(Theory of Mind)的角色条件化(role conditioning)机制,将社会角色(如母亲、法官等)作为紧凑且语义丰富的对齐信号,这些角色隐式编码了价值体系和应用这些价值所需的认知模式(cognitive schemas)。关键创新在于引入了一个无需训练的流水线:包含角色条件生成器与迭代式角色基础批评者(role-based critics),通过角色驱动的推理过程实现更安全、可解释的模型输出。实验证明该方法在多个模型家族和基准测试中显著优于基于原则、思维链(Chain-of-Thought, CoT)等基线,尤其在减少不安全输出方面效果突出,例如在WildJailbreak基准上将DeepSeek-V3的不安全输出率从81.4%降至3.6%。
链接: https://arxiv.org/abs/2602.00061
作者: Zhou Ziheng,Jiakun Ding,Zhaowei Zhang,Ruosen Gao,Yingnian Wu,Demetri Terzopoulos,Yipeng Kang,Fangwei Zhong,Junqi Wang
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Principle-based alignment often lacks context sensitivity and completeness. Grounded in Theory of Mind, we propose role conditioning as a compact alternative: social roles (e.g., mother, judge) implicitly encode both values and the cognitive schemas required to apply them. We introduce a training-free pipeline featuring a role-conditioned generator and iterative role-based critics for refinement. Across five model families, our approach consistently outperforms principle-based, Chain-of-Thought (CoT) and other baselines across benchmarks. Notably, it reduces unsafe outputs on the WildJailbreak benchmark from 81.4% to 3.6% with DeepSeek-V3. Not only for common safety benchmarks, it consistently applies for agentic safety tasks. These results establish role assignment as a powerful, interpretable paradigm for AI alignment and LLM-as-a-Judge construction.
zh
[AI-342] A longitudinal geospatial multimodal dataset of post-discharge frailty physiology mobility and neighborhoods
【速读】:该论文旨在解决老年虚弱患者从医院出院后康复过程中多维因素(如功能退化、移动能力下降、社会孤立及社区环境影响)难以连续监测与量化的问题。其解决方案的关键在于构建并公开了一个名为GEOFRAIL的纵向地理空间多模态数据集,该数据集整合了来自可穿戴传感器、临床评估和环境特征(如邻近设施、犯罪率、社会经济指标)的多源异构数据,并通过隐私保护的空间聚合方法实现真实世界场景下的持续追踪,从而为机器学习模型识别康复轨迹提供了高质量的数据基础。
链接: https://arxiv.org/abs/2602.00060
作者: Ali Abedi,Charlene H. Chu,Shehroz S. Khan
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Frailty in older adults is associated with increased vulnerability to functional decline, reduced mobility, social isolation, and challenges during the transition from hospital to community living. These factors are associated with rehospitalization and may adversely influence recovery. Neighborhood environments can further shape recovery trajectories by affecting mobility opportunities, social engagement, and access to community resources. Multimodal sensing technologies combined with data-driven analytical approaches offer the potential to continuously monitor these multidimensional factors in real-world settings. This Data Descriptor presents GEOFRAIL, a longitudinal geospatial multimodal dataset collected from community-dwelling frail older adults following hospital discharge. The dataset is organized into interconnected tables capturing participant demographics, features derived from multimodal sensors, biweekly clinical assessments of frailty, physical function, and social isolation, and temporal location records linked to neighborhood amenities, crime rates, and census-based socioeconomic indicators. Data were collected over an eight-week post-discharge period using standardized pipelines with privacy-preserving spatial aggregation. Technical validation demonstrates internal consistency across geospatial, sensor-derived, and clinical measures and reports baseline performance of machine learning models for characterizing recovery trajectories.
zh
[AI-343] xtBFGS: Quasi-Newton Optimization for Discrete Executable Text via Gradient-Operator Retrieval
【速读】:该论文旨在解决当前基于梯度的离散文本优化方法(如提示词和代码优化)主要依赖一阶优化器(如随机梯度下降)所导致的收敛速度慢与不稳定性问题,其根本原因在于这些方法忽略了优化景观中的语义曲率(semantic curvature)。解决方案的关键在于提出TextBFGS,一种基于拟牛顿法的二阶优化框架,通过从预训练的成功优化轨迹记忆中检索梯度算子(Gradient-Operators)来近似逆海森矩阵(inverse Hessian matrix),从而实现对文本变量的高效二阶修正。该方法采用“单次更新”机制,在一次推理过程中融合反馈生成与二阶校正,显著提升了优化效率与稳定性,并在多个代码优化任务(如HumanEval、MBPP)中展现出优越性能与跨任务迁移能力。
链接: https://arxiv.org/abs/2602.00059
作者: Zizheng Zhang,Yuyang Liao,Chen Chen,Jian He,Dun Wu,Qianjin Yu,Yanqin Gao,Jin Yang,Kailai Zhang,Eng Siong Chng,Xionghu Zhong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Optimizing discrete executable text such as prompts and code has recently been framed as a gradient-based process, effectively translating backpropagation concepts to the semantic space. However, existing methods predominantly operate as first-order optimizers akin to Stochastic Gradient Descent, which are suffering from slow convergence and instability because they neglect the semantic curvature of the optimization landscape. To bridge this gap, we introduce TextBFGS, a second-order framework to implement a Quasi-Newton optimization method for discrete text. Unlike traditional memory-based approaches that retrieve similar textual instances, TextBFGS approximates the inverse Hessian matrix by retrieving Gradient-Operators from the memory of pre-learned successful trajectories. Specifically, given a textual gradient feedback, TextBFGS identifies historical correction patterns from the optimization knowledge base and tries to apply these abstract operators to the current variable. This mechanism enables a One-Pass Update, combining feedback generation and second-order correction into a single inference step. Empirical evaluations on code optimization across diverse domains (e.g., HumanEval, MBPP) demonstrate that TextBFGS significantly outperforms first-order baselines. It achieves superior pass rates with fewer model calls and exhibits strong cross-task transferability, thus establishes a mathematically grounded paradigm for efficient, memory-aware text optimization.
zh
[AI-344] How Hyper-Datafication Impacts the Sustainability Costs in Frontier AI
【速读】:该论文旨在解决大规模数据驱动的前沿人工智能(AI)发展中所隐含的环境、社会与经济成本问题,特别是揭示当前“超数据化”(hyper-datafication)趋势下资源消耗加剧、环境负担转移、劳动风险集中以及文化代表性失衡等系统性不公现象。其解决方案的关键在于提出一套名为Data PROOFS的综合性建议框架,涵盖数据溯源(provenance)、资源意识(resource awareness)、所有权(ownership)、开放性(openness)、节俭性(frugality)和标准制定(standards),以系统性降低数据相关成本,并推动更具可持续性和公平性的AI发展路径。
链接: https://arxiv.org/abs/2602.00056
作者: Sophia N. Wilson,Sebastian Mair,Mophat Okinyi,Erik B. Dam,Janin Koch,Raghavendra Selvan
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 14 pages
Abstract:Large-scale data has fuelled the success of frontier artificial intelligence (AI) models over the past decade. This expansion has relied on sustained efforts by large technology corporations to aggregate and curate internet-scale datasets. In this work, we examine the environmental, social, and economic costs of large-scale data in AI through a sustainability lens. We argue that the field is shifting from building models from data to actively creating data for building models. We characterise this transition as hyper-datafication, which marks a critical juncture for the future of frontier AI and its societal impacts. To quantify and contextualise data-related costs, we analyse approximately 550,000 datasets from the Hugging Face Hub, focusing on dataset growth, storage-related energy consumption and carbon footprint, and societal representation using language data. We complement this analysis with qualitative responses from data workers in Kenya to examine the labour involved, including direct employment by big tech corporations and exposure to graphic content. We further draw on external data sources to substantiate our findings by illustrating the global disparity in data centre infrastructure. Our analyses reveal that hyper-datafication does not merely increase resource consumption but systematically redistributes environmental burdens, labour risks, and representational harms toward the Global South, precarious data workers, and under-represented cultures. Thus, we propose Data PROOFS recommendations spanning provenance, resource awareness, ownership, openness, frugality, and standards to mitigate these costs. Our work aims to make visible the often-overlooked costs of data that underpin frontier AI and to stimulate broader debate within the research community and beyond.
zh
[AI-345] Scalable and Secure AI Inference in Healthcare: A Comparative Benchmarking of FastAPI and Triton Inference Server on Kubernetes
【速读】:该论文旨在解决医疗健康领域中机器学习(Machine Learning, ML)模型在生产环境中的高效且可扩展部署问题,尤其关注推理延迟、吞吐量与数据隐私合规性(如HIPAA)之间的平衡。其解决方案的关键在于对比两种部署范式:轻量级的基于Python的FastAPI REST服务与高性能的NVIDIA Triton Inference Server,并进一步提出一种混合架构——以FastAPI作为安全网关实现受保护健康信息(PHI)脱敏,由Triton负责后端高并发推理任务。实验证明,该混合方案在保障数据安全的同时,显著提升了系统吞吐能力(单张NVIDIA T4 GPU上达780请求/秒),并优化了实时临床决策支持场景下的延迟表现(p50为22 ms),从而为医疗AI系统的高可用、合规化部署提供了可复用的技术蓝图。
链接: https://arxiv.org/abs/2602.00053
作者: Ratul Ali
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 2 pages, 2 figures, 1 table
Abstract:Efficient and scalable deployment of machine learning (ML) models is a prerequisite for modern production environments, particularly within regulated domains such as healthcare and pharmaceuticals. In these settings, systems must balance competing requirements, including minimizing inference latency for real-time clinical decision support, maximizing throughput for batch processing of medical records, and ensuring strict adherence to data privacy standards such as HIPAA. This paper presents a rigorous benchmarking analysis comparing two prominent deployment paradigms: a lightweight, Python-based REST service using FastAPI, and a specialized, high-performance serving engine, NVIDIA Triton Inference Server. Leveraging a reference architecture for healthcare AI, we deployed a DistilBERT sentiment analysis model on Kubernetes to measure median (p50) and tail (p95) latency, as well as throughput, under controlled experimental conditions. Our results indicate a distinct trade-off. While FastAPI provides lower overhead for single-request workloads with a p50 latency of 22 ms, Triton achieves superior scalability through dynamic batching, delivering a throughput of 780 requests per second on a single NVIDIA T4 GPU, nearly double that of the baseline. Furthermore, we evaluate a hybrid architectural approach that utilizes FastAPI as a secure gateway for protected health information de-identification and Triton for backend inference. This study validates the hybrid model as a best practice for enterprise clinical AI and offers a blueprint for secure, high-availability deployments.
zh
[AI-346] Lightweight Edge Learning via Dataset Pruning
【速读】:该论文旨在解决边缘学习(Edge Learning)中因设备端训练计算与能耗过高,导致在电池供电的移动系统上难以部署的问题。其核心挑战在于本地数据集通常规模庞大且存在冗余,使得训练过程资源消耗巨大。解决方案的关键在于提出一种以数据为中心的优化框架,通过轻量级的样本重要性评估机制实现数据集剪枝(Dataset Pruning),即利用截断预热阶段的平均损失统计量对样本进行排序,并根据动态剪枝比例确定保留最关键的数据点。该方法模型无关、无需设备间通信,可在不显著牺牲模型精度的前提下实现训练延迟和能耗的近线性降低,从而提升资源受限移动边缘设备上学习任务的可持续性和可扩展性。
链接: https://arxiv.org/abs/2602.00047
作者: Laha Ale,Hu Luo,Mingsheng Cao,Shichao Li,Huanlai Xing,Haifeng Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 11 pages, 10 figures
Abstract:Edge learning facilitates ubiquitous intelligence by enabling model training and adaptation directly on data-generating devices, thereby mitigating privacy risks and communication latency. However, the high computational and energy overhead of on-device training hinders its deployment on battery-powered mobile systems with strict thermal and memory budgets. While prior research has extensively optimized model architectures for efficient inference, the training phase remains bottlenecked by the processing of massive, often redundant, local datasets. In this work, we propose a data-centric optimization framework that leverages dataset pruning to achieve resource-efficient edge learning. Unlike standard methods that process all available data, our approach constructs compact, highly informative training subsets via a lightweight, on-device importance evaluation. Specifically, we utilize average loss statistics derived from a truncated warm-up phase to rank sample importance, deterministically retaining only the most critical data points under a dynamic pruning ratio. This mechanism is model-agnostic and operates locally without inter-device communication. Extensive experiments on standard image classification benchmarks demonstrate that our framework achieves a near-linear reduction in training latency and energy consumption proportional to the pruning ratio, with negligible degradation in model accuracy. These results validate dataset pruning as a vital, complementary paradigm for enhancing the sustainability and scalability of learning on resource-constrained mobile edge devices.
zh
[AI-347] When LLM s Imagine People: A Human-Centered Persona Brainstorm Audit for Bias and Fairness in Creative Applications
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)输出中存在的偏见问题,此类偏见可能强化刻板印象并在现实应用中加剧不平等,因此亟需有效的公平性审计方法。其解决方案的核心是提出Persona Brainstorm Audit (PBA),一种可扩展且透明的审计方法,通过开放式角色生成来检测多维社会偏见。与依赖固定身份类别和静态基准的现有方法不同,PBA能够在多个社会维度上揭示偏见,并支持纵向追踪以识别偏见在模型迭代中的演化规律(如减弱、持续或重现),同时有效降低数据泄露风险。
链接: https://arxiv.org/abs/2602.00044
作者: Hongliu Cao,Eoin Thomas,Rodrigo Acuna Agost
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Biased outputs from Large Language Models (LLMs) can reinforce stereotypes and perpetuate inequities in real-world applications, making fairness auditing essential. We introduce the Persona Brainstorm Audit (PBA), a scalable and transparent auditing method for detecting bias through open-ended persona generation. Unlike existing methods that rely on fixed identity categories and static benchmarks, PBA uncovers biases across multiple social dimensions while supporting longitudinal tracking and mitigating data leakage risks. Applying PBA to 12 state-of-the-art LLMs, we compare bias severity across models, dimensions, and versions, uncover distinct patterns and lineage-specific variability, and trace how biases attenuate, persist, or resurface across successive generations. Robustness analyses show PBA remains stable under varying sample sizes, role-playing prompts, and debiasing prompts, establishing its reliability for fairness auditing in LLMs.
zh
[AI-348] Student Perceptions of Large Language Models Use in Self-Reflection and Design Critique in Architecture Studio
【速读】:该论文旨在解决建筑教育中反馈机制效率低下与学生参与度不足的问题,尤其是在自我反思、同伴互评和教授主导评审三个关键环节中,传统反馈方式难以有效激发学生的批判性思维与深度学习。解决方案的关键在于将大型语言模型(Large Language Models, LLMs)作为“认知镜像”嵌入教学反馈流程,通过差异化应用实现分层支持:在自导学习中帮助学生结构化思考并缓解“空白页困境”,在同伴互评中充当中立中介以降低社交焦虑,在高阶评审中作为后评析整合工具,缓解认知过载并将抽象学术话语转化为可执行的设计迭代策略。
链接: https://arxiv.org/abs/2602.00041
作者: Juan David Salazar Rodriguez,Sam Conrad Joyce,Nachamma Sockalingam,Khoo Eng Tat,Julfendi
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Keywords: Architectural Education, Design Studio Pedagogy, Large Lan-guage Models, Generative AI in Education, Design Critique
Abstract:This study investigates the integration of Large Language Models (LLMs) into the feedback mechanisms of the architectural design studio, shifting the focus from generative production to reflective pedagogy. Employing a mixed-methods approach with architecture students at the Singapore Uni-versity of Technology and Design, the research analyzes student percep-tions across three distinct feedback domains: self-reflection, peer critique, and professor-led reviews. The findings reveal that students engage with LLMs not as authoritative instructors, but as collaborative “cognitive mir-rors” that scaffold critical thinking. In self-directed learning, LLMs help structure thoughts and overcome the “blank page” problem, though they are limited by a lack of contextual nuance. In peer critiques, the technology serves as a neutral mediator, mitigating social anxiety and the “fear of of-fending”. Furthermore, in high-stakes professor-led juries, students utilize LLMs primarily as post-critique synthesis engines to manage cognitive overload and translate abstract academic discourse into actionable design iterations.
zh
[AI-349] Enhancing few-shot time series forecasting with LLM -guided diffusion
【速读】:该论文旨在解决时间序列预测在数据稀缺场景下的性能瓶颈问题,即传统模型因依赖大规模数据集而难以有效捕捉时序动态。其解决方案的关键在于提出LTSM-DIFF框架,通过将大语言模型(Large Language Models, LLMs)的表征能力与扩散模型(Diffusion Models)的生成特性相结合:首先利用微调后的LTSM模块作为时序记忆机制,在小样本条件下提取丰富的序列特征;随后将这些特征作为条件引导联合概率扩散过程,从而精细化建模复杂的时间模式。此设计实现了从语言领域到时间序列任务的知识迁移,显著提升了模型在少样本和数据丰富场景下的泛化能力和鲁棒性。
链接: https://arxiv.org/abs/2602.00040
作者: Haonan Shi,Dehua Shuai,Liming Wang,Xiyang Liu,Long Tian
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Time series forecasting in specialized domains is often constrained by limited data availability, where conventional models typically require large-scale datasets to effectively capture underlying temporal dynamics. To tackle this few-shot challenge, we propose LTSM-DIFF (Large-scale Temporal Sequential Memory with Diffusion), a novel learning framework that integrates the expressive power of large language models with the generative capability of diffusion models. Specifically, the LTSM module is fine-tuned and employed as a temporal memory mechanism, extracting rich sequential representations even under data-scarce conditions. These representations are then utilized as conditional guidance for a joint probability diffusion process, enabling refined modeling of complex temporal patterns. This design allows knowledge transfer from the language domain to time series tasks, substantially enhancing both generalization and robustness. Extensive experiments across diverse benchmarks demonstrate that LTSM-DIFF consistently achieves state-of-the-art performance in data-rich scenarios, while also delivering significant improvements in few-shot forecasting. Our work establishes a new paradigm for time series analysis under data scarcity.
zh
[AI-350] LSSF: Safety Alignment for Large Language Models through Low-Rank Safety Subspace Fusion ACL2025
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在微调过程中安全机制易被破坏的问题,以及现有安全对齐方法依赖复杂且资源密集的微调流程所带来的效率瓶颈。其解决方案的关键在于提出一种名为LSSF(Low-Rank Safety Subspace Fusion)的安全再对齐框架,通过挖掘LLM中安全信息的低秩特性,构建一个稳定且与通用能力解耦的安全子空间投影矩阵,从而在不显著影响下游任务性能的前提下,利用线性运算有效恢复微调后模型的安全对齐能力。此外,为适应不同层间安全信息编码密度差异,作者还引入了“安全奇异值熵”这一新指标,用于动态计算每层的安全关键秩,进一步提升对齐精度与鲁棒性。
链接: https://arxiv.org/abs/2602.00038
作者: Guanghao Zhou,Panjia Qiu,Cen Chen,Hongyu Li,Mingyuan Chu,Xin Zhang,Jun Zhou
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Accepted in ACL 2025 Main Conference
Abstract:The safety mechanisms of large language models (LLMs) exhibit notable fragility, as even fine-tuning on datasets without harmful content may still undermine their safety capabilities. Meanwhile, existing safety alignment methods predominantly rely on the fine-tuning process, which inadvertently leads to the increased complexity and computational resources required. To address these issues, we introduce LSSF, a novel safety re-alignment framework with \underlineLow-Rank \underlineSafety \underlineSubspace \underlineFusion. Our proposed method exploits the low-rank characteristics of safety information in LLMs by constructing a low-rank projection matrix to extract the principal components of safety vectors. Notably, this projection matrix represents the low-rank safety subspace of the LLMs, which we have observed to remain stable during fine-tuning process and is isolated from the model’s general capabilities. These principal components are used to effectively restore safety alignment when combined with fine-tuned LLMs through linear arithmetic. Additionally, to account for the varying encoding densities of safety information across different layers of LLMs, we propose a novel metric called safety singular value entropy. This metric quantifies the encoding density and allows for the dynamic computation of the safety-critical rank for each safety vector. Extensive experiments demonstrate that our proposed post-hoc alignment method can effectively restore the safety alignment of fine-tuned models with minimal impact on their performance in downstream tasks.
zh
[AI-351] Synthetic Student Responses: LLM -Extracted Features for IRT Difficulty Parameter Estimation
【速读】:该论文旨在解决教育评估中题目标定难度(item difficulty)的传统方法依赖于耗时耗力的学生预测试所导致的障碍,尤其对课堂教师和测评开发者构成挑战。其解决方案的关键在于:通过结合传统语言学特征与利用大语言模型(Large Language Models, LLMs)提取的教学洞察(如解题步骤数、认知复杂度及潜在误解),构建一个两阶段神经网络框架——首先预测学生对试题的响应模式,再从模拟的响应数据中推导出项目反应理论(Item Response Theory, IRT)难度参数。该方法在超过25万条数学题目学生作答数据上验证,对未见过的题目实现了约0.78的皮尔逊相关系数,表明无需实际学生测试即可高精度估计题目标度。
链接: https://arxiv.org/abs/2602.00034
作者: Matias Hoyl
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 17 pages, 7 figures
Abstract:Educational assessment relies heavily on knowing question difficulty, traditionally determined through resource-intensive pre-testing with students. This creates significant barriers for both classroom teachers and assessment developers. We investigate whether Item Response Theory (IRT) difficulty parameters can be accurately estimated without student testing by modeling the response process and explore the relative contribution of different feature types to prediction accuracy. Our approach combines traditional linguistic features with pedagogical insights extracted using Large Language Models (LLMs), including solution step count, cognitive complexity, and potential misconceptions. We implement a two-stage process: first training a neural network to predict how students would respond to questions, then deriving difficulty parameters from these simulated response patterns. Using a dataset of over 250,000 student responses to mathematics questions, our model achieves a Pearson correlation of approximately 0.78 between predicted and actual difficulty parameters on completely unseen questions.
zh
机器学习
[LG-0] MEG-XL: Data-Efficient Brain-to-Text via Long-Context Pre-Training
链接: https://arxiv.org/abs/2602.02494
作者: Dulhan Jayalath,Oiwi Parker Jones
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: 19 pages, 8 figures, 5 tables
Abstract:Clinical brain-to-text interfaces are designed for paralysed patients who cannot provide extensive training recordings. Pre-training improves data-efficient generalisation by learning statistical priors across subjects, but these priors critically depend on context. While natural speech might unfold gradually over minutes, most methods pre-train with only a few seconds of context. Thus, we propose MEG-XL, a model pre-trained with 2.5 minutes of MEG context per sample, 5-300x longer than prior work, and equivalent to 191k tokens, capturing extended neural context. Fine-tuning on the task of word decoding from brain data, MEG-XL matches supervised performance with a fraction of the data (e.g. 1hr vs 50hrs) and outperforms brain foundation models. We find that models pre-trained with longer contexts learn representations that transfer better to word decoding. Our results indicate that long-context pre-training helps exploit extended neural context that other methods unnecessarily discard. Code, model weights, and instructions are available at this https URL .
[LG-1] Expanding the Capabilities of Reinforcement Learning via Text Feedback
链接: https://arxiv.org/abs/2602.02482
作者: Yuda Song,Lili Chen,Fahim Tajwar,Remi Munos,Deepak Pathak,J. Andrew Bagnell,Aarti Singh,Andrea Zanette
类目: Machine Learning (cs.LG)
*备注: 43 pages, 6 figures
Abstract:The success of RL for LLM post-training stems from an unreasonably uninformative source: a single bit of information per rollout as binary reward or preference label. At the other extreme, distillation offers dense supervision but requires demonstrations, which are costly and difficult to scale. We study text feedback as an intermediate signal: richer than scalar rewards, yet cheaper than complete demonstrations. Textual feedback is a natural mode of human interaction and is already abundant in many real-world settings, where users, annotators, and automated judges routinely critique LLM outputs. Towards leveraging text feedback at scale, we formalize a multi-turn RL setup, RL from Text Feedback (RLTF), where text feedback is available during training but not at inference. Therefore, models must learn to internalize the feedback in order to improve their test-time single-turn performance. To do this, we propose two methods: Self Distillation (RLTF-SD), which trains the single-turn policy to match its own feedback-conditioned second-turn generations; and Feedback Modeling (RLTF-FM), which predicts the feedback as an auxiliary objective. We provide theoretical analysis on both methods, and empirically evaluate on reasoning puzzles, competition math, and creative writing tasks. Our results show that both methods consistently outperform strong baselines across benchmarks, highlighting the potential of RL with an additional source of rich supervision at scale.
[LG-2] HumanX: Toward Agile and Generalizable Humanoid Interaction Skills from Human Videos
链接: https://arxiv.org/abs/2602.02473
作者: Yinhuai Wang,Qihan Zhao,Yuen Fui Lau,Runyi Yu,Hok Wai Tsui,Qifeng Chen,Jingbo Wang,Jiangmiao Pang,Ping Tan
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:Enabling humanoid robots to perform agile and adaptive interactive tasks has long been a core challenge in robotics. Current approaches are bottlenecked by either the scarcity of realistic interaction data or the need for meticulous, task-specific reward engineering, which limits their scalability. To narrow this gap, we present HumanX, a full-stack framework that compiles human video into generalizable, real-world interaction skills for humanoids, without task-specific rewards. HumanX integrates two co-designed components: XGen, a data generation pipeline that synthesizes diverse and physically plausible robot interaction data from video while supporting scalable data augmentation; and XMimic, a unified imitation learning framework that learns generalizable interaction skills. Evaluated across five distinct domains–basketball, football, badminton, cargo pickup, and reactive fighting–HumanX successfully acquires 10 different skills and transfers them zero-shot to a physical Unitree G1 humanoid. The learned capabilities include complex maneuvers such as pump-fake turnaround fadeaway jumpshots without any external perception, as well as interactive tasks like sustained human-robot passing sequences over 10 consecutive cycles–learned from a single video demonstration. Our experiments show that HumanX achieves over 8 times higher generalization success than prior methods, demonstrating a scalable and task-agnostic pathway for learning versatile, real-world robot interactive skills.
[LG-3] Age-Aware Edge-Blind Federated Learning via Over-the-Air Aggregation
链接: https://arxiv.org/abs/2602.02469
作者: Ahmed M. Elshazly,Ahmed Arafa
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: To appear in IEEE ICC 2026
Abstract:We study federated learning (FL) over wireless fading channels where multiple devices simultaneously send their model updates. We propose an efficient \emphage-aware edge-blind over-the-air FL approach that does not require channel state information (CSI) at the devices. Instead, the parameter server (PS) uses multiple antennas and applies maximum-ratio combining (MRC) based on its estimated sum of the channel gains to detect the parameter updates. A key challenge is that the number of orthogonal subcarriers is limited; thus, transmitting many parameters requires multiple Orthogonal Frequency Division Multiplexing (OFDM) symbols, which increases latency. To address this, the PS selects only a small subset of model coordinates each round using \emphAgeTop-(k), which first picks the largest-magnitude entries and then chooses the (k) coordinates with the longest waiting times since they were last selected. This ensures that all selected parameters fit into a single OFDM symbol, reducing latency. We provide a convergence bound that highlights the advantages of using a higher number of antenna array elements and demonstrates a key trade-off: increasing (k) decreases compression error at the cost of increasing the effect of channel noise. Experimental results show that (i) more PS antennas greatly improve accuracy and convergence speed; (ii) AgeTop-(k) outperforms random selection under relatively good channel conditions; and (iii) the optimum (k) depends on the channel, with smaller (k) being better in noisy settings.
[LG-4] Conflict-Aware Client Selection for Multi-Server Federated Learning
链接: https://arxiv.org/abs/2602.02458
作者: Mingwei Hong,Zheng Lin,Zehang Lin,Lin Li,Miao Yang,Xia Du,Zihan Fang,Zhaolu Kang,Dianxin Luan,Shunzhi Zhu
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: 6 pages, 4 figures
Abstract:Federated learning (FL) has emerged as a promising distributed machine learning (ML) that enables collaborative model training across clients without exposing raw data, thereby preserving user privacy and reducing communication costs. Despite these benefits, traditional single-server FL suffers from high communication latency due to the aggregation of models from a large number of clients. While multi-server FL distributes workloads across edge servers, overlapping client coverage and uncoordinated selection often lead to resource contention, causing bandwidth conflicts and training failures. To address these limitations, we propose a decentralized reinforcement learning with conflict risk prediction, named RL CRP, to optimize client selection in multi-server FL systems. Specifically, each server estimates the likelihood of client selection conflicts using a categorical hidden Markov model based on its sparse historical client selection sequence. Then, a fairness-aware reward mechanism is incorporated to promote long-term client participation for minimizing training latency and resource contention. Extensive experiments demonstrate that the proposed RL-CRP framework effectively reduces inter-server conflicts and significantly improves training efficiency in terms of convergence speed and communication cost.
[LG-5] Finite-Sample Wasserstein Error Bounds and Concentration Inequalities for Nonlinear Stochastic Approximation
链接: https://arxiv.org/abs/2602.02445
作者: Seo Taek Kong,R. Srikant
类目: Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:
Abstract:This paper derives non-asymptotic error bounds for nonlinear stochastic approximation algorithms in the Wasserstein- p distance. To obtain explicit finite-sample guarantees for the last iterate, we develop a coupling argument that compares the discrete-time process to a limiting Ornstein-Uhlenbeck process. Our analysis applies to algorithms driven by general noise conditions, including martingale differences and functions of ergodic Markov chains. Complementing this result, we handle the convergence rate of the Polyak-Ruppert average through a direct analysis that applies under the same general setting. Assuming the driving noise satisfies a non-asymptotic central limit theorem, we show that the normalized last iterates converge to a Gaussian distribution in the p -Wasserstein distance at a rate of order \gamma_n^1/6 , where \gamma_n is the step size. Similarly, the Polyak-Ruppert average is shown to converge in the Wasserstein distance at a rate of order n^-1/6 . These distributional guarantees imply high-probability concentration inequalities that improve upon those derived from moment bounds and Markov’s inequality. We demonstrate the utility of this approach by considering two applications: (1) linear stochastic approximation, where we explicitly quantify the transition from heavy-tailed to Gaussian behavior of the iterates, thereby bridging the gap between recent finite-sample analyses and asymptotic theory and (2) stochastic gradient descent, where we establish rate of convergence to the central limit theorem. Subjects: Machine Learning (cs.LG); Statistics Theory (math.ST) Cite as: arXiv:2602.02445 [cs.LG] (or arXiv:2602.02445v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.02445 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-6] Certain Head Uncertain Tail: Expert-Sample for Test-Time Scaling in Fine-Grained MoE
链接: https://arxiv.org/abs/2602.02443
作者: Yuanteng Chen,Peisong Wang,Nanxin Zeng,Yuantian Shao,Gang Li,Jing Liu,Jian Cheng
类目: Machine Learning (cs.LG)
*备注: 24 pages, 13 figures
Abstract:Test-time scaling improves LLM performance by generating multiple candidate solutions, yet token-level sampling requires temperature tuning that trades off diversity against stability. Fine-grained MoE, featuring hundreds of well-trained experts per layer and multi-expert activation per token, offers an unexplored alternative through its rich routing space. We empirically characterize fine-grained MoE routing and uncover an informative pattern: router scores exhibit a certain head of high-confidence experts followed by an uncertain tail of low-confidence candidates. While single-run greedy accuracy remains stable when fewer experts are activated, multi-sample pass@n degrades significantly-suggesting that the certain head governs core reasoning capability while the uncertain tail correlates with reasoning diversity. Motivated by these findings, we propose Expert-Sample, a training-free method that preserves high-confidence selections while injecting controlled stochasticity into the uncertain tail, enabling diverse generation without destabilizing outputs. Evaluated on multiple fine-grained MoE models across math, knowledge reasoning, and code tasks, Expert-Sample consistently improves pass@n and verification-based accuracy. On Qwen3-30B-A3B-Instruct evaluated on GPQA-Diamond with 32 parallel samples, pass@32 rises from 85.4% to 91.9%, and accuracy improves from 59.1% to 62.6% with Best-of-N verification.
[LG-7] Energy-Efficient Neuromorphic Computing for Edge AI: A Framework with Adaptive Spiking Neural Networks and Hardware-Aware Optimization
链接: https://arxiv.org/abs/2602.02439
作者: Olaf Yunus Laitinen Imanov,Derya Umut Kulali,Taner Yilmaz,Duygu Erisken,Rana Irem Turhan
类目: Neural and Evolutionary Computing (cs.NE); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注: 8 pages, 4 figures, 4 tables. Submitted to IEEE Transactions on Neural Networks and Learning Systems (TNNLS)
Abstract:Edge AI applications increasingly require ultra-low-power, low-latency inference. Neuromorphic computing based on event-driven spiking neural networks (SNNs) offers an attractive path, but practical deployment on resource-constrained devices is limited by training difficulty, hardware-mapping overheads, and sensitivity to temporal dynamics. We present NeuEdge, a framework that combines adaptive SNN models with hardware-aware optimization for edge deployment. NeuEdge uses a temporal coding scheme that blends rate and spike-timing patterns to reduce spike activity while preserving accuracy, and a hardware-aware training procedure that co-optimizes network structure and on-chip placement to improve utilization on neuromorphic processors. An adaptive threshold mechanism adjusts neuron excitability from input statistics, reducing energy consumption without degrading performance. Across standard vision and audio benchmarks, NeuEdge achieves 91-96% accuracy with up to 2.3 ms inference latency on edge hardware and an estimated 847 GOp/s/W energy efficiency. A case study on an autonomous-drone workload shows up to 312x energy savings relative to conventional deep neural networks while maintaining real-time operation.
[LG-8] Maximizing Reliability with Bayesian Optimization
链接: https://arxiv.org/abs/2602.02432
作者: Jack M. Buckingham,Ivo Couckuyt,Juergen Branke
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 25 pages, 9 figures
Abstract:Bayesian optimization (BO) is a popular, sample-efficient technique for expensive, black-box optimization. One such problem arising in manufacturing is that of maximizing the reliability, or equivalently minimizing the probability of a failure, of a design which is subject to random perturbations - a problem that can involve extremely rare failures ( P_\mathrmfail = 10^-6-10^-8 ). In this work, we propose two BO methods based on Thompson sampling and knowledge gradient, the latter approximating the one-step Bayes-optimal policy for minimizing the logarithm of the failure probability. Both methods incorporate importance sampling to target extremely small failure probabilities. Empirical results show the proposed methods outperform existing methods in both extreme and non-extreme regimes.
[LG-9] Embedding Perturbation may Better Reflect the Uncertainty in LLM Reasoning
链接: https://arxiv.org/abs/2602.02427
作者: Qihao Wen,Jiahao Wang,Yang Nan,Pengfei He,Ravi Tandon,Han Xu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large language Models (LLMs) have achieved significant breakthroughs across diverse domains; however, they can still produce unreliable or misleading outputs. For responsible LLM application, Uncertainty Quantification (UQ) techniques are used to estimate a model’s uncertainty about its outputs, indicating the likelihood that those outputs may be problematic. For LLM reasoning tasks, it is essential to estimate the uncertainty not only for the final answer, but also for the intermediate steps of the reasoning, as this can enable more fine-grained and targeted interventions. In this study, we explore what UQ metrics better reflect the LLM’s ``intermediate uncertainty’‘during reasoning. Our study reveals that an LLMs’ incorrect reasoning steps tend to contain tokens which are highly sensitive to the perturbations on the preceding token embeddings. In this way, incorrect (uncertain) intermediate steps can be readily identified using this sensitivity score as guidance in practice. In our experiments, we show such perturbation-based metric achieves stronger uncertainty quantification performance compared with baseline methods such as token (generation) probability and token entropy. Besides, different from approaches that rely on multiple sampling, the perturbation-based metrics offer better simplicity and efficiency.
[LG-10] Repurposing Protein Language Models for Latent Flow-Based Fitness Optimization
链接: https://arxiv.org/abs/2602.02425
作者: Amaru Caceres Arroyo,Lea Bogensperger,Ahmed Allam,Michael Krauthammer,Konrad Schindler,Dominik Narnhofer
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:
Abstract:Protein fitness optimization is challenged by a vast combinatorial landscape where high-fitness variants are extremely sparse. Many current methods either underperform or require computationally expensive gradient-based sampling. We present CHASE, a framework that repurposes the evolutionary knowledge of pretrained protein language models by compressing their embeddings into a compact latent space. By training a conditional flow-matching model with classifier-free guidance, we enable the direct generation of high-fitness variants without predictor-based guidance during the ODE sampling steps. CHASE achieves state-of-the-art performance on AAV and GFP protein design benchmarks. Finally, we show that bootstrapping with synthetic data can further enhance performance in data-constrained settings.
[LG-11] rust Region Continual Learning as an Implicit Meta-Learner
链接: https://arxiv.org/abs/2602.02417
作者: Zekun Wang,Anant Gupta,Christopher J. MacLellan
类目: Machine Learning (cs.LG)
*备注: 19 pages, 23 tables
Abstract:Continual learning aims to acquire tasks sequentially without catastrophic forgetting, yet standard strategies face a core tradeoff: regularization-based methods (e.g., EWC) can overconstrain updates when task optima are weakly overlapping, while replay-based methods can retain performance but drift due to imperfect replay. We study a hybrid perspective: \emphtrust region continual learning that combines generative replay with a Fisher-metric trust region constraint. We show that, under local approximations, the resulting update admits a MAML-style interpretation with a single implicit inner step: replay supplies an old-task gradient signal (query-like), while the Fisher-weighted penalty provides an efficient offline curvature shaping (support-like). This yields an emergent meta-learning property in continual learning: the model becomes an initialization that rapidly \emphre-converges to prior task optima after each task transition, without explicitly optimizing a bilevel objective. Empirically, on task-incremental diffusion image generation and continual diffusion-policy control, trust region continual learning achieves the best final performance and retention, and consistently recovers early-task performance faster than EWC, replay, and continual meta-learning baselines.
[LG-12] Active Transfer Bagging: A New Approach for Accelerated Active Learning Acquisition of Data by Combined Transfer Learning and Bagging Based Models
链接: https://arxiv.org/abs/2602.02415
作者: Vivienne Pelletier,Daniel J. Rivera,Obinna Nwokonkwo,Steven A. Wilson,Christopher L. Muhich
类目: Machine Learning (cs.LG)
*备注:
Abstract:Modern machine learning has achieved remarkable success on many problems, but this success often depends on the existence of large, labeled datasets. While active learning can dramatically reduce labeling cost when annotations are expensive, early performance is frequently dominated by the initial seed set, typically chosen at random. In many applications, however, related or approximate datasets are readily available and can be leveraged to construct a better seed set. We introduce a new method for selecting the seed data set for active learning, Active-Transfer Bagging (ATBagging). ATBagging estimates the informativeness of candidate data point from a Bayesian interpretation of bagged ensemble models by comparing in-bag and out-of-bag predictive distributions from the labeled dataset, yielding an information-gain proxy. To avoid redundant selections, we impose feature-space diversity by sampling a determinantal point process (DPP) whose kernel uses Random Fourier Features and a quality-diversity factorization that incorporates the informativeness scores. This same blended method is used for selection of new data points to collect during the active learning phase. We evaluate ATBagging on four real-world datasets covering both target-transfer and feature-shift scenarios (QM9, ERA5, Forbes 2000, and Beijing PM2.5). Across seed sizes nseed = 10-100, ATBagging improves or ties early active learning and increases area under the learning-curve relative to alternative seed subset selection methodologies in almost all cases, with strongest benefits in low-data regimes. Thus, ATBagging provides a low-cost, high reward means to initiating active learning-based data collection.
[LG-13] Masked Autoencoders as Universal Speech Enhancer
链接: https://arxiv.org/abs/2602.02413
作者: Rajalaxmi Rajagopalan,Ritwik Giri,Zhiqiang Tang,Kyu Han
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注:
Abstract:Supervised speech enhancement methods have been very successful. However, in practical scenarios, there is a lack of clean speech, and self-supervised learning-based (SSL) speech enhancement methods that offer comparable enhancement performance and can be applied to other speech-related downstream applications are desired. In this work, we develop a masked autoencoder based universal speech enhancer that is agnostic to the type of distortion affecting speech, can handle multiple distortions simultaneously, and is trained in a self-supervised manner. An augmentation stack adds further distortions to the noisy input data. The masked autoencoder model learns to remove the added distortions along with reconstructing the masked regions of the spectrogram during pre-training. The pre-trained embeddings are then used by fine-tuning models trained on a small amount of paired data for specific downstream tasks. We evaluate the pre-trained features for denoising and dereverberation downstream tasks. We explore different augmentations (like single or multi-speaker) in the pre-training augmentation stack and the effect of different noisy input feature representations (like log1p compression) on pre-trained embeddings and downstream fine-tuning enhancement performance. We show that the proposed method not only outperforms the baseline but also achieves state-of-the-art performance for both in-domain and out-of-domain evaluation datasets.
[LG-14] An Empirical Study on Noisy Data and LLM Pretraining Loss Divergence
链接: https://arxiv.org/abs/2602.02400
作者: Qizhen Zhang,Ankush Garg,Jakob Foerster,Niladri Chatterji,Kshitiz Malik,Mike Lewis
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large-scale pretraining datasets drive the success of large language models (LLMs). However, these web-scale corpora inevitably contain large amounts of noisy data due to unregulated web content or randomness inherent in data. Although LLM pretrainers often speculate that such noise contributes to instabilities in large-scale LLM pretraining and, in the worst cases, loss divergence, this phenomenon remains poorly this http URL this work, we present a systematic empirical study of whether noisy data causes LLM pretraining divergences and how it does so. By injecting controlled synthetic uniformly random noise into otherwise clean datasets, we analyze training dynamics across model sizes ranging from 480M to 5.2B parameters. We show that noisy data indeed induces training loss divergence, and that the probability of divergence depends strongly on the noise type, amount of noise, and model scale. We further find that noise-induced divergences exhibit activation patterns distinct from those caused by high learning rates, and we provide diagnostics that differentiate these two failure modes. Together, these results provide a large-scale, controlled characterization of how noisy data affects loss divergence in LLM pretraining.
[LG-15] PRISM: Performer RS-IMLE for Single-pass Multisensory Imitation Learning
链接: https://arxiv.org/abs/2602.02396
作者: Amisha Bhaskar,Pratap Tokekar,Stefano Di Cairano,Alexander Schperberg
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 10 pages main text and 4 figures, and 11 pages appendix and 10 figures, total 21 pages and 14 figures
Abstract:Robotic imitation learning typically requires models that capture multimodal action distributions while operating at real-time control rates and accommodating multiple sensing modalities. Although recent generative approaches such as diffusion models, flow matching, and Implicit Maximum Likelihood Estimation (IMLE) have achieved promising results, they often satisfy only a subset of these requirements. To address this, we introduce PRISM, a single-pass policy based on a batch-global rejection-sampling variant of IMLE. PRISM couples a temporal multisensory encoder (integrating RGB, depth, tactile, audio, and proprioception) with a linear-attention generator using a Performer architecture. We demonstrate the efficacy of PRISM on a diverse real-world hardware suite, including loco-manipulation using a Unitree Go2 with a 7-DoF arm D1 and tabletop manipulation with a UR5 manipulator. Across challenging physical tasks such as pre-manipulation parking, high-precision insertion, and multi-object pick-and-place, PRISM outperforms state-of-the-art diffusion policies by 10-25% in success rate while maintaining high-frequency (30-50 Hz) closed-loop control. We further validate our approach on large-scale simulation benchmarks, including CALVIN, MetaWorld, and Robomimic. In CALVIN (10% data split), PRISM improves success rates by approximately 25% over diffusion and approximately 20% over flow matching, while simultaneously reducing trajectory jerk by 20x-50x. These results position PRISM as a fast, accurate, and multisensory imitation policy that retains multimodal action coverage without the latency of iterative sampling.
[LG-16] ransformers learn factored representations
链接: https://arxiv.org/abs/2602.02385
作者: Adam Shai,Loren Amdahl-Culleton,Casper L. Christensen,Henry R. Bigelow,Fernando E. Rosas,Alexander B. Boyd,Eric A. Alt,Kyle J. Ray,Paul M. Riechers
类目: Machine Learning (cs.LG)
*备注:
Abstract:Transformers pretrained via next token prediction learn to factor their world into parts, representing these factors in orthogonal subspaces of the residual stream. We formalize two representational hypotheses: (1) a representation in the product space of all factors, whose dimension grows exponentially with the number of parts, or (2) a factored representation in orthogonal subspaces, whose dimension grows linearly. The factored representation is lossless when factors are conditionally independent, but sacrifices predictive fidelity otherwise, creating a tradeoff between dimensional efficiency and accuracy. We derive precise predictions about the geometric structure of activations for each, including the number of subspaces, their dimensionality, and the arrangement of context embeddings within them. We test between these hypotheses on transformers trained on synthetic processes with known latent structure. Models learn factored representations when factors are conditionally independent, and continue to favor them early in training even when noise or hidden dependencies undermine conditional independence, reflecting an inductive bias toward factoring at the cost of fidelity. This provides a principled explanation for why transformers decompose the world into parts, and suggests that interpretable low dimensional structure may persist even in models trained on complex data.
[LG-17] SLIME: Stabilized Likelihood Implicit Margin Enforcement for Preference Optimization
链接: https://arxiv.org/abs/2602.02383
作者: Maksim Afanasyev,Illarion Iov
类目: Machine Learning (cs.LG)
*备注:
Abstract:Direct preference optimization methods have emerged as a computationally efficient alternative to Reinforcement Learning from Human Feedback (RLHF) for aligning Large Language Models (LLMs). Latest approaches have streamlined the alignment process by deriving implicit reward functions, yet they often suffer from a critical objective mismatch: optimizing the relative margin between chosen and rejected responses does not guarantee the preservation of the chosen response’s absolute likelihood. This can lead to unlearning'', where the model degrades the probability of high-quality outputs to satisfy margin constraints, and formatting collapse’’ caused by the over-penalization of rejected sequences. In this work, we introduce SLIME (Stabilized Likelihood Implicit Margin Enforcement), a reference-free alignment objective designed to decouple preference learning from generation quality. SLIME incorporates a three-pronged objective: (1) an anchoring term to maximize the likelihood of preferred responses; (2) a stabilizing penalty that prevents the probabilities of rejected tokens from collapsing to zero; and (3) a dual-margin mechanism that combines hard and soft constraints for precise boundary shaping. Our results demonstrate that SLIME achieves superior performance compared to state-of-the-art baselines while maintaining higher generation stability.
[LG-18] Self-Supervised Learning from Structural Invariance ICLR2026
链接: https://arxiv.org/abs/2602.02381
作者: Yipeng Zhang,Hafez Ghaemi,Jungyoon Lee,Shahab Bakhtiari,Eilif B. Muller,Laurent Charlin
类目: Machine Learning (cs.LG)
*备注: ICLR 2026
Abstract:Joint-embedding self-supervised learning (SSL), the key paradigm for unsupervised representation learning from visual data, learns from invariances between semantically-related data pairs. We study the one-to-many mapping problem in SSL, where each datum may be mapped to multiple valid targets. This arises when data pairs come from naturally occurring generative processes, e.g., successive video frames. We show that existing methods struggle to flexibly capture this conditional uncertainty. As a remedy, we introduce a latent variable to account for this uncertainty and derive a variational lower bound on the mutual information between paired embeddings. Our derivation yields a simple regularization term for standard SSL objectives. The resulting method, which we call AdaSSL, applies to both contrastive and distillation-based SSL objectives, and we empirically show its versatility in causal representation learning, fine-grained image understanding, and world modeling on videos.
[LG-19] C-kNN-LSH: A Nearest-Neighbor Algorithm for Sequential Counterfactual Inference
链接: https://arxiv.org/abs/2602.02371
作者: Jing Wang,Jie Shen,Qiaomin Xie,Jeremy C Weiss
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Estimating causal effects from longitudinal trajectories is central to understanding the progression of complex conditions and optimizing clinical decision-making, such as comorbidities and long COVID recovery. We introduce \emphC-kNN–LSH, a nearest-neighbor framework for sequential causal inference designed to handle such high-dimensional, confounded situations. By utilizing locality-sensitive hashing, we efficiently identify ``clinical twins’’ with similar covariate histories, enabling local estimation of conditional treatment effects across evolving disease states. To mitigate bias from irregular sampling and shifting patient recovery profiles, we integrate neighborhood estimator with a doubly-robust correction. Theoretical analysis guarantees our estimator is consistent and second-order robust to nuisance error. Evaluated on a real-world Long COVID cohort with 13,511 participants, \emphC-kNN-LSH demonstrates superior performance in capturing recovery heterogeneity and estimating policy values compared to existing baselines. Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2602.02371 [cs.LG] (or arXiv:2602.02371v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.02371 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-20] Hierarchical Federated Learning with SignSGD: A Highly Communication-Efficient Approach
链接: https://arxiv.org/abs/2602.02355
作者: Amirreza Kazemi,Seyed Mohammad Azimi-Abarghouyi,Gabor Fodor,Carlo Fischione
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:
Abstract:Hierarchical federated learning (HFL) has emerged as a key architecture for large-scale wireless and Internet of Things systems, where devices communicate with nearby edge servers before reaching the cloud. In these environments, uplink bandwidth and latency impose strict communication limits, thereby making aggressive gradient compression essential. One-bit methods such as sign-based stochastic gradient descent (SignSGD) offer an attractive solution in flat federated settings, but existing theory and algorithms do not naturally extend to hierarchical settings. In particular, the interaction between majority-vote aggregation at the edge layer and model aggregation at the cloud layer, and its impact on end-to-end performance, remains unknown. To bridge this gap, we propose a highly communication-efficient sign-based HFL framework and develop its corresponding formulation for nonconvex learning, where devices send only signed stochastic gradients, edge servers combine them through majority-vote, and the cloud periodically averages the obtained edge models, while utilizing downlink quantization to broadcast the global model. We introduce the resulting scalable HFL algorithm, HierSignSGD, and provide the convergence analysis for SignSGD in a hierarchical setting. Our core technical contribution is a characterization of how biased sign compression, two-level aggregation intervals, and inter-cluster heterogeneity collectively affect convergence. Numerical experiments under homogeneous and heterogeneous data splits show that HierSignSGD, despite employing extreme compression, achieves accuracy comparable to or better than full-precision stochastic gradient descent while reducing communication cost in the process, and remains robust under aggressive downlink sparsification.
[LG-21] EvalQReason : A Framework for Step-Level Reasoning Evaluation in Large Language Models
链接: https://arxiv.org/abs/2602.02295
作者: Shaima Ahmad Freja,Ferhat Ozgur Catak,Betul Yurdem,Chunming Rong
类目: Machine Learning (cs.LG)
*备注: 15 pages (including appendix), 11 figures
Abstract:Large Language Models (LLMs) are increasingly deployed in critical applications requiring reliable reasoning, yet their internal reasoning processes remain difficult to evaluate systematically. Existing methods focus on final-answer correctness, providing limited insight into how reasoning unfolds across intermediate steps. We present EvalQReason, a framework that quantifies LLM reasoning quality through step-level probability distribution analysis without requiring human annotation. The framework introduces two complementary algorithms: Consecutive Step Divergence (CSD), which measures local coherence between adjacent reasoning steps, and Step-to-Final Convergence (SFC), which assesses global alignment with final answers. Each algorithm employs five statistical metrics to capture reasoning dynamics. Experiments across mathematical and medical datasets with open-source 7B-parameter models demonstrate that CSD-based features achieve strong predictive performance for correctness classification, with classical machine learning models reaching F1=0.78 and ROC-AUC=0.82, and sequential neural models substantially improving performance (F1=0.88, ROC-AUC=0.97). CSD consistently outperforms SFC, and sequential architectures outperform classical machine learning approaches. Critically, reasoning dynamics prove domain-specific: mathematical reasoning exhibits clear divergence-based discrimination patterns between correct and incorrect solutions, while medical reasoning shows minimal discriminative signals, revealing fundamental differences in how LLMs process different reasoning types. EvalQReason enables scalable, process-aware evaluation of reasoning reliability, establishing probability-based divergence analysis as a principled approach for trustworthy AI deployment.
[LG-22] Choice-Model-Assisted Q-learning for Delayed-Feedback Revenue Management
链接: https://arxiv.org/abs/2602.02283
作者: Owen Shen,Patrick Jaillet
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We study reinforcement learning for revenue management with delayed feedback, where a substantial fraction of value is determined by customer cancellations and modifications observed days after booking. We propose \emphchoice-model-assisted RL: a calibrated discrete choice model is used as a fixed partial world model to impute the delayed component of the learning target at decision time. In the fixed-model deployment regime, we prove that tabular Q-learning with model-imputed targets converges to an O(\varepsilon/(1-\gamma)) neighborhood of the optimal Q-function, where \varepsilon summarizes partial-model error, with an additional O(t^-1/2) sampling term. Experiments in a simulator calibrated from 61,619 hotel bookings (1,088 independent runs) show: (i) no statistically detectable difference from a maturity-buffer DQN baseline in stationary settings; (ii) positive effects under in-family parameter shifts, with significant gains in 5 of 10 shift scenarios after Holm–Bonferroni correction (up to 12.4%); and (iii) consistent degradation under structural misspecification, where the choice model assumptions are violated (1.4–2.6% lower revenue). These results characterize when partial behavioral models improve robustness under shift and when they introduce harmful bias.
[LG-23] MoLF: Mixture-of-Latent-Flow for Pan-Cancer Spatial Gene Expression Prediction from Histology
链接: https://arxiv.org/abs/2602.02282
作者: Susu Hu,Stefanie Speidel
类目: Machine Learning (cs.LG)
*备注:
Abstract:Inferring spatial transcriptomics (ST) from histology enables scalable histogenomic profiling, yet current methods are largely restricted to single-tissue models. This fragmentation fails to leverage biological principles shared across cancer types and hinders application to data-scarce scenarios. While pan-cancer training offers a solution, the resulting heterogeneity challenges monolithic architectures. To bridge this gap, we introduce MoLF (Mixture-of-Latent-Flow), a generative model for pan-cancer histogenomic prediction. MoLF leverages a conditional Flow Matching objective to map noise to the gene latent manifold, parameterized by a Mixture-of-Experts (MoE) velocity field. By dynamically routing inputs to specialized sub-networks, this architecture effectively decouples the optimization of diverse tissue patterns. Our experiments demonstrate that MoLF establishes a new state-of-the-art, consistently outperforming both specialized and foundation model baselines on pan-cancer benchmarks. Furthermore, MoLF exhibits zero-shot generalization to cross-species data, suggesting it captures fundamental, conserved histo-molecular mechanisms.
[LG-24] HopFormer: Sparse Graph Transformers with Explicit Receptive Field Control
链接: https://arxiv.org/abs/2602.02268
作者: Sanggeon Yun,Raheeb Hassan,Ryozo Masukawa,Sungheon Jeong,Mohsen Imani
类目: Machine Learning (cs.LG)
*备注:
Abstract:Graph Transformers typically rely on explicit positional or structural encodings and dense global attention to incorporate graph topology. In this work, we show that neither is essential. We introduce HopFormer, a graph Transformer that injects structure exclusively through head-specific n-hop masked sparse attention, without the use of positional encodings or architectural modifications. This design provides explicit and interpretable control over receptive fields while enabling genuinely sparse attention whose computational cost scales linearly with mask sparsity. Through extensive experiments on both node-level and graph-level benchmarks, we demonstrate that our approach achieves competitive or superior performance across diverse graph structures. Our results further reveal that dense global attention is often unnecessary: on graphs with strong small-world properties, localized attention yields more stable and consistently high performance, while on graphs with weaker small-world effects, global attention offers diminishing returns. Together, these findings challenge prevailing assumptions in graph Transformer design and highlight sparsity-controlled attention as a principled and efficient alternative.
[LG-25] Unlocking the Duality between Flow and Field Matching
链接: https://arxiv.org/abs/2602.02261
作者: Daniil Shlenskii,Alexander Varlamov,Nazar Buzun,Alexander Korotin
类目: Machine Learning (cs.LG)
*备注:
Abstract:Conditional Flow Matching (CFM) unifies conventional generative paradigms such as diffusion models and flow matching. Interaction Field Matching (IFM) is a newer framework that generalizes Electrostatic Field Matching (EFM) rooted in Poisson Flow Generative Models (PFGM). While both frameworks define generative dynamics, they start from different objects: CFM specifies a conditional probability path in data space, whereas IFM specifies a physics-inspired interaction field in an augmented data space. This raises a basic question: are CFM and IFM genuinely different, or are they two descriptions of the same underlying dynamics? We show that they coincide for a natural subclass of IFM that we call forward-only IFM. Specifically, we construct a bijection between CFM and forward-only IFM. We further show that general IFM is strictly more expressive: it includes EFM and other interaction fields that cannot be realized within the standard CFM formulation. Finally, we highlight how this duality can benefit both frameworks: it provides a probabilistic interpretation of forward-only IFM and yields novel, IFM-driven techniques for CFM.
[LG-26] Learning Markov Decision Processes under Fully Bandit Feedback
链接: https://arxiv.org/abs/2602.02260
作者: Zhengjia Zhuo,Anupam Gupta,Viswanath Nagarajan
类目: Machine Learning (cs.LG)
*备注:
Abstract:A standard assumption in Reinforcement Learning is that the agent observes every visited state-action pair in the associated Markov Decision Process (MDP), along with the per-step rewards. Strong theoretical results are known in this setting, achieving nearly-tight \Theta(\sqrtT) -regret bounds. However, such detailed feedback can be unrealistic, and recent research has investigated more restricted settings such as trajectory feedback, where the agent observes all the visited state-action pairs, but only a single \emphaggregate reward. In this paper, we consider a far more restrictive fully bandit'' feedback model for episodic MDPs, where the agent does not even observe the visited state-action pairs -- it only learns the aggregate reward. We provide the first efficient bandit learning algorithm for episodic MDPs with \widetildeO(\sqrtT) regret. Our regret has an exponential dependence on the horizon length \H , which we show is necessary. We also obtain improved nearly-tight regret bounds for ordered’’ MDPs; these can be used to model classical stochastic optimization problems such as k -item prophet inequality and sequential posted pricing. Finally, we evaluate the empirical performance of our algorithm for the setting of k -item prophet inequalities; despite the highly restricted feedback, our algorithm’s performance is comparable to that of a state-of-art learning algorithm (UCB-VI) with detailed state-action feedback.
[LG-27] Alignment-Aware Model Adaptation via Feedback-Guided Optimization
链接: https://arxiv.org/abs/2602.02258
作者: Gaurav Bhatt,Aditya Chinchure,Jiawei Zhou,Leonid Sigal
类目: Machine Learning (cs.LG)
*备注:
Abstract:Fine-tuning is the primary mechanism for adapting foundation models to downstream tasks; however, standard approaches largely optimize task objectives in isolation and do not account for secondary yet critical alignment objectives (e.g., safety and hallucination avoidance). As a result, downstream fine-tuning can degrade alignment and fail to correct pre-existing misaligned behavior. We propose an alignment-aware fine-tuning framework that integrates feedback from an external alignment signal through policy-gradient-based regularization. Our method introduces an adaptive gating mechanism that dynamically balances supervised and alignment-driven gradients on a per-sample basis, prioritizing uncertain or misaligned cases while allowing well-aligned examples to follow standard supervised updates. The framework further learns abstention behavior for fully misaligned inputs, incorporating conservative responses directly into the fine-tuned model. Experiments on general and domain-specific instruction-tuning benchmarks demonstrate consistent reductions in harmful and hallucinated outputs without sacrificing downstream task performance. Additional analyses show robustness to adversarial fine-tuning, prompt-based attacks, and unsafe initializations, establishing adaptively gated alignment optimization as an effective approach for alignment-preserving and alignment-recovering model adaptation.
[LG-28] Variational Entropic Optimal Transport
链接: https://arxiv.org/abs/2602.02241
作者: Roman Dyachenko,Nikita Gushchin,Kirill Sokolov,Petr Mokrov,Evgeny Burnaev,Alexander Korotin
类目: Machine Learning (cs.LG)
*备注:
Abstract:Entropic optimal transport (EOT) in continuous spaces with quadratic cost is a classical tool for solving the domain translation problem. In practice, recent approaches optimize a weak dual EOT objective depending on a single potential, but doing so is computationally not efficient due to the intractable log-partition term. Existing methods typically resolve this obstacle in one of two ways: by significantly restricting the transport family to obtain closed-form normalization (via Gaussian-mixture parameterizations), or by using general neural parameterizations that require simulation-based training procedures. We propose Variational Entropic Optimal Transport (VarEOT), based on an exact variational reformulation of the log-partition \log \mathbbE[\exp(\cdot)] as a tractable minimization over an auxiliary positive normalizer. This yields a differentiable learning objective optimized with stochastic gradients and avoids the necessity of MCMC simulations during the training. We provide theoretical guarantees, including finite-sample generalization bounds and approximation results under universal function approximation. Experiments on synthetic data and unpaired image-to-image translation demonstrate competitive or improved translation quality, while comparisons within the solvers that use the same weak dual EOT objective support the benefit of the proposed optimization principle.
[LG-29] Interpretability in Deep Time Series Models Demands Semantic Alignment
链接: https://arxiv.org/abs/2602.02239
作者: Giovanni De Felice,Riccardo D’Elia,Alberto Termine,Pietro Barbiero,Giuseppe Marra,Silvia Santini
类目: Machine Learning (cs.LG)
*备注:
Abstract:Deep time series models continue to improve predictive performance, yet their deployment remains limited by their black-box nature. In response, existing interpretability approaches in the field keep focusing on explaining the internal model computations, without addressing whether they align or not with how a human would reason about the studied phenomenon. Instead, we state interpretability in deep time series models should pursue semantic alignment: predictions should be expressed in terms of variables that are meaningful to the end user, mediated by spatial and temporal mechanisms that admit user-dependent constraints. In this paper, we formalize this requirement and require that, once established, semantic alignment must be preserved under temporal evolution: a constraint with no analog in static settings. Provided with this definition, we outline a blueprint for semantically aligned deep time series models, identify properties that support trust, and discuss implications for model design.
[LG-30] Online Fine-Tuning of Pretrained Controllers for Autonomous Driving via Real-Time Recurrent RL
链接: https://arxiv.org/abs/2602.02236
作者: Julian Lemmel,Felix Resch,Mónika Farsang,Ramin Hasani,Daniela Rus,Radu Grosu
类目: Robotics (cs.RO); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Systems and Control (eess.SY)
*备注:
Abstract:Deploying pretrained policies in real-world applications presents substantial challenges that fundamentally limit the practical applicability of learning-based control systems. When autonomous systems encounter environmental changes in system dynamics, sensor drift, or task objectives, fixed policies rapidly degrade in performance. We show that employing Real-Time Recurrent Reinforcement Learning (RTRRL), a biologically plausible algorithm for online adaptation, can effectively fine-tune a pretrained policy to improve autonomous agents’ performance on driving tasks. We further show that RTRRL synergizes with a recent biologically inspired recurrent network model, the Liquid-Resistance Liquid-Capacitance RNN. We demonstrate the effectiveness of this closed-loop approach in a simulated CarRacing environment and in a real-world line-following task with a RoboRacer car equipped with an event camera.
[LG-31] Prediction-Powered Risk Monitoring of Deployed Models for Detecting Harmful Distribution Shifts
链接: https://arxiv.org/abs/2602.02229
作者: Guangyi Zhang,Yunlong Cai,Guanding Yu,Osvaldo Simeone
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
Abstract:We study the problem of monitoring model performance in dynamic environments where labeled data are limited. To this end, we propose prediction-powered risk monitoring (PPRM), a semi-supervised risk-monitoring approach based on prediction-powered inference (PPI). PPRM constructs anytime-valid lower bounds on the running risk by combining synthetic labels with a small set of true labels. Harmful shifts are detected via a threshold-based comparison with an upper bound on the nominal risk, satisfying assumption-free finite-sample guarantees in the probability of false alarm. We demonstrate the effectiveness of PPRM through extensive experiments on image classification, large language model (LLM), and telecommunications monitoring tasks.
[LG-32] Scientific Theory of a Black-Box: A Life Cycle-Scale XAI Framework Based on Constructive Empiricism
链接: https://arxiv.org/abs/2602.02215
作者: Sebastian Müller,Vanessa Toborek,Eike Stadtländer,Tamás Horváth,Brendan Balcerak Jackson,Christian Bauckhage
类目: Machine Learning (cs.LG)
*备注:
Abstract:Explainable AI (XAI) offers a growing number of algorithms that aim to answer specific questions about black-box models. What is missing is a principled way to consolidate explanatory information about a fixed black-box model into a persistent, auditable artefact, that accompanies the black-box throughout its life cycle. We address this gap by introducing the notion of a scientific theory of a black (SToBB). Grounded in Constructive Empiricism, a SToBB fulfils three obligations: (i) empirical adequacy with respect to all available observations of black-box behaviour, (ii) adaptability via explicit update commitments that restore adequacy when new observations arrive, and (iii) auditability through transparent documentation of assumptions, construction choices, and update behaviour. We operationalise these obligations as a general framework that specifies an extensible observation base, a traceable hypothesis class, algorithmic components for construction and revision, and documentation sufficient for third-party assessment. Explanations for concrete stakeholder needs are then obtained by querying the maintained record through interfaces, rather than by producing isolated method outputs. As a proof of concept, we instantiate a complete SToBB for a neural-network classifier on a tabular task and introduce the Constructive Box Theoriser (CoBoT) algorithm, an online procedure that constructs and maintains an empirically adequate rule-based surrogate as observations accumulate. Together, these contributions position SToBBs as a life cycle-scale, inspectable point of reference that supports consistent, reusable analyses and systematic external scrutiny.
[LG-33] Fat-Cat: Document-Driven Metacognitive Multi-Agent System for Complex Reasoning
链接: https://arxiv.org/abs/2602.02206
作者: Tong Yang(1),Yemin Wang(3),Chaoning Zhang(4),Aming Wu(1) ((1) Henan Polytechnic University, (2) Xiamen University, (3) University of Electronic Science and Technology of China)
类目: Machine Learning (cs.LG)
*备注:
Abstract:The effectiveness of LLM-based agents is often limited not by model capacity alone, but by how efficiently contextual information is utilized at runtime. Existing agent frameworks rely on rigid, syntax-heavy state representations such as nested JSON, which require models to devote a substantial portion of their limited attention to syntactic processing rather than semantic reasoning. In this paper, we propose Fat-Cat, a document-driven agent architecture that improves the signal-to-noise ratio of state management. By integrating three key components: (1) a Semantic File System that represents agent state as Markdown documents aligned with common pre-training corpora, (2) a Textual Strategy Evolution module that accumulates task-solving knowledge without parameter updates, and (3) a Closed-Loop Watcher that monitors reasoning trajectories to reduce hallucinations. Extensive reasoning, retrieval, and coding benchmarks, Fat-Cat consistently improves agent performance. It enables the Kimi-k2 model to outperform the proprietary GPT-4o baseline on HotPotQA. Replacing the document-based state with JSON leads to performance drop, while empirically validating the critical necessity of document-driven state modeling over rigid syntax. The code is available at this https URL.
[LG-34] ECHO-2: A Large Scale Distributed Rollout Framework for Cost-efficient Reinforcement Learning
链接: https://arxiv.org/abs/2602.02192
作者: Jie Xiao,Meng Chen,Qingnan Ren,Song Jingwei,Jiaqi Huang,Yangshen Deng,Chris Tong,Wanyi Chen,Suli Wang,Ziqian Bi,Shuo Lu,Yiqun Duan,Lynn Ai,Eric Yang,Bill Shi
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 23 pages, 7 figures
Abstract:Reinforcement learning (RL) is a critical stage in post-training large language models (LLMs), involving repeated interaction between rollout generation, reward evaluation, and centralized learning. Distributing rollout execution offers opportunities to leverage more cost-efficient inference resources, but introduces challenges in wide-area coordination and policy dissemination. We present ECHO-2, a distributed RL framework for post-training with remote inference workers and non-negligible dissemination latency. ECHO-2 combines centralized learning with distributed rollouts and treats bounded policy staleness as a user-controlled parameter, enabling rollout generation, dissemination, and training to overlap. We introduce an overlap-based capacity model that relates training time, dissemination latency, and rollout throughput, yielding a practical provisioning rule for sustaining learner utilization. To mitigate dissemination bottlenecks and lower cost, ECHO-2 employs peer-assisted pipelined broadcast and cost-aware activation of heterogeneous workers. Experiments on GRPO post-training of 4B and 8B models under real wide-area bandwidth regimes show that ECHO-2 significantly improves cost efficiency while preserving RL reward comparable to strong baselines.
[LG-35] STILL: Selecting Tokens for Intra-Layer Hybrid Attention to Linearize LLM s
链接: https://arxiv.org/abs/2602.02180
作者: Weikang Meng,Liangyu Huo,Yadan Luo,Jiawen Guan,Jingyi Zhang,Yingjian Li,Zheng Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Linearizing pretrained large language models (LLMs) primarily relies on intra-layer hybrid attention mechanisms to alleviate the quadratic complexity of standard softmax attention. Existing methods perform token routing based on sliding-window partitions, resulting in position-based selection and fails to capture token-specific global importance. Meanwhile, linear attention further suffers from distribution shift caused by learnable feature maps that distort pretrained feature magnitudes. Motivated by these limitations, we propose STILL, an intra-layer hybrid linearization framework for efficiently linearizing LLMs. STILL introduces a Self-Saliency Score with strong local-global consistency, enabling accurate token selection using sliding-window computation, and retains salient tokens for sparse softmax attention while summarizing the remaining context via linear attention. To preserve pretrained representations, we design a Norm-Preserved Feature Map (NP-Map) that decouples feature direction from magnitude and reinjects pretrained norms. We further adopt a unified training-inference architecture with chunk-wise parallelization and delayed selection to improve hardware efficiency. Experiments show that STILL matches or surpasses the original pretrained model on commonsense and general reasoning tasks, and achieves up to a 86.2% relative improvement over prior linearized attention methods on long-context benchmarks.
[LG-36] Generalized Optimal Classification Trees: A Mixed-Integer Programming Approach
链接: https://arxiv.org/abs/2602.02173
作者: Jiancheng Tu,Wenqi Fan,Zhibin Wu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Global optimization of decision trees is a long-standing challenge in combinatorial optimization, yet such models play an important role in interpretable machine learning. Although the problem has been investigated for several decades, only recent advances in discrete optimization have enabled practical algorithms for solving optimal classification tree problems on real-world datasets. Mixed-integer programming (MIP) offers a high degree of modeling flexibility, and we therefore propose a MIP-based framework for learning optimal classification trees under nonlinear performance metrics, such as the F1-score, that explicitly addresses class imbalance. To improve scalability, we develop problem-specific acceleration techniques, including a tailored branch-and-cut algorithm, an instance-reduction scheme, and warm-start strategies. We evaluate the proposed approach on 50 benchmark datasets. The computational results show that the framework can efficiently optimize nonlinear metrics while achieving strong predictive performance and reduced solution times compared with existing methods.
[LG-37] Co-RedTeam: Orchestrated Security Discovery and Exploitation with LLM Agents
链接: https://arxiv.org/abs/2602.02164
作者: Pengfei He,Ash Fox,Lesly Miculicich,Stefan Friedli,Daniel Fabian,Burak Gokturk,Jiliang Tang,Chen-Yu Lee,Tomas Pfister,Long T. Le
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:Large language models (LLMs) have shown promise in assisting cybersecurity tasks, yet existing approaches struggle with automatic vulnerability discovery and exploitation due to limited interaction, weak execution grounding, and a lack of experience reuse. We propose Co-RedTeam, a security-aware multi-agent framework designed to mirror real-world red-teaming workflows by integrating security-domain knowledge, code-aware analysis, execution-grounded iterative reasoning, and long-term memory. Co-RedTeam decomposes vulnerability analysis into coordinated discovery and exploitation stages, enabling agents to plan, execute, validate, and refine actions based on real execution feedback while learning from prior trajectories. Extensive evaluations on challenging security benchmarks demonstrate that Co-RedTeam consistently outperforms strong baselines across diverse backbone models, achieving over 60% success rate in vulnerability exploitation and over 10% absolute improvement in vulnerability detection. Ablation and iteration studies further confirm the critical role of execution feedback, structured interaction, and memory for building robust and generalizable cybersecurity agents.
[LG-38] Interpretable Tabular Foundation Models via In-Context Kernel Regression
链接: https://arxiv.org/abs/2602.02162
作者: Ratmir Miftachov,Bruno Charron,Simon Valentin
类目: Machine Learning (cs.LG)
*备注:
Abstract:Tabular foundation models like TabPFN and TabICL achieve state-of-the-art performance through in-context learning, yet their architectures remain fundamentally opaque. We introduce KernelICL, a framework to enhance tabular foundation models with quantifiable sample-based interpretability. Building on the insight that in-context learning is akin to kernel regression, we make this mechanism explicit by replacing the final prediction layer with kernel functions (Gaussian, dot-product, kNN) so that every prediction is a transparent weighted average of training labels. We introduce a two-dimensional taxonomy that formally unifies standard kernel methods, modern neighbor-based approaches, and attention mechanisms under a single framework, and quantify inspectability via the perplexity of the weight distribution over training samples. On 55 TALENT benchmark datasets, KernelICL achieves performance on par with existing tabular foundation models, demonstrating that explicit kernel constraints on the final layer enable inspectable predictions without sacrificing performance.
[LG-39] Generating Causal Temporal Interaction Graphs for Counterfactual Validation of Temporal Link Prediction
链接: https://arxiv.org/abs/2602.02161
作者: Aniq Ur Rahman,Justin P. Coon
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:Temporal link prediction (TLP) models are commonly evaluated based on predictive accuracy, yet such evaluations do not assess whether these models capture the causal mechanisms that govern temporal interactions. In this work, we propose a framework for counterfactual validation of TLP models by generating causal temporal interaction graphs (CTIGs) with known ground-truth causal structure. We first introduce a structural equation model for continuous-time event sequences that supports both excitatory and inhibitory effects, and then extend this mechanism to temporal interaction graphs. To compare causal models, we propose a distance metric based on cross-model predictive error, and empirically validate the hypothesis that predictors trained on one causal model degrade when evaluated on sufficiently distant models. Finally, we instantiate counterfactual evaluation under (i) controlled causal shifts between generating models and (ii) timestamp shuffling as a stochastic distortion with measurable causal distance. Our framework provides a foundation for causality-aware benchmarking.
[LG-40] Efficient Neural Controlled Differential Equations via Attentive Kernel Smoothing
链接: https://arxiv.org/abs/2602.02157
作者: Egor Serov,Ilya Kuleshov,Alexey Zaytsev
类目: Machine Learning (cs.LG)
*备注:
Abstract:Neural Controlled Differential Equations (Neural CDEs) provide a powerful continuous-time framework for sequence modeling, yet the roughness of the driving control path often restricts their efficiency. Standard splines introduce high-frequency variations that force adaptive solvers to take excessively small steps, driving up the Number of Function Evaluations (NFE). We propose a novel approach to Neural CDE path construction that replaces exact interpolation with Kernel and Gaussian Process (GP) smoothing, enabling explicit control over trajectory regularity. To recover details lost during smoothing, we propose an attention-based Multi-View CDE (MV-CDE) and its convolutional extension (MVC-CDE), which employ learnable queries to inform path reconstruction. This framework allows the model to distribute representational capacity across multiple trajectories, each capturing distinct temporal patterns. Empirical results demonstrate that our method, MVC-CDE with GP, achieves state-of-the-art accuracy while significantly reducing NFEs and total inference time compared to spline-based baselines.
[LG-41] he Maximum von Neumann Entropy Principle: Theory and Applications in Machine Learning
链接: https://arxiv.org/abs/2602.02117
作者: Youqi Wu,Farzan Farnia
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:
Abstract:Von Neumann entropy (VNE) is a fundamental quantity in quantum information theory and has recently been adopted in machine learning as a spectral measure of diversity for kernel matrices and kernel covariance operators. While maximizing VNE under constraints is well known in quantum settings, a principled analogue of the classical maximum entropy framework, particularly its decision theoretic and game theoretic interpretation, has not been explicitly developed for VNE in data driven contexts. In this paper, we extend the minimax formulation of the maximum entropy principle due to Grünwald and Dawid to the setting of von Neumann entropy, providing a game-theoretic justification for VNE maximization over density matrices and trace-normalized positive semidefinite operators. This perspective yields a robust interpretation of maximum VNE solutions under partial information and clarifies their role as least committed inferences in spectral domains. We then illustrate how the resulting Maximum VNE principle applies to modern machine learning problems by considering two representative applications, selecting a kernel representation from multiple normalized embeddings via kernel-based VNE maximization, and completing kernel matrices from partially observed entries. These examples demonstrate how the proposed framework offers a unifying information-theoretic foundation for VNE-based methods in kernel learning.
[LG-42] Efficient Swap Regret Minimization in Combinatorial Bandits AISTATS2026
链接: https://arxiv.org/abs/2602.02087
作者: Andreas Kontogiannis,Vasilis Pollatos,Panayotis Mertikopoulos,Ioannis Panageas
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted at AISTATS 2026
Abstract:This paper addresses the problem of designing efficient no-swap regret algorithms for combinatorial bandits, where the number of actions N is exponentially large in the dimensionality of the problem. In this setting, designing efficient no-swap regret translates to sublinear – in horizon T – swap regret with polylogarithmic dependence on N . In contrast to the weaker notion of external regret minimization - a problem which is fairly well understood in the literature - achieving no-swap regret with a polylogarithmic dependence on N has remained elusive in combinatorial bandits. Our paper resolves this challenge, by introducing a no-swap-regret learning algorithm with regret that scales polylogarithmically in N and is tight for the class of combinatorial bandits. To ground our results, we also demonstrate how to implement the proposed algorithm efficiently – that is, with a per-iteration complexity that also scales polylogarithmically in N – across a wide range of well-studied applications.
[LG-43] Active learning from positive and unlabeled examples
链接: https://arxiv.org/abs/2602.02081
作者: Farnam Mansouri,Sandra Zilles,Shai Ben-David
类目: Machine Learning (cs.LG)
*备注:
Abstract:Learning from positive and unlabeled data (PU learning) is a weakly supervised variant of binary classification in which the learner receives labels only for (some) positively labeled instances, while all other examples remain unlabeled. Motivated by applications such as advertising and anomaly detection, we study an active PU learning setting where the learner can adaptively query instances from an unlabeled pool, but a queried label is revealed only when the instance is positive and an independent coin flip succeeds; otherwise the learner receives no information. In this paper, we provide the first theoretical analysis of the label complexity of active PU learning.
[LG-44] Learning Half-Spaces from Perturbed Contrastive Examples
链接: https://arxiv.org/abs/2602.02080
作者: Aryan Alavi Razavi Ravari,Farnam Mansouri,Yuxin Chen,Valentio Iverson,Adish Singla,Sandra Zilles
类目: Machine Learning (cs.LG)
*备注:
Abstract:We study learning under a two-step contrastive example oracle, as introduced by Mansouri et. al. (2025), where each queried (or sampled) labeled example is paired with an additional contrastive example of opposite label. While Mansouri et al. assume an idealized setting, where the contrastive example is at minimum distance of the originally queried/sampled point, we introduce and analyze a mechanism, parameterized by a non-decreasing noise function f , under which this ideal contrastive example is perturbed. The amount of perturbation is controlled by f(d) , where d is the distance of the queried/sampled point to the decision boundary. Intuitively, this results in higher-quality contrastive examples for points closer to the decision boundary. We study this model in two settings: (i) when the maximum perturbation magnitude is fixed, and (ii) when it is stochastic. For one-dimensional thresholds and for half-spaces under the uniform distribution on a bounded domain, we characterize active and passive contrastive sample complexity in dependence on the function f . We show that, under certain conditions on f , the presence of contrastive examples speeds up learning in terms of asymptotic query complexity and asymptotic expected query complexity. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2602.02080 [cs.LG] (or arXiv:2602.02080v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.02080 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-45] AICD Bench: A Challenging Benchmark for AI-Generated Code Detection
链接: https://arxiv.org/abs/2602.02079
作者: Daniil Orel,Dilshod Azizov,Indraneil Paul,Yuxia Wang,Iryna Gurevych,Preslav Nakov
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:
Abstract:Large language models (LLMs) are increasingly capable of generating functional source code, raising concerns about authorship, accountability, and security. While detecting AI-generated code is critical, existing datasets and benchmarks are narrow, typically limited to binary human-machine classification under in-distribution settings. To bridge this gap, we introduce \emphAICD Bench , the most comprehensive benchmark for AI-generated code detection. It spans \emph2M examples , \emph77 models across \emph11 families , and \emph9 programming languages , including recent reasoning models. Beyond scale, AICD Bench introduces three realistic detection tasks: ( \emphi )~ \emphRobust Binary Classification under distribution shifts in language and domain, ( \emphii )~ \emphModel Family Attribution , grouping generators by architectural lineage, and ( \emphiii )~ \emphFine-Grained Human-Machine Classification across human, machine, hybrid, and adversarial code. Extensive evaluation on neural and classical detectors shows that performance remains far below practical usability, particularly under distribution shift and for hybrid or adversarial code. We release AICD Bench as a \emphunified, challenging evaluation suite to drive the next generation of robust approaches for AI-generated code detection. The data and the code are available at this https URL.
[LG-46] Calibrating Adaptive Smoothing Methods for Freeway Traffic Reconstruction
链接: https://arxiv.org/abs/2602.02072
作者: Junyi Ji,Derek Gloudemans,Gergely Zachár,Matthew Nice,William Barbour,Daniel B. Work
类目: Machine Learning (cs.LG)
*备注:
Abstract:The adaptive smoothing method (ASM) is a widely used approach for traffic state reconstruction. This article presents a Python implementation of ASM, featuring end-to-end calibration using real-world ground truth data. The calibration is formulated as a parameterized kernel optimization problem. The model is calibrated using data from a full-state observation testbed, with input from a sparse radar sensor network. The implementation is developed in PyTorch, enabling integration with various deep learning methods. We evaluate the results in terms of speed distribution, spatio-temporal error distribution, and spatial error to provide benchmark metrics for the traffic reconstruction problem. We further demonstrate the usability of the calibrated method across multiple freeways. Finally, we discuss the challenges of reproducibility in general traffic model calibration and the limitations of ASM. This article is reproducible and can serve as a benchmark for various freeway operation tasks.
[LG-47] BAPS: A Fine-Grained Low-Precision Scheme for Softmax in Attention via Block-Aware Precision reScaling
链接: https://arxiv.org/abs/2602.02071
作者: Zisheng Ye,Xiaoyu He,Maoyuan Song,Guoliang Qiu,Chao Liao,Chen Wu,Yonggang Sun,Zhichun Li,Xiaoru Xie,Yuanyong Luo,Hu Liu,Pinyan Lu,Heng Liao
类目: Machine Learning (cs.LG)
*备注:
Abstract:As the performance gains from accelerating quantized matrix multiplication plateau, the softmax operation becomes the critical bottleneck in Transformer inference. This bottleneck stems from two hardware limitations: (1) limited data bandwidth between matrix and vector compute cores, and (2) the significant area cost of high-precision (FP32/16) exponentiation units (EXP2). To address these issues, we introduce a novel low-precision workflow that employs a specific 8-bit floating-point format (HiF8) and block-aware precision rescaling for softmax. Crucially, our algorithmic innovations make low-precision softmax feasible without the significant model accuracy loss that hampers direct low-precision approaches. Specifically, our design (i) halves the required data movement bandwidth by enabling matrix multiplication outputs constrained to 8-bit, and (ii) substantially reduces the EXP2 unit area by computing exponentiations in low (8-bit) precision. Extensive evaluation on language models and multi-modal models confirms the validity of our method. By alleviating the vector computation bottleneck, our work paves the way for doubling end-to-end inference throughput without increasing chip area, and offers a concrete co-design path for future low-precision hardware and software.
[LG-48] Learning to Route and Schedule LLM s from User Retrials via Contextual Queueing Bandits
链接: https://arxiv.org/abs/2602.02061
作者: Seoungbin Bae,Junyoung Son,Dabeen Lee
类目: Machine Learning (cs.LG)
*备注:
Abstract:Explosive demands for LLMs often cause user queries to accumulate in server queues, requiring efficient routing (query-LLM matching) and scheduling (query prioritization) mechanisms. Several online algorithms are being deployed, but they overlook the following two key challenges inherent to conversational LLM services: (1) unsatisfied users may retry queries, increasing the server backlog, and (2) requests for explicit" feedback, such as ratings, degrade user experiences. In this paper, we develop a joint routing and scheduling algorithm that leverages implicit" feedback inferred from user retrial behaviors. The key idea is to propose and study the framework of contextual queueing bandits with multinomial logit feedback (CQB-MNL). CQB-MNL models query retrials, as well as context-based learning for user preferences over LLMs. Our algorithm, anytime CQB (ACQB), achieves efficient learning while maintaining queue stability by combining Thompson sampling with forced exploration at a decaying rate. We show that ACQB simultaneously achieves a cumulative regret of \widetilde\mathcalO(\sqrtt) for routing and a queue length regret of \widetilde\mathcalO(t^-1/4) for any large t . For experiments, we refine query embeddings via contrastive learning while adopting a disjoint parameter model to learn LLM-specific parameters. Experiments on SPROUT, EmbedLLM, and RouterBench datasets confirm that both algorithms consistently outperform baselines.
[LG-49] Ultrafast On-chip Online Learning via Spline Locality in Kolmogorov-Arnold Networks
链接: https://arxiv.org/abs/2602.02056
作者: Duc Hoang,Aarush Gupta,Philip Harris
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG); Systems and Control (eess.SY); Machine Learning (stat.ML)
*备注:
Abstract:Ultrafast online learning is essential for high-frequency systems, such as controls for quantum computing and nuclear fusion, where adaptation must occur on sub-microsecond timescales. Meeting these requirements demands low-latency, fixed-precision computation under strict memory constraints, a regime in which conventional Multi-Layer Perceptrons (MLPs) are both inefficient and numerically unstable. We identify key properties of Kolmogorov-Arnold Networks (KANs) that align with these constraints. Specifically, we show that: (i) KAN updates exploiting B-spline locality are sparse, enabling superior on-chip resource scaling, and (ii) KANs are inherently robust to fixed-point quantization. By implementing fixed-point online training on Field-Programmable Gate Arrays (FPGAs), a representative platform for on-chip computation, we demonstrate that KAN-based online learners are significantly more efficient and expressive than MLPs across a range of low-latency and resource-constrained tasks. To our knowledge, this work is the first to demonstrate model-free online learning at sub-microsecond latencies.
[LG-50] On Stability and Robustness of Diffusion Posterior Sampling for Bayesian Inverse Problems
链接: https://arxiv.org/abs/2602.02045
作者: Yiming Yang,Xiaoyuan Cheng,Yi He,Kaiyu Li,Wenxuan Yuan,Zhuo Sun
类目: Machine Learning (cs.LG)
*备注:
Abstract:Diffusion models have recently emerged as powerful learned priors for Bayesian inverse problems (BIPs). Diffusion-based solvers rely on a presumed likelihood for the observations in BIPs to guide the generation process. However, the link between likelihood and recovery quality for BIPs is unclear in previous works. We bridge this gap by characterizing the posterior approximation error and proving the \emphstability of the diffusion-based solvers. Meanwhile, an immediate result of our findings on stability demonstrates the lack of robustness in diffusion-based solvers, which remains unexplored. This can degrade performance when the presumed likelihood mismatches the unknown true data generation processes. To address this issue, we propose a simple yet effective solution, \emphrobust diffusion posterior sampling, which is provably \emphrobust and compatible with existing gradient-based posterior samplers. Empirical results on scientific inverse problems and natural image tasks validate the effectiveness and robustness of our method, showing consistent performance improvements under challenging likelihood misspecifications.
[LG-51] winning Complex Networked Systems: Data-Driven Calibration of the mABCD Synthetic Graph Generator
链接: https://arxiv.org/abs/2602.02044
作者: Piotr Bródka,Michał Czuba,Bogumił Kamiński,Łukasz Kraiński,Katarzyna Musial,Paweł Prałat,Mateusz Stolarski
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注:
Abstract:The increasing availability of relational data has contributed to a growing reliance on network-based representations of complex systems. Over time, these models have evolved to capture more nuanced properties, such as the heterogeneity of relationships, leading to the concept of multilayer networks. However, the analysis and evaluation of methods for these structures is often hindered by the limited availability of large-scale empirical data. As a result, graph generators are commonly used as a workaround, albeit at the cost of introducing systematic biases. In this paper, we address the inverse-generator problem by inferring the configuration parameters of a multilayer network generator, mABCD, from a real-world system. Our goal is to identify parameter settings that enable the generator to produce synthetic networks that act as digital twins of the original structure. We propose a method for estimating matching configurations and for quantifying the associated error. Our results demonstrate that this task is non-trivial, as strong interdependencies between configuration parameters weaken independent estimation and instead favour a joint-prediction approach.
[LG-52] Hippasus: Effective and Efficient Automatic Feature Augmentation for Machine Learning Tasks on Relational Data
链接: https://arxiv.org/abs/2602.02025
作者: Serafeim Papadias,Kostas Patroumpas,Dimitrios Skoutas
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注: 13 pages, 7 figures, 9 tables
Abstract:Machine learning models depend critically on feature quality, yet useful features are often scattered across multiple relational tables. Feature augmentation enriches a base table by discovering and integrating features from related tables through join operations. However, scaling this process to complex schemas with many tables and multi-hop paths remains challenging. Feature augmentation must address three core tasks: identify promising join paths that connect the base table to candidate tables, execute these joins to materialize augmented data, and select the most informative features from the results. Existing approaches face a fundamental tradeoff between effectiveness and efficiency: achieving high accuracy requires exploring many candidate paths, but exhaustive exploration is computationally prohibitive. Some methods compromise by considering only immediate neighbors, limiting their effectiveness, while others employ neural models that require expensive training data and introduce scalability limitations. We present Hippasus, a modular framework that achieves both goals through three key contributions. First, we combine lightweight statistical signals with semantic reasoning from Large Language Models to prune unpromising join paths before execution, focusing computational resources on high-quality candidates. Second, we employ optimized multi-way join algorithms and consolidate features from multiple paths, substantially reducing execution time. Third, we integrate LLM-based semantic understanding with statistical measures to select features that are both semantically meaningful and empirically predictive. Our experimental evaluation on publicly available datasets shows that Hippasus substantially improves feature augmentation accuracy by up to 26.8% over state-of-the-art baselines while also offering high runtime performance.
[LG-53] Adaptive Quality-Diversity Trade-offs for Large-Scale Batch Recommendation
链接: https://arxiv.org/abs/2602.02024
作者: Clémence Réda(IBENS),Tomas Rigaux,Hiba Bederina(SODA),Koh Takeuchi,Hisashi Kashima,Jill-Jênn Vie(SODA)
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:
Abstract:A core research question in recommender systems is to propose batches of highly relevant and diverse items, that is, items personalized to the user’s preferences, but which also might get the user out of their comfort zone. This diversity might induce properties of serendipidity and novelty which might increase user engagement or revenue. However, many real-life problems arise in that case: e.g., avoiding to recommend distinct but too similar items to reduce the churn risk, and computational cost for large item libraries, up to millions of items. First, we consider the case when the user feedback model is perfectly observed and known in advance, and introduce an efficient algorithm called B-DivRec combining determinantal point processes and a fuzzy denuding procedure to adjust the degree of item diversity. This helps enforcing a quality-diversity trade-off throughout the user history. Second, we propose an approach to adaptively tailor the quality-diversity trade-off to the user, so that diversity in recommendations can be enhanced if it leads to positive feedback, and vice-versa. Finally, we illustrate the performance and versatility of B-DivRec in the two settings on synthetic and real-life data sets on movie recommendation and drug repurposing.
[LG-54] Scale-covariant spiking wavelets
链接: https://arxiv.org/abs/2602.02020
作者: Jens Egholm Pedersen,Tony Lindeberg,Peter Gerstoft
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:
Abstract:We establish a theoretical connection between wavelet transforms and spiking neural networks through scale-space theory. We rely on the scale-covariant guarantees in the leaky integrate-and-fire neurons to implement discrete mother wavelets that approximate continuous wavelets. A reconstruction experiment demonstrates the feasibility of the approach and warrants further analysis to mitigate current approximation errors. Our work suggests a novel spiking signal representation that could enable more energy-efficient signal processing algorithms.
[LG-55] DASH: Faster Shampoo via Batched Block Preconditioning and Efficient Inverse-Root Solvers
链接: https://arxiv.org/abs/2602.02016
作者: Ionut-Vlad Modoranu,Philip Zmushko,Erik Schultheis,Mher Safaryan,Dan Alistarh
类目: Machine Learning (cs.LG)
*备注:
Abstract:Shampoo is one of the leading approximate second-order optimizers: a variant of it has won the MLCommons AlgoPerf competition, and it has been shown to produce models with lower activation outliers that are easier to compress. Yet, applying Shampoo currently comes at the cost of significant computational slowdown, due to its expensive internal operations. In this paper, we take a significant step to address this shortcoming by proposing \method (for \textbfDistributed \textbfAccelerated \textbfSHampoo), a faster implementation of Distributed Shampoo based on two main new techniques: First, we show that preconditioner blocks can be stacked into 3D tensors to significantly improve GPU utilization; second, we introduce the Newton-DB iteration and the Chebyshev polynomial approximations as novel and faster approaches for computing the inverse matrix roots required by Shampoo. Along with these algorithmic contributions, we provide a first in-depth analysis of how matrix scaling critically affects Shampoo convergence. On the practical side, our GPU-aware implementation achieves up to 4.83\times faster optimizer steps compared to the well-optimized Distributed Shampoo, while Newton-DB attains the lowest validation perplexity per iteration among all tested methods. Our code is available at this https URL.
[LG-56] Robust Domain Generalization under Divergent Marginal and Conditional Distributions
链接: https://arxiv.org/abs/2602.02015
作者: Jewon Yeom,Kyubyung Chae,Hyunggyu Lim,Yoonna Oh,Dongyoon Yang,Taesup Kim
类目: Machine Learning (cs.LG)
*备注:
Abstract:Domain generalization (DG) aims to learn predictive models that can generalize to unseen domains. Most existing DG approaches focus on learning domain-invariant representations under the assumption of conditional distribution shift (i.e., primarily addressing changes in P(X\mid Y) while assuming P(Y) remains stable). However, real-world scenarios with multiple domains often involve compound distribution shifts where both the marginal label distribution P(Y) and the conditional distribution P(X\mid Y) vary simultaneously. To address this, we propose a unified framework for robust domain generalization under divergent marginal and conditional distributions. We derive a novel risk bound for unseen domains by explicitly decomposing the joint distribution into marginal and conditional components and characterizing risk gaps arising from both sources of divergence. To operationalize this bound, we design a meta-learning procedure that minimizes and validates the proposed risk bound across seen domains, ensuring strong generalization to unseen ones. Empirical evaluations demonstrate that our method achieves state-of-the-art performance not only on conventional DG benchmarks but also in challenging multi-domain long-tailed recognition settings where both marginal and conditional shifts are pronounced.
[LG-57] SNAP: A Self-Consistent Agreement Principle with Application to Robust Computation
链接: https://arxiv.org/abs/2602.02013
作者: Xiaoyi Jiang,Andreas Nienkötter
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We introduce SNAP (Self-coNsistent Agreement Principle), a self-supervised framework for robust computation based on mutual agreement. Based on an Agreement-Reliability Hypothesis SNAP assigns weights that quantify agreement, emphasizing trustworthy items and downweighting outliers without supervision or prior knowledge. A key result is the Exponential Suppression of Outlier Weights, ensuring that outliers contribute negligibly to computations, even in high-dimensional settings. We study properties of SNAP weighting scheme and show its practical benefits on vector averaging and subspace estimation. Particularly, we demonstrate that non-iterative SNAP outperforms the iterative Weiszfeld algorithm and two variants of multivariate median of means. SNAP thus provides a flexible, easy-to-use, broadly applicable approach to robust computation.
[LG-58] Logic-Guided Vector Fields for Constrained Generative Modeling
链接: https://arxiv.org/abs/2602.02009
作者: Ali Baheri
类目: Machine Learning (cs.LG)
*备注:
Abstract:Neuro-symbolic systems aim to combine the expressive structure of symbolic logic with the flexibility of neural learning; yet, generative models typically lack mechanisms to enforce declarative constraints at generation time. We propose Logic-Guided Vector Fields (LGVF), a neuro-symbolic framework that injects symbolic knowledge, specified as differentiable relaxations of logical constraints, into flow matching generative models. LGVF couples two complementary mechanisms: (1) a training-time logic loss that penalizes constraint violations along continuous flow trajectories, with weights that emphasize correctness near the target distribution; and (2) an inference-time adjustment that steers sampling using constraint gradients, acting as a lightweight, logic-informed correction to the learned dynamics. We evaluate LGVF on three constrained generation case studies spanning linear, nonlinear, and multi-region feasibility constraints. Across all settings, LGVF reduces constraint violations by 59-82% compared to standard flow matching and achieves the lowest violation rates in each case. In the linear and ring settings, LGVF also improves distributional fidelity as measured by MMD, while in the multi-obstacle setting, we observe a satisfaction-fidelity trade-off, with improved feasibility but increased MMD. Beyond quantitative gains, LGVF yields constraint-aware vector fields exhibiting emergent obstacle-avoidance behavior, routing samples around forbidden regions without explicit path planning.
[LG-59] Position: The Need for Ultrafast Training
链接: https://arxiv.org/abs/2602.02005
作者: Duc Hoang
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG); Systems and Control (eess.SY); High Energy Physics - Experiment (hep-ex); Quantum Physics (quant-ph)
*备注: Position paper at the 2nd Workshop on Domain-Specialized FPGAs (WDSFPGA 2026)
Abstract:Domain-specialized FPGAs have delivered unprecedented performance for low-latency inference across scientific and industrial workloads, yet nearly all existing accelerators assume static models trained offline, relegating learning and adaptation to slower CPUs or GPUs. This separation fundamentally limits systems that must operate in non-stationary, high-frequency environments, where model updates must occur at the timescale of the underlying physics. In this paper, I argue for a shift from inference-only accelerators to ultrafast on-chip learning, in which both inference and training execute directly within the FPGA fabric under deterministic, sub-microsecond latency constraints. Bringing learning into the same real-time datapath as inference would enable closed-loop systems that adapt as fast as the physical processes they control, with applications spanning quantum error correction, cryogenic qubit calibration, plasma and fusion control, accelerator tuning, and autonomous scientific experiments. Enabling such regimes requires rethinking algorithms, architectures, and toolflows jointly, but promises to transform FPGAs from static inference engines into real-time learning machines.
[LG-60] SpikingGamma: Surrogate-Gradient Free and Temporally Precise Online Training of Spiking Neural Networks with Smoothed Delays
链接: https://arxiv.org/abs/2602.01978
作者: Roel Koopman,Sebastian Otte,Sander Bohté
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:
Abstract:Neuromorphic hardware implementations of Spiking Neural Networks (SNNs) promise energy-efficient, low-latency AI through sparse, event-driven computation. Yet, training SNNs under fine temporal discretization remains a major challenge, hindering both low-latency responsiveness and the mapping of software-trained SNNs to efficient hardware. In current approaches, spiking neurons are modeled as self-recurrent units, embedded into recurrent networks to maintain state over time, and trained with BPTT or RTRL variants based on surrogate gradients. These methods scale poorly with temporal resolution, while online approximations often exhibit instability for long sequences and tend to fail at capturing temporal patterns precisely. To address these limitations, we develop spiking neurons with internal recursive memory structures that we combine with sigma-delta spike-coding. We show that this SpikingGamma model supports direct error backpropagation without surrogate gradients, can learn fine temporal patterns with minimal spiking in an online manner, and scale feedforward SNNs to complex tasks and benchmarks with competitive accuracy, all while being insensitive to the temporal resolution of the model. Our approach offers both an alternative to current recurrent SNNs trained with surrogate gradients, and a direct route for mapping SNNs to neuromorphic hardware.
[LG-61] Self-Consolidation for Self-Evolving Agents
链接: https://arxiv.org/abs/2602.01966
作者: Hongzhuo Yu,Fei Zhu,Guo-Sen Xie,Ling Shao
类目: Machine Learning (cs.LG)
*备注:
Abstract:While large language model (LLM) agents have demonstrated impressive problem-solving capabilities, they typically operate as static systems, lacking the ability to evolve through lifelong interaction. Existing attempts to bridge this gap primarily rely on retrieving successful past trajectories as demonstrations. However, this paradigm faces two critical limitations. First, by focusing solely on success, agents overlook the rich pedagogical value embedded in failed attempts, preventing them from identifying and avoiding recurrent pitfalls. Second, continually accumulating textual experiences not only increases the time consumption during retrieval but also inevitably introduces noise and exhausts the largest context window of current LLMs. To address these challenges, we propose a novel self-evolving framework for LLM agents that introduces a complementary evolution mechanism: First, a contrastive reflection strategy is introduced to explicitly summarize error-prone patterns and capture reusable insights. Second, we propose a self-consolidation mechanism that distills non-parametric textual experience into compact learnable parameters. This enables the agent to internalize extensive historical experience directly into its latent space. Extensive experiments demonstrate the advantages of our method in long-term agent evolution.
[LG-62] Grounding Generated Videos in Feasible Plans via World Models
链接: https://arxiv.org/abs/2602.01960
作者: Christos Ziakas,Amir Bar,Alessandra Russo
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large-scale video generative models have shown emerging capabilities as zero-shot visual planners, yet video-generated plans often violate temporal consistency and physical constraints, leading to failures when mapped to executable actions. To address this, we propose Grounding Video Plans with World Models (GVP-WM), a planning method that grounds video-generated plans into feasible action sequences using a learned action-conditioned world model. At test-time, GVP-WM first generates a video plan from initial and goal observations, then projects the video guidance onto the manifold of dynamically feasible latent trajectories via video-guided latent collocation. In particular, we formulate grounding as a goal-conditioned latent-space trajectory optimization problem that jointly optimizes latent states and actions under world-model dynamics, while preserving semantic alignment with the video-generated plan. Empirically, GVP-WM recovers feasible long-horizon plans from zero-shot image-to-video-generated and motion-blurred videos that violate physical constraints, across navigation and manipulation simulation tasks.
[LG-63] Deep Multivariate Models with Parametric Conditionals
链接: https://arxiv.org/abs/2602.01953
作者: Dmitrij Schlesinger,Boris Flach,Alexander Shekhovtsov
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We consider deep multivariate models for heterogeneous collections of random variables. In the context of computer vision, such collections may e.g. consist of images, segmentations, image attributes, and latent variables. When developing such models, most existing works start from an application task and design the model components and their dependencies to meet the needs of the chosen task. This has the disadvantage of limiting the applicability of the resulting model for other downstream tasks. Here, instead, we propose to represent the joint probability distribution by means of conditional probability distributions for each group of variables conditioned on the rest. Such models can then be used for practically any possible downstream task. Their learning can be approached as training a parametrised Markov chain kernel by maximising the data likelihood of its limiting distribution. This has the additional advantage of allowing a wide range of semi-supervised learning scenarios.
[LG-64] Bayesian Integration of Nonlinear Incomplete Clinical Data
链接: https://arxiv.org/abs/2602.01924
作者: Lucía González-Zamorano,Nuria Balbás-Esteban,Vanessa Gómez-Verdejo,Albert Belenguer-Llorens,Carlos Sevilla-Salcedo
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:
Abstract:Multimodal clinical data are characterized by high dimensionality, heterogeneous representations, and structured missingness, posing significant challenges for predictive modeling, data integration, and interpretability. We propose BIONIC (Bayesian Integration of Nonlinear Incomplete Clinical data), a unified probabilistic framework that integrates heterogeneous multimodal data under missingness through a joint generative-discriminative latent architecture. BIONIC uses pretrained embeddings for complex modalities such as medical images and clinical text, while incorporating structured clinical variables directly within a Bayesian multimodal formulation. The proposed framework enables robust learning in partially observed and semi-supervised settings by explicitly modeling modality-level and variable-level missingness, as well as missing labels. We evaluate BIONIC on three multimodal clinical and biomedical datasets, demonstrating strong and consistent discriminative performance compared to representative multimodal baselines, particularly under incomplete data scenarios. Beyond predictive accuracy, BIONIC provides intrinsic interpretability through its latent structure, enabling population-level analysis of modality relevance and supporting clinically meaningful insight.
[LG-65] Embedding Learning on Multiplex Networks for Link Prediction
链接: https://arxiv.org/abs/2602.01922
作者: Orell Trautmann,Olaf Wolkenhauer(SU),Clémence Réda(IBENS)
类目: Machine Learning (cs.LG)
*备注:
Abstract:Over the past years, embedding learning on networks has shown tremendous results in link prediction tasks for complex systems, with a wide range of real-life applications. Learning a representation for each node in a knowledge graph allows us to capture topological and semantic information, which can be processed in downstream analyses later. In the link prediction task, high-dimensional network information is encoded into low-dimensional vectors, which are then fed to a predictor to infer new connections between nodes in the network. As the network complexity (that is, the numbers of connections and types of interactions) grows, embedding learning turns out increasingly challenging. This review covers published models on embedding learning on multiplex networks for link prediction. First, we propose refined taxonomies to classify and compare models, depending on the type of embeddings and embedding techniques. Second, we review and address the problem of reproducible and fair evaluation of embedding learning on multiplex networks for the link prediction task. Finally, we tackle evaluation on directed multiplex networks by proposing a novel and fair testing procedure. This review constitutes a crucial step towards the development of more performant and tractable embedding learning approaches for multiplex networks and their fair evaluation for the link prediction task. We also suggest guidelines on the evaluation of models, and provide an informed perspective on the challenges and tools currently available to address downstream analyses applied to multiplex networks.
[LG-66] owards Long-Horizon Interpretability: Efficient and Faithful Multi-Token Attribution for Reasoning LLM s ICML2025
链接: https://arxiv.org/abs/2602.01914
作者: Wenbo Pan,Zhichao Liu,Xianlong Wang,Haining Yu,Xiaohua Jia
类目: Machine Learning (cs.LG)
*备注: ICML 2025 submission
Abstract:Token attribution methods provide intuitive explanations for language model outputs by identifying causally important input tokens. However, as modern LLMs increasingly rely on extended reasoning chains, existing schemes face two critical challenges: (1) efficiency bottleneck, where attributing a target span of M tokens within a context of length N requires O(M*N) operations, making long-context attribution prohibitively slow; and (2) faithfulness drop, where intermediate reasoning tokens absorb attribution mass, preventing importance from propagating back to the original input. To address these, we introduce FlashTrace, an efficient multi-token attribution method that employs span-wise aggregation to compute attribution over multi-token targets in a single pass, while maintaining faithfulness. Moreover, we design a recursive attribution mechanism that traces importance through intermediate reasoning chains back to source inputs. Extensive experiments on long-context retrieval (RULER) and multi-step reasoning (MATH, MorehopQA) tasks demonstrate that FlashTrace achieves over 130x speedup over existing baselines while maintaining superior faithfulness. We further analyze the dynamics of recursive attribution, showing that even a single recursive hop improves faithfulness by tracing importance through the reasoning chain.
[LG-67] Data- and Variance-dependent Regret Bounds for Online Tabular MDPs
链接: https://arxiv.org/abs/2602.01903
作者: Mingyi Li,Taira Tsuchiya,Kenji Yamanishi
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 80pages, 4tables
Abstract:This work studies online episodic tabular Markov decision processes (MDPs) with known transitions and develops best-of-both-worlds algorithms that achieve refined data-dependent regret bounds in the adversarial regime and variance-dependent regret bounds in the stochastic regime. We quantify MDP complexity using a first-order quantity and several new data-dependent measures for the adversarial regime, including a second-order quantity and a path-length measure, as well as variance-based measures for the stochastic regime. To adapt to these measures, we develop algorithms based on global optimization and policy optimization, both built on optimistic follow-the-regularized-leader with log-barrier regularization. For global optimization, our algorithms achieve first-order, second-order, and path-length regret bounds in the adversarial regime, and in the stochastic regime, they achieve a variance-aware gap-independent bound and a variance-aware gap-dependent bound that is polylogarithmic in the number of episodes. For policy optimization, our algorithms achieve the same data- and variance-dependent adaptivity, up to a factor of the episode horizon, by exploiting a new optimistic Q -function estimator. Finally, we establish regret lower bounds in terms of data-dependent complexity measures for the adversarial regime and a variance measure for the stochastic regime, implying that the regret upper bounds achieved by the global-optimization approach are nearly optimal.
[LG-68] Observation-dependent Bayesian active learning via input-warped Gaussian processes
链接: https://arxiv.org/abs/2602.01898
作者: Sanna Jarl,Maria Bånkestad,Jonathan J. S. Scragg,Jens Sjölund
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 13 pages
Abstract:Bayesian active learning relies on the precise quantification of predictive uncertainty to explore unknown function landscapes. While Gaussian process surrogates are the standard for such tasks, an underappreciated fact is that their posterior variance depends on the observed outputs only through the hyperparameters, rendering exploration largely insensitive to the actual measurements. We propose to inject observation-dependent feedback by warping the input space with a learned, monotone reparameterization. This mechanism allows the design policy to expand or compress regions of the input space in response to observed variability, thereby shaping the behavior of variance-based acquisition functions. We demonstrate that while such warps can be trained via marginal likelihood, a novel self-supervised objective yields substantially better performance. Our approach improves sample efficiency across a range of active learning benchmarks, particularly in regimes where non-stationarity challenges traditional methods.
[LG-69] Internal Flow Signatures for Self-Checking and Refinement in LLM s
链接: https://arxiv.org/abs/2602.01897
作者: Sungheon Jeong,Sanggeon Yun,Ryozo Masukawa,Wenjun Haung,Hanning Chen,Mohsen Imani
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large language models can generate fluent answers that are unfaithful to the provided context, while many safeguards rely on external verification or a separate judge after generation. We introduce \emphinternal flow signatures that audit decision formation from depthwise dynamics at a fixed inter-block monitoring boundary. The method stabilizes token-wise motion via bias-centered monitoring, then summarizes trajectories in compact \emphmoving readout-aligned subspaces constructed from the top token and its close competitors within each depth window. Neighboring window frames are aligned by an orthogonal transport, yielding depth-comparable transported step lengths, turning angles, and subspace drift summaries that are invariant to within-window basis choices. A lightweight GRU validator trained on these signatures performs self-checking without modifying the base model. Beyond detection, the validator localizes a culprit depth event and enables a targeted refinement: the model rolls back to the culprit token and clamps an abnormal transported step at the identified block while preserving the orthogonal residual. The resulting pipeline provides actionable localization and low-overhead self-checking from internal decision dynamics. \emphCode is available at \textttthis http URL.
[LG-70] Autocorrelated Optimize-via-Estimate: Predict-then-Optimize versus Finite-sample Optimal
链接: https://arxiv.org/abs/2602.01877
作者: Zichun Wang,Gar Goei Loke,Ruiting Zuo
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:Models that directly optimize for out-of-sample performance in the finite-sample regime have emerged as a promising alternative to traditional estimate-then-optimize approaches in data-driven optimization. In this work, we compare their performance in the context of autocorrelated uncertainties, specifically, under a Vector Autoregressive Moving Average VARMA(p,q) process. We propose an autocorrelated Optimize-via-Estimate (A-OVE) model that obtains an out-of-sample optimal solution as a function of sufficient statistics, and propose a recursive form for computing its sufficient statistics. We evaluate these models on a portfolio optimization problem with trading costs. A-OVE achieves low regret relative to a perfect information oracle, outperforming predict-then-optimize machine learning benchmarks. Notably, machine learning models with higher accuracy can have poorer decision quality, echoing the growing literature in data-driven optimization. Performance is retained under small mis-specification.
[LG-71] Grappa: Gradient-Only Communication for Scalable Graph Neural Network Training
链接: https://arxiv.org/abs/2602.01872
作者: Chongyang Xu,Christoph Siebenbrunner,Laurent Bindschaedler
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:
Abstract:Cross-partition edges dominate the cost of distributed GNN training: fetching remote features and activations per iteration overwhelms the network as graphs deepen and partition counts grow. Grappa is a distributed GNN training framework that enforces gradient-only communication: during each iteration, partitions train in isolation and exchange only gradients for the global update. To recover accuracy lost to isolation, Grappa (i) periodically repartitions to expose new neighborhoods and (ii) applies a lightweight coverage-corrected gradient aggregation inspired by importance sampling. We prove the corrected estimator is asymptotically unbiased under standard support and boundedness assumptions, and we derive a batch-level variant for compatibility with common deep-learning packages that minimizes mean-squared deviation from the ideal node-level correction. We also introduce a shrinkage version that improves stability in practice. Empirical results on real and synthetic graphs show that Grappa trains GNNs 4 times faster on average (up to 13 times) than state-of-the-art systems, achieves better accuracy especially for deeper models, and sustains training at the trillion-edge scale on commodity hardware. Grappa is model-agnostic, supports full-graph and mini-batch training, and does not rely on high-bandwidth interconnects or caching.
[LG-72] Designing Time Series Experiments in A/B Testing with Transformer Reinforcement Learning
链接: https://arxiv.org/abs/2602.01853
作者: Xiangkun Wu,Qianglin Wen,Yingying Zhang,Hongtu Zhu,Ting Li,Chengchun Shi
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:
Abstract:A/B testing has become a gold standard for modern technological companies to conduct policy evaluation. Yet, its application to time series experiments, where policies are sequentially assigned over time, remains challenging. Existing designs suffer from two limitations: (i) they do not fully leverage the entire history for treatment allocation; (ii) they rely on strong assumptions to approximate the objective function (e.g., the mean squared error of the estimated treatment effect) for optimizing the design. We first establish an impossibility theorem showing that failure to condition on the full history leads to suboptimal designs, due to the dynamic dependencies in time series experiments. To address both limitations simultaneously, we next propose a transformer reinforcement learning (RL) approach which leverages transformers to condition allocation on the entire history and employs RL to directly optimize the MSE without relying on restrictive assumptions. Empirical evaluations on synthetic data, a publicly available dispatch simulator, and a real-world ridesharing dataset demonstrate that our proposal consistently outperforms existing designs.
[LG-73] FUPareto: Bridging the Forgetting-Utility Gap in Federated Unlearning via Pareto Augmented Optimization
链接: https://arxiv.org/abs/2602.01852
作者: Zeyan Wang,Zhengmao Liu,Yongxin Cai,Chi Li,Xiaoying Tang,Jingchao Chen,Zibin Pan,Jing Qiu
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:
Abstract:Federated Unlearning (FU) aims to efficiently remove the influence of specific client data from a federated model while preserving utility for the remaining clients. However, three key challenges remain: (1) existing unlearning objectives often compromise model utility or increase vulnerability to Membership Inference Attacks (MIA); (2) there is a persistent conflict between forgetting and utility, where further unlearning inevitably harms retained performance; and (3) support for concurrent multi-client unlearning is poor, as gradient conflicts among clients degrade the quality of forgetting. To address these issues, we propose FUPareto, an efficient unlearning framework via Pareto-augmented optimization. We first introduce the Minimum Boundary Shift (MBS) Loss, which enforces unlearning by suppressing the target class logit below the highest non-target class logit; this can improve the unlearning efficiency and mitigate MIA risks. During the unlearning process, FUPareto performs Pareto improvement steps to preserve model utility and executes Pareto expansion to guarantee forgetting. Specifically, during Pareto expansion, the framework integrates a Null-Space Projected Multiple Gradient Descent Algorithm (MGDA) to decouple gradient conflicts. This enables effective, fair, and concurrent unlearning for multiple clients while minimizing utility degradation. Extensive experiments across diverse scenarios demonstrate that FUPareto consistently outperforms state-of-the-art FU methods in both unlearning efficacy and retained utility.
[LG-74] Self-Rewarding Sequential Monte Carlo for Masked Diffusion Language Models
链接: https://arxiv.org/abs/2602.01849
作者: Ziwei Luo,Ziqi Jin,Lei Wang,Lidong Bing,Thomas B. Schön
类目: Machine Learning (cs.LG)
*备注: Project page: this https URL
Abstract:This work presents self-rewarding sequential Monte Carlo (SMC), an inference-time scaling algorithm enabling effective sampling of masked diffusion language models (MDLMs). Our algorithm stems from the observation that most existing MDLMs rely on a confidence-based sampling strategy, where only tokens with the highest prediction confidence are preserved at each step. This restricts the generation to a noise-sensitive, greedy decoding paradigm, resulting in an inevitable collapse in the diversity of possible paths. We address this problem by launching multiple interacting diffusion processes in parallel, referred to as particles, for trajectory exploration. Importantly, we introduce the trajectory-level confidence as a self-rewarding signal for assigning particle importance weights. During sampling, particles are iteratively weighted and resampled to systematically steer generation towards globally confident, high-quality samples. Our self-rewarding SMC is verified on various masked diffusion language models and benchmarks, achieving significant improvement without extra training or reward guidance, while effectively converting parallel inference capacity into improved sampling quality. Our code is available at this https URL.
[LG-75] No Generation without Representation: Efficient Causal Protein Language Models Enable Zero-Shot Fitness Estimation
链接: https://arxiv.org/abs/2602.01845
作者: Furkan Eris
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:
Abstract:Protein language models (PLMs) face a fundamental divide: masked language models (MLMs) excel at fitness prediction while causal models enable generation, forcing practitioners to maintain separate architectures. We introduce \textbfProust, a 309M-parameter causal PLM that bridges this gap through architectural innovations adapted from recent LLM research, including grouped-query attention with shared K/V projections, cross-layer value residuals, and depthwise causal convolutions. Trained on 33B tokens in 40 B200 GPU-hours, Proust achieves Spearman \rho = 0.390 on ProteinGym substitutions, competitive with MLMs requiring 50–200 \times the compute. On indels, Proust sets a new state-of-the-art, outperforming models up to 20 \times larger. On EVEREST viral fitness benchmarks, it approaches structure-aware methods using sequence alone. These powerful representations position Proust in a sweet spot as it also retains native generative capabilities that MLMs lack by design. Interpretability analysis reveals that per-position entropy variance predicts, to an extent, when retrieval augmentation helps and hurts. Such insights can grow in both quantity and quality at scale and inform capabilities such as test-time scaling. Code and weights are available at this https URL
[LG-76] Prism: Efficient Test-Time Scaling via Hierarchical Search and Self-Verification for Discrete Diffusion Language Models
链接: https://arxiv.org/abs/2602.01842
作者: Jinbin Bai,Yixuan Li,Yuchen Zhu,Yi Xin,Qingyu Shi,Aosong Feng,Xiaohong Liu,Molei Tao,Jianru Xue,Xiangtai Li,Ming-Hsuan Yang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Inference-time compute has re-emerged as a practical way to improve LLM reasoning. Most test-time scaling (TTS) algorithms rely on autoregressive decoding, which is ill-suited to discrete diffusion language models (dLLMs) due to their parallel decoding over the entire sequence. As a result, developing effective and efficient TTS methods to unlock dLLMs’ full generative potential remains an underexplored challenge. To address this, we propose Prism (Pruning, Remasking, and Integrated Self-verification Method), an efficient TTS framework for dLLMs that (i) performs Hierarchical Trajectory Search (HTS) which dynamically prunes and reallocates compute in an early-to-mid denoising window, (ii) introduces Local branching with partial remasking to explore diverse implementations while preserving high-confidence tokens, and (iii) replaces external verifiers with Self-Verified Feedback (SVF) obtained via self-evaluation prompts on intermediate completions. Across four mathematical reasoning and code generation benchmarks on three dLLMs, including LLaDA 8B Instruct, Dream 7B Instruct, and LLaDA 2.0-mini, our Prism achieves a favorable performance-efficiency trade-off, matching best-of-N performance with substantially fewer function evaluations (NFE). The code is released at this https URL.
[LG-77] Hyperbolic Graph Neural Networks Under the Microscope: The Role of Geometry-Task Alignment
链接: https://arxiv.org/abs/2602.01828
作者: Dionisia Naddeo,Jonas Linkerhägner,Nicola Toschi,Geri Skenderi,Veronica Lachi
类目: Machine Learning (cs.LG)
*备注:
Abstract:Many complex networks exhibit hyperbolic structural properties, making hyperbolic space a natural candidate for representing hierarchical and tree-like graphs with low distortion. Based on this observation, Hyperbolic Graph Neural Networks (HGNNs) have been widely adopted as a principled choice for representation learning on tree-like graphs. In this work, we question this paradigm by proposing an additional condition of geometry-task alignment, i.e., whether the metric structure of the target follows that of the input graph. We theoretically and empirically demonstrate the capability of HGNNs to recover low-distortion representations on two synthetic regression problems, and show that their geometric inductive bias becomes helpful when the problem requires preserving metric structure. Additionally, we evaluate HGNNs on the tasks of link prediction and node classification by jointly analyzing predictive performance and embedding distortion, revealing that only link prediction is geometry-aligned. Overall, our findings shift the focus from only asking “Is the graph hyperbolic?” to also questioning “Is the task aligned with hyperbolic geometry?”, showing that HGNNs consistently outperform Euclidean models under such alignment, while their advantage vanishes otherwise.
[LG-78] Grad2Reward: From Sparse Judgment to Dense Rewards for Improving Open-Ended LLM Reasoning
链接: https://arxiv.org/abs/2602.01791
作者: Zheng Zhang,Ao Lu,Yuanhao Zeng,Ziwei Shan,Jinjin Guo,Lufei Li,Yexin Li,Kan Ren
类目: Machine Learning (cs.LG)
*备注:
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has catalyzed significant breakthroughs in complex LLM reasoning within verifiable domains, such as mathematics and programming. Recent efforts have sought to extend this paradigm to open-ended tasks by employing LLMs-as-a-Judge to provide sequence-level rewards for policy optimization. However, these rewards are inherently sparse, failing to provide the fine-grained supervision necessary for generating complex, long-form trajectories. Furthermore, current work treats the Judge as a black-box oracle, discarding the rich intermediate feedback signals encoded in it. To address these limitations, we introduce Grad2Reward, a novel framework that extracts dense process rewards directly from the Judge’s model inference process via a single backward pass. By leveraging gradient-based attribution, Grad2Reward enables precise token-level credit assignment, substantially enhancing training efficiency and reasoning quality. Additionally, Grad2Reward introduces a self-judging mechanism, allowing the policy to improve through its own evaluative signals without training specialized reward models or reliance on superior external Judges. The experiments demonstrate that policies optimized with Grad2Reward achieve outstanding performance across diverse open-ended tasks, affirming its effectiveness and broad generalizability.
[LG-79] Position: Beyond Model-Centric Prediction – Agent ic Time Series Forecasting
链接: https://arxiv.org/abs/2602.01776
作者: Mingyue Cheng,Xiaoyu Tao,Qi Liu,Ze Guo,Enhong Chen
类目: Machine Learning (cs.LG)
*备注:
Abstract:Time series forecasting has traditionally been formulated as a model-centric, static, and single-pass prediction problem that maps historical observations to future values. While this paradigm has driven substantial progress, it proves insufficient in adaptive and multi-turn settings where forecasting requires informative feature extraction, reasoning-driven inference, iterative refinement, and continual adaptation over time. In this paper, we argue for agentic time series forecasting (ATSF), which reframes forecasting as an agentic process composed of perception, planning, action, reflection, and memory. Rather than focusing solely on predictive models, ATSF emphasizes organizing forecasting as an agentic workflow that can interact with tools, incorporate feedback from outcomes, and evolve through experience accumulation. We outline three representative implementation paradigms – workflow-based design, agentic reinforcement learning, and a hybrid agentic workflow paradigm – and discuss the opportunities and challenges that arise when shifting from model-centric prediction to agentic forecasting. Together, this position aims to establish agentic forecasting as a foundation for future research at the intersection of time series forecasting.
[LG-80] Cost-Aware Bayesian Optimization for Prototyping Interactive Devices
链接: https://arxiv.org/abs/2602.01774
作者: Thomas Langerak,Renate Zhang,Ziyuan Wang,Per Ola Kristensson,Antti Oulasvirta
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:
Abstract:Deciding which idea is worth prototyping is a central concern in iterative design. A prototype should be produced when the expected improvement is high and the cost is low. However, this is hard to decide, because costs can vary drastically: a simple parameter tweak may take seconds, while fabricating hardware consumes material and energy. Such asymmetries, can discourage a designer from exploring the design space. In this paper, we present an extension of cost-aware Bayesian optimization to account for diverse prototyping costs. The method builds on the power of Bayesian optimization and requires only a minimal modification to the acquisition function. The key idea is to use designer-estimated costs to guide sampling toward more cost-effective prototypes. In technical evaluations, the method achieved comparable utility to a cost-agnostic baseline while requiring only \approx70% of the cost; under strict budgets, it outperformed the baseline threefold. A within-subjects study with 12 participants in a realistic joystick design task demonstrated similar benefits. These results show that accounting for prototyping costs can make Bayesian optimization more compatible with real-world design projects.
[LG-81] MGKAN: Predicting Asymmetric Drug-Drug Interactions via a Multimodal Graph Kolmogorov-Arnold Network ICASSP2026
链接: https://arxiv.org/abs/2602.01751
作者: Kunyi Fan,Mengjie Chen,Longlong Li,Cunquan Qu
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: Submitted to ICASSP 2026
Abstract:Predicting drug-drug interactions (DDIs) is essential for safe pharmacological treatments. Previous graph neural network (GNN) models leverage molecular structures and interaction networks but mostly rely on linear aggregation and symmetric assumptions, limiting their ability to capture nonlinear and heterogeneous patterns. We propose MGKAN, a Graph Kolmogorov-Arnold Network that introduces learnable basis functions into asymmetric DDI prediction. MGKAN replaces conventional MLP transformations with KAN-driven basis functions, enabling more expressive and nonlinear modeling of drug relationships. To capture pharmacological dependencies, MGKAN integrates three network views-an asymmetric DDI network, a co-interaction network, and a biochemical similarity network-with role-specific embeddings to preserve directional semantics. A fusion module combines linear attention and nonlinear transformation to enhance representational capacity. On two benchmark datasets, MGKAN outperforms seven state-of-the-art baselines. Ablation studies and case studies confirm its predictive accuracy and effectiveness in modeling directional drug effects.
[LG-82] Position: The Inevitable End of One-Architecture-Fits-All-Domains in Time Series Forecasting
链接: https://arxiv.org/abs/2602.01736
作者: Qinwei Ma,Jingzhe Shi,Jiahao Qiu,Zaiwen Yang
类目: Machine Learning (cs.LG)
*备注: 14 pages, 3 figures, 2 tables
Abstract:Recent work has questioned the effectiveness and robustness of neural network architectures for time series forecasting tasks. We summarize these concerns and analyze groundly their inherent limitations: i.e. the irreconcilable conflict between single (or few similar) domains SOTA and generalizability over general domains for time series forecasting neural network architecture designs. Moreover, neural networks architectures for general domain time series forecasting are becoming more and more complicated and their performance has almost saturated in recent years. As a result, network architectures developed aiming at fitting general time series domains are almost not inspiring for real world practices for certain single (or few similar) domains such as Finance, Weather, Traffic, etc: each specific domain develops their own methods that rarely utilize advances in neural network architectures of time series community in recent 2-3 years. As a result, we call for the time series community to shift focus away from research on time series neural network architectures for general domains: these researches have become saturated and away from domain-specific SOTAs over time. We should either (1) focus on deep learning methods for certain specific domain(s), or (2) turn to the development of meta-learning methods for general domains.
[LG-83] MSign: An Optimizer Preventing Training Instability in Large Language Models via Stable Rank Restoration
链接: https://arxiv.org/abs/2602.01734
作者: Lianhai Ren,Yucheng Ding,Xiao Liu,Qianxiao Li,Peng Cheng,Yeyun Gong
类目: Machine Learning (cs.LG)
*备注:
Abstract:Training instability remains a critical challenge in large language model (LLM) pretraining, often manifesting as sudden gradient explosions that waste significant computational resources. We study training failures in a 5M-parameter NanoGPT model scaled via \mu P, identifying two key phenomena preceding collapse: (1) rapid decline in weight matrix stable rank (ratio of squared Frobenius norm to squared spectral norm), and (2) increasing alignment between adjacent layer Jacobians. We prove theoretically that these two conditions jointly cause exponential gradient norm growth with network depth. To break this instability mechanism, we propose MSign, a new optimizer that periodically applies matrix sign operations to restore stable rank. Experiments on models from 5M to 3B parameters demonstrate that MSign effectively prevents training failures with a computational overhead of less than 7.0%.
[LG-84] Cross-Domain Fake News Detection on Unseen Domains via LLM -Based Domain-Aware User Modeling WWW2026
链接: https://arxiv.org/abs/2602.01726
作者: Xuankai Yang,Yan Wang,Jiajie Zhu,Pengfei Ding,Hongyang Liu,Xiuzhen Zhang,Huan Liu
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注: This paper has been accepted by The 2026 ACM Web Conference (WWW 2026)
Abstract:Cross-domain fake news detection (CD-FND) transfers knowledge from a source domain to a target domain and is crucial for real-world fake news mitigation. This task becomes particularly important yet more challenging when the target domain is previously unseen (e.g., the COVID-19 outbreak or the Russia-Ukraine war). However, existing CD-FND methods overlook such scenarios and consequently suffer from the following two key limitations: (1) insufficient modeling of high-level semantics in news and user engagements; and (2) scarcity of labeled data in unseen domains. Targeting these limitations, we find that large language models (LLMs) offer strong potential for CD-FND on unseen domains, yet their effective use remains non-trivial. Nevertheless, two key challenges arise: (1) how to capture high-level semantics from both news content and user engagements using LLMs; and (2) how to make LLM-generated features more reliable and transferable for CD-FND on unseen domains. To tackle these challenges, we propose DAUD, a novel LLM-Based Domain-Aware framework for fake news detection on Unseen Domains. DAUD employs LLMs to extract high-level semantics from news content. It models users’ single- and cross-domain engagements to generate domain-aware behavioral representations. In addition, DAUD captures the relations between original data-driven features and LLM-derived features of news, users, and user engagements. This allows it to extract more reliable domain-shared representations that improve knowledge transfer to unseen domains. Extensive experiments on real-world datasets demonstrate that DAUD outperforms state-of-the-art baselines in both general and unseen-domain CD-FND settings.
[LG-85] Revisiting Generalization Measures Beyond IID: An Empirical Study under Distributional Shift
链接: https://arxiv.org/abs/2602.01718
作者: Sora Nakai,Youssef Fadhloun,Kacem Mathlouthi,Kotaro Yoshida,Ganesh Talluri,Ioannis Mitliagkas,Hiroki Naganuma
类目: Machine Learning (cs.LG)
*备注:
Abstract:Generalization remains a central yet unresolved challenge in deep learning, particularly the ability to predict a model’s performance beyond its training distribution using quantities available prior to test-time evaluation. Building on the large-scale study of Jiang et al. (2020). and concerns by Dziugaite et al. (2020). about instability across training configurations, we benchmark the robustness of generalization measures beyond IID regime. We train small-to-medium models over 10,000 hyperparameter configurations and evaluate more than 40 measures computable from the trained model and the available training data alone. We significantly broaden the experimental scope along multiple axes: (i) extending the evaluation beyond the standard IID setting to include benchmarking for robustness across diverse distribution shifts, (ii) evaluating multiple architectures and training recipes, and (iii) newly incorporating calibration- and information-criteria-based measures to assess their alignment with both IID and OOD generalization. We find that distribution shifts can substantially alter the predictive performance of many generalization measures, while a smaller subset remains comparatively stable across settings.
[LG-86] Finite and Corruption-Robust Regret Bounds in Online Inverse Linear Optimization under M-Convex Action Sets
链接: https://arxiv.org/abs/2602.01682
作者: Taihei Oki,Shinsaku Sakaue
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
*备注:
Abstract:We study online inverse linear optimization, also known as contextual recommendation, where a learner sequentially infers an agent’s hidden objective vector from observed optimal actions over feasible sets that change over time. The learner aims to recommend actions that perform well under the agent’s true objective, and the performance is measured by the regret, defined as the cumulative gap between the agent’s optimal values and those achieved by the learner’s recommended actions. Prior work has established a regret bound of O(d\log T) , as well as a finite but exponentially large bound of \exp(O(d\log d)) , where d is the dimension of the optimization problem and T is the time horizon, while a regret lower bound of \Omega(d) is known (Gollapudi et al. 2021; Sakaue et al. 2025). Whether a finite regret bound polynomial in d is achievable or not has remained an open question. We partially resolve this by showing that when the feasible sets are M-convex – a broad class that includes matroids – a finite regret bound of O(d\log d) is possible. We achieve this by combining a structural characterization of optimal solutions on M-convex sets with a geometric volume argument. Moreover, we extend our approach to adversarially corrupted feedback in up to C rounds. We obtain a regret bound of O((C+1)d\log d) without prior knowledge of C , by monitoring directed graphs induced by the observed feedback to detect corruptions adaptively.
[LG-87] Quantifying Epistemic Predictive Uncertainty in Conformal Prediction
链接: https://arxiv.org/abs/2602.01667
作者: Siu Lun Chau,Soroush H. Zargarbashi,Yusuf Sale,Michele Caprio
类目: Machine Learning (cs.LG)
*备注: 42 pages
Abstract:We study the problem of quantifying epistemic predictive uncertainty (EPU) – that is, uncertainty faced at prediction time due to the existence of multiple plausible predictive models – within the framework of conformal prediction (CP). To expose the implicit model multiplicity underlying CP, we build on recent results showing that, under a mild assumption, any full CP procedure induces a set of closed and convex predictive distributions, commonly referred to as a credal set. Importantly, the conformal prediction region (CPR) coincides exactly with the set of labels to which all distributions in the induced credal set assign probability at least 1-\alpha . As our first contribution, we prove that this characterisation also holds in split CP. Building on this connection, we then propose a computationally efficient and analytically tractable uncertainty measure, based on \emphMaximum Mean Imprecision, to quantify the EPU by measuring the degree of conflicting information within the induced credal set. Experiments on active learning and selective classification demonstrate that the quantified EPU provides substantially more informative and fine-grained uncertainty assessments than reliance on CPR size alone. More broadly, this work highlights the potential of CP serving as a principled basis for decision-making under epistemic uncertainty.
[LG-88] Chance-Constrained Inference for Hallucination Risk Control in Large Language Models
链接: https://arxiv.org/abs/2602.01637
作者: Sreenivasan Mohandas
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large language models generate outputs stochastically and may produce fluent but invalid responses, including factual hallucinations. Existing mitigation strategies reduce average error rates but do not provide explicit control over the \emphfrequency of such failures under repeated use. We formulate inference as a deployment-time risk control problem and introduce \emphchance-constrained inference, which directly bounds the probability of hallucinations among accepted generations. Hallucinations are modeled as stochastic constraint violations, and we show that confidence-based selective prediction does not, in general, imply probabilistic risk guarantees. To enforce chance constraints efficiently, we propose a sequential, anytime-valid inference procedure that adaptively certifies feasibility or infeasibility using finite samples, avoiding conservative fixed-sample bounds. Experiments on questions inspired by NaturalQuestions and controlled multi-hop question answering demonstrate reliable risk control, early detection of intrinsically infeasible inputs, and safe composition under repeated use, while confidence-based baselines fail to provide consistent guarantees.
[LG-89] COMET: Codebook-based Online-adaptive Multi-scale Embedding for Time-series Anomaly Detection
链接: https://arxiv.org/abs/2602.01635
作者: Jinwoo Park,Hyeongwon Kang,Seung Hun Han,Pilsung Kang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Time series anomaly detection is a critical task across various industrial domains. However, capturing temporal dependencies and multivariate correlations within patch-level representation learning remains underexplored, and reliance on single-scale patterns limits the detection of anomalies across different temporal ranges. Furthermore, focusing on normal data representations makes models vulnerable to distribution shifts at inference time. To address these limitations, we propose Codebook-based Online-adaptive Multi-scale Embedding for Time-series anomaly detection (COMET), which consists of three key components: (1) Multi-scale Patch Encoding captures temporal dependencies and inter-variable correlations across multiple patch scales. (2) Vector-Quantized Coreset learns representative normal patterns via codebook and detects anomalies with a dual-score combining quantization error and memory distance. (3) Online Codebook Adaptation generates pseudo-labels based on codebook entries and dynamically adapts the model at inference through contrastive learning. Experiments on five benchmark datasets demonstrate that COMET achieves the best performance in 36 out of 45 evaluation metrics, validating its effectiveness across diverse environments.
[LG-90] AdaptNC: Adaptive Nonconformity Scores for Uncertainty-Aware Autonomous Systems in Dynamic Environments
链接: https://arxiv.org/abs/2602.01629
作者: Renukanandan Tumu,Aditya Singh,Rahul Mangharam
类目: Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY)
*备注:
Abstract:Rigorous uncertainty quantification is essential for the safe deployment of autonomous systems in unconstrained environments. Conformal Prediction (CP) provides a distribution-free framework for this task, yet its standard formulations rely on exchangeability assumptions that are violated by the distribution shifts inherent in real-world robotics. Existing online CP methods maintain target coverage by adaptively scaling the conformal threshold, but typically employ a static nonconformity score function. We show that this fixed geometry leads to highly conservative, volume-inefficient prediction regions when environments undergo structural shifts. To address this, we propose \textbfAdaptNC, a framework for the joint online adaptation of both the nonconformity score parameters and the conformal threshold. AdaptNC leverages an adaptive reweighting scheme to optimize score functions, and introduces a replay buffer mechanism to mitigate the coverage instability that occurs during score transitions. We evaluate AdaptNC on diverse robotic benchmarks involving multi-agent policy changes, environmental changes and sensor degradation. Our results demonstrate that AdaptNC significantly reduces prediction region volume compared to state-of-the-art threshold-only baselines while maintaining target coverage levels.
[LG-91] Efficient Softmax Reformulation for Homomorphic Encryption via Moment Generating Function
链接: https://arxiv.org/abs/2602.01621
作者: Hanjun Park,Byeong-Seo Min,Jiheon Woo,Min-Wook Jeong,Jongho Shin,Yongwoo Lee,Young-Sik Kim,Yongjune Kim
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Homomorphic encryption (HE) is a prominent framework for privacy-preserving machine learning, enabling inference directly on encrypted data. However, evaluating softmax, a core component of transformer architectures, remains particularly challenging in HE due to its multivariate structure, the large dynamic range induced by exponential functions, and the need for accurate division during normalization. In this paper, we propose MGF-softmax, a novel softmax reformulation based on the moment generating function (MGF) that replaces the softmax denominator with its moment-based counterpart. This reformulation substantially reduces multiplicative depth while preserving key properties of softmax and asymptotically converging to the exact softmax as the number of input tokens increases. Extensive experiments on Vision Transformers and large language models show that MGF-softmax provides an efficient and accurate approximation of softmax in encrypted inference. In particular, it achieves inference accuracy close to that of high-depth exact methods, while requiring substantially lower computational cost through reduced multiplicative depth.
[LG-92] What Do Agents Learn from Trajectory-SFT: Semantics or Interfaces?
链接: https://arxiv.org/abs/2602.01611
作者: Weizheng Gu,Chengze Li,Zhuohao Yu,Mengyuan Sun,Zhibang Yang,Wei Wang,Hongrui Jia,Shikun Zhang,Wei Ye
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large language models are increasingly evaluated as interactive agents, yet standard agent benchmarks conflate two qualitatively distinct sources of success: semantic tool-use and interface-specific interaction pattern memorization. Because both mechanisms can yield identical task success on the original interface, benchmark scores alone are not identifiable evidence of environment-invariant capability. We propose PIPE, a protocol-level evaluation augmentation for diagnosing interface reliance by minimally rewriting environment interfaces while preserving task semantics and execution behavior. Across 16 environments from AgentBench and AgentGym and a range of open-source and API-based agents, PIPE reveals that trajectory-SFT substantially amplifies interface shortcutting: trained agents degrade sharply under minimal interface rewrites, while non-trajectory-trained models remain largely stable. We further introduce Interface Reliance (IR), a counterbalanced alias-based metric that quantifies preference for training-time interfaces, and show that interface shortcutting exhibits environment-dependent, non-monotonic training dynamics that remain invisible under standard evaluation. Our code is available at this https URL.
[LG-93] Universal Redundancies in Time Series Foundation Models
链接: https://arxiv.org/abs/2602.01605
作者: Anthony Bao,Venkata Hasith Vattikuti,Jeffrey Lai,William Gilpin
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Time Series Foundation Models (TSFMs) leverage extensive pretraining to accurately predict unseen time series during inference, without the need for task-specific fine-tuning. Through large-scale evaluations on standard benchmarks, we find that leading transformer-based TSFMs exhibit redundant components in their intermediate layers. We introduce a set of tools for mechanistic interpretability of TSFMs, including ablations of specific components and direct logit attribution on the residual stream. Our findings are consistent across several leading TSFMs with diverse architectures, and across a diverse set of real-world and synthetic time-series datasets. We discover that all models in our study are robust to ablations of entire layers. Furthermore, we develop a theoretical framework framing transformers as kernel regressors, motivating a purely intrinsic strategy for ablating heads based on the stable rank of the per-head projection matrices. Using this approach, we uncover the specific heads responsible for degenerate phenomena widely observed in TSFMs, such as parroting of motifs from the context and seasonality bias. Our study sheds light on the universal properties of this emerging class of architectures for continuous-time sequence modeling.
[LG-94] A Lightweight Sparse Interaction Network for Time Series Forecasting AAAI AAAI2025
链接: https://arxiv.org/abs/2602.01585
作者: Xu Zhang,Qitong Wang,Peng Wang,Wei Wang
类目: Machine Learning (cs.LG)
*备注: The paper is published in AAAI Conference on Artificial Intelligence, AAAI 2025. The code is available at the link this https URL
Abstract:Recent work shows that linear models can outperform several transformer models in long-term time-series forecasting (TSF). However, instead of explicitly performing temporal interaction through self-attention, linear models implicitly perform it based on stacked MLP structures, which may be insufficient in capturing the complex temporal dependencies and their performance still has potential for improvement. To this end, we propose a Lightweight Sparse Interaction Network (LSINet) for TSF task. Inspired by the sparsity of self-attention, we propose a Multihead Sparse Interaction Mechanism (MSIM). Different from self-attention, MSIM learns the important connections between time steps through sparsity-induced Bernoulli distribution to capture temporal dependencies for TSF. The sparsity is ensured by the proposed self-adaptive regularization loss. Moreover, we observe the shareability of temporal interactions and propose to perform Shared Interaction Learning (SIL) for MSIM to further enhance efficiency and improve convergence. LSINet is a linear model comprising only MLP structures with low overhead and equipped with explicit temporal interaction mechanisms. Extensive experiments on public datasets show that LSINet achieves both higher accuracy and better efficiency than advanced linear models and transformer models in TSF tasks. The code is available at the link this https URL.
[LG-95] Nearly Optimal Active Preference Learning and Its Application to LLM Alignment
链接: https://arxiv.org/abs/2602.01581
作者: Yao Zhao,Kwang-Sung Jun
类目: Machine Learning (cs.LG)
*备注:
Abstract:Aligning large language models (LLMs) depends on high-quality datasets of human preference labels, which are costly to collect. Although active learning has been studied to improve sample efficiency relative to passive collection, many existing approaches adopt classical experimental design criteria such as G- or D-optimality. These objectives are not tailored to the structure of preference learning, leaving open the design of problem-specific algorithms. In this work, we identify a simple intuition specific to preference learning that calls into question the suitability of these existing design objectives. Motivated by this insight, we propose two active learning algorithms. The first provides the first instance-dependent label complexity guarantee for this setting, and the second is a simple, practical greedy method. We evaluate our algorithm on real-world preference datasets and observe improved sample efficiency compared to existing methods.
[LG-96] Local Exponential Stability of Mean-Field Langevin Descent-Ascent in Wasserstein Space
链接: https://arxiv.org/abs/2602.01564
作者: Geuntaek Seo,Minseop Shin,Pierre Monmarché,Beomjun Choi
类目: Machine Learning (cs.LG); Analysis of PDEs (math.AP); Optimization and Control (math.OC); Probability (math.PR)
*备注:
Abstract:We study the mean-field Langevin descent-ascent (MFL-DA), a coupled optimization dynamics on the space of probability measures for entropically regularized two-player zero-sum games. Although the associated mean-field objective admits a unique mixed Nash equilibrium, the long-time behavior of the original MFL-DA for general nonconvex-nonconcave payoffs has remained largely open. Answering an open question posed by Wang and Chizat (COLT 2024), we provide a partial resolution by proving that this equilibrium is locally exponentially stable: if the initialization is sufficiently close in Wasserstein metric, the dynamics trends to the equilibrium at an exponential rate. The key to our analysis is to establish a coercivity estimate for the entropy near equilibrium via spectral analysis of the linearized operator. We show that this coercivity effectively reveals a local displacement convex-concave structure, thereby driving contraction. This result settles the local stability and quantitative rate questions of Wang and Chizat, leaving global convergence as a remaining open challenge.
[LG-97] How Implicit Bias Accumulates and Propagates in LLM Long-term Memory
链接: https://arxiv.org/abs/2602.01558
作者: Yiming Ma,Lixu Wang,Lionel Z. Wang,Hongkun Yang,Haoming Sun,Xin Xu,Jiaqi Wu,Bin Chen,Wei Dong
类目: Machine Learning (cs.LG)
*备注: Under review, and the first two authors contribute equally
Abstract:Long-term memory mechanisms enable Large Language Models (LLMs) to maintain continuity and personalization across extended interaction lifecycles, but they also introduce new and underexplored risks related to fairness. In this work, we study how implicit bias, defined as subtle statistical prejudice, accumulates and propagates within LLMs equipped with long-term memory. To support systematic analysis, we introduce the Decision-based Implicit Bias (DIB) Benchmark, a large-scale dataset comprising 3,776 decision-making scenarios across nine social domains, designed to quantify implicit bias in long-term decision processes. Using a realistic long-horizon simulation framework, we evaluate six state-of-the-art LLMs integrated with three representative memory architectures on DIB and demonstrate that LLMs’ implicit bias does not remain static but intensifies over time and propagates across unrelated domains. We further analyze mitigation strategies and show that a static system-level prompting baseline provides limited and short-lived debiasing effects. To address this limitation, we propose Dynamic Memory Tagging (DMT), an agentic intervention that enforces fairness constraints at memory write time. Extensive experimental results show that DMT substantially reduces bias accumulation and effectively curtails cross-domain bias propagation.
[LG-98] Making Bias Non-Predictive: Training Robust LLM Judges via Reinforcement Learning
链接: https://arxiv.org/abs/2602.01528
作者: Qian Wang,Xuandong Zhao,Zirui Zhang,Zhanzhi Lou,Nuo Chen,Dawn Song,Bingsheng He
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:
Abstract:Large language models (LLMs) increasingly serve as automated judges, yet they remain susceptible to cognitive biases – often altering their reasoning when faced with spurious prompt-level cues such as consensus claims or authority appeals. Existing mitigations via prompting or supervised fine-tuning fail to generalize, as they modify surface behavior without changing the optimization objective that makes bias cues predictive. To address this gap, we propose Epistemic Independence Training (EIT), a reinforcement learning framework grounded in a key principle: to learn independence, bias cues must be made non-predictive of reward. EIT operationalizes this through a balanced conflict strategy where bias signals are equally likely to support correct and incorrect answers, combined with a reward design that penalizes bias-following without rewarding bias agreement. Experiments on Qwen3-4B demonstrate that EIT improves both accuracy and robustness under adversarial biases, while preserving performance when bias aligns with truth. Notably, models trained only on bandwagon bias generalize to unseen bias types such as authority and distraction, indicating that EIT induces transferable epistemic independence rather than bias-specific heuristics. Code and data are available at this https URL.
[LG-99] he Inlet Rank Collapse in Implicit Neural Representations: Diagnosis and Unified Remedy
链接: https://arxiv.org/abs/2602.01526
作者: Jianqiao Zheng,Hemanth Saratchandran,Simon Lucey
类目: Machine Learning (cs.LG)
*备注:
Abstract:Implicit Neural Representations (INRs) have revolutionized continuous signal modeling, yet they struggle to recover fine-grained details within finite training budgets. While empirical techniques, such as positional encoding (PE), sinusoidal activations (SIREN), and batch normalization (BN), effectively mitigate this, their theoretical justifications are predominantly post hoc, focusing on the global NTK spectrum only after modifications are applied. In this work, we reverse this paradigm by introducing a structural diagnostic framework. By performing a layer-wise decomposition of the NTK, we mathematically identify the ``Inlet Rank Collapse’': a phenomenon where the low-dimensional input coordinates fail to span the high-dimensional embedding space, creating a fundamental rank deficiency at the first layer that acts as an expressive bottleneck for the entire network. This framework provides a unified perspective to re-interpret PE, SIREN, and BN as different forms of rank restoration. Guided by this diagnosis, we derive a Rank-Expanding Initialization, a minimalist remedy that ensures the representation rank scales with the layer width without architectural modifications or computational overhead. Our results demonstrate that this principled remedy enables standard MLPs to achieve high-fidelity reconstructions, proving that the key to empowering INRs lies in the structural optimization of the initial rank propagation to effectively populate the latent space.
[LG-100] RAPT: Model-Predictive Out-of-Distribution Detection and Failure Diagnosis for Sim-to-Real Humanoid Robots
链接: https://arxiv.org/abs/2602.01515
作者: Humphrey Munn,Brendan Tidd,Peter Bohm,Marcus Gallagher,David Howard
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:Deploying learned control policies on humanoid robots is challenging: policies that appear robust in simulation can execute confidently in out-of-distribution (OOD) states after Sim-to-Real transfer, leading to silent failures that risk hardware damage. Although anomaly detection can mitigate these failures, prior methods are often incompatible with high-rate control, poorly calibrated at the extremely low false-positive rates required for practical deployment, or operate as black boxes that provide a binary stop signal without explaining why the robot drifted from nominal behavior. We present RAPT, a lightweight, self-supervised deployment-time monitor for 50Hz humanoid control. RAPT learns a probabilistic spatio-temporal manifold of nominal execution from simulation and evaluates execution-time predictive deviation as a calibrated, per-dimension signal. This yields (i) reliable online OOD detection under strict false-positive constraints and (ii) a continuous, interpretable measure of Sim-to-Real mismatch that can be tracked over time to quantify how far deployment has drifted from training. Beyond detection, we introduce an automated post-hoc root-cause analysis pipeline that combines gradient-based temporal saliency derived from RAPT’s reconstruction objective with LLM-based reasoning conditioned on saliency and joint kinematics to produce semantic failure diagnoses in a zero-shot setting. We evaluate RAPT on a Unitree G1 humanoid across four complex tasks in simulation and on physical hardware. In large-scale simulation, RAPT improves True Positive Rate (TPR) by 37% over the strongest baseline at a fixed episode-level false positive rate of 0.5%. On real-world deployments, RAPT achieves a 12.5% TPR improvement and provides actionable interpretability, reaching 75% root-cause classification accuracy across 16 real-world failures using only proprioceptive data.
[LG-101] Enhancing Generalization in Evolutionary Feature Construction for Symbolic Regression through Vicinal Jensen Gap Minimization
链接: https://arxiv.org/abs/2602.01510
作者: Hengzhe Zhang,Qi Chen,Bing Xue,Wolfgang Banzhaf,Mengjie Zhang
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:
Abstract:Genetic programming-based feature construction has achieved significant success in recent years as an automated machine learning technique to enhance learning performance. However, overfitting remains a challenge that limits its broader applicability. To improve generalization, we prove that vicinal risk, estimated through noise perturbation or mixup-based data augmentation, is bounded by the sum of empirical risk and a regularization term-either finite difference or the vicinal Jensen gap. Leveraging this decomposition, we propose an evolutionary feature construction framework that jointly optimizes empirical risk and the vicinal Jensen gap to control overfitting. Since datasets may vary in noise levels, we develop a noise estimation strategy to dynamically adjust regularization strength. Furthermore, to mitigate manifold intrusion-where data augmentation may generate unrealistic samples that fall outside the data manifold-we propose a manifold intrusion detection mechanism. Experimental results on 58 datasets demonstrate the effectiveness of Jensen gap minimization compared to other complexity measures. Comparisons with 15 machine learning algorithms further indicate that genetic programming with the proposed overfitting control strategy achieves superior performance.
[LG-102] Optimal Sample Complexity for Single Time-Scale Actor-Critic with Momentum
链接: https://arxiv.org/abs/2602.01505
作者: Navdeep Kumar,Tehila Dahan,Lior Cohen,Ananyabrata Barua,Giorgia Ramponi,Kfir Yehuda Levy,Shie Mannor
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We establish an optimal sample complexity of O(\epsilon^-2) for obtaining an \epsilon -optimal global policy using a single-timescale actor-critic (AC) algorithm in infinite-horizon discounted Markov decision processes (MDPs) with finite state-action spaces, improving upon the prior state of the art of O(\epsilon^-3) . Our approach applies STORM (STOchastic Recursive Momentum) to reduce variance in the critic updates. However, because samples are drawn from a nonstationary occupancy measure induced by the evolving policy, variance reduction via STORM alone is insufficient. To address this challenge, we maintain a buffer of small fraction of recent samples and uniformly sample from it for each critic update. Importantly, these mechanisms are compatible with existing deep learning architectures and require only minor modifications, without compromising practical applicability.
[LG-103] Multi-Scale Wavelet Transformers for Operator Learning of Dynamical Systems
链接: https://arxiv.org/abs/2602.01486
作者: Xuesong Wang,Michael Groom,Rafael Oliveira,He Zhao,Terence O’Kane,Edwin V. Bonilla
类目: Machine Learning (cs.LG)
*备注:
Abstract:Recent years have seen a surge in data-driven surrogates for dynamical systems that can be orders of magnitude faster than numerical solvers. However, many machine learning-based models such as neural operators exhibit spectral bias, attenuating high-frequency components that often encode small-scale structure. This limitation is particularly damaging in applications such as weather forecasting, where misrepresented high frequencies can induce long-horizon instability. To address this issue, we propose multi-scale wavelet transformers (MSWTs), which learn system dynamics in a tokenized wavelet domain. The wavelet transform explicitly separates low- and high-frequency content across scales. MSWTs leverage a wavelet-preserving downsampling scheme that retains high-frequency features and employ wavelet-based attention to capture dependencies across scales and frequency bands. Experiments on chaotic dynamical systems show substantial error reductions and improved long horizon spectral fidelity. On the ERA5 climate reanalysis, MSWTs further reduce climatological bias, demonstrating their effectiveness in a real-world forecasting setting.
[LG-104] Predicting and improving test-time scaling laws via reward tail-guided search
链接: https://arxiv.org/abs/2602.01485
作者: Muheng Li,Jian Qian,Wenlong Mou
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 33 pages, 5 figures
Abstract:Test-time scaling has emerged as a critical avenue for enhancing the reasoning capabilities of Large Language Models (LLMs). Though the straight-forward ''best-of- N ‘’ (BoN) strategy has already demonstrated significant improvements in performance, it lacks principled guidance on the choice of N , budget allocation, and multi-stage decision-making, thereby leaving substantial room for optimization. While many works have explored such optimization, rigorous theoretical guarantees remain limited. In this work, we propose new methodologies to predict and improve scaling properties via tail-guided search. By estimating the tail distribution of rewards, our method predicts the scaling law of LLMs without the need for exhaustive evaluations. Leveraging this prediction tool, we introduce Scaling-Law Guided (SLG) Search, a new test-time algorithm that dynamically allocates compute to identify and exploit intermediate states with the highest predicted potential. We theoretically prove that SLG achieves vanishing regret compared to perfect-information oracles, and achieves expected rewards that would otherwise require a polynomially larger compute budget required when using BoN. Empirically, we validate our framework across different LLMs and reward models, confirming that tail-guided allocation consistently achieves higher reward yields than Best-of- N under identical compute budgets. Our code is available at this https URL.
[LG-105] A Statistical Theory of Gated Attention through the Lens of Hierarchical Mixture of Experts
链接: https://arxiv.org/abs/2602.01468
作者: Viet Nguyen,Tuan Minh Pham,Thinh Cao,Tan Dinh,Huy Nguyen,Nhat Ho,Alessandro Rinaldo
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Viet Nguyen, Tuan Minh Pham, and Thinh Cao contributed equally to this work
Abstract:Self-attention has greatly contributed to the success of the widely used Transformer architecture by enabling learning from data with long-range dependencies. In an effort to improve performance, a gated attention model that leverages a gating mechanism within the multi-head self-attention has recently been proposed as a promising alternative. Gated attention has been empirically demonstrated to increase the expressiveness of low-rank mapping in standard attention and even to eliminate the attention sink phenomenon. Despite its efficacy, a clear theoretical understanding of gated attention’s benefits remains lacking in the literature. To close this gap, we rigorously show that each entry in a gated attention matrix or a multi-head self-attention matrix can be written as a hierarchical mixture of experts. By recasting learning as an expert estimation problem, we demonstrate that gated attention is more sample-efficient than multi-head self-attention. In particular, while the former needs only a polynomial number of data points to estimate an expert, the latter requires exponentially many data points to achieve the same estimation error. Furthermore, our analysis also provides a theoretical justification for why gated attention yields higher performance when a gate is placed at the output of the scaled dot product attention or the value map rather than at other positions in the multi-head self-attention architecture.
[LG-106] Modeling Topological Impact on Node Attribute Distributions in Attributed Graphs
链接: https://arxiv.org/abs/2602.01454
作者: Amirreza Shiralinasab Langari,Leila Yeganeh,Kim Khoa Nguyen
类目: Machine Learning (cs.LG)
*备注:
Abstract:We investigate how the topology of attributed graphs influences the distribution of node attributes. This work offers a novel perspective by treating topology and attributes as structurally distinct but interacting components. We introduce an algebraic approach that combines a graph’s topology with the probability distribution of node attributes, resulting in topology-influenced distributions. First, we develop a categorical framework to formalize how a node perceives the graph’s topology. We then quantify this point of view and integrate it with the distribution of node attributes to capture topological effects. We interpret these topology-conditioned distributions as approximations of the posteriors P(\cdot \mid v) and P(\cdot \mid \mathcalG) . We further establish a principled sufficiency condition by showing that, on complete graphs, where topology carries no informative structure, our construction recovers the original attribute distribution. To evaluate our approach, we introduce an intentionally simple testbed model, \textbfID , and use unsupervised graph anomaly detection as a probing task. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2602.01454 [cs.LG] (or arXiv:2602.01454v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.01454 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-107] Provable Cooperative Multi-Agent Exploration for Reward-Free MDPs
链接: https://arxiv.org/abs/2602.01453
作者: Idan Barnea,Orin Levy,Yishay Mansour
类目: Machine Learning (cs.LG)
*备注:
Abstract:We study cooperative multi-agent reinforcement learning in the setting of reward-free exploration, where multiple agents jointly explore an unknown MDP in order to learn its dynamics (without observing rewards). We focus on a tabular finite-horizon MDP and adopt a phased learning framework. In each learning phase, multiple agents independently interact with the environment. More specifically, in each learning phase, each agent is assigned a policy, executes it, and observes the resulting trajectory. Our primary goal is to characterize the tradeoff between the number of learning phases and the number of agents, especially when the number of learning phases is small. Our results identify a sharp transition governed by the horizon H . When the number of learning phases equals H , we present a computationally efficient algorithm that uses only \tildeO(S^6 H^6 A / \epsilon^2) agents to obtain an \epsilon approximation of the dynamics (i.e., yields an \epsilon -optimal policy for any reward function). We complement our algorithm with a lower bound showing that any algorithm restricted to \rho H phases requires at least A^H/\rho agents to achieve constant accuracy. Thus, we show that it is essential to have an order of H learning phases if we limit the number of agents to be polynomial. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2602.01453 [cs.LG] (or arXiv:2602.01453v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.01453 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-108] A Meta-Knowledge-Augmented LLM Framework for Hyperparameter Optimization in Time-Series Forecasting
链接: https://arxiv.org/abs/2602.01445
作者: Ons Saadallah,Mátyás andó,Tamás Gábor Orosz
类目: Machine Learning (cs.LG)
*备注:
Abstract:Hyperparameter optimization (HPO) plays a central role in the performance of deep learning models, yet remains computationally expensive and difficult to interpret, particularly for time-series forecasting. While Bayesian Optimization (BO) is a standard approach, it typically treats tuning tasks independently and provides limited insight into its decisions. Recent advances in large language models (LLMs) offer new opportunities to incorporate structured prior knowledge and reasoning into optimization pipelines. We introduce LLM-AutoOpt, a hybrid HPO framework that combines BO with LLM-based contextual reasoning. The framework encodes dataset meta-features, model descriptions, historical optimization outcomes, and target objectives as structured meta-knowledge within LLM prompts, using BO to initialize the search and mitigate cold-start effects. This design enables context-aware and stable hyperparameter refinement while exposing the reasoning behind optimization decisions. Experiments on a multivariate time series forecasting benchmark demonstrate that LLM-AutoOpt achieves improved predictive performance and more interpretable optimization behavior compared to BO and LLM baselines without meta-knowledge.
[LG-109] heoretical Analysis of Measure Consistency Regularization for Partially Observed Data
链接: https://arxiv.org/abs/2602.01437
作者: Yinsong Wang,Shahin Shahrampour
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:The problem of corrupted data, missing features, or missing modalities continues to plague the modern machine learning landscape. To address this issue, a class of regularization methods that enforce consistency between imputed and fully observed data has emerged as a promising approach for improving model generalization, particularly in partially observed settings. We refer to this class of methods as Measure Consistency Regularization (MCR). Despite its empirical success in various applications, such as image inpainting, data imputation and semi-supervised learning, a fundamental understanding of the theoretical underpinnings of MCR remains limited. This paper bridges this gap by offering theoretical insights into why, when, and how MCR enhances imputation quality under partial observability, viewed through the lens of neural network distance. Our theoretical analysis identifies the term responsible for MCR’s generalization advantage and extends to the imperfect training regime, demonstrating that this advantage is not always guaranteed. Guided by these insights, we propose a novel training protocol that monitors the duality gap to determine an early stopping point that preserves the generalization benefit. We then provide detailed empirical evidence to support our theoretical claims and to show the effectiveness and accuracy of our proposed stopping condition. We further provide a set of real-world data simulations to show the versatility of MCR under different model architectures designed for different data sources. Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2602.01437 [cs.LG] (or arXiv:2602.01437v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.01437 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-110] Phase Transitions for Feature Learning in Neural Networks
链接: https://arxiv.org/abs/2602.01434
作者: Andrea Montanari,Zihao Wang
类目: Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 74 pages; 17 pdf figures
Abstract:According to a popular viewpoint, neural networks learn from data by first identifying low-dimensional representations, and subsequently fitting the best model in this space. Recent works provide a formalization of this phenomenon when learning multi-index models. In this setting, we are given n i.i.d. pairs (\boldsymbol x_i,y_i) , where the covariate vectors \boldsymbol x_i\in\mathbbR^d are isotropic, and responses y_i only depend on \boldsymbol x_i through a k -dimensional projection \boldsymbol \Theta_^\sf T\boldsymbol x_i . Feature learning amounts to learning the latent space spanned by \boldsymbol \Theta_ . In this context, we study the gradient descent dynamics of two-layer neural networks under the proportional asymptotics n,d\to\infty , n/d\to\delta , while the dimension of the latent space k and the number of hidden neurons m are kept fixed. Earlier work establishes that feature learning via polynomial-time algorithms is possible if \delta \delta_\textalg , for \delta_\textalg a threshold depending on the data distribution, and is impossible (within a certain class of algorithms) below \delta_\textalg . Here we derive an analogous threshold \delta_\textNN for two-layer networks. Our characterization of \delta_\textNN opens the way to study the dependence of learning dynamics on the network architecture and training algorithm. The threshold \delta_\textNN is determined by the following scenario. Training first visits points for which the gradient of the empirical risk is large and learns the directions spanned by these gradients. Then the gradient becomes smaller and the dynamics becomes dominated by negative directions of the Hessian. The threshold \delta_\textNN corresponds to a phase transition in the spectrum of the Hessian in this second phase. Comments: 74 pages; 17 pdf figures Subjects: Machine Learning (cs.LG); Statistics Theory (math.ST) Cite as: arXiv:2602.01434 [cs.LG] (or arXiv:2602.01434v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.01434 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-111] Improve the Trade-off Between Watermark Strength and Speculative Sampling Efficiency for Language Models ICLR2026
链接: https://arxiv.org/abs/2602.01428
作者: Weiqing He,Xiang Li,Li Shen,Weijie Su,Qi Long
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: Accepted at ICLR 2026
Abstract:Watermarking is a principled approach for tracing the provenance of large language model (LLM) outputs, but its deployment in practice is hindered by inference inefficiency. Speculative sampling accelerates inference, with efficiency improving as the acceptance rate between draft and target models increases. Yet recent work reveals a fundamental trade-off: higher watermark strength reduces acceptance, preventing their simultaneous achievement. We revisit this trade-off and show it is not absolute. We introduce a quantitative measure of watermark strength that governs statistical detectability and is maximized when tokens are deterministic functions of pseudorandom numbers. Using this measure, we fully characterize the trade-off as a constrained optimization problem and derive explicit Pareto curves for two existing watermarking schemes. Finally, we introduce a principled mechanism that injects pseudorandomness into draft-token acceptance, ensuring maximal watermark strength while maintaining speculative sampling efficiency. Experiments further show that this approach improves detectability without sacrificing efficiency. Our findings uncover a principle that unites speculative sampling and watermarking, paving the way for their efficient and practical deployment.
[LG-112] SNIP: An Adaptive Mixed Precision Framework for Subbyte Large Language Model Training ASPLOS2026
链接: https://arxiv.org/abs/2602.01410
作者: Yunjie Pan,Yongyi Yang,Hanmei Yang,Scott Mahlke
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注: Accepted to ASPLOS 2026
Abstract:Training large language models (LLMs) efficiently while preserving model quality poses significant challenges, particularly with subbyte precision supported by state-of-the-art GPUs. Current mixed-precision training approaches either apply uniform precision to all GEMM operations or rely on heuristic-based methods that fail to generalize during training, leading to suboptimal convergence and instability. To address these challenges, this paper introduces SNIP, a fine-grained adaptive mixed-precision training framework for LLM pretraining that supports subbyte precision. SNIP periodically collects statistics on activations, gradients, and optimizer states to assess the precision loss impact on model quality. We define two key metrics: loss divergence in the forward pass, caused by quantization-induced increases in training loss, and weight divergence in the backward pass, which measures error propagation through gradients affecting model updates. These metrics guide an Integer Linear Programming (ILP) problem that systematically optimizes layerwise precision to minimize overall quality loss while meeting efficiency targets. Experiments on 1B, 3B, 7B and 70B Llama-like models demonstrate that SNIP consistently outperforms existing baselines, reducing FLOPs by up to 80% while preserving model quality across different model sizes and training phases with minimal computational overhead.
[LG-113] Nonlinear model reduction for transport-dominated problems
链接: https://arxiv.org/abs/2602.01397
作者: Jan S. Hesthaven,Benjamin Peherstorfer,Benjamin Unger
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:This article surveys nonlinear model reduction methods that remain effective in regimes where linear reduced-space approximations are intrinsically inefficient, such as transport-dominated problems with wave-like phenomena and moving coherent structures, which are commonly associated with the Kolmogorov barrier. The article organizes nonlinear model reduction techniques around three key elements – nonlinear parametrizations, reduced dynamics, and online solvers – and categorizes existing approaches into transformation-based methods, online adaptive techniques, and formulations that combine generic nonlinear parametrizations with instantaneous residual minimization.
[LG-114] he Enhanced Physics-Informed Kolmogorov-Arnold Networks: Applications of Newtons Laws in Financial Deep Reinforcement Learning (RL) Algorithms
链接: https://arxiv.org/abs/2602.01388
作者: Trang Thoi,Hung Tran,Tram Thoi,Huaiyang Zhong
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:
Abstract:Deep Reinforcement Learning (DRL), a subset of machine learning focused on sequential decision-making, has emerged as a powerful approach for tackling financial trading problems. In finance, DRL is commonly used either to generate discrete trade signals or to determine continuous portfolio allocations. In this work, we propose a novel reinforcement learning framework for portfolio optimization that incorporates Physics-Informed Kolmogorov-Arnold Networks (PIKANs) into several DRL algorithms. The approach replaces conventional multilayer perceptrons with Kolmogorov-Arnold Networks (KANs) in both actor and critic components-utilizing learnable B-spline univariate functions to achieve parameter-efficient and more interpretable function approximation. During actor updates, we introduce a physics-informed regularization loss that promotes second-order temporal consistency between observed return dynamics and the action-induced portfolio adjustments. The proposed framework is evaluated across three equity markets-China, Vietnam, and the United States, covering both emerging and developed economies. Across all three markets, PIKAN-based agents consistently deliver higher cumulative and annualized returns, superior Sharpe and Calmar ratios, and more favorable drawdown characteristics compared to both standard DRL baselines and classical online portfolio-selection methods. This yields more stable training, higher Sharpe ratios, and superior performance compared to traditional DRL counterparts. The approach is particularly valuable in highly dynamic and noisy financial markets, where conventional DRL often suffers from instability and poor generalization.
[LG-115] Your Self-Play Algorithm is Secretly an Adversarial Imitator: Understanding LLM Self-Play through the Lens of Imitation Learning
链接: https://arxiv.org/abs/2602.01357
作者: Shangzhe Li,Xuchao Zhang,Chetan Bansal,Weitong Zhang
类目: Machine Learning (cs.LG)
*备注: 35 pages, 5 tables, 3 figures
Abstract:Self-play post-training methods has emerged as an effective approach for finetuning large language models and turn the weak language model into strong language model without preference data. However, the theoretical foundations for self-play finetuning remain underexplored. In this work, we tackle this by connecting self-play finetuning with adversarial imitation learning by formulating finetuning procedure as a min-max game between the model and a regularized implicit reward player parameterized by the model itself. This perspective unifies self-play imitation and general preference alignment within a common framework. Under this formulation, we present a game-theoretic analysis showing that the self-play finetuning will converge to it’s equilibrium. Guided by this theoretical formulation, we propose a new self-play imitation finetuning algorithm based on the \chi^2 -divergence variational objective with bounded rewards and improved stability. Experiments on various of language model finetuning tasks demonstrate consistent improvements over existing self-play methods and validate our theoretical insights.
[LG-116] Finding Differentially Private Second Order Stationary Points in Stochastic Minimax Optimization
链接: https://arxiv.org/abs/2602.01339
作者: Difei Xu,Youming Tao,Meng Ding,Chenglin Fan,Di Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:We provide the first study of the problem of finding differentially private (DP) second-order stationary points (SOSP) in stochastic (non-convex) minimax optimization. Existing literature either focuses only on first-order stationary points for minimax problems or on SOSP for classical stochastic minimization problems. This work provides, for the first time, a unified and detailed treatment of both empirical and population risks. Specifically, we propose a purely first-order method that combines a nested gradient descent–ascent scheme with SPIDER-style variance reduction and Gaussian perturbations to ensure privacy. A key technical device is a block-wise ( q -period) analysis that controls the accumulation of stochastic variance and privacy noise without summing over the full iteration horizon, yielding a unified treatment of both empirical-risk and population formulations. Under standard smoothness, Hessian-Lipschitzness, and strong concavity assumptions, we establish high-probability guarantees for reaching an (\alpha,\sqrt\rho_\Phi \alpha) -approximate second-order stationary point with \alpha = \mathcalO( (\frac\sqrtdn\varepsilon)^2/3) for empirical risk objectives and \mathcalO(\frac1n^1/3 + (\frac\sqrtdn\varepsilon)^1/2) for population objectives, matching the best known rates for private first-order stationarity.
[LG-117] High-accuracy sampling for diffusion models and log-concave distributions
链接: https://arxiv.org/abs/2602.01338
作者: Fan Chen,Sinho Chewi,Constantinos Daskalakis,Alexander Rakhlin
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:
Abstract:We present algorithms for diffusion model sampling which obtain \delta -error in \mathrmpolylog(1/\delta) steps, given access to \widetilde O(\delta) -accurate score estimates in L^2 . This is an exponential improvement over all previous results. Specifically, under minimal data assumptions, the complexity is \widetilde O(d,\mathrmpolylog(1/\delta)) where d is the dimension of the data; under a non-uniform L -Lipschitz condition, the complexity is \widetilde O(\sqrtdL,\mathrmpolylog(1/\delta)) ; and if the data distribution has intrinsic dimension d_\star , then the complexity reduces to \widetilde O(d_\star,\mathrmpolylog(1/\delta)) . Our approach also yields the first \mathrmpolylog(1/\delta) complexity sampler for general log-concave distributions using only gradient evaluations.
[LG-118] Imperfect Influence Preserved Rankings: A Theory of TRAK for Data Attribution
链接: https://arxiv.org/abs/2602.01312
作者: Han Tong,Shubhangi Ghosh,Haolin Zou,Arian Maleki
类目: Machine Learning (cs.LG)
*备注:
Abstract:Data attribution, tracing a model’s prediction back to specific training data, is an important tool for interpreting sophisticated AI models. The widely used TRAK algorithm addresses this challenge by first approximating the underlying model with a kernel machine and then leveraging techniques developed for approximating the leave-one-out (ALO) risk. Despite its strong empirical performance, the theoretical conditions under which the TRAK approximations are accurate as well as the regimes in which they break down remain largely unexplored. In this paper, we provide a theoretical analysis of the TRAK algorithm, characterizing its performance and quantifying the errors introduced by the approximations on which the method relies. We show that although the approximations incur significant errors, TRAK’s estimated influence remains highly correlated with the original influence and therefore largely preserves the relative ranking of data points. We corroborate our theoretical results through extensive simulations and empirical studies.
[LG-119] he BoBW Algorithms for Heavy-Tailed MDPs
链接: https://arxiv.org/abs/2602.01295
作者: Yu Chen,Yuhao Liu,Jiatai Huang,Yihan Du,Longbo Huang
类目: Machine Learning (cs.LG)
*备注:
Abstract:We investigate episodic Markov Decision Processes with heavy-tailed feedback (HTMDPs). Existing approaches for HTMDPs are conservative in stochastic environments and lack adaptivity in adversarial regimes. In this work, we propose algorithms HT-FTRL-OM and HT-FTRL-UOB for HTMDPs that achieve Best-of-Both-Worlds (BoBW) guarantees: instance-independent regret in adversarial environments and logarithmic instance-dependent regret in self-bounding (including the stochastic case) environments. For the known transition setting, HT-FTRL-OM applies the Follow-The-Regularized-Leader (FTRL) framework over occupancy measures with novel skipping loss estimators, achieving a \widetilde\mathcalO(T^1/\alpha) regret bound in adversarial regimes and a \mathcalO(\log T) regret in stochastic regimes. Building upon this framework, we develop a novel algorithm HT-FTRL-UOB to tackle the more challenging unknown-transition setting. This algorithm employs a pessimistic skipping loss estimator and achieves a \widetilde\mathcalO(T^1/\alpha + \sqrtT) regret in adversarial regimes and a \mathcalO(\log^2(T)) regret in stochastic regimes. Our analysis overcomes key barriers through several technical insights, including a local control mechanism for heavy-tailed shifted losses, a new suboptimal-mass propagation principle, and a novel regret decomposition that isolates transition uncertainty from heavy-tailed estimation errors and skipping bias.
[LG-120] EDIS: Diagnosing LLM Reasoning via Entropy Dynamics ICML2026
链接: https://arxiv.org/abs/2602.01288
作者: Chenghua Zhu,Siyan Wu,Xiangkang Zeng,Zishan Xu,Zhaolu Kang,Yifu Guo,Yuquan Lu,Junduan Huang,Guojing Zhou
类目: Machine Learning (cs.LG)
*备注: Under review at ICML 2026
Abstract:Entropy-based confidence signals are increasingly leveraged to improve reasoning in large language models (LLMs), yet existing approaches treat confidence as a static quantity – typically aggregated over tokens. We show that the \emphtemporal evolution of confidence during generation carries richer information than aggregate statistics alone. Analyzing token-level entropy trajectories, we identify characteristic patterns distinguishing correct from incorrect reasoning: erroneous solutions exhibit unstable dynamics, including burst spikes (sustained uncertainty growth) and peak-valley spikes (sharp rebounds following transient confidence). These patterns persist across models and training stages, suggesting they reflect intrinsic properties of reasoning failure rather than superficial noise. To formalize this observation, we introduce the Entropy Dynamics Instability Score (\textbfEDIS), a trajectory-level metric quantifying instability in entropy evolution. EDIS serves as an effective diagnostic signal for inference-time selection, substantially improving reasoning accuracy, and offers a promising direction for training-time sample curation. Our findings establish entropy dynamics as an underexplored yet informative lens for understanding and improving LLM reasoning.
[LG-121] Richer Bayesian Last Layers with Subsampled NTK Features
链接: https://arxiv.org/abs/2602.01279
作者: Sergio Calvo-Ordoñez,Jonathan Plenk,Richard Bergna,Álvaro Cartea,Yarin Gal,Jose Miguel Hernández-Lobato,Kamil Ciosek
类目: Machine Learning (cs.LG)
*备注: Preprint, work in progress
Abstract:Bayesian Last Layers (BLLs) provide a convenient and computationally efficient way to estimate uncertainty in neural networks. However, they underestimate epistemic uncertainty because they apply a Bayesian treatment only to the final layer, ignoring uncertainty induced by earlier layers. We propose a method that improves BLLs by leveraging a projection of Neural Tangent Kernel (NTK) features onto the space spanned by the last-layer features. This enables posterior inference that accounts for variability of the full network while retaining the low computational cost of inference of a standard BLL. We show that our method yields posterior variances that are provably greater or equal to those of a standard BLL, correcting its tendency to underestimate epistemic uncertainty. To further reduce computational cost, we introduce a uniform subsampling scheme for estimating the projection matrix and for posterior inference. We derive approximation bounds for both types of sub-sampling. Empirical evaluations on UCI regression, contextual bandits, image classification, and out-of-distribution detection tasks in image and tabular datasets, demonstrate improved calibration and uncertainty estimates compared to standard BLLs and competitive baselines, while reducing computational cost.
[LG-122] From Intents to Actions: Agent ic AI in Autonomous Networks
链接: https://arxiv.org/abs/2602.01271
作者: Burak Demirel,Pablo Soldati,Yu Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Telecommunication networks are increasingly expected to operate autonomously while supporting heterogeneous services with diverse and often conflicting intents – that is, performance objectives, constraints, and requirements specific to each service. However, transforming high-level intents – such as ultra-low latency, high throughput, or energy efficiency – into concrete control actions (i.e., low-level actuator commands) remains beyond the capability of existing heuristic approaches. This work introduces an Agentic AI system for intent-driven autonomous networks, structured around three specialized agents. A supervisory interpreter agent, powered by language models, performs both lexical parsing of intents into executable optimization templates and cognitive refinement based on feedback, constraint feasibility, and evolving network conditions. An optimizer agent converts these templates into tractable optimization problems, analyzes trade-offs, and derives preferences across objectives. Lastly, a preference-driven controller agent, based on multi-objective reinforcement learning, leverages these preferences to operate near the Pareto frontier of network performance that best satisfies the original intent. Collectively, these agents enable networks to autonomously interpret, reason over, adapt to, and act upon diverse intents and network conditions in a scalable manner.
[LG-123] Mixture-of-World Models: Scaling Multi-Task Reinforcement Learning with Modular Latent Dynamics
链接: https://arxiv.org/abs/2602.01270
作者: Boxuan Zhang,Weipu Zhang,Zhaohan Feng,Wei Xiao,Jian Sun,Jie Chen,Gang Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:A fundamental challenge in multi-task reinforcement learning (MTRL) is achieving sample efficiency in visual domains where tasks exhibit substantial heterogeneity in both observations and dynamics. Model-based reinforcement learning offers a promising path to improved sample efficiency through world models, but standard monolithic architectures struggle to capture diverse task dynamics, resulting in poor reconstruction and prediction accuracy. We introduce Mixture-of-World Models (MoW), a scalable architecture that combines modular variational autoencoders for task-adaptive visual compression, a hybrid Transformer-based dynamics model with task-conditioned experts and a shared backbone, and a gradient-based task clustering strategy for efficient parameter allocation. On the Atari 100k benchmark, a single MoW agent trained once on 26 Atari games achieves a mean human-normalized score of 110.4%, competitive with the score of 114.2% achieved by STORM, an ensemble of 26 task-specific models, while using 50% fewer parameters. On Meta-World, MoW achieves a 74.5% average success rate within 300 thousand environment steps, establishing a new state of the art. These results demonstrate that MoW provides a scalable and parameter-efficient foundation for generalist world models.
[LG-124] Diving into Kronecker Adapters: Component Design Matters
链接: https://arxiv.org/abs/2602.01267
作者: Jiayu Bai,Danchen Yu,Zhenyu Liao,TianQi Hou,Feng Zhou,Robert C. Qiu,Zenan Ling
类目: Machine Learning (cs.LG)
*备注:
Abstract:Kronecker adapters have emerged as a promising approach for fine-tuning large-scale models, enabling high-rank updates through tunable component structures. However, existing work largely treats the component structure as a fixed or heuristic design choice, leaving the dimensions and number of Kronecker components underexplored. In this paper, we identify component structure as a key factor governing the capacity of Kronecker adapters. We perform a fine-grained analysis of both the dimensions and number of Kronecker components. In particular, we show that the alignment between Kronecker adapters and full fine-tuning depends on component configurations. Guided by these insights, we propose Component Designed Kronecker Adapters (CDKA). We further provide parameter-budget-aware configuration guidelines and a tailored training stabilization strategy for practical deployment. Experiments across various natural language processing tasks demonstrate the effectiveness of CDKA. Code is available at this https URL.
[LG-125] BicKD: Bilateral Contrastive Knowledge Distillation
链接: https://arxiv.org/abs/2602.01265
作者: Jiangnan Zhu,Yukai Xu,Li Xiong,Yixuan Liu,Junxu Liu,Hong kyu Lee,Yujie Gu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Knowledge distillation (KD) is a machine learning framework that transfers knowledge from a teacher model to a student model. The vanilla KD proposed by Hinton et al. has been the dominant approach in logit-based distillation and demonstrates compelling performance. However, it only performs sample-wise probability alignment between teacher and student’s predictions, lacking an mechanism for class-wise comparison. Besides, vanilla KD imposes no structural constraint on the probability space. In this work, we propose a simple yet effective methodology, bilateral contrastive knowledge distillation (BicKD). This approach introduces a novel bilateral contrastive loss, which intensifies the orthogonality among different class generalization spaces while preserving consistency within the same class. The bilateral formulation enables explicit comparison of both sample-wise and class-wise prediction patterns between teacher and student. By emphasizing probabilistic orthogonality, BicKD further regularizes the geometric structure of the predictive distribution. Extensive experiments show that our BicKD method enhances knowledge transfer, and consistently outperforms state-of-the-art knowledge distillation techniques across various model architectures and benchmarks.
[LG-126] Unraveling the Hidden Dynamical Structure in Recurrent Neural Policies
链接: https://arxiv.org/abs/2602.01196
作者: Jin Li,Yue Wu,Mengsha Huang,Yuhao Sun,Hao He,Xianyuan Zhan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Recurrent neural policies are widely used in partially observable control and meta-RL tasks. Their abilities to maintain internal memory and adapt quickly to unseen scenarios have offered them unparalleled performance when compared to non-recurrent counterparts. However, until today, the underlying mechanisms for their superior generalization and robustness performance remain poorly understood. In this study, by analyzing the hidden state domain of recurrent policies learned over a diverse set of training methods, model architectures, and tasks, we find that stable cyclic structures consistently emerge during interaction with the environment. Such cyclic structures share a remarkable similarity with \textitlimit cycles in dynamical system analysis, if we consider the policy and the environment as a joint hybrid dynamical system. Moreover, we uncover that the geometry of such limit cycles also has a structured correspondence with the policies’ behaviors. These findings offer new perspectives to explain many nice properties of recurrent policies: the emergence of limit cycles stabilizes both the policies’ internal memory and the task-relevant environmental states, while suppressing nuisance variability arising from environmental uncertainty; the geometry of limit cycles also encodes relational structures of behaviors, facilitating easier skill adaptation when facing non-stationary environments.
[LG-127] Analyzing and Improving Diffusion Models for Time-Series Data Imputation: A Proximal Recursion Perspective
链接: https://arxiv.org/abs/2602.01182
作者: Zhichao Chen,Hao Wang,Fangyikang Wang,Licheng Pan,Zhengnan Li,Yunfei Teng,Haoxuan Li,Zhouchen Lin
类目: Machine Learning (cs.LG)
*备注:
Abstract:Diffusion models (DMs) have shown promise for Time-Series Data Imputation (TSDI); however, their performance remains inconsistent in complex scenarios. We attribute this to two primary obstacles: (1) non-stationary temporal dynamics, which can bias the inference trajectory and lead to outlier-sensitive imputations; and (2) objective inconsistency, since imputation favors accurate pointwise recovery whereas DMs are inherently trained to generate diverse samples. To better understand these issues, we analyze DM-based TSDI process through a proximal-operator perspective and uncover that an implicit Wasserstein distance regularization inherent in the process hinders the model’s ability to counteract non-stationarity and dissipative regularizer, thereby amplifying diversity at the expense of fidelity. Building on this insight, we propose a novel framework called SPIRIT (Semi-Proximal Transport Regularized time-series Imputation). Specifically, we introduce entropy-induced Bregman divergence to relax the mass preserving constraint in the Wasserstein distance, formulate the semi-proximal transport (SPT) discrepancy, and theoretically prove the robustness of SPT against non-stationarity. Subsequently, we remove the dissipative structure and derive the complete SPIRIT workflow, with SPT serving as the proximal operator. Extensive experiments demonstrate the effectiveness of the proposed SPIRIT approach.
[LG-128] Rethinking the Flow-Based Gradual Domain Adaption: A Semi-Dual Optimal Transport Perspective
链接: https://arxiv.org/abs/2602.01179
作者: Zhichao Chen,Zhan Zhuang,Yunfei Teng,Hao Wang,Fangyikang Wang,Zhengnan Li,Tianqiao Liu,Haoxuan Li,Zhouchen Lin
类目: Machine Learning (cs.LG)
*备注:
Abstract:Gradual domain adaptation (GDA) aims to mitigate domain shift by progressively adapting models from the source domain to the target domain via intermediate domains. However, real intermediate domains are often unavailable or ineffective, necessitating the synthesis of intermediate samples. Flow-based models have recently been used for this purpose by interpolating between source and target distributions; however, their training typically relies on sample-based log-likelihood estimation, which can discard useful information and thus degrade GDA performance. The key to addressing this limitation is constructing the intermediate domains via samples directly. To this end, we propose an Entropy-regularized Semi-dual Unbalanced Optimal Transport (E-SUOT) framework to construct intermediate domains. Specifically, we reformulate flow-based GDA as a Lagrangian dual problem and derive an equivalent semi-dual objective that circumvents the need for likelihood estimation. However, the dual problem leads to an unstable min-max training procedure. To alleviate this issue, we further introduce entropy regularization to convert it into a more stable alternative optimization procedure. Based on this, we propose a novel GDA training framework and provide theoretical analysis in terms of stability and generalization. Finally, extensive experiments are conducted to demonstrate the efficacy of the E-SUOT framework.
[LG-129] Multi-Fidelity Physics-Informed Neural Networks with Bayesian Uncertainty Quantification and Adaptive Residual Learning for Efficient Solution of Parametric Partial Differential Equations
链接: https://arxiv.org/abs/2602.01176
作者: Olaf Yunus Laitinen Imanov
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Computational Physics (physics.comp-ph)
*备注: 8 pages, 4 figures, 6 tables
Abstract:Physics-informed neural networks (PINNs) have emerged as a powerful paradigm for solving partial differential equations (PDEs) by embedding physical laws directly into neural network training. However, solving high-fidelity PDEs remains computationally prohibitive, particularly for parametric systems requiring multiple evaluations across varying parameter configurations. This paper presents MF-BPINN, a novel multi-fidelity framework that synergistically combines physics-informed neural networks with Bayesian uncertainty quantification and adaptive residual learning. Our approach leverages abundant low-fidelity simulations alongside sparse high-fidelity data through a hierarchical neural architecture that learns nonlinear correlations across fidelity levels. We introduce an adaptive residual network with learnable gating mechanisms that dynamically balances linear and nonlinear fidelity discrepancies. Furthermore, we develop a rigorous Bayesian framework employing Hamiltonian Monte Carlo.
[LG-130] PolicyFlow: Policy Optimization with Continuous Normalizing Flow in Reinforcement Learning ICLR2026
链接: https://arxiv.org/abs/2602.01156
作者: Shunpeng Yang,Ben Liu,Hua Chen
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Submitted to ICLR 2026
Abstract:Among on-policy reinforcement learning algorithms, Proximal Policy Optimization (PPO) demonstrates is widely favored for its simplicity, numerical stability, and strong empirical performance. Standard PPO relies on surrogate objectives defined via importance ratios, which require evaluating policy likelihood that is typically straightforward when the policy is modeled as a Gaussian distribution. However, extending PPO to more expressive, high-capacity policy models such as continuous normalizing flows (CNFs), also known as flow-matching models, is challenging because likelihood evaluation along the full flow trajectory is computationally expensive and often numerically unstable. To resolve this issue, we propose PolicyFlow, a novel on-policy CNF-based reinforcement learning algorithm that integrates expressive CNF policies with PPO-style objectives without requiring likelihood evaluation along the full flow path. PolicyFlow approximates importance ratios using velocity field variations along a simple interpolation path, reducing computational overhead without compromising training stability. To further prevent mode collapse and further encourage diverse behaviors, we propose the Brownian Regularizer, an implicit policy entropy regularizer inspired by Brownian motion, which is conceptually elegant and computationally lightweight. Experiments on diverse tasks across various environments including MultiGoal, PointMaze, IsaacLab and MuJoCo Playground show that PolicyFlow achieves competitive or superior performance compared to PPO using Gaussian policies and flow-based baselines including FPO and DPPO. Notably, results on MultiGoal highlight PolicyFlow’s ability to capture richer multimodal action distributions.
[LG-131] Generalized Radius and Integrated Codebook Transforms for Differentiable Vector Quantization
链接: https://arxiv.org/abs/2602.01140
作者: Haochen You,Heng Zhang,Hongyang He,Yuqi Li,Baojing Liu
类目: Machine Learning (cs.LG)
*备注: This paper has been accepted as a conference paper at CPAL 2026
Abstract:Vector quantization (VQ) underpins modern generative and representation models by turning continuous latents into discrete tokens. Yet hard nearest-neighbor assignments are non-differentiable and are typically optimized with heuristic straight-through estimators, which couple the update step size to the quantization gap and train each code in isolation, leading to unstable gradients and severe codebook under-utilization at scale. In this paper, we introduce GRIT-VQ (Generalized Radius and Integrated Transform-Vector Quantization), a unified surrogate framework that keeps hard assignments in the forward pass while making VQ fully differentiable. GRIT-VQ replaces the straight-through estimator with a radius-based update that moves latents along the quantization direction with a controllable, geometry-aware step, and applies a data-agnostic integrated transform to the codebook so that all codes are updated through shared parameters instead of independently. Our theoretical analysis clarifies the fundamental optimization dynamics introduced by GRIT-VQ, establishing conditions for stable gradient flow, coordinated codebook evolution, and reliable avoidance of collapse across a broad family of quantizers. Across image reconstruction, image generation, and recommendation tokenization benchmarks, GRIT-VQ consistently improves reconstruction error, generative quality, and recommendation accuracy while substantially increasing codebook utilization compared to existing VQ variants.
[LG-132] Key Principles of Graph Machine Learning: Representation Robustness and Generalization
链接: https://arxiv.org/abs/2602.01139
作者: Yassine Abbahaddou
类目: Machine Learning (cs.LG)
*备注: PhD Thesis
Abstract:Graph Neural Networks (GNNs) have emerged as powerful tools for learning representations from structured data. Despite their growing popularity and success across various applications, GNNs encounter several challenges that limit their performance. in their generalization, robustness to adversarial perturbations, and the effectiveness of their representation learning capabilities. In this dissertation, I investigate these core aspects through three main contributions: (1) developing new representation learning techniques based on Graph Shift Operators (GSOs, aiming for enhanced performance across various contexts and applications, (2) introducing generalization-enhancing methods through graph data augmentation, and (3) developing more robust GNNs by leveraging orthonormalization techniques and noise-based defenses against adversarial attacks. By addressing these challenges, my work provides a more principled understanding of the limitations and potential of GNNs.
[LG-133] Self-Generative Adversarial Fine-Tuning for Large Language Models
链接: https://arxiv.org/abs/2602.01137
作者: Shiguang Wu,Yaqing Wang,Quanming Yao
类目: Machine Learning (cs.LG)
*备注:
Abstract:Fine-tuning large language models (LLMs) for alignment typically relies on supervised fine-tuning or reinforcement learning from human feedback, both limited by the cost and scarcity of high-quality annotations. Recent self-play and synthetic data approaches reduce this dependence but often rely on heuristic assumptions or ungrounded self-evaluation, which can cause bias accumulation and performance drift. In this paper, we propose Self-Generative Adversarial LLM (SGALM), a unified fine-tuning framework that formulates alignment as a generative adversarial game within a single LLM. SGALM jointly evolves generation and discrimination capabilities without external reward models. Theoretical and empirical results demonstrate that SGALM achieves state-of-the-art performance, serves as an effective alignment algorithm and a robust synthetic data engine.
[LG-134] A Unified Matrix-Spectral Framework for Stability and Interpretability in Deep Learning
链接: https://arxiv.org/abs/2602.01136
作者: Ronald Katende
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Optimization and Control (math.OC)
*备注: 11 pages
Abstract:We develop a unified matrix-spectral framework for analyzing stability and interpretability in deep neural networks. Representing networks as data-dependent products of linear operators reveals spectral quantities governing sensitivity to input perturbations, label noise, and training dynamics. We introduce a Global Matrix Stability Index that aggregates spectral information from Jacobians, parameter gradients, Neural Tangent Kernel operators, and loss Hessians into a single stability scale controlling forward sensitivity, attribution robustness, and optimization conditioning. We further show that spectral entropy refines classical operator-norm bounds by capturing typical, rather than purely worst-case, sensitivity. These quantities yield computable diagnostics and stability-oriented regularization principles. Synthetic experiments and controlled studies on MNIST, CIFAR-10, and CIFAR-100 confirm that modest spectral regularization substantially improves attribution stability even when global spectral summaries change little. The results establish a precise connection between spectral concentration and analytic stability, providing practical guidance for robustness-aware model design and training. Comments: 11 pages Subjects: Machine Learning (cs.LG); Dynamical Systems (math.DS); Optimization and Control (math.OC) MSC classes: 68T07, 15A18, 68D32, 93D20, 65F35, 62H25 Cite as: arXiv:2602.01136 [cs.LG] (or arXiv:2602.01136v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.01136 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Ronald Katende [view email] [v1] Sun, 1 Feb 2026 10:18:37 UTC (549 KB)
[LG-135] RACE: Scalable Amortized Causal Discovery from Single Sequences via Autoregressive Density Estimation
链接: https://arxiv.org/abs/2602.01135
作者: Hugo Math,Rainer Lienhart
类目: Machine Learning (cs.LG)
*备注: 8 pages, 6 figures,
Abstract:We study causal discovery from a single observed sequence of discrete events generated by a stochastic process, as encountered in vehicle logs, manufacturing systems, or patient trajectories. This regime is particularly challenging due to the absence of repeated samples, high dimensionality, and long-range temporal dependencies of the single observation during inference. We introduce TRACE, a scalable framework that repurposes autoregressive models as pretrained density estimators for conditional mutual information estimation. TRACE infers the summary causal graph between event types in a sequence, scaling linearly with the event vocabulary and supporting delayed causal effects, while being fully parallel on GPUs. We establish its theoretical identifiability under imperfect autoregressive models. Experiments demonstrate robust performance across different baselines and varying vocabulary sizes including an application to root-cause analysis in vehicle diagnostics with over 29,100 event types.
[LG-136] angent Space Fine-Tuning for Directional Preference Alignment in Large Language Models
链接: https://arxiv.org/abs/2602.01128
作者: Mete Erdogan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Our goal is to enable large language models (LLMs) to balance multiple human preference dimensions; such as helpfulness, safety, and verbosity, through principled and controllable alignment. Existing preference optimization methods, including Direct Preference Optimization (DPO), collapse feedback into a single scalar reward, fixing one balance among objectives and preventing traversal of the Pareto front. Recent work by Ortiz-Jimenez et al. (2023) showed that fine-tuning can be viewed in a model’s tangent space, where linearized updates act as additive vectors that can be composed to jointly perform well on multiple tasks. Building on this formulation, we extend this idea to preference alignment and propose Tangent-Space Direct Preference Optimization (TS-DPO), which performs DPO within this locally linear regime to learn per-objective update directions. These directions can be linearly combined at inference to generate user-specified behaviors without additional optimization. Evaluated on the helpfulness-verbosity trade-off using the HelpSteer and UltraFeedback datasets, TS-DPO achieves broader Pareto-optimal coverage and smoother preference control than scalarized DPO. Canonical Correlation Analysis (CCA) further shows that tangent-space training amplifies canonical directions aligned with distinct preferences, improving disentanglement.
[LG-137] WinFLoRA: Incentivizing Client-Adaptive Aggregation in Federated LoRA under Privacy Heterogeneity
链接: https://arxiv.org/abs/2602.01126
作者: Mengsha Kou,Xiaoyu Xia,Ziqi Wang,Ibrahim Khalil,Runkun Luo,Jingwen Zhou,Minhui Xue
类目: Machine Learning (cs.LG)
*备注: 12 pages
Abstract:Large Language Models (LLMs) increasingly underpin intelligent web applications, from chatbots to search and recommendation, where efficient specialization is essential. Low-Rank Adaptation (LoRA) enables such adaptation with minimal overhead, while federated LoRA allows web service providers to fine-tune shared models without data sharing. However, in privacy-sensitive deployments, clients inject varying levels of differential privacy (DP) noise, creating privacy heterogeneity that misaligns individual incentives and global performance. In this paper, we propose WinFLoRA, a privacy-heterogeneous federated LoRA that utilizes aggregation weights as incentives with noise awareness. Specifically, the noises from clients are estimated based on the uploaded LoRA adapters. A larger weight indicates greater influence on the global model and better downstream task performance, rewarding lower-noise contributions. By up-weighting low-noise updates, WinFLoRA improves global accuracy while accommodating clients’ heterogeneous privacy requirements. Consequently, WinFLoRA aligns heterogeneous client utility in terms of privacy and downstream performance with global model objectives without third-party involvement. Extensive evaluations demonstrate that across multiple LLMs and datasets, WinFLoRA achieves up to 52.58% higher global accuracy and up to 2.56x client utility than state-of-the-art benchmarks. Source code is publicly available at this https URL.
[LG-138] ChronoSpike: An Adaptive Spiking Graph Neural Network for Dynamic Graphs
链接: https://arxiv.org/abs/2602.01124
作者: Md Abrar Jahin,Taufikur Rahman Fuad,Jay Pujara,Craig Knoblock
类目: Machine Learning (cs.LG)
*备注:
Abstract:Dynamic graph representation learning requires capturing both structural relationships and temporal evolution, yet existing approaches face a fundamental trade-off: attention-based methods achieve expressiveness at O(T^2) complexity, while recurrent architectures suffer from gradient pathologies and dense state storage. Spiking neural networks offer event-driven efficiency but remain limited by sequential propagation, binary information loss, and local aggregation that misses global context. We propose ChronoSpike, an adaptive spiking graph neural network that integrates learnable LIF neurons with per-channel membrane dynamics, multi-head attentive spatial aggregation on continuous features, and a lightweight Transformer temporal encoder, enabling both fine-grained local modeling and long-range dependency capture with linear memory complexity O(T \cdot d) . On three large-scale benchmarks, ChronoSpike outperforms twelve state-of-the-art baselines by 2.0% Macro-F1 and 2.4% Micro-F1 while achieving 3-10\times faster training than recurrent methods with a constant 105K-parameter budget independent of graph size. We provide theoretical guarantees for membrane potential boundedness, gradient flow stability under contraction factor \rho 1 , and BIBO stability; interpretability analyses reveal heterogeneous temporal receptive fields and a learned primacy effect with 83-88% sparsity.
[LG-139] Single-Edge Node Injection Threats to GNN-Based Security Monitoring in Industrial Graph Systems
链接: https://arxiv.org/abs/2602.01113
作者: Wenjie Liang,Ranhui Yan,Jia Cai,You-Gan Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Graph neural networks (GNNs) are increasingly adopted in industrial graph-based monitoring systems (e.g., Industrial internet of things (IIoT) device graphs, power-grid topology models, and manufacturing communication networks) to support anomaly detection, state estimation, and asset classification. In such settings, an adversary that compromises a small number of edge devices may inject counterfeit nodes (e.g., rogue sensors, virtualized endpoints, or spoofed substations) to bias downstream decisions while evading topology- and homophily-based sanitization. This paper formulates deployment-oriented node-injection attacks under constrained resources and proposes the \emphSingle-Edge Graph Injection Attack (SEGIA), in which each injected node attaches to the operational graph through a single edge. SEGIA integrates a pruned SGC surrogate, multi-hop neighborhood sampling, and reverse graph convolution-based feature synthesis with a similarity-regularized objective to preserve local homophily and survive edge pruning. Theoretical analysis and extensive evaluations across datasets and defenses show at least 25% higher attack success than representative baselines under substantially smaller edge budgets. These results indicate a system-level risk in industrial GNN deployments and motivate lightweight admission validation and neighborhood-consistency monitoring.
[LG-140] On the Expressive Power of Permutation-Equivariant Weight-Space Networks
链接: https://arxiv.org/abs/2602.01083
作者: Adir Dayan,Yam Eitan,Haggai Maron
类目: Machine Learning (cs.LG)
*备注:
Abstract:Weight-space learning studies neural architectures that operate directly on the parameters of other neural networks. Motivated by the growing availability of pretrained models, recent work has demonstrated the effectiveness of weight-space networks across a wide range of tasks. SOTA weight-space networks rely on permutation-equivariant designs to improve generalization. However, this may negatively affect expressive power, warranting theoretical investigation. Importantly, unlike other structured domains, weight-space learning targets maps operating on both weight and function spaces, making expressivity analysis particularly subtle. While a few prior works provide partial expressivity results, a comprehensive characterization is still missing. In this work, we address this gap by developing a systematic theory for expressivity of weight-space networks. We first prove that all prominent permutation-equivariant networks are equivalent in expressive power. We then establish universality in both weight- and function-space settings under mild, natural assumptions on the input weights, and characterize the edge-case regimes where universality no longer holds. Together, these results provide a strong and unified foundation for the expressivity of weight-space networks.
[LG-141] LRAg ent: Efficient KV Cache Sharing for Multi-LoRA LLM Agents
链接: https://arxiv.org/abs/2602.01053
作者: Hyesung Jeon,Hyeongju Ha,Jae-Joon Kim
类目: Machine Learning (cs.LG)
*备注: 23 pages, 9 figures, 19 tables
Abstract:Role specialization in multi-LLM agent systems is often realized via multi-LoRA, where agents share a pretrained backbone and differ only through lightweight adapters. Despite sharing base model weights, each agent independently builds and stores its own KV cache for the same long, tool-augmented trajectories, incurring substantial memory and compute overhead. Existing KV cache sharing methods largely overlook this multi-LoRA setting. We observe that, across agents, cache differences are dominated by adapter outputs, while activations from the shared pretrained backbone remain highly similar. Based on this observation, we propose LRAgent, a KV cache sharing framework for multi-LoRA agents that decomposes the cache into a shared base component from the pretrained weights and an adapter-dependent component from LoRA weights. LRAgent reduces memory overhead by sharing the base component and storing the adapter component in its inherent low-rank form, and further reduces compute overhead, enabled by shared- A multi-LoRA architectures, by also sharing the low-rank cache and avoiding redundant computations for contexts already processed by other agents. To efficiently reconstruct adapter contributions at runtime, we introduce Flash-LoRA-Attention, a kernel that reorders attention computation to avoid materializing the low-rank cache to full dimension. LRAgent achieves throughput and time-to-first-token latency close to fully shared caching, while preserving accuracy near the non-shared caching baseline across agentic question-answering benchmarks.
[LG-142] SwiftRepertoire: Few-Shot Immune-Signature Synthesis via Dynamic Kernel Codes
链接: https://arxiv.org/abs/2602.01051
作者: Rong Fu,Wenxin Zhang,Muge Qi,Yang Li,Yabin Jin,Jiekai Wu,Jiaxuan Lu,Chunlei Meng,Youjin Wang,Zeli Su,Juntao Gao,Li Bao,Qi Zhao,Wei Luo,Simon Fong
类目: Machine Learning (cs.LG)
*备注: 19 pages, 8 figures, 8 tables
Abstract:Repertoire-level analysis of T cell receptors offers a biologically grounded signal for disease detection and immune monitoring, yet practical deployment is impeded by label sparsity, cohort heterogeneity, and the computational burden of adapting large encoders to new tasks. We introduce a framework that synthesizes compact task-specific parameterizations from a learned dictionary of prototypes conditioned on lightweight task descriptors derived from repertoire probes and pooled embedding statistics. This synthesis produces small adapter modules applied to a frozen pretrained backbone, enabling immediate adaptation to novel tasks with only a handful of support examples and without full model fine-tuning. The architecture preserves interpretability through motif-aware probes and a calibrated motif discovery pipeline that links predictive decisions to sequence-level signals. Together, these components yield a practical, sample-efficient, and interpretable pathway for translating repertoire-informed models into diverse clinical and research settings where labeled data are scarce and computational resources are constrained.
[LG-143] SFMP: Fine-Grained Hardware-Friendly and Search-Free Mixed-Precision Quantization for Large Language Models
链接: https://arxiv.org/abs/2602.01027
作者: Xin Nie,Haicheng Zhang,Liang Dong,Beining Feng,Jinhong Weng,Guiling Sun
类目: Machine Learning (cs.LG)
*备注: 24pages,17figures
Abstract:Mixed-precision quantization is a promising approach for compressing large language models under tight memory budgets. However, existing mixed-precision methods typically suffer from one of two limitations: they either rely on expensive discrete optimization to determine precision allocation, or introduce hardware inefficiencies due to irregular memory layouts. We propose SFMP, a search-free and hardware-friendly mixed-precision quantization framework for large language models. The framework is built upon four novel ideas: Fractional bit-width, which extends integer bit-width for weight matrix to fractional value and transforms discrete precision allocation as a continuous problem; 2)Block-wise mixed-precision, enabling fine-grained precision within weight matrices while remaining hardware-friendly; 3)Row-column weight reordering, which aggregates salient weights via row and column reordering, incurring only a small activation reordering overhead during inference; 4)Unified GEMM kernel, which supports mixed-precision GEMM at arbitrary average bit-width. Extensive experiments demonstrate that SFMP outperforms state-of-the-art layer-wise mixed-precision methods under the same memory constraints, while significantly reducing quantization cost and improving inference efficiency. Code is available at this https URL
[LG-144] he Stacked Autoencoder Evolution Hypothesis
链接: https://arxiv.org/abs/2602.01026
作者: Hiroyuki Iizuka
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:
Abstract:This study introduces a novel theoretical framework, the Stacked Autoencoder Evolution Hypothesis, which proposes that biological evolutionary systems operate through multi-layered self-encoding and decoding processes, analogous to stacked autoencoders in deep learning. Rather than viewing evolution solely as gradual changes driven by mutation and selection, this hypothesis suggests that self-replication inherently compresses and reconstructs genetic information across hierarchical layers of abstraction. This layered structure enables evolutionary systems to explore diverse possibilities not only at the sequence level but also across progressively more abstract layers of representation, making it possible for even simple mutations to navigate these higher-order this http URL a mechanism may explain punctuated evolutionary patterns and changes that can appear as if they are goal-directed in natural evolution, by allowing mutations at deeper latent layers to trigger sudden, large-scale phenotypic shifts. To illustrate the plausibility of this mechanism, artificial chemistry simulations were conducted, demonstrating the spontaneous emergence of hierarchical autoencoder structures. This framework offers a new perspective on the informational dynamics underlying both continuous and discontinuous evolutionary change.
[LG-145] Predicting Anemia Among Under-Five Children in Nepal Using Machine Learning and Deep Learning ALT
链接: https://arxiv.org/abs/2602.01005
作者: Deepak Bastola,Pitambar Acharya,Dipak Dulal,Rabina Dhakal,Yang Li
类目: Machine Learning (cs.LG)
*备注: 13 pages and submission to Public Health Nutrition is in progress
Abstract:Childhood anemia remains a major public health challenge in Nepal and is associated with impaired growth, cognition, and increased morbidity. Using World Health Organization hemoglobin thresholds, we defined anemia status for children aged 6-59 months and formulated a binary classification task by grouping all anemia severities as \emphanemic versus \emphnot anemic. We analyzed Nepal Demographic and Health Survey (NDHS 2022) microdata comprising 1,855 children and initially considered 48 candidate features spanning demographic, socioeconomic, maternal, and child health characteristics. To obtain a stable and substantiated feature set, we applied four features selection techniques (Chi-square, mutual information, point-biserial correlation, and Boruta) and prioritized features supported by multi-method consensus. Five features: child age, recent fever, household size, maternal anemia, and parasite deworming were consistently selected by all methods, while amenorrhea, ethnicity indicators, and provinces were frequently retained. We then compared eight traditional machine learning classifiers (LR, KNN, DT, RF, XGBoost, SVM, NB, LDA) with two deep learning models (DNN and TabNet) using standard evaluation metrics, emphasizing F1-score and recall due to class imbalance. Among all models, logistic regression attained the best recall (0.701) and the highest F1-score (0.649), while DNN achieved the highest accuracy (0.709), and SVM yielded the strongest discrimination with the highest AUC (0.736). Overall, the results indicate that both machine learning and deep learning models can provide competitive anemia prediction and the interpretable features such as child age, infection proxy, maternal anemia, and deworming history are central for risk stratification and public health screening in Nepal.
[LG-146] Scalable Random Wavelet Features: Efficient Non-Stationary Kernel Approximation with Convergence Guarantees ICLR2026
链接: https://arxiv.org/abs/2602.00987
作者: Sawan Kumar,Souvik Chakraborty
类目: Machine Learning (cs.LG)
*备注: Accepted at ICLR 2026
Abstract:Modeling non-stationary processes, where statistical properties vary across the input domain, is a critical challenge in machine learning; yet most scalable methods rely on a simplifying assumption of stationarity. This forces a difficult trade-off: use expressive but computationally demanding models like Deep Gaussian Processes, or scalable but limited methods like Random Fourier Features (RFF). We close this gap by introducing Random Wavelet Features (RWF), a framework that constructs scalable, non-stationary kernel approximations by sampling from wavelet families. By harnessing the inherent localization and multi-resolution structure of wavelets, RWF generates an explicit feature map that captures complex, input-dependent patterns. Our framework provides a principled way to generalize RFF to the non-stationary setting and comes with a comprehensive theoretical analysis, including positive definiteness, unbiasedness, and uniform convergence guarantees. We demonstrate empirically on a range of challenging synthetic and real-world datasets that RWF outperforms stationary random features and offers a compelling accuracy-efficiency trade-off against more complex models, unlocking scalable and expressive kernel methods for a broad class of real-world non-stationary problems.
[LG-147] Forest-Guided Semantic Transport for Label-Supervised Manifold Alignment
链接: https://arxiv.org/abs/2602.00974
作者: Adrien Aumon,Myriam Lizotte,Guy Wolf,Kevin R. Moon,Jake S. Rhodes
类目: Machine Learning (cs.LG)
*备注:
Abstract:Label-supervised manifold alignment bridges the gap between unsupervised and correspondence-based paradigms by leveraging shared label information to align multimodal datasets. Still, most existing methods rely on Euclidean geometry to model intra-domain relationships. This approach can fail when features are only weakly related to the task of interest, leading to noisy, semantically misleading structure and degraded alignment quality. To address this limitation, we introduce FoSTA (Forest-guided Semantic Transport Alignment), a scalable alignment framework that leverages forest-induced geometry to denoise intra-domain structure and recover task-relevant manifolds prior to alignment. FoSTA builds semantic representations directly from label-informed forest affinities and aligns them via fast, hierarchical semantic transport, capturing meaningful cross-domain relationships. Extensive comparisons with established baselines demonstrate that FoSTA improves correspondence recovery and label transfer on synthetic benchmarks and delivers strong performance in practical single-cell applications, including batch correction and biological conservation.
[LG-148] On the Spectral Flattening of Quantized Embeddings
链接: https://arxiv.org/abs/2602.00969
作者: Junlin Huang,Wenyi Fang,Zhenheng Tang,Yuxin Wang,Xueze Kang,Yang Zheng,Bo Li,Xiaowen Chu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Training Large Language Models (LLMs) at ultra-low precision is critically impeded by instability rooted in the conflict between discrete quantization constraints and the intrinsic heavy-tailed spectral nature of linguistic data. By formalizing the connection between Zipfian statistics and random matrix theory, we prove that the power-law decay in the singular value spectra of embeddings is a fundamental requisite for semantic encoding. We derive theoretical bounds showing that uniform quantization introduces a noise floor that disproportionately truncates this spectral tail, which induces spectral flattening and a strictly provable increase in the stable rank of representations. Empirical validation across diverse architectures including GPT-2 and TinyLlama corroborates that this geometric degradation precipitates representational collapse. This work not only quantifies the spectral sensitivity of LLMs but also establishes spectral fidelity as a necessary condition for stable low-bit optimization.
[LG-149] From drift to adaptation to the failed ml model: Transfer Learning in Industrial MLOps
链接: https://arxiv.org/abs/2602.00957
作者: Waqar Muhammad Ashraf,Talha Ansar,Fahad Ahmed,Jawad Hussain,Muhammad Mujtaba Abbas,Vivek Dua
类目: Machine Learning (cs.LG)
*备注: Corresponding author: this http URL @ucl. this http URL
Abstract:Model adaptation to production environment is critical for reliable Machine Learning Operations (MLOps), less attention is paid to developing systematic framework for updating the ML models when they fail under data drift. This paper compares the transfer learning enabled model update strategies including ensemble transfer learning (ETL), all-layers transfer learning (ALTL), and last-layer transfer learning (LLTL) for updating the failed feedforward artificial neural network (ANN) model. The flue gas differential pressure across the air preheater unit installed in a 660 MW thermal power plant is analyzed as a case study since it mimics the batch processes due to load cycling in the power plant. Updating the failed ANN model by three transfer learning techniques reveals that ETL provides relatively higher predictive accuracy for the batch size of 5 days than those of LLTL and ALTL. However, ALTL is found to be suitable for effective update of the model trained on large batch size (8 days). A mixed trend is observed for computational requirement (hyperparameter tuning and model training) of model update techniques for different batch sizes. These fundamental and empiric insights obtained from the batch process-based industrial case study can assist the MLOps practitioners in adapting the failed models to data drifts for the accurate monitoring of industrial processes.
[LG-150] SAGE: Agent ic Framework for Interpretable and Clinically Translatable Computational Pathology Biomarker Discovery
链接: https://arxiv.org/abs/2602.00953
作者: Sahar Almahfouz Nasser,Juan Francisco Pesantez Borja,Jincheng Liu,Tanvir Hasan,Zenghan Wang,Suman Ghosh,Sandeep Manandhar,Shikhar Shiromani,Twisha Shah,Naoto Tokuyama,Anant Madabhushi
类目: Machine Learning (cs.LG)
*备注:
Abstract:Despite significant progress in computational pathology, many AI models remain black-box and difficult to interpret, posing a major barrier to clinical adoption due to limited transparency and explainability. This has motivated continued interest in engineered image-based biomarkers, which offer greater interpretability but are often proposed based on anecdotal evidence or fragmented prior literature rather than systematic biological validation. We introduce SAGE (Structured Agentic system for hypothesis Generation and Evaluation), an agentic AI system designed to identify interpretable, engineered pathology biomarkers by grounding them in biological evidence. SAGE integrates literature-anchored reasoning with multimodal data analysis to correlate image-derived features with molecular biomarkers, such as gene expression, and clinically relevant outcomes. By coordinating specialized agents for biological contextualization and empirical hypothesis validation, SAGE prioritizes transparent, biologically supported biomarkers and advances the clinical translation of computational pathology.
[LG-151] Optimal Budgeted Adaptation of Large Language Models
链接: https://arxiv.org/abs/2602.00952
作者: Jing Wang,Jie Shen,Dean Foster,Zohar Karnin,Jeremy C Weiss
类目: Machine Learning (cs.LG)
*备注:
Abstract:The trade-off between labeled data availability and downstream accuracy remains a central challenge in fine-tuning large language models (LLMs). We propose a principled framework for \emphbudget-aware supervised fine-tuning by casting LLM adaptation as a contextual Stackelberg game. In our formulation, the learner (leader) commits to a scoring policy and a label-querying strategy, while an adaptive environment (follower) selects challenging supervised alternatives in response. To explicitly address label efficiency, we incorporate a finite supervision budget directly into the learning objective. Our algorithm operates in the full-feedback regime and achieves \tildeO(d\sqrtT) regret under standard linear contextual assumptions. We extend the framework with a Largest-Latency-First (LLF) confidence gate that selectively queries labels, achieving a budget-aware regret bound of \tildeO(\sqrtdB + c\sqrtB) with B=\beta T .
[LG-152] Dynamic Prior Thompson Sampling for Cold-Start Exploration in Recommender Systems
链接: https://arxiv.org/abs/2602.00943
作者: Zhenyu Zhao,David Zhang,Ellie Zhao,Ehsan Saberian
类目: Machine Learning (cs.LG)
*备注:
Abstract:Cold-start exploration is a core challenge in large-scale recommender systems: new or data-sparse items must receive traffic to estimate value, but over-exploration harms users and wastes impressions. In practice, Thompson Sampling (TS) is often initialized with a uniform Beta(1,1) prior, implicitly assuming a 50% success rate for unseen items. When true base rates are far lower, this optimistic prior systematically over-allocates to weak items. The impact is amplified by batched policy updates and pipeline latency: for hours, newly launched items can remain effectively “no data,” so the prior dominates allocation before feedback is incorporated. We propose Dynamic Prior Thompson Sampling, a prior design that directly controls the probability that a new arm outcompetes the incumbent winner. Our key contribution is a closed-form quadratic solution for the prior mean that enforces P(X_j Y_k) = epsilon at introduction time, making exploration intensity predictable and tunable while preserving TS Bayesian updates. Across Monte Carlo validation, offline batched simulations, and a large-scale online experiment on a thumbnail personalization system serving millions of users, dynamic priors deliver precise exploration control and improved efficiency versus a uniform-prior baseline.
[LG-153] SALAAD: Sparse And Low-Rank Adaptation via ADMM
链接: https://arxiv.org/abs/2602.00942
作者: Hao Ma,Melis Ilayda Bal,Liang Zhang,Bingcong Li,Niao He,Melanie Zeilinger,Michael Muehlebach
类目: Machine Learning (cs.LG)
*备注:
Abstract:Modern large language models are increasingly deployed under compute and memory constraints, making flexible control of model capacity a central challenge. While sparse and low-rank structures naturally trade off capacity and performance, existing approaches often rely on heuristic designs that ignore layer and matrix heterogeneity or require model-specific architectural modifications. We propose SALAAD, a plug-and-play framework applicable to different model architectures that induces sparse and low-rank structures during training. By formulating structured weight learning under an augmented Lagrangian framework and introducing an adaptive controller that dynamically balances the training loss and structural constraints, SALAAD preserves the stability of standard training dynamics while enabling explicit control over the evolution of effective model capacity during training. Experiments across model scales show that SALAAD substantially reduces memory consumption during deployment while achieving performance comparable to ad-hoc methods. Moreover, a single training run yields a continuous spectrum of model capacities, enabling smooth and elastic deployment across diverse memory budgets without the need for retraining.
[LG-154] Beyond What Seems Necessary: Hidden Gains from Scaling Training-Time Reasoning Length under Outcome Supervision
链接: https://arxiv.org/abs/2602.00927
作者: Yihao Xue,Allan Zhang,Jianhao Huang,Amit Sahai,Baharan Mirzasoleiman
类目: Machine Learning (cs.LG)
*备注:
Abstract:Training LLMs to think and reason for longer has become a key ingredient in building state-of-the-art models that can solve complex problems previously out of reach. Recent efforts pursue this in different ways, such as RL fine-tuning to elicit long CoT or scaling latent reasoning through architectural recurrence. This makes reasoning length an important scaling knob. In this work, we identify a novel phenomenon (both theoretically and experimentally): under outcome-only supervision, out-of-distribution (OOD) performance can continue improving as training-time reasoning length (e.g., the token budget in RL, or the loop count in looped Transformers) increases, even after in-distribution (ID) performance has saturated. This suggests that robustness may require a larger budget than ID validation alone would indicate. We provide theoretical explanations via two mechanisms: (i) self-iteration can induce a stronger inductive bias in the hypothesis class, reshaping ID-optimal solutions in ways that improve OOD generalization; and (ii) when shortcut solutions that work for ID samples but not for OOD samples persist in the hypothesis class, regularization can reduce the learned solution’s reliance on these shortcuts as the number of self-iterations increases. We complement the theory with empirical evidence from two realizations of scaling training-time reasoning length: increasing the number of loops in looped Transformers on a synthetic task, and increasing token budgets during RL fine-tuning of LLMs on mathematical reasoning.
[LG-155] Early Classification of Time Series in Non-Stationary Cost Regimes
链接: https://arxiv.org/abs/2602.00918
作者: Aurélien Renault,Alexis Bondu,Antoine Cornuéjols,Vincent Lemaire
类目: Machine Learning (cs.LG)
*备注:
Abstract:Early Classification of Time Series (ECTS) addresses decision-making problems in which predictions must be made as early as possible while maintaining high accuracy. Most existing ECTS methods assume that the time-dependent decision costs governing the learning objective are known, fixed, and correctly specified. In practice, however, these costs are often uncertain and may change over time, leading to mismatches between training-time and deployment-time objectives. In this paper, we study ECTS under two practically relevant forms of cost non-stationarity: drift in the balance between misclassification and decision delay costs, and stochastic realizations of decision costs that deviate from the nominal training-time model. To address these challenges, we revisit representative ECTS approaches and adapt them to an online learning setting. Focusing on separable methods, we update only the triggering model during deployment, while keeping the classifier fixed. We propose several online adaptations and baselines, including bandit-based and RL-based approaches, and conduct controlled experiments on synthetic data to systematically evaluate robustness under cost non-stationarity. Our results demonstrate that online learning can effectively improve the robustness of ECTS methods to cost drift, with RL-based strategies exhibiting strong and stable performance across varying cost regimes.
[LG-156] Efficient Deep Learning for Medical Imaging: Bridging the Gap Between High-Performance AI and Clinical Deployment
链接: https://arxiv.org/abs/2602.00910
作者: Cuong Manh Nguyen,Truong-Son Hy
类目: Machine Learning (cs.LG)
*备注:
Abstract:Deep learning has revolutionized medical image analysis, playing a vital role in modern clinical applications. However, the deployment of large-scale models in real-world clinical settings remains challenging due to high computational costs, latency constraints, and patient data privacy concerns associated with cloud-based processing. To address these bottlenecks, this review provides a comprehensive synthesis of efficient and lightweight deep learning architectures specifically tailored for the medical domain. We categorize the landscape of modern efficient models into three primary streams: Convolutional Neural Networks (CNNs), Lightweight Transformers, and emerging Linear Complexity Models. Furthermore, we examine key model compression strategies (including pruning, quantization, knowledge distillation, and low-rank factorization) and evaluate their efficacy in maintaining diagnostic performance while reducing hardware requirements. By identifying current limitations and discussing the transition toward on-device intelligence, this review serves as a roadmap for researchers and practitioners aiming to bridge the gap between high-performance AI and resource-constrained clinical environments.
[LG-157] PyGALAX: An Open-Source Python Toolkit for Advanced Explainable Geospatial Machine Learning
链接: https://arxiv.org/abs/2602.00907
作者: Pingping Wang(1),Yihong Yuan(1),Lingcheng Li(2),Yongmei Lu(1) ((1) Department of Geography and Environmental Studies, Texas State University, USA, (2) Atmospheric, Climate, and Earth Sciences Division, Pacific Northwest National Laboratory, USA)
类目: Machine Learning (cs.LG)
*备注:
Abstract:PyGALAX is a Python package for geospatial analysis that integrates automated machine learning (AutoML) and explainable artificial intelligence (XAI) techniques to analyze spatial heterogeneity in both regression and classification tasks. It automatically selects and optimizes machine learning models for different geographic locations and contexts while maintaining interpretability through SHAP (SHapley Additive exPlanations) analysis. PyGALAX builds upon and improves the GALAX framework (Geospatial Analysis Leveraging AutoML and eXplainable AI), which has proven to outperform traditional geographically weighted regression (GWR) methods. Critical enhancements in PyGALAX from the original GALAX framework include automatic bandwidth selection and flexible kernel function selection, providing greater flexibility and robustness for spatial modeling across diverse datasets and research questions. PyGALAX not only inherits all the functionalities of the original GALAX framework but also packages them into an accessible, reproducible, and easily deployable Python toolkit while providing additional options for spatial modeling. It effectively addresses spatial non-stationarity and generates transparent insights into complex spatial relationships at both global and local scales, making advanced geospatial machine learning methods accessible to researchers and practitioners in geography, urban planning, environmental science, and related fields.
[LG-158] Domain-Adaptive and Scalable Dense Retrieval for Content-Based Recommendation
链接: https://arxiv.org/abs/2602.00899
作者: Mritunjay Pandey(Aditya Birla Group)
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: 13 pages, 4 figures. Semantic dense retrieval for content-based recommendation on Amazon Reviews 2023 (Category - Fashion). Dataset statistics: 2.0M users; 825.9K items; 2.5M ratings; 94.9M review tokens; 510.5M metadata tokens. Timespan: May 1996 to September 2023. Metadata includes: user reviews (ratings, text, helpfulness votes, etc.); item metadata (descriptions, price, raw images, etc.)
Abstract:E-commerce recommendation and search commonly rely on sparse keyword matching (e.g., BM25), which breaks down under vocabulary mismatch when user intent has limited lexical overlap with product metadata. We cast content-based recommendation as recommendation-as-retrieval: given a natural-language intent signal (a query or review), retrieve the top-K most relevant items from a large catalog via semantic similarity. We present a scalable dense retrieval system based on a two-tower bi-encoder, fine-tuned on the Amazon Reviews 2023 (Fashion) subset using supervised contrastive learning with Multiple Negatives Ranking Loss. We construct training pairs from review text (as a query proxy) and item metadata (as the positive document) and fine-tune on 50,000 sampled interactions with a maximum sequence length of 500 tokens. For efficient serving, we combine FAISS HNSW indexing with an ONNX Runtime inference pipeline using INT8 dynamic quantization. On a review-to-title benchmark over 826,402 catalog items, our approach improves Recall@10 from 0.26 (BM25) to 0.66, while meeting practical latency and model-size constraints: 6.1 ms median CPU inference latency (batch size 1) and a 4x reduction in model size. Overall, we provide an end-to-end, reproducible blueprint for taking domain-adapted dense retrieval from offline training to CPU-efficient serving at catalog scale. Comments: 13 pages, 4 figures. Semantic dense retrieval for content-based recommendation on Amazon Reviews 2023 (Category - Fashion). Dataset statistics: 2.0M users; 825.9K items; 2.5M ratings; 94.9M review tokens; 510.5M metadata tokens. Timespan: May 1996 to September 2023. Metadata includes: user reviews (ratings, text, helpfulness votes, etc.); item metadata (descriptions, price, raw images, etc.) Subjects: Machine Learning (cs.LG); Information Retrieval (cs.IR) Cite as: arXiv:2602.00899 [cs.LG] (or arXiv:2602.00899v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.00899 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Mritunjay Pandey [view email] [v1] Sat, 31 Jan 2026 20:58:23 UTC (416 KB)
[LG-159] RoDiF: Robust Direct Fine-Tuning of Diffusion Policies with Corrupted Human Feedback
链接: https://arxiv.org/abs/2602.00886
作者: Amitesh Vatsa,Zhixian Xie,Wanxin Jin
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:Diffusion policies are a powerful paradigm for robotic control, but fine-tuning them with human preferences is fundamentally challenged by the multi-step structure of the denoising process. To overcome this, we introduce a Unified Markov Decision Process (MDP) formulation that coherently integrates the diffusion denoising chain with environmental dynamics, enabling reward-free Direct Preference Optimization (DPO) for diffusion policies. Building on this formulation, we propose RoDiF (Robust Direct Fine-Tuning), a method that explicitly addresses corrupted human preferences. RoDiF reinterprets the DPO objective through a geometric hypothesis-cutting perspective and employs a conservative cutting strategy to achieve robustness without assuming any specific noise distribution. Extensive experiments on long-horizon manipulation tasks show that RoDiF consistently outperforms state-of-the-art baselines, effectively steering pretrained diffusion policies of diverse architectures to human-preferred modes, while maintaining strong performance even under 30% corrupted preference labels.
[LG-160] Reliability-Aware Determinantal Point Processes for Robust Informative Data Selection in Large Language Models
链接: https://arxiv.org/abs/2602.00885
作者: Ahmad Sarlak,Abolfazl Razi
类目: Machine Learning (cs.LG)
*备注:
Abstract:Informative data selection is a key requirement for large language models (LLMs) to minimize the amount of data required for fine-tuning, network distillation, and token pruning, enabling fast and efficient deployment, especially under computational and communication constraints. Traditional subset selection methods, including those based on Determinantal Point Processes (DPP), focus on maximizing diversity but assume that selected data batches are always available error-free. This presumption prohibits their use under partial storage outage, imperfect communication, and stochastic access failures. Furthermore, we show that the original formulation collapses under such conditions. To address this gap, we introduce ProbDPP, a novel reliability-aware implementation of k-DPP that accounts for probabilistic data access by recasting the objective function with a regularization term that remains well-posed and decomposes into a geometric diversity term and unreliability cost. The resulting objective facilitates robust selection of diverse data batches under uncertainty. Furthermore, we frame this reliability-aware diversity maximization as a combinatorial semi-bandit problem and propose a UCB-style algorithm to efficiently learn the unknown reliability online. Theoretical analysis provides regret bounds for the proposed approach, ensuring performance guarantees.
[LG-161] st-time Generalization for Physics through Neural Operator Splitting
链接: https://arxiv.org/abs/2602.00884
作者: Louis Serrano,Jiequn Han,Edouard Oyallon,Shirley Ho,Rudy Morel
类目: Machine Learning (cs.LG)
*备注:
Abstract:Neural operators have shown promise in learning solution maps of partial differential equations (PDEs), but they often struggle to generalize when test inputs lie outside the training distribution, such as novel initial conditions, unseen PDE coefficients or unseen physics. Prior works address this limitation with large-scale multiple physics pretraining followed by fine-tuning, but this still requires examples from the new dynamics, falling short of true zero-shot generalization. In this work, we propose a method to enhance generalization at test time, i.e., without modifying pretrained weights. Building on DISCO, which provides a dictionary of neural operators trained across different dynamics, we introduce a neural operator splitting strategy that, at test time, searches over compositions of training operators to approximate unseen dynamics. On challenging out-of-distribution tasks including parameter extrapolation and novel combinations of physics phenomena, our approach achieves state-of-the-art zero-shot generalization results, while being able to recover the underlying PDE parameters. These results underscore test-time computation as a key avenue for building flexible, compositional, and generalizable neural operators.
[LG-162] Dynamic Expert Sharing: Decoupling Memory from Parallelism in Mixture-of-Experts Diffusion LLM s
链接: https://arxiv.org/abs/2602.00879
作者: Hao Mark Chen,Zhiwen Mo,Royson Lee,Qianzhou Wang,Da Li,Shell Xu Hu,Wayne Luk,Timothy Hospedales,Hongxiang Fan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Among parallel decoding paradigms, diffusion large language models (dLLMs) have emerged as a promising candidate that balances generation quality and throughput. However, their integration with Mixture-of-Experts (MoE) architectures is constrained by an expert explosion: as the number of tokens generated in parallel increases, the number of distinct experts activated grows nearly linearly. This results in substantial memory traffic that pushes inference into a memory-bound regime, negating the efficiency gains of both MoE and parallel decoding. To address this challenge, we propose Dynamic Expert Sharing (DES), a novel technique that shifts MoE optimization from token-centric pruning and conventional expert skipping methods to sequence-level coreset selection. To maximize expert reuse, DES identifies a compact, high-utility set of experts to satisfy the requirements of an entire parallel decoding block. We introduce two innovative selection strategies: (1) Intra-Sequence Sharing (DES-Seq), which adapts optimal allocation to the sequence level, and (2) Saliency-Aware Voting (DES-Vote), a novel mechanism that allows tokens to collectively elect a coreset based on aggregated router weights. Extensive experiments on MoE dLLMs demonstrate that DES reduces unique expert activations by over 55% and latency by up to 38%, while retaining 99% of vanilla accuracy, effectively decoupling memory overhead from the degree of parallelism.
[LG-163] Learning Heat-based Equations in Self-similar variables
链接: https://arxiv.org/abs/2602.00872
作者: Shihao Wang,Qipeng Qian,Jingquan Wang
类目: Machine Learning (cs.LG); Mathematical Physics (math-ph)
*备注:
Abstract:We study solution learning for heat-based equations in self-similar variables (SSV). We develop an SSV training framework compatible with standard neural-operator training. We instantiate this framework on the two-dimensional incompressible Navier-Stokes equations and the one-dimensional viscous Burgers equation, and perform controlled comparisons between models trained in physical coordinates and in the corresponding self-similar coordinates using two simple fully connected architectures (standard multilayer perceptrons and a factorized fully connected network). Across both systems and both architectures, SSV-trained networks consistently deliver substantially more accurate and stable extrapolation beyond the training window and better capture qualitative long-time trends. These results suggest that self-similar coordinates provide a mathematically motivated inductive bias for learning the long-time dynamics of heat-based equations.
[LG-164] Investigating the Robustness of Subtask Distillation under Spurious Correlation
链接: https://arxiv.org/abs/2602.00852
作者: Pattarawat Chormai,Klaus-Robert Müller,Grégoire Montavon
类目: Machine Learning (cs.LG)
*备注: 7 pages, 3 figures
Abstract:Subtask distillation is an emerging paradigm in which compact, specialized models are extracted from large, general-purpose ‘foundation models’ for deployment in environments with limited resources or in standalone computer systems. Although distillation uses a teacher model, it still relies on a dataset that is often limited in size and may lack representativeness or exhibit spurious correlations. In this paper, we evaluate established distillation methods, as well as the recent SubDistill method, when using data with spurious correlations for distillation. As the strength of the correlations increases, we observe a widening gap between advanced methods, such as SubDistill, which remain fairly robust, and some baseline methods, which degrade to near-random performance. Overall, our study underscores the challenges of knowledge distillation when applied to imperfect, real-world datasets, particularly those with spurious correlations.
[LG-165] Over-Alignment vs Over-Fitting: The Role of Feature Learning Strength in Generalization
链接: https://arxiv.org/abs/2602.00827
作者: Taesun Yeom,Taehyeok Ha,Jaeho Lee
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Feature learning strength (FLS), i.e., the inverse of the effective output scaling of a model, plays a critical role in shaping the optimization dynamics of neural nets. While its impact has been extensively studied under the asymptotic regimes – both in training time and FLS – existing theory offers limited insight into how FLS affects generalization in practical settings, such as when training is stopped upon reaching a target training risk. In this work, we investigate the impact of FLS on generalization in deep networks under such practical conditions. Through empirical studies, we first uncover the emergence of an \textitoptimal FLS – neither too small nor too large – that yields substantial generalization gains. This finding runs counter to the prevailing intuition that stronger feature learning universally improves generalization. To explain this phenomenon, we develop a theoretical analysis of gradient flow dynamics in two-layer ReLU nets trained with logistic loss, where FLS is controlled via initialization scale. Our main theoretical result establishes the existence of an optimal FLS arising from a trade-off between two competing effects: An excessively large FLS induces an \textitover-alignment phenomenon that degrades generalization, while an overly small FLS leads to \textitover-fitting .
[LG-166] Mobile Exergames: Activity Recognition Based on Smartphone Sensors
链接: https://arxiv.org/abs/2602.00809
作者: David Craveiro,Hugo Silva
类目: Machine Learning (cs.LG)
*备注:
Abstract:Smartphone sensors can be extremely useful in providing information on the activities and behaviors of persons. Human activity recognition is increasingly used for games, medical, or surveillance. In this paper, we propose a proof-of-concept 2D endless game called Duck Catch Fit, which implements a detailed activity recognition system that uses a smartphone accelerometer, gyroscope, and magnetometer sensors. The system applies feature extraction and learning mechanism to detect human activities like staying, side movements, and fake side movements. In addition, a voice recognition system is combined to recognize the word “fire” and raise the game’s complexity. The results show that it is possible to use machine learning techniques to recognize human activity with high recognition levels. Also, the combination of movement-based and voice-based integrations contributes to a more immersive gameplay.
[LG-167] Sporadic Gradient Tracking over Directed Graphs: A Theoretical Perspective on Decentralized Federated Learning
链接: https://arxiv.org/abs/2602.00791
作者: Shahryar Zehtabi,Dong-Jun Han,Seyyedali Hosseinalipour,Christopher Brinton
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 32 pages, 5 figures
Abstract:Decentralized Federated Learning (DFL) enables clients with local data to collaborate in a peer-to-peer manner to train a generalized model. In this paper, we unify two branches of work that have separately solved important challenges in DFL: (i) gradient tracking techniques for mitigating data heterogeneity and (ii) accounting for diverse availability of resources across clients. We propose \textitSporadic Gradient Tracking ( \textttSpod-GT ), the first DFL algorithm that incorporates these factors over general directed graphs by allowing (i) client-specific gradient computation frequencies and (ii) heterogeneous and asymmetric communication frequencies. We conduct a rigorous convergence analysis of our methodology with relaxed assumptions on gradient estimation variance and gradient diversity of clients, providing consensus and optimality guarantees for GT over directed graphs despite intermittent client participation. Through numerical experiments on image classification datasets, we demonstrate the efficacy of \textttSpod-GT compared to well-known GT baselines.
[LG-168] Fast Non-Episodic Finite-Horizon RL with K-Step Lookahead Thresholding
链接: https://arxiv.org/abs/2602.00781
作者: Jiamin Xu,Kyra Gan
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Online reinforcement learning in non-episodic, finite-horizon MDPs remains underexplored and is challenged by the need to estimate returns to a fixed terminal time. Existing infinite-horizon methods, which often rely on discounted contraction, do not naturally account for this fixed-horizon structure. We introduce a modified Q-function: rather than targeting the full-horizon, we learn a K-step lookahead Q-function that truncates planning to the next K steps. To further improve sample efficiency, we introduce a thresholding mechanism: actions are selected only when their estimated K-step lookahead value exceeds a time-varying threshold. We provide an efficient tabular learning algorithm for this novel objective, proving it achieves fast finite-sample convergence: it achieves minimax optimal constant regret for K=1 and \mathcalO(\max((K-1),C_K-1)\sqrtSAT\log(T)) regret for any K \geq 2 . We numerically evaluate the performance of our algorithm under the objective of maximizing reward. Our implementation adaptively increases K over time, balancing lookahead depth against estimation variance. Empirical results demonstrate superior cumulative rewards over state-of-the-art tabular RL methods across synthetic MDPs and RL environments: JumpRiverswim, FrozenLake and AnyTrading.
[LG-169] Stable Time Series Prediction of Enterprise Carbon Emissions Based on Causal Inference
链接: https://arxiv.org/abs/2602.00775
作者: Zitao Hong,Zhen Peng,Xueping Liu
类目: Machine Learning (cs.LG); Econometrics (econ.EM)
*备注:
Abstract:Against the backdrop of ongoing carbon peaking and carbon neutrality goals, accurate prediction of enterprise carbon emission trends constitutes an essential foundation for energy structure optimization and low-carbon transformation decision-making. Nevertheless, significant heterogeneity persists across regions, industries and individual enterprises regarding energy structure, production scale, policy intensity and governance efficacy, resulting in pronounced distribution shifts and non-stationarity in carbon emission data across both temporal and spatial dimensions. Such cross-regional and cross-enterprise data drift not only compromises the accuracy of carbon emission reporting but substantially undermines the guidance value of predictive models for production planning and carbon quota trading decisions. To address this critical challenge, we integrate causal inference perspectives with stable learning methodologies and time-series modelling, proposing a stable temporal prediction mechanism tailored to distribution shift environments. This mechanism incorporates enterprise-level energy inputs, capital investment, labour deployment, carbon pricing, governmental interventions and policy implementation intensity, constructing a risk consistency-constrained stable learning framework that extracts causal stable features (robust against external perturbations yet demonstrating long-term stable effects on carbon dioxide emissions) from multi-environment samples across diverse policies, regions and industrial sectors. Furthermore, through adaptive normalization and sample reweighting strategies, the approach dynamically rectifies temporal non-stationarity induced by economic fluctuations and policy transitions, ultimately enhancing model generalization capability and explainability in complex environments.
[LG-170] A novel VAE-DML fusion framework for casual analysis of greenwashing in the mining industry
链接: https://arxiv.org/abs/2602.00774
作者: Yuxin Lu,Zhen Peng,Xiqiang Xia,Jie Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Against the backdrop of the global green transition and “dual carbon” goals, mining industry chain enterprises are pivotal entities in terms of resource consumption and environmental impact. Their environmental performance directly affects regional ecological security and is closely tied to national resource strategies and green transformation outcomes. Ensuring the authenticity and reliability of their environmental disclosure is thus a core and urgent issue for sustainable development and national strategic this http URL a corporate governance perspective, this study examines equity balance as a fundamental governance mechanism, investigating its inhibitory effect on greenwashing behavior among these enterprises and the underlying pathways involved. Methodologically, the paper innovatively employs a Variational Autoencoder (VAE) and a Double Machine Learning (DML) model to construct counterfactual scenarios, mitigating endogeneity concerns and precisely identifying the causal relationship between equity balance and greenwashing. The findings indicate, first, a significant negative causal relationship between equity balance and corporate greenwashing, confirming its substantive governance effect. Second, this inhibitory effect exhibits notable heterogeneity, manifesting more strongly in western regions, upstream segments of the industrial chain, and industries with high environmental sensitivity. Third, the governance effect demonstrates clear temporal dynamics, with the strongest impact occurring in the current period, followed by a diminishing yet statistically significant lagged effect, and ultimately a stable long-term cumulative influence. Finally, mechanism analysis reveals that equity balance operates through three distinct channels to curb greenwashing: alleviating management performance pressure, enhancing the stability of the executive team, and intensifying media scrutiny.
[LG-171] Provable Model Provenance Set for Large Language Models
链接: https://arxiv.org/abs/2602.00772
作者: Xiaoqi Qiu,Hao Zeng,Zhiyu Hou,Hongxin Wei
类目: Machine Learning (cs.LG)
*备注:
Abstract:The growing prevalence of unauthorized model usage and misattribution has increased the need for reliable model provenance analysis. However, existing methods largely rely on heuristic fingerprint-matching rules that lack provable error control and often overlook the existence of multiple sources, leaving the reliability of their provenance claims unverified. In this work, we first formalize the model provenance problem with provable guarantees, requiring rigorous coverage of all true provenances at a prescribed confidence level. Then, we propose the Model Provenance Set (MPS), which employs a sequential test-and-exclusion procedure to adaptively construct a small set satisfying the guarantee. The key idea of MPS is to test the significance of provenance existence within a candidate pool, thereby establishing a provable asymptotic guarantee at a user-specific confidence level. Extensive experiments demonstrate that MPS effectively achieves target provenance coverage while strictly limiting the inclusion of unrelated models, and further reveal its potential for practical provenance analysis in attribution and auditing tasks.
[LG-172] Learning in Bayesian Stackelberg Games With Unknown Followers Types
链接: https://arxiv.org/abs/2602.00771
作者: Matteo Bollini,Francesco Bacchiocchi,Samuel Coutts,Matteo Castiglioni,Alberto Marchesi
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:
Abstract:We study online learning in Bayesian Stackelberg games, where a leader repeatedly interacts with a follower whose unknown private type is independently drawn at each round from an unknown probability distribution. The goal is to design algorithms that minimize the leader’s regret with respect to always playing an optimal commitment computed with knowledge of the game. We consider, for the first time to the best of our knowledge, the most realistic case in which the leader does not know anything about the follower’s types, i.e., the possible follower payoffs. This raises considerable additional challenges compared to the commonly studied case in which the payoffs of follower types are known. First, we prove a strong negative result: no-regret is unattainable under action feedback, i.e., when the leader only observes the follower’s best response at the end of each round. Thus, we focus on the easier type feedback model, where the follower’s type is also revealed. In such a setting, we propose a no-regret algorithm that achieves a regret of \widetildeO(\sqrtT) , when ignoring the dependence on other parameters.
[LG-173] Communications-Incentivized Collaborative Reasoning in NetGPT through Agent ic Reinforcement Learning
链接: https://arxiv.org/abs/2602.00766
作者: Xiaoxue Yu,Rongpeng Li,Zhifeng Zhao,Honggang Zhang
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
*备注:
Abstract:The evolution of next-Generation (xG) wireless networks marks a paradigm shift from connectivity-centric architectures to Artificial Intelligence (AI)-native designs that tightly integrate data, computing, and communication. Yet existing AI deployments in communication systems remain largely siloed, offering isolated optimizations without intrinsic adaptability, dynamic task delegation, or multi-agent collaboration. In this work, we propose a unified agentic NetGPT framework for AI-native xG networks, wherein a NetGPT core can either perform autonomous reasoning or delegate sub-tasks to domain-specialized agents via agentic communication. The framework establishes clear modular responsibilities and interoperable workflows, enabling scalable, distributed intelligence across the network. To support continual refinement of collaborative reasoning strategies, the framework is further enhanced through Agentic reinforcement learning under partially observable conditions and stochastic external states. The training pipeline incorporates masked loss against external agent uncertainty, entropy-guided exploration, and multi-objective rewards that jointly capture task quality, coordination efficiency, and resource constraints. Through this process, NetGPT learns when and how to collaborate, effectively balancing internal reasoning with agent invocation. Overall, this work provides a foundational architecture and training methodology for self-evolving, AI-native xG networks capable of autonomous sensing, reasoning, and action in complex communication environments.
[LG-174] Spectral Imbalance Causes Forgetting in Low-Rank Continual Adaptation
链接: https://arxiv.org/abs/2602.00722
作者: Hao Gu,Mao-Lin Luo,Zi-Hao Zhou,Han-Chen Zhang,Min-Ling Zhang,Tong Wei
类目: Machine Learning (cs.LG)
*备注: 19 pages, 6 figures
Abstract:Parameter-efficient continual learning aims to adapt pre-trained models to sequential tasks without forgetting previously acquired knowledge. Most existing approaches treat continual learning as avoiding interference with past updates, rather than considering what properties make the current task-specific update naturally preserve previously acquired knowledge. From a knowledge-decomposition perspective, we observe that low-rank adaptations exhibit highly imbalanced singular value spectra: a few dominant components absorb most of the adaptation energy, thereby (i) more likely to disrupt previously acquired knowledge and (ii) making the update more vulnerable to interference from subsequent tasks. To enable explicit balance among components, we decouple the magnitude of the task update from its directional structure and formulate it as a constrained optimization problem on a restricted Stiefel manifold. We address this problem using a projected first-order method compatible with standard deep-learning optimizers used in vision-language models. Our method mitigates both backward and forward forgetting, consistently outperforming continual learning baselines. The implementation code is available at this https URL.
[LG-175] Federated Learning at the Forefront of Fairness: A Multifaceted Perspective IJCAI
链接: https://arxiv.org/abs/2602.00718
作者: Noorain Mukhtiar,Adnan Mahmood,Yipeng Zhou,Jian Yang,Jing Teng,Quan Z. Sheng
类目: Machine Learning (cs.LG)
*备注: 7 pages (main content), 2 pages (references), Accepted and Published Proceedings of the 34th International Joint Conference on Artificial Intelligence (IJCAI). 2025
Abstract:Fairness in Federated Learning (FL) is emerging as a critical factor driven by heterogeneous clients’ constraints and balanced model performance across various scenarios. In this survey, we delineate a comprehensive classification of the state-of-the-art fairness-aware approaches from a multifaceted perspective, i.e., model performance-oriented and capability-oriented. Moreover, we provide a framework to categorize and address various fairness concerns and associated technical aspects, examining their effectiveness in balancing equity and performance within FL frameworks. We further examine several significant evaluation metrics leveraged to measure fairness quantitatively. Finally, we explore exciting open research directions and propose prospective solutions that could drive future advancements in this important area, laying a solid foundation for researchers working toward fairness in FL.
[LG-176] LocalV: Exploiting Information Locality for IP-level Verilog Generation
链接: https://arxiv.org/abs/2602.00704
作者: Hanqi Lyu,Di Huang,Yaoyu Zhu,Kangcheng Liu,Bohan Dou,Chongxiao Li,Pengwei Jin,Shuyao Cheng,Rui Zhang,Zidong Du,Qi Guo,Xing Hu,Yunji Chen
类目: Machine Learning (cs.LG)
*备注:
Abstract:The generation of Register-Transfer Level (RTL) code is a crucial yet labor-intensive step in digital hardware design, traditionally requiring engineers to manually translate complex specifications into thousands of lines of synthesizable Hardware Description Language (HDL) code. While Large Language Models (LLMs) have shown promise in automating this process, existing approaches-including fine-tuned domain-specific models and advanced agent-based systems-struggle to scale to industrial IP-level design tasks. We identify three key challenges: (1) handling long, highly detailed documents, where critical interface constraints become buried in unrelated submodule descriptions; (2) generating long RTL code, where both syntactic and semantic correctness degrade sharply with increasing output length; and (3) navigating the complex debugging cycles required for functional verification through simulation and waveform analysis. To overcome these challenges, we propose LocalV, a multi-agent framework that leverages information locality in modular hardware design. LocalV decomposes the long-document to long-code generation problem into a set of short-document, short-code tasks, enabling scalable generation and debugging. Specifically, LocalV integrates hierarchical document partitioning, task planning, localized code generation, interface-consistent merging, and AST-guided locality-aware debugging. Experiments on RealBench, an IP-level Verilog generation benchmark, demonstrate that LocalV substantially outperforms state-of-the-art (SOTA) LLMs and agents, achieving a pass rate of 45.0% compared to 21.6%.
[LG-177] opology and Geometry of the Learning Space of ReLU Networks: Connectivity and Singularities ICLR2026
链接: https://arxiv.org/abs/2602.00693
作者: Marco Nurisso,Pierrick Leroy,Giovanni Petri,Francesco Vaccarino
类目: Machine Learning (cs.LG); Algebraic Geometry (math.AG); Algebraic Topology (math.AT)
*备注: Accepted to ICLR 2026. 32 pages, 13 figures
Abstract:Understanding the properties of the parameter space in feed-forward ReLU networks is critical for effectively analyzing and guiding training dynamics. After initialization, training under gradient flow decisively restricts the parameter space to an algebraic variety that emerges from the homogeneous nature of the ReLU activation function. In this study, we examine two key challenges associated with feed-forward ReLU networks built on general directed acyclic graph (DAG) architectures: the (dis)connectedness of the parameter space and the existence of singularities within it. We extend previous results by providing a thorough characterization of connectedness, highlighting the roles of bottleneck nodes and balance conditions associated with specific subsets of the network. Our findings clearly demonstrate that singularities are intricately connected to the topology of the underlying DAG and its induced sub-networks. We discuss the reachability of these singularities and establish a principled connection with differentiable pruning. We validate our theory with simple numerical experiments.
[LG-178] Provably Protecting Fine-Tuned LLM s from Training Data Extraction
链接: https://arxiv.org/abs/2602.00688
作者: Tom Segal,Asaf Shabtai,Yuval Elovici
类目: Machine Learning (cs.LG)
*备注: 20 pages, 5 figures
Abstract:Fine-tuning large language models (LLMs) on sensitive datasets raises privacy concerns, as training data extraction (TDE) attacks can expose highly confidential information. Existing defenses against such attacks either lack formal privacy guarantees or incur substantial utility degradation. We observe that fine-tuning induces widespread probability shifts, yet preserving only a small subset of influential token-level deviations is sufficient; the remaining shifts can be aggressively smoothed with minimal impact on utility. Motivated by this insight, we propose SCP- \Delta_r , a Near Access Freeness (NAF)-based algorithm that operates on relative probabilities and explicitly smooths low-impact tokens using a base model. SCP- \Delta_r achieves orders-of-magnitude better theoretical bounds than existing NAF based methods and provides strong empirical protection against TDE attacks with minimal performance loss.
[LG-179] Audio-to-Image Bird Species Retrieval without Audio-Image Pairs via Text Distillation
链接: https://arxiv.org/abs/2602.00681
作者: Ilyass Moummad,Marius Miron,Lukas Rauch,David Robinson,Alexis Joly,Olivier Pietquin,Emmanuel Chemla,Matthieu Geist
类目: ound (cs.SD); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:
Abstract:Audio-to-image retrieval offers an interpretable alternative to audio-only classification for bioacoustic species recognition, but learning aligned audio-image representations is challenging due to the scarcity of paired audio-image data. We propose a simple and data-efficient approach that enables audio-to-image retrieval without any audio-image supervision. Our proposed method uses text as a semantic intermediary: we distill the text embedding space of a pretrained image-text model (BioCLIP-2), which encodes rich visual and taxonomic structure, into a pretrained audio-text model (BioLingual) by fine-tuning its audio encoder with a contrastive objective. This distillation transfers visually grounded semantics into the audio representation, inducing emergent alignment between audio and image embeddings without using images during training. We evaluate the resulting model on multiple bioacoustic benchmarks. The distilled audio encoder preserves audio discriminative power while substantially improving audio-text alignment on focal recordings and soundscape datasets. Most importantly, on the SSW60 benchmark, the proposed approach achieves strong audio-to-image retrieval performance exceeding baselines based on zero-shot model combinations or learned mappings between text embeddings, despite not training on paired audio-image data. These results demonstrate that indirect semantic transfer through text is sufficient to induce meaningful audio-image alignment, providing a practical solution for visually grounded species recognition in data-scarce bioacoustic settings.
[LG-180] Strong Linear Baselines Strike Back: Closed-Form Linear Models as Gaussian Process Conditional Density Estimators for TSAD
链接: https://arxiv.org/abs/2602.00672
作者: Aleksandr Yugay,Hang Cui,Changhua Pei,Alexey Zaytsev
类目: Machine Learning (cs.LG)
*备注:
Abstract:Research in time series anomaly detection (TSAD) has largely focused on developing increasingly sophisticated, hard-to-train, and expensive-to-infer neural architectures. We revisit this paradigm and show that a simple linear autoregressive anomaly score with the closed-form solution provided by ordinary least squares (OLS) regression consistently matches or outperforms state-of-the-art deep detectors. From a theoretical perspective, we show that linear models capture a broad class of anomaly types, estimating a finite-history Gaussian process conditional density. From a practical side, across extensive univariate and multivariate benchmarks, the proposed approach achieves superior accuracy while requiring orders of magnitude fewer computational resources. Thus, future research should consistently include strong linear baselines and, more importantly, develop new benchmarks with richer temporal structures pinpointing the advantages of deep learning models.
[LG-181] hree-Way Emotion Classification of EEG-based Signals using Machine Learning
链接: https://arxiv.org/abs/2602.00670
作者: Ashna Purwar,Gaurav Simkar,Madhumita,Sachin Kadam
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 6 pages, 8 figures, and 3 tables. Submitted to a conference, under review
Abstract:Electroencephalography (EEG) is a widely used technique for measuring brain activity. EEG-based signals can reveal a persons emotional state, as they directly reflect activity in different brain regions. Emotion-aware systems and EEG-based emotion recognition are a growing research area. This paper presents how machine learning (ML) models categorize a limited dataset of EEG signals into three different classes, namely Negative, Neutral, or Positive. It also presents the complete workflow, including data preprocessing and comparison of ML models. To understand which ML classification model works best for this kind of problem, we train and test the following three commonly used models: logistic regression (LR), support vector machine (SVM), and random forest (RF). The performance of each is evaluated with respect to accuracy and F1-score. The results indicate that ML models can be effectively utilized for three-way emotion classification of EEG signals. Among the three ML models trained on the available dataset, the RF model gave the best results. Its higher accuracy and F1-score suggest that it is able to capture the emotional patterns more accurately and effectively than the other two models. The RF model also outperformed the existing state-of-the-art classification models in terms of the accuracy parameter.
[LG-182] Non-Clashing Teaching in Graphs: Algorithms Complexity and Bounds ICLR2026
链接: https://arxiv.org/abs/2602.00657
作者: Sujoy Bhore,Liana Khazaliya,Fionn Mc Inerney
类目: Computational Complexity (cs.CC); Discrete Mathematics (cs.DM); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Combinatorics (math.CO)
*备注: An extended abstract of this paper will appear in the proceedings of ICLR 2026
Abstract:Kirkpatrick et al. [ALT 2019] and Fallat et al. [JMLR 2023] introduced non-clashing teaching and proved that it is the most efficient batch machine teaching model satisfying the collusion-avoidance benchmark established in the seminal work of Goldman and Mathias [COLT 1993]. Recently, (positive) non-clashing teaching was thoroughly studied for balls in graphs, yielding numerous algorithmic and combinatorial results. In particular, Chalopin et al. [COLT 2024] and Ganian et al. [ICLR 2025] gave an almost complete picture of the complexity landscape of the positive variant, showing that it is tractable only for restricted graph classes due to the non-trivial nature of the problem and concept class. In this work, we consider (positive) non-clashing teaching for closed neighborhoods in graphs. This concept class is not only extensively studied in various related contexts, but it also exhibits broad generality, as any finite binary concept class can be equivalently represented by a set of closed neighborhoods in a graph. In comparison to the works on balls in graphs, we provide improved algorithmic results, notably including FPT algorithms for more general classes of parameters, and we complement these results by deriving stronger lower bounds. Lastly, we obtain combinatorial upper bounds for wider classes of graphs. Comments: An extended abstract of this paper will appear in the proceedings of ICLR 2026 Subjects: Computational Complexity (cs.CC); Discrete Mathematics (cs.DM); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Combinatorics (math.CO) Cite as: arXiv:2602.00657 [cs.CC] (or arXiv:2602.00657v1 [cs.CC] for this version) https://doi.org/10.48550/arXiv.2602.00657 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-183] Riemannian Flow Matching for Disentangled Graph Domain Adaptation
链接: https://arxiv.org/abs/2602.00656
作者: Yingxu Wang,Xinwang Liu,Mengzhu Wang,Siyang Gao,Nan Yin
类目: Machine Learning (cs.LG)
*备注:
Abstract:Graph Domain Adaptation (GDA) typically uses adversarial learning to align graph embeddings in Euclidean space. However, this paradigm suffers from two critical challenges: Structural Degeneration, where hierarchical and semantic representations are entangled, and Optimization Instability, which arises from oscillatory dynamics of minimax adversarial training. To tackle these issues, we propose DisRFM, a geometry-aware GDA framework that unifies Riemannian embedding and flow-based transport. First, to overcome structural degeneration, we embed graphs into a Riemannian manifold. By adopting polar coordinates, we explicitly disentangle structure (radius) from semantics (angle). Then, we enforce topology preservation through radial Wasserstein alignment and semantic discrimination via angular clustering, thereby preventing feature entanglement and collapse. Second, we address the instability of adversarial alignment by using Riemannian flow matching. This method learns a smooth vector field to guide source features toward the target along geodesic paths, guaranteeing stable convergence. The geometric constraints further guide the flow to maintain the disentangled structure during transport. Theoretically, we prove the asymptotic stability of the flow matching and derive a tighter bound for the target risk. Extensive experiments demonstrate that DisRFM consistently outperforms state-of-the-art methods.
[LG-184] PHAT: Modeling Period Heterogeneity for Multivariate Time Series Forecasting
链接: https://arxiv.org/abs/2602.00654
作者: Jiaming Ma,Guanjun Wang,Qihe Huang,Sheng Huang,Haofeng Ma,Zhengyang Zhou,Pengkun Wang,Binwu Wang,Yang Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:While existing multivariate time series forecasting models have advanced significantly in modeling periodicity, they largely neglect the periodic heterogeneity common in real-world data, where variates exhibit distinct and dynamically changing periods. To effectively capture this periodic heterogeneity, we propose PHAT (Period Heterogeneity-Aware Transformer). Specifically, PHAT arranges multivariate inputs into a three-dimensional “periodic bucket” tensor, where the dimensions correspond to variate group characteristics with similar periodicity, time steps aligned by phase, and offsets within the period. By restricting interactions within buckets and masking cross-bucket connections, PHAT effectively avoids interference from inconsistent periods. We also propose a positive-negative attention mechanism, which captures periodic dependencies from two perspectives: periodic alignment and periodic deviation. Additionally, the periodic alignment attention scores are decomposed into positive and negative components, with a modulation term encoding periodic priors. This modulation constrains the attention mechanism to more faithfully reflect the underlying periodic trends. A mathematical explanation is provided to support this property. We evaluate PHAT comprehensively on 14 real-world datasets against 18 baselines, and the results show that it significantly outperforms existing methods, achieving highly competitive forecasting performance. Our sources is available at GitHub.
[LG-185] CoRe-Fed: Bridging Collaborative and Representation Fairness via Federated Embedding Distillation AAAI2026
链接: https://arxiv.org/abs/2602.00647
作者: Noorain Mukhtiar,Adnan Mahmood,Quan Z. Sheng
类目: Machine Learning (cs.LG)
*备注: 7 pages (main content), 2 pages (references), Accepted in AAAI 2026
Abstract:With the proliferation of distributed data sources, Federated Learning (FL) has emerged as a key approach to enable collaborative intelligence through decentralized model training while preserving data privacy. However, conventional FL algorithms often suffer from performance disparities across clients caused by heterogeneous data distributions and unequal participation, which leads to unfair outcomes. Specifically, we focus on two core fairness challenges, i.e., representation bias, arising from misaligned client representations, and collaborative bias, stemming from inequitable contribution during aggregation, both of which degrade model performance and generalizability. To mitigate these disparities, we propose CoRe-Fed, a unified optimization framework that bridges collaborative and representation fairness via embedding-level regularization and fairness-aware aggregation. Initially, an alignment-driven mechanism promotes semantic consistency between local and global embeddings to reduce representational divergence. Subsequently, a dynamic reward-penalty-based aggregation strategy adjusts each client’s weight based on participation history and embedding alignment to ensure contribution-aware aggregation. Extensive experiments across diverse models and datasets demonstrate that CoRe-Fed improves both fairness and model performance over the state-of-the-art baseline algorithms.
[LG-186] Combinatorial Bandit Bayesian Optimization for Tensor Outputs
链接: https://arxiv.org/abs/2602.00640
作者: Jingru Huang,Haijie Xu,Jie Guo,Manrui Jiang,Chen Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Bayesian optimization (BO) has been widely used to optimize expensive and black-box functions across various domains. Existing BO methods have not addressed tensor-output functions. To fill this gap, we propose a novel tensor-output BO method. Specifically, we first introduce a tensor-output Gaussian process (TOGP) with two classes of tensor-output kernels as a surrogate model of the tensor-output function, which can effectively capture the structural dependencies within the tensor. Based on it, we develop an upper confidence bound (UCB) acquisition function to select the queried points. Furthermore, we introduce a more complex and practical problem setting, named combinatorial bandit Bayesian optimization (CBBO), where only a subset of the outputs can be selected to contribute to the objective function. To tackle this, we propose a tensor-output CBBO method, which extends TOGP to handle partially observed outputs, and accordingly design a novel combinatorial multi-arm bandit-UCB2 (CMAB-UCB2) criterion to sequentially select both the queried points and the optimal output subset. Theoretical regret bounds for the two methods are established, ensuring their sublinear performance. Extensive synthetic and real-world experiments demonstrate their superiority.
[LG-187] Equilibrium of Feasible Zone and Uncertain Model in Safe Exploration
链接: https://arxiv.org/abs/2602.00636
作者: Yujie Yang,Zhilong Zheng,Shengbo Eben Li
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:Ensuring the safety of environmental exploration is a critical problem in reinforcement learning (RL). While limiting exploration to a feasible zone has become widely accepted as a way to ensure safety, key questions remain unresolved: what is the maximum feasible zone achievable through exploration, and how can it be identified? This paper, for the first time, answers these questions by revealing that the goal of safe exploration is to find the equilibrium between the feasible zone and the environment model. This conclusion is based on the understanding that these two components are interdependent: a larger feasible zone leads to a more accurate environment model, and a more accurate model, in turn, enables exploring a larger zone. We propose the first equilibrium-oriented safe exploration framework called safe equilibrium exploration (SEE), which alternates between finding the maximum feasible zone and the least uncertain model. Using a graph formulation of the uncertain model, we prove that the uncertain model obtained by SEE is monotonically refined, the feasible zones monotonically expand, and both converge to the equilibrium of safe exploration. Experiments on classic control tasks show that our algorithm successfully expands the feasible zones with zero constraint violation, and achieves the equilibrium of safe exploration within a few iterations.
[LG-188] Actor-Dual-Critic Dynamics for Zero-sum and Identical-Interest Stochastic Games
链接: https://arxiv.org/abs/2602.00606
作者: Ahmed Said Donmez,Yuksel Arslantas,Muhammed O. Sayin
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注:
Abstract:We propose a novel independent and payoff-based learning framework for stochastic games that is model-free, game-agnostic, and gradient-free. The learning dynamics follow a best-response-type actor-critic architecture, where agents update their strategies (actors) using feedback from two distinct critics: a fast critic that intuitively responds to observed payoffs under limited information, and a slow critic that deliberatively approximates the solution to the underlying dynamic programming problem. Crucially, the learning process relies on non-equilibrium adaptation through smoothed best responses to observed payoffs. We establish convergence to (approximate) equilibria in two-agent zero-sum and multi-agent identical-interest stochastic games over an infinite horizon. This provides one of the first payoff-based and fully decentralized learning algorithms with theoretical guarantees in both settings. Empirical results further validate the robustness and effectiveness of the proposed approach across both classes of games.
[LG-189] Direct Preference Optimization with Rating Information: Practical Algorithms and Provable Gains
链接: https://arxiv.org/abs/2602.00603
作者: Luca Viano,Ruida Zhou,Yifan Sun,Mahdi Namazifar,Volkan Cevher,Shoham Sabach,Mohammad Ghavamzadeh
类目: Machine Learning (cs.LG)
*备注:
Abstract:The class of direct preference optimization (DPO) algorithms has emerged as a promising approach for solving the alignment problem in foundation models. These algorithms work with very limited feedback in the form of pairwise preferences and fine-tune models to align with these preferences without explicitly learning a reward model. While the form of feedback used by these algorithms makes the data collection process easy and relatively more accurate, its ambiguity in terms of the quality of responses could have negative implications. For example, it is not clear if a decrease (increase) in the likelihood of preferred (dispreferred) responses during the execution of these algorithms could be interpreted as a positive or negative phenomenon. In this paper, we study how to design algorithms that can leverage additional information in the form of rating gap, which informs the learner how much the chosen response is better than the rejected one. We present new algorithms that can achieve faster statistical rates than DPO in presence of accurate rating gap information. Moreover, we theoretically prove and empirically show that the performance of our algorithms is robust to inaccuracy in rating gaps. Finally, we demonstrate the solid performance of our methods in comparison to a number of DPO-style algorithms across a wide range of LLMs and evaluation benchmarks.
[LG-190] Kernelized Edge Attention: Addressing Semantic Attention Blurring in Temporal Graph Neural Networks AAAI2026
链接: https://arxiv.org/abs/2602.00596
作者: Govind Waghmare,Srini Rohan Gujulla Leel,Nikhil Tumbde,Sumedh B G,Sonia Gupta,Srikanta Bedathur
类目: Machine Learning (cs.LG)
*备注: Accepted at AAAI 2026
Abstract:Temporal Graph Neural Networks (TGNNs) aim to capture the evolving structure and timing of interactions in dynamic graphs. Although many models incorporate time through encodings or architectural design, they often compute attention over entangled node and edge representations, failing to reflect their distinct temporal behaviors. Node embeddings evolve slowly as they aggregate long-term structural context, while edge features reflect transient, timestamped interactions (e.g. messages, trades, or transactions). This mismatch results in semantic attention blurring, where attention weights cannot distinguish between slowly drifting node states and rapidly changing, information-rich edge interactions. As a result, models struggle to capture fine-grained temporal dependencies and provide limited transparency into how temporal relevance is computed. This paper introduces KEAT (Kernelized Edge Attention for Temporal Graphs), a novel attention formulation that modulates edge features using a family of continuous-time kernels, including Laplacian, RBF, and learnable MLP variant. KEAT preserves the distinct roles of nodes and edges, and integrates seamlessly with both Transformer-style (e.g., DyGFormer) and message-passing (e.g., TGN) architectures. It achieves up to 18% MRR improvement over the recent DyGFormer and 7% over TGN on link prediction tasks, enabling more accurate, interpretable and temporally aware message passing in TGNNs.
[LG-191] SEER: Transformer-based Robust Time Series Forecasting via Automated Patch Enhancement and Replacement
链接: https://arxiv.org/abs/2602.00589
作者: Xiangfei Qiu,Xvyuan Liu,Tianen Shen,Xingjian Wu,Hanyin Cheng,Bin Yang,Jilin Hu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Time series forecasting is important in many fields that require accurate predictions for decision-making. Patching techniques, commonly used and effective in time series modeling, help capture temporal dependencies by dividing the data into patches. However, existing patch-based methods fail to dynamically select patches and typically use all patches during the prediction process. In real-world time series, there are often low-quality issues during data collection, such as missing values, distribution shifts, anomalies and white noise, which may cause some patches to contain low-quality information, negatively impacting the prediction results. To address this issue, this study proposes a robust time series forecasting framework called SEER. Firstly, we propose an Augmented Embedding Module, which improves patch-wise representations using a Mixture-of-Experts (MoE) architecture and obtains series-wise token representations through a channel-adaptive perception mechanism. Secondly, we introduce a Learnable Patch Replacement Module, which enhances forecasting robustness and model accuracy through a two-stage process: 1) a dynamic filtering mechanism eliminates negative patch-wise tokens; 2) a replaced attention module substitutes the identified low-quality patches with global series-wise token, further refining their representations through a causal attention mechanism. Comprehensive experimental results demonstrate the SOTA performance of SEER.
[LG-192] Safe Langevin Soft Actor Critic
链接: https://arxiv.org/abs/2602.00587
作者: Mahesh Keswani,Samyak Jain,Raunak P. Bhattacharyya
类目: Machine Learning (cs.LG)
*备注: 20 pages, 12 figures
Abstract:Balancing reward and safety in constrained reinforcement learning remains challenging due to poor generalization from sharp value minima and inadequate handling of heavy-tailed risk distribution. We introduce Safe Langevin Soft Actor-Critic (SL-SAC), a principled algorithm that addresses both issues through parameter-space exploration and distributional risk control. Our approach combines three key mechanisms: (1) Adaptive Stochastic Gradient Langevin Dynamics (aSGLD) for reward critics, promoting ensemble diversity and escape from poor optima; (2) distributional cost estimation via Implicit Quantile Networks (IQN) with Conditional Value-at-Risk (CVaR) optimization for tail-risk mitigation; and (3) a reactive Lagrangian relaxation scheme that adapts constraint enforcement based on the empirical CVaR of episodic costs. We provide theoretical guarantees on CVaR estimation error and demonstrate that CVaR-based Lagrange updates yield stronger constraint violation signals than expected-cost updates. On Safety-Gymnasium benchmarks, SL-SAC achieves the lowest cost in 7 out of 10 tasks while maintaining competitive returns, with cost reductions of 19-63% in velocity tasks compared to state-of-the-art baselines.
[LG-193] Bridging Time and Frequency: A Joint Modeling Framework for Irregular Multivariate Time Series Forecasting
链接: https://arxiv.org/abs/2602.00582
作者: Xiangfei Qiu,Kangjia Yan,Xvyuan Liu,Xingjian Wu,Jilin Hu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Irregular multivariate time series forecasting (IMTSF) is challenging due to non-uniform sampling and variable asynchronicity. These irregularities violate the equidistant assumptions of standard models, hindering local temporal modeling and rendering classical frequency-domain methods ineffective for capturing global periodic structures. To address this challenge, we propose TFMixer, a joint time-frequency modeling framework for IMTS forecasting. Specifically, TFMixer incorporates a Global Frequency Module that employs a learnable Non-Uniform Discrete Fourier Transform (NUDFT) to directly extract spectral representations from irregular timestamps. In parallel, the Local Time Module introduces a query-based patch mixing mechanism to adaptively aggregate informative temporal patches and alleviate information density imbalance. Finally, TFMixer fuses the time-domain and frequency-domain representations to generate forecasts and further leverages inverse NUDFT for explicit seasonal extrapolation. Extensive experiments on real-world datasets demonstrate the state–of-the-art performance of TFMixer.
[LG-194] Sparsity-Aware Unlearning for Large Language Models
链接: https://arxiv.org/abs/2602.00577
作者: Yuze Wang,Yujia Tong,Ke Xu,Jingling Yuan,Jiawei Jiang,Chuang Hu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large Language Models (LLMs) inevitably memorize sensitive information during training, posing significant privacy risks. Machine unlearning has emerged as a promising solution to selectively remove such information without full retraining. However, existing methods are designed for dense models and overlook model sparsification-an essential technique for efficient LLM deployment. We find that unlearning effectiveness degrades substantially on sparse models. Through empirical analysis, we reveal that this degradation occurs because existing unlearning methods require updating all parameters, yet sparsification prunes substantial weights to zero, fundamentally limiting the model’s forgetting capacity. To address this challenge, we propose Sparsity-Aware Unlearning (SAU), which decouples unlearning from sparsification objectives through gradient masking that redirects updates to surviving weights, combined with importance-aware redistribution to compensate for pruned parameters. Extensive experiments demonstrate that SAU significantly outperforms existing methods on sparse LLMs, achieving effective forgetting while preserving model utility.
[LG-195] Forget by Uncertainty: Orthogonal Entropy Unlearning for Quantized Neural Networks
链接: https://arxiv.org/abs/2602.00567
作者: Tian Zhang,Yujia Tong,Junhao Dong,Ke Xu,Yuze Wang,Jingling Yuan
类目: Machine Learning (cs.LG)
*备注:
Abstract:The deployment of quantized neural networks on edge devices, combined with privacy regulations like GDPR, creates an urgent need for machine unlearning in quantized models. However, existing methods face critical challenges: they induce forgetting by training models to memorize incorrect labels, conflating forgetting with misremembering, and employ scalar gradient reweighting that cannot resolve directional conflicts between gradients. We propose OEU, a novel Orthogonal Entropy Unlearning framework with two key innovations: 1) Entropy-guided unlearning maximizes prediction uncertainty on forgotten data, achieving genuine forgetting rather than confident misprediction, and 2) Gradient orthogonal projection eliminates interference by projecting forgetting gradients onto the orthogonal complement of retain gradients, providing theoretical guarantees for utility preservation under first-order approximation. Extensive experiments demonstrate that OEU outperforms existing methods in both forgetting effectiveness and retain accuracy.
[LG-196] Beyond the Node: Clade-level Selection for Efficient MCTS in Automatic Heuristic Design
链接: https://arxiv.org/abs/2602.00549
作者: Kezhao Lai,Yutao Lai,Hai-Lin Liu
类目: Machine Learning (cs.LG)
*备注:
Abstract:While Monte Carlo Tree Search (MCTS) shows promise in Large Language Model (LLM) based Automatic Heuristic Design (AHD), it suffers from a critical over-exploitation tendency under the limited computational budgets required for heuristic evaluation. To address this limitation, we propose Clade-AHD, an efficient framework that replaces node-level point estimates with clade-level Bayesian beliefs. By aggregating descendant evaluations into Beta distributions and performing Thompson Sampling over these beliefs, Clade-AHD explicitly models uncertainty to guide exploration, enabling more reliable decision-making under sparse and noisy evaluations. Extensive experiments on complex combinatorial optimization problems demonstrate that Clade-AHD consistently outperforms state-of-the-art methods while significantly reducing computational cost. The source code is publicly available at: this https URL.
[LG-197] Depth Not Data: An Analysis of Hessian Spectral Bifurcation
链接: https://arxiv.org/abs/2602.00545
作者: Shenyang Deng,Boyao Liao,Zhuoli Ouyang,Tianyu Pang,Yaoqing Yang
类目: Machine Learning (cs.LG)
*备注:
Abstract:The eigenvalue distribution of the Hessian matrix plays a crucial role in understanding the optimization landscape of deep neural networks. Prior work has attributed the well-documented ``bulk-and-spike’’ spectral structure, where a few dominant eigenvalues are separated from a bulk of smaller ones, to the imbalance in the data covariance matrix. In this work, we challenge this view by demonstrating that such spectral Bifurcation can arise purely from the network architecture, independent of data imbalance. Specifically, we analyze a deep linear network setup and prove that, even when the data covariance is perfectly balanced, the Hessian still exhibits a Bifurcation eigenvalue structure: a dominant cluster and a bulk cluster. Crucially, we establish that the ratio between dominant and bulk eigenvalues scales linearly with the network depth. This reveals that the spectral gap is strongly affected by the network architecture rather than solely by data distribution. Our results suggest that both model architecture and data characteristics should be considered when designing optimization algorithms for deep networks. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2602.00545 [cs.LG] (or arXiv:2602.00545v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.00545 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-198] One Loss to Rule Them All: Marked Time-to-Event for Structured EHR Foundation Models
链接: https://arxiv.org/abs/2602.00541
作者: Zilin Jing,Vincent Jeanselme,Yuta Kobayashi,Simon A. Lee,Chao Pang,Aparajita Kashyap,Yanwei Li,Xinzhuo Jiang,Shalmali Joshi
类目: Machine Learning (cs.LG)
*备注:
Abstract:Clinical events captured in Electronic Health Records (EHR) are irregularly sampled and may consist of a mixture of discrete events and numerical measurements, such as laboratory values or treatment dosages. The sequential nature of EHR, analogous to natural language, has motivated the use of next-token prediction to train prior EHR Foundation Models (FMs) over events. However, this training fails to capture the full structure of EHR. We propose ORA, a marked time-to-event pretraining objective that jointly models event timing and associated measurements. Across multiple datasets, downstream tasks, and model architectures, this objective consistently yields more generalizable representations than next-token prediction and pretraining losses that ignore continuous measurements. Importantly, the proposed objective yields improvements beyond traditional classification evaluation, including better regression and time-to-event prediction. Beyond introducing a new family of FMs, our results suggest a broader takeaway: pretraining objectives that account for EHR structure are critical for expanding downstream capabilities and generalizability
[LG-199] Surrogate Ensemble in Expensive Multi-Objective Optimization via Deep Q-Learning
链接: https://arxiv.org/abs/2602.00540
作者: Yuxin Wu,Hongshu Guo,Ting Huang,Yue-Jiao Gong,Zeyuan Ma
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:
Abstract:Surrogate-assisted Evolutionary Algorithms~(SAEAs) have shown promising robustness in solving expensive optimization problems. A key aspect that impacts SAEAs’ effectiveness is surrogate model selection, which in existing works is predominantly decided by human developer. Such human-made design choice introduces strong bias into SAEAs and may hurt their expected performance on out-of-scope tasks. In this paper, we propose a reinforcement learning-assisted ensemble framework, termed as SEEMOO, which is capable of scheduling different surrogate models within a single optimization process, hence boosting the overall optimization performance in a cooperative paradigm. Specifically, we focus on expensive multi-objective optimization problems, where multiple objective functions shape a compositional landscape and hence challenge surrogate selection. SEEMOO comprises following core designs: 1) A pre-collected model pool that maintains different surrogate models; 2) An attention-based state-extractor supports universal optimization state representation of problems with varied objective numbers; 3) a deep Q-network serves as dynamic surrogate selector: Given the optimization state, it selects desired surrogate model for current-step evaluation. SEEMOO is trained to maximize the overall optimization performance under a training problem distribution. Extensive benchmark results demonstrate SEEMOO’s surrogate ensemble paradigm boosts the optimization performance of single-surrogate baselines. Further ablation studies underscore the importance of SEEMOO’s design components.
[LG-200] OpenDDI: A Comprehensive Benchmark for DDI Prediction
链接: https://arxiv.org/abs/2602.00539
作者: Xinmo Jin,Bowen Fan,Xunkai Li,Henan Sun,YuXin Zeng,Zekai Chen,Yuxuan Sun,Jia Li,Qiangqiang Dai,Hongchao Qin,Rong-Hua Li,Guoren Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Drug-Drug Interactions (DDIs) significantly influence therapeutic efficacy and patient safety. As experimental discovery is resource-intensive and time-consuming, efficient computational methodologies have become essential. The predominant paradigm formulates DDI prediction as a drug graph-based link prediction task. However, further progress is hindered by two fundamental challenges: (1) lack of high-quality data: most studies rely on small-scale DDI datasets and single-modal drug representations; (2) lack of standardized evaluation: inconsistent scenarios, varied metrics, and diverse baselines. To address the above issues, we propose OpenDDI, a comprehensive benchmark for DDI prediction. Specifically, (1) from the data perspective, OpenDDI unifies 6 widely used DDI datasets and 2 existing forms of drug representation, while additionally contributing 3 new large-scale LLM-augmented datasets and a new multimodal drug representation covering 5 modalities. (2) From the evaluation perspective, OpenDDI unifies 20 SOTA model baselines across 3 downstream tasks, with standardized protocols for data quality, effectiveness, generalization, robustness, and efficiency. Based on OpenDDI, we conduct a comprehensive evaluation and derive 10 valuable insights for DDI prediction while exposing current limitations to provide critical guidance for this rapidly evolving field. Our code is available at this https URL
[LG-201] Invertible Memory Flow Networks
链接: https://arxiv.org/abs/2602.00535
作者: Liyu Zerihun,Alexandr Plashchinsky
类目: Machine Learning (cs.LG)
*备注:
Abstract:Long sequence neural memory remains a challenging problem. RNNs and their variants suffer from vanishing gradients, and Transformers suffer from quadratic scaling. Furthermore, compressing long sequences into a finite fixed representation remains an intractable problem due to the difficult optimization landscape. Invertible Memory Flow Networks (IMFN) make long sequence compression tractable through factorization: instead of learning end-to-end compression, we decompose the problem into pairwise merges using a binary tree of “sweeper” modules. Rather than learning to compress long sequences, each sweeper learns a much simpler 2-to-1 compression task, achieving O(log N) depth with sublinear error accumulation in sequence length. For online inference, we distilled into a constant-cost recurrent student achieving O(1) sequential steps. Empirical results validate IMFN on long MNIST sequences and UCF-101 videos, demonstrating compression of high-dimensional data over long sequences.
[LG-202] AIRE-Prune: Asymptotic Impulse-Response Energy for State Pruning in State Space Models
链接: https://arxiv.org/abs/2602.00534
作者: Apurba Prasad Padhy,Fernando Camacho,Saibal Mukhopadhyay
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:State space models (SSMs) often sacrifice capacity, search space, or stability to offset the memory and compute costs of large state dimensions. We introduce a structured post-training pruning method for SSMs – AIRE-Prune (Asymptotic Impulse-Response Energy for State PRUN(E)) – that reduces each layer’s state dimension by directly minimizing long-run output-energy distortion. AIRE-Prune assigns every state a closed-form asymptotic impulse-response energy-based score, i.e., the total impulse-response energy it contributes over an infinite horizon (time), and normalizes these scores layer-wise to enable global cross-layer comparison and selection. This extends modal truncation from single systems to deep stacks and aligns pruning with asymptotic response energy rather than worst-case gain. Across diverse sequence benchmarks, AIRE-Prune reveals substantial redundancy in SISO and MIMO SSMs with average pruning of 60.8%, with average accuracy drop of 0.29% without retraining, while significantly lowering compute. Code: this https URL.
[LG-203] Reinforcement Learning-assisted Constraint Relaxation for Constrained Expensive Optimization
链接: https://arxiv.org/abs/2602.00532
作者: Qianhao Zhu,Sijie Ma,Zeyuan Ma,Hongshu Guo,Yue-Jiao Gong
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:
Abstract:Constraint handling plays a key role in solving realistic complex optimization problems. Though intensively discussed in the last few decades, existing constraint handling techniques predominantly rely on human experts’ designs, which more or less fall short in utility towards general cases. Motivated by recent progress in Meta-Black-Box Optimization where automated algorithm design can be learned to boost optimization performance, in this paper, we propose learning effective, adaptive and generalizable constraint handling policy through reinforcement learning. Specifically, a tailored Markov Decision Process is first formulated, where given optimization dynamics features, a deep Q-network-based policy controls the constraint relaxation level along the underlying optimization process. Such adaptive constraint handling provides flexible tradeoff between objective-oriented exploitation and feasible-region-oriented exploration, and hence leads to promising optimization performance. We train our approach on CEC 2017 Constrained Optimization benchmark with limited evaluation budget condition (expensive cases) and compare the trained constraint handling policy to strong baselines such as recent winners in CEC/GECCO competitions. Extensive experimental results show that our approach performs competitively or even surpasses the compared baselines under either Leave-one-out cross-validation or ordinary train-test split validation. Further analysis and ablation studies reveal key insights in our designs.
[LG-204] NEST: Nested Event Stream Transformer for Sequences of Multisets
链接: https://arxiv.org/abs/2602.00520
作者: Minghui Sun,Haoyu Gong,Xingyu You,Jillian Hurst,Benjamin Goldstein,Matthew Engelhard
类目: Machine Learning (cs.LG)
*备注: 11 pages
Abstract:Event stream data often exhibit hierarchical structure in which multiple events co-occur, resulting in a sequence of multisets (i.e., bags of events). In electronic health records (EHRs), for example, medical events are grouped into a sequence of clinical encounters with well-defined temporal structure, but the order and timing of events within each encounter may be unknown or unreliable. Most existing foundation models (FMs) for event stream data flatten this hierarchy into a one-dimensional sequence, leading to (i) computational inefficiency associated with dense attention and learning spurious within-set relationships, and (ii) lower-quality set-level representations from heuristic post-training pooling for downstream tasks. Here, we show that preserving the original hierarchy in the FM architecture provides a useful inductive bias that improves both computational efficiency and representation quality. We then introduce Nested Event Stream Transformer (NEST), a FM for event streams comprised of sequences of multisets. Building on this architecture, we formulate Masked Set Modeling (MSM), an efficient paradigm that promotes improved set-level representation learning. Experiments on real-world multiset sequence data show that NEST captures real-world dynamics while improving both pretraining efficiency and downstream performance.
[LG-205] Minerva: Reinforcement Learning with Verifiable Rewards for Cyber Threat Intelligence LLM s
链接: https://arxiv.org/abs/2602.00513
作者: Md Tanvirul Alam,Aritran Piplai,Ionut Cardei,Nidhi Rastogi,Peter J Worth Jr
类目: Machine Learning (cs.LG)
*备注:
Abstract:Cyber threat intelligence (CTI) analysts routinely convert noisy, unstructured security artifacts into standardized, automation-ready representations. Although large language models (LLMs) show promise for this task, existing approaches remain brittle when producing structured CTI outputs and have largely relied on supervised fine-tuning (SFT). In contrast, CTI standards and community-maintained resources define canonical identifiers and schemas that enable deterministic verification of model outputs. We leverage this structure to study reinforcement learning with verifiable rewards (RLVR) for CTI tasks. We introduce \textitMinerva, a unified dataset and training pipeline spanning multiple CTI subtasks, each paired with task-specific verifiers that score structured outputs and identifier predictions. To address reward sparsity during rollout, we propose a lightweight self-training mechanism that generates additional verified trajectories and distills them back into the model. Experiments across LLM backbones show consistent improvements in accuracy and robustness over SFT across multiple benchmarks.
[LG-206] Partition of Unity Neural Networks for Interpretable Classification with Explicit Class Regions
链接: https://arxiv.org/abs/2602.00511
作者: Akram Aldroubi
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:Despite their empirical success, neural network classifiers remain difficult to interpret. In softmax-based models, class regions are defined implicitly as solutions to systems of inequalities among logits, making them difficult to extract and visualize. We introduce Partition of Unity Neural Networks (PUNN), an architecture in which class probabilities arise directly from a learned partition of unity, without requiring a softmax layer. PUNN constructs k nonnegative functions h_1, \ldots, h_k satisfying \sum_i h_i(x) = 1 , where each h_i(x) directly represents P(\textclass i \mid x) . Unlike softmax, where class regions are defined implicitly through coupled inequalities among logits, each PUNN partition function h_i directly defines the probability of class i as a standalone function of x . We prove that PUNN is dense in the space of continuous probability maps on compact domains. The gate functions g_i that define the partition can use various activation functions (sigmoid, Gaussian, bump) and parameterizations ranging from flexible MLPs to parameter-efficient shape-informed designs (spherical shells, ellipsoids, spherical harmonics). Experiments on synthetic data, UCI benchmarks, and MNIST show that PUNN with MLP-based gates achieves accuracy within 0.3–0.6% of standard multilayer perceptrons. When geometric priors match the data structure, shape-informed gates achieve comparable accuracy with up to 300 \times fewer parameters. These results demonstrate that interpretable-by-design architectures can be competitive with black-box models while providing transparent class probability assignments. Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC) Cite as: arXiv:2602.00511 [cs.LG] (or arXiv:2602.00511v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.00511 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Akram Aldroubi [view email] [v1] Sat, 31 Jan 2026 04:40:11 UTC (9,014 KB)
[LG-207] OD-DEAL: Dynamic Expert-Guided Adversarial Learning with Online Decomposition for Scalable Capacitated Vehicle Routing
链接: https://arxiv.org/abs/2602.00488
作者: Dongbin Jiao,Zisheng Chen,Xianyi Wang,Jintao Shi,Shengcai Liu,Shi Yan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Solving large-scale capacitated vehicle routing problems (CVRP) is hindered by the high complexity of heuristics and the limited generalization of neural solvers on massive graphs. We propose OD-DEAL, an adversarial learning framework that tightly integrates hybrid genetic search (HGS) and online barycenter clustering (BCC) decomposition, and leverages high-fidelity knowledge distillation to transfer expert heuristic behavior. OD-DEAL trains a graph attention network (GAT)-based generative policy through a minimax game, in which divide-and-conquer strategies from a hybrid expert are distilled into dense surrogate rewards. This enables high-quality, clustering-free inference on large-scale instances. Empirical results demonstrate that OD-DEAL achieves state-of-the-art (SOTA) real-time CVRP performance, solving 10000-node instances with near-constant neural scaling. This uniquely enables the sub-second, heuristic-quality inference required for dynamic large-scale deployment.
[LG-208] AREAL-DTA: Dynamic Tree Attention for Efficient Reinforcement Learning of Large Language Models
链接: https://arxiv.org/abs/2602.00482
作者: Jiarui Zhang,Yuchen Yang,Ran Yan,Zhiyu Mei,Liyuan Zhang,Daifeng Li,Wei Fu,Jiaxuan Gao,Shusheng Xu,Yi Wu,Binhang Yuan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Reinforcement learning (RL) based post-training for large language models (LLMs) is computationally expensive, as it generates many rollout sequences that could frequently share long token prefixes. Existing RL frameworks usually process these sequences independently, repeatedly recomputing identical prefixes during forward and backward passes during policy model training, leading to substantial inefficiencies in computation and memory usage. Although prefix sharing naturally induces a tree structure over rollouts, prior tree-attention-based solutions rely on fully materialized attention masks and scale poorly in RL settings. In this paper, we introduce AREAL-DTA to efficiently exploit prefix sharing in RL training. AREAL-DTA employs a depth-first-search (DFS)-based execution strategy that dynamically traverses the rollout prefix tree during both forward and backward computation, materializing only a single root-to-leaf path at a time. To further improve scalability, AREAL-DTA incorporates a load-balanced distributed batching mechanism that dynamically constructs and processes prefix trees across multiple GPUs. Across the popular RL post-training workload, AREAL-DTA achieves up to 8.31\times in \tau^2 -bench higher training throughput.
[LG-209] Parallel Stochastic Gradient-Based Planning for World Models
链接: https://arxiv.org/abs/2602.00475
作者: Michael Psenka,Michael Rabbat,Aditi Krishnapriyan,Yann LeCun,Amir Bar
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 23 pages, 7 figures
Abstract:World models simulate environment dynamics from raw sensory inputs like video. However, using them for planning can be challenging due to the vast and unstructured search space. We propose a robust and highly parallelizable planner that leverages the differentiability of the learned world model for efficient optimization, solving long-horizon control tasks from visual input. Our method treats states as optimization variables (“virtual states”) with soft dynamics constraints, enabling parallel computation and easier optimization. To facilitate exploration and avoid local optima, we introduce stochasticity into the states. To mitigate sensitive gradients through high-dimensional vision-based world models, we modify the gradient structure to descend towards valid plans while only requiring action-input gradients. Our planner, which we call GRASP (Gradient RelAxed Stochastic Planner), can be viewed as a stochastic version of a non-condensed or collocation-based optimal controller. We provide theoretical justification and experiments on video-based world models, where our resulting planner outperforms existing planning algorithms like the cross-entropy method (CEM) and vanilla gradient-based optimization (GD) on long-horizon experiments, both in success rate and time to convergence.
[LG-210] Search Inspired Exploration in Reinforcement Learning
链接: https://arxiv.org/abs/2602.00460
作者: Georgios Sotirchos,Zlatan Ajanović,Jens Kober
类目: Machine Learning (cs.LG)
*备注:
Abstract:Exploration in environments with sparse rewards remains a fundamental challenge in reinforcement learning (RL). Existing approaches such as curriculum learning and Go-Explore often rely on hand-crafted heuristics, while curiosity-driven methods risk converging to suboptimal policies. We propose Search-Inspired Exploration in Reinforcement Learning (SIERL), a novel method that actively guides exploration by setting sub-goals based on the agent’s learning progress. At the beginning of each episode, SIERL chooses a sub-goal from the \textitfrontier (the boundary of the agent’s known state space), before the agent continues exploring toward the main task objective. The key contribution of our method is the sub-goal selection mechanism, which provides state-action pairs that are neither overly familiar nor completely novel. Thus, it assures that the frontier is expanded systematically and that the agent is capable of reaching any state within it. Inspired by search, sub-goals are prioritized from the frontier based on estimates of cost-to-come and cost-to-go, effectively steering exploration towards the most informative regions. In experiments on challenging sparse-reward environments, SIERL outperforms dominant baselines in both achieving the main task goal and generalizing to reach arbitrary states in the environment.
[LG-211] FedMOA: Federated GRPO for Personalized Reasoning LLM s under Heterogeneous Rewards
链接: https://arxiv.org/abs/2602.00453
作者: Ziyao Wang,Daeun Jung,Yexiao He,Guoheng Sun,Zheyu Shen,Myungjin Lee,Ang Li
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:
Abstract:Group Relative Policy Optimization (GRPO) has recently emerged as an effective approach for improving the reasoning capabilities of large language models through online multi-objective reinforcement learning. While personalization on private data is increasingly vital, traditional Reinforcement Learning (RL) alignment is often memory-prohibitive for on-device federated learning due to the overhead of maintaining a separate critic network. GRPO’s critic-free architecture enables feasible on-device training, yet transitioning to a federated setting introduces systemic challenges: heterogeneous reward definitions, imbalanced multi-objective optimization, and high training costs. We propose FedMOA, a federated GRPO framework for multi-objective alignment under heterogeneous rewards. FedMOA stabilizes local training through an online adaptive weighting mechanism via hypergradient descent, which prioritizes primary reasoning as auxiliary objectives saturate. On the server side, it utilizes a task- and accuracy-aware aggregation strategy to prioritize high-quality updates. Experiments on mathematical reasoning and code generation benchmarks demonstrate that FedMOA consistently outperforms federated averaging, achieving accuracy gains of up to 2.2% while improving global performance, personalization, and multi-objective balance.
[LG-212] Stabilizing Decentralized Federated Fine-Tuning via Topology-Aware Alternating LoRA
链接: https://arxiv.org/abs/2602.00451
作者: Xiaoyu Wang,Xiaotian Li,Zhixiang Zhou,Chen Li,Yong Liu
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 17 Pages
Abstract:Decentralized federated learning (DFL), a serverless variant of federated learning, poses unique challenges for parameter-efficient fine-tuning due to the factorized structure of low-rank adaptation (LoRA). Unlike linear parameters, decentralized aggregation of LoRA updates introduces topology-dependent cross terms that can destabilize training under dynamic communication graphs. We propose \textttTAD-LoRA, a Topology-Aware Decentralized Low-Rank Adaptation framework that coordinates the updates and mixing of LoRA factors to control inter-client misalignment. We theoretically prove the convergence of \textttTAD-LoRA under non-convex objectives, explicitly characterizing the trade-off between topology-induced cross-term error and block-coordinate representation bias governed by the switching interval of alternative training. Experiments under various communication conditions validate our analysis, showing that \textttTAD-LoRA achieves robust performance across different communication scenarios, remaining competitive in strongly connected topologies and delivering clear gains under moderately and weakly connected topologies, with particularly strong results on the MNLI dataset.
[LG-213] owards Building Non-Fine-Tunable Foundation Models
链接: https://arxiv.org/abs/2602.00446
作者: Ziyao Wang,Nizhang Li,Pingzhi Li,Guoheng Sun,Tianlong Chen,Ang Li
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:Open-sourcing foundation models (FMs) enables broad reuse but also exposes model trainers to economic and safety risks from unrestricted downstream fine-tuning. We address this problem by building non-fine-tunable foundation models: models that remain broadly usable in their released form while yielding limited adaptation gains under task-agnostic unauthorized fine-tuning. We propose Private Mask Pre-Training (PMP), a pre-training framework that concentrates representation learning into a sparse subnetwork identified early in training. The binary mask defining this subnetwork is kept private, and only the final dense weights are released. This forces unauthorized fine-tuning without access to the mask to update parameters misaligned with pretraining subspace, inducing an intrinsic mismatch between the fine-tuning objective and the pre-training geometry. We provide theoretical analysis showing that this mismatch destabilizes gradient-based adaptation and bounds fine-tuning gains. Empirical results on large language models demonstrating that PMP preserves base model performance while consistently degrading unauthorized fine-tuning across a wide range of downstream tasks, with the strength of non-fine-tunability controlled by the mask ratio.
[LG-214] Open Materials Generation with Inference-Time Reinforcement Learning
链接: https://arxiv.org/abs/2602.00424
作者: Philipp Hoellmer,Stefano Martiniani
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注: 16 pages, 8 figures, 1 table
Abstract:Continuous-time generative models for crystalline materials enable inverse materials design by learning to predict stable crystal structures, but incorporating explicit target properties into the generative process remains challenging. Policy-gradient reinforcement learning (RL) provides a principled mechanism for aligning generative models with downstream objectives but typically requires access to the score, which has prevented its application to flow-based models that learn only velocity fields. We introduce Open Materials Generation with Inference-time Reinforcement Learning (OMatG-IRL), a policy-gradient RL framework that operates directly on the learned velocity fields and eliminates the need for the explicit computation of the score. OMatG-IRL leverages stochastic perturbations of the underlying generation dynamics preserving the baseline performance of the pretrained generative model while enabling exploration and policy-gradient estimation at inference time. Using OMatG-IRL, we present the first application of RL to crystal structure prediction (CSP). Our method enables effective reinforcement of an energy-based objective while preserving diversity through composition conditioning, and it achieves performance competitive with score-based RL approaches. Finally, we show that OMatG-IRL can learn time-dependent velocity-annealing schedules, enabling accurate CSP with order-of-magnitude improvements in sampling efficiency and, correspondingly, reduction in generation time.
[LG-215] Federated-inspired Single-cell Batch Integration in Latent Space
链接: https://arxiv.org/abs/2602.00423
作者: Quang-Huy Nguyen,Zongliang Yue,Hao Chen,Wei-Shinn Ku,Jiaqi Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Advances in single-cell RNA sequencing enable the rapid generation of massive, high-dimensional datasets, yet the accumulation of data across experiments introduces batch effects that obscure true biological signals. Existing batch correction approaches either insufficiently correct batch effects or require centralized retraining on the complete dataset, limiting their applicability in distributed and continually evolving single-cell data settings. We introduce scBatchProx, a post-hoc optimization method inspired by federated learning principles for refining cell-level embeddings produced by arbitrary upstream methods. Treating each batch as a client, scBatchProx learns batch-conditioned adapters under proximal regularization, correcting batch structure directly in latent space without requiring raw expression data or centralized optimization. The method is lightweight and deployable, optimizing batch-specific adapter parameters only. Extensive experiments show that scBatchProx consistently yields relative gains of approximately 3-8% in overall embedding quality, with batch correction and biological conservation improving in 90% and 85% of data-method pairs, respectively. We envision this work as a step toward the practical refinement of learned representations in dynamic single-cell data systems.
[LG-216] Fed-Listing: Federated Label Distribution Inference in Graph Neural Networks
链接: https://arxiv.org/abs/2602.00407
作者: Suprim Nakarmi,Junggab Son,Yue Zhao,Zuobin Xiong
类目: Machine Learning (cs.LG)
*备注: 13 pages, 4 figures, and 5 tables
Abstract:Graph Neural Networks (GNNs) have been intensively studied for their expressive representation and learning performance on graph-structured data, enabling effective modeling of complex relational dependencies among nodes and edges in various domains. However, the standalone GNNs can unleash threat surfaces and privacy implications, as some sensitive graph-structured data is collected and processed in a centralized setting. To solve this issue, Federated Graph Neural Networks (FedGNNs) are proposed to facilitate collaborative learning over decentralized local graph data, aiming to preserve user privacy. Yet, emerging research indicates that even in these settings, shared model updates, particularly gradients, can unintentionally leak sensitive information of local users. Numerous privacy inference attacks have been explored in traditional federated learning and extended to graph settings, but the problem of label distribution inference in FedGNNs remains largely underexplored. In this work, we introduce Fed-Listing (Federated Label Distribution Inference in GNNs), a novel gradient-based attack designed to infer the private label statistics of target clients in FedGNNs without access to raw data or node features. Fed-Listing only leverages the final-layer gradients exchanged during training to uncover statistical patterns that reveal class proportions in a stealthy manner. An auxiliary shadow dataset is used to generate diverse label partitioning strategies, simulating various client distributions, on which the attack model is obtained. Extensive experiments on four benchmark datasets and three GNN architectures show that Fed-Listing significantly outperforms existing baselines, including random guessing and Decaf, even under challenging non-i.i.d. scenarios. Moreover, applying defense mechanisms can barely reduce our attack performance, unless the model’s utility is severely degraded.
[LG-217] DROGO: Default Representation Objective via Graph Optimization in Reinforcement Learning
链接: https://arxiv.org/abs/2602.00403
作者: Hon Tik Tse,Marlos C. Machado
类目: Machine Learning (cs.LG)
*备注:
Abstract:In computational reinforcement learning, the default representation (DR) and its principal eigenvector have been shown to be effective for a wide variety of applications, including reward shaping, count-based exploration, option discovery, and transfer. However, in prior investigations, the eigenvectors of the DR were computed by first approximating the DR matrix, and then performing an eigendecomposition. This procedure is computationally expensive and does not scale to high-dimensional spaces. In this paper, we derive an objective for directly approximating the principal eigenvector of the DR with a neural network. We empirically demonstrate the effectiveness of the objective in a number of environments, and apply the learned eigenvectors for reward shaping.
[LG-218] MemoryLLM : Plug-n-Play Interpretable Feed-Forward Memory for Transformers
链接: https://arxiv.org/abs/2602.00398
作者: Ajay Jaiswal,Lauren Hannah,Han-Byul Kim,Duc Hoang,Arnav Kundu,Mehrdad Farajtabar,Minsik Cho
类目: Machine Learning (cs.LG)
*备注:
Abstract:Understanding how transformer components operate in LLMs is important, as it is at the core of recent technological advances in artificial intelligence. In this work, we revisit the challenges associated with interpretability of feed-forward modules (FFNs) and propose MemoryLLM, which aims to decouple FFNs from self-attention and enables us to study the decoupled FFNs as context-free token-wise neural retrieval memory. In detail, we investigate how input tokens access memory locations within FFN parameters and the importance of FFN memory across different downstream tasks. MemoryLLM achieves context-free FFNs by training them in isolation from self-attention directly using the token embeddings. This approach allows FFNs to be pre-computed as token-wise lookups (ToLs), enabling on-demand transfer between VRAM and storage, additionally enhancing inference efficiency. We also introduce Flex-MemoryLLM, positioning it between a conventional transformer design and MemoryLLM. This architecture bridges the performance gap caused by training FFNs with context-free token-wise embeddings.
[LG-219] Localized High-resolution Geographic Representations with Slepian Functions
链接: https://arxiv.org/abs/2602.00392
作者: Arjun Rao,Ruth Crasto,Tessa Ooms,David Rolnick,Konstantin Klemmer,Marc Rußwurm
类目: Machine Learning (cs.LG)
*备注: 23 pages, 12 figures, 6 tables
Abstract:Geographic data is fundamentally local. Disease outbreaks cluster in population centers, ecological patterns emerge along coastlines, and economic activity concentrates within country borders. Machine learning models that encode geographic location, however, distribute representational capacity uniformly across the globe, struggling at the fine-grained resolutions that localized applications require. We propose a geographic location encoder built from spherical Slepian functions that concentrate representational capacity inside a region-of-interest and scale to high resolutions without extensive computational demands. For settings requiring global context, we present a hybrid Slepian-Spherical Harmonic encoder that efficiently bridges the tradeoff between local-global performance, while retaining desirable properties such as pole-safety and spherical-surface-distance preservation. Across five tasks spanning classification, regression, and image-augmented prediction, Slepian encodings outperform baselines and retain performance advantages across a wide range of neural network architectures.
[LG-220] A Frag ile Guardrail: Diffusion LLM s Safety Blessing and Its Failure Mode
链接: https://arxiv.org/abs/2602.00388
作者: Zeyuan He,Yupeng Chen,Lang Lin,Yihan Wang,Shenxu Chang,Eric Sommerlade,Philip Torr,Junchi Yu,Adel Bibi,Jialin Yu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Diffusion large language models (D-LLMs) offer an alternative to autoregressive LLMs (AR-LLMs) and have demonstrated advantages in generation efficiency. Beyond the utility benefits, we argue that D-LLMs exhibit a previously underexplored safety blessing: their diffusion-style generation confers intrinsic robustness against jailbreak attacks originally designed for AR-LLMs. In this work, we provide an initial analysis of the underlying mechanism, showing that the diffusion trajectory induces a stepwise reduction effect that progressively suppresses unsafe generations. This robustness, however, is not absolute. We identify a simple yet effective failure mode, termed context nesting, where harmful requests are embedded within structured benign contexts, effectively bypassing the stepwise reduction mechanism. Empirically, we show that this simple strategy is sufficient to bypass D-LLMs’ safety blessing, achieving state-of-the-art attack success rates across models and benchmarks. Most notably, it enables the first successful jailbreak of Gemini Diffusion, to our knowledge, exposing a critical vulnerability in commercial D-LLMs. Together, our results characterize both the origins and the limits of D-LLMs’ safety blessing, constituting an early-stage red-teaming of D-LLMs.
[LG-221] RePaint-Enhanced Conditional Diffusion Model for Parametric Engineering Designs under Performance and Parameter Constraints
链接: https://arxiv.org/abs/2602.00384
作者: Ke Wang,Nguyen Gia Hien Vu,Yifan Tang,Mostafa Rahmani Dehaghani,G. Gary Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:This paper presents a RePaint-enhanced framework that integrates a pre-trained performance-guided denoising diffusion probabilistic model (DDPM) for performance- and parameter-constraint engineering design generation. The proposed method enables the generation of missing design components based on a partial reference design while satisfying performance constraints, without retraining the underlying model. By applying mask-based resampling during inference process, RePaint allows efficient and controllable repainting of partial designs under both performance and parameter constraints, which is not supported by conventional DDPM-base methods. The framework is evaluated on two representative design problems, parametric ship hull design and airfoil design, demonstrating its ability to generate novel designs with expected performance based on a partial reference design. Results show that the method achieves accuracy comparable to or better than pre-trained models while enabling controlled novelty through fixing partial designs. Overall, the proposed approach provides an efficient, training-free solution for parameter-constraint-aware generative design in engineering applications.
[LG-222] MATRIX: A Multimodal Benchmark and Post-Training Framework for Materials Science
链接: https://arxiv.org/abs/2602.00376
作者: Delia McGrath,Curtis Chong,Rohil Kulkarni,Gerbrand Ceder,Adeesh Kolluru
类目: Machine Learning (cs.LG)
*备注: 17 pages, 9 Figures, submitted
Abstract:Scientific reasoning in materials science requires integrating multimodal experimental evidence with underlying physical theory. Existing benchmarks make it difficult to assess whether incorporating visual experimental data during post-training improves mechanism-grounded explanation reasoning beyond text-only supervision. We introduce MATRIX, a multimodal benchmark for materials science reasoning that evaluates foundational theory, research-level reasoning, and the interpretation of real experimental artifacts across multiple characterization modalities. Using MATRIX as a controlled diagnostic, we isolate the effect of visual grounding by comparing post-training on structured materials science text alone with post-training that incorporates paired experimental images. Despite using relatively small amounts of multimodal data, visual supervision improves experimental interpretation by 10-25% and yields 5-16% gains on text-only scientific reasoning tasks. Our results demonstrate that these improvements rely on correct image-text alignment during post-training, highlighting cross-modal representational transfer. We also observe consistent improvements on ScienceQA and PubMedQA, demonstrating that the benefits of structured multimodal post-training extend beyond materials science. The MATRIX dataset is available at this https URL and the model at this https URL.
[LG-223] Quantum Generator Kernels
链接: https://arxiv.org/abs/2602.00361
作者: Philipp Altmann,Maximilian Mansky,Maximilian Zorn,Jonas Stein,Claudia Linnhoff-Popien
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: 28 pages, 4 figures, 8 tables, under review
Abstract:Quantum kernel methods offer significant theoretical benefits by rendering classically inseparable features separable in quantum space. Yet, the practical application of Quantum Machine Learning (QML), currently constrained by the limitations of Noisy Intermediate-Scale Quantum (NISQ) hardware, necessitates effective strategies to compress and embed large-scale real-world data like images into the constrained capacities of existing quantum devices or simulators. To this end, we propose Quantum Generator Kernels (QGKs), a generator-based approach to quantum kernels, comprising a set of Variational Generator Groups (VGGs) that merge universal generators into a parameterizable operator, ensuring scalable coverage of the available quantum space. Thereby, we address shortcomings of current leading strategies employing hybrid architectures, which might prevent exploiting quantum computing’s full potential due to fixed intermediate embedding processes. To optimize the kernel alignment to the target domain, we train a weight vector to parameterize the projection of the VGGs in the current data context. Our empirical results demonstrate superior projection and classification capabilities of the QGK compared to state-of-the-art quantum and classical kernel approaches and show its potential to serve as a versatile framework for various QML applications.
[LG-224] Leverag ing Textual-Cues for Enhancing Multimodal Sentiment Analysis by Object Recognition
链接: https://arxiv.org/abs/2602.00360
作者: Sumana Biswas,Karen Young,Josephine Griffith
类目: Machine Learning (cs.LG)
*备注:
Abstract:Multimodal sentiment analysis, which includes both image and text data, presents several challenges due to the dissimilarities in the modalities of text and image, the ambiguity of sentiment, and the complexities of contextual meaning. In this work, we experiment with finding the sentiments of image and text data, individually and in combination, on two datasets. Part of the approach introduces the novel `Textual-Cues for Enhancing Multimodal Sentiment Analysis’ (TEMSA) based on object recognition methods to address the difficulties in multimodal sentiment analysis. Specifically, we extract the names of all objects detected in an image and combine them with associated text; we call this combination of text and image data TEMS. Our results demonstrate that only TEMS improves the results when considering all the object names for the overall sentiment of multimodal data compared to individual analysis. This research contributes to advancing multimodal sentiment analysis and offers insights into the efficacy of TEMSA in combining image and text data for multimodal sentiment analysis.
[LG-225] Planning with Language and Generative Models: Toward General Reward-Guided Wireless Network Design
链接: https://arxiv.org/abs/2602.00357
作者: Chenyang Yuan,Xiaoyuan Cheng
类目: Machine Learning (cs.LG)
*备注:
Abstract:Intelligent access point (AP) deployment remains challenging in next-generation wireless networks due to complex indoor geometries and signal propagation. We firstly benchmark general-purpose large language models (LLMs) as agentic optimizers for AP planning and find that, despite strong wireless domain knowledge, their dependence on external verifiers results in high computational costs and limited scalability. Motivated by these limitations, we study generative inference models guided by a unified reward function capturing core AP deployment objectives across diverse floorplans. We show that diffusion samplers consistently outperform alternative generative approaches. The diffusion process progressively improves sampling by smoothing and sharpening the reward landscape, rather than relying on iterative refinement, which is effective for non-convex and fragmented objectives. Finally, we introduce a large-scale real-world dataset for indoor AP deployment, requiring over 50k CPU hours to train general reward functions, and evaluate in- and out-of-distribution generalization and robustness. Our results suggest that diffusion-based generative inference with a unified reward function provides a scalable and domain-agnostic foundation for indoor AP deployment planning.
[LG-226] Adaptive Momentum and Nonlinear Damping for Neural Network Training
链接: https://arxiv.org/abs/2602.00334
作者: Aikaterini Karoni,Rajit Rajpal,Benedict Leimkuhler,Gabriel Stoltz
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 29 pages, 11 figures
Abstract:We propose a continuous-time scheme for large-scale optimization that introduces individual, adaptive momentum coefficients regulated by the kinetic energy of each model parameter. This approach automatically adjusts to local landscape curvature to maintain stability without sacrificing convergence speed. We demonstrate that our adaptive friction can be related to cubic damping, a suppression mechanism from structural dynamics. Furthermore, we introduce two specific optimization schemes by augmenting the continuous dynamics of mSGD and Adam with a cubic damping term. Empirically, our methods demonstrate robustness and match or outperform Adam on training ViT, BERT, and GPT2 tasks where mSGD typically struggles. We further provide theoretical results establishing the exponential convergence of the proposed schemes.
[LG-227] Efficient and accurate steering of Large Language Models through attention-guided feature learning
链接: https://arxiv.org/abs/2602.00333
作者: Parmida Davarmanesh,Ashia Wilson,Adityanarayanan Radhakrishnan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Steering, or direct manipulation of internal activations to guide LLM responses toward specific semantic concepts, is emerging as a promising avenue for both understanding how semantic concepts are stored within LLMs and advancing LLM capabilities. Yet, existing steering methods are remarkably brittle, with seemingly non-steerable concepts becoming completely steerable based on subtle algorithmic choices in how concept-related features are extracted. In this work, we introduce an attention-guided steering framework that overcomes three core challenges associated with steering: (1) automatic selection of relevant token embeddings for extracting concept-related features; (2) accounting for heterogeneity of concept-related features across LLM activations; and (3) identification of layers most relevant for steering. Across a steering benchmark of 512 semantic concepts, our framework substantially improved steering over previous state-of-the-art (nearly doubling the number of successfully steered concepts) across model architectures and sizes (up to 70 billion parameter models). Furthermore, we use our framework to shed light on the distribution of concept-specific features across LLM layers. Overall, our framework opens further avenues for developing efficient, highly-scalable fine-tuning algorithms for industry-scale LLMs.
[LG-228] Prototype-based Explainable Neural Networks with Channel-specific Reasoning for Geospatial Learning Tasks
链接: https://arxiv.org/abs/2602.00331
作者: Anushka Narayanan,Karianne J. Bergen
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注: submitted to Environmental Data Science (preprint)
Abstract:Explainable AI (XAI) is essential for understanding machine learning (ML) decision-making and ensuring model trustworthiness in scientific applications. Prototype-based XAI methods offer an intrinsically interpretable alternative to post-hoc approaches which often yield inconsistent explanations. Prototype-based XAI methods make predictions based on the similarity between inputs and learned prototypes that represent typical characteristics of target classes. However, existing prototype-based models are primarily designed for standard RGB image data and are not optimized for the distinct, variable-specific channels commonly found in geoscientific image and raster datasets. In this study, we develop a prototype-based XAI approach tailored for multi-channel geospatial data, where each channel represents a distinct physical environmental variable or spectral channel. Our approach enables the model to identify separate, channel-specific prototypical characteristics sourced from multiple distinct training examples that inform how these features individually and in combination influence model prediction while achieving comparable performance to standard neural networks. We demonstrate this method through two geoscientific case studies: (1) classification of Madden Julian Oscillation phases using multi-variable climate data and (2) land-use classification from multispectral satellite imagery. This approach produces both local (instance-level) and global (model-level) explanations for providing insights into feature-relevance across channels. By explicitly incorporating channel-prototypes into the prediction process, we discuss how this approach enhances the transparency and trustworthiness of ML models for geoscientific learning tasks.
[LG-229] Harvest: Opportunistic Peer-to-Peer GPU Caching for LLM Inference
链接: https://arxiv.org/abs/2602.00328
作者: Nikhil Gopal,Kostis Kaffes
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large Language Model (LLM) inference is increasingly constrained by GPU memory capacity rather than compute throughput, driven by growing model sizes and the linear growth of the key-value (KV) cache during autoregressive decoding. Existing approaches mitigate memory pressure by offloading model state and KV tensors to host memory, but incur substantial latency due to limited PCIe bandwidth. We present Harvest, an opportunistic GPU cache management framework that exploits high-bandwidth peer-to-peer GPU interconnects to dynamically place model weights and KV cache in unused GPU memory. Harvest treats peer GPU memory as a transient cache tier, preserving correctness while reducing data movement overhead under dynamic memory availability. We demonstrate significant throughput speedup of more than 2 times by using Harvest to accelerate the retrieval of two widely-used inference components: expert layer weights and KV cache entries.
[LG-230] Neural Ising Machines via Unrolling and Zeroth-Order Training
链接: https://arxiv.org/abs/2602.00302
作者: Sam Reifenstein,Timothee Leleu
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Chaotic Dynamics (nlin.CD)
*备注:
Abstract:We propose a data-driven heuristic for NP-hard Ising and Max-Cut optimization that learns the update rule of an iterative dynamical system. The method learns a shared, node-wise update rule that maps local interaction fields to spin updates, parameterized by a compact multilayer perceptron with a small number of parameters. Training is performed using a zeroth-order optimizer, since backpropagation through long, recurrent Ising-machine dynamics leads to unstable and poorly informative gradients. We call this approach a neural network parameterized Ising machine (NPIM). Despite its low parameter count, the learned dynamics recover effective algorithmic structure, including momentum-like behavior and time-varying schedules, enabling efficient search in highly non-convex energy landscapes. Across standard Ising and neural combinatorial optimization benchmarks, NPIM achieves competitive solution quality and time-to-solution relative to recent learning-based methods and strong classical Ising-machine heuristics.
[LG-231] Agent ic Framework for Epidemiological Modeling
链接: https://arxiv.org/abs/2602.00299
作者: Rituparna Datta,Zihan Guan,Baltazar Espinoza,Yiqi Su,Priya Pitre,Srini Venkatramanan,Naren Ramakrishnan,Anil Vullikanti
类目: Machine Learning (cs.LG)
*备注:
Abstract:Epidemic modeling is essential for public health planning, yet traditional approaches rely on fixed model classes that require manual redesign as pathogens, policies, and scenario assumptions evolve. We introduce EPIAGENT, an agentic framework that automatically synthesizes, calibrates, verifies, and refines epidemiological simulators by modeling disease progression as an iterative program synthesis problem. A central design choice is an explicit epidemiological flow graph intermediate representation that links scenario specifications to model structure and enables strong, modular correctness checks before code is generated. Verified flow graphs are then compiled into mechanistic models supporting interpretable parameter learning under physical and epidemiological constraints. Evaluation on epidemiological scenario case studies demonstrates that EPIAGENT captures complex growth dynamics and produces epidemiologically consistent counterfactual projections across varying vaccination and immune escape assumptions. Our results show that the agentic feedback loop prevents degeneration and significantly accelerates convergence toward valid models by mimicking professional expert workflows.
[LG-232] From Observations to States: Latent Time Series Forecasting
链接: https://arxiv.org/abs/2602.00297
作者: Jie Yang,Yifan Hu,Yuante Li,Kexin Zhang,Kaize Ding,Philip S. Yu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Deep learning has achieved strong performance in Time Series Forecasting (TSF). However, we identify a critical representation paradox, termed Latent Chaos: models with accurate predictions often learn latent representations that are temporally disordered and lack continuity. We attribute this phenomenon to the dominant observation-space forecasting paradigm. Most TSF models minimize point-wise errors on noisy and partially observed data, which encourages shortcut solutions instead of the recovery of underlying system dynamics. To address this issue, we propose Latent Time Series Forecasting (LatentTSF), a novel paradigm that shifts TSF from observation regression to latent state prediction. Specifically, LatentTSF employs an AutoEncoder to project observations at each time step into a higher-dimensional latent state space. This expanded representation aims to capture underlying system variables and impose a smoother temporal structure. Forecasting is then performed entirely in the latent space, allowing the model to focus on learning structured temporal dynamics. Theoretical analysis demonstrates that our proposed latent objectives implicitly maximize mutual information between predicted latent states and ground-truth states and observations. Extensive experiments on widely-used benchmarks confirm that LatentTSF effectively mitigates latent chaos, achieving superior performance. Our code is available in this https URL.
[LG-233] Generation Order and Parallel Decoding in Masked Diffusion Models: An Information-Theoretic Perspective
链接: https://arxiv.org/abs/2602.00286
作者: Shaorong Zhang,Longxuan Yu,Rob Brekelmans,Luhan Tang,Salman Asif,Greg Ver Steeg
类目: Machine Learning (cs.LG)
*备注:
Abstract:Masked Diffusion Models (MDMs) significantly accelerate inference by trading off sequential determinism. However, the theoretical mechanisms governing generation order and the risks inherent in parallelization remain under-explored. In this work, we provide a unified information-theoretic framework to decouple and analyze two fundamental sources of failure: order sensitivity and parallelization bias. Our analysis yields three key insights: (1) The benefits of Easy-First decoding (prioritizing low-entropy tokens) are magnified as model error increases; (2) factorized parallel decoding introduces intrinsic sampling errors that can lead to arbitrary large Reverse KL divergence, capturing “incoherence” failures that standard Forward KL metrics overlook; and (3) while verification can eliminate sampling error, it incurs an exponential cost governed by the total correlation within a block. Conversely, heuristics like remasking, though computationally efficient, cannot guarantee distributional correctness. Experiments on a controlled Block-HMM and large-scale MDMs (LLaDA) for arithmetic reasoning validate our theoretical framework.
[LG-234] Green-NAS: A Global-Scale Multi-Objective Neural Architecture Search for Robust and Efficient Edge-Native Weather Forecasting
链接: https://arxiv.org/abs/2602.00240
作者: Md Muhtasim Munif Fahim,Soyda Humyra Yesmin,Saiful Islam,Md. Palash Bin Faruque,Md. A. Salam,Md. Mahfuz Uddin,Samiul Islam,Tofayel Ahmed,Md. Binyamin,Md. Rezaul Karim
类目: Machine Learning (cs.LG)
*备注:
Abstract:We introduce Green-NAS, a multi-objective NAS (neural architecture search) framework designed for low-resource environments using weather forecasting as a case study. By adhering to ‘Green AI’ principles, the framework explicitly minimizes computational energy costs and carbon footprints, prioritizing sustainable deployment over raw computational scale. The Green-NAS architecture search method is optimized for both model accuracy and efficiency to find lightweight models with high accuracy and very few model parameters; this is accomplished through an optimization process that simultaneously optimizes multiple objectives. Our best-performing model, Green-NAS-A, achieved an RMSE of 0.0988 (i.e., within 1.4% of our manually tuned baseline) using only 153k model parameters, which is 239 times fewer than other globally applied weather forecasting models, such as GraphCast. In addition, we also describe how the use of transfer learning will improve the weather forecasting accuracy by approximately 5.2%, in comparison to a naive approach of training a new model for each city, when there is limited historical weather data available for that city.
[LG-235] GRIP2: A Robust and Powerful Deep Knockoff Method for Feature Selection
链接: https://arxiv.org/abs/2602.00218
作者: Bob Junyi Zou,Lu Tian
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Identifying truly predictive covariates while strictly controlling false discoveries remains a fundamental challenge in nonlinear, highly correlated, and low signal-to-noise regimes, where deep learning based feature selection methods are most attractive. We propose Group Regularization Importance Persistence in 2 Dimensions (GRIP2), a deep knockoff feature importance statistic that integrates first-layer feature activity over a two-dimensional regularization surface controlling both sparsity strength and sparsification geometry. To approximate this surface integral in a single training run, we introduce efficient block-stochastic sampling, which aggregates feature activity magnitudes across diverse regularization regimes along the optimization trajectory. The resulting statistics are antisymmetric by construction, ensuring finite-sample FDR control. In extensive experiments on synthetic and semi-real data, GRIP2 demonstrates improved robustness to feature correlation and noise level: in high correlation and low signal-to-noise ratio regimes where standard deep learning based feature selectors may struggle, our method retains high power and stability. Finally, on real-world HIV drug resistance data, GRIP2 recovers known resistance-associated mutations with power better than established linear baselines, confirming its reliability in practice.
[LG-236] Dispersion Loss Counteracts Embedding Condensation and Improves Generalization in Small Language Models
链接: https://arxiv.org/abs/2602.00217
作者: Chen Liu,Xingzhi Sun,Xi Xiao,Alexandre Van Tassel,Ke Xu,Kristof Reimann,Danqi Liao,Mark Gerstein,Tianyang Wang,Xiao Wang,Smita Krishnaswamy
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large language models (LLMs) achieve remarkable performance through ever-increasing parameter counts, but scaling incurs steep computational costs. To better understand LLM scaling, we study representational differences between LLMs and their smaller counterparts, with the goal of replicating the representational qualities of larger models in the smaller models. We observe a geometric phenomenon which we term \textbfembedding condensation , where token embeddings collapse into a narrow cone-like subspace in some language models. Through systematic analyses across multiple Transformer families, we show that small models such as \textttGPT2 and \textttQwen3-0.6B exhibit severe condensation, whereas the larger models such as \textttGPT2-xl and \textttQwen3-32B are more resistant to this phenomenon. Additional observations show that embedding condensation is not reliably mitigated by knowledge distillation from larger models. To fight against it, we formulate a dispersion loss that explicitly encourages embedding dispersion during training. Experiments demonstrate that it mitigates condensation, recovers dispersion patterns seen in larger models, and yields performance gains across 10 benchmarks. We believe this work offers a principled path toward improving smaller Transformers without additional parameters.
[LG-237] Reducing Memorisation in Generative Models via Riemannian Bayesian Inference
链接: https://arxiv.org/abs/2602.00199
作者: Johanna Marie Gegenfurtner,Albert Kjøller Jacobsen,Naima Elosegui Borras,Alejandro Valverde Mahou,Georgios Arvanitidis
类目: Machine Learning (cs.LG)
*备注:
Abstract:Modern generative models can produce realistic samples, however, balancing memorisation and generalisation remains an open problem. We approach this challenge from a Bayesian perspective by focusing on the parameter space of flow matching and diffusion models and constructing a predictive posterior that better captures the variability of the data distribution. In particular, we capture the geometry of the loss using a Riemannian metric and leverage a flexible approximate posterior that adapts to the local structure of the loss landscape. This approach allows us to sample generative models that resemble the original model, but exhibit reduced memorisation. Empirically, we demonstrate that the proposed approach reduces memorisation while preserving generalisation. Further, we provide a theoretical analysis of our method, which explains our findings. Overall, our work illustrates how considering the geometry of the loss enables effective use of the parameter space, even for complex high-dimensional generative models.
[LG-238] How Understanding Forecast Uncertainty Resolves the Explainability Problem in Machine Learning Models
链接: https://arxiv.org/abs/2602.00179
作者: Joseph L. Breeden
类目: Machine Learning (cs.LG)
*备注: 22 pages; 2 figures
Abstract:For applications of machine learning in critical decisions, explainability is a primary concern, and often a regulatory requirement. Local linear methods for generating explanations, such as LIME and SHAP, have been criticized for being unstable near decision boundaries. In this paper, we explain that such concerns reflect a misunderstanding of the problem. The forecast uncertainty is high at decision boundaries, so consequently, the explanatory instability is high. The correct approach is to change the sequence of events and questions being asked. Nonlinear models can be highly predictive in some regions while having little or no predictability in others. Therefore, the first question is whether a usable forecast exists. When there is a forecast with low enough uncertainty to be useful, an explanation can be sought via a local linear approximation. In such cases, the explanatory instability is correspondingly low. When no usable forecast exists, the decision must fall to a simpler overall model such as traditional logistic regression. Additionally, these results show that some methods that purport to be explainable everywhere, such as ReLU networks or any piecewise linear model, have only an illusory explainability, because the forecast uncertainty at the segment boundaries is too high to be useful. Explaining an unusable forecast is pointless.
[LG-239] Benfords Law as a Distributional Prior for Post-Training Quantization of Large Language Models
链接: https://arxiv.org/abs/2602.00165
作者: Arthur Negrão,Pedro Silva,Vander L. S. Freitas,Gladston Moreira,Eduardo Luz
类目: Machine Learning (cs.LG)
*备注:
Abstract:The rapid growth of Large Language Models (LLMs) intensifies the need for effective compression, with weight quantization being the most widely adopted technique. Standard uniform quantizers assume that parameters are evenly distributed, an assumption at odds with the highly skewed distributions observed in practice. We propose Benford-Quant, a simple, data-free non-uniform quantizer inspired by Benford’s Law, which predicts that leading digits follow a logarithmic distribution. Benford-Quant replaces the uniform grid with a log-spaced codebook, dedicating more resolution to the frequent small-magnitude weights. We provide both theoretical intuition and empirical evidence: (i) weights in transformer transformational layers adhere closely to Benford statistics, while normalization layers systematically deviate; (ii) on Small Language Models (SLMs), Benford-Quant consistently improves perplexity, reducing 4-bit perplexity on Gemma-270M by more than 10%; and (iii) on larger LLMs, it remains competitive, with differences explained by over-parameterization effects. Our results indicate that incorporating a Benford-inspired prior into quantization grids is a low-cost modification that yields accuracy gains in aggressive few-bit regimes. Although it is not able to surpass the state of the art in tasks such as perplexity and LAMBADA, the Benford-Quant approach can be hybridized with other quantization methods-such as SmoothQuant and Activation-Aware Quantization-without major pipeline modification, potentially improving their performance.
[LG-240] On the Relationship Between Representation Geometry and Generalization in Deep Neural Networks ICML
链接: https://arxiv.org/abs/2602.00130
作者: Sumit Yadav
类目: Machine Learning (cs.LG)
*备注: ICML
Abstract:We investigate the relationship between representation geometry and neural network performance. Analyzing 52 pretrained ImageNet models across 13 architecture families, we show that effective dimension – an unsupervised geometric metric – strongly predicts accuracy. Output effective dimension achieves partial r=0.75 ( p 10^(-10) ) after controlling for model capacity, while total compression achieves partial r=-0.72. These findings replicate across ImageNet and CIFAR-10, and generalize to NLP: effective dimension predicts performance for 8 encoder models on SST-2/MNLI and 15 decoder-only LLMs on AG News (r=0.69, p=0.004), while model size does not (r=0.07). We establish bidirectional causality: degrading geometry via noise causes accuracy loss (r=-0.94, p 10^(-9) ), while improving geometry via PCA maintains accuracy across architectures (-0.03pp at 95% variance). This relationship is noise-type agnostic – Gaussian, Uniform, Dropout, and Salt-and-pepper noise all show |r| 0.90 . These results establish that effective dimension provides domain-agnostic predictive and causal information about neural network performance, computed entirely without labels.
[LG-241] Monte Carlo Tree Search for Execution-Guided Program Repair with Large Language Models
链接: https://arxiv.org/abs/2602.00129
作者: Yixuan Liang
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: 10 pages, 5 figures. Submitted to a conference workshop
Abstract:Automated program repair with large language models remains challenging at the repository level due to long-horizon reasoning requirements and the limitations of autoregressive decoding. We present CodePilot, a hybrid framework that integrates Monte Carlo Tree Search (MCTS) with large language models to enable execution-guided program repair for real-world GitHub issues. CodePilot performs hierarchical fault localization from repository to file and function level, explores diverse patch trajectories using MCTS, and leverages execution feedback as a reward signal to guide search and refinement. The framework further incorporates confidence-calibrated generation to selectively refine low-confidence outputs. Experiments on SWE-bench Lite demonstrate that CodePilot achieves a 24.67% issue resolution rate using open-weight models, outperforming comparable baselines. These results suggest that combining symbolic search with neural language models is an effective strategy for scalable, execution-aware software engineering automation.
[LG-242] Quantum Model Parallelism for MRI-Based Classification of Alzheimers Disease Stages
链接: https://arxiv.org/abs/2602.00128
作者: Emine Akpinar,Murat Oduncuoglu
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: Under review at Quantum Machine Intelligence (Springer Nature)
Abstract:With increasing life expectancy, AD has become a major global health concern. While classical AI-based methods have been developed for early diagnosis and stage classification of AD, growing data volumes and limited computational resources necessitate faster, more efficient approaches. Quantum-based AI methods, which leverage superposition and entanglement principles along with high-dimensional Hilbert space, can surpass classical approaches’ limitations and offer higher accuracy for high-dimensional, heterogeneous, and noisy data. In this study, a Quantum-Based Parallel Model (QBPM) architecture is proposed for the efficient classification of AD stages using MRI datasets, inspired by the principles of classical model parallelism. The proposed model leverages quantum advantages by employing two distinct quantum circuits, each incorporating rotational and entanglement blocks, running in parallel on the same quantum simulator. The classification performance of the model was evaluated on two different datasets to assess its overall robustness and generalization capability. The proposed model demonstrated high classification accuracy across both datasets, highlighting its overall robustness and generalization capability. Results obtained under high-level Gaussian noise, simulating real-world conditions, further provided experimental evidence for the model’s applicability not only in theoretical but also in practical scenarios. Moreover, compared with five different classical transfer learning methods, the proposed model demonstrated its efficiency as an alternative to classical approaches by achieving higher classification accuracy and comparable execution time while utilizing fewer circuit parameters. The results indicate that the proposed QBPM architecture represents an innovative and powerful approach for the classification of stages in complex diseases such as Alzheimer’s.
[LG-243] ALIGN: Aligned Delegation with Performance Guarantees for Multi-Agent LLM Reasoning
链接: https://arxiv.org/abs/2602.00127
作者: Tong Zhu,Baiting Chen,Jin Zhou,Hua Zhou,Sriram Sankararaman,Xiaowu Dai
类目: Machine Learning (cs.LG)
*备注:
Abstract:LLMs often underperform on complex reasoning tasks when relying on a single generation-and-selection pipeline. Inference-time ensemble methods can improve performance by sampling diverse reasoning paths or aggregating multiple candidate answers, but they typically treat candidates independently and provide no formal guarantees that ensembling improves reasoning quality. We propose a novel method, Aligned Delegation for Multi-Agent LLM Reasoning (ALIGN), which formulates LLM reasoning as an aligned delegation game. In ALIGN, a principal delegates a task to multiple agents that generate candidate solutions under designed incentives, and then selects among their outputs to produce a final answer. This formulation induces structured interaction among agents while preserving alignment between agent and principal objectives. We establish theoretical guarantees showing that, under a fair comparison with equal access to candidate solutions, ALIGN provably improves expected performance over single-agent generation. Our analysis accommodates correlated candidate answers and relaxes independence assumptions that are commonly used in prior work. Empirical results across a broad range of LLM reasoning benchmarks consistently demonstrate that ALIGN outperforms strong single-agent and ensemble baselines.
[LG-244] Predicting Mortgage Default with Machine Learning: AutoML Class Imbalance and Leakage Control
链接: https://arxiv.org/abs/2602.00120
作者: Xianghong Hu,Tianning Xu,Ying Chen,Shuai Wang
类目: Machine Learning (cs.LG)
*备注: 12 pages, 4 figures. An extended and pedagogical version will appear as a book chapter
Abstract:Mortgage default prediction is a core task in financial risk management, and machine learning models are increasingly used to estimate default probabilities and provide interpretable signals for downstream decisions. In real-world mortgage datasets, however, three factors frequently undermine evaluation validity and deployment reliability: ambiguity in default labeling, severe class imbalance, and information leakage arising from temporal structure and post-event variables. We compare multiple machine learning approaches for mortgage default prediction using a real-world loan-level dataset, with emphasis on leakage control and imbalance handling. We employ leakage-aware feature selection, a strict temporal split that constrains both origination and reporting periods, and controlled downsampling of the majority class. Across multiple positive-to-negative ratios, performance remains stable, and an AutoML approach (AutoGluon) achieves the strongest AUROC among the models evaluated. An extended and pedagogical version of this work will appear as a book chapter.
[LG-245] HDC: Training Hyperdimensional Computing Models with Backpropagation
链接: https://arxiv.org/abs/2602.00116
作者: Hanne Dejonghe,Sam Leroux
类目: Machine Learning (cs.LG)
*备注: Accepted to ESANN 2026
Abstract:Hyperdimensional computing (HDC) offers lightweight learning for energy-constrained devices by encoding data into high-dimensional vectors. However, its reliance on ultra-high dimensionality and static, randomly initialized hypervectors limits memory efficiency and learning capacity. Therefore, we propose Trainable Hyperdimensional Computing (THDC), which enables end-to-end HDC via backpropagation. THDC replaces randomly initialized vectors with trainable embeddings and introduces a one-layer binary neural network to optimize class representations. Evaluated on MNIST, Fashion-MNIST and CIFAR-10, THDC achieves equal or better accuracy than state-of-the-art HDC, with dimensionality reduced from 10.000 to 64.
[LG-246] Gauss-Newton Natural Gradient Descent for Shape Learning
链接: https://arxiv.org/abs/2602.00099
作者: James King,Arturs Berzins,Siddhartha Mishra,Marius Zeinhofer
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 16 Pages, 9 Figures, submitted to Computer-Aided Design
Abstract:We explore the use of the Gauss-Newton method for optimization in shape learning, including implicit neural surfaces and geometry-informed neural networks. The method addresses key challenges in shape learning, such as the ill-conditioning of the underlying differential constraints and the mismatch between the optimization problem in parameter space and the function space where the problem is naturally posed. This leads to significantly faster and more stable convergence than standard first-order methods, while also requiring far fewer iterations. Experiments across benchmark shape optimization tasks demonstrate that the Gauss-Newton method consistently improves both training speed and final solution accuracy.
[LG-247] rade-offs Between Individual and Group Fairness in Machine Learning: A Comprehensive Review
链接: https://arxiv.org/abs/2602.00094
作者: Sandra Benítez-Peña,Blas Kolic,Victoria Menendez,Belén Pulido
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:
Abstract:Algorithmic fairness has become a central concern in computational decision-making systems, where ensuring equitable outcomes is essential for both ethical and legal reasons. Two dominant notions of fairness have emerged in the literature: Group Fairness (GF), which focuses on mitigating disparities across demographic subpopulations, and Individual Fairness (IF), which emphasizes consistent treatment of similar individuals. These notions have traditionally been studied in isolation. In contrast, this survey examines methods that jointly address GF and IF, integrating both perspectives within unified frameworks and explicitly characterizing the trade-offs between them. We provide a systematic and critical review of hybrid fairness approaches, organizing existing methods according to the fairness mechanisms they employ and the algorithmic and mathematical strategies used to reconcile multiple fairness criteria. For each class of methods, we examine their theoretical foundations, optimization mechanisms, and empirical evaluation practices, and discuss their limitations. Additionally, we discuss the challenges and identify open research directions for developing principled, context-aware hybrid fairness methods. By synthesizing insights across the literature, this survey aims to serve as a comprehensive resource for researchers and practitioners seeking to design hybrid algorithms that provide reliable fairness guarantees at both the individual and group levels.
[LG-248] Why LoRA Resists Label Noise: A Theoretical Framework for Noise-Robust Parameter-Efficient Fine-Tuning
链接: https://arxiv.org/abs/2602.00084
作者: Brady Steele
类目: Machine Learning (cs.LG)
*备注: 14 pages, 7 figures, 7 tables
Abstract:Parameter-efficient fine-tuning methods like Low-Rank Adaptation (LoRA) have become the dominant paradigm for adapting large pretrained models. We present a theoretical framework explaining an underexplored property: LoRA’s inherent resistance to label noise. Our analysis reveals three key insights. First, we prove that rank- r LoRA cannot memorize all possible label assignments once the sample size exceeds O(r(d+k-r)) , limiting its capacity to fit arbitrary noise. Second, we derive an optimal rank balancing approximation bias and noise-induced variance, showing it decreases with noise rate. Third, we establish temporal separation: clean patterns are learned early while noise memorization occurs later. We propose RACT (Rank-Aware Curriculum Training), leveraging rank discrepancy for noise detection. Experiments validate our predictions, with RACT achieving 91.1% F1 for noise detection on AG News while maintaining 91.46% accuracy, competitive with baselines that lack noise detection capability.
[LG-249] Automated univariate time series forecasting with regression trees
链接: https://arxiv.org/abs/2602.00077
作者: Francisco Martínez,María P. Frías
类目: Machine Learning (cs.LG)
*备注: 23 pages, 17 figures
Abstract:This paper describes a methodology for automated univariate time series forecasting using regression trees and their ensembles: bagging and random forests. The key aspects that are addressed are: the use of an autoregressive approach and recursive forecasts, how to select the autoregressive features, how to deal with trending series and how to cope with seasonal behavior. Experimental results show a forecast accuracy comparable with well-established statistical models such as exponential smoothing or ARIMA. Furthermore, a publicly available software implementing all the proposed strategies has been developed and is described in the paper.
[LG-250] Repair Brain Damage: Real-Numbered Error Correction Code for Neural Network
链接: https://arxiv.org/abs/2602.00076
作者: Ziqing Li,Myung Cho,Qiutong Jin,Weiyu Xu
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注: 6 pages, 3 figures
Abstract:We consider a neural network (NN) that may experience memory faults and computational errors. In this paper, we propose a novel real-number-based error correction code (ECC) capable of detecting and correcting both memory errors and computational errors. The proposed approach introduces structures in the form of real-number-based linear constraints on the NN weights to enable error detection and correction, without sacrificing classification performance or increasing the number of real-valued NN parameters.
[LG-251] Dimensional Peeking for Low-Variance Gradients in Zeroth-Order Discrete Optimization via Simulation
链接: https://arxiv.org/abs/2602.00075
作者: Philipp Andelfinger,Wentong Cai
类目: Machine Learning (cs.LG); Mathematical Software (cs.MS)
*备注: Accepted at ACM SIGSIM PADS 2026
Abstract:Gradient-based optimization methods are commonly used to identify local optima in high-dimensional spaces. When derivatives cannot be evaluated directly, stochastic estimators can provide approximate gradients. However, these estimators’ perturbation-based sampling of the objective function introduces variance that can lead to slow convergence. In this paper, we present dimensional peeking, a variance reduction method for gradient estimation in discrete optimization via simulation. By lifting the sampling granularity from scalar values to classes of values that follow the same control flow path, we increase the information gathered per simulation evaluation. Our derivation from an established smoothed gradient estimator shows that the method does not introduce any bias. We present an implementation via a custom numerical data type to transparently carry out dimensional peeking over C++ programs. Variance reductions by factors of up to 7.9 are observed for three simulation-based optimization problems with high-dimensional input. The optimization progress compared to three meta-heuristics shows that dimensional peeking increases the competitiveness of zeroth-order optimization for discrete and non-convex simulations.
[LG-252] Generative AI-enhanced Probabilistic Multi-Fidelity Surrogate Modeling Via Transfer Learning
链接: https://arxiv.org/abs/2602.00072
作者: Jice Zeng,David Barajas-Solano,Hui Chen
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:The performance of machine learning surrogates is critically dependent on data quality and quantity. This presents a major challenge, as high-fidelity (HF) data is often scarce and computationally expensive to acquire, while low-fidelity (LF) data is abundant but less accurate. To address this data scarcity problem, we develop a probabilistic multi-fidelity surrogate framework based on generative transfer learning. We employ a normalizing flow (NF) generative model as the backbone, which is trained in two phases: (i) the NF is first pretrained on a large LF dataset to learn a probabilistic forward model; (ii) the pretrained model is then fine-tuned on a small HF dataset, allowing it to correct for LF-HF discrepancies via knowledge transfer. To relax the dimension-preserving constraint of standard bijective NFs, we integrate surjective (dimension-reducing) layers with standard coupling blocks. This architecture enables learned dimension reduction while preserving the ability to train with exact likelihoods. The resulting surrogate provides fast probabilistic predictions with quantified uncertainty and significantly outperforms LF-only baselines while using fewer HF evaluations. We validate the approach on a reinforced concrete slab benchmark, combining many coarse-mesh (LF) simulations with a limited set of fine-mesh (HF) simulations. The proposed model achieves probabilistic predictions with HF accuracy, demonstrating a practical path toward data-efficient, generative AI-driven surrogates for complex engineering systems.
[LG-253] Comparison of Multiple Classifiers for Android Malware Detection with Emphasis on Feature Insights Using CICMalDroid 2020 Dataset
链接: https://arxiv.org/abs/2602.00058
作者: Md Min-Ha-Zul Abedin,Tazqia Mehrub
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Accurate Android malware detection was critical for protecting users at scale. Signature scanners lagged behind fast release cycles on public app stores. We aimed to build a trustworthy detector by pairing a comprehensive dataset with a rigorous, transparent evaluation, and to identify interpretable drivers of decisions. We used CICMalDroid2020, which contained 17,341 apps across Benign, Adware, Banking, SMS malware, and Riskware. We extracted 301 static and 263 dynamic features into a 564 dimensional hybrid vector, then evaluated seven classifiers under three schemes, original features, principal component analysis, PCA, and linear discriminant analysis, LDA, with a 70 percent training and 30 percent test split. Results showed that gradient boosting on the original features performed best. XGBoost achieved 0.9747 accuracy, 0.9703 precision, 0.9731 recall, and 0.9716 F1, and the confusion matrix indicated rare benign labels for malicious apps. HistGradientBoosting reached 0.9741 accuracy and 0.9708 F1, while CatBoost and Random Forest were slightly lower at 0.9678 and 0.9687 accuracy with 0.9636 and 0.9637 F1. KNN and SVM lagged. PCA reduced performance for all models, with XGBoost dropping to 0.9164 accuracy and 0.8988 F1. LDA maintained mid 90s accuracy and clarified separable clusters in projections. A depth two surrogate tree highlighted package name, main activity, and target SDK as key drivers. These findings established high fidelity supervised baselines for Android malware detection and indicated that rich hybrid features with gradient boosting offered a practical and interpretable foundation for deployment.
[LG-254] Distributional Reinforcement Learning for Condition-Based Maintenance of Multi-Pump Equipment
链接: https://arxiv.org/abs/2602.00051
作者: Takato Yasuno
类目: Machine Learning (cs.LG)
*备注: 15 pages, 15 figures
Abstract:Condition-Based Maintenance (CBM) signifies a paradigm shift from reactive to proactive equipment management strategies in modern industrial systems. Conventional time-based maintenance schedules frequently engender superfluous expenditures and unanticipated equipment failures. In contrast, CBM utilizes real-time equipment condition data to enhance maintenance timing and optimize resource allocation. The present paper proposes a novel distributional reinforcement learning approach for multi-equipment CBM using Quantile Regression Deep Q-Networks (QR-DQN) with aging factor integration. The methodology employed in this study encompasses the concurrent administration of multiple pump units through three strategic scenarios. The implementation of safety-first, balanced, and cost-efficient approaches is imperative. Comprehensive experimental validation over 3,000 training episodes demonstrates significant performance improvements across all strategies. The Safety-First strategy demonstrates superior cost efficiency, with a return on investment (ROI) of 3.91, yielding 152% better performance than alternatives while requiring only 31% higher investment. The system exhibits 95.66% operational stability and immediate applicability to industrial environments.
[LG-255] Asynchronous MultiAgent Reinforcement Learning for 5G Routing under Side Constraints
链接: https://arxiv.org/abs/2602.00035
作者: Sebastian Racedo,Brigitte Jaumard,Oscar Delgado,Meysam Masoudi
类目: Networking and Internet Architecture (cs.NI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:
Abstract:Networks in the current 5G and beyond systems increasingly carry heterogeneous traffic with diverse quality-of-service constraints, making real-time routing decisions both complex and time-critical. A common approach, such as a heuristic with human intervention or training a single centralized RL policy or synchronizing updates across multiple learners, struggles with scalability and straggler effects. We address this by proposing an asynchronous multi-agent reinforcement learning (AMARL) framework in which independent PPO agents, one per service, plan routes in parallel and commit resource deltas to a shared global resource environment. This coordination by state preserves feasibility across services and enables specialization for service-specific objectives. We evaluate the method on an O-RAN like network simulation using nearly real-time traffic data from the city of Montreal. We compared against a single-agent PPO baseline. AMARL achieves a similar Grade of Service (acceptance rate) (GoS) and end-to-end latency, with reduced training wall-clock time and improved robustness to demand shifts. These results suggest that asynchronous, service-specialized agents provide a scalable and practical approach to distributed routing, with applicability extending beyond the O-RAN domain.
[LG-256] RAPTOR-AI for Disaster OODA Loop: Hierarchical Multimodal RAG with Experience-Driven Agent ic Decision-Making
链接: https://arxiv.org/abs/2602.00030
作者: Takato Yasuno
类目: Machine Learning (cs.LG)
*备注: 4 figures, 3 tables
Abstract:Effective humanitarian assistance and disaster relief (HADR) requires rapid situational understanding, reliable decision support, and the ability to generalize across diverse and previously unseen disaster contexts. This work introduces an agentic Retrieval-Augmented Generation (RAG) framework designed to support the three canonical phases of disaster response: initial rescue, mid-term recovery, and long-term reconstruction. To achieve robust multimodal grounding, we construct a hierarchical knowledge base that integrates textual disaster manuals, historical lessons (e.g., the 2011 Tohoku earthquake), and both aerial and ground-level imagery. Our system builds on the open-source multimodal implementation, which processes 46 tsunami-related PDFs (2,378 pages) using BLIP-based image captioning, ColVBERT embeddings, and long-context summarization to generate an efficient, structured multimodal retrieval tree optimized for disaster knowledge preservation. An agentic controller dynamically selects retrieval strategies (e.g., RAPTOR, ColBERT) through entropy-aware scene abstraction, enabling adaptive reasoning across heterogeneous inputs. Additionally, a lightweight LoRA-based post-training method injects experiential knowledge from past disasters, enhancing the models’ capacity to support both expert and non-expert responders. Experiments on real disaster datasets demonstrate improved situational grounding, enhanced task decomposition accuracy, and superior usability for emergency operations. Incorporating recent advances in long-context RAG systems, agentic information retrieval, and contemporary emergency response AI, our system achieves substantial gains through adaptive retrieval-augmented generation with self-reasoning and multimodal chain-of-thought capabilities.
[LG-257] ELLM PEG: An Edge-based Agent ic LLM Video Processing Tool
链接: https://arxiv.org/abs/2602.00028
作者: Zoha Azimi,Reza Farahani,Radu Prodan,Christian Timmerer
类目: Machine Learning (cs.LG); Multimedia (cs.MM)
*备注: 12 pages, 5 tables, 8 Figures, accepted for the MMSys 2026 conference
Abstract:Large language models (LLMs), the foundation of generative AI systems like ChatGPT, are transforming many fields and applications, including multimedia, enabling more advanced content generation, analysis, and interaction. However, cloud-based LLM deployments face three key limitations: high computational and energy demands, privacy and reliability risks from remote processing, and recurring API costs. Recent advances in agentic AI, especially in structured reasoning and tool use, offer a better way to exploit open and locally deployed tools and LLMs. This paper presents ELLMPEG, an edge-enabled agentic LLM framework for the automated generation of video-processing commands. ELLMPEG integrates tool-aware Retrieval-Augmented Generation (RAG) with iterative self-reflection to produce and locally verify executable FFmpeg and VVenC commands directly at the edge, eliminating reliance on external cloud APIs. To evaluate ELLMPEG, we collect a dedicated prompt dataset comprising 480 diverse queries covering different categories of FFmpeg and the Versatile Video Codec (VVC) encoder (VVenC) commands. We validate command generation accuracy and evaluate four open-source LLMs based on command validity, tokens generated per second, inference time, and energy efficiency. We also execute the generated commands to assess their runtime correctness and practical applicability. Experimental results show that Qwen2.5, when augmented with the ELLMPEG framework, achieves an average command-generation accuracy of 78 % with zero recurring API cost, outperforming all other open-source models across both the FFmpeg and VVenC datasets.
信息检索
[IR-0] Mapping a Decade of Avian Influenza Research (2014-2023): A Scientometric Analysis from Web of Science
链接: https://arxiv.org/abs/2602.01712
作者: Muneer Ahmad,Undie Felicia Nkatv,Amrita Sharma,Gorrety Maria Juma,Nicholas Kamoga,Julirine Nakanwag
类目: Digital Libraries (cs.DL); Databases (cs.DB); Information Retrieval (cs.IR)
*备注: 24 pages, 7 figures, Research Article
Abstract:This scientometric study analyzes Avian Influenza research from 2014 to 2023 using bibliographic data from the Web of Science database. We examined publication trends, sources, authorship, collaborative networks, document types, and geographical distribution to gain insights into the global research landscape. Results reveal a steady increase in publications, with high contributions from Chinese and American institutions. Journals such as PLoS One and the Journal of Virology published the highest number of studies, indicating their influence in this field. The most prolific institutions include the Chinese Academy of Sciences and the University of Hong Kong, while the College of Veterinary Medicine at South China Agricultural University emerged as the most productive department. China and the USA lead in publication volume, though developed nations like the United Kingdom and Germany exhibit a higher rate of international collaboration. “Articles” are the most common document type, constituting 84.6% of the total, while “Reviews” account for 7.6%. This study provides a comprehensive view of global trends in Avian Influenza research, emphasizing the need for collaborative efforts across borders.
[IR-1] Unmediated AI-Assisted Scholarly Citations
链接: https://arxiv.org/abs/2602.01686
作者: Stefan Szeider
类目: Digital Libraries (cs.DL); Information Retrieval (cs.IR)
*备注:
Abstract:Traditional bibliography databases require users to navigate search forms and manually copy citation data. Language models offer an alternative: a natural-language interface where researchers write text with informal citation fragments, which are automatically resolved to proper references. However, language models are not reliable for scholarly work as they generate fabricated (hallucinated) citations at substantial rates. We present an architectural approach that combines the natural-language interface of LLM chatbots with the accuracy of direct database access, implemented through the Model Context Protocol. Our system enables language models to search bibliographic databases, perform fuzzy matching, and export verified entries, all through conversational interaction. A key architectural principle bypasses the language model during final data export: entries are fetched directly from authoritative sources, with timeout protection, to guarantee accuracy. We demonstrate this approach with MCP-DBLP, a server providing access to the DBLP computer science bibliography. The system transforms form-based bibliographic services into conversational assistants that maintain scholarly integrity. This architecture is adaptable to other bibliographic databases and academic data sources. Subjects: Digital Libraries (cs.DL); Information Retrieval (cs.IR) Cite as: arXiv:2602.01686 [cs.DL] (or arXiv:2602.01686v1 [cs.DL] for this version) https://doi.org/10.48550/arXiv.2602.01686 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Open Conference Proceedings, Vol. 8 (2026): The Second Bridge on Artificial Intelligence for Scholarly Communication (AAAI-26) Related DOI: https://doi.org/10.52825/ocp.v8i.3161 Focus to learn more DOI(s) linking to related resources
[IR-2] he Algorithmic Self-Portrait: Deconstructing Memory in ChatGPT
链接: https://arxiv.org/abs/2602.01450
作者: Abhisek Dash,Soumi Das,Elisabeth Kirsten,Qinyuan Wu,Sai Keerthana Karnam,Krishna P. Gummadi,Thorsten Holz,Muhammad Bilal Zafar,Savvas Zannettou
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY); Information Retrieval (cs.IR)
*备注: This paper has been accepted at The ACM Web Conference 2026
Abstract:To enable personalized and context-aware interactions, conversational AI systems have introduced a new mechanism: Memory. Memory creates what we refer to as the Algorithmic Self-portrait - a new form of personalization derived from users’ self-disclosed information divulged within private conversations. While memory enables more coherent exchanges, the underlying processes of memory creation remain opaque, raising critical questions about data sensitivity, user agency, and the fidelity of the resulting portrait. To bridge this research gap, we analyze 2,050 memory entries from 80 real-world ChatGPT users. Our analyses reveal three key findings: (1) A striking 96% of memories in our dataset are created unilaterally by the conversational system, potentially shifting agency away from the user; (2) Memories, in our dataset, contain a rich mix of GDPR-defined personal data (in 28% memories) along with psychological insights about participants (in 52% memories); and (3)~A significant majority of the memories (84%) are directly grounded in user context, indicating faithful representation of the conversations. Finally, we introduce a framework-Attribution Shield-that anticipates these inferences, alerts about potentially sensitive memory inferences, and suggests query reformulations to protect personal information without sacrificing utility. Comments: This paper has been accepted at The ACM Web Conference 2026 Subjects: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY); Information Retrieval (cs.IR) Cite as: arXiv:2602.01450 [cs.HC] (or arXiv:2602.01450v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2602.01450 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-3] Optimizing Retrieval Components for a Shared Backbone via Component-Wise Multi-Stage Training
链接: https://arxiv.org/abs/2602.00805
作者: Yunhan Li,Mingjie Xie,Zihan Gong,Zeyang Shi,Gengshen Wu,Min Yang
类目: Information Retrieval (cs.IR)
*备注: 4 pages, 3 figures, 3 tables
Abstract:Recent advances in embedding-based retrieval have enabled dense retrievers to serve as core infrastructure in many industrial systems, where a single retrieval backbone is often shared across multiple downstream applications. In such settings, retrieval quality directly constrains system performance and extensibility, while coupling model selection, deployment, and rollback decisions across applications. In this paper, we present empirical findings and a system-level solution for optimizing retrieval components deployed as a shared backbone in production legal retrieval systems. We adopt a multi-stage optimization framework for dense retrievers and rerankers, and show that different retrieval components exhibit stage-dependent trade-offs. These observations motivate a component-wise, mixed-stage configuration rather than relying on a single uniformly optimal checkpoint. The resulting backbone is validated through end-to-end evaluation and deployed as a shared retrieval service supporting multiple industrial applications. Comments: 4 pages, 3 figures, 3 tables Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2602.00805 [cs.IR] (or arXiv:2602.00805v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2602.00805 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-4] owards Trustworthy Multimodal Recommendation
链接: https://arxiv.org/abs/2602.00730
作者: Zixuan Li
类目: Information Retrieval (cs.IR)
*备注: Preprint, 10 pages, 5 figures
Abstract:Recent advances in multimodal recommendation have demonstrated the effectiveness of incorporating visual and textual content into collaborative filtering. However, real-world deployments raise an increasingly important yet underexplored issue: trustworthiness. On modern e-commerce platforms, multimodal content can be misleading or unreliable (e.g., visually inconsistent product images or click-bait titles), injecting untrustworthy signals into multimodal representations and making existing recommenders brittle under modality corruption. In this work, we take a step towards trustworthy multimodal recommendation from both a method and an analysis perspective. First, we propose a plug-and-play modality-level rectification component that mitigates untrustworthy modality features by learning soft correspondences between items and multimodal features. Using lightweight projections and Sinkhorn-based soft matching, the rectification suppresses mismatched modality signals while preserving semantic consistency, and can be integrated into existing multimodal recommenders without architectural modifications. Second, we present two practical insights on interaction-level trustworthiness under noisy collaborative signals: (i) training-set pseudo interactions can help or hurt performance under noise depending on prior-signal alignment; and (ii) propagation-graph pseudo edges can also help or hurt robustness, as message passing may amplify misalignment. Extensive experiments on multiple datasets and backbones under varying corruption levels demonstrate improved robustness from modality rectification and validate the above interaction-level observations.
[IR-5] SWGCN: Synergy Weighted Graph Convolutional Network for Multi-Behavior Recommendation
链接: https://arxiv.org/abs/2602.00727
作者: Fangda Chen,Yueyang Wang,Chaoli Lou,Min Gao,Qingyu Xiong
类目: Information Retrieval (cs.IR)
*备注: Accepted by Information Sciences
Abstract:Multi-behavior recommendation paradigms have emerged to capture diverse user activities, forecasting primary conversions (e.g., purchases) by leveraging secondary signals like browsing history. However, current graph-based methods often overlook cross-behavioral synergistic signals and fine-grained intensity of individual actions. Motivated by the need to overcome these shortcomings, we introduce Synergy Weighted Graph Convolutional Network (SWGCN). SWGCN introduces two novel components: a Target Preference Weigher, which adaptively assigns weights to user-item interactions within each behavior, and a Synergy Alignment Task, which guides its training by leveraging an Auxiliary Preference Valuator. This task prioritizes interactions from synergistic signals that more accurately reflect user preferences. The performance of our model is rigorously evaluated through comprehensive tests on three open-source datasets, specifically Taobao, IJCAI, and Beibei. On the Taobao dataset, SWGCN yields relative gains of 112.49% and 156.36% in terms of Hit Ratio (HR) and Normalized Discounted Cumulative Gain (NDCG), respectively. It also yields consistent gains on IJCAI and Beibei, confirming its robustness and generalizability across various datasets. Our implementation is open-sourced and can be accessed via this https URL.
[IR-6] owards Sample-Efficient and Stable Reinforcement Learning for LLM -based Recommendation
链接: https://arxiv.org/abs/2602.00632
作者: Hongxun Ding,Keqin Bao,Jizhi Zhang,Yi Fang,Wenxin Xu,Fuli Feng,Xiangnan He
类目: Information Retrieval (cs.IR)
*备注:
Abstract:While Long Chain-of-Thought (Long CoT) reasoning has shown promise in Large Language Models (LLMs), its adoption for enhancing recommendation quality is growing rapidly. In this work, we critically examine this trend and argue that Long CoT is inherently ill-suited for the sequential recommendation domain. We attribute this misalignment to two primary factors: excessive inference latency and the lack of explicit cognitive reasoning patterns in user behavioral data. Driven by these observations, we propose pivoting away from the CoT structure to directly leverage its underlying mechanism: Reinforcement Learning (RL), to explore the item space. However, applying RL directly faces significant obstacles, notably low sample efficiency-where most actions fail to provide learning signals-and training instability. To overcome these limitations, we propose RISER, a novel Reinforced Item Space Exploration framework for Recommendation. RISER is designed to transform non-learnable trajectories into effective pairwise preference data for optimization. Furthermore, it incorporates specific strategies to ensure stability, including the prevention of redundant rollouts and the constraint of token-level update magnitudes. Extensive experiments on three real-world datasets show that RISER significantly outperforms competitive baselines, establishing a robust paradigm for RL-enhanced LLM recommendation. Our code will be available at this https URL.
[IR-7] Equity vs. Equality: Optimizing Ranking Fairness for Tailored Provider Needs
链接: https://arxiv.org/abs/2602.00495
作者: Yiteng Tu,Weihang Su,Shuguang Han,Yiqun Liu,Qingyao Ai
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Ranking plays a central role in connecting users and providers in Information Retrieval (IR) systems, making provider-side fairness an important challenge. While recent research has begun to address fairness in ranking, most existing approaches adopt an equality-based perspective, aiming to ensure that providers with similar content receive similar exposure. However, it overlooks the diverse needs of real-world providers, whose utility from ranking may depend not only on exposure but also on outcomes like sales or engagement. Consequently, exposure-based fairness may not accurately capture the true utility perceived by different providers with varying priorities. To this end, we introduce an equity-oriented fairness framework that explicitly models each provider’s preferences over key outcomes such as exposure and sales, thus evaluating whether a ranking algorithm can fulfill these individualized goals while maintaining overall fairness across providers. Based on this framework, we develop EquityRank, a gradient-based algorithm that jointly optimizes user-side effectiveness and provider-side equity. Extensive offline and online simulations demonstrate that EquityRank offers improved trade-offs between effectiveness and fairness and adapts to heterogeneous provider needs.
[IR-8] RAG Router-Bench: A Dataset and Benchmark for Adaptive RAG Routing
链接: https://arxiv.org/abs/2602.00296
作者: Ziqi Wang,Xi Zhu,Shuhang Lin,Haochen Xue,Minghao Guo,Yongfeng Zhang
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Retrieval-Augmented Generation (RAG) has become a core paradigm for grounding large language models with external knowledge. Despite extensive efforts exploring diverse retrieval strategies, existing studies predominantly focus on query-side complexity or isolated method improvements, lacking a systematic understanding of how RAG paradigms behave across different query-corpus contexts and effectiveness-efficiency trade-offs. In this work, we introduce RAGRouter-Bench, the first dataset and benchmark designed for adaptive RAG routing. RAGRouter-Bench revisits retrieval from a query-corpus compatibility perspective and standardizes five representative RAG paradigms for systematic evaluation across 7,727 queries and 21,460 documents spanning diverse domains. The benchmark incorporates three canonical query types together with fine-grained semantic and structural corpus metrics, as well as a unified evaluation for both generation quality and resource consumption. Experiments with DeepSeek-V3 and LLaMA-3.1-8B demonstrate that no single RAG paradigm is universally optimal, that paradigm applicability is strongly shaped by query-corpus interactions, and that increased advanced mechanism does not necessarily yield better effectiveness-efficiency trade-offs. These findings underscore the necessity of routing-aware evaluation and establish a foundation for adaptive, interpretable, and generalizable next-generation RAG systems.
[IR-9] First Steps Lasting Impact: Platform-Aware Forensics for the Next Generation of Analysts
链接: https://arxiv.org/abs/2602.00160
作者: Vinayak Jain,Sneha Sudhakaran,Saranyan Senthivel
类目: Cryptography and Security (cs.CR); Information Retrieval (cs.IR)
*备注: 21st International Conference on Cyber Warfare and Security (ICCWS 2026)
Abstract:The reliability of cyber forensic evidence acquisition is strongly influenced by the underlying operating systems, Windows, macOS, and Linux - due to inherent variations in file system structures, encryption protocols, and forensic tool compatibility. Disk forensics, one of the most widely used techniques in digital investigations, faces distinct obstacles on each platform. Windows, with its predominantly NTFS and FAT file systems, typically supports reliable disk imaging and analysis through established tools such as FTK Imager and Autopsy/Sleuth Kit. However, encryption features frequently pose challenges to evidence acquisition. Conversely, Linux environments, which rely on file systems like ext4 and XFS, generally offer greater transparency, yet the transient nature of log retention often complicates forensic analysis. In instances where anti-forensic strategies such as encryption and compression render traditional disk forensics insufficient, memory forensics becomes crucial. While memory forensic methodologies demonstrate robustness across Windows and Linux platforms forms through frameworks like Volatility, platform-specific difficulties persist. Memory analysis on Linux systems benefits from tools like LiME, snapshot utilities, and dd for memory acquisition; nevertheless, live memory acquisition on Linux can still present challenges. This research systematically assesses both disk and memory forensic acquisition techniques across samples representing Windows and Linux systems. By identifying effective combinations of forensic tools and configurations tailored to each operating system, the study aims to improve the accuracy and reliability of evidence collection. It further evaluates current forensic tools and highlights a persistent gap: consistently assuring forensic input reliability and footprint integrity.

