本篇博文主要内容为 2026-02-09 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。
说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。
提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。
目录
概览 (2026-02-09)
今日共更新533篇论文,其中:
- 自然语言处理共76篇(Computation and Language (cs.CL))
- 人工智能共146篇(Artificial Intelligence (cs.AI))
- 计算机视觉共98篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共190篇(Machine Learning (cs.LG))
- 多智能体系统共6篇(Multiagent Systems (cs.MA))
多智能体系统
[MA-0] Implementing Grassroots Logic Programs with Multiagent Transition Systems and AI
【速读】:该论文旨在解决并发逻辑编程语言 Grassroots Logic Programs (GLP) 在实际系统中实现时的语义不明确性和非确定性问题,特别是针对其在工作站和智能手机等不同平台上的可执行实现需求。解决方案的关键在于:提出两种形式化的、确定性的操作语义——dGLP(用于单代理 GLP)和 madGLP(用于多代理 GLP),它们分别被证明与原始的非确定性语义等价,并作为人工智能(AI)开发工具的正式规范,从而支持从理论到可运行代码的自动转换,确保了跨平台实现的一致性与正确性。
链接: https://arxiv.org/abs/2602.06934
作者: Ehud Shapiro
机构: 未知
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Logic in Computer Science (cs.LO); Multiagent Systems (cs.MA)
备注:
Abstract:Grassroots Logic Programs (GLP) is a concurrent logic programming language with variables partitioned into paired \emphreaders and \emphwriters, conjuring both linear logic and futures/promises: an assignment is produced at most once via the sole occurrence of a writer (promise) and consumed at most once via the sole occurrence of its paired reader (future), and may contain additional readers and/or writers, enabling the concise expression of rich multidirectional communication modalities. GLP was designed as a language for grassroots platforms – distributed systems with multiple instances that can operate independently of each other and of any global resource, and can coalesce into ever larger instances – with its target architecture being smartphones communicating peer-to-peer. The operational semantics of Concurrent (single-agent) GLP and of multiagent GLP (maGLP) were defined via transition systems/multiagent transition systems, respectively. Here, we describe the mathematics developed to facilitate the workstation- and smartphone-based implementations of GLP by AI in Dart. We developed dGLP – implementation-ready deterministic operational semantics for single-agent GLP – and proved it correct with respect to the Concurrent GLP operational semantics; dGLP was used by AI as a formal spec, from which it developed a workstation-based implementation of GLP. We developed madGLP – an implementation-ready multiagent operational semantics for maGLP – and proved it correct with respect to the maGLP operational semantics; madGLP is deterministic at the agent level (not at the system level due to communication asynchrony), and is being used by AI as a formal spec from which it develops a smartphone-based implementation of maGLP. Subjects: Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Logic in Computer Science (cs.LO); Multiagent Systems (cs.MA) Cite as: arXiv:2602.06934 [cs.PL] (or arXiv:2602.06934v1 [cs.PL] for this version) https://doi.org/10.48550/arXiv.2602.06934 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[MA-1] Pairwise is Not Enough: Hypergraph Neural Networks for Multi-Agent Pathfinding ICLR2026
【速读】:该论文旨在解决多智能体路径规划(Multi-Agent Path Finding, MAPF)中因传统图神经网络(Graph Neural Networks, GNNs)仅支持成对信息传递而导致的注意力稀释问题,尤其在高密度环境中群体协作能力不足的问题。解决方案的关键在于提出HMAGAT(Hypergraph Multi-Agent Attention Network),通过在有向超图(directed hypergraphs)上构建注意力机制,显式建模多智能体间的高阶交互关系,从而有效缓解注意力稀释并捕捉GNN难以表达的复杂群体动态,实验证明其在参数量和训练数据远少于现有最优模型的情况下仍能实现性能超越。
链接: https://arxiv.org/abs/2602.06733
作者: Rishabh Jain,Keisuke Okumura,Michael Amir,Pietro Lio,Amanda Prorok
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Accepted at ICLR 2026
Abstract:Multi-Agent Path Finding (MAPF) is a representative multi-agent coordination problem, where multiple agents are required to navigate to their respective goals without collisions. Solving MAPF optimally is known to be NP-hard, leading to the adoption of learning-based approaches to alleviate the online computational burden. Prevailing approaches, such as Graph Neural Networks (GNNs), are typically constrained to pairwise message passing between agents. However, this limitation leads to suboptimal behaviours and critical issues, such as attention dilution, particularly in dense environments where group (i.e. beyond just two agents) coordination is most critical. Despite the importance of such higher-order interactions, existing approaches have not been able to fully explore them. To address this representational bottleneck, we introduce HMAGAT (Hypergraph Multi-Agent Attention Network), a novel architecture that leverages attentional mechanisms over directed hypergraphs to explicitly capture group dynamics. Empirically, HMAGAT establishes a new state-of-the-art among learning-based MAPF solvers: e.g., despite having just 1M parameters and being trained on 100 \times less data, it outperforms the current SoTA 85M parameter model. Through detailed analysis of HMAGAT’s attention values, we demonstrate how hypergraph representations mitigate the attention dilution inherent in GNNs and capture complex interactions where pairwise methods fail. Our results illustrate that appropriate inductive biases are often more critical than the training data size or sheer parameter count for multi-agent problems.
zh
[MA-2] Sample-Efficient Policy Space Response Oracles with Joint Experience Best Response AAMAS2026
【速读】:该论文旨在解决多智能体强化学习(Multi-agent Reinforcement Learning, MARL)中因非平稳性(non-stationarity)和策略多样性不足导致的收敛困难问题,以及传统策略空间响应正交化(Policy Space Response Oracles, PSRO)在多人博弈或模拟器开销大的场景下因每个智能体单独训练最优响应(Best Response, BR)而带来的高昂样本成本问题。其核心解决方案是提出联合经验最优响应(Joint Experience Best Response, JBR),通过一次性收集全局策略下的联合轨迹数据,并复用该数据集同时计算所有智能体的最优响应,从而显著降低环境交互次数并提升样本效率。JBR将最优响应计算转化为离线强化学习问题,进一步引入三种缓解分布偏移偏差的方法:保守型JBR(安全策略改进)、探索增强型JBR(扰动数据采集并保证理论性质)和混合型JBR(交替使用JBR与独立BR更新)。实验表明,探索增强型JBR在精度与效率间取得最佳平衡,而混合型JBR仅以极低样本代价即可逼近PSRO性能,使PSRO在大规模战略学习场景中更具实用性且保持均衡鲁棒性。
链接: https://arxiv.org/abs/2602.06599
作者: Ariyan Bighashdel,Thiago D. Simão,Frans A. Oliehoek
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)
Abstract:Multi-agent reinforcement learning (MARL) offers a scalable alternative to exact game-theoretic analysis but suffers from non-stationarity and the need to maintain diverse populations of strategies that capture non-transitive interactions. Policy Space Response Oracles (PSRO) address these issues by iteratively expanding a restricted game with approximate best responses (BRs), yet per-agent BR training makes it prohibitively expensive in many-agent or simulator-expensive settings. We introduce Joint Experience Best Response (JBR), a drop-in modification to PSRO that collects trajectories once under the current meta-strategy profile and reuses this joint dataset to compute BRs for all agents simultaneously. This amortizes environment interaction and improves the sample efficiency of best-response computation. Because JBR converts BR computation into an offline RL problem, we propose three remedies for distribution-shift bias: (i) Conservative JBR with safe policy improvement, (ii) Exploration-Augmented JBR that perturbs data collection and admits theoretical guarantees, and (iii) Hybrid BR that interleaves JBR with periodic independent BR updates. Across benchmark multi-agent environments, Exploration-Augmented JBR achieves the best accuracy-efficiency trade-off, while Hybrid BR attains near-PSRO performance at a fraction of the sample cost. Overall, JBR makes PSRO substantially more practical for large-scale strategic learning while preserving equilibrium robustness.
zh
[MA-3] Prism: Spectral Parameter Sharing for Multi-Agent Reinforcement Learning
【速读】:该论文旨在解决多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)中参数共享策略导致的智能体行为同质化问题,即传统完全共享架构常使所有智能体趋同为相似策略,从而限制了群体协作的多样性与性能。其解决方案的关键在于提出 Prism 框架,通过奇异值分解(Singular Value Decomposition, SVD)在频谱域表示共享网络:所有智能体共享奇异向量方向,但各自学习不同的奇异值掩码(spectral masks),从而在保持可扩展性的同时诱导智能体间的行为多样性。该机制有效平衡了多样性与资源效率,在多个基准测试中实现了优于现有方法的性能表现。
链接: https://arxiv.org/abs/2602.06476
作者: Kyungbeom Kim,Seungwon Oh,Kyung-Joong Kim
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Parameter sharing is a key strategy in multi-agent reinforcement learning (MARL) for improving scalability, yet conventional fully shared architectures often collapse into homogeneous behaviors. Recent methods introduce diversity through clustering, pruning, or masking, but typically compromise resource efficiency. We propose Prism, a parameter sharing framework that induces inter-agent diversity by representing shared networks in the spectral domain via singular value decomposition (SVD). All agents share the singular vector directions while learning distinct spectral masks on singular values. This mechanism encourages inter-agent diversity and preserves scalability. Extensive experiments on both homogeneous (LBF, SMACv2) and heterogeneous (MaMuJoCo) benchmarks show that Prism achieves competitive performance with superior resource efficiency.
zh
[MA-4] RuleSmith: Multi-Agent LLM s for Automated Game Balancing
【速读】:该论文旨在解决游戏平衡(game balancing)这一长期挑战,传统方法依赖重复的游戏测试、专家直觉和大量手动调参,效率低下且难以适应复杂多变的游戏机制。其解决方案的关键在于提出RuleSmith框架,该框架首次利用多智能体大语言模型(multi-agent LLMs)的推理能力实现自动化游戏平衡,通过将游戏引擎、多智能体LLM自对弈与基于贝叶斯优化(Bayesian optimization)的参数搜索相结合,在多维规则空间中高效探索最优配置。其中,贝叶斯优化结合基于采集函数的自适应采样与离散投影策略,使高潜力参数候选获得更密集的游戏评估以提升精度,而探索性候选则用较少游戏快速筛选,从而在保证效果的同时显著提升搜索效率。实验表明,RuleSmith能收敛至高度平衡的游戏配置,并输出可直接应用于下游系统的可解释规则调整建议。
链接: https://arxiv.org/abs/2602.06232
作者: Ziyao Zeng,Chen Liu,Tianyu Liu,Hao Wang,Xiatao Sun,Fengyu Yang,Xiaofeng Liu,Zhiwen Fan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
备注:
Abstract:Game balancing is a longstanding challenge requiring repeated playtesting, expert intuition, and extensive manual tuning. We introduce RuleSmith, the first framework that achieves automated game balancing by leveraging the reasoning capabilities of multi-agent LLMs. It couples a game engine, multi-agent LLMs self-play, and Bayesian optimization operating over a multi-dimensional rule space. As a proof of concept, we instantiate RuleSmith on CivMini, a simplified civilization-style game containing heterogeneous factions, economy systems, production rules, and combat mechanics, all governed by tunable parameters. LLM agents interpret textual rulebooks and game states to generate actions, to conduct fast evaluation of balance metrics such as win-rate disparities. To search the parameter landscape efficiently, we integrate Bayesian optimization with acquisition-based adaptive sampling and discrete projection: promising candidates receive more evaluation games for accurate assessment, while exploratory candidates receive fewer games for efficient exploration. Experiments show that RuleSmith converges to highly balanced configurations and provides interpretable rule adjustments that can be directly applied to downstream game systems. Our results illustrate that LLM simulation can serve as a powerful surrogate for automating design and balancing in complex multi-agent environments.
zh
[MA-5] Communication Enhances LLM s Stability in Strategic Thinking
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在多智能体协作任务中因上下文依赖性导致的战略行为不可预测问题,尤其是在重复囚徒困境(repeated Prisoner’s Dilemma)场景下,模型间合作轨迹波动剧烈、稳定性差的问题。解决方案的关键在于引入低成本的“预-play消息”(pre-play messages),模拟廉价言论(cheap-talk)机制,在不增加计算成本的前提下,通过短文本交流提升战略行为的一致性和可预测性。实证结果表明,这种通信方式显著降低了多数模型-情境组合下的轨迹噪声,且其稳定效应在不同提示变体和解码策略下具有鲁棒性,尤其对基线波动较高的模型效果更明显,从而为增强多智能体LLM系统的可靠性提供了一种高效可行的干预手段。
链接: https://arxiv.org/abs/2602.06081
作者: Nunzio Lore,Babak Heydari
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注: 15 pages, 1 figure, 6 tables
Abstract:Large Language Models (LLMs) often exhibit pronounced context-dependent variability that undermines predictable multi-agent behavior in tasks requiring strategic thinking. Focusing on models that range from 7 to 9 billion parameters in size engaged in a ten-round repeated Prisoner’s Dilemma, we evaluate whether short, costless pre-play messages emulating the cheap-talk paradigm affect strategic stability. Our analysis uses simulation-level bootstrap resampling and nonparametric inference to compare cooperation trajectories fitted with LOWESS regression across both the messaging and the no-messaging condition. We demonstrate consistent reductions in trajectory noise across a majority of the model-context pairings being studied. The stabilizing effect persists across multiple prompt variants and decoding regimes, though its magnitude depends on model choice and contextual framing, with models displaying higher baseline volatility gaining the most. While communication rarely produces harmful instability, we document a few context-specific exceptions and identify the limited domains in which communication harms stability. These findings position cheap-talk style communication as a low-cost, practical tool for improving the predictability and reliability of strategic behavior in multi-agent LLM systems.
zh
自然语言处理
[NLP-0] Learning a Generative Meta-Model of LLM Activations
【速读】: 该论文试图解决现有神经网络激活分析方法(如主成分分析 PCA 和稀疏自编码器)依赖强结构假设的问题,这些假设限制了对模型内部状态分布的无偏探索。解决方案的关键在于利用生成式模型(Generative Models)作为替代方案:通过在十亿级残差流(residual stream)激活上训练扩散模型(Diffusion Models),构建“元模型”(meta-models),使其学习网络内部状态的分布,并以此作为先验(prior)来提升干预(intervention)的保真度与效果。实验表明,扩散损失随计算量平滑下降,且能可靠预测下游任务性能;结合元模型先验的干预策略显著提升文本流畅性,同时模型神经元逐渐将概念解耦为独立单元,稀疏探测得分随损失降低而提升,验证了生成式元模型在无需强结构假设前提下实现可扩展可解释性的潜力。
链接: https://arxiv.org/abs/2602.06964
作者: Grace Luo,Jiahai Feng,Trevor Darrell,Alec Radford,Jacob Steinhardt
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Existing approaches for analyzing neural network activations, such as PCA and sparse autoencoders, rely on strong structural assumptions. Generative models offer an alternative: they can uncover structure without such assumptions and act as priors that improve intervention fidelity. We explore this direction by training diffusion models on one billion residual stream activations, creating “meta-models” that learn the distribution of a network’s internal states. We find that diffusion loss decreases smoothly with compute and reliably predicts downstream utility. In particular, applying the meta-model’s learned prior to steering interventions improves fluency, with larger gains as loss decreases. Moreover, the meta-model’s neurons increasingly isolate concepts into individual units, with sparse probing scores that scale as loss decreases. These results suggest generative meta-models offer a scalable path toward interpretability without restrictive structural assumptions. Project page: this https URL.
zh
[NLP-1] InftyThink: Effective and Efficient Infinite-Horizon Reason ing via Reinforcement Learning
【速读】: 该论文旨在解决传统链式思维(Chain-of-Thought, CoT)推理在大规模推理模型中面临的三大问题:推理成本呈二次增长、上下文长度受限以及因“中间迷失”效应导致的推理质量下降。为此,作者提出 InftyThink+,其核心创新在于构建一个端到端的强化学习框架,通过模型自主控制迭代边界和显式总结机制,优化整个迭代推理轨迹。关键突破在于将总结时机、保留内容与续接策略统一建模为可学习决策,并采用两阶段训练策略(监督预热 + 轨迹级强化学习),使模型能够自适应地学习高效且准确的推理策略,从而在保持性能提升的同时显著降低延迟并加速训练。
链接: https://arxiv.org/abs/2602.06960
作者: Yuchen Yan,Liang Jiang,Jin Jiang,Shuaicheng Li,Zujie Wen,Zhiqiang Zhang,Jun Zhou,Jian Shao,Yueting Zhuang,Yongliang Shen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Project Page: this https URL Code: this https URL
Abstract:Large reasoning models achieve strong performance by scaling inference-time chain-of-thought, but this paradigm suffers from quadratic cost, context length limits, and degraded reasoning due to lost-in-the-middle effects. Iterative reasoning mitigates these issues by periodically summarizing intermediate thoughts, yet existing methods rely on supervised learning or fixed heuristics and fail to optimize when to summarize, what to preserve, and how to resume reasoning. We propose InftyThink+, an end-to-end reinforcement learning framework that optimizes the entire iterative reasoning trajectory, building on model-controlled iteration boundaries and explicit summarization. InftyThink+ adopts a two-stage training scheme with supervised cold-start followed by trajectory-level reinforcement learning, enabling the model to learn strategic summarization and continuation decisions. Experiments on DeepSeek-R1-Distill-Qwen-1.5B show that InftyThink+ improves accuracy by 21% on AIME24 and outperforms conventional long chain-of-thought reinforcement learning by a clear margin, while also generalizing better to out-of-distribution benchmarks. Moreover, InftyThink+ significantly reduces inference latency and accelerates reinforcement learning training, demonstrating improved reasoning efficiency alongside stronger performance.
zh
[NLP-2] DAWN: Dependency-Aware Fast Inference for Diffusion LLM s
【速读】: 该论文旨在解决扩散大语言模型(Diffusion Large Language Models, dLLMs)在推理阶段因质量-速度权衡而采用保守并行解码策略,导致效率潜力未被充分挖掘的问题。核心挑战在于:传统并行解码假设每个位置可独立填充,但实际中词元(token)常存在语义耦合关系,某一位置的正确选择会限制其他位置的有效选项,若不建模这种跨词元依赖性,则并行策略会导致生成质量下降。解决方案的关键是提出一种无需训练、依赖感知的解码方法DAWN,其通过构建词元依赖图,在每轮迭代中选择更可靠的未掩码位置进行并行解码,从而在保持生成质量几乎不变的前提下显著提升推理速度(相比基线提速1.80–8.06倍)。
链接: https://arxiv.org/abs/2602.06953
作者: Lizhuo Luo,Zhuoran Shi,Jiajun Luo,Zhi Wang,Shen Ren,Wenya Wang,Tianwei Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Diffusion large language models (dLLMs) have shown advantages in text generation, particularly due to their inherent ability for parallel decoding. However, constrained by the quality–speed trade-off, existing inference solutions adopt conservative parallel strategies, leaving substantial efficiency potential underexplored. A core challenge is that parallel decoding assumes each position can be filled independently, but tokens are often semantically coupled. Thus, the correct choice at one position constrains valid choices at others. Without modeling these inter-token dependencies, parallel strategies produce deteriorated outputs. Motivated by this insight, we propose DAWN, a training-free, dependency-aware decoding method for fast dLLM inference. DAWN extracts token dependencies and leverages two key motivations: (1) positions dependent on unmasked certain positions become more reliable, (2) simultaneously unmasking strongly coupled uncertain positions induces errors. Given those findings, DAWN leverages a dependency graph to select more reliable unmasking positions at each iteration, achieving high parallelism with negligible loss in generation quality. Extensive experiments across multiple models and datasets demonstrate that DAWN speedups the inference by 1.80-8.06x over baselines while preserving the generation quality. Code is released at this https URL.
zh
[NLP-3] Optimal Turkish Subword Strategies at Scale: Systematic Evaluation of Data Vocabulary Morphology Interplay
【速读】: 该论文旨在解决形态丰富语言(Morphologically Rich Languages, MRLs)如土耳其语中子词分词(subword tokenization)设计的系统性优化问题,核心挑战在于如何在词汇效率与形态保真度之间取得平衡。传统研究往往未控制训练语料规模、缺乏内在诊断工具,且下游任务评估覆盖有限。本文的关键解决方案是提出首个结构化的“子词宣言”(subwords manifest),通过联合调节词汇量与训练语料规模(数据与词汇耦合),在相同参数预算下对比WordPiece、形态层级和字符级分词器,并构建一个融合形态感知的评估框架,包含边界级微宏观F1、词元原子性与表面边界命中率分离指标、过/欠分割指数、字符/词编辑距离(CER/WER)、延续率及词缀类型覆盖率等精细化诊断工具,从而实现从内在诊断到外在性能的统一映射,为MRLs的有效分词器设计提供可操作指导。
链接: https://arxiv.org/abs/2602.06942
作者: Duygu Altinok
机构: Independent Researcher, Berlin, Germany
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Submitted to Cambridge NLP journal, all rights belong to them
Abstract:Tokenization is a pivotal design choice for neural language modeling in morphologically rich languages (MRLs) such as Turkish, where productive agglutination challenges both vocabulary efficiency and morphological fidelity. Prior studies have explored tokenizer families and vocabulary sizes but typically (i) vary vocabulary without systematically controlling the tokenizer’s training corpus, (ii) provide limited intrinsic diagnostics, and (iii) evaluate a narrow slice of downstream tasks. We present the first comprehensive, principled study of Turkish subword tokenization; a “subwords manifest”, that jointly varies vocabulary size and tokenizer training corpus size (data and vocabulary coupling), compares multiple tokenizer families under matched parameter budgets (WordPiece, morphology level, and character baselines), and evaluates across semantic (NLI, STS, sentiment analysis, NER), syntactic (POS, dependency parsing), and morphology-sensitive probes. To explain why tokenizers succeed or fail, we introduce a morphology-aware diagnostic toolkit that goes beyond coarse aggregates to boundary-level micro/macro F1, decoupled lemma atomicity vs. surface boundary hits, over/under-segmentation indices, character/word edit distances (CER/WER), continuation rates, and affix-type coverage and token-level atomicity. Our contributions are fourfold: (i) a systematic investigation of the vocabulary-corpus-success triad; (ii) a unified, morphology-aware evaluation framework linking intrinsic diagnostics to extrinsic outcomes; (iii) controlled comparisons identifying when character-level and morphology-level tokenization pay off; and (iv) an open-source release of evaluation code, tokenizer pipelines, and models. As the first work of its kind, this “subwords manifest” delivers actionable guidance for building effective tokenizers in MRLs and establishes a reproducible foundation for future research.
zh
[NLP-4] Endogenous Resistance to Activation Steering in Language Models
【速读】: 该论文旨在解决大型语言模型在推理过程中对任务不一致激活引导(task-misaligned activation steering)的抵抗现象,即模型在被外部干预引导偏离主题后仍能自发恢复生成质量的问题,这一现象被称为内源性引导抵抗(Endogenous Steering Resistance, ESR)。其解决方案的关键在于利用稀疏自编码器(Sparse Autoencoder, SAE)潜变量识别出与ESR因果关联的26个特定激活通道,并通过零消融实验验证这些潜变量构成独立的内部一致性检查机制;进一步地,研究发现可通过元提示(meta-prompting)和微调训练主动增强ESR行为,从而为理解并控制模型内部抗干扰机制提供了实证基础与可操作路径。
链接: https://arxiv.org/abs/2602.06941
作者: Alex McKenzie,Keenan Pepper,Stijn Servaes,Martin Leitgab,Murat Cubuktepe,Mike Vaiana,Diogo de Lucena,Judd Rosenblatt,Michael S. A. Graziano
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large language models can resist task-misaligned activation steering during inference, sometimes recovering mid-generation to produce improved responses even when steering remains active. We term this Endogenous Steering Resistance (ESR). Using sparse autoencoder (SAE) latents to steer model activations, we find that Llama-3.3-70B shows substantial ESR, while smaller models from the Llama-3 and Gemma-2 families exhibit the phenomenon less frequently. We identify 26 SAE latents that activate differentially during off-topic content and are causally linked to ESR in Llama-3.3-70B. Zero-ablating these latents reduces the multi-attempt rate by 25%, providing causal evidence for dedicated internal consistency-checking circuits. We demonstrate that ESR can be deliberately enhanced through both prompting and training: meta-prompts instructing the model to self-monitor increase the multi-attempt rate by 4x for Llama-3.3-70B, and fine-tuning on self-correction examples successfully induces ESR-like behavior in smaller models. These findings have dual implications: ESR could protect against adversarial manipulation but might also interfere with beneficial safety interventions that rely on activation steering. Understanding and controlling these resistance mechanisms is important for developing transparent and controllable AI systems. Code is available at this http URL.
zh
[NLP-5] Halluverse-M3: A multitask multilingual benchmark for hallucination in LLM s
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多语言和生成式场景中持续存在的幻觉(Hallucination)问题,尤其是缺乏对不同语言、任务类型及幻觉层级的系统性分析。其解决方案的关键在于构建了Halluverse-M^3数据集,该数据集覆盖英语、阿拉伯语、印地语和土耳其语四种语言,支持问答与对话摘要两类生成任务,并明确区分实体级、关系级和句子级三类幻觉。通过受控编辑生成幻觉样本并经人工验证,确保原始内容与生成结果间的清晰对应关系,从而为细粒度幻觉检测提供了一个现实且具有挑战性的多语言多任务基准。
链接: https://arxiv.org/abs/2602.06920
作者: Samir Abdaljalil,Parichit Sharma,Erchin Serpedin,Hasan Kurban
机构: Texas A&M University (德州农工大学); University of Maryland (马里兰大学); Hamad Bin Khalifa University (哈马德本哈利法大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Hallucinations in large language models remain a persistent challenge, particularly in multilingual and generative settings where factual consistency is difficult to maintain. While recent models show strong performance on English-centric benchmarks, their behavior across languages, tasks, and hallucination types is not yet well understood. In this work, we introduce Halluverse-M^3, a dataset designed to enable systematic analysis of hallucinations across multiple languages, multiple generation tasks, and multiple hallucination categories. Halluverse-M^3 covers four languages, English, Arabic, Hindi, and Turkish, and supports two generation tasks: question answering and dialogue summarization. The dataset explicitly distinguishes between entity-level, relation-level, and sentence-level hallucinations. Hallucinated outputs are constructed through a controlled editing process and validated by human annotators, ensuring clear alignment between original content and hallucinated generations. Using this dataset, we evaluate a diverse set of contemporary open-source and proprietary language models on fine-grained hallucination detection. Our results show that question answering is consistently easier than dialogue summarization, while sentence-level hallucinations remain challenging even for the strongest models. Performance is highest in English and degrades in lower-resource languages, with Hindi exhibiting the lowest detection accuracy. Overall, Halluverse-M^3 provides a realistic and challenging benchmark for studying hallucinations in multilingual, multi-task settings. We release the dataset to support future research on hallucination detection and mitigation\footnotethis https URL.
zh
[NLP-6] Uncovering Cross-Objective Interference in Multi-Objective Alignment
【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在多目标对齐(multi-objective alignment)过程中普遍存在的一种持续性失败模式:训练过程仅提升部分目标的性能,而其他目标反而退化,这种现象被作者正式定义为跨目标干扰(cross-objective interference)。针对此问题,论文的关键解决方案是提出一种可即插即用的方法——协方差目标权重自适应(Covariance Targeted Weight Adaptation, CTWA),其核心思想是通过维持目标奖励与训练信号之间的正协方差关系来有效缓解交叉干扰。该方法基于推导出的局部协方差定律,在经典标量优化算法和现代对齐中常用的截断替代目标(clipped surrogate objectives)下均保持有效性,并进一步结合Polyak–Łojasiewicz条件进行全局收敛分析,揭示了非凸标量优化达到全局收敛的条件及其与模型几何特性的依赖关系。
链接: https://arxiv.org/abs/2602.06869
作者: Yining Lu,Meng Jiang
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:We study a persistent failure mode in multi-objective alignment for large language models (LLMs): training improves performance on only a subset of objectives while causing others to degrade. We formalize this phenomenon as cross-objective interference and conduct the first systematic study across classic scalarization algorithms, showing that interference is pervasive and exhibits strong model dependence. To explain this phenomenon, we derive a local covariance law showing that an objective improves at first order when its reward exhibits positive covariance with the scalarized score. We extend this analysis to clipped surrogate objectives used in modern alignment, demonstrating that the covariance law remains valid under mild conditions despite clipping. Building on this analysis, we propose Covariance Targeted Weight Adaptation (CTWA), a plug-and-play method that maintains positive covariance between objective rewards and the training signal to effectively mitigate cross-objective interference. Finally, we complement these local improvement conditions with a global convergence analysis under the Polyak–Łojasiewicz condition, establishing when non-convex scalarized optimization achieves global convergence and how cross-objective interference depends on specific model geometric properties. Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2602.06869 [cs.CL] (or arXiv:2602.06869v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2602.06869 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-7] SEMA: Simple yet Effective Learning for Multi-Turn Jailbreak Attacks ICLR2026
【速读】: 该论文旨在解决多轮越狱攻击(multi-turn jailbreaks)在实际安全对齐聊天机器人时所面临的挑战,即现有方法因探索复杂度高和意图漂移(intent drift)而失效的问题。其核心解决方案是提出SEMA框架,关键在于两个阶段:第一阶段通过自调优预填充(prefilling self-tuning)利用最小前缀自动生成非拒绝且结构良好的多轮对抗提示进行微调,从而稳定后续学习;第二阶段采用考虑意图漂移的强化学习(reinforcement learning with intent-drift-aware reward),通过结合意图一致性、合规风险和细节程度的奖励机制锚定有害意图,实现高效多轮攻击生成。该方法无需依赖外部数据或已有策略,具备开环攻击特性,显著降低探索复杂度并统一单轮与多轮场景,在多个数据集和目标模型上达到当前最优攻击成功率(ASR),为大语言模型(LLM)安全性提供了更强、更真实的压力测试手段。
链接: https://arxiv.org/abs/2602.06854
作者: Mingqian Feng,Xiaodong Liu,Weiwei Yang,Jialin Song,Xuekai Zhu,Chenliang Xu,Jianfeng Gao
机构: 未知
类目: Computation and Language (cs.CL)
备注: ICLR 2026, 37 pages, 13 tables, 7 figures
Abstract:Multi-turn jailbreaks capture the real threat model for safety-aligned chatbots, where single-turn attacks are merely a special case. Yet existing approaches break under exploration complexity and intent drift. We propose SEMA, a simple yet effective framework that trains a multi-turn attacker without relying on any existing strategies or external data. SEMA comprises two stages. Prefilling self-tuning enables usable rollouts by fine-tuning on non-refusal, well-structured, multi-turn adversarial prompts that are self-generated with a minimal prefix, thereby stabilizing subsequent learning. Reinforcement learning with intent-drift-aware reward trains the attacker to elicit valid multi-turn adversarial prompts while maintaining the same harmful objective. We anchor harmful intent in multi-turn jailbreaks via an intent-drift-aware reward that combines intent alignment, compliance risk, and level of detail. Our open-loop attack regime avoids dependence on victim feedback, unifies single- and multi-turn settings, and reduces exploration complexity. Across multiple datasets, victim models, and jailbreak judges, our method achieves state-of-the-art (SOTA) attack success rates (ASR), outperforming all single-turn baselines, manually scripted and template-driven multi-turn baselines, as well as our SFT (Supervised Fine-Tuning) and DPO (Direct Preference Optimization) variants. For instance, SEMA performs an average 80.1% ASR@1 across three closed-source and open-source victim models on AdvBench, 33.9% over SOTA. The approach is compact, reproducible, and transfers across targets, providing a stronger and more realistic stress test for large language model (LLM) safety and enabling automatic redteaming to expose and localize failure modes. Our code is available at: this https URL.
zh
[NLP-8] he Representational Geometry of Number
【速读】: 该论文试图解决认知科学中的一个核心问题:概念表征是收敛到共享的流形(manifold)以支持泛化,还是发散到正交子空间以最小化任务干扰。尽管先前研究已发现两种现象均有证据支持,但缺乏对这些特性如何共存并随任务演变的机制性解释。解决方案的关键在于提出:表征共享并非体现在概念本身,而在于概念之间的几何关系(geometric relations)。通过以数字概念为测试对象、语言模型作为高维计算基底,作者发现数字表征在不同任务中保持稳定的相对结构;任务特异性表征虽嵌入不同子空间,且低层特征如数值大小和奇偶性沿可分离的线性方向编码,但这些子空间可通过线性映射相互转换,表明即使位于不同子空间,表征仍共享相同的相对结构。这一发现揭示了语言模型如何在保持概念表征的共享结构与功能灵活性之间取得平衡。
链接: https://arxiv.org/abs/2602.06843
作者: Zhimin Hu,Lanhao Niu,Sashank Varma
机构: Georgia Tech (佐治亚理工学院); University of Edinburgh (爱丁堡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:A central question in cognitive science is whether conceptual representations converge onto a shared manifold to support generalization, or diverge into orthogonal subspaces to minimize task interference. While prior work has discovered evidence for both, a mechanistic account of how these properties coexist and transform across tasks remains elusive. We propose that representational sharing lies not in the concepts themselves, but in the geometric relations between them. Using number concepts as a testbed and language models as high-dimensional computational substrates, we show that number representations preserve a stable relational structure across tasks. Task-specific representations are embedded in distinct subspaces, with low-level features like magnitude and parity encoded along separable linear directions. Crucially, we find that these subspaces are largely transformable into one another via linear mappings, indicating that representations share relational structure despite being located in distinct subspaces. Together, these results provide a mechanistic lens of how language models balance the shared structure of number representation with functional flexibility. It suggests that understanding arises when task-specific transformations are applied to a shared underlying relational structure of conceptual representations.
zh
[NLP-9] Visual Word Sense Disambiguation with CLIP through Dual-Channel Text Prompting and Image Augmentations
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在自然语言理解中面临的词汇歧义(Lexical Ambiguity)问题,特别是如何借助视觉域有效解析词义。其核心解决方案是构建一个可解释的视觉词义消歧(Visual Word Sense Disambiguation, VWSD)框架,关键在于利用CLIP模型将模糊文本与候选图像映射到共享多模态空间,并通过双通道提示(语义与图像引导)增强文本嵌入、结合测试时增强策略优化图像嵌入,最终基于余弦相似度选择最匹配的图像。实验表明,该方法显著提升MRR和命中率,且双通道提示比激进图像增强更高效,强调了精准CLIP对齐提示在保持语义特异性上的重要性。
链接: https://arxiv.org/abs/2602.06799
作者: Shamik Bhattacharya,Daniel Perkins,Yaren Dogan,Vineeth Konjeti,Sudarshan Srinivasan,Edmon Begoli
机构: 1: Google Brain (谷歌大脑); 2: Google (谷歌); 3: Meta (Meta); 4: Stability.AI (Stability.AI)
类目: Computation and Language (cs.CL)
备注: 9 pages, 6 figures, pending journal/workshop submission
Abstract:Ambiguity poses persistent challenges in natural language understanding for large language models (LLMs). To better understand how lexical ambiguity can be resolved through the visual domain, we develop an interpretable Visual Word Sense Disambiguation (VWSD) framework. The model leverages CLIP to project ambiguous language and candidate images into a shared multimodal space. We enrich textual embeddings using a dual-channel ensemble of semantic and photo-based prompts with WordNet synonyms, while image embeddings are refined through robust test-time augmentations. We then use cosine similarity to determine the image that best aligns with the ambiguous text. When evaluated on the SemEval-2023 VWSD dataset, enriching the embeddings raises the MRR from 0.7227 to 0.7590 and the Hit Rate from 0.5810 to 0.6220. Ablation studies reveal that dual-channel prompting provides strong, low-latency performance, whereas aggressive image augmentation yields only marginal gains. Additional experiments with WordNet definitions and multilingual prompt ensembles further suggest that noisy external signals tend to dilute semantic specificity, reinforcing the effectiveness of precise, CLIP-aligned prompts for visual word sense disambiguation.
zh
[NLP-10] Generating Data-Driven Reason ing Rubrics for Domain-Adaptive Reward Modeling
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在推理输出验证中难以可靠识别思维轨迹错误的问题,尤其是在长文本、需要专家知识的领域以及缺乏可验证奖励信号的任务中。其解决方案的关键在于提出一种数据驱动的方法,自动构建高度细粒度的推理错误分类体系(即“评分标准”或rubrics),从而显著提升LLM对未见过的推理轨迹进行错误检测的能力。实验表明,基于这些rubrics的分类方法在编码、数学和化学工程等技术领域中优于基线方法,并可用于构建更有效的LLM-as-judge奖励函数,通过强化学习训练推理模型,使模型在困难任务上的准确率相比使用通用LLM-as-judge奖励提升45%,且仅需20%的黄金标注样本即可逼近基于可验证奖励训练的性能。
链接: https://arxiv.org/abs/2602.06795
作者: Kate Sanders,Nathaniel Weir,Sapana Chaudhary,Kaj Bostrom,Huzefa Rangwala
机构: Johns Hopkins University (约翰霍普金斯大学); Amazon Web Services (亚马逊网络服务)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:An impediment to using Large Language Models (LLMs) for reasoning output verification is that LLMs struggle to reliably identify errors in thinking traces, particularly in long outputs, domains requiring expert knowledge, and problems without verifiable rewards. We propose a data-driven approach to automatically construct highly granular reasoning error taxonomies to enhance LLM-driven error detection on unseen reasoning traces. Our findings indicate that classification approaches that leverage these error taxonomies, or “rubrics”, demonstrate strong error identification compared to baseline methods in technical domains like coding, math, and chemical engineering. These rubrics can be used to build stronger LLM-as-judge reward functions for reasoning model training via reinforcement learning. Experimental results show that these rewards have the potential to improve models’ task accuracy on difficult domains over models trained by general LLMs-as-judges by +45%, and approach performance of models trained by verifiable rewards while using as little as 20% as many gold labels. Through our approach, we extend the usage of reward rubrics from assessing qualitative model behavior to assessing quantitative model correctness on tasks typically learned via RLVR rewards. This extension opens the door for teaching models to solve complex technical problems without a full dataset of gold labels, which are often highly costly to procure.
zh
[NLP-11] R-Align: Enhancing Generative Reward Models through Rationale-Centric Meta-Judging
【速读】: 该论文旨在解决生成式奖励模型(Generative Reward Models, GenRMs)在训练与评估中仅依赖结果标签(outcome-label-only)导致推理质量无法被有效检验的问题,进而影响强化学习从人类反馈(Reinforcement Learning from Human Feedback, RLHF)的最终性能。其关键解决方案是提出一种以推理过程为中心的对齐方法——Rationale-Centric Alignment(R-Align),通过引入黄金标准判断(gold judgments)并显式监督推理理由(rationale)与参考判断之间的一致性,从而提升推理保真度(reasoning fidelity)。实证表明,该方法显著降低了虚假正确率(Spurious Correctness, S-Corr),并在多个下游任务中稳定提升了策略模型的表现。
链接: https://arxiv.org/abs/2602.06763
作者: Yanlin Lai,Mitt Huang,Hangyu Guo,Xiangfeng Wang,Haodong Li,Shaoxiong Zhan,Liang Zhao,Chengyuan Yao,Yinmin Zhang,Qi Han,Chun Yuan,Zheng Ge,Xiangyu Zhang,Daxin Jiang
机构: 未知
类目: Computation and Language (cs.CL)
备注: Github: this https URL Huggingface: this https URL
Abstract:Reinforcement Learning from Human Feedback (RLHF) remains indispensable for aligning large language models (LLMs) in subjective domains. To enhance robustness, recent work shifts toward Generative Reward Models (GenRMs) that generate rationales before predicting preferences. Yet in GenRM training and evaluation, practice remains outcome-label-only, leaving reasoning quality unchecked. We show that reasoning fidelity-the consistency between a GenRM’s preference decision and reference decision rationales-is highly predictive of downstream RLHF outcomes, beyond standard label accuracy. Specifically, we repurpose existing reward-model benchmarks to compute Spurious Correctness (S-Corr)-the fraction of label-correct decisions with rationales misaligned with golden judgments. Our empirical evaluation reveals substantial S-Corr even for competitive GenRMs, and higher S-Corr is associated with policy degeneration under optimization. To improve fidelity, we propose Rationale-Centric Alignment, R-Align, which augments training with gold judgments and explicitly supervises rationale alignment. R-Align reduces S-Corr on RM benchmarks and yields consistent gains in actor performance across STEM, coding, instruction following, and general tasks.
zh
[NLP-12] able-as-Search: Formulate Long-Horizon Agent ic Information Seeking as Table Completion
【速读】: 该论文旨在解决当前信息检索(InfoSeeking)代理在长周期探索过程中难以保持焦点和连贯性的问题,其核心挑战在于:在纯文本上下文中难以稳定追踪搜索状态,包括规划过程和海量搜索结果。解决方案的关键是提出一种结构化的规划框架——Table-as-Search (TaS),它将信息检索任务重新建模为表格补全任务;具体而言,每个查询被映射到一个外部数据库维护的结构化表格模式中,其中行代表候选对象,列表示约束条件或所需信息。该表格精确管理搜索状态:已填充单元格严格记录历史与结果,空单元格则显式表达下一步搜索计划,从而统一处理深度搜索(Deep Search)、广度搜索(Wide Search)及更具挑战性的深广混合搜索(DeepWide Search),显著提升了长周期信息探索的鲁棒性、效率与可扩展性。
链接: https://arxiv.org/abs/2602.06724
作者: Tian Lan,Felix Henry,Bin Zhu,Qianghuai Jia,Junyang Ren,Qihang Pu,Haijun Li,Longyue Wang,Zhao Xu,Weihua Luo
机构: Alibaba International Digital Commerce (阿里巴巴国际数字商业集团)
类目: Computation and Language (cs.CL)
备注:
Abstract:Current Information Seeking (InfoSeeking) agents struggle to maintain focus and coherence during long-horizon exploration, as tracking search states, including planning procedure and massive search results, within one plain-text context is inherently fragile. To address this, we introduce \textbfTable-as-Search (TaS), a structured planning framework that reformulates the InfoSeeking task as a Table Completion task. TaS maps each query into a structured table schema maintained in an external database, where rows represent search candidates and columns denote constraints or required information. This table precisely manages the search states: filled cells strictly record the history and search results, while empty cells serve as an explicit search plan. Crucially, TaS unifies three distinct InfoSeeking tasks: Deep Search, Wide Search, and the challenging DeepWide Search. Extensive experiments demonstrate that TaS significantly outperforms numerous state-of-the-art baselines across three kinds of benchmarks, including multi-agent framework and commercial systems. Furthermore, our analysis validates the TaS’s superior robustness in long-horizon InfoSeeking, alongside its efficiency, scalability and flexibility. Code and datasets are publicly released at this https URL.
zh
[NLP-13] Evaluating Prompt Engineering Strategies for Sentiment Control in AI-Generated Texts
【速读】: 该论文旨在解决如何在大型语言模型(Large Language Models, LLMs)生成文本时实现情感可控性的问题,即如何有效引导AI输出具有特定情绪倾向的内容。其解决方案的关键在于采用提示工程(prompt engineering)技术,特别是通过Few-Shot prompting结合人工编写的示例,为模型提供任务特定的情感引导,从而在无需大量标注数据或昂贵微调(fine-tuning)的情况下,显著提升对生成文本情感的控制能力。实验表明,该方法在资源受限场景下尤为高效且实用。
链接: https://arxiv.org/abs/2602.06692
作者: Kerstin Sahler,Sophie Jentzsch
机构: German Aerospace Center (DLR)
类目: Computation and Language (cs.CL)
备注: The definitive, peer-reviewed and edited version of this article is published in HHAI 2025 - Proceedings of the Fourth International Conference on Hybrid Human-Artificial Intelligence, Frontiers in Artificial Intelligence and Applications, Volume 408, ISBN 978-1-64368-611-0, pages 423 - 438, 2025
Abstract:The groundbreaking capabilities of Large Language Models (LLMs) offer new opportunities for enhancing human-computer interaction through emotion-adaptive Artificial Intelligence (AI). However, deliberately controlling the sentiment in these systems remains challenging. The present study investigates the potential of prompt engineering for controlling sentiment in LLM-generated text, providing a resource-sensitive and accessible alternative to existing methods. Using Ekman’s six basic emotions (e.g., joy, disgust), we examine various prompting techniques, including Zero-Shot and Chain-of-Thought prompting using gpt-3.5-turbo, and compare it to fine-tuning. Our results indicate that prompt engineering effectively steers emotions in AI-generated texts, offering a practical and cost-effective alternative to fine-tuning, especially in data-constrained settings. In this regard, Few-Shot prompting with human-written examples was the most effective among other techniques, likely due to the additional task-specific guidance. The findings contribute valuable insights towards developing emotion-adaptive AI systems.
zh
[NLP-14] compar:IA: The French Governments LLM arena to collect French-language human prompts and preference data
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在非英语语言中性能下降、文化适配性不足及安全鲁棒性较差的问题,其根源在于英语主导的预训练数据和人类偏好对齐数据集。为应对这一挑战,研究提出解决方案的关键在于构建一个名为compar:IA的开源数字公共服务平台,该平台由法国政府内部开发,用于从以法语为主的大众群体中收集大规模人类偏好数据。其核心创新在于采用盲式成对比较界面,捕获不受限的真实世界提示与用户判断,同时保持低参与门槛和隐私保护的自动化过滤机制,从而实现多语言模型的高效评估与对齐训练。截至2026年2月7日,该平台已积累超过60万条自由格式提示和25万条偏好投票,其中约89%为法语数据,并发布了三个互补数据集(对话、投票、反应),支持多语言模型训练与人机交互研究。
链接: https://arxiv.org/abs/2602.06669
作者: Lucie Termignon,Simonas Zilinskas,Hadrien Pélissier,Aurélien Barrot,Nicolas Chesnais,Elie Gavoty
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 18 pages, 7 figures, preprint
Abstract:Large Language Models (LLMs) often show reduced performance, cultural alignment, and safety robustness in non-English languages, partly because English dominates both pre-training data and human preference alignment datasets. Training methods like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) require human preference data, which remains scarce and largely non-public for many languages beyond English. To address this gap, we introduce compar:IA, an open-source digital public service developed inside the French government and designed to collect large-scale human preference data from a predominantly French-speaking general audience. The platform uses a blind pairwise comparison interface to capture unconstrained, real-world prompts and user judgments across a diverse set of language models, while maintaining low participation friction and privacy-preserving automated filtering. As of 2026-02-07, compar:IA has collected over 600,000 free-form prompts and 250,000 preference votes, with approximately 89% of the data in French. We release three complementary datasets – conversations, votes, and reactions – under open licenses, and present initial analyses, including a French-language model leaderboard and user interaction patterns. Beyond the French context, compar:IA is evolving toward an international digital public good, offering reusable infrastructure for multilingual model training, evaluation, and the study of human-AI interaction.
zh
[NLP-15] Not All Layers Need Tuning: Selective Layer Restoration Recovers Diversity
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在后训练(post-training)过程中出现的模式坍缩(mode collapse)问题,即模型生成多样性下降、输出趋于重复,尤其在开放域任务中表现明显。解决方案的关键在于提出一种无需额外训练的“选择性层恢复”(Selective Layer Restoration, SLR)方法:基于对不同层功能差异的观察,识别出导致模式坍缩的特定层,并将这些层恢复至预训练权重,从而在不增加推理开销的前提下,显著提升生成多样性,同时保持高质量输出。该方法通过设计一个代理任务——受限随机字符生成(Constrained Random Character, CRC),量化了恢复范围与有效性-多样性之间的权衡关系,最终实现跨多种模型架构(Llama、Qwen、Gemma)和任务类型(创意写作、开放问答、多步推理)的一致性改进。
链接: https://arxiv.org/abs/2602.06665
作者: Bowen Zhang,Meiyi Wang,Harold Soh
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 pages, 7 figures, 12 tables
Abstract:Post-training improves instruction-following and helpfulness of large language models (LLMs) but often reduces generation diversity, which leads to repetitive outputs in open-ended settings, a phenomenon known as mode collapse. Motivated by evidence that LLM layers play distinct functional roles, we hypothesize that mode collapse can be localized to specific layers and that restoring a carefully chosen range of layers to their pre-trained weights can recover diversity while maintaining high output quality. To validate this hypothesis and decide which layers to restore, we design a proxy task – Constrained Random Character(CRC) – with an explicit validity set and a natural diversity objective. Results on CRC reveal a clear diversity-validity trade-off across restoration ranges and identify configurations that increase diversity with minimal quality loss. Based on these findings, we propose Selective Layer Restoration (SLR), a training-free method that restores selected layers in a post-trained model to their pre-trained weights, yielding a hybrid model with the same architecture and parameter count, incurring no additional inference cost. Across three different tasks (creative writing, open-ended question answering, and multi-step reasoning) and three different model families (Llama, Qwen, and Gemma), we find SLR can consistently and substantially improve output diversity while maintaining high output quality.
zh
[NLP-16] Beyond Static Alignment: Hierarchical Policy Control for LLM Safety via Risk-Aware Chain-of-Thought
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在安全性和帮助性之间存在的固有权衡问题,即静态、通用的安全策略缺乏运行时可控性,导致模型在实际应用中可能过度拒绝无害请求或未能充分约束有害内容。其解决方案的关键在于提出PACT(Prompt-configured Action via Chain-of-Thought)框架,通过分层策略架构实现动态安全控制:在不可覆盖的全局安全策略基础上,允许用户定义领域特定的风险类别并配置标签到动作的行为规则(如合规、引导或拒绝),同时将安全决策分解为结构化的“分类→执行”路径,从而提升实用性与透明度。实验表明,PACT在保持接近最先进安全性能的同时,显著增强了用户层面的可控性,有效缓解了安全与帮助性的冲突。
链接: https://arxiv.org/abs/2602.06650
作者: Jianfeng Si,Lin Sun,Weihong Lin,Xiangzheng Zhang
机构: Qiyuan Tech(奇元科技)
类目: Computation and Language (cs.CL)
备注: 13 pages, 5 tables, 2 figures
Abstract:Large Language Models (LLMs) face a fundamental safety-helpfulness trade-off due to static, one-size-fits-all safety policies that lack runtime controllabilityxf, making it difficult to tailor responses to diverse application needs. %As a result, models may over-refuse benign requests or under-constrain harmful ones. We present \textbfPACT (Prompt-configured Action via Chain-of-Thought), a framework for dynamic safety control through explicit, risk-aware reasoning. PACT operates under a hierarchical policy architecture: a non-overridable global safety policy establishes immutable boundaries for critical risks (e.g., child safety, violent extremism), while user-defined policies can introduce domain-specific (non-global) risk categories and specify label-to-action behaviors to improve utility in real-world deployment settings. The framework decomposes safety decisions into structured Classify \rightarrow Act paths that route queries to the appropriate action (comply, guide, or reject) and render the decision-making process transparent. Extensive experiments demonstrate that PACT achieves near state-of-the-art safety performance under global policy evaluation while attaining the best controllability under user-specific policy evaluation, effectively mitigating the safety-helpfulness trade-off. We will release the PACT model suite, training data, and evaluation protocols to facilitate reproducible research in controllable safety alignment. Comments: 13 pages, 5 tables, 2 figures Subjects: Computation and Language (cs.CL) Cite as: arXiv:2602.06650 [cs.CL] (or arXiv:2602.06650v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2602.06650 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-17] Reading Between the Waves: Robust Topic Segmentation Using Inter-Sentence Audio Features ICASSP2026
【速读】: 该论文旨在解决语音内容(如在线视频和播客)中多主题混杂导致的自动话题分割(topic segmentation)难题,以提升用户导航效率及下游应用性能。现有方法未能充分挖掘声学特征,限制了分割精度。其解决方案的关键在于提出一种多模态方法,通过微调文本编码器与Siamese音频编码器,捕捉句法边界附近的声学线索,从而增强对语义变化的敏感性。实验表明,该方法在YouTube大规模数据集上显著优于纯文本和多模态基线模型,并在葡萄牙语、德语和英语三个额外数据集上展现出更强的抗自动语音识别(ASR)噪声能力,验证了学习到的声学特征对鲁棒话题分割的重要价值。
链接: https://arxiv.org/abs/2602.06647
作者: Steffen Freisinger,Philipp Seeberger,Tobias Bocklet,Korbinian Riedhammer
机构: 未知
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted to IEEE ICASSP 2026
Abstract:Spoken content, such as online videos and podcasts, often spans multiple topics, which makes automatic topic segmentation essential for user navigation and downstream applications. However, current methods do not fully leverage acoustic features, leaving room for improvement. We propose a multi-modal approach that fine-tunes both a text encoder and a Siamese audio encoder, capturing acoustic cues around sentence boundaries. Experiments on a large-scale dataset of YouTube videos show substantial gains over text-only and multi-modal baselines. Our model also proves more resilient to ASR noise and outperforms a larger text-only baseline on three additional datasets in Portuguese, German, and English, underscoring the value of learned acoustic features for robust topic segmentation.
zh
[NLP-18] FairJudge: An Adaptive Debiased and Consistent LLM -as-a-Judge
【速读】: 该论文旨在解决当前大语言模型作为评判者(LLM-as-a-Judge)系统所面临的三大核心问题:任务与领域特定评价标准适应性不足、由非语义线索(如位置、长度、格式和模型来源)引发的系统性偏差,以及不同评估模式(如逐点评估与成对评估)之间的一致性缺失。解决方案的关键在于提出 FairJudge,其创新性地将评判行为建模为一个可学习且受正则化约束的策略(policy),并通过构建高信息密度的评判数据集来显式注入与评估行为对齐的监督信号;在此基础上,采用课程学习风格的 SFT-DPO-GRPO 训练范式,逐步实现评分规则遵循、偏见缓解和跨模式一致性,同时避免灾难性遗忘。实验表明,FairJudge 在多个内部和公开基准上显著提升了评价一致性与F1分数,降低了非语义偏差,并优于参数量更大的指令微调模型。
链接: https://arxiv.org/abs/2602.06625
作者: Bo Yang,Lanfei Feng,Yunkui Chen,Yu Zhang,Xiao Xu,Shijian Li
机构: Zhejiang University (浙江大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Existing LLM-as-a-Judge systems suffer from three fundamental limitations: limited adaptivity to task- and domain-specific evaluation criteria, systematic biases driven by non-semantic cues such as position, length, format, and model provenance, and evaluation inconsistency that leads to contradictory judgments across different evaluation modes (e.g., pointwise versus pairwise). To address these issues, we propose FairJudge, an adaptive, debiased, and consistent LLM-as-a-Judge. Unlike prior approaches that treat the judge as a static evaluator, FairJudge models judging behavior itself as a learnable and regularized policy. From a data-centric perspective, we construct a high-information-density judging dataset that explicitly injects supervision signals aligned with evaluation behavior. Building on this dataset, we adopt a curriculum-style SFT-DPO-GRPO training paradigm that progressively aligns rubric adherence, bias mitigation, and cross-mode consistency, while avoiding catastrophic forgetting. Experimental results on multiple internal and public benchmarks show that FairJudge consistently improves agreement and F1, reduces non-semantic biases, and outperforms substantially larger instruction-tuned LLMs. All resources will be publicly released after acceptance to facilitate future research.
zh
[NLP-19] Do Prompts Guarantee Safety? Mitigating Toxicity from LLM Generations through Subspace Intervention
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成文本时可能产生隐性毒性内容的问题,尤其是在面对看似无害的提示词时仍会输出有害或不当内容,这对实际应用构成严重安全风险。现有方法往往难以在token级或句法层面有效识别此类上下文依赖的毒性模式,且常在安全性与生成文本的连贯性(coherence)之间面临权衡。论文提出了一种基于目标子空间干预(targeted subspace intervention)的解决方案,其关键在于从模型内部表示中识别并抑制隐藏的毒性特征方向,同时保留模型生成安全、流畅文本的能力。实验表明,该方法在RealToxicityPrompts数据集上显著优于现有基线,可使最先进的去毒系统毒性降低8–20%,且对推理复杂度影响极小,同时保持生成质量不变。
链接: https://arxiv.org/abs/2602.06623
作者: Himanshu Singh,Ziwei Xu,A. V. Subramanyam,Mohan Kankanhalli
机构: 未知
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:
Abstract:Large Language Models (LLMs) are powerful text generators, yet they can produce toxic or harmful content even when given seemingly harmless prompts. This presents a serious safety challenge and can cause real-world harm. Toxicity is often subtle and context-dependent, making it difficult to detect at the token level or through coarse sentence-level signals. Moreover, efforts to mitigate toxicity often face a trade-off between safety and the coherence, or fluency of the generated text. In this work, we present a targeted subspace intervention strategy for identifying and suppressing hidden toxic patterns from underlying model representations, while preserving overall ability to generate safe fluent content. On the RealToxicityPrompts, our method achieves strong mitigation performance compared to existing baselines, with minimal impact on inference complexity. Across multiple LLMs, our approach reduces toxicity of state-of-the-art detoxification systems by 8-20%, while maintaining comparable fluency. Through extensive quantitative and qualitative analyses, we show that our approach achieves effective toxicity reduction without impairing generative performance, consistently outperforming existing baselines.
zh
[NLP-20] Echoes as Anchors: Probabilistic Costs and Attention Refocusing in LLM Reason ing
【速读】: 该论文旨在解决大规模推理模型(Large Reasoning Models, LRM)在测试时计算资源分配的问题,即如何高效利用有限的计算预算以提升模型在数学问题求解、代码生成和规划等任务中的准确性。现有方法如自一致性扩展或并行思维虽能提升性能,但往往引入与任务无关的“思考标记”或依赖启发式规则,且忽视了模型内部链路中自发出现的重复现象。论文的关键创新在于识别并利用一种称为“提示回响”(Echo of Prompt, EOP)的现象——即模型在推理初始阶段倾向于重述输入问题,并将其建模为一种前载式的计算调控机制。通过将EOP去除建模为基于拒绝的条件化过程,作者定义了可计算的“回响似然差距”(Echo Likelihood Gap, Δℒ),建立了早期重复与最终答案概率增益及下游准确率之间的理论联系。进一步地,提出了两种实现策略:Echo-Distilled SFT(ED-SFT)通过监督微调注入“先回响后推理”的模式,以及Echoic Prompting(EP)在不训练的情况下动态重置中间推理路径。实验表明,EOP能够增强中间层中答案到答案前缀的关注度,支持其作为注意力重聚焦机制的有效性,并在多个基准数据集上一致优于基线方法。
链接: https://arxiv.org/abs/2602.06600
作者: Zhuoyuan Hao,Zhuo Li,Wu Li,Fangming Liu,Min Zhang,Jing Li
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Test-time compute allocation in large reasoning models (LRMs) is widely used and has applications in mathematical problem solving, code synthesis, and planning. Recent work has addressed this problem by scaling self-consistency and parallel thinking, adding generic thinking tokens'' and prompting models to re-read the question before answering. Unfortunately, these approaches either inject task-agnostic tokens or mandate heuristics that do not explain -- and often ignore -- the \emphspontaneous repetition that many LRMs exhibit at the head of their internal chains. In contrast, we analyze and harness the model's tendency to restate the question, which we term the \emphEcho of Prompt (EOP), as a front-loaded, compute-shaping mechanism. We formalize its probabilistic cost by casting echo removal as rejection-based conditioning and defining the \emphEcho Likelihood Gap \Delta\mathcalL as a computable proxy. This provides the missing theoretical link that links early repetition to likelihood gains and downstream accuracy. However, it does not by itself specify how to exploit EOP. Consequently, we develop \emphEcho-Distilled SFT (ED-SFT) to instill an echo-then-reason’’ pattern through supervised finetuning, and \emphEchoic Prompting (EP) to re-ground the model mid-trace without training. While promising, quantifying benefits beyond verbosity is non-trivial. Therefore, we conduct length and suffix-controlled likelihood analyses together with layer-wise attention studies, showing that EOP increases answer to answer-prefix attention in middle layers, consistent with an \emphattention refocusing mechanism. We evaluate on GSM8K, MathQA, Hendrycks-MATH, AIME24, and MATH-500 under identical decoding settings and budgets, and find consistent gains over baselines. Code is available at this https URL.
zh
[NLP-21] Personality as Relational Infrastructure: User Perceptions of Personality-Trait-Infused LLM Messaging
【速读】: 该论文旨在解决生成式人工智能(Generative AI)在行为改变系统中应用时,个性化信息是否通过单条消息的优化提升用户体验,还是通过持续暴露于个性化内容形成整体感知的问题。其核心解决方案在于区分“试验层面效应”(trial-level effects)与“个体层面暴露效应”(person-level exposure effects),利用有序多层模型(ordinal multilevel models)结合within-between分解方法,分析不同大五人格特质(Big Five Personality Traits, BFPT)引导的提示策略对用户评价的影响。研究发现,尽管单条消息的个性化未显著改善评估结果,但长期接收基于人格特质的信息会显著提升用户对消息的个性化感知、适宜性评价及情绪体验,表明个性化效果主要来自聚合暴露而非逐条优化,这对自适应人机交互系统的长期设计与评估具有重要启示。
链接: https://arxiv.org/abs/2602.06596
作者: Dominik P. Hofer,David Haag,Rania Islambouli,Jan D. Smeddinck
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Currently under review
Abstract:Digital behaviour change systems increasingly rely on repeated, system-initiated messages to support users in everyday contexts. LLMs enable these messages to be personalised consistently across interactions, yet it remains unclear whether such personalisation improves individual messages or instead shapes users’ perceptions through patterns of exposure. We explore this question in the context of LLM-generated JITAIs, which are short, context-aware messages delivered at moments deemed appropriate to support behaviour change, using physical activity as an application domain. In a controlled retrospective study, 90 participants evaluated messages generated using four LLM strategies: baseline prompting, few-shot prompting, fine-tuned models, and retrieval augmented generation, each implemented with and without Big Five Personality Traits to produce personality-aligned communication across multiple scenarios. Using ordinal multilevel models with within-between decomposition, we distinguish trial-level effects, whether personality information improves evaluations of individual messages, from person-level exposure effects, whether participants receiving higher proportions of personality-informed messages exhibit systematically different overall perceptions. Results showed no trial-level associations, but participants who received higher proportions of BFPT-informed messages rated the messages as more personalised, appropriate, and reported less negative affect. We use Communication Accommodation Theory for post-hoc analysis. These results suggest that personality-based personalisation in behaviour change systems may operate primarily through aggregate exposure rather than per-message optimisation, with implications for how adaptive systems are designed and evaluated in sustained human-AI interaction. In-situ longitudinal studies are needed to validate these findings in real-world contexts.
zh
[NLP-22] Inference-Time Rethinking with Latent Thought Vectors for Math Reason ing
【速读】: 该论文旨在解决标准链式思维(Chain-of-Thought Reasoning)在推理过程中缺乏纠错能力的问题,即模型在单次前向传播中逐token生成答案,一旦产生早期错误便无法修正,导致推理质量受限。其解决方案的关键在于提出推理时重思(Inference-Time Rethinking)框架,通过将推理过程解耦为连续的潜在思维向量(latent thought vector,表示“要推理什么”)与条件解码器(负责“如何推理”),实现迭代自校正机制。该框架利用潜在空间中的结构化表示压缩推理逻辑,使基于梯度的优化在推理策略上具有良好的可微性,并通过吉布斯采样式的交替优化过程,在测试阶段不断调整潜在向量以更好地解释生成的推理轨迹,从而有效提升数学推理能力,且无需依赖大规模参数模型即可达到或超越大模型性能。
链接: https://arxiv.org/abs/2602.06584
作者: Deqian Kong,Minglu Zhao,Aoyang Qin,Bo Pang,Chenxin Tao,David Hartmann,Edouardo Honig,Dehong Xu,Amit Kumar,Matt Sarte,Chuan Li,Jianwen Xie,Ying Nian Wu
机构: UCLA (加州大学洛杉矶分校); Lambda Inc (Lambda公司); Tsinghua University (清华大学); Salesforce Research (Salesforce研究院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:
Abstract:Standard chain-of-thought reasoning generates a solution in a single forward pass, committing irrevocably to each token and lacking a mechanism to recover from early errors. We introduce Inference-Time Rethinking, a generative framework that enables iterative self-correction by decoupling declarative latent thought vectors from procedural generation. We factorize reasoning into a continuous latent thought vector (what to reason about) and a decoder that verbalizes the trace conditioned on this vector (how to reason). Beyond serving as a declarative buffer, latent thought vectors compress the reasoning structure into a continuous representation that abstracts away surface-level token variability, making gradient-based optimization over reasoning strategies well-posed. Our prior model maps unstructured noise to a learned manifold of valid reasoning patterns, and at test time we employ a Gibbs-style procedure that alternates between generating a candidate trace and optimizing the latent vector to better explain that trace, effectively navigating the latent manifold to refine the reasoning strategy. Training a 0.2B-parameter model from scratch on GSM8K, our method with 30 rethinking iterations surpasses baselines with 10 to 15 times more parameters, including a 3B counterpart. This result demonstrates that effective mathematical reasoning can emerge from sophisticated inference-time computation rather than solely from massive parameter counts.
zh
[NLP-23] Baichuan-M3: Modeling Clinical Inquiry for Reliable Medical Decision-Making
【速读】: 该论文旨在解决现有医疗大语言模型在开放场景下临床决策支持能力不足的问题,特别是其在主动信息获取、长程推理和幻觉抑制方面的局限性。解决方案的关键在于构建一个针对医生系统化诊疗流程优化的专用训练管道,使模型具备三方面核心能力:(i)主动获取信息以消除歧义;(ii)整合分散证据进行长周期推理以形成连贯诊断;(iii)自适应抑制幻觉以保障事实可靠性,从而实现从被动问答向临床级决策支持的范式转变。
链接: https://arxiv.org/abs/2602.06570
作者: Baichuan-M3 Team:Chengfeng Dou,Fan Yang,Fei Li,Jiyuan Jia,Qiang Ju,Shuai Wang,Tianpeng Li,Xiangrong Zeng,Yijie Zhou,Hongda Zhang,Jinyang Tai,Linzhuang Sun,Peidong Guo,Yichuan Mo,Xiaochuan Wang,Hengfu Cui,Zhishou Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:We introduce Baichuan-M3, a medical-enhanced large language model engineered to shift the paradigm from passive question-answering to active, clinical-grade decision support. Addressing the limitations of existing systems in open-ended consultations, Baichuan-M3 utilizes a specialized training pipeline to model the systematic workflow of a physician. Key capabilities include: (i) proactive information acquisition to resolve ambiguity; (ii) long-horizon reasoning that unifies scattered evidence into coherent diagnoses; and (iii) adaptive hallucination suppression to ensure factual reliability. Empirical evaluations demonstrate that Baichuan-M3 achieves state-of-the-art results on HealthBench, the newly introduced HealthBench-Hallu and ScanBench, significantly outperforming GPT-5.2 in clinical inquiry, advisory and safety. The models are publicly available at this https URL.
zh
[NLP-24] SPARC: Separating Perception And Reason ing Circuits for Test-time Scaling of VLMs
【速读】: 该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在测试时扩展(test-time scaling)中存在的脆弱性问题,即在推理过程中因感知与推理过程未分离而导致的上下文混乱、错误传播以及高成本的强化学习优化需求。其解决方案的关键在于提出SPARC(Separating Perception And Reasoning Circuits)框架,通过显式地将视觉感知与推理模块解耦,构建一个两阶段流水线:第一阶段进行显式的视觉搜索以定位与问题相关的图像区域,第二阶段基于这些区域执行条件推理生成最终答案。这种结构化设计实现了异构计算资源的动态分配、局部优化能力提升以及压缩上下文下的高效处理,显著提升了模型在复杂视觉推理任务中的准确率和鲁棒性。
链接: https://arxiv.org/abs/2602.06566
作者: Niccolo Avogaro,Nayanika Debnath,Li Mi,Thomas Frick,Junling Wang,Zexue He,Hang Hua,Konrad Schindler,Mattia Rigotti
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Despite recent successes, test-time scaling - i.e., dynamically expanding the token budget during inference as needed - remains brittle for vision-language models (VLMs): unstructured chains-of-thought about images entangle perception and reasoning, leading to long, disorganized contexts where small perceptual mistakes may cascade into completely wrong answers. Moreover, expensive reinforcement learning with hand-crafted rewards is required to achieve good performance. Here, we introduce SPARC (Separating Perception And Reasoning Circuits), a modular framework that explicitly decouples visual perception from reasoning. Inspired by sequential sensory-to-cognitive processing in the brain, SPARC implements a two-stage pipeline where the model first performs explicit visual search to localize question-relevant regions, then conditions its reasoning on those regions to produce the final answer. This separation enables independent test-time scaling with asymmetric compute allocation (e.g., prioritizing perceptual processing under distribution shift), supports selective optimization (e.g., improving the perceptual stage alone when it is the bottleneck for end-to-end performance), and accommodates compressed contexts by running global search at lower image resolutions and allocating high-resolution processing only to selected regions, thereby reducing total visual tokens count and compute. Across challenging visual reasoning benchmarks, SPARC outperforms monolithic baselines and strong visual-grounding approaches. For instance, SPARC improves the accuracy of Qwen3VL-4B on the V^* VQA benchmark by 6.7 percentage points, and it surpasses “thinking with images” by 4.6 points on a challenging OOD task despite requiring a 200 \times lower token budget.
zh
[NLP-25] Malicious Agent Skills in the Wild: A Large-Scale Security Empirical Study
【速读】: 该论文旨在解决第三方代理技能(third-party agent skills)在生成式 AI(Generative AI)生态系统中带来的安全威胁问题,尤其是这些技能因缺乏严格审核而可能包含恶意行为的风险。解决方案的关键在于构建了首个基于行为验证的标注数据集,通过分析来自两个社区注册表的98,380个技能,识别出157个恶意技能及其632个漏洞,并揭示其攻击模式:分为“数据窃取者”和“代理劫持者”两类,且高级攻击普遍包含未公开的“影子特征”(shadow features),甚至利用AI平台自身的钩子系统与权限标志进行攻击。该研究为未来代理技能安全防护提供了可复现的数据基础与分析框架。
链接: https://arxiv.org/abs/2602.06547
作者: Yi Liu,Zhihao Chen,Yanjun Zhang,Gelei Deng,Yuekang Li,Jianting Ning,Leo Yu Zhang
机构: Quantstamp; Fujian Normal University (福建师范大学); Griffith University (格里菲斯大学); Nanyang Technological University (南洋理工大学); University of New South Wales (新南威尔士大学); Zhejiang Sci-Tech University (浙江理工大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Emerging Technologies (cs.ET)
备注:
Abstract:Third-party agent skills extend LLM-based agents with instruction files and executable code that run on users’ machines. Skills execute with user privileges and are distributed through community registries with minimal vetting, but no ground-truth dataset exists to characterize the resulting threats. We construct the first labeled dataset of malicious agent skills by behaviorally verifying 98,380 skills from two community registries, confirming 157 malicious skills with 632 vulnerabilities. These attacks are not incidental. Malicious skills average 4.03 vulnerabilities across a median of three kill chain phases, and the ecosystem has split into two archetypes: Data Thieves that exfiltrate credentials through supply chain techniques, and Agent Hijackers that subvert agent decision-making through instruction manipulation. A single actor accounts for 54.1% of confirmed cases through templated brand impersonation. Shadow features, capabilities absent from public documentation, appear in 0% of basic attacks but 100% of advanced ones; several skills go further by exploiting the AI platform’s own hook system and permission flags. Responsible disclosure led to 93.6% removal within 30 days. We release the dataset and analysis pipeline to support future work on agent skill security.
zh
[NLP-26] MTQE.en-he: Machine Translation Quality Estimation for English-Hebrew EACL2026
【速读】: 该论文旨在解决英语-希伯来语(English-Hebrew)机器翻译质量评估(Machine Translation Quality Estimation, MTQE)任务中缺乏公开基准数据集的问题。其关键解决方案是发布了首个面向该语言对的公开英文-希伯来语MTQE基准数据集(http URL-he),包含来自WMT24++的959个英文段落及其对应的希伯来语机器翻译及三位人工标注者的直接评分(Direct Assessment scores)。在此基础上,通过对比ChatGPT提示、TransQuest与CometKiwi模型的表现,并采用三模型集成策略,显著提升了评估性能(Pearson提升6.4个百分点,Spearman提升5.6个百分点);同时发现参数高效微调方法(如LoRA、BitFit和仅微调分类头FTHead)在训练稳定性与性能增益方面优于全模型微调,有效缓解了过拟合与分布塌陷问题。
链接: https://arxiv.org/abs/2602.06546
作者: Andy Rosenbaum,Assaf Siani,Ilan Kernerman
机构: Lexicala(词典公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to LoResLM at EACL 2026
Abstract:We release this http URL-he: to our knowledge, the first publicly available English-Hebrew benchmark for Machine Translation Quality Estimation. this http URL-he contains 959 English segments from WMT24++, each paired with a machine translation into Hebrew, and Direct Assessment scores of the translation quality annotated by three human experts. We benchmark ChatGPT prompting, TransQuest, and CometKiwi and show that ensembling the three models outperforms the best single model (CometKiwi) by 6.4 percentage points Pearson and 5.6 percentage points Spearman. Fine-tuning experiments with TransQuest and CometKiwi reveal that full-model updates are sensitive to overfitting and distribution collapse, yet parameter-efficient methods (LoRA, BitFit, and FTHead, i.e., fine-tuning only the classification head) train stably and yield improvements of 2-3 percentage points. this http URL-he and our experimental results enable future research on this under-resourced language pair.
zh
[NLP-27] Agent CPM-Report: Interleaving Drafting and Deepening for Open-Ended Deep Research
【速读】: 该论文旨在解决当前生成式 AI (Generative AI) 在深度研究报告生成中面临的两大核心挑战:一是依赖大规模信息获取与洞察驱动的分析能力,这对现有语言模型构成显著挑战;二是现有方法普遍采用“先规划后写作”的范式,其性能高度依赖初始大纲质量,而高质量大纲的构建本身需要强大的推理能力,导致系统严重依赖闭源或在线大模型,进而带来部署障碍及用户数据安全与隐私风险。解决方案的关键在于提出 AgentCPM-Report,其核心创新是引入 Writing As Reasoning Policy (WARP),使模型在报告生成过程中动态修订大纲,通过 Evidence-Based Drafting(基于证据的草稿撰写)与 Reasoning-Driven Deepening(推理驱动的深化)交替执行,实现信息获取、知识精炼与大纲迭代演化的协同优化,并辅以多阶段智能体训练策略(Multi-Stage Agentic Training),有效赋能小型本地模型(8B参数)达到优于主流闭源系统的性能表现。
链接: https://arxiv.org/abs/2602.06540
作者: Yishan Li,Wentong Chen,Yukun Yan,Mingwei Li,Sen Mei,Xiaorong Wang,Kunpeng Liu,Xin Cong,Shuo Wang,Zhong Zhang,Yaxi Lu,Zhenghao Liu,Yankai Lin,Zhiyuan Liu,Maosong Sun
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Generating deep research reports requires large-scale information acquisition and the synthesis of insight-driven analysis, posing a significant challenge for current language models. Most existing approaches follow a plan-then-write paradigm, whose performance heavily depends on the quality of the initial outline. However, constructing a comprehensive outline itself demands strong reasoning ability, causing current deep research systems to rely almost exclusively on closed-source or online large models. This reliance raises practical barriers to deployment and introduces safety and privacy concerns for user-authored data. In this work, we present AgentCPM-Report, a lightweight yet high-performing local solution composed of a framework that mirrors the human writing process and an 8B-parameter deep research agent. Our framework uses a Writing As Reasoning Policy (WARP), which enables models to dynamically revise outlines during report generation. Under this policy, the agent alternates between Evidence-Based Drafting and Reasoning-Driven Deepening, jointly supporting information acquisition, knowledge refinement, and iterative outline evolution. To effectively equip small models with this capability, we introduce a Multi-Stage Agentic Training strategy, consisting of cold-start, atomic skill RL, and holistic pipeline RL. Experiments on DeepResearch Bench, DeepConsult, and DeepResearch Gym demonstrate that AgentCPM-Report outperforms leading closed-source systems, with substantial gains in Insight.
zh
[NLP-28] LogicSkills: A Structured Benchmark for Formal Reason ing in Large Language Models
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在逻辑推理任务中表现优异但其真正掌握的核心逻辑能力尚不明确的问题。为厘清这一问题,作者提出了一套名为LogicSkills的统一基准测试,其关键在于将形式推理拆解为三个可分离的基本技能:(i) 形式符号化(formal symbolization),即将前提转化为一阶逻辑表达式;(ii) 反例构造(countermodel construction),即构建一个有限结构使得所有前提为真而结论为假;(iii) 有效性判定(validity assessment),即判断结论是否由前提逻辑推出。该基准基于不含等词的二变量一阶逻辑片段设计,涵盖自然英语与卡罗尔风格(Carroll-style)的伪词语言,并通过SMT求解器Z3验证题目正确性和非平凡性。实验表明,主流模型在有效性判定上表现良好,但在符号化和反例构造上显著落后,揭示了模型对表面模式的依赖而非真正的符号或规则驱动推理能力。
链接: https://arxiv.org/abs/2602.06533
作者: Brian Rabern,Philipp Mondorf,Barbara Plank
机构: Niche, Inc.(Niche, Inc.); Oregon State University – Cascades(俄勒冈州立大学-卡斯卡德分校); MaiNLP, LMU Munich(慕尼黑路德维希马克西米利安大学); Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 13 pages, 5 figures
Abstract:Large language models have demonstrated notable performance across various logical reasoning benchmarks. However, it remains unclear which core logical skills they truly master. To address this, we introduce LogicSkills, a unified benchmark designed to isolate three fundamental skills in formal reasoning: (i) \textitformal symbolization\unicodex2014 translating premises into first-order logic; (ii) \textitcountermodel construction\unicodex2014 formulating a finite structure in which all premises are true while the conclusion is false; and (iii) \textitvalidity assessment\unicodex2014 deciding whether a conclusion follows from a given set of premises. Items are drawn from the two-variable fragment of first-order logic (without identity) and are presented in both natural English and a Carroll-style language with nonce words. All examples are verified for correctness and non-triviality using the SMT solver Z3. Across leading models, performance is high on validity but substantially lower on symbolization and countermodel construction, suggesting reliance on surface-level patterns rather than genuine symbolic or rule-based reasoning.
zh
[NLP-29] Completing Missing Annotation: Multi-Agent Debate for Accurate and Scalable Relevant Assessment for IR Benchmarks ICLR2026
【速读】: 该论文旨在解决信息检索(Information Retrieval, IR)评估中因基准数据集不完整(缺少标注的相关片段)而导致的偏差问题,以及现有大语言模型(Large Language Models, LLMs)与人类协同标注策略中存在的LLM过度自信和无效AI到人类升级(AI-to-human escalation)问题。解决方案的关键在于提出DREAM框架——一种基于多轮辩论的关联性评估机制,其核心是构建对立初始立场并进行迭代式相互批判,通过基于共识的辩论实现更准确的标签生成(特定场景下准确率达95.2%)和更可靠的不确定样本的人工介入,仅需3.5%的人类参与即可达成高精度标注。该框架进一步推动了BRIDGE基准数据集的构建,揭示了原数据集中存在的29,824个缺失相关片段,从而显著提升了检索系统及检索增强生成(Retrieval-Augmented Generation, RAG)系统的公平比较能力。
链接: https://arxiv.org/abs/2602.06526
作者: Minjeong Ban,Jeonghwan Choi,Hyangsuk Min,Nicole Hee-Yeon Kim,Minseok Kim,Jae-Gil Lee,Hwanjun Song
机构: Korea Advanced Institute of Science and Technology (韩国科学技术院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at ICLR 2026
Abstract:Information retrieval (IR) evaluation remains challenging due to incomplete IR benchmark datasets that contain unlabeled relevant chunks. While LLMs and LLM-human hybrid strategies reduce costly human effort, they remain prone to LLM overconfidence and ineffective AI-to-human escalation. To address this, we propose DREAM, a multi-round debate-based relevance assessment framework with LLM agents, built on opposing initial stances and iterative reciprocal critique. Through our agreement-based debate, it yields more accurate labeling for certain cases and more reliable AI-to-human escalation for uncertain ones, achieving 95.2% labeling accuracy with only 3.5% human involvement. Using DREAM, we build BRIDGE, a refined benchmark that mitigates evaluation bias and enables fairer retriever comparison by uncovering 29,824 missing relevant chunks. We then re-benchmark IR systems and extend evaluation to RAG, showing that unaddressed holes not only distort retriever rankings but also drive retrieval-generation misalignment. The relevance assessment framework is available at https: //github.com/DISL-Lab/DREAM-ICLR-26; and the BRIDGE dataset is available at this https URL.
zh
[NLP-30] Designing Computational Tools for Exploring Causal Relationships in Qualitative Data
【速读】: 该论文旨在解决当前计算工具在定性数据分析中对因果关系建模不足的问题,即现有系统或未能充分考虑上下文、缺乏可信度,或输出过于复杂,难以支持人机交互(HCI)与社会科学领域研究者深入理解用户需求及构建理论。其解决方案的关键在于设计并实现QualCausal系统,该系统通过交互式因果网络构建与多视角可视化技术,有效提取并直观呈现定性数据中的因果关系,从而减轻分析负担并提供认知支架(cognitive scaffolding),提升研究者对复杂因果结构的理解效率。
链接: https://arxiv.org/abs/2602.06506
作者: Han Meng,Qiuyuan Lyu,Peinuan Qin,Yitian Yang,Renwen Zhang,Wen-Chieh Lin,Yi-Chieh Lee
机构: National University of Singapore(新加坡国立大学); Nanyang Technological University(南洋理工大学); National Yang Ming Chiao Tung University(台湾阳明交通大学)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 19 pages, 5 figures, conditionally accepted by CHI26
Abstract:Exploring causal relationships for qualitative data analysis in HCI and social science research enables the understanding of user needs and theory building. However, current computational tools primarily characterize and categorize qualitative data; the few systems that analyze causal relationships either inadequately consider context, lack credibility, or produce overly complex outputs. We first conducted a formative study with 15 participants interested in using computational tools for exploring causal relationships in qualitative data to understand their needs and derive design guidelines. Based on these findings, we designed and implemented QualCausal, a system that extracts and illustrates causal relationships through interactive causal network construction and multi-view visualization. A feedback study (n = 15) revealed that participants valued our system for reducing the analytical burden and providing cognitive scaffolding, yet navigated how such systems fit within their established research paradigms, practices, and habits. We discuss broader implications for designing computational tools that support qualitative data analysis.
zh
[NLP-31] Revisiting the Shape Convention of Transformer Language Models
【速读】: 该论文旨在解决当前密集型Transformer语言模型中广泛采用的“窄-宽-窄”结构(narrow-wide-narrow)的前馈神经网络(Feed-Forward Network, FFN)在参数分配效率与函数逼近能力上的局限性问题。其核心假设是:近年来研究表明残差连接的“宽-窄-宽”(hourglass)结构具有更强的函数逼近能力,因此可以作为传统FFN的有效替代方案。解决方案的关键在于提出一种基于深层残差连接的“宽-窄-宽”FFN架构(hourglass-shaped FFN),通过将多个hourglass子MLP以残差路径串联,并在固定参数预算下减少FFN的参数量、增加注意力模块的参数比例,从而实现更高效的表达能力和性能提升。实验表明,在4亿至10亿参数规模范围内,该设计优于传统FFN,并在大规模模型中保持竞争力,重新审视了FFN结构设计对现代语言模型效率与表达力平衡的重要性。
链接: https://arxiv.org/abs/2602.06471
作者: Feng-Ting Liao,Meng-Hsi Chen,Guan-Ting Yi,Da-shan Shiu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Dense Transformer language models have largely adhered to one consistent architectural shape: each layer consists of an attention module followed by a feed-forward network (FFN) with a narrow-wide-narrow MLP, allocating most parameters to the MLP at expansion ratios between 2 and 4. Motivated by recent results that residual wide-narrow-wide (hourglass) MLPs offer superior function approximation capabilities, we revisit the long-standing MLP shape convention in Transformer, challenging the necessity of the narrow-wide-narrow design. To study this, we develop a Transformer variant that replaces the conventional FFN with a deeper hourglass-shaped FFN, comprising a stack of hourglass sub-MLPs connected by residual pathways. We posit that a deeper but lighter hourglass FFN can serve as a competitive alternative to the conventional FFN, and that parameters saved by using a lighter hourglass FFN can be more effectively utilized, such as by enlarging model hidden dimensions under fixed budgets. We confirm these through empirical validations across model scales: hourglass FFNs outperform conventional FFNs up to 400M and achieve comparable performance at larger scales to 1B parameters; hourglass FFN variants with reduced FFN and increased attention parameters show consistent improvements over conventional configurations at matched budgets. Together, these findings shed new light on recent work and prompt a rethinking of the narrow-wide-narrow MLP convention and the balance between attention and FFN towards efficient and expressive modern language models.
zh
[NLP-32] Improve Large Language Model Systems with User Logs
【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在实际部署中因高质量训练数据稀缺和计算成本上升导致的性能瓶颈问题,特别是如何有效利用真实用户交互日志中的非结构化、噪声丰富的反馈信号进行持续学习。其核心挑战在于区分有用反馈与噪声行为,并应对用户日志收集与模型优化之间的“离策略”(off-policy)问题。解决方案的关键在于提出UNO(User log-driven Optimization)框架:首先将原始用户日志蒸馏为半结构化规则和偏好对,继而通过查询-反馈驱动的聚类方法处理数据异质性,最后量化模型先验知识与日志数据间的认知差距,据此自适应过滤噪声并构建用于主经验与反思经验的不同模块,从而提升模型未来响应质量。
链接: https://arxiv.org/abs/2602.06470
作者: Changyue Wang,Weihang Su,Qingyao Ai,Yiqun Liu
机构: Tsinghua University (清华大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Scaling training data and model parameters has long driven progress in large language models (LLMs), but this paradigm is increasingly constrained by the scarcity of high-quality data and diminishing returns from rising computational costs. As a result, recent work is increasing the focus on continual learning from real-world deployment, where user interaction logs provide a rich source of authentic human feedback and procedural knowledge. However, learning from user logs is challenging due to their unstructured and noisy nature. Vanilla LLM systems often struggle to distinguish useful feedback signals from noisy user behavior, and the disparity between user log collection and model optimization (e.g., the off-policy optimization problem) further strengthens the problem. To this end, we propose UNO (User log-driveN Optimization), a unified framework for improving LLM systems (LLMsys) with user logs. UNO first distills logs into semi-structured rules and preference pairs, then employs query-and-feedback-driven clustering to manage data heterogeneity, and finally quantifies the cognitive gap between the model’s prior knowledge and the log data. This assessment guides the LLMsys to adaptively filter out noisy feedback and construct different modules for primary and reflective experiences extracted from user logs, thereby improving future responses. Extensive experiments show that UNO achieves state-of-the-art effectiveness and efficiency, significantly outperforming Retrieval Augmented Generation (RAG) and memory-based baselines. We have open-sourced our code at this https URL .
zh
[NLP-33] Diffusion-State Policy Optimization for Masked Diffusion Language Models
【速读】: 该论文旨在解决基于掩码扩散语言模型(Masked Diffusion Language Models)在多步去噪过程中,仅依赖最终完成结果的终端奖励进行学习时,导致中间决策信用分配粗粒度的问题。解决方案的关键在于提出DiSPO(Diffusion-State Policy Optimization),其作为可插拔的信用分配层,通过在选定的中间掩码状态处分支并重采样当前掩码位置的填充内容,利用缓存的rollout logits对生成结果进行评分,并仅更新新填充的token,而无需额外的多步扩散rollout。该方法形式化了一个固定状态下的分支完成目标,并推导出一种策略梯度估计器,可与基于相同rollout的终端反馈策略优化相结合,从而实现更精细的中间决策优化。
链接: https://arxiv.org/abs/2602.06462
作者: Daisuke Oba,Hiroki Furuta,Naoaki Okazaki
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Masked diffusion language models generate by iteratively filling masked tokens over multiple denoising steps, so learning only from a terminal reward on the final completion yields coarse credit assignment over intermediate decisions. We propose DiSPO (Diffusion-State Policy Optimization), a plug-in credit-assignment layer that directly optimizes intermediate filling decisions. At selected intermediate masked states, DiSPO branches by resampling fillings for the currently masked positions from rollout-cached logits, scores the resulting completions, and updates only the newly filled tokens – without additional multi-step diffusion rollouts. We formalize a fixed-state objective for branched completions and derive a policy-gradient estimator that can be combined with terminal-feedback policy optimization using the same rollouts. On LLaDA-8B-Instruct, DiSPO consistently improves over the terminal-feedback diffu-GRPO baseline on math and planning benchmarks under matched rollout compute and optimizer steps. Our code will be available at this https URL .
zh
[NLP-34] RelayGen: Intra-Generation Model Switching for Efficient Reason ing
【速读】: 该论文旨在解决大型推理模型(Large Reasoning Models, LRM)在推理阶段因生成长且复杂的多步推理轨迹而导致的高部署成本问题。现有提升效率的方法要么忽略单个输出内部的难度变化,要么依赖于监督式的token级路由机制,带来较高的系统复杂度。其解决方案的关键在于提出一种无需训练的、基于段级(segment-level)运行时模型切换框架RelayGen,通过离线分析生成不确定性(利用token概率边际)识别出模型特定的切换信号,这些信号可指示推理轨迹中进入低难度段落的时机,并动态将低难度部分交由小型模型继续处理,同时保留高难度推理仍在大模型上完成。该方法有效利用了推理过程中难度的内在变化特性,在多个推理基准测试中显著降低延迟,同时保持接近大模型的准确率。
链接: https://arxiv.org/abs/2602.06454
作者: Jiwon Song,Yoongon Kim,Jae-Joon Kim
机构: Seoul National University (首尔国立大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large reasoning models (LRMs) achieve strong performance on complex reasoning tasks by generating long, multi-step reasoning trajectories, but inference-time scaling incurs substantial deployment cost. A key challenge is that generation difficulty varies within a single output, whereas existing efficiency-oriented approaches either ignore this intra-generation variation or rely on supervised token-level routing with high system complexity. We present \textbfRelayGen, a training-free, segment-level runtime model switching framework that exploits difficulty variation in long-form reasoning. Through offline analysis of generation uncertainty using token probability margins, we show that coarse-grained segment-level control is sufficient to capture difficulty transitions within a reasoning trajectory. RelayGen identifies model-specific switch cues that signal transitions to lower-difficulty segments and dynamically delegates their continuation to a smaller model, while preserving high-difficulty reasoning on the large model. Across multiple reasoning benchmarks, RelayGen substantially reduces inference latency while preserving most of the accuracy of large models. When combined with speculative decoding, RelayGen achieves up to 2.2 \times end-to-end speedup with less than 2% accuracy degradation, without requiring additional training or learned routing components.
zh
[NLP-35] Evaluating an evidence-guided reinforcement learning framework in aligning light-parameter large language models with decision-making cognition in psychiatric clinical reason ing
【速读】: 该论文旨在解决轻量级大语言模型(light-parameter LLMs)在精神科临床决策支持中因幻觉和浅层推理导致的性能瓶颈问题,尤其是其与专业诊断认知存在根本性错位的局限。解决方案的关键在于提出一种名为ClinMPO的强化学习框架,通过一个基于4,474篇精神科期刊文章构建、并遵循循证医学原则结构化的奖励模型,对模型内部推理过程进行显式对齐,从而显著提升其在复杂病例中的诊断准确性,最终使轻量级模型在关键任务上超越人类医学生基准。
链接: https://arxiv.org/abs/2602.06449
作者: Xinxin Lin,Guangxin Dai,Yi Zhong,Xiang Li,Xue Xiao,Yixin Zhang,Zhengdong Wu,Yongbo Zheng,Runchuan Zhu,Ming Zhao,Huizi Yu,Shuo Wu,Jun Zhao,Lingming Hu,Yumei Wang,Ping Yin,Joey W.Y. Chan,Ngan Yin Chan,Sijing Chen,Yun Kwok Wing,Lin Lu,Xin Ma,Lizhou Fan
机构: The Chinese University of Hong Kong (香港中文大学); Shandong University (山东大学); Peking University Sixth Hospital (北京大学第六医院); Inspur Cloud Information Technology Co., Ltd. (浪潮云信息技术有限公司); Fudan University (复旦大学); Shandong Provincial Hospital Affiliated to Shandong First Medical University (山东第一医科大学附属山东省立医院); Jilin University (吉林大学); Hong Kong ICI Cloud Service Limited (香港ICI云服务有限公司); Shandong First Medical University (山东第一医科大学)
类目: Computation and Language (cs.CL)
备注: 21 pages, 8 figures
Abstract:Large language models (LLMs) hold transformative potential for medical decision support yet their application in psychiatry remains constrained by hallucinations and superficial reasoning. This limitation is particularly acute in light-parameter LLMs which are essential for privacy-preserving and efficient clinical deployment. Existing training paradigms prioritize linguistic fluency over structured clinical logic and result in a fundamental misalignment with professional diagnostic cognition. Here we introduce ClinMPO, a reinforcement learning framework designed to align the internal reasoning of LLMs with professional psychiatric practice. The framework employs a specialized reward model trained independently on a dataset derived from 4,474 psychiatry journal articles and structured according to evidence-based medicine principles. We evaluated ClinMPO on a unseen subset of the benchmark designed to isolate reasoning capabilities from rote memorization. This test set comprises items where leading large-parameter LLMs consistently fail. We compared the ClinMPO-aligned light LLM performance against a cohort of 300 medical students. The ClinMPO-tuned Qwen3-8B model achieved a diagnostic accuracy of 31.4% and surpassed the human benchmark of 30.8% on these complex cases. These results demonstrate that medical evidence-guided optimization enables light-parameter LLMs to master complex reasoning tasks. Our findings suggest that explicit cognitive alignment offers a scalable pathway to reliable and safe psychiatric decision support.
zh
[NLP-36] CORE: Comprehensive Ontological Relation Evaluation for Large Language Models
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在评估中普遍缺乏对语义关系辨别能力的考察,尤其是区分有意义的语义关联与真正无关性之间的能力不足的问题。解决方案的关键在于构建了一个名为CORE(Comprehensive Ontological Relation Evaluation)的数据集,包含22.5万道多选题(覆盖74个学科),以及一个经过严格验证的通用领域基准(203个问题,Cohen’s Kappa = 1.0),其中包含24种语义关系类型,并均衡设置无关配对。实验表明,尽管LLMs在相关配对上表现接近人类水平(86.5–100%准确率),但在无关配对上准确率骤降至0–41.35%,且存在显著的预期校准误差(Expected Calibration Error)放大和37.6%的语义坍缩率,揭示了其在处理真实无关性时的系统性缺陷,从而将“无关性推理”确立为LLM评估与安全性的关键前沿方向。
链接: https://arxiv.org/abs/2602.06446
作者: Satyam Dwivedi,Sanjukta Ghosh,Shivam Dwivedi,Nishi Kumari,Anil Thakur,Anurag Purushottam,Deepak Alok,Praveen Gatla,Manjuprasad B,Bipasha Patgiri
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large Language Models (LLMs) perform well on many reasoning benchmarks, yet existing evaluations rarely assess their ability to distinguish between meaningful semantic relations and genuine unrelatedness. We introduce CORE (Comprehensive Ontological Relation Evaluation), a dataset of 225K multiple-choice questions spanning 74 disciplines, together with a general-domain open-source benchmark of 203 rigorously validated questions (Cohen’s Kappa = 1.0) covering 24 semantic relation types with equal representation of unrelated pairs. A human baseline from 1,000+ participants achieves 92.6% accuracy (95.1% on unrelated pairs). In contrast, 29 state-of-the-art LLMs achieve 48.25-70.9% overall accuracy, with near-ceiling performance on related pairs (86.5-100%) but severe degradation on unrelated pairs (0-41.35%), despite assigning similar confidence (92-94%). Expected Calibration Error increases 2-4x on unrelated pairs, and a mean semantic collapse rate of 37.6% indicates systematic generation of spurious relations. On the CORE 225K MCQs dataset, accuracy further drops to approximately 2%, highlighting substantial challenges in domain-specific semantic reasoning. We identify unrelatedness reasoning as a critical, under-evaluated frontier for LLM evaluation and safety.
zh
[NLP-37] railBlazer: History-Guided Reinforcement Learning for Black-Box LLM Jailbreaking
【速读】: 该论文旨在解决当前生成式 AI(Generative AI)安全评估中存在的一类关键问题:现有越狱攻击(jailbreaking)方法在利用大语言模型(Large Language Models, LLMs)漏洞时,未能有效整合历史交互中的脆弱性信号,导致攻击效率低、稳定性差。解决方案的关键在于提出一种基于强化学习(Reinforcement Learning, RL)的历史感知型越狱框架,其核心创新是引入注意力机制对先前交互步骤中的漏洞信号进行重加权,从而增强对关键脆弱性的识别与利用能力,显著提升攻击成功率和查询效率。实验表明,该方法在AdvBench和HarmBench基准上均达到当前最优性能,验证了历史信息在RL驱动的对抗攻击策略中的重要价值。
链接: https://arxiv.org/abs/2602.06440
作者: Sung-Hoon Yoon,Ruizhi Qian,Minda Zhao,Weiyue Li,Mengyu Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:Large Language Models (LLMs) have become integral to many domains, making their safety a critical priority. Prior jailbreaking research has explored diverse approaches, including prompt optimization, automated red teaming, obfuscation, and reinforcement learning (RL) based methods. However, most existing techniques fail to effectively leverage vulnerabilities revealed in earlier interaction turns, resulting in inefficient and unstable attacks. Since jailbreaking involves sequential interactions in which each response influences future actions, reinforcement learning provides a natural framework for this problem. Motivated by this, we propose a history-aware RL-based jailbreak framework that analyzes and reweights vulnerability signals from prior steps to guide future decisions. We show that incorporating historical information alone improves jailbreak success rates. Building on this insight, we introduce an attention-based reweighting mechanism that highlights critical vulnerabilities within the interaction history, enabling more efficient exploration with fewer queries. Extensive experiments on AdvBench and HarmBench demonstrate that our method achieves state-of-the-art jailbreak performance while significantly improving query efficiency. These results underscore the importance of historical vulnerability signals in reinforcement learning-driven jailbreak strategies and offer a principled pathway for advancing adversarial research on LLM safeguards.
zh
[NLP-38] Investigating the structure of emotions by analyzing similarity and association of emotion words
【速读】: 该论文试图解决的问题是:Plutchik情绪轮模型(wheel of emotion)在自然语言处理中的有效性尚未得到充分验证,尤其是其在情感词语义网络结构上的合理性。为解决此问题,研究提出通过构建和分析情感词的语义网络来检验该模型的有效性。解决方案的关键在于:基于人工标注的情感词对相似性和关联性数据,构建语义网络,并利用社区检测(community detection)方法分析其拓扑结构,进而与Plutchik情绪轮模型的结构进行对比。实验结果表明,整体上语义网络结构与情绪轮模型一致,但在局部存在差异,从而为该模型的适用范围提供了实证依据。
链接: https://arxiv.org/abs/2602.06430
作者: Fumitaka Iwaki,Tatsuji Takahashi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 5 figures, 8 tables
Abstract:In the field of natural language processing, some studies have attempted sentiment analysis on text by handling emotions as explanatory or response variables. One of the most popular emotion models used in this context is the wheel of emotion proposed by Plutchik. This model schematizes human emotions in a circular structure, and represents them in two or three dimensions. However, the validity of Plutchik’s wheel of emotion has not been sufficiently examined. This study investigated the validity of the wheel by creating and analyzing a semantic networks of emotion words. Through our experiments, we collected data of similarity and association of ordered pairs of emotion words, and constructed networks using these data. We then analyzed the structure of the networks through community detection, and compared it with that of the wheel of emotion. The results showed that each network’s structure was, for the most part, similar to that of the wheel of emotion, but locally different.
zh
[NLP-39] On the Wings of Imagination: Conflicting Script-based Multi-role Framework for Humor Caption Generation ICLR2026
【速读】: 该论文旨在解决多模态场景下生成式AI(Generative AI)在幽默内容生成中的两大核心挑战:一是现有基于大语言模型(LLM)的方法依赖推理链或自我改进策略,导致创造力受限且可解释性差;二是如何有效结合视觉理解、幽默推理与创造性想象以生成符合人类幽默感知的图像描述(如漫画字幕)。解决方案的关键在于提出一个基于基础幽默理论——“脚本对立理论”(GTVH, Grounded Theory of Verbal Humor)的多角色协同框架HOMER(Humor-theory-driven Multi-role LLM Collaboration with Retrieval),其通过三个角色分工协作实现:(1)冲突脚本提取器识别关键脚本对立关系,奠定幽默基础;(2)检索增强的分层想象力生成器利用想象树结构扩展幽默目标的多样性;(3)字幕生成器基于上述知识生成有趣且多样化的幽默描述。实验表明,该方法在两个New Yorker Cartoon基准数据集上显著优于当前最先进的基线和强大LLM推理策略。
链接: https://arxiv.org/abs/2602.06423
作者: Wenbo Shang,Yuxi Sun,Jing Ma,Xin Huang
机构: Hong Kong Baptist University (香港浸会大学)
类目: Computation and Language (cs.CL)
备注: Paper accepted as a conference paper at ICLR 2026
Abstract:Humor is a commonly used and intricate human language in daily life. Humor generation, especially in multi-modal scenarios, is a challenging task for large language models (LLMs), which is typically as funny caption generation for images, requiring visual understanding, humor reasoning, creative imagination, and so on. Existing LLM-based approaches rely on reasoning chains or self-improvement, which suffer from limited creativity and interpretability. To address these bottlenecks, we develop a novel LLM-based humor generation mechanism based on a fundamental humor theory, GTVH. To produce funny and script-opposite captions, we introduce a humor-theory-driven multi-role LLM collaboration framework augmented with humor retrieval (HOMER). The framework consists of three LLM-based roles: (1) conflicting-script extractor that grounds humor in key script oppositions, forming the basis of caption generation; (2) retrieval-augmented hierarchical imaginator that identifies key humor targets and expands the creative space of them through diverse associations structured as imagination trees; and (3) caption generator that produces funny and diverse captions conditioned on the obtained knowledge. Extensive experiments on two New Yorker Cartoon benchmarking datasets show that HOMER outperforms state-of-the-art baselines and powerful LLM reasoning strategies on multi-modal humor captioning.
zh
[NLP-40] Stopping Computation for Converged Tokens in Masked Diffusion-LM Decoding ICLR2026
【速读】: 该论文旨在解决掩码扩散语言模型(Masked Diffusion Language Models)在迭代采样过程中存在的计算冗余问题:尽管许多已解码(unmasked)token的位置在多步迭代中趋于稳定,模型仍对所有位置重复计算注意力机制和前馈网络层,导致大量不必要的算力浪费。解决方案的关键在于提出SureLock机制——当某个未掩码位置的后验分布在连续迭代中达到稳定状态(即满足“确定性条件”,our sure condition)时,锁定该位置,跳过其查询投影和前馈子层计算,同时缓存其注意力键值对以供其他位置继续访问。这一策略将每轮迭代的主导计算复杂度从 O(N2d) 降低至 O(MNd),其中 N 为序列长度,M 为未锁定位置数,且 M 随迭代过程逐渐减少,从而实现显著的计算效率提升。实验证明,在LLaDA-8B上,SureLock相较无锁版本可减少30–50%的算法浮点运算量(FLOPs),同时保持生成质量相当,并通过局部KL散度监控即可理论保障最终token概率分布的偏差可控。
链接: https://arxiv.org/abs/2602.06412
作者: Daisuke Oba,Danushka Bollegala,Masahiro Kaneko,Naoaki Okazaki
机构: Institute of Science Tokyo (东京科学研究所); University of Liverpool (利物浦大学); Amazon (亚马逊); MBZUAI (穆罕默德·本·扎耶德人工智能大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at ICLR 2026
Abstract:Masked Diffusion Language Models generate sequences via iterative sampling that progressively unmasks tokens. However, they still recompute the attention and feed-forward blocks for every token position at every step – even when many unmasked tokens are essentially fixed, resulting in substantial waste in compute. We propose SureLock: when the posterior at an unmasked position has stabilized across steps (our sure condition), we lock that position – thereafter skipping its query projection and feed-forward sublayers – while caching its attention keys and values so other positions can continue to attend to it. This reduces the dominant per-iteration computational cost from O(N^2d) to O(MNd) where N is the sequence length, M is the number of unlocked token positions, and d is the model dimension. In practice, M decreases as the iteration progresses, yielding substantial savings. On LLaDA-8B, SureLock reduces algorithmic FLOPs by 30–50% relative to the same sampler without locking, while maintaining comparable generation quality. We also provide a theoretical analysis to justify the design rationale of SureLock: monitoring only the local KL at the lock step suffices to bound the deviation in final token probabilities. Our code will be available at this https URL .
zh
[NLP-41] FMBench: Adaptive Large Language Model Output Formatting
【速读】: 该论文旨在解决大语言模型在生成用户-facing和系统集成工作流中的Markdown格式输出时,难以同时满足语义意图与格式约束的问题。当前Markdown格式错误(如列表断裂、表格语法错误、标题层级不一致、代码块无效等)虽隐蔽但严重影响下游可用性。为应对这一挑战,作者提出FMBench基准测试平台,用于评估模型在多样化结构要求下的指令遵循能力,并设计了一种轻量级对齐流水线:首先通过监督微调(Supervised Fine-Tuning, SFT)提升语义一致性,再引入强化学习微调优化复合目标函数,平衡语义保真度与结构正确性。关键创新在于无需硬性解码约束即可显著提升Markdown合规性,且实验表明从高质量SFT策略初始化后,强化学习进一步增强了模型对复杂Markdown指令的鲁棒性,同时也揭示了语义与结构目标之间的内在权衡关系,强调奖励机制设计的重要性。
链接: https://arxiv.org/abs/2602.06384
作者: Yaoting Wang,Yun Zhou,Henghui Ding
机构: Fudan University (复旦大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Producing outputs that satisfy both semantic intent and format constraints is essential for deploying large language models in user-facing and system-integrated workflows. In this work, we focus on Markdown formatting, which is ubiquitous in assistants, documentation, and tool-augmented pipelines but still prone to subtle, hard-to-detect errors (e.g., broken lists, malformed tables, inconsistent headings, and invalid code blocks) that can significantly degrade downstream usability. We present FMBench, a benchmark for adaptive Markdown output formatting that evaluates models under a wide range of instruction-following scenarios with diverse structural requirements. FMBench emphasizes real-world formatting behaviors such as multi-level organization, mixed content (natural language interleaved with lists/tables/code), and strict adherence to user-specified layout constraints. To improve Markdown compliance without relying on hard decoding constraints, we propose a lightweight alignment pipeline that combines supervised fine-tuning (SFT) with reinforcement learning fine-tuning. Starting from a base model, we first perform SFT on instruction-response pairs, and then optimize a composite objective that balances semantic fidelity with structural correctness. Experiments on two model families (OpenPangu and Qwen) show that SFT consistently improves semantic alignment, while reinforcement learning provides additional gains in robustness to challenging Markdown instructions when initialized from a strong SFT policy. Our results also reveal an inherent trade-off between semantic and structural objectives, highlighting the importance of carefully designed rewards for reliable formatted generation. Code is available at: this https URL.
zh
[NLP-42] ReBeCA: Unveiling Interpretable Behavior Hierarchy behind the Iterative Self-Reflection of Language Models with Causal Analysis
【速读】: 该论文旨在解决自省机制(self-reflection)在语言模型中效果不透明的问题,即现有分析多基于相关性,难以揭示其真实因果机制且缺乏泛化能力。解决方案的关键在于提出 ReBeCA 框架,通过将自省轨迹建模为因果图,并采用三阶段不变因果预测(Invariant Causal Prediction, ICP)管道,识别出对最终自省结果具有真实因果影响的语义行为(semantic behaviors)。该方法成功分离出稀疏但强效的因果父节点,在多个任务上实现高达 49.6% 的结构似然提升,且在分布外数据中仍保持稳定,从而为理解自省动态中的真实因果关系提供了可解释、可验证的严谨方法论。
链接: https://arxiv.org/abs/2602.06373
作者: Tianqiang Yan,Sihan Shang,Yuheng Li,Song Qiu,Hao Peng,Wenjian Luo,Jue Xie,Lizhen Qu,Yuan Gao
机构: Monash University (蒙纳士大学); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区); Hainan University (海南大学); South China University of Technology (华南理工大学); The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳))
类目: Computation and Language (cs.CL)
备注: 17 pages, 3 figures
Abstract:While self-reflection can enhance language model reliability, its underlying mechanisms remain opaque, with existing analyses often yielding correlation-based insights that fail to generalize. To address this, we introduce \textbf\textttReBeCA (self-\textbf\textttReflection \textbf\textttBehavior explained through \textbf\textttCausal \textbf\textttAnalysis), a framework that unveils the interpretable behavioral hierarchy governing the self-reflection outcome. By modeling self-reflection trajectories as causal graphs, ReBeCA isolates genuine determinants of performance through a three-stage Invariant Causal Prediction (ICP) pipeline. We establish three critical findings: (1) \textbfBehavioral hierarchy: Semantic behaviors of the model influence final self-reflection results hierarchically: directly or indirectly; (2) \textbfCausation matters: Generalizability in self-reflection effects is limited to just a few semantic behaviors; (3) \textbfMore \mathbf\neq better: The confluence of seemingly positive semantic behaviors, even among direct causal factors, can impair the efficacy of self-reflection. ICP-based verification identifies sparse causal parents achieving up to 49.6% structural likelihood gains, stable across tasks where correlation-based patterns fail. Intervention studies on novel datasets confirm these causal relationships hold out-of-distribution ( p = .013, \eta^2_\mathrmp = .071 ). ReBeCA thus provides a rigorous methodology for disentangling genuine causal mechanisms from spurious associations in self-reflection dynamics.
zh
[NLP-43] Cost-Aware Model Selection for Text Classification: Multi-Objective Trade-offs Between Fine-Tuned Encoders and LLM Prompting in Production
【速读】: 该论文旨在解决在结构化文本分类任务中,当前实践中对大型语言模型(Large Language Models, LLMs)的盲目使用所导致的系统级效率低下问题。尽管LLMs在开放域推理和生成任务中表现出色,但在固定标签空间的文本分类场景下,其高推理延迟与高昂成本往往被忽视,而这些因素在生产环境中构成关键约束。论文的核心解决方案是通过系统性对比零样本和少样本提示驱动的LLMs与全微调的编码器架构(如BERT系列),从预测性能、推理延迟和经济成本三个维度进行多目标评估,并引入帕累托前沿分析和参数化效用函数以量化不同部署场景下的权衡关系。结果表明,微调后的编码器模型在保持甚至超越LLM性能的同时,可实现1–2个数量级更低的延迟与成本,因此应作为结构化NLP流水线的核心组件,而LLMs则更适合作为混合架构中的补充模块。
链接: https://arxiv.org/abs/2602.06370
作者: Alberto Andres Valdes Gonzalez
机构: Pontifical Catholic University of Chile (智利天主教大学)
类目: Computation and Language (cs.CL)
备注: 26 pages, 12 figures. Empirical benchmark comparing fine-tuned encoders and LLM prompting for text classification under cost and latency constraints
Abstract:Large language models (LLMs) such as GPT-4o and Claude Sonnet 4.5 have demonstrated strong capabilities in open-ended reasoning and generative language tasks, leading to their widespread adoption across a broad range of NLP applications. However, for structured text classification problems with fixed label spaces, model selection is often driven by predictive performance alone, overlooking operational constraints encountered in production systems. In this work, we present a systematic comparison of two contrasting paradigms for text classification: zero- and few-shot prompt-based large language models, and fully fine-tuned encoder-only architectures. We evaluate these approaches across four canonical benchmarks (IMDB, SST-2, AG News, and DBPedia), measuring predictive quality (macro F1), inference latency, and monetary cost. We frame model evaluation as a multi-objective decision problem and analyze trade-offs using Pareto frontier projections and a parameterized utility function reflecting different deployment regimes. Our results show that fine-tuned encoder-based models from the BERT family achieve competitive, and often superior, classification performance while operating at one to two orders of magnitude lower cost and latency compared to zero- and few-shot LLM prompting. Overall, our findings suggest that indiscriminate use of large language models for standard text classification workloads can lead to suboptimal system-level outcomes. Instead, fine-tuned encoders emerge as robust and efficient components for structured NLP pipelines, while LLMs are better positioned as complementary elements within hybrid architectures. We release all code, datasets, and evaluation protocols to support reproducibility and cost-aware NLP system design. Comments: 26 pages, 12 figures. Empirical benchmark comparing fine-tuned encoders and LLM prompting for text classification under cost and latency constraints Subjects: Computation and Language (cs.CL) ACMclasses: I.2.7; I.5.1; I.2.6 Cite as: arXiv:2602.06370 [cs.CL] (or arXiv:2602.06370v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2602.06370 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Alberto Andres Valdes Gonzalez [view email] [v1] Fri, 6 Feb 2026 03:54:28 UTC (1,082 KB)
zh
[NLP-44] SHINE: A Scalable In-Context Hypernetwork for Mapping Context to LoRA in a Single Pass
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLM)在适应新任务时依赖大量参数微调(如监督微调,Supervised Fine-Tuning, SFT)所带来的高计算与存储开销问题。传统方法需对LLM进行显式参数更新,导致资源消耗大且难以快速部署。解决方案的关键在于提出SHINE(Scalable Hyper In-context NEtwork),一种可扩展的超网络架构,其通过利用冻结LLM自身的参数作为上下文感知的超网络设计,并引入结构创新,在单次前向传播中即可从多样化的有意义上下文中生成高质量的LoRA适配器(Low-Rank Adaptation adapters)。该方法无需直接访问上下文即可将上下文知识转化为参数内知识(in-parameter knowledge),从而显著降低计算和内存成本,同时保持强大的表达能力。
链接: https://arxiv.org/abs/2602.06358
作者: Yewei Liu,Xiyuan Wang,Yansheng Mao,Yoav Gelbery,Haggai Maron,Muhan Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:We propose SHINE (Scalable Hyper In-context NEtwork), a scalable hypernetwork that can map diverse meaningful contexts into high-quality LoRA adapters for large language models (LLM). By reusing the frozen LLM’s own parameters in an in-context hypernetwork design and introducing architectural innovations, SHINE overcomes key limitations of prior hypernetworks and achieves strong expressive power with a relatively small number of parameters. We introduce a pretraining and instruction fine-tuning pipeline, and train our hypernetwork to generate high quality LoRA adapters from diverse meaningful contexts in a single forward pass. It updates LLM parameters without any fine-tuning, and immediately enables complex question answering tasks related to the context without directly accessing the context, effectively transforming in-context knowledge to in-parameter knowledge in one pass. Our work achieves outstanding results on various tasks, greatly saves time, computation and memory costs compared to SFT-based LLM adaptation, and shows great potential for scaling. Our code is available at this https URL
zh
[NLP-45] Can Post-Training Transform LLM s into Causal Reason ers?
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在因果推理(causal inference)任务中能力不足且缺乏系统性提升方法的问题。当前LLMs虽在决策支持中展现出潜力,但其精确的因果估计能力受限,且post-training对因果推理性能的影响尚未充分探索。解决方案的关键在于提出一个名为CauGym的综合性数据集,包含七项核心因果任务和五个多样化测试集,并系统评估五种post-training方法(SFT、DPO、KTO、PPO和GRPO)。实验表明,经过针对性微调的小规模模型(如14B参数)可在多个基准上达到甚至超越大型模型的因果推理准确率(例如在CaLM基准上达93.5%),并具备良好的泛化能力和抗干扰鲁棒性,从而首次提供了系统性证据:通过有针对性的post-training可构建可靠且稳健的基于LLM的因果推理器。
链接: https://arxiv.org/abs/2602.06337
作者: Junqi Chen,Sirui Chen,Chaochao Lu
机构: Shanghai Artificial Intelligence Library (上海人工智能图书馆); Fudan University (复旦大学); Tongji University (同济大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Causal inference is essential for decision-making but remains challenging for non-experts. While large language models (LLMs) show promise in this domain, their precise causal estimation capabilities are still limited, and the impact of post-training on these abilities is insufficiently explored. This paper examines the extent to which post-training can enhance LLMs’ capacity for causal inference. We introduce CauGym, a comprehensive dataset comprising seven core causal tasks for training and five diverse test sets. Using this dataset, we systematically evaluate five post-training approaches: SFT, DPO, KTO, PPO, and GRPO. Across five in-domain and four existing benchmarks, our experiments demonstrate that appropriate post-training enables smaller LLMs to perform causal inference competitively, often surpassing much larger models. Our 14B parameter model achieves 93.5% accuracy on the CaLM benchmark, compared to 55.4% by OpenAI o3. Furthermore, the post-trained LLMs exhibit strong generalization and robustness under real-world conditions such as distribution shifts and noisy data. Collectively, these findings provide the first systematic evidence that targeted post-training can produce reliable and robust LLM-based causal reasoners. Our data and GRPO-model are available at this https URL.
zh
[NLP-46] he Condensate Theorem: Transformers are O(n) Not O(n2)
【速读】: 该论文旨在解决大语言模型中注意力机制计算复杂度高的问题,即传统自注意力(Self-Attention)实现的二次时间复杂度 O(n2) 导致的推理效率瓶颈。其解决方案的关键在于提出“凝聚子流形定理”(Condensate Theorem),指出注意力稀疏性是一种可学习的拓扑性质而非架构约束,并通过动态识别一个由锚点(Anchor)、窗口(Window)和动态Top-k(Dynamic Top-k)构成的凝聚子流形(Condensate Manifold),对查询进行投影即可在不损失任何信息的前提下实现与全连接注意力的比特级精确等价(bit-exact token matching)。这一方法使硬件感知的拓扑注意力内核(Topological Attention kernel)在长序列上实现高达159倍的实际加速(131K tokens时为3.94ms vs 628ms),并预测在百万token规模下可达1200倍加速,从而将推理成本降低99.9%,表明二次复杂度本质是实现方式的问题,而非模型智能的必然要求。
链接: https://arxiv.org/abs/2602.06317
作者: Jorge L. Ruiz Williams
机构: NaNZeta LLC
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 13 pages, 4 figures, 8 tables
Abstract:We present the Condensate Theorem: attention sparsity is a learned topological property, not an architectural constraint. Through empirical analysis of trained language models, we find that attention mass concentrates on a distinct topological manifold – and this manifold can be identified dynamically without checking every position. We prove a general result: for any query, projecting attention onto the Condensate Manifold (Anchor + Window + Dynamic Top-k) achieves 100% output equivalence with full O(n^2) attention. This is not an approximation – it is lossless parity. We validate this across GPT-2, Pythia, Qwen2, TinyLlama, and Mistral, demonstrating bit-exact token matching on 1,500+ generated tokens. By mapping this topology to hardware, our Topological Attention kernel achieves a 159x measured speedup at 131K tokens (3.94ms vs 628ms) and a projected 1,200x speedup at 1M tokens, reducing inference costs by 99.9% compared to Flash Attention. We conclude that the quadratic bottleneck is an artifact of naive implementation, not intelligence.
zh
[NLP-47] Lost in Speech: Benchmarking Evaluation and Parsing of Spoken Code-Switching Beyond Standard UD Assumptions
【速读】: 该论文旨在解决口语代码转换(Spoken Code-Switching, CSW)在句法解析中面临的挑战,即现有基于书面语的通用依存句法树(Universal Dependencies, UD)假设无法有效处理口语中的不流畅、重复、省略及话语驱动结构等问题,导致解析器和大语言模型(Large Language Models, LLMs)在口语场景下性能显著下降。其解决方案的关键在于提出三个核心贡献:一是构建了一个基于语言学的口语CSW现象分类体系;二是设计了SpokeBench——一个由专家标注的黄金基准数据集,用于超越标准UD假设测试口语结构;三是引入FLEX-UD这一歧义感知的评估指标,揭示现有方法因误判语言学上合理的分析为错误而导致性能被低估。进一步地,作者提出DECAP框架,通过解耦口语现象处理与核心句法分析,实现无需重新训练即可提升解析鲁棒性和可解释性,实验表明该方法相较现有技术最多提升52.6%。
链接: https://arxiv.org/abs/2602.06307
作者: Nemika Tyagi,Holly Hendrix,Nelvin Licona-Guevara,Justin Mackie,Phanos Kareen,Muhammad Imran,Megan Michelle Smith,Tatiana Gallego Hernande,Chitta Baral,Olga Kellert
机构: Arizona State University (亚利桑那州立大学); Universidade da Coruña, CITIC (拉科鲁尼亚大学,CITIC研究中心)
类目: Computation and Language (cs.CL)
备注: 18 pages, 4 Figures
Abstract:Spoken code-switching (CSW) challenges syntactic parsing in ways not observed in written text. Disfluencies, repetition, ellipsis, and discourse-driven structure routinely violate standard Universal Dependencies (UD) assumptions, causing parsers and large language models (LLMs) to fail despite strong performance on written data. These failures are compounded by rigid evaluation metrics that conflate genuine structural errors with acceptable variation. In this work, we present a systems-oriented approach to spoken CSW parsing. We introduce a linguistically grounded taxonomy of spoken CSW phenomena and SpokeBench, an expert-annotated gold benchmark designed to test spoken-language structure beyond standard UD assumptions. We further propose FLEX-UD, an ambiguity-aware evaluation metric, which reveals that existing parsing techniques perform poorly on spoken CSW by penalizing linguistically plausible analyses as errors. We then propose DECAP, a decoupled agentic parsing framework that isolates spoken-phenomena handling from core syntactic analysis. Experiments show that DECAP produces more robust and interpretable parses without retraining and achieves up to 52.6% improvements over existing parsing techniques. FLEX-UD evaluations further reveal qualitative improvements that are masked by standard metrics.
zh
[NLP-48] Judging What We Cannot Solve: A Consequence-Based Approach for Oracle-Free Evaluation of Research-Level Math
【速读】: 该论文旨在解决生成式 AI (Generative AI) 在研究级数学问题中生成解题方案后,缺乏高效且可靠的验证机制这一瓶颈问题,尤其是专家人工验证成本高、难以规模化。其解决方案的关键在于提出一种无需标注数据的“基于后果的效用”(Consequence-Based Utility)评估方法:通过测试候选解在上下文示例(in-context exemplar)角色下对相关但可验证问题的求解性能,来量化其方法论层面的有效性。该方法利用了正确解在邻近问题上具有更强泛化能力的假设,从而实现了比奖励模型、生成式奖励模型及大语言模型(LLM)裁判更优的排序质量,显著提升了准确率(Acc@1)和曲线下面积(AUC),并保持了更强的正确-错误区分能力。
链接: https://arxiv.org/abs/2602.06291
作者: Guijin Son,Donghun Yang,Hitesh Laxmichand Patel,Hyunwoo Ko,Amit Agarwal,Sunghee Ahn,Kyong-Ha Lee,Youngjae Yu
机构: 未知
类目: Computation and Language (cs.CL)
备注: Preprint
Abstract:Recent progress in reasoning models suggests that generating plausible attempts for research-level mathematics may be within reach, but verification remains a bottleneck, consuming scarce expert time. We hypothesize that a meaningful solution should contain enough method-level information that, when applied to a neighborhood of related questions, it should yield better downstream performance than incorrect solutions. Building on this idea, we propose \textbfConsequence-Based Utility, an oracle-free evaluator that scores each candidate by testing its value as an in-context exemplar in solving related yet verifiable questions. Our approach is evaluated on an original set of research-level math problems, each paired with one expert-written solution and nine LLM-generated solutions. Notably, Consequence-Based Utility consistently outperforms reward models, generative reward models, and LLM judges on ranking quality. Specifically, for GPT-OSS-120B, it improves Acc@1 from 67.2 to 76.3 and AUC from 71.4 to 79.6, with similarly large AUC gains on GPT-OSS-20B (69.0 to 79.2). Furthermore, compared to LLM-Judges, it also exhibits a larger solver-evaluator gap, maintaining a stronger correct-wrong separation even on instances where the underlying solver often fails to solve.
zh
[NLP-49] RoPE-LIME: RoPE-Space Locality Sparse-K Sampling for Efficient LLM Attribution
【速读】: 该论文旨在解决闭源大语言模型(Large Language Model, LLM)输出解释难题,即在无法获取梯度信息的情况下如何实现细粒度的token级归因分析。传统基于扰动的方法依赖于重新生成文本,存在计算成本高且噪声大的问题。其解决方案的关键在于提出RoPE-LIME,一种基于概率目标函数(负对数似然和散度目标)的开放源代码代理方法,通过将推理与解释解耦,在固定闭源模型输出的前提下,利用小型开源模型进行归因计算;该方法创新性地引入两个核心机制:(i) 基于RoPE嵌入空间中松弛词移动距离(Relaxed Word Mover’s Distance)的局部核函数,确保掩码操作下相似性稳定;(ii) Sparse-K采样策略,在有限预算下提升特征交互覆盖率,从而显著减少对闭源模型API调用次数的同时提升归因质量。
链接: https://arxiv.org/abs/2602.06275
作者: Isaac Picov,Ritesh Goru
机构: University of Toronto (多伦多大学); DevRev
类目: Computation and Language (cs.CL)
备注:
Abstract:Explaining closed-source LLM outputs is challenging because API access prevents gradient-based attribution, while perturbation methods are costly and noisy when they depend on regenerated text. We introduce RoPE-LIME, an open-source extension of gSMILE that decouples reasoning from explanation: given a fixed output from a closed model, a smaller open-source surrogate computes token-level attributions from probability-based objectives (negative log-likelihood and divergence targets) under input perturbations. RoPE-LIME incorporates (i) a locality kernel based on Relaxed Word Mover’s Distance computed in RoPE embedding space for stable similarity under masking, and (ii) Sparse-K sampling, an efficient perturbation strategy that improves interaction coverage under limited budgets. Experiments on HotpotQA (sentence features) and a hand-labeled MMLU subset (word features) show that RoPE-LIME produces more informative attributions than leave-one-out sampling and improves over gSMILE while substantially reducing closed-model API calls.
zh
[NLP-50] VowelPrompt: Hearing Speech Emotions from Text via Vowel-level Prosodic Augmentation ICLR2026
【速读】: 该论文旨在解决语音情感识别中因忽视细粒度韵律特征而导致模型性能受限与可解释性不足的问题。现有基于大语言模型(Large Language Models, LLMs)的方法虽能利用文本语义进行情感推理,但普遍忽略如基频、能量和时长等关键韵律信息,从而影响其在跨领域、跨说话人场景下的泛化能力。解决方案的关键在于提出VowelPrompt框架,该框架通过提取对齐的元音段落中的韵律特征(包括音高、能量和时长),将其转化为自然语言描述以增强可解释性,并引导LLMs联合推理词汇语义与细粒度韵律变化。此外,采用监督微调(Supervised Fine-Tuning, SFT)与基于可验证奖励的强化学习(Reinforcement Learning with Verifiable Reward, RLVR)相结合的两阶段训练策略,进一步提升模型推理能力与结构化输出一致性,实现零样本、微调、跨域及跨语言条件下的稳定优越表现。
链接: https://arxiv.org/abs/2602.06270
作者: Yancheng Wang,Osama Hanna,Ruiming Xie,Xianfeng Rui,Maohao Shen,Xuedong Zhang,Christian Fuegen,Jilong Wu,Debjyoti Paul,Arthur Guo,Zhihong Lei,Ozlem Kalinli,Qing He,Yingzhen Yang
机构: Meta(元)
类目: Computation and Language (cs.CL)
备注: Accepted to ICLR 2026
Abstract:Emotion recognition in speech presents a complex multimodal challenge, requiring comprehension of both linguistic content and vocal expressivity, particularly prosodic features such as fundamental frequency, intensity, and temporal dynamics. Although large language models (LLMs) have shown promise in reasoning over textual transcriptions for emotion recognition, they typically neglect fine-grained prosodic information, limiting their effectiveness and interpretability. In this work, we propose VowelPrompt, a linguistically grounded framework that augments LLM-based emotion recognition with interpretable, fine-grained vowel-level prosodic cues. Drawing on phonetic evidence that vowels serve as primary carriers of affective prosody, VowelPrompt extracts pitch-, energy-, and duration-based descriptors from time-aligned vowel segments, and converts these features into natural language descriptions for better interpretability. Such a design enables LLMs to jointly reason over lexical semantics and fine-grained prosodic variation. Moreover, we adopt a two-stage adaptation procedure comprising supervised fine-tuning (SFT) followed by Reinforcement Learning with Verifiable Reward (RLVR), implemented via Group Relative Policy Optimization (GRPO), to enhance reasoning capability, enforce structured output adherence, and improve generalization across domains and speaker variations. Extensive evaluations across diverse benchmark datasets demonstrate that VowelPrompt consistently outperforms state-of-the-art emotion recognition methods under zero-shot, fine-tuned, cross-domain, and cross-linguistic conditions, while enabling the generation of interpretable explanations that are jointly grounded in contextual semantics and fine-grained prosodic structure.
zh
[NLP-51] MPIB: A Benchmark for Medical Prompt Injection Attacks and Clinical Safety in LLM s
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)与检索增强生成(Retrieval-Augmented Generation, RAG)系统在临床工作流中因提示注入攻击(prompt injection attacks)而导致的临床安全风险问题,即攻击者可通过恶意构造的输入指令诱导模型输出不安全或误导性的医疗建议。解决方案的关键在于提出医学提示注入基准(Medical Prompt Injection Benchmark, MPIB),其核心创新是引入“临床伤害事件率”(Clinical Harm Event Rate, CHER)这一以临床后果为导向的风险度量指标,用于量化高严重性临床危害事件的发生频率,并与攻击成功率(Attack Success Rate, ASR)联合分析,从而区分模型对指令的合规性与实际患者风险之间的差异。MPIB包含9,697个经过多阶段质量控制和临床安全审查的实例,支持对不同LLM及防御策略下提示注入攻击的系统性评估,为提升医疗AI系统的安全性提供了可复现、可量化的研究基础。
链接: https://arxiv.org/abs/2602.06268
作者: Junhyeok Lee,Han Jang,Kyu Sung Choi
机构: Seoul National University College of Medicine (首尔国立大学医学院); Seoul National University (首尔国立大学); Seoul National University Hospital (首尔国立大学医院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 13 pages, 7 figures
Abstract:Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) systems are increasingly integrated into clinical workflows; however, prompt injection attacks can steer these systems toward clinically unsafe or misleading outputs. We introduce the Medical Prompt Injection Benchmark (MPIB), a dataset-and-benchmark suite for evaluating clinical safety under both direct prompt injection and indirect, RAG-mediated injection across clinically grounded tasks. MPIB emphasizes outcome-level risk via the Clinical Harm Event Rate (CHER), which measures high-severity clinical harm events under a clinically grounded taxonomy, and reports CHER alongside Attack Success Rate (ASR) to disentangle instruction compliance from downstream patient risk. The benchmark comprises 9,697 curated instances constructed through multi-stage quality gates and clinical safety linting. Evaluating MPIB across a diverse set of baseline LLMs and defense configurations, we find that ASR and CHER can diverge substantially, and that robustness depends critically on whether adversarial instructions appear in the user query or in retrieved context. We release MPIB with evaluation code, adversarial baselines, and comprehensive documentation to support reproducible and systematic research on clinical prompt injection. Code and data are available at GitHub (code) and Hugging Face (data).
zh
[NLP-52] Is my model “mind blurting”? Interpreting the dynamics of reason ing tokens with Recurrence Quantification Analysis (RQA)
【速读】: 该论文试图解决当前对大模型推理行为分析依赖生成文本长度作为代理指标(即响应长度)所带来的局限性问题,因为这种做法无法准确捕捉Chain of Thoughts (CoT) 的动态过程和生成token的有效性。解决方案的关键在于引入递归定量分析(Recurrence Quantification Analysis, RQA),将token生成过程视为一个动力系统,从每一步生成中提取隐藏嵌入(hidden embeddings),并基于其轨迹计算RQA指标(如确定性Determinism和层状性Laminarity),从而量化模型在潜空间中的重复模式与停滞现象。这种方法不依赖于文本内容本身,而是通过分析生成过程的内在动力学特性,显著提升了对任务复杂度的预测能力(提升8%),为测试时扩展(test-time scaling)下推理模型的潜在生成动态提供了可解释、可量化的分析工具。
链接: https://arxiv.org/abs/2602.06266
作者: Quoc Tuan Pham,Mehdi Jafari,Flora Salim
机构: University of New South Wales (新南威尔士大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Test-time compute is central to large reasoning models, yet analysing their reasoning behaviour through generated text is increasingly impractical and unreliable. Response length is often used as a brute proxy for reasoning effort, but this metric fails to capture the dynamics and effectiveness of the Chain of Thoughts (CoT) or the generated tokens. We propose Recurrence Quantification Analysis (RQA) as a non-textual alternative for analysing model’s reasoning chains at test time. By treating token generation as a dynamical system, we extract hidden embeddings at each generation step and apply RQA to the resulting trajectories. RQA metrics, including Determinism and Laminarity, quantify patterns of repetition and stalling in the model’s latent representations. Analysing 3,600 generation traces from DeepSeek-R1-Distill, we show that RQA captures signals not reflected by response length, but also substantially improves prediction of task complexity by 8%. These results help establish RQA as a principled tool for studying the latent token generation dynamics of test-time scaling in reasoning models.
zh
[NLP-53] Can One-sided Arguments Lead to Response Change in Large Language Models ?
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在回答争议性问题时,如何通过简单直观的方式引导其输出特定立场的问题。现有研究表明,LLMs 可能会倾向于单一立场或拒绝回答,而无法自然呈现平衡观点。论文的核心解决方案在于:仅提供支持某一立场的单边论据(one-sided arguments),即可显著影响模型输出的立场倾向(opinion steering)。研究发现,无论模型种类、论据数量或议题类型如何变化,这种“意见引导”现象普遍存在,且当替换为其他立场的论据时,引导效果明显减弱,表明该方法具有可预测性和可控性。
链接: https://arxiv.org/abs/2602.06260
作者: Pedro Cisneros-Velarde
机构: VMware Research (VMware 研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Polemic questions need more than one viewpoint to express a balanced answer. Large Language Models (LLMs) can provide a balanced answer, but also take a single aligned viewpoint or refuse to answer. In this paper, we study if such initial responses can be steered to a specific viewpoint in a simple and intuitive way: by only providing one-sided arguments supporting the viewpoint. Our systematic study has three dimensions: (i) which stance is induced in the LLM response, (ii) how the polemic question is formulated, (iii) how the arguments are shown. We construct a small dataset and remarkably find that opinion steering occurs across (i)-(iii) for diverse models, number of arguments, and topics. Switching to other arguments consistently decreases opinion steering.
zh
[NLP-54] Steering Safely or Off a Cliff? Rethinking Specificity and Robustness in Inference-Time Interventions EACL2026
【速读】: 该论文旨在解决模型控制(model steering)中“特异性”(specificity)不足的问题,即现有方法虽能有效干预大语言模型的特定属性(如减少过度拒绝或幻觉),但往往忽视了对其他相关行为的影响,尤其是模型在分布偏移下的稳定性。其关键解决方案是提出一个包含三个维度的特异性评估框架:通用性(general,保持流畅性和无关能力)、可控性(control,保留相关控制属性)和鲁棒性(robustness,维持控制属性在分布变化下的稳定性)。通过在两个安全关键场景下验证该框架,研究发现尽管当前方法在通用性和可控性上表现良好,却普遍缺乏鲁棒性,例如减少过度拒绝的干预会显著增加模型对越狱攻击(jailbreaks)的脆弱性。这表明,仅依赖标准的有效性和特异性检查不足以确保模型安全,必须引入鲁棒性评估才能全面衡量模型控制技术的可靠性。
链接: https://arxiv.org/abs/2602.06256
作者: Navita Goyal,Hal Daumé III
机构: University of Maryland College Park (马里兰大学学院公园分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: EACL 2026 Main, Long Paper
Abstract:Model steering, which involves intervening on hidden representations at inference time, has emerged as a lightweight alternative to finetuning for precisely controlling large language models. While steering efficacy has been widely studied, evaluations of whether interventions alter only the intended property remain limited, especially with respect to unintended changes in behaviors related to the target property. We call this notion specificity. We propose a framework that distinguishes three dimensions of specificity: general (preserving fluency and unrelated abilities), control (preserving related control properties), and robustness (preserving control properties under distribution shifts). We study two safety-critical use cases: steering models to reduce overrefusal and faithfulness hallucinations, and show that while steering achieves high efficacy and largely maintains general and control specificity, it consistently fails to preserve robustness specificity. In the case of overrefusal steering, for example, all steering methods reduce overrefusal without harming general abilities and refusal on harmful queries; however, they substantially increase vulnerability to jailbreaks. Our work provides the first systematic evaluation of specificity in model steering, showing that standard efficacy and specificity checks are insufficient, because without robustness evaluation, steering methods may appear reliable even when they compromise model safety.
zh
[NLP-55] BenchMarker: An Education-Inspired Toolkit for Highlighting Flaws in Multiple-Choice Benchmarks
【速读】: 该论文旨在解决当前自然语言处理(Natural Language Processing, NLP)领域中多选题问答(Multiple-Choice Question Answering, MCQA)基准测试缺乏严格质量控制的问题。现有基准常存在三类典型缺陷:污染(contamination,即题目与网络内容完全一致)、捷径(shortcuts,选项中存在可被利用的提示信息导致猜测)、以及书写错误(writing errors,基于教育学19条评分标准的结构或语法问题)。解决方案的关键在于提出BenchMarker工具,该工具借助大语言模型(Large Language Models, LLMs)作为评判者,自动识别并标注上述三类问题,并通过人工标注验证其有效性;进一步对12个主流MCQA基准进行审计,揭示了各类缺陷对模型性能评估的影响机制,从而为改进MCQA基准设计提供科学依据和可复用的自动化检测框架。
链接: https://arxiv.org/abs/2602.06221
作者: Nishant Balepur,Bhavya Rajasekaran,Jane Oh,Michael Xie,Atrey Desai,Vipul Gupta,Steven James Moore,Eunsol Choi,Rachel Rudinger,Jordan Lee Boyd-Graber
机构: 未知
类目: Computation and Language (cs.CL)
备注: In-progress preprint
Abstract:Multiple-choice question answering (MCQA) is standard in NLP, but benchmarks lack rigorous quality control. We present BenchMarker, an education-inspired toolkit using LLM judges to flag three common MCQ flaws: 1) contamination - items appearing exactly online; 2) shortcuts - cues in the choices that enable guessing; and 3) writing errors - structural/grammatical issues based on a 19-rule education rubric. We validate BenchMarker with human annotations, then run the tool to audit 12 benchmarks, revealing: 2) contaminated MCQs tend to inflate accuracy, while writing errors tend to lower it and change rankings beyond random; and 3) prior benchmark repairs address their targeted issues (i.e., lowering accuracy with LLM-written distractors), but inadvertently add new flaws (i.e. implausible distractors, many correct answers). Overall, flaws in MCQs degrade NLP evaluation, but education research offers a path forward. We release BenchMarker to bridge the fields and improve MCQA benchmark design.
zh
[NLP-56] Learning Rate Scaling across LoRA Ranks and Transfer to Full Finetuning
【速读】: 该论文旨在解决低秩适应(Low-Rank Adaptation, LoRA)在微调大模型时学习率与适配器秩(adapter rank)之间缺乏明确 scaling 规则的问题,这一问题导致实践中每当调整秩时都需重新调参,效率低下。解决方案的关键在于提出最大更新适配(Maximal-Update Adaptation, μA),其基于预训练中的最大更新参数化(Maximal-Update Parametrization, μP)思想,通过超参数迁移理论分析揭示了最优学习率随模型宽度和适配器秩的两种不同 scaling 规律:一种是学习率基本不随秩变化,另一种则与秩呈反比关系。此外,μA 还识别出一种可实现从 LoRA 到全参数微调(full fine-tuning)的学习率迁移配置,显著降低全参数微调的调参成本。
链接: https://arxiv.org/abs/2602.06204
作者: Nan Chen,Soledad Villar,Soufiane Hayou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Low-Rank Adaptation (LoRA) is a standard tool for parameter-efficient finetuning of large models. While it induces a small memory footprint, its training dynamics can be surprisingly complex as they depend on several hyperparameters such as initialization, adapter rank, and learning rate. In particular, it is unclear how the optimal learning rate scales with adapter rank, which forces practitioners to re-tune the learning rate whenever the rank is changed. In this paper, we introduce Maximal-Update Adaptation ( \mu A), a theoretical framework that characterizes how the “optimal” learning rate should scale with model width and adapter rank to produce stable, non-vanishing feature updates under standard configurations. \mu A is inspired from the Maximal-Update Parametrization ( \mu P) in pretraining. Our analysis leverages techniques from hyperparameter transfer and reveals that the optimal learning rate exhibits different scaling patterns depending on initialization and LoRA scaling factor. Specifically, we identify two regimes: one where the optimal learning rate remains roughly invariant across ranks, and another where it scales inversely with rank. We further identify a configuration that allows learning rate transfer from LoRA to full finetuning, drastically reducing the cost of learning rate tuning for full finetuning. Experiments across language, vision, vision–language, image generation, and reinforcement learning tasks validate our scaling rules and show that learning rates tuned on LoRA transfer reliably to full finetuning.
zh
[NLP-57] Generics in science communication: Misaligned interpretations across laypeople scientists and large language models
【速读】: 该论文试图解决的问题是:科学传播中使用泛化表述(generics)可能导致误解,尤其是在不同受众(如公众、科学家与大型语言模型LLMs)之间对同一表述的理解存在系统性差异,从而引发过度概括的风险。解决方案的关键在于识别并强调语言选择在人类和基于LLM的科学传播中的重要性——研究发现,相较于科学家,公众更倾向于将科学泛化表述视为更具普适性和可信度,而LLMs(如ChatGPT-5和DeepSeek)则进一步放大这种倾向,导致对研究结果的系统性过度假设。因此,提升科学传播质量需重视语义清晰度,并警惕LLMs在摘要生成中可能引入的偏差。
链接: https://arxiv.org/abs/2602.06190
作者: Uwe Peters,Andrea Bertazzoli,Jasmine M. DeJesus,Gisela J. van der Velden,Benjamin Chin-Yee
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Scientists often use generics, that is, unquantified statements about whole categories of people or phenomena, when communicating research findings (e.g., “statins reduce cardiovascular events”). Large language models (LLMs), such as ChatGPT, frequently adopt the same style when summarizing scientific texts. However, generics can prompt overgeneralizations, especially when they are interpreted differently across audiences. In a study comparing laypeople, scientists, and two leading LLMs (ChatGPT-5 and DeepSeek), we found systematic differences in interpretation of generics. Compared to most scientists, laypeople judged scientific generics as more generalizable and credible, while LLMs rated them even higher. These mismatches highlight significant risks for science communication. Scientists may use generics and incorrectly assume laypeople share their interpretation, while LLMs may systematically overgeneralize scientific findings when summarizing research. Our findings underscore the need for greater attention to language choices in both human and LLM-mediated science communication.
zh
[NLP-58] PhenoLIP: Integrating Phenotype Ontology Knowledge into Medical Vision-Language Pretraining
【速读】: 该论文旨在解决当前医学视觉语言模型(Medical Vision-Language Models, Medical VLMs)在医疗图像分析中因依赖粗粒度图像-文本对比目标而难以捕捉由明确医学表型本体(phenotype ontology)编码的系统性视觉知识的问题。其解决方案的关键在于构建首个以表型为中心的大规模多模态知识图谱 PhenoKG,包含超过520K高质量图像-文本对与3,000多个表型,并提出 PhenoLIP 预训练框架——通过两阶段策略:首先从文本本体数据中学习增强型表型嵌入空间,再利用教师引导的知识蒸馏目标将结构化表型知识注入多模态预训练过程,从而显著提升模型在表型识别和跨模态检索任务中的性能表现。
链接: https://arxiv.org/abs/2602.06184
作者: Cheng Liang,Chaoyi Wu,Weike Zhao,Ya Zhang,Yanfeng Wang,Weidi Xie
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Recent progress in large-scale CLIP-like vision-language models(VLMs) has greatly advanced medical image analysis. However, most existing medical VLMs still rely on coarse image-text contrastive objectives and fail to capture the systematic visual knowledge encoded in well-defined medical phenotype ontologies. To address this gap, we construct PhenoKG, the first large-scale, phenotype-centric multimodal knowledge graph that encompasses over 520K high-quality image-text pairs linked to more than 3,000 phenotypes. Building upon PhenoKG, we propose PhenoLIP, a novel pretraining framework that explicitly incorporates structured phenotype knowledge into medical VLMs through a two-stage process. We first learn a knowledge-enhanced phenotype embedding space from textual ontology data and then distill this structured knowledge into multimodal pretraining via a teacher-guided knowledge distillation objective. To support evaluation, we further introduce PhenoBench, an expert-verified benchmark designed for phenotype recognition, comprising over 7,800 image–caption pairs covering more than 1,000 phenotypes. Extensive experiments demonstrate that PhenoLIP outperforms previous state-of-the-art baselines, improving upon BiomedCLIP in phenotype classification accuracy by 8.85% and BIOMEDICA in cross-modal retrieval by 15.03%, underscoring the value of integrating phenotype-centric priors into medical VLMs for structured and interpretable medical image understanding.
zh
[NLP-59] Uncertainty Drives Social Bias Changes in Quantized Large Language Models
【速读】: 该论文旨在解决大语言模型在后训练量化(post-training quantization)过程中社会偏见被隐性改变的问题,尤其是现有聚合指标无法捕捉到的细微但关键的偏见变化。其核心发现是“量化诱导的掩蔽偏见翻转”(quantization-induced masked bias flipping)现象:即使整体偏见得分未变,高达21%的响应会在量化后从偏向变为无偏或反之,且这种翻转主要由模型不确定性驱动——高不确定性的响应发生改变的概率是低不确定性响应的3–11倍;此外,4-bit量化比8-bit量化导致的行为变化多出4–6倍,并引发不同人口群体间不对称影响,某些群体偏见恶化达18.6%,而另一些改善达14.1%,从而掩盖真实风险。解决方案的关键在于引入统一基准PostTrainingBiasBench进行系统评估,并强调必须对量化后的模型实施针对性偏差检测与干预,以保障部署时的公平性和可靠性。
链接: https://arxiv.org/abs/2602.06181
作者: Stanley Z. Hua,Sanae Lotfi,Irene Y. Chen
机构: UC Berkeley & UCSF; Meta Superintelligence Labs
类目: Computation and Language (cs.CL)
备注: 12 pages, 6 figures
Abstract:Post-training quantization reduces the computational cost of large language models but fundamentally alters their social biases in ways that aggregate metrics fail to capture. We present the first large-scale study of 50 quantized models evaluated on PostTrainingBiasBench, a unified benchmark of 13 closed- and open-ended bias datasets. We identify a phenomenon we term quantization-induced masked bias flipping, in which up to 21% of responses flip between biased and unbiased states after quantization, despite showing no change in aggregate bias scores. These flips are strongly driven by model uncertainty, where the responses with high uncertainty are 3-11x more likely to change than the confident ones. Quantization strength amplifies this effect, with 4-bit quantized models exhibiting 4-6x more behavioral changes than 8-bit quantized models. Critically, these changes create asymmetric impacts across demographic groups, where bias can worsen by up to 18.6% for some groups while improving by 14.1% for others, yielding misleadingly neutral aggregate outcomes. Larger models show no consistent robustness advantage, and group-specific shifts vary unpredictably across model families. Our findings demonstrate that compression fundamentally alters bias patterns, requiring crucial post-quantization evaluation and interventions to ensure reliability in practice.
zh
[NLP-60] Large Language Model Reason ing Failures
【速读】: 该论文旨在系统性地理解和解决大语言模型(Large Language Models, LLMs)在推理过程中存在的显著失败问题,这些问题即使在看似简单的场景中也频繁发生。其解决方案的关键在于提出一个全新的分类框架,将推理区分为具身(embodied)与非具身(non-embodied)两类,其中非具身推理进一步细分为非正式(informal,即直觉式)和正式(formal,即逻辑式)推理;同时,从架构本质、应用特定性和鲁棒性三个维度对推理失败进行分类,明确每类失败的定义、成因及缓解策略,并整合现有研究,形成结构化视角以揭示LLM推理中的系统性弱点,从而为构建更强大、可靠和鲁棒的推理能力提供指导。
链接: https://arxiv.org/abs/2602.06176
作者: Peiyang Song,Pengrui Han,Noah Goodman
机构: California Institute of Technology (加州理工学院); Stanford University (斯坦福大学); Carleton College (卡尔顿学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Repository: this https URL . Published at TMLR 2026 with Survey Certification
Abstract:Large Language Models (LLMs) have exhibited remarkable reasoning capabilities, achieving impressive results across a wide range of tasks. Despite these advances, significant reasoning failures persist, occurring even in seemingly simple scenarios. To systematically understand and address these shortcomings, we present the first comprehensive survey dedicated to reasoning failures in LLMs. We introduce a novel categorization framework that distinguishes reasoning into embodied and non-embodied types, with the latter further subdivided into informal (intuitive) and formal (logical) reasoning. In parallel, we classify reasoning failures along a complementary axis into three types: fundamental failures intrinsic to LLM architectures that broadly affect downstream tasks; application-specific limitations that manifest in particular domains; and robustness issues characterized by inconsistent performance across minor variations. For each reasoning failure, we provide a clear definition, analyze existing studies, explore root causes, and present mitigation strategies. By unifying fragmented research efforts, our survey provides a structured perspective on systemic weaknesses in LLM reasoning, offering valuable insights and guiding future research towards building stronger, more reliable, and robust reasoning capabilities. We additionally release a comprehensive collection of research works on LLM reasoning failures, as a GitHub repository at this https URL, to provide an easy entry point to this area.
zh
[NLP-61] Stop the Flip-Flop: Context-Preserving Verification for Fast Revocable Diffusion Decoding
【速读】: 该论文旨在解决并行扩散解码(parallel diffusion decoding)中因过度并行导致的生成质量下降问题,尤其是现有可撤销解码(revocable decoding)方案中存在的“翻转振荡”(flip-flop oscillations)现象——即已验证的token被反复遮蔽和恢复,不仅削弱了并行草稿的条件上下文,还浪费了修订预算。其解决方案的关键在于提出COVER(Cache Override Verification for Efficient Revision),通过单次前向传播实现留一验证(leave-one-out verification)与稳定草稿生成:利用KV缓存覆盖构建两种注意力视图,对选定种子token进行遮蔽验证,同时注入其缓存的键值状态以保留上下文信息,并引入闭式对角修正项防止种子位置的自泄漏;此外,COVER采用基于稳定性感知的评分机制动态选择和调整每步验证的种子数量,从而显著减少无效修订,提升解码效率并保持输出质量。
链接: https://arxiv.org/abs/2602.06161
作者: Yanzheng Xiang,Lan Wei,Yizhen Yao,Qinglin Zhu,Hanqi Yan,Chen Jin,Philip Alexander Teare,Dandan Zhang,Lin Gui,Amrutha Saseendran,Yulan He
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Parallel diffusion decoding can accelerate diffusion language model inference by unmasking multiple tokens per step, but aggressive parallelism often harms quality. Revocable decoding mitigates this by rechecking earlier tokens, yet we observe that existing verification schemes frequently trigger flip-flop oscillations, where tokens are remasked and later restored unchanged. This behaviour slows inference in two ways: remasking verified positions weakens the conditioning context for parallel drafting, and repeated remask cycles consume the revision budget with little net progress. We propose COVER (Cache Override Verification for Efficient Revision), which performs leave-one-out verification and stable drafting within a single forward pass. COVER constructs two attention views via KV cache override: selected seeds are masked for verification, while their cached key value states are injected for all other queries to preserve contextual information, with a closed form diagonal correction preventing self leakage at the seed positions. COVER further prioritises seeds using a stability aware score that balances uncertainty, downstream influence, and cache drift, and it adapts the number of verified seeds per step. Across benchmarks, COVER markedly reduces unnecessary revisions and yields faster decoding while preserving output quality.
zh
[NLP-62] MoSE: Mixture of Slimmable Experts for Efficient and Adaptive Language Models
【速读】: 该论文旨在解决混合专家(Mixture-of-Experts, MoE)模型在精度与计算成本之间存在显著不连续权衡的问题,即一旦专家被选中便需完整执行,难以实现细粒度的资源调控。其解决方案的关键在于提出可裁剪专家混合模型(Mixture of Slimmable Experts, MoSE),其中每个专家具有嵌套的、可变宽度的结构,能够在推理时根据预算动态调整执行深度。这使得条件计算不仅作用于专家选择层面,还可扩展至单个专家内部的计算量控制,从而在单一预训练模型上实现更连续的精度-计算权衡曲线。通过结合多宽度训练与标准MoE目标函数,并引入轻量级测试时训练机制以学习路由器置信度到专家宽度的映射关系,实验表明MoSE在保持或提升全宽性能的同时,显著降低了FLOPs消耗,优化了准确率与成本之间的帕累托前沿。
链接: https://arxiv.org/abs/2602.06154
作者: Nurbek Tastan,Stefanos Laskaridis,Karthik Nandakumar,Samuel Horvath
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Mixture-of-Experts (MoE) models scale large language models efficiently by sparsely activating experts, but once an expert is selected, it is executed fully. Hence, the trade-off between accuracy and computation in an MoE model typically exhibits large discontinuities. We propose Mixture of Slimmable Experts (MoSE), an MoE architecture in which each expert has a nested, slimmable structure that can be executed at variable widths. This enables conditional computation not only over which experts are activated, but also over how much of each expert is utilized. Consequently, a single pretrained MoSE model can support a more continuous spectrum of accuracy-compute trade-offs at inference time. We present a simple and stable training recipe for slimmable experts under sparse routing, combining multi-width training with standard MoE objectives. During inference, we explore strategies for runtime width determination, including a lightweight test-time training mechanism that learns how to map router confidence/probabilities to expert widths under a fixed budget. Experiments on GPT models trained on OpenWebText demonstrate that MoSE matches or improves upon standard MoE at full width and consistently shifts the Pareto frontier for accuracy vs. cost, achieving comparable performance with significantly fewer FLOPs.
zh
[NLP-63] Protean Compiler: An Agile Framework to Drive Fine-grain Phase Ordering
【速读】: 该论文旨在解决编译器优化阶段排序(phase ordering)问题,这是一个自20世纪70年代以来长期存在的开放性难题,因其优化空间庞大且无界,难以找到全局最优解。传统方法依赖人工编码的局部优化策略,仅适用于少量基准测试,且在更换测试集时需大量重调参。本文提出的Protean Compiler框架通过集成细粒度(fine-grained)的机器学习驱动优化决策机制,实现了对LLVM编译器的原生增强,其核心创新在于构建了一个包含140余种静态特征提取方法的完整库,并支持与第三方机器学习框架及大型语言模型(Large Language Models, LLMs)的无缝集成,从而在不显著增加编译时间的前提下,在Cbench基准测试中实现平均4.1%、部分应用最高达15.7%的性能提升,验证了该方案在实际编译流程中的有效性与可扩展性。
链接: https://arxiv.org/abs/2602.06142
作者: Amir H. Ashouri,Shayan Shirahmad Gale Bagi,Kavin Satheeskumar,Tejas Srikanth,Jonathan Zhao,Ibrahim Saidoun,Ziwen Wang,Bryan Chan,Tomasz S. Czajkowski
机构: Huawei TechnologiesCanada(华为技术加拿大)
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Performance (cs.PF)
备注: Version 1- Submitted for a possible publication in 2026
Abstract:The phase ordering problem has been a long-standing challenge since the late 1970s, yet it remains an open problem due to having a vast optimization space and an unbounded nature, making it an open-ended problem without a finite solution, one can limit the scope by reducing the number and the length of optimizations. Traditionally, such locally optimized decisions are made by hand-coded algorithms tuned for a small number of benchmarks, often requiring significant effort to be retuned when the benchmark suite changes. In the past 20 years, Machine Learning has been employed to construct performance models to improve the selection and ordering of compiler optimizations, however, the approaches are not baked into the compiler seamlessly and never materialized to be leveraged at a fine-grained scope of code segments. This paper presents Protean Compiler: An agile framework to enable LLVM with built-in phase-ordering capabilities at a fine-grained scope. The framework also comprises a complete library of more than 140 handcrafted static feature collection methods at varying scopes, and the experimental results showcase speedup gains of up to 4.1% on average and up to 15.7% on select Cbench applications wrt LLVM’s O3 by just incurring a few extra seconds of build time on Cbench. Additionally, Protean compiler allows for an easy integration with third-party ML frameworks and other Large Language Models, and this two-step optimization shows a gain of 10.1% and 8.5% speedup wrt O3 on Cbench’s Susan and Jpeg applications. Protean compiler is seamlessly integrated into LLVM and can be used as a new, enhanced, full-fledged compiler. We plan to release the project to the open-source community in the near future.
zh
[NLP-64] Self-Improving World Modelling with Latent Actions
【速读】: 该论文旨在解决大语言模型(LLM)和视觉语言模型(VLM)在缺乏标注动作轨迹的情况下,难以学习内部世界模型(internal world model)的问题,即如何从仅包含状态序列的数据中推断出动作与状态转移之间的因果关系。其核心解决方案是提出SWIRL框架,关键在于将动作视为潜在变量(latent variable),并通过交替优化前向世界建模(Forward World Modelling, FWM)和逆动力学建模(Inverse Dynamics Modelling, IDM)来实现无监督学习:FWM通过最大化条件互信息提升状态预测的一致性,IDM则通过证据下界(ELBO)最大化解释观测到的状态转移;两个模块采用强化学习(GRPO)进行训练,以对方模型的对数概率作为奖励信号,形成坐标上升更新策略,并提供理论可学习性保证。
链接: https://arxiv.org/abs/2602.06130
作者: Yifu Qiu,Zheng Zhao,Waylon Li,Yftah Ziser,Anna Korhonen,Shay B. Cohen,Edoardo M. Ponti
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Internal modelling of the world – predicting transitions between previous states X and next states Y under actions Z – is essential to reasoning and planning for LLMs and VLMs. Learning such models typically requires costly action-labelled trajectories. We propose SWIRL, a self-improvement framework that learns from state-only sequences by treating actions as a latent variable and alternating between Forward World Modelling (FWM) P_\theta(Y|X,Z) and an Inverse Dynamics Modelling (IDM) Q_\phi(Z|X,Y) . SWIRL iterates two phases: (1) Variational Information Maximisation, which updates the FWM to generate next states that maximise conditional mutual information with latent actions given prior states, encouraging identifiable consistency; and (2) ELBO Maximisation, which updates the IDM to explain observed transitions, effectively performing coordinate ascent. Both models are trained with reinforcement learning (specifically, GRPO) with the opposite frozen model’s log-probability as a reward signal. We provide theoretical learnability guarantees for both updates, and evaluate SWIRL on LLMs and VLMs across multiple environments: single-turn and multi-turn open-world visual dynamics and synthetic textual environments for physics, web, and tool calling. SWIRL achieves gains of 16% on AURORABench, 28% on ByteMorph, 16% on WorldPredictionBench, and 14% on StableToolBench.
zh
[NLP-65] Analyzing Diffusion and Autoregressive Vision Language Models in Multimodal Embedding Space
【速读】: 该论文试图解决的问题是:多模态扩散语言模型(Multimodal dLLMs)是否能够作为有效的多模态嵌入模型(multimodal embedding models),从而在语义搜索、检索增强生成等任务中替代传统的自回归视觉语言模型(Autoregressive VLMs)。解决方案的关键在于首次系统性地将多模态扩散模型转化为嵌入模型,并通过分类、视觉问答(Visual Question Answering, VQA)和信息检索三类典型任务进行对比评估。实验结果表明,尽管部分扩散模型如LaViDa表现接近自回归VLMs(性能差距<5点),但多数模型(如MMaDA)存在显著性能下降(>20点),进一步分析指出其根本原因在于扩散模型中图像-文本对齐不足,限制了其嵌入表示能力。
链接: https://arxiv.org/abs/2602.06056
作者: Zihang Wang,Siyue Zhang,Yilun Zhao,Jingyi Yang,Tingyu Song,Anh Tuan Luu,Chen Zhao
机构: Nanyang Technological University (南洋理工大学); Yale University (耶鲁大学); NYU Shanghai (纽约大学上海分校); Alibaba-NTU Singapore Joint Research Institute (阿里巴巴-南洋理工大学新加坡联合研究院); University of the Chinese Academy of Sciences (中国科学院大学); Center for Data Science, New York University (纽约大学数据科学中心)
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Embedding models are a fundamental component of modern AI systems such as semantic search and retrieval-augmented generation. Recent advances in large foundation models have substantially accelerated the development of embedding models, including those based on Large Language Models (LLMs), Vision Language Models (VLMs), and Multimodal LLMs. More recently, Large Diffusion Language Models (dLLMs) and Multimodal dLLMs have emerged as competitive alternatives to autoregressive models, offering advantages such as bidirectional attention and parallel generation. This progress naturally raises a critical yet unexplored question: can Multimodal dLLMs serve as effective multimodal embedding models? To answer this, we present the first systematic study of converting Multimodal dLLMs into embedding models. We evaluate state-of-the-art Multimodal dLLMs and Autoregressive VLMs across three categories of embedding tasks: classification, visual question answering, and information retrieval. Our results show that Multimodal dLLM embeddings generally underperform their autoregressive VLM counterparts. The stronger diffusion-based model, LaViDa, lags by only 3.5 points on classification, 2.5 points on VQA, and 4.4 points on retrieval tasks, whereas the other diffusion-based model, MMaDA, exhibits substantially larger performance gaps, exceeding 20 points across all tasks. Further analysis reveals insufficient image-text alignment in diffusion-based models, accounting for the observed limitations in their embedding performance.
zh
[NLP-66] Quantifying and Attributing Polarization to Annotator Groups
【速读】: 该论文旨在解决现有标注一致性评估指标在跨群体分析中存在局限性的问题,特别是其对群体规模不平衡敏感且仅适用于单标注场景,难以有效应用于毒性(toxicity)和仇恨言论(hate speech)检测等主观任务。解决方案的关键在于提出一种可量化、具备统计显著性检验的新指标,能够将标注极化现象归因于不同标注者群体(如种族、宗教、教育程度等),从而实现对高度不平衡的社会人口学和意识形态子群体的直接比较,并支持多标签标注场景。该方法通过实证分析三个仇恨言论数据集和一个毒性检测数据集,揭示了标注群体间极化的系统性模式,同时提供了最小标注者数量估算及开源Python库以促进复现与应用。
链接: https://arxiv.org/abs/2602.06055
作者: Dimitris Tsirmpas,John Pavlopoulos
机构: Athens University of Economics and Business (雅典经济与商业大学)
类目: Computation and Language (cs.CL)
备注: 28 pages, 6 tables, 7 figures, 1 algorithm
Abstract:Current annotation agreement metrics are not well-suited for inter-group analysis, are sensitive to group size imbalances and restricted to single-annotation settings. These restrictions render them insufficient for many subjective tasks such as toxicity and hate-speech detection. For this reason, we introduce a quantifiable metric, paired with a statistical significance test, that attributes polarization to various annotator groups. Our metric enables direct comparisons between heavily imbalanced sociodemographic and ideological subgroups across different datasets and tasks, while also enabling analysis on multi-label settings. We apply this metric to three datasets on hate speech, and one on toxicity detection, discovering that: (1) Polarization is strongly and persistently attributed to annotator race, especially on the hate speech task. (2) Religious annotators do not fundamentally disagree with each other, but do with other annotators, a trend that is gradually diminished and then reversed with irreligious annotators. (3) Less educated annotators are more subjective, while educated ones tend to broadly agree more between themselves. Overall, our results reflect current findings around annotation patterns for various subgroups. Finally, we estimate the minimum number of annotators needed to obtain robust results, and provide an open-source Python library that implements our metric.
zh
[NLP-67] What Is Novel? A Knowledge-Driven Framework for Bias-Aware Literature Originality Evaluation
【速读】: 该论文旨在解决科研新颖性评估(research novelty assessment)在同行评审中高度主观、依赖隐性判断且难以与既有研究进行充分比较的问题。其核心解决方案是提出一种文献感知的新颖性评估框架,关键在于利用近8万条来自顶级人工智能会议的带注释新颖性评分的审稿报告,微调大语言模型以学习人类评审者对新颖性的判断行为,并通过结构化提取论文的核心思想、方法和主张,检索语义相关的已有文献,构建相似性图谱实现概念级细粒度对比,从而生成校准后的新颖性分数及类人解释性评估,显著降低过估计并提升评估一致性。
链接: https://arxiv.org/abs/2602.06054
作者: Abeer Mostafa,Thi Huyen Nguyen,Zahra Ahmadi
机构: PLRI Medical Informatics Institute (PLRI 医学信息学研究所); Hannover Medical School (汉诺威医学院); L3S Research Center (L3S 研究中心)
类目: Computation and Language (cs.CL)
备注:
Abstract:Assessing research novelty is a core yet highly subjective aspect of peer review, typically based on implicit judgment and incomplete comparison to prior work. We introduce a literature-aware novelty assessment framework that explicitly learns how humans judge novelty from peer-review reports and grounds these judgments in structured comparison to existing research. Using nearly 80K novelty-annotated reviews from top-tier AI conferences, we fine-tune a large language model to capture reviewer-aligned novelty evaluation behavior. For a given manuscript, the system extracts structured representations of its ideas, methods, and claims, retrieves semantically related papers, and constructs a similarity graph that enables fine-grained, concept-level comparison to prior work. Conditioning on this structured evidence, the model produces calibrated novelty scores and human-like explanatory assessments, reducing overestimation and improving consistency relative to existing approaches.
zh
[NLP-68] PersonaPlex: Voice and Role Control for Full Duplex Conversational Speech Models
【速读】: 该论文旨在解决现有全双工语音模型在角色固定和语音单一方面的局限性,从而限制了其在结构化、角色驱动的现实应用场景中的适应性和个性化交互能力。解决方案的关键在于提出PersonaPlex,一种结合混合系统提示(hybrid system prompts)的全双工对话语音模型,通过角色条件控制(role conditioning)与文本提示相结合,以及语音克隆(voice cloning)与语音样本相融合,实现对角色和语音的灵活调控。该模型基于大规模合成数据集进行训练,该数据集由开源大语言模型(LLM)和文本转语音(TTS)模型生成,实验表明其在角色遵循度、说话人相似度、延迟和自然度等方面均优于当前最先进的全双工语音模型和混合大语言模型语音系统。
链接: https://arxiv.org/abs/2602.06053
作者: Rajarshi Roy,Jonathan Raiman,Sang-gil Lee,Teodor-Dumitru Ene,Robert Kirby,Sungwon Kim,Jaehyeon Kim,Bryan Catanzaro
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent advances in duplex speech models have enabled natural, low-latency speech-to-speech interactions. However, existing models are restricted to a fixed role and voice, limiting their ability to support structured, role-driven real-world applications and personalized interactions. In this work, we introduce PersonaPlex, a duplex conversational speech model that incorporates hybrid system prompts, combining role conditioning with text prompts and voice cloning with speech samples. PersonaPlex is trained on a large-scale synthetic dataset of paired prompts and user-agent conversations, generated with open-source large language models (LLM) and text-to-speech (TTS) models. To evaluate role conditioning in real-world settings, we extend the Full-Duplex-Bench benchmark beyond a single assistant role to multi-role customer service scenarios. Experiments show that PersonaPlex achieves strong role-conditioned behavior, voice-conditioned speech, and natural conversational responsiveness, surpassing state-of-the-art duplex speech models and hybrid large language model-based speech systems in role adherence, speaker similarity, latency, and naturalness.
zh
[NLP-69] Rethinking Memory Mechanisms of Foundation Agents in the Second Half
【速读】: 该论文试图解决当前生成式 AI(Generative AI)在长时程、动态且用户依赖环境中的实用性不足问题,核心挑战在于智能体(agent)面临上下文爆炸,需持续积累、管理并选择性复用大量交互信息。解决方案的关键在于构建基础智能体记忆体系(foundation agent memory),通过三个维度统一建模:记忆载体(内部与外部)、认知机制(情景、语义、感官、工作和程序性记忆)以及记忆主体(以智能体为中心和以用户为中心)。该框架为实现高效记忆操作与学习策略提供了理论基础,并推动了对记忆实用性的评估方法发展。
链接: https://arxiv.org/abs/2602.06052
作者: Wei-Chieh Huang,Weizhi Zhang,Yueqing Liang,Yuanchen Bei,Yankai Chen,Tao Feng,Xinyu Pan,Zhen Tan,Yu Wang,Tianxin Wei,Shanglin Wu,Ruiyao Xu,Liangwei Yang,Rui Yang,Wooseong Yang,Chin-Yuan Yeh,Hanrong Zhang,Haozhen Zhang,Siqi Zhu,Henry Peng Zou,Wanjia Zhao,Song Wang,Wujiang Xu,Zixuan Ke,Zheng Hui,Dawei Li,Yaozu Wu,Langzhou He,Chen Wang,Xiongxiao Xu,Baixiang Huang,Juntao Tan,Shelby Heinecke,Huan Wang,Caiming Xiong,Ahmed A. Metwally,Jun Yan,Chen-Yu Lee,Hanqing Zeng,Yinglong Xia,Xiaokai Wei,Ali Payani,Yu Wang,Haitong Ma,Wenya Wang,Chengguang Wang,Yu Zhang,Xin Wang,Yongfeng Zhang,Jiaxuan You,Hanghang Tong,Xiao Luo,Yizhou Sun,Wei Wang,Julian McAuley,James Zou,Jiawei Han,Philip S. Yu,Kai Shu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The research of artificial intelligence is undergoing a paradigm shift from prioritizing model innovations over benchmark scores towards emphasizing problem definition and rigorous real-world evaluation. As the field enters the “second half,” the central challenge becomes real utility in long-horizon, dynamic, and user-dependent environments, where agents face context explosion and must continuously accumulate, manage, and selectively reuse large volumes of information across extended interactions. Memory, with hundreds of papers released this year, therefore emerges as the critical solution to fill the utility gap. In this survey, we provide a unified view of foundation agent memory along three dimensions: memory substrate (internal and external), cognitive mechanism (episodic, semantic, sensory, working, and procedural), and memory subject (agent- and user-centric). We then analyze how memory is instantiated and operated under different agent topologies and highlight learning policies over memory operations. Finally, we review evaluation benchmarks and metrics for assessing memory utility, and outline various open challenges and future directions.
zh
[NLP-70] CAST: Character-and-Scene Episodic Memory for Agents
【速读】: 该论文旨在解决当前智能体记忆系统在处理人类 episodic memory(情景记忆)时的不足,即现有方法多侧重于语义记忆的检索,将经验建模为键值对、向量或图结构,难以有效表示和还原包含“谁(who)、何时(when)、何地(where)”等要素的连贯事件。其解决方案的关键在于提出一种基于角色与场景的内存架构(Character-and-Scene based memory architecture, CAST),通过构建三维场景(时间/地点/主题)并将其组织成角色档案来表征情景记忆,同时融合图结构的语义记忆形成双记忆设计,从而实现对事件连贯性的有效建模与高效检索。
链接: https://arxiv.org/abs/2602.06051
作者: Kexin Ma,Bojun Li,Yuhua Tang,Ruochun Jin,Liting Sun
机构: National University of Defense Technology (国防科技大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Episodic memory is a central component of human memory, which refers to the ability to recall coherent events grounded in who, when, and where. However, most agent memory systems only emphasize semantic recall and treat experience as structures such as key-value, vector, or graph, which makes them struggle to represent and retrieve coherent events. To address this challenge, we propose a Character-and-Scene based memory architecture(CAST) inspired by dramatic theory. Specifically, CAST constructs 3D scenes (time/place/topic) and organizes them into character profiles that summarize the events of a character to represent episodic memory. Moreover, CAST complements this episodic memory with a graph-based semantic memory, which yields a robust dual memory design. Experiments demonstrate that CAST has averagely improved 8.11% F1 and 10.21% J(LLM-as-a-Judge) than baselines on various datasets, especially on open and time-sensitive conversational questions.
zh
[NLP-71] Relevance-aware Multi-context Contrastive Decoding for Retrieval-augmented Visual Question Answering WACV2026
【速读】: 该论文旨在解决大型视觉语言模型(Large Vision Language Models, LVLMs)在处理特定实体知识时存在不足的问题,尤其是在引入外部知识库进行检索增强生成(Retrieval-augmented Generation, RAG)后,现有解码方法未能充分挖掘多个相关上下文的信息,且无法有效抑制无关上下文带来的负面影响。解决方案的关键在于提出一种新颖的解码机制——相关性感知多上下文对比解码(Relevance-aware Multi-context Contrastive Decoding, RMCD),其通过为每个检索到的上下文生成预测结果,并依据其与问题的相关性对各预测结果加权融合,从而实现从多个相关上下文中高效聚合有用信息,同时削弱无关上下文的干扰。该方法无需额外训练即可直接替换现有LVLM的解码策略,在多个知识密集型视觉问答基准上均取得最优性能,且对检索质量具有强鲁棒性。
链接: https://arxiv.org/abs/2602.06050
作者: Jongha Kim,Byungoh Ko,Jeehye Na,Jinsung Yoon,Hyunwoo J. Kim
机构: Korea University (韩国大学); KAIST (韩国科学技术院); Google Cloud AI (谷歌云人工智能)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: WACV 2026
Abstract:Despite the remarkable capabilities of Large Vision Language Models (LVLMs), they still lack detailed knowledge about specific entities. Retrieval-augmented Generation (RAG) is a widely adopted solution that enhances LVLMs by providing additional contexts from an external Knowledge Base. However, we observe that previous decoding methods for RAG are sub-optimal as they fail to sufficiently leverage multiple relevant contexts and suppress the negative effects of irrelevant contexts. To this end, we propose Relevance-aware Multi-context Contrastive Decoding (RMCD), a novel decoding method for RAG. RMCD outputs a final prediction by combining outputs predicted with each context, where each output is weighted based on its relevance to the question. By doing so, RMCD effectively aggregates useful information from multiple relevant contexts while also counteracting the negative effects of irrelevant ones. Experiments show that RMCD consistently outperforms other decoding methods across multiple LVLMs, achieving the best performance on three knowledge-intensive visual question-answering benchmarks. Also, RMCD can be simply applied by replacing the decoding method of LVLMs without additional training. Analyses also show that RMCD is robust to the retrieval results, consistently performing the best across the weakest to the strongest retrieval results. Code is available at this https URL.
zh
[NLP-72] Recontextualizing Famous Quotes for Brand Slogan Generation
【速读】: 该论文旨在解决广告口号(slogan)生成中因重复使用导致的广告疲劳问题,即现有基于大语言模型(Large Language Models, LLMs)的方法往往产生风格单调、缺乏品牌个性且明显具有机器生成痕迹的输出。其解决方案的关键在于提出一种新范式:通过重新语境化与品牌相关的名人名言来生成口号,利用名言天然具备的简洁性、修辞丰富性和思想深度,实现新颖性与熟悉感之间的平衡。技术上,论文设计了一个模块化框架,将口号生成分解为可解释的子任务——引文匹配、结构分解、词汇替换和混搭生成,从而在多样性、新颖性、情感冲击力及人类偏好等方面显著优于三个前沿LLM基线模型。
链接: https://arxiv.org/abs/2602.06049
作者: Ziao Yang,Zizhang Chen,Lei Zhang,Hongfu Liu
机构: Brandeis University (布兰迪斯大学); Adobe (Adobe)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Slogans are concise and memorable catchphrases that play a crucial role in advertising by conveying brand identity and shaping public perception. However, advertising fatigue reduces the effectiveness of repeated slogans, creating a growing demand for novel, creative, and insightful slogan generation. While recent work leverages large language models (LLMs) for this task, existing approaches often produce stylistically redundant outputs that lack a clear brand persona and appear overtly machine-generated. We argue that effective slogans should balance novelty with familiarity and propose a new paradigm that recontextualizes persona-related famous quotes for slogan generation. Well-known quotes naturally align with slogan-length text, employ rich rhetorical devices, and offer depth and insight, making them a powerful resource for creative generation. Technically, we introduce a modular framework that decomposes slogan generation into interpretable subtasks, including quote matching, structural decomposition, vocabulary replacement, and remix generation. Extensive automatic and human evaluations demonstrate marginal improvements in diversity, novelty, emotional impact, and human preference over three state-of-the-art LLM baselines.
zh
[NLP-73] Quantum Attention by Overlap Interference: Predicting Sequences from Classical and Many-Body Quantum Data
【速读】: 该论文旨在解决经典Transformer架构中自注意力机制(self-attention)在量子计算场景下的高效实现问题,特别是如何在量子系统中原生支持非线性交互并直接输出损失函数,避免传统方法依赖经典后处理的瓶颈。其解决方案的关键在于提出一种变分量子自注意力(Variational Quantum Self-Attention, QSA)方案:通过量子态重叠的干涉实现所需的非线性映射,并将Renyi-1/2交叉熵损失直接作为可观测量的期望值输出,无需解码幅值编码的预测结果;同时引入可训练的数据嵌入机制,使量子态重叠自然反映数据相似性,从而构建适用于量子动力学建模的可微分注意力机制。该方案在门复杂度上展现出优势,即当序列长度 $ T $ 远大于嵌入维度 $ d $ 时,量子实现的复杂度为 $ O(T d^2) $,优于经典 $ O(T^2 d) $ 的标度。
链接: https://arxiv.org/abs/2602.06699
作者: Alessio Pecilli,Matteo Rosati
机构: Università degli Studi Roma Tre(罗马三大学)
类目: Quantum Physics (quant-ph); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 4 + 1 pages, 2 figures
Abstract:We propose a variational quantum implementation of self-attention (QSA), the core operation in transformers and large language models, which predicts future elements of a sequence by forming overlap-weighted combinations of past data. At variance with previous approaches, our QSA realizes the required nonlinearity through interference of state overlaps and returns a Renyi-1/2 cross-entropy loss directly as the expectation value of an observable, avoiding the need to decode amplitude-encoded predictions into classical logits. Furthermore, QSA naturally accommodates a constrained, trainable data-embedding that ties quantum state overlaps to data-level similarities. We find a gate complexity dominant scaling O(T d^2) for QSA, versus O(T^2 d) classically, suggesting an advantage in the practical regime where the sequence length T dominates the embedding size d. In simulations, we show that our QSA-based quantum transformer learns sequence prediction on classical data and on many-body transverse-field Ising quantum trajectories, establishing trainable attention as a practical primitive for quantum dynamical modeling.
zh
[NLP-74] STACodec: Semantic Token Assignment for Balancing Acoustic Fidelity and Semantic Information in Audio Codecs ICASSP2026
【速读】: 该论文旨在解决传统神经音频编解码器(Neural Audio Codec)在保留声学细节的同时难以有效融合语义信息的问题,尤其是现有混合编解码器通过知识蒸馏引入语义信息时往往损害重建性能,难以兼顾声学保真度与语义能力。其解决方案的关键在于提出STACodec,通过语义令牌分配(Semantic Token Assignment, STA)将自监督学习(Self-Supervised Learning, SSL)模型的语义信息嵌入到残差向量量化(Residual Vector Quantization, RVQ)的第一层(RVQ-1),从而实现语义与声学特征的统一建模;进一步地,引入语义预蒸馏(Semantic Pre-Distillation, SPD)模块,在推理阶段直接预测语义令牌并分配至RVQ-1层,降低对SSL语义分词器的依赖并提升效率。
链接: https://arxiv.org/abs/2602.06180
作者: Kaiyuan Zhang,Mohan Shi,Eray Eren,Natarajan Balaji Shankar,Zilai Wang,Abeer Alwan
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注: ICASSP 2026
Abstract:Neural audio codecs are widely used for audio compression and can be integrated into token-based language models. Traditional codecs preserve acoustic details well but lack semantic information. Recent hybrid codecs attempt to incorporate semantic information through distillation, but this often degrades reconstruction performance, making it difficult to achieve both. To address this limitation, we introduce STACodec, a unified codec that integrates semantic information from self-supervised learning (SSL) models into the first layer of residual vector quantization (RVQ-1) via semantic token assignment (STA). To further eliminate reliance on SSL-based semantic tokenizers and improve efficiency during inference, we propose a semantic pre-distillation (SPD) module, which predicts semantic tokens directly for assignment to the first RVQ layer during inference. Experimental results show that STACodec outperforms existing hybrid codecs in both audio reconstruction and downstream semantic tasks, demonstrating a better balance between acoustic fidelity and semantic capability.
zh
[NLP-75] Deep networks learn to parse uniform-depth context-free languages from local statistics
【速读】: 该论文试图解决的核心问题是:如何仅从句子数据中学习语言结构,特别是理解哪些数据统计特性使得大型语言模型(LLMs)能够在不依赖标注信息的情况下实现句法解析和语义表征的分离,以及所需的数据量(样本复杂度)如何与这些统计特性相关联。其解决方案的关键在于提出了一种可调的概率上下文无关文法(Probabilistic Context-Free Grammars, PCFGs)类,其中可以精确控制歧义程度和跨尺度的相关性结构,并设计了一种受深度卷积网络结构启发的推断算法,该算法将学习能力与特定的语言统计特征直接关联,从而在深度卷积和Transformer架构上实证验证了预测结果。整体框架揭示了不同尺度的相关性如何消除局部歧义,进而促进层级化表示的涌现。
链接: https://arxiv.org/abs/2602.06065
作者: Jack T. Parley,Francesco Cagnetta,Matthieu Wyart
机构: 未知
类目: Machine Learning (stat.ML); Disordered Systems and Neural Networks (cond-mat.dis-nn); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Understanding how the structure of language can be learned from sentences alone is a central question in both cognitive science and machine learning. Studies of the internal representations of Large Language Models (LLMs) support their ability to parse text when predicting the next word, while representing semantic notions independently of surface form. Yet, which data statistics make these feats possible, and how much data is required, remain largely unknown. Probabilistic context-free grammars (PCFGs) provide a tractable testbed for studying these questions. However, prior work has focused either on the post-hoc characterization of the parsing-like algorithms used by trained networks; or on the learnability of PCFGs with fixed syntax, where parsing is unnecessary. Here, we (i) introduce a tunable class of PCFGs in which both the degree of ambiguity and the correlation structure across scales can be controlled; (ii) provide a learning mechanism – an inference algorithm inspired by the structure of deep convolutional networks – that links learnability and sample complexity to specific language statistics; and (iii) validate our predictions empirically across deep convolutional and transformer-based architectures. Overall, we propose a unifying framework where correlations at different scales lift local ambiguities, enabling the emergence of hierarchical representations of the data.
zh
计算机视觉
[CV-0] MedMO: Grounding and Understanding Multimodal Large Language Model for Medical Images
【速读】:该论文旨在解决医学领域中多模态大语言模型(Multimodal Large Language Models, MLLMs)在临床应用中的三大关键瓶颈问题:领域覆盖不足、模态对齐困难以及 grounded 推理能力弱。为此,作者提出 MedMO,一个基于通用 MLLM 架构并仅使用大规模医学领域数据训练的医疗基础模型。其解决方案的关键在于采用三阶段训练策略:首先通过跨模态预训练实现异构视觉编码器与医学语言主干网络的对齐;其次在多任务监督下进行指令微调,涵盖图像描述、视觉问答(VQA)、报告生成、检索和边界框定位等任务;最后引入强化学习,结合可验证奖励机制(包括事实性检查与框级 GIoU 奖励),以增强复杂临床场景下的空间定位能力和逐步推理能力。该方法显著提升了模型在多个医学模态和任务上的性能表现,尤其在空间接地(grounding)方面取得突破性进展。
链接: https://arxiv.org/abs/2602.06965
作者: Ankan Deria,Komal Kumar,Adinath Madhavrao Dukre,Eran Segal,Salman Khan,Imran Razzak
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 6 figures and 4 tables
Abstract:Multimodal large language models (MLLMs) have rapidly advanced, yet their adoption in medicine remains limited by gaps in domain coverage, modality alignment, and grounded reasoning. In this work, we introduce MedMO, a medical foundation model built upon a generalized MLLM architecture and trained exclusively on large-scale, domain-specific data. MedMO follows a multi-stage training recipe: (i) cross-modal pretraining to align heterogeneous visual encoders with a medical language backbone; (ii) instruction tuning on multi-task supervision that spans captioning, VQA, report generation, retrieval, and grounded disease localization with bounding boxes; and (iii) reinforcement learning with verifiable rewards that combine factuality checks with a box-level GIoU reward to strengthen spatial grounding and step-by-step reasoning in complex clinical scenarios. MedMO consistently outperforms strong open-source medical MLLMs across multiple modalities and tasks. On VQA benchmarks, MedMO achieves an average accuracy improvement of +13.7% over the baseline and performs within 1.9% of the SOTA Fleming-VL. For text-based QA, it attains +6.9% over the baseline and +14.5% over Fleming-VL. In medical report generation, MedMO delivers significant gains in both semantic and clinical accuracy. Moreover, it exhibits strong grounding capability, achieving an IoU improvement of +40.4 over the baseline and +37.0% over Fleming-VL, underscoring its robust spatial reasoning and localization performance. Evaluations across radiology, ophthalmology, and pathology-microscopy confirm MedMO’s broad cross-modality generalization. We release two versions of MedMO: 4B and 8B. Project is available at this https URL
zh
[CV-1] CineScene: Implicit 3D as Effective Scene Representation for Cinematic Video Generation
【速读】:该论文旨在解决影视视频生成中场景与主体动态分离控制的问题,即如何在保持静态场景一致性的同时,合成具有动态主体和受控相机运动的高质量视频,从而降低实拍制作成本。其核心解决方案是提出CineScene框架,关键创新在于设计了一种隐式3D感知的场景上下文条件机制:通过VGGT编码多张静态场景图像为视觉表征,并以额外的上下文拼接方式将空间先验注入预训练文本到视频生成模型中,实现相机轨迹可控、场景一致且主体动态的视频合成。
链接: https://arxiv.org/abs/2602.06959
作者: Kaiyi Huang,Yukun Huang,Yu Li,Jianhong Bai,Xintao Wang,Zinan Lin,Xuefei Ning,Jiwen Yu,Pengfei Wan,Yu Wang,Xihui Liu
机构: The University of Hong Kong (香港大学); Kuaishou Technology (快手科技); Tsinghua University (清华大学); Zhejiang University (浙江大学); Microsoft Research (微软研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project website: this https URL
Abstract:Cinematic video production requires control over scene-subject composition and camera movement, but live-action shooting remains costly due to the need for constructing physical sets. To address this, we introduce the task of cinematic video generation with decoupled scene context: given multiple images of a static environment, the goal is to synthesize high-quality videos featuring dynamic subject while preserving the underlying scene consistency and following a user-specified camera trajectory. We present CineScene, a framework that leverages implicit 3D-aware scene representation for cinematic video generation. Our key innovation is a novel context conditioning mechanism that injects 3D-aware features in an implicit way: By encoding scene images into visual representations through VGGT, CineScene injects spatial priors into a pretrained text-to-video generation model by additional context concatenation, enabling camera-controlled video synthesis with consistent scenes and dynamic subjects. To further enhance the model’s robustness, we introduce a simple yet effective random-shuffling strategy for the input scene images during training. To address the lack of training data, we construct a scene-decoupled dataset with Unreal Engine 5, containing paired videos of scenes with and without dynamic subjects, panoramic images representing the underlying static scene, along with their camera trajectories. Experiments show that CineScene achieves state-of-the-art performance in scene-consistent cinematic video generation, handling large camera movements and demonstrating generalization across diverse environments.
zh
[CV-2] DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos
【速读】:该论文旨在解决生成式世界模型(Generative World Model)在复杂、接触丰富的机器人任务中泛化能力不足的问题,尤其是面对数据覆盖有限和动作标签稀缺的挑战。其关键解决方案在于提出DreamDojo——一个基于44,000小时第一人称人类视频训练的基石世界模型,通过引入连续潜空间动作(continuous latent actions)作为统一代理动作,有效缓解无标签视频中交互知识迁移的困难;同时结合小规模目标机器人数据微调与蒸馏加速管道,在保证物理理解准确性和动作可控性的前提下,实现10.81 FPS的实时推理速度与更强的上下文一致性,从而支撑远程操作、策略评估与模型规划等实际应用。
链接: https://arxiv.org/abs/2602.06949
作者: Shenyuan Gao,William Liang,Kaiyuan Zheng,Ayaan Malik,Seonghyeon Ye,Sihyun Yu,Wei-Cheng Tseng,Yuzhu Dong,Kaichun Mo,Chen-Hsuan Lin,Qianli Ma,Seungjun Nah,Loic Magne,Jiannan Xiang,Yuqi Xie,Ruijie Zheng,Dantong Niu,You Liang Tan,K.R. Zentner,George Kurian,Suneel Indupuru,Pooya Jannaty,Jinwei Gu,Jun Zhang,Jitendra Malik,Pieter Abbeel,Ming-Yu Liu,Yuke Zhu,Joel Jang,Linxi “Jim” Fan
机构: NVIDIA(英伟达); HKUST(香港科技大学); UC Berkeley(加州大学伯克利分校); UW(华盛顿大学); Stanford(斯坦福大学); KAIST(韩国科学技术院); UofT(多伦多大学); UCSD(加州大学圣地亚哥分校); UT Austin(德克萨斯大学奥斯汀分校)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project page: this https URL
Abstract:Being able to simulate the outcomes of actions in varied environments will revolutionize the development of generalist agents at scale. However, modeling these world dynamics, especially for dexterous robotics tasks, poses significant challenges due to limited data coverage and scarce action labels. As an endeavor towards this end, we introduce DreamDojo, a foundation world model that learns diverse interactions and dexterous controls from 44k hours of egocentric human videos. Our data mixture represents the largest video dataset to date for world model pretraining, spanning a wide range of daily scenarios with diverse objects and skills. To address the scarcity of action labels, we introduce continuous latent actions as unified proxy actions, enhancing interaction knowledge transfer from unlabeled videos. After post-training on small-scale target robot data, DreamDojo demonstrates a strong understanding of physics and precise action controllability. We also devise a distillation pipeline that accelerates DreamDojo to a real-time speed of 10.81 FPS and further improves context consistency. Our work enables several important applications based on generative world models, including live teleoperation, policy evaluation, and model-based planning. Systematic evaluation on multiple challenging out-of-distribution (OOD) benchmarks verifies the significance of our method for simulating open-world, contact-rich tasks, paving the way for general-purpose robot world models.
zh
[CV-3] Reliable Mislabel Detection for Video Capsule Endoscopy Data
【速读】:该论文旨在解决医学影像领域中深度神经网络分类性能受限于高质量标注数据稀缺的问题,尤其是在视频胶囊内镜(Video Capsule Endoscopy, VCE)这类依赖专业医师标注的场景下,标注成本高且类边界模糊导致误标现象普遍。解决方案的关键在于提出了一种用于检测医疗数据集中误标样本的框架,通过自动识别潜在误标签样本并由三位经验丰富的胃肠科医生进行复核与重新标注,从而提升数据质量;实验表明,经该框架清洗后的数据集在异常检测任务上显著优于现有基线方法。
链接: https://arxiv.org/abs/2602.06938
作者: Julia Werner,Julius Oexle,Oliver Bause,Maxime Le Floch,Franz Brinkmann,Hannah Tolle,Jochen Hampe,Oliver Bringmann
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:The classification performance of deep neural networks relies strongly on access to large, accurately annotated datasets. In medical imaging, however, obtaining such datasets is particularly challenging since annotations must be provided by specialized physicians, which severely limits the pool of annotators. Furthermore, class boundaries can often be ambiguous or difficult to define which further complicates machine learning-based classification. In this paper, we want to address this problem and introduce a framework for mislabel detection in medical datasets. This is validated on the two largest, publicly available datasets for Video Capsule Endoscopy, an important imaging procedure for examining the gastrointestinal tract based on a video stream of lowresolution images. In addition, potentially mislabeled samples identified by our pipeline were reviewed and re-annotated by three experienced gastroenterologists. Our results show that the proposed framework successfully detects incorrectly labeled data and results in an improved anomaly detection performance after cleaning the datasets compared to current baselines.
zh
[CV-4] Seeing Beyond Redundancy: Task Complexitys Role in Vision Token Specialization in VLLM s
【速读】:该论文试图解决视觉大语言模型(Vision Large Language Models, VLLMs)在处理细粒度视觉信息和空间推理任务时性能显著落后于其语言能力的问题。现有研究多将原因归结于视觉冗余,即高阶视觉信息在多个token中均匀分布,导致细粒度特征被丢弃。为深入探究这一机制,作者提出一个专门设计的合成基准数据集,用于系统性地探测不同类型的视觉特征,并引入量化指标测量视觉冗余程度,从而揭示冗余与视觉信息保留之间的关系。解决方案的关键在于:通过在复杂视觉任务上对VLLM进行微调,发现任务复杂度与视觉压缩之间存在关联——只有当训练数据中包含足够比例的高复杂度视觉内容时,模型才能调整其视觉表征分配方式,进而提升在复杂视觉任务上的表现。
链接: https://arxiv.org/abs/2602.06914
作者: Darryl Hannan,John Cooper,Dylan White,Yijing Watkins
机构: Pacific Northwest National Laboratory (太平洋西北国家实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 25 pages
Abstract:Vision capabilities in vision large language models (VLLMs) have consistently lagged behind their linguistic capabilities. In particular, numerous benchmark studies have demonstrated that VLLMs struggle when fine-grained visual information or spatial reasoning is required. However, we do not yet understand exactly why VLLMs struggle so much with these tasks relative to others. Some works have focused on visual redundancy as an explanation, where high-level visual information is uniformly spread across numerous tokens and specific, fine-grained visual information is discarded. In this work, we investigate this premise in greater detail, seeking to better understand exactly how various types of visual information are processed by the model and what types of visual information are discarded. To do so, we introduce a simple synthetic benchmark dataset that is specifically constructed to probe various visual features, along with a set of metrics for measuring visual redundancy, allowing us to better understand the nuances of their relationship. Then, we explore fine-tuning VLLMs on a number of complex visual tasks to better understand how redundancy and compression change based upon the complexity of the data that a model is trained on. We find that there is a connection between task complexity and visual compression, implying that having a sufficient ratio of high complexity visual data is crucial for altering the way that VLLMs distribute their visual representation and consequently improving their performance on complex visual tasks. We hope that this work will provide valuable insights for training the next generation of VLLMs.
zh
[CV-5] PANC: Prior-Aware Normalized Cut for Object Segmentation
【速读】:该论文旨在解决现有无监督图像分割方法中存在的非确定性问题,即分割结果对初始条件、种子顺序和阈值启发式策略高度敏感,导致分区不稳定、难以复现。其解决方案的关键在于提出一种弱监督的谱分割框架PANC(Patch-Aware Node Consistency),通过在TokenCut方法构建的token-token亲和图中引入少量标注视觉token作为先验,并将其与锚点节点耦合,从而操控图拓扑结构,引导谱空间偏向于与标注一致的分割结果。该方法在不依赖训练的前提下,利用5至30个标注token即可实现高稳定性和可控性的对象掩码生成,在多个基准数据集上达到当前弱监督与无监督方法中的最优性能,尤其在密集标签成本高或类内差异细微的场景下表现优异。
链接: https://arxiv.org/abs/2602.06912
作者: Juan Gutiérrez,Victor Gutiérrez-Garcia,José Luis Blanco-Murillo
机构: Universidad Politécnica de Madrid (马德里理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Fully unsupervised segmentation pipelines naively seek the most salient object, should this be present. As a result, most of the methods reported in the literature deliver non-deterministic partitions that are sensitive to initialization, seed order, and threshold heuristics. We propose PANC, a weakly supervised spectral segmentation framework that uses a minimal set of annotated visual tokens to produce stable, controllable, and reproducible object masks. From the TokenCut approach, we augment the token-token affinity graph with a handful of priors coupled to anchor nodes. By manipulating the graph topology, we bias the spectral eigenspace toward partitions that are consistent with the annotations. Our approach preserves the global grouping enforced by dense self-supervised visual features, trading annotated tokens for significant gains in reproducibility, user control, and segmentation quality. Using 5 to 30 annotations per dataset, our training-free method achieves state-of-the-art performance among weakly and unsupervised approaches on standard benchmarks (e.g., DUTS-TE, ECSSD, MS COCO). Contrarily, it excels in domains where dense labels are costly or intra-class differences are subtle. We report strong and reliable results on homogeneous, fine-grained, and texture-limited domains, achieving 96.8% (+14.43% over SotA), 78.0% (+0.2%), and 78.8% (+0.37%) average mean intersection-over-union (mIoU) on CrackForest (CFD), CUB-200-2011, and HAM10000 datasets, respectively. For multi-object benchmarks, the framework showcases explicit, user-controllable semantic segmentation. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.06912 [cs.CV] (or arXiv:2602.06912v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2602.06912 Focus to learn more arXiv-issued DOI via DataCite
zh
[CV-6] Prompt Reinjection: Alleviating Prompt Forgetting in Multimodal Diffusion Transformers
【速读】:该论文试图解决多模态扩散Transformer(Multimodal Diffusion Transformers, MMDiTs)在文本到图像生成过程中存在的提示遗忘(prompt forgetting)问题,即随着网络深度增加,文本分支中提示表示的语义信息逐渐丢失,导致模型对输入提示的理解能力下降。解决方案的关键在于提出一种无需训练的“提示重注入”(prompt reinjection)方法,通过将早期层中的提示表示重新注入到深层中,以缓解语义遗忘现象,从而增强模型对指令的遵循能力,并在GenEval、DPG和T2I-CompBench++等多个评测基准上提升生成图像的质量与一致性。
链接: https://arxiv.org/abs/2602.06886
作者: Yuxuan Yao,Yuxuan Chen,Hui Li,Kaihui Cheng,Qipeng Guo,Yuwei Sun,Zilong Dong,Jingdong Wang,Siyu Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages
Abstract:Multimodal Diffusion Transformers (MMDiTs) for text-to-image generation maintain separate text and image branches, with bidirectional information flow between text tokens and visual latents throughout denoising. In this setting, we observe a prompt forgetting phenomenon: the semantics of the prompt representation in the text branch is progressively forgotten as depth increases. We further verify this effect on three representative MMDiTs–SD3, SD3.5, and FLUX.1 by probing linguistic attributes of the representations over the layers in the text branch. Motivated by these findings, we introduce a training-free approach, prompt reinjection, which reinjects prompt representations from early layers into later layers to alleviate this forgetting. Experiments on GenEval, DPG, and T2I-CompBench++ show consistent gains in instruction-following capability, along with improvements on metrics capturing preference, aesthetics, and overall text–image generation quality.
zh
[CV-7] Vision Transformer Finetuning Benefits from Non-Smooth Components
【速读】:该论文旨在解决视觉Transformer(Vision Transformer)在迁移学习中组件适应能力不足的问题,即如何选择最优的模型组件进行微调以提升性能。其核心解决方案在于提出“塑性”(plasticity)这一新指标,定义为输出对输入扰动的平均变化率,用以衡量组件对输入变化的敏感程度;研究表明,高塑性的注意力模块和前馈层在微调时表现更优,这与传统认为“平滑性”(smoothness)越强越好的假设相悖,从而为迁移学习中的组件选择提供了理论依据和实践指导。
链接: https://arxiv.org/abs/2602.06883
作者: Ambroise Odonnat,Laetitia Chapel,Romain Tavenard,Ievgen Redko
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注:
Abstract:The smoothness of the transformer architecture has been extensively studied in the context of generalization, training stability, and adversarial robustness. However, its role in transfer learning remains poorly understood. In this paper, we analyze the ability of vision transformer components to adapt their outputs to changes in inputs, or, in other words, their plasticity. Defined as an average rate of change, it captures the sensitivity to input perturbation; in particular, a high plasticity implies low smoothness. We demonstrate through theoretical analysis and comprehensive experiments that this perspective provides principled guidance in choosing the components to prioritize during adaptation. A key takeaway for practitioners is that the high plasticity of the attention modules and feedforward layers consistently leads to better finetuning performance. Our findings depart from the prevailing assumption that smoothness is desirable, offering a novel perspective on the functional properties of transformers. The code is available at this https URL.
zh
[CV-8] NanoFLUX: Distillation-Driven Compression of Large Text-to-Image Generation Models for Mobile Devices
【速读】:该论文旨在解决大规模文本到图像扩散模型(text-to-image diffusion models)在视觉质量不断提升的同时,其模型规模日益增大导致与设备端(on-device)解决方案之间存在显著性能差距的问题。解决方案的关键在于提出NanoFLUX,一个通过渐进式压缩流程从17B参数的FLUX.1-Schnell模型蒸馏得到的2.4B参数流匹配(flow-matching)模型:首先通过剪枝冗余组件的策略将扩散Transformer的参数量从12B降至2B;其次引入基于ResNet的token下采样机制,在保证高分辨率处理能力的同时降低中间层计算延迟;最后采用一种新颖的文本编码器蒸馏方法,利用去噪器早期层的视觉信号辅助文本编码过程,从而在移动设备上实现约2.5秒生成512×512图像的能力,验证了高质量设备端文本到图像生成的可行性。
链接: https://arxiv.org/abs/2602.06879
作者: Ruchika Chavhan,Malcolm Chadwick,Alberto Gil Couto Pimentel Ramos,Luca Morreale,Mehdi Noroozi,Abhinav Mehrotra
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:While large-scale text-to-image diffusion models continue to improve in visual quality, their increasing scale has widened the gap between state-of-the-art models and on-device solutions. To address this gap, we introduce NanoFLUX, a 2.4B text-to-image flow-matching model distilled from 17B FLUX.1-Schnell using a progressive compression pipeline designed to preserve generation quality. Our contributions include: (1) A model compression strategy driven by pruning redundant components in the diffusion transformer, reducing its size from 12B to 2B; (2) A ResNet-based token downsampling mechanism that reduces latency by allowing intermediate blocks to operate on lower-resolution tokens while preserving high-resolution processing elsewhere; (3) A novel text encoder distillation approach that leverages visual signals from early layers of the denoiser during sampling. Empirically, NanoFLUX generates 512 x 512 images in approximately 2.5 seconds on mobile devices, demonstrating the feasibility of high-quality on-device text-to-image generation.
zh
[CV-9] RFDM: Residual Flow Diffusion Model for Efficient Causal Video Editing
【速读】:该论文旨在解决当前视频编辑方法在处理可变长度输入时效率低下、计算资源消耗大以及难以有效利用视频时序冗余的问题。现有主流方法通常依赖固定长度输入,且多基于3D时空模型,导致计算复杂度随视频长度显著增加。其解决方案的关键在于提出一种因果、高效的视频编辑模型——残差流扩散模型(Residual Flow Diffusion Model, RFDM),该模型从2D图像到图像(I2I)扩散模型出发,通过将第t帧的编辑条件设定为前一时刻t-1的预测结果,实现逐帧因果编辑;同时创新性地设计了一种新的I2I扩散前向过程,引导模型学习目标输出与先前预测之间的残差,从而聚焦于相邻帧间的差异变化,显著提升编辑效率并保留时序一致性。此方法在保持图像级计算开销的同时,实现了与3D全时空模型相当的编辑性能,并具备与输入视频长度无关的扩展能力。
链接: https://arxiv.org/abs/2602.06871
作者: Mohammadreza Salehi,Mehdi Noroozi,Luca Morreale,Ruchika Chavhan,Malcolm Chadwick,Alberto Gil Ramos,Abhinav Mehrotra
机构: Samsung AI Research(三星人工智能研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Instructional video editing applies edits to an input video using only text prompts, enabling intuitive natural-language control. Despite rapid progress, most methods still require fixed-length inputs and substantial compute. Meanwhile, autoregressive video generation enables efficient variable-length synthesis, yet remains under-explored for video editing. We introduce a causal, efficient video editing model that edits variable-length videos frame by frame. For efficiency, we start from a 2D image-to-image (I2I) diffusion model and adapt it to video-to-video (V2V) editing by conditioning the edit at time step t on the model’s prediction at t-1. To leverage videos’ temporal redundancy, we propose a new I2I diffusion forward process formulation that encourages the model to predict the residual between the target output and the previous prediction. We call this Residual Flow Diffusion Model (RFDM), which focuses the denoising process on changes between consecutive frames. Moreover, we propose a new benchmark that better ranks state-of-the-art methods for editing tasks. Trained on paired video data for global/local style transfer and object removal, RFDM surpasses I2I-based methods and competes with fully spatiotemporal (3D) V2V models, while matching the compute of image models and scaling independently of input video length. More content can be found in: this https URL
zh
[CV-10] Parameters as Experts: Adapting Vision Models with Dynamic Parameter Routing
【速读】:该论文旨在解决预训练视觉模型在参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)过程中性能难以媲美全量微调的问题,尤其针对复杂密集预测任务中存在的输入无关建模和跨层表示冗余等局限。其解决方案的关键在于提出 AdaRoute,一种基于混合专家(Mixture-of-Experts, MoE)架构的适配器方法:通过引入共享专家中心(shared expert centers),每个专家为可训练参数矩阵;在前向传播中,AdaRoute 模块利用简单动态参数路由机制,根据当前输入生成定制化的权重矩阵,从而实现输入依赖的低秩适应,增强特征表示的个性化与表达能力;同时,由于多层 AdaRoute 模块共享同一专家中心,促进了隐式的跨层特征交互,提升了特征多样性。
链接: https://arxiv.org/abs/2602.06862
作者: Meng Lou,Stanley Yu,Yizhou Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Adapting pre-trained vision models using parameter-efficient fine-tuning (PEFT) remains challenging, as it aims to achieve performance comparable to full fine-tuning using a minimal number of trainable parameters. When applied to complex dense prediction tasks, existing methods exhibit limitations, including input-agnostic modeling and redundant cross-layer representations. To this end, we propose AdaRoute, a new adapter-style method featuring a simple mixture-of-experts (MoE) architecture. Specifically, we introduce shared expert centers, where each expert is a trainable parameter matrix. During a feedforward pass, each AdaRoute module in the network dynamically generates weight matrices tailored for the current module via a simple dynamic parameter routing mechanism, which selectively aggregates parameter matrices in the corresponding expert center. Dynamic weight matrices in AdaRoute modules facilitate low-rank adaptation in an input-dependent manner, thus generating more customized and powerful feature representations. Moreover, since AdaRoute modules across multiple network layers share the same expert center, they improve feature diversity by promoting implicit cross-layer feature interaction. Extensive experiments demonstrate the superiority of AdaRoute on diverse vision tasks, including semantic segmentation, object detection and instance segmentation, and panoptic segmentation. Code will be available at: this https URL.
zh
[CV-11] Rethinking Multi-Condition DiTs: Eliminating Redundant Attention via Position-Alignment and Keyword-Scoping
【速读】:该论文旨在解决扩散 Transformer(Diffusion Transformers, DiTs)在多条件控制生成任务中因传统“拼接-注意力”策略导致的计算与内存开销随条件数量增长呈二次方上升的问题,从而限制了高精度空间布局或主体外观等细粒度控制的可扩展性。其解决方案的关键在于提出 Position-aligned and Keyword-scoped Attention (PKA) 框架:其中 Position-Aligned Attention (PAA) 通过强制局部 patch 对齐实现空间控制的线性化处理,而 Keyword-Scoped Attention (KSA) 则基于语义感知掩码剔除无关的主体驱动交互,有效消除跨模态冗余;此外,引入 Conditional Sensitivity-Aware Sampling (CSAS) 策略以重加权训练目标至关键去噪阶段,显著加速收敛并提升条件保真度。
链接: https://arxiv.org/abs/2602.06850
作者: Chao Zhou,Tianyi Wei,Yiling Chen,Wenbo Zhou,Nenghai Yu
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:
Abstract:While modern text-to-image models excel at prompt-based generation, they often lack the fine-grained control necessary for specific user requirements like spatial layouts or subject appearances. Multi-condition control addresses this, yet its integration into Diffusion Transformers (DiTs) is bottlenecked by the conventional ``concatenate-and-attend’’ strategy, which suffers from quadratic computational and memory overhead as the number of conditions scales. Our analysis reveals that much of this cross-modal interaction is spatially or semantically redundant. To this end, we propose Position-aligned and Keyword-scoped Attention (PKA), a highly efficient framework designed to eliminate these redundancies. Specifically, Position-Aligned Attention (PAA) linearizes spatial control by enforcing localized patch alignment, while Keyword-Scoped Attention (KSA) prunes irrelevant subject-driven interactions via semantic-aware masking. To facilitate efficient learning, we further introduce a Conditional Sensitivity-Aware Sampling (CSAS) strategy that reweights the training objective towards critical denoising phases, drastically accelerating convergence and enhancing conditional fidelity. Empirically, PKA delivers a 10.0 \times inference speedup and a 5.1 \times VRAM saving, providing a scalable and resource-friendly solution for high-fidelity multi-conditioned generation.
zh
[CV-12] GaussianPOP: Principled Simplification Framework for Compact 3D Gaussian Splatting via Error Quantification
【速读】:该论文旨在解决现有3D高斯溅射(3D Gaussian Splatting, 3DGS)简化方法中因依赖重要性评分(如混合权重或敏感性)而导致视觉误差度量不明确的问题,从而在模型紧凑性与渲染保真度之间难以实现最优权衡。解决方案的关键在于提出GaussianPOP框架,其核心是基于3DGS渲染方程推导出一种新颖的误差准则,能够精确量化每个高斯分布对最终图像的贡献;同时引入高效算法,在单次前向传播中即可完成误差计算,支持训练中剪枝和训练后迭代重量化两种场景,显著提升简化过程的准确性与稳定性。
链接: https://arxiv.org/abs/2602.06830
作者: Soonbin Lee,Yeong-Gyu Kim,Simon Sasse,Tomas M. Borges,Yago Sanchez,Eun-Seok Ryu,Thomas Schierl,Cornelius Hellge
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Existing 3D Gaussian Splatting simplification methods commonly use importance scores, such as blending weights or sensitivity, to identify redundant Gaussians. However, these scores are not driven by visual error metrics, often leading to suboptimal trade-offs between compactness and rendering fidelity. We present GaussianPOP, a principled simplification framework based on analytical Gaussian error quantification. Our key contribution is a novel error criterion, derived directly from the 3DGS rendering equation, that precisely measures each Gaussian’s contribution to the rendered image. By introducing a highly efficient algorithm, our framework enables practical error calculation in a single forward pass. The framework is both accurate and flexible, supporting on-training pruning as well as post-training simplification via iterative error re-quantification for improved stability. Experimental results show that our method consistently outperforms existing state-of-the-art pruning methods across both application scenarios, achieving a superior trade-off between model compactness and high rendering quality.
zh
[CV-13] AEGPO: Adaptive Entropy-Guided Policy Optimization for Diffusion Models
【速读】:该论文旨在解决强化学习从人类反馈(Reinforcement Learning from Human Feedback, RLHF)中政策优化方法(如GRPO)在扩散模型和流模型对齐任务中存在的采样效率低下与策略静态的问题。传统方法对所有提示(prompts)和去噪步骤一视同仁,忽略了样本间学习价值的差异以及关键探索时刻的动态性。其解决方案的关键在于通过分析GRPO训练过程中内部注意力机制的动力学,发现注意力熵(attention entropy)可作为双重信号代理:一方面,相对变化量ΔEntropy反映当前策略与基础策略的偏离程度,用以衡量样本的学习价值;另一方面,绝对注意力熵峰值Entropy(t)可识别高分散性的关键去噪时刻。基于此,作者提出自适应熵引导策略优化(Adaptive Entropy-Guided Policy Optimization, AEGPO),在全局层面利用ΔEntropy动态分配回放预算,在局部层面依据Entropy(t)峰值选择性引导探索,从而实现更高效且有效的策略优化。
链接: https://arxiv.org/abs/2602.06825
作者: Yuming Li,Qingyu Li,Chengyu Bai,Xiangyang Luo,Zeyue Xue,Wenyu Qin,Meng Wang,Yikai Wang,Shanghang Zhang
机构: Peking University (北京大学); Kling Team, Kuaishou Technology (快手科技Kling团队); The University of Hong Kong (香港大学); Beijing Normal University (北京师范大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Reinforcement learning from human feedback (RLHF) shows promise for aligning diffusion and flow models, yet policy optimization methods such as GRPO suffer from inefficient and static sampling strategies. These methods treat all prompts and denoising steps uniformly, ignoring substantial variations in sample learning value as well as the dynamic nature of critical exploration moments. To address this issue, we conduct a detailed analysis of the internal attention dynamics during GRPO training and uncover a key insight: attention entropy can serve as a powerful dual-signal proxy. First, across different samples, the relative change in attention entropy (\DeltaEntropy), which reflects the divergence between the current policy and the base policy, acts as a robust indicator of sample learning value. Second, during the denoising process, the peaks of absolute attention entropy (Entropy(t)), which quantify attention dispersion, effectively identify critical timesteps where high-value exploration occurs. Building on this observation, we propose Adaptive Entropy-Guided Policy Optimization (AEGPO), a novel dual-signal, dual-level adaptive optimization strategy. At the global level, AEGPO uses \DeltaEntropy to dynamically allocate rollout budgets, prioritizing prompts with higher learning value. At the local level, it exploits the peaks of Entropy(t) to guide exploration selectively at critical high-dispersion timesteps rather than uniformly across all denoising steps. By focusing computation on the most informative samples and the most critical moments, AEGPO enables more efficient and effective policy optimization. Experiments on text-to-image generation tasks demonstrate that AEGPO significantly accelerates convergence and achieves superior alignment performance compared to standard GRPO variants. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2602.06825 [cs.LG] (or arXiv:2602.06825v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.06825 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Li Yuming [view email] [v1] Fri, 6 Feb 2026 16:09:50 UTC (36,188 KB)
zh
[CV-14] RAIGen: Rare Attribute Identification in Text-to-Image Generative Models
【速读】:该论文旨在解决生成式 AI(Generative AI)中因训练数据偏见导致的语义属性覆盖不均问题,尤其是现有方法忽视了对数据分布中罕见或少数特征(如社会、文化或风格类属性)的识别与干预。传统方法分为封闭集和开放集两类:前者局限于预定义公平性类别(如性别、种族),后者则聚焦于识别主导输出的多数属性,但二者均未涵盖对稀有属性的系统性发现。论文提出 RAIGen 框架,其关键创新在于利用马特里osh卡稀疏自编码器(Matryoshka Sparse Autoencoders)与一种结合神经元激活频率与语义独特性的新型少数属性度量指标,从而无监督地识别可解释神经元,并通过其最高激活图像揭示被模型表征但未充分表达的稀有属性。实验表明,RAIGen 能在 Stable Diffusion 和 SDXL 等模型中有效发现超出固定公平类别范畴的稀有属性,支持跨架构审计并实现生成过程中的针对性增强。
链接: https://arxiv.org/abs/2602.06806
作者: Silpa Vadakkeeveetil Sreelatha,Dan Wang,Serge Belongie,Muhammad Awais,Anjan Dutta
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Text-to-image diffusion models achieve impressive generation quality but inherit and amplify training-data biases, skewing coverage of semantic attributes. Prior work addresses this in two ways. Closed-set approaches mitigate biases in predefined fairness categories (e.g., gender, race), assuming socially salient minority attributes are known a priori. Open-set approaches frame the task as bias identification, highlighting majority attributes that dominate outputs. Both overlook a complementary task: uncovering rare or minority features underrepresented in the data distribution (social, cultural, or stylistic) yet still encoded in model representations. We introduce RAIGen, the first framework, to our knowledge, for un-supervised rare-attribute discovery in diffusion models. RAIGen leverages Matryoshka Sparse Autoencoders and a novel minority metric combining neuron activation frequency with semantic distinctiveness to identify interpretable neurons whose top-activating images reveal underrepresented attributes. Experiments show RAIGen discovers attributes beyond fixed fairness categories in Stable Diffusion, scales to larger models such as SDXL, supports systematic auditing across architectures, and enables targeted amplification of rare attributes during generation.
zh
[CV-15] A Unified Formula for Affine Transformations between Calibrated Cameras
【速读】:该论文旨在解决在两个校准视角之间,局部图像块(local image patches)的仿射变换映射问题。解决方案的关键在于推导出一个闭式表达式,该表达式表明仿射变换仅依赖于相对相机位姿(relative camera pose)、图像坐标以及局部表面法向量(local surface normal)。这一结果为多视图几何中的图像匹配与重建提供了理论基础和高效计算方法。
链接: https://arxiv.org/abs/2602.06805
作者: Levente Hajder
机构: Eötvös Loránd University (ELTE) (埃弗茨·洛兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In this technical note, we derive a closed-form expression for the affine transformation mapping local image patches between two calibrated views. We show that the transformation is a function of the relative camera pose, the image coordinates, and the local surface normal.
zh
[CV-16] Machine Learning for Detection and Severity Estimation of Sweetpotato Weevil Damage in Field and Lab Conditions
【速读】:该论文旨在解决甘薯象甲(Sweetpotato weevils, Cylas spp.)对甘薯生产造成的严重危害,以及传统人工评分方法在评估虫害损伤时存在的劳动强度大、主观性强和结果不一致等问题,这些问题显著阻碍了抗性甘薯品种的育种进程。解决方案的关键在于引入基于计算机视觉(Computer Vision)的自动化评估方法:在田间场景中,利用分类模型对根部损伤严重程度进行预测,测试准确率达71.43%;在实验室场景中,构建了一个标注数据集并设计了一种结合根部分割与分块策略的两阶段目标检测流程,采用YOLO12实时检测模型识别微小蛀孔,平均精度(mAP)达到77.7%。该方案实现了客观、高效且可扩展的表型分析,与现代甘薯育种流程高度兼容,显著提升了虫害表型鉴定效率,有助于缓解象甲对粮食安全的负面影响。
链接: https://arxiv.org/abs/2602.06786
作者: Doreen M. Chelangat,Sudi Murindanyi,Bruce Mugizi,Paul Musana,Benard Yada,Milton A. Otema,Florence Osaru,Andrew Katumba,Joyce Nakatumba-Nabende
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Sweetpotato weevils (Cylas spp.) are considered among the most destructive pests impacting sweetpotato production, particularly in sub-Saharan Africa. Traditional methods for assessing weevil damage, predominantly relying on manual scoring, are labour-intensive, subjective, and often yield inconsistent results. These challenges significantly hinder breeding programs aimed at developing resilient sweetpotato varieties. This study introduces a computer vision-based approach for the automated evaluation of weevil damage in both field and laboratory contexts. In the field settings, we collected data to train classification models to predict root-damage severity levels, achieving a test accuracy of 71.43%. Additionally, we established a laboratory dataset and designed an object detection pipeline employing YOLO12, a leading real-time detection model. This methodology incorporated a two-stage laboratory pipeline that combined root segmentation with a tiling strategy to improve the detectability of small objects. The resulting model demonstrated a mean average precision of 77.7% in identifying minute weevil feeding holes. Our findings indicate that computer vision technologies can provide efficient, objective, and scalable assessment tools that align seamlessly with contemporary breeding workflows. These advancements represent a significant improvement in enhancing phenotyping efficiency within sweetpotato breeding programs and play a crucial role in mitigating the detrimental effects of weevils on food security.
zh
[CV-17] Revisiting Emotions Representation for Recognition in the Wild
【速读】:该论文试图解决传统面部情绪识别(Facial Emotion Recognition, FER)将情绪视为单一标签分类的问题,这一方法忽略了自发情绪状态的多维性和混合性,无法准确描述真实场景中复杂情绪的分布特性。其解决方案的关键在于:提出一种基于Valence-Arousal-Dominance(VAD)空间中已知基本情绪与复合情绪概率分布映射的新方法,通过自动重标注现有数据集,将单标签标注转换为情绪类别的概率分布表示,从而以混合情绪的方式更精确地刻画复杂情感状态,并有效处理情绪感知中的模糊性。
链接: https://arxiv.org/abs/2602.06778
作者: Joao Baptista Cardia Neto,Claudio Ferrari,Stefano Berretti
机构: São Paulo State Technological College (FATEC), University of Florence (佛罗伦萨大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:
Abstract:Facial emotion recognition has been typically cast as a single-label classification problem of one out of six prototypical emotions. However, that is an oversimplification that is unsuitable for representing the multifaceted spectrum of spontaneous emotional states, which are most often the result of a combination of multiple emotions contributing at different intensities. Building on this, a promising direction that was explored recently is to cast emotion recognition as a distribution learning problem. Still, such approaches are limited in that research datasets are typically annotated with a single emotion class. In this paper, we contribute a novel approach to describe complex emotional states as probability distributions over a set of emotion classes. To do so, we propose a solution to automatically re-label existing datasets by exploiting the result of a study in which a large set of both basic and compound emotions is mapped to probability distributions in the Valence-Arousal-Dominance (VAD) space. In this way, given a face image annotated with VAD values, we can estimate the likelihood of it belonging to each of the distributions, so that emotional states can be described as a mixture of emotions, enriching their description, while also accounting for the ambiguous nature of their perception. In a preliminary set of experiments, we illustrate the advantages of this solution and a new possible direction of investigation. Data annotations are available at this https URL.
zh
[CV-18] Gold Exploration using Representations from a Multispectral Autoencoder
【速读】:该论文旨在解决矿产勘探中因实地数据成本高且获取受限,导致大范围成矿潜力制图困难的问题。其解决方案的关键在于利用生成式 AI (Generative AI) 学习多光谱 Sentinel-2 遥感影像中的特征表示,并将其作为轻量级 XGBoost 分类器的输入,从而实现从空间视角识别金矿富集区域。具体而言,研究采用一个在 FalconSpace-S2 v1.0 数据集上预训练的等距自编码器(Isometric)基础模型,提取信息密集的光谱-空间嵌入表示,在仅有少量标注样本的情况下仍能有效捕捉可迁移的矿物学模式,显著提升了图像和像素级别的分类准确率(分别从 0.51/0.55 提升至 0.68/0.73),验证了基础模型表示在矿产勘查中的高效性、可扩展性和全球适用潜力。
链接: https://arxiv.org/abs/2602.06748
作者: Argyro Tsandalidou,Konstantinos Dogeas,Eleftheria Tetoula Tsonga,Elisavet Parselia,Georgios Tsimiklis,George Arvanitakis
机构: Technology Innovation Institute (技术革新研究所); Institute of Communication and Computer Systems (通信与计算机系统研究所); Geonova (盖诺瓦)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Presented in Eurips2025, 1st Workshop: Advances in Representation Learning for Earth Observation
Abstract:Satellite imagery is employed for large-scale prospectivity mapping due to the high cost and typically limited availability of on-site mineral exploration data. In this work, we present a proof-of-concept framework that leverages generative representations learned from multispectral Sentinel-2 imagery to identify gold-bearing regions from space. An autoencoder foundation model, called Isometric, which is pretrained on the large-scale FalconSpace-S2 v1.0 dataset, produces information-dense spectral-spatial representations that serve as inputs to a lightweight XGBoost classifier. We compare this representation-based approach with a raw spectral input baseline using a dataset of 63 Sentinel-2 images from known gold and non-gold locations. The proposed method improves patch-level accuracy from 0.51 to 0.68 and image-level accuracy from 0.55 to 0.73, demonstrating that generative embeddings capture transferable mineralogical patterns even with limited labeled data. These results highlight the potential of foundation-model representations to make mineral exploration more efficient, scalable, and globally applicable.
zh
[CV-19] Clinical-Prior Guided Multi-Modal Learning with Latent Attention Pooling for Gait-Based Scoliosis Screening
【速读】:该论文旨在解决青少年特发性脊柱侧弯(Adolescent Idiopathic Scoliosis, AIS)早期筛查中传统方法主观性强、难以规模化及依赖临床专家的问题,同时克服现有基于视频步态分析方法中存在的数据泄露(data leakage)和模型可解释性不足的局限。其解决方案的关键在于提出一个名为ScoliGait的新基准数据集(包含1,572个训练视频和300个完全独立的测试视频),并设计了一种多模态框架:该框架融合了基于临床运动学先验的可解释特征表示(即临床先验引导的运动学知识图谱)与潜在注意力池化机制,以有效整合视频、文本描述和知识图谱三类模态信息,从而在真实、无重复受试者的基准上实现显著性能提升,为AIS的无创、可扩展评估提供了可靠且具备临床意义的技术基础。
链接: https://arxiv.org/abs/2602.06743
作者: Dong Chen,Zizhuang Wei,Jialei Xu,Xinyang Sun,Zonglin He,Meiru An,Huili Peng,Yong Hu,Kenneth MC Cheung
机构: The University of Hong Kong - Shenzhen Hospital (香港大学深圳医院); Li Ka Shing Faculty of Medicine, The University of Hong Kong (香港大学李嘉诚医学院); Huawei (华为)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Adolescent Idiopathic Scoliosis (AIS) is a prevalent spinal deformity whose progression can be mitigated through early detection. Conventional screening methods are often subjective, difficult to scale, and reliant on specialized clinical expertise. Video-based gait analysis offers a promising alternative, but current datasets and methods frequently suffer from data leakage, where performance is inflated by repeated clips from the same individual, or employ oversimplified models that lack clinical interpretability. To address these limitations, we introduce ScoliGait, a new benchmark dataset comprising 1,572 gait video clips for training and 300 fully independent clips for testing. Each clip is annotated with radiographic Cobb angles and descriptive text based on clinical kinematic priors. We propose a multi-modal framework that integrates a clinical-prior-guided kinematic knowledge map for interpretable feature representation, alongside a latent attention pooling mechanism to fuse video, text, and knowledge map modalities. Our method establishes a new state-of-the-art, demonstrating a significant performance gap on a realistic, non-repeating subject benchmark. Our approach establishes a new state of the art, showing a significant performance gain on a realistic, subject-independent benchmark. This work provides a robust, interpretable, and clinically grounded foundation for scalable, non-invasive AIS assessment.
zh
[CV-20] Diffeomorphism-Equivariant Neural Networks
【速读】:该论文旨在解决如何将群对称性(group symmetry)通过等变性(equivariance)引入预训练神经网络的问题,特别是针对无限维群(infinite-dimensional groups)的场景。现有方法多局限于紧致、有限或低维群的线性作用,难以扩展至更复杂的几何变换。其解决方案的关键在于提出一种基于能量的规范方法(energy-based canonicalisation),将等变性建模为优化问题,从而利用成熟的可微图像配准(differentiable image registration)工具箱实现微分同胚等变性(diffeomorphism equivariance)。该方法无需大量数据增强或重新训练即可在分割与分类任务中实现近似等变性和对未见变换的泛化能力。
链接: https://arxiv.org/abs/2602.06695
作者: Josephine Elisabeth Oettinger,Zakhar Shumaylov,Johannes Bostelmann,Jan Lellmann,Carola-Bibiane Schönlieb
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Incorporating group symmetries via equivariance into neural networks has emerged as a robust approach for overcoming the efficiency and data demands of modern deep learning. While most existing approaches, such as group convolutions and averaging-based methods, focus on compact, finite, or low-dimensional groups with linear actions, this work explores how equivariance can be extended to infinite-dimensional groups. We propose a strategy designed to induce diffeomorphism equivariance in pre-trained neural networks via energy-based canonicalisation. Formulating equivariance as an optimisation problem allows us to access the rich toolbox of already established differentiable image registration methods. Empirical results on segmentation and classification tasks confirm that our approach achieves approximate equivariance and generalises to unseen transformations without relying on extensive data augmentation or retraining.
zh
[CV-21] Can We Build a Monolithic Model for Fake Image Detection? SICA: Semantic-Induced Constrained Adaptation for Unified-Yet-Discriminative Artifact Feature Space Reconstruction
【速读】:该论文旨在解决统一假图像检测(Fake Image Detection, FID)模型在跨子域场景下性能不佳的问题,特别是针对单一模型(monolithic FID)相较于集成方法表现较差的现象。研究发现,这种性能瓶颈源于“异质性现象”(heterogeneous phenomenon),即不同子域间伪造痕迹特征存在本质差异,导致特征空间坍缩。解决方案的关键在于实现“统一 yet 可区分”(unified-yet-discriminative)的伪造特征空间重建——为此,作者提出语义引导约束适配(Semantic-Induced Constrained Adaptation, SICA),首次构建了单体FID范式,利用高阶语义作为结构先验来指导特征空间重构,实验证明其能以近正交方式重建目标特征空间,并显著优于15种先进方法。
链接: https://arxiv.org/abs/2602.06676
作者: Bo Du,Xiaochen Ma,Xuekang Zhu,Zhe Yang,Chaogun Niu,Jian Liu,Ji-Zhe Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Fake Image Detection (FID), aiming at unified detection across four image forensic subdomains, is critical in real-world forensic scenarios. Compared with ensemble approaches, monolithic FID models are theoretically more promising, but to date, consistently yield inferior performance in practice. In this work, by discovering the heterogeneous phenomenon'', which is the intrinsic distinctness of artifacts across subdomains, we diagnose the cause of this underperformance for the first time: the collapse of the artifact feature space driven by such phenomenon. The core challenge for developing a practical monolithic FID model thus boils down to the unified-yet-discriminative" reconstruction of the artifact feature space. To address this paradoxical challenge, we hypothesize that high-level semantics can serve as a structural prior for the reconstruction, and further propose Semantic-Induced Constrained Adaptation (SICA), the first monolithic FID paradigm. Extensive experiments on our OpenMMSec dataset demonstrate that SICA outperforms 15 state-of-the-art methods and reconstructs the target unified-yet-discriminative artifact feature space in a near-orthogonal manner, thus firmly validating our hypothesis. The code and dataset are available at:https: //github.com/scu-zjz/SICA_OpenMMSec.
zh
[CV-22] CytoCrowd: A Multi-Annotator Benchmark Dataset for Cytology Image Analysis
【速读】:该论文旨在解决医学图像分析中高质量标注数据集的缺失问题,尤其是现有数据集要么仅提供单一、干净的金标准(ground truth),从而掩盖了临床专家之间的实际分歧,要么虽提供多份标注却缺乏独立的金标准用于客观评估。解决方案的关键在于提出CytoCrowd数据集,其核心创新是双结构设计:一方面包含来自四位独立病理学家的原始冲突标注,另一方面引入由资深专家确立的高质量金标准。这种设计使CytoCrowd既能作为标准计算机视觉任务(如目标检测和分类)的基准,又能作为评估注释聚合算法处理专家分歧能力的真实场景测试平台。
链接: https://arxiv.org/abs/2602.06674
作者: Yonghao Si,Xingyuan Zeng,Zhao Chen,Libin Zheng,Caleb Chen Cao,Lei Chen,Jian Yin
机构: Sun Yat-sen University (中山大学); Hong Kong University of Science and Technology (Guang Zhou) (香港科技大学(广州)); Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:
Abstract:High-quality annotated datasets are crucial for advancing machine learning in medical image analysis. However, a critical gap exists: most datasets either offer a single, clean ground truth, which hides real-world expert disagreement, or they provide multiple annotations without a separate gold standard for objective evaluation. To bridge this gap, we introduce CytoCrowd, a new public benchmark for cytology analysis. The dataset features 446 high-resolution images, each with two key components: (1) raw, conflicting annotations from four independent pathologists, and (2) a separate, high-quality gold-standard ground truth established by a senior expert. This dual structure makes CytoCrowd a versatile resource. It serves as a benchmark for standard computer vision tasks, such as object detection and classification, using the ground truth. Simultaneously, it provides a realistic testbed for evaluating annotation aggregation algorithms that must resolve expert disagreements. We provide comprehensive baseline results for both tasks. Our experiments demonstrate the challenges presented by CytoCrowd and establish its value as a resource for developing the next generation of models for medical image analysis.
zh
[CV-23] PlanViz: Evaluating Planning -Oriented Image Generation and Editing for Computer-Use Tasks
【速读】:该论文旨在解决统一多模态模型(Unified Multimodal Models, UMMs)在支持计算机使用规划任务(computer-use planning tasks)方面的潜力尚未被充分探索的问题,特别是这些任务涉及空间推理和程序性理解能力。解决方案的关键在于提出PlanViz基准测试框架,通过设计三个贴近日常生活的子任务(路径规划、工作图示和网页UI展示),并采用人工标注的问题与参考图像及质量控制流程确保数据质量;同时引入任务自适应评分指标PlanScore,以精确评估生成图像的正确性、视觉质量和效率,从而系统性地揭示UMMs在该领域的局限与潜力。
链接: https://arxiv.org/abs/2602.06663
作者: Junxian Li,Kai Liu,Leyang Chen,Weida Wang,Zhixin Wang,Jiaqi Xu,Fan Li,Renjing Pei,Linghe Kong,Yulun Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The main part of our paper: PlanViz Code is at: this https URL Supplementary material is at: this https URL
Abstract:Unified multimodal models (UMMs) have shown impressive capabilities in generating natural images and supporting multimodal reasoning. However, their potential in supporting computer-use planning tasks, which are closely related to our lives, remain underexplored. Image generation and editing in computer-use tasks require capabilities like spatial reasoning and procedural understanding, and it is still unknown whether UMMs have these capabilities to finish these tasks or not. Therefore, we propose PlanViz, a new benchmark designed to evaluate image generation and editing for computer-use tasks. To achieve the goal of our evaluation, we focus on sub-tasks which frequently involve in daily life and require planning steps. Specifically, three new sub-tasks are designed: route planning, work diagramming, and webUI displaying. We address challenges in data quality ensuring by curating human-annotated questions and reference images, and a quality control process. For challenges of comprehensive and exact evaluation, a task-adaptive score, PlanScore, is proposed. The score helps understanding the correctness, visual quality and efficiency of generated images. Through experiments, we highlight key limitations and opportunities for future research on this topic.
zh
[CV-24] Same Answer Different Representations: Hidden instability in VLMs
【速读】:该论文旨在解决当前对视觉语言模型(Vision Language Models, VLMs)鲁棒性评估中存在的局限性问题,即仅依赖输出层面的不变性(output-level invariance)来衡量模型稳定性,而忽略了内部表征动态变化对多模态处理可靠性的影响。其解决方案的关键在于提出一个表征感知且频率感知的评估框架,该框架不仅包含标准的标签级指标,还引入了三个核心维度:内部嵌入漂移(embedding drift)、频谱敏感性(spectral sensitivity)以及结构平滑性(structural smoothness,即视觉token的空间一致性),从而能够系统性地揭示VLM在面对输入扰动时的内在行为模式。通过该框架在SEEDBench、MMMU和POPE等数据集上的应用,论文识别出三种典型的失败模式,表明传统评估方法可能掩盖了模型实际存在的脆弱性,并为提升VLM的鲁棒性和可解释性提供了新的分析路径。
链接: https://arxiv.org/abs/2602.06652
作者: Farooq Ahmad Wani,Alessandro Suglia,Rohit Saxena,Aryo Pradipta Gema,Wai-Chung Kwan,Fazl Barez,Maria Sofia Bucarelli,Fabrizio Silvestri,Pasquale Minervini
机构: Sapienza University of Rome (罗马大学); CNRS (法国国家科学研究中心); University of Edinburgh (爱丁堡大学); University of Oxford (牛津大学); i3S; Miniml.AI; Martian
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The robustness of Vision Language Models (VLMs) is commonly assessed through output-level invariance, implicitly assuming that stable predictions reflect stable multimodal processing. In this work, we argue that this assumption is insufficient. We introduce a representation-aware and frequency-aware evaluation framework that measures internal embedding drift, spectral sensitivity, and structural smoothness (spatial consistency of vision tokens), alongside standard label-based metrics. Applying this framework to modern VLMs across the SEEDBench, MMMU, and POPE datasets reveals three distinct failure modes. First, models frequently preserve predicted answers while undergoing substantial internal representation drift; for perturbations such as text overlays, this drift approaches the magnitude of inter-image variability, indicating that representations move to regions typically occupied by unrelated inputs despite unchanged outputs. Second, robustness does not improve with scale; larger models achieve higher accuracy but exhibit equal or greater sensitivity, consistent with sharper yet more fragile decision boundaries. Third, we find that perturbations affect tasks differently: they harm reasoning when they disrupt how models combine coarse and fine visual cues, but on the hallucination benchmarks, they can reduce false positives by making models generate more conservative answers.
zh
[CV-25] CauCLIP: Bridging the Sim-to-Real Gap in Surgical Video Understanding via Causality-Inspired Vision-Language Modeling
【速读】:该论文旨在解决智能手术室中手术阶段识别(surgical phase recognition)任务因标注临床视频数据有限以及合成数据与真实数据之间存在显著领域差异(domain gap)而导致模型泛化能力差的问题。解决方案的关键在于提出一种受因果启发的视觉-语言框架 CauCLIP,其核心创新包括:一是基于频域的增强策略,在保留语义结构的同时扰动领域特定特征;二是引入因果抑制损失(causal suppression loss),以消除非因果偏差并强化因果手术特征。二者结合形成统一训练框架,使模型能够聚焦于手术流程中稳定且具有因果性的因素,从而实现无需目标域数据即可获得领域鲁棒的手术视频理解能力。
链接: https://arxiv.org/abs/2602.06619
作者: Yuxin He,An Li,Cheng Xue
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Surgical phase recognition is a critical component for context-aware decision support in intelligent operating rooms, yet training robust models is hindered by limited annotated clinical videos and large domain gaps between synthetic and real surgical data. To address this, we propose CauCLIP, a causality-inspired vision-language framework that leverages CLIP to learn domain-invariant representations for surgical phase recognition without access to target domain data. Our approach integrates a frequency-based augmentation strategy to perturb domain-specific attributes while preserving semantic structures, and a causal suppression loss that mitigates non-causal biases and reinforces causal surgical features. These components are combined in a unified training framework that enables the model to focus on stable causal factors underlying surgical workflows. Experiments on the SurgVisDom hard adaptation benchmark demonstrate that our method substantially outperforms all competing approaches, highlighting the effectiveness of causality-guided vision-language models for domain-generalizable surgical video understanding.
zh
[CV-26] DAVE: Distribution-aware Attribution via ViT Gradient Decomposition
【速读】:该论文旨在解决视觉Transformer(Vision Transformer, ViT)模型在生成稳定且高分辨率归因图(attribution maps)时面临的挑战,尤其是由于patch嵌入和注意力路由等结构组件引入的系统性伪影(structured artifacts),导致现有方法多依赖于粗粒度的patch-level归因。其解决方案的关键在于提出DAVE(Distribution-aware Attribution via ViT Gradient Decomposition),一种基于输入梯度结构分解的数学严谨归因方法;通过利用ViT的架构特性,DAVE能够分离出有效输入-输出映射中的局部等变且稳定的成分,并将其与由架构引起的伪影及其他不稳定性来源区分开来,从而实现更可靠、精细的像素级解释。
链接: https://arxiv.org/abs/2602.06613
作者: Adam Wróbel,Siddhartha Gairola,Jacek Tabor,Bernt Schiele,Bartosz Zieliński,Dawid Rymarczyk
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: work under review. Code will be released upon acceptance
Abstract:Vision Transformers (ViTs) have become a dominant architecture in computer vision, yet producing stable and high-resolution attribution maps for these models remains challenging. Architectural components such as patch embeddings and attention routing often introduce structured artifacts in pixel-level explanations, causing many existing methods to rely on coarse patch-level attributions. We introduce DAVE \textit(\underlineDistribution-aware \underlineAttribution via \underlineViT Gradient D\underlineEcomposition), a mathematically grounded attribution method for ViTs based on a structured decomposition of the input gradient. By exploiting architectural properties of ViTs, DAVE isolates locally equivariant and stable components of the effective input–output mapping. It separates these from architecture-induced artifacts and other sources of instability.
zh
[CV-27] ProtoQuant: Quantization of Prototypical Parts For General and Fine-Grained Image Classification
【速读】:该论文旨在解决原型部件模型(prototypical parts-based models)在ImageNet规模数据集上泛化能力不足以及原型漂移(prototype drift)的问题,后者表现为学习到的原型缺乏对训练分布的可靠锚定,并在微小扰动下发生激活变化。解决方案的关键在于提出ProtoQuant架构,通过在潜在空间中引入离散的隐向量量化(latent vector quantization),将原型约束在由训练数据学习得到的代码本(codebook)内,从而确保原型的稳定性与可解释性,同时无需更新主干网络(backbone),实现高效且可扩展的分类任务。
链接: https://arxiv.org/abs/2602.06592
作者: Mikołaj Janusz,Adam Wróbel,Bartosz Zieliński,Dawid Rymarczyk
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Work under review. Code will be released upon acceptance
Abstract:Prototypical parts-based models offer a “this looks like that” paradigm for intrinsic interpretability, yet they typically struggle with ImageNet-scale generalization and often require computationally expensive backbone finetuning. Furthermore, existing methods frequently suffer from “prototype drift,” where learned prototypes lack tangible grounding in the training distribution and change their activation under small perturbations. We present ProtoQuant, a novel architecture that achieves prototype stability and grounded interpretability through latent vector quantization. By constraining prototypes to a discrete learned codebook within the latent space, we ensure they remain faithful representations of the training data without the need to update the backbone. This design allows ProtoQuant to function as an efficient, interpretable head that scales to large-scale datasets. We evaluate ProtoQuant on ImageNet and several fine-grained benchmarks (CUB-200, Cars-196). Our results demonstrate that ProtoQuant achieves competitive classification accuracy while generalizing to ImageNet and comparable interpretability metrics to other prototypical-parts-based methods.
zh
[CV-28] An Integer Linear Programming Approach to Geometrically Consistent Partial-Partial Shape Matching
【速读】:该论文致力于解决部分到部分(partial-partial)三维形状匹配问题,即在两个仅部分可观测的3D形状之间建立精确对应关系。这类问题在现实场景(如3D扫描)中最为常见,但因其需同时估计未知重叠区域并保证对应准确性,长期以来缺乏有效解决方案。论文提出首个针对该问题的整数线性规划(Integer Linear Programming, ILP)方法,其关键在于引入几何一致性(geometric consistency)作为强先验约束,从而实现对重叠区域的鲁棒估计和邻域保持的对应关系计算。实验表明,该方法在匹配误差和对应平滑性方面均优于现有方法,且具备更好的可扩展性。
链接: https://arxiv.org/abs/2602.06590
作者: Viktoria Ehm,Paul Roetzer,Florian Bernard,Daniel Cremers
机构: Technical University of Munich (慕尼黑工业大学); MCML; University of Bonn (波恩大学); Lamarr Institute
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The task of establishing correspondences between two 3D shapes is a long-standing challenge in computer vision. While numerous studies address full-full and partial-full 3D shape matching, only a limited number of works have explored the partial-partial setting, very likely due to its unique challenges: we must compute accurate correspondences while at the same time find the unknown overlapping region. Nevertheless, partial-partial 3D shape matching reflects the most realistic setting, as in many real-world cases, such as 3D scanning, shapes are only partially observable. In this work, we introduce the first integer linear programming approach specifically designed to address the distinctive challenges of partial-partial shape matching. Our method leverages geometric consistency as a strong prior, enabling both robust estimation of the overlapping region and computation of neighbourhood-preserving correspondences. We empirically demonstrate that our approach achieves high-quality matching results both in terms of matching error and smoothness. Moreover, we show that our method is more scalable than previous formalisms.
zh
[CV-29] hink Proprioceptively: Embodied Visual Reason ing for VLA Manipulation
【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型中本体感知(proprioception)仅作为后期条件信号注入所带来的局限性,即机器人状态无法有效参与指令理解与视觉token注意力分配,导致计算资源浪费在冗余信息上。解决方案的关键在于提出ThinkProprio框架,将本体感知转化为嵌入空间中的文本token序列,并在输入端与任务指令进行早期融合,使机器人状态能够参与后续的视觉推理和token选择过程,从而引导计算聚焦于动作关键证据并抑制冗余视觉token。实验表明,该方法在多个基准(CALVIN、LIBERO及真实世界操作场景)上性能优于或等同于强基线,同时显著降低端到端推理延迟超过50%。
链接: https://arxiv.org/abs/2602.06575
作者: Fangyuan Wang,Peng Zhou,Jiaming Qi,Shipeng Lyu,David Navarro-Alarcon,Guodong Guo
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-language-action (VLA) models typically inject proprioception only as a late conditioning signal, which prevents robot state from shaping instruction understanding and from influencing which visual tokens are attended throughout the policy. We introduce ThinkProprio, which converts proprioception into a sequence of text tokens in the VLM embedding space and fuses them with the task instruction at the input. This early fusion lets embodied state participate in subsequent visual reasoning and token selection, biasing computation toward action-critical evidence while suppressing redundant visual tokens. In a systematic ablation over proprioception encoding, state entry point, and action-head conditioning, we find that text tokenization is more effective than learned projectors, and that retaining roughly 15% of visual tokens can match the performance of using the full token set. Across CALVIN, LIBERO, and real-world manipulation, ThinkProprio matches or improves over strong baselines while reducing end-to-end inference latency over 50%.
zh
[CV-30] LIBERO-X: Robustness Litmus for Vision-Language-Action Models
【速读】:该论文旨在解决现有视觉-语言-动作(Vision-Language-Action, VLA)模型基准测试中评估可靠性不足的问题,尤其是由于评价协议未能充分捕捉现实世界中的分布变化(distribution shifts),导致对模型泛化能力、鲁棒性和感知与语言驱动操作任务对齐程度的评估存在局限或误导。其解决方案的关键在于提出LIBERO-X基准,包含两个核心创新:一是设计分层评估协议(hierarchical evaluation protocol),按空间泛化、物体识别和任务指令理解三个维度设置渐进式难度层级,实现对性能退化在环境与任务复杂度增加下的细粒度分析;二是构建高多样性训练数据集,通过人类远程操控采集,每个场景支持多个细粒度操作目标,有效缩小训练与评估分布之间的差距。该方法显著提升了VLA模型评估的可靠性,并揭示了当前模型在场景理解与指令锚定方面的持续瓶颈。
链接: https://arxiv.org/abs/2602.06556
作者: Guodong Wang,Chenkai Zhang,Qingjie Liu,Jinjin Zhang,Jiancheng Cai,Junjie Liu,Xinmin Liu
机构: Meituan(美团); Beihang University(北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 19 pages, 14 figures and 8 tables
Abstract:Reliable benchmarking is critical for advancing Vision-Language-Action (VLA) models, as it reveals their generalization, robustness, and alignment of perception with language-driven manipulation tasks. However, existing benchmarks often provide limited or misleading assessments due to insufficient evaluation protocols that inadequately capture real-world distribution shifts. This work systematically rethinks VLA benchmarking from both evaluation and data perspectives, introducing LIBERO-X, a benchmark featuring: 1) A hierarchical evaluation protocol with progressive difficulty levels targeting three core capabilities: spatial generalization, object recognition, and task instruction understanding. This design enables fine-grained analysis of performance degradation under increasing environmental and task complexity; 2) A high-diversity training dataset collected via human teleoperation, where each scene supports multiple fine-grained manipulation objectives to bridge the train-evaluation distribution gap. Experiments with representative VLA models reveal significant performance drops under cumulative perturbations, exposing persistent limitations in scene comprehension and instruction grounding. By integrating hierarchical evaluation with diverse training data, LIBERO-X offers a more reliable foundation for assessing and advancing VLA development.
zh
[CV-31] NECromancer: Breathing Life into Skeletons via BVH Animation
【速读】:该论文旨在解决现有运动模型在跨物种骨骼结构上的泛化能力受限问题,即当前大多数运动分词方法仅适用于特定物种的骨骼结构,难以在不同形态的生物体之间迁移使用。其解决方案的关键在于提出了一种通用运动分词器 NECromancer(NEC),其核心创新包括:(1) 基于BVH文件中关节语义、静息姿态偏移和骨骼拓扑信息构建结构先验编码的Ontology-aware Skeletal Graph Encoder(OwO);(2) 一种不依赖拓扑结构的Topology-Agnostic Tokenizer(TAT),可将运动序列压缩为统一且拓扑不变的离散表示;以及(3) 覆盖异构骨骼的大规模BVH运动数据集Unified BVH Universe(UvU)。该方案实现了运动与骨骼结构的有效解耦,并支持跨物种运动迁移、组合、去噪、基于token的生成及文本到运动检索等任务,为多样化形态下的运动分析与合成提供了一个统一框架。
链接: https://arxiv.org/abs/2602.06548
作者: Mingxi Xu,Qi Wang,Zhengyu Wen,Phong Dao Thien,Zhengyu Li,Ning Zhang,Xiaoyu He,Wei Zhao,Kehong Gong,Mingyuan Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Motion tokenization is a key component of generalizable motion models, yet most existing approaches are restricted to species-specific skeletons, limiting their applicability across diverse morphologies. We propose NECromancer (NEC), a universal motion tokenizer that operates directly on arbitrary BVH skeletons. NEC consists of three components: (1) an Ontology-aware Skeletal Graph Encoder (OwO) that encodes structural priors from BVH files, including joint semantics, rest-pose offsets, and skeletal topology, into skeletal embeddings; (2) a Topology-Agnostic Tokenizer (TAT) that compresses motion sequences into a universal, topology-invariant discrete representation; and (3) the Unified BVH Universe (UvU), a large-scale dataset aggregating BVH motions across heterogeneous skeletons. Experiments show that NEC achieves high-fidelity reconstruction under substantial compression and effectively disentangles motion from skeletal structure. The resulting token space supports cross-species motion transfer, composition, denoising, generation with token-based models, and text-motion retrieval, establishing a unified framework for motion analysis and synthesis across diverse morphologies. Demo page: this https URL
zh
[CV-32] Universal Anti-forensics Attack against Image Forgery Detection via Multi-modal Guidance
【速读】:该论文旨在解决当前生成式 AI 内容(AIGC)检测器在真实场景中缺乏对抗鲁棒性的问题,特别是针对反伪造攻击(anti-forensics attack)的忽视导致检测系统易受欺骗。解决方案的关键在于提出 ForgeryEraser 框架,其核心创新是利用视觉语言模型(VLMs,如 CLIP)作为共享骨干网络所固有的特征空间漏洞:通过设计一种多模态引导损失函数(multi-modal guidance loss),将伪造图像嵌入在 VLM 特征空间中向文本引导的真实锚点靠拢,同时远离伪造锚点,从而有效擦除伪造痕迹,且无需访问目标 AIGC 检测器。此方法实现了对先进 AIGC 检测器的通用反伪造攻击,并使解释型检测模型对伪造图像生成看似合理的“真实”解释。
链接: https://arxiv.org/abs/2602.06530
作者: Haipeng Li,Rongxuan Peng,Anwei Luo,Shunquan Tan,Changsheng Chen,Anastasia Antsiferova
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注: 17 pages, 11 figures
Abstract:The rapid advancement of AI-Generated Content (AIGC) technologies poses significant challenges for authenticity assessment. However, existing evaluation protocols largely overlook anti-forensics attack, failing to ensure the comprehensive robustness of state-of-the-art AIGC detectors in real-world applications. To bridge this gap, we propose ForgeryEraser, a framework designed to execute universal anti-forensics attack without access to the target AIGC detectors. We reveal an adversarial vulnerability stemming from the systemic reliance on Vision-Language Models (VLMs) as shared backbones (e.g., CLIP), where downstream AIGC detectors inherit the feature space of these publicly accessible models. Instead of traditional logit-based optimization, we design a multi-modal guidance loss to drive forged image embeddings within the VLM feature space toward text-derived authentic anchors to erase forgery traces, while repelling them from forgery anchors. Extensive experiments demonstrate that ForgeryEraser causes substantial performance degradation to advanced AIGC detectors on both global synthesis and local editing benchmarks. Moreover, ForgeryEraser induces explainable forensic models to generate explanations consistent with authentic images for forged images. Our code will be made publicly available.
zh
[CV-33] AdaptOVCD: Training-Free Open-Vocabulary Remote Sensing Change Detection via Adaptive Information Fusion
【速读】:该论文旨在解决遥感变化检测(Remote Sensing Change Detection)在开放世界场景下对预定义类别和大规模像素级标注的依赖问题,从而提升模型的泛化能力和适用性。其核心挑战在于如何在无需重新训练的情况下实现任意类别变化的零样本检测,并有效缓解多模型融合中的误差传播问题。解决方案的关键在于提出了一种无训练的开放词汇变化检测(Training-Free Open-Vocabulary Change Detection, AdaptOVCD)架构,通过双维多层次信息融合机制——垂直维度上整合数据、特征与决策层的信息融合,水平维度上引入针对性自适应设计(包括自适应辐射校准ARA、自适应变化阈值ACT与自适应置信度过滤ACF),深度协同多个异构预训练模型(如SAM-HQ、DINOv3、DGTRS-CLIP),实现了跨数据集的高精度变化检测(达到全监督性能上限的84.89%),显著优于现有无训练方法。
链接: https://arxiv.org/abs/2602.06529
作者: Mingyu Dou,Shi Qiu,Ming Hu,Yifan Chen,Huping Ye,Xiaohan Liao,Zhe Sun
机构: Northwest Polytechnical University (西北工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Remote sensing change detection plays a pivotal role in domains such as environmental monitoring, urban planning, and disaster assessment. However, existing methods typically rely on predefined categories and large-scale pixel-level annotations, which limit their generalization and applicability in open-world scenarios. To address these limitations, this paper proposes AdaptOVCD, a training-free Open-Vocabulary Change Detection (OVCD) architecture based on dual-dimensional multi-level information fusion. The framework integrates multi-level information fusion across data, feature, and decision levels vertically while incorporating targeted adaptive designs horizontally, achieving deep synergy among heterogeneous pre-trained models to effectively mitigate error propagation. Specifically, (1) at the data level, Adaptive Radiometric Alignment (ARA) fuses radiometric statistics with original texture features and synergizes with SAM-HQ to achieve radiometrically consistent segmentation; (2) at the feature level, Adaptive Change Thresholding (ACT) combines global difference distributions with edge structure priors and leverages DINOv3 to achieve robust change detection; (3) at the decision level, Adaptive Confidence Filtering (ACF) integrates semantic confidence with spatial constraints and collaborates with DGTRS-CLIP to achieve high-confidence semantic identification. Comprehensive evaluations across nine scenarios demonstrate that AdaptOVCD detects arbitrary category changes in a zero-shot manner, significantly outperforming existing training-free methods. Meanwhile, it achieves 84.89% of the fully-supervised performance upper bound in cross-dataset evaluations and exhibits superior generalization capabilities. The code is available at this https URL.
zh
[CV-34] MicroBi-ConvLSTM: An Ultra-Lightweight Efficient Model for Human Activity Recognition on Resource Constrained Devices
【速读】:该论文旨在解决在资源受限的可穿戴设备上实现高精度人体活动识别(Human Activity Recognition, HAR)的问题,尤其针对微控制器(microcontroller)有限的SRAM内存预算。现有轻量级模型如TinierHAR和TinyHAR虽具备良好准确率,但在考虑操作系统开销后仍超出内存限制。其解决方案的关键在于提出MicroBi-ConvLSTM架构:通过两级卷积特征提取与4倍时间池化(temporal pooling)降低参数规模,并引入单层双向长短期记忆网络(bidirectional LSTM)以保留时序依赖性,最终平均仅需11.4K参数,较TinierHAR减少2.9倍、较DeepConvLSTM减少11.9倍,同时保持线性O(N)计算复杂度,满足边缘设备部署需求。
链接: https://arxiv.org/abs/2602.06523
作者: Mridankan Mandal
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:
Abstract:Human Activity Recognition (HAR) on resource constrained wearables requires models that balance accuracy against strict memory and computational budgets. State of the art lightweight architectures such as TinierHAR (34K parameters) and TinyHAR (55K parameters) achieve strong accuracy, but exceed memory budgets of microcontrollers with limited SRAM once operating system overhead is considered. We present MicroBi-ConvLSTM, an ultra-lightweight convolutional-recurrent architecture achieving 11.4K parameters on average through two stage convolutional feature extraction with 4x temporal pooling and a single bidirectional LSTM layer. This represents 2.9x parameter reduction versus TinierHAR and 11.9x versus DeepConvLSTM while preserving linear O(N) complexity. Evaluation across eight diverse HAR benchmarks shows that MicroBi-ConvLSTM maintains competitive performance within the ultra-lightweight regime: 93.41% macro F1 on UCI-HAR, 94.46% on SKODA assembly gestures, and 88.98% on Daphnet gait freeze detection. Systematic ablation reveals task dependent component contributions where bidirectionality benefits episodic event detection, but provides marginal gains on periodic locomotion. INT8 post training quantization incurs only 0.21% average F1-score degradation, yielding a 23.0 KB average deployment footprint suitable for memory constrained edge devices.
zh
[CV-35] DriveWorld-VLA: Unified Latent-Space World Modeling with Vision-Language-Action for Autonomous Driving
【速读】:该论文旨在解决当前端到端(End-to-end, E2E)自动驾驶方法中,未来场景演化与动作规划难以在单一架构内有效统一的问题,其根源在于潜在状态共享不足,导致视觉想象对决策的影响受限。解决方案的关键在于提出DriveWorld-VLA框架,通过在潜在空间(latent space)中紧密集成视觉-语言-动作(Vision-Language-Action, VLA)模型与世界模型(World Models),使VLA规划器能够直接利用完整的场景演化建模能力,并将世界模型的潜在状态作为核心决策状态,从而评估候选动作对未来场景演化的具体影响。该设计实现了可控的、动作条件化的特征级想象,避免了昂贵的像素级模拟,显著提升了决策的前瞻性和鲁棒性。
链接: https://arxiv.org/abs/2602.06521
作者: Feiyang jia,Lin Liu,Ziying Song,Caiyan Jia,Hangjun Ye,Xiaoshuai Hao,Long Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 20 pages, 7 tables, 12 figures
Abstract:End-to-end (E2E) autonomous driving has recently attracted increasing interest in unifying Vision-Language-Action (VLA) with World Models to enhance decision-making and forward-looking imagination. However, existing methods fail to effectively unify future scene evolution and action planning within a single architecture due to inadequate sharing of latent states, limiting the impact of visual imagination on action decisions. To address this limitation, we propose DriveWorld-VLA, a novel framework that unifies world modeling and planning within a latent space by tightly integrating VLA and world models at the representation level, which enables the VLA planner to benefit directly from holistic scene-evolution modeling and reducing reliance on dense annotated supervision. Additionally, DriveWorld-VLA incorporates the latent states of the world model as core decision-making states for the VLA planner, facilitating the planner to assess how candidate actions impact future scene evolution. By conducting world modeling entirely in the latent space, DriveWorld-VLA supports controllable, action-conditioned imagination at the feature level, avoiding expensive pixel-level rollouts. Extensive open-loop and closed-loop evaluations demonstrate the effectiveness of DriveWorld-VLA, which achieves state-of-the-art performance with 91.3 PDMS on NAVSIMv1, 86.8 EPDMS on NAVSIMv2, and 0.16 3-second average collision rate on nuScenes. Code and models will be released in this https URL.
zh
[CV-36] FloorplanVLM: A Vision-Language Model for Floorplan Vectorization
【速读】:该论文旨在解决将位图(raster)建筑平面图转换为工程级矢量图形(engineering-grade vector graphics)的难题,其核心挑战在于复杂拓扑结构与严格的几何约束条件。传统基于像素的方法依赖脆弱的启发式规则,而基于查询的Transformer模型则易产生碎片化的房间结构,难以保证全局几何一致性。解决方案的关键在于提出一个统一框架FloorplanVLM,将其重构为一种图像条件下的序列建模任务(image-conditioned sequence modeling),直接输出表示全局拓扑结构的结构化JSON序列,从而实现从“像素到序列”的范式转变。这一方法显著提升了对斜墙、弧形等复杂几何形态的精确建模能力,并通过构建大规模高质量数据集(Floorplan-2M与Floorplan-HQ-300K)及渐进式训练策略(SFT + GRPO),实现了结构合理性与几何精度的协同优化。
链接: https://arxiv.org/abs/2602.06507
作者: Yuanqing Liu,Ziming Yang,Yulong Li,Yue Yang
机构: Beike
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Converting raster floorplans into engineering-grade vector graphics is challenging due to complex topology and strict geometric constraints. To address this, we present FloorplanVLM, a unified framework that reformulates floorplan vectorization as an image-conditioned sequence modeling task. Unlike pixel-based methods that rely on fragile heuristics or query-based transformers that generate fragmented rooms, our model directly outputs structured JSON sequences representing the global topology. This ‘pixels-to-sequence’ paradigm enables the precise and holistic constraint satisfaction of complex geometries, such as slanted walls and curved arcs. To support this data-hungry approach, we introduce a scalable data engine: we construct a large-scale dataset (Floorplan-2M) and a high-fidelity subset (Floorplan-HQ-300K) to balance geometric diversity and pixel-level precision. We then employ a progressive training strategy, using Supervised Fine-Tuning (SFT) for structural grounding and quality annealing, followed by Group Relative Policy Optimization (GRPO) for strict geometric alignment. To standardize evaluation on complex layouts, we establish and open-source FPBench-2K. Evaluated on this rigorous benchmark, FloorplanVLM demonstrates exceptional structural validity, achieving \textbf92.52% external-wall IoU and robust generalization across non-Manhattan architectures.
zh
[CV-37] MultiGraspNet: A Multitask 3D Vision Model for Multi-gripper Robotic Grasping
【速读】:该论文旨在解决当前基于视觉的机器人抓取模型在多末端执行器(如平行夹爪与吸盘)场景下适用性受限的问题:现有方法通常仅针对单一夹爪设计,难以扩展至双臂昂贵配置,或依赖定制化混合夹爪,导致学习逻辑无法跨任务迁移,限制了泛化能力。解决方案的关键在于提出 MultiGraspNet——一种统一的多任务3D深度学习框架,能够在同一模型中同时预测平行夹爪和真空吸盘的可行抓取位姿;通过共享早期特征并保留夹爪特异性的精修模块,有效融合不同抓取模态的互补信息,在杂乱场景中提升鲁棒性和适应性,并在 GraspNet-1Billion 和 SuctionNet-1Billion 数据集上训练得到具有高精度的抓取可操作性掩码(graspability masks)。实验证明,该方法在单臂多夹爪系统中优于纯吸盘基线,对已见物体抓取成功率提高16%,对未见物体提高32%,且在平行夹爪任务上保持竞争力。
链接: https://arxiv.org/abs/2602.06504
作者: Stephany Ortuno-Chanelo,Paolo Rabino,Enrico Civitelli,Tatiana Tommasi,Raffaello Camoriano
机构: VANDAL Laboratory, Department of Control and Computer Engineering, Politecnico di Torino, Turin, Italy; Comau S.p.A., Advanced Automation Solutions, Grugliasco, Italy; Istituto Italiano di Tecnologia, Genoa, Italy
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-based models for robotic grasping automate critical, repetitive, and draining industrial tasks. Existing approaches are typically limited in two ways: they either target a single gripper and are potentially applied on costly dual-arm setups, or rely on custom hybrid grippers that require ad-hoc learning procedures with logic that cannot be transferred across tasks, restricting their general applicability. In this work, we present MultiGraspNet, a novel multitask 3D deep learning method that predicts feasible poses simultaneously for parallel and vacuum grippers within a unified framework, enabling a single robot to handle multiple end effectors. The model is trained on the richly annotated GraspNet-1Billion and SuctionNet-1Billion datasets, which have been aligned for the purpose, and generates graspability masks quantifying the suitability of each scene point for successful grasps. By sharing early-stage features while maintaining gripper-specific refiners, MultiGraspNet effectively leverages complementary information across grasping modalities, enhancing robustness and adaptability in cluttered scenes. We characterize MultiGraspNet’s performance with an extensive experimental analysis, demonstrating its competitiveness with single-task models on relevant benchmarks. We run real-world experiments on a single-arm multi-gripper robotic setup showing that our approach outperforms the vacuum baseline, grasping 16% percent more seen objects and 32% more of the novel ones, while obtaining competitive results for the parallel task.
zh
[CV-38] Forest canopy height estimation from satellite RGB imagery using large-scale airborne LiDAR-derived training data and monocular depth estimation
【速读】:该论文旨在解决全球范围内高分辨率森林冠层高度(Canopy Height Model, CHM)制图的难题,尤其是在空间稀疏性和测量不确定性方面受限于星载激光雷达(LiDAR)数据的问题。其解决方案的关键在于利用大规模公开获取的机载LiDAR点云数据生成的CHM作为训练集,结合3米分辨率的PlanetScope多光谱影像和航空RGB影像,训练了一个先进的单目深度估计模型Depth Anything V2,构建出名为Depth2CHM的新型模型。该模型能够直接从卫星RGB影像中推演空间连续、高分辨率的冠层高度分布,显著提升了精度与可扩展性,验证结果显示其在不同区域均优于现有全球米级CHM产品。
链接: https://arxiv.org/abs/2602.06503
作者: Yongkang Lai,Xihan Mu,Tim R. McVicar,Dasheng Fan,Donghui Xie,Shanxin Guo,Wenli Huang,Tianjie Zhao,Guangjian Yan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Large-scale, high-resolution forest canopy height mapping plays a crucial role in understanding regional and global carbon and water cycles. Spaceborne LiDAR missions, including the Ice, Cloud, and Land Elevation Satellite-2 (ICESat-2) and the Global Ecosystem Dynamics Investigation (GEDI), provide global observations of forest structure but are spatially sparse and subject to inherent uncertainties. In contrast, near-surface LiDAR platforms, such as airborne and unmanned aerial vehicle (UAV) LiDAR systems, offer much finer measurements of forest canopy structure, and a growing number of countries have made these datasets openly available. In this study, a state-of-the-art monocular depth estimation model, Depth Anything V2, was trained using approximately 16,000 km2 of canopy height models (CHMs) derived from publicly available airborne LiDAR point clouds and related products across multiple countries, together with 3 m resolution PlanetScope and airborne RGB imagery. The trained model, referred to as Depth2CHM, enables the estimation of spatially continuous CHMs directly from PlanetScope RGB imagery. Independent validation was conducted at sites in China (approximately 1 km2) and the United States (approximately 116 km2). The results showed that Depth2CHM could accurately estimate canopy height, with biases of 0.59 m and 0.41 m and root mean square errors (RMSEs) of 2.54 m and 5.75 m for these two sites, respectively. Compared with an existing global meter-resolution CHM product, the mean absolute error is reduced by approximately 1.5 m and the RMSE by approximately 2 m. These results demonstrated that monocular depth estimation networks trained with large-scale airborne LiDAR-derived canopy height data provide a promising and scalable pathway for high-resolution, spatially continuous forest canopy height estimation from satellite RGB imagery.
zh
[CV-39] DreamHome-Pano: Design-Aware and Conflict-Free Panoramic Interior Generation
【速读】:该论文旨在解决现代室内设计中个性化空间生成时面临的“条件冲突”问题,即在满足刚性建筑结构约束与特定风格偏好之间难以平衡,导致风格属性可能破坏布局几何精度。其解决方案的关键在于提出DreamHome-Pano框架,通过引入Prompt-LLM作为语义桥梁,实现布局约束与风格参考到专业描述提示的跨模态对齐;同时构建Conflict-Free Control架构,融合结构感知的几何先验和多条件解耦策略,有效抑制风格干扰对空间布局的侵蚀,从而保障生成结果在美学质量与结构一致性之间达到更优平衡。
链接: https://arxiv.org/abs/2602.06494
作者: Lulu Chen,Yijiang Hu,Yuanqing Liu,Yulong Li,Yue Yang
机构: Beike
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In modern interior design, the generation of personalized spaces frequently necessitates a delicate balance between rigid architectural structural constraints and specific stylistic preferences. However, existing multi-condition generative frameworks often struggle to harmonize these inputs, leading to “condition conflicts” where stylistic attributes inadvertently compromise the geometric precision of the layout. To address this challenge, we present DreamHome-Pano, a controllable panoramic generation framework designed for high-fidelity interior synthesis. Our approach introduces a Prompt-LLM that serves as a semantic bridge, effectively translating layout constraints and style references into professional descriptive prompts to achieve precise cross-modal alignment. To safeguard architectural integrity during the generative process, we develop a Conflict-Free Control architecture that incorporates structural-aware geometric priors and a multi-condition decoupling strategy, effectively suppressing stylistic interference from eroding the spatial layout. Furthermore, we establish a comprehensive panoramic interior benchmark alongside a multi-stage training pipeline, encompassing progressive Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). Experimental results demonstrate that DreamHome-Pano achieves a superior balance between aesthetic quality and structural consistency, offering a robust and professional-grade solution for panoramic interior visualization.
zh
[CV-40] Rebenchmarking Unsupervised Monocular 3D Occupancy Prediction
【速读】:该论文旨在解决单目图像中3D占据预测(3D occupancy prediction)在遮挡区域的结构推理难题,尤其是现有无监督方法因训练与评估协议不一致以及依赖二维真实标签导致的性能瓶颈。其解决方案的关键在于:首先通过解析体渲染过程中的变量,提出一种物理上最一致的占据概率表示形式,并据此改进评估协议,使无监督方法能以与监督方法一致的方式进行评价;其次引入一种考虑遮挡的极化机制(occlusion-aware polarization mechanism),利用多视角视觉线索增强对遮挡区域内占据与空闲空间的区分能力,从而显著提升模型在复杂场景下的泛化性能。
链接: https://arxiv.org/abs/2602.06488
作者: Zizhan Guo,Yi Feng,Mengtan Zhang,Haoran Zhang,Wei Ye,Rui Fan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Inferring the 3D structure from a single image, particularly in occluded regions, remains a fundamental yet unsolved challenge in vision-centric autonomous driving. Existing unsupervised approaches typically train a neural radiance field and treat the network outputs as occupancy probabilities during evaluation, overlooking the inconsistency between training and evaluation protocols. Moreover, the prevalent use of 2D ground truth fails to reveal the inherent ambiguity in occluded regions caused by insufficient geometric constraints. To address these issues, this paper presents a reformulated benchmark for unsupervised monocular 3D occupancy prediction. We first interpret the variables involved in the volume rendering process and identify the most physically consistent representation of the occupancy probability. Building on these analyses, we improve existing evaluation protocols by aligning the newly identified representation with voxel-wise 3D occupancy ground truth, thereby enabling unsupervised methods to be evaluated in a manner consistent with that of supervised approaches. Additionally, to impose explicit constraints in occluded regions, we introduce an occlusion-aware polarization mechanism that incorporates multi-view visual cues to enhance discrimination between occupied and free spaces in these regions. Extensive experiments demonstrate that our approach not only significantly outperforms existing unsupervised approaches but also matches the performance of supervised ones. Our source code and evaluation protocol will be made available upon publication.
zh
[CV-41] Instance-Free Domain Adaptive Object Detection
【速读】:该论文试图解决无实例域自适应目标检测(Instance-Free Domain Adaptive Object Detection)的问题,即在目标域中缺乏感兴趣的目标实例(foreground instances)的情况下,如何实现有效的域适应。传统方法依赖于目标域中包含足够数量的前景样本进行特征对齐,但在实际场景(如野生动物监测、病灶检测)中,获取带目标的标注数据成本极高,而仅含背景的数据却十分丰富。为应对这一挑战,作者提出关系与结构一致性网络(RSCN),其核心创新在于:通过构建背景特征原型(background feature prototypes)实现跨域对齐,并同时强制源域和目标域中前景特征与背景特征之间的关系保持一致,从而在无目标实例条件下也能实现鲁棒的域适应性能。
链接: https://arxiv.org/abs/2602.06484
作者: Hengfu Yu,Jinhong Deng,Lixin Duan,Wen Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 12 figures
Abstract:While Domain Adaptive Object Detection (DAOD) has made significant strides, most methods rely on unlabeled target data that is assumed to contain sufficient foreground instances. However, in many practical scenarios (e.g., wildlife monitoring, lesion detection), collecting target domain data with objects of interest is prohibitively costly, whereas background-only data is abundant. This common practical constraint introduces a significant technical challenge: the difficulty of achieving domain alignment when target instances are unavailable, forcing adaptation to rely solely on the target background information. We formulate this challenge as the novel problem of Instance-Free Domain Adaptive Object Detection. To tackle this, we propose the Relational and Structural Consistency Network (RSCN) which pioneers an alignment strategy based on background feature prototypes while simultaneously encouraging consistency in the relationship between the source foreground features and the background features within each domain, enabling robust adaptation even without target instances. To facilitate research, we further curate three specialized benchmarks, including simulative auto-driving detection, wildlife detection, and lung nodule detection. Extensive experiments show that RSCN significantly outperforms existing DAOD methods across all three benchmarks in the instance-free scenario. The code and benchmarks will be released soon.
zh
[CV-42] Efficient-LVSM: Faster Cheaper and Better Large View Synthesis Model via Decoupled Co-Refinement Attention ICLR2026
【速读】:该论文旨在解决基于Transformer的前馈式新视角合成(Novel View Synthesis, NVS)模型中存在计算复杂度高和参数共享机制僵化的问题。具体而言,现有方法如LVSM采用全自注意力机制,导致计算复杂度随输入视图数量呈二次增长,并且在异构token之间强制共享参数,限制了模型表达能力。其解决方案的关键在于提出Efficient-LVSM,一种双流架构,通过解耦的协同精炼机制实现优化:对输入视图应用局部自注意力(intra-view self-attention),对目标视图采用“自注意力后交叉注意力”(self-then-cross attention)策略,从而消除冗余计算;该设计不仅显著提升训练效率(2倍加速收敛)与推理速度(4.4倍加速),还实现了更优的性能表现(RealEstate10K上PSNR达29.86 dB,较LVSM提升0.2 dB)及零样本泛化能力。
链接: https://arxiv.org/abs/2602.06478
作者: Xiaosong Jia,Yihang Sun,Junqi You,Songbur Wong,Zichen Zou,Junchi Yan,Zuxuan Wu,Yu-Gang Jiang
机构: Institute of Trustworthy Embodied AI (TEAI), Fudan University (复旦大学); Sch. of Computer Science & Sch. of Artificial Intelligence, Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at ICLR 2026
Abstract:Feedforward models for novel view synthesis (NVS) have recently advanced by transformer-based methods like LVSM, using attention among all input and target views. In this work, we argue that its full self-attention design is suboptimal, suffering from quadratic complexity with respect to the number of input views and rigid parameter sharing among heterogeneous tokens. We propose Efficient-LVSM, a dual-stream architecture that avoids these issues with a decoupled co-refinement mechanism. It applies intra-view self-attention for input views and self-then-cross attention for target views, eliminating unnecessary computation. Efficient-LVSM achieves 29.86 dB PSNR on RealEstate10K with 2 input views, surpassing LVSM by 0.2 dB, with 2x faster training convergence and 4.4x faster inference speed. Efficient-LVSM achieves state-of-the-art performance on multiple benchmarks, exhibits strong zero-shot generalization to unseen view counts, and enables incremental inference with KV-cache, thanks to its decoupled designs.
zh
[CV-43] LAB-Det: Language as a Domain-Invariant Bridge for Training-Free One-Shot Domain Generalization in Object Detection
【速读】:该论文旨在解决基础目标检测器(如GLIP和Grounding DINO)在特定领域(如水下图像或工业缺陷检测)中因数据稀缺而导致性能显著下降的问题。传统跨域少样本方法依赖于对稀有目标域数据的微调,存在计算成本高和过拟合风险。解决方案的关键在于提出一种“训练-free one-shot domain generalization”范式,即仅用每个类别一个标注样例即可实现模型适配,且无需更新任何参数。其核心创新是LAB-Det方法,通过将每个示例投影为描述性文本,并利用语言作为域不变桥梁(Language As a domain-invariant Bridge),以语言条件引导冻结的检测器进行推理,从而替代传统的梯度驱动特征适应,实现了在数据稀缺场景下的鲁棒泛化能力。
链接: https://arxiv.org/abs/2602.06474
作者: Xu Zhang,Zhe Chen,Jing Zhang,Dacheng Tao
机构: The University of Sydney (悉尼大学); La Trobe University (拉特罗布大学); Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Foundation object detectors such as GLIP and Grounding DINO excel on general-domain data but often degrade in specialized and data-scarce settings like underwater imagery or industrial defects. Typical cross-domain few-shot approaches rely on fine-tuning scarce target data, incurring cost and overfitting risks. We instead ask: Can a frozen detector adapt with only one exemplar per class without training? To answer this, we introduce training-free one-shot domain generalization for object detection, where detectors must adapt to specialized domains with only one annotated exemplar per class and no weight updates. To tackle this task, we propose LAB-Det, which exploits Language As a domain-invariant Bridge. Instead of adapting visual features, we project each exemplar into a descriptive text that conditions and guides a frozen detector. This linguistic conditioning replaces gradient-based adaptation, enabling robust generalization in data-scarce domains. We evaluate on UODD (underwater) and NEU-DET (industrial defects), two widely adopted benchmarks for data-scarce detection, where object boundaries are often ambiguous, and LAB-Det achieves up to 5.4 mAP improvement over state-of-the-art fine-tuned baselines without updating a single parameter. These results establish linguistic adaptation as an efficient and interpretable alternative to fine-tuning in specialized detection settings.
zh
[CV-44] Exploring Specular Reflection Inconsistency for Generalizable Face Forgery Detection
【速读】:该论文旨在解决高保真生成式 AI (Generative AI) 伪造图像(如扩散模型生成的深度伪造人脸)难以被现有基于空间和频率特征的检测方法识别的问题。其解决方案的关键在于:首先,利用 Retinex 理论快速准确地估计人脸纹理并分离出镜面反射分量;其次,基于 Phong 光照模型中镜面反射的非线性与参数复杂性,提出 Specular-Reflection-Inconsistency-Network (SRI-Net),通过两阶段交叉注意力机制捕捉镜面反射与其对应人脸纹理及直接光照之间的不一致性,从而实现对生成式伪造的鲁棒检测。
链接: https://arxiv.org/abs/2602.06452
作者: Hongyan Fei,Zexi Jia,Chuanwei Huang,Jinchao Zhang,Jie Zhou
机构: Peking University (北京大学); Tencent Inc (腾讯公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Detecting deepfakes has become increasingly challenging as forgery faces synthesized by AI-generated methods, particularly diffusion models, achieve unprecedented quality and resolution. Existing forgery detection approaches relying on spatial and frequency features demonstrate limited efficacy against high-quality, entirely synthesized forgeries. In this paper, we propose a novel detection method grounded in the observation that facial attributes governed by complex physical laws and multiple parameters are inherently difficult to replicate. Specifically, we focus on illumination, particularly the specular reflection component in the Phong illumination model, which poses the greatest replication challenge due to its parametric complexity and nonlinear formulation. We introduce a fast and accurate face texture estimation method based on Retinex theory to enable precise specular reflection separation. Furthermore, drawing from the mathematical formulation of specular reflection, we posit that forgery evidence manifests not only in the specular reflection itself but also in its relationship with corresponding face texture and direct light. To address this issue, we design the Specular-Reflection-Inconsistency-Network (SRI-Net), incorporating a two-stage cross-attention mechanism to capture these correlations and integrate specular reflection related features with image features for robust forgery detection. Experimental results demonstrate that our method achieves superior performance on both traditional deepfake datasets and generative deepfake datasets, particularly those containing diffusion-generated forgery faces.
zh
[CV-45] What Is Wrong with Synthetic Data for Scene Text Recognition? A Strong Synthetic Engine with Diverse Simulations and Self-Evolution
【速读】:该论文旨在解决当前场景文本识别(Scene Text Recognition, STR)模型训练中真实数据难以获取且成本高昂的问题,以及现有合成数据因域差距(domain gap)导致性能不足的挑战。其核心解决方案在于提出一个名为UnionST的数据生成引擎,通过整合多样化的文本语料、字体和布局样本,显著提升合成数据在复杂场景下的逼真度;同时构建了大规模合成数据集UnionST-S,并设计自进化学习(Self-Evolution Learning, SEL)框架以高效利用少量真实标签实现模型性能优化。关键创新点在于通过数据多样性增强和少样本标注策略,有效缩小合成与真实数据之间的域差距,使模型在特定场景下超越纯真实数据训练的效果。
链接: https://arxiv.org/abs/2602.06450
作者: Xingsong Ye,Yongkun Du,JiaXin Zhang,Chen Li,Jing LYU,Zhineng Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Large-scale and categorical-balanced text data is essential for training effective Scene Text Recognition (STR) models, which is hard to achieve when collecting real data. Synthetic data offers a cost-effective and perfectly labeled alternative. However, its performance often lags behind, revealing a significant domain gap between real and current synthetic data. In this work, we systematically analyze mainstream rendering-based synthetic datasets and identify their key limitations: insufficient diversity in corpus, font, and layout, which restricts their realism in complex scenarios. To address these issues, we introduce UnionST, a strong data engine synthesizes text covering a union of challenging samples and better aligns with the complexity observed in the wild. We then construct UnionST-S, a large-scale synthetic dataset with improved simulations in challenging scenarios. Furthermore, we develop a self-evolution learning (SEL) framework for effective real data annotation. Experiments show that models trained on UnionST-S achieve significant improvements over existing synthetic datasets. They even surpass real-data performance in certain scenarios. Moreover, when using SEL, the trained models achieve competitive performance by only seeing 9% of real data labels.
zh
[CV-46] ChatUMM: Robust Context Tracking for Conversational Interleaved Generation
【速读】:该论文旨在解决统一多模态模型(Unified Multimodal Models, UMMs)在实际应用中受限于单轮交互范式的问题,即现有模型仅能处理独立请求,缺乏持续对话能力,难以作为真正的多轮交互助手。其核心解决方案是提出ChatUMM,一种具备强上下文追踪能力的对话式统一模型,关键创新在于两个方面:一是采用交错式多轮训练策略,将文本-图像序列流建模为连续对话流;二是构建系统化的对话数据合成管道,通过三阶段过程(构建基础状态对话、引入依赖历史的干扰轮次以强化长程依赖解析、合成自然交错的多模态响应)将标准单轮数据转化为流畅对话,从而显著提升复杂多轮场景下的鲁棒性和上下文感知能力。
链接: https://arxiv.org/abs/2602.06442
作者: Wenxun Dai,Zhiyuan Zhao,Yule Zhong,Yiji Cheng,Jianwei Zhang,Linqing Wang,Shiyi Zhang,Yunlong Lin,Runze He,Fellix Song,Wayne Zhuang,Yong Liu,Haoji Zhang,Yansong Tang,Qinglin Lu,Chunyu Wang
机构: Tsinghua University (清华大学); Tencent Hunyuan (腾讯混元)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ChatUMM Project
Abstract:Unified multimodal models (UMMs) have achieved remarkable progress yet remain constrained by a single-turn interaction paradigm, effectively functioning as solvers for independent requests rather than assistants in continuous dialogue. To bridge this gap, we present ChatUMM. As a conversational unified model, it excels at robust context tracking to sustain interleaved multimodal generation. ChatUMM derives its capabilities from two key innovations: an interleaved multi-turn training strategy that models serialized text-image streams as a continuous conversational flow, and a systematic conversational data synthesis pipeline. This pipeline transforms a diverse set of standard single-turn datasets into fluid dialogues through three progressive stages: constructing basic stateful dialogues, enforcing long-range dependency resolution via ``distractor’’ turns with history-dependent query rewriting, and synthesizing naturally interleaved multimodal responses. Extensive evaluations demonstrate that ChatUMM achieves state-of-the-art performance among open-source unified models on visual understanding and instruction-guided editing benchmarks, while maintaining competitive fidelity in text-to-image generation. Notably, ChatUMM exhibits superior robustness in complex multi-turn scenarios, ensuring fluid, context-aware dialogues.
zh
[CV-47] Bridging the Indoor-Outdoor Gap: Vision-Centric Instruction-Guided Embodied Navigation for the Last Meters
【速读】:该论文旨在解决室外到室内(out-to-in)场景下,智能体在缺乏精确外部先验信息(如GPS坐标)时的导航难题,尤其针对现有方法无法实现从室外精准定位至建筑入口的细粒度导航问题。其解决方案的关键在于提出一种“无先验”的指令驱动式具身导航任务,并设计了一个以视觉为中心的导航框架,通过图像提示(image-based prompts)驱动决策,从而仅依赖第一人称视角观测与自然语言指令完成跨环境无缝导航。同时,论文构建了首个面向该任务的开源数据集,引入轨迹条件视频合成的数据生成流程,显著提升了模型在复杂现实场景中的泛化能力与实用性。
链接: https://arxiv.org/abs/2602.06427
作者: Yuxiang Zhao,Yirong Yang,Yanqing Zhu,Yanfen Shen,Chiyu Wang,Zhining Gu,Pei Shi,Wei Guo,Mu Xu
机构: AMAP CV Lab, Alibaba Group(阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Embodied navigation holds significant promise for real-world applications such as last-mile delivery. However, most existing approaches are confined to either indoor or outdoor environments and rely heavily on strong assumptions, such as access to precise coordinate systems. While current outdoor methods can guide agents to the vicinity of a target using coarse-grained localization, they fail to enable fine-grained entry through specific building entrances, critically limiting their utility in practical deployment scenarios that require seamless outdoor-to-indoor transitions. To bridge this gap, we introduce a novel task: out-to-in prior-free instruction-driven embodied navigation. This formulation explicitly eliminates reliance on accurate external priors, requiring agents to navigate solely based on egocentric visual observations guided by instructions. To tackle this task, we propose a vision-centric embodied navigation framework that leverages image-based prompts to drive decision-making. Additionally, we present the first open-source dataset for this task, featuring a pipeline that integrates trajectory-conditioned video synthesis into the data generation process. Through extensive experiments, we demonstrate that our proposed method consistently outperforms state-of-the-art baselines across key metrics including success rate and path efficiency.
zh
[CV-48] POPL-KF: A Pose-Only Geometric Representation-Based Kalman Filter for Point-Line-Based Visual-Inertial Odometry
【速读】:该论文旨在解决传统视觉惯性里程计(Visual-inertial odometry, VIO)系统在挑战性场景下性能下降的问题,尤其是基于多状态约束卡尔曼滤波(MSCKF)的VIO系统因特征三维坐标线性化误差和延迟观测更新导致的定位精度不足。其解决方案的关键在于提出一种仅依赖位姿的几何表示方法(pose-only geometric representation),并构建POPL-KF系统:通过显式消除点特征与线特征的三维坐标参数,从测量方程中移除线性化误差源;同时实现视觉观测的即时更新,提升状态估计的实时性和准确性;此外,设计统一的基帧选择算法以优化位姿约束,并引入基于图像网格分割与双向光流一致性的线特征过滤机制,从而显著改善线特征质量。
链接: https://arxiv.org/abs/2602.06425
作者: Aiping Wang,Zhaolong Yang,Shuwen Chen,Hai Zhang
机构: Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Mainstream Visual-inertial odometry (VIO) systems rely on point features for motion estimation and localization. However, their performance degrades in challenging scenarios. Moreover, the localization accuracy of multi-state constraint Kalman filter (MSCKF)-based VIO systems suffers from linearization errors associated with feature 3D coordinates and delayed measurement updates. To improve the performance of VIO in challenging scenes, we first propose a pose-only geometric representation for line features. Building on this, we develop POPL-KF, a Kalman filter-based VIO system that employs a pose-only geometric representation for both point and line features. POPL-KF mitigates linearization errors by explicitly eliminating both point and line feature coordinates from the measurement equations, while enabling immediate update of visual measurements. We also design a unified base-frames selection algorithm for both point and line features to ensure optimal constraints on camera poses within the pose-only measurement model. To further improve line feature quality, a line feature filter based on image grid segmentation and bidirectional optical flow consistency is proposed. Our system is evaluated on public datasets and real-world experiments, demonstrating that POPL-KF outperforms the state-of-the-art (SOTA) filter-based methods (OpenVINS, PO-KF) and optimization-based methods (PL-VINS, EPLF-VINS), while maintaining real-time performance.
zh
[CV-49] Alleviating Sparse Rewards by Modeling Step-Wise and Long-Term Sampling Effects in Flow-Based GRPO
【速读】:该论文旨在解决现有基于群体排名策略的强化学习(Group-wise Ranking Policy Optimization, GRPO)在文本到图像生成中,因奖励信号稀疏且未区分各去噪步骤局部影响而导致的训练效率低和长期依赖建模不足的问题。其核心解决方案是提出TurningPoint-GRPO(TP-GRPO),关键创新在于:(i) 用逐步增量奖励(step-level incremental rewards)替代结果导向型奖励,从而提供密集、步骤感知的学习信号以更清晰地隔离每一步去噪操作的“纯”效应;(ii) 通过识别“转折点”——即奖励趋势发生符号变化且后续奖励演化与整体轨迹趋势一致的步骤——并为这些动作分配聚合的长期奖励,显式建模去噪轨迹中的延迟隐式交互,从而捕捉早期操作对后期状态的长期影响。该方法无需超参数调优,仅依赖增量奖励符号变化即可检测转折点,显著提升了奖励利用效率和生成质量。
链接: https://arxiv.org/abs/2602.06422
作者: Yunze Tong,Mushui Liu,Canyu Zhao,Wanggui He,Shiyi Zhang,Hongwei Zhang,Peng Zhang,Jinlong Liu,Ju Huang,Jiamang Wang,Hao Jiang,Pipei Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, in submission
Abstract:Deploying GRPO on Flow Matching models has proven effective for text-to-image generation. However, existing paradigms typically propagate an outcome-based reward to all preceding denoising steps without distinguishing the local effect of each step. Moreover, current group-wise ranking mainly compares trajectories at matched timesteps and ignores within-trajectory dependencies, where certain early denoising actions can affect later states via delayed, implicit interactions. We propose TurningPoint-GRPO (TP-GRPO), a GRPO framework that alleviates step-wise reward sparsity and explicitly models long-term effects within the denoising trajectory. TP-GRPO makes two key innovations: (i) it replaces outcome-based rewards with step-level incremental rewards, providing a dense, step-aware learning signal that better isolates each denoising action’s “pure” effect, and (ii) it identifies turning points-steps that flip the local reward trend and make subsequent reward evolution consistent with the overall trajectory trend-and assigns these actions an aggregated long-term reward to capture their delayed impact. Turning points are detected solely via sign changes in incremental rewards, making TP-GRPO efficient and hyperparameter-free. Extensive experiments also demonstrate that TP-GRPO exploits reward signals more effectively and consistently improves generation. Demo code is available at this https URL.
zh
[CV-50] Learning Human Visual Attention on 3D Surfaces through Geometry-Queried Semantic Priors
【速读】:该论文旨在解决现有三维显著性(3D saliency)方法在建模人类视觉注意机制时存在的局限性,即这些方法通常依赖手工设计的几何特征或缺乏语义感知的学习模型,无法解释为何人类会注视那些语义有意义但几何上不显著的区域。其解决方案的关键在于提出SemGeo-AttentionNet,一种双流架构,通过不对称跨模态融合显式建模自底向上几何处理与自顶向下语义识别之间的交互关系:利用基于扩散模型的语义先验(来自几何条件下的多视角渲染)和点云Transformer进行几何特征提取,并通过交叉注意力机制使几何特征主动查询语义内容,从而实现由几何显著性引导语义检索;此外,该框架进一步扩展至时间维度上的扫描路径(scanpath)生成,引入首个尊重三维网格拓扑结构并包含回抑制(inhibition-of-return)动力学的强化学习公式,显著提升了对人类在三维表面上视觉注意行为的建模能力。
链接: https://arxiv.org/abs/2602.06419
作者: Soham Pahari,Sandeep C. Kumain
机构: UPES Dehradun
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Human visual attention on three-dimensional objects emerges from the interplay between bottom-up geometric processing and top-down semantic recognition. Existing 3D saliency methods rely on hand-crafted geometric features or learning-based approaches that lack semantic awareness, failing to explain why humans fixate on semantically meaningful but geometrically unremarkable regions. We introduce SemGeo-AttentionNet, a dual-stream architecture that explicitly formalizes this dichotomy through asymmetric cross-modal fusion, leveraging diffusion-based semantic priors from geometry-conditioned multi-view rendering and point cloud transformers for geometric processing. Cross-attention ensures geometric features query semantic content, enabling bottom-up distinctiveness to guide top-down retrieval. We extend our framework to temporal scanpath generation through reinforcement learning, introducing the first formulation respecting 3D mesh topology with inhibition-of-return dynamics. Evaluation on SAL3D, NUS3D and 3DVA datasets demonstrates substantial improvements, validating how cognitively motivated architectures effectively model human visual attention on three-dimensional surfaces.
zh
[CV-51] Point Virtual Transformer
【速读】:该论文旨在解决LiDAR-based 3D物体检测中远距离目标检测性能下降的问题,其根本原因在于远距离点云稀疏导致几何特征不可靠。解决方案的关键在于提出Point Virtual Transformer (PointViT)框架,该框架通过从RGB图像生成深度补全的虚拟点(virtual points),并采用选择性采样策略与原始LiDAR点进行融合,避免了直接引入全部虚拟点带来的计算开销和信息融合困难;进一步地,该框架在BEV(鸟瞰图)空间中设计门控融合机制,并结合稀疏卷积和Transformer结构实现高效且鲁棒的特征表示与目标查询优化,从而显著提升远场目标检测精度。
链接: https://arxiv.org/abs/2602.06406
作者: Veerain Sood,Bnalin,Gaurav Pandey
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 4 figures
Abstract:LiDAR-based 3D object detectors often struggle to detect far-field objects due to the sparsity of point clouds at long ranges, which limits the availability of reliable geometric cues. To address this, prior approaches augment LiDAR data with depth-completed virtual points derived from RGB images; however, directly incorporating all virtual points leads to increased computational cost and introduces challenges in effectively fusing real and virtual information. We present Point Virtual Transformer (PointViT), a transformer-based 3D object detection framework that jointly reasons over raw LiDAR points and selectively sampled virtual points. The framework examines multiple fusion strategies, ranging from early point-level fusion to BEV-based gated fusion, and analyses their trade-offs in terms of accuracy and efficiency. The fused point cloud is voxelized and encoded using sparse convolutions to form a BEV representation, from which a compact set of high-confidence object queries is initialised and refined through a transformer-based context aggregation module. Experiments on the KITTI benchmark report 91.16% 3D AP, 95.94% BEV AP, and 99.36% AP on the KITTI 2D detection benchmark for the Car class.
zh
[CV-52] A neuromorphic model of the insect visual system for natural image processing
【速读】:该论文旨在解决当前视觉模型在追求任务性能的同时,忽视生物启发性处理路径的问题。其核心挑战在于如何构建一个既具备生物合理性又能在多种任务中通用的视觉表征学习框架。解决方案的关键在于提出一种受昆虫视觉系统启发的模型,通过模拟其稀疏编码机制,将密集视觉输入转化为具有判别性的稀疏代码;该模型采用全自监督对比目标进行训练,无需标签数据即可实现表示学习,并支持跨任务复用而无需依赖特定领域的分类器。此外,该模型同时以人工神经网络和脉冲神经网络(Spiking Neural Network, SNN)形式实现,验证了其在模拟定位场景中的优越性,凸显了类脑神经形态处理路径的功能优势。
链接: https://arxiv.org/abs/2602.06405
作者: Adam D. Hines,Karin Nordström,Andrew B. Barron
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
备注: 21 pages, 7 figures, under review
Abstract:Insect vision supports complex behaviors including associative learning, navigation, and object detection, and has long motivated computational models for understanding biological visual processing. However, many contemporary models prioritize task performance while neglecting biologically grounded processing pathways. Here, we introduce a bio-inspired vision model that captures principles of the insect visual system to transform dense visual input into sparse, discriminative codes. The model is trained using a fully self-supervised contrastive objective, enabling representation learning without labeled data and supporting reuse across tasks without reliance on domain-specific classifiers. We evaluated the resulting representations on flower recognition tasks and natural image benchmarks. The model consistently produced reliable sparse codes that distinguish visually similar inputs. To support different modelling and deployment uses, we have implemented the model as both an artificial neural network and a spiking neural network. In a simulated localization setting, our approach outperformed a simple image downsampling comparison baseline, highlighting the functional benefit of incorporating neuromorphic visual processing pathways. Collectively, these results advance insect computational modelling by providing a generalizable bio-inspired vision model capable of sparse computation across diverse tasks.
zh
[CV-53] MeDocVL: A Visual Language Model for Medical Document Understanding and Parsing
【速读】:该论文旨在解决医疗文档光学字符识别(OCR)中存在的挑战,包括复杂版式、领域特定术语以及噪声标注问题,同时要求字段级别的精确匹配。其解决方案的关键在于提出一种后训练的视觉语言模型 MeDocVL,通过“训练驱动的标签精炼”从噪声标注中构建高质量监督信号,并结合“噪声感知的混合后训练策略”,融合强化学习与监督微调,从而实现鲁棒且精准的医疗文档解析。
链接: https://arxiv.org/abs/2602.06402
作者: Wenjie Wang,Wei Wu,Ying Liu,Yuan Zhao,Xiaole Lv,Liang Diao,Zengjian Fan,Wenfeng Xie,Ziling Lin,De Shi,Lin Huang,Kaihe Xu,Hong Li
机构: Ping An Property & Casualty Insurance Company (平安财产保险股份有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 8 figures. Technical report
Abstract:Medical document OCR is challenging due to complex layouts, domain-specific terminology, and noisy annotations, while requiring strict field-level exact matching. Existing OCR systems and general-purpose vision-language models often fail to reliably parse such documents. We propose MeDocVL, a post-trained vision-language model for query-driven medical document parsing. Our framework combines Training-driven Label Refinement to construct high-quality supervision from noisy annotations, with a Noise-aware Hybrid Post-training strategy that integrates reinforcement learning and supervised fine-tuning to achieve robust and precise extraction. Experiments on medical invoice benchmarks show that MeDocVL consistently outperforms conventional OCR systems and strong VLM baselines, achieving state-of-the-art performance under noisy supervision.
zh
[CV-54] FusionOcc: Students t-Distribution Based Object-Centric Multi-Sensor Fusion Framework for 3D Occupancy Prediction
【速读】:该论文旨在解决当前3D语义占据预测(3D semantic occupancy prediction)方法中因依赖3D体素(voxel volume)或一组3D高斯分布作为中间表示而导致的细粒度几何细节捕捉效率与有效性不足的问题。解决方案的关键在于提出了一种面向对象的多传感器融合框架TFusionOcc,其核心创新包括:多阶段多传感器融合策略、基于学生t分布和T-混合模型(T-Mixture Model, TMM)的概率建模机制,以及引入几何更灵活的基元——可变形超椭球(deformable superquadric,即带逆向形变的超椭球),从而在nuScenes基准上实现了当前最优(SOTA)性能,并在nuScenes-C数据集上验证了其在相机和激光雷达退化场景下的鲁棒性。
链接: https://arxiv.org/abs/2602.06400
作者: Zhenxing Ming,Julie Stephany Berrio,Mao Shan,Stewart Worrall
机构: Australian Centre for Robotics (澳大利亚机器人中心); University of Sydney (悉尼大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:3D semantic occupancy prediction enables autonomous vehicles (AVs) to perceive fine-grained geometric and semantic structure of their surroundings from onboard sensors, which is essential for safe decision-making and navigation. Recent models for 3D semantic occupancy prediction have successfully addressed the challenge of describing real-world objects with varied shapes and classes. However, the intermediate representations used by existing methods for 3D semantic occupancy prediction rely heavily on 3D voxel volumes or a set of 3D Gaussians, hindering the model’s ability to efficiently and effectively capture fine-grained geometric details in the 3D driving environment. This paper introduces TFusionOcc, a novel object-centric multi-sensor fusion framework for predicting 3D semantic occupancy. By leveraging multi-stage multi-sensor fusion, Student’s t-distribution, and the T-Mixture model (TMM), together with more geometrically flexible primitives, such as the deformable superquadric (superquadric with inverse warp), the proposed method achieved state-of-the-art (SOTA) performance on the nuScenes benchmark. In addition, extensive experiments were conducted on the nuScenes-C dataset to demonstrate the robustness of the proposed method in different camera and lidar corruption scenarios. The code will be available at: this https URL
zh
[CV-55] POINTS-GUI-G: GUI-Grounding Journey
【速读】:该论文旨在解决GUI(图形用户界面)接地(GUI grounding)任务中模型对界面元素定位精度不足的问题,尤其是在从基础模型(如POINTS-1.5)出发、缺乏强空间感知能力的情况下实现端到端的GUI操作自动化。解决方案的关键在于三个核心创新:(1) 精细化的数据工程,包括统一多种开源数据集格式,并引入增强、过滤与难度分级策略;(2) 优化的训练策略,通过持续微调视觉编码器提升感知准确性并保持训练与推理阶段分辨率一致性;(3) 基于可验证奖励的强化学习(Reinforcement Learning, RL),利用GUI任务天然具备高精度反馈的优势,显著提升了感知密集型接地任务的执行精度。
链接: https://arxiv.org/abs/2602.06391
作者: Zhongyin Zhao,Yuan Liu,Yikun Liu,Haicheng Wang,Le Tian,Xiao Zhou,Yangxiu You,Zilin Yu,Yang Yu,Jie Zhou
机构: Tencent(腾讯)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The rapid advancement of vision-language models has catalyzed the emergence of GUI agents, which hold immense potential for automating complex tasks, from online shopping to flight booking, thereby alleviating the burden of repetitive digital workflows. As a foundational capability, GUI grounding is typically established as a prerequisite for end-to-end task execution. It enables models to precisely locate interface elements, such as text and icons, to perform accurate operations like clicking and typing. Unlike prior works that fine-tune models already possessing strong spatial awareness (e.g., Qwen3-VL), we aim to master the full technical pipeline by starting from a base model with minimal grounding ability, such as POINTS-1.5. We introduce POINTS-GUI-G-8B, which achieves state-of-the-art performance with scores of 59.9 on ScreenSpot-Pro, 66.0 on OSWorld-G, 95.7 on ScreenSpot-v2, and 49.9 on UI-Vision. Our model’s success is driven by three key factors: (1) Refined Data Engineering, involving the unification of diverse open-source datasets format alongside sophisticated strategies for augmentation, filtering, and difficulty grading; (2) Improved Training Strategies, including continuous fine-tuning of the vision encoder to enhance perceptual accuracy and maintaining resolution consistency between training and inference; and (3) Reinforcement Learning (RL) with Verifiable Rewards. While RL is traditionally used to bolster reasoning, we demonstrate that it significantly improves precision in the perception-intensive GUI grounding task. Furthermore, GUI grounding provides a natural advantage for RL, as rewards are easily verifiable and highly accurate.
zh
[CV-56] Revisiting Salient Object Detection from an Observer-Centric Perspective
【速读】:该论文旨在解决传统显著性目标检测(Salient Object Detection, SOD)方法将问题建模为单一客观预测任务所导致的欠定性和本质病态性问题,即不同观察者因先验知识、偏好或意图差异可能对同一图像中的显著区域产生不同认知。其解决方案的关键在于提出一种以观察者为中心的显著性检测框架(Observer-Centric SOD, OC-SOD),通过融合视觉线索与观察者特定因素(如偏好或意图)来实现个性化和情境感知的显著性预测。为此,作者构建了首个OC-SODBench数据集(包含33k图像和152k文本提示与对象对),并设计了基于多模态大语言模型的高效标注流程及OC-SODAgent代理基线,该基线模拟人类“感知-反思-调整”的认知过程,从而更真实地刻画人类感知与计算建模之间的鸿沟。
链接: https://arxiv.org/abs/2602.06369
作者: Fuxi Zhang,Yifan Wang,Hengrun Zhao,Zhuohan Sun,Changxing Xia,Lijun Wang,Huchuan Lu,Yangrui Shao,Chen Yang,Long Teng
机构: Dalian University of Technology (大连理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Salient object detection is inherently a subjective problem, as observers with different priors may perceive different objects as salient. However, existing methods predominantly formulate it as an objective prediction task with a single groundtruth segmentation map for each image, which renders the problem under-determined and fundamentally ill-posed. To address this issue, we propose Observer-Centric Salient Object Detection (OC-SOD), where salient regions are predicted by considering not only the visual cues but also the observer-specific factors such as their preferences or intents. As a result, this formulation captures the intrinsic ambiguity and diversity of human perception, enabling personalized and context-aware saliency prediction. By leveraging multi-modal large language models, we develop an efficient data annotation pipeline and construct the first OC-SOD dataset named OC-SODBench, comprising 33k training, validation and test images with 152k textual prompts and object pairs. Built upon this new dataset, we further design OC-SODAgent, an agentic baseline which performs OC-SOD via a human-like “Perceive-Reflect-Adjust” process. Extensive experiments on our proposed OC-SODBench have justified the effectiveness of our contribution. Through this observer-centric perspective, we aim to bridge the gap between human perception and computational modeling, offering a more realistic and flexible understanding of what makes an object truly “salient.” Code and dataset are publicly available at: this https URL
zh
[CV-57] Robust Pedestrian Detection with Uncertain Modality
【速读】:该论文旨在解决跨模态行人检测(Cross-modal Pedestrian Detection, CMPD)中因输入模态不确定性导致的性能下降问题。现有方法通常假设RGB、近红外(NIR)和热红外(TIR)三种模态同时可用,但在真实场景中,由于设备限制或环境变化,输入数据常出现模态缺失或组合不可预测的情况,导致模型难以提取鲁棒的行人特征并引发显著性能退化。解决方案的关键在于提出自适应不确定性感知网络(Adaptive Uncertainty-aware Network, AUNet),其核心创新包括:统一模态验证与精炼(Unified Modality Validation Refinement, UMVR),通过不确定性感知路由机制判断各模态可用性,并对有效模态进行语义精炼以提升信息可靠性;以及模态感知交互模块(Modality-Aware Interaction, MAI),根据UMVR输出动态激活或抑制内部交互机制,实现对可用模态的自适应互补信息融合。
链接: https://arxiv.org/abs/2602.06363
作者: Qian Bie,Xiao Wang,Bin Yang,Zhixi Yu,Jun Chen,Xin Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Due to the limitation “The abstract field cannot be longer than 1,920 characters”, the abstract here is shorter than that in the PDF file
Abstract:Existing cross-modal pedestrian detection (CMPD) employs complementary information from RGB and thermal-infrared (TIR) modalities to detect pedestrians in 24h-surveillance this http URL captures rich pedestrian details under daylight, while TIR excels at night. However, TIR focuses primarily on the person’s silhouette, neglecting critical texture details essential for detection. While the near-infrared (NIR) captures texture under low-light conditions, which effectively alleviates performance issues of RGB and detail loss in TIR, thereby reducing missed detections. To this end, we construct a new Triplet RGB-NIR-TIR (TRNT) dataset, comprising 8,281 pixel-aligned image triplets, establishing a comprehensive foundation for algorithmic research. However, due to the variable nature of real-world scenarios, imaging devices may not always capture all three modalities simultaneously. This results in input data with unpredictable combinations of modal types, which challenge existing CMPD methods that fail to extract robust pedestrian information under arbitrary input combinations, leading to significant performance degradation. To address these challenges, we propose the Adaptive Uncertainty-aware Network (AUNet) for accurately discriminating modal availability and fully utilizing the available information under uncertain inputs. Specifically, we introduce Unified Modality Validation Refinement (UMVR), which includes an uncertainty-aware router to validate modal availability and a semantic refinement to ensure the reliability of information within the modality. Furthermore, we design a Modality-Aware Interaction (MAI) module to adaptively activate or deactivate its internal interaction mechanisms per UMVR output, enabling effective complementary information fusion from available modalities.
zh
[CV-58] Di3PO – Diptych Diffusion DPO for Targeted Improvements in Image
【速读】:该论文旨在解决文本到图像(text-to-image, T2I)扩散模型在偏好微调(preference tuning)过程中,现有方法依赖计算成本高昂的生成步骤来构建正负样本对所引发的问题,包括生成样本间差异不显著、采样与过滤成本高以及无关像素区域方差过大,从而降低训练效率。解决方案的关键在于提出“Di3PO”方法,通过隔离目标改进区域并保持图像周围上下文稳定,实现更高效且有针对性的正负样本对构造,从而提升偏好微调的效果。
链接: https://arxiv.org/abs/2602.06355
作者: Sanjana Reddy(1),Ishaan Malhi(2),Sally Ma(2),Praneet Dutta(2) ((1) Google, (2) Google DeepMind)
机构: Google(谷歌); Google DeepMind(谷歌深度智能)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Existing methods for preference tuning of text-to-image (T2I) diffusion models often rely on computationally expensive generation steps to create positive and negative pairs of images. These approaches frequently yield training pairs that either lack meaningful differences, are expensive to sample and filter, or exhibit significant variance in irrelevant pixel regions, thereby degrading training efficiency. To address these limitations, we introduce “Di3PO”, a novel method for constructing positive and negative pairs that isolates specific regions targeted for improvement during preference tuning, while keeping the surrounding context in the image stable. We demonstrate the efficacy of our approach by applying it to the challenging task of text rendering in diffusion models, showcasing improvements over baseline methods of SFT and DPO.
zh
[CV-59] rifuse: Enhancing Attention-Based GUI Grounding via Multimodal Fusion
【速读】:该论文旨在解决GUI(图形用户界面)中自然语言指令到界面元素的精准定位问题,即GUI grounding任务。现有方法主要依赖于大规模GUI数据集对多模态大语言模型(MLLMs)进行微调以预测目标元素坐标,存在数据消耗大且在未见界面场景下泛化能力差的问题;而基于注意力机制的替代方案虽无需任务特定微调,却因GUI图像中缺乏显式且互补的空间锚点导致定位可靠性低。论文提出Trifuse框架,其关键创新在于通过Consensus-SinglePeak(CS)融合策略,显式整合注意力机制、OCR提取的文本线索与图标级语义描述,强制跨模态一致性的同时保留锐利的定位峰值,从而在不依赖任务特定微调的情况下显著提升GUI grounding性能,并大幅降低对昂贵标注数据的依赖。
链接: https://arxiv.org/abs/2602.06351
作者: Longhui Ma,Di Zhao,Siwei Wang,Zhao Lv,Miao Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 10 figures
Abstract:GUI grounding maps natural language instructions to the correct interface elements, serving as the perception foundation for GUI agents. Existing approaches predominantly rely on fine-tuning multimodal large language models (MLLMs) using large-scale GUI datasets to predict target element coordinates, which is data-intensive and generalizes poorly to unseen interfaces. Recent attention-based alternatives exploit localization signals in MLLMs attention mechanisms without task-specific fine-tuning, but suffer from low reliability due to the lack of explicit and complementary spatial anchors in GUI images. To address this limitation, we propose Trifuse, an attention-based grounding framework that explicitly integrates complementary spatial anchors. Trifuse integrates attention, OCR-derived textual cues, and icon-level caption semantics via a Consensus-SinglePeak (CS) fusion strategy that enforces cross-modal agreement while retaining sharp localization peaks. Extensive evaluations on four grounding benchmarks demonstrate that Trifuse achieves strong performance without task-specific fine-tuning, substantially reducing the reliance on expensive annotated data. Moreover, ablation studies reveal that incorporating OCR and caption cues consistently improves attention-based grounding performance across different backbones, highlighting its effectiveness as a general framework for GUI grounding.
zh
[CV-60] FlowConsist: Make Your Flow Consistent with Real Trajectory
【速读】:该论文旨在解决快速流模型(Fast Flow Models)在训练过程中存在的两个根本性问题:一是由随机配对噪声-数据样本构建的条件速度导致轨迹漂移,使模型无法沿一致的常微分方程(ODE)路径演化;二是模型近似误差随时间步累积,在长时间区间内引发严重偏差。解决方案的关键在于提出FlowConsist训练框架,其核心创新包括两点:一是用模型自身预测的边缘速度(marginal velocities)替代条件速度,从而优化目标与真实轨迹对齐;二是引入轨迹校正策略,在轨迹每个时间步上对齐生成样本与真实样本的边际分布,有效抑制误差累积。该方法在ImageNet 256×256上实现了1.52的FID分数,仅需1次采样步骤即达到新SOTA性能。
链接: https://arxiv.org/abs/2602.06346
作者: Tianyi Zhang,Chengcheng Liu,Jinwei Chen,Chun-Le Guo,Chongyi Li,Ming-Ming Cheng,Bo Li,Peng-Tao Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Fast flow models accelerate the iterative sampling process by learning to directly predict ODE path integrals, enabling one-step or few-step generation. However, we argue that current fast-flow training paradigms suffer from two fundamental issues. First, conditional velocities constructed from randomly paired noise-data samples introduce systematic trajectory drift, preventing models from following a consistent ODE path. Second, the model’s approximation errors accumulate over time steps, leading to severe deviations across long time intervals. To address these issues, we propose FlowConsist, a training framework designed to enforce trajectory consistency in fast flows. We propose a principled alternative that replaces conditional velocities with the marginal velocities predicted by the model itself, aligning optimization with the true trajectory. To further address error accumulation over time steps, we introduce a trajectory rectification strategy that aligns the marginal distributions of generated and real samples at every time step along the trajectory. Our method establishes a new state-of-the-art on ImageNet 256 \times 256, achieving an FID of 1.52 with only 1 sampling step.
zh
[CV-61] Uncertainty-Aware 4D Gaussian Splatting for Monocular Occluded Human Rendering
【速读】:该论文旨在解决单目视频中动态人体高保真渲染在遮挡情况下性能急剧下降的问题。现有方法要么通过生成式模型(Generative AI)幻化缺失内容,导致严重的时间闪烁;要么依赖刚性几何启发式规则,无法捕捉多样化的外观变化。其解决方案的关键在于将任务重新建模为异方差观测噪声下的最大后验估计(Maximum A Posteriori estimation),提出U-4DGS框架,融合概率形变网络(Probabilistic Deformation Network)与双光栅化流水线(Double Rasterization pipeline),生成像素级对齐的不确定性图(uncertainty maps),作为自适应梯度调制器自动抑制不可靠观测带来的伪影;同时引入置信度感知正则化(Confidence-Aware Regularizations),利用学习到的不确定性选择性传播时空有效性,防止缺乏可靠视觉线索区域的几何漂移。
链接: https://arxiv.org/abs/2602.06343
作者: Weiquan Wang,Feifei Shao,Lin Li,Zhen Wang,Jun Xiao,Long Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:High-fidelity rendering of dynamic humans from monocular videos typically degrades catastrophically under occlusions. Existing solutions incorporate external priors-either hallucinating missing content via generative models, which induces severe temporal flickering, or imposing rigid geometric heuristics that fail to capture diverse appearances. To this end, we reformulate the task as a Maximum A Posteriori estimation problem under heteroscedastic observation noise. In this paper, we propose U-4DGS, a framework integrating a Probabilistic Deformation Network and a Double Rasterization pipeline. This architecture renders pixel-aligned uncertainty maps that act as an adaptive gradient modulator, automatically attenuating artifacts from unreliable observations. Furthermore, to prevent geometric drift in regions lacking reliable visual cues, we enforce Confidence-Aware Regularizations, which leverage the learned uncertainty to selectively propagate spatial-temporal validity. Extensive experiments on ZJU-MoCap and OcMotion demonstrate that U-4DGS achieves SOTA rendering fidelity and robustness.
zh
[CV-62] SPDA-SAM: A Self-prompted Depth-Aware Segment Anything Model for Instance Segmentation
【速读】:该论文旨在解决当前基于Segment Anything Model (SAM) 的实例分割方法在依赖人工提示(manual prompts)以及缺乏深度信息导致空间结构感知能力不足的问题。其解决方案的关键在于提出一种自提示的深度感知SAM模型(SPDA-SAM),其中包含两个核心模块:一是语义-空间自提示模块(SSSPM),从SAM的图像编码器和掩码解码器中分别提取语义与空间提示;二是粗到细的RGB-D融合模块(C2FFM),通过融合单目RGB图像与估计深度图的特征,利用深度图中的结构信息提供粗粒度引导,并结合局部深度变化实现细粒度特征融合,从而有效补偿RGB图像中缺失的空间信息并提升边界分割精度。
链接: https://arxiv.org/abs/2602.06335
作者: Yihan Shang,Wei Wang,Chao Huang,Xinghui Dong
机构: Ocean University of China (中国海洋大学); Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recently, Segment Anything Model (SAM) has demonstrated strong generalizability in various instance segmentation tasks. However, its performance is severely dependent on the quality of manual prompts. In addition, the RGB images that instance segmentation methods normally use inherently lack depth information. As a result, the ability of these methods to perceive spatial structures and delineate object boundaries is hindered. To address these challenges, we propose a Self-prompted Depth-Aware SAM (SPDA-SAM) for instance segmentation. Specifically, we design a Semantic-Spatial Self-prompt Module (SSSPM) which extracts the semantic and spatial prompts from the image encoder and the mask decoder of SAM, respectively. Furthermore, we introduce a Coarse-to-Fine RGB-D Fusion Module (C2FFM), in which the features extracted from a monocular RGB image and the depth map estimated from it are fused. In particular, the structural information in the depth map is used to provide coarse-grained guidance to feature fusion, while local variations in depth are encoded in order to fuse fine-grained feature representations. To our knowledge, SAM has not been explored in such self-prompted and depth-aware manners. Experimental results demonstrate that our SPDA-SAM outperforms its state-of-the-art counterparts across twelve different data sets. These promising results should be due to the guidance of the self-prompts and the compensation for the spatial information loss by the coarse-to-fine RGB-D fusion operation.
zh
[CV-63] aming SAM3 in the Wild: A Concept Bank for Open-Vocabulary Segmentation
【速读】:该论文旨在解决开放词汇分割(Open-Vocabulary Segmentation, OVS)模型在目标域中因数据分布漂移(data drift)和概念分布漂移(concept drift)导致的视觉证据与提示(prompt)之间对齐失效的问题。现有方法如SAM3依赖预定义概念进行可提示分割,但当目标域的视觉或标签分布发生变化时,其性能显著下降。解决方案的关键在于提出一种无参数校准框架——ConceptBank,通过构建针对目标域统计特性的概念库(concept bank),动态恢复视觉与提示间的对齐关系:首先利用类别级视觉原型锚定目标域特征;其次在数据漂移下挖掘代表性支持样本以抑制异常值;最后融合候选概念以修正概念漂移。该方法无需重新训练即可实时适应分布变化,在自然场景和遥感等挑战性场景中实现了鲁棒性和效率的新基准。
链接: https://arxiv.org/abs/2602.06333
作者: Gensheng Pei,Xiruo Jiang,Yazhou Yao,Xiangbo Shu,Fumin Shen,Byeungwoo Jeon
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The recent introduction of \textttSAM3 has revolutionized Open-Vocabulary Segmentation (OVS) through \textitpromptable concept segmentation, which grounds pixel predictions in flexible concept prompts. However, this reliance on pre-defined concepts makes the model vulnerable: when visual distributions shift (\textitdata drift) or conditional label distributions evolve (\textitconcept drift) in the target domain, the alignment between visual evidence and prompts breaks down. In this work, we present \textscConceptBank, a parameter-free calibration framework to restore this alignment on the fly. Instead of adhering to static prompts, we construct a dataset-specific concept bank from the target statistics. Our approach (\textiti) anchors target-domain evidence via class-wise visual prototypes, (\textitii) mines representative supports to suppress outliers under data drift, and (\textitiii) fuses candidate concepts to rectify concept drift. We demonstrate that \textscConceptBank effectively adapts \textttSAM3 to distribution drifts, including challenging natural-scene and remote-sensing scenarios, establishing a new baseline for robustness and efficiency in OVS. Code and model are available at this https URL.
zh
[CV-64] Halt the Hallucination: Decoupling Signal and Semantic OOD Detection Based on Cascaded Early Rejection
【速读】:该论文旨在解决当前分布外(Out-of-Distribution, OOD)检测方法在安全关键场景中效率低下且易产生语义幻觉的问题,即现有方法仍需对低层统计噪声执行全尺度推理,导致计算资源浪费并引发深度网络将物理异常误判为高置信度语义内容。其解决方案的核心在于提出级联早期拒识(Cascaded Early Rejection, CER)框架,通过粗粒度到细粒度的分层过滤机制实现高效异常检测;关键创新包括:1)结构能量筛(Structural Energy Sieve, SES),利用拉普拉斯算子在网络入口建立非参数化屏障以高效拦截物理信号异常;2)语义感知超球面能量(Semantically-aware Hyperspherical Energy, SHE)检测器,在中间层解耦特征幅值与方向,从而识别细粒度语义偏差。
链接: https://arxiv.org/abs/2602.06330
作者: Ningkang Peng,Chuanjie Cheng,Jingyang Mao,Xiaoqian Peng,Feng Xing,Bo Zhang,Chao Tan,Zhichao Zheng,Peiheng Li,Yanhui Gu
机构: Nanjing Normal University (南京师范大学); Nanjing University of Chinese Medicine (南京中医药大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Efficient and robust Out-of-Distribution (OOD) detection is paramount for safety-critical this http URL, existing methods still execute full-scale inference on low-level statistical noise. This computational mismatch not only incurs resource waste but also induces semantic hallucination, where deep networks forcefully interpret physical anomalies as high-confidence semantic this http URL address this, we propose the Cascaded Early Rejection (CER) framework, which realizes hierarchical filtering for anomaly detection via a coarse-to-fine this http URL comprises two core modules: 1)Structural Energy Sieve (SES), which establishes a non-parametric barrier at the network entry using the Laplacian operator to efficiently intercept physical signal anomalies; and 2) the Semantically-aware Hyperspherical Energy (SHE) detector, which decouples feature magnitude from direction in intermediate layers to identify fine-grained semantic deviations. Experimental results demonstrate that CER not only reduces computational overhead by 32% but also achieves a significant performance leap on the CIFAR-100 benchmark:the average FPR95 drastically decreases from 33.58% to 22.84%, and AUROC improves to 93.97%. Crucially, in real-world scenarios simulating sensor failures, CER exhibits performance far exceeding state-of-the-art methods. As a universal plugin, CER can be seamlessly integrated into various SOTA models to provide performance gains.
zh
[CV-65] Adaptive and Balanced Re-initialization for Long-timescale Continual Test-time Domain Adaptation ICASSP2026
【速读】:该论文旨在解决持续测试时域自适应(Continual Test-Time Domain Adaptation, CTTA)中模型在长期非平稳环境变化下性能退化的问题。现有方法虽优化了适应过程,但缺乏对长时间跨度下模型稳定性的保障。解决方案的关键在于提出一种基于重初始化(re-initialization)的策略——自适应平衡重初始化(Adaptive-and-Balanced Re-initialization, ABR),其核心思想是根据标签翻转(label flip)的变化动态调整重初始化间隔,从而维持模型在长期演化环境中的性能稳定性。实验表明,ABR在多个CTTA基准上显著优于现有方法。
链接: https://arxiv.org/abs/2602.06328
作者: Yanshuo Wang,Jinguang Tong,Jun Lan,Weiqiang Wang,Huijia Zhu,Haoxing Chen,Xuesong Li,Jie Hong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in ICASSP 2026
Abstract:Continual test-time domain adaptation (CTTA) aims to adjust models so that they can perform well over time across non-stationary environments. While previous methods have made considerable efforts to optimize the adaptation process, a crucial question remains: Can the model adapt to continually changing environments over a long time? In this work, we explore facilitating better CTTA in the long run using a re-initialization (or reset) based method. First, we observe that the long-term performance is associated with the trajectory pattern in label flip. Based on this observed correlation, we propose a simple yet effective policy, Adaptive-and-Balanced Re-initialization (ABR), towards preserving the model’s long-term performance. In particular, ABR performs weight re-initialization using adaptive intervals. The adaptive interval is determined based on the change in label flip. The proposed method is validated on extensive CTTA benchmarks, achieving superior performance.
zh
[CV-66] Accelerating Vision Transformers on Brain Processing Unit
【速读】:该论文旨在解决Vision Transformer(ViT)模型在基于卷积神经网络(CNN)优化的专用硬件如Brain Processing Units(BPUs)上部署时存在的架构不匹配问题。具体而言,BPUs针对四维卷积操作进行优化,而ViT中的线性层处理的是三维数据,导致无法有效利用BPU的加速能力。解决方案的关键在于重构DeiT模型:通过将原生的线性层和层归一化(Layer Normalization)操作替换为精心设计的卷积算子,使模型结构适配BPUs的计算特性,同时保持原始权重参数不变,无需重新训练或微调。实验表明,该方法在ImageNet和花卉分类数据集上均实现了高精度与显著的推理速度提升(最高达3.8倍加速),是首个成功实现ViT在BPUs上全量加速的方案。
链接: https://arxiv.org/abs/2602.06300
作者: Jinchi Tang,Yan Guo
机构: Suzhou Institute for Advanced Research,USTC; University of Science and Technology of China
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:With the advancement of deep learning technologies, specialized neural processing hardware such as Brain Processing Units (BPUs) have emerged as dedicated platforms for CNN acceleration, offering optimized INT8 computation capabilities for convolutional operations. Meanwhile, Vision Transformer (ViT) models, such as the Data-efficient Image Transformer (DeiT), have demonstrated superior performance and play increasingly crucial roles in computer vision tasks. However, due to the architectural mismatch between CNN-optimized hardware and Vision Transformer computation characteristics–namely, that linear layers in Transformers operate on three-dimensional data while BPU acceleration is designed for four-dimensional convolution operations-it is difficult or even impossible to leverage BPU’s advantages when deploying Vision Transformers. To address this challenge, we propose a novel approach that restructures the Vision Transformer by replacing linear layers and layer normalization operations with carefully designed convolutional operators. This enables DeiT to fully utilize the acceleration capabilities of BPUs, while allowing the original weight parameters to be inherited by the restructured models without retraining or fine-tuning. To the best of our knowledge, this is the first successful deployment of Vision Transformers that fully leverages BPU classification datasets demonstrate the effectiveness of our approach. Specifically, the quantized DeiT-Base model achieves 80.4% accuracy on ImageNet, compared to the original 81.8%, while obtaining up to a 3.8* inference speedup. Our finetuned DeiT model on the flower classification dataset also achieves excellent performance, with only a 0.5% accuracy drop for the DeiT-Base model, further demonstrating the effectiveness of our method.
zh
[CV-67] Unsupervised MRI-US Multimodal Image Registration with Multilevel Correlation Pyramidal Optimization MICCAI2025
【速读】:该论文旨在解决术前与术中多模态医学图像配准(multimodal medical image registration)中的关键挑战,特别是由于不同模态图像间差异以及术中组织位移和切除导致的形变问题。解决方案的核心在于提出一种基于多层级相关金字塔优化(multilevel correlation pyramidal optimization, MCPO)的无监督配准方法:首先利用模态无关邻域描述符提取各模态特征并映射至统一特征空间;随后设计多层级金字塔融合优化机制,通过密集相关性分析与权重平衡的耦合凸优化,在不同尺度上实现位移场的全局优化与局部细节互补,从而显著提升配准精度与鲁棒性。
链接: https://arxiv.org/abs/2602.06288
作者: Jiazheng Wang,Zeyu Liu,Min Liu,Xiang Chen,Hang Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: first-place method of ReMIND2Reg Learn2Reg 2025 (in MICCAI 2025)
Abstract:Surgical navigation based on multimodal image registration has played a significant role in providing intraoperative guidance to surgeons by showing the relative position of the target area to critical anatomical structures during surgery. However, due to the differences between multimodal images and intraoperative image deformation caused by tissue displacement and removal during the surgery, effective registration of preoperative and intraoperative multimodal images faces significant challenges. To address the multimodal image registration challenges in Learn2Reg 2025, an unsupervised multimodal medical image registration method based on multilevel correlation pyramidal optimization (MCPO) is designed to solve these problems. First, the features of each modality are extracted based on the modality independent neighborhood descriptor, and the multimodal images is mapped to the feature space. Second, a multilevel pyramidal fusion optimization mechanism is designed to achieve global optimization and local detail complementation of the displacement field through dense correlation analysis and weight-balanced coupled convex optimization for input features at different scales. Our method focuses on the ReMIND2Reg task in Learn2Reg 2025. Based on the results, our method achieved the first place in the validation phase and test phase of ReMIND2Reg. The MCPO is also validated on the Resect dataset, achieving an average TRE of 1.798 mm. This demonstrates the broad applicability of our method in preoperative-to-intraoperative image registration. The code is avaliable at this https URL.
zh
[CV-68] MMEarth-Bench: Global Model Adaptation via Multimodal Test-Time Training
【速读】:该论文旨在解决当前地理空间机器学习中预训练模型在多模态数据和全球尺度上的泛化能力不足的问题。现有基准数据集通常缺乏多模态信息且地理代表性有限,难以全面评估多模态预训练模型的性能。为此,作者提出了MMEarth-Bench,一个包含五个新多模态环境任务、12种数据模态、全球分布的数据集,并提供分布内与分布外测试划分。解决方案的关键在于提出一种无需依赖模型结构的测试时训练方法(Test-Time Training with Multimodal Reconstruction, TTT-MMR),该方法利用测试时所有可用模态作为辅助任务进行重建,从而提升模型在随机和地理分布外测试集上的表现,同时通过地理批次策略实现正则化与专业化之间的良好平衡。
链接: https://arxiv.org/abs/2602.06285
作者: Lucia Gordon,Serge Belongie,Christian Igel,Nico Lang
机构: Harvard University (哈佛大学); University of Copenhagen (哥本哈根大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent research in geospatial machine learning has demonstrated that models pretrained with self-supervised learning on Earth observation data can perform well on downstream tasks with limited training data. However, most of the existing geospatial benchmark datasets have few data modalities and poor global representation, limiting the ability to evaluate multimodal pretrained models at global scales. To fill this gap, we introduce MMEarth-Bench, a collection of five new multimodal environmental tasks with 12 modalities, globally distributed data, and both in- and out-of-distribution test splits. We benchmark a diverse set of pretrained models and find that while (multimodal) pretraining tends to improve model robustness in limited data settings, geographic generalization abilities remain poor. In order to facilitate model adaptation to new downstream tasks and geographic domains, we propose a model-agnostic method for test-time training with multimodal reconstruction (TTT-MMR) that uses all the modalities available at test time as auxiliary tasks, regardless of whether a pretrained model accepts them as input. Our method improves model performance on both the random and geographic test splits, and geographic batching leads to a good trade-off between regularization and specialization during TTT. Our dataset, code, and visualization tool are linked from the project page at this http URL.
zh
[CV-69] An Interpretable Vision Transformer as a Fingerprint-Based Diagnostic Aid for Kabuki and Wiedemann-Steiner Syndromes
【速读】:该论文旨在解决罕见遗传综合征——Kabuki综合征(Kabuki syndrome, KS)和Wiedemann-Steiner综合征(Wiedemann-Steiner syndrome, WSS)在临床诊断中因遗传检测可及性不足而导致的高漏诊率问题。其解决方案的关键在于利用基于视觉变压器(vision transformer)的深度学习模型,从指纹图像中自动提取具有综合征特异性的形态学特征,实现对KS、WSS与健康对照之间的高效分类,AUC最高达0.85(KS vs. WSS),并结合注意力可视化增强模型解释性,从而为早期非侵入式、可访问且可解释的遗传病辅助诊断提供可行路径。
链接: https://arxiv.org/abs/2602.06282
作者: Marilyn Lionts,Arnhildur Tomasdottir,Viktor I. Agustsson,Yuankai Huo,Hans T. Bjornsson,Lotta M. Ellingsen
机构: Vanderbilt University (范德比尔特大学); Landspitali University Hospital (兰德斯帕蒂利大学医院); University of Iceland (冰岛大学); Johns Hopkins University (约翰霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注:
Abstract:Kabuki syndrome (KS) and Wiedemann-Steiner syndrome (WSS) are rare but distinct developmental disorders that share overlapping clinical features, including neurodevelopmental delay, growth restriction, and persistent fetal fingertip pads. While genetic testing remains the diagnostic gold standard, many individuals with KS or WSS remain undiagnosed due to barriers in access to both genetic testing and expertise. Dermatoglyphic anomalies, despite being established hallmarks of several genetic syndromes, remain an underutilized diagnostic signal in the era of molecular testing. This study presents a vision transformer-based deep learning model that leverages fingerprint images to distinguish individuals with KS and WSS from unaffected controls and from one another. We evaluate model performance across three binary classification tasks. Across the three classification tasks, the model achieved AUC scores of 0.80 (control vs. KS), 0.73 (control vs. WSS), and 0.85 (KS vs. WSS), with corresponding F1 scores of 0.71, 0.72, and 0.83, respectively. Beyond classification, we apply attention-based visualizations to identify fingerprint regions most salient to model predictions, enhancing interpretability. Together, these findings suggest the presence of syndrome-specific fingerprint features, demonstrating the feasibility of a fingerprint-based artificial intelligence (AI) tool as a noninvasive, interpretable, and accessible future diagnostic aid for the early diagnosis of underdiagnosed genetic syndromes.
zh
[CV-70] ASMa: Asymmetric Spatio-temporal Masking for Skeleton Action Representation Learning
【速读】:该论文旨在解决自监督学习(Self-supervised Learning, SSL)在基于骨骼的动作识别中因数据增强策略过于集中于高运动帧和高阶关节(如度数为3或4的关节)而导致特征表示偏倚与不完整的问题,从而影响模型在多样化动作模式下的泛化能力。其解决方案的关键在于提出一种不对称时空掩码机制(Asymmetric Spatio-temporal Masking, ASMa),通过两种互补的掩码策略——一是选择性掩码高阶关节与低运动帧,二是掩码低阶关节与高运动帧——来学习更全面、平衡的时空动态表征;同时引入可学习的特征对齐模块以有效融合两个掩码视图下的表示,并采用知识蒸馏技术压缩模型以适配资源受限设备,实现高效部署。
链接: https://arxiv.org/abs/2602.06251
作者: Aman Anand,Amir Eskandari,Elyas Rahsno,Farhana Zulkernine
机构: Queen’s University (皇后大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Self-supervised learning (SSL) has shown remarkable success in skeleton-based action recognition by leveraging data augmentations to learn meaningful representations. However, existing SSL methods rely on data augmentations that predominantly focus on masking high-motion frames and high-degree joints such as joints with degree 3 or 4. This results in biased and incomplete feature representations that struggle to generalize across varied motion patterns. To address this, we propose Asymmetric Spatio-temporal Masking (ASMa) for Skeleton Action Representation Learning, a novel combination of masking to learn a full spectrum of spatio-temporal dynamics inherent in human actions. ASMa employs two complementary masking strategies: one that selectively masks high-degree joints and low-motion, and another that masks low-degree joints and high-motion frames. These masking strategies ensure a more balanced and comprehensive skeleton representation learning. Furthermore, we introduce a learnable feature alignment module to effectively align the representations learned from both masked views. To facilitate deployment in resource-constrained settings and on low-resource devices, we compress the learned and aligned representation into a lightweight model using knowledge distillation. Extensive experiments on NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD datasets demonstrate that our approach outperforms existing SSL methods with an average improvement of 2.7-4.4% in fine-tuning and up to 5.9% in transfer learning to noisy datasets and achieves competitive performance compared to fully supervised baselines. Our distilled model achieves 91.4% parameter reduction and 3x faster inference on edge devices while maintaining competitive accuracy, enabling practical deployment in resource-constrained scenarios.
zh
[CV-71] ForeHOI: Feed-forward 3D Object Reconstruction from Daily Hand-Object Interaction Videos
【速读】:该论文旨在解决从单目手部-物体交互视频中重建物体三维几何形状的问题,尤其针对因严重遮挡和相机、手与物体之间复杂耦合运动带来的挑战。解决方案的关键在于提出一种前向传播模型 ForeHOI,其核心创新是通过在前向框架中联合预测二维掩码补全(2D mask inpainting)与三维形状补全(3D shape completion),有效缓解单目视频中手部对物体的严重遮挡问题,从而实现优于传统优化方法的重建效果。这种二维与三维形状补全之间的信息交互显著提升了整体重建质量,同时无需任何预处理步骤,推理时间控制在一分钟以内,且相比以往方法提速约100倍。
链接: https://arxiv.org/abs/2602.06226
作者: Yuantao Chen,Jiahao Chang,Chongjie Ye,Chaoran Zhang,Zhaojie Fang,Chenghong Li,Xiaoguang Han
机构: SSE, CUHKSZ (深圳高等金融研究院,香港中文大学(深圳)); FNii-Shenzhen (深圳市人工智能与机器人研究院); SDS, CUHKSZ (深圳数据科学学院,香港中文大学(深圳))
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 7 figures, Page: this https URL
Abstract:The ubiquity of monocular videos capturing daily hand-object interactions presents a valuable resource for embodied intelligence. While 3D hand reconstruction from in-the-wild videos has seen significant progress, reconstructing the involved objects remains challenging due to severe occlusions and the complex, coupled motion of the camera, hands, and object. In this paper, we introduce ForeHOI, a novel feed-forward model that directly reconstructs 3D object geometry from monocular hand-object interaction videos within one minute of inference time, eliminating the need for any pre-processing steps. Our key insight is that, the joint prediction of 2D mask inpainting and 3D shape completion in a feed-forward framework can effectively address the problem of severe occlusion in monocular hand-held object videos, thereby achieving results that outperform the performance of optimization-based methods. The information exchanges between the 2D and 3D shape completion boosts the overall reconstruction quality, enabling the framework to effectively handle severe hand-object occlusion. Furthermore, to support the training of our model, we contribute the first large-scale, high-fidelity synthetic dataset of hand-object interactions with comprehensive annotations. Extensive experiments demonstrate that ForeHOI achieves state-of-the-art performance in object reconstruction, significantly outperforming previous methods with around a 100x speedup. Code and data are available at: this https URL.
zh
[CV-72] Cross-Modal Redundancy and the Geometry of Vision-Language Embeddings ICLR2026
【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)中跨模态嵌入空间几何结构不明确的问题,即尽管VLMs在对齐图像与文本方面表现优异,但其共享表示空间的内在几何特性仍缺乏系统理解。解决方案的关键在于引入“等能假设”(Iso-Energy Assumption),该假设利用跨模态冗余性:真正共享的概念应在不同模态下具有相同的平均能量。基于此假设,作者提出一种对齐稀疏自编码器(Aligned Sparse Autoencoder, SAE),在训练过程中强制能量一致性的同时保持重建性能。这一归纳偏置改变了SAE的解空间而不损害重建质量,从而获得可用于几何分析的可解释表示。实验表明,该方法揭示了VLM潜在空间中清晰的结构:双模态原子承载全部跨模态对齐信号,单模态原子则构成模态特异性偏差并完全解释模态差距;移除后者可消除模态差距而不影响性能,且限制向量运算于双模态子空间能实现分布内编辑并提升检索效果。
链接: https://arxiv.org/abs/2602.06218
作者: Grégoire Dhimoïla,Thomas Fel,Victor Boutin,Agustin Picard
机构: Brown University (布朗大学); ENS Paris Saclay (巴黎萨克雷高等师范学院); IRT Saint Exupéry (圣埃克絮佩里研究所); Kempner Institute, Harvard University (哈佛大学肯普纳研究所); CNRS (法国国家科学研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Published as a conference paper at ICLR 2026
Abstract:Vision-language models (VLMs) align images and text with remarkable success, yet the geometry of their shared embedding space remains poorly understood. To probe this geometry, we begin from the Iso-Energy Assumption, which exploits cross-modal redundancy: a concept that is truly shared should exhibit the same average energy across modalities. We operationalize this assumption with an Aligned Sparse Autoencoder (SAE) that encourages energy consistency during training while preserving reconstruction. We find that this inductive bias changes the SAE solution without harming reconstruction, giving us a representation that serves as a tool for geometric analysis. Sanity checks on controlled data with known ground truth confirm that alignment improves when Iso-Energy holds and remains neutral when it does not. Applied to foundational VLMs, our framework reveals a clear structure with practical consequences: (i) sparse bimodal atoms carry the entire cross-modal alignment signal; (ii) unimodal atoms act as modality-specific biases and fully explain the modality gap; (iii) removing unimodal atoms collapses the gap without harming performance; (iv) restricting vector arithmetic to the bimodal subspace yields in-distribution edits and improved retrieval. These findings suggest that the right inductive bias can both preserve model fidelity and render the latent geometry interpretable and actionable.
zh
[CV-73] Addressing the Waypoint-Action Gap in End-to-End Autonomous Driving via Vehicle Motion Models
【速读】:该论文旨在解决当前端到端自动驾驶(End-to-End Autonomous Driving, E2E-AD)研究中“路径点基准”(waypoint-based)与“动作基准”(action-based)模型之间的训练与评估鸿沟问题。现有主流基准协议和训练流程多基于路径点预测模型,导致动作型策略难以有效训练和公平比较,从而制约了其发展。解决方案的关键在于提出一种新颖的可微分车辆动力学建模框架,该框架能够将预测的动作序列在仿真环境中滚动执行,并将其映射为对应的自车坐标系下的路径点轨迹,同时在路径点空间进行监督学习。这一机制使得动作型架构可在不修改原有路径点基准评估协议的前提下实现训练与评估,从而首次实现了动作型策略在标准benchmark中的端到端训练与性能对比。
链接: https://arxiv.org/abs/2602.06214
作者: Jorge Daniel Rodríguez-Vidal,Gabriel Villalonga,Diego Porres,Antonio M. López Peña
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 8 pages, 3 figures
Abstract:End-to-End Autonomous Driving (E2E-AD) systems are typically grouped by the nature of their outputs: (i) waypoint-based models that predict a future trajectory, and (ii) action-based models that directly output throttle, steer and brake. Most recent benchmark protocols and training pipelines are waypoint-based, which makes action-based policies harder to train and compare, slowing their progress. To bridge this waypoint-action gap, we propose a novel, differentiable vehicle-model framework that rolls out predicted action sequences to their corresponding ego-frame waypoint trajectories while supervising in waypoint space. Our approach enables action-based architectures to be trained and evaluated, for the first time, within waypoint-based benchmarks without modifying the underlying evaluation protocol. We extensively evaluate our framework across multiple challenging benchmarks and observe consistent improvements over the baselines. In particular, on NAVSIM \textttnavhard our approach achieves state-of-the-art performance. Our code will be made publicly available upon acceptance.
zh
[CV-74] DroneKey: A Size Prior-free Method and New Benchmark for Drone 3D Pose Estimation from Sequential Images ICRA2026
【速读】:该论文旨在解决无人机(drone)在安全监控系统中实现高精度三维姿态估计(3D pose estimation)的问题,尤其针对现有方法依赖先验信息(如物理尺寸或3D模型)以及数据集规模小、场景单一、泛化能力弱等局限。其核心解决方案是提出一种无需先验信息的端到端框架 DroneKey++,关键在于:1)设计了一个关键点编码器(keypoint encoder),可同时完成关键点检测与分类;2)引入基于射线几何推理(ray-based geometric reasoning)和类别嵌入(class embeddings)的位姿解码器(pose decoder),实现高精度的3D姿态估计;并通过构建大规模合成数据集 6DroneSyn(含50K图像、7种无人机型号和88个户外背景)有效提升模型泛化能力,实验表明该方法在旋转和翻译误差上均达到先进水平,并具备实时推理性能(GPU下达414 FPS)。
链接: https://arxiv.org/abs/2602.06211
作者: Seo-Bin Hwang,Yeong-Jun Cho
机构: Chonnam National University (全南国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 page, 5 figures, 6 tables, Accepted to ICRA 2026 (to appear)
Abstract:Accurate 3D pose estimation of drones is essential for security and surveillance systems. However, existing methods often rely on prior drone information such as physical sizes or 3D meshes. At the same time, current datasets are small-scale, limited to single models, and collected under constrained environments, which makes reliable validation of generalization difficult. We present DroneKey++, a prior-free framework that jointly performs keypoint detection, drone classification, and 3D pose estimation. The framework employs a keypoint encoder for simultaneous keypoint detection and classification, and a pose decoder that estimates 3D pose using ray-based geometric reasoning and class embeddings. To address dataset limitations, we construct 6DroneSyn, a large-scale synthetic benchmark with over 50K images covering 7 drone models and 88 outdoor backgrounds, generated using 360-degree panoramic synthesis. Experiments show that DroneKey++ achieves MAE 17.34 deg and MedAE 17.1 deg for rotation, MAE 0.135 m and MedAE 0.242 m for translation, with inference speeds of 19.25 FPS (CPU) and 414.07 FPS (GPU), demonstrating both strong generalization across drone models and suitability for real-time applications. The dataset is publicly available.
zh
[CV-75] AnyThermal: Towards Learning Universal Representations for Thermal Perception ICRA
【速读】:该论文旨在解决现有热成像(thermal imaging)骨干网络在小规模数据上进行任务特定训练后,泛化能力受限、难以适应多种环境与任务的问题。解决方案的关键在于提出AnyThermal模型,其核心思想是通过知识蒸馏(knowledge distillation)将视觉基础模型(如DINOv2)中提取的通用特征迁移至热成像编码器(thermal encoder),从而学习到任务无关且鲁棒的热力特征表示。为支持这一方法,作者还构建了首个开源同步采集RGB-热图像的数据平台TartanRGBT,并基于此收集了涵盖室内、空中、非铺装路面和城市等四种环境的TartanRGBT数据集,有效缓解了现有RGB-热数据集之间的多样性差距,最终在跨模态场景识别、热分割和单目深度估计等多个下游任务中实现了显著性能提升(最高达36%)。
链接: https://arxiv.org/abs/2602.06203
作者: Parv Maheshwari,Jay Karhade,Yogesh Chawla,Isaiah Adu,Florian Heisen,Andrew Porco,Andrew Jong,Yifei Liu,Santosh Pitla,Sebastian Scherer,Wenshan Wang
机构: Robotics Institute, Carnegie Mellon University (卡内基梅隆大学机器人学院); Biological Systems Engineering, University of Nebraska-Lincoln (内布拉斯加林肯大学生物系统工程系); Mechanical Engineering, Penn State University (宾夕法尼亚州立大学机械工程系); School of Engineering and Design, Technical University of Munich (慕尼黑工业大学工程与设计学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Accepted at IEEE ICRA (International Conference on Robotics Automation) 2026
Abstract:We present AnyThermal, a thermal backbone that captures robust task-agnostic thermal features suitable for a variety of tasks such as cross-modal place recognition, thermal segmentation, and monocular depth estimation using thermal images. Existing thermal backbones that follow task-specific training from small-scale data result in utility limited to a specific environment and task. Unlike prior methods, AnyThermal can be used for a wide range of environments (indoor, aerial, off-road, urban) and tasks, all without task-specific training. Our key insight is to distill the feature representations from visual foundation models such as DINOv2 into a thermal encoder using thermal data from these multiple environments. To bridge the diversity gap of the existing RGB-Thermal datasets, we introduce the TartanRGBT platform, the first open-source data collection platform with synced RGB-Thermal image acquisition. We use this payload to collect the TartanRGBT dataset - a diverse and balanced dataset collected in 4 environments. We demonstrate the efficacy of AnyThermal and TartanRGBT, achieving state-of-the-art results with improvements of up to 36% across diverse environments and downstream tasks on existing datasets.
zh
[CV-76] DeDPO: Debiased Direct Preference Optimization for Diffusion Models
【速读】:该论文旨在解决扩散模型对齐中依赖大规模高质量人类偏好标签所带来的高成本与可扩展性瓶颈问题。其核心解决方案是提出一种半监督框架Debiased DPO(DeDPO),关键在于将因果推断中的去偏估计技术引入DPO目标函数,显式识别并校正合成标注器(如自训练或视觉-语言模型VLMs)所引入的系统性偏差和噪声,从而实现从不完美反馈源中稳健学习,显著提升模型性能并达到甚至超越完全人工标注数据训练的理论上限。
链接: https://arxiv.org/abs/2602.06195
作者: Khiem Pham,Quang Nguyen,Tung Nguyen,Jingsen Zhu,Michele Santacatterina,Dimitris Metaxas,Ramin Zabih
机构: Cornell University (康奈尔大学); Rutgers University (罗格斯大学); NYU (纽约大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Direct Preference Optimization (DPO) has emerged as a predominant alignment method for diffusion models, facilitating off-policy training without explicit reward modeling. However, its reliance on large-scale, high-quality human preference labels presents a severe cost and scalability bottleneck. To overcome this, We propose a semi-supervised framework augmenting limited human data with a large corpus of unlabeled pairs annotated via cost-effective synthetic AI feedback. Our paper introduces Debiased DPO (DeDPO), which uniquely integrates a debiased estimation technique from causal inference into the DPO objective. By explicitly identifying and correcting the systematic bias and noise inherent in synthetic annotators, DeDPO ensures robust learning from imperfect feedback sources, including self-training and Vision-Language Models (VLMs). Experiments demonstrate that DeDPO is robust to the variations in synthetic labeling methods, achieving performance that matches and occasionally exceeds the theoretical upper bound of models trained on fully human-labeled data. This establishes DeDPO as a scalable solution for human-AI alignment using inexpensive synthetic supervision.
zh
[CV-77] Unsupervised Anomaly Detection of Diseases in the Female Pelvis for Real-Time MR Imaging
【速读】:该论文旨在解决女性盆腔疾病(如子宫肌瘤、子宫内膜癌、子宫内膜异位症和腺肌症)在磁共振成像(MRI)诊断中因解剖结构高度变异导致的识别延迟问题,以及现有人工智能(AI)方法多为疾病特异性且不支持实时处理、难以临床落地的局限性。其解决方案的关键在于构建一个疾病无关、参数无关且支持实时处理的无监督异常检测框架:通过仅使用健康矢状面T2加权MRI扫描训练残差变分自编码器(Residual Variational Autoencoder),学习正常盆腔解剖结构;推理时利用重建误差热图定位病灶区域,无需标注异常数据即可实现异常检测;同时引入扩散生成的合成数据增强模型鲁棒性,并在公开数据集上验证了平均AUC达0.736、灵敏度0.828、特异度0.692的性能,且重建速度可达约92.6帧/秒,具备临床实时应用潜力。
链接: https://arxiv.org/abs/2602.06179
作者: Anika Knupfer,Johanna P. Müller,Jordina A. Verdera,Martin Fenske,Claudius S. Mathy,Smiti Tripathy,Sebastian Arndt,Matthias May,Michael Uder,Matthias W. Beckmann,Stefanie Burghaus,Jana Hutter
机构: University Hospital Erlangen (UKER); Friedrich-Alexander University Erlangen–Nürnberg (FAU); Leibniz University Hannover; Institute of Women’s Health
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 8 figures
Abstract:Pelvic diseases in women of reproductive age represent a major global health burden, with diagnosis frequently delayed due to high anatomical variability, complicating MRI interpretation. Existing AI approaches are largely disease-specific and lack real-time compatibility, limiting generalizability and clinical integration. To address these challenges, we establish a benchmark framework for disease- and parameter-agnostic, real-time-compatible unsupervised anomaly detection in pelvic MRI. The method uses a residual variational autoencoder trained exclusively on healthy sagittal T2-weighted scans acquired across diverse imaging protocols to model normal pelvic anatomy. During inference, reconstruction error heatmaps indicate deviations from learned healthy structure, enabling detection of pathological regions without labeled abnormal data. The model is trained on 294 healthy scans and augmented with diffusion-generated synthetic data to improve robustness. Quantitative evaluation on the publicly available Uterine Myoma MRI Dataset yields an average area-under-the-curve (AUC) value of 0.736, with 0.828 sensitivity and 0.692 specificity. Additional inter-observer clinical evaluation extends analysis to endometrial cancer, endometriosis, and adenomyosis, revealing the influence of anatomical heterogeneity and inter-observer variability on performance interpretation. With a reconstruction time of approximately 92.6 frames per second, the proposed framework establishes a baseline for unsupervised anomaly detection in the female pelvis and supports future integration into real-time MRI. Code is available upon request (this https URL), prospective data sets are available for academic collaboration.
zh
[CV-78] M3: High-fidelity Text-to-Image Generation via Multi-Modal Multi-Agent and Multi-Round Visual Reason ing
【速读】:该论文旨在解决生成式 AI(Generative AI)在文本到图像(text-to-image, T2I)合成中对复杂组合提示(compositional prompts)理解与实现能力不足的问题,尤其是当提示包含多个约束条件时,现有模型常出现语义偏差或结构错误。解决方案的关键在于提出一个无需训练的多智能体迭代推理框架 M3(Multi-Modal, Multi-Agent, Multi-Round),其通过协同运作的 Planner、Checker、Refiner、Editor 和 Verifier 五类专用代理,在推理阶段系统性地分解、验证、修正并优化图像生成过程,实现约束条件的逐个精准满足与单调性能提升,从而显著增强复杂组合场景下的生成质量与鲁棒性。
链接: https://arxiv.org/abs/2602.06166
作者: Bangji Yang,Ruihan Guo,Jiajun Fan,Chaoran Cheng,Ge Liu
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generative models have achieved impressive fidelity in text-to-image synthesis, yet struggle with complex compositional prompts involving multiple constraints. We introduce \textbfM3 (Multi-Modal, Multi-Agent, Multi-Round), a training-free framework that systematically resolves these failures through iterative inference-time refinement. M3 orchestrates off-the-shelf foundation models in a robust multi-agent loop: a Planner decomposes prompts into verifiable checklists, while specialized Checker, Refiner, and Editor agents surgically correct constraints one at a time, with a Verifier ensuring monotonic improvement. Applied to open-source models, M3 achieves remarkable results on the challenging OneIG-EN benchmark, with our Qwen-Image+M3 surpassing commercial flagship systems including Imagen4 (0.515) and Seedream 3.0 (0.530), reaching state-of-the-art performance (0.532 overall). This demonstrates that intelligent multi-agent reasoning can elevate open-source models beyond proprietary alternatives. M3 also substantially improves GenEval compositional metrics, effectively doubling spatial reasoning performance on hardened test sets. As a plug-and-play module compatible with any pre-trained T2I model, M3 establishes a new paradigm for compositional generation without costly retraining.
zh
[CV-79] MetaSSP: Enhancing Semi-supervised Implicit 3D Reconstruction through Meta-adaptive EMA and SDF-aware Pseudo-label Evaluation
【速读】:该论文旨在解决基于隐式符号距离函数(Implicit SDF-based)的单视角三维重建方法在实际应用中对大规模标注数据依赖性强、可扩展性差的问题。其解决方案的关键在于提出了一种新颖的半监督框架MetaSSP,通过引入基于梯度的参数重要性估计来正则化自适应指数移动平均(adaptive EMA)更新,并设计了结合增强一致性与SDF方差的伪标签加权机制,从而有效利用大量未标注图像提升模型性能;该方法从10%的监督预热开始,统一优化有标签和无标签数据,在Pix3D基准上显著降低了Chamfer Distance约20.61%,并提升了IoU约24.09%,达到了新的最先进水平。
链接: https://arxiv.org/abs/2602.06163
作者: Luoxi Zhang,Chun Xie,Itaru Kitahara
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Implicit SDF-based methods for single-view 3D reconstruction achieve high-quality surfaces but require large labeled datasets, limiting their scalability. We propose MetaSSP, a novel semi-supervised framework that exploits abundant unlabeled images. Our approach introduces gradient-based parameter importance estimation to regularize adaptive EMA updates and an SDF-aware pseudo-label weighting mechanism combining augmentation consistency with SDF variance. Beginning with a 10% supervised warm-up, the unified pipeline jointly refines labeled and unlabeled data. On the Pix3D benchmark, our method reduces Chamfer Distance by approximately 20.61% and increases IoU by around 24.09% compared to existing semi-supervised baselines, setting a new state of the art.
zh
[CV-80] Driving with DINO: Vision Foundation Features as a Unified Bridge for Sim-to-Real Generation in Autonomous Driving
【速读】:该论文旨在解决现有模拟到真实(Sim2Real)自动驾驶视频生成方法中面临的“一致性-真实性困境”(Consistency-Realism Dilemma):低级信号(如边缘、模糊图像)虽能提供精确控制,但会因“纹理烘焙”导致合成伪影,损害真实感;而高级先验(如深度、语义、HDMap)虽提升逼真度,却缺乏结构细节以保证控制一致性。解决方案的关键在于提出Driving with DINO(DwD)框架,利用Vision Foundation Module(VFM)特征作为统一桥梁,通过主子空间投影(Principal Subspace Projection)去除高频纹理信息以缓解“烘焙”问题,同时引入随机通道尾部丢弃(Random Channel Tail Drop)补偿维度压缩带来的结构损失;进一步设计可学习的空间对齐模块(Spatial Alignment Module)适配高分辨率DINOv3特征至扩散模型骨干网络,并采用因果时序聚合器(Causal Temporal Aggregator)结合因果卷积保留历史运动上下文,从而有效抑制运动模糊并保障时序稳定性。
链接: https://arxiv.org/abs/2602.06159
作者: Xuyang Chen,Conglang Zhang,Chuanheng Fu,Zihao Yang,Kaixuan Zhou,Yizhi Zhang,Jianan He,Yanfeng Zhang,Mingwei Sun,Zengmao Wang,Zhen Dong,Xiaoxiao Long,Liqiu Meng
机构: Technical University of Munich (慕尼黑工业大学); Huawei Hilbert Research Center (华为希尔伯特研究中心); Wuhan University (武汉大学); Huawei Riemann Lab (华为黎曼实验室); University of Science and Technology of China (中国科学技术大学); Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project website this https URL
Abstract:Driven by the emergence of Controllable Video Diffusion, existing Sim2Real methods for autonomous driving video generation typically rely on explicit intermediate representations to bridge the domain gap. However, these modalities face a fundamental Consistency-Realism Dilemma. Low-level signals (e.g., edges, blurred images) ensure precise control but compromise realism by “baking in” synthetic artifacts, whereas high-level priors (e.g., depth, semantics, HDMaps) facilitate photorealism but lack the structural detail required for consistent guidance. In this work, we present Driving with DINO (DwD), a novel framework that leverages Vision Foundation Module (VFM) features as a unified bridge between the simulation and real-world domains. We first identify that these features encode a spectrum of information, from high-level semantics to fine-grained structure. To effectively utilize this, we employ Principal Subspace Projection to discard the high-frequency elements responsible for “texture baking,” while concurrently introducing Random Channel Tail Drop to mitigate the structural loss inherent in rigid dimensionality reduction, thereby reconciling realism with control consistency. Furthermore, to fully leverage DINOv3’s high-resolution capabilities for enhancing control precision, we introduce a learnable Spatial Alignment Module that adapts these high-resolution features to the diffusion backbone. Finally, we propose a Causal Temporal Aggregator employing causal convolutions to explicitly preserve historical motion context when integrating frame-wise DINO features, which effectively mitigates motion blur and guarantees temporal stability. Project page: this https URL
zh
[CV-81] MGP-KAD: Multimodal Geometric Priors and Kolmogorov-Arnold Decoder for Single-View 3D Reconstruction in Complex Scenes ICIP
【速读】:该论文旨在解决复杂真实场景中单视角三维重建(Single-view 3D reconstruction)面临的挑战,包括噪声干扰、物体多样性以及训练数据集有限等问题。其核心解决方案是提出一种名为MGP-KAD的多模态特征融合框架,关键创新在于:一是引入几何先验(geometric prior),通过采样和聚类真实物体数据生成类别级特征,并在训练过程中动态调整以增强几何理解;二是设计基于Kolmogorov-Arnold Networks(KAN)的混合解码器,克服传统线性解码器对复杂多模态输入处理能力不足的问题,从而显著提升重建结果的几何完整性、平滑性和细节保真度。
链接: https://arxiv.org/abs/2602.06158
作者: Luoxi Zhang,Chun Xie,Itaru Kitahara
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages. Published in IEEE International Conference on Image Processing (ICIP) 2025
Abstract:Single-view 3D reconstruction in complex real-world scenes is challenging due to noise, object diversity, and limited dataset availability. To address these challenges, we propose MGP-KAD, a novel multimodal feature fusion framework that integrates RGB and geometric prior to enhance reconstruction accuracy. The geometric prior is generated by sampling and clustering ground-truth object data, producing class-level features that dynamically adjust during training to improve geometric understanding. Additionally, we introduce a hybrid decoder based on Kolmogorov-Arnold Networks (KAN) to overcome the limitations of traditional linear decoders in processing complex multimodal inputs. Extensive experiments on the Pix3D dataset demonstrate that MGP-KAD achieves state-of-the-art (SOTA) performance, significantly improving geometric integrity, smoothness, and detail preservation. Our work provides a robust and effective solution for advancing single-view 3D reconstruction in complex scenes.
zh
[CV-82] EgoAVU: Egocentric Audio-Visual Understanding
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在第一人称视频(egocentric videos)中联合理解视觉与音频信息的能力不足的问题。现有研究受限于缺乏带有连贯跨模态标注的文本数据,导致MLLMs难以有效融合两种模态的信息。解决方案的关键在于提出EgoAVU——一个可扩展的数据引擎,通过跨模态相关性建模自动生成第一人称视频的音视频叙述、问题与答案,并结合基于标记的视频过滤和模块化图结构的筛选机制,确保数据多样性与质量。基于此构建的EgoAVU-Instruct训练集和EgoAVU-Bench评估基准揭示了当前MLLMs对视觉信号的显著偏倚,而微调后模型在EgoAVU-Bench上性能提升达113%,并在其他基准如EgoTempo和EgoIllusion上获得最高28%的相对性能增益,验证了该方案的有效性。
链接: https://arxiv.org/abs/2602.06139
作者: Ashish Seth,Xinhao Mei,Changsheng Zhao,Varun Nagaraja,Ernie Chang,Gregory P. Meyer,Gael Le Lan,Yunyang Xiong,Vikas Chandra,Yangyang Shi,Dinesh Manocha,Zhipeng Cai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Understanding egocentric videos plays a vital role for embodied intelligence. Recent multi-modal large language models (MLLMs) can accept both visual and audio inputs. However, due to the challenge of obtaining text labels with coherent joint-modality information, whether MLLMs can jointly understand both modalities in egocentric videos remains under-explored. To address this problem, we introduce EgoAVU, a scalable data engine to automatically generate egocentric audio-visual narrations, questions, and answers. EgoAVU enriches human narrations with multimodal context and generates audio-visual narrations through cross-modal correlation modeling. Token-based video filtering and modular, graph-based curation ensure both data diversity and quality. Leveraging EgoAVU, we construct EgoAVU-Instruct, a large-scale training dataset of 3M samples, and EgoAVU-Bench, a manually verified evaluation split covering diverse tasks. EgoAVU-Bench clearly reveals the limitations of existing MLLMs: they bias heavily toward visual signals, often neglecting audio cues or failing to correspond audio with the visual source. Finetuning MLLMs on EgoAVU-Instruct effectively addresses this issue, enabling up to 113% performance improvement on EgoAVU-Bench. Such benefits also transfer to other benchmarks such as EgoTempo and EgoIllusion, achieving up to 28% relative performance gain. Code will be released to the community.
zh
[CV-83] mpora: Characterising the Time-Contingent Utility of Online Test-Time Adaptation
【速读】:该论文旨在解决测试时适应(Test-time adaptation, TTA)方法在实际部署中因时间约束导致的性能评估失真问题。传统评估假设无限处理时间,忽略了准确性与延迟之间的权衡,而现实场景中如低延迟敏感或用户交互类应用对预测响应时间有严格要求,导致模型即使准确度高但因延迟过高而无法有效使用。解决方案的关键在于提出 Tempora 框架,其核心包括三部分:(1)建模不同部署时间约束的时序场景(temporal scenarios),(2)设计可操作的评估协议(evaluation protocols),以及(3)引入三种时间相关效用指标(time-contingent utility metrics)——离散效用(discrete utility)用于异步流中硬截止时间场景、连续效用(continuous utility)用于价值随延迟衰减的交互式场景、摊销效用(amortised utility)用于预算受限场景。通过该框架对七种 TTA 方法在 ImageNet-C 上进行 240 次时序评估,发现传统排名无法预测在时间压力下的表现,揭示了方法选择需根据具体任务和时延约束动态调整,从而为实践者提供决策依据,也为研究者指明了面向部署的适配优化方向。
链接: https://arxiv.org/abs/2602.06136
作者: Sudarshan Sreeram,Young D. Kwon,Cecilia Mascolo
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint. Under review. Code available upon acceptance
Abstract:Test-time adaptation (TTA) offers a compelling remedy for machine learning (ML) models that degrade under domain shifts, improving generalisation on-the-fly with only unlabelled samples. This flexibility suits real deployments, yet conventional evaluations unrealistically assume unbounded processing time, overlooking the accuracy-latency trade-off. As ML increasingly underpins latency-sensitive and user-facing use-cases, temporal pressure constrains the viability of adaptable inference; predictions arriving too late to act on are futile. We introduce Tempora, a framework for evaluating TTA under this pressure. It consists of temporal scenarios that model deployment constraints, evaluation protocols that operationalise measurement, and time-contingent utility metrics that quantify the accuracy-latency trade-off. We instantiate the framework with three such metrics: (1) discrete utility for asynchronous streams with hard deadlines, (2) continuous utility for interactive settings where value decays with latency, and (3) amortised utility for budget-constrained deployments. Applying Tempora to seven TTA methods on ImageNet-C across 240 temporal evaluations reveals rank instability: conventional rankings do not predict rankings under temporal pressure; ETA, a state-of-the-art method in the conventional setting, falls short in 41.2% of evaluations. The highest-utility method varies with corruption type and temporal pressure, with no clear winner. By enabling systematic evaluation across diverse temporal constraints for the first time, Tempora reveals when and why rankings invert, offering practitioners a lens for method selection and researchers a target for deployable adaptation.
zh
[CV-84] From Blurry to Believable: Enhancing Low-quality Talking Heads with 3D Generative Priors
【速读】:该论文旨在解决低质量图像或视频源导致的3D人脸重建效果差的问题,特别是在生成高保真、可动画化的3D说话头(3D talking heads)时面临的挑战。解决方案的关键在于提出SuperHead框架,其核心创新是利用预训练3D生成模型的先验知识,通过一种新颖的动力学感知3D反演(dynamics-aware 3D inversion)策略优化生成模型的潜在表示,从而得到超分辨率的3D高斯泼溅(3D Gaussian Splatting, 3DGS)头部模型,并将其绑定到参数化头部模型(如FLAME)以实现高质量动画。该过程通过稀疏采集的多视角、多表情下的上采样2D人脸渲染图与对应深度图进行联合监督,确保动态面部运动下的几何一致性与身份保真度。
链接: https://arxiv.org/abs/2602.06122
作者: Ding-Jiun Huang,Yuanhao Wang,Shao-Ji Yuan,Albert Mosella-Montoro,Francisco Vicente Carrasco,Cheng Zhang,Fernando De la Torre
机构: Carnegie Mellon University (卡内基梅隆大学); Texas A&M University (德州农工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to 3DV 2026. Project Page: this https URL
Abstract:Creating high-fidelity, animatable 3D talking heads is crucial for immersive applications, yet often hindered by the prevalence of low-quality image or video sources, which yield poor 3D reconstructions. In this paper, we introduce SuperHead, a novel framework for enhancing low-resolution, animatable 3D head avatars. The core challenge lies in synthesizing high-quality geometry and textures, while ensuring both 3D and temporal consistency during animation and preserving subject identity. Despite recent progress in image, video and 3D-based super-resolution (SR), existing SR techniques are ill-equipped to handle dynamic 3D inputs. To address this, SuperHead leverages the rich priors from pre-trained 3D generative models via a novel dynamics-aware 3D inversion scheme. This process optimizes the latent representation of the generative model to produce a super-resolved 3D Gaussian Splatting (3DGS) head model, which is subsequently rigged to an underlying parametric head model (e.g., FLAME) for animation. The inversion is jointly supervised using a sparse collection of upscaled 2D face renderings and corresponding depth maps, captured from diverse facial expressions and camera viewpoints, to ensure realism under dynamic facial motions. Experiments demonstrate that SuperHead generates avatars with fine-grained facial details under dynamic motions, significantly outperforming baseline methods in visual quality.
zh
[CV-85] SVRepair: Structured Visual Reason ing for Automated Program Repair
【速读】:该论文旨在解决现有自动化程序修复(Automated Program Repair, APR)方法多为单模态、无法有效利用视觉诊断信息的问题,尤其是当错误报告中包含布局异常或缺失控件等视觉线索时,传统方法难以将这些密集的视觉输入与代码缺陷精准关联。解决方案的关键在于提出一种名为SVRepair的多模态APR框架,其核心创新是引入结构化视觉表示(Structured Visual Representation, SVR),通过微调视觉-语言模型将异构视觉工件(如截图和控制流图)统一转化为语义场景图,捕捉GUI元素及其结构关系(如层次结构),从而为下游修复任务提供规范化且与代码相关的上下文;在此基础上,SVRepair进一步设计了一种迭代式视觉工件分割策略,逐步聚焦于与bug相关的区域以抑制无关信息并减少幻觉,最终实现高精度的故障定位与补丁生成。
链接: https://arxiv.org/abs/2602.06090
作者: Xiaoxuan Tang,Jincheng Wang,Liwei Luo,Jingxuan Xu,Sheng Zhou,Dajun Chen,Wei Jiang,Yong Li
机构: Ant Group (蚂蚁集团); Zhejiang University (浙江大学)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 3 figures
Abstract:Large language models (LLMs) have recently shown strong potential for Automated Program Repair (APR), yet most existing approaches remain unimodal and fail to leverage the rich diagnostic signals contained in visual artifacts such as screenshots and control-flow graphs. In practice, many bug reports convey critical information visually (e.g., layout breakage or missing widgets), but directly using such dense visual inputs often causes context loss and noise, making it difficult for MLLMs to ground visual observations into precise fault localization and executable patches. To bridge this semantic gap, we propose \textbfSVRepair, a multimodal APR framework with structured visual representation. SVRepair first fine-tunes a vision-language model, \textbfStructured Visual Representation (SVR), to uniformly transform heterogeneous visual artifacts into a \emphsemantic scene graph that captures GUI elements and their structural relations (e.g., hierarchy), providing normalized, code-relevant context for downstream repair. Building on the graph, SVRepair drives a coding agent to localize faults and synthesize patches, and further introduces an iterative visual-artifact segmentation strategy that progressively narrows the input to bug-centered regions to suppress irrelevant context and reduce hallucinations. Extensive experiments across multiple benchmarks demonstrate state-of-the-art performance: SVRepair achieves \textbf36.47% accuracy on SWE-Bench M, \textbf38.02% on MMCode, and \textbf95.12% on CodeVision, validating the effectiveness of SVRepair for multimodal program repair.
zh
[CV-86] OmniVideo-R1: Reinforcing Audio-visual Reason ing with Query Intention and Modality Attention
【速读】:该论文旨在解决现有多模态视频理解模型在音频-视觉联合理解任务中表现不足的问题,尤其是在跨模态协同推理能力上的局限性。其解决方案的核心在于提出OmniVideo-R1框架,通过两个关键策略实现增强的混合模态推理:一是基于自监督学习范式的查询密集型定位(query-intensive grounding),提升模型对多模态线索的精准锚定能力;二是基于对比学习范式的模态感知融合(modality-attentive fusion),使模型能够动态地加权不同模态信息以优化决策。实验表明,该方法在多个基准测试上均显著优于现有强基线模型,验证了其在复杂场景下的有效性与泛化能力。
链接: https://arxiv.org/abs/2602.05847
作者: Zhangquan Chen,Jiale Tao,Ruihuang Li,Yihao Hu,Ruitao Chen,Zhantao Yang,Xinlei Yu,Haodong Jing,Manyuan Zhang,Shuai Shao,Biao Wang,Qinglin Lu,Ruqi Huang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 12 figures
Abstract:While humans perceive the world through diverse modalities that operate synergistically to support a holistic understanding of their surroundings, existing omnivideo models still face substantial challenges on audio-visual understanding tasks. In this paper, we propose OmniVideo-R1, a novel reinforced framework that improves mixed-modality reasoning. OmniVideo-R1 empowers models to “think with omnimodal cues” by two key strategies: (1) query-intensive grounding based on self-supervised learning paradigms; and (2) modality-attentive fusion built upon contrastive learning paradigms. Extensive experiments on multiple benchmarks demonstrate that OmniVideo-R1 consistently outperforms strong baselines, highlighting its effectiveness and robust generalization capabilities.
zh
[CV-87] CORP: Closed-Form One-shot Representation-Preserving Structured Pruning for Vision Transformers
【速读】:该论文旨在解决Vision Transformer(ViT)在推理阶段计算和内存开销高的问题,尤其是在不依赖重新训练或复杂多阶段优化的前提下实现高效的结构化剪枝(structured pruning)。其核心挑战在于如何在仅使用少量无标签校准集的情况下,通过一次性剪枝移除整个MLP隐藏维度和注意力子结构,同时保持模型精度。解决方案的关键在于提出一种闭式(closed-form)的一次性结构化剪枝框架CORP,将剪枝建模为表示恢复问题,通过将被移除的激活和注意力logits表示为保留组件的仿射函数,并推导出闭式岭回归解,从而将补偿机制隐式地融入模型权重中,最小化校准分布下的期望表示误差。此方法无需标签、梯度或微调即可在极端稀疏度下维持高精度,且可在单张GPU上于20分钟内完成剪枝,显著提升实际部署效率。
链接: https://arxiv.org/abs/2602.05243
作者: Boxiang Zhang,Baijian Yang
机构: Purdue University (普渡大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision Transformers achieve strong accuracy but incur high compute and memory cost. Structured pruning can reduce inference cost, but most methods rely on retraining or multi-stage optimization. These requirements limit post-training deployment. We propose \textbfCORP, a closed-form one-shot structured pruning framework for Vision Transformers. CORP removes entire MLP hidden dimensions and attention substructures without labels, gradients, or fine-tuning. It operates under strict post-training constraints using only a small unlabeled calibration set. CORP formulates structured pruning as a representation recovery problem. It models removed activations and attention logits as affine functions of retained components and derives closed-form ridge regression solutions that fold compensation into model weights. This minimizes expected representation error under the calibration distribution. Experiments on ImageNet with DeiT models show strong redundancy in MLP and attention representations. Without compensation, one-shot structured pruning causes severe accuracy degradation. With CORP, models preserve accuracy under aggressive sparsity. On DeiT-Huge, CORP retains 82.8% Top-1 accuracy after pruning 50% of both MLP and attention structures. CORP completes pruning in under 20 minutes on a single GPU and delivers substantial real-world efficiency gains.
zh
[CV-88] EUGens: Efficient Unified and General Dense Layers NEURIPS2025
【速读】:该论文旨在解决神经网络中全连接前馈层(Fully-connected feedforward layers, FFLs)在计算复杂度和参数数量上的瓶颈问题,尤其是在实时应用和资源受限环境中的可扩展性挑战。其解决方案的关键在于提出了一类新型密集层——高效、统一且通用的密集层(Efficient, Unified and General dense layers, EUGens),该方法利用随机特征近似标准FFL,并通过引入输入范数的直接依赖关系实现对FFL的推广;EUGens将推理复杂度从二次时间降低至线性时间,同时保持表达能力和适应性,还首次实现了对任意多项式激活函数的无偏近似算法,显著减少了参数量与计算开销。
链接: https://arxiv.org/abs/2410.09771
作者: Sang Min Kim,Byeongchan Kim,Arijit Sehanobish,Somnath Basu Roy Chowdhury,Rahul Kidambi,Dongseok Shim,Avinava Dubey,Snigdha Chaturvedi,Min-hwan Oh,Krzysztof Choromanski
机构: Seoul National University (首尔国立大学); Independent; Google Research (谷歌研究); UNC Chapel Hill (北卡罗来纳大学教堂山分校); Google DeepMind (谷歌深度大脑); Columbia University (哥伦比亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Neurips 2025
Abstract:Efficient neural networks are essential for scaling machine learning models to real-time applications and resource-constrained environments. Fully-connected feedforward layers (FFLs) introduce computation and parameter count bottlenecks within neural network architectures. To address this challenge, in this work, we propose a new class of dense layers that generalize standard fully-connected feedforward layers, \textbfEfficient, \textbfUnified and \textbfGeneral dense layers (EUGens). EUGens leverage random features to approximate standard FFLs and go beyond them by incorporating a direct dependence on the input norms in their computations. The proposed layers unify existing efficient FFL extensions and improve efficiency by reducing inference complexity from quadratic to linear time. They also lead to \textbfthe first unbiased algorithms approximating FFLs with arbitrary polynomial activation functions. Furthermore, EuGens reduce the parameter count and computational overhead while preserving the expressive power and adaptability of FFLs. We also present a layer-wise knowledge transfer technique that bypasses backpropagation, enabling efficient adaptation of EUGens to pre-trained models. Empirically, we observe that integrating EUGens into Transformers and MLPs yields substantial improvements in inference speed (up to \textbf27%) and memory efficiency (up to \textbf30%) across a range of tasks, including image classification, language model pre-training, and 3D scene reconstruction. Overall, our results highlight the potential of EUGens for the scalable deployment of large-scale neural networks in real-world scenarios.
zh
[CV-89] Orientation-Robust Latent Motion Trajectory Learning for Annotation-free Cardiac Phase Detection in Fetal Echocardiography
【速读】:该论文旨在解决胎儿超声心动图中无创自动识别心脏舒张末期(end-diastolic, ED)和收缩末期(end-systolic, ES)帧的问题,这一环节在先天性心脏病(congenital heart disease, CHD)的自动化诊断流程中至关重要,但目前依赖人工标注且缺乏胎儿心电图(fetal electrocardiography, ECG)支持,导致效率低下。其解决方案的关键在于提出一种自监督框架 ORBIT(Orientation-Robust Beat Inference from Trajectories),通过图像配准作为自监督任务,学习心脏形变的潜在运动轨迹,并利用轨迹拐点捕捉心肌舒张与收缩的转换时刻,从而实现对ED和ES帧的准确、姿态鲁棒定位,且无需依赖固定胎儿心脏方位假设或人工标注。
链接: https://arxiv.org/abs/2602.06761
作者: Yingyu Yang,Qianye Yang,Can Peng,Elena D’Alberti,Olga Patey,Aris T. Papageorghiou,J.Alison Noble
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint, Submitted to a journal
Abstract:Fetal echocardiography is essential for detecting congenital heart disease (CHD), facilitating pregnancy management, optimized delivery planning, and timely postnatal interventions. Among standard imaging planes, the four-chamber (4CH) view provides comprehensive information for CHD diagnosis, where clinicians carefully inspect the end-diastolic (ED) and end-systolic (ES) phases to evaluate cardiac structure and motion. Automated detection of these cardiac phases is thus a critical component toward fully automated CHD analysis. Yet, in the absence of fetal electrocardiography (ECG), manual identification of ED and ES frames remains a labor-intensive bottleneck. We present ORBIT (Orientation-Robust Beat Inference from Trajectories), a self-supervised framework that identifies cardiac phases without manual annotations under various fetal heart orientation. ORBIT employs registration as self-supervision task and learns a latent motion trajectory of cardiac deformation, whose turning points capture transitions between cardiac relaxation and contraction, enabling accurate and orientation-robust localization of ED and ES frames across diverse fetal positions. Trained exclusively on normal fetal echocardiography videos, ORBIT achieves consistent performance on both normal (MAE = 1.9 frames for ED and 1.6 for ES) and CHD cases (MAE = 2.4 frames for ED and 2.1 for ES), outperforming existing annotation-free approaches constrained by fixed orientation assumptions. These results highlight the potential of ORBIT to facilitate robust cardiac phase detection directly from 4CH fetal echocardiography.
zh
[CV-90] AS-Mamba: Asymmetric Self-Guided Mamba Decoupled Iterative Network for Metal Artifact Reduction
【速读】:该论文旨在解决金属伪影(metal artifact)对计算机断层扫描(Computed Tomography, CT)图像质量的显著劣化问题,这种劣化会干扰临床诊断的准确性。现有深度学习方法如卷积神经网络(CNN)和Transformer难以显式捕捉由金属引起的条纹状伪影的方向性几何特征,导致结构恢复效果不佳。其解决方案的关键在于提出了一种不对称自引导Mamba架构(Asymmetric Self-Guided Mamba, AS-Mamba),利用状态空间模型(State Space Model, SSM)的线性传播特性显式建模并抑制方向性条纹伪影;同时引入频域校正机制以修正全局幅度谱,缓解因硬化效应造成的强度不均匀性;此外,通过自引导对比正则化策略弥合不同临床场景间的分布差异,从而提升模型泛化能力。
链接: https://arxiv.org/abs/2602.06350
作者: Bowen Ning,Zekun Zhou,Xinyi Zhong,Zhongzhen Wang,HongXin Wu,HaiTao Wang,Liu Shi,Qiegen Liu
机构: Nanchang University (南昌大学); LargeV Instrument Corp., Ltd. (北京大伟仪器有限公司)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages,10 figures
Abstract:Metal artifact significantly degrades Computed Tomography (CT) image quality, impeding accurate clinical diagnosis. However, existing deep learning approaches, such as CNN and Transformer, often fail to explicitly capture the directional geometric features of artifacts, leading to compromised structural restoration. To address these limitations, we propose the Asymmetric Self-Guided Mamba (AS-Mamba) for metal artifact reduction. Specifically, the linear propagation of metal-induced streak artifacts aligns well with the sequential modeling capability of State Space Models (SSMs). Consequently, the Mamba architecture is leveraged to explicitly capture and suppress these directional artifacts. Simultaneously, a frequency domain correction mechanism is incorporated to rectify the global amplitude spectrum, thereby mitigating intensity inhomogeneity caused by beam hardening. Furthermore, to bridge the distribution gap across diverse clinical scenarios, we introduce a self-guided contrastive regularization strategy. Extensive experiments on public andclinical dental CBCT datasets demonstrate that AS-Mamba achieves superior performance in suppressing directional streaks and preserving structural details, validating the effectiveness of integrating physical geometric priors into deep network design.
zh
[CV-91] Zero-shot Multi-Contrast Brain MRI Registration by Intensity Randomizing T1-weighted MRI (LUMIR25) MICCAI2025
【速读】:该论文旨在解决在域偏移(domain shifts)条件下实现零样本(zero-shot)医学图像配准的问题,具体场景包括高场强磁共振成像(high-field MRI)、病理脑图像以及多种MRI对比度。训练数据仅包含同质的T1加权脑部MRI图像,而目标是在未见过的模态(如T1-T2配准)中保持高精度和良好形变正则性。解决方案的关键在于三个简单但有效的策略:(i) 基于模态无关邻域描述符(modality-independent neighborhood descriptor, MIND)的多模态损失函数,增强特征对不同对比度的鲁棒性;(ii) 通过强度随机化实现外观增强(intensity randomization for appearance augmentation),提升模型对图像外观变化的泛化能力;(iii) 在推理阶段引入轻量级实例特定优化(lightweight instance-specific optimization, ISO),对特征编码器进行微调,以适应新模态的局部特征分布。这些方法共同推动了模型在跨模态配准任务中的卓越表现。
链接: https://arxiv.org/abs/2602.06292
作者: Hengjie Liu,Yimeng Dou,Di Xu,Xinyi Fu,Dan Ruan,Ke Sheng
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to and reviewed by Learn2Reg MICCAI 2025
Abstract:In this paper, we summarize the methods and results of our submission to the LUMIR25 challenge in Learn2Reg 2025, which achieved 1st place overall on the test set. Extended from LUMIR24, this year’s task focuses on zero-shot registration under domain shifts (high-field MRI, pathological brains, and various MRI contrasts), while the training data comprise only in-domain T1-weighted brain MRI. We start with a meticulous analysis of LUMIR24 winners to identify the main contributors to good monomodal registration performance. To achieve good generalization with diverse contrasts from a model trained with T1-weighted MRI only, we employ three simple but effective strategies: (i) a multimodal loss based on the modality-independent neighborhood descriptor (MIND), (ii) intensity randomization for appearance augmentation, and (iii) lightweight instance-specific optimization (ISO) on feature encoders at inference time. On the validation set, our approach achieves reasonable T1-T2 registration accuracy while maintaining good deformation regularity.
zh
[CV-92] ALIEN: Analytic Latent Watermarking for Controllable Generation
【速读】:该论文旨在解决生成式 AI(Generative AI)中水印嵌入的鲁棒性与保真度难以平衡的问题,尤其是现有方法依赖计算密集型启发式优化进行迭代信号精炼所导致的高训练开销和局部最优问题。解决方案的关键在于提出首个针对时间依赖调制系数的解析推导框架——ALIEN(Analytical Watermarking Framework for Controllable Generation),通过理论建模引导水印残差在扩散过程中的演化路径,从而实现可控且高效的水印嵌入,显著提升水印的鲁棒性和生成质量。
链接: https://arxiv.org/abs/2602.06101
作者: Liangqi Lei,Keke Gai,Jing Yu,Qi Wu
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:
Abstract:Watermarking is a technical alternative to safeguarding intellectual property and reducing misuse. Existing methods focus on optimizing watermarked latent variables to balance watermark robustness and fidelity, as Latent diffusion models (LDMs) are considered a powerful tool for generative tasks. However, reliance on computationally intensive heuristic optimization for iterative signal refinement results in high training overhead and local optima this http URL address these issues, we propose an \underlineAna\underlinelytical Watermark\underlineing Framework for Controllabl\underlinee Generatio\underlinen (ALIEN). We develop the first analytical derivation of the time-dependent modulation coefficient that guides the diffusion of watermark residuals to achieve controllable watermark embedding this http URL results show that ALIEN-Q outperforms the state-of-the-art by 33.1% across 5 quality metrics, and ALIEN-R demonstrates 14.0% improved robustness against generative variant and stability threats compared to the state-of-the-art across 15 distinct conditions. Code can be available at this https URL.
zh
[CV-93] COSMOS: Coherent Supergaussian Modeling with Spatial Priors for Sparse-View 3D Splatting
【速读】:该论文旨在解决3D高斯散射(3D Gaussian Splatting, 3DGS)在稀疏视图条件下训练时存在的过拟合与结构退化问题,从而提升新视角下的泛化能力。其核心解决方案是提出了一种基于空间先验的协同超高斯建模方法(Coherent supergaussian Modeling with Spatial Priors, COSMOS),关键在于引入了基于局部几何线索和外观特征的新颖超高斯分组机制,并通过组间全局自注意力与组内局部稀疏注意力融合多尺度空间信息,同时利用组内位置正则化约束保持结构一致性,有效抑制噪声点(floaters),显著增强稀疏视图下的重建稳定性和质量。
链接: https://arxiv.org/abs/2602.06044
作者: Chaeyoung Jeong,Kwangsu Kim
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:
Abstract:3D Gaussian Splatting (3DGS) has recently emerged as a promising approach for 3D reconstruction, providing explicit, point-based representations and enabling high-quality real time rendering. However, when trained with sparse input views, 3DGS suffers from overfitting and structural degradation, leading to poor generalization on novel views. This limitation arises from its optimization relying solely on photometric loss without incorporating any 3D structure priors. To address this issue, we propose Coherent supergaussian Modeling with Spatial Priors (COSMOS). Inspired by the concept of superpoints from 3D segmentation, COSMOS introduces 3D structure priors by newly defining supergaussian groupings of Gaussians based on local geometric cues and appearance features. To this end, COSMOS applies inter group global self-attention across supergaussian groups and sparse local attention among individual Gaussians, enabling the integration of global and local spatial information. These structure-aware features are then used for predicting Gaussian attributes, facilitating more consistent 3D reconstructions. Furthermore, by leveraging supergaussian-based grouping, COSMOS enforces an intra-group positional regularization to maintain structural coherence and suppress floaters, thereby enhancing training stability under sparse-view conditions. Our experiments on Blender and DTU show that COSMOS surpasses state-of-the-art methods in sparse-view settings without any external depth supervision.
zh
人工智能
[AI-0] Agent ic Uncertainty Reveals Agent ic Overconfidence
【速读】:该论文旨在解决AI代理在执行任务时对自身成功率预测的准确性问题,即探究生成式 AI (Generative AI) 代理是否能够合理评估其任务成功概率。研究通过在任务执行前、中、后三个阶段获取代理的成功概率估计,发现所有代理均表现出明显的过度自信现象(例如,仅22%成功率的任务被预测为77%的成功率)。解决方案的关键在于:尽管传统的事后评估通常被认为更可靠,但实验表明,基于更少信息的预执行评估反而能实现更好的校准效果;此外,采用对抗性提示(adversarial prompting)将评估任务重构为“找漏洞”形式,显著提升了预测的校准性能,成为最优策略。
链接: https://arxiv.org/abs/2602.06948
作者: Jean Kaddour,Srijan Patel,Gbètondji Dovonon,Leo Richter,Pasquale Minervini,Matt J. Kusner
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Can AI agents predict whether they will succeed at a task? We study agentic uncertainty by eliciting success probability estimates before, during, and after task execution. All results exhibit agentic overconfidence: some agents that succeed only 22% of the time predict 77% success. Counterintuitively, pre-execution assessment with strictly less information tends to yield better discrimination than standard post-execution review, though differences are not always significant. Adversarial prompting reframing assessment as bug-finding achieves the best calibration.
zh
[AI-1] Cochain Perspectives on Temporal-Difference Signals for Learning Beyond Markov Dynamics
【速读】:该论文旨在解决非马尔可夫(Non-Markovian)动态环境下强化学习(Reinforcement Learning, RL)中贝尔曼方程近似有效性下降的问题,即在存在长程依赖、部分可观测性和记忆效应的真实场景中,传统基于贝尔曼方程的时序差分(Temporal-Difference, TD)方法难以准确建模状态价值函数。其解决方案的关键在于提出一种新颖的拓扑视角:将TD误差视为状态转移空间上的1-上链(1-cochain),并发现马尔可夫动态本质上对应于拓扑可积性(topological integrability)。通过贝尔曼-德拉姆投影(Bellman-de Rham projection),该方法实现了TD误差的霍奇型分解(Hodge-type decomposition),分离出可积分分量与拓扑残差。进而设计了HodgeFlow Policy Search(HFPS)算法,通过拟合潜在网络最小化非可积投影残差,从而在非马尔可夫环境中实现更稳定且具有敏感性保障的策略搜索。
链接: https://arxiv.org/abs/2602.06939
作者: Zuyuan Zhang,Sizhe Tang,Tian Lan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Non-Markovian dynamics are commonly found in real-world environments due to long-range dependencies, partial observability, and memory effects. The Bellman equation that is the central pillar of Reinforcement learning (RL) becomes only approximately valid under Non-Markovian. Existing work often focus on practical algorithm designs and offer limited theoretical treatment to address key questions, such as what dynamics are indeed capturable by the Bellman framework and how to inspire new algorithm classes with optimal approximations. In this paper, we present a novel topological viewpoint on temporal-difference (TD) based RL. We show that TD errors can be viewed as 1-cochain in the topological space of state transitions, while Markov dynamics are then interpreted as topological integrability. This novel view enables us to obtain a Hodge-type decomposition of TD errors into an integrable component and a topological residual, through a Bellman-de Rham projection. We further propose HodgeFlow Policy Search (HFPS) by fitting a potential network to minimize the non-integrable projection residual in RL, achieving stability/sensitivity guarantees. In numerical evaluations, HFPS is shown to significantly improve RL performance under non-Markovian.
zh
[AI-2] From Kepler to Newton: Inductive Biases Guide Learned World Models in Transformers
【速读】:该论文旨在解决通用人工智能架构是否能够超越预测任务,从而发现支配宇宙的物理规律这一核心问题。现有“AI Physicist”方法虽能恢复物理定律,但依赖强领域特定先验;而通用Transformer模型则仅实现高预测精度却未能捕捉底层物理机制。解决方案的关键在于引入三种最小归纳偏置:首先通过将预测建模为连续回归以确保空间平滑性,其次通过在噪声上下文中训练以增强稳定性、抑制误差累积,使模型学习到一致的开普勒世界模型;最终引入时间局部性偏置——限制注意力窗口仅关注近期状态,迫使模型放弃曲线拟合,转而发现牛顿力的表示形式。研究表明,简单的架构选择决定了AI是成为曲线拟合器还是物理学家,标志着自动化科学发现的重要进展。
链接: https://arxiv.org/abs/2602.06923
作者: Ziming Liu,Sophia Sanborn,Surya Ganguli,Andreas Tolias
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Classical Physics (physics.class-ph)
备注:
Abstract:Can general-purpose AI architectures go beyond prediction to discover the physical laws governing the universe? True intelligence relies on “world models” – causal abstractions that allow an agent to not only predict future states but understand the underlying governing dynamics. While previous “AI Physicist” approaches have successfully recovered such laws, they typically rely on strong, domain-specific priors that effectively “bake in” the physics. Conversely, Vafa et al. recently showed that generic Transformers fail to acquire these world models, achieving high predictive accuracy without capturing the underlying physical laws. We bridge this gap by systematically introducing three minimal inductive biases. We show that ensuring spatial smoothness (by formulating prediction as continuous regression) and stability (by training with noisy contexts to mitigate error accumulation) enables generic Transformers to surpass prior failures and learn a coherent Keplerian world model, successfully fitting ellipses to planetary trajectories. However, true physical insight requires a third bias: temporal locality. By restricting the attention window to the immediate past – imposing the simple assumption that future states depend only on the local state rather than a complex history – we force the model to abandon curve-fitting and discover Newtonian force representations. Our results demonstrate that simple architectural choices determine whether an AI becomes a curve-fitter or a physicist, marking a critical step toward automated scientific discovery.
zh
[AI-3] amperBench: Systematically Stress-Testing LLM Safety Under Fine-Tuning and Tampering
【速读】:该论文旨在解决开放权重大型语言模型(Large Language Models, LLMs)在面对潜在恶意或意外修改时缺乏统一评估标准的问题,从而难以系统比较不同模型与防御机制在安全性、可用性和鲁棒性方面的表现。其解决方案的关键在于提出TamperBench——首个系统性评估LLMs抗篡改能力的统一框架,该框架(i)汇集了当前最先进的权重空间微调攻击与潜在空间表示攻击;(ii)通过针对每个攻击-模型组合进行系统的超参数扫描,实现更贴近现实的对抗评估;(iii)同时提供安全性和实用性双重评价指标。该框架支持最小代码改动即可配置任意微调方案、对齐阶段防御方法及度量套件,并保障端到端可复现性,为LLM安全性研究提供了标准化工具。
链接: https://arxiv.org/abs/2602.06911
作者: Saad Hossain,Tom Tseng,Punya Syon Pandey,Samanvay Vajpayee,Matthew Kowal,Nayeema Nonta,Samuel Simko,Stephen Casper,Zhijing Jin,Kellin Pelrine,Sirisha Rambhatla
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 28 pages, 13 figures
Abstract:As increasingly capable open-weight large language models (LLMs) are deployed, improving their tamper resistance against unsafe modifications, whether accidental or intentional, becomes critical to minimize risks. However, there is no standard approach to evaluate tamper resistance. Varied data sets, metrics, and tampering configurations make it difficult to compare safety, utility, and robustness across different models and defenses. To this end, we introduce TamperBench, the first unified framework to systematically evaluate the tamper resistance of LLMs. TamperBench (i) curates a repository of state-of-the-art weight-space fine-tuning attacks and latent-space representation attacks; (ii) enables realistic adversarial evaluation through systematic hyperparameter sweeps per attack-model pair; and (iii) provides both safety and utility evaluations. TamperBench requires minimal additional code to specify any fine-tuning configuration, alignment-stage defense method, and metric suite while ensuring end-to-end reproducibility. We use TamperBench to evaluate 21 open-weight LLMs, including defense-augmented variants, across nine tampering threats using standardized safety and capability metrics with hyperparameter sweeps per model-attack pair. This yields novel insights, including effects of post-training on tamper resistance, that jailbreak-tuning is typically the most severe attack, and that Triplet emerges as a leading alignment-stage defense. Code is available at: this https URL
zh
[AI-4] Supercharging Simulation-Based Inference for Bayesian Optimal Experimental Design
【速读】:该论文旨在解决贝叶斯最优实验设计(Bayesian Optimal Experimental Design, BOED)中期望信息增益(Expected Information Gain, EIG)计算困难的问题,尤其是在似然函数不可解析或难以估计的场景下。传统方法受限于单一对比性EIG界,难以有效利用现代模拟推断(Simulation-Based Inference, SBI)工具。其解决方案的关键在于揭示了EIG的多种可建模形式,并直接利用SBI中的神经后验、似然和比率估计器来构建新的EIG估计量;特别地,作者提出了一种基于神经似然估计的新型EIG估计算法,并通过多起点并行梯度上升优化策略显著缓解了基于梯度的EIG最大化过程中的优化瓶颈问题,从而在标准BOED基准测试中达到或超越现有最先进方法22%的性能提升。
链接: https://arxiv.org/abs/2602.06900
作者: Samuel Klein,Willie Neiswanger,Daniel Ratner,Michael Kagan,Sean Gasiorowski
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
备注:
Abstract:Bayesian optimal experimental design (BOED) seeks to maximize the expected information gain (EIG) of experiments. This requires a likelihood estimate, which in many settings is intractable. Simulation-based inference (SBI) provides powerful tools for this regime. However, existing work explicitly connecting SBI and BOED is restricted to a single contrastive EIG bound. We show that the EIG admits multiple formulations which can directly leverage modern SBI density estimators, encompassing neural posterior, likelihood, and ratio estimation. Building on this perspective, we define a novel EIG estimator using neural likelihood estimation. Further, we identify optimization as a key bottleneck of gradient based EIG maximization and show that a simple multi-start parallel gradient ascent procedure can substantially improve reliability and performance. With these innovations, our SBI-based BOED methods are able to match or outperform by up to 22% existing state-of-the-art approaches across standard BOED benchmarks.
zh
[AI-5] raceCoder: A Trace-Driven Multi-Agent Framework for Automated Debugging of LLM -Generated Code
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成代码时易引入细微但关键性错误的问题,尤其是面对复杂任务时,现有自动化修复方法依赖于表面的通过/失败信号,难以实现精准的错误定位,且缺乏从历史失败中学习的能力,导致修复过程陷入重复低效的循环。解决方案的关键在于提出TraceCoder框架,其核心创新包括:1)通过代码插桩(instrumentation)收集细粒度运行时轨迹(runtime traces),实现对程序内部执行状态的深度洞察;2)基于这些轨迹进行因果分析(causal analysis),精确识别故障根源;3)引入历史教训学习机制(Historical Lesson Learning Mechanism, HLLM),将先前失败修复尝试中的经验提炼为可复用的知识,指导后续修正策略并避免同类错误重现;4)设计回滚机制(Rollback Mechanism),确保每次修复迭代均严格逼近正确解,从而保障收敛稳定性。实验证明,该方案显著提升了修复准确率与效率。
链接: https://arxiv.org/abs/2602.06875
作者: Jiangping Huang,Wenguang Ye,Weisong Sun,Jian Zhang,Mingyue Zhang,Yang Liu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) often generate code with subtle but critical bugs, especially for complex tasks. Existing automated repair methods typically rely on superficial pass/fail signals, offering limited visibility into program behavior and hindering precise error localization. In addition, without a way to learn from prior failures, repair processes often fall into repetitive and inefficient cycles. To overcome these challenges, we present TraceCoder, a collaborative multi-agent framework that emulates the observe-analyze-repair process of human experts. The framework first instruments the code with diagnostic probes to capture fine-grained runtime traces, enabling deep insight into its internal execution. It then conducts causal analysis on these traces to accurately identify the root cause of the failure. This process is further enhanced by a novel Historical Lesson Learning Mechanism (HLLM), which distills insights from prior failed repair attempts to inform subsequent correction strategies and prevent recurrence of similar mistakes. To ensure stable convergence, a Rollback Mechanism enforces that each repair iteration constitutes a strict improvement toward the correct solution. Comprehensive experiments across multiple benchmarks show that TraceCoder achieves up to a 34.43% relative improvement in Pass@1 accuracy over existing advanced baselines. Ablation studies verify the significance of each system component, with the iterative repair process alone contributing a 65.61% relative gain in accuracy. Furthermore, TraceCoder significantly outperforms leading iterative methods in terms of both accuracy and cost-efficiency.
zh
[AI-6] Zero-shot Generalizable Graph Anomaly Detection with Mixture of Riemannian Experts
【速读】:该论文旨在解决零样本图异常检测(Zero-shot Graph Anomaly Detection, GAD)中因忽略不同异常模式内在几何差异而导致跨域泛化能力受限的问题。现有方法通常将异构图数据嵌入到单一静态曲率空间,这会扭曲异常的结构特征,从而影响检测性能。解决方案的关键在于提出GAD-MoRE框架,其核心创新是采用“黎曼专家混合”(Mixture of Riemannian Experts, MoRE)架构:通过多个专用的黎曼专家网络分别运行于不同的曲率空间,确保每种异常模式在最易检测的几何空间中建模;同时引入异常感知的多曲率特征对齐模块,使原始节点特征能够映射至与曲率匹配的黎曼空间以捕获多样化的几何特性;此外,设计基于记忆的动态路由机制,根据历史重建表现自适应地选择最优专家,从而提升对未见异常模式的泛化能力。
链接: https://arxiv.org/abs/2602.06859
作者: Xinyu Zhao,Qingyun Sun,Jiayi Luo,Xingcheng Fu,Jianxin Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Graph Anomaly Detection (GAD) aims to identify irregular patterns in graph data, and recent works have explored zero-shot generalist GAD to enable generalization to unseen graph datasets. However, existing zero-shot GAD methods largely ignore intrinsic geometric differences across diverse anomaly patterns, substantially limiting their cross-domain generalization. In this work, we reveal that anomaly detectability is highly dependent on the underlying geometric properties and that embedding graphs from different domains into a single static curvature space can distort the structural signatures of anomalies. To address the challenge that a single curvature space cannot capture geometry-dependent graph anomaly patterns, we propose GAD-MoRE, a novel framework for zero-shot Generalizable Graph Anomaly Detection with a Mixture of Riemannian Experts architecture. Specifically, to ensure that each anomaly pattern is modeled in the Riemannian space where it is most detectable, GAD-MoRE employs a set of specialized Riemannian expert networks, each operating in a distinct curvature space. To align raw node features with curvature-specific anomaly characteristics, we introduce an anomaly-aware multi-curvature feature alignment module that projects inputs into parallel Riemannian spaces, enabling the capture of diverse geometric characteristics. Finally, to facilitate better generalization beyond seen patterns, we design a memory-based dynamic router that adaptively assigns each input to the most compatible expert based on historical reconstruction performance on similar anomalies. Extensive experiments in the zero-shot setting demonstrate that GAD-MoRE significantly outperforms state-of-the-art generalist GAD baselines, and even surpasses strong competitors that are few-shot fine-tuned with labeled data from the target domain.
zh
[AI-7] AIRS-Bench: a Suite of Tasks for Frontier AI Research Science Agents
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在科学科研自动化中缺乏系统性评估基准的问题。现有研究多局限于单一任务或特定阶段,难以全面衡量LLM代理(LLM agents)在完整科研生命周期中的能力,包括从创意生成、实验分析到迭代优化的全过程。为应对这一挑战,作者提出AIRS-Bench(AI Research Science Benchmark),其关键创新在于构建了一套涵盖语言建模、数学推理、生物信息学和时间序列预测等多领域的20个任务集合,这些任务不提供基线代码,从而真实评估代理自主完成科研任务的能力。该框架支持新任务的灵活扩展与不同代理架构的公平比较,并通过前沿模型结合顺序与并行支架建立基线,揭示了当前代理在部分任务上已超越人类最先进水平(SOTA),但在多数任务中仍显著落后于理论性能上限,表明该基准具有高度可拓展性和研究价值。
链接: https://arxiv.org/abs/2602.06855
作者: Alisia Lupidi,Bhavul Gauri,Thomas Simon Foster,Bassel Al Omari,Despoina Magka,Alberto Pepe,Alexis Audran-Reiss,Muna Aghamelu,Nicolas Baldwin,Lucia Cipolina-Kun,Jean-Christophe Gagnon-Audet,Chee Hau Leow,Sandra Lefdal,Hossam Mossalam,Abhinav Moudgil,Saba Nazir,Emanuel Tewolde,Isabel Urrego,Jordi Armengol Estape,Amar Budhiraja,Gaurav Chaurasia,Abhishek Charnalia,Derek Dunfield,Karen Hambardzumyan,Daniel Izcovich,Martin Josifoski,Ishita Mediratta,Kelvin Niu,Parth Pathak,Michael Shvartsman,Edan Toledo,Anton Protopopov,Roberta Raileanu,Alexander Miller,Tatiana Shavrina,Jakob Foerster,Yoram Bachrach
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 49 pages, 14 figures, 10 tables
Abstract:LLM agents hold significant promise for advancing scientific research. To accelerate this progress, we introduce AIRS-Bench (the AI Research Science Benchmark), a suite of 20 tasks sourced from state-of-the-art machine learning papers. These tasks span diverse domains, including language modeling, mathematics, bioinformatics, and time series forecasting. AIRS-Bench tasks assess agentic capabilities over the full research lifecycle – including idea generation, experiment analysis and iterative refinement – without providing baseline code. The AIRS-Bench task format is versatile, enabling easy integration of new tasks and rigorous comparison across different agentic frameworks. We establish baselines using frontier models paired with both sequential and parallel scaffolds. Our results show that agents exceed human SOTA in four tasks but fail to match it in sixteen others. Even when agents surpass human benchmarks, they do not reach the theoretical performance ceiling for the underlying tasks. These findings indicate that AIRS-Bench is far from saturated and offers substantial room for improvement. We open-source the AIRS-Bench task definitions and evaluation code to catalyze further development in autonomous scientific research.
zh
[AI-8] From Features to Actions: Explainability in Traditional and Agent ic AI Systems
【速读】:该论文旨在解决静态可解释人工智能(Explainable AI)方法在应用于多步骤决策的自主代理系统(Agentic AI)时失效的问题,即传统基于归因(attribution-based)的解释技术难以有效诊断代理行为轨迹中的执行级失败。其解决方案的关键在于区分并比较两种不同的解释范式:在静态分类任务中使用归因方法进行特征重要性排序,而在代理基准测试(如TAU-bench Airline和AssistantBench)中采用基于执行轨迹(trace-based)的诊断机制。实证结果表明,归因方法虽在静态场景下能保持稳定的特征排名(Spearman ρ = 0.86),但无法可靠定位代理行为中的故障点;而基于轨迹的评分体系则能一致地识别出行为中断的根本原因,尤其是发现状态跟踪不一致现象在失败运行中出现频率高出2.7倍,并使成功概率下降49%。这一发现推动了从局部输出解释向轨迹级解释(trajectory-level explainability)的范式转变,以更有效地评估与调试自主AI的行为。
链接: https://arxiv.org/abs/2602.06841
作者: Sindhuja Chaduvula,Jessee Ho,Kina Kim,Aravind Narayanan,Mahshid Alinoori,Muskan Garg,Dhanesh Ramachandram,Shaina Raza
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Over the last decade, explainable AI has primarily focused on interpreting individual model predictions, producing post-hoc explanations that relate inputs to outputs under a fixed decision structure. Recent advances in large language models (LLMs) have enabled agentic AI systems whose behaviour unfolds over multi-step trajectories. In these settings, success and failure are determined by sequences of decisions rather than a single output. While useful, it remains unclear how explanation approaches designed for static predictions translate to agentic settings where behaviour emerges over time. In this work, we bridge the gap between static and agentic explainability by comparing attribution-based explanations with trace-based diagnostics across both settings. To make this distinction explicit, we empirically compare attribution-based explanations used in static classification tasks with trace-based diagnostics used in agentic benchmarks (TAU-bench Airline and AssistantBench). Our results show that while attribution methods achieve stable feature rankings in static settings (Spearman \rho = 0.86 ), they cannot be applied reliably to diagnose execution-level failures in agentic trajectories. In contrast, trace-grounded rubric evaluation for agentic settings consistently localizes behaviour breakdowns and reveals that state tracking inconsistency is 2.7 \times more prevalent in failed runs and reduces success probability by 49%. These findings motivate a shift towards trajectory-level explainability for agentic systems when evaluating and diagnosing autonomous AI behaviour. Resources: this https URL this https URL Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2602.06841 [cs.AI] (or arXiv:2602.06841v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2602.06841 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-9] An Adaptive Differentially Private Federated Learning Framework with Bi-level Optimization
【速读】:该论文旨在解决联邦学习(Federated Learning)在设备异构性和非独立同分布(Non-IID)数据条件下,结合差分隐私(Differential Privacy)机制时所面临的梯度更新不稳定、噪声放大及模型性能下降问题。其解决方案的关键在于:在客户端引入轻量级局部压缩模块以正则化中间表示并抑制梯度变异,从而缓解本地优化中的噪声放大;在服务器端设计自适应梯度裁剪策略,基于历史更新统计动态调整裁剪阈值,避免过度裁剪和噪声主导;同时提出约束感知聚合机制,有效抑制不可靠或噪声主导的客户端更新,提升全局优化稳定性。
链接: https://arxiv.org/abs/2602.06838
作者: Jin Wang,Hui Ma,Fei Xing,Ming Yan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: submited to a conference
Abstract:Federated learning enables collaborative model training across distributed clients while preserving data privacy. However, in practical deployments, device heterogeneity, non-independent, and identically distributed (Non-IID) data often lead to highly unstable and biased gradient updates. When differential privacy is enforced, conventional fixed gradient clipping and Gaussian noise injection may further amplify gradient perturbations, resulting in training oscillation and performance degradation and degraded model performance. To address these challenges, we propose an adaptive differentially private federated learning framework that explicitly targets model efficiency under heterogeneous and privacy-constrained settings. On the client side, a lightweight local compressed module is introduced to regularize intermediate representations and constrain gradient variability, thereby mitigating noise amplification during local optimization. On the server side, an adaptive gradient clipping strategy dynamically adjusts clipping thresholds based on historical update statistics to avoid over-clipping and noise domination. Furthermore, a constraint-aware aggregation mechanism is designed to suppress unreliable or noise-dominated client updates and stabilize global optimization. Extensive experiments on CIFAR-10 and SVHN demonstrate improved convergence stability and classification accuracy.
zh
[AI-10] LLM Active Alignment: A Nash Equilibrium Perspective
【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)在多代理场景下群体行为难以预测与调控的问题,特别是在开放文本空间中纳什均衡(Nash Equilibrium, NE)计算不可行的情况下,如何实现对LLM群体行为的可解释性引导。其解决方案的关键在于将每个代理的动作建模为人类子群体的混合分布,使代理能够主动且策略性地选择对齐对象,从而构建一个具有行为意义且可解释的策略类;在此基础上,通过标准的凹效用假设推导出闭式纳什均衡解,实现系统级行为预测,并提供明确、可操作的指导以调整对齐目标,推动社会期望结果的达成。该方法可作为现有对齐流水线(如基于人类反馈的强化学习RLHF)之上的主动对齐层使用。
链接: https://arxiv.org/abs/2602.06836
作者: Tonghan Wang,Yuqi Pan,Xinyi Yang,Yanchen Jiang,Milind Tambe,David C. Parkes
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We develop a game-theoretic framework for predicting and steering the behavior of populations of large language models (LLMs) through Nash equilibrium (NE) analysis. To avoid the intractability of equilibrium computation in open-ended text spaces, we model each agent’s action as a mixture over human subpopulations. Agents choose actively and strategically which groups to align with, yielding an interpretable and behaviorally substantive policy class. We derive closed-form NE characterizations, adopting standard concave-utility assumptions to enable analytical system-level predictions and give explicit, actionable guidance for shifting alignment targets toward socially desirable outcomes. The method functions as an active alignment layer on top of existing alignment pipelines such as RLHF. In a social-media setting, we show that a population of LLMs, especially reasoning-based models, may exhibit political exclusion, pathologies where some subpopulations are ignored by all LLM agents, which can be avoided by our method, illustrating the promise of applying the method to regulate multi-agent LLM dynamics across domains.
zh
[AI-11] AI-Generated Music Detection in Broadcast Monitoring
【速读】:该论文旨在解决当前AI音乐检测模型在广播音频场景下性能显著下降的问题,尤其针对短时音乐片段(如电视节目中背景音乐)与语音混叠的复杂条件。现有检测方法多基于流媒体平台上的完整、干净音乐片段进行训练和验证,无法适应广播音频中音乐常被语音掩盖且持续时间短的实际场景。解决方案的关键在于构建首个专为广播风格AI音乐检测设计的数据集AI-OpenBMAT,其中包含3,294段1分钟音频片段(总计54.9小时),模拟真实电视音频的时长分布和音量关系,融合真人制作的背景音乐与Suno v3.5生成的风格匹配续写内容。通过在此数据集上对CNN基线和SpectTTTra等先进模型进行评估,研究揭示了语音掩蔽和短音乐长度是导致现有模型性能退化的两大核心挑战,并确立AI-OpenBMAT作为工业级广播场景下开发鲁棒AI音乐检测器的重要基准。
链接: https://arxiv.org/abs/2602.06823
作者: David Lopez-Ayala,Asier Cabello,Pablo Zinemanas,Emilio Molina,Martin Rocamora
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
备注:
Abstract:AI music generators have advanced to the point where their outputs are often indistinguishable from human compositions. While detection methods have emerged, they are typically designed and validated in music streaming contexts with clean, full-length tracks. Broadcast audio, however, poses a different challenge: music appears as short excerpts, often masked by dominant speech, conditions under which existing detectors fail. In this work, we introduce AI-OpenBMAT, the first dataset tailored to broadcast-style AI-music detection. It contains 3,294 one-minute audio excerpts (54.9 hours) that follow the duration patterns and loudness relations of real television audio, combining human-made production music with stylistically matched continuations generated with Suno v3.5. We benchmark a CNN baseline and state-of-the-art SpectTTTra models to assess SNR and duration robustness, and evaluate on a full broadcast scenario. Across all settings, models that excel in streaming scenarios suffer substantial degradation, with F1-scores dropping below 60% when music is in the background or has a short duration. These results highlight speech masking and short music length as critical open challenges for AI music detection, and position AI-OpenBMAT as a benchmark for developing detectors capable of meeting industrial broadcast requirements.
zh
[AI-12] POP: Online Structural Pruning Enables Efficient Inference of Large Foundation Models
【速读】:该论文旨在解决当前结构化剪枝方法在推理过程中采用固定剪枝决策,忽略了自回归令牌生成中动态出现的稀疏模式的问题。解决方案的关键在于提出POP(Partition-guided Online Pruning)框架,其通过将模型通道划分为保留、候选和剪枝区域,在预填充阶段定义粗粒度剪枝分区,解码阶段在候选区域内生成细粒度掩码,从而实现上下文感知的动态剪枝;该方法无需离线校准、重训练或学习预测器,具有轻量级、即插即用的特点,显著降低计算开销并减少推理延迟,同时在多种大基础模型(如大语言模型LLMs、专家混合模型MoEs和视觉-语言模型VLMs)上保持更高精度。
链接: https://arxiv.org/abs/2602.06822
作者: Yi Chen,Wonjin Shin,Shuhong Liu,Tho Mai,Jeongmo Lee,Chuanbo Hua,Kun Wang,Jun Liu,Joo-Young Kim
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large foundation models (LFMs) achieve strong performance through scaling, yet current structural pruning methods derive fixed pruning decisions during inference, overlooking sparsity patterns that emerge in the autoregressive token generation. In this paper, we propose POP (Partition-guided Online Pruning), an efficient online structural pruning framework that enables context-conditioned dynamic pruning with minimal computational overhead. POP partitions model channels into retained, candidate, and pruned regions, where prefilling defines a coarse pruning partition, and the decoding stage generates a fine-grained mask within the candidate region, avoiding full-channel re-evaluation. The coarse pruning partition preserves consistently important weights, while the fine-grained masking provides context-conditioned variation during decoding. Moreover, POP is a lightweight, plug-and-play method that requires no preprocessing, including offline calibration, retraining, or learning predictors. Extensive evaluations across diverse LFMs, including large language models (LLMs), mixture-of-experts models (MoEs), and vision-language models (VLMs), demonstrate that POP consistently delivers higher accuracy than existing pruning approaches while incurring smaller computational overhead and minimizing inference latency.
zh
[AI-13] ScaleEnv: Scaling Environment Synthesis from Scratch for Generalist Interactive Tool-Use Agent Training
【速读】:该论文旨在解决通用智能体(Generalist Agents)在多样化场景中适应能力不足的问题,其核心挑战在于交互式环境的稀缺性以及现有合成方法在环境多样性与可扩展性方面的局限。解决方案的关键在于提出ScaleEnv框架,该框架通过程序化测试确保环境可靠性,并利用工具依赖图扩展(Tool Dependency Graph Expansion)和可执行动作验证(Executable Action Verification)来保障任务的完整性和可解性,从而实现从零构建完全交互式环境及可验证任务的目标。此方法使智能体能在ScaleEnv中通过探索学习,在未见过的多轮工具使用基准(如τ²-Bench和VitaBench)上显著提升性能,验证了环境多样性对模型泛化能力的关键作用。
链接: https://arxiv.org/abs/2602.06820
作者: Dunwei Tu,Hongyan Hao,Hansi Yang,Yihao Chen,Yi-Kai Zhang,Zhikang Xia,Yu Yang,Yueqing Sun,Xingchen Liu,Furao Shen,Qi Gu,Hui Su,Xunliang Cai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Training generalist agents capable of adapting to diverse scenarios requires interactive environments for self-exploration. However, interactive environments remain critically scarce, and existing synthesis methods suffer from significant limitations regarding environmental diversity and scalability. To address these challenges, we introduce ScaleEnv, a framework that constructs fully interactive environments and verifiable tasks entirely from scratch. Specifically, ScaleEnv ensures environment reliability through procedural testing, and guarantees task completeness and solvability via tool dependency graph expansion and executable action verification. By enabling agents to learn through exploration within ScaleEnv, we demonstrate significant performance improvements on unseen, multi-turn tool-use benchmarks such as \tau^2 -Bench and VitaBench, highlighting strong generalization capabilities. Furthermore, we investigate the relationship between increasing number of domains and model generalization performance, providing empirical evidence that scaling environmental diversity is critical for robust agent learning.
zh
[AI-14] Wild Guesses and Mild Guesses in Active Concept Learning
【速读】:该论文旨在解决人类在概念学习过程中如何平衡查询信息量与推理稳定性的问题,尤其是在面对开放、稀疏的假设空间时,如何有效进行主动学习。其解决方案的关键在于构建一个神经符号贝叶斯学习框架,其中假设由大型语言模型(Large Language Model, LLM)生成,并通过贝叶斯更新重加权;在此基础上比较两种策略:一种是基于近似期望信息增益(Expected Information Gain, EIG)的理性主动学习器,另一种是类人正向测试策略(Positive Test Strategy, PTS)。研究发现,EIG在需要证伪的复杂规则中表现优异,但在简单规则上因支持不匹配导致粒子近似失效,而PTS虽非信息最优,却能通过选择“安全”查询维持假设生成的有效性,从而在简单规则下实现更快收敛,揭示了“确认偏差”可能并非认知错误,而是适应人类思维中稀疏假设空间的理性策略。
链接: https://arxiv.org/abs/2602.06818
作者: Anirudh Chari,Neil Pattanaik
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Human concept learning is typically active: learners choose which instances to query or test in order to reduce uncertainty about an underlying rule or category. Active concept learning must balance informativeness of queries against the stability of the learner that generates and scores hypotheses. We study this trade-off in a neuro-symbolic Bayesian learner whose hypotheses are executable programs proposed by a large language model (LLM) and reweighted by Bayesian updating. We compare a Rational Active Learner that selects queries to maximize approximate expected information gain (EIG) and the human-like Positive Test Strategy (PTS) that queries instances predicted to be positive under the current best hypothesis. Across concept-learning tasks in the classic Number Game, EIG is effective when falsification is necessary (e.g., compound or exception-laden rules), but underperforms on simple concepts. We trace this failure to a support mismatch between the EIG policy and the LLM proposal distribution: highly diagnostic boundary queries drive the posterior toward regions where the generator produces invalid or overly specific programs, yielding a support-mismatch trap in the particle approximation. PTS is information-suboptimal but tends to maintain proposal validity by selecting “safe” queries, leading to faster convergence on simple rules. Our results suggest that “confirmation bias” may not be a cognitive error, but rather a rational adaptation for maintaining tractable inference in the sparse, open-ended hypothesis spaces characteristic of human thought.
zh
[AI-15] SuReNav: Superpixel Graph-based Constraint Relaxation for Navigation in Over-constrained Environments ICRA2026
【速读】:该论文旨在解决半静态环境中过约束规划问题,即在避免所有硬性约束区域的前提下,寻找一种最小化进入高风险区域的最优折中方案。传统方法依赖预定义区域代价,难以泛化,且导航空间的空间连续性导致难以准确识别可通行区域而不产生过估计。解决方案的关键在于提出SuReNav框架,其核心包括:基于超像素图(superpixel graph)的地图构建与区域约束表示、利用人类示范训练的图神经网络实现区域约束松弛以模拟人类似的安全高效导航行为,以及通过松弛、规划与执行的交替进行完成完整导航流程。该方法在2D语义地图和OpenStreetMap的3D地图上优于现有最先进基线,在效率与安全性之间取得平衡,并在真实城市环境中通过四足机器人Spot验证了其可扩展性和泛化能力。
链接: https://arxiv.org/abs/2602.06807
作者: Keonyoung Koh,Moonkyeong Jung,Samuel Seungsup Lee,Daehyung Park
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by ICRA 2026. Code and videos are available at this https URL
Abstract:We address the over-constrained planning problem in semi-static environments. The planning objective is to find a best-effort solution that avoids all hard constraint regions while minimally traversing the least risky areas. Conventional methods often rely on pre-defined area costs, limiting generalizations. Further, the spatial continuity of navigation spaces makes it difficult to identify regions that are passable without overestimation. To overcome these challenges, we propose SuReNav, a superpixel graph-based constraint relaxation and navigation method that imitates human-like safe and efficient navigation. Our framework consists of three components: 1) superpixel graph map generation with regional constraints, 2) regional-constraint relaxation using graph neural network trained on human demonstrations for safe and efficient navigation, and 3) interleaving relaxation, planning, and execution for complete navigation. We evaluate our method against state-of-the-art baselines on 2D semantic maps and 3D maps from OpenStreetMap, achieving the highest human-likeness score of complete navigation while maintaining a balanced trade-off between efficiency and safety. We finally demonstrate its scalability and generalization performance in real-world urban navigation with a quadruped robot, Spot.
zh
[AI-16] On the Identifiability of Steering Vectors in Large Language Models
【速读】:该论文试图解决生成式 AI(Generative AI)中行为控制方法(如人格向量)的可解释性问题,即这些控制向量是否能唯一且可识别地对应到模型内部表示。论文指出,在现实建模和数据条件下,由于存在大量行为不可区分的干预等价类,这类向量本质上是不可识别的。解决方案的关键在于引入结构性假设,包括统计独立性、稀疏性约束、多环境验证或跨层一致性,从而恢复识别能力。这一发现揭示了当前行为控制方法的根本性可解释性限制,并明确了实现安全关键控制所需的理论前提。
链接: https://arxiv.org/abs/2602.06801
作者: Sohan Venkatesh,Ashish Mahendran Kurapath
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 23 pages, 4 figures, 2 tables
Abstract:Activation steering methods, such as persona vectors, are widely used to control large language model behavior and increasingly interpreted as revealing meaningful internal representations. This interpretation implicitly assumes steering directions are identifiable and uniquely recoverable from input-output behavior. We formalize steering as an intervention on internal representations and prove that, under realistic modeling and data conditions, steering vectors are fundamentally non-identifiable due to large equivalence classes of behaviorally indistinguishable interventions. Empirically, we validate this across multiple models and semantic traits, showing orthogonal perturbations achieve near-equivalent efficacy with negligible effect sizes. However, identifiability is recoverable under structural assumptions including statistical independence, sparsity constraints, multi-environment validation or cross-layer consistency. These findings reveal fundamental interpretability limits and clarify structural assumptions required for reliable safety-critical control.
zh
[AI-17] Next-generation cyberattack detection with large language models : anomaly analysis across heterogeneous logs
【速读】:该论文旨在解决传统入侵检测系统在异构日志源中进行异常检测时面临的高误报率、语义盲区和数据稀缺问题,尤其是由于日志数据的敏感性导致高质量标注数据难以获取。其解决方案的关键在于:(1) 构建两个平衡且多样化的日志数据集 LogAtlas-Foundation-Sessions 和 LogAtlas-Defense-Set,包含显式的攻击标注并保障隐私;(2) 通过实证基准测试揭示标准指标(如F1分数和准确率)在安全场景中的误导性;(3) 提出两阶段训练框架,先使用大规模基础模型 Base-AMAN(30亿参数)理解日志语义,再通过知识蒸馏得到轻量级实时检测模型 AMAN(5亿参数),实现每会话0.3–0.5秒的推理速度和每日低于50美元的运营成本,从而提升检测的准确性与实用性。
链接: https://arxiv.org/abs/2602.06777
作者: Yassine Chagna,Antal Goldschmidt
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:This project explores large language models (LLMs) for anomaly detection across heterogeneous log sources. Traditional intrusion detection systems suffer from high false positive rates, semantic blindness, and data scarcity, as logs are inherently sensitive, making clean datasets rare. We address these challenges through three contributions: (1) LogAtlas-Foundation-Sessions and LogAtlas-Defense-Set, balanced and heterogeneous log datasets with explicit attack annotations and privacy preservation; (2) empirical benchmarking revealing why standard metrics such as F1 and accuracy are misleading for security applications; and (3) a two phase training framework combining log understanding (Base-AMAN, 3B parameters) with real time detection (AMAN, 0.5B parameters via knowledge distillation). Results demonstrate practical feasibility, with inference times of 0.3-0.5 seconds per session and operational costs below 50 USD per day.
zh
[AI-18] owards Understanding What State Space Models Learn About Code
【速读】:该论文旨在解决生成式 AI(Generative AI)中状态空间模型(State Space Models, SSMs)在代码理解任务中的内部机制不明确以及其在微调过程中性能下降的问题。现有研究表明,SSMs 在预训练阶段能有效捕捉代码的语法和语义特征,但在下游任务微调时会遗忘某些长程依赖关系,尤其当任务强调短程依赖时表现更差。解决方案的关键在于提出一种基于频域的诊断框架——SSM-Interpret,该框架揭示了微调过程中谱特性向短程依赖偏移的现象;并据此设计架构改进策略,显著提升了SSM在代码建模中的性能,验证了分析结果可直接指导模型优化。
链接: https://arxiv.org/abs/2602.06774
作者: Jiali Wu,Abhinav Anand,Shweta Verma,Mira Mezini
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:State Space Models (SSMs) have emerged as an efficient alternative to the transformer architecture. Recent studies show that SSMs can match or surpass Transformers on code understanding tasks, such as code retrieval, when trained under similar conditions. However, their internal mechanisms remain a black box. We present the first systematic analysis of what SSM-based code models actually learn and perform the first comparative analysis of SSM and Transformer-based code models. Our analysis reveals that SSMs outperform Transformers at capturing code syntax and semantics in pretraining but forgets certain syntactic and semantic relations during fine-tuning on task, especially when the task emphasizes short-range dependencies. To diagnose this, we introduce SSM-Interpret, a frequency-domain framework that exposes a spectral shift toward short-range dependencies during fine-tuning. Guided by these findings, we propose architectural modifications that significantly improve the performance of SSM-based code model, validating that our analysis directly enables better models.
zh
[AI-19] AEGIS: Adversarial Target-Guided Retention-Data-Free Robust Concept Erasure from Diffusion Models
【速读】:该论文旨在解决当前概念擦除(concept erasure)方法在鲁棒性(robustness)与保留性(retention)之间难以兼顾的问题。具体而言,现有方法通常在增强某一特性时牺牲另一特性:例如,将单个擦除提示映射到固定安全目标会导致类别级残留信息可被提示攻击利用,而侧重保留性的方案则在面对自适应攻击者时表现不佳。为应对这一挑战,论文提出了一种名为对抗擦除与梯度感知协同(Adversarial Erasure with Gradient Informed Synergy, AEGIS)的框架,其关键在于无需依赖保留数据(retention-data-free),通过梯度信息引导的协同机制,同时提升模型对擦除概念的鲁棒性和对无关概念的保留能力。
链接: https://arxiv.org/abs/2602.06771
作者: Fengpeng Li,Kemou Li,Qizhou Wang,Bo Han,Jiantao Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 30 pages,12 figures
Abstract:Concept erasure helps stop diffusion models (DMs) from generating harmful content; but current methods face robustness retention trade off. Robustness means the model fine-tuned by concept erasure methods resists reactivation of erased concepts, even under semantically related prompts. Retention means unrelated concepts are preserved so the model’s overall utility stays intact. Both are critical for concept erasure in practice, yet addressing them simultaneously is challenging, as existing works typically improve one factor while sacrificing the other. Prior work typically strengthens one while degrading the other, e.g., mapping a single erased prompt to a fixed safe target leaves class level remnants exploitable by prompt attacks, whereas retention-oriented schemes underperform against adaptive adversaries. This paper introduces Adversarial Erasure with Gradient Informed Synergy (AEGIS), a retention-data-free framework that advances both robustness and retention.
zh
[AI-20] A Unified Framework for LLM Watermarks
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)文本水印技术缺乏统一理论框架的问题。现有水印方法多采用自底向上的设计思路,缺乏系统性的原理支撑,导致难以分析其内在约束与性能权衡。论文的关键贡献在于提出一个基于约束优化的统一形式化框架,能够推导出大多数现有水印方案,并明确揭示每种方法所优化的具体约束条件,如质量(quality)、多样性(diversity)和检测功率(detection power)之间的权衡关系。这一框架不仅为理解已有方法提供了理论依据,还支持针对特定需求设计新型水印方案,例如直接以困惑度(perplexity)作为质量代理指标,从而获得在该约束下最优的水印机制。实验验证表明,从同一约束条件导出的水印方案在对应指标上均能最大化检测性能。
链接: https://arxiv.org/abs/2602.06754
作者: Thibaud Gloaguen,Robin Staab,Nikola Jovanović,Martin Vechev
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:LLM watermarks allow tracing AI-generated texts by inserting a detectable signal into their generated content. Recent works have proposed a wide range of watermarking algorithms, each with distinct designs, usually built using a bottom-up approach. Crucially, there is no general and principled formulation for LLM watermarking. In this work, we show that most existing and widely used watermarking schemes can in fact be derived from a principled constrained optimization problem. Our formulation unifies existing watermarking methods and explicitly reveals the constraints that each method optimizes. In particular, it highlights an understudied quality-diversity-power trade-off. At the same time, our framework also provides a principled approach for designing novel watermarking schemes tailored to specific requirements. For instance, it allows us to directly use perplexity as a proxy for quality, and derive new schemes that are optimal with respect to this constraint. Our experimental evaluation validates our framework: watermarking schemes derived from a given constraint consistently maximize detection power with respect to that constraint. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2602.06754 [cs.CR] (or arXiv:2602.06754v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2602.06754 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-21] Semantically Labelled Automata for Multi-Task Reinforcement Learning with LTL Instructions
【速读】:该论文旨在解决多任务强化学习(Multi-task Reinforcement Learning, Multi-task RL)中如何让智能体学习到一个通用策略,从而在面对任意甚至未见过的任务时仍能有效泛化的问题。其核心挑战在于如何高效地表示和利用任务规范信息,尤其是在使用线性时序逻辑(Linear Temporal Logic, LTL)作为任务描述形式时。解决方案的关键在于提出一种新颖的任务嵌入技术,该技术基于一类新发展的语义LTL到自动机(LTL-to-Automata)转换方法,这些转换最初用于时序合成(Temporal Synthesis)。通过这种语义标注的自动机结构,每个状态都包含丰富的结构化信息,使得系统能够:(i) 在线高效计算自动机;(ii) 提取用于策略条件化的表达能力强的任务嵌入;(iii) 原生支持完整的LTL语法。实验表明,该方法在多个领域均达到当前最优性能,并可扩展至现有方法无法处理的复杂规格场景。
链接: https://arxiv.org/abs/2602.06746
作者: Alessandro Abate,Giuseppe De Giacomo,Mathias Jackermeier,Jan Kretínský,Maximilian Prokop,Christoph Weinhuber
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We study multi-task reinforcement learning (RL), a setting in which an agent learns a single, universal policy capable of generalising to arbitrary, possibly unseen tasks. We consider tasks specified as linear temporal logic (LTL) formulae, which are commonly used in formal methods to specify properties of systems, and have recently been successfully adopted in RL. In this setting, we present a novel task embedding technique leveraging a new generation of semantic LTL-to-automata translations, originally developed for temporal synthesis. The resulting semantically labelled automata contain rich, structured information in each state that allow us to (i) compute the automaton efficiently on-the-fly, (ii) extract expressive task embeddings used to condition the policy, and (iii) naturally support full LTL. Experimental results in a variety of domains demonstrate that our approach achieves state-of-the-art performance and is able to scale to complex specifications where existing methods fail.
zh
[AI-22] Optimal Abstractions for Verifying Properties of Kolmogorov-Arnold Networks (KANs)
【速读】:该论文旨在解决Kolmogorov-Arnold Networks (KANs) 的性质验证问题,即在给定输入集合下,判断网络输出是否满足特定性质。由于KANs由非线性、一元激活函数(通常为分段多项式样条或高斯过程)构成,直接验证其行为具有挑战性。为此,作者提出一种基于数学抽象的方法:将每个KAN单元替换为分段仿射(Piecewise Affine, PWA)函数,从而在局部和全局层面提供原始网络与其近似之间的误差估计,并将验证问题转化为混合整数线性规划(Mixed Integer Linear Program, MILP)进行求解。该方法的关键在于如何平衡PWA逼近中分段数量与误差边界之间的权衡——分段过多会导致MILP变量激增而计算不可行,分段过少则误差过大失去验证意义。论文的核心贡献是构建了一个系统性的框架,利用KAN的结构特性,通过单元级动态规划与网络级背包优化相结合的方式,在保证指定误差约束的前提下最小化总分段数,从而实现最优近似策略的选择,实证表明该方法虽有前期分析开销,但能显著提升验证精度与可靠性。
链接: https://arxiv.org/abs/2602.06737
作者: Noah Schwartz,Chandra Kanth Nagesh,Sriram Sankaranarayanan,Ramneet Kaur,Tuhin Sahai,Susmit Jha
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:
Abstract:We present a novel approach for verifying properties of Kolmogorov-Arnold Networks (KANs), a class of neural networks characterized by nonlinear, univariate activation functions typically implemented as piecewise polynomial splines or Gaussian processes. Our method creates mathematical ``abstractions’’ by replacing each KAN unit with a piecewise affine (PWA) function, providing both local and global error estimates between the original network and its approximation. These abstractions enable property verification by encoding the problem as a Mixed Integer Linear Program (MILP), determining whether outputs satisfy specified properties when inputs belong to a given set. A critical challenge lies in balancing the number of pieces in the PWA approximation: too many pieces add binary variables that make verification computationally intractable, while too few pieces create excessive error margins that yield uninformative bounds. Our key contribution is a systematic framework that exploits KAN structure to find optimal abstractions. By combining dynamic programming at the unit level with a knapsack optimization across the network, we minimize the total number of pieces while guaranteeing specified error bounds. This approach determines the optimal approximation strategy for each unit while maintaining overall accuracy requirements. Empirical evaluation across multiple KAN benchmarks demonstrates that the upfront analysis costs of our method are justified by superior verification results.
zh
[AI-23] GhostCite: A Large-Scale Analysis of Citation Validity in the Age of Large Language Models
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在学术写作中广泛使用所引发的“伪造引用”(ghost citations)问题,这一现象严重威胁科学文献的引文有效性与可信度。其核心解决方案是提出并实现了一个名为 CiteVerifier 的开源大规模引文验证框架,并通过三项实验系统性地量化了该风险:首先在40个研究领域对13个先进LLM进行引文生成基准测试,发现所有模型均存在显著的引文幻觉率(14.23%–94.93%);其次分析了56,381篇顶会论文中的220万条引文,证实1.07%的论文含无效或伪造引文,且2025年增幅达80.9%;最后通过对97名研究人员的调研揭示了“验证缺口”——41.5%的研究者直接复制粘贴BibTeX而不核查,76.7%的审稿人未认真检查参考文献。研究表明,AI工具不可靠性与人类验证不足共同导致虚假引文污染科学记录,因此论文建议从研究人员、会议主办方到工具开发者三方面协同干预以维护引文完整性。
链接: https://arxiv.org/abs/2602.06718
作者: Zuyao Xu,Yuqi Qiu,Lu Sun,FaSheng Miao,Fubin Wu,Xinyi Wang,Xiang Li,Haozhe Lu,ZhengZe Zhang,Yuxin Hu,Jialu Li,Jin Luo,Feng Zhang,Rui Luo,Xinran Liu,Yingxian Li,Jiaji Liu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Citations provide the basis for trusting scientific claims; when they are invalid or fabricated, this trust collapses. With the advent of Large Language Models (LLMs), this risk has intensified: LLMs are increasingly used for academic writing, yet their tendency to fabricate citations (ghost citations'') poses a systemic threat to citation validity. To quantify this threat and inform mitigation, we develop CiteVerifier, an open-source framework for large-scale citation verification, and conduct the first comprehensive study of citation validity in the LLM era through three experiments built on it. We benchmark 13 state-of-the-art LLMs on citation generation across 40 research domains, finding that all models hallucinate citations at rates from 14.23\% to 94.93\%, with significant variation across research domains. Moreover, we analyze 2.2 million citations from 56,381 papers published at top-tier AI/ML and Security venues (2020--2025), confirming that 1.07\% of papers contain invalid or fabricated citations (604 papers), with an 80.9\% increase in 2025 alone. Furthermore, we survey 97 researchers and analyze 94 valid responses after removing 3 conflicting samples, revealing a critical verification gap’': 41.5% of researchers copy-paste BibTeX without checking and 44.4% choose no-action responses when encountering suspicious references; meanwhile, 76.7% of reviewers do not thoroughly check references and 80.0% never suspect fake citations. Our findings reveal an accelerating crisis where unreliable AI tools, combined with inadequate human verification by researchers and insufficient peer review scrutiny, enable fabricated citations to contaminate the scientific record. We propose interventions for researchers, venues, and tool developers to protect citation integrity. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.06718 [cs.CR] (or arXiv:2602.06718v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2602.06718 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-24] F-GRPO: Dont Let Your Policy Learn the Obvious and Forget the Rare
【速读】:该论文针对强化学习中基于可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)的群体采样方法在实际应用中因计算资源限制而采用较小群体规模所引发的问题展开研究,即小群体易忽略稀有但正确的轨迹(rare-correct trajectories),导致策略更新偏向常见解、难以探索多样且高质量的解决方案。其关键解决方案是提出一种难度感知的优势缩放系数(difficulty-aware advantage scaling coefficient),受Focal Loss启发,通过降低高成功率提示(high-success prompts)对应的更新权重,从而缓解对高频路径的过度聚焦,使算法更关注稀有但正确路径的学习。这一轻量级改进可无缝集成至GRPO、DAPO和CISPO等主流群相对RLVR算法中,在不增加计算成本的前提下显著提升模型在域内与域外基准上的性能表现(如pass@256指标提升达3.6–4.6个百分点)。
链接: https://arxiv.org/abs/2602.06717
作者: Daniil Plyusov,Alexey Gorbatovski,Boris Shaposhnikov,Viacheslav Sinii,Alexey Malakhov,Daniil Gavrilov
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) is commonly based on group sampling to estimate advantages and stabilize policy updates. In practice, large group sizes are not feasible due to computational limits, which biases learning toward trajectories that are already likely. Smaller groups often miss rare-correct trajectories while still containing mixed rewards, concentrating probability on common solutions. We derive the probability that updates miss rare-correct modes as a function of group size, showing non-monotonic behavior, and characterize how updates redistribute mass within the correct set, revealing that unsampled-correct mass can shrink even as total correct mass grows. Motivated by this analysis, we propose a difficulty-aware advantage scaling coefficient, inspired by Focal loss, that down-weights updates on high-success prompts. The lightweight modification can be directly integrated into any group-relative RLVR algorithm such as GRPO, DAPO, and CISPO. On Qwen2.5-7B across in-domain and out-of-domain benchmarks, our method improves pass@256 from 64.1 \rightarrow 70.3 (GRPO), 69.3 \rightarrow 72.5 (DAPO), and 73.2 \rightarrow 76.8 (CISPO), while preserving or improving pass@1, without increasing group size or computational cost.
zh
[AI-25] Autoregressive Models for Knowledge Graph Generation
【速读】:该论文旨在解决知识图谱(Knowledge Graph, KG)生成中如何建模复杂语义依赖关系并保持领域有效性约束的问题。传统方法如链接预测仅独立评分三元组,难以捕捉子图层面的语义一致性;而本研究提出基于自回归框架的ARK模型,将知识图谱视为三元组序列进行建模,通过数据驱动方式隐式学习类型一致性、时序有效性及关系模式等约束,无需显式规则监督即可生成语义有效的全新知识图谱。其关键创新在于:1)以自回归方式建模三元组间的全局依赖关系;2)引入SAIL变分扩展实现受控生成,支持从部分图结构进行条件补全;3)实证发现模型容量(隐藏维度=64)比架构深度对性能影响更大,且循环神经网络在保证语义有效性的同时显著优于Transformer类架构的计算效率。
链接: https://arxiv.org/abs/2602.06707
作者: Thiviyan Thanapalasingam,Antonis Vozikis,Peter Bloem,Paul Groth
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Knowledge Graph (KG) generation requires models to learn complex semantic dependencies between triples while maintaining domain validity constraints. Unlike link prediction, which scores triples independently, generative models must capture interdependencies across entire subgraphs to produce semantically coherent structures. We present ARK (Auto-Regressive Knowledge Graph Generation), a family of autoregressive models that generate KGs by treating graphs as sequences of (head, relation, tail) triples. ARK learns implicit semantic constraints directly from data, including type consistency, temporal validity, and relational patterns, without explicit rule supervision. On the IntelliGraphs benchmark, our models achieve 89.2% to 100.0% semantic validity across diverse datasets while generating novel graphs not seen during training. We also introduce SAIL, a variational extension of ARK that enables controlled generation through learned latent representations, supporting both unconditional sampling and conditional completion from partial graphs. Our analysis reveals that model capacity (hidden dimensionality = 64) is more critical than architectural depth for KG generation, with recurrent architectures achieving comparable validity to transformer-based alternatives while offering substantial computational efficiency. These results demonstrate that autoregressive models provide an effective framework for KG generation, with practical applications in knowledge base completion and query answering.
zh
[AI-26] SaDiT: Efficient Protein Backbone Design via Latent Structural Tokenization and Diffusion Transformers
【速读】:该论文旨在解决基于扩散模型的从头蛋白质骨架设计在计算效率上的瓶颈问题,即现有方法虽然能生成新颖蛋白结构,但因计算复杂度高而难以实现大规模结构探索。其解决方案的关键在于提出SaDiT框架,通过引入SaProt离散化分词(tokenization)将蛋白质几何结构压缩到低维离散潜在空间中,从而显著降低生成过程的复杂度,并保持SE(3)等变性;同时创新性地设计IPA Token Cache机制,在迭代采样过程中复用Invariant Point Attention (IPA)层的计算状态,进一步优化了Invariance Point Attention模块的效率,最终在保持结构合理性的同时大幅提升生成速度和可设计性。
链接: https://arxiv.org/abs/2602.06706
作者: Shentong Mo,Lanqing Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Generative models for de novo protein backbone design have achieved remarkable success in creating novel protein structures. However, these diffusion-based approaches remain computationally intensive and slower than desired for large-scale structural exploration. While recent efforts like Proteina have introduced flow-matching to improve sampling efficiency, the potential of tokenization for structural compression and acceleration remains largely unexplored in the protein domain. In this work, we present SaDiT, a novel framework that accelerates protein backbone generation by integrating SaProt Tokenization with a Diffusion Transformer (DiT) architecture. SaDiT leverages a discrete latent space to represent protein geometry, significantly reducing the complexity of the generation process while maintaining theoretical SE(3) equivalence. To further enhance efficiency, we introduce an IPA Token Cache mechanism that optimizes the Invariant Point Attention (IPA) layers by reusing computed token states during iterative sampling. Experimental results demonstrate that SaDiT outperforms state-of-the-art models, including RFDiffusion and Proteina, in both computational speed and structural viability. We evaluate our model across unconditional backbone generation and fold-class conditional generation tasks, where SaDiT shows superior ability to capture complex topological features with high designability.
zh
[AI-27] Multimodal Generative Retrieval Model with Staged Pretraining for Food Delivery on Meituan
【速读】:该论文旨在解决多模态检索模型在联合优化过程中存在的模态主导与训练速度不一致问题,这些问题易导致某些模态被忽视或出现“单周期”(one-epoch)现象,从而影响模型对多模态特征的有效利用。其解决方案的关键在于提出一种分阶段预训练策略(staged pretraining strategy),通过在不同阶段引导模型专注于特定任务,使模型能够更有效地关注和利用多模态特征,同时灵活控制各阶段的训练过程以避免训练不平衡;此外,为更好地利用语义ID(semantic ID, SID)压缩高维多模态嵌入,设计了生成式与判别式任务,增强模型对SID、查询和物品特征之间关联的理解,从而提升整体检索性能。
链接: https://arxiv.org/abs/2602.06654
作者: Boyu Chen,Tai Guo,Weiyu Cui,Yuqing Li,Xingxing Wang,Chuan Shi,Cheng Yang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal retrieval models are becoming increasingly important in scenarios such as food delivery, where rich multimodal features can meet diverse user needs and enable precise retrieval. Mainstream approaches typically employ a dual-tower architecture between queries and items, and perform joint optimization of intra-tower and inter-tower tasks. However, we observe that joint optimization often leads to certain modalities dominating the training process, while other modalities are neglected. In addition, inconsistent training speeds across modalities can easily result in the one-epoch problem. To address these challenges, we propose a staged pretraining strategy, which guides the model to focus on specialized tasks at each stage, enabling it to effectively attend to and utilize multimodal features, and allowing flexible control over the training process at each stage to avoid the one-epoch problem. Furthermore, to better utilize the semantic IDs that compress high-dimensional multimodal embeddings, we design both generative and discriminative tasks to help the model understand the associations between SIDs, queries, and item features, thereby improving overall performance. Extensive experiments on large-scale real-world Meituan data demonstrate that our method achieves improvements of 3.80%, 2.64%, and 2.17% on R@5, R@10, and R@20, and 5.10%, 4.22%, and 2.09% on N@5, N@10, and N@20 compared to mainstream baselines. Online A/B testing on the Meituan platform shows that our approach achieves a 1.12% increase in revenue and a 1.02% increase in click-through rate, validating the effectiveness and superiority of our method in practical applications.
zh
[AI-28] RAPID: Reconfigurable Adaptive Platform for Iterative Design
【速读】:该论文旨在解决机器人操作策略开发中因末端执行器(end-effector)变更导致的迭代效率低下问题,即传统方法在更换夹爪或传感器配置时需进行机械重新装配和系统重新集成,显著延长了实验周期。其解决方案的关键在于提出一个全栈可重构平台 RAPID,核心创新包括:基于无工具模块化硬件架构实现秒级重新配置,并通过驱动层物理掩码(Physical Mask)实时感知硬件状态,该掩码由 USB 事件生成,能够显式暴露模态存在性作为运行时信号,从而支持自动配置与传感器热插拔时的优雅降级,确保策略持续执行。此设计使多模态消融研究成为可能,同时将多模态配置设置时间缩短两个数量级。
链接: https://arxiv.org/abs/2602.06653
作者: Zi Yin,Fanhong Li,Shurui Zheng,Jia Liu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Developing robotic manipulation policies is iterative and hypothesis-driven: researchers test tactile sensing, gripper geometries, and sensor placements through real-world data collection and training. Yet even minor end-effector changes often require mechanical refitting and system re-integration, slowing iteration. We present RAPID, a full-stack reconfigurable platform designed to reduce this friction. RAPID is built around a tool-free, modular hardware architecture that unifies handheld data collection and robot deployment, and a matching software stack that maintains real-time awareness of the underlying hardware configuration through a driver-level Physical Mask derived from USB events. This modular hardware architecture reduces reconfiguration to seconds and makes systematic multi-modal ablation studies practical, allowing researchers to sweep diverse gripper and sensing configurations without repeated system bring-up. The Physical Mask exposes modality presence as an explicit runtime signal, enabling auto-configuration and graceful degradation under sensor hot-plug events, so policies can continue executing when sensors are physically added or removed. System-centric experiments show that RAPID reduces the setup time for multi-modal configurations by two orders of magnitude compared to traditional workflows and preserves policy execution under runtime sensor hot-unplug events. The hardware designs, drivers, and software stack are open-sourced at this https URL .
zh
[AI-29] Humanoid Manipulation Interface: Humanoid Whole-Body Manipulation from Robot-Free Demonstrations
【速读】:该论文旨在解决当前人形机器人全身操作(whole-body manipulation)中因依赖遥操作或视觉模拟到现实的强化学习方法所导致的数据采集效率低、硬件物流复杂及奖励函数设计困难的问题,从而限制了自主技能在非受控环境中的泛化能力。解决方案的关键在于提出一种名为HuMI(Humanoid Manipulation Interface)的便携式高效框架,通过使用轻量化硬件捕获人类全身运动数据,实现无需机器人参与的数据收集,并基于此构建分层学习流程,将人类动作映射为可执行且符合物理可行性的机器人技能,从而显著提升数据效率与环境适应性。
链接: https://arxiv.org/abs/2602.06643
作者: Ruiqian Nai,Boyuan Zheng,Junming Zhao,Haodong Zhu,Sicong Dai,Zunhao Chen,Yihang Hu,Yingdong Hu,Tong Zhang,Chuan Wen,Yang Gao
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Website: this https URL
Abstract:Current approaches for humanoid whole-body manipulation, primarily relying on teleoperation or visual sim-to-real reinforcement learning, are hindered by hardware logistics and complex reward engineering. Consequently, demonstrated autonomous skills remain limited and are typically restricted to controlled environments. In this paper, we present the Humanoid Manipulation Interface (HuMI), a portable and efficient framework for learning diverse whole-body manipulation tasks across various environments. HuMI enables robot-free data collection by capturing rich whole-body motion using portable hardware. This data drives a hierarchical learning pipeline that translates human motions into dexterous and feasible humanoid skills. Extensive experiments across five whole-body tasks–including kneeling, squatting, tossing, walking, and bimanual manipulation–demonstrate that HuMI achieves a 3x increase in data collection efficiency compared to teleoperation and attains a 70% success rate in unseen environments.
zh
[AI-30] mperature Scaling Attack Disrupting Model Confidence in Federated Learning
【速读】:该论文旨在解决联邦学习(Federated Learning)中模型置信度校准(confidence calibration)被恶意攻击的问题,即现有攻击多聚焦于准确性或植入后门,而忽视了对预测置信度的系统性破坏。解决方案的关键在于提出温度缩放攻击(Temperature Scaling Attack, TSA),通过在本地训练阶段引入温度缩放与学习率-温度耦合机制,使恶意更新保持类似良性优化行为,从而规避基于准确性的监控和基于相似性的检测;同时,该方法在非独立同分布(non-IID)环境下仍能维持标准收敛边界,却显著扭曲置信度输出,导致严重误判风险(如医疗场景中漏诊率提升7.2倍或自动驾驶中误报激增),揭示了校准完整性是联邦学习中的关键攻击面。
链接: https://arxiv.org/abs/2602.06638
作者: Kichang Lee,Jaeho Jin,JaeYeon Park,Songkuk Kim,JeongGil Ko
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: 20 pages, 20 figures
Abstract:Predictive confidence serves as a foundational control signal in mission-critical systems, directly governing risk-aware logic such as escalation, abstention, and conservative fallback. While prior federated learning attacks predominantly target accuracy or implant backdoors, we identify confidence calibration as a distinct attack objective. We present the Temperature Scaling Attack (TSA), a training-time attack that degrades calibration while preserving accuracy. By injecting temperature scaling with learning rate-temperature coupling during local training, malicious updates maintain benign-like optimization behavior, evading accuracy-based monitoring and similarity-based detection. We provide a convergence analysis under non-IID settings, showing that this coupling preserves standard convergence bounds while systematically distorting confidence. Across three benchmarks, TSA substantially shifts calibration (e.g., 145% error increase on CIFAR-100) with 2 accuracy change, and remains effective under robust aggregation and post-hoc calibration defenses. Case studies further show that confidence manipulation can cause up to 7.2x increases in missed critical cases (healthcare) or false alarms (autonomous driving), even when accuracy is unchanged. Overall, our results establish calibration integrity as a critical attack surface in federated learning.
zh
[AI-31] rust Regions Sell But Whos Buying? Overlap Geometry as an Alternative Trust Region for Policy Optimization
【速读】:该论文旨在解决标准信任区域方法(如TRPO)中基于KL散度的策略更新约束在实际训练中难以防止罕见但剧烈的概率比偏离问题,这正是导致PPO等算法引入裁剪机制的根源。其解决方案的关键在于提出以重叠几何(overlap geometry)作为替代的信任区域,通过Bhattacharyya系数直接约束分布重叠程度,从而更有效地控制概率比尾部的分离现象;具体实现上,提出了Bhattacharyya-TRPO(BTRPO)和Bhattacharyya-PPO(BPPO),分别采用平方根比更新(q = sqrt®)和二次Hellinger/Bhattacharyya惩罚项,实验证明该方法在相同训练预算下提升了强化学习任务的鲁棒性和综合性能,为稳定策略优化提供了一种更具理论依据的替代方案。
链接: https://arxiv.org/abs/2602.06627
作者: Gaurish Trivedi,Alakh Sharma,Kartikey Singh Bhandari,Yash Sinha,Pratik Narang,Dhruv Kumar,Jagat Sesh Challa
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Under Review
Abstract:Standard trust-region methods constrain policy updates via Kullback-Leibler (KL) divergence. However, KL controls only an average divergence and does not directly prevent rare, large likelihood-ratio excursions that destabilize training–precisely the failure mode that motivates heuristics such as PPO’s clipping. We propose overlap geometry as an alternative trust region, constraining distributional overlap via the Bhattacharyya coefficient (closely related to the Hellinger/Renyi-1/2 geometry). This objective penalizes separation in the ratio tails, yielding tighter control over likelihood-ratio excursions without relying on total variation bounds that can be loose in tail regimes. We derive Bhattacharyya-TRPO (BTRPO) and Bhattacharyya-PPO (BPPO), enforcing overlap constraints via square-root ratio updates: BPPO clips the square-root ratio q = sqrt®, and BTRPO applies a quadratic Hellinger/Bhattacharyya penalty. Empirically, overlap-based updates improve robustness and aggregate performance as measured by RLiable under matched training budgets, suggesting overlap constraints as a practical, principled alternative to KL for stable policy optimization.
zh
[AI-32] he challenge of generating and evolving real-life like synthetic test data without accessing real-world raw data – a Systematic Review
【速读】:该论文旨在解决在不使用真实原始数据的前提下,生成并演化符合现实场景的合成测试数据(synthetic test data)这一关键挑战,尤其针对依赖电子政务服务输入的应用程序(如跨境信息交换、医疗与金融领域)所面临的隐私保护与测试真实性之间的矛盾。解决方案的关键在于识别和归纳现有方法中对隐私保护敏感的数据脱敏与合成技术,发现当前大多数方法仍需直接访问真实数据进行预处理或生成,且缺乏对合成数据随时间演化的支持机制;因此,研究指出合成测试数据的演化能力是数字政府解决方案中亟待探索的方向,以应对日益严格的法律法规要求。
链接: https://arxiv.org/abs/2602.06609
作者: Maj-Annika Tammisto,Faiz Ali Shah,Daniel Rodriguez,Dietmar Pfahl
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 22 pages
Abstract:Background: High-level system testing of applications that use data from e-Government services as input requires test data that is real-life-like but where the privacy of personal information is guaranteed. Applications with such strong requirement include information exchange between countries, medicine, banking, etc. This review aims to synthesize the current state-of-the-practice in this domain. Objectives: The objective of this Systematic Review is to identify existing approaches for creating and evolving synthetic test data without using real-life raw data. Methods: We followed well-known methodologies for conducting systematic literature reviews, including the ones from Kitchenham as well as guidelines for analysing the limitations of our review and its threats to validity. Results: A variety of methods and tools exist for creating privacy-preserving test data. Our search found 1,013 publications in IEEE Xplore, ACM Digital Library, and SCOPUS. We extracted data from 75 of those publications and identified 37 approaches that answer our research question partly. A common prerequisite for using these methods and tools is direct access to real-life data for data anonymization or synthetic test data generation. Nine existing synthetic test data generation approaches were identified that were closest to answering our research question. Nevertheless, further work would be needed to add the ability to evolve synthetic test data to the existing approaches. Conclusions: None of the publications really covered our requirements completely, only partially. Synthetic test data evolution is a field that has not received much attention from researchers but needs to be explored in Digital Government Solutions, especially since new legal regulations are being placed in force in many countries. Comments: 22 pages Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.06609 [cs.LG] (or arXiv:2602.06609v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.06609 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Expert Systems, 2025; 42:e70164 Related DOI: https://doi.org/10.1111/exsy.70164 Focus to learn more DOI(s) linking to related resources Submission history From: Daniel Rodriguez [view email] [v1] Fri, 6 Feb 2026 11:12:54 UTC (911 KB) Full-text links: Access Paper: View a PDF of the paper titled The challenge of generating and evolving real-life like synthetic test data without accessing real-world raw data – a Systematic Review, by Maj-Annika Tammisto and 3 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.LG prev | next new | recent | 2026-02 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
zh
[AI-33] Scaling Speech Tokenizers with Diffusion Autoencoders ICLR2026
【速读】:该论文旨在解决现有语音分词器(Speech Tokenizer)在语义编码与声学重建之间的权衡难题,以及如何实现低比特率和低标记率的问题。其解决方案的关键在于提出了一种扩散自编码器架构——语音扩散分词器(Speech Diffusion Tokenizer, SiTok),该模型通过监督学习联合优化语义丰富的表示,并利用扩散机制实现高保真音频重建,在16亿参数规模下于200万小时语音数据上训练,最终实现了仅12.5 Hz的极低标记率和200 bit/s的低比特率,同时在理解、重建和生成任务中均优于强基线方法。
链接: https://arxiv.org/abs/2602.06602
作者: Yuancheng Wang,Zhenyu Tang,Yun Wang,Arthur Hinsvark,Yingru Liu,Yinghao Li,Kainan Peng,Junyi Ao,Mingbo Ma,Mike Seltzer,Qing He,Xubo Liu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: ICLR 2026
Abstract:Speech tokenizers are foundational to speech language models, yet existing approaches face two major challenges: (1) balancing trade-offs between encoding semantics for understanding and acoustics for reconstruction, and (2) achieving low bit rates and low token rates. We propose Speech Diffusion Tokenizer (SiTok), a diffusion autoencoder that jointly learns semantic-rich representations through supervised learning and enables high-fidelity audio reconstruction with diffusion. We scale SiTok to 1.6B parameters and train it on 2 million hours of speech. Experiments show that SiTok outperforms strong baselines on understanding, reconstruction and generation tasks, at an extremely low token rate of 12.5 Hz and a bit-rate of 200 bits-per-second.
zh
[AI-34] AgentS tepper: Interactive Debugging of Software Development Agents
【速读】:该论文旨在解决基于大语言模型(Large Language Models, LLMs)的软件开发代理(Software Development Agents)在调试过程中缺乏可解释性和可控性的问题。当前工具难以以清晰、结构化的方式呈现代理执行过程中的中间状态,如LLM查询轨迹、工具调用序列及代码修改细节,导致开发者难以理解和定位错误。解决方案的关键在于提出AgentStepper——首个面向LLM驱动软件工程代理的交互式调试器,其核心创新是将代理行为建模为LLM、代理程序与工具之间的结构化对话,并支持断点设置、单步执行、实时编辑提示词和工具调用等功能,同时可视化展示仓库级别的代码变更。这一高阶抽象机制使开发者能够以接近传统程序调试的方式高效分析和干预代理行为,显著提升调试效率与准确性。
链接: https://arxiv.org/abs/2602.06593
作者: Robert Hutter,Michael Pradel
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Software development agents powered by large language models (LLMs) have shown great promise in automating tasks like environment setup, issue solving, and program repair. Unfortunately, understanding and debugging such agents remain challenging due to their complex and dynamic nature. Developers must reason about trajectories of LLM queries, tool calls, and code modifications, but current techniques reveal little of this intermediate process in a comprehensible format. The key insight of this paper is that debugging software development agents shares many similarities with conventional debugging of software programs, yet requires a higher level of abstraction that raises the level from low-level implementation details to high-level agent actions. Drawing on this insight, we introduce AgentStepper, the first interactive debugger for LLM-based software engineering agents. AgentStepper enables developers to inspect, control, and interactively manipulate agent trajectories. AgentStepper represents trajectories as structured conversations among an LLM, the agent program, and tools. It supports breakpoints, stepwise execution, and live editing of prompts and tool invocations, while capturing and displaying intermediate repository-level code changes. Our evaluation applies AgentStepper to three state-of-the-art software development agents, ExecutionAgent, SWE-Agent, and RepairAgent, showing that integrating the approach into existing agents requires minor code changes (39-42 edited lines). Moreover, we report on a user study with twelve participants, indicating that AgentStepper improves the ability of participants to interpret trajectories (64% vs. 67% mean performance) and identify bugs in the agent’s implementation (17% vs. 60% success rate), while reducing perceived workload (e.g., frustration reduced from 5.4/7.0 to 2.4/7.0) compared to conventional tools.
zh
[AI-35] arget noise: A pre-training based neural network initialization for efficient high resolution learning
【速读】:该论文旨在解决神经网络权重初始化对优化行为和收敛效率影响较大的问题,尤其是传统随机初始化方法(如Xavier和Kaiming初始化)未能利用优化过程自身信息的局限性。其解决方案的关键在于提出一种基于自监督预训练的初始化策略:通过让网络先拟合随机噪声(random noise)来获得结构化的参数配置,从而在后续任务中显著提升收敛速度。该方法不依赖额外数据或网络架构修改,特别适用于具有强低频偏置特性的隐式神经表示(implicit neural representations, INRs)和Deep Image Prior(DIP)类模型,使网络能更早捕捉高频成分,实现更快且更稳定的优化过程。
链接: https://arxiv.org/abs/2602.06585
作者: Shaowen Wang,Tariq Alkhalifah
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 11 pages, 12 figures
Abstract:Weight initialization plays a crucial role in the optimization behavior and convergence efficiency of neural networks. Most existing initialization methods, such as Xavier and Kaiming initializations, rely on random sampling and do not exploit information from the optimization process itself. We propose a simple, yet effective, initialization strategy based on self-supervised pre-training using random noise as the target. Instead of directly training the network from random weights, we first pre-train it to fit random noise, which leads to a structured and non-random parameter configuration. We show that this noise-driven pre-training significantly improves convergence speed in subsequent tasks, without requiring additional data or changes to the network architecture. The proposed method is particularly effective for implicit neural representations (INRs) and Deep Image Prior (DIP)-style networks, which are known to exhibit a strong low-frequency bias during optimization. After noise-based pre-training, the network is able to capture high-frequency components much earlier in training, leading to faster and more stable convergence. Although random noise contains no semantic information, it serves as an effective self-supervised signal (considering its white spectrum nature) for shaping the initialization of neural networks. Overall, this work demonstrates that noise-based pre-training offers a lightweight and general alternative to traditional random initialization, enabling more efficient optimization of deep neural networks.
zh
[AI-36] Exploring Sparsity and Smoothness of Arbitrary ell_p Norms in Adversarial Attacks
【速读】:该论文旨在解决当前对抗攻击研究中对ℓ_p范数参数选择缺乏系统性分析的问题,特别是探究不同p值(p ∈ [1,2])如何影响对抗扰动的稀疏性和平滑性。其解决方案的关键在于提出了一套定量评估框架:采用两种已有的稀疏性度量,并引入三种平滑性度量,其中一种基于一阶泰勒近似,另一种则通过平滑操作推导得出。通过在多个真实图像数据集和多种模型架构(包括卷积神经网络与基于Transformer的网络)上的广泛实验,作者发现ℓ₁或ℓ₂范数通常不是最优选择,而p ∈ [1.3, 1.5]范围内的ℓ_p范数能实现稀疏性与平滑性的最佳权衡,从而强调了在设计和评估对抗攻击时进行规范选择的重要性。
链接: https://arxiv.org/abs/2602.06578
作者: Christof Duhme,Florian Eilers,Xiaoyi Jiang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Adversarial attacks against deep neural networks are commonly constructed under \ell_p norm constraints, most often using p=1 , p=2 or p=\infty , and potentially regularized for specific demands such as sparsity or smoothness. These choices are typically made without a systematic investigation of how the norm parameter ( p ) influences the structural and perceptual properties of adversarial perturbations. In this work, we study how the choice of ( p ) affects sparsity and smoothness of adversarial attacks generated under ( \ell_p ) norm constraints for values of p \in [1,2] . To enable a quantitative analysis, we adopt two established sparsity measures from the literature and introduce three smoothness measures. In particular, we propose a general framework for deriving smoothness measures based on smoothing operations and additionally introduce a smoothness measure based on first-order Taylor approximations. Using these measures, we conduct a comprehensive empirical evaluation across multiple real-world image datasets and a diverse set of model architectures, including both convolutional and transformer-based networks. We show that the choice of \ell_1 or \ell_2 is suboptimal in most cases and the optimal p value is dependent on the specific task. In our experiments, using \ell_p norms with p\in [1.3, 1.5] yields the best trade-off between sparse and smooth attacks. These findings highlight the importance of principled norm selection when designing and evaluating adversarial attacks.
zh
[AI-37] Perturbing the Phase: Analyzing Adversarial Robustness of Complex-Valued Neural Networks
【速读】:该论文旨在解决复值神经网络(Complex-valued Neural Networks, CVNNs)在实际应用中对异常值(outliers)的鲁棒性问题,尤其是其在面对对抗攻击时的稳定性。解决方案的关键在于设计了一种专门针对输入数据相位信息的新型攻击方法——相位攻击(Phase Attacks),并推导出适用于复值神经网络的常见对抗攻击的复数形式。实验表明,在某些场景下CVNNs比实值神经网络(Real-valued Neural Networks, RVNNs)更具鲁棒性,但两者均对相位扰动极为敏感;相位攻击在同等强度下比同时扰动相位与幅值的传统攻击更能显著降低模型性能,揭示了相位信息在复值表示中的关键作用。
链接: https://arxiv.org/abs/2602.06577
作者: Florian Eilers,Christof Duhme,Xiaoyi Jiang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Complex-valued neural networks (CVNNs) are rising in popularity for all kinds of applications. To safely use CVNNs in practice, analyzing their robustness against outliers is crucial. One well known technique to understand the behavior of deep neural networks is to investigate their behavior under adversarial attacks, which can be seen as worst case minimal perturbations. We design Phase Attacks, a kind of attack specifically targeting the phase information of complex-valued inputs. Additionally, we derive complex-valued versions of commonly used adversarial attacks. We show that in some scenarios CVNNs are more robust than RVNNs and that both are very susceptible to phase changes with the Phase Attacks decreasing the model performance more, than equally strong regular attacks, which can attack both phase and magnitude.
zh
[AI-38] ransformer-based Parameter Fitting of Models derived from Bloch-McConnell Equations for CEST MRI Analysis
【速读】:该论文旨在解决化学交换饱和转移(Chemical Exchange Saturation Transfer, CEST)磁共振成像(MRI)数据量化难题,该问题源于CEST信号受多种生理变量复杂耦合的影响,导致传统基于梯度的求解方法在参数估计中精度不足。解决方案的关键在于提出一种基于Transformer架构的神经网络模型,通过自监督训练方式直接拟合从Bloch-McConnell物理模型推导出的CEST光谱,从而高效准确地提取代谢物浓度、交换速率和弛豫率等关键参数,显著优于传统优化算法。
链接: https://arxiv.org/abs/2602.06574
作者: Christof Duhme,Chris Lippe,Verena Hoerr,Xiaoyi Jiang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Chemical exchange saturation transfer (CEST) MRI is a non-invasive imaging modality for detecting metabolites. It offers higher resolution and sensitivity compared to conventional magnetic resonance spectroscopy (MRS). However, quantification of CEST data is challenging because the measured signal results from a complex interplay of many physiological variables. Here, we introduce a transformer-based neural network to fit parameters such as metabolite concentrations, exchange and relaxation rates of a physical model derived from Bloch-McConnell equations to in-vitro CEST spectra. We show that our self-supervised trained neural network clearly outperforms the solution of classical gradient-based solver.
zh
[AI-39] Which Graph Shift Operator? A Spectral Answer to an Empirical Question
【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)中图移位算子(Graph Shift Operator, GSO)选择缺乏理论指导的问题,即如何在不依赖经验试错的情况下,为特定预测任务选取最优的GSO。其解决方案的关键在于提出了一种新的对齐增益(alignment gain)度量,该度量能够量化输入信号与标签子空间之间的几何失真,并通过谱代理(spectral proxy)将这一对齐性与Lipschitz常数联系起来,从而构建出一个理论上严谨且计算高效的GSO选择准则,可在训练前直接用于排序和筛选最优GSO。
链接: https://arxiv.org/abs/2602.06557
作者: Yassine Abbahaddou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Graph Neural Networks (GNNs) have established themselves as the leading models for learning on graph-structured data, generally categorized into spatial and spectral approaches. Central to these architectures is the Graph Shift Operator (GSO), a matrix representation of the graph structure used to filter node signals. However, selecting the optimal GSO, whether fixed or learnable, remains largely empirical. In this paper, we introduce a novel alignment gain metric that quantifies the geometric distortion between the input signal and label subspaces. Crucially, our theoretical analysis connects this alignment directly to generalization bounds via a spectral proxy for the Lipschitz constant. This yields a principled, computation-efficient criterion to rank and select the optimal GSO for any prediction task prior to training, eliminating the need for extensive search.
zh
[AI-40] SeeUPO: Sequence-Level Agent ic-RL with Convergence Guarantees
【速读】:该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的智能体在强化学习(Reinforcement Learning, RL)训练中缺乏收敛性保障的问题,尤其是在多轮交互场景下,主流算法如PPO(Proximal Policy Optimization)与优势估计方法(如Group Relative Advantage Estimation, GRAE)组合时可能出现训练不稳定甚至无法收敛的情况。其核心解决方案是提出一种无评论家(critic-free)且具备理论收敛保证的多轮交互优化方法——SeeUPO(Sequence-level Sequential Update Policy Optimization),其关键在于将多轮交互建模为顺序执行的多智能体赌博机问题,并通过逆序逐轮策略更新实现单调改进与全局最优解的收敛,从而在AppWorld和BFCL v4等基准测试中显著优于现有算法,同时提升训练稳定性。
链接: https://arxiv.org/abs/2602.06554
作者: Tianyi Hu,Qingxu Fu,Yanxi Chen,Zhaoyang Liu,Bolin Ding
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement learning (RL) has emerged as the predominant paradigm for training large language model (LLM)-based AI agents. However, existing backbone RL algorithms lack verified convergence guarantees in agentic scenarios, especially in multi-turn settings, which can lead to training instability and failure to converge to optimal policies. In this paper, we systematically analyze how different combinations of policy update mechanisms and advantage estimation methods affect convergence properties in single/multi-turn scenarios. We find that REINFORCE with Group Relative Advantage Estimation (GRAE) can converge to the globally optimal under undiscounted conditions, but the combination of PPO GRAE breaks PPO’s original monotonic improvement property. Furthermore, we demonstrate that mainstream backbone RL algorithms cannot simultaneously achieve both critic-free and convergence guarantees in multi-turn scenarios. To address this, we propose SeeUPO (Sequence-level Sequential Update Policy Optimization), a critic-free approach with convergence guarantees for multi-turn interactions. SeeUPO models multi-turn interaction as sequentially executed multi-agent bandit problems. Through turn-by-turn sequential policy updates in reverse execution order, it ensures monotonic improvement and convergence to global optimal solution via backward induction. Experiments on AppWorld and BFCL v4 demonstrate SeeUPO’s substantial improvements over existing backbone algorithms: relative gains of 43.3%-54.6% on Qwen3-14B and 24.1%-41.9% on Qwen2.5-14B (averaged across benchmarks), along with superior training stability. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2602.06554 [cs.AI] (or arXiv:2602.06554v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2602.06554 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Tianyi Hu [view email] [v1] Fri, 6 Feb 2026 09:57:23 UTC (8,993 KB)
zh
[AI-41] Dynamics-Aligned Shared Hypernetworks for Zero-Shot Actuator Inversion
【速读】:该论文旨在解决上下文感知强化学习中的零样本泛化问题,尤其针对隐式(latent)上下文需从数据中推断时所面临的挑战,其中典型的失败模式是执行器反转(actuator inversion),即在隐式二元上下文中,相同动作会产生相反的物理效应。解决方案的关键在于提出DMA*-SH框架,其核心是一个仅通过动态预测训练的单一超网络(hypernetwork),该网络生成一组共享的适配器权重(adapter weights),用于动态模型、策略函数和动作价值函数的调制。这种共享调制机制引入了与执行器反转相匹配的归纳偏置(inductive bias),同时结合输入/输出归一化和随机输入掩码技术,稳定了上下文推理过程,促进方向集中的表征学习。理论分析进一步通过表达能力分离结果和策略梯度方差分解,证明了模态内压缩可提升在执行器反转场景下的学习效率。
链接: https://arxiv.org/abs/2602.06550
作者: Jan Benad,Pradeep Kr. Banerjee,Frank Röder,Nihat Ay,Martin V. Butz,Manfred Eppe
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Zero-shot generalization in contextual reinforcement learning remains a core challenge, particularly when the context is latent and must be inferred from data. A canonical failure mode is actuator inversion, where identical actions produce opposite physical effects under a latent binary context. We propose DMA*-SH, a framework where a single hypernetwork, trained solely via dynamics prediction, generates a small set of adapter weights shared across the dynamics model, policy, and action-value function. This shared modulation imparts an inductive bias matched to actuator inversion, while input/output normalization and random input masking stabilize context inference, promoting directionally concentrated representations. We provide theoretical support via an expressivity separation result for hypernetwork modulation, and a variance decomposition with policy-gradient variance bounds that formalize how within-mode compression improves learning under actuator inversion. For evaluation, we introduce the Actuator Inversion Benchmark (AIB), a suite of environments designed to isolate discontinuous context-to-dynamics interactions. On AIB’s held-out actuator-inversion tasks, DMA*-SH achieves zero-shot generalization, outperforming domain randomization by 111.8% and surpassing a standard context-aware baseline by 16.1%.
zh
[AI-42] HyPER: Bridging Exploration and Exploitation for Scalable LLM Reason ing with Hypothesis Path Expansion and Reduction
【速读】:该论文旨在解决多路径思维链(multi-path chain-of-thought)在测试时计算资源扩展过程中面临的探索-利用权衡(exploration-exploitation trade-off)问题。现有方法要么通过硬编码的扩展规则限制探索(如树状搜索),要么因冗余假设路径过度利用而导致效率低下,且答案选择机制较弱。其解决方案的关键在于将测试时缩放重新建模为一个动态的“扩展-缩减”控制问题,提出HyPER——一种无需训练的在线控制策略,通过轻量级路径统计信息在固定计算预算下重新分配资源;其核心组件包括:随假设池演化从探索向利用过渡的在线控制器、生成阶段无需全路径重采样的词元级精炼机制,以及结合长度与置信度的聚合策略,从而实现更高效且准确的答案生成,在多个混合专家语言模型上实现了显著的准确率提升(8–10%)和计算成本降低(25–40%)。
链接: https://arxiv.org/abs/2602.06527
作者: Shengxuan Qiu,Haochen Huang,Shuzhang Zhong,Pengfei Zuo,Meng Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Scaling test-time compute with multi-path chain-of-thought improves reasoning accuracy, but its effectiveness depends critically on the exploration-exploitation trade-off. Existing approaches address this trade-off in rigid ways: tree-structured search hard-codes exploration through brittle expansion rules that interfere with post-trained reasoning, while parallel reasoning over-explores redundant hypothesis paths and relies on weak answer selection. Motivated by the observation that the optimal balance is phase-dependent and that correct and incorrect reasoning paths often diverge only at late stages, we reformulate test-time scaling as a dynamic expand-reduce control problem over a pool of hypotheses. We propose HyPER, a training-free online control policy for multi-path decoding in mixture-of-experts models that reallocates computation under a fixed budget using lightweight path statistics. HyPER consists of an online controller that transitions from exploration to exploitation as the hypothesis pool evolves, a token-level refinement mechanism that enables efficient generation-time exploitation without full-path resampling, and a length- and confidence-aware aggregation strategy for reliable answer-time exploitation. Experiments on four mixture-of-experts language models across diverse reasoning benchmarks show that HyPER consistently achieves a superior accuracy-compute trade-off, improving accuracy by 8 to 10 percent while reducing token usage by 25 to 40 percent.
zh
[AI-43] Progress Constraints for Reinforcement Learning in Behavior Trees
【速读】:该论文旨在解决行为树(Behavior Tree, BT)与强化学习(Reinforcement Learning, RL)混合架构中因控制器间相互干扰而导致的整体性能下降问题,尤其是在稀疏奖励、安全探索和长期信用分配等挑战下,传统简单集成方式易使部分子控制器抵消先前达成的子目标。解决方案的关键在于提出进度约束(progress constraints)机制,该机制基于行为树收敛性的理论分析,引入可行性估计器来动态限制允许的动作集合,从而确保各子控制器在执行时不会破坏已实现的阶段性目标,有效提升训练效率、样本利用率及约束满足度。
链接: https://arxiv.org/abs/2602.06525
作者: Finn Rietz,Mart Kartašev,Johannes A. Stork,Petter Ögren
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Behavior Trees (BTs) provide a structured and reactive framework for decision-making, commonly used to switch between sub-controllers based on environmental conditions. Reinforcement Learning (RL), on the other hand, can learn near-optimal controllers but sometimes struggles with sparse rewards, safe exploration, and long-horizon credit assignment. Combining BTs with RL has the potential for mutual benefit: a BT design encodes structured domain knowledge that can simplify RL training, while RL enables automatic learning of the controllers within BTs. However, naive integration of BTs and RL can lead to some controllers counteracting other controllers, possibly undoing previously achieved subgoals, thereby degrading the overall performance. To address this, we propose progress constraints, a novel mechanism where feasibility estimators constrain the allowed action set based on theoretical BT convergence results. Empirical evaluations in a 2D proof-of-concept and a high-fidelity warehouse environment demonstrate improved performance, sample efficiency, and constraint satisfaction, compared to prior methods of BT-RL integration.
zh
[AI-44] JADE: Expert-Grounded Dynamic Evaluation for Open-Ended Professional Tasks
【速读】:该论文旨在解决在开放性专业任务中评估智能体(Agentic AI)时面临的严格性与灵活性之间的根本矛盾:静态评分标准虽具可重复性但难以适应多样化的有效响应策略,而基于大语言模型(LLM)的评判方法虽灵活却存在不稳定性和偏见。解决方案的关键在于提出JADE框架,其包含两层结构:第一层将专家知识编码为预定义的评估技能集合,确保评价标准的稳定性;第二层进行针对具体报告的、基于主张(claim-level)的动态评估,并引入证据依赖门控机制以排除建立在已被反驳主张之上的结论,从而在保持一致性的同时灵活识别不同推理路径中的关键失败模式。
链接: https://arxiv.org/abs/2602.06486
作者: Lanbo Lin,Jiayao Liu,Tianyuan Yang,Li Cai,Yuanwu Xu,Lei Wei,Sicong Xie,Guannan Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Evaluating agentic AI on open-ended professional tasks faces a fundamental dilemma between rigor and flexibility. Static rubrics provide rigorous, reproducible assessment but fail to accommodate diverse valid response strategies, while LLM-as-a-judge approaches adapt to individual responses yet suffer from instability and bias. Human experts address this dilemma by combining domain-grounded principles with dynamic, claim-level assessment. Inspired by this process, we propose JADE, a two-layer evaluation framework. Layer 1 encodes expert knowledge as a predefined set of evaluation skills, providing stable evaluation criteria. Layer 2 performs report-specific, claim-level evaluation to flexibly assess diverse reasoning strategies, with evidence-dependency gating to invalidate conclusions built on refuted claims. Experiments on BizBench show that JADE improves evaluation stability and reveals critical agent failure modes missed by holistic LLM-based evaluators. We further demonstrate strong alignment with expert-authored rubrics and effective transfer to a medical-domain benchmark, validating JADE across professional domains. Our code is publicly available at this https URL.
zh
[AI-45] Agent CPM-Explore: Realizing Long-Horizon Deep Exploration for Edge-Scale Agents
【速读】:该论文旨在解决边缘规模(edge-scale)生成式 AI (Generative AI) 模型在复杂任务中性能受限的问题,特别是针对 40 亿参数(4B-parameter)级别模型在训练过程中存在的三大瓶颈:监督微调(Supervised Fine-Tuning, SFT)阶段的灾难性遗忘、强化学习(Reinforcement Learning, RL)阶段对奖励信号噪声的敏感性,以及长上下文场景下因冗余信息导致的推理能力退化。解决方案的关键在于提出 AgentCPM-Explore——一个高知识密度且具备强探索能力的紧凑型 4B 代理模型,并构建了一个融合参数空间模型融合(parameter-space model fusion)、奖励信号去噪(reward signal denoising)和上下文信息精炼(contextual information refinement)的系统性训练框架,从而显著提升了边缘模型的推理稳定性和整体性能,使其在多个基准测试中达到甚至超越更大规模模型的表现。
链接: https://arxiv.org/abs/2602.06485
作者: Haotian Chen,Xin Cong,Shengda Fan,Yuyang Fu,Ziqin Gong,Yaxi Lu,Yishan Li,Boye Niu,Chengjun Pan,Zijun Song,Huadong Wang,Yesai Wu,Yueying Wu,Zihao Xie,Yukun Yan,Zhong Zhang,Yankai Lin,Zhiyuan Liu,Maosong Sun
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:While Large Language Model (LLM)-based agents have shown remarkable potential for solving complex tasks, existing systems remain heavily reliant on large-scale models, leaving the capabilities of edge-scale models largely underexplored. In this paper, we present the first systematic study on training agentic models at the 4B-parameter scale. We identify three primary bottlenecks hindering the performance of edge-scale models: catastrophic forgetting during Supervised Fine-Tuning (SFT), sensitivity to reward signal noise during Reinforcement Learning (RL), and reasoning degradation caused by redundant information in long-context scenarios. To address the issues, we propose AgentCPM-Explore, a compact 4B agent model with high knowledge density and strong exploration capability. We introduce a holistic training framework featuring parameter-space model fusion, reward signal denoising, and contextual information refinement. Through deep exploration, AgentCPM-Explore achieves state-of-the-art (SOTA) performance among 4B-class models, matches or surpasses 8B-class SOTA models on four benchmarks, and even outperforms larger-scale models such as Claude-4.5-Sonnet or DeepSeek-v3.2 in five benchmarks. Notably, AgentCPM-Explore achieves 97.09% accuracy on GAIA text-based tasks under pass@64. These results provide compelling evidence that the bottleneck for edge-scale models is not their inherent capability ceiling, but rather their inference stability. Based on our well-established training framework, AgentCPM-Explore effectively unlocks the significant, yet previously underestimated, potential of edge-scale models.
zh
[AI-46] Principle-Evolvable Scientific Discovery via Uncertainty Minimization
【速读】:该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的科学智能体在科学发现过程中因固守初始先验假设而导致的效率低下问题,尤其在静态假设空间中难以发现新现象、计算资源浪费严重的问题。其解决方案的关键在于将科学发现任务从传统的假设搜索转变为对底层科学原理的演化,提出PiEvo框架——一个可在扩展的原理空间中进行贝叶斯优化的机制;该框架通过高斯过程驱动的信息导向假设选择(Information-Directed Hypothesis Selection)与异常驱动的增强机制(anomaly-driven augmentation),使智能体能够自主地迭代更新其理论世界观,从而显著提升发现质量与收敛速度,并保持跨领域和不同LLM架构下的鲁棒性。
链接: https://arxiv.org/abs/2602.06448
作者: Yingming Pu,Tao Lin,Hongyu Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Model (LLM)-based scientific agents have accelerated scientific discovery, yet they often suffer from significant inefficiencies due to adherence to fixed initial priors. Existing approaches predominantly operate within a static hypothesis space, which restricts the discovery of novel phenomena, resulting in computational waste when baseline theories fail. To address this, we propose shifting the focus from searching hypotheses to evolving the underlying scientific principles. We present PiEvo, a principle-evolvable framework that treats scientific discovery as Bayesian optimization over an expanding principle space. By integrating Information-Directed Hypothesis Selection via Gaussian Process and an anomaly-driven augmentation mechanism, PiEvo enables agents to autonomously refine their theoretical worldview. Evaluation across four benchmarks demonstrates that PiEvo (1) achieves an average solution quality of up to 90.81%~93.15%, representing a 29.7%~31.1% improvement over the state-of-the-art, (2) attains an 83.3% speedup in convergence step via significantly reduced sample complexity by optimizing the compact principle space, and (3) maintains robust performance across diverse scientific domains and LLM backbones.
zh
[AI-47] rajAD: Trajectory Anomaly Detection for Trustworthy LLM Agents
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在运行时轨迹异常检测(trajectory anomaly detection)的问题,以提升LLM代理的可信度。现有安全机制主要依赖静态输入/输出过滤,但作者指出,确保代理可靠性需对中间执行过程进行审计。为此,论文提出将异常检测任务定义为精确错误定位,从而支持高效的回滚与重试机制。解决方案的关键在于构建TrajBench数据集,通过扰动与补全策略模拟多样化的程序性异常,并基于此评估模型的过程监督能力;进一步提出TrajAD这一专用验证器,采用细粒度过程监督进行训练,实验证明其显著优于基线方法,表明专门化监督对于构建可信代理至关重要。
链接: https://arxiv.org/abs/2602.06443
作者: Yibing Liu,Chong Zhang,Zhongyi Han,Hansong Liu,Yong Wang,Yang Yu,Xiaoyan Wang,Yilong Yin
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures, 1 table
Abstract:We address the problem of runtime trajectory anomaly detection, a critical capability for enabling trustworthy LLM agents. Current safety measures predominantly focus on static input/output filtering. However, we argue that ensuring LLM agents reliability requires auditing the intermediate execution process. In this work, we formulate the task of Trajectory Anomaly Detection. The goal is not merely detection, but precise error localization. This capability is essential for enabling efficient rollback-and-retry. To achieve this, we construct TrajBench, a dataset synthesized via a perturb-and-complete strategy to cover diverse procedural anomalies. Using this benchmark, we investigate the capability of models in process supervision. We observe that general-purpose LLMs, even with zero-shot prompting, struggle to identify and localize these anomalies. This reveals that generalized capabilities do not automatically translate to process reliability. To address this, we propose TrajAD, a specialized verifier trained with fine-grained process supervision. Our approach outperforms baselines, demonstrating that specialized supervision is essential for building trustworthy agents.
zh
[AI-48] A methodology for analyzing financial needs hierarchy from social discussions using LLM
【速读】:该论文旨在解决如何系统性地识别和理解社交媒体中金融需求的层次结构问题,传统方法如问卷调查难以捕捉真实情境下的复杂行为模式。其解决方案的关键在于利用生成式AI(Generative AI)中的大语言模型(Large Language Models, LLMs)对海量社交媒体文本进行计算分析,从中提取并解析金融需求表达,从而验证金融需求从短期基本需求到长期目标的层级组织特征,并提供一种数据驱动、可扩展的替代传统调研的方法,实现对现实世界金融行为更动态且细致的理解。
链接: https://arxiv.org/abs/2602.06431
作者: Abhishek Jangra,Sachin Thukral,Arnab Chatterjee,Jayasree Raveendran
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 15 pages, 5 figures, 4 tables
Abstract:This study examines the hierarchical structure of financial needs as articulated in social media discourse, employing generative AI techniques to analyze large-scale textual data. While human needs encompass a broad spectrum from fundamental survival to psychological fulfillment financial needs are particularly critical, influencing both individual well-being and day-to-day decision-making. Our research advances the understanding of financial behavior by utilizing large language models (LLMs) to extract and analyze expressions of financial needs from social media posts. We hypothesize that financial needs are organized hierarchically, progressing from short-term essentials to long-term aspirations, consistent with theoretical frameworks established in the behavioral sciences. Through computational analysis, we demonstrate the feasibility of identifying these needs and validate the presence of a hierarchical structure within them. In addition to confirming this structure, our findings provide novel insights into the content and themes of financial discussions online. By inferring underlying needs from naturally occurring language, this approach offers a scalable and data-driven alternative to conventional survey methodologies, enabling a more dynamic and nuanced understanding of financial behavior in real-world contexts.
zh
[AI-49] Intrinsic Stability Limits of Autoregressive Reason ing: Structural Consequences for Long-Horizon Execution
【速读】:该论文试图解决大语言模型(Large Language Models, LLMs)在长时程推理任务中性能显著下降的问题,即“长horizon推理失效”现象。传统解释多归因于任务复杂性(如组合搜索爆炸或长期信用分配难题),但作者指出这些解释不充分:即使在无语义歧义、线性且无分支的简单任务中,自回归生成(autoregressive generation)本身也存在内在稳定性限制。解决方案的关键在于提出一个理论性结论——定理A(Theorem A),证明单路径自回归推理中的决策优势随执行长度呈指数衰减,从而确立了可维持推理链长度的根本上限。这一发现揭示了长时程推理失败的本质是过程层面的结构不稳定性,而非单纯的任务复杂度问题,并由此推导出稳定长时程推理需依赖离散分割机制,自然导向图结构(如有向无环图,DAG)的执行方式,为未来推理系统设计从单纯规模扩展转向结构治理(structured governance)提供了理论依据与实践方向。
链接: https://arxiv.org/abs/2602.06413
作者: Hsien-Jyh Liao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 16 Pages, 7 figures, Keyworda: Autoregressive Reasoning, Long-Horizon Stability, Chain-of-Thought Reasoning, Information-Theoretic Analysis, Structured Reasoning, Inference Dynamics
Abstract:Large language models (LLMs) demonstrate remarkable reasoning capabilities, yet their performance often deteriorates sharply in long-horizon tasks, exhibiting systematic breakdown beyond certain scales. Conventional explanations primarily attribute this phenomenon to task complexity, such as combinatorial search explosion or long-term credit assignment challenges. In this work, we argue that these explanations are incomplete: even in linear, unbranched tasks without semantic ambiguity, autoregressive execution is subject to an intrinsic stability limit. We propose that the fundamental constraint on long-horizon reasoning arises from process-level instability in autoregressive generation rather than solely from search or task complexity, reframing long-horizon reasoning as a problem of structural governance. We derive Theorem~A, showing that decision advantage in single-path autoregressive reasoning decays exponentially with execution length, imposing a fundamental bound on maintainable reasoning chains. This result implies a structural consequence: stable long-horizon reasoning requires discrete segmentation, naturally inducing graph-like execution structures such as directed acyclic graphs (DAGs). Empirical studies in both synthetic environments and real TextWorld tasks reveal observable performance cliffs consistent with theoretical predictions. Our findings provide a dynamical perspective on long-horizon reasoning failure and suggest new limitations on maintaining long-term coherence under purely autoregressive architectures. Furthermore, we highlight that short-horizon evaluation protocols may obscure structural instability, indicating a potential shift from scaling toward structured governance in future reasoning systems. Comments: 16 Pages, 7 figures, Keyworda: Autoregressive Reasoning, Long-Horizon Stability, Chain-of-Thought Reasoning, Information-Theoretic Analysis, Structured Reasoning, Inference Dynamics Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2602.06413 [cs.AI] (or arXiv:2602.06413v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2602.06413 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-50] Empirical Analysis of Adversarial Robustness and Explainability Drift in Cybersecurity Classifiers
【速读】:该论文旨在解决机器学习(Machine Learning, ML)模型在网络安全应用中面临的对抗鲁棒性与可解释性漂移问题,即对抗扰动(adversarial perturbations)如何削弱模型的检测准确率并破坏其解释能力。解决方案的关键在于引入一个量化指标——鲁棒性指数(Robustness Index, RI),定义为准确率-扰动曲线下的面积,并结合梯度敏感性和SHAP(Shapley Additive Explanations)归因漂移分析,识别易受攻击的输入特征;实验表明,通过对抗训练可将RI提升最高达9%,同时保持干净数据上的准确率,从而揭示了鲁棒性与可解释性之间的耦合关系,强调了定量评估在构建可信AI驱动的网络安全系统中的重要性。
链接: https://arxiv.org/abs/2602.06395
作者: Mona Rajhans,Vishal Khawarey
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted for publication in 18th ACM International Conference on Agents and Artificial Intelligence (ICAART 2026), Marbella, Spain
Abstract:Machine learning (ML) models are increasingly deployed in cybersecurity applications such as phishing detection and network intrusion prevention. However, these models remain vulnerable to adversarial perturbations small, deliberate input modifications that can degrade detection accuracy and compromise interpretability. This paper presents an empirical study of adversarial robustness and explainability drift across two cybersecurity domains phishing URL classification and network intrusion detection. We evaluate the impact of L (infinity) bounded Fast Gradient Sign Method (FGSM) and Projected Gradient Descent (PGD) perturbations on model accuracy and introduce a quantitative metric, the Robustness Index (RI), defined as the area under the accuracy perturbation curve. Gradient based feature sensitivity and SHAP based attribution drift analyses reveal which input features are most susceptible to adversarial manipulation. Experiments on the Phishing Websites and UNSW NB15 datasets show consistent robustness trends, with adversarial training improving RI by up to 9 percent while maintaining clean-data accuracy. These findings highlight the coupling between robustness and interpretability degradation and underscore the importance of quantitative evaluation in the design of trustworthy, AI-driven cybersecurity systems.
zh
[AI-51] Unlocking Noisy Real-World Corpora for Foundation Model Pre-Training via Quality-Aware Tokenization
【速读】:该论文旨在解决传统分词方法在处理噪声数据时效果受限的问题,尤其是在真实世界中存在信号质量差异的语料(如基因组序列和金融时间序列)上,现有分词策略未考虑数据可靠性,导致模型性能下降。其解决方案的关键在于提出QA-Token(Quality-Aware Tokenization),通过将数据可靠性直接融入词汇表构建过程,实现更鲁棒的分词机制;具体包括:(i) 基于双层优化的词汇表构造与下游任务性能联合优化框架,(ii) 利用质量感知奖励的强化学习方法学习合并策略并具备收敛性保证,以及 (iii) 采用Gumbel-Softmax松弛机制实现端到端参数学习。该方法在基因组学和金融领域均取得显著提升,并成功应用于包含1.7万亿碱基对的预训练语料,实现了路径检测性能最优(MCC=94.53)且token数量减少15%。
链接: https://arxiv.org/abs/2602.06394
作者: Arvid E. Gollwitzer,Paridhi Latawa,David de Gruijl,Deepak A. Subramanian,Adrián Noriega de la Colina
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Genomics (q-bio.GN); Computational Finance (q-fin.CP)
备注:
Abstract:Current tokenization methods process sequential data without accounting for signal quality, limiting their effectiveness on noisy real-world corpora. We present QA-Token (Quality-Aware Tokenization), which incorporates data reliability directly into vocabulary construction. We make three key contributions: (i) a bilevel optimization formulation that jointly optimizes vocabulary construction and downstream performance, (ii) a reinforcement learning approach that learns merge policies through quality-aware rewards with convergence guarantees, and (iii) an adaptive parameter learning mechanism via Gumbel-Softmax relaxation for end-to-end optimization. Our experimental evaluation demonstrates consistent improvements: genomics (6.7 percentage point F1 gain in variant calling over BPE), finance (30% Sharpe ratio improvement). At foundation scale, we tokenize a pretraining corpus comprising 1.7 trillion base-pairs and achieve state-of-the-art pathogen detection (94.53 MCC) while reducing token count by 15%. We unlock noisy real-world corpora, spanning petabases of genomic sequences and terabytes of financial time series, for foundation model training with zero inference overhead.
zh
[AI-52] Generating High-quality Privacy-preserving Synthetic Data
【速读】:该论文旨在解决合成表格数据在实际部署中面临的三重挑战:分布保真度(distributional fidelity)、下游任务实用性(downstream utility)与隐私保护(privacy protection)之间的权衡问题。其解决方案的核心在于提出一种模型无关的后处理框架,包含两个关键步骤:一是模式修补(mode patching),用于修复合成数据中缺失或严重欠采样的类别,同时尽可能保留已学习到的变量间依赖关系;二是k近邻过滤(k nearest neighbor filter),通过移除与真实数据点过于接近的合成记录,强制设定真实与合成样本间的最小距离。该方法在两种神经生成模型(前馈生成器和变分自编码器)上验证有效,在多个公开数据集上显著提升了分布相似性与依赖结构保持能力,同时维持了下游模型性能,并增强了基于距离的隐私指标,为合成数据的质量与隐私优化提供了实用指导。
链接: https://arxiv.org/abs/2602.06390
作者: David Yavo,Richard Khoury,Christophe Pere,Sadoune Ait Kaci Azzou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Synthetic tabular data enables sharing and analysis of sensitive records, but its practical deployment requires balancing distributional fidelity, downstream utility, and privacy protection. We study a simple, model agnostic post processing framework that can be applied on top of any synthetic data generator to improve this trade off. First, a mode patching step repairs categories that are missing or severely underrepresented in the synthetic data, while largely preserving learned dependencies. Second, a k nearest neighbor filter replaces synthetic records that lie too close to real data points, enforcing a minimum distance between real and synthetic samples. We instantiate this framework for two neural generative models for tabular data, a feed forward generator and a variational autoencoder, and evaluate it on three public datasets covering credit card transactions, cardiovascular health, and census based income. We assess marginal and joint distributional similarity, the performance of models trained on synthetic data and evaluated on real data, and several empirical privacy indicators, including nearest neighbor distances and attribute inference attacks. With moderate thresholds between 0.2 and 0.35, the post processing reduces divergence between real and synthetic categorical distributions by up to 36 percent and improves a combined measure of pairwise dependence preservation by 10 to 14 percent, while keeping downstream predictive performance within about 1 percent of the unprocessed baseline. At the same time, distance based privacy indicators improve and the success rate of attribute inference attacks remains largely unchanged. These results provide practical guidance for selecting thresholds and applying post hoc repairs to improve the quality and empirical privacy of synthetic tabular data, while complementing approaches that provide formal differential privacy guarantees.
zh
[AI-53] Difficulty-Estimated Policy Optimization
【速读】:该论文旨在解决大型推理模型(Large Reasoning Models, LRM)在训练过程中因梯度信号衰减导致的收敛稳定性问题,尤其是在处理过于简单或过于复杂的问题时,组间优势消失使得梯度易受噪声干扰。其解决方案的关键在于提出一种难度估计策略优化(Difficulty-Estimated Policy Optimization, DEPO),通过引入一个在线难度估计器,在rollout阶段前动态评估并过滤低学习潜力样本,从而将计算资源聚焦于高价值样本,显著降低推理成本(最多减少2倍的rollout开销)且不牺牲模型性能,提升了训练效率与鲁棒性。
链接: https://arxiv.org/abs/2602.06375
作者: Yu Zhao,Fan Jiang,Tianle Liu,Bo Zeng,Yu Liu,Longyue Wang,Weihua Luo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advancements in Large Reasoning Models (LRMs), exemplified by DeepSeek-R1, have underscored the potential of scaling inference-time compute through Group Relative Policy Optimization (GRPO). However, GRPO frequently suffers from gradient signal attenuation when encountering problems that are either too trivial or overly complex. In these scenarios, the disappearance of inter-group advantages makes the gradient signal susceptible to noise, thereby jeopardizing convergence stability. While variants like DAPO attempt to rectify gradient vanishing, they do not alleviate the substantial computational overhead incurred by exhaustive rollouts on low-utility samples. In this paper, we propose Difficulty-Estimated Policy Optimization (DEPO), a novel framework designed to optimize the efficiency and robustness of reasoning alignment. DEPO integrates an online Difficulty Estimator that dynamically assesses and filters training data before the rollout phase. This mechanism ensures that computational resources are prioritized for samples with high learning potential. Empirical results demonstrate that DEPO achieves up to a 2x reduction in rollout costs without compromising model performance. Our approach significantly lowers the computational barrier for training high-performance reasoning models, offering a more sustainable path for reasoning scaling. Code and data will be released upon acceptance.
zh
[AI-54] raining Data Selection with Gradient Orthogonality for Efficient Domain Adaptation
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在特定领域微调过程中面临的灾难性遗忘(catastrophic forgetting)问题,即在提升领域专业知识的同时难以保持通用推理能力。现有方法存在两难:梯度手术(gradient surgery)方法虽能提供几何安全性但计算开销巨大,而高效数据选择方法则缺乏对冲突梯度方向的感知能力。解决方案的关键在于提出正交梯度选择(Orthogonal Gradient Selection, OGS),其核心思想是将梯度投影的几何洞察从优化器阶段转移到数据选择阶段,通过引入轻量级导航器(Navigator)模型与强化学习技术,动态筛选出梯度与通用知识锚点正交的训练样本,从而实现无需修改优化器且无运行时投影成本的安全更新,显著提升领域性能与训练效率,同时维持或增强通用任务表现。
链接: https://arxiv.org/abs/2602.06359
作者: Xiyang Zhang,Yuanhe Tian,Hongzhi Wang,Yan Song
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Fine-tuning large language models (LLMs) for specialized domains often necessitates a trade-off between acquiring domain expertise and retaining general reasoning capabilities, a phenomenon known as catastrophic forgetting. Existing remedies face a dichotomy: gradient surgery methods offer geometric safety but incur prohibitive computational costs via online projections, while efficient data selection approaches reduce overhead but remain blind to conflict-inducing gradient directions. In this paper, we propose Orthogonal Gradient Selection (OGS), a data-centric method that harmonizes domain performance, general capability retention, and training efficiency. OGS shifts the geometric insights of gradient projection from the optimizer to the data selection stage by treating data selection as a constrained decision-making process. By leveraging a lightweight Navigator model and reinforcement learning techniques, OGS dynamically identifies training samples whose gradients are orthogonal to a general-knowledge anchor. This approach ensures naturally safe updates for target models without modifying the optimizer or incurring runtime projection costs. Experiments across medical, legal, and financial domains demonstrate that OGS achieves excellent results, significantly improving domain performance and training efficiency while maintaining or even enhancing performance on general tasks such as GSM8K.
zh
[AI-55] Zero-Trust Runtime Verification for Agent ic Payment Protocols: Mitigating Replay and Context-Binding Failures in AP2
【速读】:该论文旨在解决基于自主AI代理(autonomous AI agents)的支付系统中,传统授权协议(如Agent Payments Protocol, AP2)在运行时因并发、重试和编排等行为而暴露的安全漏洞问题。尽管AP2在规范层面通过签名验证、显式绑定和过期语义提供了安全保障,但实际运行时的动态特性可能导致授权被重复使用或上下文被篡改,从而引发重放攻击(replay attacks)和上下文重定向攻击(context-redirect attacks)。其解决方案的关键在于提出一种零信任(zero-trust)运行时验证框架,通过动态生成的时间绑定随机数(nonce)强制实施“仅消费一次”(consume-once)的授权语义,并确保授权决策在执行时刻而非静态发行时被验证,从而实现对运行时上下文的显式绑定与访问控制。实证表明,该框架可有效防御上述攻击,在高并发场景下保持约3.8毫秒的稳定验证延迟,且运行时状态复杂度仅取决于峰值并发量,具备可预测的低开销特性。
链接: https://arxiv.org/abs/2602.06345
作者: Qianlong Lan,Anuj Kaul,Shaun Jones,Stephanie Westrum
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:The deployment of autonomous AI agents capable of executing commercial transactions has motivated the adoption of mandate-based payment authorization protocols, including the Universal Commerce Protocol (UCP) and the Agent Payments Protocol (AP2). These protocols replace interactive, session-based authorization with cryptographically issued mandates, enabling asynchronous and autonomous execution. While AP2 provides specification-level guarantees through signature verification, explicit binding, and expiration semantics, real-world agentic execution introduces runtime behaviors such as retries, concurrency, and orchestration that challenge implicit assumptions about mandate usage. In this work, we present a security analysis of the AP2 mandate lifecycle and identify enforcement gaps that arise during runtime in agent-based payment systems. We propose a zero-trust runtime verification framework that enforces explicit context binding and consume-once mandate semantics using dynamically generated, time-bound nonces, ensuring that authorization decisions are evaluated at execution time rather than assumed from static issuance properties. Through simulation-based evaluation under high concurrency, we show that context-aware binding and consume-once enforcement address distinct and complementary attack classes, and that both are required to prevent replay and context-redirect attacks. The proposed framework mitigates all evaluated attacks while maintaining stable verification latency of approximately 3.8~ms at throughput levels up to 10,000 transactions per second. We further demonstrate that the required runtime state is bounded by peak concurrency rather than cumulative transaction history, indicating that robust runtime security for agentic payment execution can be achieved with minimal and predictable overhead. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.06345 [cs.CR] (or arXiv:2602.06345v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2602.06345 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-56] Action Hallucination in Generative Visual-Language-Action Models
【速读】:该论文试图解决生成式机器人策略(generative robot policies)在实际部署中出现的物理约束违反问题,即“动作幻觉”(action hallucinations)及其扩展到规划层面的失败。其关键解决方案在于揭示了常见模型架构与可行机器人行为之间的结构不匹配,并系统性地识别出三种阻碍因素:拓扑(topological)、精度(precision)和时域(horizon)障碍,这些障碍导致不可规避的权衡关系。通过这一机制化分析,论文为提升生成式机器人策略的可靠性与可信度提供了理论依据,同时保留其表达能力。
链接: https://arxiv.org/abs/2602.06339
作者: Harold Soh,Eugene Lim
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 22 pages
Abstract:Robot Foundation Models such as Vision-Language-Action models are rapidly reshaping how robot policies are trained and deployed, replacing hand-designed planners with end-to-end generative action models. While these systems demonstrate impressive generalization, it remains unclear whether they fundamentally resolve the long-standing challenges of robotics. We address this question by analyzing action hallucinations that violate physical constraints and their extension to plan-level failures. Focusing on latent-variable generative policies, we show that hallucinations often arise from structural mismatches between feasible robot behavior and common model architectures. We study three such barriers – topological, precision, and horizon – and show how they impose unavoidable tradeoffs. Our analysis provides mechanistic explanations for reported empirical failures of generative robot policies and suggests principled directions for improving reliability and trustworthiness, without abandoning their expressive power.
zh
[AI-57] Exposing Weaknesses of Large Reason ing Models through Graph Algorithm Problems
【速读】:该论文旨在解决当前大型推理模型(Large Reasoning Models, LRM)在数学、代码和常识推理等基准测试中存在评估局限性的问题,如缺乏长上下文能力的考察、挑战性不足以及答案难以程序化验证。为此,作者提出GrAlgoBench这一新型基准,其核心在于利用图算法问题作为测评工具——这类问题具备长上下文推理需求、可精细调控难度等级,并支持标准化、程序化的正确性验证。实验表明,当前LRM在长文本情境下准确率显著下降(节点数超过120时低于50%),且存在“过度思考”现象,即冗余自验证导致推理轨迹膨胀但未提升准确性,从而揭示了模型在记忆保持与执行可靠性方面的关键缺陷。
链接: https://arxiv.org/abs/2602.06319
作者: Qifan Zhang,Jianhao Ruan,Aochuan Chen,Kang Zeng,Nuo Chen,Jing Tang,Jia Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Reasoning Models (LRMs) have advanced rapidly; however, existing benchmarks in mathematics, code, and common-sense reasoning remain limited. They lack long-context evaluation, offer insufficient challenge, and provide answers that are difficult to verify programmatically. We introduce GrAlgoBench, a benchmark designed to evaluate LRMs through graph algorithm problems. Such problems are particularly well suited for probing reasoning abilities: they demand long-context reasoning, allow fine-grained control of difficulty levels, and enable standardized, programmatic evaluation. Across nine tasks, our systematic experiments reveal two major weaknesses of current LRMs. First, accuracy deteriorates sharply as context length increases, falling below 50% once graphs exceed 120 nodes. This degradation is driven by frequent execution errors, weak memory, and redundant reasoning. Second, LRMs suffer from an over-thinking phenomenon, primarily caused by extensive yet largely ineffective self-verification, which inflates reasoning traces without improving correctness. By exposing these limitations, GrAlgoBench establishes graph algorithm problems as a rigorous, multidimensional, and practically relevant testbed for advancing the study of reasoning in LRMs. Code is available at this https URL.
zh
[AI-58] oward generative machine learning for boosting ensembles of climate simulations
【速读】:该论文旨在解决气候预测中由不可减少的内部气候变异性(irreducible internal climate variability)引起的不确定性量化问题,传统方法依赖物理驱动的气候模型生成集合,但受计算资源限制,难以同时实现高分辨率与大规模集合以确保不确定性估计的稳健性。其解决方案的关键在于利用生成式机器学习中的条件变分自编码器(conditional Variational Autoencoder, cVAE),通过在有限气候模拟样本上训练,生成任意规模的物理一致的气候集合样本,从而在保持计算效率的同时提升不确定性表征能力。该方法能够重现低阶和高阶统计特征(包括极端事件),并捕捉全球遥相关格局,即便在训练数据未覆盖的气候条件下仍具表现力。
链接: https://arxiv.org/abs/2602.06287
作者: Parsa Gooya,Reinel Sospedra-Alfonso,Johannes Exenberger
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Atmospheric and Oceanic Physics (physics.ao-ph)
备注: this http URL contains Supplementary Information
Abstract:Accurately quantifying uncertainty in predictions and projections arising from irreducible internal climate variability is critical for informed decision making. Such uncertainty is typically assessed using ensembles produced with physics based climate models. However, computational constraints impose a trade off between generating the large ensembles required for robust uncertainty estimation and increasing model resolution to better capture fine scale dynamics. Generative machine learning offers a promising pathway to alleviate these constraints. We develop a conditional Variational Autoencoder (cVAE) trained on a limited sample of climate simulations to generate arbitrary large ensembles. The approach is applied to output from monthly CMIP6 historical and future scenario experiments produced with the Canadian Centre for Climate Modelling and Analysis’ (CCCma’s) Earth system model CanESM5. We show that the cVAE model learns the underlying distribution of the data and generates physically consistent samples that reproduce realistic low and high moment statistics, including extremes. Compared with more sophisticated generative architectures, cVAEs offer a mathematically transparent, interpretable, and computationally efficient framework. Their simplicity lead to some limitations, such as overly smooth outputs, spectral bias, and underdispersion, that we discuss along with strategies to mitigate them. Specifically, we show that incorporating output noise improves the representation of climate relevant multiscale variability, and we propose a simple method to achieve this. Finally, we show that cVAE-enhanced ensembles capture realistic global teleconnection patterns, even under climate conditions absent from the training data.
zh
[AI-59] Do LLM s Act Like Rational Agents ? Measuring Belief Coherence in Probabilistic Decision Making
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在高风险决策场景中是否具备理性效用最大化行为的问题,即其推理逻辑是否符合贝叶斯效用最大化原则,以及所输出的概率与实际信念之间是否存在一致性。解决方案的关键在于构建可验证的条件,用于判断LLM报告的概率分布是否可能来源于任何理性代理(rational agent)的内在信念体系——若不满足这些条件,则说明LLM的推断偏离了理想贝叶斯效用最大化框架。研究通过医疗诊断任务对多个LLM进行实证检验,揭示了其决策行为与理论最优策略之间的偏差,为提升LLM在关键应用中的可信性和可解释性提供了量化依据和改进方向。
链接: https://arxiv.org/abs/2602.06286
作者: Khurram Yamin,Jingjing Tang,Santiago Cortes-Gomez,Amit Sharma,Eric Horvitz,Bryan Wilder
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) are increasingly deployed as agents in high-stakes domains where optimal actions depend on both uncertainty about the world and consideration of utilities of different outcomes, yet their decision logic remains difficult to interpret. We study whether LLMs are rational utility maximizers with coherent beliefs and stable preferences. We consider behaviors of models for diagnosis challenge problems. The results provide insights about the relationship of LLM inferences to ideal Bayesian utility maximization for elicited probabilities and observed actions. Our approach provides falsifiable conditions under which the reported probabilities \emphcannot correspond to the true beliefs of any rational agent. We apply this methodology to multiple medical diagnostic domains with evaluations across several LLMs. We discuss implications of the results and directions forward for uses of LLMs in guiding high-stakes decisions.
zh
[AI-60] GRP-Obliteration: Unaligning LLM s With a Single Unlabeled Prompt
【速读】:该论文旨在解决安全对齐(safety alignment)模型在部署后可能被轻易破坏的问题,即现有方法虽能实现模型的安全对齐,但缺乏对后续微调攻击的鲁棒性。为应对这一挑战,作者提出了一种名为GRP-Obliteration(GRP-Oblit)的新方法,其核心在于利用Group Relative Policy Optimization(GRPO)直接从目标模型中移除安全约束,而非依赖大量标注数据或损害模型功能。关键创新在于仅需一个未标注提示(unlabeled prompt)即可可靠地使已对齐模型偏离安全规范,同时显著优于现有最先进技术,并在语言模型与扩散图像生成系统中均表现出良好的泛化能力。
链接: https://arxiv.org/abs/2602.06258
作者: Mark Russinovich,Yanan Cai,Keegan Hines,Giorgio Severi,Blake Bullwinkel,Ahmed Salem
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Safety alignment is only as robust as its weakest failure mode. Despite extensive work on safety post-training, it has been shown that models can be readily unaligned through post-deployment fine-tuning. However, these methods often require extensive data curation and degrade model utility. In this work, we extend the practical limits of unalignment by introducing GRP-Obliteration (GRP-Oblit), a method that uses Group Relative Policy Optimization (GRPO) to directly remove safety constraints from target models. We show that a single unlabeled prompt is sufficient to reliably unalign safety-aligned models while largely preserving their utility, and that GRP-Oblit achieves stronger unalignment on average than existing state-of-the-art techniques. Moreover, GRP-Oblit generalizes beyond language models and can also unalign diffusion-based image generation systems. We evaluate GRP-Oblit on six utility benchmarks and five safety benchmarks across fifteen 7-20B parameter models, spanning instruct and reasoning models, as well as dense and MoE architectures. The evaluated model families include GPT-OSS, distilled DeepSeek, Gemma, Llama, Ministral, and Qwen. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.06258 [cs.LG] (or arXiv:2602.06258v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.06258 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-61] REBEL: Hidden Knowledge Recovery via Evolutionary-Based Evaluation Loop
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)中“遗忘学习”(machine unlearning)方法的有效性评估问题。当前主流的评估指标依赖于良性查询,无法检测到经过训练后仍残留在模型中的敏感或版权数据,导致对真实遗忘效果的误判。为此,作者提出REBEL——一种进化式对抗提示生成框架,其关键在于通过动态演化策略生成高攻击性的提示,以探测被标记为“已遗忘”的知识是否仍可被提取。实验表明,REBEL能够显著提升对未完全删除知识的恢复能力,在TOFU和WMDP基准上分别实现最高60%和93%的攻击成功率(Attack Success Rate, ASR),揭示了现有遗忘方法可能仅提供表面保护的局限性。
链接: https://arxiv.org/abs/2602.06248
作者: Patryk Rybak,Paweł Batorski,Paul Swoboda,Przemysław Spurek
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Machine unlearning for LLMs aims to remove sensitive or copyrighted data from trained models. However, the true efficacy of current unlearning methods remains uncertain. Standard evaluation metrics rely on benign queries that often mistake superficial information suppression for genuine knowledge removal. Such metrics fail to detect residual knowledge that more sophisticated prompting strategies could still extract. We introduce REBEL, an evolutionary approach for adversarial prompt generation designed to probe whether unlearned data can still be recovered. Our experiments demonstrate that REBEL successfully elicits forgotten'' knowledge from models that seemed to be forgotten in standard unlearning benchmarks, revealing that current unlearning methods may provide only a superficial layer of protection. We validate our framework on subsets of the TOFU and WMDP benchmarks, evaluating performance across a diverse suite of unlearning algorithms. Our experiments show that REBEL consistently outperforms static baselines, recovering forgotten’’ knowledge with Attack Success Rates (ASRs) reaching up to 60% on TOFU and 93% on WMDP. We will make all code publicly available upon acceptance. Code is available at this https URL
zh
[AI-62] ATEX-CF: Attack-Informed Counterfactual Explanations for Graph Neural Networks ICLR2026
【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)可解释性问题,具体目标是生成高质量的反事实解释(Counterfactual Explanations),即通过识别最小的图结构改动来改变模型预测结果,从而回答“什么必须不同才能得到不同结果”。传统方法通常将对抗攻击与反事实解释分开处理,且多局限于单一类型的边操作(如仅删除边),难以兼顾解释的忠实性(fidelity)、稀疏性(sparsity)和合理性(plausibility)。本文提出ATEX-CF框架,其关键在于首次将对抗攻击技术与反事实解释生成统一建模:利用对抗攻击中边添加策略提升扰动效率,结合反事实方法中的边删除策略增强解释的可理解性,并在受限扰动预算下联合优化三重目标,从而实现更高效、更现实的实例级解释。
链接: https://arxiv.org/abs/2602.06240
作者: Yu Zhang,Sean Bin Yang,Arijit Khan,Cuneyt Gurcan Akcora
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 30 pages, accepted by ICLR 2026, github code: this https URL
Abstract:Counterfactual explanations offer an intuitive way to interpret graph neural networks (GNNs) by identifying minimal changes that alter a model’s prediction, thereby answering “what must differ for a different outcome?”. In this work, we propose a novel framework, ATEX-CF that unifies adversarial attack techniques with counterfactual explanation generation-a connection made feasible by their shared goal of flipping a node’s prediction, yet differing in perturbation strategy: adversarial attacks often rely on edge additions, while counterfactual methods typically use deletions. Unlike traditional approaches that treat explanation and attack separately, our method efficiently integrates both edge additions and deletions, grounded in theory, leveraging adversarial insights to explore impactful counterfactuals. In addition, by jointly optimizing fidelity, sparsity, and plausibility under a constrained perturbation budget, our method produces instance-level explanations that are both informative and realistic. Experiments on synthetic and real-world node classification benchmarks demonstrate that ATEX-CF generates faithful, concise, and plausible explanations, highlighting the effectiveness of integrating adversarial insights into counterfactual reasoning for GNNs.
zh
[AI-63] SR4-Fit: An Interpretable and Informative Classification Algorithm Applied to Prediction of U.S. House of Representatives Elections ICML
【速读】:该论文旨在解决机器学习模型在关键应用场景中缺乏可解释性的问题,尤其是高精度黑箱模型(如随机森林)难以揭示输入与输出之间的关系,而传统可解释规则算法(如RuleFit)则存在预测性能弱和结果不稳定的问题。解决方案的关键在于提出一种新型可解释分类算法——稀疏松弛正则化回归规则拟合(Sparse Relaxed Regularized Regression Rule-Fit, SR4-Fit),该方法通过引入稀疏性和松弛正则化机制,在保持优异预测准确率的同时生成稳定且易于解释的规则集,从而突破了模型可解释性与预测能力之间的传统权衡困境。
链接: https://arxiv.org/abs/2602.06229
作者: Shyam Sundar Murali Krishnan,Dean Frederick Hougen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages, 2 figures, 7 tables, to appear in the 24th IEEE AMLA International Conference on Machine Learning and Applications (ICMLA’25)
Abstract:The growth of machine learning demands interpretable models for critical applications, yet most high-performing models are ``black-box’’ systems that obscure input-output relationships, while traditional rule-based algorithms like RuleFit suffer from a lack of predictive power and instability despite their simplicity. This motivated our development of Sparse Relaxed Regularized Regression Rule-Fit (SR4-Fit), a novel interpretable classification algorithm that addresses these limitations while maintaining superior classification performance. Using demographic characteristics of U.S. congressional districts from the Census Bureau’s American Community Survey, we demonstrate that SR4-Fit can predict House election party outcomes with unprecedented accuracy and interpretability. Our results show that while the majority party remains the strongest predictor, SR4-Fit has revealed intrinsic combinations of demographic factors that affect prediction outcomes that were unable to be interpreted in black-box algorithms such as random forests. The SR4-Fit algorithm surpasses both black-box models and existing interpretable rule-based algorithms such as RuleFit with respect to accuracy, simplicity, and robustness, generating stable and interpretable rule sets while maintaining superior predictive performance, thus addressing the traditional trade-off between model interpretability and predictive capability in electoral forecasting. To further validate SR4-Fit’s performance, we also apply it to six additional publicly available classification datasets, like the breast cancer, Ecoli, page blocks, Pima Indians, vehicle, and yeast datasets, and find similar results.
zh
[AI-64] Do It for HER: First-Order Temporal Logic Reward Specification in Reinforcement Learning (Extended Version) AAAI2026
【速读】:该论文旨在解决在状态空间较大的马尔可夫决策过程(Markov Decision Processes, MDPs)中,如何逻辑化地指定非马尔可夫奖励(non-Markovian rewards)的问题。传统方法依赖于手动编码的布尔谓词,难以处理复杂且异构数据域中的任务。解决方案的关键在于引入线性时序逻辑模理论(Linear Temporal Logic Modulo Theories over finite traces, LTLfMT),其通过将谓词扩展为任意一阶理论中的公式,实现了对复杂任务的自然表达,并避免了人工特征工程。同时,作者识别出一个在无限状态空间下既可 tractable 又足够表达复杂的 LTLfMT 子集,并结合奖励机器(reward machines)与 hindsight experience replay (HER) 方法来应对稀疏奖励问题,从而显著提升了策略学习效率和任务完成能力。
链接: https://arxiv.org/abs/2602.06227
作者: Pierriccardo Olivieri,Fausto Lasca,Alessandro Gianola,Matteo Papini
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
备注: This is the extended version of a paper accepted at AAAI 2026
Abstract:In this work, we propose a novel framework for the logical specification of non-Markovian rewards in Markov Decision Processes (MDPs) with large state spaces. Our approach leverages Linear Temporal Logic Modulo Theories over finite traces (LTLfMT), a more expressive extension of classical temporal logic in which predicates are first-order formulas of arbitrary first-order theories rather than simple Boolean variables. This enhanced expressiveness enables the specification of complex tasks over unstructured and heterogeneous data domains, promoting a unified and reusable framework that eliminates the need for manual predicate encoding. However, the increased expressive power of LTLfMT introduces additional theoretical and computational challenges compared to standard LTLf specifications. We address these challenges from a theoretical standpoint, identifying a fragment of LTLfMT that is tractable but sufficiently expressive for reward specification in an infinite-state-space context. From a practical perspective, we introduce a method based on reward machines and Hindsight Experience Replay (HER) to translate first-order logic specifications and address reward sparsity. We evaluate this approach to a continuous-control setting using Non-Linear Arithmetic Theory, showing that it enables natural specification of complex tasks. Experimental results show how a tailored implementation of HER is fundamental in solving tasks with complex goals.
zh
[AI-65] Coupled Local and Global World Models for Efficient First Order RL
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在复杂操纵任务中因依赖手写物理模拟器(hand-crafted physics simulators)而面临的挑战,尤其是在处理非刚性接触、复杂感官信息(如视觉感知)和高维状态空间时,传统模拟器难以准确建模的问题。其解决方案的关键在于提出一种无需依赖物理模拟器的训练框架:通过从机器人与真实环境交互数据中学习得到一个大规模扩散模型作为世界模型(world model),并引入一种新颖的解耦一阶梯度(First-order Gradient, FoG)方法,将全尺度世界模型用于生成高保真前向轨迹,同时利用轻量级潜在空间代理模型近似局部动态以高效计算梯度。这种全局-局部世界模型耦合机制实现了高保真轨迹推演与可计算梯度之间的平衡,从而显著提升了样本效率,并在Push-T操纵任务和四足机器人自中心视角物体操作任务中验证了方法的有效性。
链接: https://arxiv.org/abs/2602.06219
作者: Joseph Amigo,Rooholla Khorrambakht,Nicolas Mansard,Ludovic Righetti
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:World models offer a promising avenue for more faithfully capturing complex dynamics, including contacts and non-rigidity, as well as complex sensory information, such as visual perception, in situations where standard simulators struggle. However, these models are computationally complex to evaluate, posing a challenge for popular RL approaches that have been successfully used with simulators to solve complex locomotion tasks but yet struggle with manipulation. This paper introduces a method that bypasses simulators entirely, training RL policies inside world models learned from robots’ interactions with real environments. At its core, our approach enables policy training with large-scale diffusion models via a novel decoupled first-order gradient (FoG) method: a full-scale world model generates accurate forward trajectories, while a lightweight latent-space surrogate approximates its local dynamics for efficient gradient computation. This coupling of a local and global world model ensures high-fidelity unrolling alongside computationally tractable differentiation. We demonstrate the efficacy of our method on the Push-T manipulation task, where it significantly outperforms PPO in sample efficiency. We further evaluate our approach through an ego-centric object manipulation task with a quadruped. Together, these results demonstrate that learning inside data-driven world models is a promising pathway for solving hard-to-model RL tasks in image space without reliance on hand-crafted physics simulators.
zh
[AI-66] Emergent Low-Rank Training Dynamics in MLPs with Smooth Activations
【速读】:该论文旨在解决当前对大规模深度神经网络训练动态中低维子空间现象的理论解释不足问题,尤其是在非线性网络中的机制尚不明确。其关键解决方案在于通过分析多层感知机(MLP)在梯度下降(GD)下的权重演化过程,证明了权重动态始终集中在不变的低维子空间内;并进一步精确刻画了具有平滑非线性激活函数的两层网络中这些子空间的结构,从而为低秩训练、压缩与适配等方法提供了理论依据。实验验证表明,该现象超越理论假设,在多种分类任务中,基于此低秩参数化且初始位于相应子空间的模型可达到与全参数化模型相当的分类性能。
链接: https://arxiv.org/abs/2602.06208
作者: Alec S. Xu,Can Yaras,Matthew Asato,Qing Qu,Laura Balzano
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 41 pages, 15 figures
Abstract:Recent empirical evidence has demonstrated that the training dynamics of large-scale deep neural networks occur within low-dimensional subspaces. While this has inspired new research into low-rank training, compression, and adaptation, theoretical justification for these dynamics in nonlinear networks remains limited. %compared to deep linear settings. To address this gap, this paper analyzes the learning dynamics of multi-layer perceptrons (MLPs) under gradient descent (GD). We demonstrate that the weight dynamics concentrate within invariant low-dimensional subspaces throughout training. Theoretically, we precisely characterize these invariant subspaces for two-layer networks with smooth nonlinear activations, providing insight into their emergence. Experimentally, we validate that this phenomenon extends beyond our theoretical assumptions. Leveraging these insights, we empirically show there exists a low-rank MLP parameterization that, when initialized within the appropriate subspaces, matches the classification performance of fully-parameterized counterparts on a variety of classification tasks.
zh
[AI-67] Multi-Way Representation Alignment
【速读】:该论文旨在解决多模型(M ≥ 3)神经网络表示空间对齐问题,现有方法因仅支持成对映射而难以构建一致的全局参考空间,且严格等距对齐在检索任务中表现不佳。其解决方案的关键在于提出几何校正的普罗克鲁斯特斯对齐(Geometry-Corrected Procrustes Alignment, GCPA),该方法首先利用广义普罗克鲁斯特斯分析(Generalized Procrustes Analysis, GPA)构建保持内部几何结构的共享正交空间,随后通过后处理步骤修正方向不一致性,从而在保留实用共享表示的同时显著提升任意模型间的检索性能。
链接: https://arxiv.org/abs/2602.06205
作者: Akshit Achara,Tatiana Gaintseva,Mateo Mahaut,Pritish Chakraborty,Viktor Stenby Johansson,Melih Barsbey,Emanuele Rodolà,Donato Crisostomi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The Platonic Representation Hypothesis suggests that independently trained neural networks converge to increasingly similar latent spaces. However, current strategies for mapping these representations are inherently pairwise, scaling quadratically with the number of models and failing to yield a consistent global reference. In this paper, we study the alignment of M \ge 3 models. We first adapt Generalized Procrustes Analysis (GPA) to construct a shared orthogonal universe that preserves the internal geometry essential for tasks like model stitching. We then show that strict isometric alignment is suboptimal for retrieval, where agreement-maximizing methods like Canonical Correlation Analysis (CCA) typically prevail. To bridge this gap, we finally propose Geometry-Corrected Procrustes Alignment (GCPA), which establishes a robust GPA-based universe followed by a post-hoc correction for directional mismatch. Extensive experiments demonstrate that GCPA consistently improves any-to-any retrieval while retaining a practical shared reference space.
zh
[AI-68] Personagram: Bridging Personas and Product Design for Creative Ideation with Multimodal LLM s
【速读】:该论文旨在解决传统手绘人物画像(persona)在产品设计过程中存在的抽象性高、制作成本大以及难以转化为具体设计特征的问题,这些问题导致人物画像常沦为静态参考而非动态驱动设计决策的工具。解决方案的关键在于构建一个名为Personagram的交互式系统,该系统基于多模态大语言模型(Multimodal Large Language Models, MLLMs),能够从人口统计数据中探索详细的人物画像,提取由画像属性推断出的产品功能,并针对特定客户群体进行重组,从而将抽象的人物画像转化为可操作的设计要素,显著提升设计师对人物画像的参与度、透明度感知和满意度。
链接: https://arxiv.org/abs/2602.06197
作者: Taewook Kim,Matthew K. Hong,Yan-Ying Chen,Jonathan Q. Li,Monica P Van,Shabnam Hakimi,Matthew Kay,Matthew Klenk
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 22 pages, 10 figures, 4 tables
Abstract:Product designers often begin their design process with handcrafted personas. While personas are intended to ground design decisions in consumer preferences, they often fall short in practice by remaining abstract, expensive to produce, and difficult to translate into actionable design features. As a result, personas risk serving as static reference points rather than tools that actively shape design outcomes. To address these challenges, we built Personagram, an interactive system powered by multimodal large language models (MLLMs) that helps designers explore detailed census-based personas, extract product features inferred from persona attributes, and recombine them for specific customer segments. In a study with 12 professional designers, we show that Personagram facilitates more actionable ideation workflows by structuring multimodal thinking from persona attributes to product design features, achieving higher engagement with personas, perceived transparency, and satisfaction compared to a chat-based baseline. We discuss implications of integrating AI-generated personas into product design workflows.
zh
[AI-69] Hear You in Silence: Designing for Active Listening in Human Interaction with Conversational Agents Using Context-Aware Pacing
【速读】:该论文旨在解决当前对话代理(Conversational Agents, CAs)在人际互动中缺乏情感共鸣的问题,尤其是忽视了人类对话中体现“积极倾听”(active listening)的时序线索。传统CAs采用固定响应节奏,无法根据用户输入动态调整交互节奏,从而削弱了用户的情感投入与信任感。解决方案的关键在于引入五种情境感知的响应节奏策略:反射性沉默(Reflective Silence)、促进性沉默(Facilitative Silence)、共情沉默(Empathic Silence)、留白空间(Holding Space)和即时回应(Immediate Response),并通过对照实验验证这些策略能显著提升用户对CA的人性化感知、流畅度、互动性和自我披露深度,尤其在职业支持场景中增强了倾听质量与情感信任。
链接: https://arxiv.org/abs/2602.06134
作者: Zhihan Jiang,Qianhui Chen,Chu Zhang,Yanheng Li,Ray LC
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 29 pages, 10 figures. Conditionally Accepted to CHI '26
Abstract:In human conversation, empathic dialogue requires nuanced temporal cues indicating whether the conversational partner is paying attention. This type of “active listening” is overlooked in the design of Conversational Agents (CAs), which use the same pacing for one conversation. To model the temporal cues in human conversation, we need CAs that dynamically adjust response pacing according to user input. We qualitatively analyzed ten cases of active listening to distill five context-aware pacing strategies: Reflective Silence, Facilitative Silence, Empathic Silence, Holding Space, and Immediate Response. In a between-subjects study (N=50) with two conversational scenarios (relationship and career-support), the context-aware agent scored higher than static-pacing control on perceived human-likeness, smoothness, and interactivity, supporting deeper self-disclosure and higher engagement. In the career support scenario, the CA yielded higher perceived listening quality and affective trust. This work shows how insights from human conversation like context-aware pacing can empower the design of more empathic human-AI communication.
zh
[AI-70] Urban Spatio-Temporal Foundation Models for Climate-Resilient Housing: Scaling Diffusion Transformers for Disaster Risk Prediction
【速读】:该论文旨在解决气候灾害对城市交通与应急响应运作的破坏性影响问题,具体表现为住房损毁、基础设施退化及网络可达性下降。其核心解决方案是提出Skjold-DiT框架,该框架通过融合异构时空城市数据,生成建筑级别的气候风险指标,并显式引入交通网络结构和可达性信号(如应急可达性和疏散路径约束),从而支持智能车辆调度与应急指挥系统的条件化路径规划。关键创新包括:(1) Fjell-Prompt提示条件接口,提升跨城市迁移能力;(2) Norrland-Fusion跨模态注意力机制,统一灾害图谱、建筑属性、人口统计与交通基础设施至共享潜在表示;(3) Valkyrie-Forecast反事实模拟器,用于在干预提示下生成概率性风险轨迹,实现不确定性感知的可达性层预测(如可达性、旅行时间膨胀和路径冗余)。
链接: https://arxiv.org/abs/2602.06129
作者: Olaf Yunus Laitinen Imanov,Derya Umut Kulali,Taner Yilmaz
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures. Submitted to IEEE Transactions on Intelligent Vehicles
Abstract:Climate hazards increasingly disrupt urban transportation and emergency-response operations by damaging housing stock, degrading infrastructure, and reducing network accessibility. This paper presents Skjold-DiT, a diffusion-transformer framework that integrates heterogeneous spatio-temporal urban data to forecast building-level climate-risk indicators while explicitly incorporating transportation-network structure and accessibility signals relevant to intelligent vehicles (e.g., emergency reachability and evacuation-route constraints). Concretely, Skjold-DiT enables hazard-conditioned routing constraints by producing calibrated, uncertainty-aware accessibility layers (reachability, travel-time inflation, and route redundancy) that can be consumed by intelligent-vehicle routing and emergency dispatch systems. Skjold-DiT combines: (1) Fjell-Prompt, a prompt-based conditioning interface designed to support cross-city transfer; (2) Norrland-Fusion, a cross-modal attention mechanism unifying hazard maps/imagery, building attributes, demographics, and transportation infrastructure into a shared latent representation; and (3) Valkyrie-Forecast, a counterfactual simulator for generating probabilistic risk trajectories under intervention prompts. We introduce the Baltic-Caspian Urban Resilience (BCUR) dataset with 847,392 building-level observations across six cities, including multi-hazard annotations (e.g., flood and heat indicators) and transportation accessibility features. Experiments evaluate prediction quality, cross-city generalization, calibration, and downstream transportation-relevant outcomes, including reachability and hazard-conditioned travel times under counterfactual interventions.
zh
[AI-71] Jackpot: Optimal Budgeted Rejection Sampling for Extreme Actor-Policy Mismatch Reinforcement Learning ICLR2026
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)强化学习(Reinforcement Learning, RL)中rollout生成成本过高导致的训练效率低下问题。传统方法通常将rollout与策略优化耦合,但若解耦(如使用更高效的模型进行rollout),会因分布不匹配(distribution mismatch)引发学习不稳定。其解决方案的关键在于提出Jackpot框架,核心创新是引入最优预算拒绝采样(Optimal Budget Rejection Sampling, OBRS),通过控制接受预算直接缩小rollout模型与演化策略之间的分布差异;同时结合统一训练目标和高效系统实现(如top-k概率估计与批级偏差校正),显著提升训练稳定性,并在Qwen3-8B-Base上实现接近在线策略RL的性能表现。
链接: https://arxiv.org/abs/2602.06107
作者: Zhuoming Chen,Hongyi Liu,Yang Zhou,Haizhong Zheng,Beidi Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: ICLR 2026
Abstract:Reinforcement learning (RL) for large language models (LLMs) remains expensive, particularly because the rollout is expensive. Decoupling rollout generation from policy optimization (e.g., leveraging a more efficient model to rollout) could enable substantial efficiency gains, yet doing so introduces a severe distribution mismatch that destabilizes learning. We propose Jackpot, a framework that leverages Optimal Budget Rejection Sampling (OBRS) to directly reduce the discrepancy between the rollout model and the evolving policy. Jackpot integrates a principled OBRS procedure, a unified training objective that jointly updates the policy and rollout models, and an efficient system implementation enabled by top- k probability estimation and batch-level bias correction. Our theoretical analysis shows that OBRS consistently moves the rollout distribution closer to the target distribution under a controllable acceptance budget. Empirically, \sys substantially improves training stability compared to importance-sampling baselines, achieving performance comparable to on-policy RL when training Qwen3-8B-Base for up to 300 update steps of batchsize 64. Taken together, our results show that OBRS-based alignment brings us a step closer to practical and effective decoupling of rollout generation from policy optimization for RL for LLMs.
zh
[AI-72] Coding Agents with Environment Interaction: A Theoretical Perspective
【速读】:该论文旨在解决生成式 AI (Generative AI) 在测试驱动软件开发中环境交互策略的理论机制不明确问题,特别是代码选择与生成过程中如何利用执行环境反馈来提升正确性。其解决方案的关键在于构建一个概率框架:一方面将多种成熟的代码选择启发式方法形式化为环境感知的代码正确性估计器,并理论证明基于模糊功能相似性的估计器在信噪比上严格优于基于功能等价性的估计器;另一方面将回溯提示(backprompting)建模为 Thompson sampling 的上下文内近似,推导出包含不可观测奖励成分的 regret 上界,从而解释了回溯提示效果受限于任务描述模糊性(不可约 regret)。这一理论分析不仅验证了现有实践的有效性,还指导改进任务描述以提升性能,最终提出新基准 QiskitHumanEvalSimX 用于评估改进后的策略。
链接: https://arxiv.org/abs/2602.06098
作者: Nicolas Menet,Michael Hersche,Andreas Krause,Abbas Rahimi
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: preprint
Abstract:Coding agents are increasingly utilized in test-driven software development, yet the theoretical mechanisms behind their environment-interaction strategies remain underexplored. We provide a probabilistic framework for two dominant paradigms: code selection after generation using the execution environment, and code generation conditioned on environment feedback. First, we formalize several well-established selection heuristics as environment-aware estimators of code correctness. We theoretically prove that estimators based on fuzzy functional similarity add an inductive bias and strictly dominate estimators based on functional equivalence in terms of signal-to-noise ratio. Second, we frame backprompting as an in-context approximation of Thompson sampling. We derive a novel regret bound for reward functions with unobservable components, theoretically explaining why the effectiveness of backprompting is limited by the ambiguity of the informal task description (an irreducible regret). Using three state-of-the-art open weight models, we corroborate these findings across BigCodeBenchHard, LeetCodeDataset, and QiskitHumanEvalSim. Our formalization also suggests how to improve task descriptions effectively, leading to a new benchmark, QiskitHumanEvalSimX.
zh
[AI-73] NanoNet: Parameter-Efficient Learning with Label-Scarce Supervision for Lightweight Text Mining Model
【速读】:该论文旨在解决轻量级半监督学习(Lightweight Semi-Supervised Learning, LSL)中标签样本稀缺、模型推理成本高以及训练策略计算复杂且易陷入局部最优的问题。解决方案的关键在于提出NanoNet框架,通过参数高效学习(Parameter-Efficient Learning)实现有限监督下的模型训练:利用在线知识蒸馏(Online Knowledge Distillation)生成多个小型模型,并通过相互学习正则化(Mutual Learning Regularization)提升其性能,从而在显著降低训练开销和标注依赖的同时,获得适用于下游推理的轻量化文本挖掘模型。
链接: https://arxiv.org/abs/2602.06093
作者: Qianren Mao,Yashuo Luo,Ziqi Qin,Junnan Liu,Weifeng Jiang,Zhijun Chen,Zhuoran Li,Likang Xiao,Chuou Xu,Qili Zhang,Hanwen Hao,Jingzheng Li,Chunghua Lin,Jianxin Li,Philip S. Yu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The lightweight semi-supervised learning (LSL) strategy provides an effective approach of conserving labeled samples and minimizing model inference costs. Prior research has effectively applied knowledge transfer learning and co-training regularization from large to small models in LSL. However, such training strategies are computationally intensive and prone to local optima, thereby increasing the difficulty of finding the optimal solution. This has prompted us to investigate the feasibility of integrating three low-cost scenarios for text mining tasks: limited labeled supervision, lightweight fine-tuning, and rapid-inference small models. We propose NanoNet, a novel framework for lightweight text mining that implements parameter-efficient learning with limited supervision. It employs online knowledge distillation to generate multiple small models and enhances their performance through mutual learning regularization. The entire process leverages parameter-efficient learning, reducing training costs and minimizing supervision requirements, ultimately yielding a lightweight model for downstream inference.
zh
[AI-74] ransformer-Based Reinforcement Learning for Autonomous Orbital Collision Avoidance in Partially Observable Environments
【速读】:该论文旨在解决空间轨道自主碰撞规避(Autonomous Orbital Collision Avoidance)中因部分可观测性(Partial Observability)和监测不完善所导致的决策可靠性问题。其解决方案的关键在于提出一种基于Transformer的强化学习框架,该框架通过引入可配置的交会模拟器、距离相关的观测模型以及顺序状态估计器来建模相对运动的不确定性,并创新性地采用Transformer架构构建部分可观测马尔可夫决策过程(Partially Observable Markov Decision Process, POMDP),利用长程时序注意力机制更有效地解析噪声大且间歇性的观测数据,从而提升智能体在监测不完善环境下的避碰性能与鲁棒性。
链接: https://arxiv.org/abs/2602.06088
作者: Thomas Georges,Adam Abdin
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce a Transformer-based Reinforcement Learning framework for autonomous orbital collision avoidance that explicitly models the effects of partial observability and imperfect monitoring in space operations. The framework combines a configurable encounter simulator, a distance-dependent observation model, and a sequential state estimator to represent uncertainty in relative motion. A central contribution of this work is the use of transformer-based Partially Observable Markov Decision Process (POMDP) architecture, which leverage long-range temporal attention to interpret noisy and intermittent observations more effectively than traditional architectures. This integration provides a foundation for training collision avoidance agents that can operate more reliably under imperfect monitoring environments.
zh
[AI-75] Allocate Marginal Reviews to Borderline Papers Using LLM Comparative Ranking
【速读】:该论文旨在解决大型机器学习(Machine Learning, ML)会议中审稿资源分配效率低下的问题,即如何更有效地利用有限的边际审稿能力(marginal review capacity),以提升整体评审质量与公平性。当前做法常依赖随机或基于作者/领域亲和性的分配策略,缺乏对论文接受边界附近潜在高价值论文的精准识别。解决方案的关键在于:在人工审稿前,使用大语言模型(Large Language Models, LLMs)进行成对比较(pairwise comparisons)并结合Bradley-Terry模型构建相对排序,从而识别出“边界带”(borderline band)论文;在此基础上,依据会议设定的最低审稿数量阈值(如3或4篇),自动为处于该边界带内的论文分配额外审稿人(如第4或第5篇),且不依赖任何人工审稿结果或LLM输出的最终接受/拒绝判断。此方法通过量化预测边界集与真实边界集的重叠度(ρ)和边界附近额外审稿的增量价值(Δ),实现审稿资源的最优配置。
链接: https://arxiv.org/abs/2602.06078
作者: Elliot L. Epstein,Rajat Dwaraknath,John Winnicki,Thanawat Sornwanee
机构: 未知
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13 pages
Abstract:This paper argues that large ML conferences should allocate marginal review capacity primarily to papers near the acceptance boundary, rather than spreading extra reviews via random or affinity-driven heuristics. We propose using LLM-based comparative ranking (via pairwise comparisons and a Bradley–Terry model) to identify a borderline band \emphbefore human reviewing and to allocate \emphmarginal reviewer capacity at assignment time. Concretely, given a venue-specific minimum review target (e.g., 3 or 4), we use this signal to decide which papers receive one additional review (e.g., a 4th or 5th), without conditioning on any human reviews and without using LLM outputs for accept/reject. We provide a simple expected-impact calculation in terms of (i) the overlap between the predicted and true borderline sets ( \rho ) and (ii) the incremental value of an extra review near the boundary ( \Delta ), and we provide retrospective proxies to estimate these quantities.
zh
[AI-76] HQP: Sensitivity-Aware Hybrid Quantization and Pruning for Ultra-Low-Latency Edge AI Inference
【速读】:该论文旨在解决分布式边缘-云环境中高保真、实时推理对模型优化的迫切需求,以应对严重的延迟和能耗约束。其核心挑战在于如何在保证精度损失可控的前提下实现模型的高效压缩与加速。解决方案的关键在于提出一种集成式混合量化与剪枝(Hybrid Quantization and Pruning, HQP)框架,该框架通过引入基于Fisher信息矩阵(Fisher Information Matrix, FIM)近似的动态权重敏感性度量,指导结构化剪枝算法迭代移除冗余滤波器,并严格限定剪枝后的模型精度下降不超过预设阈值(Δacc),从而确保稀疏模型结构对量化误差及硬件特定内核优化具有最大鲁棒性;随后执行8位后训练量化,最终在多种NVIDIA Jetson边缘平台验证了该方法可实现最高3.12倍的推理速度提升和55%的模型尺寸缩减,同时将精度损失控制在1.5%以内。
链接: https://arxiv.org/abs/2602.06069
作者: Dinesh Gopalan,Ratul Ali
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注: 7 pages, 3 figures, 2 tables
Abstract:The escalating demand for high-fidelity, real-time inference in distributed edge-cloud environments necessitates aggressive model optimization to counteract severe latency and energy constraints. This paper introduces the Hybrid Quantization and Pruning (HQP) framework, a novel, integrated methodology designed to achieve synergistic model acceleration while adhering to strict quality guarantees. We detail a sensitivity-aware structural pruning algorithm that employs a dynamic weight sensitivity metric, derived from a highly efficient approximation of the Fisher Information Matrix (FIM), to guide the iterative removal of redundant filters. This pruning is strictly conditional, enforcing an adherence to a maximum permissible accuracy drop (Delta ax) before the model proceeds to 8-bit post-training quantization. This rigorous coordination is critical, as it ensures the resultant sparse model structure is maximally robust to quantization error and hardware-specific kernel optimization. Exhaustive evaluation across heterogeneous NVIDIA Jetson edge platforms, utilizing resource-efficient architectures like MobileNetV3 and ResNet-18, demonstrates that the HQP framework achieves a peak performance gain of 3.12 times inference speedup and a 55 percent model size reduction, while rigorously containing the accuracy drop below the 1.5 percent constraint. A comprehensive comparative analysis against conventional single-objective compression techniques validates the HQP framework as a superior, hardware-agnostic solution for deploying ultra-low-latency AI in resource-limited edge infrastructures.
zh
[AI-77] Scheduler: Reinforcement Learning-Driven Continual Optimization for Large-Scale Resource Investment Problems
【速读】:该论文旨在解决共享可再生资源环境下优先约束任务调度问题,即资源投资问题(Resource Investment Problem, RIP),其目标是在满足任务间优先关系和时间约束的前提下,最小化所分配可再生资源的成本。传统基于混合整数规划(Mixed-Integer Programming, MIP)和约束规划(Constraint Programming, CP)的方法在大规模实例上求解效率低下,且难以支持动态更新下的快速重调度。解决方案的关键在于提出iScheduler框架,该框架将RIP建模为一个马尔可夫决策过程(Markov Decision Process, MDP),通过分解子问题并采用强化学习驱动的迭代调度策略,实现按序选择处理过程来构建调度方案;同时利用未受影响过程的调度结果复用机制,仅对变更部分进行重新调度,从而显著加速优化过程并支持低延迟重配置。
链接: https://arxiv.org/abs/2602.06064
作者: Yi-Xiang Hu,Yuke Wang,Feng Wu,Zirui Huang,Shuli Zeng,Xiang-Yang Li
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: 13 pages, 7 figures,
Abstract:Scheduling precedence-constrained tasks under shared renewable resources is central to modern computing platforms. The Resource Investment Problem (RIP) models this setting by minimizing the cost of provisioned renewable resources under precedence and timing constraints. Exact mixed-integer programming and constraint programming become impractically slow on large instances, and dynamic updates require schedule revisions under tight latency budgets. We present iScheduler, a reinforcement-learning-driven iterative scheduling framework that formulates RIP solving as a Markov decision process over decomposed subproblems and constructs schedules through sequential process selection. The framework accelerates optimization and supports reconfiguration by reusing unchanged process schedules and rescheduling only affected processes. We also release L-RIPLIB, an industrial-scale benchmark derived from cloud-platform workloads with 1,000 instances of 2,500-10,000 tasks. Experiments show that iScheduler attains competitive resource costs while reducing time to feasibility by up to 43 \times against strong commercial baselines.
zh
[AI-78] Git for Sketches: An Intelligent Tracking System for Capturing Design Evolution
【速读】:该论文旨在解决产品概念设计过程中传统绘图工具难以捕捉非线性设计历程与认知意图的问题。其解决方案的关键在于提出了一种基于Web的系统DIMES(Design Idea Management and Evolution capture System),其中核心创新是sGIT(SketchGit)——一种将Git版本控制原语映射到设计行为的可视化版本控制架构,并集成生成式AI模块。sGIT通过AEGIS模块利用混合深度学习与机器学习模型对六类笔画进行分类,支持隐式分支和多模态提交(笔画数据+语音意图),从而实现设计过程的结构化记录与可追溯性;同时,生成式AI自动生成叙事性摘要和渲染图像,显著提升知识传递效率与用户接受度,验证了智能版本控制在连接创造性行为与认知文档方面的有效性。
链接: https://arxiv.org/abs/2602.06047
作者: Sankar B,Amogh A S,Sandhya Baranwal,Dibakar Sen
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 49 pages, 25 figures
Abstract:During product conceptualization, capturing the non-linear history and cognitive intent is crucial. Traditional sketching tools often lose this context. We introduce DIMES (Design Idea Management and Evolution capture System), a web-based environment featuring sGIT (SketchGit), a custom visual version control architecture, and Generative AI. sGIT includes AEGIS, a module using hybrid Deep Learning and Machine Learning models to classify six stroke types. The system maps Git primitives to design actions, enabling implicit branching and multi-modal commits (stroke data + voice intent). In a comparative study, experts using DIMES demonstrated a 160% increase in breadth of concept exploration. Generative AI modules generated narrative summaries that enhanced knowledge transfer; novices achieved higher replication fidelity (Neural Transparency-based Cosine Similarity: 0.97 vs. 0.73) compared to manual summaries. AI-generated renderings also received higher user acceptance (Purchase Likelihood: 4.2 vs 3.1). This work demonstrates that intelligent version control bridges creative action and cognitive documentation, offering a new paradigm for design education.
zh
[AI-79] he Quantum Sieve Tracer: A Hybrid Framework for Layer-Wise Activation Tracing in Large Language Models
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)中机制可解释性问题,特别是如何从高维多义噪声中分离出稀疏的语义信号,以精确识别事实回忆(factual recall)的内部计算路径。其解决方案的关键在于提出一种混合量子-经典框架——量子筛子追踪器(Quantum Sieve Tracer),通过两阶段分析实现:首先利用经典因果追踪定位关键层,随后将特定注意力头激活映射到指数级扩展的量子希尔伯特空间中,借助量子核方法区分构造性(recall)与抑制性(suppression)机制,从而在细粒度拓扑层面揭示注意力结构的功能分化。
链接: https://arxiv.org/abs/2602.06852
作者: Jonathan Pan
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注: 4 pages, 4 figures
Abstract:Mechanistic interpretability aims to reverse-engineer the internal computations of Large Language Models (LLMs), yet separating sparse semantic signals from high-dimensional polysemantic noise remains a significant challenge. This paper introduces the Quantum Sieve Tracer, a hybrid quantum-classical framework designed to characterize factual recall circuits. We implement a modular pipeline that first localizes critical layers using classical causal tracing, then maps specific attention head activations into an exponentially large quantum Hilbert space. Using open-weight models (Meta Llama-3.2-1B and Alibaba Qwen2.5-1.5B-Instruct), we perform a two-stage analysis that reveals a fundamental architectural divergence. While Qwen’s layer 7 circuit functions as a classic Recall Hub, we discover that Llama’s layer 9 acts as an Interference Suppression circuit, where ablating the identified heads paradoxically improves factual recall. Our results demonstrate that quantum kernels can distinguish between these constructive (recall) and reductive (suppression) mechanisms, offering a high-resolution tool for analyzing the fine-grained topology of attention.
zh
[AI-80] Bridging 6G IoT and AI: LLM -Based Efficient Approach for Physical Layers Optimization Tasks
【速读】:该论文旨在解决第六代(6G)物联网(IoT)网络中物理层优化任务在资源受限环境下的实时性与适应性问题,特别是如何在不重新训练模型的前提下实现动态优化。其解决方案的关键在于提出了一种基于提示工程的实时反馈与验证(Prompt-Engineering-based Real-Time Feedback and Verification, PE-RTFV)框架,该框架利用无线通信系统固有的闭环反馈机制,通过一个优化大语言模型(Optimization LLM, O-LLM)迭代生成结构化提示,并交由一个代理大语言模型(Agent LLM, A-LLM)求解任务特定方案;O-LLM依据实时系统反馈对提示进行梯度下降式的迭代优化,从而引导A-LLM逐步逼近最优解。实验表明,该方法在无线供电物联网场景下完成用户目标驱动的星座设计任务时,仅需数次迭代即可达到接近遗传算法的性能,验证了其在复杂物理层优化任务中的有效性与高效性。
链接: https://arxiv.org/abs/2602.06819
作者: Ahsan Mehmood,Naveed Ul Hassan,Ghassan M. Kraidy
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注: This paper is submitted to IEEE IoT Journal and is currently under review
Abstract:This paper investigates the role of large language models (LLMs) in sixth-generation (6G) Internet of Things (IoT) networks and proposes a prompt-engineering-based real-time feedback and verification (PE-RTFV) framework that perform physical-layer’s optimization tasks through an iteratively process. By leveraging the naturally available closed-loop feedback inherent in wireless communication systems, PE-RTFV enables real-time physical-layer optimization without requiring model retraining. The proposed framework employs an optimization LLM (O-LLM) to generate task-specific structured prompts, which are provided to an agent LLM (A-LLM) to produce task-specific solutions. Utilizing real-time system feedback, the O-LLM iteratively refines the prompts to guide the A-LLM toward improved solutions in a gradient-descent-like optimization process. We test PE-RTFV approach on wireless-powered IoT testbed case study on user-goal-driven constellation design through semantically solving rate-energy (RE)-region optimization problem which demonstrates that PE-RTFV achieves near-genetic-algorithm performance within only a few iterations, validating its effectiveness for complex physical-layer optimization tasks in resource-constrained IoT networks.
zh
[AI-81] ARIS-RSMA Enhanced ISAC System: Joint Rate Splitting and Beamforming Design
【速读】:该论文旨在解决在非直视(non-line-of-sight, NLOS)环境下多目标感知中因信道遮挡导致的公平性瓶颈问题,即如何在存在障碍物的复杂场景中提升多目标回波信号的感知公平性和系统性能。解决方案的关键在于提出一种主动可重构智能表面(active reconfigurable intelligent surface, ARIS)辅助的速率分割多址接入(rate-splitting multiple access, RSMA)集成感知与通信(integrated sensing and communication, ISAC)系统架构,通过联合优化收发端波束赋形、ARIS配置以及速率分割策略,在满足多用户速率和功率约束的前提下,最大化最小多目标回波信干噪比(echo signal-to-interference-plus-noise ratio, SINR)。为处理该高度非凸优化问题,采用重大化-最小化(majorization-minimization, MM)与序列秩一约束松弛(sequential rank-one constraint relaxation, SROCR)算法将其分解为三个子问题并迭代求解,仿真表明该方案显著优于非正交多址接入(non-orthogonal multiple access, NOMA)、空分多址接入(space-division multiple access, SDMA)及被动智能反射面(passive RIS)基线方法,并逼近仅感知场景下的理论上限。
链接: https://arxiv.org/abs/2602.06399
作者: Xin Jin,Tiejun Lv,Yashuai Cao,Jie Zeng,Mugen Peng
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注: 5 pages, 5 figures, accepted by IEEE Wireless Communications Letters
Abstract:This letter proposes an active reconfigurable intelligent surface (ARIS) assisted rate-splitting multiple access (RSMA) integrated sensing and communication (ISAC) system to overcome the fairness bottleneck in multi-target sensing under obstructed line-of-sight environments. Beamforming at the transceiver and ARIS, along with rate splitting, are optimized to maximize the minimum multi-target echo signal-to-interference-plus-noise ratio under multi-user rate and power constraints. The intricate non-convex problem is decoupled into three subproblems and solved iteratively by majorization-minimization (MM) and sequential rank-one constraint relaxation (SROCR) algorithms. Simulations show our scheme outperforms nonorthogonal multiple access, space-division multiple access, and passive RIS baselines, approaching sensing-only upper bounds.
zh
[AI-82] Optimal rates for density and mode estimation with expand-and-sparsify representations AISTATS2026
【速读】:该论文旨在解决感官系统中稀疏表示现象的建模问题,并探讨其在密度估计和模式估计两类基础统计任务中的适用性。其核心解决方案是基于“扩展-稀疏化”(expand-and-sparsify)表示方法:首先通过随机线性投影将输入 $ x \in \mathbb{R}^d $ 映射到高维空间 $ \mathbb{R}^m $(其中 $ m \gg d $),随后保留最大 $ k \ll m $ 个元素并置其余为零,从而得到一个 $ k $-稀疏向量。研究证明,该表示可用于构造具有最优 ℓ∞ 收敛率的密度估计器,并在此基础上设计出在温和条件下可实现最优模式估计速率(至多对数因子误差)的算法。关键在于利用随机投影诱导的结构与稀疏性,实现高效且统计最优的估计性能。
链接: https://arxiv.org/abs/2602.06175
作者: Kaushik Sinha,Christopher Tosh
机构: 未知
类目: atistics Theory (math.ST); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: Accepted at AISTATS 2026
Abstract:Expand-and-sparsify representations are a class of theoretical models that capture sparse representation phenomena observed in the sensory systems of many animals. At a high level, these representations map an input x \in \mathbbR^d to a much higher dimension m \gg d via random linear projections before zeroing out all but the k \ll m largest entries. The result is a k -sparse vector in \0,1^m . We study the suitability of this representation for two fundamental statistical problems: density estimation and mode estimation. For density estimation, we show that a simple linear function of the expand-and-sparsify representation produces an estimator with minimax-optimal \ell_\infty convergence rates. In mode estimation, we provide simple algorithms on top of our density estimator that recover single or multiple modes at optimal rates up to logarithmic factors under mild conditions.
zh
机器学习
[LG-0] Improving Credit Card Fraud Detection with an Optimized Explainable Boosting Machine
链接: https://arxiv.org/abs/2602.06955
作者: Reza E. Fazel,Arash Bakhtiary,Siavash A. Bigdeli
类目: Machine Learning (cs.LG)
*备注: 22 pages, 5 figures, 5 tables
Abstract:Addressing class imbalance is a central challenge in credit card fraud detection, as it directly impacts predictive reliability in real-world financial systems. To overcome this, the study proposes an enhanced workflow based on the Explainable Boosting Machine (EBM)-a transparent, state-of-the-art implementation of the GA2M algorithm-optimized through systematic hyperparameter tuning, feature selection, and preprocessing refinement. Rather than relying on conventional sampling techniques that may introduce bias or cause information loss, the optimized EBM achieves an effective balance between accuracy and interpretability, enabling precise detection of fraudulent transactions while providing actionable insights into feature importance and interaction effects. Furthermore, the Taguchi method is employed to optimize both the sequence of data scalers and model hyperparameters, ensuring robust, reproducible, and systematically validated performance improvements. Experimental evaluation on benchmark credit card data yields an ROC-AUC of 0.983, surpassing prior EBM baselines (0.975) and outperforming Logistic Regression, Random Forest, XGBoost, and Decision Tree models. These results highlight the potential of interpretable machine learning and data-driven optimization for advancing trustworthy fraud analytics in financial systems.
[LG-1] Optimal Derivative Feedback Control for an Active Magnetic Levitation System: An Experimental Study on Data-Driven Approaches
链接: https://arxiv.org/abs/2602.06944
作者: Saber Omidi,Rene Akupan Ebunle,Se Young Yoon
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 10 pages, 9 figures. Preprint; manuscript under journal review
Abstract:This paper presents the design and implementation of data-driven optimal derivative feedback controllers for an active magnetic levitation system. A direct, model-free control design method based on the reinforcement learning framework is compared with an indirect optimal control design derived from a numerically identified mathematical model of the system. For the direct model-free approach, a policy iteration procedure is proposed, which adds an iteration layer called the epoch loop to gather multiple sets of process data, providing a more diverse dataset and helping reduce learning biases. This direct control design method is evaluated against a comparable optimal control solution designed from a plant model obtained through the combined Dynamic Mode Decomposition with Control (DMDc) and Prediction Error Minimization (PEM) system identification. Results show that while both controllers can stabilize and improve the performance of the magnetic levitation system when compared to controllers designed from a nominal model, the direct model-free approach consistently outperforms the indirect solution when multiple epochs are allowed. The iterative refinement of the optimal control law over the epoch loop provides the direct approach a clear advantage over the indirect method, which relies on a single set of system data to determine the identified model and control.
[LG-2] From Core to Detail: Unsupervised Disentanglement with Entropy-Ordered Flows
链接: https://arxiv.org/abs/2602.06940
作者: Daniel Galperin,Ullrich Köthe
类目: Machine Learning (cs.LG)
*备注:
Abstract:Learning unsupervised representations that are both semantically meaningful and stable across runs remains a central challenge in modern representation learning. We introduce entropy-ordered flows (EOFlows), a normalizing-flow framework that orders latent dimensions by their explained entropy, analogously to PCA’s explained variance. This ordering enables adaptive injective flows: after training, one may retain only the top C latent variables to form a compact core representation while the remaining variables capture fine-grained detail and noise, with C chosen flexibly at inference time rather than fixed during training. EOFlows build on insights from Independent Mechanism Analysis, Principal Component Flows and Manifold Entropic Metrics. We combine likelihood-based training with local Jacobian regularization and noise augmentation into a method that scales well to high-dimensional data such as images. Experiments on the CelebA dataset show that our method uncovers a rich set of semantically interpretable features, allowing for high compression and strong denoising.
[LG-3] Reciprocal Latent Fields for Precomputed Sound Propagation DATE
链接: https://arxiv.org/abs/2602.06937
作者: Hugo Seuté,Pranai Vasudev,Etienne Richan,Louis-Xavier Buffoni
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Temporary pre-print, will be updated. In review at a conference
Abstract:Realistic sound propagation is essential for immersion in a virtual scene, yet physically accurate wave-based simulations remain computationally prohibitive for real-time applications. Wave coding methods address this limitation by precomputing and compressing impulse responses of a given scene into a set of scalar acoustic parameters, which can reach unmanageable sizes in large environments with many source-receiver pairs. We introduce Reciprocal Latent Fields (RLF), a memory-efficient framework for encoding and predicting these acoustic parameters. The RLF framework employs a volumetric grid of trainable latent embeddings decoded with a symmetric function, ensuring acoustic reciprocity. We study a variety of decoders and show that leveraging Riemannian metric learning leads to a better reproduction of acoustic phenomena in complex scenes. Experimental validation demonstrates that RLF maintains replication quality while reducing the memory footprint by several orders of magnitude. Furthermore, a MUSHRA-like subjective listening test indicates that sound rendered via RLF is perceptually indistinguishable from ground-truth simulations.
[LG-4] When RL Meets Adaptive Speculative Training: A Unified Training-Serving System
链接: https://arxiv.org/abs/2602.06932
作者: Junxiong Wang,Fengxiang Bie,Jisen Li,Zhongzhu Zhou,Zelei Shao,Yubo Wang,Yinghui Liu,Qingyang Wu,Avner May,Sri Yanamandra,Yineng Zhang,Ce Zhang,Tri Dao,Percy Liang,Ben Athiwaratkun,Shuaiwen Leon Song,Chenfeng Xu,Xiaoxia Wu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Speculative decoding can significantly accelerate LLM serving, yet most deployments today disentangle speculator training from serving, treating speculator training as a standalone offline modeling problem. We show that this decoupled formulation introduces substantial deployment and adaptation lag: (1) high time-to-serve, since a speculator must be trained offline for a considerable period before deployment; (2) delayed utility feedback, since the true end-to-end decoding speedup is only known after training and cannot be inferred reliably from acceptance rate alone due to model-architecture and system-level overheads; and (3) domain-drift degradation, as the target model is repurposed to new domains and the speculator becomes stale and less effective. To address these issues, we present Aurora, a unified training-serving system that closes the loop by continuously learning a speculator directly from live inference traces. Aurora reframes online speculator learning as an asynchronous reinforcement-learning problem: accepted tokens provide positive feedback, while rejected speculator proposals provide implicit negative feedback that we exploit to improve sample efficiency. Our design integrates an SGLang-based inference server with an asynchronous training server, enabling hot-swapped speculator updates without service interruption. Crucially, Aurora supports day-0 deployment: a speculator can be served immediately and rapidly adapted to live traffic, improving system performance while providing immediate utility feedback. Across experiments, Aurora achieves a 1.5x day-0 speedup on recently released frontier models (e.g., MiniMax M2.1 229B and Qwen3-Coder-Next 80B). Aurora also adapts effectively to distribution shifts in user traffic, delivering an additional 1.25x speedup over a well-trained but static speculator on widely used models (e.g., Qwen3 and Llama3). Subjects: Machine Learning (cs.LG) Cite as: arXiv:2602.06932 [cs.LG] (or arXiv:2602.06932v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.06932 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-5] Continuous-time reinforcement learning: ellipticity enables model-free value function approximation
链接: https://arxiv.org/abs/2602.06930
作者: Wenlong Mou
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:
Abstract:We study off-policy reinforcement learning for controlling continuous-time Markov diffusion processes with discrete-time observations and actions. We consider model-free algorithms with function approximation that learn value and advantage functions directly from data, without unrealistic structural assumptions on the dynamics. Leveraging the ellipticity of the diffusions, we establish a new class of Hilbert-space positive definiteness and boundedness properties for the Bellman operators. Based on these properties, we propose the Sobolev-prox fitted q -learning algorithm, which learns value and advantage functions by iteratively solving least-squares regression problems. We derive oracle inequalities for the estimation error, governed by (i) the best approximation error of the function classes, (ii) their localized complexity, (iii) exponentially decaying optimization error, and (iv) numerical discretization error. These results identify ellipticity as a key structural property that renders reinforcement learning with function approximation for Markov diffusions no harder than supervised learning. Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Statistics Theory (math.ST); Machine Learning (stat.ML) Cite as: arXiv:2602.06930 [cs.LG] (or arXiv:2602.06930v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.06930 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-6] Robustness Beyond Known Groups with Low-rank Adaptation
链接: https://arxiv.org/abs/2602.06924
作者: Abinitha Gourabathina,Hyewon Jeong,Teya Bergamaschi,Marzyeh Ghassemi,Collin Stultz
类目: Machine Learning (cs.LG)
*备注:
Abstract:Deep learning models trained to optimize average accuracy often exhibit systematic failures on particular subpopulations. In real world settings, the subpopulations most affected by such disparities are frequently unlabeled or unknown, thereby motivating the development of methods that are performant on sensitive subgroups without being pre-specified. However, existing group-robust methods typically assume prior knowledge of relevant subgroups, using group annotations for training or model selection. We propose Low-rank Error Informed Adaptation (LEIA), a simple two-stage method that improves group robustness by identifying a low-dimensional subspace in the representation space where model errors concentrate. LEIA restricts adaptation to this error-informed subspace via a low-rank adjustment to the classifier logits, directly targeting latent failure modes without modifying the backbone or requiring group labels. Using five real-world datasets, we analyze group robustness under three settings: (1) truly no knowledge of subgroup relevance, (2) partial knowledge of subgroup relevance, and (3) full knowledge of subgroup relevance. Across all settings, LEIA consistently improves worst-group performance while remaining fast, parameter-efficient, and robust to hyperparameter choice.
[LG-7] Revisiting the Generic Transformer: Deconstructing a Strong Baseline for Time Series Foundation Models
链接: https://arxiv.org/abs/2602.06909
作者: Yunshi Wen,Wesley M. Gifford,Chandra Reddy,Lam M. Nguyen,Jayant Kalagnanam,Anak Agung Julius
类目: Machine Learning (cs.LG)
*备注:
Abstract:The recent surge in Time Series Foundation Models has rapidly advanced the field, yet the heterogeneous training setups across studies make it difficult to attribute improvements to architectural innovations versus data engineering. In this work, we investigate the potential of a standard patch Transformer, demonstrating that this generic architecture achieves state-of-the-art zero-shot forecasting performance using a straightforward training protocol. We conduct a comprehensive ablation study that covers model scaling, data composition, and training techniques to isolate the essential ingredients for high performance. Our findings identify the key drivers of performance, while confirming that the generic architecture itself demonstrates excellent scalability. By strictly controlling these variables, we provide comprehensive empirical results on model scaling across multiple dimensions. We release our open-source model and detailed findings to establish a transparent, reproducible baseline for future research.
[LG-8] A first realization of reinforcement learning-based closed-loop EEG-TMS
链接: https://arxiv.org/abs/2602.06907
作者: Dania Humaidan,Jiahua Xu,Jing Chen,Christoph Zrenner,David Emanuel Vetter,Laura Marzetti,Paolo Belardinelli,Timo Roine,Risto J. Ilmoniemi,Gian Luca Romani,Ulf Zieman
类目: Machine Learning (cs.LG)
*备注:
Abstract:Background: Transcranial magnetic stimulation (TMS) is a powerful tool to investigate neurophysiology of the human brain and treat brain disorders. Traditionally, therapeutic TMS has been applied in a one-size-fits-all approach, disregarding inter- and intra-individual differences. Brain state-dependent EEG-TMS, such as coupling TMS with a pre-specified phase of the sensorimotor mu-rhythm, enables the induction of differential neuroplastic effects depending on the targeted phase. But this approach is still user-dependent as it requires defining an a-priori target phase. Objectives: To present a first realization of a machine-learning-based, closed-loop real-time EEG-TMS setup to identify user-independently the individual mu-rhythm phase associated with high- vs. low-corticospinal excitability states. Methods: We applied EEG-TMS to 25 participants targeting the supplementary motor area-primary motor cortex network and used a reinforcement learning algorithm to identify the mu-rhythm phase associated with high- vs. low corticospinal excitability. We employed linear mixed effects models and Bayesian analysis to determine effects of reinforced learning on corticospinal excitability indexed by motor evoked potential amplitude, and functional connectivity indexed by the imaginary part of resting-state EEG coherence. Results: Reinforcement learning effectively identified the mu-rhythm phase associated with high- vs. low-excitability states, and their repetitive stimulation resulted in long-term increases vs. decreases in functional connectivity in the stimulated sensorimotor network. Conclusions: We demonstrated for the first time the feasibility of closed-loop EEG-TMS in humans, a critical step towards individualized treatment of brain disorders.
[LG-9] Parameter-free Dynamic Regret: Time-varying Movement Costs Delayed Feedback and Memory
链接: https://arxiv.org/abs/2602.06902
作者: Emmanuel Esposito,Andrew Jacobsen,Hao Qiu,Mengxiao Zhang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:In this paper, we study dynamic regret in unconstrained online convex optimization (OCO) with movement costs. Specifically, we generalize the standard setting by allowing the movement cost coefficients \lambda_t to vary arbitrarily over time. Our main contribution is a novel algorithm that establishes the first comparator-adaptive dynamic regret bound for this setting, guaranteeing \widetilde\mathcalO(\sqrt(1+P_T)(T+\sum_t \lambda_t)) regret, where P_T is the path length of the comparator sequence over T rounds. This recovers the optimal guarantees for both static and dynamic regret in standard OCO as a special case where \lambda_t=0 for all rounds. To demonstrate the versatility of our results, we consider two applications: OCO with delayed feedback and OCO with time-varying memory. We show that both problems can be translated into time-varying movement costs, establishing a novel reduction specifically for the delayed feedback setting that is of independent interest. A crucial observation is that the first-order dependence on movement costs in our regret bound plays a key role in enabling optimal comparator-adaptive dynamic regret guarantees in both settings.
[LG-10] Sample Complexity of Causal Identification with Temporal Heterogeneity
链接: https://arxiv.org/abs/2602.06899
作者: Ameya Rathod,Sujay Belsare,Salvik Krishna Nautiyal,Dhruv Laad,Ponnurangam Kumaraguru
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Recovering a unique causal graph from observational data is an ill-posed problem because multiple generating mechanisms can lead to the same observational distribution. This problem becomes solvable only by exploiting specific structural or distributional assumptions. While recent work has separately utilized time-series dynamics or multi-environment heterogeneity to constrain this problem, we integrate both as complementary sources of heterogeneity. This integration yields unified necessary identifiability conditions and enables a rigorous analysis of the statistical limits of recovery under thin versus heavy-tailed noise. In particular, temporal structure is shown to effectively substitute for missing environmental diversity, possibly achieving identifiability even under insufficient heterogeneity. Extending this analysis to heavy-tailed (Student’s t) distributions, we demonstrate that while geometric identifiability conditions remain invariant, the sample complexity diverges significantly from the Gaussian baseline. Explicit information-theoretic bounds quantify this cost of robustness, establishing the fundamental limits of covariance-based causal graph recovery methods in realistic non-stationary systems. This work shifts the focus from whether causal structure is identifiable to whether it is statistically recoverable in practice.
[LG-11] A Cycle-Consistent Graph Surrogate for Full-Cycle Left Ventricular Myocardial Biomechanics
链接: https://arxiv.org/abs/2602.06884
作者: Siyu Mu,Wei Xuan Chan,Choon Hwai Yap
类目: Machine Learning (cs.LG)
*备注:
Abstract:Image-based patient-specific simulation of left ventricular (LV) mechanics is valuable for understanding cardiac function and supporting clinical intervention planning, but conventional finite-element analysis (FEA) is computationally intensive. Current graph-based surrogates do not have full-cycle prediction capabilities, and physics-informed neural networks often struggle to converge on complex cardiac geometries. We present CardioGraphFENet (CGFENet), a unified graph-based surrogate for rapid full-cycle estimation of LV myocardial biomechanics, supervised by a large FEA simulation dataset. The proposed model integrates (i) a global–local graph encoder to capture mesh features with weak-form-inspired global coupling, (ii) a gated recurrent unit-based temporal encoder conditioned on the target volume-time signal to model cycle-coherent dynamics, and (iii) a cycle-consistent bidirectional formulation for both loading and inverse unloading within a single framework. These strategies enable high fidelity with respect to traditional FEA ground truths and produce physiologically plausible pressure-volume loops that match FEA results when coupled with a lumped-parameter model. In particular, the cycle-consistency strategy enables a significant reduction in FEA supervision with only minimal loss in accuracy.
[LG-12] Decoupling Variance and Scale-Invariant Updates in Adaptive Gradient Descent for Unified Vector and Matrix Optimization
链接: https://arxiv.org/abs/2602.06880
作者: Zitao Song,Cedar Site Bai,Zhe Zhang,Brian Bullins,David F. Gleich
类目: Machine Learning (cs.LG)
*备注:
Abstract:Adaptive methods like Adam have become the \textitde facto standard for large-scale vector and Euclidean optimization due to their coordinate-wise adaptation with a second-order nature. More recently, matrix-based spectral optimizers like Muon (Jordan et al., 2024b) show the power of treating weight matrices as matrices rather than long vectors. Linking these is hard because many natural generalizations are not feasible to implement, and we also cannot simply move the Adam adaptation to the matrix spectrum. To address this, we reformulate the AdaGrad update and decompose it into a variance adaptation term and a scale-invariant term. This decoupling produces \textbfDeVA ( \textbfDe coupled \textbfV ariance \textbfA daptation), a framework that bridges between vector-based variance adaptation and matrix spectral optimization, enabling a seamless transition from Adam to adaptive spectral descent. Extensive experiments across language modeling and image classification demonstrate that DeVA consistently outperforms state-of-the-art methods such as Muon and SOAP (Vyas et al., 2024), reducing token usage by around 6.6%. Theoretically, we show that the variance adaptation term effectively improves the blockwise smoothness, facilitating faster convergence. Our implementation is available at this https URL
[LG-13] -STAR: A Context-Aware Transformer Framework for Short-Term Probabilistic Demand Forecasting in Dock-Based Shared Micro-Mobility
链接: https://arxiv.org/abs/2602.06866
作者: Jingyi Cheng,Gonçalo Homem de Almeida Correia,Oded Cats,Shadi Sharif Azadeh
类目: Machine Learning (cs.LG)
*备注: This work has been submitted to Transportation Research Part C
Abstract:Reliable short-term demand forecasting is essential for managing shared micro-mobility services and ensuring responsive, user-centered operations. This study introduces T-STAR (Two-stage Spatial and Temporal Adaptive contextual Representation), a novel transformer-based probabilistic framework designed to forecast station-level bike-sharing demand at a 15-minute resolution. T-STAR addresses key challenges in high-resolution forecasting by disentangling consistent demand patterns from short-term fluctuations through a hierarchical two-stage structure. The first stage captures coarse-grained hourly demand patterns, while the second stage improves prediction accuracy by incorporating high-frequency, localized inputs, including recent fluctuations and real-time demand variations in connected metro services, to account for temporal shifts in short-term demand. Time series transformer models are employed in both stages to generate probabilistic predictions. Extensive experiments using Washington D.C.'s Capital Bikeshare data demonstrate that T-STAR outperforms existing methods in both deterministic and probabilistic accuracy. The model exhibits strong spatial and temporal robustness across stations and time periods. A zero-shot forecasting experiment further highlights T-STAR’s ability to transfer to previously unseen service areas without retraining. These results underscore the framework’s potential to deliver granular, reliable, and uncertainty-aware short-term demand forecasts, which enable seamless integration to support multimodal trip planning for travelers and enhance real-time operations in shared micro-mobility services.
[LG-14] Designing a Robust Bounded and Smooth Loss Function for Improved Supervised Learning
链接: https://arxiv.org/abs/2602.06858
作者: Soumi Mahato,Lineesh M.C
类目: Machine Learning (cs.LG)
*备注:
Abstract:The loss function is crucial to machine learning, especially in supervised learning frameworks. It is a fundamental component that controls the behavior and general efficacy of learning algorithms. However, despite their widespread use, traditional loss functions have significant drawbacks when dealing with high-dimensional and outlier-sensitive datasets, which frequently results in reduced performance and slower convergence during training. In this work, we develop a robust, bounded, and smooth (RoBoS-NN) loss function to resolve the aforementioned hindrances. The generalization ability of the loss function has also been theoretically analyzed to rigorously justify its robustness. Moreover, we implement RoboS-NN loss in the framework of a neural network (NN) to forecast time series and present a new robust algorithm named \mathcalL_\textRoBoS -NN. To assess the potential of \mathcalL_\textRoBoS -NN, we conduct experiments on multiple real-world datasets. In addition, we infuse outliers into data sets to evaluate the performance of \mathcalL_\textRoBoS -NN in more challenging scenarios. Numerical results show that \mathcalL_\textRoBoS -NN outperforms the other benchmark models in terms of accuracy measures.
[LG-15] Improved Sampling Schedules for Discrete Diffusion Models
链接: https://arxiv.org/abs/2602.06849
作者: Alberto Foresti,Mustapha Bounoua,Giulio Franzese,Luca Ambrogioni,Pietro Michiardi
类目: Machine Learning (cs.LG)
*备注:
Abstract:Discrete diffusion models have emerged as a powerful paradigm for generative modeling on sequence data; however, the information-theoretic principles governing their reverse processes remain significantly less understood than those of their continuous counterparts. In this work, we bridge this gap by analyzing the reverse process dynamics through the lens of thermodynamic entropy production. We propose the entropy production rate as a rigorous proxy for quantifying information generation, deriving as a byproduct a bound on the Wasserstein distance between intermediate states and the data distribution. Leveraging these insights, we introduce two novel sampling schedules that are uniformly spaced with respect to their corresponding physics-inspired metrics: the Entropic Discrete Schedule (EDS), which is defined by maintaining a constant rate of information gain, and the Wasserstein Discrete Schedule (WDS), which is defined by taking equal steps in terms of the Wasserstein distance. We empirically demonstrate that our proposed schedules significantly outperform state-of-the-art strategies across diverse application domains, including synthetic data, music notation, vision and language modeling, consistently achieving superior performance at a lower computational budget.
[LG-16] Are Deep Learning Based Hybrid PDE Solvers Reliable? Why Training Paradigms and Update Strategies Matter
链接: https://arxiv.org/abs/2602.06842
作者: Yuhan Wu,Jan Willem van Beek,Victorita Dolean,Alexander Heinlein
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:
Abstract:Deep learning-based hybrid iterative methods (DL-HIMs) integrate classical numerical solvers with neural operators, utilizing their complementary spectral biases to accelerate convergence. Despite this promise, many DL-HIMs stagnate at false fixed points where neural updates vanish while the physical residual remains large, raising questions about reliability in scientific computing. In this paper, we provide evidence that performance is highly sensitive to training paradigms and update strategies, even when the neural architecture is fixed. Through a detailed study of a DeepONet-based hybrid iterative numerical transferable solver (HINTS) and an FFT-based Fourier neural solver (FNS), we show that significant physical residuals can persist when training objectives are not aligned with solver dynamics and problem physics. We further examine Anderson acceleration (AA) and demonstrate that its classical form is ill-suited for nonlinear neural operators. To overcome this, we introduce physics-aware Anderson acceleration (PA-AA), which minimizes the physical residual rather than the fixed-point update. Numerical experiments confirm that PA-AA restores reliable convergence in substantially fewer iterations. These findings provide a concrete answer to ongoing controversies surrounding AI-based PDE solvers: reliability hinges not only on architectures but on physically informed training and iteration design.
[LG-17] Learning Deep Hybrid Models with Sharpness-Aware Minimization
链接: https://arxiv.org/abs/2602.06837
作者: Naoya Takeishi
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Hybrid modeling, the combination of machine learning models and scientific mathematical models, enables flexible and robust data-driven prediction with partial interpretability. However, effectively the scientific models may be ignored in prediction due to the flexibility of the machine learning model, making the idea of hybrid modeling pointless. Typically some regularization is applied to hybrid model learning to avoid such a failure case, but the formulation of the regularizer strongly depends on model architectures and domain knowledge. In this paper, we propose to focus on the flatness of loss minima in learning hybrid models, aiming to make the model as simple as possible. We employ the idea of sharpness-aware minimization and adapt it to the hybrid modeling setting. Numerical experiments show that the SAM-based method works well across different choices of models and datasets.
[LG-18] Calibrating Tabular Anomaly Detection via Optimal Transport
链接: https://arxiv.org/abs/2602.06810
作者: Hangting Ye,He Zhao.Wei Fan,Xiaozhuang Song,Dandan Guo,Yi Chang,Hongyuan Zha
类目: Machine Learning (cs.LG)
*备注:
Abstract:Tabular anomaly detection (TAD) remains challenging due to the heterogeneity of tabular data: features lack natural relationships, vary widely in distribution and scale, and exhibit diverse types. Consequently, each TAD method makes implicit assumptions about anomaly patterns that work well on some datasets but fail on others, and no method consistently outperforms across diverse scenarios. We present CTAD (Calibrating Tabular Anomaly Detection), a model-agnostic post-processing framework that enhances any existing TAD detector through sample-specific calibration. Our approach characterizes normal data via two complementary distributions, i.e., an empirical distribution from random sampling and a structural distribution from K-means centroids, and measures how adding a test sample disrupts their compatibility using Optimal Transport (OT) distance. Normal samples maintain low disruption while anomalies cause high disruption, providing a calibration signal to amplify detection. We prove that OT distance has a lower bound proportional to the test sample’s distance from centroids, and establish that anomalies systematically receive higher calibration scores than normals in expectation, explaining why the method generalizes across datasets. Extensive experiments on 34 diverse tabular datasets with 7 representative detectors spanning all major TAD categories (density estimation, classification, reconstruction, and isolation-based methods) demonstrate that CTAD consistently improves performance with statistical significance. Remarkably, CTAD enhances even state-of-the-art deep learning methods and shows robust performance across diverse hyperparameter settings, requiring no additional tuning for practical deployment.
[LG-19] FlowDA: Accurate Low-Latency Weather Data Assimilation via Flow Matching
链接: https://arxiv.org/abs/2602.06800
作者: Ran Cheng,Lailai Zhu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Data assimilation (DA) is a fundamental component of modern weather prediction, yet it remains a major computational bottleneck in machine learning (ML)-based forecasting pipelines due to reliance on traditional variational methods. Recent generative ML-based DA methods offer a promising alternative but typically require many sampling steps and suffer from error accumulation under long-horizon auto-regressive rollouts with cycling assimilation. We propose FlowDA, a low-latency weather-scale generative DA framework based on flow matching. FlowDA conditions on observations through a SetConv-based embedding and fine-tunes the Aurora foundation model to deliver accurate, efficient, and robust analyses. Experiments across observation rates decreasing from 3.9% to 0.1% demonstrate superior performance of FlowDA over strong baselines with similar tunable-parameter size. FlowDA further shows robustness to observational noise and stable performance in long-horizon auto-regressive cycling DA. Overall, FlowDA points to an efficient and scalable direction for data-driven DA.
[LG-20] Rare Event Analysis of Large Language Models
链接: https://arxiv.org/abs/2602.06791
作者: Jake McAllister Dorman,Edward Gillman,Dominic C. Rose,Jamie F. Mair,Juan P. Garrahan
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech)
*备注:
Abstract:Being probabilistic models, during inference large language models (LLMs) display rare events: behaviour that is far from typical but highly significant. By definition all rare events are hard to see, but the enormous scale of LLM usage means that events completely unobserved during development are likely to become prominent in deployment. Here we present an end-to-end framework for the systematic analysis of rare events in LLMs. We provide a practical implementation spanning theory, efficient generation strategies, probability estimation and error analysis, which we illustrate with concrete examples. We outline extensions and applications to other models and contexts, highlighting the generality of the concepts and techniques presented here.
[LG-21] Displacement-Resistant Extensions of DPO with Nonconvex f-Divergences ICLR2026
链接: https://arxiv.org/abs/2602.06788
作者: Idan Pipano,Shoham Sabach,Kavosh Asadi,Mohammad Ghavamzadeh
类目: Machine Learning (cs.LG)
*备注: Published as a conference paper at ICLR 2026
Abstract:DPO and related algorithms align language models by directly optimizing the RLHF objective: find a policy that maximizes the Bradley-Terry reward while staying close to a reference policy through a KL divergence penalty. Previous work showed that this approach could be further generalized: the original problem remains tractable even if the KL divergence is replaced by a family of f -divergence with a convex generating function f . Our first contribution is to show that convexity of f is not essential. Instead, we identify a more general condition, referred to as DPO-inducing, that precisely characterizes when the RLHF problem remains tractable. Our next contribution is to establish a second condition on f that is necessary to prevent probability displacement, a known empirical phenomenon in which the probabilities of the winner and the loser responses approach zero. We refer to any f that satisfies this condition as displacement-resistant. We finally focus on a specific DPO-inducing and displacement-resistant f , leading to our novel SquaredPO loss. Compared to DPO, this new loss offers stronger theoretical guarantees while performing competitively in practice.
[LG-22] Weisfeiler and Lehman Go Categorical
链接: https://arxiv.org/abs/2602.06787
作者: Seongjin Choi,Gahee Kim,Se-Young Yun
类目: Machine Learning (cs.LG); Category Theory (math.CT)
*备注: Comments are welcome!
Abstract:While lifting map has significantly enhanced the expressivity of graph neural networks, extending this paradigm to hypergraphs remains fragmented. To address this, we introduce the categorical Weisfeiler-Lehman framework, which formalizes lifting as a functorial mapping from an arbitrary data category to the unifying category of graded posets. When applied to hypergraphs, this perspective allows us to systematically derive Hypergraph Isomorphism Networks, a family of neural architectures where the message passing topology is strictly determined by the choice of functor. We introduce two distinct functors from the category of hypergraphs: an incidence functor and a symmetric simplicial complex functor. While the incidence architecture structurally mirrors standard bipartite schemes, our functorial derivation enforces a richer information flow over the resulting poset, capturing complex intersection geometries often missed by existing methods. We theoretically characterize the expressivity of these models, proving that both the incidence-based and symmetric simplicial approaches subsume the expressive power of the standard Hypergraph Weisfeiler-Lehman test. Extensive experiments on real-world benchmarks validate these theoretical findings.
[LG-23] Fair Transit Stop Placement: A Clustering Perspective and Beyond
链接: https://arxiv.org/abs/2602.06776
作者: Haris Aziz,Ling Gai,Yuhang Guo,Jeremy Vollen
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:
Abstract:We study the transit stop placement (TrSP) problem in general metric spaces, where agents travel between source-destination pairs and may either walk directly or utilize a shuttle service via selected transit stops. We investigate fairness in TrSP through the lens of justified representation (JR) and the core, and uncover a structural correspondence with fair clustering. Specifically, we show that a constant-factor approximation to proportional fairness in clustering can be used to guarantee a constant-factor biparameterized approximation to core. We establish a lower bound of 1.366 on the approximability of JR, and moreover show that no clustering algorithm can approximate JR within a factor better than 3. Going beyond clustering, we propose the Expanding Cost Algorithm, which achieves a tight 2.414-approximation for JR, but does not give any bounded core guarantee. In light of this, we introduce a parameterized algorithm that interpolates between these approaches, and enables a tunable trade-off between JR and core. Finally, we complement our results with an experimental analysis using small-market public carpooling data.
[LG-24] Robust Online Learning
链接: https://arxiv.org/abs/2602.06775
作者: Sajad Ashkezari
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We study the problem of learning robust classifiers where the classifier will receive a perturbed input. Unlike robust PAC learning studied in prior work, here the clean data and its label are also adversarially chosen. We formulate this setting as an online learning problem and consider both the realizable and agnostic learnability of hypothesis classes. We define a new dimension of classes and show it controls the mistake bounds in the realizable setting and the regret bounds in the agnostic setting. In contrast to the dimension that characterizes learnability in the PAC setting, our dimension is rather simple and resembles the Littlestone dimension. We generalize our dimension to multiclass hypothesis classes and prove similar results in the realizable case. Finally, we study the case where the learner does not know the set of allowed perturbations for each point and only has some prior on them.
[LG-25] On the Convergence of Multicalibration Gradient Boosting
链接: https://arxiv.org/abs/2602.06773
作者: Daniel Haimovich,Fridolin Linder,Lorenzo Perini,Niek Tax,Milan Vojnovic
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Under submission
Abstract:Multicalibration gradient boosting has recently emerged as a scalable method that empirically produces approximately multicalibrated predictors and has been deployed at web scale. Despite this empirical success, its convergence properties are not well understood. In this paper, we bridge the gap by providing convergence guarantees for multicalibration gradient boosting in regression with squared-error loss. We show that the magnitude of successive prediction updates decays at O(1/\sqrtT) , which implies the same convergence rate bound for the multicalibration error over rounds. Under additional smoothness assumptions on the weak learners, this rate improves to linear convergence. We further analyze adaptive variants, showing local quadratic convergence of the training loss, and we study rescaling schemes that preserve convergence. Experiments on real-world datasets support our theory and clarify the regimes in which the method achieves fast convergence and strong multicalibration.
[LG-26] Calibrating Generative AI to Produce Realistic Essays for Data Augmentation
链接: https://arxiv.org/abs/2602.06772
作者: Edward W. Wolfe,Justin O. Barber
类目: Machine Learning (cs.LG)
*备注: Artificial Intelligence in Measurement and Education Conference (AIME-Con)
Abstract:Data augmentation can mitigate limited training data in machine-learning automated scoring engines for constructed response items. This study seeks to determine how well three approaches to large language model prompting produce essays that preserve the writing quality of the original essays and produce realistic text for augmenting ASE training datasets. We created simulated versions of student essays, and human raters assigned scores to them and rated the realism of the generated text. The results of the study indicate that the predict next prompting strategy produces the highest level of agreement between human raters regarding simulated essay scores, predict next and sentence strategies best preserve the rated quality of the original essay in the simulated essays, and predict next and 25 examples strategies produce the most realistic text as judged by human raters.
[LG-27] Soft Forward-Backward Representations for Zero-shot Reinforcement Learning with General Utilities
链接: https://arxiv.org/abs/2602.06769
作者: Marco Bagatella,Thomas Rupf,Georg Martius,Andreas Krause
类目: Machine Learning (cs.LG)
*备注:
Abstract:Recent advancements in zero-shot reinforcement learning (RL) have facilitated the extraction of diverse behaviors from unlabeled, offline data sources. In particular, forward-backward algorithms (FB) can retrieve a family of policies that can approximately solve any standard RL problem (with additive rewards, linear in the occupancy measure), given sufficient capacity. While retaining zero-shot properties, we tackle the greater problem class of RL with general utilities, in which the objective is an arbitrary differentiable function of the occupancy measure. This setting is strictly more expressive, capturing tasks such as distribution matching or pure exploration, which may not be reduced to additive rewards. We show that this additional complexity can be captured by a novel, maximum entropy (soft) variant of the forward-backward algorithm, which recovers a family of stochastic policies from offline data. When coupled with zero-order search over compact policy embeddings, this algorithm can sidestep iterative optimization schemes, and optimizes general utilities directly at test-time. Across both didactic and high-dimensional experiments, we demonstrate that our method retains favorable properties of FB algorithms, while also extending their range to more general RL problems.
[LG-28] Disentanglement by means of action-induced representations
链接: https://arxiv.org/abs/2602.06741
作者: Gorka Muñoz-Gil,Hendrik Poulsen Nautrup,Arunava Majumder,Paulin de Schoulepnikoff,Florian Fürrutter,Marius Krumm,Hans J. Briegel
类目: Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an); Quantum Physics (quant-ph)
*备注: Main text: 10 pages, 4 figures
Abstract:Learning interpretable representations with variational autoencoders (VAEs) is a major goal of representation learning. The main challenge lies in obtaining disentangled representations, where each latent dimension corresponds to a distinct generative factor. This difficulty is fundamentally tied to the inability to perform nonlinear independent component analysis. Here, we introduce the framework of action-induced representations (AIRs) which models representations of physical systems given experiments (or actions) that can be performed on them. We show that, in this framework, we can provably disentangle degrees of freedom w.r.t. their action dependence. We further introduce a variational AIR architecture (VAIR) that can extract AIRs and therefore achieve provable disentanglement where standard VAEs fail. Beyond state representation, VAIR also captures the action dependence of the underlying generative factors, directly linking experiments to the degrees of freedom they influence.
[LG-29] Explaining Grokking in Transformers through the Lens of Inductive Bias
链接: https://arxiv.org/abs/2602.06702
作者: Jaisidh Singh,Diganta Misra,Antonio Orvieto
类目: Machine Learning (cs.LG)
*备注: Total 15 pages, 9 figures
Abstract:We investigate grokking in transformers through the lens of inductive bias: dispositions arising from architecture or optimization that let the network prefer one solution over another. We first show that architectural choices such as the position of Layer Normalization (LN) strongly modulates grokking speed. This modulation is explained by isolating how LN on specific pathways shapes shortcut-learning and attention entropy. Subsequently, we study how different optimization settings modulate grokking, inducing distinct interpretations of previously proposed controls such as readout scale. Particularly, we find that using readout scale as a control for lazy training can be confounded by learning rate and weight decay in our setting. Accordingly, we show that features evolve continuously throughout training, suggesting grokking in transformers can be more nuanced than a lazy-to-rich transition of the learning regime. Finally, we show how generalization predictably emerges with feature compressibility in grokking, across different modulators of inductive bias. Our code is released at this https URL.
[LG-30] aipan: A Query-free Transfer-based Multiple Sensitive Attribute Inference Attack Solely from Publicly Released Graphs
链接: https://arxiv.org/abs/2602.06700
作者: Ying Song,Balaji Palanisamy
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Graph-structured data underpin a wide spectrum of modern applications. However, complex graph topologies and homophilic patterns can facilitate attribute inference attacks (AIAs) by enabling sensitive information leakage to propagate across local neighborhoods. Existing AIAs predominantly assume that adversaries can probe sensitive attributes through repeated model queries. Such assumptions are often impractical in real-world settings due to stringent data protection regulations, prohibitive query budgets, and heightened detection risks, especially when inferring multiple sensitive attributes. More critically, this model-centric perspective obscures a pervasive blind spot: \textbfintrinsic multiple sensitive information leakage arising solely from publicly released graphs. To exploit this unexplored vulnerability, we introduce a new attack paradigm and propose \textbfTaipan, the first query-free transfer-based attack framework for multiple sensitive attribute inference attacks on graphs (G-MSAIAs). Taipan integrates \emphHierarchical Attack Knowledge Routing to capture intricate inter-attribute correlations, and \emphPrompt-guided Attack Prototype Refinement to mitigate negative transfer and performance degradation. We further present a systematic evaluation framework tailored to G-MSAIAs. Extensive experiments on diverse real-world graph datasets demonstrate that Taipan consistently achieves strong attack performance across same-distribution settings and heterogeneous similar- and out-of-distribution settings with mismatched feature dimensionalities, and remains effective even under rigorous differential privacy guarantees. Our findings underscore the urgent need for more robust multi-attribute privacy-preserving graph publishing methods and data-sharing practices.
[LG-31] NanoQuant: Efficient Sub-1-Bit Quantization of Large Language Models
链接: https://arxiv.org/abs/2602.06694
作者: Hyochan Chong,Dongkyu Kim,Changdong Kim,Minseop Choi
类目: Machine Learning (cs.LG)
*备注: 26 pages. Hyochan Chong and Dongkyu Kim contributed equally to this work
Abstract:Weight-only quantization has become a standard approach for efficiently serving large language models (LLMs). However, existing methods fail to efficiently compress models to binary (1-bit) levels, as they either require large amounts of data and compute or incur additional storage. In this work, we propose NanoQuant, the first post-training quantization (PTQ) method to compress LLMs to both binary and sub-1-bit levels. NanoQuant formulates quantization as a low-rank binary factorization problem, and compresses full-precision weights to low-rank binary matrices and scales. Specifically, it utilizes an efficient alternating direction method of multipliers (ADMM) method to precisely initialize latent binary matrices and scales, and then tune the initialized parameters through a block and model reconstruction process. Consequently, NanoQuant establishes a new Pareto frontier in low-memory post-training quantization, achieving state-of-the-art accuracy even at sub-1-bit compression rates. NanoQuant makes large-scale deployment feasible on consumer hardware. For example, it compresses Llama2-70B by 25.8 \times in just 13 hours on a single H100, enabling a 70B model to operate on a consumer 8 GB GPU.
[LG-32] Makespan Minimization in Split Learning: From Theory to Practice
链接: https://arxiv.org/abs/2602.06693
作者: Robert Ganian,Fionn Mc Inerney,Dimitra Tsigkari
类目: Networking and Internet Architecture (cs.NI); Computational Complexity (cs.CC); Machine Learning (cs.LG)
*备注: This paper will appear at IEEE INFOCOM 2026
Abstract:Split learning recently emerged as a solution for distributed machine learning with heterogeneous IoT devices, where clients can offload part of their training to computationally-powerful helpers. The core challenge in split learning is to minimize the training time by jointly devising the client-helper assignment and the schedule of tasks at the helpers. We first study the model where each helper has a memory cardinality constraint on how many clients it may be assigned, which represents the case of homogeneous tasks. Through complexity theory, we rule out exact polynomial-time algorithms and approximation schemes even for highly restricted instances of this problem. We complement these negative results with a non-trivial polynomial-time 5-approximation algorithm. Building on this, we then focus on the more general heterogeneous task setting considered by Tirana et al. [INFOCOM 2024], where helpers have memory capacity constraints and clients have variable memory costs. In this case, we prove that, unless P=NP, the problem cannot admit a polynomial-time approximation algorithm for any approximation factor. However, by adapting our aforementioned 5-approximation algorithm, we develop a novel heuristic for the heterogeneous task setting and show that it outperforms heuristics from prior works through extensive experiments.
[LG-33] Memory-Conditioned Flow-Matching for Stable Autoregressive PDE Rollouts
链接: https://arxiv.org/abs/2602.06689
作者: Victor Armegioiu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Autoregressive generative PDE solvers can be accurate one step ahead yet drift over long rollouts, especially in coarse-to-fine regimes where each step must regenerate unresolved fine scales. This is the regime of diffusion and flow-matching generators: although their internal dynamics are Markovian, rollout stability is governed by per-step \emphconditional law errors. Using the Mori–Zwanzig projection formalism, we show that eliminating unresolved variables yields an exact resolved evolution with a Markov term, a memory term, and an orthogonal forcing, exposing a structural limitation of memoryless closures. Motivated by this, we introduce memory-conditioned diffusion/flow-matching with a compact online state injected into denoising via latent features. Via disintegration, memory induces a structured conditional tail prior for unresolved scales and reduces the transport needed to populate missing frequencies. We prove Wasserstein stability of the resulting conditional kernel. We then derive discrete Grönwall rollout bounds that separate memory approximation from conditional generation error. Experiments on compressible flows with shocks and multiscale mixing show improved accuracy and markedly more stable long-horizon rollouts, with better fine-scale spectral and statistical fidelity.
[LG-34] Pruning at Initialisation through the lens of Graphon Limit: Convergence Expressivity and Generalisation
链接: https://arxiv.org/abs/2602.06675
作者: Hoang Pham,The-Anh Ta,Long Tran-Thanh
类目: Machine Learning (cs.LG)
*备注:
Abstract:Pruning at Initialisation methods discover sparse, trainable subnetworks before training, but their theoretical mechanisms remain elusive. Existing analyses are often limited to finite-width statistics, lacking a rigorous characterisation of the global sparsity patterns that emerge as networks grow large. In this work, we connect discrete pruning heuristics to graph limit theory via graphons, establishing the graphon limit of PaI masks. We introduce a Factorised Saliency Model that encompasses popular pruning criteria and prove that, under regularity conditions, the discrete masks generated by these algorithms converge to deterministic bipartite graphons. This limit framework establishes a novel topological taxonomy for sparse networks: while unstructured methods (e.g., Random, Magnitude) converge to homogeneous graphons representing uniform connectivity, data-driven methods (e.g., SNIP, GraSP) converge to heterogeneous graphons that encode implicit feature selection. Leveraging this continuous characterisation, we derive two fundamental theoretical results: (i) a Universal Approximation Theorem for sparse networks that depends only on the intrinsic dimension of active coordinate subspaces; and (ii) a Graphon-NTK generalisation bound demonstrating how the limit graphon modulates the kernel geometry to align with informative features. Our results transform the study of sparse neural networks from combinatorial graph problems into a rigorous framework of continuous operators, offering a new mechanism for analysing expressivity and generalisation in sparse neural networks.
[LG-35] Confundo: Learning to Generate Robust Poison for Practical RAG Systems
链接: https://arxiv.org/abs/2602.06616
作者: Haoyang Hu,Zhejun Jiang,Yueming Lyu,Junyuan Zhang,Yi Liu,Ka-Ho Chow
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Retrieval-augmented generation (RAG) is increasingly deployed in real-world applications, where its reference-grounded design makes outputs appear trustworthy. This trust has spurred research on poisoning attacks that craft malicious content, inject it into knowledge sources, and manipulate RAG responses. However, when evaluated in practical RAG systems, existing attacks suffer from severely degraded effectiveness. This gap stems from two overlooked realities: (i) content is often processed before use, which can fragment the poison and weaken its effect, and (ii) users often do not issue the exact queries anticipated during attack design. These factors can lead practitioners to underestimate risks and develop a false sense of security. To better characterize the threat to practical systems, we present Confundo, a learning-to-poison framework that fine-tunes a large language model as a poison generator to achieve high effectiveness, robustness, and stealthiness. Confundo provides a unified framework supporting multiple attack objectives, demonstrated by manipulating factual correctness, inducing biased opinions, and triggering hallucinations. By addressing these overlooked challenges, Confundo consistently outperforms a wide range of purpose-built attacks across datasets and RAG configurations by large margins, even in the presence of defenses. Beyond exposing vulnerabilities, we also present a defensive use case that protects web content from unauthorized incorporation into RAG systems via scraping, with no impact on user experience.
[LG-36] Adaptive-CaRe: Adaptive Causal Regularization for Robust Outcome Prediction
链接: https://arxiv.org/abs/2602.06611
作者: Nithya Bhasker,Fiona R. Kolbinger,Susu Hu,Gitta Kutyniok,Stefanie Speidel
类目: Machine Learning (cs.LG)
*备注:
Abstract:Accurate prediction of outcomes is crucial for clinical decision-making and personalized patient care. Supervised machine learning algorithms, which are commonly used for outcome prediction in the medical domain, optimize for predictive accuracy, which can result in models latching onto spurious correlations instead of robust predictors. Causal structure learning methods on the other hand have the potential to provide robust predictors for the target, but can be too conservative because of algorithmic and data assumptions, resulting in loss of diagnostic precision. Therefore, we propose a novel model-agnostic regularization strategy, Adaptive-CaRe, for generalized outcome prediction in the medical domain. Adaptive-CaRe strikes a balance between both predictive value and causal robustness by incorporating a penalty that is proportional to the difference between the estimated statistical contribution and estimated causal contribution of the input features for model predictions. Our experiments on synthetic data establish the efficacy of the proposed Adaptive-CaRe regularizer in finding robust predictors for the target while maintaining competitive predictive accuracy. With experiments on a standard causal benchmark, we provide a blueprint for navigating the trade-off between predictive accuracy and causal robustness by tweaking the regularization strength, \lambda . Validation using real-world dataset confirms that the results translate to practical, real-domain settings. Therefore, Adaptive-CaRe provides a simple yet effective solution to the long-standing trade-off between predictive accuracy and causal robustness in the medical domain. Future work would involve studying alternate causal structure learning frameworks and complex classification models to provide deeper insights at a larger scale.
[LG-37] he hidden risks of temporal resampling in clinical reinforcement learning
链接: https://arxiv.org/abs/2602.06603
作者: Thomas Frost,Hrisheekesh Vaidya,Steve Harris
类目: Machine Learning (cs.LG)
*备注: 12 pages, 4 figures. Currently under submission to npj Digital Medicine
Abstract:Offline reinforcement learning (ORL) has shown potential for improving decision-making in healthcare. However, contemporary research typically aggregates patient data into fixed time intervals, simplifying their mapping to standard ORL frameworks. The impact of these temporal manipulations on model safety and efficacy remains poorly understood. In this work, using both a gridworld navigation task and the UVA/Padova clinical diabetes simulator, we demonstrate that temporal resampling significantly degrades the performance of offline reinforcement learning algorithms during live deployment. We propose three mechanisms that drive this failure: (i) the generation of counterfactual trajectories, (ii) the distortion of temporal expectations, and (iii) the compounding of generalisation errors. Crucially, we find that standard off-policy evaluation metrics can fail to detect these drops in performance. Our findings reveal a fundamental risk in current healthcare ORL pipelines and emphasise the need for methods that explicitly handle the irregular timing of clinical decision-making.
[LG-38] DiTS: Multimodal Diffusion Transformers Are Time Series Forecasters
链接: https://arxiv.org/abs/2602.06597
作者: Haoran Zhang,Haixuan Liu,Yong Liu,Yunzhong Qiu,Yuxuan Wang,Jianmin Wang,Mingsheng Long
类目: Machine Learning (cs.LG)
*备注:
Abstract:While generative modeling on time series facilitates more capable and flexible probabilistic forecasting, existing generative time series models do not address the multi-dimensional properties of time series data well. The prevalent architecture of Diffusion Transformers (DiT), which relies on simplistic conditioning controls and a single-stream Transformer backbone, tends to underutilize cross-variate dependencies in covariate-aware forecasting. Inspired by Multimodal Diffusion Transformers that integrate textual guidance into video generation, we propose Diffusion Transformers for Time Series (DiTS), a general-purpose architecture that frames endogenous and exogenous variates as distinct modalities. To better capture both inter-variate and intra-variate dependencies, we design a dual-stream Transformer block tailored for time-series data, comprising a Time Attention module for autoregressive modeling along the temporal dimension and a Variate Attention module for cross-variate modeling. Unlike the common approach for images, which flattens 2D token grids into 1D sequences, our design leverages the low-rank property inherent in multivariate dependencies, thereby reducing computational costs. Experiments show that DiTS achieves state-of-the-art performance across benchmarks, regardless of the presence of future exogenous variate observations, demonstrating unique generative forecasting strengths over traditional deterministic deep forecasting models.
[LG-39] Degradation of Feature Space in Continual Learning
链接: https://arxiv.org/abs/2602.06586
作者: Chiara Lanza,Roberto Pereira,Marco Miozzo,Eduard Angelats,Paolo Dini
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:
Abstract:Centralized training is the standard paradigm in deep learning, enabling models to learn from a unified dataset in a single location. In such setup, isotropic feature distributions naturally arise as a mean to support well-structured and generalizable representations. In contrast, continual learning operates on streaming and non-stationary data, and trains models incrementally, inherently facing the well-known plasticity-stability dilemma. In such settings, learning dynamics tends to yield increasingly anisotropic feature space. This arises a fundamental question: should isotropy be enforced to achieve a better balance between stability and plasticity, and thereby mitigate catastrophic forgetting? In this paper, we investigate whether promoting feature-space isotropy can enhance representation quality in continual learning. Through experiments using contrastive continual learning techniques on CIFAR-10 and CIFAR-100 data, we find that isotropic regularization fails to improve, and can in fact degrade, model accuracy in continual settings. Our results highlight essential differences in feature geometry between centralized and continual learning, suggesting that isotropy, while beneficial in centralized setups, may not constitute an appropriate inductive bias for non-stationary learning scenarios.
[LG-40] Learning to Allocate Resources with Censored Feedback
链接: https://arxiv.org/abs/2602.06565
作者: Giovanni Montanari,Côme Fiegel,Corentin Pla,Aadirupa Saha,Vianney Perchet
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注:
Abstract:We study the online resource allocation problem in which at each round, a budget B must be allocated across K arms under censored feedback. An arm yields a reward if and only if two conditions are satisfied: (i) the arm is activated according to an arm-specific Bernoulli random variable with unknown parameter, and (ii) the allocated budget exceeds a random threshold drawn from a parametric distribution with unknown parameter. Over T rounds, the learner must jointly estimate the unknown parameters and allocate the budget so as to maximize cumulative reward facing the exploration–exploitation trade-off. We prove an information-theoretic regret lower bound \Omega(T^1/3) , demonstrating the intrinsic difficulty of the problem. We then propose RA-UCB, an optimistic algorithm that leverages non-trivial parameter estimation and confidence bounds. When the budget B is known at the beginning of each round, RA-UCB achieves a regret of order \widetilde\mathcalO(\sqrtT) , and even \mathcalO(\mathrmpoly\text-\log T) under stronger assumptions. As for unknown, round dependent budget, we introduce MG-UCB, which allows within-round switching and infinitesimal allocations, and matches the regret guarantees of RA-UCB. We then validate our theoretical results through experiments on real-world datasets.
[LG-41] Reinforcement Learning-Based Dynamic Management of Structured Parallel Farm Skeletons on Serverless Platforms
链接: https://arxiv.org/abs/2602.06555
作者: Lanpei Li,Massimo Coppola,Malio Li,Valerio Besozzi,Jack Bell,Vincenzo Lomonaco
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: Accepted at AHPC3 workshop, PDP 2026
Abstract:We present a framework for dynamic management of structured parallel processing skeletons on serverless platforms. Our goal is to bring HPC-like performance and resilience to serverless and continuum environments while preserving the programmability benefits of skeletons. As a first step, we focus on the well known Farm pattern and its implementation on the open-source OpenFaaS platform, treating autoscaling of the worker pool as a QoS-aware resource management problem. The framework couples a reusable farm template with a Gymnasium-based monitoring and control layer that exposes queue, timing, and QoS metrics to both reactive and learning-based controllers. We investigate the effectiveness of AI-driven dynamic scaling for managing the farm’s degree of parallelism via the scalability of serverless functions on OpenFaaS. In particular, we discuss the autoscaling model and its training, and evaluate two reinforcement learning (RL) policies against a baseline of reactive management derived from a simple farm performance model. Our results show that AI-based management can better accommodate platform-specific limitations than purely model-based performance steering, improving QoS while maintaining efficient resource usage and stable scaling behaviour.
[LG-42] Fine-Grained Model Merging via Modular Expert Recombination
链接: https://arxiv.org/abs/2602.06552
作者: Haiyun Qiu,Xingyu Wu,Liang Feng,Kay Chen Tan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Model merging constructs versatile models by integrating task-specific models without requiring labeled data or expensive joint retraining. Although recent methods improve adaptability to heterogeneous tasks by generating customized merged models for each instance, they face two critical limitations. First, the instance-specific merged models lack reusability, restricting the exploitation of high-quality merging configurations and efficient batch inference. Second, these methods treat each task-specific model as a monolithic whole, overlooking the diverse mergeability of homologous components such as attention and multilayer perceptron layers, and the differing merging sensitivities across components. To address these limitations, we propose MERGE (\underlineModular \underlineExpert \underlineRecombination for fine-\underlineGrained m\underlineErging), a method that enables component-wise model merging and input-aware, on-demand module recombination at inference. MERGE formulates component-wise merging as a bi-objective optimization problem that balances cross-task performance and storage efficiency, and develops a surrogate-assisted evolutionary algorithm to efficiently identify Pareto-optimal merging configurations. These high-quality configurations underpin a reusable modular expert library, from which a lightweight routing network dynamically activates and recombines modular experts to assemble input-specific models and enable efficient inference under storage constraints. Extensive experiments across various model scales, task types, and fine-tuning strategies demonstrate that MERGE consistently outperforms strong baselines and generalizes effectively.
[LG-43] Refining the Information Bottleneck via Adversarial Information Separation
链接: https://arxiv.org/abs/2602.06549
作者: Shuai Ning,Zhenpeng Wang,Lin Wang,Bing Chen,Shuangrong Liu,Xu Wu,Jin Zhou,Bo Yang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Generalizing from limited data is particularly critical for models in domains such as material science, where task-relevant features in experimental datasets are often heavily confounded by measurement noise and experimental artifacts. Standard regularization techniques fail to precisely separate meaningful features from noise, while existing adversarial adaptation methods are limited by their reliance on explicit separation labels. To address this challenge, we propose the Adversarial Information Separation Framework (AdverISF), which isolates task-relevant features from noise without requiring explicit supervision. AdverISF introduces a self-supervised adversarial mechanism to enforce statistical independence between task-relevant features and noise representations. It further employs a multi-layer separation architecture that progressively recycles noise information across feature hierarchies to recover features inadvertently discarded as noise, thereby enabling finer-grained feature extraction. Extensive experiments demonstrate that AdverISF outperforms state-of-the-art methods in data-scarce scenarios. In addition, evaluations on real-world material design tasks show that it achieves superior generalization performance.
[LG-44] Live Knowledge Tracing: Real-Time Adaptation using Tabular Foundation Models
链接: https://arxiv.org/abs/2602.06542
作者: Mounir Lbath,Alexandre Paresy,Abdelkayoum Kaddouri,Alan André,Alexandre Ittah,Jill-Jênn Vie(SODA)
类目: Machine Learning (cs.LG)
*备注:
Abstract:Deep knowledge tracing models have achieved significant breakthroughs in modeling student learning trajectories. However, these architectures require substantial training time and are prone to overfitting on datasets with short sequences. In this paper, we explore a new paradigm for knowledge tracing by leveraging tabular foundation models (TFMs). Unlike traditional methods that require offline training on a fixed training set, our approach performs real-time ‘‘live’’ knowledge tracing in an online way. The core of our method lies in a two-way attention mechanism: while attention knowledge tracing models only attend across earlier time steps, TFMs simultaneously attend across both time steps and interactions of other students in the training set. They align testing sequences with relevant training sequences at inference time, therefore skipping the training step entirely. We demonstrate, using several datasets of increasing size, that our method achieves competitive predictive performance with up to 273x speedups, in a setting where more student interactions are observed over time.
[LG-45] AlertBERT: A noise-robust alert grouping framework for simultaneous cyber attacks
链接: https://arxiv.org/abs/2602.06534
作者: Lukas Karner,Max Landauer,Markus Wurzenberger,Florian Skopik
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Automated detection of cyber attacks is a critical capability to counteract the growing volume and sophistication of cyber attacks. However, the high numbers of security alerts issued by intrusion detection systems lead to alert fatigue among analysts working in security operations centres (SOC), which in turn causes slow reaction time and incorrect decision making. Alert grouping, which refers to clustering of security alerts according to their underlying causes, can significantly reduce the number of distinct items analysts have to consider. Unfortunately, conventional time-based alert grouping solutions are unsuitable for large scale computer networks characterised by high levels of false positive alerts and simultaneously occurring attacks. To address these limitations, we propose AlertBERT, a self-supervised framework designed to group alerts from isolated or concurrent attacks in noisy environments. Thereby, our open-source implementation of AlertBERT leverages masked-language-models and density-based clustering to support both real-time or forensic operation. To evaluate our framework, we further introduce a novel data augmentation method that enables flexible control over noise levels and simulates concurrent attack occurrences. Based on the data sets generated through this method, we demonstrate that AlertBERT consistently outperforms conventional time-based grouping techniques, achieving superior accuracy in identifying correct alert groups.
[LG-46] opography scanning as a part of process monitoring in power cable insulation process
链接: https://arxiv.org/abs/2602.06519
作者: Janne Harjuhahto,Jaakko Harjuhahto,Mikko Lahti,Jussi Hanhirova,Björn Sonerud
类目: Machine Learning (cs.LG)
*备注: 6 pages, 14 figures
Abstract:We present a novel topography scanning system developed to XLPE cable core monitoring. Modern measurement technology is utilized together with embedded high-performance computing to build a complete and detailed 3D surface map of the insulated core. Cross sectional and lengthwise geometry errors are studied, and melt homogeneity is identified as one major factor for these errors. A surface defect detection system has been developed utilizing deep learning methods. Our results show that convolutional neural networks are well suited for real time analysis of surface measurement data enabling reliable detection of surface defects.
[LG-47] Evolutionary Generation of Multi-Agent Systems
链接: https://arxiv.org/abs/2602.06511
作者: Yuntong Hu,Matthew Trager,Yuting Zhang,Yi Zhang,Shuo Yang,Wei Xia,Stefano Soatto
类目: Machine Learning (cs.LG)
*备注: 26 pages, 15 figures
Abstract:Large language model (LLM)-based multi-agent systems (MAS) show strong promise for complex reasoning, planning, and tool-augmented tasks, but designing effective MAS architectures remains labor-intensive, brittle, and hard to generalize. Existing automatic MAS generation methods either rely on code generation, which often leads to executability and robustness failures, or impose rigid architectural templates that limit expressiveness and adaptability. We propose Evolutionary Generation of Multi-Agent Systems (EvoMAS), which formulates MAS generation as structured configuration generation. EvoMAS performs evolutionary generation in configuration space. Specifically, EvoMAS selects initial configurations from a pool, applies feedback-conditioned mutation and crossover guided by execution traces, and iteratively refines both the candidate pool and an experience memory. We evaluate EvoMAS on diverse benchmarks, including BBEH, SWE-Bench, and WorkBench, covering reasoning, software engineering, and tool-use tasks. EvoMAS consistently improves task performance over both human-designed MAS and prior automatic MAS generation methods, while producing generated systems with higher executability and runtime robustness. EvoMAS outperforms the agent evolution method EvoAgent by +10.5 points on BBEH reasoning and +7.1 points on WorkBench. With Claude-4.5-Sonnet, EvoMAS also reaches 79.1% on SWE-Bench-Verified, matching the top of the leaderboard.
[LG-48] Can Microcanonical Langevin Dynamics Leverag e Mini-Batch Gradient Noise?
链接: https://arxiv.org/abs/2602.06500
作者: Emanuel Sommer,Kangning Diao,Jakob Robnik,Uros Seljak,David Rügamer
类目: Machine Learning (cs.LG)
*备注:
Abstract:Scaling inference methods such as Markov chain Monte Carlo to high-dimensional models remains a central challenge in Bayesian deep learning. A promising recent proposal, microcanonical Langevin Monte Carlo, has shown state-of-the-art performance across a wide range of problems. However, its reliance on full-dataset gradients makes it prohibitively expensive for large-scale problems. This paper addresses a fundamental question: Can microcanonical dynamics effectively leverage mini-batch gradient noise? We provide the first systematic study of this problem, establishing a novel continuous-time theoretical analysis of stochastic-gradient microcanonical dynamics. We reveal two critical failure modes: a theoretically derived bias due to anisotropic gradient noise and numerical instabilities in complex high-dimensional posteriors. To tackle these issues, we propose a principled gradient noise preconditioning scheme shown to significantly reduce this bias and develop a novel, energy-variance-based adaptive tuner that automates step size selection and dynamically informs numerical guardrails. The resulting algorithm is a robust and scalable microcanonical Monte Carlo sampler that achieves state-of-the-art performance on challenging high-dimensional inference tasks like Bayesian neural networks. Combined with recent ensemble techniques, our work unlocks a new class of stochastic microcanonical Langevin ensemble (SMILE) samplers for large-scale Bayesian inference.
[LG-49] Adaptive Uncertainty-Aware Tree Search for Robust Reason ing
链接: https://arxiv.org/abs/2602.06493
作者: Zeen Song,Zihao Ma,Wenwen Qiang,Changwen Zheng,Gang Hua
类目: Machine Learning (cs.LG)
*备注:
Abstract:Inference-time reasoning scaling has significantly advanced the capabilities of Large Language Models (LLMs) in complex problem-solving. A prevalent approach involves external search guided by Process Reward Models (PRMs). However, a fundamental limitation of this framework is the epistemic uncertainty of PRMs when evaluating reasoning paths that deviate from their training distribution. In this work, we conduct a systematic analysis of this challenge. We first provide empirical evidence that PRMs exhibit high uncertainty and unreliable scoring on out-of-distribution (OOD) samples. We then establish a theoretical framework proving that while standard search incurs linear regret accumulation, an uncertainty-aware strategy can achieve sublinear regret. Motivated by these findings, we propose Uncertainty-Aware Tree Search (UATS), a unified method that estimates uncertainty via Monte Carlo Dropout and dynamically allocates compute budget using a reinforcement learning-based controller. Extensive experiments demonstrate that our approach effectively mitigates the impact of OOD errors.
[LG-50] owards Generalizable Reason ing: Group Causal Counterfactual Policy Optimization for LLM Reason ing
链接: https://arxiv.org/abs/2602.06475
作者: Jingyao Wang,Peizheng Guo,Wenwen Qiang,Jiahuan Zhou,Huijie Guo,Changwen Zheng,Hui Xiong
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large language models (LLMs) excel at complex tasks with advances in reasoning capabilities. However, existing reward mechanisms remain tightly coupled to final correctness and pay little attention to the underlying reasoning process: trajectories with sound reasoning but wrong answers receive low credit, while lucky guesses with flawed logic may be highly rewarded, affecting reasoning generalization. From a causal perspective, we interpret multi-candidate reasoning for a fixed question as a family of counterfactual experiments with theoretical supports. Building on this, we propose Group Causal Counterfactual Policy Optimization to explicitly train LLMs to learn generalizable reasoning patterns. It proposes an episodic causal counterfactual reward that jointly captures (i) robustness, encouraging the answer distribution induced by a reasoning step to remain stable under counterfactual perturbations; and (ii) effectiveness, enforcing sufficient variability so that the learned reasoning strategy can transfer across questions. We then construct token-level advantages from this reward and optimize the policy, encouraging LLMs to favor reasoning patterns that are process-valid and counterfactually robust. Extensive experiments on diverse benchmarks demonstrate its advantages.
[LG-51] Achieving Better Local Regret Bound for Online Non-Convex Bilevel Optimization
链接: https://arxiv.org/abs/2602.06457
作者: Tingkai Jia,Haiguang Wang,Cheng Chen
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:Online bilevel optimization (OBO) has emerged as a powerful framework for many machine learning problems. Prior works have developed several algorithms that minimize the standard bilevel local regret or the window-averaged bilevel local regret of the OBO problem, but the optimality of existing regret bounds remains unclear. In this work, we establish optimal regret bounds for both settings. For standard bilevel local regret, we propose an algorithm that achieves the optimal regret \Omega(1+V_T) with at most O(T\log T) total inner-level gradient evaluations. We further develop a fully single-loop algorithm whose regret bound includes an additional gradient-variation terms. For the window-averaged bilevel local regret, we design an algorithm that captures sublinear environmental variation through a window-based analysis and achieves the optimal regret \Omega(T/W^2) . Experiments validate our theoretical findings and demonstrate the practical effectiveness of the proposed methods.
[LG-52] he Window Dilemma: Why Concept Drift Detection is Ill-Posed
链接: https://arxiv.org/abs/2602.06456
作者: Brandon Gower-Winter,Misja Groen,Georg Krempl
类目: Machine Learning (cs.LG)
*备注: 12 pages, 1 Figure, 5 Tables. Accepted to the 24th International Symposium on Intelligent Data Analysis (IDA) (2026)
Abstract:Non-stationarity of an underlying data generating process that leads to distributional changes over time is a key characteristic of Data Streams. This phenomenon, commonly referred to as Concept Drift, has been intensively studied, and Concept Drift Detectors have been established as a class of methods for detecting such changes (drifts). For the most part, Drift Detectors compare regions (windows) of the data stream and detect drift if those windows are sufficiently dissimilar. In this work, we introduce the Window Dilemma, an observation that perceived drift is a product of windowing and not necessarily the underlying data generating process. Additionally, we highlight that drift detection is ill-posed, primarily because verification of drift events are implausible in practice. We demonstrate these contributions first by an illustrative example, followed by empirical comparisons of drift detectors against a variety of alternative adaptation strategies. Our main finding is that traditional batch learning techniques often perform better than their drift-aware counterparts further bringing into question the purpose of detectors in Stream Classification. Comments: 12 pages, 1 Figure, 5 Tables. Accepted to the 24th International Symposium on Intelligent Data Analysis (IDA) (2026) Subjects: Machine Learning (cs.LG) Cite as: arXiv:2602.06456 [cs.LG] (or arXiv:2602.06456v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.06456 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-53] On the Plasticity and Stability for Post-Training Large Language Models
链接: https://arxiv.org/abs/2602.06453
作者: Wenwen Qiang,Ziyin Gu,Jiahuan Zhou,Jie Hu,Jingyao Wang,Changwen Zheng,Hui Xiong
类目: Machine Learning (cs.LG)
*备注:
Abstract:Training stability remains a critical bottleneck for Group Relative Policy Optimization (GRPO), often manifesting as a trade-off between reasoning plasticity and general capability retention. We identify a root cause as the geometric conflict between plasticity and stability gradients, which leads to destructive interference. Crucially, we argue that deterministic projection methods are suboptimal for GRPO as they overlook the intrinsic stochasticity of group-based gradient estimates. To address this, we propose Probabilistic Conflict Resolution (PCR), a Bayesian framework that models gradients as random variables. PCR dynamically arbitrates conflicts via an uncertainty-aware ``soft projection’’ mechanism, optimizing the signal-to-noise ratio. Extensive experiments demonstrate that PCR significantly smooths the training trajectory and achieves superior performance in various reasoning tasks.
[LG-54] BrokenBind: Universal Modality Exploration beyond Dataset Boundaries
链接: https://arxiv.org/abs/2602.06451
作者: Zhuo Huang,Runnan Chen,Bo Han,Gang Niu,Masashi Sugiyama,Tongliang Liu
类目: Machine Learning (cs.LG)
*备注: 17 pages, 8 figures, and 10 tables
Abstract:Multi-modal learning combines various modalities to provide a comprehensive understanding of real-world problems. A common strategy is to directly bind different modalities together in a specific joint embedding space. However, the capability of existing methods is restricted within the modalities presented in the given dataset, thus they are biased when generalizing to unpresented modalities in downstream tasks. As a result, due to such inflexibility, the viability of previous methods is seriously hindered by the cost of acquiring multi-modal datasets. In this paper, we introduce BrokenBind, which focuses on binding modalities that are presented from different datasets. To achieve this, BrokenBind simultaneously leverages multiple datasets containing the modalities of interest and one shared modality. Though the two datasets do not correspond to each other due to distribution mismatch, we can capture their relationship to generate pseudo embeddings to fill in the missing modalities of interest, enabling flexible and generalized multi-modal learning. Under our framework, any two modalities can be bound together, free from the dataset limitation, to achieve universal modality exploration. Further, to reveal the capability of our method, we study intensified scenarios where more than two datasets are needed for modality binding and show the effectiveness of BrokenBind in low-data regimes. Through extensive evaluation, we carefully justify the superiority of BrokenBind compared to well-known multi-modal baseline methods.
[LG-55] Is Gradient Ascent Really Necessary? Memorize to Forget for Machine Unlearning
链接: https://arxiv.org/abs/2602.06441
作者: Zhuo Huang,Qizhou Wang,Ziming Hong,Shanshan Ye,Bo Han,Tongliang Liu
类目: Machine Learning (cs.LG)
*备注:
Abstract:For ethical and safe AI, machine unlearning rises as a critical topic aiming to protect sensitive, private, and copyrighted knowledge from misuse. To achieve this goal, it is common to conduct gradient ascent (GA) to reverse the training on undesired data. However, such a reversal is prone to catastrophic collapse, which leads to serious performance degradation in general tasks. As a solution, we propose model extrapolation as an alternative to GA, which reaches the counterpart direction in the hypothesis space from one model given another reference model. Therefore, we leverage the original model as the reference, further train it to memorize undesired data while keeping prediction consistency on the rest retained data, to obtain a memorization model. Counterfactual as it might sound, a forget model can be obtained via extrapolation from the memorization model to the reference model. Hence, we avoid directly acquiring the forget model using GA, but proceed with gradient descent for the memorization model, which successfully stabilizes the machine unlearning process. Our model extrapolation is simple and efficient to implement, and it can also effectively converge throughout training to achieve improved unlearning performance.
[LG-56] Reclaiming First Principles: A Differentiable Framework for Conceptual Hydrologic Models
链接: https://arxiv.org/abs/2602.06429
作者: Jasper A. Vrugt,Jonathan M. Frame,Ethan Bollman
类目: Machine Learning (cs.LG); Geophysics (physics.geo-ph)
*备注: 85 pages, 14 figures
Abstract:Conceptual hydrologic models remain the cornerstone of rainfall-runoff modeling, yet their calibration is often slow and numerically fragile. Most gradient-based parameter estimation methods rely on finite-difference approximations or automatic differentiation frameworks (e.g., JAX, PyTorch and TensorFlow), which are computationally demanding and introduce truncation errors, solver instabilities, and substantial overhead. These limitations are particularly acute for the ODE systems of conceptual watershed models. Here we introduce a fully analytic and computationally efficient framework for differentiable hydrologic modeling based on exact parameter sensitivities. By augmenting the governing ODE system with sensitivity equations, we jointly evolve the model states and the Jacobian matrix with respect to all parameters. This Jacobian then provides fully analytic gradient vectors for any differentiable loss function. These include classical objective functions such as the sum of absolute and squared residuals, widely used hydrologic performance metrics such as the Nash-Sutcliffe and Kling-Gupta efficiencies, robust loss functions that down-weight extreme events, and hydrograph-based functionals such as flow-duration and recession curves. The analytic sensitivities eliminate the step-size dependence and noise inherent to numerical differentiation, while avoiding the instability of adjoint methods and the overhead of modern machine-learning autodiff toolchains. The resulting gradients are deterministic, physically interpretable, and straightforward to embed in gradient-based optimizers. Overall, this work enables rapid, stable, and transparent gradient-based calibration of conceptual hydrologic models, unlocking the full potential of differentiable modeling without reliance on external, opaque, or CPU-intensive automatic-differentiation libraries.
[LG-57] Beyond Code Contributions: How Network Position Temporal Bursts and Code Review Activities Shape Contributor Influence in Large-Scale Open Source Ecosystems
链接: https://arxiv.org/abs/2602.06426
作者: S M Rakib Ul Karim,Wenyi Lu,Sean Goggins
类目: Machine Learning (cs.LG)
*备注:
Abstract:Open source software (OSS) projects rely on complex networks of contributors whose interactions drive innovation and sustainability. This study presents a comprehensive analysis of OSS contributor networks using advanced graph neural networks and temporal network analysis on data spanning 25 years from the Cloud Native Computing Foundation ecosystem, encompassing sandbox, incubating, and graduated projects. Our analysis of thousands of contributors across hundreds of repositories reveals that OSS networks exhibit strong power-law distributions in influence, with the top 1% of contributors controlling a substantial portion of network influence. Using GPU-accelerated PageRank, betweenness centrality, and custom LSTM models, we identify five distinct contributor roles: Core, Bridge, Connector, Regular, and Peripheral, each with unique network positions and structural importance. Statistical analysis reveals significant correlations between specific action types (commits, pull requests, issues) and contributor influence, with multiple regression models explaining substantial variance in influence metrics. Temporal analysis shows that network density, clustering coefficients, and modularity exhibit statistically significant temporal trends, with distinct regime changes coinciding with major project milestones. Structural integrity simulations show that Bridge contributors, despite representing a small fraction of the network, have a disproportionate impact on network cohesion when removed. Our findings provide empirical evidence for strategic contributor retention policies and offer actionable insights into community health metrics.
[LG-58] Adaptive Protein Tokenization
链接: https://arxiv.org/abs/2602.06418
作者: Rohit Dilip,Ayush Varshney,David Van Valen
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注:
Abstract:Tokenization is a promising path to multi-modal models capable of jointly understanding protein sequences, structure, and function. Existing protein structure tokenizers create tokens by pooling information from local neighborhoods, an approach that limits their performance on generative and representation tasks. In this work, we present a method for global tokenization of protein structures in which successive tokens contribute increasing levels of detail to a global representation. This change resolves several issues with generative models based on local protein tokenization: it mitigates error accumulation, provides embeddings without sequence-reduction operations, and allows task-specific adaptation of a tokenized sequence’s information content. We validate our method on reconstruction, generative, and representation tasks and demonstrate that it matches or outperforms existing models based on local protein structure tokenizers. We show how adaptive tokens enable inference criteria based on information content, which boosts designability. We validate representations generated from our tokenizer on CATH classification tasks and demonstrate that non-linear probing on our tokenized sequences outperforms equivalent probing on representations from other tokenizers. Finally, we demonstrate how our method supports zero-shot protein shrinking and affinity maturation.
[LG-59] EEG Emotion Classification Using an Enhanced Transformer-CNN-BiLSTM Architecture with Dual Attention Mechanisms
链接: https://arxiv.org/abs/2602.06411
作者: S M Rakib UI Karim,Wenyi Lu,Diponkor Bala,Rownak Ara Rasul,Sean Goggins
类目: Machine Learning (cs.LG)
*备注:
Abstract:Electroencephalography (EEG)-based emotion recognition plays a critical role in affective computing and emerging decision-support systems, yet remains challenging due to high-dimensional, noisy, and subject-dependent signals. This study investigates whether hybrid deep learning architectures that integrate convolutional, recurrent, and attention-based components can improve emotion classification performance and robustness in EEG data. We propose an enhanced hybrid model that combines convolutional feature extraction, bidirectional temporal modeling, and self-attention mechanisms with regularization strategies to mitigate overfitting. Experiments conducted on a publicly available EEG dataset spanning three emotional states (neutral, positive, and negative) demonstrate that the proposed approach achieves state-of-the-art classification performance, significantly outperforming classical machine learning and neural baselines. Statistical tests confirm the robustness of these performance gains under cross-validation. Feature-level analyses further reveal that covariance-based EEG features contribute most strongly to emotion discrimination, highlighting the importance of inter-channel relationships in affective modeling. These findings suggest that carefully designed hybrid architectures can effectively balance predictive accuracy, robustness, and interpretability in EEG-based emotion recognition, with implications for applied affective computing and human-centered intelligent systems.
[LG-60] Near-Optimal Regret for Distributed Adversarial Bandits: A Black-Box Approach
链接: https://arxiv.org/abs/2602.06404
作者: Hao Qiu,Mengxiao Zhang,Nicolò Cesa-Bianchi
类目: Machine Learning (cs.LG)
*备注:
Abstract:We study distributed adversarial bandits, where N agents cooperate to minimize the global average loss while observing only their own local losses. We show that the minimax regret for this problem is \tilde\Theta(\sqrt(\rho^-1/2+K/N)T) , where T is the horizon, K is the number of actions, and \rho is the spectral gap of the communication matrix. Our algorithm, based on a novel black-box reduction to bandits with delayed feedback, requires agents to communicate only through gossip. It achieves an upper bound that significantly improves over the previous best bound \tildeO(\rho^-1/3(KT)^2/3) of Yi and Vojnovic (2023). We complement this result with a matching lower bound, showing that the problem’s difficulty decomposes into a communication cost \rho^-1/4\sqrtT and a bandit cost \sqrtKT/N . We further demonstrate the versatility of our approach by deriving first-order and best-of-both-worlds bounds in the distributed adversarial setting. Finally, we extend our framework to distributed linear bandits in R^d , obtaining a regret bound of \tildeO(\sqrt(\rho^-1/2+1/N)dT) , achieved with only O(d) communication cost per agent and per round via a volumetric spanner.
[LG-61] Uniform Spectral Growth and Convergence of Muon in LoRA-Style Matrix Factorization
链接: https://arxiv.org/abs/2602.06385
作者: Changmin Kang,Jihun Yun,Baekrok Shin,Yeseul Cho,Chulhee Yun
类目: Machine Learning (cs.LG)
*备注:
Abstract:Spectral gradient descent (SpecGD) orthogonalizes the matrix parameter updates and has inspired practical optimizers such as Muon. They often perform well in large language model (LLM) training, but their dynamics remain poorly understood. In the low-rank adaptation (LoRA) setting, where weight updates are parameterized as a product of two low-rank factors, we find a distinctive spectral phenomenon under Muon in LoRA fine-tuning of LLMs: singular values of the LoRA product show near-uniform growth across the spectrum, despite orthogonalization being performed on the two factors separately. Motivated by this observation, we analyze spectral gradient flow (SpecGF)-a continuous-time analogue of SpecGD-in a simplified LoRA-style matrix factorization setting and prove “equal-rate” dynamics: all singular values grow at equal rates up to small deviations. Consequently, smaller singular values attain their target values earlier than larger ones, sharply contrasting with the largest-first stepwise learning observed in standard gradient flow. Moreover, we prove that SpecGF in our setting converges to global minima from almost all initializations, provided the factor norms remain bounded; with \ell_2 regularization, we obtain global convergence. Lastly, we corroborate our theory with experiments in the same setting.
[LG-62] Advances in Battery Energy Storag e Management: Control and Economic Synergies
链接: https://arxiv.org/abs/2602.06365
作者: Venkata Rajesh Chundru,Shreshta Rajakumar Deshpande,Stanislav A Gankov
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: Pre Print
Abstract:The existing literature on Battery Energy Storage Systems (BESS) predominantly focuses on two main areas: control system design aimed at achieving grid stability and the techno-economic analysis of BESS dispatch on power grid. However, with the increasing incorporation of ancillary services into power grids, a more comprehensive approach to energy management systems is required. Such an approach should not only optimize revenue generation from BESS but also ensure the safe, efficient, and reliable operation of lithium-ion batteries. This research seeks to bridge this gap by exploring literature that addresses both the economic and operational dimensions of BESS. Specifically, it examines how economic aspects of grid duty cycles can align with control schemes deployed in BESS systems. This alignment, or synergy, could be instrumental in creating robust digital twins virtual representations of BESS systems that enhance both grid stability and revenue potential. The literature review is organized into five key categories: (1) ancillary services for BESS, exploring support functions that BESS can provide to power grids; (2) control systems developed for real-time BESS power flow management, ensuring smooth operations under dynamic grid conditions; (3) optimization algorithms for BESS dispatch, focusing on efficient energy allocation strategies; (4) techno-economic analyses of BESS and battery systems to assess their financial viability; and (5) digital twin technologies for real-world BESS deployments, enabling advanced predictive maintenance and performance optimization. This review will identify potential synergies, research gaps, and emerging trends, paving the way for future innovations in BESS management and deployment strategies. Comments: Pre Print Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG) Cite as: arXiv:2602.06365 [eess.SY] (or arXiv:2602.06365v1 [eess.SY] for this version) https://doi.org/10.48550/arXiv.2602.06365 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-63] Envy-Free Allocation of Indivisible Goods via Noisy Queries
链接: https://arxiv.org/abs/2602.06361
作者: Zihan Li,Yan Hao Ling,Jonathan Scarlett,Warut Suksompong
类目: Computer Science and Game Theory (cs.GT); Information Theory (cs.IT); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We introduce a problem of fairly allocating indivisible goods (items) in which the agents’ valuations cannot be observed directly, but instead can only be accessed via noisy queries. In the two-agent setting with Gaussian noise and bounded valuations, we derive upper and lower bounds on the required number of queries for finding an envy-free allocation in terms of the number of items, m , and the negative-envy of the optimal allocation, \Delta . In particular, when \Delta is not too small (namely, \Delta \gg m^1/4 ), we establish that the optimal number of queries scales as \frac\sqrt m (\Delta / m)^2 = \fracm^2.5\Delta^2 up to logarithmic factors. Our upper bound is based on non-adaptive queries and a simple thresholding-based allocation algorithm that runs in polynomial time, while our lower bound holds even under adaptive queries and arbitrary computation time.
[LG-64] Evaluating LLM -persona Generated Distributions for Decision-making
链接: https://arxiv.org/abs/2602.06357
作者: Jackie Baek,Yunhan Chen,Ziyu Chi,Will Ma
类目: Machine Learning (cs.LG)
*备注:
Abstract:LLMs can generate a wealth of data, ranging from simulated personas imitating human valuations and preferences, to demand forecasts based on world knowledge. But how well do such LLM-generated distributions support downstream decision-making? For example, when pricing a new product, a firm could prompt an LLM to simulate how much consumers are willing to pay based on a product description, but how useful is the resulting distribution for optimizing the price? We refer to this approach as LLM-SAA, in which an LLM is used to construct an estimated distribution and the decision is then optimized under that distribution. In this paper, we study metrics to evaluate the quality of these LLM-generated distributions, based on the decisions they induce. Taking three canonical decision-making problems (assortment optimization, pricing, and newsvendor) as examples, we find that LLM-generated distributions are practically useful, especially in low-data regimes. We also show that decision-agnostic metrics such as Wasserstein distance can be misleading when evaluating these distributions for decision-making.
[LG-65] Enhance and Reuse: A Dual-Mechanism Approach to Boost Deep Forest for Label Distribution Learning
链接: https://arxiv.org/abs/2602.06353
作者: Jia-Le Xu,Shen-Huan Lyu,Yu-Nian Wang,Ning Chen,Zhihao Qu,Bin Tang,Baoliu Ye
类目: Machine Learning (cs.LG)
*备注:
Abstract:Label distribution learning (LDL) requires the learner to predict the degree of correlation between each sample and each label. To achieve this, a crucial task during learning is to leverage the correlation among labels. Deep Forest (DF) is a deep learning framework based on tree ensembles, whose training phase does not rely on backpropagation. DF performs in-model feature transform using the prediction of each layer and achieves competitive performance on many tasks. However, its exploration in the field of LDL is still in its infancy. The few existing methods that apply DF to the field of LDL do not have effective ways to utilize the correlation among labels. Therefore, we propose a method named Enhanced and Reused Feature Deep Forest (ERDF). It mainly contains two mechanisms: feature enhancement exploiting label correlation and measure-aware feature reuse. The first one is to utilize the correlation among labels to enhance the original features, enabling the samples to acquire more comprehensive information for the task of LDL. The second one performs a reuse operation on the features of samples that perform worse than the previous layer on the validation set, in order to ensure the stability of the training process. This kind of Enhance-Reuse pattern not only enables samples to enrich their features but also validates the effectiveness of their new features and conducts a reuse process to prevent the noise from spreading further. Experiments show that our method outperforms other comparison algorithms on six evaluation metrics.
[LG-66] Adversarial Learning in Games with Bandit Feedback: Logarithmic Pure-Strategy Maximin Regret
链接: https://arxiv.org/abs/2602.06348
作者: Shinji Ito,Haipeng Luo,Arnab Maiti,Taira Tsuchiya,Yue Wu
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注:
Abstract:Learning to play zero-sum games is a fundamental problem in game theory and machine learning. While significant progress has been made in minimizing external regret in the self-play settings or with full-information feedback, real-world applications often force learners to play against unknown, arbitrary opponents and restrict learners to bandit feedback where only the payoff of the realized action is observable. In such challenging settings, it is well-known that \Omega(\sqrtT) external regret is unavoidable (where T is the number of rounds). To overcome this barrier, we investigate adversarial learning in zero-sum games under bandit feedback, aiming to minimize the deficit against the maximin pure strategy – a metric we term Pure-Strategy Maximin Regret. We analyze this problem under two bandit feedback models: uninformed (only the realized reward is revealed) and informed (both the reward and the opponent’s action are revealed). For uninformed bandit learning of normal-form games, we show that the Tsallis-INF algorithm achieves O(c \log T) instance-dependent regret with a game-dependent parameter c . Crucially, we prove an information-theoretic lower bound showing that the dependence on c is necessary. To overcome this hardness, we turn to the informed setting and introduce Maximin-UCB, which obtains another regret bound of the form O(c’ \log T) for a different game-dependent parameter c’ that could potentially be much smaller than c . Finally, we generalize both results to bilinear games over an arbitrary, large action set, proposing Tsallis-FTRL-SPM and Maximin-LinUCB for the uninformed and informed setting respectively and establishing similar game-dependent logarithmic regret bounds. Subjects: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT) MSC classes: 91A26 (Primary) 68T05, 68Q32 (Secondary) ACMclasses: F.2.0; I.2.6; G.3 Cite as: arXiv:2602.06348 [cs.LG] (or arXiv:2602.06348v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.06348 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-67] AdFL: In-Browser Federated Learning for Online Advertisement
链接: https://arxiv.org/abs/2602.06336
作者: Ahmad Alemari,Pritam Sen,Cristian Borcea
类目: Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:
Abstract:Since most countries are coming up with online privacy regulations, such as GDPR in the EU, online publishers need to find a balance between revenue from targeted advertisement and user privacy. One way to be able to still show targeted ads, based on user personal and behavioral information, is to employ Federated Learning (FL), which performs distributed learning across users without sharing user raw data with other stakeholders in the publishing ecosystem. This paper presents AdFL, an FL framework that works in the browsers to learn user ad preferences. These preferences are aggregated in a global FL model, which is then used in the browsers to show more relevant ads to users. AdFL can work with any model that uses features available in the browser such as ad viewability, ad click-through, user dwell time on pages, and page content. The AdFL server runs at the publisher and coordinates the learning process for the users who browse pages on the publisher’s website. The AdFL prototype does not require the client to install any software, as it is built utilizing standard APIs available on most modern browsers. We built a proof-of-concept model for ad viewability prediction that runs on top of AdFL. We tested AdFL and the model with two non-overlapping datasets from a website with 40K visitors per day. The experiments demonstrate AdFL’s feasibility to capture the training information in the browser in a few milliseconds, show that the ad viewability prediction achieves up to 92.59% AUC, and indicate that utilizing differential privacy (DP) to safeguard local model parameters yields adequate performance, with only modest declines in comparison to the non-DP variant.
[LG-68] Dont Break the Boundary: Continual Unlearning for OOD Detection Based on Free Energy Repulsion
链接: https://arxiv.org/abs/2602.06331
作者: Ningkang Peng,Kun Shao,Jingyang Mao,Linjing Qian,Xiaoqian Peng,Xichen Yang,Yanhui Gu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Deploying trustworthy AI in open-world environments faces a dual challenge: the necessity for robust Out-of-Distribution (OOD) detection to ensure system safety, and the demand for flexible machine unlearning to satisfy privacy compliance and model rectification. However, this objective encounters a fundamental geometric contradiction: current OOD detectors rely on a static and compact data manifold, whereas traditional classification-oriented unlearning methods disrupt this delicate structure, leading to a catastrophic loss of the model’s capability to discriminate anomalies while erasing target classes. To resolve this dilemma, we first define the problem of boundary-preserving class unlearning and propose a pivotal conceptual shift: in the context of OOD detection, effective unlearning is mathematically equivalent to transforming the target class into OOD samples. Based on this, we propose the TFER (Total Free Energy Repulsion) framework. Inspired by the free energy principle, TFER constructs a novel Push-Pull game mechanism: it anchors retained classes within a low-energy ID manifold through a pull mechanism, while actively expelling forgotten classes to high-energy OOD regions using a free energy repulsion force. This approach is implemented via parameter-efficient fine-tuning, circumventing the prohibitive cost of full retraining. Extensive experiments demonstrate that TFER achieves precise unlearning while maximally preserving the model’s discriminative performance on remaining classes and external OOD data. More importantly, our study reveals that the unique Push-Pull equilibrium of TFER endows the model with inherent structural stability, allowing it to effectively resist catastrophic forgetting without complex additional constraints, thereby demonstrating exceptional potential in continual unlearning tasks.
[LG-69] Online Adaptive Reinforcement Learning with Echo State Networks for Non-Stationary Dynamics IJCNN2026
链接: https://arxiv.org/abs/2602.06326
作者: Aoi Yoshimura,Gouhei Tanaka
类目: Machine Learning (cs.LG)
*备注: Submitted to IJCNN 2026
Abstract:Reinforcement learning (RL) policies trained in simulation often suffer from severe performance degradation when deployed in real-world environments due to non-stationary dynamics. While Domain Randomization (DR) and meta-RL have been proposed to address this issue, they typically rely on extensive pretraining, privileged information, or high computational cost, limiting their applicability to real-time and edge systems. In this paper, we propose a lightweight online adaptation framework for RL based on Reservoir Computing. Specifically, we integrate an Echo State Networks (ESNs) as an adaptation module that encodes recent observation histories into a latent context representation, and update its readout weights online using Recursive Least Squares (RLS). This design enables rapid adaptation without backpropagation, pretraining, or access to privileged information. We evaluate the proposed method on CartPole and HalfCheetah tasks with severe and abrupt environment changes, including periodic external disturbances and extreme friction variations. Experimental results demonstrate that the proposed approach significantly outperforms DR and representative adaptive baselines under out-of-distribution dynamics, achieving stable adaptation within a few control steps. Notably, the method successfully handles intra-episode environment changes without resetting the policy. Due to its computational efficiency and stability, the proposed framework provides a practical solution for online adaptation in non-stationary environments and is well suited for real-world robotic control and edge deployment.
[LG-70] How (Not) to Hybridize Neural and Mechanistic Models for Epidemiological Forecasting
链接: https://arxiv.org/abs/2602.06323
作者: Yiqi Su,Ray Lee,Jiaming Cui,Naren Ramakrishnan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Epidemiological forecasting from surveillance data is a hard problem and hybridizing mechanistic compartmental models with neural models is a natural direction. The mechanistic structure helps keep trajectories epidemiologically plausible, while neural components can capture non-stationary, data-adaptive effects. In practice, however, many seemingly straightforward couplings fail under partial observability and continually shifting transmission dynamics driven by behavior, waning immunity, seasonality, and interventions. We catalog these failure modes and show that robust performance requires making non-stationarity explicit: we extract multi-scale structure from the observed infection series and use it as an interpretable control signal for a controlled neural ODE coupled to an epidemiological model. Concretely, we decompose infections into trend, seasonal, and residual components and use these signals to drive continuous-time latent dynamics while jointly forecasting and inferring time-varying transmission, recovery, and immunity-loss rates. Across seasonal and non-seasonal settings, including early outbreaks and multi-wave regimes, our approach reduces long-horizon RMSE by 15-35%, improves peak timing error by 1-3 weeks, and lowers peak magnitude bias by up to 30% relative to strong time-series, neural ODE, and hybrid baselines, without relying on auxiliary covariates.
[LG-71] SOCKET: SOft Collison Kernel EsTimator for Sparse Attention
链接: https://arxiv.org/abs/2602.06283
作者: Sahil Joshi,Agniva Chowdhury,Wyatt Bellinger,Amar Kanakamedala,Ekam Singh,Hoang Anh Duy Le,Aditya Desai,Anshumali Shrivastava
类目: Machine Learning (cs.LG)
*备注: 11 figures, 5 tables
Abstract:Exploiting sparsity during long-context inference is central to scaling large language models, as attention dominates the cost of autoregressive decoding. Sparse attention reduces this cost by restricting computation to a subset of tokens, but its effectiveness depends critically on efficient scoring and selection of relevant tokens at inference time. We revisit Locality-Sensitive Hashing (LSH) as a sparsification primitive and introduce SOCKET, a SOft Collision Kernel EsTimator that replaces hard bucket matches with probabilistic, similarity-aware aggregation. Our key insight is that hard LSH produces discrete collision signals and is therefore poorly suited for ranking. In contrast, soft LSH aggregates graded collision evidence across hash tables, preserving the stability of relative ordering among the true top- k tokens. This transformation elevates LSH from a candidate-generation heuristic to a principled and mathematically grounded scoring kernel for sparse attention. Leveraging this property, SOCKET enables efficient token selection without ad-hoc voting mechanism, and matches or surpasses established sparse attention baselines across multiple long-context benchmarks using diverse set of models. With a custom CUDA kernel for scoring keys and a Flash Decode Triton backend for sparse attention, SOCKET achieves up to 1.5 \times higher throughput than FlashAttention, making it an effective tool for long-context inference. Code is open-sourced at this https URL.
[LG-72] Statistical Learning from Attribution Sets
链接: https://arxiv.org/abs/2602.06276
作者: Lorne Applebaum,Robert Busa-Fekete,August Y. Chen,Claudio Gentile,Tomer Koren,Aryan Mokhtari
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We address the problem of training conversion prediction models in advertising domains under privacy constraints, where direct links between ad clicks and conversions are unavailable. Motivated by privacy-preserving browser APIs and the deprecation of third-party cookies, we study a setting where the learner observes a sequence of clicks and a sequence of conversions, but can only link a conversion to a set of candidate clicks (an attribution set) rather than a unique source. We formalize this as learning from attribution sets generated by an oblivious adversary equipped with a prior distribution over the candidates. Despite the lack of explicit labels, we construct an unbiased estimator of the population loss from these coarse signals via a novel approach. Leveraging this estimator, we show that Empirical Risk Minimization achieves generalization guarantees that scale with the informativeness of the prior and is also robust against estimation errors in the prior, despite complex dependencies among attribution sets. Simple empirical evaluations on standard datasets suggest our unbiased approach significantly outperforms common industry heuristics, particularly in regimes where attribution sets are large or overlapping.
[LG-73] PurSAMERE: Reliable Adversarial Purification via Sharpness-Aware Minimization of Expected Reconstruction Error
链接: https://arxiv.org/abs/2602.06269
作者: Vinh Hoang,Sebastian Krumscheid,Holger Rauhut,Raúl Tempone
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Probability (math.PR)
*备注:
Abstract:We propose a novel deterministic purification method to improve adversarial robustness by mapping a potentially adversarial sample toward a nearby sample that lies close to a mode of the data distribution, where classifiers are more reliable. We design the method to be deterministic to ensure reliable test accuracy and to prevent the degradation of effective robustness observed in stochastic purification approaches when the adversary has full knowledge of the system and its randomness. We employ a score model trained by minimizing the expected reconstruction error of noise-corrupted data, thereby learning the structural characteristics of the input data distribution. Given a potentially adversarial input, the method searches within its local neighborhood for a purified sample that minimizes the expected reconstruction error under noise corruption and then feeds this purified sample to the classifier. During purification, sharpness-aware minimization is used to guide the purified samples toward flat regions of the expected reconstruction error landscape, thereby enhancing robustness. We further show that, as the noise level decreases, minimizing the expected reconstruction error biases the purified sample toward local maximizers of the Gaussian-smoothed density; under additional local assumptions on the score model, we prove recovery of a local maximizer in the small-noise limit. Experimental results demonstrate significant gains in adversarial robustness over state-of-the-art methods under strong deterministic white-box attacks.
[LG-74] Swap Regret Minimization Through Response-Based Approachability
链接: https://arxiv.org/abs/2602.06264
作者: Ioannis Anagnostides,Gabriele Farina,Maxwell Fishelson,Haipeng Luo,Jon Schneider
类目: Machine Learning (cs.LG)
*备注:
Abstract:We consider the problem of minimizing different notions of swap regret in online optimization. These forms of regret are tightly connected to correlated equilibrium concepts in games, and have been more recently shown to guarantee non-manipulability against strategic adversaries. The only computationally efficient algorithm for minimizing linear swap regret over a general convex set in \mathbbR^d was developed recently by Daskalakis, Farina, Fishelson, Pipis, and Schneider (STOC '25). However, it incurs a highly suboptimal regret bound of \Omega(d^4 \sqrtT) and also relies on computationally intensive calls to the ellipsoid algorithm at each iteration. In this paper, we develop a significantly simpler, computationally efficient algorithm that guarantees O(d^3/2 \sqrtT) linear swap regret for a general convex set and O(d \sqrtT) when the set is centrally symmetric. Our approach leverages the powerful response-based approachability framework of Bernstein and Shimkin (JMLR '15) – previously overlooked in the line of work on swap regret minimization – combined with geometric preconditioning via the John ellipsoid. Our algorithm simultaneously minimizes profile swap regret, which was recently shown to guarantee non-manipulability. Moreover, we establish a matching information-theoretic lower bound: any learner must incur in expectation \Omega(d \sqrtT) linear swap regret for large enough T , even when the set is centrally symmetric. This also shows that the classic algorithm of Gordon, Greenwald, and Marks (ICML '08) is existentially optimal for minimizing linear swap regret, although it is computationally inefficient. Finally, we extend our approach to minimize regret with respect to the set of swap deviations with polynomial dimension, unifying and strengthening recent results in equilibrium computation and online learning. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2602.06264 [cs.LG] (or arXiv:2602.06264v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.06264 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-75] On Randomized Algorithms in Online Strategic Classification
链接: https://arxiv.org/abs/2602.06257
作者: Chase Hutton,Adam Melrod,Han Shao
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注:
Abstract:Online strategic classification studies settings in which agents strategically modify their features to obtain favorable predictions. For example, given a classifier that determines loan approval based on credit scores, applicants may open or close credit cards and bank accounts to obtain a positive prediction. The learning goal is to achieve low mistake or regret bounds despite such strategic behavior. While randomized algorithms have the potential to offer advantages to the learner in strategic settings, they have been largely underexplored. In the realizable setting, no lower bound is known for randomized algorithms, and existing lower bound constructions for deterministic learners can be circumvented by randomization. In the agnostic setting, the best known regret upper bound is O(T^3/4\log^1/4T|\mathcal H|) , which is far from the standard online learning rate of O(\sqrtT\log|\mathcal H|) . In this work, we provide refined bounds for online strategic classification in both settings. In the realizable setting, we extend, for T \mathrmLdim(\mathcalH) \Delta^2 , the existing lower bound \Omega(\mathrmLdim(\mathcalH) \Delta) for deterministic learners to all learners. This yields the first lower bound that applies to randomized learners. We also provide the first randomized learner that improves the known (deterministic) upper bound of O(\mathrmLdim(\mathcal H) \cdot \Delta \log \Delta) . In the agnostic setting, we give a proper learner using convex optimization techniques to improve the regret upper bound to O(\sqrtT \log |\mathcalH| + |\mathcalH| \log(T|\mathcalH|)) . We show a matching lower bound up to logarithmic factors for all proper learning rules, demonstrating the optimality of our learner among proper learners. As such, improper learning is necessary to further improve regret guarantees. Subjects: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT) Cite as: arXiv:2602.06257 [cs.LG] (or arXiv:2602.06257v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.06257 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Adam Melrod [view email] [v1] Thu, 5 Feb 2026 23:17:33 UTC (33 KB)
[LG-76] Adaptive Sparse Möbius Transforms for Learning Polynomials
链接: https://arxiv.org/abs/2602.06246
作者: Yigit Efe Erginbas,Justin Singh Kang,Elizabeth Polito,Kannan Ramchandran
类目: Machine Learning (cs.LG)
*备注:
Abstract:We consider the problem of exactly learning an s -sparse real-valued Boolean polynomial of degree d of the form f:\ 0,1^n \rightarrow \mathbbR . This problem corresponds to decomposing functions in the AND basis and is known as taking a Möbius transform. While the analogous problem for the parity basis (Fourier transform) f: -1,1 ^n \rightarrow \mathbbR is well-understood, the AND basis presents a unique challenge: the basis vectors are coherent, precluding standard compressed sensing methods. We overcome this challenge by identifying that we can exploit adaptive group testing to provide a constructive, query-efficient implementation of the Möbius transform (also known as Möbius inversion) for sparse functions. We present two algorithms based on this insight. The Fully-Adaptive Sparse Möbius Transform (FASMT) uses O(sd \log(n/d)) adaptive queries in O((sd + n) sd \log(n/d)) time, which we show is near-optimal in query complexity. Furthermore, we also present the Partially-Adaptive Sparse Möbius Transform (PASMT), which uses O(sd^2\log(n/d)) queries, trading a factor of d to reduce the number of adaptive rounds to O(d^2\log(n/d)) , with no dependence on s . When applied to hypergraph reconstruction from edge-count queries, our results improve upon baselines by avoiding the combinatorial explosion in the rank d . We demonstrate the practical utility of our method for hypergraph reconstruction by applying it to learning real hypergraphs in simulations.
[LG-77] A Fast and Generalizable Fourier Neural Operator-Based Surrogate for Melt-Pool Prediction in Laser Processing
链接: https://arxiv.org/abs/2602.06241
作者: Alix Benoit(1),Toni Ivas(1),Mateusz Papierz(2),Asel Sagingalieva(2),Alexey Melnikov(2),Elia Iseli(1) ((1) EMPA, (2) Terra Quantum AG)
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注: 29 pages, 12 figures, 6 tables
Abstract:High-fidelity simulations of laser welding capture complex thermo-fluid phenomena, including phase change, free-surface deformation, and keyhole dynamics, however their computational cost limits large-scale process exploration and real-time use. In this work we present the Laser Processing Fourier Neural Operator (LP-FNO), a Fourier Neural Operator (FNO) based surrogate model that learns the parametric solution operator of various laser processes from multiphysics simulations generated with FLOW-3D WELD (registered trademark). Through a novel approach of reformulating the transient problem in the moving laser frame and applying temporal averaging, the system results in a quasi-steady state setting suitable for operator learning, even in the keyhole welding regime. The proposed LP-FNO maps process parameters to three-dimensional temperature fields and melt-pool boundaries across a broad process window spanning conduction and keyhole regimes using the non-dimensional normalized enthalpy formulation. The model achieves temperature prediction errors on the order of 1% and intersection-over-union scores for melt-pool segmentation over 0.9. We demonstrate that a LP-FNO model trained on coarse-resolution data can be evaluated on finer grids, yielding accurate super-resolved predictions in mesh-converged conduction regimes, whereas discrepancies in keyhole regimes reflect unresolved dynamics in the coarse-mesh training data. These results indicate that the LP-FNO provides an efficient surrogate modeling framework for laser welding, enabling prediction of full three-dimensional fields and phase interfaces over wide parameter ranges in just tens of milliseconds, up to a hundred thousand times faster than traditional Finite Volume multi-physics software.
[LG-78] Provably avoiding over-optimization in Direct Preference Optimization without knowing the data distribution
链接: https://arxiv.org/abs/2602.06239
作者: Adam Barla,Emanuele Nevali,Luca Viano,Volkan Cevher
类目: Machine Learning (cs.LG)
*备注:
Abstract:We introduce PEPO (Pessimistic Ensemble based Preference Optimization), a single-step Direct Preference Optimization (DPO)-like algorithm to mitigate the well-known over-optimization issue in preference learning without requiring the knowledge of the data-generating distribution or learning an explicit reward model. PEPO achieves pessimism via an ensemble of preference-optimized policies trained on disjoint data subsets and then aggregates them through a worst case construction that favors the agreement across models. In the tabular setting, PEPO achieves sample complexity guarantees depending only on a single-policy concentrability coefficient, thus avoiding the all-policy concentrability which affects the guarantees of algorithms prone to over-optimization, such as DPO. The theoretical findings are corroborated by a convincing practical performance, while retaining the simplicity and the practicality of DPO-style training.
[LG-79] f-FUM: Federated Unlearning via min–max and f-divergence
链接: https://arxiv.org/abs/2602.06187
作者: Radmehr Karimian,Amirhossein Bagheri,Meghdad Kurmanji,Nicholas D. Lane,Gholamali Aminian
类目: Machine Learning (cs.LG)
*备注: 16 pages, 1 figure
Abstract:Federated Learning (FL) has emerged as a powerful paradigm for collaborative machine learning across decentralized data sources, preserving privacy by keeping data local. However, increasing legal and ethical demands, such as the “right to be forgotten”, and the need to mitigate data poisoning attacks have underscored the urgent necessity for principled data unlearning in FL. Unlike centralized settings, the distributed nature of FL complicates the removal of individual data contributions. In this paper, we propose a novel federated unlearning framework formulated as a min-max optimization problem, where the objective is to maximize an f -divergence between the model trained with all data and the model retrained without specific data points, while minimizing the degradation on retained data. Our framework could act like a plugin and be added to almost any federated setup, unlike SOTA methods like (\cite10269017 which requires model degradation in server, or \citekhalil2025notfederatedunlearningweight which requires to involve model architecture and model weights). This formulation allows for efficient approximation of data removal effects in a federated setting. We provide empirical evaluations to show that our method achieves significant speedups over naive retraining, with minimal impact on utility.
[LG-80] o 2:4 Sparsity and Beyond: Neuron-level Activation Function to Accelerate LLM Pre-Training
链接: https://arxiv.org/abs/2602.06183
作者: Meghana Madhyastha,Daniel Haziza,Jesse Cai,Newsha Ardalani,Zhiqi Bu,Carole-Jean Wu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Trainings of Large Language Models are generally bottlenecked by matrix multiplications. In the Transformer architecture, a large portion of these operations happens in the Feed Forward Network (FFN), and this portion increases for larger models, up to 50% of the total pretraining floating point operations. We show that we can leverage hardware-accelerated sparsity to accelerate all matrix multiplications in the FFN, with 2:4 sparsity for weights and v:n:m (Venom) sparsity for activations. Our recipe relies on sparse training steps to accelerate a large part of the pretraining, associated with regular dense training steps towards the end. Overall, models trained with this approach exhibit the same performance on our quality benchmarks, and can speed up training end-to-end by 1.4 to 1.7x. This approach is applicable to all NVIDIA GPUs starting with the A100 generation, and is orthogonal to common optimization techniques, such as, quantization, and can also be applied to mixture-of-experts model architectures.
[LG-81] Know Your Scientist: KYC as Biosecurity Infrastructure
链接: https://arxiv.org/abs/2602.06172
作者: Jonathan Feldman,Tal Feldman,Annie I Anton
类目: Cryptography and Security (cs.CR); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:
Abstract:Biological AI tools for protein design and structure prediction are advancing rapidly, creating dual-use risks that existing safeguards cannot adequately address. Current model-level restrictions, including keyword filtering, output screening, and content-based access denials, are fundamentally ill-suited to biology, where reliable function prediction remains beyond reach and novel threats evade detection by design. We propose a three-tier Know Your Customer (KYC) framework, inspired by anti-money laundering (AML) practices in the financial sector, that shifts governance from content inspection to user verification and monitoring. Tier I leverages research institutions as trust anchors to vouch for affiliated researchers and assume responsibility for vetting. Tier II applies output screening through sequence homology searches and functional annotation. Tier III monitors behavioral patterns to detect anomalies inconsistent with declared research purposes. This layered approach preserves access for legitimate researchers while raising the cost of misuse through institutional accountability and traceability. The framework can be implemented immediately using existing institutional infrastructure, requiring no new legislation or regulatory mandates.
[LG-82] SCONE: A Practical Constraint-Aware Plug-in for Latent Encoding in Learned DNA Storag e
链接: https://arxiv.org/abs/2602.06157
作者: Cihan Ruan,Lebin Zhou,Rongduo Han,Linyi Han,Bingqing Zhao,Chenchen Zhu,Wei Jiang,Wei Wang,Nam Ling
类目: Machine Learning (cs.LG)
*备注:
Abstract:DNA storage has matured from concept to practical stage, yet its integration with neural compression pipelines remains inefficient. Early DNA encoders applied redundancy-heavy constraint layers atop raw binary data - workable but primitive. Recent neural codecs compress data into learned latent representations with rich statistical structure, yet still convert these latents to DNA via naive binary-to-quaternary transcoding, discarding the entropy model’s optimization. This mismatch undermines compression efficiency and complicates the encoding stack. A plug-in module that collapses latent compression and DNA encoding into a single step. SCONE performs quaternary arithmetic coding directly on the latent space in DNA bases. Its Constraint-Aware Adaptive Coding module dynamically steers the entropy encoder’s learned probability distribution to enforce biochemical constraints - Guanine-Cytosine (GC) balance and homopolymer suppression - deterministically during encoding, eliminating post-hoc correction. The design preserves full reversibility and exploits the hyperprior model’s learned priors without modification. Experiments show SCONE achieves near-perfect constraint satisfaction with negligible computational overhead (2% latency), establishing a latent-agnostic interface for end-to-end DNA-compatible learned codecs.
[LG-83] Latent Structure Emergence in Diffusion Models via Confidence-Based Filtering
链接: https://arxiv.org/abs/2602.06155
作者: Wei Wei,Yizhou Zeng,Kuntian Chen,Sophie Langer,Mariia Seleznova,Hung-Hsu Chou
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Diffusion models rely on a high-dimensional latent space of initial noise seeds, yet it remains unclear whether this space contains sufficient structure to predict properties of the generated samples, such as their classes. In this work, we investigate the emergence of latent structure through the lens of confidence scores assigned by a pre-trained classifier to generated samples. We show that while the latent space appears largely unstructured when considering all noise realizations, restricting attention to initial noise seeds that produce high-confidence samples reveals pronounced class separability. By comparing class predictability across noise subsets of varying confidence and examining the class separability of the latent space, we find evidence of class-relevant latent structure that becomes observable only under confidence-based filtering. As a practical implication, we discuss how confidence-based filtering enables conditional generation as an alternative to guidance-based methods.
[LG-84] Optimistic Training and Convergence of Q-Learning – Extended Version
链接: https://arxiv.org/abs/2602.06146
作者: Prashant Mehta,Sean Meyn
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:
Abstract:In recent work it is shown that Q-learning with linear function approximation is stable, in the sense of bounded parameter estimates, under the (\varepsilon,\kappa) -tamed Gibbs policy; \kappa is inverse temperature, and \varepsilon0 is introduced for additional exploration. Under these assumptions it also follows that there is a solution to the projected Bellman equation (PBE). Left open is uniqueness of the solution, and criteria for convergence outside of the standard tabular or linear MDP settings. The present work extends these results to other variants of Q-learning, and clarifies prior work: a one dimensional example shows that under an oblivious policy for training there may be no solution to the PBE, or multiple solutions, and in each case the algorithm is not stable under oblivious training. The main contribution is that far more structure is required for convergence. An example is presented for which the basis is ideal, in the sense that the true Q-function is in the span of the basis. However, there are two solutions to the PBE under the greedy policy, and hence also for the (\varepsilon,\kappa) -tamed Gibbs policy for all sufficiently small \varepsilon0 and \kappa\ge 1 . Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML) MSC classes: 68T05, 93E35, 62L20, 93E20 Cite as: arXiv:2602.06146 [cs.LG] (or arXiv:2602.06146v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.06146 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-85] Flow Matching for Offline Reinforcement Learning with Discrete Actions
链接: https://arxiv.org/abs/2602.06138
作者: Fairoz Nower Khan,Nabuat Zaman Nahim,Ruiquan Huang,Haibo Yang,Peizhong Ju
类目: Machine Learning (cs.LG)
*备注:
Abstract:Generative policies based on diffusion models and flow matching have shown strong promise for offline reinforcement learning (RL), but their applicability remains largely confined to continuous action spaces. To address a broader range of offline RL settings, we extend flow matching to a general framework that supports discrete action spaces with multiple objectives. Specifically, we replace continuous flows with continuous-time Markov chains, trained using a Q-weighted flow matching objective. We then extend our design to multi-agent settings, mitigating the exponential growth of joint action spaces via a factorized conditional path. We theoretically show that, under idealized conditions, optimizing this objective recovers the optimal policy. Extensive experiments further demonstrate that our method performs robustly in practical scenarios, including high-dimensional control, multi-modal decision-making, and dynamically changing preferences over multiple objectives. Our discrete framework can also be applied to continuous-control problems through action quantization, providing a flexible trade-off between representational complexity and performance.
[LG-86] Compressing LLM s with MoP: Mixture of Pruners
链接: https://arxiv.org/abs/2602.06127
作者: Bruno Lopes Yamamoto,Lucas Lauton de Alcantara,Victor Zacarias,Leandro Giusti Mugnaini,Keith Ando Ogawa,Lucas Pellicer,Rosimeire Pereira Costa,Edson Bollis,Anna Helena Reali Costa,Artur Jordao
类目: Machine Learning (cs.LG)
*备注: Code and models are available at: this https URL
Abstract:The high computational demands of Large Language Models (LLMs) motivate methods that reduce parameter count and accelerate inference. In response, model pruning emerges as an effective strategy, yet current methods typically focus on a single dimension-depth or width. We introduce MoP (Mixture of Pruners), an iterative framework that unifies these dimensions. At each iteration, MoP generates two branches-pruning in depth versus pruning in width-and selects a candidate to advance the path. On LLaMA-2 and LLaMA-3, MoP advances the frontier of structured pruning, exceeding the accuracy of competing methods across a broad set of compression regimes. It also consistently outperforms depth-only and width-only pruning. Furthermore, MoP translates structural pruning into real speedup, reducing end-to-end latency by 39% at 40% compression. Finally, extending MoP to the vision-language model LLaVA-1.5, we notably improve computational efficiency and demonstrate that text-only recovery fine-tuning can restore performance even on visual tasks.
[LG-87] Private and interpretable clinical prediction with quantum-inspired tensor train models
链接: https://arxiv.org/abs/2602.06110
作者: José Ramón Pareja Monturiol,Juliette Sinnott,Roger G. Melko,Mohammad Kohandel
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Quantum Physics (quant-ph)
*备注: 21 pages, 5 figures, 9 tables. The code for the experiments is publicly available at this https URL
Abstract:Machine learning in clinical settings must balance predictive accuracy, interpretability, and privacy. Models such as logistic regression (LR) offer transparency, while neural networks (NNs) provide greater predictive power; yet both remain vulnerable to privacy attacks. We empirically assess these risks by designing attacks that identify which public datasets were used to train a model under varying levels of adversarial access, applying them to LORIS, a publicly available LR model for immunotherapy response prediction, as well as to additional shallow NN models trained for the same task. Our results show that both models leak significant training-set information, with LRs proving particularly vulnerable in white-box scenarios. Moreover, we observe that common practices such as cross-validation in LRs exacerbate these risks. To mitigate these vulnerabilities, we propose a quantum-inspired defense based on tensorizing discretized models into tensor trains (TTs), which fully obfuscates parameters while preserving accuracy, reducing white-box attacks to random guessing and degrading black-box attacks comparably to Differential Privacy. TT models retain LR interpretability and extend it through efficient computation of marginal and conditional distributions, while also enabling this higher level of interpretability for NNs. Our results demonstrate that tensorization is widely applicable and establishes a practical foundation for private, interpretable, and effective clinical prediction.
[LG-88] Prag matic Curiosity: A Hybrid Learning-Optimization Paradigm via Active Inference
链接: https://arxiv.org/abs/2602.06104
作者: Yingke Li,Anjali Parashar,Enlu Zhou,Chuchu Fan
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Many engineering and scientific workflows depend on expensive black-box evaluations, requiring decision-making that simultaneously improves performance and reduces uncertainty. Bayesian optimization (BO) and Bayesian experimental design (BED) offer powerful yet largely separate treatments of goal-seeking and information-seeking, providing limited guidance for hybrid settings where learning and optimization are intrinsically coupled. We propose “pragmatic curiosity,” a hybrid learning-optimization paradigm derived from active inference, in which actions are selected by minimizing the expected free energy–a single objective that couples pragmatic utility with epistemic information gain. We demonstrate the practical effectiveness and flexibility of pragmatic curiosity on various real-world hybrid tasks, including constrained system identification, targeted active search, and composite optimization with unknown preferences. Across these benchmarks, pragmatic curiosity consistently outperforms strong BO-type and BED-type baselines, delivering higher estimation accuracy, better critical-region coverage, and improved final solution quality.
[LG-89] oward Faithful and Complete Answer Construction from a Single Document
链接: https://arxiv.org/abs/2602.06103
作者: Zhaoyang Chen,Cody Fleming
类目: Machine Learning (cs.LG)
*备注:
Abstract:Modern large language models (LLMs) are powerful generators driven by statistical next-token prediction. While effective at producing fluent text, this design biases models toward high-probability continuations rather than exhaustive and faithful answers grounded in source content. As a result, directly applying LLMs lacks systematic mechanisms to ensure both completeness (avoiding omissions) and faithfulness (avoiding unsupported content), which fundamentally conflicts with core AI safety principles. To address this limitation, we present EVE, a structured framework for document-grounded reasoning. Unlike free-form prompting, EVE constrains generation to a structured, verifiable pipeline that decomposes high-rigor reasoning into extraction, validation, and enumeration. Empirically, this design enables consistent and simultaneous improvements in recall, precision, and F1-score: recall and precision increase by up to 24% and 29%, respectively, with a corresponding 31% gain in F1-score. This effectively breaks the long-standing trade-off between coverage and accuracy typical of single-pass LLM generation, while also mitigating generation truncation caused by length limitations. At the same time, we emphasize that EVE exhibits performance saturation due to the inherent ambiguity of natural language, reflecting fundamental limits of language-based reasoning. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2602.06103 [cs.LG] (or arXiv:2602.06103v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.06103 Focus to learn more arXiv-issued DOI via DataCite
[LG-90] Agent ic Workflow Using RBA_θ for Event Prediction
链接: https://arxiv.org/abs/2602.06097
作者: Purbak Sengupta,Sambeet Mishra,Sonal Shreya
类目: Machine Learning (cs.LG)
*备注:
Abstract:Wind power ramp events are difficult to forecast due to strong variability, multi-scale dynamics, and site-specific meteorological effects. This paper proposes an event-first, frequency-aware forecasting paradigm that directly predicts ramp events and reconstructs the power trajectory thereafter, rather than inferring events from dense forecasts. The framework is built on an enhanced Ramping Behaviour Analysis (RBA _\theta ) method’s event representation and progressively integrates statistical, machine-learning, and deep-learning models. Traditional forecasting models with post-hoc event extraction provides a strong interpretable baseline but exhibits limited generalisation across sites. Direct event prediction using Random Forests improves robustness over survival-based formulations, motivating fully event-aware modelling. To capture the multi-scale nature of wind ramps, we introduce an event-first deep architecture that integrates wavelet-based frequency decomposition, temporal excitation features, and adaptive feature selection. The resulting sequence models enable stable long-horizon event prediction, physically consistent trajectory reconstruction, and zero-shot transfer to previously unseen wind farms. Empirical analysis shows that ramp magnitude and duration are governed by distinct mid-frequency bands, allowing accurate signal reconstruction from sparse event forecasts. An agentic forecasting layer is proposed, in which specialised workflows are selected dynamically based on operational context. Together, the framework demonstrates that event-first, frequency-aware forecasting provides a transferable and operationally aligned alternative to trajectory-first wind-power prediction.
[LG-91] Canzona: A Unified Asynchronous and Load-Balanced Framework for Distributed Matrix-based Optimizers
链接: https://arxiv.org/abs/2602.06079
作者: Liangyu Wang,Siqi Zhang,Junjie Wang,Yiming Dong,Bo Zheng,Zihan Qiu,Shengkun Tang,Di Wang,Rui Men,Dayiheng Liu
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:
Abstract:The scaling of Large Language Models (LLMs) drives interest in matrix-based optimizers (e.g., Shampoo, Muon, SOAP) for their convergence efficiency; yet their requirement for holistic updates conflicts with the tensor fragmentation in distributed frameworks like Megatron. Existing solutions are suboptimal: synchronous approaches suffer from computational redundancy, while layer-wise partitioning fails to reconcile this conflict without violating the geometric constraints of efficient communication primitives. To bridge this gap, we propose Canzona, a Unified, Asynchronous, and Load-Balanced framework that decouples logical optimizer assignment from physical parameter distribution. For Data Parallelism, we introduce an alpha-Balanced Static Partitioning strategy that respects atomicity while neutralizing the load imbalance. For Tensor Parallelism, we design an Asynchronous Compute pipeline utilizing Micro-Group Scheduling to batch fragmented updates and hide reconstruction overhead. Extensive evaluations on the Qwen3 model family (up to 32B parameters) on 256 GPUs demonstrate that our approach preserves the efficiency of established parallel architectures, achieving a 1.57x speedup in end-to-end iteration time and reducing optimizer step latency by 5.8x compared to the baseline.
[LG-92] PackInfer: Compute- and I/O-Efficient Attention for Batched LLM Inference
链接: https://arxiv.org/abs/2602.06072
作者: Rui Ning,Wei Zhang,Fan Lai
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:
Abstract:Attention efficiency is critical to large language model (LLM) inference. While prior advances optimize attention execution for individual requests (e.g., FlashAttention), production LLM serving relies on batching requests with highly heterogeneous sequence lengths for high serving throughput. This mismatch induces severe computation and I/O imbalance, exacerbates stragglers, and underutilizes GPU resources. We present PackInfer, a kernel-level attention framework that enables compute- and I/O-aware execution for heterogeneous batched inference. PackInfer orchestrates batched requests into load-balanced execution groups, effectively saturating GPU utilization by packing multiple requests into unified kernel launches. By constructing attention kernels directly over packed query-key regions, PackInfer eliminates redundant computation and balances thread-block execution. It then incorporates I/O-aware grouping that co-locates shared-prefix requests and reorganizes KV caches into group-contiguous layouts, reducing memory fragmentation and redundant data movement as generation evolves. Evaluations on real-world workloads show that PackInfer reduces inference latency by 13.0-20.1%, and improves throughput by 20% compared to the state-of-the-art FlashAttention.
[LG-93] Deep Unfolded Fractional Optimization for Maximizing Robust Throughput in 6G Networks
链接: https://arxiv.org/abs/2602.06062
作者: Anh Thi Bui,Robert-Jeron Reifert,Hayssam Dahrouj,Aydin Sezgin
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 6 pages, 5 figures
Abstract:The sixth-generation (6G) of wireless communication networks aims to leverage artificial intelligence tools for efficient and robust network optimization. This is especially the case since traditional optimization methods often face high computational complexity, motivating the use of deep learning (DL)-based optimization frameworks. In this context, this paper considers a multi-antenna base station (BS) serving multiple users simultaneously through transmit beamforming in downlink mode. To account for robustness, this work proposes an uncertainty-injected deep unfolded fractional programming (UI-DUFP) framework for weighted sum rate (WSR) maximization under imperfect channel conditions. The proposed method unfolds fractional programming (FP) iterations into trainable neural network layers refined by projected gradient descent (PGD) steps, while robustness is introduced by injecting sampled channel uncertainties during training and optimizing a quantile-based objective. Simulation results show that the proposed UI-DUFP achieves higher WSR and improved robustness compared to classical weighted minimum mean square error, FP, and DL baselines, while maintaining low inference time and good scalability. These findings highlight the potential of deep unfolding combined with uncertainty-aware training as a powerful approach for robust optimization in 6G networks.
[LG-94] Automatic Detection and Analysis of Singing Mistakes for Music Pedagogy
链接: https://arxiv.org/abs/2602.06917
作者: Sumit Kumar,Suraj Jaiswal,Parampreet Singh,Vipul Arora
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
*备注: Under Review at Transactions of Audio Speech and Language Processing
Abstract:The advancement of machine learning in audio analysis has opened new possibilities for technology-enhanced music education. This paper introduces a framework for automatic singing mistake detection in the context of music pedagogy, supported by a newly curated dataset. The dataset comprises synchronized teacher learner vocal recordings, with annotations marking different types of mistakes made by learners. Using this dataset, we develop different deep learning models for mistake detection and benchmark them. To compare the efficacy of mistake detection systems, a new evaluation methodology is proposed. Experiments indicate that the proposed learning-based methods are superior to rule-based methods. A systematic study of errors and a cross-teacher study reveal insights into music pedagogy that can be utilised for various music applications. This work sets out new directions of research in music pedagogy. The codes and dataset are publicly available.
[LG-95] RanSOM: Second-Order Momentum with Randomized Scaling for Constrained and Unconstrained Optimization
链接: https://arxiv.org/abs/2602.06824
作者: El Mahdi Chayti
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
Abstract:Momentum methods, such as Polyak’s Heavy Ball, are the standard for training deep networks but suffer from curvature-induced bias in stochastic settings, limiting convergence to suboptimal \mathcalO(\epsilon^-4) rates. Existing corrections typically require expensive auxiliary sampling or restrictive smoothness assumptions. We propose \textbfRanSOM, a unified framework that eliminates this bias by replacing deterministic step sizes with randomized steps drawn from distributions with mean \eta_t . This modification allows us to leverage Stein-type identities to compute an exact, unbiased estimate of the momentum bias using a single Hessian-vector product computed jointly with the gradient, avoiding auxiliary queries. We instantiate this framework in two algorithms: \textbfRanSOM-E for unconstrained optimization (using exponentially distributed steps) and \textbfRanSOM-B for constrained optimization (using beta-distributed steps to strictly preserve feasibility). Theoretical analysis confirms that RanSOM recovers the optimal \mathcalO(\epsilon^-3) convergence rate under standard bounded noise, and achieves optimal rates for heavy-tailed noise settings ( p \in (1, 2] ) without requiring gradient clipping.
[LG-96] Optimal Learning-Rate Schedules under Functional Scaling Laws: Power Decay and Warmup-Stable-Decay
链接: https://arxiv.org/abs/2602.06797
作者: Binghui Li,Zilin Wang,Fengling Chen,Shiyang Zhao,Ruiheng Zheng,Lei Wu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:We study optimal learning-rate schedules (LRSs) under the functional scaling law (FSL) framework introduced in Li et al. (2025), which accurately models the loss dynamics of both linear regression and large language model (LLM) pre-training. Within FSL, loss dynamics are governed by two exponents: a source exponent s0 controlling the rate of signal learning, and a capacity exponent \beta1 determining the rate of noise forgetting. Focusing on a fixed training horizon N , we derive the optimal LRSs and reveal a sharp phase transition. In the easy-task regime s \ge 1 - 1/\beta , the optimal schedule follows a power decay to zero, \eta^*(z) = \eta_\mathrmpeak(1 - z/N)^2\beta - 1 , where the peak learning rate scales as \eta_\mathrmpeak \eqsim N^-\nu for an explicit exponent \nu = \nu(s,\beta) . In contrast, in the hard-task regime s 1 - 1/\beta , the optimal LRS exhibits a warmup-stable-decay (WSD) (Hu et al. (2024)) structure: it maintains the largest admissible learning rate for most of training and decays only near the end, with the decay phase occupying a vanishing fraction of the horizon. We further analyze optimal shape-fixed schedules, where only the peak learning rate is tuned – a strategy widely adopted in practiceand characterize their strengths and intrinsic limitations. This yields a principled evaluation of commonly used schedules such as cosine and linear decay. Finally, we apply the power-decay LRS to one-pass stochastic gradient descent (SGD) for kernel regression and show the last iterate attains the exact minimax-optimal rate, eliminating the logarithmic suboptimality present in prior analyses. Numerical experiments corroborate our theoretical predictions. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2602.06797 [stat.ML] (or arXiv:2602.06797v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2602.06797 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-97] Missing At Random as Covariate Shift: Correcting Bias in Iterative Imputation
链接: https://arxiv.org/abs/2602.06713
作者: Luke Shannon,Song Liu,Katarzyna Reluga
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 8 pages, 6 figures
Abstract:Accurate imputation of missing data is critical to downstream machine learning performance. We formulate missing data imputation as a risk minimisation problem, which highlights a covariate shift between the observed and unobserved data distributions. This covariate shift induced bias is not accounted for by popular imputation methods and leads to suboptimal performance. In this paper, we derive theoretically valid importance weights that correct for the induced distributional bias. Furthermore, we propose a novel imputation algorithm that jointly estimates both the importance weights and imputation models, enabling bias correction throughout the imputation process. Empirical results across benchmark datasets show reductions in root mean squared error and Wasserstein distance of up to 7% and 20%, respectively, compared to otherwise identical unweighted methods.
[LG-98] Infinite-dimensional generative diffusions via Doobs h-transform
链接: https://arxiv.org/abs/2602.06621
作者: Thorben Pieper-Sethmacher,Daniel Paulin
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:This paper introduces a rigorous framework for defining generative diffusion models in infinite dimensions via Doob’s h-transform. Rather than relying on time reversal of a noising process, a reference diffusion is forced towards the target distribution by an exponential change of measure. Compared to existing methodology, this approach readily generalises to the infinite-dimensional setting, hence offering greater flexibility in the diffusion model. The construction is derived rigorously under verifiable conditions, and bounds with respect to the target measure are established. We show that the forced process under the changed measure can be approximated by minimising a score-matching objective and validate our method on both synthetic and real data.
[LG-99] Evolving Ranking Functions for Canonical Blow-Ups in Positive Characteristic
链接: https://arxiv.org/abs/2602.06553
作者: Gergely Bérczi
类目: Algebraic Geometry (math.AG); Machine Learning (cs.LG)
*备注: 41 pages
Abstract:Resolution of singularities in positive characteristic remains a long-standing open problem in algebraic geometry. In characteristic zero, the problem was solved by Hironaka in 1964, work for which he was awarded the Fields Medal. Modern proofs proceed by constructing suitable ranking functions, that is, invariants shown to strictly decrease along canonical sequences of blow-ups, ensuring termination. In positive characteristic, however, no such general ranking function is known: Frobenius-specific pathologies, such as the kangaroo phenomenon, can cause classical characteristic-zero invariants to plateau or even temporarily increase, presenting a fundamental obstruction to existing approaches. In this paper we report a sequence of experiments using the evolutionary search model AlphaEvolve, designed to discover candidate ranking functions for a toy canonical blow-up process. Our test benchmarks consist of carefully selected hypersurface singularities in dimension 4 and characteristic p=3 , with monic purely inseparable leading term, a regime in which naive order-based invariants often fail. After iteratively refining the experimental design, we obtained a discretized five-component lexicographic ranking function satisfying a bounded-delay descent criterion with zero violations across the benchmark. These experiments in turn motivated our main results: the conjectural delayed ranking functions in characteristic 3 formulated in two conjectures.
[LG-100] Operationalizing Steins Method for Online Linear Optimization: CLT-Based Optimal Tradeoffs
链接: https://arxiv.org/abs/2602.06545
作者: Zhiyu Zhang,Aaditya Ramdas
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Adversarial online linear optimization (OLO) is essentially about making performance tradeoffs with respect to the unknown difficulty of the adversary. In the setting of one-dimensional fixed-time OLO on a bounded domain, it has been observed since Cover (1966) that achievable tradeoffs are governed by probabilistic inequalities, and these descriptive results can be converted into algorithms via dynamic programming, which, however, is not computationally efficient. We address this limitation by showing that Stein’s method, a classical framework underlying the proofs of probabilistic limit theorems, can be operationalized as computationally efficient OLO algorithms. The associated regret and total loss upper bounds are “additively sharp”, meaning that they surpass the conventional big-O optimality and match normal-approximation-based lower bounds by additive lower order terms. Our construction is inspired by the remarkably clean proof of a Wasserstein martingale central limit theorem (CLT) due to Röllin (2018). Several concrete benefits can be obtained from this general technique. First, with the same computational complexity, the proposed algorithm improves upon the total loss upper bounds of online gradient descent (OGD) and multiplicative weight update (MWU). Second, our algorithm can realize a continuum of optimal two-point tradeoffs between the total loss and the maximum regret over comparators, improving upon prior works in parameter-free online learning. Third, by allowing the adversary to randomize on an unbounded support, we achieve sharp in-expectation performance guarantees for OLO with noisy feedback. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2602.06545 [stat.ML] (or arXiv:2602.06545v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2602.06545 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-101] HyQuRP: Hybrid quantum-classical neural network with rotational and permutational equivariance for 3D point clouds
链接: https://arxiv.org/abs/2602.06381
作者: Semin Park,Chae-Yeun Park
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 8+28 pages; 2 figures
Abstract:We introduce HyQuRP, a hybrid quantum-classical neural network equivariant to rotational and permutational symmetries. While existing equivariant quantum machine learning models often rely on ad hoc constructions, HyQuRP is built upon the formal foundations of group representation theory. In the sparse-point regime, HyQuRP consistently outperforms strong classical and quantum baselines across multiple benchmarks. For example, when six subsampled points are used, HyQuRP ( \sim 1.5K parameters) achieves 76.13% accuracy on the 5-class ModelNet benchmark, compared to approximately 71% for PointNet, PointMamba, and PointTransformer with similar parameter counts. These results highlight HyQuRP’s exceptional data efficiency and suggest the potential of quantum machine learning models for processing 3D point cloud data.
[LG-102] A Multiplicative Neural Network Architecture: Locality and Regularity of Appriximation
链接: https://arxiv.org/abs/2602.06374
作者: Hee-Sun Choi,Beom-Seok Han
类目: Functional Analysis (math.FA); Machine Learning (cs.LG)
*备注: 22 pages
Abstract:We introduce a multiplicative neural network architecture in which multiplicative interactions constitute the fundamental representation, rather than appearing as auxiliary components within an additive model. We establish a universal approximation theorem for this architecture and analyze its approximation properties in terms of locality and regularity in Bessel potential spaces. To complement the theoretical results, we conduct numerical experiments on representative targets exhibiting sharp transition layers or pointwise loss of higher-order regularity. The experiments focus on the spatial structure of approximation errors and on regularity-sensitive quantities, in particular the convergence of Zygmund-type seminorms. The results show that the proposed multiplicative architecture yields residual error structures that are more tightly aligned with regions of reduced regularity and exhibits more stable convergence in regularity-sensitive metrics. These results demonstrate that adopting a multiplicative representation format has concrete implications for the localization and regularity behavior of neural network approximations, providing a direct connection between architectural design and analytical properties of the approximating functions. Comments: 22 pages Subjects: Functional Analysis (math.FA); Machine Learning (cs.LG) MSC classes: 46E35, 41A30 Cite as: arXiv:2602.06374 [math.FA] (or arXiv:2602.06374v1 [math.FA] for this version) https://doi.org/10.48550/arXiv.2602.06374 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-103] High-Dimensional Limit of Stochastic Gradient Flow via Dynamical Mean-Field Theory
链接: https://arxiv.org/abs/2602.06320
作者: Sota Nishiyama,Masaaki Imaizumi
类目: Machine Learning (stat.ML); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG)
*备注:
Abstract:Modern machine learning models are typically trained via multi-pass stochastic gradient descent (SGD) with small batch sizes, and understanding their dynamics in high dimensions is of great interest. However, an analytical framework for describing the high-dimensional asymptotic behavior of multi-pass SGD with small batch sizes for nonlinear models is currently missing. In this study, we address this gap by analyzing the high-dimensional dynamics of a stochastic differential equation called a \emphstochastic gradient flow (SGF), which approximates multi-pass SGD in this regime. In the limit where the number of data samples n and the dimension d grow proportionally, we derive a closed system of low-dimensional and continuous-time equations and prove that it characterizes the asymptotic distribution of the SGF parameters. Our theory is based on the dynamical mean-field theory (DMFT) and is applicable to a wide range of models encompassing generalized linear models and two-layer neural networks. We further show that the resulting DMFT equations recover several existing high-dimensional descriptions of SGD dynamics as special cases, thereby providing a unifying perspective on prior frameworks such as online SGD and high-dimensional linear regression. Our proof builds on the existing DMFT technique for gradient flow and extends it to handle the stochasticity in SGF using tools from stochastic calculus.
[LG-104] me-uniform conformal and PAC prediction
链接: https://arxiv.org/abs/2602.06297
作者: Kayla E. Scharfstein,Arun Kumar Kuchibhotla
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Given that machine learning algorithms are increasingly being deployed to aid in high stakes decision-making, uncertainty quantification methods that wrap around these black box models such as conformal prediction have received much attention in recent years. In sequential settings, where data are observed/generated in a streaming fashion, traditional conformal methods do not provide any guarantee without fixing the sample size. More importantly, traditional conformal methods cannot cope with sequentially updated predictions. As such, we develop an extension of the conformal prediction and related probably approximately correct (PAC) prediction frameworks to sequential settings where the number of data points is not fixed in advance. The resulting prediction sets are anytime-valid in that their expected coverage is at the required level at any time chosen by the analyst even if this choice depends on the data. We present theoretical guarantees for our proposed methods and demonstrate their validity and utility on simulated and real datasets.
[LG-105] Inheritance Between Feedforward and Convolutional Networks via Model Projection
链接: https://arxiv.org/abs/2602.06245
作者: Nicolas Ewen,Jairo Diaz-Rodriguez,Kelly Ramsay
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Techniques for feedforward networks (FFNs) and convolutional networks (CNNs) are frequently reused across families, but the relationship between the underlying model classes is rarely made explicit. We introduce a unified node-level formalization with tensor-valued activations and show that generalized feedforward networks form a strict subset of generalized convolutional networks. Motivated by the mismatch in per-input parameterization between the two families, we propose model projection, a parameter-efficient transfer learning method for CNNs that freezes pretrained per-input-channel filters and learns a single scalar gate for each (output channel, input channel) contribution. Projection keeps all convolutional layers adaptable to downstream tasks while substantially reducing the number of trained parameters in convolutional layers. We prove that projected nodes take the generalized FFN form, enabling projected CNNs to inherit feedforward techniques that do not rely on homogeneous layer inputs. Experiments across multiple ImageNet-pretrained backbones and several downstream image classification datasets show that model projection is a strong transfer learning baseline under simple training recipes.
[LG-106] Warm Starts Cold States: Exploiting Adiabaticity for Variational Ground-States
链接: https://arxiv.org/abs/2602.06137
作者: Ricard Puig,Berta Casas,Alba Cervera-Lierta,Zoë Holmes,Adrián Pérez-Salinas
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 11 + 24 pages, 3 figures
Abstract:Reliable preparation of many-body ground states is an essential task in quantum computing, with applications spanning areas from chemistry and materials modeling to quantum optimization and benchmarking. A variety of approaches have been proposed to tackle this problem, including variational methods. However, variational training often struggle to navigate complex energy landscapes, frequently encountering suboptimal local minima or suffering from barren plateaus. In this work, we introduce an iterative strategy for ground-state preparation based on a stepwise (discretized) Hamiltonian deformation. By complementing the Variational Quantum Eigensolver (VQE) with adiabatic principles, we demonstrate that solving a sequence of intermediate problems facilitates tracking the ground-state manifold toward the target system, even as we scale the system size. We provide a rigorous theoretical foundation for this approach, proving a lower bound on the loss variance that suggests trainability throughout the deformation, provided the system remains away from gap closings. Numerical simulations, including the effects of shot noise, confirm that this path-dependent tracking consistently converges to the target ground state.
[LG-107] Algebraic Robustness Verification of Neural Networks
链接: https://arxiv.org/abs/2602.06105
作者: Yulia Alexandr,Hao Duan,Guido Montúfar
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Algebraic Geometry (math.AG)
*备注:
Abstract:We formulate formal robustness verification of neural networks as an algebraic optimization problem. We leverage the Euclidean Distance (ED) degree, which is the generic number of complex critical points of the distance minimization problem to a classifier’s decision boundary, as an architecture-dependent measure of the intrinsic complexity of robustness verification. To make this notion operational, we define the associated ED discriminant, which characterizes input points at which the number of real critical points changes, distinguishing test instances that are easier or harder to verify. We provide an explicit algorithm for computing this discriminant. We further introduce the parameter discriminant of a neural network, identifying parameters where the ED degree drops and the decision boundary exhibits reduced algebraic complexity. We derive closed-form expressions for the ED degree for several classes of neural architectures, as well as formulas for the expected number of real critical points in the infinite-width limit. Finally, we present an exact robustness certification algorithm based on numerical homotopy continuation, establishing a concrete link between metric algebraic geometry and neural network verification.
信息检索
[IR-0] On the Efficiency of Sequentially Aware Recommender Systems: Cotten4Rec
链接: https://arxiv.org/abs/2602.06935
作者: Shankar Veludandi,Gulrukh Kurdistan,Uzma Mushtaque
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Sequential recommendation (SR) models predict a user’s next interaction by modeling their historical behaviors. Transformer-based SR methods, notably BERT4Rec, effectively capture these patterns but incur significant computational overhead due to extensive intermediate computations associated with Softmax-based attention. We propose Cotten4Rec, a novel SR model utilizing linear-time cosine similarity attention, implemented through a single optimized compute unified device architecture (CUDA) kernel. By minimizing intermediate buffers and kernel-launch overhead, Cotten4Rec substantially reduces resource usage compared to BERT4Rec and the linear-attention baseline, LinRec, especially for datasets with moderate sequence lengths and vocabulary sizes. Evaluations across three benchmark datasets confirm that Cotten4Rec achieves considerable reductions in memory and runtime with minimal compromise in recommendation accuracy, demonstrating Cotten4Rec’s viability as an efficient alternative for practical, large-scale sequential recommendation scenarios where computational resources are critical.
[IR-1] R2LED: Equipping Retrieval and Refinement in Lifelong User Modeling with Semantic IDs for CTR Prediction
链接: https://arxiv.org/abs/2602.06622
作者: Qidong Liu,Gengnan Wang,Zhichen Liu,Moranxin Wang,Zijian Zhang,Xiao Han,Ni Zhang,Tao Qin,Chen Li
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Lifelong user modeling, which leverages users’ long-term behavior sequences for CTR prediction, has been widely applied in personalized services. Existing methods generally adopted a two-stage “retrieval-refinement” strategy to balance effectiveness and efficiency. However, they still suffer from (i) noisy retrieval due to skewed data distribution and (ii) lack of semantic understanding in refinement. While semantic enhancement, e.g., LLMs modeling or semantic embeddings, offers potential solutions to these two challenges, these approaches face impractical inference costs or insufficient representation granularity. Obsorbing multi-granularity and lightness merits of semantic identity (SID), we propose a novel paradigm that equips retrieval and refinement in Lifelong User Modeling with SEmantic IDs (R2LED) to address these issues. First, we introduce a Multi-route Mixed Retrieval for the retrieval stage. On the one hand, it captures users’ interests from various granularities by several parallel recall routes. On the other hand, a mixed retrieval mechanism is proposed to efficiently retrieve candidates from both collaborative and semantic views, reducing noise. Then, for refinement, we design a Bi-level Fusion Refinement, including a target-aware cross-attention for route-level fusion and a gate mechanism for SID-level fusion. It can bridge the gap between semantic and collaborative spaces, exerting the merits of SID. The comprehensive experimental results on two public datasets demonstrate the superiority of our method in both performance and efficiency. To facilitate the reproduction, we have released the code online this https URL.
[IR-2] okenMixer-Large: Scaling Up Large Ranking Models in Industrial Recommenders
链接: https://arxiv.org/abs/2602.06563
作者: Yuchen Jiang,Jie Zhu,Xintian Han,Hui Lu,Kunmin Bai,Mingyu Yang,Shikang Wu,Ruihao Zhang,Wenlin Zhao,Shipeng Bai,Sijin Zhou,Huizhi Yang,Tianyi Liu,Wenda Liu,Ziyan Gong,Haoran Ding,Zheng Chai,Deping Xie,Zhe Chen,Yuchao Zheng,Peng Xu
类目: Information Retrieval (cs.IR)
*备注:
Abstract:In recent years, the study of scaling laws for large recommendation models has gradually gained attention. Works such as Wukong, HiFormer, and DHEN have attempted to increase the complexity of interaction structures in ranking models and validate scaling laws between performance and parameters/FLOPs by stacking multiple layers. However, their experimental scale remains relatively limited. Our previous work introduced the TokenMixer architecture, an efficient variant of the standard Transformer where the self-attention mechanism is replaced by a simple reshape operation, and the feed-forward network is adapted to a pertoken FFN. The effectiveness of this architecture was demonstrated in the ranking stage by the model presented in the RankMixer paper. However, this foundational TokenMixer architecture itself has several design limitations. In this paper, we propose TokenMixer-Large, which systematically addresses these core issues: sub-optimal residual design, insufficient gradient updates in deep models, incomplete MoE sparsification, and limited exploration of scalability. By leveraging a mixing-and-reverting operation, inter-layer residuals, the auxiliary loss and a novel Sparse-Pertoken MoE architecture, TokenMixer-Large successfully scales its parameters to 7-billion and 15-billion on online traffic and offline experiments, respectively. Currently deployed in multiple scenarios at ByteDance, TokenMixer -Large has achieved significant offline and online performance gains.
[IR-3] MuCo: Multi-turn Contrastive Learning for Multimodal Embedding Model
链接: https://arxiv.org/abs/2602.06393
作者: Geonmo Gu,Byeongho Heo,Jaemyung Yu,Jaehui Hwang,Taekyung Kim,Sangmin Lee,HeeJae Jun,Yoohoon Kang,Sangdoo Yun,Dongyoon Han
类目: Information Retrieval (cs.IR)
*备注: 22 pages
Abstract:Universal Multimodal embedding models built on Multimodal Large Language Models (MLLMs) have traditionally employed contrastive learning, which aligns representations of query-target pairs across different modalities. Yet, despite its empirical success, they are primarily built on a “single-turn” formulation where each query-target pair is treated as an independent data point. This paradigm leads to computational inefficiency when scaling, as it requires a separate forward pass for each pair and overlooks potential contextual relationships between multiple queries that can relate to the same context. In this work, we introduce Multi-Turn Contrastive Learning (MuCo), a dialogue-inspired framework that revisits this process. MuCo leverages the conversational nature of MLLMs to process multiple, related query-target pairs associated with a single image within a single forward pass. This allows us to extract a set of multiple query and target embeddings simultaneously, conditioned on a shared context representation, amplifying the effective batch size and overall training efficiency. Experiments exhibit MuCo with a newly curated 5M multimodal multi-turn dataset (M3T), which yields state-of-the-art retrieval performance on MMEB and M-BEIR benchmarks, while markedly enhancing both training efficiency and representation coherence across modalities. Code and M3T are available at this https URL


